The symptoms were classic but crippling. During our last Black Friday simulation, the Checkout Service began throwing HTTP 504 Gateway Timeouts. The root cause wasn't the database or the payment gateway—it was the Notification Service. Every time a user placed an order, the Checkout Service made a synchronous REST call to trigger an email confirmation. When the SMTP server lagged, the Checkout thread pool exhausted immediately. We weren't just coupling services; we were coupling their failure modes. To survive production traffic, we had to move from a request-response model to an event-driven architecture using Apache Kafka.
The Architecture: Why Synchronous HTTP Failed Us
In our legacy setup (v1.0), the OrderService and NotificationService were tightly coupled. We were running Spring Boot 3.1 on AWS ECS (Fargate) with a standard Tomcat thread pool configuration (200 threads). When the Notification Service latency spiked from 50ms to 2000ms due to third-party email provider throttling, the OrderService threads stuck in a WAITING state.
We saw the following stack trace repeatedly in our Datadog logs:
at java.base/java.net.SocketInputStream.socketRead0(Native Method)
...
at com.example.orderservice.client.NotificationClient.sendEmail(NotificationClient.java:42)
This indicated that our throughput was strictly limited by the slowest downstream dependency. We needed to "fire and forget" the order event, guaranteeing it would be processed eventually without holding up the user's HTTP response.
The Failed "In-Memory" Fix
Before adopting Kafka, we tried a naive approach to save infrastructure costs. We wrapped the notification logic in a Java CompletableFuture.runAsync() block within the same JVM.
This worked in the development environment, but in production, it was a disaster. When the application restarted during a deployment, all pending asynchronous tasks in the memory queue were lost instantly. We dropped about 1.5% of order confirmations during a rolling update. We needed durability, which meant we needed an external broker.
Environment & Kafka Configuration
For this implementation, we used the following stack:
- Language: Java 17
- Framework: Spring Boot 3.2
- Broker: Apache Kafka 3.6 (running via Docker Compose)
- Library: Spring Kafka
Here is the optimized `docker-compose.yml` setup to replicate a production-like broker locally. Note the `KAFKA_ADVERTISED_LISTENERS` configuration, which is a common tripping point for developers connecting from outside the container network.
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
kafka:
image: confluentinc/cp-kafka:7.5.0
depends_on: [zookeeper]
ports: ["9092:9092"]
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
# Crucial for external access
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
Developing the Producer (Order Service)
The goal of the Producer is to serialize the order data and push it to the `orders` topic. We must ensure that the message is actually persisted by the broker before we confirm the transaction to the user.
In the configuration below, pay attention to the `ProducerConfig.ACKS_CONFIG`. Setting this to "all" is vital for data integrity, ensuring the leader and all replicas acknowledge the write.
@Configuration
public class KafkaProducerConfig {
@Value("${spring.kafka.bootstrap-servers}")
private String bootstrapServers;
@Bean
public ProducerFactory<String, OrderEvent> producerFactory() {
Map<String, Object> configProps = new HashMap<>();
configProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
configProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
// Using JsonSerializer to send full objects
configProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JsonSerializer.class);
// Reliability settings
configProps.put(ProducerConfig.ACKS_CONFIG, "all");
configProps.put(ProducerConfig.RETRIES_CONFIG, 3);
return new DefaultKafkaProducerFactory<>(configProps);
}
@Bean
public KafkaTemplate<String, OrderEvent> kafkaTemplate() {
return new KafkaTemplate<>(producerFactory());
}
}
With the configuration in place, the service layer becomes incredibly lean. We no longer wait for the email to be sent. We publish the event and return 202 Accepted or 201 Created immediately.
@Service
@RequiredArgsConstructor
public class OrderService {
private final KafkaTemplate<String, OrderEvent> kafkaTemplate;
private static final String TOPIC = "orders";
public void placeOrder(OrderRequest request) {
// Business logic (save to DB)...
OrderEvent event = new OrderEvent(request.getOrderId(), request.getEmail(), request.getAmount());
// Async send. The Future can be used for logging success/failure asynchronously
kafkaTemplate.send(TOPIC, event.getOrderId(), event)
.whenComplete((result, ex) -> {
if (ex == null) {
log.info("Sent message with offset=[" + result.getRecordMetadata().offset() + "]");
} else {
log.error("Unable to send message", ex);
}
});
}
}
Developing the Consumer (Notification Service)
The Consumer's job is to listen to the topic and execute the slow side-effect (sending the email). The critical part here is the `groupId`. By defining a group ID, we allow multiple instances of the Notification Service to scale horizontally; Kafka will load-balance the partitions among them.
@Service
@Slf4j
public class NotificationConsumer {
@KafkaListener(
topics = "orders",
groupId = "notification-group",
containerFactory = "kafkaListenerContainerFactory"
)
public void handleOrderEvent(OrderEvent event) {
log.info("Received Order Event: {}", event.getOrderId());
try {
// Simulate slow SMTP process
Thread.sleep(1000);
sendEmail(event.getEmail());
} catch (Exception e) {
log.error("Failed to process order notification", e);
// In a real scenario, you would throw an exception
// to trigger retry or move to a Dead Letter Queue (DLQ)
}
}
}
Benchmark: Sync REST vs. Async Kafka
We ran a load test using JMeter with 500 concurrent users. The difference in the Order Service's response time was drastic because the heavy lifting was offloaded.
| Metric | Legacy (REST Blocking) | Optimized (Kafka Async) |
|---|---|---|
| Avg Response Time | 1,200 ms | 45 ms |
| Throughput (TPS) | 85 req/sec | 4,200 req/sec |
| Error Rate (Load) | 12% (Timeouts) | 0% |
The 45ms response time in the optimized setup represents only the time needed to persist the order to the database and the event to the Kafka topic. The user experience is instantaneous, and the email arrives a second later in the background.
Edge Cases: The "Poison Pill"
While Kafka is powerful, it introduces complexity. One major edge case is the "Poison Pill"—a message that cannot be deserialized or processed (e.g., malformed JSON). If your consumer throws an exception and infinite retries are configured, the consumer loop will get stuck on that single message forever, blocking all subsequent orders.
DeadLetterPublishingRecoverer in your Spring Kafka Factory. This automatically moves failed messages to a DLT (Dead Letter Topic) (e.g., orders.DLT) after N attempts, allowing the main processing to continue.
Conclusion
By decoupling the Order and Notification services with Apache Kafka, we transformed a fragile, synchronous system into a resilient, event-driven architecture. We eliminated the cascading failures caused by third-party latency and improved our throughput by over 40x. While the operational complexity of managing a broker is non-trivial, the stability gains in a microservices environment are well worth the investment.
Post a Comment