Kafka Architecture: Low Level - 2025 Edition

January 9, 2025

                                                                           

🚀 What’s New in This 2025 Update

Major Updates and Changes

  • KRaft-Only Architecture - ZooKeeper completely eliminated
  • Raft Consensus Replication - Native leadership election
  • Java 17 Requirement - Modern JVM optimizations
  • Protocol Cleanup - Removed pre-0.10.x formats
  • Dynamic KRaft Quorums - Add/remove controllers without downtime
  • Improved Atomic Writes - Enhanced exactly-once semantics

Deprecated Features

  • ZooKeeper coordination - Fully removed
  • Java 8 support - Minimum Java 11/17
  • Legacy wire protocols - Pre-0.10.x formats gone
  • Old replication mechanisms - Replaced by Raft

Ready to understand how Kafka achieves its legendary performance? Let’s dive deep into the engineering decisions that make Kafka the backbone of modern data infrastructure.

Kafka Architecture: Low-Level Design

mindmap
  root((Low-Level Design))
    Persistence
      Sequential I/O
      Page Cache
      Zero Copy
      Log Segments
    Network
      Binary Protocol
      Request Pipelining
      Batch Processing
      Compression
    Replication
      Raft Consensus
      ISR Management
      Leader Election
      Atomic Writes
    Performance
      OS Optimization
      JVM Tuning
      Thread Models
      Memory Management

Building on our Kafka architecture series including topics, producers, consumers, and log compaction, this article reveals the engineering brilliance behind Kafka’s performance.

Kafka Design Motivation

LinkedIn engineering created Kafka to solve a fundamental problem: handling massive real-time data streams efficiently. The design goals shaped every architectural decision:

  • High throughput - Millions of messages per second
  • Low latency - Sub-millisecond message delivery
  • Horizontal scalability - Add nodes to increase capacity
  • Fault tolerance - Survive multiple failures
  • Durability - Never lose committed data
  • Operational simplicity - Easy to run at scale
flowchart TB
  subgraph DesignPrinciples["Design Principles"]
    P1[Distributed<br>Commit Log]
    P2[Sequential<br>I/O]
    P3[Zero<br>Copy]
    P4[Batching<br>Everywhere]
    P5[Smart<br>Clients]
  end
  
  subgraph Results["Results"]
    R1[700+ MB/s<br>Per Broker]
    R2[< 5ms<br>Latency]
    R3[Millions of<br>Partitions]
    R4[Petabyte<br>Scale]
  end
  
  P1 --> R3
  P2 --> R1
  P3 --> R2
  P4 --> R1
  P5 --> R4
  
  classDef principle fill:#bbdefb,stroke:#1976d2,stroke-width:1px,color:#333333
  classDef result fill:#c8e6c9,stroke:#43a047,stroke-width:1px,color:#333333
  
  class P1,P2,P3,P4,P5 principle
  class R1,R2,R3,R4 result

Cloudurable provides Kafka training, Kafka consulting, Kafka support and helps setting up Kafka clusters in AWS.

Persistence: Embrace the Filesystem

Kafka’s genius begins with its storage strategy: trust the operating system.

Sequential I/O: The Speed Secret

flowchart LR
  subgraph RandomIO["Random I/O (Slow)"]
    D1[Disk] -->|Seek| H1[Head]
    H1 -->|Read| B1[Block 42]
    B1 -->|Seek| H2[Head]
    H2 -->|Read| B2[Block 7]
    B2 -->|Seek| H3[Head]
    H3 -->|Read| B3[Block 99]
  end
  
  subgraph SequentialIO["Sequential I/O (Fast)"]
    D2[Disk] -->|Read| S1[Block 1]
    S1 -->|Read| S2[Block 2]
    S2 -->|Read| S3[Block 3]
    S3 -->|Continue| S4[...]
  end
  
  RandomIO -->|~100 IOPS| Slow[Slow: Seek Time Dominates]
  SequentialIO -->|~600 MB/s| Fast[Fast: No Seeks]
  
  classDef random fill:#ffcdd2,stroke:#e53935,stroke-width:1px,color:#333333
  classDef sequential fill:#c8e6c9,stroke:#43a047,stroke-width:1px,color:#333333
  
  class D1,H1,H2,H3,B1,B2,B3 random
  class D2,S1,S2,S3,S4 sequential

Performance Numbers:

  • Sequential writes: 600+ MB/s on SATA SSD, 2+ GB/s on NVMe
  • Random writes: 100-200 IOPS (0.4-0.8 MB/s)
  • 1500x faster for sequential access!

OS Page Cache: Free Performance

// Kafka doesn't manage its own cache
// Instead, it leverages the OS page cache

// Write path - OS buffers in page cache
fileChannel.write(ByteBuffer.wrap(messageBytes));

// Read path - OS serves from memory if cached
fileChannel.read(buffer, offset);

// Zero-copy transfer - never enters JVM
fileChannel.transferTo(position, count, socketChannel);

Benefits:

  • No GC pressure - Data stays outside JVM heap
  • Warm cache on restart - OS cache survives process restart
  • Automatic management - OS handles eviction optimally
  • Read-ahead optimization - OS prefetches sequential data

Log-Structured Storage

classDiagram
  class TopicPartition {
    +topic: String
    +partition: Int
    +logDir: Path
    +segments: List~LogSegment~
    +activeSegment: LogSegment
    +append(records: Records): void
    +read(offset: Long, maxBytes: Int): Records
  }
  
  class LogSegment {
    +baseOffset: Long
    +file: FileChannel
    +index: OffsetIndex
    +timeIndex: TimeIndex
    +maxSegmentSize: Long
    +append(records: Records): void
    +read(offset: Long): Records
    +close(): void
  }
  
  class OffsetIndex {
    +entries: MappedByteBuffer
    +lookup(offset: Long): Position
    +append(offset: Long, position: Int): void
  }
  
  class TimeIndex {
    +entries: MappedByteBuffer
    +lookup(timestamp: Long): Offset
    +append(timestamp: Long, offset: Long): void
  }
  
  TopicPartition "1" *-- "many" LogSegment
  LogSegment "1" *-- "1" OffsetIndex
  LogSegment "1" *-- "1" TimeIndex

Producer Load Balancing and Batching

Smart Client Design

Kafka producers are “smart clients” that:

  1. Discover cluster topology - Query any broker for metadata
  2. Route directly to leaders - No proxy layer needed
  3. Batch intelligently - Accumulate records by partition
  4. Handle failures - Retry with exponential backoff

Advanced Batching Strategy

stateDiagram-v2
  [*] --> Accumulating: Producer sends
  
  state Accumulating {
    [*] --> Collecting
    Collecting --> BatchReady: Size limit OR
    Collecting --> BatchReady: Time limit
    BatchReady --> Compressing: Optional
    Compressing --> Ready
  }
  
  Accumulating --> Sending: Batch ready
  
  state Sending {
    [*] --> InFlight
    InFlight --> WaitingAck: Network send
    WaitingAck --> Success: ACK received
    WaitingAck --> Retry: Failure
    Retry --> InFlight: Backoff
  }
  
  Sending --> [*]: Complete
  
  classDef accumulating fill:#bbdefb,stroke:#1976d2,stroke-width:1px,color:#333333
  classDef sending fill:#e1bee7,stroke:#8e24aa,stroke-width:1px,color:#333333
  
  class Accumulating,Collecting,BatchReady,Compressing,Ready accumulating
  class Sending,InFlight,WaitingAck,Success,Retry sending

Batching Configuration:

// Batch size - larger = better throughput
props.put("batch.size", 16384);  // 16KB default

// Linger time - wait for more records
props.put("linger.ms", 10);      // 10ms delay

// Memory buffer for batching
props.put("buffer.memory", 33554432);  // 32MB

// Compression - batch level
props.put("compression.type", "lz4");

Producer Batching Architecture

Kafka Architecture - Kafka Producer Batching

Network Protocol and Zero-Copy

Binary Protocol Efficiency

flowchart TB
  subgraph ProducerRequest["Producer Request"]
    H1[Request Header<br>16 bytes]
    P1[Partition 1<br>Compressed Batch]
    P2[Partition 2<br>Compressed Batch]
    P3[Partition 3<br>Compressed Batch]
  end
  
  subgraph NetworkTransfer["Network Transfer"]
    TCP[TCP Connection]
    SSL[Optional TLS]
    POOL[Connection Pool]
  end
  
  subgraph BrokerProcessing["Broker Processing"]
    NIO[NIO Thread]
    APPEND[Append to Log]
    ZERO[Zero-Copy to Followers]
  end
  
  H1 --> TCP
  P1 --> TCP
  P2 --> TCP
  P3 --> TCP
  TCP --> NIO
  NIO --> APPEND
  APPEND --> ZERO
  
  classDef request fill:#bbdefb,stroke:#1976d2,stroke-width:1px,color:#333333
  classDef network fill:#fff9c4,stroke:#f9a825,stroke-width:1px,color:#333333
  classDef broker fill:#e1bee7,stroke:#8e24aa,stroke-width:1px,color:#333333
  
  class H1,P1,P2,P3 request
  class TCP,SSL,POOL network
  class NIO,APPEND,ZERO broker

Zero-Copy Transfer

// Traditional copy (4 copies, 2 context switches)
// 1. Read: Disk → Kernel buffer
// 2. Copy: Kernel → User buffer (context switch)
// 3. Copy: User → Kernel buffer (context switch)
// 4. Write: Kernel → Network

// Kafka zero-copy (2 copies, 0 context switches)
// 1. Read: Disk → Kernel buffer
// 2. Transfer: Kernel → Network (DMA)

public long sendfile(FileChannel file, long position, 
                    long count, WritableByteChannel socket) {
    return file.transferTo(position, count, socket);
}

Modern Replication with Raft

KRaft Consensus Architecture

flowchart TB
  subgraph KRaftCluster["KRaft Controller Cluster"]
    C1[Controller 1<br>Leader<br>Term: 5]
    C2[Controller 2<br>Follower]
    C3[Controller 3<br>Follower]
    
    C1 -->|AppendEntries| C2
    C1 -->|AppendEntries| C3
    C2 -.->|Vote| C1
    C3 -.->|Vote| C1
  end
  
  subgraph RaftLog["Raft Log"]
    L1[Entry 1: Create Topic]
    L2[Entry 2: Elect Leader P0]
    L3[Entry 3: Update ISR]
    L4[Entry 4: Broker Join]
    L1 --> L2 --> L3 --> L4
  end
  
  subgraph Brokers["Kafka Brokers"]
    B1[Broker 1]
    B2[Broker 2]
    B3[Broker 3]
    B4[Broker 4]
  end
  
  C1 -->|Metadata| B1
  C1 -->|Metadata| B2
  C1 -->|Metadata| B3
  C1 -->|Metadata| B4
  
  classDef leader fill:#c8e6c9,stroke:#43a047,stroke-width:2px,color:#333333
  classDef follower fill:#bbdefb,stroke:#1976d2,stroke-width:1px,color:#333333
  classDef log fill:#e1bee7,stroke:#8e24aa,stroke-width:1px,color:#333333
  
  class C1 leader
  class C2,C3 follower
  class L1,L2,L3,L4 log

Raft vs ZooKeeper Comparison

Aspect ZooKeeper Era KRaft Era
Failover Time 30+ seconds < 1 second
Metadata Consistency Eventual Strong
Operational Complexity High (2 systems) Low (1 system)
Max Partitions ~200K Millions
Network Hops 2-3 1

Producer Durability and Exactly-Once

Durability Levels

stateDiagram-v2
  [*] --> Producer: Send message
  
  state "acks=0" as acks0 {
    Producer --> Network: Fire & forget
    Network --> [*]: No confirmation
  }
  
  state "acks=1" as acks1 {
    Producer --> Leader: Send
    Leader --> LogWrite: Append
    LogWrite --> Producer: ACK
  }
  
  state "acks=all" as acksall {
    Producer --> Leader: Send
    Leader --> Replicate: Copy to ISRs
    Replicate --> AllISRs: Wait for all
    AllISRs --> Producer: ACK
  }
  
  classDef fast fill:#fff9c4,stroke:#f9a825,stroke-width:1px,color:#333333
  classDef balanced fill:#bbdefb,stroke:#1976d2,stroke-width:1px,color:#333333
  classDef safe fill:#c8e6c9,stroke:#43a047,stroke-width:1px,color:#333333
  
  class acks0 fast
  class acks1 balanced
  class acksall safe

Exactly-Once Semantics

// Initialize transactional producer
producer.initTransactions();

try {
    // Start transaction
    producer.beginTransaction();
    
    // Send multiple messages atomically
    producer.send(new ProducerRecord<>("accounts", "debit", amount));
    producer.send(new ProducerRecord<>("accounts", "credit", amount));
    
    // Send consumer offsets in same transaction
    producer.sendOffsetsToTransaction(offsets, consumerGroupId);
    
    // Commit atomically
    producer.commitTransaction();
} catch (ProducerFencedException | OutOfOrderSequenceException e) {
    // Fatal errors - producer is zombied
    producer.close();
} catch (KafkaException e) {
    // Abort and retry
    producer.abortTransaction();
}

Performance Optimizations

JVM and OS Tuning

# JVM Settings for Kafka (Java 17)
export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC \
  -XX:MaxGCPauseMillis=20 \
  -XX:InitiatingHeapOccupancyPercent=35 \
  -XX:+ExplicitGCInvokesConcurrent \
  -XX:+UseStringDeduplication"

# OS Tuning (Linux)
# Increase file descriptors
ulimit -n 100000

# Network tuning
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"

# Disk scheduler for SSDs
echo noop > /sys/block/sda/queue/scheduler

Thread Model

classDiagram
  class KafkaBroker {
    +acceptorThreads: Thread[]
    +networkThreads: Thread[]
    +ioThreads: Thread[]
    +backgroundThreads: Thread[]
  }
  
  class AcceptorThread {
    +port: Int
    +accept(): SocketChannel
  }
  
  class NetworkThread {
    +selector: Selector
    +processRequests(): void
    +sendResponses(): void
  }
  
  class IOThread {
    +requestQueue: BlockingQueue
    +handleProduce(): void
    +handleFetch(): void
  }
  
  class BackgroundThread {
    +logCleaner: LogCleaner
    +replicaFetcher: ReplicaFetcher
    +performMaintenance(): void
  }
  
  KafkaBroker "1" *-- "1+" AcceptorThread
  KafkaBroker "1" *-- "N" NetworkThread
  KafkaBroker "1" *-- "M" IOThread
  KafkaBroker "1" *-- "many" BackgroundThread

Quotas and Multi-Tenancy

Resource Management

// Producer quotas
bin/kafka-configs.sh --alter --add-config 'producer_byte_rate=1048576' \
  --entity-type users --entity-name alice

// Consumer quotas
bin/kafka-configs.sh --alter --add-config 'consumer_byte_rate=2097152' \
  --entity-type users --entity-name bob

// Request quotas (CPU time)
bin/kafka-configs.sh --alter --add-config 'request_percentage=50' \
  --entity-type clients --entity-name mobile-app

Low-Level Architecture Review

How does Kafka achieve high throughput?

Through sequential I/O, zero-copy transfers, batching, compression, and leveraging OS page cache instead of managing its own cache.

What makes KRaft better than ZooKeeper?

Native integration, sub-second failover, support for millions of partitions, reduced operational complexity, and strong consistency guarantees.

How does exactly-once work?

Idempotent producers with sequence numbers, transactional APIs for atomic writes across partitions, and transaction coordinators managing state.

Why is batching critical?

Amortizes network overhead, improves compression ratios, enables efficient sequential writes, and maximizes throughput.

What are the durability trade-offs?

  • acks=0: Maximum speed, no durability
  • acks=1: Balanced performance and safety
  • acks=all: Maximum durability, higher latency

How does Kafka handle slow consumers?

Pull-based model prevents overwhelming consumers, quotas limit resource usage, and consumer lag monitoring enables proactive intervention.

Best Practices for 2025

  1. Leverage KRaft - Simpler operations, better performance
  2. Use Java 17 - Modern GC, better performance
  3. Enable compression - LZ4 for speed, ZSTD for ratio
  4. Tune OS settings - File descriptors, network buffers
  5. Monitor everything - JMX metrics are comprehensive
  6. Plan capacity - Disk, network, CPU in that order
  7. Test failure scenarios - Validate durability settings
  8. Use transactions wisely - Only when needed
  9. Optimize batch sizes - Based on message patterns
  10. Keep producers close - Minimize network latency

About Cloudurable

Master Kafka’s internals with expert guidance. Cloudurable provides Kafka training, Kafka consulting, Kafka support and helps setting up Kafka clusters in AWS.

Check out our new GoLang course. We provide onsite Go Lang training which is instructor led.

                                                                           
comments powered by Disqus

Apache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting