Kafka Architecture: Log Compaction

January 9, 2025

🚀 What’s New in This 2025 Update

Major Updates and Changes

KRaft-Managed Compaction - All compaction under KRaft control
Tiered Storage Integration - Compaction across local and remote tiers
Diskless Topics - Object storage compaction (KIP-1165)
Performance Optimizations - Reduced I/O with cloud-native design
Simplified Operations - No ZooKeeper coordination needed
Enhanced Monitoring - Better visibility into compaction progress

Deprecated Features

❌ ZooKeeper-based coordination - Fully removed
❌ Legacy message formats - v0 and v1 no longer supported
❌ Old compaction metrics - Updated for KRaft

Ready to master Kafka’s powerful state management feature? Let’s explore how log compaction enables event sourcing and stateful processing at scale.

Kafka Architecture: Log Compaction

mindmap
  root((Log Compaction))
    Core Concepts
      Key-Based Retention
      Latest Value Preservation
      Tombstone Deletion
      Offset Stability
    Use Cases
      State Restoration
      Event Sourcing
      Cache Rebuilding
      CDC Patterns
    Architecture
      Head vs Tail
      Segment Files
      Log Cleaner
      Compaction Threads
    Modern Features
      KRaft Integration
      Tiered Storage
      Diskless Topics
      Cloud Optimization

Building on our Kafka architecture series including topics, producers, consumers, and ecosystem, this deep dive reveals how log compaction transforms Kafka into a distributed database.

Log compaction retains the latest update for each key—perfect for maintaining state, building caches, and implementing event sourcing patterns.

Kafka Log Compaction Fundamentals

While Kafka typically deletes old data by time or size, log compaction offers a third way: keep only the latest value for each key. Think of it as a distributed, eventually-consistent key-value store.

Why Log Compaction Matters

flowchart LR
  subgraph Events["Event Stream"]
    E1[user-1: CREATE]
    E2[user-2: CREATE]
    E3[user-1: UPDATE]
    E4[user-1: UPDATE]
    E5[user-2: DELETE]
    E6[user-1: UPDATE]
  end
  
  subgraph Compacted["After Compaction"]
    C1[user-1: UPDATE<br>Latest State]
    C2[user-2: null<br>Tombstone]
  end
  
  Events -->|Compaction| Compacted
  
  classDef event fill:#bbdefb,stroke:#1976d2,stroke-width:1px,color:#333333
  classDef compacted fill:#c8e6c9,stroke:#43a047,stroke-width:1px,color:#333333
  
  class E1,E2,E3,E4,E5,E6 event
  class C1,C2 compacted

Step-by-Step Explanation:

Multiple events for same key over time
Compaction keeps only latest value per key
Tombstones (null values) mark deletions
Result: Complete snapshot of current state

Key Benefits:

State restoration - Rebuild services from compacted log
Space efficiency - Bounded by unique keys, not time
Durability - Persistent record of all current values
Performance - Fast bootstrap from compacted data

Cloudurable provides Kafka training, Kafka consulting, Kafka support and helps setting up Kafka clusters in AWS.

Log Compaction Structure

A compacted log maintains two distinct regions:

flowchart TB
  subgraph Log["Compacted Topic Log"]
    subgraph Head["Head (Active)"]
      H1[Offset 100: key-A = val1]
      H2[Offset 101: key-B = val1]
      H3[Offset 102: key-A = val2]
      H4[Offset 103: key-C = val1]
      H5[Offset 104: key-A = val3]
      H6[→ New writes append here]
    end
    
    subgraph Tail["Tail (Compacted)"]
      T1[Offset 0: key-X = val5]
      T2[Offset 15: key-Y = val3]
      T3[Offset 42: key-Z = null]
      T4[← Only latest values]
    end
  end
  
  Head -->|Periodic Compaction| Tail
  
  classDef head fill:#fff9c4,stroke:#f9a825,stroke-width:1px,color:#333333
  classDef tail fill:#e1bee7,stroke:#8e24aa,stroke-width:1px,color:#333333
  
  class H1,H2,H3,H4,H5,H6 head
  class T1,T2,T3,T4 tail

Key Points:

Head - Traditional Kafka log, all writes append
Tail - Compacted region, only latest values
Offsets preserved - Compaction never changes offsets
Gradual process - Compaction runs periodically

Kafka Log Compaction Structure

Log Compaction Structure

The Compaction Process

stateDiagram-v2
  [*] --> Monitoring: Log Cleaner Thread
  
  Monitoring --> SelectingPartition: Check dirty ratio
  SelectingPartition --> BuildingMap: Choose dirtiest
  
  state BuildingMap {
    [*] --> ScanningTail
    ScanningTail --> MappingKeys: Read segments
    MappingKeys --> LatestValues: Track latest offset per key
  }
  
  BuildingMap --> Rewriting: Memory map complete
  
  state Rewriting {
    [*] --> ReadingHead
    ReadingHead --> Filtering: Check each record
    Filtering --> Writing: Keep if latest
    Filtering --> Skipping: Discard if outdated
    Writing --> ReadingHead
    Skipping --> ReadingHead
  }
  
  Rewriting --> SwappingSegments: New segments ready
  SwappingSegments --> Monitoring: Atomically replace
  
  classDef monitoring fill:#bbdefb,stroke:#1976d2,stroke-width:1px,color:#333333
  classDef processing fill:#c8e6c9,stroke:#43a047,stroke-width:1px,color:#333333
  classDef writing fill:#e1bee7,stroke:#8e24aa,stroke-width:1px,color:#333333
  
  class Monitoring,SelectingPartition monitoring
  class BuildingMap,ScanningTail,MappingKeys,LatestValues processing
  class Rewriting,ReadingHead,Filtering,Writing,Skipping,SwappingSegments writing

Step-by-Step Explanation:

Log cleaner threads monitor partition dirty ratios
Select partition with highest ratio of uncompacted data
Build in-memory map of latest offset per key
Rewrite segments, keeping only latest values
Atomically swap new segments for old
Original offsets preserved throughout

Kafka Log Compaction Process

Compaction Guarantees

Ordering Guarantees

Per-key order preserved - Updates never reordered
Cross-key order maintained - Within constraints
Offset stability - Message offsets never change
Consumer compatibility - Works with all consumers

Visibility Guarantees

// Consumers see all records if they keep up
min.compaction.lag.ms = 60000  // 1 minute minimum before compaction

// Tombstones visible for configured period
delete.retention.ms = 86400000  // 24 hours default

// Consumers lagging > delete.retention.ms might miss tombstones

Modern Log Cleaner Architecture with KRaft

classDiagram
  class LogCleaner {
    +cleanerThreads: CleanerThread[]
    +cleanerManager: CleanerManager
    +config: CleanerConfig
    +start(): void
    +shutdown(): void
  }
  
  class CleanerThread {
    +threadId: int
    +cleaning: boolean
    +doWork(): void
    +selectDirtiestLog(): TopicPartition
    +cleanLog(tp: TopicPartition): void
  }
  
  class CleanerManager {
    +inProgress: Map~TopicPartition, CleanerState~
    +checkpoints: Map~TopicPartition, Long~
    +kraft: KRaftMetadata
    +isCompactable(tp: TopicPartition): boolean
    +updateCheckpoint(tp: TopicPartition, offset: Long): void
  }
  
  class KRaftMetadata {
    +topicConfigs: Map~String, Properties~
    +getCompactionPolicy(topic: String): String
    +getRetentionMs(topic: String): Long
  }
  
  class SegmentCleaner {
    +sourceSegments: List~LogSegment~
    +offsetMap: OffsetMap
    +clean(): List~LogSegment~
    +shouldRetainRecord(key: Bytes, offset: Long): boolean
  }
  
  LogCleaner "1" *-- "many" CleanerThread
  LogCleaner "1" *-- "1" CleanerManager
  CleanerManager "1" --> "1" KRaftMetadata
  CleanerThread "1" --> "1" SegmentCleaner : creates

Configuration for Log Compaction

Topic-Level Settings

# Enable compaction for a topic
kafka-topics.sh --create \
  --topic user-profiles \
  --partitions 10 \
  --replication-factor 3 \
  --config cleanup.policy=compact \
  --config min.cleanable.dirty.ratio=0.5 \
  --config segment.ms=3600000

Key Configuration Parameters

# Cleanup policy (delete, compact, or both)
log.cleanup.policy=compact

# Minimum time before compaction eligible
log.cleaner.min.compaction.lag.ms=60000

# Maximum time to retain delete markers
log.cleaner.delete.retention.ms=86400000

# Minimum dirty ratio to trigger cleaning
min.cleanable.dirty.ratio=0.5

# Memory for deduplication
log.cleaner.dedupe.buffer.size=134217728

# Number of cleaner threads
log.cleaner.threads=2

# I/O throttling for cleaners
log.cleaner.io.max.bytes.per.second=Double.MaxValue

Tiered Storage and Compaction

Kafka 4.0 introduces sophisticated compaction for tiered storage:

flowchart TB
  subgraph LocalTier["Local Storage (Hot)"]
    L1[Recent Segments]
    L2[Active Head]
    L3[Compaction Buffer]
  end
  
  subgraph RemoteTier["Object Storage (Cold)"]
    R1[Historical Segments]
    R2[Compacted Archives]
    R3[Metadata Index]
  end
  
  subgraph CompactionFlow["Tiered Compaction"]
    C1[Local Compaction]
    C2[Upload to Remote]
    C3[Remote Compaction]
    C4[Metadata Update]
  end
  
  L1 --> C1
  C1 --> L3
  L3 --> C2
  C2 --> R1
  R1 --> C3
  C3 --> R2
  C4 --> R3
  
  classDef local fill:#bbdefb,stroke:#1976d2,stroke-width:1px,color:#333333
  classDef remote fill:#e1bee7,stroke:#8e24aa,stroke-width:1px,color:#333333
  classDef process fill:#c8e6c9,stroke:#43a047,stroke-width:1px,color:#333333
  
  class L1,L2,L3 local
  class R1,R2,R3 remote
  class C1,C2,C3,C4 process

Tiered Compaction Benefits:

Cost optimization - Compact before archiving
Query performance - Indexed remote segments
Flexible retention - Different policies per tier
Resource efficiency - Offload to object storage

Diskless Topics and Object Compaction

KIP-1165 introduces compaction for diskless topics:

flowchart LR
  subgraph Producers["Producers"]
    P1[App 1]
    P2[App 2]
  end
  
  subgraph Brokers["Kafka Brokers (No Local Storage)"]
    B1[Broker 1]
    B2[Broker 2]
  end
  
  subgraph ObjectStorage["Cloud Object Storage"]
    O1[Segment Objects]
    O2[Compaction Workers]
    O3[Compacted Objects]
  end
  
  P1 --> B1
  P2 --> B2
  B1 -->|Direct Write| O1
  B2 -->|Direct Write| O1
  O1 --> O2
  O2 -->|Object Compaction| O3
  
  classDef producer fill:#bbdefb,stroke:#1976d2,stroke-width:1px,color:#333333
  classDef broker fill:#fff9c4,stroke:#f9a825,stroke-width:1px,color:#333333
  classDef storage fill:#e1bee7,stroke:#8e24aa,stroke-width:1px,color:#333333
  
  class P1,P2 producer
  class B1,B2 broker
  class O1,O2,O3 storage

Use Cases and Patterns

1. Event Sourcing

// User profile events compacted by userId
producer.send(new ProducerRecord<>("user-profiles", 
    userId,                          // Key for compaction
    userProfileJson));               // Latest state

2. Change Data Capture (CDC)

// Database changes compacted by primary key
producer.send(new ProducerRecord<>("inventory", 
    productId,                       // Product ID key
    productData));                   // Current inventory

// Deletions via tombstone
producer.send(new ProducerRecord<>("inventory", 
    productId,                       // Product ID key
    null));                         // Null = tombstone

3. Configuration Management

// Application configs compacted by config key
producer.send(new ProducerRecord<>("app-configs", 
    "feature-flags",                 // Config category
    featureFlagSettings));          // Latest settings

4. Cache Restoration

// Rebuild cache from compacted topic
consumer.seekToBeginning(partitions);
while (true) {
    ConsumerRecords<String, String> records = consumer.poll(100);
    for (ConsumerRecord<String, String> record : records) {
        if (record.value() != null) {
            cache.put(record.key(), record.value());
        } else {
            cache.remove(record.key());  // Handle tombstone
        }
    }
}

Best Practices for 2025

1. Design for Compaction

Use meaningful, stable keys
Keep values reasonably sized
Design for eventual consistency
Plan for tombstone retention

2. Monitor Compaction Health

# Key metrics to track
kafka.log:type=LogCleanerManager,name=cleanable-dirty-ratio
kafka.log:type=LogCleanerManager,name=max-clean-time-ms
kafka.log:type=LogCleanerManager,name=max-buffer-utilization-percent

3. Optimize Performance

Tune cleaner threads based on CPU
Allocate sufficient dedupe buffer memory
Consider I/O throttling in shared environments
Monitor and adjust dirty ratios

4. Handle Tiered Storage

Configure appropriate local retention
Plan remote compaction schedules
Monitor cross-tier query performance
Set tier-specific retention policies

5. Tombstone Management

Set appropriate delete.retention.ms
Ensure consumers handle nulls gracefully
Monitor tombstone accumulation
Plan periodic full reprocessing

Log Compaction Review

What are three ways Kafka can delete records?

Time-based deletion - After retention period
Size-based deletion - When log exceeds size limit
Log compaction - Keep only latest value per key

What is log compaction good for?

Log compaction excels at maintaining current state snapshots for:

Service state restoration after crashes
Cache rebuilding
Event sourcing
Change data capture
Configuration distribution

Describe compacted log structure

Compacted logs have:

Head - Active region receiving new writes
Tail - Compacted region with latest values only
Preserved offsets - Compaction never changes offsets
Atomic swaps - Segments replaced atomically

Do offsets change after compaction?

No. Compaction preserves original offsets, ensuring consumer compatibility.

What is a partition segment?

A segment is a chunk of a partition’s log stored as files. Segments enable:

Efficient compaction without full partition rewrite
Time or size-based rotation
Parallel cleaning operations
Atomic replacement during compaction

About Cloudurable

Master Kafka’s advanced features with expert guidance. Cloudurable provides Kafka training, Kafka consulting, Kafka support and helps setting up Kafka clusters in AWS.

Check out our new GoLang course. We provide onsite Go Lang training which is instructor led.

comments powered by Disqus

Apache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting

Kafka Architecture: Log Compaction - 2025 Edition

🚀 What’s New in This 2025 Update

Major Updates and Changes

Deprecated Features

Kafka Architecture: Log Compaction

Kafka Log Compaction Fundamentals

Why Log Compaction Matters

Key Benefits:

Log Compaction Structure

Kafka Log Compaction Structure

The Compaction Process

Kafka Log Compaction Process

Compaction Guarantees

Ordering Guarantees

Visibility Guarantees

Modern Log Cleaner Architecture with KRaft

Configuration for Log Compaction

Topic-Level Settings

Key Configuration Parameters

Tiered Storage and Compaction

Diskless Topics and Object Compaction

Use Cases and Patterns

1. Event Sourcing

2. Change Data Capture (CDC)

3. Configuration Management

4. Cache Restoration

Best Practices for 2025

1. Design for Compaction

2. Monitor Compaction Health

3. Optimize Performance

4. Handle Tiered Storage

5. Tombstone Management

Log Compaction Review

What are three ways Kafka can delete records?

What is log compaction good for?

Describe compacted log structure

Do offsets change after compaction?

What is a partition segment?

Related Content

About Cloudurable

Search

Share

Follow

Categories

Tags