Kafka Topic Architecture - Replication, Failover and Parallel Processing

This article covers some lower level details of Kafka topic architecture. It is a continuation of the Kafka Architecture article.

This article covers Kafka Topic’s Architecture with a discussion of how partitions are used for fail-over and parallel processing.

Kafka Topics, Logs, Partitions

Recall that a Kafka topic is a named stream of records. Kafka stores topics in logs. A topic log is broken up into partitions. Kafka spreads log’s partitions across multiple servers or disks. Think of a topic as a category, stream name or feed.

Kafka vs. JMS

in kafka

May 12, 2017

Kafka vs JMS, SQS, RabbitMQ Messaging

Is Kafka a queue or a publish and subscribe system? Yes. It can be both.

Kafka is like a queue for consumer groups, which we cover later. Basically, Kafka is a queue system per consumer group so it can do load balancing like JMS, RabbitMQ, etc.

Kafka is like topics in JMS, RabbitMQ, and other MOM systems for multiple consumer groups. Kafka has topics and producers publish to the topics and the subscribers (Consumer Groups) read from the topics.

Kafka Architecture

in Kafka Architecture

May 11, 2017

If you are not sure what Kafka is, see What is Kafka?.

Kafka Architecture

Kafka consists of Records, Topics, Consumers, Producers, Brokers, Logs, Partitions, and Clusters. Records can have key (optional), value and timestamp. Kafka Records are immutable. A Kafka Topic is a stream of records ("/orders", "/user-signups"). You can think of a Topic as a feed name. A topic has a Log which is the topic’s storage on disk. A Topic Log is broken up into partitions and segments. The Kafka Producer API is used to produce streams of data records. The Kafka Consumer API is used to consume a stream of records from Kafka. A Broker is a Kafka server that runs in a Kafka Cluster. Kafka Brokers form a cluster. The Kafka Cluster consists of many Kafka Brokers on many servers. Broker sometimes refer to more of a logical system or as Kafka as a whole.

The Kafka Ecosystem - Kafka Core, Kafka Streams, Kafka Connect, Kafka REST Proxy, and the Schema Registry

in Kafka Ecosystem

May 11, 2017

The Kafka Ecosystem - Kafka Core, Kafka Streams, Kafka Connect, Kafka REST Proxy, and the Schema Registry

The core of Kafka is the brokers, topics, logs, partitions, and cluster. The core also consists of related tools like MirrorMaker. The aforementioned is Kafka as it exists in Apache.

The Kafka ecosystem consists of Kafka Core, Kafka Streams, Kafka Connect, Kafka REST Proxy, and the Schema Registry. Most of the additional pieces of the Kafka ecosystem comes from Confluent and is not part of Apache.

What is Apache Kafka?

in kafka

May 10, 2017

What is Kafka?

Kafka’s growth is exploding, more than 1/3 of all Fortune 500 companies use Kafka. These companies includes the top ten travel companies, 7 of top ten banks, 8 of top ten insurance companies, 9 of top ten telecom companies, and much more. LinkedIn, Microsoft and Netflix process four comma messages a day with Kafka (1,000,000,000,000). Kafka is used for real-time streams of data, used to collect big data or to do real time analysis or both). Kafka is used with in-memory microservices to provide durability and it can be used to feed events to CEP (complex event streaming systems), and IOT/IFTTT style automation systems.

Kafka, Avro Serialization and the Schema Registry

in Schema Registry

May 9, 2017

Kafka Tutorial: Kafka, Avro Serialization and the Schema Registry

Confluent Schema Registry stores Avro Schemas for Kafka producers and consumers. The Schema Registry and provides RESTful interface for managing Avro schemas It allows the storage of a history of schemas which are versioned. the Confluent Schema Registry supports checking schema compatibility for Kafka.

Cloudurable provides Kafka training, Kafka consulting, Kafka support and helps setting up Kafka clusters in AWS.

Continue reading

Understanding Apache Avro: Avro Introduction for Big Data and Data Streaming Architectures

in avro

May 8, 2017

Avro Introduction for Big Data and Data Streaming Architectures

Apache Avro™ is a data serialization system. Avro provides data structures, binary data format, container file format to store persistent data, and provides RPC capabilities. Avro does not require code generation to use and integrates well with JavaScript, Python, Ruby, C, C#, C++ and Java. Avro gets used in the Hadoop ecosystem as well as by Kafka.

Avro is similar to Thrift, Protocol Buffers, JSON, etc. Avro does not require code generation. Avro needs less encoding as part of the data since it stores names and types in the schema reducing duplication. Avro supports the evolution of schemas.

Kinesis vs. Kafka

in Kinesis

May 5, 2017

Kinesis vs. Kafka

Kinesis works with streaming data.

Stock prices
Game data (scores from game)
Social network data
Geospatial data like Uber data where you are
IOT sensors

Kafka works with streaming data too.

Kinesis Streams is like Kafka Core. Kinesis Analytics is like Kafka Streams. A Kinesis Shard is like Kafka Partition.

They are similar and get used in similar use cases.

Data is stored in Kinesis for default 24 hours, and you can increase that up to 7 days.

Kafka Broker Startup Scripts

in Kafka Training

April 27, 2017

Running a Kafka Broker

Starting brokers in Kafka is pretty straightforward, here are some simple quick start instructions. But as developers, we want to do at least a little more than just the basics. For instance my first needs were to start multiple brokers on the same machine, and also to enable JMX.

Out of the box, you can simply rely on the supplied server.properties Each broker needs a unique id and needs a unique port. These are the corresponding properties from server.properties:

Kafka Tutorial with Examples

in Kafka Training

April 17, 2017

Kafka Tutorial

Kafka Tutorial for the Kafka streaming platform. Covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example. Lastly, we added some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have started to expand on the Java examples to correlate with the design discussion of Kafka. We have also expanded on the Kafka design section and added references.

AWS Cassandra Cluster Tutorial 5: Setting up Cassandra Cluster in AWS/EC2

in Cassandra

April 8, 2017

Cassandra Cluster Tutorial 5 - Cassandra AWS Cluster with CloudFormation, bastion host, Ansible and the aws-command line

This Cassandra tutorial is useful for developers and DevOps/DBA staff who want to launch a Cassandra cluster in AWS.

The cassandra-image project has been using Vagrant and Ansible to set up a Cassandra Cluster for local testing. Then we used Packer, Ansible and EC2. We used Packer to create AWS images in the last tutorial. In this tutorial, we will use CloudFormation to create a VPC, Subnets, security groups and more to launch a Cassandra cluster in EC2 using the AWS AMI image we created with Packer in the last article. The next two tutorials after this one, will set up Cassandra to work in multiple AZs and multiple regions using custom snitches for Cassandra.

Configuring metricsd to setup a disk alarm

in metricsd

April 7, 2017

What is MetricsD?

Metricsd is a golang program that gathers metrics from instance an AWS EC2 node and reports these metrics to places such as AWS / CloudWatch. Metrics collected include disk space, cpu activity, memory allocation, Cassandra KPIs. MetricsD is most often run as a systemd process.

Disk Gatherer reports to AWS / CloudWatch, sets alarms or sends emails.

The Disk Gatherer reports disk state information to AWS / CloudWatch, sets alarms in AWS / CloudWatch or sends emails. It leverages the df command.

Cassandra AWS System Memory Guidelines

in Cassandra

March 15, 2017

System Memory Guidelines for Cassandra AWS

Basic guidelines for AWS Cassandra

Do not use less than 8GB of memory for the JVM. The more RAM the better. Use G1GC. SSTable are first stored in memory and then written to disk sequentially. The larger the SSTable the less scanning that needs to be done while reading and determining if a key is in an SSTable using a bloom filter. In the EC2 world this equates to an m4.xlarge (16GB of memory), and you need some memory for the OS, specifically the IO buffers. The i2.xlarge and d2.xlarge are the smallest in their family and exceed the min memory requirement (and then some).

AWS Cassandra: Cassandra, NUMA and EC2

in Cassandra

March 14, 2017

AWS Cassandra and NUMA

The i3.8xlarge, c4.8xlarge, m4.10xlarge, and above EC2 instance types use more than 1 CPU, which means NUMA controls are available.

A good read on this is from Al Tolbert’s blog post.

The quickest way to tell if a machine is NUMA is to run “numactl –hardware”. -Al Tobey blog post on Cassandra tuning

NUMA stands for Non-Uniform Memory Architecture. Modern x86 CPUs contain an integrated memory controller. Multi-socket system, have two memory controllers. Each CPU gets a share of the memory. If one CPU socket needs memory that another CPU socket has, the memory is transferred. Transferring this memory between CPUs is more expensive than if the memory only existed in one CPUs memory. When a JVM thread only uses memory local to one CPU, things go fast, and if not slower (10 CPU cycles vs. 100 or some order of magnitude).

Cassandra AWS CPU Guidelines

in Cassandra

March 14, 2017

Cassandra CPU requirements in AWS Cloud

Cassandra is highly concurrent. Cassandra nodes can uses as many CPU cores as available if configured correctly.

What are vCPUs and ECUs?

An Amazon EC2 vCPU is a hyper thread, often referred to as a virtual core. Think of it as a physical thread of execution. It is able to run one thread at a time (which of course could be swapped out).

An Amazon ECU is some made up term that AWS used to use which was the power of the Intel Pentium chip that they used on the earliest incarnations of EC2. 50 ECU would be like 50 Pentium chips from a bygone era. Ignore ECUs.

Cassandra AWS Storage Requirements

in Cassandra

March 14, 2017

Cassandra AWS Storage Requirements

Cassandra does a lot sequential disk IO for the commit log and writing out SSTable. You still need random I/O for read operations. The more read operations that are cache misses, the more your EBS volumes need IOPS.

Cassandra writes to four areas

commit logs
SSTable
an index file
a bloom filter

Consider EC2 instance store instead of EBS for Cassandra

AWS provides EC2 instance local storage called instance storage which is not available with all EC2 instance types, and Elastic Block Store (EBS). Instance storage does not have to go over a SAN or Intranet, instead it uses the local hardware bus. Instance storage is right there on the server you are renting. The downside of EC2 instance storage is the expense, and it is not as flexible as EBS. Due to historic problems with EBS, it used to be the only real option for running Cassandra in AWS. EBS has a reputation for degrading performance over time. Some of this has likely been fixed with enhanced EBS, but instance storage is more reliable.

What is Cassandra?

in Cassandra

March 14, 2017

What is Cassandra?

Cassandra is a linearly scalable, open source NoSQL database. Cassandra uses log-structured merge-tree, which makes Cassandra one of the best NoSQL options for high-throughput writes. Cassandra delivers continuous availability, with operational simplicity. Unlike many other NoSQL solutions, Cassandra is a master-less, peer-to-peer, distributed clustered store. Each node knows about the cluster network topology via the gossip protocol.

Cloudurable provides Cassandra training, Cassandra consulting, Cassandra support and helps setting up Cassandra clusters in AWS.

Continue reading

Part 2 Setting up Ansible and ssh for Cassandra Database Cluster DevOps

in ansible

March 11, 2017

Cassandra Cluster Tutorial 3: Part 2 of 2

Setting up Ansible and SSH for our Cassandra Database Cluster for DevOps/DBA Tasks

This tutorial series centers on how DevOps/DBA tasks with the Cassandra Database. As we mentioned before, Ansible and ssh are essential DevOps/DBA tools for common DBA/DevOps tasks whilst working with Cassandra Clusters. Please read part 1 before reading part 2.

In part 1, we set up Ansible for our Cassandra Database Cluster to automate common DevOps/DBA tasks. As part of this setup, we created an ssh key and then set up our instances with this key so we could use ssh, scp, and most importantly ansible. We also created an ansible playbook to install keys on our Cassandra nodes from a bastion host that we set up with Vagrant.

Setting up Ansible/SSH for Cassandra Database Cluster DevOps Part 1

in ansible

March 11, 2017

Cassandra Cluster Tutorial 3: Part 1 of 2

Setting up Ansible/SSH for our Cassandra Database Cluster for DevOps/DBA Tasks

Ansible and ssh are essential DevOps/DBA tools for common DBA/DevOps tasks like managing backups, rolling upgrades to the Cassandra cluster in AWS/EC2, and so much more. An excellent aspect of Ansible is that it uses ssh, so you do not have to install an agent to use Ansible.

This article series centers on how DevOps/DBA tasks with the Cassandra Database. However the use of Ansible for DevOps/DBA transcends its use with the Cassandra Database, so this article is good information for any DevOps/DBA or Developer that needs to manage groups of instances, boxes, hosts whether they be on-prem bare-metal, dev boxes, or in the AWS cloud. You don’t need to be setting up Cassandra to get use of this article.

Cloud DevOps: Packer, Ansible, SSH and AWS/EC2

in ansible

March 3, 2017

Cloud DevOps: Using Packer, Ansible/SSH and AWS command line tools to create and DBA manage EC2 Cassandra instances in AWS.

This article is useful for developers and DevOps/DBA staff who want to create AWS AMI images and manage those EC2 instances with Ansible. Although this article is part of a series about setting up the Cassandra Database images and doing DevOps/DBA with Cassandra clusters, the topics we cover apply to AWS DevOps in general - even if you don’t use Cassandra at all.

Kafka Topic Architecture - Replication, Failover and Parallel Processing

Kafka Topics, Logs, Partitions

Kafka vs JMS, SQS, RabbitMQ Messaging

Kafka Architecture

The Kafka Ecosystem - Kafka Core, Kafka Streams, Kafka Connect, Kafka REST Proxy, and the Schema Registry

What is Kafka?

Kafka Tutorial: Kafka, Avro Serialization and the Schema Registry

Avro Introduction for Big Data and Data Streaming Architectures

Kinesis vs. Kafka

Running a Kafka Broker

Kafka Tutorial

Cassandra Cluster Tutorial 5 - Cassandra AWS Cluster with CloudFormation, bastion host, Ansible and the aws-command line

What is MetricsD?

Disk Gatherer reports to AWS / CloudWatch, sets alarms or sends emails.

System Memory Guidelines for Cassandra AWS

Basic guidelines for AWS Cassandra

AWS Cassandra and NUMA

Cassandra CPU requirements in AWS Cloud

What are vCPUs and ECUs?

Cassandra AWS Storage Requirements

Consider EC2 instance store instead of EBS for Cassandra

What is Cassandra?

Cassandra Cluster Tutorial 3: Part 2 of 2

Setting up Ansible and SSH for our Cassandra Database Cluster for DevOps/DBA Tasks

Cassandra Cluster Tutorial 3: Part 1 of 2

Setting up Ansible/SSH for our Cassandra Database Cluster for DevOps/DBA Tasks

Cloud DevOps: Using Packer, Ansible/SSH and AWS command line tools to create and DBA manage EC2 Cassandra instances in AWS.

Search

Share

Follow

Categories

Tags