March 14, 2017
What is Cassandra?
Cassandra is a linearly scalable, open source NoSQL database. Cassandra uses log-structured merge-tree, which makes Cassandra one of the best NoSQL options for high-throughput writes. Cassandra delivers continuous availability, with operational simplicity. Unlike many other NoSQL solutions, Cassandra is a master-less, peer-to-peer, distributed clustered store. Each node knows about the cluster network topology via the gossip protocol.
Cassandra Key Components
A Cassandra node is one server in a Cassandra cluster. Cassandra nodes store partitions of data. Cassandra nodes deploy into Cassandra clusters, the largest unit of deployment. In AWS, it is normal for a Cassandra cluster to span Availability Zones (AZs) and AWS regions to improve disaster recovery and speed client throughput. A Cassandra node usually equates to an Amazon EC2 instance in AWS.
Cassandra clusters consist of racks and data centers. Nodes get deploy as members of racks which get deployed into data centers. In Amazon, Availability Zones (AWS AZ) usually equate to Cassandra racks, while Amazon regions equate to Cassandra data centers. In a AWS region, racks (AZs) of a Cassandra cluster would deploy into a AWS VPC.
A commit log is a transaction log on every node in the cluster. All Cassandra write operations are written to the commit log first. Commit logs are written sequentially-append only. Only during Cassandra node recovery does Cassandra read the commit log. Cassandra replays commit logs to perform Cassandra node recovery. Commit logs are either flushed periodically by the OS or written in batches. Care must be taken to keep the commit log on a different EBS volume (virtual disk) than SSTables if using HDD. Use SSDs if you want to share the volume between the commit log and SSTables.
Cassandra nodes deploy to racks (AWS AZs). The number of racks/AZs should equate to a multiple of the replication level. Nodes on a rack or in an AZ have fast connectivity. AZs, in the same region, have low-latency links which benefit replication and data consistency checks. In AWS EC2, you can speed up the node to node communication by using placement groups and enhanced networking which has speeds up to 10 GBE.
A memtable is an in-memory version of an SStable. The memtable is a write-back cache of data rows that is looked up by key. One memtable links to one Cassandra table. Memtables are flushed to disk when the node reaches global memory thresholds, full commit log event, or Cassandra table level interval arrived event. In EC2, Cassandra nodes must run on EC2 instances types that have enough memory to support Cassandra caches, and memtables while leaving enough room for Linux OS buffers for TCP/IP stack.
Cassandra also has key caches and row caches. Use cases that have high read to write ratios will benefit from large caches and EC2 instances that enough system memory (DRAM).
Cassandra uses bloom filters to quickly test if a key exists in a memtable/SSTable. A bloom filter can tell if the key might exist in a set or positively does not exist in the set. False positives are plausible, but false negatives are not. Bloom filters are a good way of avoiding expensive I/O operation. Bloom filters are stored on disk and also are in-memory to quickly resolve which SSTable has a key. The larger the bloom filter the more accurate. Larger bloom filters require more memory, which might require EC2 instances with enough memory.
An SStable is the disk representation of a memtable SStable is a write-only data structure. When a memtables flush, they are written key sorted, sequentially to form an SStable. Later SSTable is compacted by being merge-sorted and rewritten as a new larger SSTable. Compaction of SSTables can take between 20% and 50% overhead disk space. This space has to be accounted for when allocating EBS volumes.
Index files index into main data files (SStable) and speeds up reads. You need memory for this as well.
A keyspace is a like a database schema. Keyspaces have many tables and define replication strategy, replication factor, and rules. The more replication, the more IO needs which impact IO costs (more IOPs and more network bandwidth needs).
A Cassandra table is like a SQL table except it can contain complex columns (maps, sets, lists). Cassandra tables are collection of ordered columns fetched by row. A table row key is known as the partition key. The row key determines the data distribution across a cluster.
Cloudurable specialize in AWS DevOps Automation for Cassandra and Kafka
We hope this brief article on What is Casandra is helpful. We help companies install Cassandra and Kafka in AWS. We help them plug into CloudWatch for KPIs, monitoring, alarms and log aggregation as we as automate the install and maintenance tasks for Kafka and Cassandra. We also provide Casandra consulting and Kafka consulting to get you setup fast in AWS with CloudFormation and CloudWatch. Support us by checking out our Casandra training and Kafka training.
If you have a read intensive application, having more memory for bloom filters, key caches, and index files are going to speed up read operations. You can also speed up reads by using RAID 1, or Cassandra JBOD (just a bunch of disks). JBOD is faster than RAID level 1. EC2 instances that support enhances EBS will give better and more consistent performance than EC2 instances that do not. Matching the IOPs (I/O operations per second) to your use case is important as well. Since with SSDs IOPs is a function of size, if you have a smaller data set, you could use AWS EBS provisioned IOPs for a read intensive application as well.
Other key Cassandra Concepts, Data Structures, and Algorithms
Cassandra uses data partitioning. The Cassandra distributed cluster forms a single logical database. The data is spread across Cassandra nodes in the Cassandra cluster. Every Cassandra node is responsible for part of the data for the database. Spreading the data across the nodes is data partitioning. Other databases calls this sharding.
Later versions of Cassandra use virtual nodes (vnodes). Thus each Cassandra server node can contain many virtual nodes. Vnodes makes it easier to add and remove Cassandra server nodes to a cluster. Cassandra employs a partition key to spread the data across the node using consistent hashing. Since Cassandra uses consistent hashing, its cluster is often referred to as a ring.
More to come, we will discuss more concepts like data replication, data consistency, eventual consistency, tunable consistency, snitches, gossip, write path, read path, hinted handoffs, and anti-entropy mechanisms.
More info about Cassandra and AWS
Read more about Cassandra AWS with this slide deck.
Slide deck that covers configuring AWS Cassandra
Cloudurable™: streamline DevOps/DBA for Cassandra running on AWS. Cloudurable™ provides AMIs, CloudWatch Monitoring, CloudFormation templates and monitoring tools to support Cassandra in production running in EC2. We also teach advanced Cassandra courses which teaches how one could develop, support and deploy Cassandra to production in AWS EC2 for Developers and DevOps/DBA. We also provide Cassandra consulting and Cassandra training.
More info about Cloudurable
Please take some time to read the Advantage of using Cloudurable™.
- Subscription Cassandra support to streamline DevOps (Support subscription pricing for Cassandra and Kafka in AWS)
- Quickstart Mentoring Consulting for Developers and DevOps
- Architectural Analysis Consulting
- Training and mentoring for Cassandra for DevOps and Developers
- Training and mentoring for Kafka for DevOps and Developers
- We specialize in AWS Cassandra deployments for organizations that are setting up Cassandra as a Service.
Written by R. Hightower and JP Azar.
We hope you enjoyed this article. Please provide feedback.
Cloudurable provides Cassandra training, Cassandra consulting, Cassandra support and helps setting up Cassandra clusters in AWS. Cloudurable also provides Kafka training, Kafka consulting, Kafka support and helps setting up Kafka clusters in AWS.
Check out our new GoLang course. We provide onsite Go Lang training which is instructor led.Tweet
Apache Spark Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Advantages of using Cloudurable™
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Tutorial PDF
ElasticSearch / ELK Consulting
InfluxDB/TICK Training TICK Consulting