Cassandra AWS Storage Requirements

March 14, 2017

                                                                           

Cassandra AWS Storage Requirements

Cassandra does a lot sequential disk IO for the commit log and writing out SSTable. You still need random I/O for read operations. The more read operations that are cache misses, the more your EBS volumes need IOPS.

Cassandra writes to four areas

  • commit logs
  • SSTable
  • an index file
  • a bloom filter

Consider EC2 instance store instead of EBS for Cassandra

AWS provides EC2 instance local storage called instance storage which is not available with all EC2 instance types, and Elastic Block Store (EBS). Instance storage does not have to go over a SAN or Intranet, instead it uses the local hardware bus. Instance storage is right there on the server you are renting. The downside of EC2 instance storage is the expense, and it is not as flexible as EBS. Due to historic problems with EBS, it used to be the only real option for running Cassandra in AWS. EBS has a reputation for degrading performance over time. Some of this has likely been fixed with enhanced EBS, but instance storage is more reliable.

EBS is ok for Cassandra, Prefer EBS

Using EBS with Cassandra did not work very well in the past, and you had to use more expensive EC2 instances with instance storage.

Until recently using Cassandra and AWS EBS was not a good idea. The latest generation of EBS-optimized instances offer a good mix of performance and for many use cases rivaling instance storage. EBS volumes are usually the best pick for price for performance. If in doubt start with EBS-optimized instances. EBS has nice features like snapshots, and redundancy that make it preferred if performance is close or horizontal scale out is an option. Also with EBS elastic volumes, provisioned IO and enhanced EBS, it would be hard not to pick EBS. It is just a lot more flexible, and less expensive.

EC2 instances we work with for the right IO for Cassandra use case

EC2 instances we use tend to be from the M4 family and the I3 family (released Nov 30, 2016). M4 is AWS EC2s newest generation of general purpose instances with EBS optimized storage whilst the I3 family includes fast SSD-backed instance storage optimized for very high random I/O performance. I3s provide high IOPS at a low cost. For tiny read/writes benchmarking i3 EC2 instances are better instances than m4s (EC2 instances) at 8x the read speed (note benchmark was I2 vs. M4, but I3 is the latest). For medium read/writes, m4 are equivalent (EBS optimized) but at 8x less cost than i3s (keep in mind price goes down and performance goes up over time). There have been some reports of EBS storage degrading over time. But for 8x the cost, and with some monitoring and replication, you could automate the retirement of degrading EC2 instances using optimized EBS that are degrading.

EC2 I3 instances go up to 3.3 million random IOPS (great for reads) at 4KB block size, and the I3 throughput goes up to 16 GB/s. It is a beast.

An advantage of the M4 family is the ability to use EBS to create snapshots and simply spin up new instances by attaching EBS volume to a new instance. If you are not sure, start with m4.2xlarge. (You can use optimized EBS with I3 as well.)

You can consider D2 family of EC2 instances for mostly write operations or offline analytics that performs large queries. The D2 family offers the highest throughput for cost. If you are keeping a lot of logs or even approaching big data uses cases, this might be a great option for high throughput (mostly writes and mostly batch reads). To learn more about a use case where D2 made the most sense see Scale it to Billions where they used D2 with an IOT device streaming data application using Cassandra.

Separate EBS volume for Cassandra commit log

It makes sense if possible to have commit logs on a separate disk if using magnetic disks. SSTables are written to in streams but are read from using random access if data is not found in cache.

Prefer SSD to HDD at least start with SSD

SSTables can benefit from SSD drives due to random access. If in doubt use SSD EBS volumes. SSDs provide low-latency response times for random read operations and supply enough throughput for long sequential writes performance for compaction operations, writing SSTables and commit logs. Magnetic disks in EC2 have greater throughput but less IOPS which is good for SSTables compaction but not good for random reads. HDD are the cheapest per byte of storage and cheapest for byte throughput. There are some benchmarks that show HDD EBS volumes doing well, but it depends on the use case. If in doubt, use SSD volumes. If in doubt, use SSD volumes. You can change it later after observing load test and production KPIs for IOPs and throughput. You can used provisioned IOPs with SSDs to buy IOPs for Cassandra clusters that are doing a lot of reads.

Take Cassandra Compaction in mind when choosing EBS volume sizes

Take replication and compaction overhead into account. The compaction process of SSTable data makes heavy use of the disk. LeveledCompactionStrategy may need 10 to 20% overhead. SizeTieredCompactionStrategy worse case is 50% overhead needed to perform compaction.

Keep this in mind while sizing disks. If you are doing a high-update use case, LeveledCompactionStrategy is the best solution if you want to limit the total disk size used at any point in time and to optimize reads as the row will be spread across less (up to ten times less) SSTables. LeveledCompactionStrategy requires more IO and processing time for compactions. If in doubt, use LeveledCompactionStrategy.

We hope this blog post on AWS Storage requirements for Cassandra running in EC2/AWS helpful. Cloudurable specialize in AWS DevOps Automation for Cassandra and Kafka. Cloudurable provides Casandra consulting and Kafka consulting to get you setup fast in AWS with CloudFormation and CloudWatch. Check out our AWS-centric Casandra training and Kafka training.

Read speed RAID0 or JBOD

If you use RAID, RAID 0, which focuses on speed, is sufficient for Cassandra because Cassandra has replication and data-safety built-in. With Cassandra 3.x you should use Cassandra JBOD (just a bunch of disks) instead of RAID 0 for throughput speed. JBOD is preferred, and it can help with random read speeds.

AWS Elastic Volume and Linux file systems - Elastic volume perfect for Cassandra

AWS Elastic volume added 22017, you can change EBS type on running node! Using the new EBS elastic volumes goes well with ext4 and XFS. XFS is the preferred file system since it has less size constraints (sudo mkfs.xfs -K /dev/xvdb) and excels at writing in parallel. You can use ext4 as well but avoid others. For ext4, you will need to expand the volume using sudo resize2fs /dev/xvda1 and use this for XFS sudo xfs_growfs -d /mnt. The key point here is that the Linux OS will not automatically expand. You will have to tell it.

Cassandra Encryption at rest use Amazon KMS and EBS

If you need data at rest encryption, use encrypted EBS volumes / KMS if running in EC2, and use dm-crypt file system if not. Since AWS KMS uses hardware-assisted encryption (Hardware Security Modules), it is going to be much faster than the encryption that comes with the JDK. Next fastest would be Linux based file system encryption. Avoid encryption from Cassandra JDK. Also, KMS allows you to easily rotate keys and expire them.

Encrypted volumes have the same IOPS performance on as unencrypted volumes. All other forms of encryption have overhead, and potentially a lot of overhead (20% or 30% CPU hit is not uncommon for OS and JDK).

EBS easily supports KMS encryption, and it integrates well. However instance storage requires that you use an encrypted file system like dm-crypt. This is another clear advantage for KMS.

EBS has been know to degrade over time - watch with CloudWatch

With EBS, you need to keep an eye out for EBS issues like poor throughput, performance degrading over time, and instances not cleanly dying. This is where system monitoring like CloudWatch comes into play and one reason we build images AMIs which can be monitored using Amazon CloudWatch. We support Linux OS log aggregation, and Cassandra log aggregation into CloudWatch. We also support OS metrics and Cassandra metrics into CloudWatch.

Set memtable_flush_writers Cassandra config.yaml

If your data directories are backed by instance storage SSD, you can increase this using memtable_flush_writers * data_file_directories <= # of vCPU. If you are using instance storage HDDs or EBS SSD use memtable_flush_writers: #vCPUs. Do not leave this set to 1.

If Cassandra read speed is a problem, there are lots of ways to scale reads

Scaling Cassandra read speeds:

  • Horizontally scale Cassandra (more nodes)
  • Use instance store (super fast IO)
  • Buy provisioned IOPs – or bigger SSDs
  • Add more disks to each node using JBOD (more disks / EBS volumes)
  • EC2 instances with more SSDs or Disks if using
  • Use a bigger key-cache, row-cache (more memory / more cache)
  • More disk space for SizeTieredCompactionStrategy
  • Don’t forget to optimize query and partition keys
  • Add more tables or materialized views to optimize queries

Cloudurable provides Cassandra training, Cassandra consulting, Cassandra support and helps setting up Cassandra clusters in AWS.

Slide deck on AWS Cassandra Storage requirements

More info about Cassandra and AWS

Read more about Cassandra AWS with this slide deck.

Amazon has a guide that covers Cassandra on AWS that is a must read There is also this Amazon Cassandra guide on High Scalability that is a must read.

About Cloudurable™

Cloudurable™: streamline DevOps/DBA for Cassandra running on AWS. Cloudurable™ provides AMIs, CloudWatch Monitoring, CloudFormation templates and monitoring tools to support Cassandra in production running in EC2. We also teach advanced Cassandra courses which teaches how one could develop, support and deploy Cassandra to production in AWS EC2 for Developers and DevOps/DBA. We also provide Cassandra consulting and Cassandra training.

Follow Cloudurable™ at our LinkedIn page, Facebook page, Google plus or Twitter.

More info about Cloudurable

Please take some time to read the Advantage of using Cloudurable™.

Cloudurable provides:

Authors

Written by R. Hightower and JP Azar.

Slide deck that covers configuring AWS Cassandra

References

Feedback


We hope you enjoyed this article. Please provide feedback.

About Cloudurable

Cloudurable provides Cassandra training, Cassandra consulting, Cassandra support and helps setting up Cassandra clusters in AWS. Cloudurable also provides Kafka training, Kafka consulting, Kafka support and helps setting up Kafka clusters in AWS.

Check out our new GoLang course. We provide onsite Go Lang training which is instructor led.

                                                                           

Apache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting