March 14, 2017
Cassandra AWS Storage Requirements
Cassandra does a lot sequential disk IO for the commit log and writing out SSTable. You still need random I/O for read operations. The more read operations that are cache misses, the more your EBS volumes need IOPS.
Cassandra writes to four areas
- commit logs
- an index file
- a bloom filter
Consider EC2 instance store instead of EBS for Cassandra
AWS provides EC2 instance local storage called instance storage which is not available with all EC2 instance types, and Elastic Block Store (EBS). Instance storage does not have to go over a SAN or Intranet, instead it uses the local hardware bus. Instance storage is right there on the server you are renting. The downside of EC2 instance storage is the expense, and it is not as flexible as EBS. Due to historic problems with EBS, it used to be the only real option for running Cassandra in AWS. EBS has a reputation for degrading performance over time. Some of this has likely been fixed with enhanced EBS, but instance storage is more reliable.
EBS is ok for Cassandra, Prefer EBS
Using EBS with Cassandra did not work very well in the past, and you had to use more expensive EC2 instances with instance storage.
Until recently using Cassandra and AWS EBS was not a good idea. The latest generation of EBS-optimized instances offer a good mix of performance and for many use cases rivaling instance storage. EBS volumes are usually the best pick for price for performance. If in doubt start with EBS-optimized instances. EBS has nice features like snapshots, and redundancy that make it preferred if performance is close or horizontal scale out is an option. Also with EBS elastic volumes, provisioned IO and enhanced EBS, it would be hard not to pick EBS. It is just a lot more flexible, and less expensive.
EC2 instances we work with for the right IO for Cassandra use case
EC2 instances we use tend to be from the M4 family and the I3 family (released Nov 30, 2016). M4 is AWS EC2s newest generation of general purpose instances with EBS optimized storage whilst the I3 family includes fast SSD-backed instance storage optimized for very high random I/O performance. I3s provide high IOPS at a low cost. For tiny read/writes benchmarking i3 EC2 instances are better instances than m4s (EC2 instances) at 8x the read speed (note benchmark was I2 vs. M4, but I3 is the latest). For medium read/writes, m4 are equivalent (EBS optimized) but at 8x less cost than i3s (keep in mind price goes down and performance goes up over time). There have been some reports of EBS storage degrading over time. But for 8x the cost, and with some monitoring and replication, you could automate the retirement of degrading EC2 instances using optimized EBS that are degrading.
EC2 I3 instances go up to 3.3 million random IOPS (great for reads) at 4KB block size, and the I3 throughput goes up to 16 GB/s. It is a beast.
An advantage of the M4 family is the ability to use EBS to create snapshots and simply spin up new instances by attaching EBS volume to a new instance. If you are not sure, start with
m4.2xlarge. (You can use optimized EBS with I3 as well.)
You can consider D2 family of EC2 instances for mostly write operations or offline analytics that performs large queries. The D2 family offers the highest throughput for cost. If you are keeping a lot of logs or even approaching big data uses cases, this might be a great option for high throughput (mostly writes and mostly batch reads). To learn more about a use case where D2 made the most sense see Scale it to Billions where they used D2 with an IOT device streaming data application using Cassandra.
Separate EBS volume for Cassandra commit log
It makes sense if possible to have commit logs on a separate disk if using magnetic disks.
SSTables are written to in streams but are read from using random access if data is not found in cache.
Prefer SSD to HDD at least start with SSD
SSTables can benefit from SSD drives due to random access. If in doubt use SSD EBS volumes. SSDs provide low-latency response times for random read operations and supply enough throughput for long sequential writes performance for compaction operations, writing SSTables and commit logs. Magnetic disks in EC2 have greater throughput but less IOPS which is good for
SSTables compaction but not good for random reads. HDD are the cheapest per byte of storage and cheapest for byte throughput. There are some benchmarks that show HDD EBS volumes doing well, but it depends on the use case. If in doubt, use SSD volumes. If in doubt, use SSD volumes. You can change it later after observing load test and production KPIs for IOPs and throughput. You can used provisioned IOPs with SSDs to buy IOPs for Cassandra clusters that are doing a lot of reads.
Take Cassandra Compaction in mind when choosing EBS volume sizes
Take replication and compaction overhead into account. The compaction process of SSTable data makes heavy use of the disk.
LeveledCompactionStrategy may need 10 to 20% overhead.
SizeTieredCompactionStrategy worse case is 50% overhead needed to perform compaction.
Keep this in mind while sizing disks. If you are doing a high-update use case,
LeveledCompactionStrategy is the best solution if you want to limit the total disk size used at any point in time and to optimize reads as the row will be spread across less (up to ten times less) SSTables.
LeveledCompactionStrategy requires more IO and processing time for compactions. If in doubt, use
We hope this blog post on AWS Storage requirements for Cassandra running in EC2/AWS helpful. Cloudurable specialize in AWS DevOps Automation for Cassandra and Kafka. Cloudurable provides Casandra consulting and Kafka consulting to get you setup fast in AWS with CloudFormation and CloudWatch. Check out our AWS-centric Casandra training and Kafka training.
Read speed RAID0 or JBOD
If you use RAID, RAID 0, which focuses on speed, is sufficient for Cassandra because Cassandra has replication and data-safety built-in. With Cassandra 3.x you should use Cassandra JBOD (just a bunch of disks) instead of RAID 0 for throughput speed. JBOD is preferred, and it can help with random read speeds.
AWS Elastic Volume and Linux file systems - Elastic volume perfect for Cassandra
AWS Elastic volume added 2⁄2017, you can change EBS type on running node! Using the new EBS elastic volumes goes well with ext4 and XFS. XFS is the preferred file system since it has less size constraints (
sudo mkfs.xfs -K /dev/xvdb) and excels at writing in parallel. You can use ext4 as well but avoid others.
For ext4, you will need to expand the volume using
sudo resize2fs /dev/xvda1 and use this for XFS
sudo xfs_growfs -d /mnt. The key point here is that the Linux OS will not automatically expand. You will have to tell it.
Cassandra Encryption at rest use Amazon KMS and EBS
If you need data at rest encryption, use encrypted EBS volumes / KMS if running in EC2, and use dm-crypt file system if not. Since AWS KMS uses hardware-assisted encryption (Hardware Security Modules), it is going to be much faster than the encryption that comes with the JDK. Next fastest would be Linux based file system encryption. Avoid encryption from Cassandra JDK. Also, KMS allows you to easily rotate keys and expire them.
Encrypted volumes have the same IOPS performance on as unencrypted volumes. All other forms of encryption have overhead, and potentially a lot of overhead (20% or 30% CPU hit is not uncommon for OS and JDK).
EBS easily supports KMS encryption, and it integrates well. However instance storage requires that you use an encrypted file system like dm-crypt. This is another clear advantage for KMS.
EBS has been know to degrade over time - watch with CloudWatch
With EBS, you need to keep an eye out for EBS issues like poor throughput, performance degrading over time, and instances not cleanly dying. This is where system monitoring like CloudWatch comes into play and one reason we build images AMIs which can be monitored using Amazon CloudWatch. We support Linux OS log aggregation, and Cassandra log aggregation into CloudWatch. We also support OS metrics and Cassandra metrics into CloudWatch.
Set memtable_flush_writers Cassandra config.yaml
If your data directories are backed by instance storage SSD, you can increase this using
memtable_flush_writers * data_file_directories <= # of vCPU. If you are using instance storage HDDs or EBS SSD use
memtable_flush_writers: #vCPUs. Do not leave this set to 1.
If Cassandra read speed is a problem, there are lots of ways to scale reads
Scaling Cassandra read speeds:
- Horizontally scale Cassandra (more nodes)
- Use instance store (super fast IO)
- Buy provisioned IOPs – or bigger SSDs
- Add more disks to each node using JBOD (more disks / EBS volumes)
- EC2 instances with more SSDs or Disks if using
- Use a bigger key-cache, row-cache (more memory / more cache)
- More disk space for SizeTieredCompactionStrategy
- Don’t forget to optimize query and partition keys
- Add more tables or materialized views to optimize queries
Slide deck on AWS Cassandra Storage requirements
More info about Cassandra and AWS
Read more about Cassandra AWS with this slide deck.
Cloudurable™: streamline DevOps/DBA for Cassandra running on AWS. Cloudurable™ provides AMIs, CloudWatch Monitoring, CloudFormation templates and monitoring tools to support Cassandra in production running in EC2. We also teach advanced Cassandra courses which teaches how one could develop, support and deploy Cassandra to production in AWS EC2 for Developers and DevOps/DBA. We also provide Cassandra consulting and Cassandra training.
More info about Cloudurable
Please take some time to read the Advantage of using Cloudurable™.
- Subscription Cassandra support to streamline DevOps (Support subscription pricing for Cassandra and Kafka in AWS)
- Quickstart Mentoring Consulting for Developers and DevOps
- Architectural Analysis Consulting
- Training and mentoring for Cassandra for DevOps/DBA and Developers
- Training and mentoring for Kafka for DevOps and Developers
- We specialize in AWS Cassandra deployments for organizations that are setting up Cassandra as a Service.
Written by R. Hightower and JP Azar.
Slide deck that covers configuring AWS Cassandra
- Carpenter, Jeff; Hewitt, Eben (2016-06-29). Cassandra: The Definitive Guide: Distributed Data at Web Scale. O’Reilly Media. Kindle Edition.
- High Scalability: How To Setup A Highly Available Multi-AZ Cassandra Cluster On AWS EC2
- Datastax: Turning Java Resources
- Uber Robert: Bandwidth required for hinted handoff
- Notes on Cassandra AWS Deploy
- Scaling to billions - what they don’t tell you in Cassandra README
- Amazon: Apache Cassandra on AWS Best Practices
We hope you enjoyed this article. Please provide feedback.
Cloudurable provides Cassandra training, Cassandra consulting, Cassandra support and helps setting up Cassandra clusters in AWS. Cloudurable also provides Kafka training, Kafka consulting, Kafka support and helps setting up Kafka clusters in AWS.
Check out our new GoLang course. We provide onsite Go Lang training which is instructor led.Tweet
Apache Spark Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Advantages of using Cloudurable™
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Tutorial PDF
ElasticSearch / ELK Consulting
InfluxDB/TICK Training TICK Consulting