Why you need watchdogs with Cassandra

Watchdogs for non-stop Cassandra

For rock solid reliability and resilience, Cloudurable™ Cassandra EC2 instances are started and monitored by the systemd watchdog service. The systemd watchdog ensures that the Cassandra service is running and if it stops running, the systemd system will restart Cassandra.

The key point here is the watchdogs automatically restarts Cassandra if the JVM ever exits or crashes. If you want to stop Cassandra, you need to tell the systemd (sudo systemctl stop cassandra ) to stop the instance. The watchdog system is really the one built into the OS. The systemd system provides full support for supervisor (software) watchdog support for individual system services. Since we configure Cassandra as a systemd service we get this watchdog support. This allows you to support Cassandra in production which is essential for DevOps.

Since we repeat the journald logs (journald is part of systemd) to AWS CloudWatch, any restarts are automatically recorded, and we can create alerts and or remediate actions that handles these. Knowing your instances are restarting due to bugs, anomalies, or other problems is a very important KPI.

BUT WHY? How is this a benefit if Cassandra already has replication and resilience built in?

Why you need a watchdog

Cassandra is likely at the heart of your microservices operational data storage needs. You might have a Cassandra Keyspace per microservice and you might have five or six microservices talking to 6 Keyspaces.

What if those microservices start getting 2 or 3x more traffic than normal?
What if there is a denial of service attack?
What if there is just a bug that got past the load testing regressions test of a new version of a new microservice?
Or worse what if there are just catastrophic anomalies?

Watchdog scenario

Bugs and outages in a distributed system can be subtle and hard to track down. Let’s spell out a scenario. A developer writes a non-optimal query and it goes unnoticed. Then three months later another developer on a another project writes a Cassandra UDF which is resource intensive and gets more so as the data set grows. Then there are a few more suboptimal indexes added to a table to get a team by a tight deadline.

Now Cassandra works fine 99.5% of the time, but occasionally the non-optimal query runs, the Cassandra UDF gets called, and the suboptimal index gets used all at once. When all three things happen at the same time, a Cassandra instance runs out of memory or becomes unresponsive or just crashes.

Now this is no longer a simple system. These microservices have evolved over time and you have 6 keyspaces, 40 tables, 20 SASI indexes, 6 custom UDFs, and 100s of queries spread over three code bases. This is why you need something to restart Cassandra so production does not go down while you figure out what this anomaly is. Now one of your nodes are down, and other Cassandra nodes are now taking 50% or 200% more load than normal and they might also being hit by this anomaly. It is random and hard to track down. This is not really hypothetical. It can and does happen.

The basic level support makes Cassandra: non-stop Cassandra, and you have the logs and metrics to help you diagnose issues. The enterprise level support helps you collect key information just at the right time to find these difficult to track distributed system bugs. Those key reports, with the Cloudurable support team behind you can help you track down and diagnose the key issues. But more importantly in the mean time, your cluster is still running.

Now people ask. Why don’t I just get rid of that EC2 instance, make a snapshot (if possible), and then just spin up a new instance. There may be times when this is what you want to do. You EC2 instance could have noisy neighbors or the EBS volume performance has been reported to decline over time so you might determine that you need to do this. However, this takes time, and depending on the size of your data, it could take quite a bit of time. Remember Cassandra is not stateless. Also if you have to bring up additional nodes to join without backup data, then Cassandra nodes now needs to populate this node, putting additional strain on the system. You will have Cassandra nodes taking on more load because instance are going down and nodes are getting restored. The reality is you want both watchdogs and spinning up new servers, but for sporadic bugs, or rare situations, the quickest way to recover is sometimes just restart the Cassandra process. The restart will cause alerts / logs in CloudWatch, and then you can determine if you need more drastic remediation. A watchdog system could be the difference between uptime even during adversity or long and sporadic downtime that is hard to diagnose.

More info

Cloudurable™ provides:

Contact us

For more details on the subscription support or pricing please contact us or call ((415) 758-1113) or write info@cloudurable.com.