Product Roadmap

Phase 1 (minimal viable product) (March 1st)

Run cassandra as a systemd service (DONE) with simple restart (simple watchdog)
OS monitoring to AWS CloudWatch Metrics
status: code written, not configured as systemd/service in AMI
OS log forwarding to AWS CloudWatch Logs
status code written, not configured in AMI
Cassandra log forwarding via logstash UDP
status: prototype written, code not done, log config not added to image
Cassandra KPIs forwarding to AWS CloudWatch
status: similar to OS AWS CloudMetrics, not started
AMI image creation (partially done)
Parts done. Need to prepare for MVP release
Setup AWS VPC with bastion, configure SSH/Ansible to run via bastion to Cassandra node (tons of progress)
Cassandra nodes in private subnet in a VPC, bastion in public subnet in the same VPC.

Phase 2 (enterprise) (May 1st)

Ansible scripts to update from phase 1 to phase 2 (rolling restart, clean shutdown, maybe an S3 backup)
Enterprise Watchdog initial release
Monitor levels (threads, memory, disk, work queue size)
Trigger actions (restart Cassandra, dump thread stack, heap dump, store in S3)

Product vision

Basic version full vision

OS Metrics sent to CloudWatch metrics
Cassandra KPIs sent to Cloudwatch metrics
OS journald logs sent to Cloudwatch logging
Cassandra logs sent to Cloudwatch logging
systemd used to restart Cassandra if it dies.

Enterprise version full vision

Everything in basic support
Enterprise Watchdog

Enterprise Watchdog full vision

Health system monitoring KPIs (may replace Cassandra KPIs forwarding and OS monitoring to not duplicate effort)
Store data points for interval (hour) (use Cassandra time series support features)
Last week/month is in memory by the hour plus stdev, and last 20 data points (looking for trends)
Anomaly detection will look for trends and anomalies
Anomaly will issue alerts or perform actions (stack and memory dump)
Heart-beat: Create ping (can ping simple table in Cassandra to see if Cassandra is running)
Can monitor levels and issue alerts
(threads, memory, disk, work queue size), perform actions (dump thread stack and/or heap and store in S3)
Trigger actions based on health levels, and anomalies
Monitor:
Thread counts
Work queue size
JVM memory used
Actions:
Cloudwatch alert,
Cloudwatch log error,
Restart Cassandra (cleanly),
Backup Cassandra,
Send AWS SNS message,
Send AWS Queue message,
Run Lambda function

See kapacitor for the direction we are headed, but with actions.

If we tie the datastore to Cassandra, then it will be harder to use with Kafka and others. The datastore should be agnostic but we should use Cassandra for Cassandra by default. Data should be small with TTLs so it is cleaned-up whenever possible. We may consider using InfluxDB, LevelDB, etc. for the datastore. We therefore need an abstraction level (which we can evolve as we add more datastores).