Product Roadmap

Phase 1 (minimal viable product) (March 1st)

  • Run cassandra as a systemd service (DONE) with simple restart (simple watchdog)
  • OS monitoring to AWS CloudWatch Metrics
  • status: code written, not configured as systemd/service in AMI
  • OS log forwarding to AWS CloudWatch Logs
  • status code written, not configured in AMI
  • Cassandra log forwarding via logstash UDP
  • status: prototype written, code not done, log config not added to image
  • Cassandra KPIs forwarding to AWS CloudWatch
  • status: similar to OS AWS CloudMetrics, not started
  • AMI image creation (partially done)
  • Parts done. Need to prepare for MVP release
  • Setup AWS VPC with bastion, configure SSH/Ansible to run via bastion to Cassandra node (tons of progress)
  • Cassandra nodes in private subnet in a VPC, bastion in public subnet in the same VPC.

Phase 2 (enterprise) (May 1st)

  • Ansible scripts to update from phase 1 to phase 2 (rolling restart, clean shutdown, maybe an S3 backup)
  • Enterprise Watchdog initial release
  • Monitor levels (threads, memory, disk, work queue size)
  • Trigger actions (restart Cassandra, dump thread stack, heap dump, store in S3)

Product vision

Basic version full vision

  • OS Metrics sent to CloudWatch metrics
  • Cassandra KPIs sent to Cloudwatch metrics
  • OS journald logs sent to Cloudwatch logging
  • Cassandra logs sent to Cloudwatch logging
  • systemd used to restart Cassandra if it dies.

Enterprise version full vision

  • Everything in basic support
  • Enterprise Watchdog

Enterprise Watchdog full vision

  • Health system monitoring KPIs (may replace Cassandra KPIs forwarding and OS monitoring to not duplicate effort)
  • Store data points for interval (hour) (use Cassandra time series support features)
  • Last week/month is in memory by the hour plus stdev, and last 20 data points (looking for trends)
  • Anomaly detection will look for trends and anomalies
  • Anomaly will issue alerts or perform actions (stack and memory dump)
  • Heart-beat: Create ping (can ping simple table in Cassandra to see if Cassandra is running)
  • Can monitor levels and issue alerts
  • (threads, memory, disk, work queue size), perform actions (dump thread stack and/or heap and store in S3)
  • Trigger actions based on health levels, and anomalies
  • Monitor:
  • Thread counts
  • Work queue size
  • JVM memory used
  • Actions:
  • Cloudwatch alert,
  • Cloudwatch log error,
  • Restart Cassandra (cleanly),
  • Backup Cassandra,
  • Send AWS SNS message,
  • Send AWS Queue message,
  • Run Lambda function

See kapacitor for the direction we are headed, but with actions.

If we tie the datastore to Cassandra, then it will be harder to use with Kafka and others. The datastore should be agnostic but we should use Cassandra for Cassandra by default. Data should be small with TTLs so it is cleaned-up whenever possible. We may consider using InfluxDB, LevelDB, etc. for the datastore. We therefore need an abstraction level (which we can evolve as we add more datastores).