Configuring metricsd to setup a disk alarm

April 7, 2017

                                                                           

What is MetricsD?

Metricsd is a golang program that gathers metrics from instance an AWS EC2 node and reports these metrics to places such as AWS / CloudWatch. Metrics collected include disk space, cpu activity, memory allocation, Cassandra KPIs. MetricsD is most often run as a systemd process.

Disk Gatherer reports to AWS / CloudWatch, sets alarms or sends emails.

The Disk Gatherer reports disk state information to AWS / CloudWatch, sets alarms in AWS / CloudWatch or sends emails. It leverages the df command.

The Disk Gatherer can collect several pieces of information, such as total bytes on disk, number of bytes used, percent of bytes used, etc. For the purposes of the alarm all we care about is the percentage of used disk space. So we must at least include the usedpct field in the fields to gather. By defult, if we don’t include fields in the configuration, we get what we need usedpct by default.

We also must setup the threshold, in this case, if the percentage used is at least the threshold, an alarm is triggered.

# ------------------------------------------------------------
# fields []string
#     what fields to output
#     fields: total        - number of 1K bytes on the disk
#             used         - number of 1K bytes used on the disk
#             available    - number of 1K bytes available on the disk
#             usedpct      - percentage of bytes used on the disk (calculated)
#             availablepct - percentage of bytes available on the disk (calculated)
#             capacitypct  - percentage of bytes available on the disk (reported)
#             mount        - where the file system is mounted
#     default: ["usedpct"]
#
# alarm_threshold int
#     if used percent is more than this number, send an alarm
#     value <= 0 or > 100 mean never alarm
#     default: 101
# ------------------------------------------------------------
disk {
    alarm_threshold = 75
}

Alarmers

When a metric is gathered, some bits are set to indicate if this metric should be handled by the alarmers, of which there are currently two types, aws and email. In the general section of the config, set the alarmers to which ones you want active.

# ------------------------------------------------------------
# alarmers []string
#     aws email
# ------------------------------------------------------------
alarmers = ["aws", "email]

The name of the alarm is built from namespace, which is configured in the general settings, ec2_instance_id and region which are automatically determined from the node itself. The name shows up in the AWS CloudWatch console and in the subject of an email.

ALARM: <namespace>.<ec2_instance_id>.<metric_name> in <region>

AWS Alarmer

The AWS alarmer will build an alarm in CloudWatch. If you provide an ARN to an SNS, that SNS will be fired and a notification will occur. Otherwise it will just be in the console for you to see.

# ------------------------------------------------------------
# aws alarmer settings
#     the alarmer depends on aws{} being configured correctly
# ------------------------------------------------------------
# disk_alarm_arns string
#     the aws arns value of who to notify for a disk alarm
#     not required, cloudwatch alarm is not required to notify anyone
# ------------------------------------------------------------
aws_alarmer {
    disk_alarm_arns = "arn:aws:sns:us-east-2:111122223333:DSKALRM"
}

Email Alarmer

The Email alarmer will send an email when it gets the alarm and then will continue to resend the alarm until the problem is fixed, unless you tell it via configuration to not resend by setting resend_interval_seconds to -1

# ------------------------------------------------------------
# email alarmer settings
#     the alarmer depends on smtp{} being configured correctly
# ------------------------------------------------------------
# resend_interval_seconds string
#     how often to RE-SEND an alarm email once the first has been sent
#     -1 for never resend
#     not supplied or 0 defaults to 300 seconds (5 minutes)
#     3600 (1 hour) is the maximum interval
#
# disk_alarm_tos []string
#     list of email addresses to send to
# ------------------------------------------------------------
email_alarmer {
    resend_interval_seconds = 300
    disk_alarm_tos = ["my.app.disk.alarm@mycompany.com", "cto@mycompany.com"]
}


Sample email sent from AWS SNS

Subject: ALARM: "NsScott.Ec2IId-999.diskUsedPct:/dev/sda5" in US-East-2
From: AWS Notifications <no-reply@sns.amazonaws.com> 2:36 PM
To: scott.fauerbach

You are receiving this email because your Amazon CloudWatch Alarm "NsScott.Ec2IId-999.diskUsedPct:/dev/sda5" in the US-East-2 region has entered the ALARM state, because "Threshold Crossed: 1 datapoint (76.89999999999999) was greater than the threshold (75.0)." at "Friday 07 April, 2017 18:36:31 UTC".

View this alarm in the AWS Management Console:
https://console.aws.amazon.com/cloudwatch/home?region=us-east-2#s=Alarms&alarm=NsScott.Ec2IId-999.diskUsedPct%3A%2Fdev%2Fsda5

Alarm Details:
- Name:                       NsScott.Ec2IId-999.diskUsedPct:/dev/sda5
- Description:                Disk Used Pct Alarm
- State Change:               OK -> ALARM
- Reason for State Change:    Threshold Crossed: 1 datapoint (76.89999999999999) was greater than the threshold (75.0).
- Timestamp:                  Friday 07 April, 2017 18:36:31 UTC
- AWS Account:                888877776666

Threshold:
- The alarm is in the ALARM state when the metric is GreaterThanThreshold 75.0 for 300 seconds.

Monitored Metric:
- MetricNamespace:                     NsScott
- MetricName:                          diskUsedPct:/dev/sda5
- Dimensions:                          [InstanceId = Ec2IId-999] [Role = dev-node] [Environment = dev] [Provider = disk]
- Period:                              300 seconds
- Statistic:                           Average
- Unit:                                not specified



State Change Actions:
- OK:
- ALARM: [arn:aws:sns:us-east-2:821683928919:OPS]
- INSUFFICIENT_DATA:


--
If you wish to stop receiving notifications from this topic, please click or visit the link below to unsubscribe:
https://sns.us-east-2.amazonaws.com/unsubscribe.html?SubscriptionArn=arn:aws:sns:us-east-2:821683928919:OPS#######

Please do not reply directly to this email. If you have any questions or comments regarding this email, please contact us at https://aws.amazon.com/support


Sample email sent directly from MetricsD

Subject: ALARM for diskUsedPct:/dev/sda5 on Ec2IId-999" in US-East-2
From: no-reply@mycompany.com 2:36 PM
To: scott.fauerbach

You are receiving this email because your MetricsD "Disk Used Pct Alarm" diskUsedPct:/dev/sda5 on Ec2IId-999 in US-East-2 is over the threshold (75) at Wed, 12 Apr 2017 14:29:53 EDT.

Instance Details:
- Namespace: NsScott
- Role: dev-node
- Environment: dev
- EC2 Instance Id: Ec2IId-999

Metric Details:
- Description: Disk Used Pct Alarm
- Metric Name: diskUsedPct:/dev/sda5
- Value: 76.90
- Threshold: 75
- Timestamp: Wed, 12 Apr 2017 14:36:31 EDT

Please do not reply directly to this email.

Cloudurable™ provides:

Contact us

For more details on the subscription support or pricing please contact us or call ((415) 758-1113) or write info@cloudurable.com.

Check out our new GoLang course. We provide onsite Go Lang training which is instructor led.

                                                                           

Apache Spark Training
Kafka Tutorial
Akka Consulting
Cassandra Training
AWS Cassandra Database Support
Kafka Support Pricing
Cassandra Database Support Pricing
Non-stop Cassandra
Watchdog
Advantages of using Cloudurable™
Cassandra Consulting
Cloudurable™| Guide to AWS Cassandra Deploy
Cloudurable™| AWS Cassandra Guidelines and Notes
Free guide to deploying Cassandra on AWS
Kafka Training
Kafka Consulting
DynamoDB Training
DynamoDB Consulting
Kinesis Training
Kinesis Consulting
Kafka Tutorial PDF
Kubernetes Security Training
Redis Consulting
Redis Training
ElasticSearch / ELK Consulting
ElasticSearch Training
InfluxDB/TICK Training TICK Consulting