Keeping eyes on GCP infrastructure resources is essential for your applications to work seamlessly. DevOps team should get notified when applications or services went down or inaccessible due to some issues like compute instance/s crossed defined threshold, K8S pods crashed, network went down at some region, etc. Getting alerts on configured notification channels allows DevOps teams to act quickly to rectify and resolve issues to keep your services up and running.
Here GCP monitoring and alerts are come to the rescue and terraform allow us to manage these resources as a code (IaC).
GCP Monitoring and Alerts
GCP monitoring provides a complete solution to collect and analyze the metrics of GCP resources and visualize them in the form of dashboards. GCP also allows you to monitor your applications’ availability using uptime checks. We can set up alerts when user defined criteria for resource state, utilization get matched and you will get notified on configured notification channels like email, slack, etc.
Why Terraform
We can create infrastructure manually using GCP console. If it’s a small infrastructure and for specific environment like Dev or Prod only, then its Ok to go with manual infrastructure creation.
Glossary -
Dashboards
Graphical visualizations of GCP resource metrics.
Alerts
Message raised by GCP monitoring when certain criteria get matched in resource metrics.
Notification Channel
Communication channels where GCP will send alert notifications like email, slack, etc.
Uptime Check
Checks application availability is application is responding for specific API/health check call.
Infrastructure as Code
Allow us to create and manage Gloud infrastructure/resources using code.
Diagram
Fig – Infrastructure creation using Terraform.
Configuring monitoring and alerts for few resources for single environment may not take too much time, but what if we need to configure it in multiple environments like dev, prod, staging, qa for multiple times, this will consume large amount of time and energy. Here terraform comes into picture to automate infrastructure creation, we can create, modify, and destroy resources quickly.
Terraform In Short
Terraform is an infrastructure as code (IaC) tool developed by HashiCorp, it allows us to manage and provision cloud infrastructure.
Terraform Code
Here, as we can see there is code snippets of terraform main, variable files, modules for notification channels, uptime checks and alerts for uptime checks. For creation alerts and monitoring we have created a simple python app using terraform startup script as seen below.
VM starup script
"sudo apt-get update; sudo apt-get install -yq build-essential python3-pip rsync; pip install flask; echo -e \"from flask import Flask \napp = Flask(__name__)\n@app.route('/')\ndef hello_cloud():\n\treturn 'Hello Cloud'\n\napp.run(host='0.0.0.0')\" > app.py; python3 app.py;"
Python “Hello Cloud” App.
Monitoring Dashboard
module>monitoring>dashboard>dashboard.tf
resource "google_monitoring_dashboard" "dashboard" {dashboard_json = var.dash_json
}
environment>dev>main.tf
module "flask_app_dashboard" {source = "../../module/monitoring/dashboard"
dash_json = jsonencode({
"displayName": "Flask App VM Dashboard",
"dashboardFilters": [],
"mosaicLayout": {
"columns": 48,
"tiles": [
{
"width": 24,
"height": 16,
"widget": {
"title": "VM Instance - CPU utilization [MEAN]",
"xyChart": {
"chartOptions": {
"mode": "COLOR"
},
"dataSets": [
{
"breakdowns": [],
"dimensions": [],
"measures": [],
"minAlignmentPeriod": "60s",
"plotType": "LINE",
"targetAxis": "Y1",
"timeSeriesQuery": {
"timeSeriesFilter": {
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
},
"filter": "metric.type=\"compute.googleapis.com/instance/cpu/utilization\" resource.type=\"gce_instance\""
}
}
}
],
"thresholds": [],
"yAxis": {
"label": "",
"scale": "LINEAR"
}
}
}
},
{
"xPos": 24,
"width": 24,
"height": 16,
"widget": {
"title": "Flask App logs panel",
"logsPanel": {
"filter": "resource.type=\"gce_instance\" resource.labels.instance_id=\"${module.flask_app_vm.instance_id}\" resource.labels.zone=\"us-central1-c\"\n",
"resourceNames": ["projects/1055175960331"]
}
}
}
]
},
"labels": {}
})
}
terraform apply -target module.flask_app_dashboard.google_monitoring_dashboard.dashboard -var-file dev.tfvars
Generated Dashboard in GCP Console
Uptime Checks – TCP and HTTP
module>monitoring>uptime-check>http>http-uptime-check.tf
resource "google_monitoring_uptime_check_config" "http-uptime-check" {
for_each = local.flat_hosts
display_name = "${each.value.hostname}-http-uptime-check"
timeout = "60s"
selected_regions = ["ASIA_PACIFIC", "USA", "EUROPE"]
http_check {
path = each.value.path
port = each.value.port
use_ssl = each.value.use_ssl
validate_ssl = each.value.validate_ssl
accepted_response_status_codes {
status_class = "STATUS_CLASS_2XX"
}
accepted_response_status_codes {
status_value = 301
}
accepted_response_status_codes {
status_value = 302
}
}
monitored_resource {
type = "uptime_url"
labels = {
project_id = var.project_id
host = each.value.hostname
}
}
}
environment>dev>main.tf
#TCP UPTIME CHECK
module "flask_app_tcp_uptime_check" {
source = "../../module/monitoring/uptime-check/tcp"
tcp_uptime_hosts = var.flask_app_tcp_uptime_check_hosts
project_id = var.project_id
}
#HTTP UPTIME CHECK
module "flask_app_http_uptime_check" {
source = "../../module/monitoring/uptime-check/http"
http_uptime_hosts = var.flask_app_http_uptime_check_hosts
project_id = var.project_id
}
Environment variables
Here we can set uptime checks for multiple targets, just add multiple targets/hosts in Json array.
Generally, we don’t need tcp and http uptime checks for same application, here we have created it just for example.
environment>dev>dev.tfvars
#TCP UPTIME CHECK
flask_app_tcp_uptime_check_hosts = [{
hostname = "35.209.69.34"
port = "5000"
}]
#HTTP UPTIME CHECK
flask_app_http_uptime_check_hosts = [{
hostname = "35.209.69.34"
path = "/"
port = "5000"
use_ssl = "false"
validate_ssl = "false"
}]
terraform apply -target module.flask_app_http_uptime_check.google_monitoring_uptime_check_config.http-uptime-check -var-file dev.tfvars
terraform apply -target module.flask_app_tcp_uptime_check.google_monitoring_uptime_check_config.tcp-uptime-check -var-file dev.tfvars
Created uptime checks in GCP console
Notification Channels
module>monitoring>notification-channel>email.tf
resource "google_monitoring_notification_channel" "email" {
display_name = var.email_channel_display_name
type = "email"
labels = {
email_address = var.notification_email
}
}
environment>dev>main.tf
#NOTIFICATION CHANNEL
module "notification_channel_email" {
source = "../../module/monitoring/notification-channel"
notification_email = var.notification_email
email_channel_display_name = var.notification_email_desc
}
environment>dev>dev.tfvars
#NOTIFICATION CHANNEL
notification_email = "alerts@yourdomain.com"
notification_email_desc = "notification email for to receive uptime check alerts"
terraform apply -target module.notification_channel_email.google_monitoring_notification_channel.email -var-file dev.tfvars
Created Notification Channels in GCP console
Alerts for Uptime checks
module>monitoring>alerts>alert-policy-uptime-check.tf
resource "google_monitoring_alert_policy" "alert-policy-uptime-check" {
project = var.project_id
enabled = true
count = length(var.uptime_check_ids)
display_name = "Uptime check alert policy for ${element(split("/", var.uptime_check_ids[count.index]), 3)}"
documentation {
content = "Uptime check failed for ${element(split("/", var.uptime_check_ids[count.index]), 3)}"
}
notification_channels = [var.notification_channel]
combiner = "OR"
conditions {
display_name = "Uptime check for ${element(split("/", var.uptime_check_ids[count.index]), 3)}"
condition_threshold {
filter = <<EOT
metric.type="monitoring.googleapis.com/uptime_check/check_passed" AND metric.label.check_id="${element(split("/", var.uptime_check_ids[count.index]), 3)}" AND resource.type="uptime_url"
EOT
duration = "0s"
threshold_value = "1"
comparison = "COMPARISON_GT"
aggregations {
alignment_period = "1200s"
cross_series_reducer = "REDUCE_COUNT_FALSE"
per_series_aligner = "ALIGN_NEXT_OLDER"
group_by_fields = ["resource.label.project_id",
"resource.label.host"]
}
trigger {
count = "1"
}
}
}
user_labels = {
severity = "critical"
}
alert_strategy {
auto_close = "604800s"
}
}
environment>dev>main.tf
#ALERTS - TCP
module "flask_app_tcp_email_alerts" {
source = "../../module/monitoring/alerts"
depends_on = [module.flask_app_tcp_uptime_check, module.notification_channel_email]
uptime_check_ids = module.flask_app_tcp_uptime_check.tcp_uptime_check_ids
notification_channel = module.notification_channel_email.notification_channel_id
}
#ALERTS - HTTP
module "flask_app_http_email_alerts" {
source = "../../module/monitoring/alerts"
depends_on = [module.flask_app_http_uptime_check, module.notification_channel_email]
uptime_check_ids = module.flask_app_http_uptime_check.http_uptime_check_ids
notification_channel = module.notification_channel_email.notification_channel_id
}
Environment>dev>output.tf
output "tcp_uptime_check_ids" {
value = module.flask_app_tcp_uptime_check.tcp_uptime_check_ids
}
output "http_uptime_check_ids" {
value = module.flask_app_http_uptime_check.http_uptime_check_ids
}
output "notification_channel_id" {
value = module.notification_channel_email.notification_channel_id
}
terraform apply -target module.flask_app_email_alerts.google_monitoring_alert_policy.alert-policy-uptime-check -var-file dev.tfvars
Created Policies in GCP Console.
Conclusion
Smooth running if apps and services in cloud environments is critical to corporate success. GCP monitoring provides robust solution for tracking resource metrics and application availability, it will help support team to act quickly if case of any abnormality. Terraform (IaC) allows DevOps teams to efficiently manage and automate infrastructure, alerts and monitoring across multiple environments in minimum time and effort.