This set of tutorials is for IT administrators and Operators who want to deploy, run, and manage modern application environments that run on Google Kubernetes Engine (GKE). As you progress through this set of tutorials you learn how to configure monitoring and alerts, scale workloads, and simulate failure, all using the Cymbal Bank sample microservices application:
The Cymbal Bank sample application used in this set of tutorials is made up of a number of microservices that all run in the GKE cluster. Problems with any of these services could result in a bad experience for the bank's customers, such as not being able to access the bank application. Learning about problems with the services as soon as possible means you can quickly start to troubleshoot and resolve the issues.
In this tutorial, you learn how to monitor workloads in a GKE cluster using Google Cloud Managed Service for Prometheus and Cloud Monitoring. You learn how to complete the following tasks:
Create a Slack webhook for Alertmanager.
Configure Prometheus to monitor the status of a sample microservices-based application.
Simulate an outage and review the alerts sent using the Slack webhook.
Enabling GKE and deploying the Cymbal Bank sample application for this series of tutorials means that you incur per-cluster charges for GKE on Google Cloud as listed on our Pricing page until you disable GKE or delete the project.
You are also responsible for other Google Cloud costs incurred while running the Cymbal Bank sample application, such as charges for Compute Engine VMs and Cloud Monitoring.
Before you beginTo learn how to monitor your workloads, you must complete the first tutorial to create a GKE cluster that uses Autopilot and deploy the Cymbal Bank sample microservices-based application.
We recommend that you complete this set of tutorials for scalable apps in order. As you progress through the set of tutorials, you learn new skills and use additional Google Cloud products and services.
To show an example of how a GKE Autopilot cluster can use Google Cloud Managed Service for Prometheus to generate messages to a communications platform, this tutorial uses Slack. In your own production deployments, you can use your organization's preferred communication tool to process and deliver messages when your GKE cluster has an issue.
Join a Slack workspace, either by registering with your email or by using an invitation sent by a Workspace Admin.
Note: If you are not an Admin for your Slack workspace, you might need approval from a Workspace Admin before your app is deployed to your workspace.An important part of setting up monitoring is making sure that you're notified when actionable events such as outages occur. A common pattern for this is to send notifications to a communication tool such as Slack, which is what you use in this tutorial. Slack provides a webhooks feature that lets external applications, like your production deployments, generate messages. You can use other communication tools in your organization to process and deliver messages when your GKE cluster has an issue.
GKE clusters that use Autopilot include a Google Cloud Managed Service for Prometheus instance. This instance can generate alerts when something happens to your applications. These alerts can then use a Slack webhook to send a message to your Slack workspace so you receive prompt notifications when there's a problem.
To set up Slack notifications based on alerts generated by Prometheus, you must create a Slack application, activate Incoming Webhooks for the application, and install the application to a Slack workspace.
Sign in to Slack using your workspace name and your Slack account credentials.
In Prometheus, Alertmanager processes monitoring events that your deployments generate. Alertmanager can skip duplicate events, group related events, and send notifications, like using a Slack webhook. This section shows you how to configure Alertmanager to use your new Slack webhook. Specifying how you want Alertmanager to process events to send is covered in the next section of the tutorial, Configure Prometheus.
To configure Alertmanager to use your Slack webhook, complete the following steps:
Change directories to the Git repository that includes all the sample manifests for Cymbal Bank from the previous tutorial:
cd ~/bank-of-anthos/
If needed, change the directory location to where you previously cloned the repository.
Update the Alertmanager sample YAML manifest with the webhook URL of your Slack application:
sed -i "s@SLACK_WEBHOOK_URL@SLACK_WEBHOOK_URL@g" "extras/prometheus/gmp/alertmanager.yaml"
Replace SLACK_WEBHOOK_URL
with the URL of the webhook from the previous section.
To dynamically use your unique Slack webhook URL without changes to the application code, you can use a Kubernetes Secret. The application code reads the value of this Secret. In more complex applications, this ability lets you change, or rotate, values for security or compliance reasons.
Create a Kubernetes secret for Alertmanager using the sample YAML manifest that contains the Slack webhook URL:
kubectl create secret generic alertmanager \
-n gmp-public \
--from-file=extras/prometheus/gmp/alertmanager.yaml
Prometheus can use exporters to get metrics from applications without code changes. The Prometheus blackbox exporter lets you probe endpoints like HTTP or HTTPS. This exporter works well when you don't want to, or can't, expose the inner workings of your application to Prometheus. The Prometheus blackbox exporter can work without changes to your application code to expose metrics to Prometheus.
Deploy the Prometheus blackbox exporter to your cluster:
kubectl apply -f extras/prometheus/gmp/blackbox-exporter.yaml
After you have configured Alertmanager to use your Slack webhook, you need to tell Prometheus what to monitor in Cymbal Bank, and what kinds of event you want Alertmanager to notify you about using the Slack webhook.
In the Cymbal Bank sample application that you use in these tutorials, there are various microservices that run in the GKE cluster. One problem you probably want to know about as soon as possible is if one of the Cymbal Bank services has stopped responding normally to requests, potentially meaning your customers can't access the application. You can configure Prometheus to respond to events based on your organization's policies.
ProbesYou can configure Prometheus probes for the resources that you want to monitor. These probes can generate alerts based on the response that the probes receive. In the Cymbal Bank sample application, you can use HTTP probes that check for 200-level response codes from the Services. An HTTP 200-level response indicates that the Service is running correctly and can respond to requests. If there's a problem and the probe doesn't receive the expected response, you can define Prometheus rules that generate alerts for Alertmanager to process and perform additional actions.
Create some Prometheus probes to monitor the HTTP status of the various microservices of the Cymbal Bank sample application. Review the following sample manifest:
As shown in this manifest file, it's best practice that each PodMonitoring
Prometheus liveness probe monitors each Deployment separately.
To create the Prometheus liveness probes, apply the manifest to your cluster:
kubectl apply -f extras/prometheus/gmp/probes.yaml
Prometheus needs to know what you want to do based on the response that the probes you created in the previous steps receive. You define this response using Prometheus rules.
In this tutorial, you create Prometheus rules to generate alerts depending on the response to the liveness probe. Alertmanager then processes the output of these rules to generate notifications using the Slack webhook.
Create rules that generate events based on the response to the liveness probes. Review the following sample manifest:
This manifest describes a PrometheusRule
and includes the following fields:
spec.groups.[*].name
: the name of the rule group.spec.groups.[*].interval
: how often rules in the group are evaluated.spec.groups.[*].rules[*].alert
: the name of the alert.spec.groups.[*].rules[*].expr
: the PromQL expression to evaluate.spec.groups.[*].rules[*].for
: the amount of time alerts must return for before they are considered firing.spec.groups.[*].rules[*].annotations
: a list of annotations to add to each alert. This is only valid for alerting rules.spec.groups.[*].rules[*].labels
: the labels to add or overwrite.To create the rules, apply the manifest to your cluster:
kubectl apply -f extras/prometheus/gmp/rules.yaml
To make sure that your Prometheus probes, rules, and Alertmanager configuration are correct, you should test that alerts and notifications are sent when there's a problem. If you don't test this flow, you might not realize there's an outage of your production services when something goes wrong.
To simulate an outage of one of the microservices, scale the contacts
Deployment to zero. With zero instances of the Service, the Cymbal Bank sample application can't read contact information for customers:
kubectl scale deployment contacts --replicas 0
GKE might take up to 5 minutes to scale down the Deployment.
Check the status of the Deployments in your cluster and verify that the contacts
Deployment scales down correctly:
kubectl get deployments
In the following example output, the contacts
Deployment has successfully scaled down to 0
instances:
NAME READY UP-TO-DATE AVAILABLE AGE
balancereader 1/1 1 1 17m
blackbox-exporter 1/1 1 1 5m7s
contacts 0/0 0 0 17m
frontend 1/1 1 1 17m
ledgerwriter 1/1 1 1 17m
loadgenerator 1/1 1 1 17m
transactionhistory 1/1 1 1 17m
userservice 1/1 1 1 17m
After the contacts
Deployment has scaled down to zero, the Prometheus probe reports a HTTP error code. This HTTP error generates an alert for Alertmanager to then process.
Check your Slack workspace channel for an outage notification message with text similar to the following example:
[FIRING:1] ContactsUnavailable
Severity: Warning :warning:
Summary: Contacts Service is unavailable
Namespace: default
Check Contacts pods and it's logs
In a real outage scenario, after you receive the notification in Slack you would start to troubleshoot and restore services. For this tutorial, simulate this process and restore the contacts
Deployment by scaling back up the number of replicas:
kubectl scale deployment contacts --replicas 1
It might take up to 5 minutes to scale the Deployment and for the Prometheus probe to receive an HTTP 200 response. You check the status of the Deployments using the kubectl get deployments
command.
When a healthy response to the Prometheus probe is received, Alertmanager clears the event. You should see an alert resolution notification message in your Slack workspace channel similar to the following example:
[RESOLVED] ContactsUnavailable
Severity: Warning :warning:
Summary: Contacts Service is unavailable
Namespace: default
Check Contacts pods and it's logs
We recommend that you complete this set of tutorials for Cymbal Bank in order. As you progress through the set of tutorials, you learn new skills and use additional Google Cloud products and services.
If you want to take a break before you move on to the next tutorial and avoid incurring charges to your Google Cloud account for the resources used in this tutorial, delete the project you created.
appspot.com
URL, delete selected resources inside the project instead of deleting the whole project.If you plan to explore multiple architectures, tutorials, or quickstarts, reusing projects can help you avoid exceeding project quota limits.
Learn how to scale your deployments in GKE in the next tutorial.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4