Stay organized with collections Save and categorize content based on your preferences.
This section reviews the concept of service-level indicators (SLIs), defines what makes for a good or useful SLI, and provides examples of SLI implementations for selected services. This page is intended for people who want examples that implement service-specific SLIs.
Introduction to SLIsThe reliability of a service is an abstract notion; what reliability means depends on the service and the needs of its users. A service-level indicator (SLI) is a measure of that reliability to be used for both communicating about the reliability of the service and to manage the service.
SLIs are measured over a time window. The size of the window typically depends upon the decision the information is being used to make. For example, you might measure a single SLI in the following ways:
We recommend 28 days as a starting point for measuring your SLI; this value provides a good balance between the strategic and tactical use cases.
For more information, see the following sections of the Site Reliability Engineering Workbook:
Properties of a good SLIWe consider "good" SLIs to be those measures that meet the following criteria:
SLIs are good proxy measures for user happiness.
A good SLI correlates strongly with user happiness. You use the SLI as the basis for a service-level objective (SLO), a threshold set on the SLI. You set the SLO so that, when the SLI is within a defined range, most of your users are happy. For this relationship to hold, the SLI must be a good proxy measure for user happiness.
If the SLI is a good proxy for user happiness, then when there is an event that affects user happiness, the SLI changes in some direction. Likewise, when there are no events that affect user happiness, the SLI doesn't change.
SLIs scale monotonically and linearly with user happiness.
A good SLI scales monotonically, and linearly, with user happiness. If the SLI improves, then user happiness improves. Similarly, if the SLI decreases, then user happiness decreases. The amount of improvement in the value of a good SLI corresponds to the amount of improvement in user happiness.
SLIs produce measurements that range from 0% to 100%.
A good SLI produces a performance measurement that ranges from 0% to 100%: this range is intuitive and easy to work with. For example, SLI performance of 100% means that everything is working, and SLI performance of 0% means that nothing is working.
Having a SLI that ranges from 0% to 100% makes setting a SLO on the SLI easy and clear: assign a percentage target such as 99.9%, and the SLI performance must be at or higher than that target for the service to be meeting its SLO.
One way of implementing an SLI that has these properties is to think of the SLI in terms of promises made to your users. By counting the promises that you made and upheld over a time window, you can derive a number that ranges from 0% to 100%. Such SLIs also translate well into error budgets: for a given SLO, your error budget is the number of promises you can fail to uphold over a time window while still meeting your SLO.
Examples of promises include:
200
status code to a customer's request.An SLI specification is what you want to measure. The specification doesn't include the exact technical details of how you are going to measure it. For example, the following is a specification of an SLI for page-loading time:
There can be many ways to measure an SLI, each with trade-offs and benefits. The ways of measuring the SLI are the SLI Implementations. For example, you might implement the page-loading specification as one of the following:
Each of these choices involves trade-offs between the following characteristics:
Fidelity to user experience usually improves when the SLI is measured closer to the user. For example, the implementation that uses code in the user's browser results in a more accurate measurement of latency than the latency perceived by the user or by other measurement choices.
The tradeoff is that the browser-based measurement also includes any latency introduced by the user's connection to your service. For example, when a service is used over the public internet, this latency might vary significantly with geographic location or network conditions.
The result is that the browser-based signal is a good proxy for user happiness. However, this signal might not provide actionable information you can use to improve the reliability of your service.
For information about combining multiple measurements to balance this tradeoff, see this post from The Telegraph.
BucketingYou might need multiple SLIs for a service when your service performs different kinds of work for different users, or when it performs a particular task with different possible outcomes.
Different tasksServices that perform multiple types of work, for different categories of users, and in which each type of work influences user happiness differently benefit from multiple SLIs.
For example, if your service handles both read and write requests, users performing those tasks might have different requirements:
To capture these different requirements, your SLI must be able to distinguish between these two cases. Typically, the SLI metric has a label that you can use to classify values into one of several buckets.
One task with different outcomesServices that perform a single type of work but where user expectations differ based on the response benefit from multiple SLIs.
For example, if your service offers only read access to data, users might have different tolerance for latency depending on the outcome of the request:
In this case, your latency SLI needs to be able to distinguish between successful and unsuccessful requests.
What's nextFor information about implementing SLIs for Google Cloud services using Google Cloud metrics, see the following:
For information about implementing application-specific SLIs, see the following:
For an example that illustrates how to create a SLI for services that report custom metrics, see Setting SLOs: observability using custom metrics.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-11 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-11 UTC."],[],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4