A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.redhat.com/en/topics/devops/what-is-sre below:

What is SRE (site reliability engineering)?

A site reliability engineer is a unique role that requires either a background as a sysadmin, a software developer with additional operations experience, or someone in an IT operations role that also has software development skills. 

SRE teams are responsible for how code is deployed, configured, and monitored, as well as the availability, latency, change management, emergency response, and capacity management of services in production.

SRE teams determine the launch of new features by using service-level agreements (SLAs) to define the required reliability of the system through service-level indicators (SLI) and service-level objectives (SLO). 

An SLI measures specific aspects of provided service levels. Key SLIs include request latency, availability, error rate, and system throughput. An SLO is based on the target value or range for a specified service level based on the SLI.

An SLO for the required system reliability is then based on the downtime determined to be acceptable. This downtime level is referred to as an error budget—the maximum allowable threshold for errors and outages. 

With SRE, 100% reliability is not expected—failure is planned for and expected.

Once established, the development team is able to "spend" the error budget when releasing a new feature. Using the SLO and error budget, the team then determines whether a product or service can launch based on the available error budget.

If a service is running within the error budget, then the development team can launch whenever it wants, but if the system currently has too many errors or goes down for longer than the error budget allows then no new launches can take place until the errors are within budget.   

The development team conducts automated operations tests to demonstrate reliability. 

Site reliability engineers split their time between operations tasks and project work. According to SRE best practices from Google, site reliability engineers can only spend a maximum of 50% of their time on operations—and they should be monitored to ensure they don’t go over.  

The rest of the their time should be spent on development tasks like creating new features, scaling the system, and implementing automation.

Excess operational work and poorly performing services can be redirected back to the development team so that the site reliability engineer doesn't spend too much time on the operations of an application or service. 

Automation is an important part of the site reliability engineer’s role. If they are repeatedly dealing with a problem, then they will likely automate a solution. 

Maintaining the balance between operations and development work is a key component of SRE. 

Checklist: 5 ways site reliability engineers can help you


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4