In this blog, we will discuss how Outsourcing Service Providers are using SRE practices to modernize their service offerings and the impact to your outsourcing contract(s).
Site Reliability Engineering (SRE) is a set of software engineering practices developed by Google to improve business satisfaction with IT systems by proactively automating processes to increase service reliability using Agile and Development, Security and Operation (DevSecOps) principles. SRE focuses on improving “availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning of the service(s)” (1). SRE provides teams the tools to balance the risk of releasing new features and ensuring reliability for business users.
Traditional outsourcing contracts incorporate Information Technology Infrastructure Library v3 (ITIL) processes which are prescriptive and already include guidance on how work gets done like automation, continual service improvement and alignment with business objectives. Modern outsourcing contracts incorporate ITIL 4 practices which allow flexibility in how the work gets done based on the needs of the business while adhering to company policies. A key aspect of this flexibility for SRE is establishing availability targets known as an Error Budget for each product/service. Error Budgets are used to balance service reliability with the pace of innovation. For example, if the organization is expecting no more than 3 hours of planned and unplanned downtime in a year and they have had only 1 hour of downtime then the team could plan to implement more changes since the system is stable. SRE utilizes Error Budgets, Service Level Objectives (SLOs), and a mindset that any incidents or problems are not development or operations concerns but business systems availability concerns. Adding the right SRE activities to your outsourcing contracts can help with managing business expectations and outcomes.
There are numerous considerations when contracting for Site Reliability Engineering (SRE) services. The key considerations are Governance, Performance Management, Statement of Work, and Pricing which establish the guardrails for how SRE will be managed.
The first consideration is Governance of the contract – SRE is new enough that both parties must collaborate with and trust one another to define the intent of the services and the outcomes expected. History shows that trust and intent are contentious points between Clients and Service Providers that deteriorate from the point of contract signature. KPMG LLP recommends adding a Collaboration Schedule to the contract to capture the behaviors expected of all parties to make sure that trust is built. For example, Collaboration rules like ‘Partner always’, ‘End-to-end business Process Optimization’ and ‘Performance Focus First’ are all pain points today and necessary behaviors in modern operating environments involving multiple parties. The Collaboration Schedule is necessary because a Reliability Engineer (RE) is proactively splitting time between operations and application development which are historically done by different Service Providers. This time split often results in the RE playing a broker role between the Client and/or Service Provider(s) to find the right balance between service stability and introduction of new features. To find the right balance and establish accountability across suppliers KPMG recommends establishing an SRE Roles and Responsibilities matrix in the contract very similar in look and feel to the Financial Responsibilities Matrix in your traditional outsourcing contract.
The second consideration is Performance Management – Service Level Agreements (SLAs) have always been a point of contention between Clients and Service Providers because SLAs do not always align with business outcomes or measure end-to-end business processes. SRE practices are designed to prevent SLAs from breaching their targets using more stringent Service Level Objectives (SLOs) and Service Level indicators (SLIs). SLOs align operations with the business objectives using SLI measurements, such as system response time, end-to-end process failure and user satisfaction. For example, missing payroll payments due to a process step failure in an HR system could have been proactively detected and resolved using SLOs/SLIs. This raises the questions:
- Should I just update my SLA measurements to match the more stringent SLOs/SLIs? The quick answer is no; Continuous improvement methodology will already improve an SLA measurement as an example from 99.95% to an SLI of 99.97% over the life of the contract and Service Providers are reluctant to increase percentages without charging more for services.
- How do I measure the performance of SRE service? Along with measurements previously mentioned, KPMG recommends that you also include Critical Deliverables in your contracts which focus on SRE outcomes; performance met or not met. Critical Deliverables have material financial penalties to drive Service Provider focus on business outcomes and end-to-end business process. Building on the missing payroll payments example, Critical Deliverables would exist for ‘Payroll processed with zero errors’ and ‘HR Payroll data loaded on time’ which is really what the business cares about.
The third consideration is the Statement of Work(s)(SOW) – Traditional outsourcing SOWs contain specific activities on ‘how’ work is performed. Modern outsourcing SOWs contain less prescriptive responsibility statements which focus on ‘what’ work needs to be performed, aligning with Agile principles. For example, Incident Management in a Traditional SOW will have a statement like this ‘Provide a data extract on a nightly basis of defined data elements from the Incident database.’; while a Modern SOW responsibility statement will look like this ‘Supplier shall perform all activities and services required to maintain and support Incident Management (e.g., Incident response and resolution)’.
The fourth consideration is Pricing – Outsourcing is usually focused on price and performance; price being most important during contracting and performance post-contracting. Service Providers are constantly introducing new services like SRE to differentiate themselves from the competition, add value and allow for price premiums. With Agile ways of working, value realization to the business is a key measurable component for outsourcing governance; examples include, producing business requested features quickly, collaborate with the business, and adjust plans at the speed of business. KPMG recommends establishing a Value Realization Framework which is used in combination with Performance Management to align business expectations with contractually measurable outcomes from Service Providers to make sure you are getting the value you pay for. This raises a key question of; Why would you want to pay more for a new service you are unfamiliar with? The quick answer is you don’t have to. KPMG recommends making sure SRE services are included in the base services in your Request For Solution (RFS) requirements to service providers.
- Source:The Site Reliability Workbook, How SRE relates to Devops, Murphy, et al. , 2018