Is Your Service Level Agreement Driving Downtime?
As it relates to maintenance, OEM Service level agreements (SLAs) are enacted to decrease downtime associated with hardware and software malfunctions. Businesses often purchase several versions of SLAs from an OEM to cover a wide variety of equipment and uptime needs. These agreements fall into one of three categories:
- Software Troubleshooting and Tech Support
- Hardware Replacement
- Field Engineering
Software Troubleshooting and tech support is remote support. When something goes awry, technical support teams run through a standard troubleshooting process. If the issue is not resolved, the case may be escalated through various tiers within the tech support team. With each escalation, the client gains access to support agents with more experience and knowledge.
Hardware replacement provides clients with new or refurbished hardware within a specific period of time (as noted in the SLA). Ideally, replacement hardware is delivered quickly to minimize downtime. Internal teams are responsible for the installation and configuration of the replacement hardware when Field engineering support is not purchased. When purchasing an SLA, hardware replacement coverage must be purchased with troubleshooting and tech support coverage. This mitigates unnecessary hardware replacement that may not resolve an issue.
Field engineering support cannot be purchased separately from troubleshooting and tech support and hardware replacement. Field engineering support provides clients with on-site access to experts who install and configure replacement hardware.
Only 3 percent of network pros at midsize and large enterprise report that they catch and correct all mistakes before they cause an outage. - Veriflow
Are You Buying a Response of a Resolution?
Many SLAs are not built to drive resolution. Instead, each of the three coverage areas is driven by response times. This means that the support provider must respond to your inquiry within the set period of time. The response may be as simple as clarifying the problem or asking for the contract number. Once that initial response is initiated, the clock restarts. A 30 minute response time for high priority issues may sound good, but it can quickly turn into 60, 90, or even 120 minutes before troubleshooting actually begins.
Service level agreements built around resolution incentivize providers to resolve an issue within a set time period and prevent lengthy back-and-forth sessions with tech support. A problem or issue is resolved when the cause of the issue is identified and repaired, and connectivity is restored. After resolution, there is a reasonable expectation that the same issue (with the same cause) will not reoccur.
In addition to response and resolve times, some SLAs promise time to restoration. This means interrupted connectivity or traffic is restored within a set period, when the issue that caused the interruption may not be identified or resolved yet. It gives clients and support teams the opportunity to enact a workaround until the root cause can identified and resolved.
Eighty percent of unplanned outages are due to ill-planned changes made by administrators or developers. - Visible Ops
How Severe Is It?
In addition to varying coverage levels and timing, SLAs are also defined by problem severity. Each provider may use a unique term, yet the definition at each level is standard across the industry. Problem severity falls into one of four categories:
- P1 = Critical Impact/System Down
- P2 = Significant Impact
- P3 = Normal/Minor Impact
- P4 = Low Impact/Informational Inquiry
A severity level of P1 is an issue that is network critical, causes a business impact, and in which a critical piece of the network is inoperable. The issue prevents users from performing normal, mission-critical functions or processing revenue.
P2 issues impact the network and the business but do not cause a STOP. The condition likely causes severe latency and results in extremely slow functionality. The network is usable, but use is severely limited.
Issues assigned a severity level of P3 pose minimal network impact and no business impact, such as a performance issue. This issue may impact network functionality for multiple users but does not impact production.
P4 issues are informational or inquisitive. This could be anything from minor loss of functionality to feature requests and how-to questions.
Within an SLA, each level of severity is assigned a response time. As the severity decreases, the response time increases. Remember that response does not mean resolution. Most issues require three to ten touches to resolve. A P2 issue with a response time of 60 minutes can equate to three to ten hours of downtime. Often, it is the support provider that assigns the severity level, leaving the client with little control over how quickly the issue is resolved.
One in ten companies report needing 99.999% availability. –Information Technology and Intelligence Corp.
Tracking Support Performance
Original equipment manufacturers (OEMs) do not offer SLAs that support time to resolution or restoration metrics, focusing instead on time to respond. To create an SLA that supports your business needs, partner with high-quality support teams that commit to Mean Time to Restore (MTTR) metrics and provide credits when MTTR averages exceed SLA-specified values.
When crafting hardware replacement support agreements, ensure that “replacement” refers to the entire unit or Field Replaceable Part (FRP), not just a portion of the unit that has to be deconstructed or only may resolve the issue. Downtime increases when only a portion of the malfunctioning unit is replaced, and the issue is not resolved.
It is also important to track “dead on arrival” rates of replacement equipment. Since most OEMs pull hardware replacement equipment from an inventory of refurbished products, DOA rates are notoriously high. The industry average is 2 percent. However, high-quality support providers who routinely test replacement hardware provide DOA rates that are much lower, closer to 0.5 percent.
Tracking these metrics often falls to the client since OEM portal tools rarely allow for in-depth reporting. Establish a robust tracking system to ensure that your SLAs are serving your uptime needs instead.
Service level agreements are complex, and since businesses have many post-purchase agreements in addition to entitled services, understanding how agreements are impacting your business may require the insight of an expert. System downtime is not always avoidable, however getting your business back up and running as quickly as possible should be the primary goal of your support agreement.
Topics: Service Level Agreements