Chapter 5. Working with Third Parties Shouldn’t Suck

Over the years, the definition of Site Reliability Engineering (SRE) has evolved, but the easiest to digest is subjectively “what happens when software engineering is tasked with what used to be called ‘operations.’”1 Most Site Reliability teams consider operations as the applications running on their own infrastructure. These days, more and more companies rely on third parties to serve a very specific function in which they specialize. This includes things like Domain Name System (DNS), Content Delivery Network (CDN), Application Performance Management (APM), Storage, Payments, Email, Messaging (SMS), Security (such as Single Sign-On [SSO] or Two-Factor Authentication [2FA]), Log Processing, and more. Any one of these resources, if not implemented properly, is a dependency that has the capacity to bring down your site.

Are vendors black boxes that we don’t control? Not necessarily. As we approach working with vendors, it’s important that we apply the same suite of SRE disciplines to working with third parties in an effort to make it suck less.

Build, Buy, or Adopt?

Before we dive into the topic of working with vendors, we should discuss the decisions that would lead us to buy over build or adopt. Our level of involvement in this process will depend on the combination of importance and stakeholders. Determining importance is the first step in this entire process and will dictate the significance of other deciding factors, such as weight of cost, the weight of support, influencers, Service-Level Agreements (SLAs), and more.

Establish Importance

Determining importance can be challenging for an SRE leading a project or integration. For instance, if the decision is on which JavaScript framework to use for the next version of the website, it’s clear that there are going to be many stakeholders involved in the decision-making process and many more impacted, such as Data Science, Quality Assurance, Tools, and more.

However, the decision over which certificate authority (CA) to select is another story altogether. For some, the choice of certificate is as simple as saying, “Let’s just use Let’s Encrypt,” and move ahead. There are, however, a handful of companies that must make considerations that go beyond the choice between a single name certificate and a multiname certificate.

If you’re scratching your head right now, good. Let’s explore this a bit more.

Depending on the SRE team, certificates might simply imply security. To other SRE teams, certificates might elicit concerns of impact on performance. Yet other SRE teams might question the impact to protocols and cipher suites. And beyond SRE teams comes the question of long-term ownership, integration, changes to request process, workflow, access control, revocation process, certificate rotation, and more.

Determining the importance of an integration early on will help avoid technical debt and scaling limitations down the road.

Identify Stakeholders

The smaller the company, the easier the decision-making process is. However, as companies grow, it is important to at least consider factors that will impact scale, slow growth, reduce productivity, and cause headaches down the road. Further, understanding importance also allows us to identify who might be impacted by our decisions so that they can be brought into the fold.

As SREs, we know firsthand the role collaboration plays in site reliability. The earliest SRE teams at Google were embedded with software engineering teams, provided valuable insights into scaling, and took an operational approach to building reliable services. It would only make sense that as we consider the impact of integration, we identify the stakeholders and influencers who will prove key to our project’s overall success.

The sense of ownership and importance play a significant role in a project’s success. It goes without saying that people don’t like unplanned work and they don’t like being told what to do. The earlier stakeholders are identified and consulted, the easier it will be to apply ownership and have additional people to work toward the project’s success. Nothing is worse than starting a project and committing time and energy, only to have it shot down by a stakeholder whose role was not considered.

Make a Decision

After you have established importance and stakeholders, we can begin evaluating the decision to build or buy. Quite often, the build discussion is heavily dependent on the incorporation of an open source solution, which I propose is an altogether new category to be considered: adopt. Let’s break down the differences between each option.

Build (developed in-house from scratch)

Your needs are so custom and innovative that it cannot be done by anyone other than your company. Often, these solutions are open-sourced after battle hardening. Such examples include Kafka by LinkedIn, React by Facebook, and Envoy by Lyft.

Buy (SaaS, hosted, or licensed solution)

Your needs are common and a paid solution exists to solve that problem.

Adopt (integration of an open source solution)

Your needs are common and open source solutions exist.

As we deliberate whether to build, buy, or adopt, we might note that some key differences exist in when and where the money is spent. As we consider adopting an open source solution, the capital expenditure (CapEx) and operational expense (OpEx) can be difficult to predict. Evaluating and researching an open source project for a Proof-of-Concept (PoC) is relatively inexpensive; however, as the project is utilized at scale, issues such as memory utilization, computational constraints, and storage requirements can quickly exceed benchmarks captured during R&D. Additionally, skilled personnel are necessary not only for initial integration, but also for the continued operation, enhancements, and inevitable upgrades. Open source projects will have bugs that can lead to exposure to significant security vulnerabilities, memory leaks, and loss of data. In some cases, full version upgrades create incompatibilities with other integrations as a result of feature deprecation. Last, but not least, migrating between open source versions introduces breaking changes, often causing full configuration rewrites, loss of backward compatibility, and operational instability. These sorts of issues are generally covered in a single line item in the CapEx table, but that line item should be expanded into its own project operating expense (PrOpEx) table.

As a result, our predictions often aim low if we don’t consult with an experienced project manager. In the buy case, CapEx is generally low with little integration cost, while OpEx is generally perceived as high—perceived because we often do not consider the personnel hours we would otherwise have to spend to produce a similarly operationally viable product. Many of the considerations involved in an open source adoption are resolved by a third-party solution, and they are on the hook for maintaining that operational stability.

Neither adopt nor buy solutions invalidate the need for documentation, maintenance, monitoring, logging, tooling, automation, and all other aspects that define a service. The difference between a PoC, a functional prototype, a Minimum Viable Product (MVP), and a first-class service with high operational integrity resembles the distance between the sun and each planet. Toward the end of this chapter, we address some of the critical facets that define a first-class service.

However, now that we’re at least considering the difference between build, buy, and adopt, we can address some additional points of consideration.

Acknowledge Reality

You’re a fantastic SRE. You produce eloquent code. You’re a natural leader. Scalability comes naturally to you. But you aren’t good at everything. It’s likely that you’ve found a niche that allows you to excel. The next interesting project might not be something that should be built from scratch. Not because you can’t, but because business decisions are rife with justifications that go beyond your personal interests. As a responsible SRE, prioritizing business objectives over personal interests can lead to success. The unfortunate reality is that working on something cool is not always what’s best for business.

As we let that reality sink in, let’s take a look at some other project considerations that we need to assess:

  • What problem is being solved?

  • How does this impact the bottom line?

  • Will this impact the critical path?

  • Is this a core competency?

  • Maturity of a solution?

  • Nice to have or need to have?

  • Will this have continued adoption?

  • What vulnerabilities are exposed?

  • What is our CapEx?

  • What are our OpExes?

  • What are our PrOpExes?

  • What are our abandonment expenses?

  • Who are our customers?

  • To whom are we a customer?

  • Did we identify ancillary benefits?

  • What is the delta of inception to production?

  • What is the integration timeline?

  • How will this be monitored?

  • Who’s responsible long term?

  • What does escalation look like?

  • What are the SLAs? Is that critical?

  • What are the risks/benefits?

  • What is our fallback?

  • How to measure success?

  • How to tolerate failure?

  • What happens if I get hit by a truck?

These points are only a handful of the considerations to keep in the back of your mind when debating whether to buy or adopt. Going through each consideration with your team depends on the significance of the integration. As we consider buy options, it’s worth expanding on a few of these points to provoke additional thoughts that you might not have considered.

Is this a core competency?

Always weigh whether a solution/integration under consideration is both your core competency as well as the company’s core competency.

Third parties offer specialized solutions and have teams of engineers working to expand their value proposition by building additional features and services. A Software as a Service (SaaS) solution for log aggregation might be more effective in the long run than trying to adopt an open source solution such as Elasticsearch. You have to decide whether Elasticsearch is really something that you, your team, and your company need to focus on.

Integration timeline?

However long you think it’s going to take to complete an integration, double it. Then add a little more. A PoC for an open source solution might be easy to implement, but this does not mean that the project is production ready. The definition of production readiness for SREs should encompass testing/staging, monitoring, logging, configuration management, deployment procedure, disaster recovery, and beyond.

With things like CDN, DNS, monitoring, CA, and messaging solutions, implementation of these production readiness factors can prove to be incredibly challenging, and in some cases, impossible. Fortunately, with third-party integrations, especially at larger companies, a variety of teams will evaluate the solution as we work on the integration. You should expect that Legal, Security, and Procurement2 teams will play a role during the purchase process. Each of these teams will introduce its own processes, which can delay getting the third-party solution into production.

Project Operating Expense and Abandonment Expense

Beyond consideration for the initial cost to purchase goods and services (CapEx, such as hardware, software, and one-time licenses) and the ongoing costs to run and maintain (OpEx, such as monthly server prices, recurring service fees), additional expense categories should be a part of the consideration process.

Project Operating Expense (PrOpEx)

Implementation costs for a solution. Typically a single line item in the CapEx table, implementation costs can prove to be a more significant factor in the decision over build, buy, or adopt choices. Implementation costs to consider include services that will be utilized as the solution is implemented, such as legal fees, vulnerability probing, additional monitoring and performance infrastructure, and professional or consulting services. These costs can also expose overlooked OpEx costs.

Abandonment expense (AbEx)

This is the cost to abandon a solution to implement another altogether. This is sometimes referred to by the softer term migration cost. There is always the potential that adopting an open source solution might result in abandonment in order to buy a fleshed-out SaaS solution, or vice versa. Costs will differ between abandonment scenarios. In the buy-to-adopt scenario, you might be required to pay out contractual obligations such as monthly usage commitments in addition to the incurred costs of restarting a project. In the adopt-to-buy scenario, cost is less of an issue; however, time is a much larger factor. In the previous section, we saw that Legal, Security, and Procurement processes can prolong time to production. It’s in our best interest to plan accordingly and abandon early.

You should consider AbEx throughout the lifetime of a project. Working with third parties means that we should also consider the buy-to-buy abandonment scenario, as well—that is, leaving one vendor to use a cheaper, more robust vendor’s solution. If you’re not careful, one vendor’s custom solution could cause vendor lock-in, ultimately increasing abandonment expense. Vendor lock-in can also tip the scales in the vendor’s favor when it comes time to renegotiate contracts.

After deliberating the risks and benefits to every solution, you may reach the decision to buy. Therefore, it should come as no surprise to the higher-ups that there’s work involved in setting up the third party beyond simple configuration. SREs have an obligation to build world-class solutions, and to that end, we should explore some best practices to move third parties from being ancillary to your tech stack to being an extension of it.

Third Parties as First-Class Citizens

In many organizations, you hear the term “first-class citizen” being thrown around in reference to some mission-critical service—for example, auth/login or ads. Rarely will you hear “third party” and “first-class citizen” in the same sentence. By and large, third parties are treated as ancillary to the tech stack. That sentiment is to be expected—if you’re paying another company for its products, features, and services, it’s that company’s responsibility to manage its SLAs. However, your end users (your customers) do not necessarily share that sentiment, nor are they as forgiving. For all intents and purposes, your third-party integrations are an extension of your company.

When They’re Down, You’re Down

Third-party integrations take many forms. As we integrate with third-party solutions, it’s important to define how these can have an impact on our end users. In Figure 5-1, there are a few third-party integration points that have direct and indirect impacts to end-user experience.

Third parties can fit anywhere around the edge of your tech stack.
Figure 5-1. Third parties can fit anywhere around the edge of your tech stack

Direct impact

Quite often, the third-party providers that we work with have a direct impact on site availability, site reliability, and performance. In fact, these third parties are obviously impactful to the overall reliability. Here is a short list of providers with direct impact:

  • DNS is the first system that end users experience. When DNS is down, you’re down.

  • Edge products, such as site accelerators, wide-area firewall, forward proxies/load balancers, and so on, can suffer congestion issues, human configuration error, and routing failures.

  • CDNs for static objects, such as JavaScript and stylesheets, play a role in perceived user experience; improper configuration can lead to ineffective cacheability and high page load times, leading to increased bounce rates. The ability to diagnose and triage performance has become increasingly complicated with Single-Page Application (SPA) frameworks.4

Indirect impact

Some providers impact site reliability in less obvious ways. Based on experience, these are third parties that process transactions on the backend of the technology stack. The following are examples of providers with indirect impact:

  • Payment processors outages—for instance, failed API key rotation or incorrect billing codes—should not delay user experience. An optimistic approach might use previous billing history to indicate future payment processing success and will simply queue the transaction.

  • Synthetic and Real-User Monitoring (RUM) are necessary for telemetry; however, use caution when utilizing either for automation.5

  • SMS and email integrations generally have no direct touchpoints between your end users and your tech stack. However, end users have come to expect instant gratification; therefore, undelivered payment receipts or delayed two-factor verification codes will lead to a negative perception of your company.

SREs are not only seeking excellent uptimes, but also a consistent user experience overall. A theoretically simple mitigation strategy for mission-critical services is to establish redundancy. Whether direct or indirect, these mission-critical third-party integrations should be treated as first-class citizens and run like any other service in your tech stack.

But if third parties are categorized as black boxes, we shouldn’t expect much from them in terms of data, right? That’s simply not true in modern day. More and more third parties are building APIs that allow us to take advantage of running their infrastructure as code, which means we are able to manage configuration, reporting, and a suite of other features to run our third-party products more like services.

Running the Black Box Like a Service

Third-party solutions are not like the services you run in your tech stack; therefore, the consensus view is that they’re black boxes—you put stuff in and get stuff out. Each third-party vendor offers a set of products and features that provide some value propositions. Depending on the vendor, its solution could very well be a black box. However, many modern vendors have come to understand the need for transparency and have built APIs to facilitate integrations. Thus, rather than stamping a vendor as a black box, we should consider a vendor as being on a spectrum.

We can use Figure 5-2 as a way to determine where on the spectrum a third-party vendor might land. Let’s take two vendors. Vendor A might have great reporting, tons of monitoring capabilities, simple configurations, access to a robust set of APIs, and high-caliber documentation to maximize self-service. Vendor B could have the best solution with simple configurations through an intuitive portal, but if we’re looking for the ability to utilize Vendor B to help automate processes, limited API access will prevent us from fully implementing its solution. In this example, Vendor A is a very open box, and Vendor B is a more closed solution, perhaps closer to a black box.

Not all third-party solutions are black box.
Figure 5-2. Not all third-party solutions are black box

It’s often hard to determine which part of a third-party integration could possibly be treated like a custom-built service. To that end, we need to define the top-level functionality that we are attempting to achieve. After we understand that functionality well, we should refine the activities needed to manage that integration to yield a set of predictable and actionable states. Some of these states might be the following:

Fully operational

All vendor operations working as expected.

Operational but degraded performance

Vendor solution is operational but performance is degraded; for example, CDN is delivering content, but objects are being delivered slowly.

Operational but limited visibility

Vendor solution is operational but telemetry does not reflect operational state. An example of this with a third-party integration could be reporting API or Log Delivery Service outage if you rely on third-party data to flag service state.

Operational but API down

Vendor solution is operational but its configuration API and/or portal are down. Vendors typically segregate configuration from operation. Fortunately, we can still be up, operationally, but this might create a bottleneck.

Hard down

Vendor solution is not operational.

With third parties, some additional states might include portal outages, processing delays, data consistency issues, regional performance degradation, regional service outages, and so on. The level of observability we have will dictate our success when we’re dealing with these states. Let’s dig into what it takes to help manage these states.

Service-Level Indicators, Service-Level Objectives, and SLAs

We can’t talk about running a service without talking about Service-Level Indicators (SLI), Service-Level Objectives (SLO), and SLAs. But how do you capture SLIs on services that run outside of your tech stack that are seemingly out of your control? This is where we need to be creative.

SLIs on black boxes

SLIs for services within our own tech stack are objectively easy to collect. In many cases, it’s a matter of running a daemon to collect operating system/virtual machine/container stats or a library/SDK that we import to emit metrics to our metrics collection cluster. Easy enough. Vendors run their services within their own tech stack. They will not typically emit server-level metrics; however, they do offer a wealth of data in a couple of different forms.

Polling API informs SLIs

If your vendor does not offer a robust reporting API, walk away. Reporting APIs will be a first step toward building your SLIs. You should not, however, trust these nor use them as if they were real time. Many vendors will try to keep their reporting APIs as close to “real time” as possible; however, there is always a processing delay. Any provider running a distributed network will likely suffer from some reporting latency—they will need to aggregate, process, and provide a flexible API—and this all comes at a cost that is not necessarily beneficial to their bottom line.

Real-time data informs SLIs

As technology has progressed, vendors are beginning to offer real-time logs delivered via Syslog, HTTP POST, or SFTP upload. In addition, log processing companies have partnered with these vendors to facilitate integration and dashboards. With these integrations in place, we can see the paradigm shift from the vendor is a black box to the vendor is a partner. With the data on hand, we are allowed to lift the weight of vendor dependency to becoming more self-reliant. If real-time data is offered as a paid service, ask that it be free. Ultimately, you, as an SRE, will utilize this to troubleshoot issues, thus alleviating their customer support.

Synthetic monitoring informs SLIs

Trust but verify. Our providers can give us a wealth of data, and it’s unlikely that our vendors would hide anything. But for services that we run that have a very broad scope, such as CDN or DNS, it can often be extremely difficult to triage and diagnose an issue. Synthetic monitoring (APM solutions such as Catchpoint and Keynote) provides an additional layer of detail that could not be seen through service metrics alone. Synthetic tests will often reveal configuration issues, delivery latency, cache efficiency, compression handling, and corner cases that would be difficult to find with log data alone. For SREs that own CDN and DNS, this is becoming critical. And if you’re wondering, synthetic monitoring providers offer both real-time and polling APIs, so you can do more cool stuff with test results.

RUM informs SLIs

Nothing is better than real-user data. With advancements made in modern browsers, there is a wealth of information that we can garner from navigation timing and resource timing APIs (client side). RUM can tell you all sorts of information about how users are experiencing your third-party services, which includes CDN, DNS, CA, payment processors, ads, and more.

SLOs

SREs typically focus on meeting their SLOs—the agreement between SRE and product owners for how often a service’s SLIs can be out of spec. As an SRE working with a provider, SLOs can be difficult to calculate and even more difficult to define. It’s not simply a matter of carrying over the SLA from the vendor. SREs responsible for third-party integrations might be dealing with multiple vendors simultaneously and support a broad range of product teams, as is the case with payments, ads, CDN, and DNS. In this scenario, the product owners end up being your sibling SRE teams. The SLOs we define in this scenario allow our sibling SRE teams to calculate risk and formulate error budgets.

Negotiating SLAs with vendors

When you first sign up for service with your vendor, you sign a lot of things, including a Master Service Agreement (MSA), which covers all of the vendor’s products and services, and an SLA. Historically, SREs haven’t had to work with MSAs or SLAs, but with the evolution of the SRE role and broadened scope of work, these agreements are more commonplace for the SRE. That said, if you’re big enough to have a legal team and a procurement team, be sure they are in the loop before signing anything.

Often, vendor SLAs provide credits for service when they fail to meet their end of the agreement. Most SLAs include 99.5% uptime. Unless there’s a catastrophic event, these SLA targets are easily met. Things get tricky when we begin to consider SLAs that meet specific needs, such as throughput for large object delivery on a CDN, indexing latency for our log processing companies, or metric accuracy for our synthetic monitoring providers.

Use the same SLIs and SLOs to inform your SLAs to hold vendors accountable for their service availability and performance. Your procurement team will appreciate the transparency and validation of their hard work.

Playbook: From Staging to Production

With many of the services we run, we have to adhere to some sort of playbook—a means for running our service in production. This involves a detailed explanation of what the service is intended to do, how to test it, how to deploy it, and how to deal with it should things go sideways.

Testing and staging

Quarterly releases are ancient history. These days, Continuous Integration (CI) and Continuous Deployment (CD) are standard. We’ve become accustomed to maintaining a deployment pipeline similar to one that is triggered by committing code to master and allowing our automation to run a suite of unit tests, followed by deployment to a staging environment, followed by Selenium testing, followed by canary deployment, and finally production deployment.

Third-party integrations can pose a different set of challenges with regard to properly testing functionality within staging environments. It’s unlikely that CDN, DNS, certificates, and edge proxy configurations are considered as part of a CI/CD pipeline. It’s even less likely that your staging environment resembles production, if that staging environment even exists. Unfortunately, as these products have been seen as ancillary, little attention has been given to how these technologies play a role in the deployment process.

With the emergence of new players in the CDN, DNS, certificate, edge, and APM space, progress is being made as to how we can configure and manage their respective products via code. As a result, replication of environments has become more feasible. Additionally, some vendors offer some form of staging environment. For example, Akamai offers configuration staging as a standard part of its configuration deployment procedure; the staged configuration is triggered by modifying hostname resolution to target the staging hosts. Staging environments ultimately encourage the creation of unit, regression, and Selenium test regimens.

Monitoring

There are two categories of monitoring that we should consider with third-party solutions: integration monitoring (is the integration working?) and impact monitoring (what impact does the solution have on our users?). Obviously, we want to ensure that the integrated service (CDN, message processor, etc.) is stable; hence, we should be looking at whether the third party is up or down. Additionally, we want to know when the integrated service is dropping connections, becoming overloaded, or introducing processing delays. Some third-party integrations rely entirely on APIs; thus, monitoring the integration is a bit more straightforward, as is the case with an SMS or email provider.

Impact monitoring is a bit different. When we relinquish some control of our site to third parties—for example, content delivery by CDN or steering to data centers by DNS providers—monitoring a third party’s impact on our users becomes a bit more challenging. To understand how end users are impacted by these critical path integrations, we should rely on synthetic monitoring or RUM.

Uses for synthetic monitoring

Synthetic monitoring is a dimension of end-user experience testing that relies on probes or nodes that run canned tests at regular intervals. Many third parties specialize in this type of APM solution; these include Catchpoint, Keynote, Gomez, Pingdom, and Thousand Eyes. This type of monitoring is great for things as simple as testing whether an endpoint is up or down, or as complicated as transactional workflows.

Here are some things to do:

  • Utilize real-time push APIs rather than polling for aggregated test data. Polling APIs will cause delayed Mean Time to Detect (MTTD) if you’re using synthetic tests to identify failures due to reporting aggregation delays. Push APIs can ensure that you get the raw results.

  • Create multiple synthetic tests for each CDN/edge configuration. Each CDN can behave differently within the same region, and synthetic tests can possibly reflect regional performance and reliability issues.

  • Monitor for content errors such as HTTP error codes or connection resets or timeouts on child assets (not just base page) to better understand how a CDN provider is handling your content.

  • Run DNS tests against authoritative name servers so that you remain aware of issues related to your traffic steering; testing against default resolvers will not isolate issues to your DNS provider(s).

  • Ensure good test node coverage over your highest-traffic geographies and maintain low test intervals to ensure maximum test data. This will allow you to flag for errors from a consensus of servers over a smaller rolling window.

And here are some pitfalls to avoid:

  • Don’t rely on polling of third-party reporting APIs. This data is generally aggregated.

  • Don’t configure long test intervals to save money. Long test execution intervals will limit granularity. Because each node will generally run a test only once within a given interval, it’s unlikely that monitoring consecutive events from a single node will flag failures. Instead, consensus for a test over a given interval might prove more fruitful.

  • Don’t run static page tests to monitor your CDNs. Synthetic test nodes will likely hit the same CDN edge Point-of-Presence (PoP) repeatedly and artificially improve cache efficiency anyway. Therefore, static page testing will tell you only how well a CDN serves cached content; this does not represent how well a CDN serves content from origin to end user.

  • Don’t pick nodes for the sake of picking nodes. Some nodes might not be on eyeball networks (such as Comcast or Time Warner); instead, nodes sit on Backbone networks in data centers or peering facilities. Performance data from backbone networks might not be a reflection of what real eyeball networks experience.

  • Don’t select nodes that are regionally disparate from your actual users. Selecting a node in Timbuktu when most of your customers are in North America is simply illogical.

Uses for RUM

RUM also falls within the dimension of end-user experience testing. The key distinction is that RUM is executed on the client side. The improvements in browser APIs have allowed for enhanced reporting data. We now have access to navigation timing APIs—page load time—and more interestingly, resource timing APIs—per object load times—across most modern browsers. The same third parties that offer a synthetic monitoring solution often offer a complementary RUM solution in the form of a JavaScript beacon.

Here are some things to do:

  • Take advantage of both navigation and resource timing APIs to pinpoint content delivery issues.

  • If using vendor-agnostic hostnames (hostnames that are not vendor specific), implement a custom header to identify which third-party vendor was responsible for object delivery.

  • If your company has a Data Science/Performance team, it is likely that real-user data is already being collected and processed through a data pipeline. They might be enticed to enhance that data pipeline to include enhanced delivery details from resource timing APIs.

  • Ensure Security and Privacy teams evaluate the RUM beacons that you are considering using. This is especially important because RUM can expose personally identifiable information (PII), cookies, and IP details. This might run afoul of the EU’s General Data Protection Regulation (GDPR)8 compliance rules, which went into effect on May 25, 2018.

Again, here are some pitfalls to avoid:

  • Don’t reinvent the wheel; if you have a Data Science or other team concerned about RUM, it might already have a data pipeline9 that we can utilize. Similarly, Google, Amazon Web Services (AWS), and others offer solutions for message queueing and streaming.

  • Don’t rely on third-party APIs as a conduit for monitoring third-party health. Third parties generally do an excellent job of performing what they’re paid to do; however, their reporting APIs are not necessarily their core competency. If you rely on third-party reporting APIs, you are likely relying on data that is stale or inaccurate. This is because of data aggregation delays, processing delays, data pruning, and repackaging of that data for your consumption. Instead, we want to rely on as much real data as possible, such as data that can be pulled from logs in real time.

Tooling

Tooling is a critical part of our third-party integration, especially in a larger organization. As an SRE responsible for third-party integrations, we want to avoid being a choke point as much as possible. To that end, we should build systems that take advantage of everything the third party has to offer via APIs. These APIs facilitate integrations by allowing you to build the tools needed to effectively manage your third-party integrations. As these APIs are employed, we will want to consider AbEx and avoid the use of custom features that could create vendor lock-in. We can see an example of this in CDN purge APIs—some vendors utilize surrogate keys, some utilize some form of regular expression, some utilize wildcards, and some utilize batching. Any of these implementations into our regular workflow could cause an issue if we need to replace or add another vendor’s solution.

Further removing us from being a choke point, we want to avoid any integration that requires we play a role in access control. Managing access to vendor portals for every engineer is overly cumbersome and has the potential to lead to security issues and technical debt.

Finally, as we create tooling around our vendor’s API, it’s critical that we consider abandonment. Try to maintain a modular architecture to support a variety of third parties. This layer of abstraction allows for additional providers and swapability should a third-party vendor need to be replaced.

Automation

As an extension of tooling, we want to be able to automate some of the functionality of our third-party solutions as much as possible. The possibilities here are endless and fully depend on how you make use of your third party’s APIs. To give you an idea of what we can do, let’s look at the following examples:

Automating content removal on account closure

When a user closes their account, certain privacy laws apply and require that user PII is deleted. As a result, it’s possible to utilize your company’s data pipeline (or similar message queue) to identify such events to automate the image removal process. This will be necessary for GDPR compliance.

Automating data center/endpoint selection

In the event that a critical service is down behind an endpoint, we can use DNS APIs to steer traffic away from the affected endpoint.

Automating certificate renewal

In the event that certificates are ready to expire, you could easily use your CA’s API to reinstall certificates to your endpoints rather than relying on manual intervention or outage.

Logging

Third-party reporting APIs are generally not great. Often, the reporting data we receive via pull APIs are aggregated, processed, and scrubbed. Aggregation delays are common with pull API and should not necessarily be used for raising alerts. As an SRE, we need to know what’s happening now, not what happened 15 minutes ago. To that end, third-party integrations have often been looked at as black boxes with little to no visibility into their operations.

But times have changed. Many third parties offer real-time data feeds that include all sorts of useful data; CDNs are offering delivery receipts for every object; DNS providers are offering query data; and synthetic monitoring providers offer granular test data.

The benefit to all of this available data is that we’re able to build with it as we see fit. However, this is also a risk. Just because we have the data does not mean that it’s entirely useful. As we utilize real-time logging data, we need to be sure we’re utilizing what we need to supplement our monitoring as well as provide a way to track down issues. For example, CDN logs can include a browser type, extension details, client IP details, cookies, and other request headers. Much of this is unnecessary for most applications, such as regional isolation of HTTP response codes.

Finally, logging can be absurdly expensive. Not only do some vendors charge for the delivery of this data, but you’re also responsible for consuming, indexing, processing, and retaining it. This can be a large undertaking in and of itself, and you should approach it with caution.

Disaster planning

Although we don’t like to think about it, working with a third-party vendor incurs some risk. As part of our playbook, we will need to consider what to do when the vendor is hard down. Disaster planning should include considerations for the removal of the third party altogether. You will likely have to consider the following:

  • Maintaining an additional (as active or standby) provider

  • Having capacity to serve requests ourselves

  • Queueing requests to be handled later

Removal of a CDN might require that we accept a certain amount of latency. Removal of a payment processor might require an optimistic approach to ensure a good customer experience. Some delay in a welcome message or payment receipt via email might not be entirely noticeable to an end user. Remember, slow is better than down.

Communication

With third-party integrations, it’s important to not live in a silo. Talking to a solutions or sales engineer actually plays to your advantage for discussing product enhancements, discovering unknown or underutilized features, understanding their product roadmap, and preparing third parties for future events such as product releases. Additionally, it pays to maintain bidirectional communication to align priorities. Third-party vendors should be under NDA; therefore, it’s helpful to communicate any upcoming projects that might alter the relationship.

Establishing a regular call cadence can be useful but might not be ideal for all SREs. Often, these calls are tedious are time-consuming, and can conflict with production issues. Additionally, there is a bit of organization required, especially when you’re working with multiple vendors. To that end, it’s often a good idea to work with a project manager or someone similar who can facilitate these calls and keep detailed notes. Support from a project manager will help you keep your vendors aligned with your team’s and company’s priorities.

Decommissioning

When it comes time to decommission a service, we consider all of our dependencies. The same is true of dealing with third parties. We obviously need to consider removal of tooling, service monitoring, alerts, and data collection services/endpoints. In addition, we are obliged to consider a few other dependencies that play a role in the integration.

Sometimes, termination is not as simple as stating “you’re fired.” Terms of the termination are generally stated in your service agreements. If you’re deciding to terminate a contract for financial reasons, you might be able to negotiate in advance of termination. In these cases, use caution when evaluating contracts that include Most Favored Customer (MFC) clauses because these often make it harder to negotiate a better deal. Some agreements also allow for autorenewal months prior to a contract’s end date. Your legal and procurement teams might have prevented this, but it’s important to look at the contract and discuss termination with them.

Communication obviously played a role in the purchase and ongoing relationship of your third-party integration. Communication is even more critical during contract termination. If, for instance, the third party was not meeting SLAs but that was never communicated, it’s not necessarily fair to that vendor and could prove costly for the third-party sales teams.

The tech industry is small, and stories of a bad breakup will haunt you and your company. These stories could play a role in your future integration attempts. Sales organizations remember which customers were most difficult to work with, and salespeople move between companies, especially within the same industry. Failure to communicate during the decommissioning process can tarnish reputations and complicate future relationships.

The best way to avoid this is to communicate your intent to terminate a third party well in advance of the end date—at least a quarter ahead. This gives the third-party sales team an opportunity to rectify any problems that might have persisted during the integration life cycle. We vote with our dollars. Losing a large enough customer can incentivize a third party to prove themselves.

Closing Thoughts

Many SREs consider working with third-party solutions to be a disappointment from both an intellectual-challenge and career-path perspective. Not only would we rather solve large problems by building our own custom, in-house solutions, but we’re often encouraged to do so to advance our professional trajectory. However, custom solutions are not necessarily in the best interest of our company or our end users.

Quite often, buy options offer as much of a challenge as the build or adopt options. We should shift our mindset away from third-party solutions getting in the way of our creativity and instead consider how a third-party solution plays a role in end-user experience and site reliability.

As we close out, you should take the following points with you:

  • Third parties are an extension of your stack, not ancillary.

  • If it’s critical path, treat it like a service.

  • Consider abandonment during the life cycle of an integration.

  • The quality of your third-party integration depends on good communication.

Contributor Bio

Jonathan Mercereau has spent his career leading teams and architecting resilient, fault tolerant, and performant solutions working with the biggest players in DNS, CDN, Certificate Authority, and Synthetic Monitoring. Chances are, you’ve experienced the results of his work, from multi-CDN and optimizing streaming algorithms at Netflix to all multi-vendor solutions and performance enhancements at LinkedIn. In 2016, Jonathan cofounded a SaaS startup, traffiq corp, to bring big company traffic engineering best practices, orchestration, and automation to the broader web.

1 Site Reliability Engineering, Introduction.

2 Procurement teams are the experts in handling of purchase orders and subcontracts. They purchase the goods and services required for the project and company initiatives. They pay the bills to keep the website on, but also ensure that legal and business obligations are met.

3 Several third-party CDN providers offer ESI solutions. Some providers offer basic ESI support based on Specification 1.0, while others offer extensions and custom solutions.

4 SPA frameworks dynamically rewrite the current page rather than reloading the page from the server; that is, SPAs use client-side rendering instead of server-side rendering. They do so by loading HTML, JavaScript, and stylesheets in a single page load and caching those objects locally in the browser. Content is then loaded dynamically via WebSockets, polling over AJAX, server-sent events, or event-triggered AJAX calls. For a cached SPA, traditional page load timing using Navigation Timing API is incredibly low; however, the page is likely unusable without the dynamic content. Triggering a Resource Timing API in modern browsers well after expected page load times can help with the ability to diagnose and triage user experience issues.

5 Use of monitoring-based automation is growing among large companies. Synthetic monitoring solutions utilize a wide dispersion of nodes running tests on set intervals. Automation relying on consecutive event triggers from synthetic test nodes can be unwieldy and inconsistent, especially between geographies. Instead, triggers based on event consensus among test nodes can yield better results. RUM can be just as unwieldy for larger sites because the amount of data is much larger. A more surgical approach to automation is required for working with RUM-based automation.

6 An origin server in the context of CDNs is simply the source of the original content.

7 A CDN edge hostname generally maps to a single configuration/protocol combination; that configuration contains a mapping to the origin server and content rules. Conditional rules (such as request headers or context-path) can trigger alternate origin server selection, caching rules, header modification, and so on.

8 GDPR applies to all foreign companies and extends current data protection regulations for all EU residents. You can find more information at https://www.eugdpr.org.

9 Many companies use a data pipeline for interapplication messaging following a publish–subscribe (pub–sub) architecture to trigger events. These pipelines are often also utilized for streaming data, such as page views, application logs, or real-user data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.249.105