© Andrew Davis 2019
A. DavisMastering Salesforce DevOps https://doi.org/10.1007/978-1-4842-5473-8_11

11. Keeping the Lights On

Andrew Davis1 
(1)
San Diego, CA, USA
 

Keeping the lights on in your Salesforce org is generally just a matter of paying your Salesforce bill. They do most of the hard work for you. Nevertheless, in this chapter we’ll look briefly at the different ways in which “Dev” and “Ops” can work together in the Salesforce world, as well as how some of the other truly operational aspects of managing a Salesforce org are relevant and can be understood as an extension as the broader development lifecycle.

Salesforce Does the Hard Work for You

I decided to focus my career on the Salesforce platform in part because I enjoyed the simplicity with which businesses and nonprofits could build on the platform. When I first started looking at jobs in this field, I naturally looked at the job openings at Salesforce themselves. I was a bit daunted to see the depth and sophistication of the skills they seek in their employees: Oracle database administration, site reliability engineering, offensive security engineering, network analysis, data science, cryptography, the list goes on.

If you think that the content presented in this book is sophisticated, it’s nothing compared to the level of product engineering happening under the hood to allow Salesforce to scale globally, deliver features aggressively, and maintain industry-leading uptime.

Fortunately, that’s all someone else’s job.

That’s the reason we (or our employers) pay Salesforce, and the reason that SaaS makes sense. Let someone else do the performance tuning on the databases. Let someone else understand load balancers, cache invalidation, compilers, and performance optimization. From our point of view, Salesforce basically just works.

If there’s an outage or performance impairment in a Salesforce instance, we can check https://trust.salesforce.com . It happens sometimes, but it’s rare, generally short-lived, and followed by a root cause analysis and improvements to prevent it from happening again.

We can also file support tickets with them, which generally get handled promptly. You can pay for various tiers of support up to and including 24x7 Mission-Critical Applications (MCA) phone support.

So from an operations point of view, keeping Salesforce in operation is enormously complex—but fortunately not for admins.

What Does Dev and Ops Cooperation Mean?

The term DevOps originates from a talk by John Allspaw and Paul Hammond at the 2009 Velocity conference called “10 deploys a day: Dev and ops cooperation at Flickr.”1 That talk was significant for many reasons, but largely because it showed the benefits of cooperation between two teams who had traditionally worked in isolation.

There are several reasons why this isolation had evolved. First, there is a fundamental tension between the goals of development and operations. Developers’ job is to continually innovate to better meet the evolving needs of the business. The goal of admin/operations teams is to maintain a trusted system by keeping everything running smoothly without downtime. For operations, innovation and change implies the risk of breaking things. There is thus a tension between innovation and trust. These teams often reported to different managers (e.g., CTO vs. CIO) and had competing goals and incentives, which further deepened the separation.

Another reason for the separation was to enforce separation of duties, a compliance requirement that dictates that those who develop a system should be different from those who put it into production. Separation of duties is a long-standing principle in financial accounting, and organizations subject to audits and regulatory oversight extend that to their IT systems. Access controls are often used to actually prevent developers from releasing code into production environments.

The basic reason why this separation became dysfunctional can be explained by Conway’s Law:

Organizations which design systems are constrained to produce systems which are copies of the communication structures of these organizations. 2

Melvin Conway was a programmer, but Conway’s Law is a sociological statement that basically says that your organizational communication structure will affect and limit your architecture. The challenge that the Dev vs. Ops separation introduced was that getting software running in production depends on both the software and the underlying infrastructure, and the two cannot be optimized independently. So if the teams responsible for these areas are not communicating well, naturally the overall system will suffer.

Matthew Skelton and Manuel Pais maintain a web site ( https://web.devopstopologies.com ), documenting some of the many patterns and antipatterns for Devs and Ops to work together effectively. The examples are thought-provoking, and in practice our projects and teams may shift between different patterns and antipatterns over time.

Since the term DevOps was introduced, it’s been observed that there are frequently many other communication gaps in the software value chain. Wherever there are role boundaries, such gaps may exist, for example, between business analysts and architects, between developers and QA testers, and between security specialists and everyone else. Terms such as TestOps, DevTestOps, and DevSecOps have come into use, but the underlying concept is the same: creating silos between teams increases inefficiency and generally doesn’t succeed in reducing risk. The lesson here is that “DevOps” is not really about “Dev” vs. “Ops” but rather about ensuring that there is strong cooperation between all of the teams who participate in producing the final system.

The remedy for this dysfunction has been described as an “Inverse Conway Maneuver.”3 This sounds like a surgical technique, but basically means that you should organize your teams (and their communication) in a way that is most conducive to your desired architecture.

Bringing all of this academic discussion back down to earth, what does “Dev and ops cooperation” look like on the Salesforce platform?

Since the actual “operations” of Salesforce is mostly handled by Salesforce themselves, the cooperation challenges on the Salesforce platform tend to look different than they might on other platforms. But the core idea remains the same: silos between teams can negatively impact the organization’s effectiveness and the effectiveness of individual teams. Cooperation and communication across teams are critical if they are all collaborating around a single org.

If your Salesforce team is small, any cooperation problems are more likely due to personalities rather than organization structure. If your team fits Amazon’s two pizza rule (you can feed them all with two pizzas), your communication will hopefully be pretty smooth, and confusion can be quickly resolved.

As teams scale and grow however, communication can be more challenging. Jez Humble has often made the point4 that enterprises are complex adaptive systems. By definition this means that individuals, including the CEO, don’t have perfect knowledge of the entire company and are not able to fully predict the effects of any action. Your Salesforce org is also a complex adaptive system once it reaches a certain scale.

If you’re the sole admin for a new, small org at a small company, you might be familiar with every single customization and interaction in that org. But as your company grows, the dev and admin team grows, and time passes, the complexity grows beyond what one person can hold in their mind. Even a solo admin is likely to forget the details of customizations they implemented several months or years ago.

The basic point is that developers need to be aware of how their applications are behaving in production so that they can debug or improve them if needed. And admins need confidence that development has been well tested before they inherit it in their production org. They also need applications to provide clear information about what they’re doing and why, so problems can be resolved easily. The two groups need a way to collaborate with each other easily should problems arise.

Version control can go a long way toward providing shared visibility and answers to “what changed?” even across large and distributed teams. It’s also important to establish lines of communication between those who initially built functionality, the users of that functionality, and the admins who support those users.

Multiple development teams can be more easily supported when they can develop their work as packages, with clear boundaries between them. Package boundaries reduce the risk and uncertainty of merging large and complex metadata. Packages also establish clarity about the original authors of certain functionality, making it easy for admins to know who to talk to if questions or side effects arise.

You can think of your Salesforce “Operations” as being 80% handled by Salesforce themselves and 20% by your own production admins. You can then consider the other roles (or potential roles) you have in your team, such as business analysts, admins (the app-building kind), code-based developers, testers, security specialists, and so on. If you are all collaborating around a single production org, then there absolutely needs to be good and clear cooperation between all of these teams, since that production org is a single interrelated system.

Since these disparate teams are (whether they like it or not) collaborating on a single interrelated system, what are the possible ways in which silos might interfere with effective cooperation? In the words of Skelton and Pais, “There is no ‘right’ team topology, but several ‘bad’ topologies for any one organization.” Common failure modes include
  • Not providing developers access to a Salesforce DX Dev Hub (you can safely give them the “free limited access” production license).

  • Long-running parallel development projects not developing in an integrated environment.

  • Developers having no access to test in a production-like environment.

  • Developers being unable to submit Salesforce support tickets.

  • Infrequent release windows.

  • Developers neglecting to establish production logging.

  • Inefficient release paths or handoffs from development through to production.

  • Parallel consulting partner-led projects competing with one another. Development teams from different vendors can easily introduce silos.

This list is not comprehensive, and challenges may change over time.

If you are maintaining multiple independent production orgs, then the need for collaboration across the teams maintaining those orgs is reduced. But companies with multiple orgs invariably have some common standards, processes, and applications across those orgs, and it will be in everyone’s best interest if you establish a Center of Excellence to allow for knowledge sharing across these teams.

Salesforce Admin Activities

It’s well beyond the scope of this book to share every aspect of activities an admin might need to undertake. The focus in this section is to share tips that can make the overall development lifecycle smoother.

User Management

User management is a fundamental aspect of a Salesforce admin’s responsibility. By definition, an admin is delegated by the company to maintain the Salesforce org for the users. It’s their responsibility to ensure that people who should have access to the org promptly get the right level of access. It’s also their responsibility to ensure that people who should not have access do not get that access.

The most reliable way to ensure correct access is for your company to maintain a single sign-on (SSO) system that acts as the single source of truth for current employees and to use that SSO for accessing Salesforce. Generally, password access to Salesforce should be prevented, except for admins to use in case there’s a problem with the SSO.

There are two extremes of access that can interfere with the overall flow of development. One extreme is having too many system administrators in production—this brings a serious risk of untracked and conflicting changes. The other extreme is not allowing developers any kind of access to a production org.

In some cases, your developers may not need or should not have access to Salesforce production data or capabilities. For developers to effectively debug production issues, they need to see debug logs, but seeing those logs requires “View All Data” permission which may again be inappropriate.

The Salesforce DX Dev Hub is necessarily a production org, and for developers to use Salesforce DX scratch orgs or packaging, they need a user account on that production org. Salesforce offers a “Free Limited Access License” that allows use of a Dev Hub on production without the ability to view data or change metadata in that org. If security concerns bar you from even offering such safe and restricted licenses, you will need to establish a separate production org that the developers can use collectively as their Dev Hub.

You will also need to establish one or more integration users that can be used by integrated systems. Ideally, each integrated system (including your CI system) should have its own integration user for security and logging purposes, but since integration user licenses cost as much as regular user licenses, most companies opt to combine multiple integrations under a smaller number of integration users.

Security

Salesforce security follows a layered model, where different security mechanisms such as Profiles, permission sets, and roles can provide increasing levels of access to users. By default, no one can do or access anything, but each of these layers can add permissions; these layers never remove privileges, permissions are additive.

Every Salesforce user is assigned a single Profile, which establishes their basic security privileges. Salesforce provides some standard profiles, notably “Standard User,” “System Administrator,” and “Integration User.” These profiles can be cloned and customized as needed. You can also create Permission Sets, which work similar to profiles except that users can have more than one Permission Set.

It’s in the best interest of security as well as org governance for there to be a very limited number of people holding System Administrator privileges. Salesforce provides very granular access controls that allow you to add admin-like privileges to a permission set, which you can then apply as needed to users.

Use Permission Sets Instead of Profiles

Because of an early design decision in the Metadata API, Profiles are notorious for being the single biggest pain in the butt for Salesforce release managers. When you retrieve a profile using the Metadata API, the profile definition you receive varies depending on what other metadata you have requested. Managing profiles in a CI tool thus requires sophisticated tooling, and Salesforce release management is plagued by missing profile permissions or deployment errors from permissions for metadata that is not present in the target org. Even groups with sophisticated CI/CD processes sometimes choose to manage profile permissions manually.

Salesforce is working on improvements, but in the meantime I would strongly suggest you avoid using Profiles to provide permissions and use permission sets instead. Since API version 40.0, Permission Sets are always retrieved and deployed in a consistent way, making them a better candidate for CI/CD.

Permission sets can be used for every type of user characteristic except for the following:
  • Page layout assignments (which page layout a user sees for a record)

  • Login hours

  • Login IP ranges

  • Session settings

  • Password policies

  • Delegated authentication

  • Two-factor authentication with single sign-on

  • Desktop client access

  • Organization-wide email addresses allowed in the From field when sending emails

I would thus suggest that you use this list to determine what profiles you actually need in your organization. For example, if you maintain a call center and for security you want to lock call center users to a particular login IP range, whereas your salespeople need access to Salesforce from any location, that’s perfect justification for having a “Call Center” profile. Don’t create different profiles for every category of user in your organization. In fact, even the need for different page layout assignments is small, since you can simply restrict access to certain fields to prevent those fields from cluttering a record layout for users.

Managing Scheduled Jobs

Salesforce provides a system for scheduling jobs that can run periodically in an org. Scheduled jobs are typically used to manage batch processing for activities that would be too slow or too computationally expensive to run on the fly.

One large nonprofit organization I worked with prepared many layers of elaborate reports on daily, monthly, quarterly, and annual cycles. The reports aggregated and summarized data across millions of opportunities, far too much to be handled using standard Salesforce reports. They opted to use Salesforce for this task instead of a separate BI tool, and so our team wrote extensive batch Apex code to summarize all of this data and then created scheduled jobs to run the appropriate batch jobs at the appropriate times.

Such scheduled jobs can represent critically important aspects of your Salesforce configuration. Therefore (you probably know where I’m going with this) they should ideally also be stored in version control and an automated process used to ensure that they are in place. Such a system becomes extremely helpful when promoting such customizations between environments. If scheduled job definitions are stored in version control, they can be promoted and tested gradually between environments and deployed to production when the team is confident in them.

Such a system needs to be idempotent, meaning that you need to ensure you can run the job scheduling task repeatedly without it creating multiple jobs. In practice, I haven’t seen many teams doing this, but it’s worth considering if scheduled jobs are critical for you, and you want this level of reliability.

Probably the smoothest way to manage this is to write the scheduling system in Apex itself, running queries to check existing schedule jobs and using the system.schedule() methods to schedule any that are missing. You can then run this code as anonymous Apex as a postdeployment step. Listing 11-1 shows the syntax for scheduling new jobs and aborting all existing scheduled jobs. Production-ready code should include further checks, such as validating existing jobs and ensuring that jobs are not running before canceling them.
  public with sharing class JobScheduler {
    public static void scheduleAll(){
      System.schedule('scheduledJob1','0 0 2 ? * SAT', new ScheduledJob1());
      System.schedule('ScheduledJob2','0 1 * * * ?', new ScheduledJob1());
    }
    public static void abortAll(){
      for(CronTrigger ct : getScheduledJobs()){
        system.abortJob(ct.Id);
      }
    }
    private static CronTrigger[] getScheduledJobs() {
      final string SCHEDULED_JOB = '7';
      return [
          SELECT Id, CronJobDetail.Name, CronExpression, State
          FROM CronTrigger
          WHERE CronJobDetail.JobType = :SCHEDULED_JOB];
    }
  }
Listing 11-1

Starter code for scheduling and aborting jobs

One challenge related to scheduled jobs is that you can encounter a deployment error if you update an Apex class that is referenced in a scheduled job. If your classes are written such that changing a class definition midway through a scheduled job being processed won’t cause problems, you can enable the Apex Setting “Allow deployments of components when corresponding Apex jobs are pending or in progress.”

Monitoring and Observability

Monitoring and observability are two concepts related to gaining insight into the behavior of production systems to become aware of and diagnose any problems that arise. Figure 11-1 helps explain the relationship between these concepts. Observability refers to the degree to which you are able to get information about a particular behavior or system, especially if you need to debug it. For example, debug logs provide one form of observability, but only if those who can make sense of the logs have access to them. Monitoring takes a broader, high-level view of the system, typically tracking trends over time. Monitoring depends on observability because if the behavior of a system can’t be observed, it can’t be monitored. Finally, analysis is the activity of using monitoring to gain actionable insights into a system. This can be manual analysis, for example, debugging, or automated analysis, for example, raising alerts if exceptions are encountered or thresholds exceeded.
../images/482403_1_En_11_Chapter/482403_1_En_11_Fig1_HTML.jpg
Figure 11-1

Analysis depends on monitoring, which in turn depends on observability

The reason this topic is relevant in the context of DevOps is that if production systems are not observable, it’s far more difficult to improve the system, and we’re left to rely on anecdotal feedback such as users submitting cases. A production system has zero observability if there is no way to inspect that system to see how it’s behaving. Conversely, providing access to debug logs, event notifications, and performance metrics means there’s a high degree of observability. For observability, more is better, but there are legitimate security concerns that might cause teams to limit who has access to that information.

In terms of monitoring, just because a system can be inspected deeply, doesn’t mean that you should be monitoring every aspect of that system. Google’s Site Reliability Engineering book advises:

Your monitoring system should address two questions: what’s broken, and why? The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause. … “What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise. 5

The Site Reliability Engineering book advises that there are four golden signals of monitoring that you should focus on:
  1. 1.

    Latency (the time it takes to request a service)

     
  2. 2.

    Traffic (how much demand there is on your system)

     
  3. 3.

    Errors

     
  4. 4.

    Saturation (resource utilization).

     

Salesforce monitors and helps ensure low latency for their services. But you might be interested to observe and monitor the page load time for a complex VisualForce page you built, for example. Traffic is normally fairly predictable for your internal users, but if you are hosting a Community, you might want to monitor that traffic, especially if it’s subject to surges due to marketing campaigns and so on. Error reporting is an important topic that we’ll deal with later, as is saturation, which mostly equates to governor limits for Salesforce.

Salesforce provides greatly simplified monitoring and observability tools compared to what a traditional sysadmin might deal with when managing a server. This is because Salesforce itself is handling the monitoring of things like uptime, CPU load, network latency, and so on. Nevertheless, as users of Salesforce, we build business-critical capabilities on that platform and so may need our own levels of monitoring and observability. It’s worth looking at the existing options and ways they can best be exploited.

Built-In Monitoring and Observability Tools

Salesforce continues to expand the different kinds of monitoring available to users. These different tools are spread across the Setup UI, but many of them are in the Environments ➤ Jobs, Environments ➤ Logs, and Environments ➤ Monitoring sections. Debug logs are typically the most useful for developers. The limits on debug logs and the capabilities for parsing them have been greatly improved by the Salesforce DX team, so that it’s finally possible to set your log levels to “Finest” without exceeding log size limits, step through Apex code using the Apex Replay Debugger in Visual Studio Code, and see how the values of your variables evolve as your code executes. Frontend developers working with JavaScript and Lightning Components have had such capabilities (e.g., via Chrome Dev Tools) for many years, so it’s a relief to now have such visibility for Apex.

One challenge with debug logs is that you require the “Modify Users” and “View All Data” permissions to see them, which may not be an acceptable level of access to give developers in production. The simplest solution to this is to create a Permission Set called (for example) “Debug and View All Data” that you can assign to developers temporarily if they are struggling to debug a production issue. That’s similar to the approach used to get hands-on support from Salesforce or an ISV in your org: an admin “Grants Account Login Access” to the support agent for a period of time.

A workaround that my colleagues at Appirio use regularly is to create a custom object to store error messages and an Apex class that writes the error details (including a stack trace) to that object. That class can then be invoked inside try-catch blocks on code that you want to monitor for errors. Once stored in a custom object, those error messages are persisted across time and no longer require elevated permissions to access. You can use Workflow Email Alerts to provide notifications and Dashboards for monitoring. You can now also use Change Data Capture on that object or fire dedicated platform events to make that information immediately available external monitoring tools.

There are other monitoring capabilities for different aspects of Salesforce. For example, inbound and outbound email logs, data and file storage limits, daily API limits, and background jobs pages.

One of the simplest automated alerts you can establish is to “Set Up Governor Limit Email Warnings” which send email warnings when Apex code uses more than 50% of governor limits.6 You may want to combine that with email filters to automatically ignore certain kinds of notifications. But this provides basic insight into whether any of your custom code might be at risk of exceeding governor limits.

Add-On Monitoring and Observability Tools

Salesforce Shield is the best-known add-on to aid observability. Shield provides three capabilities: platform encryption, field audit trail, and event monitoring. The event monitoring capability of Shield aids observability by providing real-time access to performance, security, and usage data from across Salesforce. This information can then be ingested into third-party monitoring tools like Splunk. To whet your appetite for this, Salesforce provides user login and logout information for free.

Salesforce also now provides Proactive Monitoring as an add-on service. Salesforce customers who subscribe to mission-critical support or other premier support options have historically been able to contact Salesforce to get access to similar monitoring information to diagnose problems like page load times or transient errors. Proactive Monitoring provides these capabilities as a standard bundle.

There are also a number of third-party solutions that can aid in monitoring. ThousandEyes provides an enterprise-wide view of availability and page load times, including details on which Salesforce data center is being accessed, and the performance of each intermediate section of the Internet between your users and Salesforce. This can be used to monitor uptime or to diagnose network issues that might affect certain branches in a global organization, but it can also be used to monitor page load times if you have concerns about particular applications.

AppNeta is similar to ThousandEyes in monitoring availability and load times, but focuses more on analyzing and categorizing your internal network traffic with an eye toward prioritizing higher-value traffic.

Opsview is an open-core product that provides visibility of system performance across on-premise, private, and public cloud. Their Salesforce connector uses the Salesforce REST API to monitor organizational limits. If you only need to monitor Salesforce, it may be simpler to just periodically query those limits yourself using the Salesforce Limits API.7

If you are hosting a Salesforce Community, Google Analytics remains one of the best tools to aggregate and analyze user behavior like page views, click-through rate, and time spent on each page. Because of its applicability to any web site, Google Analytics is probably one of the best-known monitoring tools. For a time at Appirio, we used it to get insight into employee usage of our internal Salesforce instance.

What to Monitor

As mentioned previously, the point of monitoring and observability is to help you identify when something is going wrong, what it is, and why it’s happening. The goal is to maintain a high signal-to-noise ratio, by only monitoring things that have real benefit and minimizing distracting information. With observability, more is generally better. But with monitoring, you should aim to be very targeted about what you want to monitor.

An occasion that often warrants monitoring is when rolling out a new business-critical feature. These should be debugged and performance tested in a staging environment, but it’s important to have a way to monitor their performance and usage once they’ve actually rolled out. It’s common for development teams to be exhausted from the final push to take a new production application live. If your teams are oriented around doing projects, they will be tempted to think of release day as the last day that they have to think about that application. But if you think of each application as a product, then release day is the first day that information becomes available on whether the application is reliable and serving user needs.

There are two main things to monitor when it comes to new services: are they working? And are they helping? To determine whether your applications are working, it is important to gather error messages (e.g., into a custom object as described earlier) and also to look at page load times if performance is a concern. For ad hoc analysis of page load times, the Salesforce Developer Console is still the best tool in my opinion. The Developer Console is a bit buggy, but hidden inside it are excellent tools for analyzing performance. Salesforce wrote a blog post on how to make use of these capabilities.8

Apex itself allows you to monitor limits using the Limits Class.9 So you could also monitor and log metrics such as heap size and CPU execution time on your newly released applications if there’s a concern about a particular metric. Storing such performance metrics in a custom object creates a Salesforce-native way to monitor application performance. You can then build reports and dashboards around those if that helps.

Finally, one of the most important things you can do with monitoring systems is to quiet, disable, or delete them once an application has proven to be stable and the monitoring data no longer identifies any issues worthy of analysis. As the Site Reliability Engineering book states:

Like all software systems, monitoring can become so complex that it’s fragile, complicated to change, and a maintenance burden. Therefore, design your monitoring system with an eye toward simplicity. … Data collection, aggregation, and alerting configuration that is rarely exercised … should be up for removal.

Other Duties As Assigned

Salesforce admins claimed the title “Awesome Admin” for themselves, and it’s well deserved. They are often the main point of escalation for huge groups of users when they encounter challenges or confusion. This chapter just scratches the surface of the potential activities that admins might undertake or the issues that might arise in the service of production users.

Summary

Whereas with traditional IT systems, it is the role of admins to optimize and keep those systems running and patched, Salesforce does most of this hard work for you. Salesforce Admins generally play a dual role: part actual administrator and part App Builder. That dual role entrusts them with the responsibility of ensuring the production org is stable and accessible while also giving them the opportunity to use their creativity to make the org better. Having spoken about the administrative tasks of managing users, tuning security, and ensuring monitoring, we now turn to the more creative work of improving the org for the benefit of users.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.94.152