Chapter 16. DevOps War Stories and Interviews

Author: Noah

When I was finishing my last year of college at Cal Poly San Louis Obispo, I needed to take Organic Chemistry over the summer to graduate on time. Unfortunately, there was no financial aid over the summer, so I had to rent a house and get a full-time job. I was able to get a part-time job at the library for minimum wage, but it still wasn’t enough money. I scoured the help wanted ads and the only job popping up was for a bouncer at a vast country-western nightclub.

At my job interview, the manager who interviewed me was about six feet tall, close to three hundred pounds of mostly muscle. He also had a huge black eye. He told me that the previous weekend a large pack of people had beat up all the bouncers, including him. He told me it was between me and a shot-putter on the track team. To help decide between me and the shot-putter, he asked me if I would run in the middle of a similar fight. I told him that I wouldn’t run from a brawl and I got the job.

Later I started to realize that I may have been a little bit naive in estimating my courage and my abilities. There were some massive wrestlers and football players that would regularly come in and get into fights, and they were terrifying people. At one concert, they called in for backup because they were expecting trouble. A new bouncer with a mohawk and Chinese writing tattooed on his skull was my coworker for that event. Several years later I saw him on TV winning the UFC Heavyweight Championship and put a name to a face, Chuck Liddel. Bouncing was a dangerous job. This fact became evident one day when I tried to break up a fight in which a drunk 250-pound football player was pummeling a victim in the face. I wanted to pull him off but instead I was thrown several feet back, as if the man was effortlessly punching a pillow across the room. It dawned on me at that moment that I wasn’t invincible, and my martial arts skills were nonexistent. I never forgot that lesson.

One way of describing this overconfidence is the Dunning-Kruger effect. The Dunning-Kruger effect is a cognitive bias where people mistakenly assess their cognitive ability as more significant than it is. You can see this effect in practice on the yearly StackOverflow survey. In 2019, 70% of developers considered themselves above average, while 10% considered themselves below average. What is the takeaway here? Don’t trust human cognition! Trust automation instead. “Trust me,” “I am the boss,” “I have been doing this for X years,” and other statements of confidence are nonsense in comparison to the brutal efficiency of automation done correctly. The belief that automation is greater than hierarchy is what DevOps is all about.

This chapter links everything in the book about automation together by using real people and real case studies to explore the DevOps best practices of:

  • Continuous Integration

  • Continuous Delivery

  • Microservices

  • Infrastructure as Code

  • Monitoring and Logging

  • Communication and Collaboration

Film Studio Can’t Make Film

After living in New Zealand for a year working on the movie Avatar, I was in a very peaceful frame of mind. I lived on the North Island in a town called Miramar, which was a stunningly beautiful peninsula. I would step outside my door onto the beach for my daily 14-kilometer daily run. Eventually, the contract ended, and I had to get a new job. I accepted a position in the Bay Area at a major film studio that employed hundreds of people and was located at a facility covered over a hundred thousand square feet. Hundreds of millions of dollars had been invested in this company, and it seemed like a cool place to work. I flew in over the weekend and arrived on a Sunday (the day before work).

My first day at work was quite shocking. It snapped me right out of my paradise mindset. The entire studio was down for the count, and hundreds of employees couldn’t work because the central software system, an asset management system, wouldn’t work. Out of panic and desperation, I was brought into the secret war room and shown the extent of the problem. I could tell my peaceful days of running casually along the beach were over. I had entered a war zone. Yikes!

As I learned more about this crisis, it became apparent that it had been a slow-burning fire for quite some time. Regular all-day outages and severe technical problems were the norm. The list of issues was as follows:

  • The system was developed in isolation without code review.

  • There was no version control.

  • There was no build system.

  • There were no tests.

  • There were functions over a thousand lines long.

  • It was sometimes hard to get a hold of key people responsible for the project.

  • Outages were costly because highly paid personnel couldn’t work.

  • Film encourages reckless software development because they “are not a software company.”

  • Remote locations had mysterious connections issues.

  • There was no real monitoring.

  • Many departments had implemented ad hoc solutions and patches for problems.

The only fix to this standard set of problems is to start doing one right thing at a time. The new team formed to solve this problem did precisely that. One of the first steps to address the challenge was to set up continuous integration and automated load-testing in a staging environment. It is incredible how much a simple action like this can lead to considerable gains in understanding.

Note

One of the last and more interesting problems we resolved surfaced in our monitoring system. After the system performance had stabilized and software engineering best practices were applied, we discovered a surprising bug. We had the system perform a daily check-in of assets as a basic health check. Several days later, we began experiencing severe performance issues daily. When we looked at the spikes in CPU on the database, they correlated to the basic health check. Digging further, it became apparent that the check-in code (a homegrown ORM, or object-relational mapper) was exponentially generating SQL queries. Most workflow processes only involved two or three versions of an asset, and our health check monitoring had discovered a critical flaw. One more reason for automated health checking.

When we ran the load test, we discovered a whole host of issues. One issue we found immediately was that after a small amount of concurrent traffic, the MySQL database would flatline. We found that the exact version of MySQL we were using had some severe performance problems. Switching to the latest version dramatically improved performance. With that performance issue solved and a way to automatically test whether we were making the problem better, we quickly iterated dramatic fixes.

Next, we got source code into version control, creating a branch-based deployment strategy and then running linting and testing, along with code review, on each check-in. This action also dramatically improved visibility into our performance issues and presented self-evident solutions. In an industrial crisis, automation and standards of excellence are two of the essential tools you can deploy.

One final issue we had to resolve was that our remote film studio location was having severe reliability problems. Staff at the remote location was sure the issue was performance issues related to our API. The crisis was urgent enough that the top executives at the company had me and another engineer fly down to the location to troubleshoot the problem. When we got there, we called our central office and had them look only at requests from that IP range. When we launched their application, there was no network traffic observed.

We checked the network infrastructure at the remote location and verified we could send traffic between the client machine and the central office. On a hunch, we decided to look at both local and network performance on the Windows machine using specialized diagnostic software. We then observed thousands of socket connections launched over 2 to 3 seconds. After digging into the issue a bit, we discovered that the Windows operating system would temporarily shut down the entire network stack. If too many network connections spawn within a short window, the OS protects itself. The client application was attempting to launch thousands of network connections in a for loop and eventually shut off its network stack. We went into the code and limited the machine so it could only make one network connection, and suddenly, things worked.

Later we ran pylint on the source code of the client software and discovered approximately one-third of the system was not executable. The key issue wasn’t a performance problem, but a lack of software engineering and DevOps best practices. A few simple modifications to the workflow, such as continuous integration, monitoring, and automated load testing, could have flushed this problem out in under a week.

Game Studio Can’t Ship Game

When I first joined an established game company, they were undergoing a transformational change. The existing product was extremely innovative when it first launched, but by the time I joined, they had decided they needed to invest in new products. The current culture of the company was a very data center-centric culture with substantial change management at every step. Many of the tools developed were logical extensions of the desire to maintain uptime for a highly successful, but dying, game. The introduction of new people, new departments, and new products led to inevitable, consistent conflict and also a crisis.

The initial depth of the crisis became known to me early, even though I was working on the legacy product. As I was walking by the new products team on the way to another meeting, I heard an interesting conversation. A Spanish developer on the flagship new product caught my ear when he said during an Agile standup, “It no worketh…” This statement alone was quite shocking, but I was even more shocked to hear the response, “Luigi, that is tech talk; this is not the forum for that.”

I knew at that point there was, in fact, something wrong. Later, many people on the project quit, and I took over a project that was over a year late and was on rewrite number three in language number three. One of the key findings was that this “canary in the coal mine” developer was precisely correct. Nothing worked! On my first day on the project, I checked out the source code on my laptop and tried to run the web application. It completely locked up my computer on a few refreshes of Chrome. Uh-oh, here I was again.

After digging into the project a bit, I realized there were a few critical issues. The first was that the core technical crisis needed addressing. There was a “cargo cult” Agile process that was very good at creating and closing tickets, but that built something that didn’t function. One of the first things I did was isolate the core engineers from this project management process, and we designed a fix for the core solutions without the overhead of “Agile.” Next, when the core engine issue resolved, we created an automated deployment and custom load-testing process.

Because this was the top priority for the company, we reprioritized the work of members of some of the other teams working on the core product to build a custom load test and custom instrumentation. This initiative was also met with substantial pushback because it meant that product managers who were working with these resources would be standing idly by. It was an essential point in the project because it forced management to decide whether launching this first new product was a top priority to the company or not.

The final big hurdle in launching the product was the creation of a continuous delivery system. On average, it took around one week to make even a small change like changing HTML. The deployment process that worked reasonably well for a C++ game that had hundreds of thousands of paying customers did not work for a modern web application. The legacy game ran out of a traditional data center. It was very different than what was ideal for creating a web-based game in the cloud.

One of the things cloud computing also exposes is lack of automation. The very nature of cloud computing demands a higher level of DevOps skills and automation. Things are not elastic if it requires a human to be involved in scaling up and scaling down servers. Continuous delivery means that software runs continuously, delivered to environments where it can be deployed quickly as a final step. A “release manager” who is involved in a week-long deployment process with many manual steps is in direct opposition to DevOps.

Python Scripts Take 60 Seconds to Launch

Working at one of the top film studios in the world, with one of the largest supercomputers in the world, is a great way to see what happens at scale. One of the critical issues with open source software is that it may be built on a laptop in isolation from the needs of a large company. A developer is trying to solve a particular problem. On the one hand, the solution is elegant, but on another hand, it creates a problem.

One of these problems in Python surfaced at this film studio because they had to deal with petabytes of data on a centralized file server. Python scripts were the currency of the company, and they ran just about everywhere. Unfortunately, they took around 60 seconds to launch. A few of us got together to solve the issue, and we used one of our favorite tools, strace:

root@f1bfc615a58e:/app# strace -c -e stat64,open python -c 'import click'
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0        97         4 open
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    97         4 total

The Python 2 is O(nlogn) or “Super Linear time” for module lookup. The number of directories in your path would at least linearly increase the time it took to launch the script. This performance penalty became a real problem at the film studio because this often meant more than one hundred thousand calls to the filesystem to launch a Python script. This process was not only slow but incrementally destructive to the performance of the file server. Eventually, this started to crush the centralized multimillion dollar file server completely.

The solution was a combination of doing a deep dive with strace, i.e., using the right tool for the job, and also hacking Python to stop looking up imports by using paths. Later versions of Python have addressed this issue through caching lookups, but it pays to learn tools that enable you to do a deep dive on performance. A final touch is to always run profiling as a part of the continuous integration process to catch these types of performance bugs.

Note

Two other situations at film studios involving bad UX design combined with lousy architecture come to mind. One day an animator came over to the main engineering section and asked for advice on solving a problem with a Filemaker Pro database that had been set up. The Filemaker Pro database that kept track of shots for the animation department kept getting deleted. When I asked to look at the database, there was a UI with two buttons next to each other. One was a medium-sized green button that said “Save Entry,” and another was a large red button that said, “Delete Database.”

At a completely different company, we noticed that one particular IP address was sending a tremendous amount of load toward the production MySQL database. When we tracked down the developer, they seemed a bit hesitant to talk to us. We asked if their department was doing anything special. They said they had a PyQt GUI that performed automation tasks. When we looked at the GUI, there were several buttons of normal size, and then a large button labeled “GO.” We asked what the “GO” button did, and the developer sheepishly said that everyone knows you don’t press that button. I opened up an SSH connection to our database and ran top on the MySQL server. Next, I did press that button, despite his protests not to. Sure enough, the database immediately went to 100% CPU sustained for several minutes.

Putting Out a Fire with a Cache and Intelligent Instrumentation

At a Sports Social network where I was CTO, we were experiencing substantial problems with the performance from our relational database. We started to reach the limits of vertical scaling. We were running the biggest version of SQL Server that Amazon RDS (Amazon’s managed relational database service) provided. To make things worse, we couldn’t easily switch to horizontal scaling because, at that time, SQL Server didn’t integrate read slaves into RDS.

There are many ways to solve a crisis like this. One method could have involved rewriting key queries, but we were experiencing so much traffic and a shortage of engineers that we had to get creative. One of our DevOps-centric engineers came up with a critical solution by doing the following:

  • Added more instrumentation via an APM that traced the time SQL calls took and mapped those to routes

  • Added Ngnix as a cache to read-only routes

  • Load-tested this solution in a dedicated staging environment

This engineer saved our bacon, and he implemented the solution through minimal modifications to the application. It dramatically increased our performance and allowed us to eventually scale to millions of monthly users, becoming one of the largest sports sites in the world. DevOps principles aren’t just important in the abstract; they can figurately saving you from drowning in a sea of deliberate technical debt.

You’ll Automate Yourself Out of a Job!

In my early twenties, I got a job at one of the top film studios in the world and was pretty excited to use my combination of skills in video, film, programming, and IT. It was also a union job, which I had never experienced before working in technology. The upside of a union job was that it had incredible benefits and pay, but I later discovered that there were some downsides involving automation.

After working there for a couple of months, I realized that one of my tasks was pretty silly. I would walk around the film studio on a Saturday (for overtime pay), place a CD into high-end editing systems, and “do maintenance.” The general premise was a good one; do weekly preventative maintenance so that these expensive machines would have minimal downtime during the week. The implementation was flawed, though. Why would I do anything manually if I could automate it? After all, it was a computer.

After my second Saturday “doing maintenance,” I constructed a secret plan to automate my job. Because it was a union job, I had to be a bit careful, though, and keep it a secret until I had validated that it would work. If I asked for permission, you could forget about it. I first wrote down a series of steps that were necessary to automate this:

  1. Connect the OS X machines to the company LDAP servers. This step would allow multiple users and allow me to mount NFS home directories.

  2. Reverse engineer the editing software to let multiple users access the software. I had applied group-level permissions to several lists to hack it, allowing various users to use the same machine.

  3. Create an image of the software in the state I would want it to install.

  4. Write a script to “NetBoot,” i.e., boot the machine from a networking operating system and then reimage the machines.

Once I got this sorted out, I was able to walk over to any machine, reboot it, and hold down the “N” key. It then wholly reimaged the software (and still preserved user data because it was on the network). It took 3.5 to 5 minutes to completely reinstall the entire machine because of the fast system and because I was doing block-level copying.

In my first test run, I was able to complete my “maintenance” in 30 minutes. The only bottleneck was walking over to the machines and rebooting them while holding down the “N” key. Additionally, I then told the film editors first to try restoring their machines by resetting them and holding the “N” key, and this eliminated support calls dramatically. Yikes, my job and my whole department’s job got way more manageable. This automation wasn’t entirely good at the union shop.

Soon an older union worker pulled me into a surprise meeting with my boss. He was unhappy with what I did. The meeting ended with him screaming at me while pointing a finger at me and saying, “You will script yourself out of a job, kid!” The boss’s boss was also unhappy. He had advocated to management for months to get a maintenance team approved, and then I wrote a script that eliminated much of what our department did. He also yelled at me profusely.

The word spread, and everyone liked the new automated process, including the stars and the film editors. I did not get fired because of this. Later, word spread about what I was doing, and I did script myself out of a job. I got recruited to work at Sony Imageworks and was hired to do what I almost got fired for doing. It was a fun job, too. I got to play basketball at lunch with Adam Sandler and the cast of his movies quite frequently. So, yes, you can script yourself out of a job and right into a better one!

DevOps Antipatterns

Let’s dive into some clear examples of what not to do. It is often much easier to learn from mistakes than to learn from perfection. This section dives into many horror stories and antipatterns to avoid.

No Automated Build Server Antipattern

It never ceased to amaze me how many troubled projects or companies had no build server. This fact is perhaps the most significant red flag that exists for a software company. If your software isn’t put through a build server, you can pretty much guarantee minimal other forms of automation are occurring. This problem is a canary in the coal mine. Build servers are a foundational piece that must be in place to ensure you can reliably deliver software. Often, the first thing I would do in a crisis would be to immediately set up a build server. Just running code through pylint makes things better quickly.

Somewhat related to this is the “almost working” build server. It is shocking how some organizations do the same thing they do with DevOps and say, “that isn’t my job…that is the build engineer.” This dismissive attitude, just like the attitude of “this is not my job; this is DevOps,” is cancer. Every job in automation is your job if you work in a software company. There is no more critical or virtuous task than making sure things are automated. It is frankly ridiculous to say automation tasks are not your job. Shame on anyone who says this.

Flying Blind

Are you logging your code? If you are not, why? Do you drive your car without headlights, too? Application visibility is another problem that is quite easy to remedy. For production software, it is technically possible to have too much logging, but more often than not in a troubled project, there is zero! For distributed systems, it is critical to have it logged. No matter how skilled the developers, no matter how simple the problem, no matter how good the operations team, you need logging. If you don’t include logging in your application, your project is dead on arrival.

Difficulties in Coordination as an Ongoing Accomplishment

One of the difficulties in working in DevOps teams is the status differences among the cofounder/CTO, the cofounder/CEO, and other members of the group. This conflict causes coordination difficulties in getting: a more reliable infrastructure, better instrumentation, proper backups, testing and QA, and ultimate resolution of any ongoing stability crisis.

Another ordinary organization dynamic that erodes the integrating conditions and causes coordination breakdowns is status differences in groups, because high-status groups may feel no need to recognize the task contributions of members of low-status groups. For instance, Metiu shows how, in software production, high-status programmers refuse to read the notes and comments that low-status programmers provide to document the progress of work. Because accountability requires an acknowledgment of mutual responsibilities, status differences that prevent such acknowledgment limit the development of liability.1

In situations where substantial status differences exist, members of groups may not be able to trust one another. When working interdependently, low-status individuals in these situations will ask fewer questions and give less feedback, for fear of offending others and possible repercussions. Such a situation leads to less sharing of knowledge, limiting common understanding in the group.”

In organizational behavior, there is a concept called “closure.” Closure is defined as the act of monopolizing goods or opportunities based on status. According to Metiu, a typical high-status group of software developers will use the following techniques to practice closure:

  • Lack of interaction

  • Use of geographical distance or proximity (in the case of an office)

  • Nonuse of work

  • Criticism

  • Code-ownership transfer

Having observed interactions inside companies, my belief is that executive management often engages in closure for projects they work on with staff. For example, even though a CTO may ask a DevOps engineer to work on an instrumentation task, the CTO may then refuse to use it. By not using it, it infers the DevOps engineer could never be in the same high-status group as the CTO. This behavior is textbook “closure” according to research in software development teams by Metiu.2

This behavior is one of the most substantial challenges to overcome in fixing pervasive problems within engineering in organizations. When a high-status individual “owned” a component, it has historically not worked until several “low status” team members got involved and took joint responsibility. These projects included UI, logging, data-center migration, infrastructure, and more. Admittedly this is a complex problem and not the only factor, but it is a factor with some unknown, yet significant, weight.

If the leadership in an organization “is better” than other people, you will never implement true DevOps principles. You will be applying Highest Paid Person’s Opinon (HIPO) principles. While DevOps can quite literally save lives and save your company, HIPOs are ferocious animals that can and do kill everything in their path.

No Teamwork

It is ubiquitous at martial arts studios to have the students help mop the floor. There are a lot of obvious reasons for doing this. It shows respect to the instructor and teaches students self-discipline. However, there are also some more subtle reasons as well.

There is a game theory problem at work here. Being exposed to a staph infection can cause serious health concerns. If you are offered an opportunity to mop the floor at the gym you train at, think very carefully about how you respond. People will watch how well you clean the floor, and if you do it well, they will do it well because they respect you. If you treat the task as something “beneath you” and don’t perform it well, you could cause two problems. One, you have not cleaned the floor well, which could cause other members of the gym to get sick. Two, you have “infected” the mindset of others, who will in turn not clean the floor. Your actions have actions in the present and the future.

So by “winning” and not mopping the floor correctly, you actually “lose” because you have played a part in encouraging unsanitary conditions that can be life-threatening. What is the moral of the story? If you train at a martial arts gym regularly and are asked to mop the floor, make sure you do an incredible job with a happy face. Your life could depend on it.

Let’s look at the same situation at a software company. Many critical tasks fit this same profile: adding proper logging, creating a continuous deployment of your project, load testing your project, linting your code, or doing a code review. If you show a poor attitude or don’t complete these tasks, your company may get a life-threating disease, just like staph. Approach and completion both are important. What is the message you are sending to your coworkers?

There is an excellent book on teamwork by Larson and LaFast3 that covers a comprehensive and scientific study of teams. There are eight characteristics they identified that explain how and why effective teams develop:

  • A clear, elevating goal

  • A results-driven structure

  • Competent team members

  • Unified commitment

  • A collaborative climate

  • Standards of excellence

  • External support and recognition

  • Principled leadership

Let’s review some examples of how these have worked or not worked in organizations.

A clear, elevating goal

If your organization doesn’t have a clear, elevating goal, you’re in trouble, full stop! As an engineer, I wanted the goal to be to make excellent, reliable software that worked. At troubled companies, though, I was told about many goals: going after whales, letting Amazon “burn down” while we moved to a data center, getting the company sold to “X” or “Y.”

A results-driven structure

Is your organization a R.O.W.E, or results-only work experience? Many of the tools and processes used in companies are questionable if they are not directly attributable to results: Skype, email, extremely long meetings, working “late.” Ultimately, none of this helps the company by itself. More of a focus on results, versus “face time” or a “quick response on Skype” or email could be a breakout change in an organization. What about “fake Agile”? Is your company doing a cargo-cult agile? Does this process accomplish nothing but burn hours of developers’ time in meetings while you talk about burndown charts, story points, and use lots of other process buzzwords?

Competent team members

It should go without saying that you need competent team members at your organization to be successful. Competency doesn’t mean “elite” schooling or “leet code,” though, it means the ability and desire to perform tasks as part of a team.

Unified commitment

Do you have self-serving people on your team? Are they looking out only for themselves? Do they push a last-minute change to the database, then walk out the door without testing because it was 4:35? They need to catch the bus (not caring if they burned down production). This behavior is cancer that destroys your team quicker than anything else. You cannot have self-serving people in a high-performing team; they will ruin it.

A collaborative climate

Is there an appropriate level of task conflict? Everyone cannot merely agree with each other, because you won’t catch mistakes. At the same time, you cannot have people yelling at each other. It needs to be a respectful environment where people are open and expect feedback. If the scale falls too far in either direction, you’re doomed. Achieving this balance is easier said than done.

Another example is the hiring process. Many companies complain about the inability to hire, hire for diversity, and generally get good candidates. The real issue is that their hiring process is “fugly:”

  1. First, the company FLATTERs candidates to apply.

  2. Next, they WASTE their time with bespoke irrelevant tests.

  3. Then they HAZE them with a round of interviews that have worse predictive value than random.

  4. Then they GHOST the candidate and don’t give them any feedback.

  5. They LIE and say they are trying to hire people when they have a broken process.

  6. They then YELL on social media about how hard it is to engage diverse, or any, candidate.

The reason you cannot hire is your process is FUGLY! Treat people with respect, and you will get respect. The connection will manifest itself in being able to retain many great employees who have been needlessly disregarded by a hiring practice that optimizes for the wrong thing.

Standards of excellence

This step is a significant challenge for organizations. Most IT professionals work very hard but could improve their standards of excellence and craftsmanship. One other way of stating this is to say that a higher degree of self-discipline is required. Higher standards for writing software, testing, and deploying are necessary. More stringent measures are required for reading the documentation on new technologies before they deploy.

One example is the software life cycle. At every stage, Higher standards are necessary. Write a technical overview and create a diagram before work starts. It is important to never release code that hasn’t been through a proper DevOps life cycle.

In terms of infrastructure, best practices need to be followed at many steps, whether a zookeeper configuration, EC2 storage configuration, or mongo or serverless. Every single component in the technology stack needs to be revisited and looked at for best-practice compliance. Many situations exist where documentation stated a proper way to configure an element, but it was never read! It would be safe to assume over 50% of the technology stacks at many companies are still improperly configured, despite significant technology improvements.

Please note that I am making a clear distinction between working “long hours, nights, and weekends,” versus being highly disciplined at work and following standards of excellence. There are too many nights and weekends worked, and not enough discipline, by an order of magnitude in the software industry. It would be a grave mistake to underestimate how significant the lack of standards and controls is versus merely telling someone to work longer and harder.

Finally, there needs to be a higher standard for gathering quantitative data when recommending strategic directions for many companies. The lack of any real quantitative analysis of “migrating to a new data center” or “pursuing whales,” speaks to a lack of discipline and process for many in management. It simply isn’t good enough to have an opinion, often stated as a fact by a member of a management team, without having the data to back it up. Management needs high standards. Everyone in the company can see when data, not opinion, hierarchy, aggression, or a desire for a commission, is used to make decisions.

External support and recognition

Historically, there have been some real issues with external support and recognition for DevOps professionals. A readily evident example is being on call. Things have improved dramatically in the tech world. But even today, many people on pager duty are not recognized for how hard they work and how challenging it is to be on call.

At many organizations, there appears to be no tangible reward for working hard, such as volunteering to be on call. If anything, there is a clear precedent that shirking your duty could get you promoted because you were crafty enough to get out of lower-status work. In the case of one employee I worked with, he said it “wasn’t smart” (his words) to agree to be on call. He refused to go on call when he was in engineering. Shirking his duty then led to a promotion. It is challenging to ask for extraordinary contributions when leaders show below average commitment and below average integrity.

Another example of a lack of external support is when one department drops the hard tasks on another. They commonly say, “This is DevOps; this isn’t my job.” I have seen sales engineering teams set up many environments: a data center environment, a Rackspace environment, AWS environment. They continuously paged the people on call, even though they hadn’t set these environments up. When the sales engineer confronts this problem, he mentions he is in sales, and this “wasn’t his job.” Engineering didn’t have access to the machines he had set up. They were misconfigured and paging people. The clear message here is “don’t be a sucker” and get stuck on call. The “smart” people shirk this responsibility and delegate it to the lower status “suckers.”

Yet another example of a lack of external support is when I was working at a company where customer data was deleted accidentally. A sales engineer initially misconfigured the machine without enough storage to support the customer’s desired retention period. The responsibility, though, of continually cleaning up the data was in the hands of DevOps, the “suckers.”

The maintenance of the machine required running dangerous Unix commands multiple times a day, and often in the middle of the night. Unsurprisingly, one member of the DevOps team mistyped one of the commands and deleted the customer data. The sales engineer got angry and refused to let the customer know; instead, he attempted to force the DevOps engineer to call the customer and perform a “mea culpa.” It is problematic that companies have weak external support, and management has allowed individuals to behave in a way that is not supportive. This behavior sends a clear message that the administration will not tackle the “tough” problems like addressing immature or unprincipled behavior and instead will shift it to DevOps.

Principled leadership

There have been some great examples of principled leadership at companies I have worked at, as well as some unfortunate cases. Larson and LaFasto mention that a transformative leader, “establishes trust through positioning—assuring that the leader’s behavior exemplifies the ideals and course of the vision.” A CTO, for example, was on call for months during a crisis to show solidarity with everyone. This situation is an example of not asking someone to do something you wouldn’t do yourself. Responsibility occurs when it is a personal sacrifice and inconvenient.

Another excellent example of principled leadership was with a product manager and the front-end team. She “required” that the front-end team use the ticket system and led by example, actively working with the queue and culling it. As a result, UX engineers learn this skill and how important it was for planning. She could have just said, “use the system,” but instead used it herself. This situation has led to a real success that can be measured quantitatively. The ticket turnover rate, which the product manager closely monitored, improved.

On the other hand, some of the practices that startup CEOs promoted were unprincipled. Some would frequently send emails out about people needing to “work late,” then go home at 4 P.M. Teams pick up on this behavior, and some of these repercussions remain forever. Another way to phrase this would be to call this “inauthentic leadership.”

I have seen situations where a DevOps team was harassed and it was quite damaging. This harassment was created by saying the team didn’t work hard, or was incompetent. This can be particularly damaging if it comes from someone who often leaves early and refuses to do challenging engineering tasks. Harassment is terrible enough, but when it comes from a legitimate slacker who is allowed to terrorize people, it becomes insufferable.

Larson and LaFasto also mention that any team that rated itself as low performing in these three categories didn’t last long::

  • Clear, elevating goal

  • Competent team members

  • Standards of excellence

Interviews

Glenn Solomon

What are some brief pieces of wisdom you can offer to Python and DevOps folks?

All companies will become software companies. There will be four or five companies that will be fundamental in this growth. DevOps is a critical aspect of this evolution. The velocity of change is important. New and different jobs will be created.

Personal website

https://goinglongblog.com

Company website

https://www.ggvc.com

Public contact information

https://www.linkedin.com/in/glennsolomon

Andrew Nguyen

Where do you work and what do you do there?

I am the Program Director of Health Informatics and the Chair of the Department of Health Professions at the University of San Francisco. My research interests involve the application of machine/deep learning to healthcare data with a specific focus on unstructured data. This includes text using NLP as well as sensor data using signal processing and analysis, both of which highly benefit from advances in deep learning. I am also the founder and CTO of qlaro, Inc., a digital health startup focused on using machine learning and NLP to empower cancer patients from diagnosis through survival. We help patients prioritize what they need to do next and how best to ask questions of their physicians and care team.

What is your favorite cloud and why?

While I started exploring cloud services (primarily from the perspective of IaaS) using AWS, I’d most recently been using GCP for my work. I made the switch early on purely due to the cost saving when deploying a HIPAA-compliant solution. Since then, I’ve been using GCP out of convenience since it’s what I have the most experience with. However, where possible, I do my work using platform-agnostic tools to minimize impact, should I need to make the change.

From a machine learning perspective, I’m much more agnostic and happy to use either AWS or GCP, depending on the particulars of the machine learning project. That said, for my next project (which will involve collecting, storing, and processing a significant amount of data), I am planning on using GCP given the ease of developing and running Apache Beam jobs on various runners, including Google Dataflow.

When did you start using Python?

I started using Python about 15 years ago as a web development language when Django was first released. Since then, I’ve used it as a general-purpose programming/scripting language as well as a data science language.

What is your favorite thing about Python?

My favorite thing is that it’s a ubiquitous, interpretable, object-oriented language. It runs on pretty much any system and provides the power of OOP with the simplicity of an interpreted scripting language.

What is your least favorite thing about Python?

Whitespace. I understand the reasoning behind using whitespace the way Python does. However, it gets annoying when trying to determine the scope of a function that spans more than the screen can display.

What is the software industry going to look like in 10 years?

I think we will see more and more people doing “software development” without writing as much code. Similar to how Word and Google Docs make it easy for anyone to format a document without manual word processing, I think folks will be able to write small functions or use GUIs to take care of simple business logic. In some sense, as tools like AWS Lambda and Google Cloud Functions become the norm, we’ll see an increasing amount of turnkey functions that don’t require formal computer science training to use effectively.

What technology would you short?

I would short MLaaS (machine learning as a service) companies—that is, companies that focus purely on machine learning algorithms. Just as we don’t see companies that provide word processing services, tools and platforms such as AutoML or SageMaker will make it easy enough for most companies to bring the ML capability in-house. While we can’t solve all ML problems using such tools, we can probably solve 80 to 90% of them. So, there will still be companies creating new ML approaches or providing ML as a service, but we’ll see immense consolidation around major cloud providers (versus the endless stream of companies “doing machine learning” that we see today).

What is the most important skill you would recommend someone interested in Python DevOps learn?

Learn the concepts, not just the tools and tooling. New paradigms will come and go; but for every paradigm, we’ll see dozens of competing tools and libraries. If you’re only learning the specific tool or library, you’ll quickly fall behind as a new paradigm starts to materialize and take over.

What is the most important skill you would recommend someone learn?

Learn how to learn. Figure out how you learn and how you can learn quickly. As with Moore’s Law, where we saw the doubling of processor speeds with each generation, we are seeing accelerating growth of DevOps tools. Some build on existing approaches, while others attempt to supplant them. In any case, you’ll need to know how you learn so that you can quickly and efficiently learn about the ever-increasing number of tools out there—and then quickly decide if it’s worth pursuing.

Tell the readers something cool about you.

I enjoy hiking, backpacking, and generally being outside. In my free time, I also volunteer with the search and rescue team of my local Sheriff’s Office, usually searching for missing people in the woods, but also during disasters such as the Camp Fire that hit Paradise, California.

Gabriella Roman

What is your name and current profession?

Hello! My name is Gabriella Roman, and I’m currently an undergraduate student studying computer science at Boston University.

Where do you work and what do you do there?

I’m an intern at Red Hat, Inc., where I work on the Ceph team. I mainly work with ceph-medic, a python tool that performs checks against Ceph clusters, either by fixing bugs in old checks or resolving issues with new checks. I also work with the DocUBetter team to update Ceph’s documentation.

What is your favorite cloud and why?

Though I’ve only ever really used Google Cloud Storage, I can’t really make a point as to why it’s my favorite. I just happened to try it out and without much reason to dislike it, I have stayed loyal to it for the past 10 years. I do like its simple interface, and as someone who does not like to keep much digital clutter, the 15-GB limit does not bother me.

When did you start using Python?

I first learned Python in an introductory computer science course I took in the second half of my sophomore year.

What is your favorite thing about Python?

Its readability. Python’s syntax is among the simplest of the programming languages, making it a great choice for beginners.

What is your least favorite thing about Python?

I don’t have enough experience with other programming languages yet for comparison.

What is the software industry going to look like in 10 years?

It’s nearly impossible to know what the future will hold, especially in a field that is constantly changing. All I can say is that I hope the software industry continues to evolve in a positive direction, and that software is not used wrongfully.

What is the most important skill you would recommend to someone interested in learning Python?

Practicing good code style, especially when working with a team, helps avoid a lot of unnecessary headache. As a Python newbie myself, I find it especially helpful when I read through code that is well organized and has detailed documentation.

What is the most important skill you would recommend someone learn?

This one isn’t exactly a skill, but more of a state of mind: be willing to learn new things! We’re constantly learning, even when we least expect it, so keep an open mind and allow others to share their knowledge with you!

Tell the readers something cool about you.

I really enjoy playing video games! Some of my favorites are The Last of Us, Hollow Knight, and League of Legends.

Professional website

https://www.linkedin.com/in/gabriellasroman

Rigoberto Roche

Where do you work and what do you do there?

I work at NASA Glenn Research Center as Lead Engineer for the Meachine Learning and Smart Algorithms Team. My job is to develop decision-making algorithms to controll all aspects of space communication and navigation.

What is your favorite cloud and why?

Amazon Web Services, because it is the one I have the most experience with due to its availability in my work flow.

When did you start using Python?

2014

What is your favorite thing about Python?

Easy-to-read code and quick development time.

What is your least favorite thing about Python?

Whitespace delineation.

What is the software industry going to look like in 10 years?

It is hard to tell. It seems there is a drive to cloud computing and decentralized programming that will drive developers to work as independent contractors for everything. It’ll be a gig economy, not a large business industry. The biggest shift will be the use of automatic coding tools to separate the creative development from the syntax learning tasks. This can open the door to more creative professionals to develop new things and new systems.

What technology would you short?

Uber and Lyft. Anything that has manual labor that can be automated by narrow AI: driving, warehousing, paralegal work. Problems that can be solved by deep learning.

What is the most important skill you would recommend someone interested in Python DevOps learn?

The ability to learn quickly, with a benchmark of “Can you be dangerous in one month or less?” Another is the ability to understand and build from basic principles “like a physicist,” by doing the actual work yourself and understanding more than the theory.

What is the most important skill you would recommend someone learn?

Brain hooks (memory palace), the pomodoro technique, and spaced recall self-testing for content absorption.

Tell the readers something cool about you.

I love combat training systems like Rickson Gracie’s style JiuJitsu and Mussad Krav Maga (not the sport one). My passion in this world is to build a truly thinking machine.

Personal website

Just google my name.

Personal Blog

Don’t have one.

Company website

www.nasa.gov

Public contact information

[email protected]

Jonathan LaCour

Where do you work and what do you do there?

I am the CTO for Mission, a cloud consulting and managed service provider focused on AWS. At Mission, I direct the creation and definition of our service offerings and lead our platform team, which focuses on driving efficiency and quality via automation.

What is your favorite cloud and why?

I have deep roots in public cloud, both as a consumer and a builder of public cloud services. That experience has led me to understand that AWS provides the deepest, broadest, and most widespread public cloud available. Because AWS is the clear market leader, they also have the largest community of open source tools, frameworks, and projects.

When did you start using Python?

I first started programming in Python in late 1996 around the release of Python 1.4. At the time, I was in high school, but was working in my spare time as a programmer for an enterprise healthcare company. Python instantly felt like “home,” and I have been using Python as my language of choice ever since.

What is your favorite thing about Python?

Python is a very low-friction language that happily fades into the background, allowing the developer to focus on solving problems rather than fighting with unnecessary complexity. Python is just so much fun to use as a result!

What is your least favorite thing about Python?

Python applications can be more difficult to deploy and distribute than I’d like. With languages like Go, applications can be built into portable binaries that are easy to distribute, whereas Python programs require significantly more effort.

What is the software industry going to look like in 10 years?

The last 10 years have been about the rise of public cloud services, with a focus on infrastructure as code, and infrastructure automation. I believe that the next 10 years will be about the rise of serverless architectures and managed services. Applications will no longer be built around the concept of “servers,” and will instead be built around services and functions. Many will transition off of servers into container orchestration platforms like Kubernetes, while others will make the leap directly to serverless.

What technology would you short?

Blockchain. While an interesting technology, the overreach of its applicability is astounding, and the space is filled with hucksters and snake oil salesmen peddling blockchain as the solution to all problems.

What is the most important skill you would recommend someone interested in Python DevOps learn?

Since I first started using Python in 1996, I’ve found that the most important driver for learning has been curiosity and the drive to automate. Python is an incredible tool for automation, and a curious mind can constantly find new ways to automate everything from our business systems to our homes. I’d encourage anyone getting started with Python to look for opportunities to “scratch your own itches” by solving real problems with Python.

What is the most important skill you would recommend someone learn?

Empathy. Too often, technologists embrace technology without considering the impact on humanity and on each other. Empathy is a personal core value for me, and it helps me to be a better technologist, manager, leader, and human.

Tell the readers something cool about yourself.

I have spent the last three years resurrecting my personal website, pulling in content from as far back as 2002. Now, my website is my personal archive of memories, photos, writing, and more.

Personal website

https://cleverdevil.io

Personal blog

https://cleverdevil.io

Company website

https://www.missioncloud.com

Public contact information

https://cleverdevil.io

Ville Tuulos

Where do you work and what do you do there?

I work at Netflix where I lead our machine learning infrastructure team. Our job is to provide a platform for data scientists that allows them to prototype end-to-end ML workflows quickly and deploy them to production confidently.

What is your favorite cloud and why?

I am a shameless fan of AWS. I have been using AWS since the EC2 beta in 2006. AWS continues to impress me both technically and as a business. Their core pieces of infrastructure, like S3, scale and perform extremely well, and they are very robust. From a business point of view they have done two things right. They have embraced open source technologies, which has made adoption easier in many cases, and they are very sensitive to customer feedback.

When did you start using Python?

I started using Python around 2001. I remember being very excited about the release of generators and generator expressions soon after I had started using Python.

What is your favorite thing about Python?

I am fascinated by programming languages in general. Not only technically, but also as a medium of human communication and as a culture. Python is an extremely well-balanced language. In many ways it is a simple and easily approachable language, but at the same time expressive enough to handle even complex applications. It is not a high-performance language, but in most cases it is performant enough, especially when it comes to I/O. Many other languages are better optimized for particular use cases, but only a few are as well-rounded as Python.

Also, the CPython implementation is a straightforward piece of C code and much simpler than JVM, V8, or the Go runtime, which makes it easy to debug and extend when needed.

What is your least favorite thing about Python?

The other side of being a well-rounded generalist language is that Python is not an optimal language for any particular use case. I miss C when I work on anything performance-critical. I miss Erlang when I am building anything requiring concurrency. And when hacking algorithms, I miss the type inference of OCaml. Paradoxically, when using any of these languages, I miss the generality, pragmatism, and the community of Python.

What is the software industry going to look like in 10 years?

The trend is clear if you look at the past 50 years of computing. Software is eating the world and the software industry keeps moving upward in the tech stack. Relatively speaking, we have fewer people focusing on hardware, operating systems, and low-level coding than ever before. Correspondingly, we have an increasing number of people writing software who don’t have much experience or knowledge of the lower levels of the stack, which is OK. I think this trend has massively contributed to the success of Python this far. I predict that we will see more and more human-centered solutions like Python in the future, so we can empower an ever-increasing group of people to build software.

What technology would you short?

I tend to short technologies that assume that technical factors trump human factors. The history is littered with technologies that were technically brilliant but failed to appreciate the actual needs of their users. Taking this stance is not always easy, since it is a natural engineering instinct to feel that technically elegant solutions should deserve to win.

What is the most important skill you would recommend someone interested in Python DevOps learn?

I would recommend anyone serious about Python, DevOps in particular, to learn a bit about functional programming. Nothing hardcore, just the mindset around idempotency, function composition, and the benefits of immutability. I think the functional mindset is very useful for large-scale DevOps: how to think about immutable infrastructure, packaging, etc.

What is the most important skill you would recommend someone learn?

Learning to identify what problems are worth solving is critical. So many times I have observed software projects where an almost infinite amount of resources has been put into solving problems that ultimately don’t matter. I have found Python to be a good way to hone this skill, since it allows you to quickly prototype fully functional solutions that can help you see what is relevant.

Tell the readers something cool about you.

With a friend of mine, I hacked an urban game that was played in NYC. Players took photos with their phones that were projected on a giant billboard in the Times Square in real time. A nerdy cool thing about this is that the whole game, including the client running on the phones, was written in Python. What’s even cooler is that the game took place in 2006, in the Jurassic era of smartphones, predating the iPhone.

Personal website

https://www.linkedin.com/in/villetuulos

Company website

https://research.netflix.com

Public contact information

@vtuulos on Twitter

Joseph Reis

Where do you work and what do you do there?

I’m the cofounder of Ternary Data. Mostly I work in sales, marketing, and product development.

What is your favorite cloud and why?

It’s a toss up between AWS and Google Cloud. I find AWS better for apps, but Google Cloud superior for data and ML/AI.

When did you start using Python?

2009

What is your favorite thing about Python?

There’s (usually) one way to do things, so that cuts down on the mental overhead needed to figure out the best solution to the problem at hand. Just do it the Python way and move on.

What is your least favorite thing about Python?

The GIL is my least favorite thing. Though thankfully, the world seems to be moving toward a resolution of the GIL.

What is the software industry going to look like in 10 years?

Probably much like it does now, though with faster iteration cycles of best practices and new tools. What’s old is new again, and what’s new is old. The thing that doesn’t change is people.

What technology would you short?

I’d short AI in the short term, but very long AI over the coming decades. A lot of hype around AI poses a risk of some broken hearts in the short term.

What is the most important skill you would recommend someone interested in Python DevOps learn?

Automate EVERYTHING possible. Python is an awesome language for simplifying your life, and your company’s processes. Definitely take advantage of this great power.

What is the most important skill you would recommend someone learn?

Learn to have a growth mindset. Being flexible, adaptable, and able to learn new things will keep you relevant for a very long time. With the hyper-fast changes in tech—and the world in general—there will be endless opportunities to learn…mostly because you’ll have to :).

Tell the readers something cool about you.

I’m a former rock climbing bum, club DJ, and adventurer. Now I’m a rock climber with a job, I still DJ, and I go on as many adventures as possible. So, not much has changed, but the itch to explore and do dangerous things still continues.

Personal website and blog

https://josephreis.com

Company website

https://ternarydata.com

Public contact information

[email protected]

Teijo Holzer

Where do you work and what do you do there?

I have been working as a Senior Software Engineer at Weta Digital, New Zealand, for 12 years. My responsibilities include software development (mainly Python and C++), but I also occasionally perform System Engineering and DevOps tasks.

What is your favorite cloud and why?

It has to be AWS.

One of the main features they offer is the support for continous integration and delivery. In software engineering, you want to automate as many mundane tasks as possible so you can concentrate on the fun parts of innovative software development.

Things that you usually don’t want to think about are code builds, running existing automated tests, releasing and deploying new versions, restarting services, etc. So you want to rely on tools like Ansible, Puppet, Jenkins, etc., to perform these tasks automatically at certain defined points (e.g., when you merge a new feature branch into master).

Another big plus is the amount of available support in online forums like Stack Overflow, etc. Being the current market leader in the cloud platform space naturally leads to a larger user base asking questions and solving problems.

When did you start using Python?

I started using Python more than 15 years ago and have more than 12 years of professional Python experience.

What is your favorite thing about Python?

That there is no need to reformat your source code, ever. The choice to have whitespace carry syntactical/grammatical meaning means that other people’s Python code immediately has a very high readability score. I also like the Python license, which led to a huge uptake of Python as a scripting language in many third-party commercial applications.

What is your least favorite thing about Python?

The difficulty of performing highly concurrent tasks in a reliable and effective fashion. In Python, efficient and reliable threading and multiprocessing are still difficult to achieve in complex environments.

What is the software industry going to look like in 10 years?

In my opinion, more emphasis will be placed on being able to integrate and deliver customer-centric solutions based on existing infrastructure and tooling within an aggressive time frame. There is no need to constantly reinvent the same wheel. So system engineering and DevOps skills will become more important in the software industry. You need to be able to scale fast if required.

What technology would you short?

Any system that has a single point of failure. Building robust systems requires you to acknowledge that all systems will eventually fail, so you need to cater for that at every level. That starts by not using assert statements in your code, and goes all the way up to providing high-availability, multimaster DB servers. Building fault-tolerant systems is especially important when there are many users relying on your systems 24/7. Even AWS only offers 99.95% uptime.

What is the most important skill you would recommend someone interested in Python DevOps learn?

Speedy automation. Every time you find yourself repeating the same tasks over and over, or you find yourself again waiting for a long-running task to complete, ask yourself: How can I automate and speed up those tasks ? Having a quick turnaround time is essential for effective DevOps work.

What is the most important skill you would recommend someone learn?

Speedy automation, as discussed above.

Tell the readers something cool about you.

I like presenting at Python conferences. Look for my most recent talk about Python, Threading and Qt at the Kiwi PyCon X.

Recent talk:

https://python.nz/kiwipycon.talk.teijoholzer

Company website

http://www.wetafx.co.nz

Matt Harrison

Where do you work and what do you do there?

I work at a company I created called MetaSnake. It provides corporate training and consulting in Python and data science. I spend about half of my time teaching engineers how to be productive with Python or how to do data science. The other half is consulting and helping companies leverage these technologies.

What is your favorite cloud and why?

I’ve used both Google and AWS in the past. They (and others) have excellent Python support, which I love. I don’t know that I have a favorite, but I’m glad there are multiple clouds, as I believe competition gives us better products.

When did you start using Python?

I started using Python in 2000, when working at a small startup doing search. A colleague and I needed to build a small prototype. I was pushing for using Perl and he wanted to use TCL. Python was a compromise, as neither of us wanted to use the other’s preferred technology. I believe both of us promptly forgot what we were using before and have leveraged Python since.

What is your favorite thing about Python?

Python fits my brain. It is easy to start from something simple to build an MVP and then productionize it. I’m really enjoying using notebook environments like Jupyter and Colab. They make data analysis really interactive.

What is your least favorite thing about Python?

The built-in docstrings for the classes, such as lists and dictionaries, need some cleanup. They are hard for newcomers to understand.

What is the software industry going to look like in 10 years?

I don’t have a crystal ball. For me, the main difference between now and 10 years ago is leveraging the cloud. Otherwise, I use many of the same tools. I expect programming in the next 10 years to be very similar, off-by-one errors will still pop up, CSS will still be hard, maybe deployment might be slightly easier.

What technology would you short?

I imagine that proprietary tools for data analysis will go the way of the dinosaur. There might be an effort to save them by open sourcing them, but it will be too little, too late.

What is the most important skill you would recommend someone interested in Python DevOps learn?

I think curiosity and a willingness to learn are very important, especially as many of these tools are fast-moving targets. There seems to always be a new offering or new software.

What is the most important skill you would recommend someone learn?

I have two. One, learning how you learn. People learn in different ways. Find what way works for you.

The other skill is not technical. It is learning how to network. This doesn’t have to be a dirty word and is very useful for people in tech. Most of my jobs and work have come from networking. This will pay huge dividends.

Tell the readers something cool about you.

I like to get outside. That might be running, ultimate hiking, or skiing.

Personal website/blog

https://hairysun.com

Company website

https://www.metasnake.com

Public contact information

[email protected]

Michael Foord

Where do you work and what do you do there?

My last two jobs have been working on DevOps tools, which has led to a reluctant passion for the topic on my part. Reluctant because I was long skeptical of the DevOps movement, thinking that it was mostly managers wanting developers to do sysadmin work as well. I’ve come to see that the part of DevOps I really care about is the systems-level thinking and having development processes be fully aware of that.

I worked for three years on Juju for Canonical, an interesting foray into programming with Go, and then for a year for Red Hat building a test automation system for Ansible Tower. Since then I’ve been self-employed with a mixture of training, team coaching, and contracting, including an AI project I’m working on now.

I’m my copious spare time I work on Python itself as part of the Python Core Dev team.

What is your favorite cloud and why?

I’m going to throw this question sideways a little. My favorite cloud is all of them, or at least not having to care (too much) which cloud I’m on.

The Juju model describes your system in a backend agnostic way. It provides a modelling language for describing your services and the relationships between them, which can then be deployed to any cloud.

This allows you to start on, say, AWS or Azure, and for cost or data security reasons migrate to an on-prem private cloud like Kubernetes or OpenStack without having to change your tooling.

I like to control my major dependencies, so I prefer to work with something like OpenStack than a public cloud. I’m also a fan of Canonical’s MaaS (Metal As A Service), which is a bare-metal provisioner. It started life as a fork of Cobbler, I believe. You can use it directly, or as a substrate for managing the hardware with a private cloud. I wrote the Juju code to connect to the MaaS 2 API, and I was very impressed with it.

I’m much more of a fan of LXC/LXD or KVM virtualization than I am of Docker (virtually heresy these days), so Kubernetes or OpenShift wouldn’t be my first port of call.

For commercial projects I sometimes recommend VMware cloud solutions, simply because of the availability of sysadmins used to running these systems.

When did you start using Python?

I started programming with Python as a hobby in about 2002. I enjoyed it so much that I started full-time programming around 2006. I was lucky enough to get a job with a London fintech startup where I really learned the craft of software engineering.

What is your favorite thing about Python?

It’s pragmatism. Python is enormously practical, which makes it useful for real world tasks. This stretches right into the object system that strives to make the theory match the practice.

This is why I love teaching Python. For the most part, the theory is the same as the practice, so you get to teach them in the same breath.

What is your least favorite thing about Python?

Python is old, and if you include the standard library, big. Python has quite a few warts, such as the lack of symmetry in the descriptor protocol, meaning you can’t write a setter for a class descriptor. Those are largely minor.

The big wart for me, which many will agree with, is the lack of true free-threading. In a multicore world, this has been getting more and more important, and the Python community was in denial about it for years. Thankfully we’re now seeing the core-devs make practical steps to fix this. Subinterpreter support has several PEPs and is actively being worked on. There are also people looking at [possibly] moving away from reference counting for garbage collection, which will make free-threading much easier. In fact, most of the work has already been done by Larry Hastings in his Gilectomy experiment, but it’s still hampered by reference counting.

What is the software industry going to look like in 10 years?

I think we’re in the early part of an AI gold rush. That’s going to spawn thousands of short-lived and useless products, but also completely change the industry. AI will be a standard part of most large-scale systems.

In addition, DevOps is providing a way for us to think about system development, deployment, and maintenance. We’ve already seen the effects of that in microservices and polyglot environments. I think we’ll see a rise of a new generation of DevOps tooling that democratises system-level thinking, making it much easier to build and maintain larger-scale systems. These will become more “solved problems,” and the frontiers will expand into new challenges that haven’t yet been described.

What technology would you short?

Ooh, a challenging question. I’m going to say the current generation of DevOps tooling.

The genius of DevOps is codifying arcane deployment and configuration knowledge; for example, playbooks with Ansible, and Charms for Juju.

The ideal DevOps tooling would allow you to describe and then orchestrate your system, in a backend agnostic way, but also incorporate monitoring and an awareness of the system state. This would make deployment, testing, reconfiguring, scaling, and self healing straightforward and standard features.

Perhaps we need an App Store for the cloud. I think a lot of people are trying to get there.

What is the most important skill you would recommend someone interested in Python DevOps learn?

I tend to learn by doing, so I resent being told what to learn. I’ve definitely taken jobs just to learn new skills. When I started with Canonical, it was because I wanted to learn web development.

So practical experience trumps learning. Having said that, virtual machines and containers are likely to remain the base unit of system design and deployment. Being comfortable slinging containers around is really powerful.

Networking is hard, important, and a very valuable skill. Combine that with containers through software-defined networking layers, and it makes a potent combination.

What is the most important skill you would recommend someone learn?

You’ll never know enough, so the most important skill is to be able to learn and to change. If you can change, you’re never stuck. Being stuck is the worst thing in the world.

Tell the readers something cool about you.

I dropped out of Cambridge University, I’ve been homeless, I’ve lived in a community for a number of years, I sold bricks for 10 years, and I taught myself to program. Now I’m on the Python core development team and have the privilege of traveling the world to speak about and teach Python.

Personal website/blog

http://www.voidspace.org.uk

Public contact information

[email protected]

Recommendations

“All models are wrong…but some are useful,” certainly applies to any general piece of advice on DevOps. Some elements of my analysis will be absolutely wrong, but some of it will be useful. It is impossible that my own personal bias doesn’t play a role in my analysis. Despite being potentially wrong in some of my analysis and being very biased, there are clearly some urgent issues to fix in the management of most companies. Some of the highest priority issues are:

  1. Status differences have lead to accountability problems, with software stability being a highly visible example. Engineering managers (especially startup founders), in particular, need to acknowledge how informal status closure affected software quality and fix it.

  2. There is a culture of meaningless risk (firing silver bullets versus fixing broken windows) at many organizations.

  3. There are ineffective or meaningless standards of excellence, and there is a general lack of discipline in engineering at many organizations.

  4. Culturally, data has not been used to make decisions. The highest paid person’s opinion (HIPO), status, aggression, intuition, and possibly even the roll of the dice have been the reasons for a decision.

  5. True understanding of the concept of “opportunity cost” eludes executive management. This lack of understanding then trickles down the ranks.

  6. The need to increase focus on meritocracy over “snake oil and bullshit,” as senior data scientist Jeremy Howard at Kaggle says.

The “right” things can be put in place in a few months in engineering: a ticket system, code review, testing, planning, scheduling, and more. Executive leadership in companies can agree these are the right things to do, but their actions must match their words. Instead of focusing on execution, consistency and accountability, executive leadership is often focused on firing silver bullets from a high-powered elephant gun. Unfortunately, they often miss every elephant they shoot at. Executive teams would be wise to learn from these mistakes and avoid the negative culture around these mistakes.

Exercises

  • What are the core components necessary for a capable team?

  • Describe three areas you could improve as a team member.

  • Describe three areas you excel at as a team member.

  • What is right about all companies in the future?

  • Why does DevOps need external support and recognition?

Challenges

  • Create a detailed analysis of your current team using the teamwork framework by Larson and LaFast.

  • Have everyone on your team fill out anonymous index cards that have three positive things and three valuable feedback items for each member of your small group (must have negative and positive issues). Get into a room and have each person read the index cards from their teammates. (Yes, this does work and can be a life-changing experience for individuals in your team.)

Capstone Project

Now that you have reached the end of this book, there is a Capstone Project that you can build that demonstrates mastery of the concepts covered:

  • Using the ideas explored in the book, create a scikit-learn, PyTorch, or TensorFlow application that servers out predictions via Flask. Deploy this project to a primary cloud provider while completing all of these tasks:

    • Endpoint and health-check monitoring

    • Continuous delivery to multiple environments

    • Logging to a cloud-based service such as Amazon CloudWatch

    • Load-test the performance and create a scalability plan

1 Anca Metiu, “Owning the Code: Status Closure in Distributed Groups,” Organization Science, (Jul-Aug. 2006).

2 Anca Metiu, “Owning the Code: Status Closure in Distributed Groups,” Organization Science, (Jul-Aug. 2006).

3 Larson, C. E., & LaFasto, F. M. J. (1989). Sage series in interpersonal communication, Vol. 10. Teamwork: What must go right/what can go wrong. Thousand Oaks, CA, US: Sage Publications, Inc.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.214.155