Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2
Dev: Technical Debt Is Now Your Problem

All developers know the feeling of sitting behind the screen and opening another developer's code. Or maybe you work in IT Ops and are reviewing a system administrator's automations, or a data scientist examining your colleague's machine learning model, or a product owner trying to understand a website's information architecture.

On the first inspection, it looks like a pile of spaghetti with three different sauces on top. One function calls another function that calls a third function of poorly structured code, with no consistent naming conventions, zero unit tests, and almost no documentation. You open five or more tabs to trace through the flow, reverse engineer the business logic, and hope that the modifications you need to make don't cause defects or introduce new issues.

Spaghetti code often happens when a developer, engineer, data scientist, or designer tries to get something done in too little time or with insufficient expertise. In some cases, it's written with zero care about what happens when the next person must review or improve on it. Other times, the code has been modified several times by different people, each imprinting their coding methods and style without cleaning up their predecessor's crafty work, again, often because management doesn't allocate sufficient time, direction, or priority to maintain and improve the code.

When you encounter spaghetti code, it's almost always because something of importance suddenly needs fixing. You rarely have the opportunity to clean up and refactor code. Usually, you come across it when you're under pressure to solve a problem. And that's why you do it. You ignore that pile of crappy code and come up with your way of implementing what is needed today.

You convince yourself that your code is better than your predecessor’s. You will code with all the modernized components and standards, and it will take less time than pulling off an anthropological study of the coding fiasco that someone else left behind.

And in the back of your mind, you know that when your spaghetti code starts rotting and is causing heartache, there will be someone else responsible for cleaning up your mess.

Let me tell you about one colossal coding mess that I created. It was an Italian restaurant filled with bowls of spaghetti code held together with duct tape and Band‐Aids. I must admit, it was a beast, and I never moved it entirely out of my responsibility for others to support it. It started with a simple problem that grew to layers of complexity over time. When it worked, you didn't know it even existed, but when it failed, it took a MacGyver to diagnose the problem and solve it before customers got angry.

Yeah, you've probably seen this before too.

Here I am, head of software development at the ripe age of 26 in the late 1990s. We have a simple problem of getting newspaper classified ads from their advertising systems into our search engine and available for users by dawn. That's when the paperboy—or papergirl—delivered the newspaper to doorsteps, so the assumption at that time was that readers who got their news from our website expected the same level of service.

But this wasn't easy. The newspaper's physical process of completing the print edition was developed and improved upon over decades. Sure, the editors can stop the printing presses for any number of reasons, but there are buttoned‐down procedures to handle these disruptions. And although the printing press often supports multiple newspapers, there is a finite capacity limit given the sequential operations of printing and packaging newspapers, then loading them onto delivery trucks.

Our process for loading the digital version of the newspaper onto the website was still being worked out. More important, the process for publishing classified ads, a major source of revenue on a newspaper website, was our primary product and development challenge.

Gotta have those ads up. Must enable our users to search for that '73 red Mustang, the apartment with an amazing view of the park, the new job that pays better and is closer to home, or the house in a better school district. Our most dedicated users go treasure hunting in the “Buy, Sell, Trade” section, which eventually gets disrupted by eBay's and Craigslist's larger inventories and more convenient user experiences than hunting on individual newspaper websites.

But before eBay, Craigslist, Cars.com, HotJobs, and a slew of other websites decimate the newspaper classified ad business, I have the glorious job of running a software system that sucks all those ads in. Yeah, it is a file each newspaper FTP'ed to us almost every night on a schedule that matched the print newspaper's production workflow. Beyond that, it is a data dump in a format that looks nothing like a CSV, XML, or JSON output.

That's because it comes out of the newspaper's legacy ad systems, which do several things really well. They enable call centers of ad salespeople to sell classified ads on the phone. Yup, you used to call newspaper salespeople to tell them about that beautiful Camaro you want to sell. The ad reps would help you beautify a drab ad into something that someone was more likely to notice. Most important, this system calculates the advertisement's length in print ad lines and allows the ad salesperson to sell you on a whole bunch of add‐on widgets. Want boldface in your ad? A small icon to catch people's attention? All that costs you extra and costs the newspaper next to nothing.

Now ask me, why did I care to work on this problem of searching classified ads? You're talented and can work on all sorts of different business opportunities? Why newspapers, and why classifieds?

Classified ads are cash cows for newspapers with $19.6 billion in newspaper revenue at its peak in 2000,¹ and their main costs are the call centers for selling the ads. The underlying systems are optimized to make the sales process efficient, produce the ads correctly, and calculate the cost accurately. Its last job is to spit out a file containing the classified ads with all the formatting required to print them on a newspaper page.

Until our website came along, the print systems were the only ones utilizing this output. And that's the file we get daily. It is a mix of funky codes signifying categories, line breaks, ad breaks, and other required formatting. Some of the markups are needed to place the ad in the right section and render it correctly in print, but most of the coding provides little metadata to make the ad searchable online, and a lot of it is totally useless for displaying the ad on a website.

Oh, yeah, and almost every newspaper's file is different. Sure, there are some similarities when they originate from the same underlying system. After that, the page formatting, the newspaper's categories, and the upselling ad enhancements are mostly different from paper to paper.

So with that as our starting point, here is how a pile of spaghetti code is created one application and script at a time.

My Recipe: Creating the Perfect Bowl of Spaghetti Code

First, we need an application to parse through all of these different file formats. So, the company's founding CTO, Ilan, whom I succeed two years later, develops a parser and calls it the “Universal Markup Conversion Utility.” At least that's what I think it is called because we always referred to it by its acronym UMCU, pronounced Uhm–Coo. Ops uses this tool to extract the text, searchable categories, and other information from the classified ad files we receive. It has its own parsing language that, quite frankly, I never fully understand. But we have a small team in operations responsible for setting and maintaining these scripts, so it's out of sight, out of mind for me.

Back then, my brainchild is an NLP, an application that I appropriately name Babelfish. If you don't know what a Babel fish is, get yourself a copy of the Douglas Adams classic, The Hitchhiker's Guide to the Galaxy. It's a masterpiece that every engineer should read before acceptance into geekdom.

The Babelfish application opens one file at a time, reads one ad, parses out words, and looks to assign tags for searchable criteria. It makes it possible to extract key classified ad details such as makes and models of cars, job titles, number of bedrooms and bathrooms, and prices of anything. Apache Spark doesn't exist yet, and other parallel processing engines aren't affordable to startups. Sure, we experimented with parallelizing the jobs and even considered making the application multi‐threaded. But our higher priority is always to improve the accuracy of the searchable tags extracted from these ads.

I'm super‐proud of this creation called Babelfish, and in retrospect, it was short‐sighted that we did not consider extending this into a product by itself. It is a legit NLP that we develop in the 1990s—15 years later and other NLP technologies I get the chance to use aren't all that much better. Yes, we had AI in the 1990s! Built on a fast, C++ engine and applied to data coming from thousands of customers. Why aren't we worth billions, damn it?

But I digress.

As good of a technology as this is, it is a very poorly architected application. You feed it dictionary and rule files that are all in proprietary file formats. It has minimal error checking and processes the data sequentially and at a snail's pace. While the rule engine is powerful, it isn't easy to learn, and it has no constructs to simplify implementations. As a result, if you develop a regular expression for parsing car prices out, you must copy and paste it into the real estate rules engine. It's not quite a mess, but it is poor architecture, bad coding, and technical debt.

So now we have a lot more contextual metadata about the ads and are ready to send this off to a search engine for indexing. We use the AltaVista search engine running on DEC AlphaServers, and we spend days visiting the DEC labs to test out these beasts, fast and cheaper than comparable servers from Sun Microsystems, but clunky to develop in DEC's flavor of UNIX.

We build another application that reads the file outputs from Babelfish and uses the AltaVista software development kit to create or update these index files. Of course, this is terribly inefficient, but this application is in its third version, having been used to work with other search technologies before we selected AltaVista. And, of course, we don't have time to better architect this with Babelfish's core capabilities. And keep in mind that APIs, service buses, and microservices don't exist yet. We're lucky to have web servers that allow us to plug in proprietary code and respond to web requests.

That's three applications: UMCU, Babelfish, and the Indexer. A fourth handles basic accounting, captures how many ads came from different newspapers, and stores this data in an Oracle database. Yup, that is a separate application.

But what makes this messy is less about the applications themselves and more about how we stitched them together. You see, every file must run through these applications, and we have limited hardware. We have cron jobs set up to run different applications on a cluster of servers. Don't know what a cron job is? It's what we use to schedule running applications before all the fancy schedulers and other utilities to kick off batch jobs.

Some of these cron jobs run independently, and others run on different servers. But they all leverage the same basic pattern:

Make sure no other instances of this application are running.
If there are new files, then run the application.
Check for errors when the application completes, and send out alerts if there is a failure.

This algorithm looks logical and simple, right? Its implementation is robust and should work without issues, right?

Well, this is far from reality, and plenty goes wrong.

There are many error conditions to check for and resolve. As we find one, we update the scripts to identify and resolve the issue.

If the application hangs and never completes for whatever reason, then the whole process grinds to a halt like someone jabbed an iron staff into the printing press. These disruptions happen more often than anyone wants to admit because the underlying infrastructure, databases we connect to, or services we rely on have their own faults. We develop more scripts to look for hung applications, stop them, and then restart them to resolve these issues.

As we grow customers, the processing time increases, and delays are even harder to recover from and get back on schedule. So, we add more hardware and more scripts to distribute the load across the infrastructure.

Load balancers distribute web traffic across multiple servers, but there aren't mainstream technologies to distribute batch processing applications. Maybe banks and insurance companies have them, but we're a startup and not seeking out these enterprise technologies. The scripts we develop to distribute the load need additional scripting to ensure robust operations, especially if one of the servers fails or is running slow.

The only thing going in our favor is that we instrument robust logging. Every application has a log, and every process, regardless of how many scripts are daisy‐chained, also has a log. I become a wizard with UNIX tools like sed, awk, grep, and find to parse through these logs when seeking a root cause to a failed process.

By 2001, this jalopy is processing the classified ads from over 1,600 newspapers. We advance the scripting so much that it fails infrequently, but when it does, it's frustrating and challenging for the network operation center (NOC) to figure out the problem and the best route to solve it.

Once you write and support bad code, you attain a specialized skill set of detecting it in other people's code. Like any disease, you see the symptoms of it first. The application performing poorly or crashing could be an infrastructure or configuration issue, but I never rule out a coding problem as a possible root cause. When development teams are slow or even scared to make code changes, they're almost always inhibited by hard‐to‐read and trace code. Deployments that create production issues? You guessed it. It's probably bad code.

Bad code comes in many flavors, sizes, and impacts. And every time I walk into a new job or am involved in acquiring a company with proprietary software, I open my nostrils to sniff out where the code smells bad and what impact it's causing.

The technical debt, legacy systems, and half‐developed integrations pile up. As people leave the company and others get reassigned to work on the next set of business priorities, fewer people are left behind with the knowledge to maintain these systems.

Knowledge transfer and enabling ongoing support may not be problems for you as a developer or technical lead. They might not be factors for you as a cloud or systems engineer when procedures are in place to recover from issues. Data scientists seek to move on to the next challenge, model, and insights. Product managers want to build products, deliver on customer expectations, deliver business impacts, and then move on to the next growth opportunity.

As you move up the ranks to director, vice president, or a CXO (anyone with a c‐level title), you'll end up owning the mess. You'll struggle to find people to work on legacy systems, the time to address the root cause, and the budget to upgrade them. What's worse is, even with the best intentions, you're probably developing the next generation of legacy applications. You probably won't have robust standards in place or reinforced, and business leaders will be too demanding to release applications faster than the time required by developers and engineers to learn and implement best practices.

I am thinking back about those days when we were developing software without many of today's tools, platforms, and best practices. Today, your apps are either cloud‐native or modernized to run in clouds and take advantage of the cloud's robust elastic infrastructure. Instead of just logging, there are standard practices for application observability. Instead of scripting and connecting applications with digital duct tape, you should leverage monitoring, automation, and AIOps platforms that analyze operational data in real‐time, correlate alerts to incidents, and trigger automated responses. There are static and dynamic code analysis tools that can pinpoint bad code before developers check it into code repositories, and there is an oversupply of open source and commercial web services that developers and data scientists should leverage instead of developing technical solutions. Then there are low‐code and no‐code tools to create apps, application programming interfaces (APIs), integrations, dashboards, so you may be able to upgrade employee workflows and customer experiences without having to build code from scratch at all these days.

And yet, we develop more and will leave a new generation of technical debt and legacy systems behind, often because we try to do too much, too fast, and without sufficient investment in best practices.

Dramatize the Mess: Getting Leadership's Buy‐In to Address Technical Debt

Let's time‐warp to another point in my career and a new scenario where I confront a mess and must illustrate the magnitude of technical debt and spaghetti code to senior leaders.

This mess isn't my frankenarchitecture, but you can't pin the blame solely on the developers. After all, technology legacies start at the top and flow downstream into the implementations. The developer might have done all the coding, but they weren't the one layering on years of business rules and exceptions. The architect didn't create standards. The team leaders didn't conduct code reviews. And the managers didn't ensure knowledge sharing.

I'm sitting in a meeting one day with senior leaders of our transformation program, including the CEO. This company completely relies on technology. They collect data, process it, analyze it, and then sell analytics from it. Like most analytics companies I work with, they collect information on buyer behaviors and sell purchasing analytics back to the businesses that sell them products and services. Capturing data exists on the customer journey in every industry, especially retail, adtech, fintech, and other sectors where many interactions occur in digital modes. When the entire customer journey is digital, it's relatively easy to put hooks into the process and create A/B testing scenarios that collect data and use it to understand buying decisions. But when the journey is a mix of digital and physical steps, the data collection is more complex.

The CEO and the leadership team are frustrated because we had another data processing delay. It takes days to process data, and it requires weeks to deploy algorithm changes. These production and cycle times might have been acceptable five years ago, but today it's a barrier for this company.

One of the primary problems is that one developer manages this codebase. He's been at the company for a long time, and his peers like him for his business knowledge and understanding of the data. But relying on him frustrates the leadership group because there's no one else to step in and address issues.

Maybe they should have thought about the problem of code coverage while the developer was creating, improving, and growing the codebase. But chances are, the leadership team has little visibility on what the technology looks like underneath the hood. Maybe they need to.

My senses awake to the smell of bad code. I'm only at the company for a few months, and the developer has been with them for well over a decade. Data integrations, processing, and DataOps are the arteries of data and analytics companies, and the underlying technology supporting them is the beating heart. They deliver new and processed data across the organization for decision‐making. When the arteries are clogged, it can slow the delivery of much‐needed data to the brain, which we use to analyze, interpret, and identify insights. When the data arteries come to a halt, it's like a heart attack, and it starves a data‐driven company, especially when analytics is the product.

The CEO looks over at me and says, “Isaac, we need to review the situation and come back with some recommendations.”

I'm not surprised to be handed this assignment. But because no one addressed why one developer is maintaining the company's lifeblood, I already suspect I'm jumping into a messy code issue.

I schedule time with the developer to do a code walkthrough. It's mostly database stored procedures and reporting services that stitch together an end‐to‐end data flow. The developer never documented the codebase, and there are no high‐level diagrams illustrating the data flows. Each procedure is a mix of data processing and computations and coded that way for performance considerations. Some of the code is in version control, but the rest is embedded in platforms that make it cumbersome to manage in external repositories. There's lots of code—certainly too much for me to understand in one sit‐down. In fact, I'm still searching for how anyone would make their way through this codebase. It's like an overgrown jungle, but if you go in with a machete, you might actually hack something important.

I need more time to figure this one out, but I have to go back to the leadership team with an answer. So, I ask Phil to make a printout for me.

“A printout? That will take hundreds if not thousands of pages,” he responds while giving me a you're‐out‐of‐your‐mind look. He thinks I'm going to read through all the code, and he also doesn't want to waste all that paper.

So, I let him know my intent, “You can print it double‐sided in the smallest font available. I need to make a point. If it requires more than one package of paper, then you can stop.”

I also hate the idea of wasting paper, but this is the most efficient way for me to make a point—dramatically. When asked for an update at the next meeting, I put the three‐inch brick of paper down on the table. I pass it around and state emphatically that this will not be a simple problem to solve.

I look around the room for reaction and sense bewilderment. It's a shock‐and‐awe moment, and I hope the drama I created through this visual will convey the story and the magnitude of the problem. There's no anger or blame in their eyes—which is good—but I can sense that the weight of this problem (pun intended) is sinking in. Unfortunately, Digital Trailblazers are bound to hit legacy boulders in the road. The question becomes, which boulders should you focus on? You'll have to decide whether it's better to chip away at the problem or when it makes more sense to re‐platform to a modernized solution, which often makes the most sense when the issue is business critical.

Over the next few months, we chip away at this problem. We put an extract, transform, and load (ETL) platform in for new data flows and refactor critical aspects of the legacy ones. But I never get to see this mess fully cleaned up.

Large codebases, managed by very few people, are one source of technical debt. I've seen 1.4 million lines of PERL code, HTML pages with tens of thousands of JavaScript code, stored procedures with thousands of lines of code, and databases with thousands of tables. I've seen a proprietary ERP developed as monolithic stored procedures running the back office of a large institution. Recently, I saw an organization with multiple service buses, each with multiple versions running production workloads.

So, while there is smelly code that needs fixing and technical debt that needs support, there are also what I call burning fires.

Burning fires take down the business. And not just once, often multiple times. Once the fire starts burning, a team assembles in a war room to determine root causes. Burning fires bring together people from the networks operations center, incident managers, and IT services management. If the fire is burning out of control, the war rooms may also require developers, testers, and business managers. Their collective work results in the problem being identified and resolved, but the fire is still smoldering. Soon, another fire flares up, and the team assembles again in another war room to put out a fire that's only slightly different than previous ones.

In this all‐too‐common scenario, management rarely gives the technology team sufficient time to come up with permanent fixes. It's a byproduct of how companies thought of technology as one‐time capital investments and then left a small fraction of operational expenditure to support it. Today, we live in an SaaS world where customers expect ongoing technology improvements, and now it's up to Digital Trailblazers to budget and manage their technology investments to support it. Are you up for this challenge?

Bad Code in Modern Platforms: Still a Problem

Bad code doesn't just show up in legacy systems. It can easily appear in new code on modern platforms. Only, the stakes may be higher when bad code leads to operational issues that evolve into a business crisis.

I witness a code nightmare years later when an executive committee hires me to lead a digital transformation at a services company. They're eager to turn around a business model that collects data today only to report findings several months later. We have our work cut out for us, as every aspect of the business runs on processes and technologies requiring improvements and transformations.

But before I can move the organization into high transformational gear, there is a critical operational issue that needs addressing in their call centers.

It is a problem with several call centers that survey our clients' customers. You walk into the store, perform a transaction, and get a call from one of our agents to collect feedback on the experience. Our clients—ranging from big brands to small businesses—are all demanding, and their bonuses depend on the results. If the surveys aren't working properly, they aren't shy about letting us have it.

I inherit a set of technology and operations that definitely require improvements, but it isn't a complete disaster. The IT team manages hundreds of low‐end desktops used by the agents, upgraded dialers that are the cornerstone to saving the company a ton of money, and the new enterprise survey platform that will bring the business to the digital age.

No one cares about how the technology is implemented or integrated. The executives care that they invested hundreds of thousands of dollars into it, and the surveys for our biggest clients aren't working. They care that the team can't figure out the problem. And they are furious that the client is calling our CEO and giving him an earful.

I feel like Al Pacino in the third Godfather movie when he says, “Just when I thought I was out, they pull me back in!” I'm freaking hired to transform the company, but I am now back in the trenches to solve an old‐school performance problem.

I'm on the short plane ride up, thinking about what's to come. I have no idea what type of problem I am walking into because it's in a technical area where I have limited experience. I've been in organizations that have call rooms and others that perform market research surveys. This group does market research surveys with people in multiple call centers over the phone. Some of their surveys operate on a legacy proprietary application, but the failing ones run on a commercial market research platform that integrates with the dialer.

This performance problem could be a network, hardware, platform, or application issue. It might be a problem in a single call center or agent desktop that's bottlenecking the service. Maybe the hardware they sized is insufficient to support this client's larger surveys. Or perhaps the audio recordings are bottlenecking the network or storage infrastructure.

The application is given a list of people and phone numbers at the beginning of the night. It feeds the dialer these numbers and starts dialing down the list until it makes a connection. Once you pick up and say, “Hello,” it routes the call to someone in the call center. You've likely received one of these calls before and may have wondered why there's a short pause between announcing yourself and hearing a voice on the other end. In this case, the attendant reads a script and asks the respondent to participate in a quick survey. If they confirm, the attendant clicks to start the survey.

What I find out is that on some occasions, the survey never appears. It's a blank screen. Sometimes it does come up after a long wait time, but the attendant has lost the call by then.

This particular survey is a big one being dialed out of several call centers simultaneously. It only has a few questions, but there's significant programming logic to it.

When I walk into the office, I can see the teams standing by whiteboards holding their daily agile standups. Even though these are surveys, they are still programmed. The output data and reporting also require coding. And even though there are many options to tag the verbal outputs to open‐ended questions, the coding is largely done manually. Their legacy is in programming everything, and many scripts are copies from one market research study used as the base for the next one. It's copies and copies of code—only once the market research ends, so does its underlying code. Of course, this code isn't in version control, and code from completed surveys is never archived.

There are many opportunities to improve things here, but today I'm solely focused on figuring out why this survey, the biggest one they do for their largest customer, is failing consistently.

The leader of this team is a very rational thinker, a solid operational manager, and keeps the trains running. He knows when to ask for help and is a team player when trying to solve any form of issue, especially customer issues. He's hyperfocused on the customer because he's in a services business. Either he delivers a good experience and the results from the surveys daily, or customers are upset.

I like him and am intrigued by the challenges facing the group. He's hoping to fix the immediate problems and then move on to the greater challenges of streamlining their market research operations. He knows the group must evolve from classic market research services to digital ones. I accept the challenge and look forward to the partnership, but also warn him, “Give me a shovel and I'll start digging, but you're probably not going to like what I find.”

On their new platform, this one survey has none of that spaghetti code of scripts required by their legacy platform, which is another reason everyone is angry. This new platform is there to automate their processes and bring them to the digital age of market research. It's the Rolls‐Royce of market research platforms because it supports surveys collected over the phone as well as surveys completed through websites and mobile phones. The company invested a ton of money on this platform, and the payoff should come from big surveys like this one.

The failures are costing the company a lot of money.

Every night we fill the call centers and pay attendants to dial. And every night, they complete only a small fraction of the surveys. We're not getting paid for the surveys that aren't fully complete. It's a triple whammy to our financials because we take on all the costs, only a small fraction of the revenue, and we're paying a full IT team to figure out what's going wrong with the survey.

I walk into the conference room, and I know everyone there. One thing I like about this group is that they aren't intimidated by my presence. There's no feeling of, “Oh, shit, the CIO is in the room,” that I sometimes get with other teams. They carry on their conversation, and it's a back‐and‐forth hypothesis on what's wrong.

The head of survey operations responsible for the survey thinks it's a hardware issue with the dialers. A support engineer from the vendor is on the phone, and he blames the network or possibly a hacked workstation that's flooding it with network chatter. The head of network operations reports to me and is also dialed in. He reminds everyone that the network and workstations are working fine for other surveys, so the issue must be something about this particular survey and its underlying platform. There's also a support engineer from the market research survey platform dialed in, and she wants to review all the software the development team used to integrate and move data around. That goes on the to‐do list, even though the software lead is grumbling because those same integrations and scripts are components of other surveys.

I wait for an opening to chime in, and as I listen to the conversation, I think to myself, “How can I support this team? What are they missing, what questions should I ask, and where should I guide them?” It's a challenge since I am still new to the organization and know little about the architecture, so I have to reach into the fundamentals of encouraging the team while identifying improvement opportunities.

I step in at an opportune moment and say, “Lots of good theories and logic, everyone. I can see there are several moving parts to this program. There are several new platforms mixed with legacy technologies that have been in place for a long time. There's clearly something different about this survey, and we can't continue to play hot potato and hope the issue is not in our scope of responsibilities. Who can show me data on how different aspects of this system perform while the survey is running?”

There's a short pause, and it continues, and I'm getting a little scared. Surely the technologists look at performance metrics?

I finally get a response, “Leo is the best person looking at the underlying data.” I look around the room to identify Leo and what he's learned from the data, but I am told he isn't in the meeting.

The consensus coming out of this meeting is that it's a dialer issue, but I'm not convinced. When the meeting ends, I go to visit Leo.

Leo is playing around with this new tool called Splunk. Every system administrator should have thought of creating this tool because it simplifies aggregating and querying data from multiple log files. But of course, using these tools without a strategy can be a colossal waste of time. I ask Leo, “Build me an outlier dashboard. Show me which of the four call centers are underperforming. Which agent desks are slower than the others? Which system parameters are spiking 95 percent above their means? What part of the survey are agents running when issues begin to happen?”

One week later, I am back at the office looking at the report. Leo affectionately names the report “Isaac's Dashboard,” and even my name is in the URL. I laugh to myself upon seeing this, but the truth is, I hope this data‐driven approach becomes ingrained in their culture and operating model without requiring my guidance.

So, I ask the team, “How are we going to test this survey offline and without impacting our client?”

It turns out there isn't an easy answer. We can test the survey, but not with the dialers. They have already tested the survey and found no issues, so the next step would have to include the dialers. The only way to do this is to staff the call centers and task them to dial the survey, but we can't experiment with a live client survey. We need to try different scenarios, tweak configurations, and even try out variants of the survey. The only way to test is to staff the call centers and dial a mock‐up version of the survey. We would dial a list of people not affiliated in any way with the client and ask them a bunch of made‐up questions programmed with the identical sample and survey logic used by the client.

Testing manually is a costly approach. We must pay the agents, and while this test survey is running, we wouldn't be able to run other surveys that generate revenue for the business. It's like stopping a factory floor and building fake widgets to identify quality issues.

“Set it up,” is what I told them. I'd be back next week to help run the war room. In the meantime, I would have to explain this approach and the underlying costs to the executive group. I would bear this burden, and I know that I would be facing a firing squad of questions.

That's a key role for the Digital Trailblazer. Understand the problem, listen to the team's insights, ask questions, review data, debate approaches, and decide a course of action. Then go back and communicate the plan to stakeholders and provide them sufficient time and resources to address problems.

Digital Trailblazer Lessons

Navigating Tech Debt's Challenges

When you're a developer, systems engineer, or data scientist, your role is to develop and support various technologies. You make decisions on how to implement things in elegant ways. It's almost always harder to fix someone else's code than to code something new, and management rewards you for solving the problem at hand. When there are significant application performance issues, everyone wants the problem resolved and then walks away back to their daily business. As you climb the leadership ranks, the responsibility falls on you to decide which technical debt, legacy systems, and burning fires to fix, improve, or fully modernize. Digital Trailblazers with technical backgrounds may be particularly challenged because they often want the best fixes instead of temporary solutions.

And when you make it to one of these leadership positions, it's important to note that it becomes your responsibility to instrument standards and best practices. But the challenge is that teams debate them, and your smartest and fastest developers and engineers will want the freedom to self‐organize and innovate with new technologies. And as a technology leader, you want them to succeed!

So, where does this end? What should technology, data, and digital leaders do differently to slow this vicious cycle of overengineering requirements, rushing development, and underfunding ongoing support and systems maintenance?

Here are five takeaways from this chapter:

Focus everyone on technical simplicity when developing any kind of new system. The best technical solutions may not come from building something new. Today more than ever, you might be able to provide good enough results through SaaS solutions. You can also develop applications on low‐code platforms that provide the structure to create more capabilities with less and sometimes no code. When it's clear that you must engineer new systems, challenge business leaders to simplify requirements, and influence developers to find simple, supportable implementations. Over‐engineered technology solutions developed on the latest and greatest application architectures are often the hardest to support long term.
Let the data be your guide when making decisions, but also trust your instincts. Decision‐making requires balancing facts, insights from data, opinions shared by colleagues, and intuitions from past experiences. Your ability to listen to all sources and then guide teams decisively is a critical leadership skill. This skill is essential when resolving operational issues, especially as you become less hands‐on with the underlying technology and less familiar with all the end‐to‐end business processes.
Build rapport with your team by acknowledging your own mistakes and failures. Many leaders recommend showing humility with your team, and they also recommend using stories to help convey messages. Both practices are truly important when working with technical staff. They must know that you've been in their shoes, and from your stories they should learn your philosophies on how you tackle implementation challenges. Now it's easy to share your best practices, but this isn't always the best approach if you want others to learn, internalize, and make the lessons core to their behaviors. You didn't understand best practices only through successes. Don't be shy about sharing the hard lessons you've learned along the way, and own up to where you made mistakes.
Lead teams by helping prioritize which questions need solving. Your knee‐jerk response to solving an operational, technology, or customer issue may be to get to root causes and fix the implementation. But before you go there, add some discipline to your thinking. Leaders didn't ask me to solve an ETL problem. They asked me to explain why solving it was so hard and how best to address a personnel issue. The team was collaborating to solve a customer issue, but the problem was that they weren't using data to guide their efforts. Sometimes, the question is worth solving, like how finding a way to validate the implementation changes to the market research survey solved the problem. Other times, the leader should help identify the question or problem, but parking lot any thinking and work around its solution. There's always more work to do than any organization can handle. Prioritizing is a critical leadership skill for Digital Trailblazers.
Demonstrate the business and customer impacts around tech debt and legacy systems. I don't meet many CIOs and CTOs who profess to have a complete handle on all the technical debt and full business impact of their legacy systems. It's unlikely they have sufficient budget, business priority, or skills to address technical, data, and other forms of debt before cracks in the foundation become business crises. What's worse, the term technical debt is terrible to garner an executive's full attention. CFOs are used to having financial debt on the balance sheet, and COOs have more than their share of operational workarounds. Digital Trailblazers must tell the technical debt story to executives in a language they understand, like dropping a brick of paper on the desk or using a business crisis to get support for investment. And you can't go to the executives with a request to fix everything. Transforming legacy systems requires a disciplined approach to prioritize what's most impacting. Then, Digital Trailblazers must understand and implement to meet future business needs and not just upgrade systems to meet today's technology challenges. So before investing in new systems that replace existing ones, it's critical to review the problems through a customer and strategic lens. What problems are worth solving, when, how easily, and with how much investment?

▪ ▪ ▪

If you would like more specifics on these lessons learned and best practices, please visit https://www.starcio.com/digital-trailblazer/chapter-2.

Note

1. John Reinan, “How Craigslist Killed the Newspapers’ Golden Goose,” MinnPost, Feb. 3, 2014, https://www.minnpost.com/business/2014/02/how-craigslist-killed-newspapers-golden-goose/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2: Dev: Technical Debt Is Now Your Problem

Create new playlist

Sign In

Sign Up

My Recipe: Creating the Perfect Bowl of Spaghetti Code

Dramatize the Mess: Getting Leadership's Buy‐In to Address Technical Debt

Bad Code in Modern Platforms: Still a Problem

Note

Table of Contents for
Chapter 2: Dev: Technical Debt Is Now Your Problem