Chapter 2. The Importance of Building a Self-Service Culture

Before we can talk about how to build a data lake, we need to discuss the culture of the company using that data lake and, more specifically, the mental shift required for organizations to fully embrace the value of a data lake. In more traditional organizations, the DataOps team stands between the data and the business users. Traditionally, when a user needs data, they approach a data analyst or data scientist and make a request. The data team then responds. It’s common for the data team to build dashboards and have a set of prebuilt reports that it refreshes or sends out periodically, but so-called ad hoc requests are usually handled on a case-by-case basis.

This gatekeeper approach inevitably causes bottlenecks, as shown in Figure 2-1. Users who need data for a key presentation or to make a strategic decision are forced to wait for their turns in the queue. Often, they give up, and make the decision without having the data to back it up. And it becomes difficult, if not impossible, for an organization to extract the full value from its data using this paradigm. Because of this, a self-service culture is essential if your company is going to get the most value from your data and, eventually, become a true data-driven organization.

The importance of building a self service culture
Figure 2-1. The importance of building a self-service culture

The End Goal: Becoming a Data-Driven Organization

A data-driven business is one in which decisions are powered by data as opposed to intuition or even personal experience. It’s one in which people who tend to “think from their gut” are encouraged to use hard empirical evidence to back up what they say and do. In a data-driven company like Facebook, for example, no one would think of showing up to an important meeting without quantifiable facts to back up their position.

According to Forrester, data-driven companies grow eight times faster than those that work from intuition or speculation. Insights-driven businesses grow on average more than 30% annually and are on track to earn $1.8 trillion by 2021.

Obviously, a first step toward being data driven is to make data readily accessible to everyone in the organization—that is, to democratize the data. And this means having a self-service data culture.

Numerous—if not most—companies today have announced their intentions of becoming data driven. Many have already started down this road. The basic premise has long been that deploying a data warehouse, populating it with company data, and hiring a team of intelligence analysts will lead in no time to data-driven nirvana.

But it’s not happening very quickly. Virtually all respondents to a recent NewVantage survey said that their firms are trying to make the shift, but only about a third have succeeded.

A data-driven organization should possess three things, according to Ashish Thusoo, cofounder and CEO of Qubole, as documented in the O’Reilly book Creating a Data-Driven Enterprise with DataOps:

  • A culture in which everyone buys into the idea of using data to make business decisions

  • An organizational structure that supports a data-driven culture

  • Technology that makes data self-service

We are focusing on the last point in this book, but let’s go over all three requirements nonetheless.

Foster a Culture of Data-Driven Decision Making

Whether you know it or not, your business already has a decision-making culture. The problem is that your culture might not be geared toward a data-driven approach. It might be ingrained in the culture of your organization, like many others, that the “HIPPO” (highest-paid person in the office) is the last stop in the decision-making process, and the senior person in the meeting gets to make the final choice. Unfortunately, the HIPPO can at times be very wrong. But unless you have the data to use as evidence for your arguments (and your company culture supports arguing with the top brass in the organization), their decision usually stands.

To successfully become data driven, your employees should always use data to start, continue, or conclude every single business decision, no matter how major or minor.

You probably need to start the shift from the top. For example, many marketing departments are becoming more data driven. A chief marketing officer (CMO) can set the right tone by making it mandatory to experiment with and test new creative initiatives and campaigns to gather data on their impact, as opposed to simply relying on gut feeling and intuition. That message gives primacy to data, and that sentiment then flows to the rest of the marketing team.

It’s important to understand that different employees within your organization will have different reasons for buying into using data in their day-to-day jobs. You first must identify who all of these stakeholders are. Then, you must understand what will motivate them to begin using data to make decisions. And then you must make it easy for them to serve themselves, organizationally and technically, a topic we address in the next two sections.

Build an Organizational Structure That Supports a Self-Service Culture

Organizationally, how do you support a data-driven company? Most successful data-driven entities have created a central data platform team that publishes data and manages the necessary technical infrastructure. Although some organizations establish multiple data teams and embed them in different departments, each catering to the needs of that particular department, this is typically less useful if you hope to get to data-driven nirvana because it ends up creating a number of isolated data silos. Therefore, a strong, functional, central data team is extremely important. As we discussed in Chapter 1, this provides a single source of truth that underpins all data analyses.

Your next step is to embed business function–specific analysts within each department, staff specially designated to help users in that department extract value from your company’s data. This works best because those analysts possess the domain-specific knowledge about the business function—the marketing data analyst understands campaigns, and the finance data analyst understands accounts payables—while also having intimate knowledge of the data. They can convert the language of the data systems to the language of the business, as illustrated in Figure 2-2.

This is critical because the two languages are very different. The business wants to ask questions such as the following:

  • Which geographic regions of my business are the best to invest in?

  • What is the size of the market?

  • Who is the competition?

  • What are the best market opportunities today?

The hub and spoke data organizational model
Figure 2-2. The hub-and-spoke data organizational model

According to Thusoo, Facebook quickly understood that it needed a centralized data team. Then, it embedded analysts in every product team. “We also took care that all the analysts had a central forum at which they could meet and communicate what they were doing, allowing data intelligence to flow through the entire organization,” he says.

Essentially, this model transmitted the data-driven DNA of the self-service culture throughout the company. If any link in this “value chain” of data is weak, you face barriers to developing a true data-driven culture. Chief data officers usually carry the ultimate responsibility of nurturing and growing this value chain of data.

Putting a Self-Service Technological Infrastructure in Place

All data-using employees must be supported by the central data team if you hope to achieve self-service. This team is ultimately responsible for maintaining the infrastructure. Their job is to deploy whatever people, processes, and technologies are necessary to make data available to everyone who needs it, on a self-service basis.

In a recent survey by Qubole, 70% of organizations said a self-service environment was either already in place or planned, as shown in Figure 2-3.

Plans to build a self service environment
Figure 2-3. Plans to build a self-service environment

By empowering users to explore data easily and with the tools they’re comfortable with, data teams can keep staffing low and raise the productivity and effectiveness of business users throughout the organization. The key here is that data engineering teams need to be able to automate the underlying infrastructure to be fully aligned with key business initiatives. If they don’t do that, they’ll waste time providing information to other data teams rather than doing infrastructure setup and maintenance.

You’ll know that your company is genuinely data driven when “bottom-up” demand for self-service data access emerges. When this happens, you must ensure that the tools and mechanisms are there to support this bottom-up interest among employees. For example, with self-service tools and processes in place, employees in marketing could themselves find and analyze previous campaigns’ datasets—which are stored in the centralized data lake—to come up with ideas for successful future campaigns and messaging.

Challenges of Building a Self-Service Infrastructure

When it comes to creating a data-driven infrastructure that includes a data lake, organizations most often face the following four challenges:

  • Lack of specialized expertise

  • The disparity and distribution of data

  • Organizational resistance

  • Reluctance to commit to open source

Let’s examine each of these a bit more closely.

Lack of Specialized Expertise

Building a data lake for truly big data requires specialized technologies and skillsets. In fact, the primary big data challenge in the Qubole survey was lack of experience, as shown in Figure 2-4.

Challenges faced by big data teams
Figure 2-4. Challenges faced by big data teams

The 80/20 rule appears to be true in the big data world, in which only 20% of the technology workforce has adequate hands-on experience building big data platforms that can scale as needed. This raises urgent issues, because data in corporate data lakes is growing year after year, as depicted in Figure 2-5.

The growth of corporate data stores
Figure 2-5. The growth of corporate data stores

During the initial exploratory phases, users might experiment with data on their local laptops. This works for the short term because the datasets don’t exceed the resource limits of their system. However, as soon as data begins to grow, either in volume or complexity, the physics required to process it begins to change. It requires more compute resources to process. This is when organizations begin to look at cluster computing engines like Hadoop, Hive, Spark, or Presto, depending on the particular use case.

Also, most IT engineers only have experience with static, monolithic applications that have fixed resources, not clusters and workloads, which are more ephemeral.

The LinkedIn Workforce Report for US (August 2018) states that “demand for data scientists is off the charts,” with data science skills shortages present in almost every large US city, as demonstrated in Figure 2-6.

Job growth in big data space
Figure 2-6. Job growth in big data space

Moreover, the technology in the big data space is still young and thus constantly evolving. Though Spark was the most used framework in 2017, as Figure 2-7 highlights, Flink and Presto were catching up with extraordinary market-share growth of 125% and 63%, respectively.

Frameworks continually evolving
Figure 2-7. Frameworks continually evolving

Disparity and Distribution of Data

Next, the disparity of data is a challenge. The centralized data team needs to be able to capture all of the data in the organization. But the number of sources is mind-boggling—and growing every month. Moreover, much of the data is siloed. You might need to find and capture data from within different business applications, product applications, public and private customer interaction points, monitoring systems, third-party data providers, and many other sources.

Also, many enterprise data systems are primarily set up for operational reasons. Collecting data from them is usually an afterthought. The natural inclination of your business is probably not to go out of its way to capture this data, much less expend the effort to consolidate it in one place. Potentially valuable data from all of these sources remains in silos. In the process, the organization loses many opportunities to derive insights or discover optimizations by putting data from different sources together.

Deciding to create a data lake can raise all sorts of questions:

  • Where do I store the data?

  • In what format should I store it?

  • How long is that data useful?

  • How do I make it easy for users to find data?

  • Who has access to the data?

  • What are our permission rights to the data?

The first step toward overcoming these challenges and answering these questions is to take an inventory of all of your data sources and create a company-wide data-capture infrastructure that lays out the correct way to capture and log the data. Everyone in the organization should adhere to those standards.

The next step is to consolidate all of the data in the centralized data lake so that every data consumer in the organization knows where to find it.

Organizational Resistance

Change is difficult—especially a change as massive as moving to a self-service, data-driven company.

The first step is to get the hub-and-spoke data organization in place and ensure that it is aligned with the business requirements. This in and of itself can prove challenging, and given how slowly companies change—especially larger enterprises—it could take many weeks or months.

Next, you need to ensure that you’ve put the right tools in place to enable self-service for your users. Depending on how sophisticated your users are, these tools can range from canned dashboards and reports to ad hoc querying tools, all the way to fully customizable data platforms. And you must stay on top of whether these tools are actually being used—there’s no point in having them if no one pays attention to them.

Your users’ level of data literacy—their ability to find, work with, analyze, and argue using data—is critical to building a self-service, data-driven culture. Airbnb serves as an example of what you can do to improve enterprise-wide data literacy. Airbnb had a data literacy problem, despite having built a centralized data lake and populated it with massive datasets that the company thought would be useful to employees in the 22 countries where it operates. At the beginning of Q3 2016, only 30% of Airbnb employees used this data platform at least once weekly.

To remedy this, Airbnb built its Data University, with the mission of educating everyone in the company on how to use data effectively. Engineers, product managers, designers—all employees, really—were taught how to use data to unearth insights that would help them make better decisions.

Data University proved a great success, and has transformed Airbnb’s culture to one that is both self-service and data driven. Employees learned how to handle ad hoc data requests themselves, and within a year, 60% of them were using the Airbnb data platform at least weekly to make data-driven decisions.

Finally, consider creating some internal centers of excellence around big data analytics. Some of the best data-driven companies constantly reach out to various teams and ask, “How are you doing? What are you using to solve problems?”

You might have centers that are really good at Spark, whereas others are really good at analyses. You might have datacenters that are really good at data governance. With these centers of excellence in place, you can quickly overcome many roadblocks.

Reluctance to Commit to Open Source

It’s taken a while, but open source is no longer such a dirty word with mainstream enterprises, even fairly traditional ones like financial services or health care companies.

Figure 2-8 shows that “open source software programs play an important role in how DevOps and open source best practices are adopted by organizations,” according to a survey conducted by the New Stack and the Linux Foundation.

Large companies most likely to use open source
Figure 2-8. Large companies most likely to use open source

In the survey results, more than half of respondents (53%) across all industries say their organizations use an open source software program or have plans to use one. Large companies are almost twice as likely to run an open source program than smaller companies (63% versus 37%). And the number of large companies using open source programs is expected to triple by 2020.

Companies with open source programs also see more benefits from open source code and community participation. As Figure 2-9 highlights, 44% of companies with open source programs contribute code upstream, whereas only 6% of other companies do so.

Companies with open source programs are more likely to benefit
Figure 2-9. Companies with open source programs are more likely to benefit

There are three key arguments for adopting open source software:

  • Your company is on the cutting edge of software innovation.

  • Your company is able to tap into a huge community of support.

  • Your company can “fork” your own version of open source software to build into your applications.

Going the other way—to closed, proprietary big data solutions—you might have clearer roadmaps. You certainly have more structure. And you’re paying for stability and enterprise-grade support if something goes wrong. The latter is probably the number one reason: if something goes wrong, you know who to yell at. You have a contract that says the issue will be fixed within four hours. More risk-averse companies tend to stick with proprietary systems. But even that is changing, as big conservative financial institutions like JPMorgan have embraced Hadoop and other open source big data tools.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.186.83