Chapter 2

Improved agility and insights through (visual) discovery

Abstract

This chapter focuses on the expansion of traditional business intelligence to include data discovery. It discusses how discovery should be seen as a complementary process that drives BI going forward, rather than replace it, to supports data-driven competencies for evolving data-centric organizations. It also reviews the role of friction in discovery and how to navigate the four forms of discovery to maximize the value of data discovery as a key strategic process. Finally, it explores the emergence of visual discovery and briefly touches on the role of discovery in the days to come, and the unique challenges and opportunities that discovery will bring as it becomes an increasingly fundamental strategic process.

Keywords

data discovery
BI
friction
visual discovery
strategy
agility
insights

“The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny’…”

—Isaac Asimov

The characteristics of data-driven companies and the proof points and case studies shared in chapter: Separating Leaders from Laggards, support the continued reinvention of traditional business intelligence (BI) today as data discovery and visualization continue to take an increasingly important seat at the BI table. Now, it is time to take a look at how the data industry itself is changing from the inside out.
Assessing change covers three areas: people, processes, and technology. In this chapter, I will focus on the expansion of traditional BI to include data discovery. This is part of the processes portion of the change in the data industry. And, while the role of data discovery is steeped in philosophical debate as the new kid on the block (though, as we will come to see, it is not really so new after all), it should be seen as a complementary process that drives BI going forward, rather than replaces it. As an extension of BI, data discovery supports the data-driven competencies for evolving data-centric organizations. This chapter reviews the role of friction in discovery and how to navigate the four forms of discovery to maximize the value of data discovery as a key strategic process—and what that means for those companies looking to adopt a discovery-centric culture today. With a foundational knowledge of what discovery is and how it differs from exploration, we will also briefly explore the emergence of visual discovery, a concept that will be paramount in later discussions throughout the rest of this book, as we continue to dive into the role of data visualization and its importance in visual analytics. Before moving ahead into the last piece of the organizations, process, and people trifecta—people—in chapter: From Self-Service to Self-Sufficiency, we will touch on the role of discovery in the days to come, and the unique challenges and opportunities that discovery will bring with it as it becomes an increasingly fundamental strategic process in data-driven organizations. You will find that many of these will shape later conversations in Part III of this book.

2.1. The discovery imperative

Data discovery—or, to use its previous moniker, information discovery (ID)—is not exactly new. In fact, the past has seen information discovery referred to, somewhat facetiously, as “one of the coolest and scariest technologies in BI.”
Nevertheless, recently, discovery—like many other things in today’s data intelligence industry is undergoing reinvention. Spurred into the spotlight with an influx of new tools and technologies, the new and improved approach to data discovery is fueled by slick intuitive user interfaces and fast in-memory caching, compressed, and associative databases—and by the realized affordability of Moore’s Law that provides users with desktops that boast more capacity, more horsepower, and higher resolution graphics enabling desktop discovery tools that were not possible five to seven years ago—especially for mainstream users. The ability for people to effortlessly access data from anywhere, and then quickly and iteratively explore relationships and visualize information, has forever changed the way we think about data discovery in a world where analysts of all shapes and sizes (or, skill levels) get to play with data.
You could say that the entrance of discovery has erupted into a bit of a data discovery epidemic—but in a good way. Once introduced in an organization, information discovery tools spread like wildfire by those seeking to capture immediate value and insights—and, like opening presents on Christmas morning, they often do so without waiting on IT or BI teams (our analogous parents) to catch up. The tool vendors know this, too. For example, Tableau Software has a well-known “land and expand” strategy of trickling in with individual users (or user groups) and expanding to enterprise licenses within a short amount of time. Veteran BI professionals struggle with the “open-data” tenant that appears counterintuitive to decades of the traditional BI mindset of well-organized and properly controlled data and information usage. The overarching concern is the descent of BI into chaos and the inevitable discovery hangover: we should have known better. This is a worrisome shadow in the back of many of our minds that we are hoping never to face.
The rush toward new technologies that deliver instant gratification and value should be considered carefully. In the case of data discovery, I believe this is an undeniable and natural way in which humans learn, gain intelligence, and make better decisions. Being a species that relies on competition—figuring out things faster and better than others—we are practically bred to discover, and what better application to use that intrinsic drive of curiosity than with the thing that is becoming front and center in much of our personal and professional lives—data? With that in mind, let us explore some of the critical conversations that BI teams are having today—and those that you should carefully consider at some point in your adoption of discovery as a fundamental process alongside traditional BI.

2.2. Business intelligence versus data discovery

Now, let us not start out on the wrong foot. While it is common to hear that ominous phrase “BI versus Discovery” and get the impression that traditional BI and data discovery are somehow sworn adversaries or otherwise at odds, they actually have quite a lot in common. In fact, they should be seen, if anything, as companions rather than rivals. After all, sometimes the most unlikely of pairs make the best of friends. Consider Nobel Laureate T.S. Eliot and the bawdy comedian Groucho Marx, who had a peculiar friendship that began when Eliot wrote Marx for a signed photograph and the two stayed pen pals until shortly before Eliot’s death in 1964. Or the quick-witted Mark Twain and inventor Nikola Tesla, who became such good friends that each of them credited the other with having special restorative powers for the other—Twain’s novels a tonic for Tesla’s recovery from bedridden illness as a young man, and Tesla’s electric wizardry a cure for poor Twain’s severe bout of—shall we say—constipation.
Such an improbable companionship have BI and Discovery. Like Eliot and Marx’s shared love for literature or Tesla and Twain’s intellectual curiosity, at their cores both BI and discovery share the same intended purpose: to derive value from data. Their approach to that purpose is a matter of different perspective. When it comes to defining what really distinguishes discovery from BI, it boils down to a simple change in how we think about the data. Rather than relying on business people to tell us how the business works—the BI approach—discovery instead relies on using real data to show—to gain insights—on what is really going on in and around the business.
We can articulate the fundamental difference in BI and discovery in the following way: traditionally, enterprise BI has focused on how to create systems that move information in and around and up and down in the organization, while maintaining its business context. It focuses on keeping that very important context bubble wrapped tightly around the data so that the business consumer does not need to. It is “rely and verify:” a framework wherein the analyst role is embedded within the system and the end user does not have to be a data expert, but just has to be able to rely on the data presented to verify a business need is met. The end goal of BI is to create and predefine easily digestible information that is plattered up and served to the end user with the purpose of being consumed as-is. It is insight off a menu.
Traditional BI focuses on establishing predefined business goals and metrics that drive the necessary construct of business performance management and dashboards, and then transform data against them. The traditional BI process is designed to understand your data through a disciplined process of “analyze, design, and develop.” It looks inherently backward to consume data as-is through reporting and analysis to achieve a rear-view mirror perspective into the business. It is reactive versus proactive.
Discovery, instead, is all about being proactive. It begins not with a predefinition but a goal to explore and connect unknowns—which is less a political statement and more aptly the whole idea behind collecting and storing all available data and looking to connect the dots and make associations between trends to see things that had not been known before—new insights. Opposite to the “rely and verify” approach of traditional BI, discovery approaches the data in an iterative process of “discover, verify, operationalize” to uncover new insights and then build and operationalize new analytic models that provide value back to the business. It is kind of like why buffets make for such a popular family dinner destination: everyone can personalize their dining experience with a seemingly endless combination cultivated from a diverse selection and amounts (I am looking at you dinner rolls) of food types.
Ultimately, the fundamental difference between BI and discovery is simple: one starts with a predefinition and expectation of the data, while the other ends with a new definition derived from new insights into the data.
When we talk about BI versus Discovery, we are not really putting them on opposite sides of the battlefield. Instead, what we are ultimately talking about is having the ability—the willingness—to iterate and explore the data without the assumptions and biases of predefinitions.
Consider this example: IT (or a BI team) asks the business to provide them back with the information that it needs to know. The business, in turn, answers with a metric—not what they need to know, but what they need to measure in order to calculate what they need to know. This, by the way, is how we have come up with things like dimensional modeling, OLAP cubes, and other slice-and-dice approaches to understanding and interpreting data to achieve a business goal or other key performance indicator (KPI). Whole generations of BI have fixated on understanding and defining how data needs to map into a metric. But here is the rub: (things like) OLAP are only as good as what you predefine—if you only predefine five dimensions, you will not discover the other twenty hiding in the data. You have to know that you do not know what you are looking for to be able to know how to find it—and no, that is not a riddle to the location of the Holy Grail. It is simply another way of thinking that supports the need for discovery—and for the environment in which to discover.
Discovery (which should not be confused with exploration—see Box 2.1) begins with a goal to achieve within the business, but it accepts that we simply do not know what the metrics are or what data we need (or have) to meet that goal. It requires living inside—getting all up close and personal with—the data. This is the investigative nature of discovery—exploring, playing, visualizing, picking apart, and mashing back together the data in an iterative process to discover relationships, patterns, and trends in the data itself. We may already know the context of the data, but the goal is to build new models to uncover relationships that we do not already know, and then figure out how that information can provide value back to the business while always evolving and changing as the business does over time.

Box 2.1   Discovery versus exploration

Discovery does not equal exploration, just as exploration—likewise—does not equal discovery. While interrelated, these concepts are not interchangeable—or, at least not in my opinion.
As I see it, data exploration is a process by which to systematically investigate, scrutinize, or look for new information. Within the discovery process, exploration is the journey that we take through each step, as we continue to seek a new discovery. To explore is a precursor to discovery: it is to set out to do something—to search for answers, to inquire into a subject or an issue. From the mid-16th century, to explore is to ask “why.”
Discovery itself is the end game of exploration: it is to make known—or to expose—something that has been previously unknown or unseen, like an insight. Discovery moves beyond the “why” to that “what”—or, as from late Latin discooperire, to “cover completely.” It finds something new in the course (or as the result) of exploration. It is the moment in which we become aware, observe, or recognize something new or unexpected—or, alternatively, at the moment when we realize that there is not something new to find. As the saying goes, “even nothing is something.” When we talk about the discovery process, what we are really saying is the process by which you earn a discovery—it is an exploratory process primed for a happy discovery ending.
And, discovery is as much about prediction as it is about iteration. Analysts with an inherent knowledge of the data can look at the context and identify that it is not quite right—that it does not join with an established metric quite as anticipated—they can predict that and already have a plan in mind of what to try next. There is another critical component to context, too: each analyst must decide whether it is—and how it is—applicable to their situation. This has always been a conflict between enterprise data warehouse context and departmental data marts—now it is at the empowered individual level, too. Then, they can go forth and discover and derive further specific context from what is already known. They can iterate. It is agile, yes, but it misses some of the discipline that makes BI, well…BI. Data discovery has always been a part of the BI requirements gathering process, and included data profiling, data quality, data mining, and metric feasibility. Discovery does not have to be completely standalone or complementary to BI—it can also continue to aid those BI processes, which, years ago, required the assistance of agile methodologies.
To go full-on discovery mode requires this give-and-take ability to predict and iterate—to not be satisfied with one answer and to keep on searching for new information. We want to be able to fail fast—to take advantage of short shelf lives and get the most of our information when and how we can—and then move on. And that kind of iterative ability necessitates self-sufficiency, a new-and-improved breed of the old “self-service” that we will explore in detail in the next chapter. Analysts now need to not only have access to data, but they need to be able to create and consume on the fly, that is, without having to go and ask for help and without being hindered by avoidable friction in the discovery process. They need discovery tools, and they need discovery environments (see Box 2.2). This is part of IT’s new role—enablement and consultative—and part of a larger shift we are going to start seeing happen in the industry.

Box 2.2   Playing in the discovery sandbox

When you have great toys, the next thing you need is a great place to play with them.
Like discovery, the need for an environment to support it is not new. There have been “exploration data warehouses” for some time. More modern, the term “analytic sandbox” is being thrown into the vernacular to support the interactive nature of analytic models for business analytics. These analytic sandboxes can be a virtual partition (or schema) of existing analytic-oriented databases, independent physical databases, or analytic appliances. Other specialized databases are built inside out to support very human, intuitive discovery frameworks, too. Or, even more modern, these discovery sandboxes can be a section of the data lake, or a data “pond” even. Big data environments, like Hadoop, seem to intrinsically enable information discovery. Still, the data industry recognizes the information discovery category primarily as vendors whose desktop software enable users to leverage in-memory databases, connectivity, integration, and visualization to explore and discover information. Many of these vendors specialize and differentiate themselves by how they architect discovery-oriented sandboxes, as well as perform supporting tasks, like integration, preparation, or data acquisition.
The choice between enabling “desktop sandboxes” or “architected sandboxes” (or both) can center on choices regarding data movement, or location of the analytic workload, or user type. With a separate sandbox database, data collections from the data warehouse (and other sources and systems that contain additional data) can be moved via ETL (and a large number of other open source and proprietary data collection and integration technologies) to isolate analytic processing and manipulation of data without impacting other established operational BI workloads for the sake of discovery. Another advantage comes from the possibility of collaboration among business analysts, who can share derived data sets and test semantic definitions together. This kind of collaboration is not easily done when a business analyst works locally on their desktop and wants to perform other desktop functions, or requires the desktop to be connected off hours to execute data integration routines. Most discovery tools—visual or otherwise—now allow business analysts to publish their findings and data sets to server-based versions.
It was not too long ago that the discovery-inclined Hadoop environment lacked integration tool compatibility and was burdened by too much programming complexity, thereby limiting its use to specialized users like the data scientist. But, in the past handful of years, this has changed. Vendors new and incumbent have increased integration capabilities with Hadoop via HCatalog, Hive, and more. New platforms and frameworks, like Spark and the even newer Apache Flink, are squaring off to reduce complexity, eliminate mapping, and find even more performance gains. Whatever the tool of choice—and whether it is a single Hadoop cluster, a fully trenched data lake, or one of their more traditional on-premise counterparts—the choice is not whether to have a sandbox—it is where.
When picking your sandbox locale, consider the following:
Who is doing discovery?
Does all the data needed fit onto a desktop even with compression? (Excel can store millions or rows itself)
Are users more likely to work isolated or collaboratively? What about now versus down the road?
Is the same data being pulled of many desktops that would benefit from that one-time operation and enable many users to perform discovery? Or, is it more centralized?
What types of sandbox ready architectures already exist—and do you need another?

2.3. The business impact of the discovery culture

I seem to have found myself sounding remarkably like a broken record in this chapter: discovery itself (as a function of learning more about available data) is not new, nor is the need for discovery environments. And, that is all true enough—these principles may be as old as the data industry itself. We have always wanted to push our data to the max and learn as much as possible from it, and we have wanted protected ways in which to do it. (It is a little reminiscent of that infamous canonical wisdom of the Biblical verse: there is nothing new under the sun.) What data discovery and its associated counterparts are, however—and you are welcome to pick your adjective of choice here—is changing. Being reinvented, evolving, modernized—the list can go on. It is a swing from the Old Testament—the old way of doing things, if you will—to the New.
Scottish poet, novelist, and purveyor of folk tales, Andrew Lang—who, consequently, is also “not new,” seeing as how he passed away in 1912—has, amongst his writings, been credited with leaving behind a lovely quote that has since been used as an immortal critique of scholarly position (or of social progress): “he uses statistics as a drunken man uses lamp posts – for support rather than for illumination.” Lang may have been speaking specifically on statistics, but his words hit on another thread—that of the need to see beyond the obvious and look toward the larger picture. Discovery—and the insights that we reach through discovery—should not be limited to face value alone, but should also be a part of a larger shift toward a discovery-oriented culture.
To recap our previous discussion into agreeable takeaways, we know that both BI and discovery rely on data. Likewise, they both require a high degree of business context and knowledge, and both pivot between analysis and verification to be actionable. However, while BI and discovery ultimately share the same mission of delivering derived insights and value to the business, they are—again—two uniquely distinct capabilities that are both concerned with analyzing, predicting, and discovering. And, all of these are part of the information feedback loop to measure the business, analyze, act, remeasure, and verify. To take advantage of data discovery for business value requires more than a pairing of two approaches to information management. It also necessitates a cultural change within the business. The fostering of a discovery culture—including embracing new mental models and fostering an iterative and agile environment—enables business users from analysts to data scientists to help organizations unlock the value of discovery. This was the salient point of an article I coauthored with Manan Goel for Teradata Magazine in 2014—discovery and BI each provide business benefits, but they are distinctly different. So what does that mean for your organization? Let us discuss.

2.3.1. Fostering a discovery culture

Discovery organizations are different from those that are Business Intelligence Competency Center (BICC)-based, or those that are otherwise traditional BI-centric. Primarily this is because, unlike traditional BI, discovery is iterative. This is fundamentally different than BI, and thus it requires change to make it happen. And, like any organizational change, discovery cannot simply be given lip service as an organizational imperative: it must be embedded into the fabric of the business in order to live up to its potential of being a valuable process.
The discovery environment operates under the new mental model of “fail fast, discover more.” It is highly iterative, focusing on providing the agility to access, assemble, verify, and deploy processes, wherein analysts can rapidly move through data to explore for insights. It is dependent on providing access and the ability to incorporate data, too, so that analysts can explore all data—not just that stored in the data warehouse—and leverage all forms of business analytics—from SQL, nPath, graph, textual, statistical, and predictive, among others—to achieve business goals.
Finally, the discovery process is collaborative and requires the ability to share and verify the findings. The discovery culture requires that the business users have some level of independence from IT, and that they have intuitive, visually optimized tools to support exploration, too.
The discovery culture is:
Agile and iterative
Failure-tolerant and experimental
Collaborative and IT-independent

2.3.2. Discovery culture challenges

Enabling a discovery culture is not without its set of challenges. First, it is—or, it can be—difficult for successful BI delivery organizations to accept “iterative failures” as good cultural attitudes. This is a stark contrast to the traditional build-to-last mindset with the built-to-change mindset needed to be agile, discover opportunities faster, and capitalize on them before competitors. We are not typically programmed with the mindset that failure is acceptable—much less that it is actually okay, and a normal part of exploration. Instead, we hear “don’t mess up,” or “try harder,” or “practice makes perfect.” There was a recent Forbes article where the contributor wrote that failing—fast or otherwise—is an idea we think is a good one, but only when it applies to other people.
It is counterintuitive and an interesting psychological experiment to think about why failure is a good thing. Not only a good thing, failure is still decidedly undervalued as a technique for innovation. As an anecdote about the need to fail—and quickly—consider the story of British inventor James Dyson. Dyson like the vacuum company “Dyson”? Yes, exactly. One day, Dyson looked at his household vacuum cleaner and decided he could make it better. It took him 15 years and 5,125 failed prototypes for him to decide to fail (Dyson is quoted as saying “I thought I’d try the wrong shape…and it worked”). On his 5,126th prototype, Dyson found his winning design—winning to the tune of $6 billion to date in worldwide sales—and the improved Dyson vacuum is a great modern success story that speaks to the power of failure. This may seem like a lesson in perseverance more than anything, but when you do the math and realize that Dyson was creating an average of 28 prototypes per month, it becomes jaw-dropping to think about how quickly he was designing, experimenting, testing, iterating—how fast he was failing. Another rags-to-riches story about the power of failure could be J. K. Rowling, author of the multimillion dollar Harry Potter series. The story goes that Rowling received twelve publishing rejections in a row before being accepted, and was only then accepted after the eight-year old daughter of an editor demanded to read the rest of the manuscript (these are publishing rejections, mind you, and does not include literary agents rejections previous for which, knowing a bit about the publishing industry, I can only assume was even higher). Even after acceptance of The Sorcerer’s Stone Rowling was advised to get a day job, which might be the most embarrassing comment that editor ever made to the woman, whose last four books in the Potter series have since consecutively set records as the fastest-selling books in history, on both sides of the Atlantic, with sales of 450 million—and this does not even include earnings from merchandising, movies, or the Universal theme parks, either. Burn!
The lesson, again, is that sometimes, failure can be a good thing. And the faster you fail, the quicker you can move on to that sweet, sweet win. Keep in mind that, as it becomes more data-driven, your company is/will be dependent on data discovery to stay competitive and alive, so the more people that are exploring in a field of data, the better.
Alongside the requisite mental “fail fast” models of discovery, a highly iterative and exploratory environment, and capable tools and wide-open access (more on these in the next chapter), embracing a discovery culture means that we also face new challenges in governing people, roles, and responsibilities. Aside from the mindset change of failure as a good thing, governance is arguably the second biggest challenge to be faced by the discovery culture. And, as if that was not enough on its own, we must be aware of the need to govern the results of discovery itself, and this applies both to the insight and the way in which we present—or visualize—the insight. Roles and responsibilities for accessing and working with the data should be established, as well as definitions of ownership and delegation for semantic context discovered. The discovery results themselves should also undergo governance, as well as monitoring the operationalization of discovered new analytic models. For the most part, BI has had the benefit of having governance baked in (as broad and optimistic as that may sound), while defining everything from extraction, transformation, and consumption of context. Discovery is driving governance to new levels and unexplored territories of policies and decisions to balance risks. That said, a conversation on governance cannot be constrained into a few measly sentences, so, I will devote an entire chapter later to this concept and move on for now.

2.3.3. Discovery organizations as ambidextrous by design

Last of all, I just want to insert a brief tidbit on organizational design for the discovery culture, for those of you interested in such things. A strong case can be made that the discovery culture is one that is ambidextrous by design. These organizations attempt to create excellence—competitive advantage—by supporting entrepreneurial and innovative practices (Selcer and Decker, 2012)—hence, self-sufficiency (which I will introduce in greater detail in chapter: From Self-Service to Self-Sufficiency). Ambidextrous organizations—like data and customer-centric organizations—are adaptable and continue to mature and evolve as they react to internal and external challenges that contribute to the ongoing shaping of the organizational design. This can be top-down to strongly coordinate work for efficiency and productive, or it can be bottom-up and thus promote individuality and creativity.

2.4. The role of friction in discovery

The role of friction in data discovery is much akin to that minimalist design mantra that will later on be the capstone of our visual design conversation: less is more.
Unlike traditional BI, discovery hinges on the ability to explore data in a “fail-fast” iterative process that cycles through repetitive steps of accessing the data; exploring it; blending and integrating it (or “interrogating it”) with other data; analytically modeling the data; and finally, verifying and governing the new discovery before operationalizing it back into the enterprise. As an iterative exercise, discovery inherently places a premium on both the ability to quickly and agilely harness large amounts of varieties of data to explore, and also—equally as important—on speed. This is not just speed for the sake of speed, either: it is a fundamental prerequisite of truly enabling the discovery culture in your enterprise. The quicker you can move through the discovery process, the quicker you can arrive at insights that add value back into the business. Finding a worthwhile discovery—especially the first time around—is not guaranteed. This is the essence of fail-fast: if one discovery does not work, toss it aside and keep looking. Sure, it may take only one attempt to discover a valuable nugget, but it also may take 99 iterations through the discovery cycle before one valuable insight is uncovered (if at all). The value of speed, then, can be found at the intersection of actionable time to insight and the ease—or, the “frictionless-ness”—of the discovery process.
Friction is caused by the incremental events that add time and complexity to discovery through activities that slow down the process, like those that IT used to do for business users—many, if not all, of which can be reduced (or removed completely) with robust self-service, visualization, collaboration, and sharing capabilities. Friction, then, is a speed killer—it adds time to discovery and interrupts the “train of thought” ability to move quickly through the discovery process and earn insight.
Somehow every performance measurement seems to boil down to that same old 80/20 concept. When I was in project management, we applied it there: 80% time should be spent planning, 20% doing the work. Now, in discovery, it is here again. When it comes to discovery, we often hear that 80% of time is spent in data preparation, leaving only 20% of the time for data science and exploration (some have suggested this ratio is more closely aligned with a 90/10 split). The more friction that is added into the process, the longer it takes to yield an insight in the data. Removing friction reduces the barriers to discovery and subsequently increases time to insight—speed.
Think of it this way: would data discovery be nearly as worthwhile if analysts had to budget an hour of waiting time for each step of the discovery process? And, what if it does take 99 iterations to find one meaningful insight? Do the math: that is 5 steps at 1 hour each, 99 times, for a total of 495 hours or approximately 21 days to value. Yikes! But if we pick up the speed—reduce the friction—and drop that hour of waiting time down to 1 min per step, the time to value is significantly reduced: 1 min steps, 99 times, is a mere 495 min—8.25 hours—to value. How much time would you prefer to spend—two months or one day—before discovering an insight that adds value in your business? How much would you be okay with wasting?
Yes, sometimes there are unavoidable activities (like governance) that will inevitably add friction to the discovery process. However, the goal is to reduce the strain (and time sink) around those activities as much as possible—as illustrated by the math earlier, every second counts in discovery. Remember: the real value of discovery is to earn insights on data and facilitate interactive and immediate reaction by the business. Breaking down the activities that induce friction in each step of the discovery process reduces the barriers to discovery and subsequently increases time to insight. Speed, then, is a function of friction: the less friction in the discovery process, the more value speed can deliver to the business. As friction decreases, time to failures improves and subsequently time to insight increases—and the more valuable discovery becomes. Now, multiply that by ever-increasing amount of people in the enterprise (not just data scientists and analysts) to solve the challenges of the business. Compare that to the number of people using Google and Yahoo! to discover solutions to everyday life challenges, and how that has evolved over the past 5, 10, and 15 years—adoption has been proven.
Ultimately, it is a simple equation: less friction equals faster time to insight. The role of friction in discovery then? As little as possible.

2.5. The four forms of discovery

Among their many other benefits, Hadoop and other big data playgrounds serve as a staging ground for discovery. This is not by accident, but by design: these ecosystems provide the ability to scale affordably to store and then search all data—structured and unstructured alike—enabling business analysts to explore and discover within a single environment.
Below I want to break down four identified forms of discovery, each of which is in use in organizations today, before introducing the concept of visual data discovery, which is the heart of this book. Each of the four forms of discovery below can be organized into two categories: traditional and advanced (or new). These traditional forms of discovery include commonplace, structured BI-discovery tools, like spreadsheets and basic visualizations, while advanced forms of discovery leverage multifaceted search mode capabilities and innovations in advanced visualizations to support new capabilities in data discovery.

2.5.1. Traditional forms of discovery

First, both mainstay spreadsheets and basic visualizations—like basic graphs and percentage of whole (eg, pie) charts—are traditional forms of discovery.

2.5.1.1. Spreadsheets

Spreadsheets (like Microsoft Excel) remain the most popular and pervasive business analytics paradigm to work with data, in part because of their widespread and long-standing availability and user familiarity. However, with a wide range of analysis and reporting capabilities, spreadsheets can be powerful analytic tool in the hands of an experienced user. The original spreadsheet pioneers VisiCalc and Lotus 1-2-3 discovered a powerful paradigm for humans to organize, calculate, and analyze data that has proven to stand the test of time—though some of the companies themselves did not. Today, Microsoft Excel 2013 can hold over one million rows (1,048,576) and over 16,000 columns (16,384) of data in memory in a worksheet.
With spreadsheets, the real value is in providing access to data for the user to manipulate locally. With this tool, and the data already organized neatly into rows and columns, an analyst can slice and dice spreadsheet data through sorting, filtering, pivoting, or building very simple or very complex formulas and statistics directly into their spreadsheet(s). They can discover new insights by simply reorganizing the data. Some vendors, like Datameer, for example, have started to capitalize on the concept of “spreadsheet discovery.” In this very Excel-esque spreadsheet user interface, analysts and business users can leverage the fluency of the Excel environment to discover big data insights. (This is not the only way vendors are reimagining the familiarity of intuitive Microsoft environments—some discovery tools (business user-based Chartio and statistical discovery solution JMP come to mind) have very wizard-like interfaces to guide users to discovery, too.) We might not always like to admit it, but Microsoft’s mantra of technology for the everyday user has enabled nearly every company in the world today.

2.5.1.2. Basic visualizations

Basic visualizations, such as graphs or charts (including those embedded in dashboards)—whether generated through Excel or not—provide simple, straightforward visual representations of data that allow analysts to discover insights that might not be as easily perceived in a plain text format.
It is no small task to put a point on exactly what constitutes a basic data visualization as the range and breadth of visualizations is quite broad, but perhaps it is a simple description to say that basic visualizations are an effective means of describing, exploring, or summarizing data because the use of a visual image can simplify complex information and help to highlight—or discover—patterns and trends in the data. They can also help in presenting large amounts of data, and can just as easily be used to present smaller datasets, too. That said, basic visualizations fall short of their more advanced cousins in many ways. They are more often than not static, one-layered visualizations that offer little to no interactive or animated capabilities. Moreover, they are lacking in dynamic data content and do not offer abilities to query data, personalize appearance, or provide real-time monitoring mechanisms (like sharing or alerts).

2.5.2. Advanced forms of discovery

The evolution of traditional forms of discovery has led to newer, more advanced forms of discovery that can search through—and visualize—multiple kinds of data within one environment. The two other forms of data discovery are what we classify as analytic forms of discovery.

2.5.2.1. Multifaceted, search mode

Multifaceted (or, “search-mode”) discovery allows analysts to mine through data for insights without discriminating between structured and unstructured data. Analysts can access data in documents, emails, images, wikis, social data, etc. in a search engine fashion (like Google, Yahoo!, or Bing) with the ability to iterate back-and-forth as needed and drill down to dive deeper into available data to discover new insights. IBM Watson, for example, is a search mode form of discovery, capable of answering questions posed in everyday language. We will touch on other discovery-oriented languages, and how they are working to ready for the Internet of Things, in more detail in later chapters.

2.5.2.2. Advanced visualizations

Finally, advanced visualizations are everything that basic data visualizations are not. They are a tool for visual discovery that allow analysts to experiment with big data to uncover insights in a totally new way. These advanced visualizations can also complement or supplement traditional forms of discovery to provide the opportunity to compare various forms of discovery to potentially discover even more insights, or have a more complete view of the data.
With advanced visualizations, analysts can visualize clusters or aggregate data; they can also experiment with data through iteration to look for correlations or predictors to discover new analytic models. These advanced visualizations are interactive, possibly animated, and some can even provide real-time data analysis with streaming visualization capabilities. Moreover, advanced visualizations are multiple-dimension, linked, and layered, providing optimal visual discovery opportunities for users to follow train-of-thought thinking as they visually step through data and craft compelling data narratives through visual storytelling. While basic data visualization types—again, like bar or pie charts—can be optimized to be advanced data visualizations, there exists also an entire new spectrum on the visualization continuum devoted to advanced types of visual displays, such as network visualizations or arc diagrams, that can layer on multiple dimensions of data at once.
The inclusion of visual cues—like intelligent icons and waves of color in heat maps—are an emerging technique in advanced visual discovery that leverage principles and best practices in cognitive sciences and visual design. I will explore the “beautiful science” of data visualization in later chapters, when we talk about how to use color, perceptual pop-out, numerosity, and other techniques to layer visual intuition on top of cognitive understanding to interact with, learn from, and earn new insights and engage in visual dialog with data.
Remember, advanced visualizations are not simply a function of how the data is visualized, but are measured by how dynamic, interactive, and functional they are. Advanced data visualizations enable visual discovery by design—the core focus of chapter: The Importance of Visual Design (for now, see Box 2.3).

Box 2.3   The emergence of visual discovery

By now, we have nearly dissected every angle of discovery—how it differs from traditional BI, the nuances between traditional and advanced discovery, and so forth. We have begun to touch on a construct, which is the heart of this text as the phrase “visual data discovery” has started to materialize in paragraphs past.
It probably will not surprise you that there are many definitions of visual data discovery floating around out there. Such is the fate of buzzy new terminology when it first flies into the face of every vendor, marketer, and customer trying to wrangle out and differentiate on a piece of the definitive pie of definitions.
Rather than tossing another definition into the pool, I would like to cast my vote with one that already exists and try to achieve some degree of unification. Gartner Analyst Cindi Howson, a long-time data viz guru, has offered perhaps one of the most clear and succinct definitions of visual data discovery, stating: “Visual data discovery tools speed the time to insight through the use of visualizations, best practices in visual perception, and easy exploration.” Howson also notes that such tools support business agility and self-service BI through a variety of innovations that may include in-memory processing and mashing of multiple data sources. To Cindi’s definition, I would also like to add that visual data discovery is a mechanism for discovery that places an inherent premium on visual—perhaps more so than analytical prowess—to guide discovery and works to facilitate a visual dialog with progressively vast amounts of large and diverse data sets.
Thus, my definition of visual data discovery is this: visual data discovery is the use of visually-oriented, self-service tools designed to guide users to insights through the effective use of visual design principles, graphicacy best practices, and kinesthetic learning supported by animation, interactivity, and collective learning. Narrated visual discovery is the basis of true data storytelling.
Later chapters will explore the anatomy of a visual discovery application and other technical details.

2.6. SQL: The language of discovery

For many the self-sufficient analyst, learning and being able to use SQL is not a skill prerequisite in the modern data analyst toolkit, as many self-service discovery-oriented tools are designed to take the scripting burden off of the user. Nevertheless, SQL remains a high-premium component of data discovery.
As organizations’ transition more and more from traditional BI methods of querying, analyzing, and reporting on data to the highly iterative and interactive model of data discovery with new data in varying structure and volume, the ability to load, access, integrate, and explore data is becoming increasingly more critical. While technologies like Hadoop are gaining acceptance and adoption as data management systems built from the ground up to be affordable, scalable, and to flexibly work with data—and while many new languages are emerging to work with data in new, condensed, and more intuitive ways—the Structured Query Language (SQL) remains the key to unlocking the real business value inside new data discovery with a traditional method. Even SQL’s little cousin MDX (multi-dimensional expression) language for OLAP will be empowered by big data.
The more users that are enabled to participate in the discovery process, the more value can be unlocked. This is in terms of both quantity of users, as well as the variety thereof. Every data user in the organization can be a vessel to discovery—from casual users (that represent 80% of BI users), to power users and business analysts (that represent 20% of BI users), to the few data scientists and analytic modelers in the organization. In addition to enabling more users within the organization to perform discovery, discovery-centric organizations should seek to extend the role of traditional data stewards. From traditional subject, process, unit (ie, sales and marketing) data, and project stewards, look to define new data stewards, including big data stewards (to include website, social media, M2M, and big transactions) and to leverage the role of the data scientists to recommend algorithms, verify and test analytics models, and—more important—research and mentor others in the organization. By the way, this includes C-level titles, too. We will explore the roles of Chief Analytics Officers or Chief Data Officers—or even Chief Storytelling Officers—in more detail in later chapters. However, even though these people may perhaps be less likely to dig in and work directly with the data, they are every bit as relevant to this part of the discussion due to their positions as leaders, mentors, and enablers within a visual discovery culture.
However, the very few data scientists in the organization—and even the enabled power users—can only scratch the surface of the business value of discovery, and that, too, will eventually flatten out over time. Going forward, we should position these users as the enablers for the enterprise casual users, who also need access—those “know it when they see it” discoverers who benefit most from having the ability to interact and explore data in a familiar and self-sufficient way. This is where SQL as the language of discovery comes in.
Today’s self-sufficient analyst requires the ability to access data, to load it, and to iteratively explore and “fail fast” to discover insights hidden within data. This challenges traditional data management and data warehouses primarily through schema and controls. However, using SQL for discovery leverages decades of familiarity, adoption, and maturity that exists within tools already installed in today’s technology ecosystems. For example, many spreadsheet raw-data formats and intuitive visualization tools are heavily dependent on the SQL word. Therefore, analysts and power users immediately benefit from having a highly iterative, high performing SQL capability within Hadoop—hence the rush to provide faster SQL access to schemas in Hadoop over Hive.
Unlocking big data value in discovery is heavily dependent on the ability to executive SQL because it is already so pervasive, coupled with functionality that has performance and capability. However, not all SQL engines are created equal. They are all maturing differently. They have a different history or a different DNA. And, some are starting fresh, while others are leveraging years of database capability. The following three areas are considerations when evaluating the power of SQL on Hadoop. Consequently, Radiant Advisors annually publishes independent benchmarks focused on the performance dimensions of speed and SQL capabilities, and are recommended reading for a look at how individual queries perform on today’s leading SQL-on-Hadoop options.

2.6.1. SQL capability and compatibility

First is SQL capability. We have learned through existing relational databases that not all SQL is the same. There are some vendor specific SQL, and while vendors can run SQL, it could be of many versions starting with SQL-92, and SQL-99 standards to the later analytic functions found in SQL-2000+ versions of the ANSI standard. If you have existing tools and reports that you want to connect to Hadoop, you do not want to rewrite a vast amount of existing SQL statements and make sure they are going to work in existing tools and applications.
Compatibility with standardized SQL—or more mature, advanced SQL, then—minimizes rework. And, without SQL capability and maturity, many analytic functions will be rendered unable to perform anyway. With this in mind, look at the vendor’s roadmaps to see what analytic functions they have on tap over the next year or two.

2.6.2. Scalability

Second is scalability. With a large cluster up to thousands of nodes, there is the assumption that the SQL engine runs on all the nodes in the cluster. However, be aware and cognizant of limitations. For example, if you are running 100, 500, or thousands of nodes, maybe the SQL engine is not capable of running on that and is limited to only run on 16 or 32 notes at a time. Some early adopters have sectioned off areas of clusters to operate SQL on Hadoop engines on, resulting in a tiered architecture in single clusters. Even more important recently is whether it is YARN certified or not.
Additionally, be aware of data duplication due to data file format inside of the HDFS (Hadoop Distributed File System). Is data stored inside an open file, like Text, Optimized Row-Column (ORC), JSON, or Parquet, or does it require extraction out of those files and into a propriety file format that cannot be accessed by other Hadoop applications—especially YARN applications? Beware of duplicating data in order to fuel the SQL on Hadoop engine.

2.6.3. Speed

Finally, again, it is all about speed. Speed does matter, especially in interactive response time and especially for big data sets—especially from a discovery perspective, where the goal is to discover fast through iteratively “failing fast.” If you know you are going to fail 99 times before you find the first insight, you want to move through those 99 iterations as quickly as possible. Waiting 700 seconds for a query return versus 7 seconds or for a batch process can be a painful form of analysis, wherein you can discover your patience thresholds before any insights into the data. When evaluating speed, look beyond response times to consider workloads, caching, and concurrency, too.

2.6.4. Thinking long-term

When choosing any SQL-on-Hadoop engine, take your broader data architecture and data strategies into consideration and make sure these align. The architecture for SQL-on-Hadoop will continue to evolve—it has already moved from a batch-oriented Hive on MapReduce in Hadoop version 1, to the current Hadoop version 2 that added the benefits of Hive and TEZ running on YARN for vectorized queries that brought orders of magnitude performance increases to Hive 13. Today, we see “architectural SQL,” with Hive and TEZ running on YARN inside of clusters, and we can also begin bringing in other engines that can run directly with HDFS. Architecture is a differentiator between those SQL-on-Hadoop engines that are already YARN-compatible, and those that pick up performance by going directly to the HDFS core and looking for compatibility in long-term strategies. Of course, many new technologies and strategies continue to show up in the running for how to best manage and benefit from big data, but the one constant in a sea of change and invention remains SQL.
Ultimately, SQL is becoming more and more part of the story of unifying our data platform environment. Through SQL, we can bring together enterprise data, high-end analytics, and big data sets that live in Hadoop, and we can start to work with all of these through a unifying language.

2.7. Discovery in the days to come

This has been a long, somewhat nuanced, dissertation on the role of data discovery—and visual data discovery—alongside traditional BI and other forms of information discovery. Ultimately, we can distill the bulk of the chapter earlier into two capstone points: while discovery and BI share some commonalities, they are fundamentally different in approach and process yet intertwined. And, data discovery is expected to become the prominent approach to BI in the next several years.
Today data discovery is beginning to deliver on its all of its promise of capabilities and earned actionable insights (remember Netflix in the last chapter, which is basically the icon for discovery and empowerment as an organizations that has disrupted an entire industry)—especially those reached visually. However, there are still areas of improvement to close gaps and find balance in features and functionalities. One of these areas is in governance—both in the discovery process itself, and in how the data visualizations itself is utilized within the organization as an information asset. While discovery is good—and necessary—that does not mean it is a free-for-all. With intuitive, robust tools and wide-open access, we face the new challenges of properly governing roles, responsibilities and how data is used in the business. We do not want to return to the chaos of reports that do not match, garbage data, or mismatched information systems. Discovery is as strategic a process as any other in BI.
We will explore these issues of expanding data governance for discovery in more detail later in Part III. For now let us continue to explore the shift from self-service to self-sufficiency and the business implications of the big data democracy. Let us talk about people.

Reference

Selcer A, Decker P. The structuration of ambidexterity: an urge for caution in organizational design. Int. J. Org. Innovat. 2012;5(1):6596.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.203.172