Chapter 2. Data

Data is the fundamental building block of everything in this book from here on out. The visualizations in the following chapters require a solid understanding of data and how to turn a question into a data-informed answer. In my role as a full-time trainer, I frequently meet people who have worked with data for a while but need to brush up on certain aspects. In this chapter I will cover those aspects to ensure you have what you need on your journey to becoming an effective communicator with data.

This chapter will help you form and refine those fundamental skills by building your awareness of:

  • What we mean by data and some of the key features of data

  • The sources of data and how it is created

  • Where you will find data

  • How to structure data sources to make them easier for analysis and communication

  • How to identify the correct data for your questions

Working with data can be intimidating at first, but you’ll find that tasks become much easier once you’ve developed a set of fundamental skills with whichever technology you choose to work with. To answer your questions, you will need to sift through a vast amount of data from a plethora of sources. The core skills covered here will prepare you to work with data in your workplace, no matter the data’s source.

Knowing your data is also important, as it allows you to validate your visualizations and describe the work to others. Imagine trying to speak a language without being sure what the words mean or represent. Often you won’t have a perfect data set to answer the question at hand. Knowing more about what is possible, where the data is potentially stored, and how you want to receive the data set for analysis will allow you to work confidently and more efficiently with data.

What Is Data?

You hear the word data constantly, but what does it actually mean? Data can be defined as the facts or numbers collected about observations for the purpose of understanding the subject better.

Data collection has become a lot easier with the growth in digitalization. Technology has become intertwined with most facets of our lives, and thus we can measure those facets and store the subsequent data. With the volume of data points being captured, the new challenge isn’t just finding the data points; it’s also creating clarity around what they mean. For a long time, most organizations struggled to store the vast quantities of data being captured by their services, such as their customer-facing apps or call-handling systems. Advances in technology have reduced this challenge, so we can spend less time working out whether the data was stored and more time focusing on finding and making use of the data.

The volume of data created means it isn’t always stored as you need it to be for your analysis, however. You might have others to provide you with data to work with in your role at your organization, but not everyone will be so lucky. Also, if you develop an understanding of what it takes to prepare data, you’ll be able to articulate what you need more clearly. To understand how to turn data into something meaningful, you must first understand the key features of data so you can recognize what’s useful and what isn’t.

The Key Features of Data

What image does the word data conjure in your head? I’ve worked with data for a decade and a half, and I still see the same image: a spreadsheet, with columns and rows holding cells of data. Before we begin looking at the rows and columns, let’s focus on the cells themselves and their contents.

Figure 2-1 shows an excerpt from a spreadsheet that I’ll use to talk through the importance of cells in a data set. The excerpt contains cells with different types of information.

Basic spreadsheet of data 0
Figure 2-1. Basic spreadsheet of data 0

There are three main classifications of cells that you’ll need to recognize to work with data effectively.

A header is similar to a title for the cells listed underneath it. Each header should name what the values in the cells below it represent. If your data has been created for you, the headers should be clear. If you find that they’re not clear, it’s much more likely that the data set hasn’t been prepared for you. In Figure 2-1, each cell under “Country” has a value that is a recognizable country. The more cells that contain a value you’d expect to see based on the header, the more confident you can be that the headers represent what you’d expect them to.

Categorical data is where a cell contains a value that helps us understand how to interpret the cells containing numbers. If Figure 2-1 contained only sales and target values, the user of the data would have no way to understand what those values actually represent. Using the cells with categorical data, we can interpret that York, in the northern region of the United Kingdom, had sales of 381,511. The contents of cells containing categorical data are often regarded as the categorical variables.

Numbers are the final classification of cells to recognize early on. The numerical data points are often the element of the data set you are actually looking to understand. You can analyze numbers by aggregating the values or comparing the variance between them, among other types of analysis. In Figure 2-1, you might want to compare the sales to the target values to see whether each row’s sales number is bigger than the target value.

You will use cells of data as the building blocks of your analysis. You will choose which ones to include; most data sets will also have many that you will ignore.

So where do these cells come from in the first place? Data can be created in many different ways and is kept for many different reasons. Everything we do creates data; understanding that can help us improve our lives. Richard Silvester, founder of the data visualization company infogr8, talks about how much data can be produced by our daily activities. In Table 2-1, I’ve used Richard’s “day in the life” concept to illustrate how you might create many different data points during the first part of a typical day.

Table 2-1. Caption TK
Action Data Created Use
Waking up Sleep-tracking data on wearable devices Tracking your sleep on a wearable device can help you learn about your sleep patterns and what affects them.
Showering Water/electric meter data Energy companies can use smart meters to optimize production and understand high demand periods.
Making breakfast Data on products bought at the supermarket or ordered online as you run out of supplies Your product-ordering data is fed back to suppliers to drive production levels and inform logistics companies.
Watching the news on your tablet Usage data captured by streaming channels; app usage tracked by device provider Production of programming can be modified in response to what people do or don’t watch. Streaming services want you to keep using them and will recommend what you should watch next on their platform.
Checking social media for your informal news Likes, shares, and app usage Algorithms assess your interests and provide information on the content you interact with.
Traveling to work Data from riding your bicycle (tracked on Strava), driving your car and filling up the gas tank, or buying a commuter rail ticket The Strava file tracking accelerations and routes taken helps improve cycling infrastructure. Satellite navigation systems are similar for roads. Ticket purchases show demand not only for trains but also for additional services like cafes that banks can support new/expanding businesses with.
Entering the office Data from security badge used to access the building This provides time tracking for employees and also helps companies ensure everyone is clear of the building in case of safety events such as fires. This data can also be used to determine space requirements and working patterns.

Just by getting up and going to the office, you’ve created a data trail. Companies utilize data about what you use and when you use it to shape their logistics flows. These data points require analysis to enable companies to make meaningful decisions. The next key feature of data is how we structure the data cells.

Rows and Columns

Cells of values mean very little on their own. I’ve used the terms rows and columns to describe how cells are organized. Being clear on how to interpret rows and columns in data sets will allow you to understand more about the values you find within them. Let’s look at rows and columns in turn to understand their importance.

Rows

Ideally, a row of data should contain information about a single observation of whatever the data set is about. You need to look not just at the values generated by the observation but also at the different categorical and numerical values that are held on the same row. The categorical values will set the level of detail, or granularity, of the data.

Let’s look at a basic data set (Table 2-2) to identify what each row represents.

Table 2-2. Basic data set
Weekday Store Sales Value Sales Target
Monday Manchester 1000 800
Tuesday Manchester 750 800
Wednesday Manchester 400 900
Thursday Manchester 1350 1000
Friday Manchester 1500 1000

I trust you’ve noticed there are only two columns of categorical values, Weekday and Store. Therefore, the data set’s granularity is one row per store per day. Table 2-2 lists data for only one store, so each weekday sets the level of granularity that a record of sales value and sales target is made at.

Not taking time to recognize the granularity of the data set forming your communications is a mistake I see made frequently, even by experienced data workers. When working with a new data set, you must understand its granularity, because that will determine whether you need to aggregate the rows of data. Aggregation is when you sum up, average, or find the maximum value of data points at a different level of granularity than exists in the data set. For example, in Table 2-2, if you wanted to average the sales targets for the Manchester store, you’d leave only one value: 900.

Using aggregation to answer the questions you are analyzing is a common task when communicating with data. Table 2-3 looks at some examples and the types of aggregation you might use based on the data in Table 2-2.

Table 2-3. Using aggregation to analyze data
Question Aggregation Technique Needed
Which day had the highest sales value? Maximum: You’d assess each sales value and return only the largest one.
What are the total sales for the Manchester store? Sum: You’d add up the sales values and return the total amount.
What is the difference between the highest day’s sales and the lowest? Maximum, Minimum, and Subtraction: You’d find the largest and smallest sales values and subtract one from the other.

Most software used to communicate with data is designed to make these aggregations easy and intuitive to complete. When communicating with data, you must validate your results to ensure your communication is accurate.

Columns

The final part of the data structure to understand is the column. Columns organize similar cells of data so you can make sense of them. In a well-structured data set, each column will represent either a category or a numerical value but not both. The term data field is commonly exchanged for column, but they mean the same thing. Software used to analyze data will frequently require each column to be uniquely named so the software can refer to the relevant values in a data set.

Sadly, the data sets you’ll need to use when working with data are not always structured nicely. They frequently will require data preparation to form the necessary column structure. You may need to merge values or split a single column into multiple columns. Some columns might not be required at all, and you’ll have to remove them to make your analysis easier to conduct.

You need a clear understanding of the questions you are trying to answer before you look at the data set. 'Table 2-4 doesn’t show an additional column of data, but the Store column has a deeper level of granularity, as that column now contains a second value. If you are trying to answer questions only about the Manchester store, the additional value acts as a distraction to the analysis.

However, you can’t simply remove the Store column, as the additional store, York, has added new rows of related data. The table’s length has doubled, as the data set now covers two stores instead of one, with two rows of data for each weekday. If you removed the Store column, you’d no longer see which store each row’s observations are about.

Table 2-4. Beyond the basic data set
Weekday Store Sales Value Sales Target
Monday Manchester 1000 800
Monday York 650 500
Tuesday Manchester 750 800
Tuesday York 400 500
Wednesday Manchester 400 900
Wednesday York 600 600
Thursday Manchester 1350 1000
Thursday York 650 750
Friday Manchester 1500 1000
Friday York 700 750

Additional columns of data aren’t necessarily something to be frustrated with. They can allow you to perform deeper and more insightful analysis. Each question posed in Table 2-3 could be answered about both stores. Alternatively, the questions could look at which store had the higher sales each day or the higher average target. This is why it’s important to be clear on the questions you need to ask; you want to ensure your data set has enough granularity for you to answer the questions but not so much that you have to do a lot of work to find the answers.

Rows and columns work hand in hand; thus you need to understand the effect changing one feature will have on the other. Each column should be a single data type, so let’s turn to that concept next.

Data Types

The term data type is used to describe the form of data that exists within a cell in a data set. You learned in “Rows and Columns”that each header should describe the cells of data below it. Therefore, the column of data below the header must be in a consistent form and include only one data type. Most data software solutions allow you to have only a single data type in a column.

The rest of this section will go into more detail about each data type and what considerations you should make when using a data set with data types.

Numbers

Numerical values are at the heart of data and make up the majority of measures found within data sets. A measure is another name for the numerical values you’ve encountered in the example data sets so far, like sales and target values. A value made up of just 0s, 1s, 2s, 3s, 4s, 5s, 6s, 7s, 8s, and/or 9s is a numeric data type. When communicating with data, you often will aggregate these values if you’re answering overarching questions, or you might still refer to each separate value if focusing on more detailed questions.

You may also find numerical values forming identification numbers, which are commonly used in data sets to allow different data sets to be joined together, or to ensure there is a unique value for each categorical variable if the names of those values might change over time, as with a rebranded product. Most large companies will assign a customer an identification (ID) number to make analysis more anonymous.

Numeric data fields will be present in most data sets across all industries and departments. Consider the example in Figure 2-2, taken from the World Bank, of the world’s cereal yield in kilograms per hectare.

World Bank data on cereal yields  in kilograms per hectare
Figure 2-2. World Bank data on cereal yields, in kilograms per hectare

The main set of numbers comprises the values of cereal yield, the main subject of the data set. Although analyzing the numerical data is the key focus of understanding this data set, completing that task in this table alone isn’t easy. Numerical data fields can also have null data points. A null represents the absence of data in a data field or row.

In Figure 2-2, numbers are also present in the form of years. The year would not be used as a measure but in fact acts as a category of our data, as the years are part of the description for the yield values. The yield values are set at the level of country and year, despite other categorical data being present in the data set. The additional categorical does not contain multiple values per country and year and thus does not add to the data’s granularity.

The formats of numeric data types are important too. Whether a number is a whole number, or integer, or a number with a decimal place (sometimes referred to as a float) can affect how you use the value. Many data tools treat these fields differently. The format makes a lot of difference when it comes to the content of what the field represents. Take, for example, a column that has the word “percentage” in the header. If the data field is an integer, it might be represented with a value of 31, implying the value is actually 31%. If the field is being held as a float, the value might be recorded as 0.31. This demonstrates the importance of naming your headers clearly to indicate how each value should be used. After all, sales might have increased 3100%, or the percentage of survey results received might be a terrible 0.0031%.

Numeric data will be very important when you are communicating with data, so you’ll want to check that you are using the values correctly.

Strings

For new data analysts, string data is the data type that takes the most getting used to. This is due to how string data is assessed by the tools that ingest it from the data source. Even if you are a more experienced data user, you might still regularly struggle to work with string data due to the number of forms it can take.

String data is made up of alphanumeric data that can also include punctuation marks and symbols. Student names, university course descriptions, and course identification codes can all be forms of string data. Any field that is not just numbers can be held as a string. Any computer system that allows you to freely enter data will probably store the data as a string field, as the user might use more than just numeric values.

The string fields are going to be categorical fields in your data set. You might still use string fields as a measure by counting the number of rows that contain a specific value. However, most of your analysis will use the categorical data fields to break up the measures you are analyzing. Take Course Name as an example—if it were a string field, a value would be generated for each different course in each of the following questions:

  • What is the minimum number of students attending lectures per course?

  • What is the average grade per course?

  • What are the total fees paid by students per course?

The flexibility of string data is fantastic but can cause headaches too. Tools read string data character by character. They also assess the position of each character in the string. In Figure 2-3, each character’s position in the string value Communicating with Data has been denoted below the character. This demonstrates how business intelligence tools read from left to right, including punctuation and also symbols such as spaces (characters 14 and 19 in Figure 2-3).

Character positions in a string value
Figure 2-3. Character positions in a string value

If an extra space were accidentally added to the term Communicating with Data, the position of each subsequent character would change, and thus the strings would be seen as completely different from each other, even if you, the consumer, might read the terms the same way.

When working with string data, you’ll often have to clean up the values held in the data field to ensure you are comparing similar string values correctly. Common cleaning tasks might involve any of the following:

Changing case

You might need to make a value UPPERCASE, lowercase, or Title Case (where the first letter of each word is capitalized).

Splitting names

Dividing names is useful for tasks like finding the city name in a full postal address or breaking up longer, amalgamated strings into separate columns of data.

Solving spelling mistakes

Finding and fixing typos can make analyzing data much easier.

String fields can be converted into other data types. By using string fields as a date or Boolean data type instead, you can complete specific analyses more easily.

Dates

Date fields are the bane of many analysts’ lives. Date fields often must be precisely formatted for the date to be recognized. During the analysis phase, you might need many different levels of detail from the same date field. Let’s break down how much detail one date field actually contains using the example in Figure 2.5.

British date value
Figure 2-4. British date value

The basic parts of the date listed in the value are:

  • Day = 31

  • Month = 12

  • Year = 2021

But there are also a number of inferred aspects:

  • Week number = 52

  • Quarter = 4

  • Weekday = Friday

  • Day of year = 365

Getting these different parts of the date involves using functions, or instructions to cause changes to a data field. Functions can extract part of a date (to form the values just given) or move dates forward or backward. If users have entered dates as strings, you can use functions to convert them to the proper data type.

Booleans

The Boolean data type seems like the simplest form, but using it can be anything but simple.

Booleans come from conditional calculations, otherwise known as yes/no questions: either the condition has been met or it has not. A Boolean data field holds just three values: true (the condition has been met), false (the condition has not been met), or null. Examples of conditions that could result in a Boolean field are:

  • Did Sales meet or exceed the target of 100,000?

  • Did the student pass the exam?

  • Did the customer buy the product?

When you use a conditional calculation in a business intelligence tool, the tool creates a new data field with true or false values. If you’d like a more descriptive term or a simple yes/no answer, you can use aliases to change the true or false values to any term you like.

Because of their simplicity, boolean fields calculate very quickly, so when working with large data sets they can be quite efficient, especially compared to string data.

How Is Data Created?

Data does not just magically appear—it has to be created somewhere, and understanding where is important. To ensure you’re using data effectively and accurately, you need to evaluate its source.

Think of it like writing: writers often quote other works as evidence or to reinforce an argument. Without a source, though, a quote loses its validity and credibility. You should treat data without a source just as cautiously as a quote without a source. So where does data get created, and how can you know which sources to trust?

In your organization, you will discover useful sources of data—databases or files that are trusted by many and that will form the backbone of your analysis. Asking where your current reporting and data come from will lead you back to a number of sources. Not all of the stored data points will be shared in the reporting you use; understanding what other data points are available will enable you to ask more diverse questions of your data.

The data you’ll need to answer all your questions will likely be created and stored outside your organization. By looking at outside sources of data, you can validate or challenge the data within your organization. These sources can take your analysis to the next level, and thus finding them is worth the additional effort.

Where Is Data Created?

Data is created in many places—far too many to cover in this book. This section will give you an overview of four main types of data sources that you’ll frequently encounter.

These sources differ from each other in that sometimes data creation is a by-product of an activity and sometimes it is the reason for the activity. Data is considered a by-product when an activity occurs and data can be formed from it. Operational and transportation systems and the Internet of Things exist not to capture data but to enable processes and services to happen. Yet surveys are intentionally created to gather data to produce analysis and aid decision making in organizations.

Let’s look at each type of data source in turn to help you understand what you need to consider when data is collected from it.

Operational systems

Workers and customers of organizations all over the world use operational systems every day. The term operational system covers a multitude of systems, from manufacturing machines to registering an insurance policy. The system part of the term frequently refers to the computerization of a previously manual process. By using computers to complete the process, data can be captured at many points in time. Drawing data from operational systems allows you to measure the duration of processes for the customer as well as measure where errors might have occurred. Another benefit is that data is produced without any extra effort from the system’s operator, and thus steps aren’t missed.

Banks, for example, now process transactions instantly, as technology links not only the bank’s internal systems but also the global banking system. Movements of money, stock sales, and loan approvals now happen more quickly and get approved based on models, and these occur with more transparency to the client and customer than ever before. Data points enable rapid decision making but also can be used for analysis to make the processes even faster and more accurate.

In retail, the key operational systems are the till on the counter and the system measuring stock levels in the warehouse. Only over the last few decades have these systems been linked to other systems in retail organizations to seamlessly order new stock as products sell in stores. Every transaction and every movement of inventory or money creates data points in operational systems. Figure 2.6 shows the flow of retail goods from manufacturing to distribution to the point of sale. Every step of this flow creates data points. The retailer can analyze these data points to learn how long shipments take, which items are most popular, and so on and identify and correct any problems found.

Operational process flow
Figure 2-5. Operational process flow

These data points are increasingly stored in databases whether they are used for analysis now or might be useful in the future. Operational systems are designed to do a job (whether that’s manufacturing pretzels, validating tickets, or issuing insurance policies), not for data production, so the raw data they produce is rarely ready for instant analysis. The data must be carefully prepared to avoid losing or incorrectly manipulating the data.

Surveys

Organizations often use surveys to collect information directly from users, customers, and clients. These can be brief and broad or deep and narrow, as those being surveyed are unlikely to want to spend hours answering questions. The different types of surveys produce two main types of data:

Quantitative

Quantitative data is data that can easily be measured or calculated. It can come from counting responses or aggregating numeric responses and is often numerical, which makes it easier to analyze.

Qualitative

Qualitative data is more descriptive, collected from free-text-entry responses to questions. It can offer rich insights, but these insights are much more difficult to find, especially within large volumes of survey responses.

Qualitative responses are held as string data, which, as you’ve learned, can hold different characters and terms. These answers need to be readable by machines, so preparing them often involves breaking the strings into single words whose frequency can be counted.

Survey data doesn’t just come in raw numbers or qualitative strings. Surveys can capture data in many different ways, and this is a major factor in what data you will receive for analysis as well as in how comprehensive that analysis will be.

Radio buttons (Figure 2-5) and single-value drop-down lists (Figure 2-6) allow the user to choose only one answer from a list of options. Limiting the possible answers makes your analysis simpler. You have to set the possible answers before issuing the survey, so you won’t discover new insights, but you can confirm possible preferences, for example.

Radio button options
Figure 2-6. Radio button options
Single value drop down options
Figure 2-7. Single-value drop-down options

Multiple-choice questions, as in the multiple-value drop-down (Figure 2.9), give respondents more options. The answers are still predefined, so respondents must still pick the most relevant answers. Surveys can offer an Other answer option, but this free text entry makes analysis much harder because the answer introduces string data rather than set values.

Multiple value drop down options
Figure 2-8. Multiple-value drop-down options

Free text entry allows respondents to share their full feedback in their own words (Figure 2-9). The creator doesn’t have to think through all potential answers before sending out the survey. Free text entry is a qualitative method, and it produces string fields that take a lot more work to process. Spelling issues, abstract phrasing, and sarcasm (especially among British respondents!) can all complicate your analysis.

Free text entry
Figure 2-9. Free text entry

Gaining opinions directly from those you are surveying is extremely useful and provides a powerful message to communicate.

Movement/transportation

Our everyday movements create a lot of data. You can measure your own movement using a smartwatch; gathering the movements of hundreds of thousands of people can provide unique insights into their behavior. Many people are nervous about sharing this information, so access will be limited to the company capturing it, or the data will be aggregated before it is shared with third-party organizations.

The Quantified Self technique was developed by individuals who began tracking different elements of their lives by using data, from time spent reading emails to the foods they eat throughout the year. The aim of capturing this data is the same as that of capturing data in organizations: learning from what happens currently to improve what should happen in the future. These types of measurements have become common in many personal devices, such as my iPhone telling me how long I have stared at its screen each day in the last week.

As the movement has grown and demonstrated more value, the number of devices (and of the sensors within them) that capture this data has increased. Strava, an app that allows runners and cyclists to track their activities by distance and location, has been part of this development. As wearable devices have become more popular, their functions have expanded to tracking speed, heart rate, elevation, and more. The data can be useful for athletes who want to monitor their performance and determine the impact of their training.

Individual data is just the beginning, though; data collected from large volumes of runners and riders can be useful on a societal level. For example, local governments might use it to determine where to place infrastructure for safer bike routes.

The rise of satellite navigation for cars has led to similar data sets: measuring the speeds of vehicles using the system can help city planners monitor how roads are used and why traffic becomes congested. Strava’s Metro project provides this data to transport planners at a very aggregated level to help them make decisions about future infrastructure (see Figure 2-10).

Collecting and storing this data is a vast task that overtaxes traditional data storage solutions. Storing and processing data, especially in great volume, has become much cheaper over the last two decades, making such granular tracking possible. The ethical use of this data and of other personal identifiable information continues to be a sensitive subject that you will come across frequently as you work with data.

Strava Metro
Figure 2-10. Strava Metro

The internet of things

Data is also being created by our own homes through a series of devices called the Internet of Things. The Internet of Things often involves giving internet or local network connectivity to devices that traditionally have not been connected to the internet.

These devices can communicate with each other to create linked-up services, such as home appliances that you can control from your smartphone, or farm equipment that adjusts irrigation levels based on the predicted rainfall. Smart meters mean utility companies don’t have to collect meter readings door to door and can determine usage on a much more granular basis.

Using data from these devices can be overwhelming due to the volume of collected data points, but unless you are the provider of these devices, you will likely receive aggregated records that reduce the volume of data.

Data is being collected from more and more devices, which is both a blessing and a curse. Having more data to provide answers is useful—but if you need to form answers quickly, having so many sources to draw on can pose a big challenge.

Should You Trust Your Data?

You should treat data the same way you should treat words: with great skepticism. Before you trust someone’s argument or believe something you read online, you should check the source, check whether the quote has been changed, and understand the wider context of how the words are being used.

Data requires the same treatment. Starting with the source of the data allows you to identify potential bias. This is not to say you should ignore a data source just because it might be biased, but you certainly should use it with care and highlight to your audience any impact the source might have on your findings.

Table 2-5 uses our data sources from earlier in this section and assesses them for potential bias.

Table 2-5. Potential biases in data sources
Source of Data Potential Biases
Operational systems An operational system’s developer specifies which data points are collected. Has the developer accounted for issues faced by people of other genders/ages/ethnicities and ensured that data that could support them is collected and not ignored?
Surveys Quantitative data requires anticipating respondents’ answers.
Qualitative data requires much more cleaning before analysis, so understanding what preparation has occurred is key, as those producing the data can add their own biases, similar to the operational system developer above.
Transport Quantified Self services require technology and often subscription fees, biasing the data toward more wealthy individuals.
Internet of Things Data acquired from this type of source can be biased toward those who can afford such devices.

The source is clearly important to the data’s user, and as the communicator of the data to others, it’s imperative that you understand how the data is intended to be used. This can put a lot of pressure on you as the author of the analysis, but as with anything else, critically appraising the source will help you understand potential issues. You can also work with other analysts to learn from their experiences.

Data As a Resource

Data comes from a lot of places; it’s also stored in a lot of different places. In this section, I’ll guide you through the most common data storage formats, as well as one on the rise. Knowing more about these different methods of data storage will enable you to ask clear questions and set better requirements with your organization’s data owners. This section will also cover a key consideration in holding data: security.

Before we dive more deeply into this section on storage, let me go over some important terminology:

Data point

A single value in a data set

Data set

A single source of data

Database

A collection of data sets

Data warehouse

A collection of databases

Data lake

A less structured repository of data to be converted into curated data sources for analysis

Getting used to this terminology will make it much easier to work with data in your organization. When sourcing data sets, asking for the right level of data or source will be a big help in finding the right data for your analysis.

Files

Files and files and files of data are nothing new in the world or on your computer. Datafiles are created every day as a result of your work. This can involve taking data inputs from outside sources like market research, financial accounts, or internal sales records, adding your own logic through calculations or filters, and then creating a new file. Your files are likely to build up quickly (see Figure 2-11) and will contain lots of data that you’ll want to make use of over time.

Files within File Explorer
Figure 2-11. Files within File Explorer

The volume of datafiles poses a unique opportunity and a challenge at the same time. You will likely need to make use of not only your own files but also those of others across your organization. Keeping the files stored in a logical way that allows you and others to find and access the information quickly can be a challenge. Teams often have their own file management techniques, with little consistency between teams. Often only by collaborating successfully with others will you be able access the data you need to answer your questions.

This challenge has become more pronounced, as files now are not just stored locally on your and/or your colleagues’ computers but potentially are stored on cloud-based drives that many people can access.

Common file types

You will encounter multiple file types when working with data. This is due to the different software producing data outputs. If you can become familiar with different file formats, you will be able to work with more data from more diverse sources. This will allow you to improve your work by using more sources to test your theories and validate your findings. The file type often defines how the data is stored within the file and the method with which a data tool queries the data in the file.

The file types are named by their extensions, the letters at the end of the filename that describe their format. The main file types you are likely to use are:

Excel spreadsheets (.xlsx)

An Excel spreadsheet is a “jack of all trades, master of none” when it comes to data in a business. A spreadsheet can be used to store data and may be used to process the data as well. The flexibility of using a spreadsheet to work with data means each solution is very customizable; thus it may sometimes be difficult to pick up and work with the data inside a particular spreadsheet.

Comma-separated values files (.csv)

These are commonly used to export data from operational systems, databases, and website downloads, since many tools (including spreadsheets) are built to use them as inputs. The data fields are separated by commas (hence the name), and individual rows are shown by new lines within the file.

Text files (.txt)

These are even more simplistic than CSV files, as data fields do not have to be separated by a comma. Instead, they can be separated by any number of characters that can determine a new data field. If you plan to use the data within text files, you must first understand how the files are structured and how to turn the contents into rows and columns.

Portable Document Format files (.pdf)

PDFs are not necessarily only about data. PDFs often contain vast amounts of text as well as images. Tables of data found within PDFs can be read by a few different data tools, making them a useful source of data if you can connect to the data within them.

Spatial files (.shp, .kml, or .geojson)

Spatial files contain exactly what they are named for: spatial objects. When you are working with data about locations, spatial objects such as points, lines, or polygons describe locations or geographical boundaries in a common manner. Shapefiles contain this information, so this file type is most regularly used when communicating data using maps; not all tools are able to visualize this type of data.

A plethora of other file types may be involved in your analysis, but many are unique to the tools that create them, and this book might be the size of an encyclopedia if all were covered.

Note

Many data tools connect to the files containing the data using a driver that allows the tool to query the data source. When looking to connect to more bespoke file types than those just described, you may need to download a driver to connect to the data set.[[end note box]]

Common challenges

While the flexibility of the file types mentioned in the previous section can be a good thing when entering data, it can pose significant challenges when you’re using the data for analysis:

Control

The lack of rules or controls when entering data or adding data fields can create issues for those using the files. Most data analytics tools require data to be in a set structure of rows or columns, but files can have data added to them that does not meet these requirements. For example, a row could be added with totals for the rest of the data; if this wasn’t spotted, all the values would be double-counted in your analysis.

Adding data fields is easy to do in datafiles. If you don’t set up the analytics tool to process these additional fields, you can create issues or even prevent refreshes of your analytical views. When you build useful analytics that people begin to rely on, you need to ensure the views are available when the audience requires them. Unexpected data changes are a big reason for communications failing.

Origin

Even if a datafile is useful for analysis, it frequently will need to be updated to ensure the analysis is based on the latest view of a situation. A file does not necessarily contain a link back to its origin showing how it was formed. Extracts from data sources pose interesting challenges in terms of whether the logic used to create those files can be replicated to produce a like-for-like result.

As discussed earlier, you need to understand the origin of the data to ensure that you assess any potential bias. If you can’t trace the origin of the data in a file, understanding the potential effect on your analysis might be very difficult.

Processing

Whether or not its origin can be found, a datafile does not necessarily retain the history of changes made to it. This creates challenges, as you will not know if alterations, calculations, or manipulations have been made to the original data. If changes have been made, false conclusions may be drawn from the data. Changes are not always made consciously either; it’s easy to mistype or overwrite data values in datafiles, as there is little formal process behind writing the changes.

Size

Datafiles are not built for handling large volumes of data. As data becomes easier to capture, the volume of data to be stored becomes larger. Many file types have limits on how many rows can be stored. Although file types like Excel have increased the volume of data that can be held, the enhanced volume doesn’t necessarily allow for enough data to be held. The datafiles are also slower to use when connecting the analytical data tools to the files, compared to databases.

Use

With all the unknown aspects just demonstrated, the use of datafiles can also be misconstrued due to the lack of context about their creation. A datafile may have been the result of research focusing on a specific group of people or a database query filtering out products. If this context is lost, an analysis of the data may draw poor conclusions due to the mistaken assumption that the data set is based on a complete population or set of products.

If you work with data, you will likely have to work with datafiles. The flexibility of data entry into datafiles can make them perfect for ad hoc analysis. However, if your analysis is set up to be productionalized, or run automatically, then you should look to provide more control on the input files to prevent errors in the processing of the data, as well as incorrect results being drawn from the analysis.

Databases, Warehouses, and Lakes

Databases provide more rigor with input and processing of data. Databases are purpose built to ingest, process, store, and output data. The key differences from datafiles are that databases are more heavily administered and require a coding language to interact with the data. If you are new to working with data, this might pose a barrier to accessing information you need to meet your or your stakeholder’s needs.

Database software is specially designed to work with data and handle many of the challenges that datafiles pose. Databases are often run on a computer, sometimes called a server, with much larger memory and more processing power than a laptop or desktop computer. A database is divided into tables of data. These tables might focus on elements of an organization such as:

  • Customers or students

  • Products

  • Sales

  • Employee details

These tables often need to be joined together to answer the questions posed. For example, when analyzing sales, data on which products were sold and to whom must be factored in to provide a complete picture. Data tables can be linked together in commonly requested ways to form views. The mapping of how the tables in the database join together is called the database schema.

Common types of databases

The most common database type uses Structured Query Language, or SQL for short. SQL is a coding language that allows you to run queries on the database to return either data sets or aggregated values to help your analysis. Microsoft SQL Server is one of the most frequently used databases due to many servers running Windows as their operating system. SQL lies at the heart of a number of other commonly used databases, including MySQL, PostgreSQL, and Teradata. Each has slightly different code requirements, making it difficult to use these sources without some training.

Database growth over time has led to a need for the ability to run multiple databases alongside each other. This is called a data warehouse, which contains numerous databases.

Many newer databases can still be queried with SQL, but the database is more likely to be hosted on a cloud-based server than on a server run inside your organization. Cloud-based servers include Amazon Web Services, Microsoft Azure, and Google Cloud Platform, all of which have their own database technologies or can host other providers’ databases. Cloud-based databases are useful as information scales, as more data can be added rather without the server’s memory maxing out.

The challenge of streaming data in vast volumes has created problems for SQL-based solutions. The data structure required to make SQL work effectively means that large inflows of data cannot be processed fast enough to push data into the structure of the database. This is where the newer solution of data lakes comes in. Data lakes are repositories of data in which the data is often held in an unstructured state, to be processed later. Once data is identified as being useful and can be transformed correctly, it is processed into a more traditional SQL database.

Common challenges

You are likely to encounter a number of challenges when working with databases. First, there are major barriers to accessing the data using SQL. In most organizations, employees have not been taught how to code in SQL. This means many people are unable to access the data source directly and have to rely on others to query data for them. This can make things difficult if the data you want to communicate is located only in a database.

It isn’t just the coding language that limits access to data. Database access is much more controlled than access to datafiles due to the fact that more important processes like regulatory mandated reporting are completed from such data sets. In most organizations, even if access to data sets is obtainable, the process requires IT teams as well as the business owner of the data giving access.

Although databases are designed for using much larger data sets, the analysis of data from databases can be harder if you are using more basic analytical tools like Excel. Before analysis can even begin, determining the correct ways to join the tables and views from the database to form the data set often involves many subject matter experts or hefty amounts of experience. To work with data from a database is difficult enough, but to complete this process when the source data resides in a data lake is an even greater technical challenge. Exploring and processing the data from a data lake can be quite time consuming, with often many iterations required along the way.

Each of these elements creates a number of hurdles to overcome, but the effort is worth it. Working with databases offers more powerful processing of a more tightly controlled data set. Your analysis can be made more repeatable and more stable if you use data from a database rather than using datafiles.

Application Programming Interfaces (APIs)

The modern analyst has ever-increasing demands being placed on their skills. One area for growth for many analysts has been in coding skills due to the introduction of APIs as a way to connect to more data sets. An API is not used only for data analysis; it also allows applications to write data points to a data set. For example, liking a tweet actually uses APIs to write a record to Twitter’s database without giving you access to the database, which contains a lot of valuable information. APIs have become more popular due to their ability to tie different data sources together in one place and standardize methods of access via website-based traffic (Figure 2-12). You will often see an API call as an entry in the URL window on your browser, as you are fundamentally going to a website address to fetch your data for analysis.

API layer between you and the data stores
Figure 2-12. API layer between you and the data stores

Like any technological development, APIs have gone through distinct stages of development that addressed issues users had found when working with them. Simple Object Access Protocol (SOAP) APIs were the original type of APIs but posed challenges, as the data was passed in XML files, which are not the easiest to work with and whose structure is hard to understand. Representational State Transfer (REST) APIs introduced a more human, standardized approach to the work being completed through the APIs and allowed people to work with JSON files, which have a more logical structure. The newest form of API is GraphQL, which is much more well suited for streaming data sets, where SOAP and REST APIs were struggling to keep up.

Services use APIs when communicating over the web, which is becoming the home for more and more data sources. There are some differences between working with APIs and the files/databases covered thus far in this chapter. API calls often have limits to the scope of what they can request from the underlying data sets and also restrict the amount of data that can be returned; the internet has varying levels of bandwidth, so APIs are constructed to keep data sizes small enough that information can be sent and received quickly. This can pose a challenge when conducting data analysis on large data sets, as you may need to make multiple calls to gather the information you require.

The challenges of working with APIs go beyond just the volume of queries you need to run. First, the coding required to send those queries can take some getting used to if you do not code in a computational language already. If you work in an organization, you may have a team that will support you with this part of the process and with the subsequent reshaping of data that many data analytics tools require.

Another challenge is that developers write APIs to allow users to access certain data sets in certain ways. This can create an inherent bias in the API as far as what data is available to answer your queries. As with any data source, you need to consider where the data originated from as well as what data has been collected and what has been removed.

Data Security and Ethics

Analysis of data can reveal a number of trends around clients, products, and services. However, that same data can reveal a lot about individuals too, if used in certain ways. This makes the need to protect personal data from misuse very important. Misuse can take different forms, from identity theft to unethical use of the data.

When you work with data, it is important to ensure that

  • the subject of the data is aware of how the data will be used.

  • the data is kept secure, from its transfer from the source to deployment of the analysis.

  • the data is deleted once it has met its purpose and the supplier will no longer benefit from its retention.

The process of securing data through all stages of use is often known as data management. Having clearly sourced data, a defined purpose for the data, and a plan for deleting the data is important. Across the world, legislation has come into force that ensures that an individual’s data is used only with their permission and is deleted when holding the data no longer serves the individual’s interests.

Securing data is just as important. Merely having a good password on data isn’t enough. Holding data encrypted devices and using it inside secure networks means data loss due to hostile attempts to steal it will be less likely. Securing data means protecting the subjects of the data from potential harm or negative actions against them. After all, this data could be about you—how would you want your information to be treated?

Data security is not the only aspect that needs to be considered with regard to data use. Ethical use of data is becoming a more contentious issue as the power of data becomes more apparent. Being ethical involves more thought beyond just your own and your organization’s ethics. As data points are often about human behavior, close analysis can reveal details of people’s movement, beliefs, sexual orientation, and lots more besides. Ensuring your analysis does not stray from the intended purpose for which the provider of the data gave permission should remain a primary concern.

There are many books on data security, and even more on the ethical use of data, so this book does not attempt to explain all that is required to be compliant. Always keep in mind the potential impacts of data leaks or data misuse as you work with data.

Easy or Hard? the “Right” Data Structure

A clear understanding of the constituent parts of a data set and where data comes from can help set the foundation for using data for analysis. However, the structure of the data, sometimes referred to as the shape, can make a huge difference in how easy the data is to analyze. An earlier section of this chapter covered columns and rows and understanding the granularity of a data set. This section will look at how to shape and clean data for analysis.

The Shape of Data

Creating a data set that meets a certain structure can be the difference between hours of tough calculations or easy drag-and-drop use of the data. Whether your data set comes from a datafile or a database, the likelihood is that the data will be in columns of some kind. What each column represents makes a significant difference in how easily you can form your analysis, or whether analysis is even possible.

Categorical data

The categorical data describes what each measure refers to. Having a separate column for each piece of descriptive data makes analyzing the data much easier, as you can simply answer questions based on each column. For example, to look at the average test score in the data from Table 2-6, you could find a value at multiple levels in the data set:

  • Overall

  • Per department

  • Per subject

  • Per course

  • Per student

Table 2-6. Categorical data
Department Subject Course Student Score
Arts History HI101 30957191 76
Arts History HI101 52814935 92
Arts History HI101 89620539 60
Arts History HI102 30957191 66
Arts History HI102 89620539 35
Social Sciences Politics PO101 30957191 58
Social Sciences Politics PO101 51123824 61
Social Sciences Politics PO102 89620539 75
Social Sciences Geography GE101 51123824 91
Social Sciences Geography GE101 63947347 82

Most analytics tools calculate these results by reading down a column of numbers to determine the values that meet the condition set. If the software you use doesn’t work in this manner, you’ll need to perform this type of calculation to answer the same questions. For example, let’s work out each subject’s average score (Table 2-7).

Table 2-7. Calculating the average score per subject
Department Subject Course Student Score
Arts History HI101 30957191 76
Arts History HI101 52814935 92
Arts History HI101 89620539 60
Arts History HI102 30957191 66
Arts History HI102 89620539 35
Social Sciences Politics PO101 30957191 58
Social Sciences Politics PO101 51123824 61
Social Sciences Politics PO102 89620539 75
Social Sciences Geography GE101 51123824 91
Social Sciences Geography GE101 63947347 82

To work out the average, you first need to determine the number of subjects. In this case, there are three subjects: history, politics, and geography. The analytical tool sums up the values that correspond to each subject and then divides by the number of records, with these results:

  • History—a total of 329, divided by 5 students, means an average of 65.8

  • Politics—a total of 194, divided by 3 students, means an average of 64.7

  • Geography—a total of 173, divided by 2 students means an average of 86.5

Analyzing the data by department instead is simple, and similar logic would apply: instead of looking for each subject, the analytical tools would look for the different departments and then perform the same calculations for the relevant rows of data.

Measures

Just as each type of categorical data needs its own column in your data set, so does each measure. In Table 2-8, the measures are not in individual columns, preventing fast aggregations. In most tools, having values in separate columns makes analysis easier by enabling the aggregation of whole columns.

Table 2-8. Multiple measure as multiple rows
Department Subject Course Student Measure Value
Arts History HI101 30957191 Score 76
Arts History HI101 30957191 Attendance 6
Arts History HI101 52814935 Score 92
Arts History HI101 52814935 Attendance 8
Arts History HI101 89620539 Score 60
Arts History HI101 89620539 Attendance 10

For this data, having separate columns for Score and Attendance would allow those values to be aggregated, as we saw in the preceding section. Being able to conduct calculations between measures is also important, and having a separate column per measure ensures the calculations involve less complexity. The multiple measures within the Measure column in Table 2-8 also create additional challenges when analyzing the data set, as a row of data is no longer about a student’s score in each course. To be able to count the number of students in the data set, you need to ignore each duplication of a student’s reference number.

If the data set requiring analysis isn’t in the form of a separate category or measure in an individual column, then the data requires reshaping. Assessing a data set’s structure is one of the first tasks you should undertake when working with a new data set or analytical challenge. A few techniques are available to achieve the required shape.

Pivot—columns to rows

Pivoting your data is one of the most common methods of reshaping data. The technique should be used when you see a value that could be classed as a category in the headers. These headers often occur when adding data to a data set as a new column. In the case of Table 2-9, adding a new student’s results to the data set creates a new column.

Table 2-9. Data requiring a columns-to-rows pivot
Department Subject Course 30957191 52814935 89620539 51123824 63947347
Arts History HI101 76 92 60
Arts History HI102 66 35
Social Sciences Politics PO101 58 61
Social Sciences Politics PO102 75
Social Sciences Geography GE101 91 82

The challenge in analyzing this data set is not simply aggregating a single column but updating the analysis as new students are added. This is much easier when Student values are all in a single column and the measures are held in a separate column alongside it. The same reshaping is shown for course results in Figure 2-13.

Pivoting columns to rows
Figure 2-13. Pivoting columns to rows

The resulting data set could include null values if any years don’t have all course results recorded. Most tools actually remove those null records, however, as they do not contribute any values to analysis.

Pivot—rows to columns

As seen in “Measures”, having multiple rows for each metric instead of separate columns can create its own difficulties. Therefore, there is a form of pivoting to correct the issue. Pivoting rows to columns requires a slightly different technique, but ideally it should produce the same data structure as resulted from the columns-to-rows pivot. Table 2-10 is the complete data set from Table 2-8, where it is much harder to find the relationship between Score and Attendance.

Table 2-10. Full data set of multiple measures across multiple rows
Department Subject Course Student Measure Value
Arts History HI101 30957191 Score 76
Arts History HI101 30957191 Attendance 6
Arts History HI101 52814935 Score 92
Arts History HI101 52814935 Attendance 8
Arts History HI101 89620539 Score 60
Arts History HI101 89620539 Attendance 10
Arts History HI102 30957191 Score 66
Arts History HI102 30957191 Attendance 7
Arts History HI102 89620539 Score 35
Arts History HI102 89620539 Attendance 7
Social Sciences Politics PO101 30957191 Score 58
Social Sciences Politics PO101 30957191 Attendance 8
Social Sciences Politics PO101 51123824 Score 61
Social Sciences Politics PO101 51123824 Attendance 8
Social Sciences Politics PO102 89620539 Score 75
Social Sciences Politics PO102 89620539 Attendance 10
Social Sciences Geography GE101 51123824 Score 91
Social Sciences Geography GE101 51123824 Attendance 7
Social Sciences Geography GE101 63947347 Score 82
Social Sciences Geography GE101 63947347 Attendance 10

Pivoting rows to columns requires setting which data field will form the new headers of the columns of measures (Figure 2-14). The values that will fall underneath the headers can be selected from a column in the data set—in our example, the Value column. If multiple values exist in the same row in the resulting data set, the data being used for the pivot will often require you to select an aggregation. For the data set in Table 2-10, there are no duplicates, so that won’t be factored into consideration.

Pivoting rows to columns
Figure 2-14. Pivoting rows to columns

The result of the rows-to-columns pivot is a data set in which each categorical data field and measure has its own column, making analysis easier (Table 2-11).

Table 2-11. Clean data source resulting from the pivot
Department Subject Course Student Score Attendance
Arts History HI101 30957191 76 6
Arts History HI101 52814935 92 8
Arts History HI101 89620539 60 10
Arts History HI102 30957191 66 7
Arts History HI102 89620539 35 7
Social Sciences Politics PO101 30957191 58 8
Social Sciences Politics PO101 51123824 61 8
Social Sciences Politics PO102 89620539 75 10
Social Sciences Geography GE101 51123824 91 7
Social Sciences Geography GE101 63947347 82 10

Aggregation

The shape of data isn’t just about columns of data; the number of rows also changes the shape of data significantly. In “Rows”, rows of data were identified as records of individual observations, but for ease of analysis, you can aggregate rows to make the data less granular. Often you’ll leave data sets at the most granular level to ensure you can ask a range of questions about the data. However, you can prepare data for specific purposes if you are aware of those questions. To raise the data to a higher level of granularity (i.e., less detail), the metrics that share common categorical values are aggregated. Typical aggregations include:

  • Sum—totaling the values

  • Average: mean—totaling the values and dividing by the number of records

  • Average: mode—finding the most common value

  • Average: median—ordering all the values and finding the middle value

  • Minimum—finding the lowest value

  • Maximum—finding the highest value

Table 2-12 shows the average (mean) score and average attendance for each course, as calculated from the student score and attendance data in Table 2-11. The level of granularity in the data set currently is each student in a particular course. To aggregate the measures up to the level of the course, the students’ details need to be aggregated within each course.

Table 2-12. Averaged scores and attendances
Department Subject Course Avg. Score Avg. Attendance
Arts History HI101 76 8
Arts History HI102 50.5 7
Social Sciences Politics PO101 59.5 8
Social Sciences Politics PO102 75 10
Social Sciences Geography GE101 86.5 8.5

Aggregation can also remove granularity by determining the number of instances of a particular categorical value:

  • Count—adding up the number of instances of a value in a category

  • Count distinct—adding up the number of instances of different values in a category

Let’s use a count to allow the user of the data set to determine how complete the data set is, as it’s unlikely all the data is in the data sets, since only a few people are recorded against each course. Table 2-13 shows the count of students per course, along with the previous aggregations.

Table 2-13. Count of students per course
Department Subject Course Avg. Score Avg. Attendance Count of Students
Arts History HI101 76 8 3
Arts History HI102 50.5 7 2
Social Sciences Politics PO101 59.5 8 2
Social Sciences Politics PO102 75 10 1
Social Sciences Geography GE101 86.5 8.5 2

Due to the ever-growing size of data sets, data aggregation is increasingly needed to avoid poor performance by analytical tools that must process billions of rows of data. Aggregating data to a higher level of granularity means the tools have to process fewer rows and can thus perform calculations more quickly. This also allows you to validate data more easily due to having fewer rows to check against. As previously mentioned, aggregating data should occur only when answering your questions doesn’t require a deeper level of granularity.

Cleaning Data

As mentioned in “Rows and Columns”, a data field, or column of data, should contain a value describing a single element of the observation. Outputs from operational systems do not always come in an easy-to-read format or meet this “one data item in each column” guideline, and thus they require cleaning through splitting before analytical use. Note that data cleaning isn’t required only for system-generated sources; manually entered file-based sources often require cleaning as well due to poor entries.

Splitting

Not every source will require you to create a well-structured data set. Determine early on whether this extra work will be needed, as you’ll want to factor that into the time you have to produce your analysis. Splitting is a process that can be completed in many tools either with a few clicks or by writing a calculation. Whether you build the calculation yourself or have a tool with the automated option, the same logic applies. The logic requires the data processor to recognize the breaks between the data points to create a new column.

The data in Table 2-14 could actually be split a few ways. Note, for example, that the main data values are divided by hyphens.

Table 2-14. Log from security badge entry
Log
2022_10_13-08:31:47-30957191
2022_10_13-08:42:21-89620539
2022_10_13-08:47:19-52814935
2022_10_13-08:49:56-51123824

Splitting the log values at the hyphens would result in three columns containing the date, time, and student ID, as shown in Table 2-15.

Table 2-15. Result of splitting logic
Date Time Stude nt ID
2022_10_13 08:31:47 30957191
2022_10_13 08:42:21 89620539
2022_10_13 08:47:19 52814935
2022_10_13 08:49:56 51123824

Depending on the type of analysis, you might want to perform a second split calculation on Date to produce separate columns for year, month, and day. To analyze whether students turn up later for classes in winter months compared to summer months, you could apply the same logic to the Time column to create separate columns for hours, minutes, and seconds. Analyzing the data held within the logs is much more difficult without splitting up the data. And with data updating continuously, splitting can become an arduous task unless the process is automated. Work out what you need from the data before trying to productionalize the data preparation to ensure you have the required data fields.

Replacing rogue characters

As covered in “Rows and Columns”, each data field should be only one data type. String data fields can contain any alphanumeric character, but to conduct mathematical calculations, you will likely want only numeric data. Data entry is rarely perfect, especially if values are entered manually. But if manual entry can result in mistakes, why do systems allow for manual data entry? The answer is simple: flexibility. When a system is being designed, restricting the data to be entered to a simple drop-down set of choices isn’t always feasible. The flexibility of data entry is useful for the inputter, but the resulting data challenges can be tough. Even the most careful data entry can lead to incorrect characters being added to data fields. For example, in Table 2-16, the Capacity data field should clearly be a column of integers, but the character A has accidentally been added to a value.

Table 2-16. Rogue character in a data field
Building Room Capacity
Roscoe Theater A 470
Roscoe Theater B 236
Simon Theater A A198
Simon Theater B 330
Simon Theater C 121

To return the field to being entirely numeric, you could manually remove the value. But this could lead to important values being filtered out of the data set, changing the resulting analysis. Alternatively, most data tools feature a calculation function called Replace that allows the user to replace characters. In this example, you could clean the data entry by using Replace to delete the A. Look for what the data value should be before making any amendments to avoid leaving an incorrect value.

Learning to structure and prepare data can make your analysis easier and could even unlock other useful values in the data set.

The “Right” Data

Whether you have the “right” data to start your analysis is often determined not by the accessible data but by the question being posed. Without a clear understanding of what is needed, you’ll find it’s virtually impossible to provide the right answers. As I’ve become a better analyst, I’ve learned how to identify what the stakeholder actually needs. A number of factors prevent the stakeholder from asking for their real needs. These range from the availability of data to the situation that the need has arisen from (Figure 2-15).

Requirement gathering challenges
Figure 2-15. Requirement-gathering challenges
Context

A person trying to find answers from data must always understand the context of the question being asked. For example, the answer to even a simple question like “What’s happened to our profits?” differs greatly depending on the global macroeconomic situation, the organization’s stage of development, and even when the question was last asked.

The organization’s context isn’t the only one to understand; the personal context of the individual asking the question is important too. Maybe they’re under pressure because they’re not meeting their targets and thus require a more analytical answer about their performance and how to improve it. Or perhaps the stakeholder is performing well and wants the analysis to demonstrate just how far ahead they are compared to expectations. Without clarity on context, it’s unlikely you will be able to use data to show what the stakeholder requires.

Stakeholder knowledge

The stakeholder’s knowledge is a key factor in whether they will be able to articulate what they need. Knowledge in this regard has two components: subject matter expertise and data skills. Subject matter expertise is often why someone has the role they do in an organization. Is the individual highly experienced in their field? Are they renowned within the organization for their level of knowledge about how the organization works? Their level of expertise affects the level of detail the analysis might go into. If the stakeholder is new to their role or is more of a generalist manager, the requirements might remain at a higher level—in which case the analysis will need to include a lot more contextual information rather than being deeply nuanced.

Simply knowing a lot about the subject of the data analysis isn’t enough for a deep, technical analysis. Knowing how data works in an organization, what can be achieved with data analytics, and what data is even available for analysis can dramatically change the requirements. A stakeholder with low data skills and data awareness is unlikely to form the perfect set of requirements. Few people will openly highlight their weaknesses, especially as they move up in an organization. Therefore, you’ll need to ask additional questions to deduce whether you should use your data skills to help the stakeholder ask for what they actually need.

Data availability

Obviously you can’t complete data analysis where the data doesn’t actually exist. But just because you don’t have access to the data now doesn’t mean you can’t get access. A valuable capability in an organization is knowing what data actually exists and where it resides. Being able to locate data quickly, or knowing who can, is a huge factor in being able to communicate with data effectively, because this knowledge can reduce the time it takes to find the data you need. With more data coming from external third parties or internet-based sources, being able to corral data for use in your analysis can change the requirements you are able to meet or promote. Without this knowledge, stakeholders are likely to ask broader questions for fear that the data won’t be available or that they seem naive.

Software functionality

Even if you have all the constituent parts listed thus far, the software must have the functionality to meet the set requirements. Often tools can be manipulated to meet requirements, and the internet is littered with work-arounds for software that doesn’t have a specific feature. Software functionality is frequently underestimated when setting requirements. A stakeholder who doesn’t know the capability of the software used in their organization often doesn’t request its full capability, even when achieving it is relatively easy. Certainly many data visualization tools are much more interactive than stakeholders initially perceive, which is why they ask for more closed requirements. This means they want a particular answer and won’t be open to the analysis being more exploratory.

Often this mixture of challenges can make data work fail to have the intended impact or create frustration between all parties involved, unless time is spent on getting the fundamental requirements right.

With so many challenges extant, how should you go about avoiding them and gathering requirements successfully?

Requirement Gathering

Requirement gathering is a task that can make or break your analytics project before you’ve even begun. With clear requirements from your stakeholder (or from yourself if it is a personal project), you can meet the needs of the project. Requirements do not necessarily have to be precise, as a downside of absolute precision is that you may miss some key insights. On the other hand, if the requirements are too loose, answering the questions being posed may be difficult.

With good requirement gathering, you’ll be able to answer questions that your key stakeholders haven’t even thought about yet, as in many cases you’ll be analyzing data that has never been looked at before, and you won’t know what it shows until you wrangle it together and start to conduct the analysis.

Asking the right questions

Getting to the real needs of a stakeholder is all about asking them the right questions. As shown in Figure 2-16, a stakeholder’s knowledge of their subject and data can have a massive impact on whether you can form an understanding of their real needs. If you have a stakeholder who is used to requesting and using data solutions, you are much more likely to obtain a thorough understanding of what they need to achieve. However, if the stakeholder’s knowledge of the subject or the data isn’t strong, you will likely need to ask additional questions to understand what is really required.

Using the university theme of this chapter, let’s work through an example.

Scenario: university student planning

You work at a university, and school grades have just been released, so places at the university are just being confirmed. A number of students won’t have made their grades, leaving spaces available in some courses. The planning manager, Vicky, comes to you to set requirements for a project that will help a number of teams understand the changing picture as results come in and places at the university get confirmed or otherwise open up if students don’t make their grades.

Vicky’s initial request is for “a dashboard to show how many students we have versus the places available across the university.” When you first hear this request, it doesn’t seem unreasonable and even seems achievable. You might need only to grab some summary numbers to measure:

  • How many students have met their grades to confirm their spot?

  • How many students have missed their grades?

  • How many students have yet to confirm?

However, if you offer those numbers, you are likely to get back another set of questions. This is the great aspect of communicating with data: people are smart, so once you show them something, they learn and will often want to know something else building on that knowledge. A good data worker will try to get ahead of the subject by forming an understanding of where the questioning might go as they gather the requirements.

Rather than just saying yes to Vicky’s request, you can ask a simple question that could help move you toward a better set of requirements: “What are you planning to do with those numbers?” Vicky’s answer could guide you in adding to those initial summary numbers so that you can provide her with what she really needs. Some potential answers from Vicky are:

  • “We need to make sure we have enough classrooms.”

  • “We need to make sure we have the right number of lecturers.”

  • “We need to make sure we have sufficient on-campus housing space for the students.”

There have been numerous times in my career when the answers have gone along the lines of all the above points. Although this means more work to set up the initial views, effort will be saved down the line, as you’re more likely to get the answers you require on the first try.

Another way you can guide yourself and your stakeholders to the requirements they actually need is by using the “five whys” technique that was originally created by Sakichi Toyoda at the Toyota Motor Corporation. By channeling what I call your “inner toddler” and asking “Why?” in response to each answer to your questions, you will get to the real focal point of the requirements. Toyoda found that you should ask “Why?” up to five times in a row to find the real reason for the request. Clearly, not asking “Why?” like an actual toddler is the key to not annoying your stakeholder.

Here is how your interaction with Vicky could play out using the “five whys” technique:

Vicky: “I need a dashboard to show how many students we have versus the places available across the university.”

You: “Why do you need those numbers?”

Vicky: “I need to be able to tell the vice chancellor what capacity we might have next year.”

You: “Why would the vice chancellor be looking for those values?”

Vicky: “She is looking to confirm the number of teaching staff that will be needed,”

You: “Why is she looking for those numbers right now?”

Vicky: “She needs to know whether we need more large classroom space if we don’t have enough staff to teach in smaller groups and rooms.”

You: “Why the focus on classrooms?”

Vicky: “There are plans to refurbish all the lecture theatres in Building C, and that will affect capacity for social science courses.”

Now you have much clearer insight into what information is actually required for the analysis, and you needed only four whys to get there. The data needs to be more than just overall student numbers; it needs to be split by department too. Alongside that data set, joining a data set containing room capacity per faculty would aallow you to compare the current student numbers against teaching space in case other courses are not full and could switch to alternate rooms.

The technique seems so simple—so why doesn’t everyone just ask the right question the first time? Asking stakeholders challenging questions can be difficult. Stakeholders frequently are more senior members of the organization, and you might be nervous about challenging them. However, as discussed, stakeholders may not be used to working in this way, so posing these questions can be useful. Also, thinking through these questions with someone else can help you think differently about the problem in front of you and how you might want to approach the requirements.

Sketching

Understanding your stakeholder’s requirements is not the only part of the puzzle you need to piece together before starting the work to ensure you have the right data. Stakeholders often have specific charts, layouts, and formats in mind; not delivering those will reduce the likelihood of the stakeholder using the analytical products you come up with. If you deliver the initial work without checking the visual requirements, the stakeholder will likely ask you to rework your output, even if it answers the question.

Chapters 3 and 4 of this book will look at the most effective ways to visualize certain data points, but stakeholder requirements will not necessarily fit these best practices. The aim of these chapters is to give you the knowledge to articulate how data should be visualized, and why, to ensure the message within the data is being communicated clearly.

The format of your communication is especially important when you’re dealing with work that is for your stakeholder’s management or for external third parties. Trying to get these details out of someone’s head and into a requirements document is difficult for a number of reasons:

They are an experienced data professional.

Although an experienced data professional would score well on our stakeholder knowledge test, they pose separate challenges. They may expect you to match how they would have approached the requirements they have given. Often they might have worked with different tools that have alternate options and strengths to those you use.

They’re not the end user.

When you don’t get to talk to the end user, it’s difficult to understand how they might use the analysis and what they need it to answer. Even if your stakeholder has spoken to the end user, the requirements will be second- or third-hand. Each link in the chain will add their own interpretation and needs into the requirements, further convoluting the clarity around what is essential.

“I know what I want; I just don’t know what it is yet.”

This is a really tough situation to be in and not atypical in analytics, as questions can be hastily formed based on situational stimuli like meetings or seeing other projects. The situation is better when you have worked with the stakeholder for a while but can be very challenging if it’s your first interaction with them.

A nice approach to working through these challenges is to get the stakeholder to draw what they’re thinking of. The drawing does not have to be a masterpiece; even the crudest rendering can quickly give you a sense of what the end product might look like. Without the data, the focus instantly becomes what each sketched element adds to the work, how the user might interact with the work, and whose expertise you might need to complete the project.

Figure 2-16 shows a sketch based on Vicky’s requirements following the use of the “five whys” technique.

Sketch of dashboard meeting Vicky s requirements
Figure 2-16. Sketch of dashboard meeting Vicky’s requirements

The sketch can help drive you toward getting the right data by working out what data fields would be needed to form specific graphs (Figure 2-17).

Detailed sketch of overview chart
Figure 2-17. Detailed sketch of overview chart

Getting the right requirements can help guide you toward getting the best outcome the first time and ensure you are collating all the necessary data for your project. Iterating a sketch on a whiteboard or on paper is easier than visualizing data in data software unless the user is highly experienced with the software. Faster iteration will allow you to find the best answer sooner and will likely increase the work’s longevity, since it will be more relevant to users.

Use of the Data

Once you know the requirements, as well as how the data is likely to be represented in a data visualization, the next step is understanding the data flow from source to analysis. You need to consider a number of factors to ensure you get the data you require when you need it, and in a usable format.

Frequency

As discussed in “How Is Data Created?”, data is created in many different ways, but more data is getting captured with increasing frequency than ever before. For example, surveys are now conducted online rather than just in person or by mail, allowing results to be updated in real time. Thus you’ll need to ensure you can offer your stakeholder updated versions of the analysis without creating the challenge of continually maintaining the visualization. Depending on the tools you use, this can be easy, or you may need to revert to manual updates.

Being aware of when the data is “ready” is an important factor in ensuring the data is suitable for answering the stakeholder’s needs. Frequently many stakeholders will be looking for certainty from their analytical products, and updating data will create confusion if they are not prepared for changing values. If your data is likely to change your visualization, then the view clearly needs to show whether and when that might occur. However, updating data can be beneficial for stakeholders who want to start looking at potential results and emerging trends rather than waiting for the complete data set. Getting clarity on whether your stakeholder prefers to receive an early view of provisional data or wait for a solid, complete data set is key.

Volume

Along with access to streaming data sets, modern data sets are growing ever bigger. Many storage solutions are capable of storing huge data volumes, but the analyst and visualization layer will feel the effects of working with large data sets. These effects include:

Slower development

Working with large data sets can slow down everything you do with data. Data exploration is a key part of most analysis, since how you visualize the data will likely change as you understand more about the data set. Each step of the exploration will be slower because the data tool has to process more records. These steps include understanding what is within the data as data loads as well as building the charts themselves.

More difficult analysis

Larger data sets are harder to analyze, and not just due to the inevitably slower performance of the majority of data tools. Uncovering key findings and anomalies can become much more difficult. It can be easier to identify trends with larger data sets, as there are more data points to validate them, but analysis isn’t always focused on that. Finding a data outlier can mean identifying a new high-growth client, a product whose sales are about to take off, or a musician who is about to have a million streams.

Slower performance

Slower performance is an issue for the end user as well as for the analyst. After working in financial services for nearly a decade, I’ve learned that people are not happy waiting for the loading icon to stop spinning and render their analysis. This can often mean the difference between them using an analytical view or just ignoring it.

You can take different approaches to working around these factors:

Sampling

Taking a sample of a large data set removes much of the difficulty in developing a visualization. Finding a large enough or representative sample can be a challenge. You may want to take a random sample of the data, or just the first thousand rows. Each technique has its own issues if you are using only this sample to form your analysis, but sampling will save development time regardless. Be careful not to miss extremes in the data that could dramatically alter the techniques you use to visualize the data. Replacing the sampled data set with the full data set before forming any conclusions is also key.

Incremental data refresh

Often data sets will require updating over time. When you’re working with a large data set, the time spent refreshing the full data set can be significant. Powering your analysis with a data set in which only the new rows of data are added shortens the update time considerably. This technique does not work, however, if previously loaded records require an occasional update. For example, if a customer might return a product from a retail order, the data set should be updated to reflect the customer’s actual order.

One big data set versus many data sets

With data sets being created from an ever-growing number of sources, pulling data sources to get the right data set for your analysis is an increasingly common challenge. When pulling different data sources together, you should consider what the resulting data set is required for and whether you can alter it slightly to make it useful for others to prevent the proliferation of a single data source for each piece of analysis. After all, managing all these different data sources can become quite burdensome.

There are two common methods for merging data sources. One is joining data. Joining two data sources involves adding data fields from one data source to the other. For this technique to work, you’ll need to specify join conditions so the tool knows which records to append to each other. Figure 2-18 illustrates the joining of separate data sets on the average grade per course and the number of students in each course to create the single data set required for analysis. The join technique requires the condition, a logic statement, to link the two tables. In the example in Figure 2-18, the logic is where the course IDs match; the join condition finds matching values from the two input tables in only two cases. An inner join returns only the values that meet the join condition. If you want to include additional course descriptions, there are other types of joins depending on what output you are looking to create. This is why a join is often demonstrated as a Venn diagram: to illustrate what you want to return. The inner join is where the two circles intersect and adds the corresponding additional data fields from one source on to the other.

Join technique used for adding columns to a data set
Figure 2-18. Join technique used for adding columns to a data set

The second technique, unioning, involves stacking similarly structured data sets on top of each other (Figure 2-19).

Union technique used for stacking data sources
Figure 2-19. Union technique used for stacking data sources

Whether to have one large data set that is suitable for many potential purposes or many data sets that are set up in more bespoke fashion for each individual piece of analysis is a delicate balancing act and will often depend on the data tools you are using.

Summary

Unsurprisingly, learning what constitutes data, where it comes from, how it is stored, and what you need to consider when working with it is a key step in learning how to communicate with data. The better your fundamental skills are, the more effective your use of data will be. Many of these aspects can be daunting at first but are worth trying just to see what challenges you face. Like the use of written or verbal language, trying to wrangle data to help convince others is a never-ending process. Different audiences and stakeholders will inevitably have different needs, so growing your skills with the fundamental building blocks of any form of communication is only going to be beneficial.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
34.238.143.70