Restructuring data in Tableau connections

The Excel workbook World Population Data.xlsx, which is included in the Data directory of the resources that are included with this book, is typical of many Excel documents. Here is what it looks like:

Excel documents such as this are often more human readable but contain multiple issues for data analysis in Tableau. The issues in this particular document include the following:

  • Excessive headers (titles, notes, and formatting) that are not part of the data
  • Merged cells
  • Country name and code in a single column
  • Columns that are likely unnecessary (Indicator Name and Indicator Code)
  • The data is wide, that is, there is a column for each year and the population measure is spread across these columns within a single record

When we initially connect to the Excel document in Tableau, the connection screen will look similar to this:

The data preview reveals some of the issues resulting from the poor structure:

  • Since the column headers were not in the first Excel row, Tableau gave the defaults of F1, F2, and so on, to each column
  • The title World Population Data and note about sample data were interpreted as values in the F1 column
  • The actual column headers are treated as a row of data (the third row)

Fortunately, these issues can be addressed in the connection window. First, we can correct many of the excessive header issues by turning on the Tableau Data Interpreter, a component which specifically identifies and resolves common structural issues in Excel or Google Sheets documents. When you check the Use Data Interpreter option, the data preview reveals much better results:

Clicking the Review the results... link that appears under the checkbox will cause Tableau to generate a new Excel document that is color-coded to indicate how the Data Interpreter parsed the Excel document. Use this feature to verify that Tableau has correctly interpreted the Excel document and retained the data you expect.

Observe the elimination of the excess headers and the correct names of the columns. A few additional issues still need to be corrected.

First, we can hide the Indicator Name and Indicator Code columns if we feel they are not useful for our analysis. Clicking the drop-down arrow on a column header reveals a menu of options. Hide will remove the field from the connection and even prevent it from being stored in extracts:

Second, we can use the option on the same menu to split the Country Name and Code column into two columns so that we can work with the name and code separately. In this case, the Split option on the menu works well and Tableau perfectly splits the data, even removing the parentheses from around the code. In cases where the split option does not initially work, try the Custom Split... option. We'll also use the Rename... option to rename the split fields from Country Name and Code - Split 1 and Country Name and Code - Split 2 to Country Name and Country Code, respectively. Then, we'll Hide the original Country Name and Code field.

At this point, most of the data structure issues have been remedied. However, you'll recognize that the data is in a wide format. We have already seen the issues that we'll run into:

Our final step is to pivot the Year columns. This means that we'll reshape the data in such a way that every country will have a row for every year. Select all the year columns by clicking the 1960 column, scrolling to the far right, and holding Shift while clicking the 2013 column. Finally, use the drop-down menu on any one of the year fields and select the Pivot option.

The result is two columns (Pivot field names and Pivot field values) in place of all the year columns. Rename the two new columns to Year and Population. Your dataset is now narrow and tall instead of wide and short.

Finally, notice from the icon on the Year column that it is recognized by Tableau as a text field. Clicking the icon will allow you to change the data type directly. In this case, selecting Date will result in NULL values, but changing the data type to a Number (whole) will give you integer values that will work well in most cases:

Alternatively, you could use the first drop down menu on the Year field and select Create Calculated Field... This would allow you to create a calculated field name Year (date) which parses the year string as a date with code such as DATE(DATEPARSE("yyyy", [Year])).
This code will parse the string and then convert it into a simple date without a time. You can then hide the original Year field. You can hide any field, even if it is used in calculations, as long as it isn't used in a view. This leaves you with a very clean dataset.

The final dataset is far easier to work with in Tableau than the original:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.73.127