CHAPTER 6: Julia the Data Engineer

Chapter006.jpg

As we discussed in the previous chapter, data engineering encompasses the first parts of the data science pipeline, focusing on preprocessing (preparing) your data and generating features. Data engineering brings out the most information-rich aspects of your data and prepares them for further analysis. This maximizes your chances of getting some useful insights out of your process. Data engineering is extremely useful for complex or noisy data, since such data cannot be used as-is in the data analysis processes that ensue.

When engineering data, we explore datasets and develop visuals from its variables. We also statistically assess the relationships among those variables, as well as their connections with the target variable. This is a topic we’ll cover in the next chapter, as it requires the use of particular packages. For now we’ll focus on data preparation and data representation, which we can perform using the base package of Julia. This will also enable you to practice what you know about Julia programming and understand your data on a deeper level.

This chapter covers the following topics:

  • Accessing data in additional files types, such as .json custom functions
  • Cleaning up your data by removing invalid values and filling in missing values
  • Applying advanced aspects of the Julia language, useful for more customized tasks
  • Exploring various ways of data transformation
  • Storing data in different file types, so that it is readily available for the stages that follow as well as for other applications.

Data Frames

Data frames are popular data structures, originally introduced by the R analytics platform. Their main advantages over conventional data structures (such as matrices) are that they can handle missing values, and that they can accommodate multiple data types. An additional bonus of data frames is that the variables can be named and accessed through their names instead of numbers. This advantage applies to both variables stored in columns and to rows. Data frames are also easier to load directly from delimited files, such as .csv files, and saving them to such files is a straightforward process.

In order to work with data frames you need to add and load the corresponding package, called DataFrames, while also doing a package update, just in case:

In[1]: Pkg.add(“DataFrames”)

Pkg.update()

In[2]: using DataFrames

You can learn more about this useful package at its website: http://bit.ly/29dSqKP.

Creating and populating a data frame

You can create a data frame in Julia as follows:

In[3]: df = DataFrame()

Alternatively, if you already have your data in an array format, you can put it into a data frame by doing the following:

df = DataFrame(A)

where A is the array containing the data you want to insert into the data frame.

Another way to populate a data frame is by adding DataArrays to it. A DataArray is another structure found in the DataFrames package, consisting of a one-dimensional array that has a name. Just like data frames, data arrays can include missing values. A DataArray can be constructed in a similar manner:

da = DataArray()

Or, to make it more useful, you can populate it and fill it at the same time:

In[4]: da = DataArray([1,2,3,4,5,6,7,8,9,10])

So, if you have a couple of data arrays, e.g. da1 and da2, you can put them into a data frame df under the names var1 and var2 as follows:

In[5]: df[:var1] = da1

    df[:var2] = da2

The colon right before the name of each data array is an operator that tells Julia that what follows is a symbol. Whenever it sees :var1 it will understand that you are referring to a column named var1. This may seem unusual at first, but it’s a much more efficient way of referring to a column, than the quotes approach in the pandas package of Python and in Graphlab (e.g. df[“var1”]).

Data frames basics

Now, let’s look into how we can handle data frames effectively in order to leverage them for efficient data engineering.

Variable names in a data frame

Since having easily understood names for variables is a big selling point for data frames, let’s examine this aspect first. To find out what names the variables have, you just need to apply the names() command:

In[6]: names(df)

Out[6]: 2-element Array{Symbol,1}:

  :var1

  :var2

After seeing the result, we realize that the names we chose for the variables are not intuitive. To change them into something more meaningful, we employ the rename!() command:

In[7]: rename!(df, [:var1, :var2], [:length, :width])

There is also a version of this command without the ‘!’ part: rename(). This works the same way, but doesn’t make any changes to the data frame it is applied on. Since it doesn’t have any advantages over the command we mentioned previously, it is best to refrain from using it when working with a single data frame. Unless you want to make a copy of that data frame for some reason, it is best to use the rename!() command instead of rename().

Upon renaming the variables of our data frame, we realize that the second variable was actually supposed to be “height,” not “width.” To change the name we use the rename!() command again, this time without the array inputs, something made possible with multiple dispatch:

In[8]: rename!(df, :width, :height)

Although this command changes the data frame in a way, it doesn’t alter any of its data. We’ll look at commands that change the actual content of the data frame later in this chapter.

Accessing particular variables in a data frame

To access a particular variable in our data frame df we simply need to reference it, just like we would reference a variable in an array. Be sure to use its name as a symbol variable, though:

In[9]: df[:length]

If the name of the variable is itself a variable, you need to convert it first using the symbol() function:

In[10]: var_name = “height”

  df[symbol(var_name)]

You can also refer to a given variable by number, just like in arrays. This is less practical since the column numbers are not clearly evident (you’ll first need to view the data frame or run an exploratory command). However, if you have no idea how the variables are named and you just want to see whether the data frame has been populated correctly, you can use this option (e.g. df[1] will show you the first column and variable of data frame df).

Exploring a data frame

Once you have figured out what variables are in a data frame and have changed them into something that makes more sense to you than X1 or var1, it is time to explore the actual contents of that data frame. You can do that in various ways; there is no right or wrong strategy. However, there are some commands that will come in handy regardless of your approach. First, as a data scientist, you must pay attention to the data types, which you can easily find through the following command:

In[11]: showcols(df)

3x2 DataFrames.DataFrame

| Col # |  Name  | Eltype | Missing |

|-------|--------|--------|---------|

|   1   | length | Int64  |    0    |

|   2   | height | Int64  |    0    |

This is particularly useful in large datasets with many different variables. As a bonus, you also get some metadata about the number of missing values in each variable.

In order to get a better feel for the variables, it helps to see a small sample of them, usually the first few rows. Just like in R and Python, we can view the first six rows using the head() command:

In[12]: head(df)

If that proves insufficient, we can take a peek at the last six rows using tail(), another familiar command to many R and Python users:

In[13]: tail(df)

Since that small glimpse of data may not be enough to help us understand the variables involved, we can also ask Julia for a summary of them, through the describe() command:

In[14]: describe(df)

What returns are a number of descriptive statistics. You could also retrieve these descriptions by examining each variable on your own, using the same command we saw in a previous chapter. But it’s helpful to be able to gather this information by applying it to the data frame directly, thanks to multiple dispatch.

Filtering sections of a data frame

More often than not, we are more interested in a particular segment of the data frame than the entire data frame. So, just like with arrays, we must know how to select parts of a data frame. This is achieved using the same method as with arrays. The only difference is that with data frames, you can refer to the variables by name:

In[15]: df[1:5, [:length]]

The output of this will be another data frame containing the data of rows 1 and 2 for variable “length.” If you omit the brackets around the variable name, e.g. df[1:2, :length], what you’ll get is a data array instead. That’s interesting, but the real value of filtering a data frame arises when you select rows that you don’t know beforehand. Just like when filtering parts of an array, filtering a data frame is possible through the inclusion of conditionals. So, if we want to find all the cases where length is higher than 2 we type:

In[16]: df[df[:length] .> 2, :]

The colon after the comma tells Julia that you want the data from all the variables. Long expressions like this one can be somewhat challenging if you are not used to them, and may even cause errors due to typos. One good practice to simplify expressions is to introduce an index variable, say ind, that replaces the conditional. Naturally, this is going to be a binary variable, so it shouldn’t take up too much memory:

In[17]: ind = df[:length] .> 2

  df[ind, :]

Applying functions to a data frame’s variables

If you want to apply a certain function to a single variable, just apply it to the corresponding data array, exactly how you would if you were dealing with a regular array. If your target is a series of variables or the whole data frame, then the colwise() command is what you need:

In[18]: colwise(maximum, df)

If df had a bunch of variables and you wanted to apply a function, say var(), to only length and height, then you would proceed as follows:

In[19]: colwise(mean, df[[:length, :height]])

Applying a function to a data frame does not change the data frame itself. If you want the changes to be permanent you’ll need to take an additional step and assign the result of the colwise() function to the variables involved.

Working with data frames

Now that we’ve learned the basics of data frames, let’s see how we can use them to better understand our data. To start, let’s try using data frames to handle more challenging cases where there is some data missing from the dataset at hand.

Finding that your data has missing values is a major pain, but it’s a common phenomenon that needs to be addressed. Fortunately, data frames are well-equipped to help. As such, the DataFrames package has a data type dedicated to them: NA (standing for Not Applicable or Not Available). These NAs, even if they are few, can cause some serious issues, since no operations you apply to the corresponding variables will work:

In[20]: df[:weight] = DataArray([10,20,-1,15,25,5,10,20,-1,5])

  df[df[:weight] .== -1,:weight] = NA

  mean(df[:weight])

Out[20]: NA

This code returns an NA because mean() cannot handle data arrays with missing values, leading them to taint the whole array. To get rid of missing values, you must either replace them with something meaningful (in the context of the data type of the variable where they are present), or delete the rows where they are present. You could also delete the whole variable, if the missing values are the majority of its elements, but not without loss of some signal in the dataset.

The most effective strategy is to replace the missing values with something that makes sense, such as the median, mean, or mode of that variable. In order to do this, we first need to spot the missing values (marked as NAs) in the data frame, which is made possible using the isna() function:

In[21]: isna(df[:weight])

Out[21]: 10-element BitArray{1}:

  false

  false

  true

  false

  false

  false

  false

  false

  true

  false

To make the result more intuitive, we can use isna() in combination with the find() function:

In[22]: find(isna(df[:weight]))

Out[22]: 2-element Array{Int64,1}:

  3

  9

So, we discover that we have missing values in rows 3 and 9. Now we’re ready to replace them with something else. The best choice is usually the mean of the variable (for classification problems, we would take into account the class each one of these values belongs to). To ensure that the type of the variable is maintained, we will need to round it off, and convert it into an Int64:

In[23]: m = round(Int64,mean(df[!isna(df[:weight]), :weight]))

In[24]: df[isna(df[:weight]), :weight] = m

  show(df[:weight])

[10,20,14,15,25,5,10,20,14,5]

Altering data frames

As we said earlier, there’s an alternative method for dealing with missing values: deleting the entire row or column. This strategy is less desirable than simply filling in missing values, because the columns or rows in question may still contain valuable data. However, if you do need to make alterations to data frames (for this or any other reason), it can be done easily using the delete!() command:

In[25]: delete!(df, :length)

If you want to meddle with the rows, use the push!() and @data() commands:

In[26]: push!(df, @data([6, 15]))

With this command we’ve just added another data point having length = 6 and weight = 15. Although arrays can’t handle NAs (a type that only works with DataFrame and DataArray objects), you can still use them with @data().

If you wish to delete certain rows, you can do that using the deleterows!() command:

In[27]: deleterows!(df, 9:11)

Alternatively, you can use an array of indexes instead of a range, which has its own advantages:

In[28]: deleterows!(df, [1, 2, 4])

Sorting the contents of a data frame

One way to look at data frames is like tables in a database. As such, it is usually handy to apply some ordering operations to them, making their data more readily available. A couple of functions that are particularly useful for this task are by() and sort!(). The former groups the unique values of a given variable, putting them in ascending order and counting how many instances of each value exist in the data frame. It is often used in combination with the function nrow(), which creates a new row in a data frame. For example:

In[29]: by(df, :weight, nrow)

Out[29]:

  weight  x1

1  5   1

2  10  1

3  14  1

4  20  1

5  25  1

If you wish to order all the elements of a data frame based on a variable (or a combination of variables), the sort!() command is what you are looking for:

In[30]: sort!(df, cols = [order(:height), order(:weight)])

This code takes data frame df and sorts all its elements based on their height values and then on their weight. As there are no duplicates in the height variable, the second part of the ordering is redundant in this case, though it may be useful for a larger sample of this dataset.

Data frame tips

Another interesting kind of data frame is the SFrame, a scalable tabular and graph data structure developed by Turi Inc. (formerly known as Dato) and used in the Graphlab big data platform. SFrames are designed to handle very large datasets outside the computer’s memory. Using a native hard disk, they allow for scalable data analytics on both conventional table-based and graphical data. Although still experimental in Julia, SFrames are worth keeping in mind, as they are very versatile and can allow for seamless data transfer among various platforms. Should you wish to try them out in Julia, you can do so by using the SFrames package found at http://bit.ly/29dV7fy.

Although data frames have proven popular since their initial release (particularly among statisticians), they don’t always offer a significant advantage over analytics work. So if you are working with a new data processing algorithm, you may still need to use conventional arrays. Also, arrays tend to be faster than data frames. If you need to do a lot of looping, you may want to steer away from data frames or use them in combination with arrays, so that you get the best of both worlds.

Data frames are particularly useful when loading or saving data, especially when dealing with missing values. We will look into how you can incorporate data frames in your IO processes in the next section.

Importing and Exporting Data

Accessing .json data files

In Chapter 2 we saw how we could load data from .csv, .txt, and .jld files. However, we often need to access semi-structured data that is in other data formats. One such data format is .json and it’s widely used for storing all kinds of data across different applications. You can load the data from a .json file into a variable X using the JSON package:

In[31]: Pkg.add(“JSON”)

In[32]: import JSON

In[33]: f = open(“file.json”)

  X = JSON.parse(f)

  close(f)

Note that we could have typed “using JSON” instead of “import JSON”, something that would also work, but it’s best to avoid that option as there is conflict among some of its functions with functions of the Base package. Also, the variable containing the parsed .json file is a dictionary, something that makes sense since .json is a dictionary structure too. Moreover, the parse() function of the JSON package also can be applied to a string containing the .json contents.

Storing data in .json files

Because your data will usually be in tables, saving it to a .json file isn’t something you will regularly want to do. But if the need should ever arise, Julia has this option for you. Apply the following code to export the dictionary variable X as a .json file:

In[34]: f = open(“test.json”, “w”)

  JSON.print(f, X)

  close(f)

The test.json file that is created is bound to look different than the file.json file we used earlier. The reason is that the JSON package exports data in a compact format, omitting all extra spaces and new line characters that you would normally see in a data file. You can read about this package at http://bit.ly/29mk8Ye.

Loading data files into data frames

Data frames and data files are intimately related, so converting one to the other is a straightforward process. If you want to load data from a file into a data frame, your best bet is the readtable() command:

In[35]: df = readtable(“CaffeineForTheForce.csv”)

If you know how the missing values are denoted in the data file and you want to save some time, you can take care of them while loading the dataset into a data frame, using the following:

df = readtable(“CaffeineForTheForce.csv”, nastrings = [“N/A”, “-”, “”])

This command will load the Caffeine for the Force dataset, taking into account that missing values are represented as one or more of the following strings: “N/A”, “-”, ““ (empty string).

You can load all kinds of delimited files into a data frame using the command readtable().

Saving data frames into data files

If you want to save data from a data frame into a file, the command you are looking for is writetable():

In[36]: writetable(“dataset.csv”, df)

This command saves our data frame df into a delimited file called dataset.csv. We can use the same command to save it in another format, without any additional parameters:

In[37]: writetable(“dataset.tsv”, df)

You can see that the new file has tab-separated values. Unfortunately, the DataFrames package has a limited range of delimited files, so if you try to save your data frame as something it doesn’t recognize, it will export it as a .csv.

Cleaning Up Data

To clean up data, you must address certain parts that don’t fit or create gaps in your data structures (usually arrays or data frames). The first thing to do is take care of the missing values as we saw earlier in the chapter. Beyond that, depending on the data type, you can clean up your data in different ways.

Cleaning up numeric data

Cleaning up numeric data involves examining outliers: those nasty data points that usually have an extreme value in one or more of the variables, and an extreme capacity to mess up your models. Exterminating extreme values in a variable can be very helpful, especially when employing distance-based models. Even though sophisticated techniques like Deep Networks and Extreme Learning Machines are practically immune to outliers, it is good practice to handle them at this stage since a well-engineered dataset allows for a more versatile data modeling phase.

Outliers are an elusive kind of abnormality, not as clear-cut as missing values. Even after years of evolution of data analytics methods, there is still no foolproof approach to identifying them and no consensus as to whether we should eliminate them at all. That’s because certain outliers may actually contain important information and should therefore not be ignored (it was the unconscious elimination of outliers, after all, that caused the hole in the ozone layer to remain undetected for a while).

Let’s look at how we can find these outliers and then decide whether we ought to eliminate them (or replace them with some other value that makes more sense, as it is often the case). The process is relatively simple; all variables follow a certain distribution (which may or may not coincide with the familiars from statistics theory). So, certain values are more likely to be found than others, depending on the form of that distribution. If a particular value is extremely unlikely to be found (say less than 1% likely), then it qualifies as an outlier.

If we have a series of numeric variables, it makes sense to take into account all of the normalized variables together, when looking for outliers. We’ll examine this process in the next section. For now, let’s just pinpoint the extreme values for an individual variable. Check out the corresponding section in the “Caffeine for the Force” case study, where several outliers are pinpointed.

Once an outlier is identified, it is often replaced with something else, such as the maximum or minimum value of that variable. If it appears that none of these are meaningful (e.g. we decide that the outlier is probably due to some typo), we can also treat it as a missing value and replace it with the mean or median of that variable. In the case of a classification problem, it is useful to take into account the class of that data point before replacing its value. If we have a lot of data, it is not uncommon to get rid of that data point altogether. See the “Caffeine for the Force” case study for examples of how outliers are handled in the corresponding dataset.

Cleaning up text data

Cleaning up text data is fairly straightforward, but rather time-consuming. The idea is to remove “irrelevant” characters from the text, including but not limited to the following:

  • Punctuation marks
  • Numbers
  • Symbols ( “+”, “*”, “<“, etc.)
  • Extra white spaces
  • Special characters ( “@”, “~”, etc.).

Depending on the application, you may also want to remove words that don’t appear in the dictionary, names, stop words, or anything else that isn’t relevant to the problem you are trying to solve.

One efficient way of stripping a given text (stored in variable S) of most of the irrelevant characters is the following:

In[38]: Z = “”

  for c in S

   if lowercase(c) in “qwertyuiopasdfghjklzxcvbnm “

     Z = string(Z,c)

   end

  end

The result of the above snippet, variable Z, is a string variable stripped of all characters that are not in the alphabet and are not spaces. This may not be perfect (duplicate spaces may still be present, along with other undesirable character patterns), but it’s a good start. In the next section we’ll see how we can make the text data even more useful by applying normalization.

Formatting and Transforming Data

A clean dataset is much more usable, but it’s still not entirely ready (although you could use it as-is in a model). To further polish it we must format it, which can save a lot of valuable resources. To format your dataset you must examine each one of its variables and determine the appropriate strategy depending on what kind of data it contains. Once the formatting is done, you’ll often need to make some transformations (e.g. normalize the data) to ensure a cleaner signal. Let’s look at each one of these steps, for the different types of data, in more detail.

Formatting numeric data

Formatting numeric data is simple. Basically, you need to decide which data type best encapsulates the nature of the variables involved, and apply this type to each one of them. Doing so is a matter of intuition and domain knowledge. A good statistics book can get you started. Once you decide on the type of a numeric variable you need to format it accordingly. The Julia command convert() will do just that:

In[39]: x = [1.0, 5.0, 3.0, 78.0, -2.0, -54.0]

   x = convert(Array{Int8}, x)

  show(x)

  Int8[1,5,3,78,-2,-54]

In this example, Julia takes an array of floats and formats it as an array of 8-bit integers. You could always use other kinds of integers too, depending on your particular data. For instance, if you think this variable may take values larger than 128, you may want to use Int16 or higher. Whatever the case, float may not be the best type since none of the elements of that variable seem to have a value that cannot be expressed as an integer. If, however, you plan to perform distance calculations with this data, you may want to keep the float type and just reduce the bandwidth used:

In[40]: x = convert(Array{Float16}, x)

Formatting text data

The main method of formatting text data is to convert it to the appropriate string subtype. Specifically, you’ll need to decide whether your text data is going to be encoded. Using the default string type, AbstractString, usually works well with all your text manipulation applications.

In the rare occasions when you must analyze individual characters, you need to convert your data into the Char type. Keep in mind, though, that a Char variable can only contain a single character. More often than not you’ll use Char as part of an array (i.e. Array{Char}). It’s important to remember that even if the contents of a string variable of length 1 match exactly with a particular Char variable, Julia will never see them as the same thing:

In[41]: ‘c’ == “c”

Out[41]: false

So, if you wish to avoid confusions and errors, you may want to employ the convert() function accordingly:

convert(Char, SomeSingleCharacterString)

Or, in the opposite case:

convert(AbstractString, SomeCharacterVariable)

Fortunately, string() has been designed more carefully; when you need to aggregate string variables, you can use any combination of AbstractString variables (or any of its subtypes) and Char variables.

Importance of data types

Although we talked about this in Chapter 2, it cannot be stressed enough that data types need to be chosen carefully, particularly when dealing with large data sets. An incorrect data type may waste valuable resources (especially RAM). That’s the best case scenario, since an incorrect data type may yield some strange errors or exceptions, resulting in a waste of another valuable resource: time. So, be mindful when choosing types for your dataset’s variables. A set of well-defined variables can go a long way.

Applying Data Transformations to Numeric Data

One arena where Julia excels is numeric computations, which form the core of data transformations (keep in mind that string variables can be seen as numbers coded in a particular way). Data transformations are a core part of data engineering as well, which is why Julia often doubles as a data engineer.

This is important to keep in mind because it weakens the argument that Julia isn’t useful in data science because it doesn’t have enough mature packages. Those who make this argument ignore the fact that a large part of data science involves tasks that don’t need packages. Julia proves them wrong by being adept at numeric computations, and by gradually growing its package spectrum. Data transformation is at the core of all this.

There are several data transformation tasks that are applicable to numeric variables. Here we’ll focus on the most important ones: normalization, discretization or binarization, and making a binary variable continuous.

Normalization

This process involves taking a variable and transforming it in such a way that its elements are not too large or too small compared to other variables’ elements. Statisticians love this data transformation because it often makes use of statistical techniques. However, it is much deeper than it is often portrayed in statistics books.

A good normalization will allow for the development of several new features (e.g. log(x), 1/x) that can significantly enhance the performance of basic classifiers or regressors. These features may or may not be possible using the conventional normalization techniques, depending on the data at hand. Regardless, these normalization techniques can improve your dataset, especially if you are planning to use distance-based methods or polynomial models.

There are three main normalization techniques used in data science, each of which has its advantages and drawbacks:

1. Max-min normalization. This approach is fairly straightforward and it’s computationally cheap. It involves transforming a variable to [0, 1], using its maximum and minimum values as boundaries. The main issue with this method is that if there are outliers in the variables, all of the values of the normalized variable are clustered together. A typical implementation in Julia is:

   norm_x = (x - minimum(x)) / (maximum(x) - minimum(x))

2. Mean-standard deviation normalization. This is the most common approach, and it’s also relatively computationally cheap. It makes use of the mean and the standard deviation as its parameters and yields values that are centered around 0 and not too far from it (usually they are in the area -3 to 3, though there are no fixed extreme values in the normalized variable). The main drawback of this method is that it yields negative values, making the normalized variable incompatible with certain transformations like log(x). There are also no fixed boundaries for the normalized variable. Here is how you can implement it in Julia:

  norm_x = (x - mean(x)) / std(x)

3. Sigmoidal normalization. This is an interesting approach that is more computationally expensive than the others. It is immune to outliers and yields values in (0, 1). However, depending on the data, it may yield a clustered normalized variable. With the proper parameters it can yield a virtually perfect normalized variable that is both meaningful and applicable for other transformations (e.g. log(x)). Finding these parameters, however, involves advanced data analytics that are way beyond the scope of this book. Its more basic implementation is straightforward in Julia:

  norm_x = 1 ./ (1 + exp(-x))

Whichever normalization method you choose, it is good practice to make a note of it, along with the parameters used whenever applicable (e.g. in the mean-standard deviation normalization, note the mean and standard values you used in the transformation). This way, you can apply the same transformation to new data as it becomes available and merge it smoothly with your normalized dataset.

Discretization (binning) and binarization

Continuous variables are great, but sometimes it makes sense to turn them into ordinal ones, or even a set of binary variables. Imagine for example a variable that is brimming with missing values or a variable whose values are polarized (this is often the case with variables stemming from certain questionnaires). In such cases the additional detail that the continuous variable provides is not helpful. If a variable age has values that are clustered around 20, 45, and 60, you might as well treat it as a trinary variable age_new having the values young, middle_aged, and mature.

Turning a continuous variable into an ordinal one is a process called discretization, or binning (since all the values of the variable are put into several bins). One place where this process is utilized is the creation of a histogram, which will be covered in the next chapter.

To perform discretization you merely need to define the boundaries of each new value. So, in the previous example with the age variable, you can say that anyone with age <= 30 is young, anyone with age > 55 is mature, and anyone else is middle_aged. The exact values of the boundaries are something you can figure out from your data. In general, for N bins, you’ll need N-1 boundary values, which don’t have to follow any particular pattern.

Turning a discreet variable into a set of binary ones is an equally straightforward process. You just need to create one binary variable for each value of the discreet one (though you can actually omit the last binary variable if you want, since it doesn’t yield any additional information). For example, in the previous case of the age_new variable, you can turn it into a set of 3 binary ones, is_young, is_middle_aged, and is_mature:

In[42]: age_new = [“young”, “young”, “mature”, “middle-aged”, “mature”]

  is_young = (age_new .== “young”)

  is_middle_aged = (age_new .== “middle-aged”)

  is_mature = (age_new .== “mature”)

  show(is_young)

  Bool[true,true,false,false,false]

You could also handle missing values this way:

In[43]: age_new = [“young”, “young”, “”, “mature”, “middle-aged”, “”, “NA”,

  “mature”, “”]

  is_missing = (age_new .== “”) | (age_new .== “NA”)

  show(is_missing)

    

  Bool[false,false,true,false,false,false,true]

If there are several ways that a missing value is denoted, you can put them all in an array NA_denotations and use the following code for the missing values variable instead:

is_missing = [age_value in NA_denotations for age_value in age_new]

Binary to continuous (binary classification only)

Sometimes we need more granularity but all we have is a bunch of binary variables (these are usually variables that came from the original data, not from another continuous variable). Fortunately, there are ways to turn a binary variable into a continuous one—at least for a certain kind of problem where we need to predict a binary variable. These are often referred to as “binary classification problems”.

The most popular method to turn a binary variable into a continuous one is the relative risk transformation (for more information, see http://bit.ly/29tudWw). Another approach that’s somewhat popular is the odd-ratio (for details, see http://bit.ly/29voj4m). We’ll leave the implementation part of these methods as an exercise for you. (Hint: create a table containing all four possible combinations of the binary variable you want to transform and the class variable.)

Applying data transformations to text data

When it comes to text data, Julia proves itself to be equally adept at performing transformations. In particular, you can change the case of the text (lower or upper case), and also turn the whole thing into a vector. Let’s look into each one of these processes in more detail.

Case normalization

In common text, character case fluctuates often, making it difficult to work with. The most popular way to tackle this is through case normalization, i.e. changing all the characters of the text into the same case. Usually we choose to normalize everything to lower case, since it’s been proven to be easier to read. Happily, Julia has a built-in function for that–we’ve seen it a couple of times already:

In[44]: S_new = lowercase(S)

Out[44]: “mr. smith is particularly fond of product #2235; what a surprise!”

Should you want to normalize everything to upper case instead, you can use the uppercase() function. Non-alphabetic characters are left untouched, making lowercase() and uppercase() a couple of versatile functions.

Vectorization

Vectorization has nothing to do with the performance-improving vectorization performed in R, Matlab, and the like. In fact, in Julia it’s best to avoid this practice since it doesn’t help at all (it’s much faster to use a for-loop instead). The vectorization we deal with here involves transforming a series of elements in a text string (or a “bag of words”) into a relatively sparse vector of 1s and 0s (as well as a Boolean or a BitArray vector). Although this practice may take its toll on memory resources, it is essential for any kind of text analytics work.

Let’s take a simple dataset comprising four sentences and see how it can be vectorized:

In[45]: X = [“Julia is a relatively new programming language”, “Julia can be used in data science”, “Data science is used to derive insights from data”, “Data is often noisy”]

The first step is to construct the vocabulary of this dataset:

In[46]: temp = [split(lowercase(x), “ “) for x in X]

  vocabulary = unique(temp[1])

  for T in temp[2:end]

   vocabulary = union(vocabulary, T)

  end

  vocabulary = sort(vocabulary)

  N = length(vocabulary)

  show(vocabulary)

  

  SubString{ASCIIString}[“a”,”be”,”can”,”data”,”derive”,”from”,”in”,

  “insights”,”is”,”julia”,”language”,”new”,”noisy”,”often”,

  “programming”,”relatively”,”science”,”to”,”used”]

Next we need to create a matrix for the vectorized results and populate it, using this vocabulary:

In[47]: n = length(X)

  VX = zeros(Int8, n, N) # Vectorized X

  for i = 1:n

   temp = split(lowercase(X[i]))

   for T in temp

     ind = find(T .== vocabulary)

     VX[i,ind] = 1

   end  

  end

  println(VX)

  Int8[1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0

    0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1

    0 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1

    0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0]

Here we chose 8-bit Integers as the type for the output, but we could have gone with Bool or BitArray if we preferred. Also, we could have removed some stop-words from the vocabulary, in order to make the whole process and the end result a bit lighter (and potentially more useful).

The vectorized representation of our text array x is a relatively compact (dense) matrix VX. That’s because the vocabulary size is small (just 19 words). In most real-world problems, though, the number of words in a vocabulary is significantly larger; this will generate a huge matrix VX which will be sparse (i.e. the majority of its elements will be zeros). To remedy this, we often employ some kind of factorization or feature selection.

Preliminary Evaluation of Features

Feature evaluation is a fascinating part of data analytics, as it provides you with priceless insight regarding the value of individual features. In essence, feature evaluation serves as a valuable proxy for their predictive potential, by providing you with an intuitive measure of how useful they are going to be in predictive models. This is something useful to know, especially if you plan to use computationally expensive methods in the stages that follow. The method we use to evaluate features greatly depends on the problem at hand–particularly on the nature of the target variable. So, we’ll tackle feature evaluation in the case of both a continuous target variable (as in regression problems) and a discreet one (as in classification problems).

Regression

For regression problems there are two main ways to evaluate a feature:

  • Examine the absolute value of the coefficient of a regression model, such as linear regression, support vector machine (SVM), or a decision tree
  • Calculate the absolute value of the correlation of that feature with the target variable (particularly the rank-based correlation).

The higher each one of these two values is, the better the feature in general.

Classification

Here the feature evaluation gets a bit trickier, since there is no universally agreed-upon way to perform classification. There are a plethora of methods to accurately evaluate a classification feature, the most important of which are the following:

  • Index of Discernibility. This metric was developed especially for this purpose. Although it was originally created to assess whole datasets, it also works well with individual features and all kinds of classification problems. It takes values in [0, 1] and is distance-based. The publicly available versions of this metric are the Spherical Index of Discernibility (original metric), the Harmonic Index of Discernibility (simpler and somewhat faster), and the Distance-based Index of Discernibility (much faster and more scalable).
  • Fisher’s Discriminant Ratio. This metric is related to linear discriminant analysis, and also doubles as a feature evaluation measure. It works with individual features only and is easy to compute.
  • Similarity index. This simple metric works with discreet features only. It takes values in [0, 1] and is very quick to calculate.
  • Jaccard Similarity. This simple yet robust metric is also applicable to discreet features only. It takes values in [0, 1], is quick to compute, and focuses on an individual class of the target variable.
  • Mutual Information. This is a well-researched metric for both discreet and continuous variables. It yields positive values and is fast to compute (though its use on continuous variables is not as easy, since it involves complex intervals). It’s also simple to normalize it so that it yields values [0, 1].

Although uncommon, it is possible to combine a pair (or more) of these metrics for a more robust evaluation of a feature. If you want to evaluate feature combos to see how well they work together, your only option is the Index of Discernibility.

Feature evaluation tips

Whichever metric you choose, keep in mind that although a feature’s estimated value is a proxy to its performance, it is not absolute. Poor features have been proven to be essential for robust classifications, as they seem to fill the information gaps of other features. So, no feature evaluation metric can be 100% accurate in predicting a feature’s value in practice. However, you can effectively use feature evaluation to eliminate the obviously useless features from your dataset, conserving resources and making the stages that follow easier.

Summary

  • Data engineering is an essential part of the data science pipeline, and although it can be time-consuming and monotonous, it can save you a lot of time in the long run.
  • Data frames are popular data structures that can handle missing values (represented as NA) effectively, using the isna() function to identify them. It’s also easy to load data into them, using the readtable() command, as well as to save them into delimited files, using the writetable() function.
  • Accessing data from .json files can be accomplished using the JSON package and its parse() command. You can create a .json file using the print() command of the same package. Data extracted from a .json file is made available as a dictionary object.
  • Cleaning up data is a complex process comprising the following elements, depending on the sort of data we have:
  • Numeric data: eliminate missing values, handle outliers.
  • Text data: remove unnecessary characters, remove stop words (in the case of text analytics).
  • Formatting data (casting each variable as the most suitable type) is important as it saves memory resources and helps prevent errors later on.
  • Transformations are commonplace in data engineering and depend on the kind of data at hand:
  • Numeric data: normalization (making all features of comparable size), discretization (making a continuous feature discreet), binarization (turning a discreet feature into a set of binary ones), and making a binary feature continuous (for binary classification problems only).
  • Text data: case normalization (making all characters lower or upper case) and vectorization (turning text into binary arrays).
  • Feature evaluation is essential in understanding the dataset at hand. Depending on the type of modeling that you plan to do afterwards, there are different strategies for accomplishing feature evaluation, the most important of which are:
  • Index of Discernibility – continuous features.
  • Fisher’s Discriminant Ratio – continuous features.
  • Similarity Index – discreet features.
  • Jaccard Similarity – discreet features.
  • Mutual Information – both discreet and continuous features.

Chapter Challenge

  1. 1. What is the importance of data engineering in a data science project?
  2. 2. What are the main advantages of data frames over matrices?
  3. 3. How can you import data from a .json file?
  4. 4. Suppose you have to do some engineering work on a data file that is larger than your machine’s memory. How would you go about doing this?
  5. 5. What does data cleaning entail?
  6. 6. Why are data types important in data engineering?
  7. 7. How would you transform numeric data so that all the variables are comparable in terms of values?
  8. 8. What data types in Julia would you find most useful when dealing with a text analytics problem?
  9. 9. Suppose you have some text data that you are in the process of engineering. Each record has a string variable containing the presence or not of certain key words and key phrases. How would you save that file efficiently so that you could use it in the future and also share it with the engineering team?
  10. 10. How would you evaluate the features in the OnlineNewsPopularity dataset? What about the ones in the Spam dataset?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.135.4