CHAPTER 2: Setting Up the Data Science Lab

Chapter002.jpg

Just like every applied scientist worth his or her title, if you are to use Julia for data science, you will need to set up a lab where you can explore the data you have to your avail and distill its essence. Similar to most other programming languages, the interface that comes pre-equipped with Julia is minimalistic. Usually referred to as Read, Evaluate, Print, Loop (REPL), this basic interface is great for trying out the language and creating some fundamental scripts.

However, if you want to do some serious work with Julia, you will need an IDE tailored for Julia. And if you want to do something powerful enough to wow your organization, you will need to equip Julia with a few additional packages beyond the base set.

In this chapter we will examine the following topics:

  • IDEs and packages for Julia
  • Walkthrough of the IJulia IDE
  • Datasets we’ll be using in this book
  • An example of a simple machine learning algorithm implemented in Julia
  • Saving your workspace into a data file.

Before we start to set up the lab, though, you must install Julia on your machine. Appendix A will walk you through how to install Julia.

Julia IDEs

It is highly recommended that you install an IDE. If you are comfortable with text editors, you can get your editor of choice to recognize Julia code, enabling you to create and edit Julia scripts there with ease.

Moreover, now would be a good time to get acquainted with Julia’s selection of packages and how you can install them on your machine. You can also start getting a bit of hands-on experience with Julia, using it on some relatively simple algorithms, while at the same time getting acquainted with its IO capabilities.

Juno

Juno is a minimalistic yet powerful IDE, based on Light Table, that specializes in Julia scripts. Juno is equipped with autocomplete, which predicts the functions or variables that you will type, similar to the predictive text input features of most smartphones. Autocomplete increases the speed and accuracy with which you code.

Juno is intuitive and easy to master–definitely a worthy option for Julia. If you wish to learn more about its functionality, you can always refer to its abundant online documentation, which is accessible via the “Help” menu. You can see a screenshot of Juno in Figure 2.1.

Juno’s console features a handy little animation that shows you when Julia is occupied with a script or with starting up; this makes it easy to understand what’s currently going on (you can see it on Figure 2.1, right below the line numbers). Unfortunately, you cannot write directly in the console area, as you would in other IDEs (such as Canopy, for Python). That’s not too inconvenient, though, since it is easy to run individual commands in a script window by pressing Ctrl-Enter (or Shift-Enter) after typing your command.

It’s important to remember that whenever you want to execute something in Juno, you have to make sure that the script you are using has the .jl extension, as this is the only way for the IDE to recognize Julia code and therefore be able to run it.

Image003.jpg

Figure 2.1 A screenshot of Juno with all its viewers open.

When scripts don’t work as expected, Juno has a straightforward way of letting you know: through the use of red highlighted text in the console (see Figure 2.2).

Image004.jpg

Figure 2.2 A partial screenshot of Juno, focusing on the part of the IDE where an error message would appear. The text highlighted in red denotes encountered issues.

Although Juno is still in development, it has a lot of great features that are constantly being refined. A recent development on the Atom software (a generic IDE for all kinds of languages) allowed it to integrate the Julia kernel, enabling it to identify and run Julia scripts (see http://bit.ly/2akTBua for details).

What’s more, Atom allows the user to switch back and forth between the text editor (where you normally develop all your scripts) and the REPL, making it ideal for testing out ideas before you put them in your programs. This makes the Julia plug-in for Atom a viable alternate IDE, which is gradually being incorporated into the Juno project. So keep a close eye on these developments, as they may prove to be the new norm for Julia IDEs. You can learn more about this in the corresponding article in the official Julia blog (http://bit.ly/29Nzsf1).

IJulia

Until the Atom-based option matures (eventually making Juno the go-to IDE for Julia scripts) there are other options out there that are already developed enough to be completely reliable. One such option is IJulia, which is simply Julia on your web browser. If you come from a Python background and you are already used to IPython notebooks, you may want to get started working with IJulia. In fact, this is the IDE we’ll be using throughout the book, partly because it’s more established (the creators of Julia and other experts use it in most of their presentations), and partly because it is easy to showcase code using this IDE.

Curiously, IJulia is in a way older than any other Julia IDE, since it is essentially the notebook software that was developed for Python (currently known as Jupyter, and formerly known as IPython, available at https://jupyter.org), with the only difference being the Julia kernel in the backend. Consider Jupyter a car that can have different engines under its hood: although it usually runs on the Python engine, it works perfectly with the Julia engine as well.

IJulia is also handy when it comes to presentations and tutorial building, as well as data exploration. It’s no coincidence that Jupyter is popular among data scientists who use Python.

All IJulia notebooks are rendered on your web browser, although their code files are stored natively. You can find the corresponding installation files on this site: http://bit.ly/1aA7oeg. Alternatively, you can run it from the cloud, through JuliaBox (https://juliabox.com), although you will need a Google account for that. You can get an idea of IJulia’s interface in Figure 2.3.

Image005.jpg

Figure 2.3 A screenshot of IJulia from its online version, JuliaBox. This interactive version of the language is ideal for sharing code in presentations, for building tutorials, and for high-level programming.

Additional IDEs

In addition to Juno and IJulia, another Julia IDE was recently released by the company that created Julia Studio. Julia Studio was an excellent IDE that made Julia as user-friendly as R and Matlab, and is probably what made people fall in love with Julia “back in the day.” However, it has been discontinued for a while now, even though some Julia tutorials still refer to it as the go-to IDE for Julia.

This latest IDE is called Epicenter. Although it is a worthy alternative to the existing IDEs, the functionality of its free version is limited. If you wish to create more elaborate Julia applications, you may want to consider investing in one of Epicenter’s paid editions. You can find more about it here: http://bit.ly/29Nzvrv.

Another IDE that so far has escaped publicity is the Julia IDE that was created by the well-known Tutorials Point website (http://bit.ly/29KlxWp). It is similar to JuliaBox in that it’s cloud-based, but doesn’t require you to log in and has an intuitive interface that rivals that of Juno. Furthermore, it offers you direct access to the console, so you can work on script development and script experimentation in the same window (instead of having to switch between the editor and the REPL, or between two different tabs of the Juno editor). However, it doesn’t always have the latest version of Julia and may therefore not be able to run state-of-the-art packages properly, so we recommend you use it only for simple scripts.

You can always use a combination of all these IDEs. Maybe you can have Juno on your main workstation, use IJulia when sharing your code with your team and/or your clients, use JuliaBox when you are on a computer that doesn’t have Julia installed (and you don’t have the option to do so), and use Epicenter when working on more sophisticated projects that require advanced features, such as sophisticated GUIs.

Finally, if you are comfortable using the REPL coupled with a simple text editor for all your Julia development tasks, and you are a Windows user, you can use Notepad++, which now has a Julia language package available (http://bit.ly/29Y9SWL). Just be sure to save your script files with the extension (.jl) so that the text editor recognizes them as Julia scripts and applies the right highlighting to them. Alternatively, if you are using Linux or Mac OS, you could use Emacs for your Julia scripts by applying the corresponding language pack (http://bit.ly/29Y9CqH).

Julia Packages

Finding and selecting packages

As we learned in the previous chapter, packages are an important and sought-after aspect of any programming language. The reason is simple: packages empower you to efficiently do things that would normally take a lot of time to do yourself, by providing the essential auxiliary functions. The best part about packages is that the tasks they are designed to help with are the tasks that are most tedious and time-consuming. Until you reach a level of mastery of Julia, you may need to rely on packages to obtain the functions you need to work with your data.

Finding packages is a relatively straightforward task when it comes to the Julia platform. The safest place to do so is through the corresponding page on the official Julia site (http://pkg.julialang.org), which provides a list of all the packages that have been officially adopted by the Julia ecosystem. These packages can also be found on the GitHub: https://github.com/JuliaLang.

If you are more adventurous with your package use, you can always roam around GitHub to find other, more experimental Julia code repositories. Keep in mind, however, that these are provided on an as-is basis and you are welcome to use them at your own risk. Be sure to install all of the dependencies they may have to ensure a smoother experience.

Selecting the right package for the job is also a straightforward process, if a more subjective one. What we recommend is to carefully read the documentation provided in the package’s README.md file. Once you decide that the package will meet your needs, check out the test statistics below the file listing. Figure 2.4 shows an example of such stats.

Image006.jpg

Figure 2.4 Test statistics for a relatively mature Julia package. Although the package seems to still need some work, it passes all the functionality tests for both the latest stable version (bottom left) and the nightly release (bottom right).

If the package passes most of the tests for the version of Julia that you plan to be using it on, then it should work well for you. Even if it’s not quite ready yet, it doesn’t hurt to keep that package in mind for the future, since Julia’s packages evolve rapidly and it probably won’t be long before that package meets expectations.

Installing packages

Whether you are using Juno, IJulia, Epicenter, a text editor, or even the REPL, at one point or another you will come to the realization that the base package that Julia has is limited (though much more powerful than Python’s base package–at least for data analysis work). To install a package in Julia, you need to execute the following code:

Pkg.add(“mypackage”)

where mypackage is the name of the package you want to install. This has to be in double quotes, as Julia views it as a string variable. Depending on the size of the package, it can take some time to download and install on your computer, at the location C:Usersusername.juliav0.4 for Windows, and ~/.julia/v0.4 for UNIX-based systems.

Packages that are not official (mainly because they are still under development) need to be added manually. Fortunately, the Pkg.clone() command makes the whole process easy. To install such a package you just need to run this command, passing the package’s GitHub URL as an argument:

Pkg.clone(“git://github.com/AuthorName/SomeCoolPackage.jl.git”)

Some of these packages may not work properly (especially if they have not been updated for a while), or may even not work at all. Still, it is useful to be familiar with them, in case you want to explore the bleeding edge of Julia tech!

Although not required, it is good practice to update the package once you install it, since there are bound to be newer versions of it. You can do that by executing the snippet:

Pkg.update()

It may take some time for this command to complete, particularly the first time it is run. If you are installing many packages, run this command only after you have installed all packages, as there is no way to update just a single package. Also, it is a good idea to run the Pkg.update() command periodically, to ensure that you always have the latest version of your installed packages. While running this command on the REPL, any issues will appear in red, while everything else should be in blue.

Using packages

Once the package is installed, you can make use of the functions it contains by executing the following in every Julia session where you need to use the package’s functions:

using mypackage

After a package is installed, it will be recognized by Julia as a keyword, so you won’t need to put it in double quotes to load it into memory.

Hacking packages

Once you have become confident in your Julia programming skills, you can try your hand at improving an existing package without needing to know C. To do this, you can change the package’s source code and then load it into memory again (using the aforementioned command). For example, the source code of a package called CoolPackage should be all the .jl files at the directory C:Usersusername.juliav0.4CoolPackagesrc in the case of Windows, and !/username/.julia/v0.4/CoolPackage for UNIX-based systems.

IJulia Basics

Handling files

Creating a notebook

Creating a new file in IJulia (referred to as an IJulia notebook) is a fairly straightforward process. An IJulia notebook is something that is usable only in IJulia, either natively or on the cloud. If you wish to create a script file to be shared with people who use a different IDE, you’ll need to export your IJulia notebook as a .jl file (something we’ll cover shortly).

Once you have opened the application, just click the “New” button at the top-right and then select the “Julia” option toward the bottom of the menu that appears. If you are already processing a notebook, you can create a new one using the “File” menu. Figure 2.5 shows both of these options.

Image007.jpg

Figure 2.5 Creating a new notebook in IJulia: whether on the main screen (top), or while processing another notebook (bottom).

Saving a notebook

You can save a Julia notebook in IJulia by clicking on the disk button to the top-left, or you can select the fifth command in the “File” menu (see Figure 2.6).

Image008.jpg

Figure 2.6 Saving a notebook in IJulia: through the “File” menu (top) and through the disk button (bottom).

When this happens, IJulia takes all your text and code and puts it in a text file having the .ipynb extension. This kind of file is readable by the Jupyter program (even if IJulia is not installed on the computer). We recommend you keep all your Julia scripts in the same folder for easier access and referencing.

Jupyter will periodically auto-save your notebook; even if you close the notebook, nothing will be lost as long as the IDE is running. An IJulia notebook contains a variety of information–not just code. So when you save such a file (with the extension .ipynb), you are saving all of the markup you have developed (e.g. the section headers and HTML text), the Julia code, and all the results this code has produced by the time you save the notebook (including both text and graphics).

Renaming a notebook

Renaming a Julia notebook in IJulia can be done in two ways (beyond renaming the actual file in the OS shell). The simplest method is to click on its name (“Untitled” by default) at the top of the screen, right next to the Jupyter logo, and type the new name in the text box that appears. You can do the same task by selecting the fourth option on the “File” menu. See Figure 2.7.

Image009.jpg

Figure 2.7 Renaming a notebook in IJulia.

Loading a notebook

There are several ways to load a Julia notebook in IJulia. From the main screen, just click on the notebook file (or any text file for that matter). Alternatively, if you are already processing a file, you can load another one using the “File” menu (this opens another browser tab showing the Jupyter main screen). See Figure 2.8.

Image010.jpg

Figure 2.8 Loading a notebook in IJulia.

Exporting a notebook

A notebook can be exported in several ways, depending on what you plan to do with it afterwards. Here are your options:

  • If you wish to move the actual notebook file onto another machine and process it using IJulia there, you can export it as a notebook (.ipynb file).
  • If you wish to use the code in the notebook on the REPL, commenting out all its other content (headings, explanatory text, etc.), exporting it as a .jl file would be the best option.
  • If you prefer to share the notebook with other people without needing to run code (e.g. via a website, or as part of a demonstration), you can export it either as an .html file or a LaTex-based digital printout (.pdf file).

All of these exporting options, as well as other ones, are available to you via the “File” menu, as shown on Figure 2.9.

Image011.jpg

Figure 2.9 Exporting a notebook in IJulia.

Organizing code in .jl files

Since Julia takes a functional approach to programming, the scripts you find and create will commonly take the form of functions. Functions can coexist in the same .jl file without interfering with each other. Related functions should exist within the same file.

For instance, if you have a function called fun(x::Int64) and another one called fun(x::Array{Int64, 1}), you may want to put these together since they extend each other through multiple dispatch. Also, you may want to store the function MyFun(x::Int64, y::Int64) which relies on fun()in that same .jl file. We suggest that you put any testing code in a wrapper function, such as main(), to ensure that it won’t be run unless you want to run it.

None of this is crucial if you are using IJulia, but if you want to share your code with other Julians, it is prudent to know how to do so. Because all .jl files can be run from an IJulia notebook (using the include() command) it is a good idea to have commonly-used programs saved as .jl files–especially if you plan to make changes to these programs at any point.

Referencing code

Implementing and executing simple algorithms is straightforward. However, at some point you’ll want to build something more elaborate that makes use of other functions that may be already stored in other .jl files. Naturally, you could always copy and paste the corresponding code into your current .jl file, but that would create unnecessary redundancy. Instead, call the code you need by referencing the corresponding .jl file. This is possible through the following command:

include(“my other script.jl”)

If the referenced .jl file is in another directory, be sure to include the full path as well. Remember that the folder separator is not the conventional character, but either the double slash \ or the backslash /. Julia views the double slash as a single character, even though it consists of two. Type length(“\”) to verify.

Once you run the include() command, everything in that file will be executed:

include(“D:\Julia scripts\my other script.jl”)

It is important that you don’t have any unnecessary code that is not within a function in the referenced file as this may create confusion when running your current program.

Working directory

At some point in your Julia scripting you may want to identify the location where Julia thinks you are working, and perhaps change that location. All this is done using two simple commands: pwd() and cd().

The pwd() command is short for “print working directory” and does just that:

In[1]: pwd()

Out[1]: “c:\users\Zacharias\Documents\Julia”

If you want to change the directory where you are, you can do this using cd(), just like in the command prompt:

In[2]: cd(“d:\”)

In[3]: pwd()

Out[3]: “d:\”

The default folder that Julia works with is listed in the juliarc.jl file, which is in the folder where Julia is installed, and then within this folder the path esourcesappjuliaetcjulia. If you plan to make changes to that file, we recommend that you create a backup copy first, as it may seriously disrupt the function of Julia.

Datasets We Will Use

No lab is complete without some data to work with. Normally, this would come from a combination of data streams. For the purpose of this book, we’ll assume that you have already queried your data sources and you have everything in a series of data files, primarily in the .csv format (comma separated values). For data that is not so formatted, we will illustrate how you can parse it, obtain its more information-rich elements, and then store it in a structured format like .csv or any other delimited file (see Chapter 5 for details).We will also cover a case of semi-structured data, in the form of raw .txt files.

Dataset descriptions

For most of the case studies we will be utilizing two .csv files: magic04.csv and OnlineNewsPopularity.csv (obtained from the UCI machine learning repository, http://archive.ics.uci.edu/ml). We will also make use of the Spam Assassin dataset, which more closely parallels a real-world problem than any of the benchmark datasets you can find in a repository.

Magic dataset

The Magic (aka Magic04) dataset refers to the data collected by the Magic telescope using the imaging technique. It consists of approximately 19,000 data points and 10 features. The attribute we are trying to predict (located in the last column in the dataset) represents the kind of radiation each data point corresponds to: either gamma or hadron (represented as g and h, respectively). As such, this is a classification problem. You can take a look at a couple of lines of the dataset below:

22.0913,10.8949,2.2945,0.5381,0.2919,15.2776,18.2296,7.3975,21.068,123.281,g

100.2775,21.8784,3.11,0.312,0.1446,-48.1834,57.6547,-9.6341,20.7848,346.433,h

OnlineNewsPopularity dataset

As its name suggests, the OnlineNewsPopularity dataset comprises data from a variety of news websites, including their popularity measured in number of shares the articles received (target variable). The dataset consists of about 40,000 data points and 59 features (not counting the URL attribute, which is more like an identifier for each data point). As the attribute we are trying to predict in this case is a continuous one, this is a classical regression problem. You can view a sample of this dataset below:

http://mashable.com/2013/01/07/beewi-smart-toys/, 731.0, 10.0, 370.0, 0.559888577828, 0.999999995495, 0.698198195053, 2.0, 2.0, 0.0, 0.0, 4.35945945946, 9.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8500.0, 8500.0, 8500.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0222452755449, 0.306717575824, 0.0222312775078, 0.0222242903103, 0.626581580813, 0.437408648699, 0.0711841921519, 0.0297297297297, 0.027027027027, 0.52380952381, 0.47619047619, 0.350609996065, 0.136363636364, 0.6, -0.195, -0.4, -0.1, 0.642857142857, 0.214285714286, 0.142857142857, 0.214285714286, 855

http://mashable.com/2013/01/07/bodymedia-armbandgets-update/, 731.0, 8.0, 960.0, 0.418162618355, 0.999999998339, 0.54983388613, 21.0, 20.0, 20.0, 0.0, 4.65416666667, 10.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 545.0, 16000.0, 3151.15789474, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0200816655822, 0.114705387413, 0.0200243688545, 0.0200153281713, 0.825173249979, 0.514480300844, 0.268302724212, 0.0802083333333, 0.0166666666667, 0.827956989247, 0.172043010753, 0.402038567493, 0.1, 1.0, -0.224479166667, -0.5, -0.05, 0.0, 0.0, 0.5, 0.0, 556

Spam Assassin dataset

The final dataset in our collection is composed of .txt files, representing a collection of 3,298 emails, 501 of which are spam. The rest are normal emails (referred to as ham) and are divided into two categories: easy ham and hard ham, depending on the difficulty of detecting them accurately. As you would expect, this is a classical classification problem.

In this case, we will need to do some work since there are no features per se. We will have to create them from scratch, using the text data of the emails. Below is a sample of one such email. The subject of the email, which will be our focus when dealing with this email, is in bold.

Return-Path: <Online#[email protected]>

Received: from acmta4.cnet.com (abv-sfo1-acmta4.cnet.com [206.16.1.163])

by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g69MseT08837

for <[email protected]>; Tue, 9 Jul 2002 23:54:40 +0100

Received: from abv-sfo1-ac-agent2 (206.16.0.224) by acmta4.cnet.com (PowerMTA(TM) v1.5); Tue, 9 Jul 2002 15:49:15 -0700 (envelope-from <Online#[email protected]>)

Message-ID: <1100198.1026255272511.JavaMail.root@abv-sfo1-ac-agent2>

Date: Tue, 9 Jul 2002 15:54:30 -0700 (PDT)

From: “CNET News.com Daily Dispatch” <Online#[email protected]>

To: [email protected]

Subject: CNET NEWS.COM: Cable companies cracking down on Wi-Fi

Mime-Version: 1.0

Content-Type: text/html; charset=ISO-8859-1

Content-Transfer-Encoding: 7bit

X-Mailer: Accucast (http://www.accucast.com)

X-Mailer-Version: 2.8.4-2

Downloading datasets

The above datasets were selected because they are complex enough to be interesting, but not necessarily daunting. They are also universal enough to relate to many different data science applications. You can download them in either one of these ways:

  • Directly from the UCI repository
  • From the dedicated Dropbox folder the author has created: http://bit.ly/29mtzIY.

Once downloaded and unpacked, you can take a peek at the datasets in a spreadsheet software (such as MS Excel, Numbers, or Gnumeric, depending on the OS you are using), or even a text editor (the data is stored in the .csv files so they can be opened as normal text files even by your system’s default editor). You can learn more about these datasets by reading the corresponding .names files.

Loading datasets

CSV files

To make access to the aforementioned datasets easier, especially if you are new to Julia, it is recommended that you move the extracted .csv files to the working folder of the language. To reveal Julia’s working folder, just run the pwd() command we saw previously:

In[1]: pwd() #A

#A println(pwd()) is also an option, particularly for other IDEs and the REPL.

You can load a .csv file into Julia’s workspace (the memory of your computer that is handled by Julia) using the following simple command:

In[2]: data = readcsv(“magic04.csv”);

The semicolon at the end is optional, but useful if you don’t want the console window to fill up with data from the uploaded file. The data from the file is now stored in a two-dimensional array (matrix) called “data.” You can also load the data into other formats (such as our all-time favorite: the data frame), but we’ll get to that later on. Also, keep in mind that you can incorporate the file path in the filename. If your .csv file was stored in D:data, you would run the following code instead:

In[3]: data = readcsv(“D:\data\magic04.csv”);

Text files

You can load text data (say, one of the spam emails) in the Julia workspace just as easily, using the following commands:

In[1]: f = open(filename, “r”)

lines = readlines(f);

close(f)

The above snippet will load a text file whose path and name are passed as filename as an IO object f. The parameter r tells Julia that when you create this object, you only want to read data from it.

To make use of the IO object f you need to apply a function such as readlines() that can receive such objects as an input. readlines() can parse the whole file, split its contents into one-line strings, and return an array of these strings as its output.

Here’s a variant of the above method, where each line of the file is parsed in a sequential manner using a for-loop:

f = open(filename, “r”)

for line in eachline(f)

  [some code]

end

close(f)

If you wish to have the whole contents of the text file stored in a single string, you can do that using the following snippet:

f = open(filename, “r”)

text = readall(f);

close(f)

Although this approach may seem more elegant, it may be suboptimal in practice. The reason is twofold. First, the text files you encounter in the data science world are often large and storing them in a variable (or an array) takes up a great deal of memory (which usually comes at a cost). Second, we generally won’t need the whole file, so we can process it line by line without having to store more than one line in memory.

Coding and Testing a Simple Machine Learning Algorithm in Julia

To familiarize yourself with your new lab, let’s start with a relatively simple example. If you are completely new to Julia, don’t worry if you don’t understand everything here; this is just to help you get acquainted with the look and feel of the language. We encourage you to revisit this example once you feel comfortable with Julia code.

In the meantime, you can either type the code yourself, or use the kNN.jl file, provided along with the rest of the Julia code that accompanies this book (just open the file on IJulia or a text editor, copy all its contents, and then paste it on a new IJulia notebook).

We selected one of the simplest algorithms that is still useful. We want you to learn to code something that you may actually use in your work as a data scientist, such as an algorithm that can help you with classification tasks. One such algorithm is the one applicable to the magic dataset, which involves classifying the observed radiation of the telescope as gamma or hadron (denoted as g and h respectively, in the last attribute of the dataset).

This is possible using the information in the remaining attributes, which we will use as-is for the purposes of this example. The end result will be a prediction of some unknown observations into one of these two classes, and the validation of this prediction based on the actual classes of these observations.

The focus of this section is not to make you an expert in classification algorithms (which we’ll discuss in detail at a later chapter), so it’s alright if you don’t understand everything. For the time being, you can focus on the Julia aspect of it and try to understand how the programs involved work.

Algorithm description

The algorithm we’ll be examining is called “k Nearest Neighbor” (or kNN) and is a basic classification algorithm from the early days of machine learning. Despite its antiquity, it is still used in image analysis and recommender systems, among other fields. kNN is a distance-based classifier. Although it doesn’t have a training phase, it is robust when speed is of the essence.

Its philosophy is straightforward: in order to classify a given (unknown) data point X, find the k data points that are most similar to it, and apply the majority vote. Similarity is usually based on the reverse of the distances involved, so the data points with the smallest distances to X are selected. The pseudo-code of the algorithm is shown in Listing 2.1.

Inputs: training data input values (X), training data labels (x), testing data input values (Y), number of neighbors (k)

Output: predicted labels for testing data (y)

for each element i in array Y

  calculate distances of Yi to all training points Xj     #A

  find indexes of k smallest distances

  break down these data points based on their classes

  find the class with the majority of these k data points  #B

  assign this class as the label of Yi

end for

#A  distance function

#B  classify function

Listing 2.1 Pseudo-code for the kNN algorithm.

This is shown in more detail in Figure 2.10.

Image012.jpg

Figure 2.10 A flowchart of the kNN algorithm, a basic machine learning algorithm for classification.

Algorithm implementation

To implement kNN in Julia we need two things: a distance calculation function and a classification function. These can be wrapped in the main function, which we’ll call kNN(). For simplicity, in this implementation of kNN we will use the Euclidean distance (though in practice you can use any distance metric you prefer). It is generally best to start with the auxiliary functions–in this case, distance() and classify(). If you run the whole code as a single script, you need to have all the functions defined before you can use them, as we saw in the previous chapter.

First, we need to think about what each one of these functions will have as its input(s) and what it will yield as an output. In the case of distance(), it is rather obvious: it needs to take in two data points as vectors (one-dimensional arrays) and yield a single number (float). As for classify(), it needs to take the distances of all the data points as vectors, along with their labels as vectors, and the number of neighbors to examine (a single number), ultimately yielding a single element (which could be either a number or a string, depending on what populates the labels array of our dataset).

In Julia, although it is important to define all these elements, being too specific may cause frustrating errors. For example, if we need two numbers x and y as inputs, and it doesn’t matter what kind of numbers these are, we can define x and y as just numbers instead of, say, floats or integers. This will also allow our functions to be more versatile. However, in cases where some types of numbers wouldn’t make sense as inputs, we must be more specific.

In the case of classify(), for example, we don’t want k to be a number with a decimal point, a fraction, or a complex number. It needs to be an integer. So in the corresponding functions (in this case our wrapper function kNN() as well as the classify() function), it is better off being defined as such. As for the dataset to be used in kNN(), this will be in the form of a matrix and will need to be defined accordingly.

Let’s start with the auxiliary functions of the algorithm, so that you get familiar with how an algorithm is implemented in Julia. We’ll first define how we want to calculate the distance between two data points, such as the code in Listing 2.2 (there are better ways to implement this, but this one is easier to understand).

We could include this code in the main method since it’s fairly small, but what if we wanted to try out different distance metrics? It is much easier to make this kind of change when the code is broken down into several auxiliary functions which are self-sufficient and easier to understand and edit.

In[1]: function distance{T<:Number}(x::Array{T,1}, y::Array{T,1})

dist = 0                  #A

for i in 1:length(x)      #B

    dist += (x[i] - y[i])^2

end

dist = sqrt(dist)

return dist

end

#A initialize distance variable

#B repeat for all dimensions of x and y

Listing 2.2 An auxiliary function for the implementation of the kNN algorithm. This one is responsible for calculating the distance between two points, x and y, represented as vectors.

Now let’s get into the meat of the algorithm: the classification process for a single data point, based on the distances we’ve calculated. The distance function has been applied several times to create a single dimensional array called distances, which we’ll use as one of the inputs in the classify() function. The code for this function is available in Listing 2.3.

In[2]: function classify{T<:Any}(distances::Array{Float, 1}, labels::Array{T, 1, k::Int64)

    class = unique(labels)            #A

  nc = length(class)                  #B

  indexes = Array(Int,0)              #C

  M = maxtype(typeof(distances[1]))   #D

  class_count = Array(Int, nc)

  for i in 1:k

      indexes[i] = inmin(distances)

      distances[indexes[i]] = M       #E

  end

    klabels = labels[indexes]

  

    for i in 1:nc

      for j in 1:k

        if klabels[j] == class[i]

          class_count[i] += 1

          break

      end

    end

end

index = inmax(class_count)

return class[index]

  end

#A find all the distinct classes

#B number of classes

#C initialize vector of indexes of the nearest neighbors

#D the largest possible number that this vector can have

#E make sure this element is not selected again

Listing 2.3 Another auxiliary function of the implementation of the kNN algorithm. This one performs classification of a point based on its distances from the known points of the dataset.

Now it’s time to see how it all fits together by implementing the main function (often referred to as the wrapper function). This is the function we’ll call whenever we want to use the kNN algorithm (although we can always call the other functions on their own, which is particularly useful for debugging). So, let’s type the code in Listing 2.4 and conclude our project.

In[3]: function apply_kNN{T1<:Number, T2<:Any}(X::Array{T1,2}, x::Array{T2,1}, Y::Array{T1,2}, k::Int)

N = size(X,1)                        #A

n = size(Y,1)                        #B

    D = Array(Float, N)              #C

        z = Array(typeof(x[1]), n)   #D

for i in 1:n

    for j in 1:N

      D[j] = distance(X[j,:], Y[i,:])

    end

      

z[i] = classify(D, x, k)

end

return z

   end

#A number of known data points

#B number of data points to classify

#C initialize distance vector

#D initialize labels vector (output)

Listing 2.4 The main function (wrapper) of the implementation of the kNN algorithm.

Algorithm testing

In order to use this function, we are going to need some data. So, let’s load the magic dataset we saw in the previous section:

In[4]: data = readcsv(“magic04.csv”)

This puts all the data into a single matrix. To make it easier to work with, we can organize it first into inputs (features) and outputs (labels). You can do this with the following commands:

In[5]: I = map(Float64, data[:, 1:(end-1)])  #A

In[6]: O = data[:, end]                      #B

#A take all the columns of the data matrix, apart from the last one and convert everything into a Float. Result = 10-dim Array of Float numbers

#B take only the last column of the data matrix. Result = 1-dim Array

Now, if you were to use this data to test a classifier (in this case the kNN algorithm), both of these arrays would need to be divided into training and testing sets. This is a technique that is worth looking into detail, so we’ll describe it in a later chapter. For now, we can do a basic random sampling.

First we’ll get a random set of indexes for the training set (say, half of the total number of data points). Then we’ll select the corresponding data points and their labels from I and O respectively, and store them in two arrays. Afterwards, we’ll put the remaining data points and their labels in two other arrays. You can do all that with the code (which we encourage you to write on IJulia or some other Julia IDE) in Listing 2.5.

In[7]: N = length(O)           #A

In[8]: n = round(Int64, N/2)   #B

In[9]: R = randperm(N)         #C

In[10]: indX = R[1:n]          #D

In[11]: X = I[indX,:]          #E

In[12]: x = O[indX]            #F

In[13]: indY = R[(n+1):end]

In[14]: Y = I[indY,:]          #E

In[15]: y = O[indY]            #F

#A number of data points in the whole dataset (which is equivalent to the length of array O)

#B the half of the above number

#C a random permutation of all the indexes (essential for sampling)

#D get some random indexes for the training set

#E input values for training and testing set respectively

#F target values for training and testing set respectively

#G some random indexes for the testing set

Listing 2.5 Code for testing the implementation of the kNN algorithm, using the preloaded Magic dataset.

Now you are ready to see the kNN classifier you have built in action. Just run the following commands:

In[16]: z = apply_kNN(X, x, Y, 5)       #A

In[17]: println( sum(y .== z[1]) / n )

println(z[1][1:5], z[2][1:5])           #B

#A predicted values and the accompanying probabilities (output of the classifier)

#B accuracy rate of classification and small sample of the classifier’s output

Upon running this code, you should see something like the following in your IJulia notebook (results may vary every time due to the random sampling involved):

Out[17]: 0.805888538380652

  Any[“g”,”g”,”g”,”g”,”g”][0.6,0.8,0.6,0.8,0.6]

Congratulations! You have just run your first data science experiment using Julia! Be sure to save this code to an IJulia notebook (or some other format such as a .jl file) as you may want to go back to it later.

Saving Your Workspace into a Data File

Now that you have some results, you may want to store them somewhere for future reference or even further processing. Unlike R, Julia does not store your workspace once you exit; if you want certain variables to be accessible next time you spin up the Julia kernel, you’d better save them somewhere yourself. Fortunately you can do that it two effective ways:

  • Save the data in a delimited file (e.g. a .csv) for easy access of the data from other programs (e.g. a spreadsheet)
  • Use the native Julia data file format (.jld), by employing the corresponding package.

Each one of these methods has its own advantages, which we will examine in more detail in the subsections that follow.

Saving data into delimited files

This is probably the simplest and most widely used option. It doesn’t require any packages, it creates files that are easily accessible by other programs, and it uses formats that most people are already familiar with. You can save your data (say, an array A) in a delimited file with semi-colons (;) as the separators among the data fields, by employing the writedlm() function as seen here:

writedlm(“/data/mydata.dat”, A, “;”)

As you can see, the first argument of the writedlm() function is the file name (including the path), the second is the array to be saved, and the third is the delimiter. The delimiter is usually a character, although it can be any string (e.g. : ). The default value for this parameter is a tab (denoted ), yielding .tsv files. Nevertheless, the extension of the delimited file you create is up to you; Julia will not assume anything, even if it seems obvious to you.

One special case of delimited files, which is particularly popular when it comes to numeric data, is .csv files. You can save your data in a .csv by selecting the comma as a delimiter, but there is a simpler way: the writecsv() function. You can use this as follows:

writecsv(“/data/mydata.dat”, A)

Delimited files are not the most resource-efficient methods of saving a dataset, yet it is often necessary to utilize them. When the volume of your data increases, or when you wish to preserve the metadata of the variables involved, it is best to make use of the second alternative to saving data files. As the examples in this book involve a quite manageable data size, we will be using delimited files for all our data exports.

Saving data into native Julia format

Often it is easier to use the native format of a language for storing the data used (e.g. SFrames, SArrays, and SGraphs in Graphlab; .RData in R; or .mat files in Matlab). That’s not to say that other languages cannot access these formats–they are just each more practical for the language they were created for. The Julia Data Format (.jld files) is one such native format developed by Simon Kornblith and Tim Holy to address this need, using the more generic HDF5 data format. As this is no small feat, the Julia data format took a while to enter the scene.

To employ this data format you first need to add the JLD and HDF5 packages, and then use them to create a .jld file containing your data (say an array A composed of floats, and an integer b). You can do all this using the following code:

Pkg.add(“HDF5”)

Pkg.add(“JLD”)

using JLD

f = open(“mydata.jld”, “w”)

@write f A

@write f b

close(f)

The @ character has a specific function in Julia, which when applied on a function changes its syntax, making it more versatile. We recommend you look into it only once you feel confident enough with the basics: perhaps after solving a number of problems using Julia, and getting well-acquainted with all the functions described in this book.

Alternatively, instead of the last four lines, you can write the following equivalent code:

save(“mydata.jld”, “var_A”, A, “var_b”, b)

where the arguments in quotes correspond to the filename and the stored names of the variables (here we have made them different than the original ones to avoid any confusion). If you wish to save all the variables in the workspace, you just need to type the following:

save(“mydata.jld”)

To retrieve the data stored in a .jld file, type:

D = load(“mydata.jld”)

Loading a .jld file creates a dictionary where all its contents can be found. The variable names are stored as keys in that dictionary. If you wish to access only a particular variable in a .jld file (e.g. the variable called “var_b”), you simply specify it using the following piece of code:

b = load(“mydata.jld”, “var_b”)

Another way to access the variables stored in a .jld file is to use the following:

f = jldopen(“mydata.jld”,”r”)

dump(f, 20)

This will spit out the first 20 variables in the .jld file, the way they are stored in the corresponding dictionary (i.e. variable name: variable type and dimensionality). This approach is particularly useful if you are not sure which variable you want to access.

The JLD package is still new and the documentation is incomplete at the time of this writing. We encourage you to probe it in more depth on your own, as it is bound to evolve over time. It is a useful tool and can make your data storing and retrieving work much easier. You can read more in the corresponding documentation file in the GitHub location: http://bit.ly/29fVavH.

Saving data into text files

If your data is highly unstructured and none of the above options work, you can always save it in simple text files. Keep in mind, though, that retrieving the data afterwards may require some more work (something we will revisit in Chapter 6). To save data in a .txt file you just need to do the following:

f = open(“/data/mydata.txt”, “w”)

write(f, SomeStringVariable)

write(f, AnotherStringVariable)

.

.

.

close(f)

In order for your data file to have some spacing between each pair of consecutive variables, you need to add new line characters at the end of each string (represented as in Julia). So, if you want to save the string data “Julia rocks!” to your (already open) data file, you need to make the following adjustment:

data = string(data, “ ”)

Naturally, you aren’t limited to string variables, even though whatever you save into a text file will eventually be converted into a string type. So, if you have an array A, you can save it as follows:

A = [123, 34423.23, -322, 4553452352345234523452345345261709106832734]

f = open((“/data/mydata.txt”, “w”)

for a in A

write(f, string(a, “ ”))

end

close(f)

This way the array will be saved into the mydata.txt file in the data folder, with one element per line. This will make it easier to read afterwards in a text editor, as well as access it using a script from any language.

Help!

Regardless of your background, you’ll eventually run into a situation where you need some assistance–usually with a function. When this happens, before you rush off to Stackoverflow, you may want to check out Julia documentation. To do that, just type:

help(somefunction)

Alternatively you can type the following:

? somefunction

Although the output of this function is not the easiest to understand, it is very useful. The more you use the function, the easier it will get. In this ever-changing environment that is Julia, it is essential to get acquainted with the Julia documentation. Embrace it as your best resource for learning various aspects of the language, particularly concerning the use of data types, operators, and functions.

If you have a more general question, you can search the Julia manual at http://bit.ly/29bWHU2 or the Julia wikibook at http://bit.ly/29cIges. The latter is usually preferable as it has more examples and it is much easier to read. For more subjective questions (like “How do I evaluate the performance of this code?”) your best bet would be to consult the experts. Stackoverflow is a good place to start (use the tag “julia-lang”), though the Julia users group at Google is also a viable option. Whatever the case, don’t fret! Just like every new technology, Julia has its quirks and may take a while to get used to. That’s why through the rest of the book we will try to keep the coding to a minimum, so that you can focus on the more interesting (and valuable) parts of the language.

Summary

  • Julia is easier to work with when you make use of an IDE, such as Juno, IJulia (Jupyter), or Epicenter.
  • You can easily access Julia on the cloud using JuliaBox. This allows you to create and store IJulia notebooks on a server, using a Google account.
  • Installing packages is essential for serious Julia programming. This can be done using the Pkg.add() command. To install package abc you just need to type: Pkg.add(abc).
  • You can load a .csv file in the Julia workspace using the readcsv() function. For example, the data file mydata.csv can be accessed as follows: readcsv(mydata.csv).
  • You can load data from a .txt file in various ways, the simplest one being: f = open(filename, r); lines = readlines(f); close(f).
  • k Nearest Neighbor (kNN) is a simple yet efficient machine learning algorithm that can be used to perform classification on any labeled dataset (i.e. a dataset that has a target variable which is discreet).
  • When implementing a non-trivial algorithm, it is useful to break it down into auxiliary functions. The auxiliary functions need to be loaded into memory before the main (wrapper) function is executed.
  • You can save data into files using one of the following methods:
  • For a delimited file: writedlm(filename, variable, delimiter). The delimiter can be any printable string, although it is usually a character. If it’s omitted, the tab character is used as a default.
  • For a Julia native data file: save(filename.jld, name_for_1st_variable, var1, name_for_2nd_variable, var2, …). You need to load into memory the JLD and HDF5 packages first.
  • For a simple text file: f = open(filename, w); write(f, StringVariable1); write(f, StringVariable2); …; close(f).
  • You can seek help for a function in Julia using the help() command: help(FunctionName).
  • You can find help for more complex matters in the Julia wikibook, the Julia documentation, Stackoverflow, and any Julia users group (online or physical).

Chapter Challenge

  1. 1. Which IDE(s) would you use if you wanted to take advantage of the latest developments in the language?
  2. 2. Which IDE would you use if you couldn’t have Julia installed on your machine due to lack of administrative privileges?
  3. 3. What are the advantages of using an IDE (e.g. Juno or the online IDE at tutorialspoint.com) over REPL? What are the advantages of IJulia over an IDE and REPL?
  4. 4. Why is it useful to have auxiliary functions in a program?
  5. 5. What’s a wrapper function?
  6. 6. What do the functions sqrt(), inmin(), and length() do? What data types are their arguments? (Hint: use the help() function for each one of these.)
  7. 7. What range of values does the expression sum(y == z) / length(y) take?
  8. 8. Say that you have an array A that you wish to save in .csv format. How would you go about doing this?
  9. 9. You have a variety of arrays and single value variables in your workspace and you have to shut down your machine because your OS wants to install another batch of updates (again!). How would you go about saving your most important variables while preserving their types so you can resume your work after the OS updates are completed?
  10. 10. What’s the difference between the functions max() and maximum()? Which one would you use for finding the maximum value of two numbers? (Hint: make use of help() again, and try out a couple of examples to see their difference in practice.)
  11. 11. How would you install the package called NMF, get it up to date, and load it into memory?
  12. 12. Can you use the kNN algorithm to analyze text using the data as-is? Explain.
  13. 13. The senior data scientist in your group says that in the dataset you are working on, it would be better to use Manhattan/city-block distance instead of Euclidean (you can learn more about this type of distance on this website: http://bit.ly/29J8D0Y). How would you go about doing that, without having to rewrite the whole kNN algorithm from scratch?

Note: You can view the answers to all of the above in Appendix F.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.236.108