CHAPTER 13: Reaching the Next Level

Chapter013.jpg

If what they say about learning is true, this chapter is where you’ll get to know what you know (and perhaps more importantly, what you don’t know). This is because this chapter will teach you to become self-sufficient in Julia, when it comes to data science applications.

That’s not to say that you’ll be a know-it-all after finishing this chapter! However, if you can work through everything we recommend in this chapter, you’ll have gained an impressive level of competency in the language. You won’t need to rely on this or any other book for your Julia-related endeavors in data analytics, and you’ll be empowered to use Julia more freely.

Although this chapter is accompanied by an IJulia notebook, we recommend that you don’t use it unless you need a little help. Instead, try first to create your own notebook and write all the necessary code from scratch.

This chapter covers the following topics:

  • The Julia community and how to learn about the latest developments on the language
  • What you’ve learned about using Julia for data science through a hands-on project
  • How you can use your newly-acquired skills to contribute to the Julia project, making make Julia more relevant in the data science world.

Julia Community

Sites to interact with other Julians

Although the Julia community is relatively small, there are people out there who use Julia and are open to interacting with other users of this technology. Fortunately, they tend to be quite experienced in Julia, as well as in other computer languages, since it’s rare that someone starts programming with Julia as their first programming language.

Most Julians understand that the world has not yet adopted this language, so if they do any programming for their day job, they most likely use other, more established platforms. This is helpful because they’ll usually be able to answer your questions about whether functions in other languages have equivalents in Julia, or at least point you to the right direction. That’s not to say that there are no knowledgeable people in other programming communities. You can find experienced programmers everywhere, though the population density of such individuals tends to be higher in Julia groups.

So, if you want to interact with other Julians, this Google group is a good place to start: http://bit.ly/29CYPoj. If Spanish happens to be your language of choice, there is a version of this group in Spanish: http://bit.ly/29kHaRS.

What if you have a very specific question about a Julia script, one of the built-in functions, or you can’t find a way to deal with an error on your code? Well, that’s what StackOverflow is for, with the julia-lang tag being the go-to option for organizing questions related to this language.

There are also Meetup groups, which are a great way to meet other Julia users in person. You can find a list of all of them here: http://bit.ly/29xJOCM. Currently Julia is popular in the United States and a few other countries, so if you are privileged enough to live in a city where there is an active Julia meetup, do check it out.

Code repositories

As Julia is an open-source project, all of the code for it and the majority of its packages is available on GitHub. You can find a list of all supported packages in this site: http://pkg.julialang.org, along with the links to the corresponding GitHub repositories. In Appendix C, we have a list of all the packages used in this book along with their GitHub links.

Videos

As videos have quickly become the most popular medium for sharing information, the Julian creators have taken advantage of this and have a specialized channel on YouTube: http://bit.ly/29D0hac. Beyond this channel, there are several other videos created by individual Julia users. If you happen to speak Portuguese, check out this great video by Alexandre Gomiero de Oliveira: http://bit.ly/28RMlbW. If you prefer something in English, this channel by Juan Klopper is a good place to start: http://bit.ly/28NS27K. Always be on the lookout for new Julia videos, since there are new ones popping up regularly.

News

If you find that YouTube is too slow in delivering the latest developments of this language, then try looking for a reliable blog to keep you up to date. You can read the official blog maintained by the creators of the language (http://julialang.org/blog), or this meta-blog that gathers Julia-related posts from various blogs: http://www.juliabloggers.com. The entries of this blog contain all kinds of articles, along with links to the original sources.

If you like small snippets of news, then the Twitter medium may be best for you. In this case, check out tweets with the hashtag #JuliaLang, and #JuliaLangsEs for the Spanish ones.

Practice What You’ve Learned

Now it’s time to apply what you have learned to a hands-on project. Before you start, we recommend you review the notebook from Chapter 5 one more time, to make sure that you understand the whole process and how Julia fits into it. Also, you may want to go back to all the questions or exercises of this book that originally puzzled you, and try to tackle them once more. Once you are ready, you can gather all your notes and start looking at your task below.

Although it would be great if you could apply as much of your new know-how as possible, you won’t need to use every single tool in your toolbox in order to complete this task. An example of a minimalist solution to this project is available in this chapter’s corresponding IJulia notebook, while a high-level approach to how you could tackle this project is in Chapter 5. Your solution is bound to be different than the suggested one and that’s fine; our solution is there just as an aid in case you get stuck. Also, you may find some of the auxiliary functions in that notebook handy, so feel free to use them for this and any other application.

In this project, your task is to analyze the Spam Assassin dataset that we saw briefly in the beginning of this book, ultimately creating a system that will be able to distinguish a spam email from a normal (or “ham”) email based upon its subject. You can use whatever method you see fit. Provide the necessary code for your analysis (along with any other relevant material) and validate your results.

The dataset comprises three folders of emails, labeled “spam,” “hard-ham,” and “easy-ham.” Most of the data in each file will be irrelevant to this task, as you need to focus on the subject only. (Once you finish the task, you can tweak your system so that it makes use of additional data, for additional practice.)

We recommend you do some data engineering to gather all the relevant data across all the files, and store it in additional files that you can use for further analysis. Also, since all of this is text data, you’ll need to create your own features before you can apply any Stats or ML algorithms. We also recommend you use an IJulia notebook for all your code, your results, and any additional comments throughout this project.

You’ll quickly notice that some files contain emails written in another language, so you won’t be able to make much of them due to their encoding. We suggest you skip them and work with the remaining part of the dataset. If you find the programming involved too frustrating at this point, and you find yourself stuck, you can either use some other data science tool you are more familiar with (see Appendix D), or you can go ahead and work with the titles_only.csv delimited file. We created this file to help you along in the process of tackling this project (the delimiter is a comma, something made possible through the removal of all special characters from the subject data).

Some features to get you started

If this whole project seems daunting to you, that’s fine; most text-based project are like that. However, without going into natural language processing (NLP) methods, we can discuss some methods of solving such a problem with Julia, applying the data science process. So, if you just don’t know where to start, here are several features you can try using to build a baseline model.

  • The presence of the following words or phrases:
  • sale
  • guaranteed
  • low price
  • zzzzteana
  • fortune
  • money
  • entrepreneurs
  • perl
  • bug
  • investment
  • The presence of very large words (more than 10 characters)
  • The number of digits (i.e. numeric characters) in the subject.

Finding useful features is a good exercise in creativity (an essential quality in data science) as well as for refining your understanding of how prediction works in machine learning. Some features may be great for predicting one class, other features may be good for predicting another, while others may be good as aids of other features. Also, you may use functions like the sigmoid (sig() in the notebook) to make your numeric features more powerful (i.e. able to provide higher distances between the classes you are trying to distinguish).

Since this particular problem is relatively simple, if you apply yourself during feature creation, it is possible to obtain a perfect class separation (discernibility of 1.0). Of course, this does not guarantee a great performance for this kind of problem, since there is a significant imbalance between the classes. Remember that your aim is to be able to pinpoint the spam emails, so predicting the ham ones is good, but not as important.

Also, keep in mind that you can build a good model even if you don’t have the most complete feature set, so don’t spend all of your time on this part of process. We recommend you build a basic feature set using the low-hanging fruits first, create a model based on that, and then go back and enrich your features. Also, be sure to perform as many of the techniques we described in the book as possible. After all, the objective of the project is to get some hands-on experience in using data science methods in Julia.

Some thoughts on this project

Although this is real-world data, we made an effort to keep the whole thing as simple as possible. Ideally, you would make use of other relevant data apart from the subject text, such as the IP of the sender, time of correspondence, and the actual content of the email. Were you to take all of this data into account, you would surely have better performance in your system, but the whole project might take a long time to complete (you may need to create a series of new features specializing on the body text, for example). Also, if you were to work on such an application in the real world, you would probably have access to some blacklist where known spammers would be mentioned, making the whole process a bit easier.

The objective of this project, however, is not to build the perfect spam detection system, but rather to make use of Julia for a data science project and have something to use as a reference for your future projects. Also, such a project can be a great opportunity to exercise your discernment and intuition, qualities that although crucial to data science are rarely acknowledged in the majority of relevant books out there. Among other things, these qualities are imperative for figuring out which validation metrics you would apply on this problem, which words or phrases would you use as features, and what other aspects of the linguistic structure you could express as numeric features.

Once you have finished this project and you are satisfied with the outcome, you can try your hand at more sophisticated datasets (Kaggle would be a good place to start) and gradually expand your Julia expertise. Also, once you obtain more experience, you can dig deeper in the Spam Assassin dataset and see how you can improve your prediction system further. Perhaps you can perform some clever feature selection too, so that the whole thing is even faster. The sky is the limit!

Final Thoughts about Your Experience with Julia in Data Science

Refining your Julia programming skills

Unfortunately there is only one way to do this and there is no instant gratification: practice writing Julia programs. The various challenges in http://www.codeabbey.com are a great place to find programming problems, although it’s unlikely that you’ll find many answers with solutions in Julia. If you want something more focused on Julia, the first 10 problems of the 99 problems in Haskell are available in this site, along with some proposed solutions: http://bit.ly/29xKI2c.

Once you reach the level where you can create bug-free code (at least for the majority of cases!), you can start looking at how you can make the code more efficient. One good way for monitoring that is the @time meta-function, which you can insert in front of the code you want to benchmark, in order to get some measurement of the resources it takes (CPU time and RAM). It would also be helpful if you got to know the different data types, and see how the expert Julians handle them for optimal efficiency in their code. You’ll find out that Julia is a high-performance language with even mediocre code, and can be even more efficient when you optimize your coding style.

Contributing to the Julia project

You can contribute to the Julia project in the following ways:

Easy ways:

  • Share your experience of Julia with other people.
  • Share articles you find about the language to raise awareness of Julia among your contacts.
  • Report any issues with the language or packages of it to the Julia developers’ community.
  • Donate to the project (there is a donate button at the bottom of the community page of the language’s official site: http://julialang.org/community).

More hands-on ways:

  • Create functions or packages that may be useful to other Julia users and share the corresponding scripts online (via your personal website, or through a public one like Dropbox or GitHub).
  • Edit existing packages and resolve issues or limitations they have.
  • Solve programming challenges in Julia and make the solutions available to everyone through a blog.
  • Start using Julia for your everyday work (even if it is in parallel with your existing platform).

Ways involving some commitment:

  • Attend the annual Julia conferences (currently in Boston and Bangalore, in June and in September or October, respectively).
  • Promote Julia-related events, articles, and books on your site, blog, or Meetup event.
  • Create learning material for the language and make it available to other people (or translate existing material to your native tongue).
  • Buy more copies of this book and give them to friends as presents!

Okay, you don’t have to do the last one if your friends are not particularly interested in the topic (and you want to keep them as friends). However, you can still buy these extra copies and hand them out to your colleagues or to members of your local data science Meetup group!

Future of Julia in data science

It is difficult to make predictions about this matter, especially at this point when there isn’t a ton of information to base those predictions on. However, if the latest trends in the Julia language are any indication of its growth and increased applicability, we can expect to see more of it in the future, particularly in data science related applications.

At the end of the day, no matter how conservative people are when it comes to embracing a new tool, eventually its usability will overcome the inertia of its potential users. A good example of this is Linux, which has been a great alternative to other operating systems since the early 90s, even though people were not initially willing to adopt it.

Now, however, the majority of computers (and other electronics) make use of this OS in one way or another, while more and more people are switching to it as their main OS. Perhaps something similar will happen with Julia. It may not happen this year, or the next, but just like Python came to be the status quo due to its simplicity and ease of use, so can Julia gradually become a major player in the data science field.

Will Julia replace R, Python, and other languages? That’s unlikely; there will always be people (and companies) that will use these tools even when they don’t have any clear advantage over the alternatives–simply because these people have invested a lot in them. Consider the case of the Fortran language, which has been gradually replaced by Java, C, C++, and other low-level languages. Still, there are people out there who use Fortran and will probably continue doing so even when it has become entirely irrelevant. Of course, R and Python are far more adaptive and high-level than Fortran. Yet, it is likely that in the future, working with these platforms would be a matter of preference rather than of any actual advantage–just as some people prefer to work with C# instead of Java or C++.

Besides, Julia doesn’t have an aggressive approach towards existing technologies, which is why there several APIs for bridging the gap between them and Julia (see Appendix D). So, it is unlikely that there will be a “fight to the death” between Julia and the R / Python duo, but rather some symbiosis of sorts. Perhaps Julia will lead the way in data engineering tasks and prove self-sufficient in the majority of other applications, with R being the go-to platform for statistical analyses, Python being the choice for some specialized models, and all of them running in parallel. At the end of the day, true data scientists just want to get the job done and couldn’t care less about the tools used, just like good developers are happy to use whatever language they see fit for building the application they need.

It is our wish that you become the kind of data scientist who takes advantage of new technologies such as Julia to make the field richer and more valuable to everyone. Perhaps you can one day be a contributor to this technology, making things easier for the Julia-oriented data scientists to come.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.190.41