Chapter 22
IN THIS CHAPTER
Locating starting challenges
Working with specific kinds of data
Performing analysis, pattern recognition, and classification
Dealing with huge online datasets
Data science is all about working with data. While working through this book, you have used a number of datasets, including the toy datasets that come with the Scikit-learn library. Of course, these datasets are all great for getting you started, but just as a runner wouldn’t stop after conquering the local fun run, so you need to start training for data science marathons by working with larger datasets.
This chapter introduces you to a number of challenging datasets that can help you become a world-class data scientist. By combining what you discover in this book with these new datasets, you can learn how to do amazing things. In fact, some people may view you as a bit of a magician as you pull seemingly impossible data patterns out of your hat. Each of the following datasets provides you with specific skills and helps you achieve different goals.
You use Scikit-learn quite a bit while using this book, so you may already understand it a bit. The Kaggle competition at https://www.kaggle.com/c/data-science-london-Scikit-learn
(the current competition ended in December 2014, but there should be others) provides a practice ground for trying, sharing, and creating examples using the Scikit-learn classification algorithms. All the tools for the previous competition are still in place, and it’s still well worth exploring. The goal is to try, create, and share examples of using Scikit-learn’s classification capabilities. You can find the data used for the competition at https://www.kaggle.com/c/data-science-london-scikit-learn/data
. The rules appear at https://www.kaggle.com/c/data-science-london-scikit-learn/rules
, and you can discover how Kaggle evaluates your submissions at https://www.kaggle.com/c/data-science-london-scikit-learn/details/evaluation
.
Of course, you might not have any desire to compete. Looking at the leaderboard (https://www.kaggle.com/c/data-science-london-scikit-learn/leaderboard
) may keep you from seriously considering actual competition because the contest has attracted serious data scientists. However, you can still enjoy taking a chance at figuring out how to solve a challenging data problem and learn something new in the meanwhile, without needing to submit a solution of yours to the leaderboard.
You work with the Titanic data to some extent in the book (Chapters 6 and 20) by using Titanic.csv
and Titanic3.csv
from the Vanderbilt University School of Medicine. This challenge is actually much easier than the one described in the previous section because Kaggle designed it for the beginner. You can find it at https://www.kaggle.com/c/titanic
. The data model, found at https://www.kaggle.com/c/titanic/data
, is different from the one in the book, but the concepts are the same. You can find the rules for this competition at https://www.kaggle.com/c/titanic/rules
and the method of evaluation at https://www.kaggle.com/c/titanic#evaluation
.
You can find the leaderboard for this competition at https://www.kaggle.com/c/titanic/leaderboard
. The number of people who have already achieved what amounts to a perfect score should fill you with confidence.
Competitions are great at helping you think through solutions in an environment in which others are doing the same. In the real world, you may find yourself pitted against competition on a regular basis, so competitions provide good experiences in thinking critically and quickly. They also present you with an opportunity to learn from others. The best place to find such competitions is on the Kaggle site at https://www.kaggle.com/competitions
.
This site will help you locate any past or present Kaggle competition. To find a present competition, click the Active Competitions link. To find a past competition, click the All Competitions link. All the datasets are freely available, so you have a chance to try your skills against any real-world scenario you might want to select. The Kaggle community will provide you with plenty of tutorials, benchmarks, and beat-the-benchmarks posts.
It’s interesting to note that the Kaggle competitions come from companies that don’t normally have access to data scientists, so you really are working in a real world environment. You can also use this site to locate a potential job. Just go to https://www.kaggle.com/jobs
by clicking the Jobs link on the main page.
The Madelon Data Set at https://archive.ics.uci.edu/ml/datasets/Madelon
is an artificial dataset containing a two-class classification problem with continuous input variables. This NIPS 2003 feature selection challenge will seriously test your skills in cross-validating models. The main emphasis of this challenge is to devise strategies for avoiding overfit — an issue that you first confront in the “Finding more things that can go wrong” section of Chapter 16. You find overfit issues mentioned in Chapters 18, 19, and 20 as well. To obtain the dataset, contact Isabelle Guyon at the address found in the Source section of the page at https://archive.ics.uci.edu/ml/datasets/Madelon
.
The MovieLens site (https://movielens.org/
) is all about helping you find a movie you might like. After all, with millions of movies out there, finding something new and interesting could take time that you don’t want to spend. The setup works by asking you to input ratings for movies you already know about. The MovieLens site then makes recommendations for you based on your ratings. In short, your ratings teach an algorithm what to look for, and then the site applies this algorithm to the entire dataset.
You can obtain the MovieLens dataset at https://grouplens.org/datasets/movielens/
. The interesting thing about this site is that you can download all or part of the dataset based on how you want to interact with it. You can find downloads in the following sizes:
This dataset presents you with an opportunity to work with user-generated data using both supervised and unsupervised techniques. The large datasets present special challenges that only big data can provide. You can find some starter information for working with supervised and unsupervised techniques in Chapters 15 and 19.
Everyone wants to get rid of spam e-mail — those time wasters that contain everything from invitations to join in a fantastic new venture to pornography. Of course, the best way to accomplish the task is to create an algorithm to do the sorting for you. However, you need to train the algorithm to perform its work, which is where the Spambase Data Set comes into play. You can find the Spambase Data Set at https://archive.ics.uci.edu/ml/datasets/Spambase
.
This collection of spam e-mails came from postmasters and individuals who had filed spam reports. It also includes nonspam e-mail from various sources to allow the creation of filters that let good e-mails through. This is a complex challenge dealing with textual data and complex, different targets.
http://rexa.info/paper/a2734ae038cae7393159934e860c24a52dc2754d
)http://rexa.info/paper/631197638c7e0317c98e1a8d98e5fce8921aa758
)http://rexa.info/paper/48d6beec2a36a87d9d88b6de85dd85a75e5ed24d
)http://rexa.info/paper/3cb3fbd5512e3cd12111b598fece53fcb42c484b
)Pattern recognition, especially working with handwritten information, is an important data science task. The Mixed National Institute of Standards and Technology (MNIST) dataset of handwritten digits at http://yann.lecun.com/exdb/mnist/
provides a training set of 60,000 examples, and a test set of 10,000 examples. This is a subset of the original National Institute of Standards and Technology (NIST) dataset found at https://srdata.nist.gov/gateway/gateway?keyword=handwriting+recognition
. It’s a good dataset to use to learn how to work with handwritten data without having to perform a lot of preprocessing at the outset.
https://cs.indstate.edu/~jkinne/cs475-f2011/code/mnistHandwriting.py
https://martin-thoma.com/classify-mnist-with-pybrain/
https://gist.github.com/akesling/5358964
The host page also contains an important listing of methods used to work with the training and test set. The list contains an impressive number of classifiers that should give you some ideas for your own experiments. The point is that this particular dataset is useful for all sorts of different tasks.
The Canadian Institute for Advanced Research (CIFAR) datasets at https://www.cs.toronto.edu/~kriz/cifar.html
provide you with graphics content to work with in various ways. The CIFAR-10 and CIFAR-100 datasets contain labeled subsets of a dataset with 80 million tiny images (you can read about how the dataset works with the original image dataset in the Learning Multiple Layers of Features from Tiny Images technical report at https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
). In the CIFAR-10 dataset, you find 60,000 32 x 32 color images in ten classes (for 6,000 images in each class). Here are the classes you find:
The CIFAR-100 dataset contains more classes. Instead of 10 classes, you get 100 classes containing 600 images each. The size of the dataset is the same, but the number of classes is larger. The classification system is hierarchical in this case. The 100 classes divide into 20 superclasses. For example, in the aquatic mammals superclass, you find the beaver, dolphin, otter, seal, and whale classes.
This is an excellent challenge to take after you have worked with the digits dataset described in the previous section. Taking this challenge helps you to deal with colorful, complex images. If you worked through the examples in Chapter 14, you already have some experience working with images using the toy Olivetti Faces dataset.
If you want to work with a really large dataset, try the Amazon.com review dataset at https://snap.stanford.edu/data/web-Amazon.html
. This dataset consists of reviews from Amazon.com taken over a period of 18 years, including ~35 million reviews up to March 2013. The reviews include product and user information, ratings, and a plain-text review. This is the dataset to tackle after you work through smaller datasets, such as MovieLens. It can help you understand how to work with user-generated data in a business context.
Unlike many of the datasets in this chapter, the Amazon.com dataset comes in a number of forms. Yes, you can download all.txt.gz
to obtain the entire dataset (11GB of data), but you also have the option to download just portions of the dataset. For example, you can choose to download just the 184,887 reviews associated with baby products by obtaining Baby.txt.gz
(a 42MB download).
Imagine trying to work through the connections between 3.5 billion web pages. You can do just that by downloading the immense dataset at https://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us
. The biggest, richest, most complex dataset of all is the Internet itself. Start with a subsample offered by the Common Crawl 2012 web corpus (https://commoncrawl.org/
) and learn how to extract and elaborate data from websites. The principle uses for this dataset are:
Pay particular attention to the Contents section near the middle of the page. Clicking a link takes you to an entry at http://webdatacommons.org/hyperlinkgraph/
that explains the dataset in more detail. You need the additional information to perform most data science tasks. Near the bottom of the page are links for downloading various levels of the entire graph (fortunately, you don’t have to download everything, which would be a 45GB download for the index file and a 331GB download for the arc file).
Don’t let the idea of performing an analysis on such a large dataset scare you. If you worked through the examples in Chapter 7, you have worked with simple graph data. This dataset is a similar task but on a significantly larger scale. Yes, size does matter to some extent, but you already know some of the required techniques for getting the job done.
3.133.146.237