Chapter 22

Ten Data Challenges You Should Take

IN THIS CHAPTER

Bullet Locating starting challenges

Bullet Working with specific kinds of data

Bullet Performing analysis, pattern recognition, and classification

Bullet Dealing with huge online datasets

Data science is all about working with data. While working through this book, you have used a number of datasets, including the toy datasets that come with the Scikit-learn library. Of course, these datasets are all great for getting you started, but just as a runner wouldn’t stop after conquering the local fun run, so you need to start training for data science marathons by working with larger datasets.

This chapter introduces you to a number of challenging datasets that can help you become a world-class data scientist. By combining what you discover in this book with these new datasets, you can learn how to do amazing things. In fact, some people may view you as a bit of a magician as you pull seemingly impossible data patterns out of your hat. Each of the following datasets provides you with specific skills and helps you achieve different goals.

Remember You can find a wealth of datasets on the Internet. However, not every dataset is created equal, and you need to choose your challenges with care. These ten datasets provide well-known functionality, often provide you with tutorials, and appear in scientific papers. These three features make these datasets stand apart from the competition. Yes, other good datasets are available, but these ten datasets provide skills needed to conquer even bigger challenges, such as that database lurking on your company server.

Meeting the Data Science London + Scikit-learn Challenge

You use Scikit-learn quite a bit while using this book, so you may already understand it a bit. The Kaggle competition at https://www.kaggle.com/c/data-science-london-Scikit-learn (the current competition ended in December 2014, but there should be others) provides a practice ground for trying, sharing, and creating examples using the Scikit-learn classification algorithms. All the tools for the previous competition are still in place, and it’s still well worth exploring. The goal is to try, create, and share examples of using Scikit-learn’s classification capabilities. You can find the data used for the competition at https://www.kaggle.com/c/data-science-london-scikit-learn/data. The rules appear at https://www.kaggle.com/c/data-science-london-scikit-learn/rules, and you can discover how Kaggle evaluates your submissions at https://www.kaggle.com/c/data-science-london-scikit-learn/details/evaluation.

Of course, you might not have any desire to compete. Looking at the leaderboard (https://www.kaggle.com/c/data-science-london-scikit-learn/leaderboard) may keep you from seriously considering actual competition because the contest has attracted serious data scientists. However, you can still enjoy taking a chance at figuring out how to solve a challenging data problem and learn something new in the meanwhile, without needing to submit a solution of yours to the leaderboard.

Remember Because this site builds on knowledge you already have from the book, it’s actually the best place to begin building new skills. That’s why this site appears first in the chapter: You can get a good start using other datasets with techniques you already know.

Predicting Survival on the Titanic

You work with the Titanic data to some extent in the book (Chapters 6 and 20) by using Titanic.csv and Titanic3.csv from the Vanderbilt University School of Medicine. This challenge is actually much easier than the one described in the previous section because Kaggle designed it for the beginner. You can find it at https://www.kaggle.com/c/titanic. The data model, found at https://www.kaggle.com/c/titanic/data, is different from the one in the book, but the concepts are the same. You can find the rules for this competition at https://www.kaggle.com/c/titanic/rules and the method of evaluation at https://www.kaggle.com/c/titanic#evaluation.

You can find the leaderboard for this competition at https://www.kaggle.com/c/titanic/leaderboard. The number of people who have already achieved what amounts to a perfect score should fill you with confidence.

Tip The biggest challenge in this case is that the dataset is quite small and requires that you create new features in order to obtain an accurate score. The competition helps you apply the skills you learn in the “Considering the Art of Feature Creation” section of Chapter 9 and see demonstrated in Chapter 19.

Finding a Kaggle Competition that Suits Your Needs

Competitions are great at helping you think through solutions in an environment in which others are doing the same. In the real world, you may find yourself pitted against competition on a regular basis, so competitions provide good experiences in thinking critically and quickly. They also present you with an opportunity to learn from others. The best place to find such competitions is on the Kaggle site at https://www.kaggle.com/competitions.

This site will help you locate any past or present Kaggle competition. To find a present competition, click the Active Competitions link. To find a past competition, click the All Competitions link. All the datasets are freely available, so you have a chance to try your skills against any real-world scenario you might want to select. The Kaggle community will provide you with plenty of tutorials, benchmarks, and beat-the-benchmarks posts.

Remember You don’t have to select an ongoing competition. For example, you might see a past competition that meets a need and try that instead (benefitting from the published solutions). If you take an active competition you can post your questions on the forum and have some of the most skilled data scientists in the world answer your questions and doubts. Because of the great number of competitions on this site, it’s likely that you’ll find a competition that will suit your interests!

It’s interesting to note that the Kaggle competitions come from companies that don’t normally have access to data scientists, so you really are working in a real world environment. You can also use this site to locate a potential job. Just go to https://www.kaggle.com/jobs by clicking the Jobs link on the main page.

Honing Your Overfit Strategies

The Madelon Data Set at https://archive.ics.uci.edu/ml/datasets/Madelon is an artificial dataset containing a two-class classification problem with continuous input variables. This NIPS 2003 feature selection challenge will seriously test your skills in cross-validating models. The main emphasis of this challenge is to devise strategies for avoiding overfit — an issue that you first confront in the “Finding more things that can go wrong” section of Chapter 16. You find overfit issues mentioned in Chapters 18, 19, and 20 as well. To obtain the dataset, contact Isabelle Guyon at the address found in the Source section of the page at https://archive.ics.uci.edu/ml/datasets/Madelon.

Tip This particular dataset attracted the attention of a number of people who created papers about it. The best papers appear in the book Feature Extraction, Foundations and Applications at https://www.springer.com/us/book/9783540354871. You can also download an associated technical report from https://clopinet.com/isabelle/Projects/ETH/TM-fextract-class.pdf. The Advances in Neural Information Processing Systems 17 (NIPS 2004) at https://papers.nips.cc/book/advances-in-neural-information-processing-systems-17-2004 also contains useful links to papers that will help you with this particular dataset.

Trudging Through the MovieLens Dataset

The MovieLens site (https://movielens.org/) is all about helping you find a movie you might like. After all, with millions of movies out there, finding something new and interesting could take time that you don’t want to spend. The setup works by asking you to input ratings for movies you already know about. The MovieLens site then makes recommendations for you based on your ratings. In short, your ratings teach an algorithm what to look for, and then the site applies this algorithm to the entire dataset.

You can obtain the MovieLens dataset at https://grouplens.org/datasets/movielens/. The interesting thing about this site is that you can download all or part of the dataset based on how you want to interact with it. You can find downloads in the following sizes:

  • 100,000 ratings from 1,000 users on 1,700 movies
  • 1 million ratings from 6,000 users on 4,000 movies
  • 10 million ratings and 100,000 tag applications applied to 10,000 movies by 72,000 users
  • 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users
  • MovieLens’s latest dataset in small or full sizes (the full size contained 21,000,000 ratings and 470,000 tag applications applied to 27,000 movies by 230,000 users as of this writing but will increase in size with time)

This dataset presents you with an opportunity to work with user-generated data using both supervised and unsupervised techniques. The large datasets present special challenges that only big data can provide. You can find some starter information for working with supervised and unsupervised techniques in Chapters 15 and 19.

Getting Rid of Spam E-mails

Everyone wants to get rid of spam e-mail — those time wasters that contain everything from invitations to join in a fantastic new venture to pornography. Of course, the best way to accomplish the task is to create an algorithm to do the sorting for you. However, you need to train the algorithm to perform its work, which is where the Spambase Data Set comes into play. You can find the Spambase Data Set at https://archive.ics.uci.edu/ml/datasets/Spambase.

This collection of spam e-mails came from postmasters and individuals who had filed spam reports. It also includes nonspam e-mail from various sources to allow the creation of filters that let good e-mails through. This is a complex challenge dealing with textual data and complex, different targets.

Tip You can find a number of papers that cite this particular dataset. The following list provides a quick overview of the pertinent papers and their host sites:

Working with Handwritten Information

Pattern recognition, especially working with handwritten information, is an important data science task. The Mixed National Institute of Standards and Technology (MNIST) dataset of handwritten digits at http://yann.lecun.com/exdb/mnist/ provides a training set of 60,000 examples, and a test set of 10,000 examples. This is a subset of the original National Institute of Standards and Technology (NIST) dataset found at https://srdata.nist.gov/gateway/gateway?keyword=handwriting+recognition. It’s a good dataset to use to learn how to work with handwritten data without having to perform a lot of preprocessing at the outset.

Tip The dataset appears in four files. The two training and two test files contain images and labels. You need all four files in order to create a complete dataset for working with digits. A potential problem in working with the MNIST dataset is that the image files aren’t in a particular format. The format used for storing the images appears at the bottom of the page. Of course, you could always build your own Python application for reading them, but using code that someone else has created is a lot easier. The following list provides places where you can get code to read the MNIST dataset using Python:

The host page also contains an important listing of methods used to work with the training and test set. The list contains an impressive number of classifiers that should give you some ideas for your own experiments. The point is that this particular dataset is useful for all sorts of different tasks.

Remember You have worked with the digits toy dataset from Scikit-learn in a number of chapters in the book. To use this dataset, you import the digits database using from sklearn.datasets import load_digits. This particular dataset appears in Chapters 12, 15, 17, 19, and 20, so you gain a considerable amount of experience in working with a much smaller digits database when you work through the examples in those chapters.

Working with Pictures

The Canadian Institute for Advanced Research (CIFAR) datasets at https://www.cs.toronto.edu/~kriz/cifar.html provide you with graphics content to work with in various ways. The CIFAR-10 and CIFAR-100 datasets contain labeled subsets of a dataset with 80 million tiny images (you can read about how the dataset works with the original image dataset in the Learning Multiple Layers of Features from Tiny Images technical report at https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf). In the CIFAR-10 dataset, you find 60,000 32 x 32 color images in ten classes (for 6,000 images in each class). Here are the classes you find:

  • Airplane
  • Automobile
  • Bird
  • Cat
  • Deer
  • Dog
  • Frog
  • Horse
  • Ship
  • Truck

The CIFAR-100 dataset contains more classes. Instead of 10 classes, you get 100 classes containing 600 images each. The size of the dataset is the same, but the number of classes is larger. The classification system is hierarchical in this case. The 100 classes divide into 20 superclasses. For example, in the aquatic mammals superclass, you find the beaver, dolphin, otter, seal, and whale classes.

Warning Both CIFAR datasets come in Python, MATLAB, and binary versions. Make sure that you download the correct version and follow the instructions for using them on the download page. Yes, you could use the other versions with Python, but doing so would require a lot of extra programming, and because you already have access to a Python version, you wouldn’t gain anything from the exercise.

This is an excellent challenge to take after you have worked with the digits dataset described in the previous section. Taking this challenge helps you to deal with colorful, complex images. If you worked through the examples in Chapter 14, you already have some experience working with images using the toy Olivetti Faces dataset.

Analyzing Amazon.com Reviews

If you want to work with a really large dataset, try the Amazon.com review dataset at https://snap.stanford.edu/data/web-Amazon.html. This dataset consists of reviews from Amazon.com taken over a period of 18 years, including ~35 million reviews up to March 2013. The reviews include product and user information, ratings, and a plain-text review. This is the dataset to tackle after you work through smaller datasets, such as MovieLens. It can help you understand how to work with user-generated data in a business context.

Unlike many of the datasets in this chapter, the Amazon.com dataset comes in a number of forms. Yes, you can download all.txt.gz to obtain the entire dataset (11GB of data), but you also have the option to download just portions of the dataset. For example, you can choose to download just the 184,887 reviews associated with baby products by obtaining Baby.txt.gz (a 42MB download).

Tip Make sure to check out the bottom of the page. The site owner has thoughtfully provided you with the Python code required to interpret the data. Using this simple function makes working with the immense dataset a lot easier. Even if you choose to create a modified version of the function, you at least have a good starting point.

Interacting with a Huge Graph

Imagine trying to work through the connections between 3.5 billion web pages. You can do just that by downloading the immense dataset at https://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us. The biggest, richest, most complex dataset of all is the Internet itself. Start with a subsample offered by the Common Crawl 2012 web corpus (https://commoncrawl.org/) and learn how to extract and elaborate data from websites. The principle uses for this dataset are:

  • Search algorithms
  • Spam detection methods
  • Graph analysis algorithms
  • Web science research

Pay particular attention to the Contents section near the middle of the page. Clicking a link takes you to an entry at http://webdatacommons.org/hyperlinkgraph/ that explains the dataset in more detail. You need the additional information to perform most data science tasks. Near the bottom of the page are links for downloading various levels of the entire graph (fortunately, you don’t have to download everything, which would be a 45GB download for the index file and a 331GB download for the arc file).

Don’t let the idea of performing an analysis on such a large dataset scare you. If you worked through the examples in Chapter 7, you have worked with simple graph data. This dataset is a similar task but on a significantly larger scale. Yes, size does matter to some extent, but you already know some of the required techniques for getting the job done.

Tip This particular site provides access to a number of other datasets. Links for these datasets are at the bottom of the page. For example, you can find “Great statistical analysis: forecasting meteorite hits” at https://www.analyticbridge.com/profiles/blogs/great-statistical-analysis-forecasting-meteorite-hits. In short, if analyzing the entire Internet doesn’t appeal to you, try one of the other amazing (and huge) datasets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.146.237