Gathering data

Gathering data is a must when practice time comes. Web scraping is a possibility. Doing it is both a way to get data and a way to tune-up your code skills meanwhile. You might be lead into mastering packages such as rvest; if you do so, you must likely improve your coding skills only by doing so. Also, web scraping goes along with knowledge of HTML, HTTP, and more, so, look for it if you are willing to cultivate a broader set of skills.

BEWARE! Scraping the web is not always legally safe. Make sure that you're not breaking any laws or else doing something harmful with the information you scrapped.

If you are not into it, there is no need to go that far. Some R packages come with lots of datasets you can use. There is a web page that summarizes lots of them: https://vincentarelbundock.github.io/Rdatasets/datasets.html.

The web page, Vicent Arel-Bundock's repository, disposes of several useful pieces of information for every dataset. There, you will find the dataset name and the name of the package holding it, as well as the number of columns, rows, and available data types. Based on this kind of information, you can decide upon a data frame to use and download it directly from there, or load it using the related library:

Vicent's repository

Loading a data frame directly from a package, which obviously can be downloaded using install.packages(), is time-saving. Of course, going into all the trouble of loading data from a file tends to be more realistic than doing it through a package, but sometimes saving time is just what you need. Speaking for myself, I tend to visit Vincent's repository when I just want to practice visualizations and nothing else.

If you have access to primary data, you can also build your own dataset and possibly make a living out of it.

If your idea is to browse data based on field/category, the UCI Machine Learning Repository is a good call: https://archive.ics.uci.edu/ml/index.php.

The repository is a collection, domain theories, and data generators, as the website defines it. Besides the data itself, readers might also find a brief description along with tips about how to use it, which algorithms to deploy, or which variable to predict; thus, you may not only have data but also an objective.

If you happen to own a dataset, you can donate it to the UCI Machine Learning Repository.

As the name suggests, the UCI Machine Learning Repository is optimal if you want to try machine- learning models. The home page displays the later datasets to come on board and the most popular ones. You can also click View ALL Data Sets to browse for more, or you can use the search bar:

UCI Machine Learning Repository

Datasets can be also browsed by category. Be advised that you can use the dataset differently than what it is suggested for—set another target, test an alternative algorithm—but it feels nice to have an objective, doesn't it? Having a clear goal, plus being benchmarked while sometimes competing for money prizes might be a lot better—meet Kaggle: https://www.kaggle.com.

Kaggle is not exactly a data repository, but you can say that it has a data repository embedded in it. Mainly, Kaggle hosts data-driven competitions. They give the rules: a train and a test set. Using the train set, competitors have to fit a model to predict the targets for the test set.

Couldn't find the data you were looking for? Try reddit.com/r/data, and ask users for help.

Sometimes, competitions will be featured by real companies trying to solve real problems and usually offering money prizes. It's quite a feat to score money in competitions such as that, but scoring high enough is likely to culminate in a job offer. From the companies' perspective, Kaggle is a storefront for data scientists.

Besides, even if you go directly to the lower ranks, you are likely to learn much if you can just keep an eye to the kernels and discussion boards. You might as well experience how feature engineering and fine-tuning can make a difference to real-world problems.

Expect to learn much from just testing yourself in a competition. Expect to learn even more if you consistently follow the kernels and discussions. From the data scientists' perspective, Kaggle is a storefront for models and best practices, and you might learn which algorithms are best at solving at some kind of problem.

Many competitions allow users to join forces into teams at some stage. That's a great opportunity to learn from others, plus an opportunity for networking.

The data science field is evolving at a fast pace, and you got to stay updated, but how? The next paragraph will tell you something more about that.

Table of Contents for Gathering data

Create new playlist

Sign In

Sign Up

Table of Contents for
Gathering data