2

Organizing Data with Datasets

In his story The Adventure of the Copper Beeches, Arthur Conan Doyle has Sherlock Holmes shout “Data! Data! Data! I cannot make bricks without clay.” This mindset, which served the most famous detective in literature so well, should be adopted by every data scientist. For that reason, we begin the more technical part of this book with a chapter dedicated to data: specifically, in the Kaggle context, leveraging the power of the Kaggle Datasets functionality for our purposes.

In this chapter, we will cover the following topics:

  • Setting up a dataset
  • Gathering the data
  • Working with datasets
  • Using Kaggle Datasets in Google Colab
  • Legal caveats

Setting up a dataset

In principle, any data you can use you can upload to Kaggle (subject to limitations; see the Legal caveats section later on). The specific limits at the time of writing are 100 GB per private dataset and a 100 GB total quota. Keep in mind that the size limit per single dataset is calculated uncompressed; uploading compressed versions speeds up the transfer but does not help against the limits. You can check the most recent documentation for the datasets at this link: https://www.kaggle.com/docs/datasets.

Kaggle promotes itself as a “home of data science” and the impressive collection of datasets available from the site certainly lends some credence to that claim. Not only can you find data on topics ranging from oil prices to anime recommendations, but it is also impressive how quickly data ends up there. When the emails of Anthony Fauci were released under the Freedom of Information Act in May 2021 (https://www.washingtonpost.com/politics/interactive/2021/tony-fauci-emails/), they were uploaded as a Kaggle dataset a mere 48 hours later.

Figure 2.1: Trending and popular datasets on Kaggle

Before uploading the data for your project into a dataset, make sure to check the existing content. For several popular applications (image classification, NLP, financial time series), there is a chance it has already been stored there.

For the sake of this introduction, let us assume the kind of data you will be using in your project is not already there, so you need to create a new dataset. When you head to the menu with three lines on the left-hand side and click on Data, you will be redirected to the Datasets page:

Obraz zawierający tekst  Opis wygenerowany automatycznie

Figure 2.2: The Datasets page

When you click on + New Dataset, you will be prompted for the basics: uploading the actual data and giving it a title:

Obraz zawierający tekst  Opis wygenerowany automatycznie

Figure 2.3: Entering dataset details

The icons on the left-hand side correspond to the different sources you can utilize for your dataset. We describe them in the order they are shown on the page:

  • Upload a file from a local drive (shown in the figure)
  • Create from a remote URL
  • Import a GitHub repository
  • Use output files from an existing Notebook
  • Import a Google Cloud Storage file

An important point about the GitHub option: This feature is particularly handy when it comes to experimental libraries. While frequently offering hitherto unavailable functionality, they are usually not included in the Kaggle environment, so if you want to use such a library in your code, you can import it as a dataset, as demonstrated below:

  1. Go to Datasets and click + New Dataset.
  2. Select the GitHub icon.
  3. Insert the link to the repository, as well as the title for the dataset.
  4. Click on Create at the bottom right:
Obraz zawierający tekst  Opis wygenerowany automatycznie

Figure 2.4: Dataset from GitHub repository

Next to the Create button, there is another one marked Private. By default, any dataset you create is private: only you, its creator, can view and edit it. It is probably a good idea to leave this setting at default at the dataset creation stage and only at a later stage make it public (available to either a select list of contributors, or everyone).

Keep in mind that Kaggle is a popular platform and many people upload their datasets – including private ones – so try to think of a non-generic title. This will increase the chance of your dataset actually being noticed.

Once you have completed all the steps and clicked Create, voilà! Your first dataset is ready. You can then head to the Data tab:

Figure 2.5: The Data tab

The screenshot above demonstrates the different information you can provide about your dataset; the more you do provide, the higher the usability index. This index is a synthetic measure summarizing how well your dataset is described. Datasets with higher usability indexes appear higher up in the search results. For each dataset, the usability index is based on several factors, including the level of documentation, the availability of related public content like Notebooks as references, file types, and coverage of key metadata.

In principle, you do not have to fill out all the fields shown in the image above; your newly created dataset is perfectly usable without them (and if it is a private one, you probably do not care; after all, you know what is in it). However, community etiquette would suggest filling out the information for the datasets you make public: the more you specify, the more usable the data will be to others.

Gathering the data

Apart from legal aspects, there is no real limit on the kind of content you can store in the datasets: tabular data, images, text; if it fits within the size requirements, you can store it. This includes data harvested from other sources; tweets by hashtag or topic are among the popular datasets at the time of writing:

Obraz zawierający tekst  Opis wygenerowany automatycznie

Figure 2.6: Tweets are among the most popular datasets

Discussion of the different frameworks for harvesting data from social media (Twitter, Reddit, and so on) is outside the scope of this book.

Andrew Maranhão

https://www.kaggle.com/andrewmvd

We spoke to Andrew Maranhão (aka Larxel), Datasets Grandmaster (number 1 in Datasets at time of writing) and Senior Data Scientist at the Hospital Albert Einstein in São Paulo, about his rise to Datasets success, his tips for creating datasets, and his general experiences on Kaggle.

What’s your favourite kind of competition and why? In terms of techniques and solving approaches, what is your specialty on Kaggle?

Medical imaging is usually my favourite. It speaks to my purpose and job. Among medical competitions, NLP is language-bound, tabular data varies widely among hospitals, but imaging is mostly the same, so any advancement in this context can bring about benefits for many countries across the world, and I love this impact potential. I also have a liking for NLP and tabular data, but I suppose this is pretty standard.

Tell us about a particularly challenging competition you entered, and what insights you used to tackle the task.

In a tuberculosis detection in x-ray images competition, we had around 1,000 images, which is a pretty small number for capturing all the manifestations of the disease. I came up with two ideas to offset this:

  1. Pre-train on external data of pneumonia detection (~20k images), as pneumonia can be mistaken for tuberculosis.
  2. Pre-train on multilabel classification of lung abnormalities (~600k images) and use grad-CAM with a simple SSD to generate bounding box annotations of classification labels.

In the end, a simple blend of these two achieved 22% more compared to the result that the second-place team had. It happened at a medical convention, with about 100 teams participating.

You have become a Dataset Grandmaster and achieved the number 1 rank in Datasets. How do you choose topics and find, gather, and publish data for your datasets on Kaggle?

This is a big question; I’ll try to break it down piece by piece.

  1. Set yourself a purpose

The first thing that I have in mind when choosing a topic is the reason I am doing this in the first place.

When there is a deeper reason underneath, great datasets just come off as a result, not as a goal in itself. Fei Fei Li, the head of the lab that created ImageNet, revealed in a TED talk that she wanted to create a world where machines would be able to reason and appreciate the world with their vision in the same way her children did.

Having a purpose in mind will make it more likely that you’ll engage and improve over time, and will also differentiate you and your datasets. You can certainly live off tabular data on everyday topics, though I find that unlikely to leave a lasting impact.

  1. A great dataset is the embodiment of a great question

If we look at the greatest datasets in current literature, such as ImageNet and others, we can see some common themes:

  • It is a daring, relevant question with great potential for all of us (scientific or real-world application)
  • The data was well collected, controlled for quality, and well documented
  • There is an adequate amount of data and diversity for our current hardware
  • It has an active community that continuously improves the data and/or builds upon that question

As I mentioned before, I feel that asking questions is a primary role of a data scientist and is likely to become even more prominent as automated machine and deep learning solutions advance. This is where datasets can certainly exercise something unique to your skillset.

  1. Create your process for success, rather than only pursuing success for the sake of success

Quality far overshadows quantity; you only need 15 datasets to become a Grandmaster and the flagship datasets of AI are few and well made.

I have thrown away as many datasets as I have published. It takes time, and it is not a one and done type of thing as many people treat it – datasets have a maintenance and continuous improvement side to them.

One thing that is very often overlooked is supporting the community that gathers around your data. Notebooks and datasets are mutual efforts, so supporting those who take the time to analyze your data goes a long way for your dataset too. Analyzing their bottlenecks and choices can give directions as to what pre-processing steps could be done and provided, and also the clarity of your documentation.

All in all, the process that I recommend starts with setting your purpose, breaking it down into objectives and topics, formulating questions to fulfil these topics, surveying possible sources of data, selecting and gathering, pre-processing, documenting, publishing, maintaining and supporting, and finally, improvement actions.

For instance, let’s say that you would like to increase social welfare; you break it down into an objective, say, racial equity. From there, you analyze topics related to the objective and find the Black Lives Matter movement. From here, you formulate the question: how can I make sense of the millions of voices talking about it?

This narrows down your data type to NLP, which you can gather data for from news articles, YouTube comments, and tweets (which you choose, as it seems more representative of your question and feasible). You pre-process the data, removing identifiers, and document the collection process and dataset purpose.

With that done, you publish it, and a few Kagglers attempt topic modeling but struggle to do so because some tweets contain many foreign languages that create encoding problems. You support them by giving them advice and highlighting their work, and decide to go back and narrow the tweets down to English, to fix this for good.

Their analysis reveals the demands, motivations, and fears relating to the movement. With their efforts, it was possible to break down millions of tweets into a set of recommendations that may improve racial equity in society.

4. Doing a good job is all that is in your control

Ultimately, it is other people that turn you into a Grandmaster, and votes don’t always translate into effort or impact. In one of my datasets, about Cyberpunk 2077, I worked on it for about 40 hours total and, to this day, it is still one of my least upvoted datasets.

But it doesn’t matter. I put in the effort, I tried, and I learned what I could — that’s what is in my control, and next week I’ll do it again no matter what. Do your best and keep going.

Are there any particular tools or libraries that you would recommend using for data analysis/machine learning?

Strangely enough, I both recommend and unrecommend libraries. LightGBM is a great tabular ML library with a fantastic ratio of performance to compute time, CatBoost can sometimes outperform it, but it comes at the cost of increased compute time, during which you could be having and testing new ideas. Optuna is great for hyperparameter tuning, Streamlit for frontends, Gradio for MVPs, Fast API for microservices, Plotly and Plotly Express for charts, PyTorch and its derivatives for deep learning.

While libraries are great, I also suggest that at some point in your career you take the time to implement it yourself. I first heard this advice from Andrew Ng and then from many others of equal calibre. Doing this creates very in-depth knowledge that sheds new light on what your model does and how it responds to tuning, data, noise, and more.

In your experience, what do inexperienced Kagglers often overlook? What do you know now that you wish you’d known when you first started?

Over the years, the things I wished I realized sooner the most were:

  1. Absorbing all the knowledge at the end of a competition
  2. Replication of winning solutions in finished competitions

In the pressure of a competition drawing to a close, you can see the leaderboard shaking more than ever before. This makes it less likely that you will take risks and take the time to see things in all their detail. When a competition is over, you don’t have that rush and can take as long as you need; you can also replicate the rationale of the winners who made their solutions known.

If you have the discipline, this will do wonders for your data science skills, so the bottom line is: stop when you are done, not when the competition ends. I have also heard this advice from an Andrew Ng keynote, where he recommended replicating papers as one of his best ways to develop yourself as an AI practitioner.

Also, at the end of a competition , you are likely to be exhausted and just want to call it a day. No problem there; just keep in mind that the discussion forum after the competition is done is one of the most knowledge-rich places on Planet Earth, primarily because many rationales and code for winning solutions are made public there. Take the time to read and study what the winners did; don’t give into the desire to move on to something else, as you might miss a great learning opportunity.

Has Kaggle helped you in your career? If so, how?

Kaggle helped my career by providing a wealth of knowledge, experience and also building my portfolio. My first job as a data scientist was largely due to Kaggle and DrivenData competitions. All throughout my career, I studied competition solutions and participated in a few more. Further engagement on Datasets and Notebooks also proved very fruitful in learning new techniques and asking better questions.

In my opinion, asking great questions is the primary challenge faced by a data scientist. Answering them is surely great as well, although I believe we are not far from a future where automated solutions will be more and more prevalent in modeling. There will always be room for modeling, but I suppose a lot of work will be streamlined in that regard. Asking great questions, however, is far harder to automate – if the question is not good, even the best solution could be meaningless.

Have you ever used something you have done in Kaggle competitions in order to build your portfolio to show to potential employers?

Absolutely. I landed my first job as a data scientist in 2017 using Kaggle as proof of knowledge. To this day, it is still a fantastic CV component, as educational backgrounds and degrees are less representative of data science knowledge and experience than a portfolio is.

A portfolio with projects with competitions shows not just added experience but also a willingness to going above and beyond for development, which is arguably more important for long-term success.

Do you use other competition platforms? How do they compare to Kaggle?

I also use DrivenData and AICrowd. The great thing about them is that they allow organizations that don’t have the same access to financial resources, such as start-ups and research institutions, to create competitions.

Great competitions come from a combination of great questions and great data, and this can happen regardless of company size. Kaggle has a bigger and more active community, and the hardware they provide, coupled with the data and Notebook capabilities, make it the best option; yet both DrivenData and AICrowd introduce just as interesting challenges and allow for more diversity.

What’s the most important thing someone should keep in mind or do when they’re entering a competition?

Assuming your primary goal is development, my recommendation is that you pick a competition on a topic that interests you and a task that you haven’t done before. Critical sense and competence require depth and diversity. Focusing and giving your best will guarantee depth, and diversity is achieved by doing things you have not done before or have not done in the same way.

Working with datasets

Once you have created a dataset, you probably want to use it in your analysis. In this section, we discuss different methods of going about this.

Very likely, the most important one is starting a Notebook where you use your dataset as a primary source. You can do this by going to the dataset page and then clicking on New Notebook:

Figure 2.7: Creating a Notebook from the dataset page

Once you have done this, you will be redirected to your Notebook page:

Obraz zawierający tekst  Opis wygenerowany automatycznie

Figure 2.8: Starting a Notebook using your dataset

Here are a few pointers around this:

  • The alphanumeric title is generated automatically; you can edit it by clicking on it.
  • On the right-hand side under Data, you see the list of data sources attached to your Notebook; the dataset I selected can be accessed under ../input/ or from /kaggle/input/.
  • The opening block (with the imported packages, descriptive comments, and printing the list of available files) is added automatically to a new Python Notebook.

With this basic setup, you can start to write a Notebook for your analysis and utilize your dataset as a data source. We will discuss Notebooks at greater length in Chapter 4, Leveraging Discussion Forums.

Using Kaggle Datasets in Google Colab

Kaggle Notebooks are free to use, but not without limits (more on that in Chapter 4), and the first one you are likely to hit is the time limit. A popular alternative is to move to Google Colab, a free Jupyter Notebook environment that runs entirely in the cloud: https://colab.research.google.com.

Even once we’ve moved the computations there, we might still want to have access to the Kaggle datasets, so importing them into Colab is a rather handy feature. The remainder of this section discusses the steps necessary to use Kaggle Datasets through Colab.

The first thing we do, assuming we are already registered on Kaggle, is head to the account page to generate the API token (an access token containing security credentials for a login session, user identification, privileges, and so on):

  1. Go to your account, which can be found at https://www.kaggle.com/USERNAME/account, and click on Create New API Token:
Obraz zawierający tekst  Opis wygenerowany automatycznie

Figure 2.9: Creating a new API token

A file named kaggle.json containing your username and token will be created.

  1. The next step is to create a folder named Kaggle in your Google Drive and upload the .json file there:

Figure 2.10: Uploading the .json file into Google Drive

  1. Once done, you need to create a new Colab notebook and mount your drive by running the following code in the notebook:
    from google.colab import drive
    drive.mount('/content/gdrive')
    
  2. Get the authorization code from the URL prompt and provide it in the empty box that appears, and then execute the following code to provide the path to the .json config:
    import os
    # content/gdrive/My Drive/Kaggle is the path where kaggle.json is 
    # present in the Google Drive
    os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"
    # change the working directory
    %cd /content/gdrive/My Drive/Kaggle
    # check the present working directory using the pwd command
    
  3. We can download the dataset now. Begin by going to the dataset’s page on Kaggle, clicking on the three dots next to New Notebook, and selecting Copy API command:

Figure 2.11: Copying the API command

  1. Run the API command to download the Dataset (readers interested in details of the commands used can consult the official documentation: https://www.kaggle.com/docs/api) :
    !kaggle datasets download -d ajaypalsinghlo/world-happiness-report-2021
    
  2. The dataset will be downloaded to the Kaggle folder as a .zip archive – unpack it and you are good to go.

As you can see from the list above, using a Kaggle dataset in Colab is a straightforward process – all you need is an API token, and making the switch gives you the possibility of using more GPU hours than what is granted by Kaggle.

Legal caveats

Just because you can put some data on Kaggle does not necessarily mean that you should. An excellent example would be the People of Tinder dataset. In 2017, a developer used the Tinder API to scrape the website for semi-private profiles and uploaded the data on Kaggle. After the issue became known, Kaggle ended up taking the dataset down. You can read the full story here: https://www.forbes.com/sites/janetwburns/2017/05/02/tinder-profiles-have-been-looted-again-this-time-for-teaching-ai-to-genderize-faces/?sh=1afb86b25454.

In general, before you upload anything to Kaggle, ask yourself two questions:

  1. Is it allowed from a copyright standpoint? Remember to always check the licenses. When in doubt, you can always consult https://opendefinition.org/guide/data/ or contact Kaggle.
  2. Are there privacy risks associated with this dataset? Just because posting certain types of information is not, strictly speaking, illegal, doing so might be harmful to another person’s privacy.

The limitations speak to common sense, so they are not too likely to hamper your efforts on Kaggle.

Summary

In this chapter, we introduced Kaggle Datasets, the standardized manner of storing and using data in the platform. We discussed dataset creation, ways of working outside of Kaggle, and the most important functionality: using a dataset in your Notebook. This provides a good segue to our next chapter, where we focus our attention on Kaggle Notebooks.

Join our book’s Discord space

Join the book’s Discord workspace for a monthly Ask me Anything session with the authors:

https://packt.link/KaggleDiscord

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.249.104