Recipe 1. Text data collection using APIs
Recipe 2. Reading PDF file in Python
Recipe 3. Reading word document
Recipe 4. Reading JSON object
Recipe 5. Reading HTML page and HTML parsing
Recipe 6. Regular expressions
Recipe 7. String handling
Recipe 8. Web scraping
Introduction
Before getting into details of the book, let’s see the different possible data sources available in general. We need to identify potential data sources for a business’s benefit.
Client Data
For any problem statement, one of the sources is their own data that is already present. But it depends on the business where they store it. Data storage depends on the type of business, amount of data, and cost associated with different sources.
SQL databases
Hadoop clusters
Cloud storage
Flat files
Free source
A huge amount of data is freely available over the internet. We just need to streamline the problem and start exploring multiple free data sources.
Free APIs like Twitter
Wikipedia
Government data (e.g. http://data.gov )
Census data (e.g. http://www.census.gov/data.html )
Health care claim data (e.g. https://www.healthdata.gov/
Web scraping
Extracting the content/data from websites, blogs, forums, and retail websites for reviews with the permission from the respective sources using web scraping packages in Python.
There are a lot of other sources like crime data, accident data, and economic data that can also be leveraged for analysis based on the problem statement.
Recipe 1-1. Collecting Data
As discussed, there are a lot of free APIs through which we can collect data and use it to solve problems. We will discuss the Twitter API in particular (it can be used in other scenarios as well).
Problem
You want to collect text data using Twitter APIs.
Solution
Twitter has a gigantic amount of data with a lot of value in it. Social media marketers are making their living from it. There is an enormous amount of tweets every day, and every tweet has some story to tell. When all of this data is collected and analyzed, it gives a tremendous amount of insights to a business about their company, product, service, etc.
Let’s see how to pull the data in this recipe and then explore how to leverage it in coming chapters.
How It Works
Step 1-1 Log in to the Twitter developer portal
consumer key: Key associated with the application (Twitter, Facebook, etc.).
consumer secret: Password used to authenticate with the authentication server (Twitter, Facebook, etc.).
access token: Key given to the client after successful authentication of above keys.
access token secret: Password for the access key.
Step 1-2 Execute below query in Python
The query above will pull the top 10 tweets when the product ABC is searched. The API will pull English tweets since the language given is ‘en’ and it will exclude retweets.
Recipe 1-2. Collecting Data from PDFs
Most of the time your data will be stored as PDF files. We need to extract text from these files and store it for further analysis.
Problem
You want to read a PDF file.
Solution
The simplest way to do this is by using the PyPDF2 library.
How It Works
Let’s follow the steps in this section to extract data from PDF files.
Step 2-1 Install and import all the necessary libraries
Note
You can download any PDF file from the web and place it in the location where you are running this Jupyter notebook or Python script.
Step 2-2 Extracting text from PDF file
Please note that the function above doesn’t work for scanned PDFs.
Recipe 1-3. Collecting Data from Word Files
Next, let us look at another small recipe by reading Word files in Python.
Problem
You want to read Word files .
Solution
The simplest way to do this is by using the docx library.
How It Works
Let’s follow the steps in this section to extract data from the Word file.
Step 3-1 Install and import all the necessary libraries
Note
You can download any Word file from the web and place it in the location where you are running this Jupyter notebook or Python script.
Step 3-2 Extracting text from word file
Recipe 1-4. Collecting Data from JSON
Reading a JSON file/object.
Problem
You want to read a JSON file/object.
Solution
The simplest way to do this is by using requests and the JSON library.
How It Works
Let’s follow the steps in this section to extract data from the JSON.
Step 4-1 Install and import all the necessary libraries
Step 4-2 Extracting text from JSON file
Recipe 1-5. Collecting Data from HTML
In this recipe, let us look at reading HTML pages.
Problem
You want to read parse/read HTML pages.
Solution
The simplest way to do this is by using the bs4 library.
How It Works
Let’s follow the steps in this section to extract data from the web.
Step 5-1 Install and import all the necessary libraries
Step 5-2 Fetch the HTML file
Step 5-3 Parse the HTML file
Step 5-4 Extracting tag value
Step 5-5 Extracting all instances of a particular tag
Step 5-6 Extracting all text of a particular tag
If you observe here, using the ‘p’ tag extracted most of the text present in the page.
Recipe 1-6. Parsing Text Using Regular Expressions
In this recipe, we are going to discuss how regular expressions are helpful when dealing with text data. This is very much required when dealing with raw data from the web, which would contain HTML tags, long text, and repeated text. During the process of developing your application, as well as in output, we don’t need such data.
We can do all sort of basic and advanced data cleaning using regular expressions.
Problem
You want to parse text data using regular expressions.
Solution
The best way to do this is by using the “re” library in Python.
How It Works
re.I: This flag is used for ignoring casing.
re.L: This flag is used to find a local dependent.
re.M: This flag is useful if you want to find patterns throughout multiple lines.
re.S: This flag is used to find dot matches.
re.U: This flag is used to work for unicode data.
re.X: This flag is used for writing regex in a more readable format.
Find the single occurrence of character a and b:
Regex: [ab]Find characters except for a and b:
Regex: [^ab]Find the character range of a to z:
Regex: [a-z]Find a range except to z:
Regex: [^a-z]Find all the characters a to z as well as A to Z:
Regex: [a-zA-Z]Any single character:
Regex:Any whitespace character:
Regex: sAny non-whitespace character:
Regex: SAny digit:
Regex: dAny non-digit:
Regex: DAny non-words:
Regex: WAny words:
Regex: wEither match a or b:
Regex: (a|b)- The occurrence of a is either zero or one:
Matches zero or one occurrence but not more than one occurrence
Regex: a? ; ?
The occurrence of a is zero times or more than that:
Regex: a* ; * matches zero or more than that
The occurrence of a is one time or more than that:
Regex: a+ ; + matches occurrences one or more that one time
Exactly match three occurrences of a:
Regex: a{3}Match simultaneous occurrences of a with 3 or more than 3:
Regex: a{3,}Match simultaneous occurrences of a between 3 to 6:
Regex: a{3,6}Starting of the string:
Regex: ^Ending of the string:
Regex: $Match word boundary:
Regex:Non-word boundary:
Regex: B
re.match() and re.search() functions are used to find the patterns and then can be processed according to the requirements of the application.
re.match(): This checks for a match of the string only at the beginning of the string. So, if it finds the pattern at the beginning of the input string, then it returns the matched pattern; otherwise; it returns a noun.
re.search(): This checks for a match of the string anywhere in the string. It finds all the occurrences of the pattern in the given input string or data.
Now let’s look at a few of the examples using these regular expressions.
Tokenizing
For an explanation of regex, please refer to the main recipe.
Extracing email IDs
- 1.Read/create the document or sentencesdoc = "For more details please mail us at: [email protected], [email protected]"
- 2.Execute the re.findall functionaddresses = re.findall(r'[w.-]+@[w.-]+', doc)for address in addresses:print(address)
Replacing email IDs
- 1.
Read/create the document or sentences
- 2.
Execute the re.sub function
For an explanation of regex, please refer to Recipe 1-6.
Extract data from the ebook and perform regex
- 1.Extract the content from the book# Import libraryimport reimport requests#url you want to extracturl = 'https://www.gutenberg.org/files/2638/2638-0.txt'#function to extractdef get_book(url):# Sends a http request to get the text from project Gutenbergraw = requests.get(url).text# Discards the metadata from the beginning of the bookstart = re.search(r"*** START OF THIS PROJECT GUTENBERG EBOOK .* ***",raw ).end()# Discards the metadata from the end of the bookstop = re.search(r"II", raw).start()# Keeps the relevant texttext = raw[start:stop]return text# processingdef preprocess(sentence):return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()#calling the above functionbook = get_book(url)processed_book = preprocess(book)print(processed_book)# Outputproduced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part i i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when da
- 2.Perform some exploratory data analysis on this data using regex# Count number of times "the" is appeared in the booklen(re.findall(r'the', processed_book))#Output302#Replace "i" with "I"processed_book = re.sub(r'sis', " I ", processed_book)print(processed_book)#outputproduced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part I i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when da#find all occurance of text in the format "abc--xyz"re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)#output['ironical--it','malicious--smile','fur--or','astrachan--overcoat','it--the','Italy--was','malady--a','money--and','little--to','No--Mr','is--where','I--I','I--','--though','crime--we','or--judge','gaiters--still','--if','through--well','say--through','however--and','Epanchin--oh','too--at','was--and','Andreevitch--that','everyone--that','reduce--or','raise--to','listen--and','history--but','individual--one','yes--I','but--','t--not','me--then','perhaps--','Yes--those','me--is','servility--if','Rogojin--hereditary','citizen--who','least--goodness','memory--but','latter--since','Rogojin--hung','him--I','anything--she','old--and','you--scarecrow','certainly--certainly','father--I','Barashkoff--I','see--and','everything--Lebedeff','about--he','now--I','Lihachof--','Zaleshoff--looking','old--fifty','so--and','this--do','day--not','that--','do--by','know--my','illness--I','well--here','fellow--you']
Recipe 1-7. Handling Strings
In this recipe, we are going to discuss how to handle strings and dealing with text data.
We can do all sort of basic text explorations using string operations.
Problem
You want to explore handling strings.
Solution
The simplest way to do this is by using the below string functionality.
s.find(t) index of first instance of string t inside s (-1 if not found)
s.rfind(t) index of last instance of string t inside s (-1 if not found)
s.index(t) like s.find(t) except it raises ValueError if not found
s.rindex(t) like s.rfind(t) except it raises ValueError if not found
s.join(text) combine the words of the text into a string using s as the glue
s.split(t) split s into a list wherever a t is found (whitespace by default)
s.splitlines() split s into a list of strings, one per line
s.lower() a lowercased version of the string s
s.upper() an uppercased version of the string s
s.title() a titlecased version of the string s
s.strip() a copy of s without leading or trailing whitespace
s.replace(t, u) replace instances of t with u inside s
How It Works
Now let us look at a few of the examples.
Replacing content
- 1.Creating a stringString_v1 = "I am exploring NLP"#To extract particular character or range of characters from stringprint(String_v1[0])#output"I"#To extract exploringprint(String_v1[5:14])#outputexploring
- 2.Replace “exploring” with “learning” in the above stringString_v2 = String_v1.replace("exploring", "learning")print(String_v2)#OutputI am learning NLP
Concatenating two strings
Searching for a substring in a string
Recipe 1-8. Scraping Text from the Web
In this recipe, we are going to discuss how to scrape data from the web.
Caution
Before scraping any websites, blogs, or e-commerce websites, please make sure you read the terms and conditions of the websites on whether it gives permissions for data scraping.
So, what is web scraping, also called web harvesting or web data extraction?
It is a technique to extract a large amount of data from websites and save it in a database or locally. You can use this data to extract information related to your customers/users/products for the business’s benefit.
Prerequisite: Basic understanding of HTML structure.
Problem
You want to extract data from the web by scraping. Here we have taken the example of the IMDB website for scraping top movies.
Solution
The simplest way to do this is by using beautiful soup or scrapy library from Python. Let’s use beautiful soup in this recipe.
How It Works
Let’s follow the steps in this section to extract data from the web.
Step 8-1 Install all the necessary libraries
Step 8-2 Import the libraries
Step 8-3 Identify the url to extract the data
Step 8-4 Request the url and download the content using beautiful soup
Step 8-5 Understand the website page structure to extract the required information
Go to the website and right-click on the page content to inspect the html structure of the website.
Identify the data and fields you want to extract. Say, for example, we want the Movie name and IMDB rating from this page.
So, we will have to check under which div or class the movie names are present in the HTML and parse the beautiful soup accordingly.
In the below example, to extract the movie name, we can parse our soup through <table class ="chart full-width"> and <td class="titleColumn">.
Similarly, we can fetch the other details. For more details, please refer to the code in step 8-6.
Step 8-6 Use beautiful soup to extract and parse the data from HTML tags
Your request to the URL has failed, so maybe you need to try again after some time. This is common in web scraping.
Web pages are dynamic. The HTML tags of websites keep changing. Understand the tags and make small changes in the code in accordance with HTML, and you are good to go.
Step 8-7 Convert lists to data frame and you can perform the analysis that meets the business requirements
Step 8-8 Download the data frame
We have implemented most of the ways and techniques to extract text data from possible sources. In the coming chapters, we will look at how to explore, process, and clean this data, followed by feature engineering and building NLP applications.