Parsing data

A basic analysis of the URL tells us that we are passing in parameters that include min price and max price, but most importantly, the page number. We can use this in our code, and just dynamically change that page number to pull additional pages using a loop.

Let's try this with some sample code:

url_prefix = "https://www.renthop.com/search/nyc?max_price=50000&min_price=0&page=" 
page_no = 1 
url_suffix = "&sort=hopscore&q=&search=0" 
  
for i in range(3): 
    target_page = url_prefix + str(page_no) + url_suffix 
    print(target_page) 
    page_no += 1 

The preceding code results in the following output:

This looks like a success. Now we need to just put it all together. We'll start by turning our parsing loop into a proper function that we can call for each of the pages. We do that in the following code:

def parse_data(listing_divs): 
    listing_list = [] 
    for idx in range(len(listing_divs)): 
        indv_listing = [] 
        current_listing = listing_divs[idx] 
        href = current_listing.select('a[id*=title]')[0]['href'] 
        addy = current_listing.select('a[id*=title]')[0].string 
        hood = current_listing.select('div[id*=hood]')[0] 
        .string.replace('
','') 
 
        indv_listing.append(href) 
        indv_listing.append(addy) 
        indv_listing.append(hood) 
 
        listing_specs = current_listing.select('table[id*=info] tr') 
        for spec in listing_specs: 
            try: 
                values = spec.text.strip().replace(' ', '_').split() 
                clean_values = [x for x in values if x != '_'] 
                indv_listing.extend(clean_values) 
            except: 
                indv_listing.extend(np.nan) 
        listing_list.append(indv_listing) 
    return listing_list 

This function will take in a page full of listing_divs and return the data payload for each. We can then keep adding the data to our master list of apartment data. Notice that there is some additional code in there to validate and remove some erroneous '_' values that get added in the listing_spec loop. This was to avoid some bad parsing that added an additional column when there shouldn't have been one.

Next, we will build the main loop that will retrieve each page, get the listing_divs, parse out the data points, and finally add all of the info to our final Python list of all data points for each listing. We do that in the following code:

all_pages_parsed = [] 
for i in range(100): 
    target_page = url_prefix + str(page_no) + url_suffix 
    print(target_page) 
    r = requests.get(target_page) 
     
    soup = BeautifulSoup(r.content, 'html5lib') 
     
    listing_divs = soup.select('div[class*=search-info]') 
     
    one_page_parsed = parse_data(listing_divs) 
     
    all_pages_parsed.extend(one_page_parsed) 
     
    page_no += 1 

Before trying this on 100 pages, you should confirm that it works on a much smaller number, like 3.

You should have noticed the page being printed out as the code ran. If you used 30 pages, you should see that there are 2,000 listings in your all_pages_parsed list.

Let's now move our data into a pandas DataFrame, so that we can work with it more easily. We do that in the following code:

df = pd.DataFrame(all_pages_parsed, columns=['url', 'address', 'neighborhood', 'rent', 'beds', 'baths']) 
 
df 

The preceding code results in the following output:

Now that we have all our data pulled down, parsed, and incorporated in a DataFrame, let's move on to cleansing and verifying our data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.164.75