The BMW website has a search tool to find local dealerships, available at https://www.bmw.de/de/home.html?entryType=dlo:
This tool takes a location, and then displays the points near it on a map, such as this search for Berlin
:
Using Firebug, we find that the search triggers this AJAX request:
https://c2b-services.bmw.com/c2b-localsearch/services/api/v3/ clients/BMWDIGITAL_DLO/DE/ pois?country=DE&category=BM&maxResults=99&language=en& lat=52.507537768880056&lng=13.425269635701511
Here, the maxResults
parameter is set to 99
. However, we can increase this to download all locations in a single query, a technique covered in Chapter 1, Introduction to Web Scraping. Here is the result when maxResults
is increased to 1000
:
>>> url = 'https://c2b-services.bmw.com/c2b-localsearch/services/api/v3/clients/BMWDIGITAL_DLO/DE/pois?country=DE&category=BM&maxResults=%d&language=en&lat=52.507537768880056&lng=13.425269635701511' >>> jsonp = D(url % 1000) >>> jsonp 'callback({"status":{ ... })'
This AJAX request provides the data in JSONP format, which stands for JSON with padding. The padding is usually a function to call, with the pure JSON data as an argument, in this case the callback
function call. To parse this data with Python's json
module, we need to first strip this padding:
>>> import json >>> pure_json = jsonp[jsonp.index('(') + 1 : jsonp.rindex(')')] >>> dealers = json.loads(pure_json) >>> dealers.keys() [u'status', u'count', u'translation', u'data', u'metadata'] >>> dealers['count'] 731
We now have all the German BMW dealers loaded in a JSON object—currently, 731 of them. Here is the data for the first dealer:
>>> dealers['data']['pois'][0] {u'attributes': {u'businessTypeCodes': [u'NO', u'PR'], u'distributionBranches': [u'T', u'F', u'G'], u'distributionCode': u'NL', u'distributionPartnerId': u'00081', u'fax': u'+49 (30) 20099-2110', u'homepage': u'http://bmw-partner.bmw.de/niederlassung-berlin-weissensee', u'mail': u'[email protected]', u'outletId': u'3', u'outletTypes': [u'FU'], u'phone': u'+49 (30) 20099-0', u'requestServices': [u'RFO', u'RID', u'TDA'], u'services': []}, u'category': u'BMW', u'city': u'Berlin', u'country': u'Germany', u'countryCode': u'DE', u'dist': 6.65291036632401, u'key': u'00081_3', u'lat': 52.562568863415, u'lng': 13.463589476607, u'name': u'BMW AG Niederlassung Berlin Filiale Weixdfensee', u'postalCode': u'13088', u'street': u'Gehringstr. 20'}
We can now save the data of interest. Here is a snippet to write the name and latitude and longitude of these dealers to a spreadsheet:
with open('bmw.csv', 'w') as fp: writer = csv.writer(fp) writer.writerow(['Name', 'Latitude', 'Longitude']) for dealer in dealers['data']['pois']: name = dealer['name'].encode('utf-8') lat, lng = dealer['lat'], dealer['lng'] writer.writerow([name, lat, lng])
After running this example, the contents of the bmw.csv
spreadsheet will look similar to this:
Name,Latitude,Longitude BMW AG Niederlassung Berlin Filiale Weißensee,52.562568863415,13.463589476607 Autohaus Graubaum GmbH,52.4528925,13.521265 Autohaus Reier GmbH & Co. KG,52.56473,13.32521 ...
The full source code for scraping this data from BMW is available at https://bitbucket.org/wswp/code/src/tip/chapter09/bmw.py.
Translating foreign content
You may have noticed that the first screenshot for BMW was in German, but the second in English. This is because the text for the second was translated using the Google Translate browser extension. This is a useful technique when trying to understand how to navigate a website in a foreign language. When the BMW website is translated, the website still works as usual. Be aware, though, as Google Translate will break some websites, for example, if the content of a select box is translated and a form depends on the original value.
Google Translate is available as the Google Translate
extension for Chrome, the Google Translator
addon for Firefox, and can be installed as the Google Toolbar
for Internet Explorer. Alternatively, http://translate.google.com can be used for translations—however, this often breaks functionality because Google is hosting the content.
3.22.249.90