Web scraping with python3 requests and BeautifulSoup
pip install -r requirements.txt
requirements.txt
requests==2.19.1 beautifulsoup4==4.6.3
requests module for requesting the url and fetching response and bs4 (beautifulsoup4) for making web scraping easier.
Once you have the requirements installed you can simply import and use. For now we will be requesting the yelp service and playing around with our modules
requesting_yelp.epy
import requests from bs4 import BeautifulSoup
You can visit yelp and search for anything in search bar, for example we searched for restaurants and it returned back Best Restaurants in San Francisco, CA
Copy the url from the browser and paste it into your file
requesting_yelp.epy
url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&ns=1"
Now requests comes to work. Since we make an GET request with our browser for this url and we get back all the html response in form of new webpage, we are gonna do the same with requests.get(url)
and store the result.
requesting_yelp.epy
response = requests.get(url)
Now response contains the returned result from the get request to the url. We can use methods on response like, you can actually print response,
print(response) print(response.status_code)
run the file
cmd
.../python_scraping_web> py requesting_yelp.py <Response [200]> 200
200. The HTTP 200 OK success status response code indicates that the request has succeeded.
To actually print the whole html that the webpage contains we can print
print(response.text)
Earlier version of python requests used to print the html from response.text
in ugly way but on printing it now we can get the prettified html or we can also use the bs4 module
For that we need to create a BeautifulSoup object by passing in the text returned from the url,
soup = BeautifulSoup(response.text) print(soup.prettify()) <img height="1" src="https://www.facebook.com/tr?id=102029836881428&ev=PageView&noscript=1" style="display:none" width="1"> </img> </noscript> <script> (function() { var main = null; var main=function(){var c=Math.random()+"";var b=c*10000000000000;document.write('<iframe src="https://6372968.fls.doubleclick.net/activityi;src=6372968;type=invmedia;cat=qr3hlsqk;dc_lat=;dc_rdid=;tag_for_child_directed_treatment=;ord='+b+'?" width="1" height="1" frameborder="0" style="display:none"></iframe>')}; if (main === null) { throw 'invalid inline script, missing main declaration.'; } main(); })(); </script> <noscript> <iframe frameborder="0" height="1" src="https://6372968.fls.doubleclick.net/activityi;src=6372968;type=invmedia;cat=qr3hlsqk;dc_lat=;dc_rdid=;tag_for_child_directed_treatment=;ord=1?" style="display:none" width="1"> </iframe> </noscript> </body> </html>
Resultant html should be something like this in both requests and BeautifulSoup case.
But BeautifulSoup gives us more advanced methods for scraping like the find()
and findall()
requesting_yelp.py
links = soup.findAll('a') print(links) ... <span class="dropdown_label"> The Netherlands </span> </a>, <a class="dropdown_link js-dropdown-link" href="https://www.yelp.com.tr/" role="menuitem"> <span class="dropdown_label"> Turkey </span> </a>, <a class="dropdown_link js-dropdown-link" href="https://www.yelp.co.uk/" role="menuitem"> <span class="dropdown_label"> United Kingdom </span> </a>, <a class="dropdown_link js-dropdown-link" href="https://www.yelp.com/" role="menuitem"> <span class="dropdown_label"> United States </span> ...
A lot of links exists so you terminal should be full of links and html tags
We can loop over the links variable and print the individual link
for link in links: print(link)
On running
... <a href="/atlanta">Atlanta</a> <a href="/austin">Austin</a> <a href="/boston">Boston</a> <a href="/chicago">Chicago</a> <a href="/dallas">Dallas</a> <a href="/denver">Denver</a> <a href="/detroit">Detroit</a> <a href="/honolulu">Honolulu</a> <a href="/houston">Houston</a> <a href="/la">Los Angeles</a> <a href="/miami">Miami</a> <a href="/minneapolis">Minneapolis</a> <a href="/nyc">New York</a> <a href="/philadelphia">Philadelphia</a> <a href="/portland">Portland</a> <a href="/sacramento">Sacramento</a> <a href="/san-diego">San Diego</a> <a href="/sf">San Francisco</a> <a href="/san-jose">San Jose</a> <a href="/seattle">Seattle</a> <a href="/dc">Washington, DC</a> <a href="/locations">More Cities</a> <a href="https://yelp.com/about">About</a> <a href="https://officialblog.yelp.com/">Blog</a> <a href="https://www.yelp-support.com/?l=en_US">Support</a> <a href="/static?p=tos">Terms</a> <a href="http://www.databyacxiom.com" rel="nofollow" target="_blank">Some Data By Acxiom</a>
This looks a lot cleaner now.
So far we have requesting a single url. In this section we will be formatting url to request a different url.
If you look at the yelp url we used before, you might find at the very bottom that there is pagination being used.
So what we can do is visit another search page say 2
page and we find that url changed a bit.
Specifically the url had some new value at last for the 2
page which is
https://www.yelp.com/search?find_desc=Restaurants&find_loc=los+angeles&start=30
You guessed it right
&start=30
is what is new in the url, if you have worked with django then you might have used pagination somewhere in your templates.
So that means we can actually add this value at the end of the existing url and locate to another search result page
Have a look at formatting_url.py
import requests from bs4 import BeautifulSoup base_url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc={}" city = "los angeles" start = 30 url = base_url.format(city) second_page = url + '&start=' + str(start) response = requests.get(third_page) print(f"STATUS CODE: {response.status_code} FOR {response.url}") soup = BeautifulSoup(response.text, 'html.parser') links = soup.findAll('a')
We assign 30 value to the start and add it as str(start)
at the end of the url and name it second_page
and then request that page. We get 200
status code
This means that by finding the patterns in url we can request more url.
So what more could be done. We can start a loop that would request the urls and each time increment the start value by 30
start = 0 for i in range(40): url = base_url.format(city) url += '&start=' + str(start) start += 30 if start == 270: break ...
Now how do I know that we have to increment by 30
well I checked the pattern of urls by visiting the pages and stop at 270
so that we only request 10 pages. You can use whatever value you want but it should be multiple of 30
Now we will be using the previous code that we wrote in formatting_url.py
and extract the particular piece of text from the html tags that we need which is the title of the restaurant from each search page.
Visit the url and open developers tools and point at the block of restaurant with title, rating, review etc. and find the li tag with class regular-search-result
We will be using this class for searching the particular li
tag from the response using BeautifulSoup
reading_name.py
import requests ... info_block = soup.findAll('li', {'class': 'regular-search-result'}) print(info_block)
Run the file and you should the whole li tag and its inner tags printed. But we want to extract the title of the restaurant from each li tag, for that we have to find the class used in the title of restaurant
The title is wrapped inside a anchor tag with class biz-name
info_block = soup.findAll('a', {'class': 'biz-name'}) print(info_block) count = 0 for info in info_block: print(info.text) count += 1 print(count)
On printing the text
of the html tag we get the title of the restaurant, these are not all the title cause some block don't have biz-name
class but we have what we need.
We can also write the names of restaurants onto a file but we have to use a try except block while performing the file writing operation since the title names of restaurants contains some non str characters which could cause error.
with open('los_angeles_restaurants.txt', 'a') as file: start = 0 for i in range(100): url = base_url.format(city, start) response = requests.get(url) start += 30 print(f"STATUS CODE: {response.status_code} FOR {response.url}") soup = BeautifulSoup(response.text, 'html.parser') names = soup.findAll('a', {'class': 'biz-name'}) count = 0 for info in names: try: title = info.text print(title) file.write(title + '\n') count += 1 except Exception as e: print(e) print(f"{count} RESTAURANTS EXTRACTED...") print(start) if start == 990: break
For any questions regarding what we have done so far contact me at CodeMentor
In this section we will be go a little more further and extract the name, address, phone-number of the restaurant.
This time we will be looking for the div
tag that has class biz-listing-large
that contains the restaurant details.
In writing_details.py
We have reused a lot of code from other files the only difference is that we open a new file, fetch the title, address, and phone number from respective classes and write it into the file.
... city = "los+angeles" ... file_path = f'yelp-{city}.txt' with open(file_path, 'w') as textFile: soup = BeautifulSoup(response.text, 'html.parser') businesses = soup.findAll('div', {'class': 'biz-listing-large'}) count = 0 for biz in businesses: title = biz.find('a', {'class': 'biz-name'}).text address = biz.find('address').text phone = biz.find('span', {'class': 'biz-phone'}).text detail = f"{title}\n{address}\n{phone}" textFile.write(str(detail) + '\n\n')
We edit the city value so that it neither conflicts with url and our file path name.
yelp_los+angeles.txt is still doesn't has text in nice formatted way like we wanted but in the next section will be working on it.
AMF Beverly Lanes 1201 W Beverly Blvd (323) 728-9161 Maccheroni Republic 332 S Broadway (213) 346-9725 Home Restaurant - Los Feliz 1760 Hillhurst Ave (323) 669-0211 ...
Once we have extracted the data we want to make our data look good i.e. not without any spaces and newlines so for that we will use a simple logic
writing_clean_data.py
... with open(file_path, 'a') as textFile: count = 0 for biz in businesses: try: title = biz.find('a', {'class': 'biz-name'}).text address = biz.find('address').contents # print(address) phone = biz.find('span', {'class': 'biz-phone'}).text region = biz.find('span', {'class': 'neighborhood-str-list'}).contents count += 1 for item in address: if "br" in item: print(item.getText()) else: print('\n' + item.strip(" \n\r\t")) for item in region: if "br" in item: print(item.getText()) else: print(item.strip(" \n\t\r") + '\n') ...
We simply get the text of the item if there are any br tags else we strip the newlines, return lines, tabs and spaces / space from the text, On running the file
800 W Sunset Blvd Echo Park 4156 Santa Monica Blvd Silver Lake 8500 Beverly Blvd Beverly Grove 5484 Wilshire Blvd Mid-Wilshire 5115 Wilshire Blvd Hancock Park 126 E 6th St Downtown 8164 W 3rd St Beverly Grove 7910 W 3rd St Beverly Grove 4163 W 5th St Koreatown 435 N Fairfax Ave Beverly Grove 1267 W Temple St Echo Park 429 W 8th St Downtown 724 S Spring St Downtown 8450 W 3rd St Beverly Grove 2308 S Union Ave University Park 5583 W Pico Blvd Mid-Wilshire 'NoneType' object has no attribute 'contents' 3413 Cahuenga Blvd W Hollywood Hills 727 N Broadway Chinatown 6602 Melrose Ave Hancock Park 612 E 11th St Downtown ...
The same way we have to clean the phone number
... for item in phone: if "br" in item: phone_number += item.getText() + " " else: phone_number += item.strip(" \n\t\r") + " " ... except Exception as e: print(e) logs = open('errors.log', 'a') logs.write(str(e) + '\n') logs.close() address = None phone_number = None region = None
Again run the file and change the start = 990
delete all content in yelp-{city}-clean.txt
run the file again. All details of the restaurant will be written to the file
yelp-{city}-clean.txt
Tea Station Express Bestia 2121 E 7th Pl Downtown (213) 514-5724 République 624 S La Brea Ave Hancock Park (310) 362-6115 The Morrison 3179 Los Feliz Blvd Atwater Village (323) 667-1839 A Food Affair 1513 S Robertson Blvd Pico-Robertson (310) 557-9795 Running Goose 1620 N Cahuenga Blvd Hollywood (323) 469-1080 Howlin’ Ray’s Perch 448 S Hill St Downtown (213) 802-1770 Faith & Flower 705 W 9th St Downtown (213) 239-0642
Notice that we some of the data is missing because due to some error we reduce the risk of code crashing by setting the values to None