python_scraping_web

Web scraping with python3 requests and BeautifulSoup

Installation

pip install -r requirements.txt

requirements.txt

requests==2.19.1 beautifulsoup4==4.6.3

requests module for requesting the url and fetching response and bs4 (beautifulsoup4) for making web scraping easier.

Requesting and Souping

Once you have the requirements installed you can simply import and use. For now we will be requesting the yelp service and playing around with our modules

requesting_yelp.epy

import requests from bs4 import BeautifulSoup

You can visit yelp and search for anything in search bar, for example we searched for restaurants and it returned back Best Restaurants in San Francisco, CA

Copy the url from the browser and paste it into your file

requesting_yelp.epy

url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&ns=1"

Now requests comes to work. Since we make an GET request with our browser for this url and we get back all the html response in form of new webpage, we are gonna do the same with requests.get(url) and store the result.

requesting_yelp.epy

response = requests.get(url)

Now response contains the returned result from the get request to the url. We can use methods on response like, you can actually print response,

print(response) print(response.status_code)

run the file

cmd

.../python_scraping_web> py requesting_yelp.py <Response [200]> 200

200. The HTTP 200 OK success status response code indicates that the request has succeeded.

To actually print the whole html that the webpage contains we can print

print(response.text)

Earlier version of python requests used to print the html from response.text in ugly way but on printing it now we can get the prettified html or we can also use the bs4 module

For that we need to create a BeautifulSoup object by passing in the text returned from the url,

soup = BeautifulSoup(response.text) print(soup.prettify()) <img height="1" src="https://www.facebook.com/tr?id=102029836881428&amp;ev=PageView&amp;noscript=1" style="display:none" width="1"> </img> </noscript> <script> (function() { var main = null; var main=function(){var c=Math.random()+"";var b=c*10000000000000;document.write('<iframe src="https://6372968.fls.doubleclick.net/activityi;src=6372968;type=invmedia;cat=qr3hlsqk;dc_lat=;dc_rdid=;tag_for_child_directed_treatment=;ord='+b+'?" width="1" height="1" frameborder="0" style="display:none"></iframe>')}; if (main === null) { throw 'invalid inline script, missing main declaration.'; } main(); })(); </script> <noscript> <iframe frameborder="0" height="1" src="https://6372968.fls.doubleclick.net/activityi;src=6372968;type=invmedia;cat=qr3hlsqk;dc_lat=;dc_rdid=;tag_for_child_directed_treatment=;ord=1?" style="display:none" width="1"> </iframe> </noscript> </body> </html>

Resultant html should be something like this in both requests and BeautifulSoup case.

But BeautifulSoup gives us more advanced methods for scraping like the find() and findall()

requesting_yelp.py

links = soup.findAll('a') print(links) ... <span class="dropdown_label"> The Netherlands </span> </a>, <a class="dropdown_link js-dropdown-link" href="https://www.yelp.com.tr/" role="menuitem"> <span class="dropdown_label"> Turkey </span> </a>, <a class="dropdown_link js-dropdown-link" href="https://www.yelp.co.uk/" role="menuitem"> <span class="dropdown_label"> United Kingdom </span> </a>, <a class="dropdown_link js-dropdown-link" href="https://www.yelp.com/" role="menuitem"> <span class="dropdown_label"> United States </span> ...

A lot of links exists so you terminal should be full of links and html tags

We can loop over the links variable and print the individual link

for link in links:	print(link)

On running

... <a href="/atlanta">Atlanta</a> <a href="/austin">Austin</a> <a href="/boston">Boston</a> <a href="/chicago">Chicago</a> <a href="/dallas">Dallas</a> <a href="/denver">Denver</a> <a href="/detroit">Detroit</a> <a href="/honolulu">Honolulu</a> <a href="/houston">Houston</a> <a href="/la">Los Angeles</a> <a href="/miami">Miami</a> <a href="/minneapolis">Minneapolis</a> <a href="/nyc">New York</a> <a href="/philadelphia">Philadelphia</a> <a href="/portland">Portland</a> <a href="/sacramento">Sacramento</a> <a href="/san-diego">San Diego</a> <a href="/sf">San Francisco</a> <a href="/san-jose">San Jose</a> <a href="/seattle">Seattle</a> <a href="/dc">Washington, DC</a> <a href="/locations">More Cities</a> <a href="https://yelp.com/about">About</a> <a href="https://officialblog.yelp.com/">Blog</a> <a href="https://www.yelp-support.com/?l=en_US">Support</a> <a href="/static?p=tos">Terms</a> <a href="http://www.databyacxiom.com" rel="nofollow" target="_blank">Some Data By Acxiom</a>

This looks a lot cleaner now.

Requesting Pages

So far we have requesting a single url. In this section we will be formatting url to request a different url.

If you look at the yelp url we used before, you might find at the very bottom that there is pagination being used.

So what we can do is visit another search page say 2 page and we find that url changed a bit.

Specifically the url had some new value at last for the 2 page which is

https://www.yelp.com/search?find_desc=Restaurants&find_loc=los+angeles&start=30

You guessed it right

&start=30

is what is new in the url, if you have worked with django then you might have used pagination somewhere in your templates.

So that means we can actually add this value at the end of the existing url and locate to another search result page

Have a look at formatting_url.py

import requests from bs4 import BeautifulSoup base_url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc={}" city = "los angeles" start = 30 url = base_url.format(city) second_page = url + '&start=' + str(start) response = requests.get(third_page) print(f"STATUS CODE: {response.status_code} FOR {response.url}") soup = BeautifulSoup(response.text, 'html.parser') links = soup.findAll('a')

We assign 30 value to the start and add it as str(start) at the end of the url and name it second_page and then request that page. We get 200 status code

This means that by finding the patterns in url we can request more url.

So what more could be done. We can start a loop that would request the urls and each time increment the start value by 30

start = 0 for i in range(40): url = base_url.format(city) url += '&start=' + str(start) start += 30 if start == 270: break ...

Now how do I know that we have to increment by 30 well I checked the pattern of urls by visiting the pages and stop at 270 so that we only request 10 pages. You can use whatever value you want but it should be multiple of 30

Reading Restaurant Title

Now we will be using the previous code that we wrote in formatting_url.py and extract the particular piece of text from the html tags that we need which is the title of the restaurant from each search page.

Visit the url and open developers tools and point at the block of restaurant with title, rating, review etc. and find the li tag with class regular-search-result

We will be using this class for searching the particular li tag from the response using BeautifulSoup

reading_name.py

import requests ... info_block = soup.findAll('li', {'class': 'regular-search-result'}) print(info_block)

Run the file and you should the whole li tag and its inner tags printed. But we want to extract the title of the restaurant from each li tag, for that we have to find the class used in the title of restaurant

The title is wrapped inside a anchor tag with class biz-name

info_block = soup.findAll('a', {'class': 'biz-name'}) print(info_block) count = 0 for info in info_block: print(info.text) count += 1 print(count)

On printing the text of the html tag we get the title of the restaurant, these are not all the title cause some block don't have biz-name class but we have what we need.

We can also write the names of restaurants onto a file but we have to use a try except block while performing the file writing operation since the title names of restaurants contains some non str characters which could cause error.

with open('los_angeles_restaurants.txt', 'a') as file: start = 0 for i in range(100): url = base_url.format(city, start) response = requests.get(url) start += 30 print(f"STATUS CODE: {response.status_code} FOR {response.url}") soup = BeautifulSoup(response.text, 'html.parser') names = soup.findAll('a', {'class': 'biz-name'}) count = 0 for info in names: try: title = info.text print(title) file.write(title + '\n') count += 1 except Exception as e: print(e) print(f"{count} RESTAURANTS EXTRACTED...") print(start) if start == 990: break

For any questions regarding what we have done so far contact me at CodeMentor

Advanced Extraction

In this section we will be go a little more further and extract the name, address, phone-number of the restaurant.

This time we will be looking for the div tag that has class biz-listing-large that contains the restaurant details.

In writing_details.py We have reused a lot of code from other files the only difference is that we open a new file, fetch the title, address, and phone number from respective classes and write it into the file.

... city = "los+angeles" ... file_path = f'yelp-{city}.txt' with open(file_path, 'w') as textFile: soup = BeautifulSoup(response.text, 'html.parser') businesses = soup.findAll('div', {'class': 'biz-listing-large'}) count = 0 for biz in businesses: title = biz.find('a', {'class': 'biz-name'}).text address = biz.find('address').text phone = biz.find('span', {'class': 'biz-phone'}).text detail = f"{title}\n{address}\n{phone}" textFile.write(str(detail) + '\n\n')

We edit the city value so that it neither conflicts with url and our file path name.

yelp_los+angeles.txt is still doesn't has text in nice formatted way like we wanted but in the next section will be working on it.

AMF Beverly Lanes 1201 W Beverly Blvd (323) 728-9161 Maccheroni Republic 332 S Broadway (213) 346-9725 Home Restaurant - Los Feliz 1760 Hillhurst Ave (323) 669-0211 ...

Writing Clean Data

Once we have extracted the data we want to make our data look good i.e. not without any spaces and newlines so for that we will use a simple logic

writing_clean_data.py

 ... with open(file_path, 'a') as textFile: count = 0 for biz in businesses: try: title = biz.find('a', {'class': 'biz-name'}).text address = biz.find('address').contents # print(address) phone = biz.find('span', {'class': 'biz-phone'}).text region = biz.find('span', {'class': 'neighborhood-str-list'}).contents count += 1 for item in address: if "br" in item: print(item.getText()) else: print('\n' + item.strip(" \n\r\t")) for item in region: if "br" in item: print(item.getText()) else: print(item.strip(" \n\t\r") + '\n') ...

We simply get the text of the item if there are any br tags else we strip the newlines, return lines, tabs and spaces / space from the text, On running the file

800 W Sunset Blvd Echo Park 4156 Santa Monica Blvd Silver Lake 8500 Beverly Blvd Beverly Grove 5484 Wilshire Blvd Mid-Wilshire 5115 Wilshire Blvd Hancock Park 126 E 6th St Downtown 8164 W 3rd St Beverly Grove 7910 W 3rd St Beverly Grove 4163 W 5th St Koreatown 435 N Fairfax Ave Beverly Grove 1267 W Temple St Echo Park 429 W 8th St Downtown 724 S Spring St Downtown 8450 W 3rd St Beverly Grove 2308 S Union Ave University Park 5583 W Pico Blvd Mid-Wilshire 'NoneType' object has no attribute 'contents' 3413 Cahuenga Blvd W Hollywood Hills 727 N Broadway Chinatown 6602 Melrose Ave Hancock Park 612 E 11th St Downtown ...

The same way we have to clean the phone number

 ... for item in phone: if "br" in item: phone_number += item.getText() + " " else: phone_number += item.strip(" \n\t\r") + " " ... except Exception as e: print(e) logs = open('errors.log', 'a') logs.write(str(e) + '\n') logs.close() address = None phone_number = None region = None

Again run the file and change the start = 990 delete all content in yelp-{city}-clean.txt run the file again. All details of the restaurant will be written to the file

yelp-{city}-clean.txt

Tea Station Express Bestia 2121 E 7th Pl Downtown (213) 514-5724 République 624 S La Brea Ave Hancock Park (310) 362-6115 The Morrison 3179 Los Feliz Blvd Atwater Village (323) 667-1839 A Food Affair 1513 S Robertson Blvd Pico-Robertson (310) 557-9795 Running Goose 1620 N Cahuenga Blvd Hollywood (323) 469-1080 Howlin’ Ray’s Perch 448 S Hill St Downtown (213) 802-1770 Faith & Flower 705 W 9th St Downtown (213) 239-0642

Notice that we some of the data is missing because due to some error we reduce the risk of code crashing by setting the values to None

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.gitignore		.gitignore
README.md		README.md
crawl_yelp.py		crawl_yelp.py
extract_details.py		extract_details.py
formatting_url.py		formatting_url.py
los_angeles_restaurants.txt		los_angeles_restaurants.txt
reading_name.py		reading_name.py
requesting_yelp.py		requesting_yelp.py
requirements.txt		requirements.txt
restaurant_names.py		restaurant_names.py
write_csv_data.py		write_csv_data.py
writing_clean_data.py		writing_clean_data.py
writing_details.py		writing_details.py
yelp-los+angeles-clean.csv		yelp-los+angeles-clean.csv
yelp-los+angeles-clean.txt		yelp-los+angeles-clean.txt
yelp-los+angeles.txt		yelp-los+angeles.txt
yelp_20_pages.txt		yelp_20_pages.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

python_scraping_web

Installation

Requesting and Souping

Requesting Pages

Reading Restaurant Title

Advanced Extraction

Writing Clean Data

About

Uh oh!

Releases

Packages

Languages

Alexmhack/python_scraping_web

Folders and files

Latest commit

History

Repository files navigation

python_scraping_web

Installation

Requesting and Souping

Requesting Pages

Reading Restaurant Title

Advanced Extraction

Writing Clean Data

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages