Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping Issue with BS
#11
Can test attribute of button,in BS can do this with .attrs
Example i would use .get() as .attrs return a dictionary,then can add as default value eg more pages.
import requests from bs4 import BeautifulSoup import time # First url = "https://www.homeadvisor.com/c.Garage-Garage-Doors.Atlanta.GA.-12036.html" # Last #url = "https://www.homeadvisor.com/c.Garage-Garage-Doors.Atlanta.GA.-12036.html?startingIndex=50" headers = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)\ AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36" } response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, "lxml") button = soup.select('button.page-next.\@px-1.\@ml-1')
Test
# First page,or all pages before last page will return this >>> button[0].attrs.get('disabled', 'More pages') 'More pages' >>> # Last page >>> button[0].attrs.get('disabled', 'More pages') 'disabled'
Reply
#12
Fantastic!

This does the job. The next thing I need to do is grab the city, state and category for each listing. But it only appears in two places, the URL and at the top of the page. Then I will add them to the dictionary using RegEx.

PAGE = 0 while True: html = get_html(session, BASE_URL, PAGE) listings = get_listings(html) for listing in listings: print(listing['company'], listing['phone'], listing['rating'], end='\n') button = html.select('button.page-next.\@px-1.\@ml-1') if button[0].attrs.get('disabled') == 'disabled': break PAGE += 25
I also have to read each URL from a file instead of hard coding it. In the following code, I am appending each state > city > category > listings then writing the rows to a CSV file. One question, how do I only write the column names one time?

def save_csv(listings, filename): filename = 'home-advisor-data-{}.csv'.format(state) with open(filename, 'a', encoding='utf-8', newline='') as file: writer = csv.writer(file, delimiter=',') writer.writerow(['Company', 'Phone Number', 'Rating']) #While paginating through each page of results, it will write these literal columns. #How do I avoid this? I only want these at the top, once. for listing in listings: writer.writerow( [listing['company'], listing['Phone_Number'], listing['Rating']])
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web scraping Possible JavaScript issue johnboy1974 2 3,214 Apr-11-2021, 08:53 AM
Last Post: johnboy1974
  Web scraping: webbrowser issue Truman 10 11,088 Jul-11-2018, 11:57 PM
Last Post: snippsat

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.