DEV Community

Areahints
Areahints

Posted on

Need help with python

how would you achieve the following logic using python?

  • Take a search query, for example, why do I like dogs?
  • Open browser, navigate to duckduckgo (or something else), search for my query.
  • Save the HTML of the search page.
  • Open each URL in search page (for the first page)in a new tab.
  • Save the HTML of each opened URL

Top comments (2)

Collapse
 
rhymes profile image
rhymes

Why do you need to open the pages in the browser? Wouldn't it be easier to just download the HTML?

  • open the url https://duckduckgo.com/?q=dogs with requests
  • save the HTML
  • parse it with html.parser from the standard library
  • download all the links

This is the simplest version I can think of. There are other ways to scrape pages and links.

If you truly need to "drive" the browser instead, you probably want to look into something like pyppeteer which drives a headless chrome/chromium

Collapse
 
areahints profile image
Areahints

@rhymes

this is what I've tried to do:

import os import ssl import logging from bs4 import BeautifulSoup import urllib, re from urllib.request import Request, urlopen # Global variables user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7' headers = {'User-Agent':user_agent,} url_google = 'https://www.google.com/search?&q=' url_duck = 'https://duckduckgo.com/?q=' # Get user's search query and format the string query = input('What are you searching for?: ') query = re.sub('\\ |\\?|\\.|\\!|\\/|\\;|\\:', '+', query) # use user's choice to make request choice = int(input('Select Search Engine, Google = 1, Duckduckgo = 2: ')) googlesearch = url_google + query ducksearch = url_duck + query def set_custom_log_info(filename): logging.basicConfig(filename=filename, level=logging.INFO) def report(e:Exception): logging.exception(str(e)) def write_webpage_as_html(filename, data=''): try: with open(filename, 'wb') as fobj: fobj.write(data) except Exception as e: print(e) report(e) return False else: return True class Search: _url = '' _data = '' _log = None _soup = None def __init__(self, url, log): self._url = url self._log = log def retrieve_webpage(self): try: if choice == 1: html = urllib.request.urlopen(googlesearch,None,headers) else: html = urllib.request.urlopen(ducksearch,None,headers) except Exception as e: print (e) self._log.report(str(e)) else: self._data = html.read() if len(self._data) > 0: print ("Retrieved successfully") if __name__ == '__main__': search_scrap = Search() search_scrap.retrieve_webpage() search_scrap.write_webpage_as_html() 

I am still getting errors, any advice is welcome