Downloading Images - Unable to find correct selector

Brompy · Jan-20-2020, 08:01 AM

I'm trying to write a script that 1) goes to a website 2) downloads and parses the HTML 3) downloads a comic image 4) selects the "previous comic button" 5) repeats 1-4

The script is failing on either hitting the "previous comic" button, or downloading the next page before it can reach the second image.

I have tried tinkering with the different selectors, but I'm not sure why it's not working.

#! python3 #swordscraper.py - Downloads all the swords comics. import requests, os, bs4 os.chdir(r'C:\Users\bromp\OneDrive\Desktop\Python') os.makedirs('swords', exist_ok=True) #store comics in /swords url = 'https://swordscomic.com/' #starting url while not url.endswith('#'): #Download the page. print('Downloading page %s...' % url) res = requests.get(url) res.raise_for_status soup = bs4.BeautifulSoup(res.text, 'html.parser') #Find the URL of the comic image. comicElem = soup.select('#comic-image') if comicElem == []: print('Could not find comic image.') else: comicUrl = comicElem[0].get('src') comicUrl = "http://" + comicUrl if 'swords' not in comicUrl: comicUrl=comicUrl[:7]+'swordscomic.com/'+comicUrl[7:] #Download the image. print('Downloading image %s...' % (comicUrl)) res = requests.get(comicUrl) res.raise_for_status() #Save the image to ./swords imageFile = open(os.path.join('swords', os.path.basename(comicUrl)), 'wb') for chunk in res.iter_content(100000): imageFile.write(chunk) imageFile.close() #Get the Prev button's url. prevLink = soup.select('a[id="navigation-previous"]')[1] url = 'https://swordscomic.com/' + prevLink.get('href')

This is the output it gives when I run it:

Downloading page https://swordscomic.com/... Downloading image http://swordscomic.com//media/Swords364t.png... Traceback (most recent call last): File "C:\Users\bromp\AppData\Local\Programs\Python\Python37-32\swordscraper.py", line 40, in <module> prevLink = soup.select('a[id="navigation-previous"]')[1] IndexError: list index out of range

Do I need to use a different module, like Selenium?

**Larz60+** · Jan-20-2020, 04:42 PM

I think you want to use selenium for this.
How-to do this:
web scraping part 1
web scraping part 2

***snippsat*** · (This post was last modified: Jan-20-2020, 11:02 PM by snippsat.)

Can use Selenium,but it's a cooler way here as they use Roman numerals in url.
Then can write it so can use integer to navigate or eg download all images.
Example to get one image.

# roman.py import requests import os, re from roman_convert import roman_to_int, int_to_roman def make_url(roman_number): return f'https://swordscomic.com/comic/{roman_number}/' def download(url, img_nr): img = requests.get(url) img_name = f'{int_to_roman(img_nr)}.png' with open(img_name, 'wb') as f_out: f_out.write(img.content) if __name__ == '__main__': img_nr = 361 url = f'https://swordscomic.com/media/Swords{img_nr}t.png' roman_number = int_to_roman(img_nr) org_link = make_url(roman_number) print(f'Dowloading --> {org_link}') download(url, img_nr)

Output:
E:\div_code\home λ python roman.py Dowloading --> https://swordscomic.com/comic/CCCLXI/

CCCLXI.png
[Image: I58IDh.png]
roman_convert that i import is not written bye me,just code from first hit on Google.

Hide/Show

# roman_convert.py def roman_to_int(s): rom_val = {'I': 1, 'V': 5, 'X': 10, 'L': 50, 'C': 100, 'D': 500, 'M': 1000} int_val = 0 for i in range(len(s)): if i > 0 and rom_val[s[i]] > rom_val[s[i - 1]]: int_val += rom_val[s[i]] - 2 * rom_val[s[i - 1]] else: int_val += rom_val[s[i]] return int_val def int_to_roman(num): val = [ 1000, 900, 500, 400, 100, 90, 50, 40, 10, 9, 5, 4, 1 ] syb = [ "M", "CM", "D", "CD", "C", "XC", "L", "XL", "X", "IX", "V", "IV", "I" ] roman_num = '' i = 0 while num > 0: for _ in range(num // val[i]): roman_num += syb[i] num -= val[i] i += 1 return roman_num if __name__ == '__main__': n = int_to_roman(362) print(n) # CCCLXII print(roman_to_int(n)) # 362

Brompy · Jan-22-2020, 11:32 AM

Thank you both for the replies.

Snippsat, your solution is very cool/clever. But when I try to use it does download a .png, but it is an unreadable .png file that is only 13kb large. What is going wrong?

***snippsat*** · Jan-22-2020, 04:54 PM

(Jan-22-2020, 11:32 AM)Brompy Wrote: But when I try to use it does download a .png, but it is an unreadable .png file that is only 13kb large. What is going wrong?

Not all days have images,so it will be 13kb on these days.
Setup is:

E:\div_code\home\ |-- roman.py |-- roman_convert

If i change roman.py to download more images at once.
Also make it so it show days with no image in both int and roman numb.
So here download 25 days,4 days has no image as show in list under.

[('CCCXLVI.png', 346), ('CCCXLVII.png', 347), ('CCCLVI.png', 356), ('CCCLXIII.png', 363)]

# roman.py import requests import os, re from roman_convert import roman_to_int, int_to_roman def make_url(roman_number): return f'https://swordscomic.com/comic/{roman_number}/' no_image = [] def download(url, img_nr): img = requests.get(url) img_name = f'{int_to_roman(img_nr)}.png' with open(img_name, 'wb') as f_out: if len(img.content) < 15000: no_image.append(img_name) else: f_out.write(img.content) if __name__ == '__main__': #img_nr = 364 for img_nr in range(340, 365): url = f'https://swordscomic.com/media/Swords{img_nr}t.png' roman_number = int_to_roman(img_nr) org_link = make_url(roman_number) print(f'Dowloading --> {org_link}') download(url, img_nr) day_int = [roman_to_int(ro.split('.')[0]) for ro in no_image] print(list(zip(no_image, day_int)))

Run:

Output:
E:\div_code\home λ python roman.py Dowloading --> https://swordscomic.com/comic/CCCXL/ Dowloading --> https://swordscomic.com/comic/CCCXLI/ Dowloading --> https://swordscomic.com/comic/CCCXLII/ Dowloading --> https://swordscomic.com/comic/CCCXLIII/ Dowloading --> https://swordscomic.com/comic/CCCXLIV/ Dowloading --> https://swordscomic.com/comic/CCCXLV/ Dowloading --> https://swordscomic.com/comic/CCCXLVI/ Dowloading --> https://swordscomic.com/comic/CCCXLVII/ Dowloading --> https://swordscomic.com/comic/CCCXLVIII/ Dowloading --> https://swordscomic.com/comic/CCCXLIX/ Dowloading --> https://swordscomic.com/comic/CCCL/ Dowloading --> https://swordscomic.com/comic/CCCLI/ Dowloading --> https://swordscomic.com/comic/CCCLII/ Dowloading --> https://swordscomic.com/comic/CCCLIII/ Dowloading --> https://swordscomic.com/comic/CCCLIV/ Dowloading --> https://swordscomic.com/comic/CCCLV/ Dowloading --> https://swordscomic.com/comic/CCCLVI/ Dowloading --> https://swordscomic.com/comic/CCCLVII/ Dowloading --> https://swordscomic.com/comic/CCCLVIII/ Dowloading --> https://swordscomic.com/comic/CCCLIX/ Dowloading --> https://swordscomic.com/comic/CCCLX/ Dowloading --> https://swordscomic.com/comic/CCCLXI/ Dowloading --> https://swordscomic.com/comic/CCCLXII/ Dowloading --> https://swordscomic.com/comic/CCCLXIII/ Dowloading --> https://swordscomic.com/comic/CCCLXIV/ [('CCCXLVI.png', 346), ('CCCXLVII.png', 347), ('CCCLVI.png', 356), ('CCCLXIII.png', 363)]

Can show a quick test with Selenium,as this is maybe not so easy if new to this.
Here i go back 3 times,then send site source code to BS,so can to find the real download(it's not the roman numerals link) link in meta tag.
To download i use same function with some modifications.

from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup import time, os import requests def download(img_url): img = requests.get(img_url) img_name = os.path.basename(img_url) with open(img_name, 'wb') as f_out: if len(img.content) < 15000: no_image.append(img_name) else: f_out.write(img.content) if __name__ == '__main__': #--| Setup chrome_options = Options() #chrome_options.add_argument("--headless") browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe') #--| Parse or automation browser.get('https://swordscomic.com/comic/CCCLXV/') back = browser.find_elements_by_css_selector('#navigation-previous')[0].click() time.sleep(3) back = browser.find_elements_by_css_selector('#navigation-previous')[0].click() time.sleep(3) back = browser.find_elements_by_css_selector('#navigation-previous')[0].click() # Give source code to BeautifulSoup soup = BeautifulSoup(browser.page_source, 'html.parser') img_url = soup.find('meta', property="og:image") img_url = img_url.attrs['content'] download(img_url) browser.quit()

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Django: View is unable to find attributes of database model	pythonpaul32	1	1,984	May-20-2024, 05:38 AM Last Post: tahirahmedd
	Div Class HTML selector in Python	Artur	1	1,930	Mar-28-2024, 09:46 AM Last Post: StevenSnyder
	python selenium downloading embedded pdf	damian0612	0	5,706	Feb-23-2021, 09:11 PM Last Post: damian0612
	Downloading CSV from a website	bmiller12	1	2,745	Nov-26-2020, 09:33 AM Last Post: Axel_Erfurt
	TDD/CSS & HTML testing - CSS selector (.has-error)	makoseafox	0	2,658	May-13-2020, 07:41 PM Last Post: makoseafox
	Downloading book preview	Truman	6	5,454	May-15-2019, 10:02 PM Last Post: Truman
	Downloading Multiple Webpages	MoziakBeats	4	4,930	Apr-17-2019, 04:06 AM Last Post: Skaperen
	Python - Scrapy - CSS selector	Baggelhsk95	1	7,282	Nov-07-2018, 04:45 PM Last Post: stranac
	Downloading txt files	tjnichols	6	6,193	Aug-27-2018, 10:01 PM Last Post: tjnichols
	Django+uWsgi unable to find "application" callable	rosettas	3	14,770	Aug-24-2017, 01:41 PM Last Post: nilamo

Downloading Images - Unable to find correct selector

User Panel Messages

Announcements