Beutifulsoup: how to pick text that's not in HTML tags?

pitonas · (This post was last modified: Oct-08-2018, 10:33 AM by pitonas.)

Hello guys,

I'm building a web scraper and everything went smooth so far until I came across such situation:

There is a tag that contains the information that I need to pick.

Travel date: 2019.10.10 
Travel duration: 7 nights


The problem is that I need to pick the date (2019.10.10) and the number of nights (7 nights) only.

travel_date = inner_page_soup.find('strong', text='Travel date:').next_sibling

It works until there is no such "sibling" and I get such error:
AttributeError: 'NoneType' object has no attribute 'next_sibling'

I've added a line to check if the variable is None and find another info.

 if travel_date is None: travel_date = inner_page_soup.find('div', {"class":"info"}).span.text

Do you have any ideas why it's not working?

**Larz60+** · (This post was last modified: Oct-08-2018, 11:08 AM by Larz60+.)

If I knew the url of your site, I would have used it for example,
for this, I use https://www.weather.gov/

load the web site into chrome or firefox.
highlight the text you are interested in and right click
choose inspect element, move cursor in inspect over text node:

<strong>Travel date:</strong>&nbsp;2019.10.10<br>

right click --> copy --> XPath
paste into code like (your xpath will be dfferent):

xpath = '/html/body/div[5]/div/div[4]/p/a[2]'

Now run code like:

from lxml import html import requests import sys def get_stuff(): page = None response = requests.get('https://www.weather.gov/') if response.status_code == 200: page = response.content else: print("c'ant load page") sys.exit(-1) tree = tree = html.fromstring((page)) # replace with your xpath node = tree.xpath('/html/body/div[4]/div[2]/div[1]/div[2]/div/div[2]/p') text = node[0].text.strip() print(text) if __name__ == '__main__': get_stuff()

results:

Output:
A slow moving storm system will bring a continued threat for heavy snow over the Rockies, heavy rain, flooding,and severe weather over the Plains into midweek. Over the Gulf of Mexico, Tropical Storm Michael is expected tostrengthen into a hurricane and cause direct impacts to the northeast Gulf Coast by midweek. Heavy rain from Michael could once again impact the Carolinas late week.

pitonas · Oct-08-2018, 12:07 PM

Larz60+,

thank you for your quick answer! Unfortunately, I can't share the web URL publicly.

I do managed to make it work according to your example, however, I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...

***snippsat*** · (This post was last modified: Oct-08-2018, 12:47 PM by snippsat.)

It will be in text element of p tag.
Have to do some clean up.

from bs4 import BeautifulSoup html = """\ <p> <strong>Travel date:</strong>&nbsp;2019.10.10<br> <strong>Travel duration:</strong>&nbsp;7 nights </p>""" soup = BeautifulSoup(html, 'lxml')

>>> s = soup.find('p').text >>> s '\nTravel date:\xa02019.10.10\nTravel duration:\xa07 nights\n' >>> s = s.strip().replace('\xa0', ' ').split('\n') >>> s ['Travel date: 2019.10.10', 'Travel duration: 7 nights']

Can quick also make a dictionary.

>>> s ['Travel date: 2019.10.10', 'Travel duration: 7 nights'] >>> d = dict([i.split(': ') for i in s]) >>> d {'Travel date': '2019.10.10', 'Travel duration': '7 nights'} >>> d['Travel date'] '2019.10.10'

Quote: I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...

BS has support for CSS selector,lxml support both XPath and CSS selector.
I find CSS selector fine to use in BS.
There are example of use of CSS selector/XPath in BS and lxml in Web-Scraping part-1.

pitonas · Oct-08-2018, 01:43 PM

Awesome! Thank you, snippsat

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Python Obstacles \| Jeet-Kune-Do \| BS4 (Tags > MariaDB) [URL/Local HTML]	BrandonKastning	0	2,279	Feb-08-2022, 08:55 PM Last Post: BrandonKastning
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	7,076	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	5,667	Nov-02-2020, 08:12 PM Last Post: Larz60+
	Easy HTML Parser: Validating trs by attributes several tags deep?	runswithascript	7	6,180	Aug-14-2020, 10:58 PM Last Post: runswithascript
	Jinja2 HTML <a> tags not rendering properly	ChaitanyaPy	4	5,559	Jun-28-2020, 06:12 PM Last Post: ChaitanyaPy
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	3,676	Mar-22-2020, 06:10 AM Last Post: BrandonKastning
	Web crawler extracting specific text from HTML	lewdow	1	4,680	Jan-03-2020, 11:21 PM Last Post: snippsat
	Help on parsing simple text on HTML	amaumox	5	5,367	Jan-03-2020, 05:50 PM Last Post: amaumox
	Extract text between bold headlines from HTML	CostasG	1	3,770	Aug-31-2019, 10:53 AM Last Post: snippsat
	How do I get rid of the HTML tags in my output?	glittergirl	1	4,926	Aug-05-2019, 08:30 PM Last Post: snippsat

Beutifulsoup: how to pick text that's not in HTML tags?

User Panel Messages

Announcements