Python Forum
Beutifulsoup: how to pick text that's not in HTML tags?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Beutifulsoup: how to pick text that's not in HTML tags?
#1
Hello guys,

I'm building a web scraper and everything went smooth so far until I came across such situation:

There is a <p> tag that contains the information that I need to pick.
<p>
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights
</p>

The problem is that I need to pick the date (2019.10.10) and the number of nights (7 nights) only.

travel_date = inner_page_soup.find('strong', text='Travel date:').next_sibling
It works until there is no such "sibling" and I get such error:
AttributeError: 'NoneType' object has no attribute 'next_sibling'

I've added a line to check if the variable is None and find another info.
 if travel_date is None: travel_date = inner_page_soup.find('div', {"class":"info"}).span.text
Do you have any ideas why it's not working?
Reply
#2
If I knew the url of your site, I would have used it for example,
for this, I use https://www.weather.gov/

load the web site into chrome or firefox.
highlight the text you are interested in and right click
choose inspect element, move cursor in inspect over text node:
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
right click --> copy --> XPath
paste into code like (your xpath will be dfferent):
xpath = '/html/body/div[5]/div/div[4]/p/a[2]'
Now run code like:
from lxml import html import requests import sys def get_stuff(): page = None response = requests.get('https://www.weather.gov/') if response.status_code == 200: page = response.content else: print("c'ant load page") sys.exit(-1) tree = tree = html.fromstring((page)) # replace with your xpath node = tree.xpath('/html/body/div[4]/div[2]/div[1]/div[2]/div/div[2]/p') text = node[0].text.strip() print(text) if __name__ == '__main__': get_stuff()
results:
Output:
A slow moving storm system will bring a continued threat for heavy snow over the Rockies, heavy rain, flooding,and severe weather over the Plains into midweek. Over the Gulf of Mexico, Tropical Storm Michael is expected tostrengthen into a hurricane and cause direct impacts to the northeast Gulf Coast by midweek. Heavy rain from Michael could once again impact the Carolinas late week.
Reply
#3
Larz60+,

thank you for your quick answer! Unfortunately, I can't share the web URL publicly.

I do managed to make it work according to your example, however, I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...
Reply
#4
It will be in text element of p tag.
Have to do some clean up.
from bs4 import BeautifulSoup html = """\ <p> <strong>Travel date:</strong>&nbsp;2019.10.10<br> <strong>Travel duration:</strong>&nbsp;7 nights </p>""" soup = BeautifulSoup(html, 'lxml')
>>> s = soup.find('p').text >>> s '\nTravel date:\xa02019.10.10\nTravel duration:\xa07 nights\n' >>> s = s.strip().replace('\xa0', ' ').split('\n') >>> s ['Travel date: 2019.10.10', 'Travel duration: 7 nights']
Can quick also make a dictionary.
>>> s ['Travel date: 2019.10.10', 'Travel duration: 7 nights'] >>> d = dict([i.split(': ') for i in s]) >>> d {'Travel date': '2019.10.10', 'Travel duration': '7 nights'} >>> d['Travel date'] '2019.10.10' 
Quote: I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...
BS has support for CSS selector,lxml support both XPath and CSS selector.
I find CSS selector fine to use in BS.
There are example of use of CSS selector/XPath in BS and lxml in Web-Scraping part-1.
Reply
#5
Awesome! Thank you, snippsat
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Python Obstacles | Jeet-Kune-Do | BS4 (Tags > MariaDB) [URL/Local HTML] BrandonKastning 0 2,279 Feb-08-2022, 08:55 PM
Last Post: BrandonKastning
  HTML multi select HTML listbox with Flask/Python rfeyer 0 7,076 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 5,667 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  Easy HTML Parser: Validating trs by attributes several tags deep? runswithascript 7 6,180 Aug-14-2020, 10:58 PM
Last Post: runswithascript
  Jinja2 HTML <a> tags not rendering properly ChaitanyaPy 4 5,559 Jun-28-2020, 06:12 PM
Last Post: ChaitanyaPy
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 3,676 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Web crawler extracting specific text from HTML lewdow 1 4,680 Jan-03-2020, 11:21 PM
Last Post: snippsat
  Help on parsing simple text on HTML amaumox 5 5,367 Jan-03-2020, 05:50 PM
Last Post: amaumox
  Extract text between bold headlines from HTML CostasG 1 3,770 Aug-31-2019, 10:53 AM
Last Post: snippsat
  How do I get rid of the HTML tags in my output? glittergirl 1 4,926 Aug-05-2019, 08:30 PM
Last Post: snippsat

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.