Python Forum
[SOLVED] [Beautiful Soup] How to deprettify?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[SOLVED] [Beautiful Soup] How to deprettify?
#1
Information 
Hello,

I made the mistake of using soup.prettify() to save soups to files, and I now have whitespaces that show up as useless spaces when viewing the files in an HTML WYSIWYG editor.

The following code doesn't work to remove those useless whitespaces.

Before I write a Python script to run the files through Tidy instead, does someone know if it can be fixed with BS?

Thank you.

for file in glob.glob("*.html"):	BASE = Path(file).stem	OUTPUTFILE = fr"{BASE}.CONV.html"	soup = BeautifulSoup(open(file,"br"),"lxml")	for tag in soup.find_all():	if tag.string:	tag.string.replace_with(' '.join(tag.string.split()))	print(tag.string)	else:	print(tag.name, " no string")	pass	with open(OUTPUTFILE, 'w', encoding='utf-8') as outp:	outp.write(str(soup))
Reply
#2
To show the problem.
from bs4 import BeautifulSoup html = '''\ <body> <h1>This is a Heading</h1> <p>This is a paragraph</p> <p>blue car</p> </body>''' soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) print('-' * 25) print(str(soup))
Output:
<body> <h1> This is a Heading </h1> <p> This is a paragraph </p> <p> blue car </p> </body> ------------------------- <body> <h1>This is a Heading</h1> <p>This is a paragraph</p> <p>blue car</p> </body>
So the new line is annoying(i tried to fix it a long time ago),now just ways under.
Easy fix is to use to html formatting online eg code beautify.
Or install Prettier,has a command line tool eg use prettier --write . formatt all html file in a folder.
G:\div_code\html_file λ prettier --write . h1.html 170ms h2.html 5ms
Then output of both from BS option over will be correct formatted html.
Output:
<body> <h1>This is a Heading</h1> <p>This is a paragraph</p> <p>blue car</p> </body>
Reply
#3
Thank you.
Reply
#4
For others' benefit, here's how to do it in Beautiful Soup:

import sys import os import glob import shutil from bs4 import BeautifulSoup ROOT = r"c:\temp" os.chdir(ROOT) for file in glob.glob("*.html"):	print("Handling ", file)	#save original file	ORIGFILE = fr"{file}.orig"	#grab original times	mtime = os.stat(file).st_mtime	atime = os.stat(file).st_atime	tup = (atime, mtime)	dest = shutil.copyfile(file, ORIGFILE)	os.utime(ORIGFILE, tup)	#Remove all carriage returns	with open(file, "r") as f:	dna = f.read().replace("\n", "")	#trim each string	soup = BeautifulSoup(dna,"lxml")	_ = [s.replace_with(s.text.strip()) for s in soup.find_all(string=True)]	#save soup back to file	with open(file, 'w', encoding='utf-8') as outp:	outp.write(str(soup))	#Must close before updating time	os.utime(file, tup)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question [SOLVED] [Beautiful Soup] Replace tag.string from another file? Winfried 2 1,614 May-01-2025, 03:43 PM
Last Post: Winfried
Question [SOLVED] [Beautiful Soup] Move line to top in HTML head? Winfried 0 946 Apr-13-2025, 05:50 AM
Last Post: Winfried
  Trouble selecting attribute with beautiful soup bananatoast 3 3,807 Jan-30-2022, 10:01 AM
Last Post: bananatoast
  I need help parsing through data and creating a database using beautiful soup username369 1 2,753 Sep-22-2021, 08:45 PM
Last Post: Larz60+

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.