Python Forum
[SOLVED] Right way to open files with different encodings?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[SOLVED] Right way to open files with different encodings?
#1
Question 
Hello,

Some of the files could be Windows (latin1, iso9959-1, cp1252), others could be utf-8.

Is try/except the right way to do it?

#with open(file, 'r') as f: #with open(file, 'r',encoding='utf-8') as f: #latin1, iso9959-1, cp1252 with open(file, 'r',encoding='latin-1') as f: content_text = f.read() soup = BeautifulSoup(content_text, 'html.parser')
Thank you.
Reply
#2
(Apr-23-2024, 08:49 AM)Winfried Wrote: Is try/except the right way to do it?
Normally, there is no way to decode a file having an unknown unicode encoding. Specialized modules such as chardet contain tools to guess the encoding of a file. It is probably the best solution, but read the FAQ of the chardet module first.

Python is not equipped with tools to guess encodings, so attempting to decode and catch exceptions will succeed in diagnosing that some encodings are not the actual encoding of the file, but a success does not mean that it is the correct encoding an the result can be a mojibake
« We can solve any problem by introducing an extra level of indirection »
Reply
#3
(Apr-23-2024, 08:49 AM)Winfried Wrote: Some of the files could be Windows (latin1, iso9959-1, cp1252), others could be utf-8.
As these are .html some advice if you are making or saving this these files,then there is a way if using Requests and BS to always save as utf-8.
If files are already made then as bye Gribouillis there is chardet.
So eg if i have one .html file which(i make to be latin-1) and one in utf-8.
λ chardetect page_latin.html page_latin.html: ISO-8859-1 with confidence 0.73 G:\div_code\html_utf λ chardetect page_utf8.html page_utf8.html: utf-8 with confidence 0.7525
from bs4 import BeautifulSoup with open('page_latin.html', encoding='latin-1') as fp: soup = BeautifulSoup(fp, 'lxml') h1_tag = soup.find('h1') print(h1_tag) # Utf-8 the default with open('html_new.html') as fp: soup = BeautifulSoup(fp, 'lxml') h1_tag = soup.find('h1') print(h1_tag)
Output:
<h1>Jalapeñod je pèle</h1> <h1>Jalapeñod je pèle</h1>
So all works as it should,if take away encoding='latin-1' it break and get UnicodeDecodeError.

Can also convert to utf-8 as this happens when open a file in Beautiful Soup:
Bs4 Doc Wrote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode.
Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode.

So from latin-1 to utf-8.
from bs4 import BeautifulSoup with open('page_latin.html', 'rb') as fp,open('html_new.html', 'w', encoding='utf-8') as fp_out: file_out = fp.read() # When open a file in BS it will be Unicode soup = BeautifulSoup(file_out, 'lxml') fp_out.write(soup.prettify())
λ chardetect html_new.html html_new.html: utf-8 with confidence 0.7525
File used in test,same just with different encoding.
<html lang="en"> <head> <title>Here is site title</title> </head> <body> <h1>Jalapeñod je pèle</h1> </body> </html>
Reply
#4
Thanks for the infos. I forgot to check the thread again.

Turns out BS edits the meta tag in the input file if it weren't utf-8 (eg. "text/html; charset=Windows-1252" → "text/html; charset=utf-8"), but doesn't add one if there weren't any to begin with. And if the input file is missing a meta encoding tag, BS/Unicode,Damnit can guess it wrong.

from bs4 import BeautifulSoup #IMPORTANT: With no meta tag in input file, BS can be mistaken, eg. ISO-8859-1 is wrongly guessed as ISO-8859-8 INPUTFILE = "test.1252.no.meta.html" INPUTFILE = "test.1252.with.meta.html" with open('output.utf8.html', 'w', encoding='utf-8') as fp_out:	soup = BeautifulSoup(open(INPUTFILE, 'rb'), 'lxml')	print("Orig encod:",soup.original_encoding)	#if no meta, add one since BS doesn't	meta = soup.head.find("meta", {"http-equiv".lower():"Content-Type".lower()})	if not meta:	print("No meta")	metatag = soup.new_tag('meta')	metatag.attrs['http-equiv'] = 'Content-Type'	metatag.attrs['content'] = 'text/html; charset=utf-8'	soup.head.append(metatag)	print(soup.head.prettify())	fp_out.write(soup.prettify())
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  print does not open console in Linux Mint [Solved] Tycho_2025 5 705 Oct-04-2025, 08:52 AM
Last Post: Tycho_2025
Question [SOLVED] Open file, and insert space in string? Winfried 7 2,461 May-28-2025, 07:56 AM
Last Post: Winfried
  [SOLVED] Loop through directories and files one level down? Winfried 3 4,119 Apr-28-2024, 02:31 PM
Last Post: Gribouillis
  Open files in an existing window instead of new Kostov 2 2,090 Apr-13-2024, 07:22 AM
Last Post: Kostov
  open python files in other drive akbarza 1 1,950 Aug-24-2023, 01:23 PM
Last Post: deanhystad
Question [solved] compressing files with python. SpongeB0B 1 1,674 May-26-2023, 03:33 PM
Last Post: SpongeB0B
  Help replacing word in Mutiple files. (SOLVED) mm309d 0 1,939 Mar-21-2023, 03:43 AM
Last Post: mm309d
  Delete empty text files [SOLVED] AlphaInc 5 4,391 Jul-09-2022, 02:15 PM
Last Post: DeaD_EyE
  Sorting and Merging text-files [SOLVED] AlphaInc 10 11,164 Aug-20-2021, 05:42 PM
Last Post: snippsat
  How to open/load image .tiff files > 2 GB ? hobbyist 1 4,660 Aug-19-2021, 12:50 AM
Last Post: Larz60+

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020
This forum uses Lukasz Tkacz MyBB addons.
Forum use Krzysztof "Supryk" Supryczynski addons.