[SOLVED] Correct way to convert file from cp-1252 to utf-8?

Winfried · (This post was last modified: Feb-26-2024, 06:19 PM by Winfried.)

Hello,

In a directory, I have a bunch of HTML files that were written in cp-1252 (ie. Latin1) that I need to convert to utf-8.

The following doesn't seem to work: After running the loop once, the second run shows files still considered to be in cp-1252. What's the right way to proceed?

Thank you.

import os import glob import chardet from bs4 import BeautifulSoup from datetime import datetime os.chdir(r".\input_test") files = glob.glob("*.html") for file in files: #detect encoding rawdata = open(file, "rb").read() encoding = chardet.detect(rawdata)['encoding'] if encoding in ["Windows-1252","ascii","ISO-8859-1"]: print("File still not in utf-8",file) continue print("Converting ",file) #get original access and modification times atime = os.stat(file).st_atime mtime = os.stat(file).st_mtime tup = (atime, mtime) #convert to utf8 data = open(file, "r").read() data.encode(encoding = 'UTF-8', errors = 'strict') with open(file, 'w', encoding='utf-8') as outp: outp.write(data) #set creation/modification back to original date os.utime(file, tup) elif encoding == "utf-8": #print("File in utf-8", file) pass else: print("Encoding error:", file, encoding)

**Gribouillis** · (This post was last modified: Feb-26-2024, 03:10 PM by Gribouillis.)

You could replace line 25 with

data = rawdata.decode(encoding=encoding)

Then at line 27, open the file in binary mode, without encoding because data is normally a byte string after the encode() of line 26.

Winfried · (This post was last modified: Feb-26-2024, 03:21 PM by Winfried.)

The two lines you mean?

#data = open(file, "r").read() #data.encode(encoding = 'UTF-8', errors = 'strict') data = rawdata.decode(encoding=encoding) #with open(file, 'w', encoding='utf-8') as outp: with open(file, 'wb') as outp: outp.write(data) #TypeError: a bytes-like object is required, not 'str'

CHECK LATER It still doesn't work: Files that were supposedly converted to utf-8 in the first run are still considered as Windows files:

if encoding in ["Windows-1252","ascii","ISO-8859-1"]: print("File still not in utf-8",file) continue

Also, the code above adds new carriage returns in the output :-/

<head>

<title>my title</title>

<meta name="description" content="my title">

<meta name="keywords" content="my title">

<meta name="classification" content="windows">

</head>

**Gribouillis** · (This post was last modified: Feb-26-2024, 03:29 PM by Gribouillis.)

After decoding, you need to add

data = data.encode(encoding = 'UTF-8', errors = 'strict')

bytes and unicode strings are different types in Python. Try to understand what the code does in details.

Winfried · Feb-26-2024, 03:38 PM

Using this code, the second run still says files are not in utf-8. It doesn't look like it's the right way to convert Windows files to utf-8:

for file in files: rawdata = open(file, "rb").read() encoding = chardet.detect(rawdata)['encoding'] if encoding in ["Windows-1252","ascii","ISO-8859-1"]: print("File still not in utf-8",file) continue print("Converting ",file) atime = os.stat(file).st_atime mtime = os.stat(file).st_mtime tup = (atime, mtime) #convert to utf8 data = rawdata.decode(encoding=encoding) data = data.encode(encoding = 'UTF-8', errors = 'strict') with open(file, 'wb') as outp: outp.write(data) #set creation/modification date os.utime(file, tup) elif encoding == "utf-8": #print(encoding) #print("File in utf-8", file) pass else: #ISO-8859-1 #ascii print("Encoding error:", file, encoding) #exit()

**Gribouillis** · Feb-26-2024, 06:01 PM

If a unicode string is encoded in utf-8 and written to a file, the file is encoded in utf8. No matter what chardet detects.

Winfried · Feb-26-2024, 06:19 PM

Looks like it.

After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8.

Bottom line: chardet doesn't seem reliable to check how a file is encoded :-/

Thanks for the help.

**Gribouillis** · (This post was last modified: Feb-26-2024, 06:24 PM by Gribouillis.)

(Feb-26-2024, 06:19 PM)Winfried Wrote: After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8.

If a file contains only ASCII characters, there is no difference at all between the ASCII and the UTF8 encodings.

>>> s = 'hello world' >>> s.encode('utf8') b'hello world' >>> s.encode('ascii') b'hello world' >>>

Winfried · Feb-29-2024, 12:30 AM

Turns out there's a lot easier solution: Just open the file and feed it to Beautiful Soup, which will take care of 1) converting data to utf-8 if needed, and add/edit the relevant meta line in the header.

file = r"c:\temp\input.html" with open(file, 'r') as f: content_text = f.read() soup = BeautifulSoup(content_text, 'html.parser') print(soup.head)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[SOLVED] Linefeed when writing "f" strings to text file?	Winfried	5	756	Nov-04-2025, 11:51 AM Last Post: buran
	[SOLVED] Open file, and insert space in string?	Winfried	7	2,461	May-28-2025, 07:56 AM Last Post: Winfried
	[SOLVED] [Beautiful Soup] Replace tag.string from another file?	Winfried	2	1,610	May-01-2025, 03:43 PM Last Post: Winfried
	[SOLVED] [Linux] Write file and change owner?	Winfried	6	3,132	Oct-17-2024, 01:15 AM Last Post: Winfried
	[solved] how to delete the 10 first lines of an ascii file	paul18fr	7	4,038	Aug-07-2024, 08:18 PM Last Post: Gribouillis
	Extracting the correct data from a CSV file	S2G	6	2,948	Jun-03-2024, 04:50 PM Last Post: snippsat
	Convert File to Data URL	michaelnicol	3	3,938	Jul-08-2023, 11:35 AM Last Post: DeaD_EyE
	Python Script to convert Json to CSV file	chvsnarayana	8	6,296	Apr-26-2023, 10:31 PM Last Post: DeaD_EyE
	Loop through json file and reset values [SOLVED]	AlphaInc	2	7,412	Apr-06-2023, 11:15 AM Last Post: AlphaInc
	Convert an Interger into any base !? [Solved]	SpongeB0B	8	4,417	Jan-16-2023, 10:24 AM Last Post: SpongeB0B

[SOLVED] Correct way to convert file from cp-1252 to utf-8?

User Panel Messages

Announcements