Posts: 253 Threads: 110 Joined: Aug 2018 Feb-26-2024, 02:28 PM (This post was last modified: Feb-26-2024, 06:19 PM by Winfried.) Hello, In a directory, I have a bunch of HTML files that were written in cp-1252 (ie. Latin1) that I need to convert to utf-8. The following doesn't seem to work: After running the loop once, the second run shows files still considered to be in cp-1252. What's the right way to proceed? Thank you. import os import glob import chardet from bs4 import BeautifulSoup from datetime import datetime os.chdir(r".\input_test") files = glob.glob("*.html") for file in files: #detect encoding rawdata = open(file, "rb").read() encoding = chardet.detect(rawdata)['encoding'] if encoding in ["Windows-1252","ascii","ISO-8859-1"]: print("File still not in utf-8",file) continue print("Converting ",file) #get original access and modification times atime = os.stat(file).st_atime mtime = os.stat(file).st_mtime tup = (atime, mtime) #convert to utf8 data = open(file, "r").read() data.encode(encoding = 'UTF-8', errors = 'strict') with open(file, 'w', encoding='utf-8') as outp: outp.write(data) #set creation/modification back to original date os.utime(file, tup) elif encoding == "utf-8": #print("File in utf-8", file) pass else: print("Encoding error:", file, encoding) Posts: 4,874 Threads: 78 Joined: Jan 2018 Feb-26-2024, 03:10 PM (This post was last modified: Feb-26-2024, 03:10 PM by Gribouillis.) You could replace line 25 with data = rawdata.decode(encoding=encoding) Then at line 27, open the file in binary mode, without encoding because data is normally a byte string after the encode() of line 26. « We can solve any problem by introducing an extra level of indirection » Posts: 253 Threads: 110 Joined: Aug 2018 Feb-26-2024, 03:21 PM (This post was last modified: Feb-26-2024, 03:21 PM by Winfried.) The two lines you mean? #data = open(file, "r").read() #data.encode(encoding = 'UTF-8', errors = 'strict') data = rawdata.decode(encoding=encoding) #with open(file, 'w', encoding='utf-8') as outp: with open(file, 'wb') as outp: outp.write(data) #TypeError: a bytes-like object is required, not 'str' CHECK LATER It still doesn't work: Files that were supposedly converted to utf-8 in the first run are still considered as Windows files: if encoding in ["Windows-1252","ascii","ISO-8859-1"]: print("File still not in utf-8",file) continueAlso, the code above adds new carriage returns in the output :-/ <head> <title>my title</title> <meta name="description" content="my title"> <meta name="keywords" content="my title"> <meta name="classification" content="windows"> </head> Posts: 4,874 Threads: 78 Joined: Jan 2018 Feb-26-2024, 03:28 PM (This post was last modified: Feb-26-2024, 03:29 PM by Gribouillis.) After decoding, you need to add data = data.encode(encoding = 'UTF-8', errors = 'strict') bytes and unicode strings are different types in Python. Try to understand what the code does in details. « We can solve any problem by introducing an extra level of indirection » Posts: 253 Threads: 110 Joined: Aug 2018 Using this code, the second run still says files are not in utf-8. It doesn't look like it's the right way to convert Windows files to utf-8: for file in files: rawdata = open(file, "rb").read() encoding = chardet.detect(rawdata)['encoding'] if encoding in ["Windows-1252","ascii","ISO-8859-1"]: print("File still not in utf-8",file) continue print("Converting ",file) atime = os.stat(file).st_atime mtime = os.stat(file).st_mtime tup = (atime, mtime) #convert to utf8 data = rawdata.decode(encoding=encoding) data = data.encode(encoding = 'UTF-8', errors = 'strict') with open(file, 'wb') as outp: outp.write(data) #set creation/modification date os.utime(file, tup) elif encoding == "utf-8": #print(encoding) #print("File in utf-8", file) pass else: #ISO-8859-1 #ascii print("Encoding error:", file, encoding) #exit() Posts: 4,874 Threads: 78 Joined: Jan 2018 If a unicode string is encoded in utf-8 and written to a file, the file is encoded in utf8. No matter what chardet detects. « We can solve any problem by introducing an extra level of indirection » Posts: 253 Threads: 110 Joined: Aug 2018 Looks like it. After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8. Bottom line: chardet doesn't seem reliable to check how a file is encoded :-/ Thanks for the help. Posts: 4,874 Threads: 78 Joined: Jan 2018 Feb-26-2024, 06:23 PM (This post was last modified: Feb-26-2024, 06:24 PM by Gribouillis.) (Feb-26-2024, 06:19 PM)Winfried Wrote: After converting it to utf-8, chardet says a file is still "ascii with confidence 1.0" while both Notepad++ and Notepad2 say it's utf-8. If a file contains only ASCII characters, there is no difference at all between the ASCII and the UTF8 encodings. >>> s = 'hello world' >>> s.encode('utf8') b'hello world' >>> s.encode('ascii') b'hello world' >>> « We can solve any problem by introducing an extra level of indirection » Posts: 253 Threads: 110 Joined: Aug 2018 Turns out there's a lot easier solution: Just open the file and feed it to Beautiful Soup, which will take care of 1) converting data to utf-8 if needed, and add/edit the relevant meta line in the header. file = r"c:\temp\input.html" with open(file, 'r') as f: content_text = f.read() soup = BeautifulSoup(content_text, 'html.parser') print(soup.head) |