-  
-   Notifications  You must be signed in to change notification settings 
- Fork 19.2k
Description
Pandas version checks
-  I have checked that this issue has not already been reported. 
-  I have confirmed this bug exists on the latest version of pandas. 
-  I have confirmed this bug exists on the main branch of pandas. 
Reproducible Example
from pathlib import Path import pandas out_path = Path('tmp.xml') def generate_xml(N): record ='''  <outer>  <inner1>1</inner1>  <inner2>b</inner2>  </outer>  ''' with open(out_path, 'w') as f: f.write('<?xml version="1.0" encoding="UTF-8"?>\n<root_elem>') for i in range(1, N): f.write(record) f.write('</root_elem>') ## bigger files take a long time to load, but for me, the 'magic number' is between 10M and 20M: ## 20M always fails to read, 10M always succeeds ## I think more accurate predictor is file size, which needs to be over 1.2GB roughly. ## Small sanity check: generate_xml(1000) pandas.read_xml(out_path) ## The bug: generate_xml(20000000) from xml.etree import ElementTree as ET ## File is obviously valid XML, and ET can parse it tmp = ET.parse(out_path) ## The following fails pandas.read_xml(out_path)Issue Description
Reading big XML files, on Windows 11 (not reproduced on Linux), causes an XMLSyntaxError: switching encoding: encoder error, line 1, column 1. 
 Trigger size is between 1.2 - 1.6 GB.
Expected Behavior
To either parse the file or give an accurate error about running out of memory, etc. The misleading error is very frustrating when working with 3rd party XMLs where it's not inconceivable they're in the wrong encoding, causing a futile search for a problem in the file.
Installed Versions
pandas : 2.2.2
 numpy : 1.26.4
 pytz : 2024.1
 dateutil : 2.9.0.post0
 setuptools : None
 pip : 24.1.1
 Cython : None
 pytest : None
 hypothesis : None
 sphinx : None
 blosc : None
 feather : None
 xlsxwriter : None
 lxml.etree : 5.2.2
 html5lib : None
 pymysql : None
 psycopg2 : None
 jinja2 : None
 IPython : 8.26.0
 pandas_datareader : None
 adbc-driver-postgresql: None
 adbc-driver-sqlite : None
 bs4 : None
 bottleneck : None
 dataframe-api-compat : None
 fastparquet : None
 fsspec : None
 gcsfs : None
 matplotlib : None
 numba : None
 numexpr : None
 odfpy : None
 openpyxl : None
 pandas_gbq : None
 pyarrow : 16.1.0
 pyreadstat : None
 python-calamine : None
 pyxlsb : None
 s3fs : None
 scipy : None
 sqlalchemy : None
 tables : None
 tabulate : None
 xarray : None
 xlrd : None
 zstandard : None
 tzdata : 2024.1
 qtpy : None
 pyqt5 : None