I stumbled on another list index out of range. When parsing a large file using pymupdf.layout+pymupdf4llm the following traceback is encountered:
Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/__init__.py", line 83, in to_markdown parsed_doc = parse_document( File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/__init__.py", line 42, in parse_document return document_layout.parse_document( File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/helpers/document_layout.py", line 908, in parse_document utils.clean_tables(page, blocks) File "/usr/local/lib/python3.10/site-packages/pymupdf4llm/helpers/utils.py", line 261, in clean_tables y_vals = [y_vals0[0]] IndexError: list index out of range Versions:
pymupdf4llm: 0.2.5
pymupdf-layout: 1.26.6
The commands used were:
doc=pymupdf.open(pdf_name) md_chunks = pymupdf4llm.to_markdown(doc) The size of the PDF file is 142MB so I cannot upload it here.
p.s. these files belong to the open data of the Dutch government and are important to parse. Unfortunately there is a great variety in quality and size of these files. On the other hand, they are great test cases ![]()