Skip to content

Conversation

@NamelessCoder
Copy link

Fixes an issue with parsing some PDF documents which contain this instruction. I have no idea what it means.

Type of pull request

  • Bug fix (involves code and configuration changes)
  • New feature (involves code and configuration changes)
  • Documentation update
  • Something else

About

I am working on parsing a HUGE library of PDF files and some of those files contain an endobj header element
which isn't handled by the parser, and this results in an exception. The files do parse correctly if this header
element is simply ignored. It doesn't seem to have any practical use.

Checklist for code / configuration changes

See CONTRIBUTING.md for all essential information about contributing.

Fixes an issue with parsing some PDF documents which contain this instruction. I have no idea what it means.
@k00ni
Copy link
Collaborator

k00ni commented Nov 3, 2025

Thank you for your PR.

Is it still work in progress?

If not, there are a few tasks left to solve before I take a closer look. Please read https://github.com/smalot/pdfparser/blob/master/CONTRIBUTING.md for more information.

@NamelessCoder
Copy link
Author

@k00ni It's no longer WIP.

I assume you mean the part about needing a test? I did look in the test cases but didn't find any current tests that deal with the "skip unknowns" behavior in the parser class, so I wasn't sure if the policy would be to allow it w/o test, or require a test to be written for just that one thing (same as other issueXX tests), or a general test to nail down the "skip unknowns" as a whole.

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing code without tests that cover the changes might lead to unintended side effects. We therefore only allow code changes without tests under rare circumstances.

But lets take a step back and get an overview about the underlying issue:

  1. I assume you receive an Exception (Invalid type) without this code change?
  2. Can you provide a PDF which triggers this problem and is free of charge so we can use it in the test suite?
  3. If you are not able to produce/provide a usable PDF (e.g. because it contains sensitive data) we have to think about a basic test case which triggers your code.

Don't get me wrong, I know that a faulty library can disrupt a project in a significant way but just because a PDF fails to parse doesn't mean the library needs a change. It might be because of the PDF itself, the library or somewhere in between. I hope we can find a way to "harden" the library but also add at least 1 test to cover the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment