I've created a simple pdf [hi.pdf] with the word hi and when I open it in Notepad++, its encoding is ANSI, which I assume is Notepad++'s best guess, with it opening successfully when I Save as hiSaveAs.pdf.
However, when I copy the contents of hi.pdf from Notepad++, pasting into a new file and saving as hiANSI.pdf with an encoding of ANSI, the file is corrupted and can't be opened:
Error, failed to load pdf document. - When I re-open
hiANSI.pdfin Notepad++, it has UTF8 listed as the encoding and when I compare it tohi.pdf, I notice it has whitespaces wherehi.pdfhas theNULcharacter: - If I change the encoding of
hiANSI.pdfto ANSI instead of UTF8, the text differs fromhi.pdfeven more:
Can someone explain what is happening here?
- Why does Save as work, but copying the exact same text into a new Notepad++ file results with a whitespace instead of the
NULchar? - Why does Notepad++ think
hiANSI.pdfis UTF8, buthi.pdfANSI?
This does not answer this question.
The MSB is not being stripped. Have a look at the hex comparison:
For example, why is 0A being added between 0D and 25 (first row, 10th byte)?
UPDATE:
I noticed Notepad did much less than Notepad++ in terms of "helping". For example when I saved hi.pdf as hiANSI.pdf using Notepad instead of Notepad++, the only thing Notepad did to help was add 0x0A (line feed) after 0x0D (carriage return), and replaced 0x00 (NUL) with 0x20 (space):
If I saved hi.pdf as hiANSI.bin, it did even less. It just replaced 0x00 with 0x20:
In the above two cases, it produced a valid PDF but with "hi" replaced with "IJ":
UPDATE
If I replace the following 0x20 bytes in hiANSI.pdf with 0x00 to match hi.pdf, it displays "hi" instead of "IJ" but with a different font:
Here are the two bytes I changed (highlighted in yellow):
Why does changing these two bytes have this effect?







