Hi, I think the title doesn’t capture my situation all that well.
And apologies in advance for a somewhat long post.
I am writing a little python script that takes as input
- a larger PDF with a bunch of songs (say 300) in it[1] and
- a list of songs to extract (say 20)
and should produce in output two smaller PDFs: one is the listed songs, with their chords and all, as they appear in the source. And the other is the same songs without the chords (so approximately “every other line” from the source PDF). As you can guess: one is for the band, the other for the singer.
I have written all the code to detect what blocks need to be used to produce the result, and where they should go, and the result seems to work correctly if I render the result in a base14 font, say helv (the test source is Arial, so all the metrics happen to work out), but I am having trouble generating the output when I try to carry over the original fonts as well.
The emission stage in my script is a loop that for each page will use page.insert_font(fontname=…, fontbuffer=…) passing over the previously extracted fonts (using get_fonts()) , and then scan all the characters and use page.insert_text((char[“origin”][0] + offsetx, char[“origin”][1] + offsety), char[‘c’], fontname=…, …) for transfering the single characters and carry over all the positions[2].
The issue is that I get all the right content if I use a base14 font, but if I use the original document’s fonts, I get a whole pile of tofu boxes instead. I suspect the reason behind this is that the original font is subsetted, and the character I’m passing in is not mapped through the subset’s CMAP.
It seems that I would need to transfer over the CMAP and pass to a method like insert_text() the glyph indices in the font instead of the unicode codepoints, right? Or register the CMAP with the font, and let pymupdf deal with the unicode to glyph-index lookup? I could find no way of achieving either of these, though. I Googled for this without much useful coming up.
Or even, is there a way that I could just copy-paste the spans across from source to dest? Or something along those lines?
One last thought abot this last question: I do need that the relative positioning of the characters in the original and copied pages are precisely kept (as the chords “float” above the lyrics, and they must not undergo repositioning relative to each other), as well as the ability to remove “every other line” (roughly) to produce “lyrics-only” versions for the singer. For this second reason I suspect an approach based on clipping/cropping and using show_pdf_page() might be too coarse a method of working (and I’m somewhat concerned about clipping with rectangles and letting through small parts of adjacent content that should not be in the output).
Any help much appreciated,
with thanks
Luca
[1] Imagine pages with a few songs like this https://www.pinterest.com/pin/acoustic-guitar-chords-and-lyrics--23784704274174095/ on each page.
[2] Going this route of one char at a time is very very slow, it takes a second or three per song, but I was hoping I might improve on this once the rest of the script was working correctly. I don’t love it.