I'm pretty sure this code is based on something snippsat wrote some time ago, too clever for me.
PyPDf2 is good, but can't get some text for reasons I don't know or understand
# importing required modules import pdfplumber path2pdf = '/home/pedro/pdfExtractedPages/The_Knights_Tale_Modern_English.pdf' path2text = '/home/pedro/temp/The_Knights_Tale_Middle_English.txt' """ >>> test = enumerate(pages, 1) # 1 starts counting at 1, set 0 to count from zero >>> test <enumerate object at 0x7f586adb5880> >>> for t in test: print(t) (1, <Page:1>) (2, <Page:2>) (3, <Page:3>) """ text_pages = [] with pdfplumber.open(pdf_file) as pdf: pages = pdf.pages for pg in range(len(pages)): text = pdf.pages[pg].extract_text() text_pages.append(text) print('text_pages is now', len(text_pages), 'long. Now joining tthe list to a string ... ') textstring = ''.join(text_pages) with open(path2text, 'w') as tf: tf.write(textstring)Gets the bits other modules can't reach! Worked for me, thank you snippsat!