Skip to content

Conversation

@mahmoodfathy
Copy link
Contributor

Summary of Changes:

  • added a new function to process pdfs into chunks
  • added a new function to parse a pdf document using langchain
  • added support for the pdf icon in the UI

What it should look like:
image



@staticmethod
def _split_PDF_into_paragraphs(texts:List[PDFDocument],minimum_length=256):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move this to pdf_parser, class BasicDocument's content should stay :str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Roey7 i refactored , should be now resolved , moved it into pdf parser file and refactored the logic

type: DocumentType
title: str
content: str
content: Union[str, List[PDFDocument]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be str.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Roey7 reverted and changed back to as it was str

loader = PyPDFLoader(input_filename)
documents = loader.load()
text_split = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_split.split_documents(documents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add here the pdf->text, this func should just return str

Copy link
Contributor Author

@mahmoodfathy mahmoodfathy Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Roey7 refactored and now pdf_to_text function returns a str

python-dateutil
httplib2
pypdf
pycryptodome
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure all of these needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Roey7 yes they are required by pypdf and langchain

if document.file_type != FileType.PDF:
paragraphs = Indexer._split_into_paragraphs(document.content)
else:
paragraphs = split_PDF_into_paragraphs(document.content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now when content is always simple str, is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Roey7 yes because the way i split documents is different than the other one , split_PDF_into_paragraphs is different than this Indexer._split_into_paragraphs so i have to check the file type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Roey7 with your above suggestion to move the split_PDF_into_paragraphs function to the pdf parser module it is important to do this check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Roey7 reverted back after discussions

@yuvalsteuer yuvalsteuer merged commit d4c7ce4 into GerevAI:main Apr 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants