Feature: Adding support for processing and searching pdfs inside google drive #94

mahmoodfathy · 2023-04-18T08:38:02Z

Summary of Changes:

added a new function to process pdfs into chunks
added a new function to parse a pdf document using langchain
added support for the pdf icon in the UI

What it should look like:

Roey7 · 2023-04-18T11:30:44Z

app/indexing/index_documents.py

+
+
+ @staticmethod
+ def _split_PDF_into_paragraphs(texts:List[PDFDocument],minimum_length=256):


please move this to pdf_parser, class BasicDocument's content should stay :str

@Roey7 i refactored , should be now resolved , moved it into pdf parser file and refactored the logic

Roey7 · 2023-04-18T11:30:56Z

app/data_source/api/basic_document.py

 type: DocumentType
 title: str
- content: str
+ content: Union[str, List[PDFDocument]]


should be str.

@Roey7 reverted and changed back to as it was str

Roey7 · 2023-04-18T11:31:22Z

app/parsers/pdf.py

+loader = PyPDFLoader(input_filename)
+documents = loader.load()
+text_split = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)
+texts = text_split.split_documents(documents)


add here the pdf->text, this func should just return str

@Roey7 refactored and now pdf_to_text function returns a str

…lved PR comments

Roey7 · 2023-04-23T15:09:56Z

app/requirements.txt

+python-dateutil
+httplib2
+pypdf
+pycryptodome


are you sure all of these needed?

@Roey7 yes they are required by pypdf and langchain

Roey7 · 2023-04-23T15:10:45Z

app/indexing/index_documents.py

+ if document.file_type != FileType.PDF:
+ paragraphs = Indexer._split_into_paragraphs(document.content)
+ else:
+ paragraphs = split_PDF_into_paragraphs(document.content)


now when content is always simple str, is this needed?

@Roey7 yes because the way i split documents is different than the other one , split_PDF_into_paragraphs is different than this Indexer._split_into_paragraphs so i have to check the file type

@Roey7 with your above suggestion to move the split_PDF_into_paragraphs function to the pdf parser module it is important to do this check

@Roey7 reverted back after discussions

…n works after changing the delimeter

Feature:added support for pdfs inside google drive to be processed

fcb1090

Roey7 reviewed Apr 18, 2023

View reviewed changes

Refactor: Refactored splitting the PDF into paragraphs logic and reso…

78993eb

…lved PR comments

Roey7 reviewed Apr 23, 2023

View reviewed changes

mahmoodfathy added 2 commits April 23, 2023 21:27

removed split_PDF_into_paragraphs as it is not needed-current functio…

a7ee8ae

…n works after changing the delimeter

Chore:remove extra unecessary lines

1f85426

yuvalsteuer merged commit d4c7ce4 into GerevAI:main Apr 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Adding support for processing and searching pdfs inside google drive #94

Feature: Adding support for processing and searching pdfs inside google drive #94

Uh oh!

mahmoodfathy commented Apr 18, 2023

Roey7 Apr 18, 2023

mahmoodfathy Apr 19, 2023

Roey7 Apr 18, 2023

mahmoodfathy Apr 19, 2023

Roey7 Apr 18, 2023

mahmoodfathy Apr 19, 2023 •

edited

Loading

Roey7 Apr 23, 2023

mahmoodfathy Apr 23, 2023

Roey7 Apr 23, 2023

mahmoodfathy Apr 23, 2023

mahmoodfathy Apr 23, 2023

mahmoodfathy Apr 23, 2023

Labels

3 participants



		@staticmethod
		def _split_PDF_into_paragraphs(texts:List[PDFDocument],minimum_length=256):

Feature: Adding support for processing and searching pdfs inside google drive #94

Feature: Adding support for processing and searching pdfs inside google drive #94

Uh oh!

Conversation

mahmoodfathy commented Apr 18, 2023

Summary of Changes:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahmoodfathy Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

3 participants

mahmoodfathy Apr 19, 2023 •

edited

Loading