Skip to content

Conversation

@MthwRobinson
Copy link
Contributor

Summary

Adds tracked metadata from unstructured elements to the document metadata when UnstructuredFileLoader is used in "elements" mode. Tracked metadata is available in unstructured>=0.4.9, but the code is written for backward compatibility with older unstructured versions.

Testing

Before running, make sure to upgrade to unstructured==0.4.9. In the code snippet below, you should see page_number, filename, and category in the metadata for each document. doc[0] should have page_number: 1 and doc[-1] should have page_number: 2. The example document is layout-parser-paper-fast.pdf from the unstructured sample docs.

from langchain.document_loaders import UnstructuredFileLoader loader = UnstructuredFileLoader(file_path=f"layout-parser-paper-fast.pdf", mode="elements") docs = loader.load()
@hwchase17 hwchase17 merged commit 3ea1e5a into langchain-ai:master Feb 16, 2023
dongreenberg pushed a commit to dongreenberg/langchain that referenced this pull request Feb 17, 2023
### Summary Adds tracked metadata from `unstructured` elements to the document metadata when `UnstructuredFileLoader` is used in `"elements"` mode. Tracked metadata is available in `unstructured>=0.4.9`, but the code is written for backward compatibility with older `unstructured` versions. ### Testing Before running, make sure to upgrade to `unstructured==0.4.9`. In the code snippet below, you should see `page_number`, `filename`, and `category` in the metadata for each document. `doc[0]` should have `page_number: 1` and `doc[-1]` should have `page_number: 2`. The example document is `layout-parser-paper-fast.pdf` from the [`unstructured` sample docs](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs). ```python from langchain.document_loaders import UnstructuredFileLoader loader = UnstructuredFileLoader(file_path=f"layout-parser-paper-fast.pdf", mode="elements") docs = loader.load() ```
@blob42 blob42 mentioned this pull request Feb 21, 2023
zachschillaci27 pushed a commit to zachschillaci27/langchain that referenced this pull request Mar 8, 2023
### Summary Adds tracked metadata from `unstructured` elements to the document metadata when `UnstructuredFileLoader` is used in `"elements"` mode. Tracked metadata is available in `unstructured>=0.4.9`, but the code is written for backward compatibility with older `unstructured` versions. ### Testing Before running, make sure to upgrade to `unstructured==0.4.9`. In the code snippet below, you should see `page_number`, `filename`, and `category` in the metadata for each document. `doc[0]` should have `page_number: 1` and `doc[-1]` should have `page_number: 2`. The example document is `layout-parser-paper-fast.pdf` from the [`unstructured` sample docs](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs). ```python from langchain.document_loaders import UnstructuredFileLoader loader = UnstructuredFileLoader(file_path=f"layout-parser-paper-fast.pdf", mode="elements") docs = loader.load() ```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants