Skip to content

Conversation

@MthwRobinson
Copy link
Contributor

Summary

Closes #464 and addresses user requests in langchain#979 Adds a url kwarg to partition to allow users to process remote documents directly. If the user passes in content_type, partition will process the document using content_type as the MIME type. Otherwise, it will use the Content-Type header to determine the MIME type.

Testing

 from unstructured.partition.auto import partition # A markdown file url = "https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/LICENSE.md" elements = partition(url=url, content_type="text/markdown") # An HTML file url = "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-april-11-2023" elements = partition(url=url) # A PDF file url = "https://www.understandingwar.org/sites/default/files/Russian%20Offensive%20Campaign%20Assessment%2C%20April%2011%2C%202023.pdf" elements = partition(url=url, strategy="fast")
@MthwRobinson MthwRobinson requested a review from qued April 12, 2023 14:56
@MthwRobinson MthwRobinson changed the title efeat: add url kwarg to partititon feat: add url kwarg to partititon Apr 12, 2023
Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@MthwRobinson MthwRobinson enabled auto-merge (squash) April 12, 2023 18:01
@MthwRobinson MthwRobinson merged commit e2e473d into main Apr 12, 2023
@MthwRobinson MthwRobinson deleted the feat/other-content-types branch April 12, 2023 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants