Skip to content

Conversation

@MthwRobinson
Copy link
Contributor

Summary

Adds a UnstructuredURLLoader that supports loading data from a list of URLs.

Testing

from langchain.document_loaders import UnstructuredURLLoader urls = [ "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023", "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023" ] loader = UnstructuredURLLoader(urls=urls) raw_documents = loader.load()
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks awesome! thanks

@hwchase17 hwchase17 merged commit 07a407d into langchain-ai:master Feb 10, 2023
dongreenberg pushed a commit to dongreenberg/langchain that referenced this pull request Feb 17, 2023
…ain-ai#979) ### Summary Adds a `UnstructuredURLLoader` that supports loading data from a list of URLs. ### Testing ```python from langchain.document_loaders import UnstructuredURLLoader urls = [ "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023", "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023" ] loader = UnstructuredURLLoader(urls=urls) raw_documents = loader.load() ```
@blob42 blob42 mentioned this pull request Feb 21, 2023
zachschillaci27 pushed a commit to zachschillaci27/langchain that referenced this pull request Mar 8, 2023
…ain-ai#979) ### Summary Adds a `UnstructuredURLLoader` that supports loading data from a list of URLs. ### Testing ```python from langchain.document_loaders import UnstructuredURLLoader urls = [ "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023", "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023" ] loader = UnstructuredURLLoader(urls=urls) raw_documents = loader.load() ```
@Boscop
Copy link

Boscop commented Apr 9, 2023

@MthwRobinson Can you please adjust UnstructuredURLLoader to allow loading text/plain files from URLs? (Ideally also other mime types like text/markdown.)
Currently I'm getting Error fetching or processing https://raw.githubusercontent.com/[...], exeption: Expected content type text/html. Got text/plain; charset=utf-8.

@MthwRobinson
Copy link
Contributor Author

@Boscop - Thanks for flagging, we added an issue in the unstructured library to support other MIME types and will pick it up as soon as we can.

@IqraShahid-dev
Copy link

@MthwRobinson Got error for pdf while fetching from s3 bucket,

raise ValueError(f"Expected content type text/html. Got {content_type}.") ValueError: Expected content type text/html. Got application/pdf. 
@MthwRobinson
Copy link
Contributor Author

#2793 will allow for processing non HTML resources with the URL loader. @Boscop - you can pass the content_type kwarg into the loader to force it to use a specific MIME type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants