Documents
documents ¶
Documents module for data retrieval and processing workflows.
This module provides core abstractions for handling data in retrieval-augmented generation (RAG) pipelines, vector stores, and document processing workflows.
Documents vs. message content
This module is distinct from langchain_core.messages.content, which provides multimodal content blocks for LLM chat I/O (text, images, audio, etc. within messages).
Key distinction:
-
Documents (this module): For data retrieval and processing workflows
- Vector stores, retrievers, RAG pipelines
- Text chunking, embedding, and semantic search
- Example: Chunks of a PDF stored in a vector database
-
Content Blocks (
messages.content): For LLM conversational I/O- Multimodal message content sent to/from models
- Tool calls, reasoning, citations within chat
- Example: An image sent to a vision model in a chat message (via
ImageContentBlock)
While both can represent similar data types (text, files), they serve different architectural purposes in LangChain applications.
Document ¶
Bases: BaseMedia
Class for storing a piece of text and associated metadata.
Note
Document is for retrieval workflows, not chat I/O. For sending text to an LLM in a conversation, use message types from langchain.messages.
Example
| METHOD | DESCRIPTION |
|---|---|
__init__ | Pass page_content in as positional or named arg. |
is_lc_serializable | Return |
get_lc_namespace | Get the namespace of the LangChain object. |
__str__ | Override |
lc_id | Return a unique identifier for this class for serialization purposes. |
to_json | Serialize the object to JSON. |
to_json_not_implemented | Serialize a "not implemented" object. |
lc_secrets property ¶
A map of constructor argument names to secret ids.
For example, {"openai_api_key": "OPENAI_API_KEY"}
lc_attributes property ¶
lc_attributes: dict List of attribute names that should be included in the serialized kwargs.
These attributes must be accepted by the constructor.
Default is an empty dictionary.
id class-attribute instance-attribute ¶
An optional identifier for the document.
Ideally this should be unique across the document collection and formatted as a UUID, but this will not be enforced.
metadata class-attribute instance-attribute ¶
Arbitrary metadata associated with the content.
__init__ ¶
Pass page_content in as positional or named arg.
is_lc_serializable classmethod ¶
is_lc_serializable() -> bool Return True as this class is serializable.
get_lc_namespace classmethod ¶
__str__ ¶
__str__() -> str Override __str__ to restrict it to page_content and metadata.
| RETURNS | DESCRIPTION |
|---|---|
str | A string representation of the |
lc_id classmethod ¶
Return a unique identifier for this class for serialization purposes.
The unique identifier is a list of strings that describes the path to the object.
For example, for the class langchain.llms.openai.OpenAI, the id is ["langchain", "llms", "openai", "OpenAI"].
to_json ¶
Serialize the object to JSON.
| RAISES | DESCRIPTION |
|---|---|
ValueError | If the class has deprecated attributes. |
| RETURNS | DESCRIPTION |
|---|---|
SerializedConstructor | SerializedNotImplemented | A JSON serializable object or a |
to_json_not_implemented ¶
Serialize a "not implemented" object.
| RETURNS | DESCRIPTION |
|---|---|
SerializedNotImplemented |
|
Blob ¶
Bases: BaseMedia
Raw data abstraction for document loading and file processing.
Represents raw bytes or text, either in-memory or by file reference. Used primarily by document loaders to decouple data loading from parsing.
Inspired by Mozilla's Blob
Initialize a blob from in-memory data
Load from memory and specify MIME type and metadata
Load the blob from a file
| METHOD | DESCRIPTION |
|---|---|
check_blob_is_valid | Verify that either data or path is provided. |
as_string | Read data as a string. |
as_bytes | Read data as bytes. |
as_bytes_io | Read data as a byte stream. |
from_path | Load the blob from a path like object. |
from_data | Initialize the |
__repr__ | Return the blob representation. |
__init__ | |
is_lc_serializable | Is this class serializable? |
get_lc_namespace | Get the namespace of the LangChain object. |
lc_id | Return a unique identifier for this class for serialization purposes. |
to_json | Serialize the object to JSON. |
to_json_not_implemented | Serialize a "not implemented" object. |
data class-attribute instance-attribute ¶
Raw data associated with the Blob.
mimetype class-attribute instance-attribute ¶
mimetype: str | None = None MIME type, not to be confused with a file extension.
encoding class-attribute instance-attribute ¶
encoding: str = 'utf-8' Encoding to use if decoding the bytes into a string.
Uses utf-8 as default encoding if decoding to string.
path class-attribute instance-attribute ¶
Location where the original content was found.
source property ¶
source: str | None The source location of the blob as string if known otherwise none.
If a path is associated with the Blob, it will default to the path location.
Unless explicitly set via a metadata field called 'source', in which case that value will be used instead.
lc_secrets property ¶
A map of constructor argument names to secret ids.
For example, {"openai_api_key": "OPENAI_API_KEY"}
lc_attributes property ¶
lc_attributes: dict List of attribute names that should be included in the serialized kwargs.
These attributes must be accepted by the constructor.
Default is an empty dictionary.
id class-attribute instance-attribute ¶
An optional identifier for the document.
Ideally this should be unique across the document collection and formatted as a UUID, but this will not be enforced.
metadata class-attribute instance-attribute ¶
Arbitrary metadata associated with the content.
check_blob_is_valid classmethod ¶
Verify that either data or path is provided.
as_string ¶
as_string() -> str Read data as a string.
| RAISES | DESCRIPTION |
|---|---|
ValueError | If the blob cannot be represented as a string. |
| RETURNS | DESCRIPTION |
|---|---|
str | The data as a string. |
as_bytes ¶
as_bytes() -> bytes Read data as bytes.
| RAISES | DESCRIPTION |
|---|---|
ValueError | If the blob cannot be represented as bytes. |
| RETURNS | DESCRIPTION |
|---|---|
bytes | The data as bytes. |
as_bytes_io ¶
as_bytes_io() -> Generator[BytesIO | BufferedReader, None, None] Read data as a byte stream.
| RAISES | DESCRIPTION |
|---|---|
NotImplementedError | If the blob cannot be represented as a byte stream. |
| YIELDS | DESCRIPTION |
|---|---|
BytesIO | BufferedReader | The data as a byte stream. |
from_path classmethod ¶
from_path( path: PathLike, *, encoding: str = "utf-8", mime_type: str | None = None, guess_type: bool = True, metadata: dict | None = None, ) -> Blob Load the blob from a path like object.
| PARAMETER | DESCRIPTION |
|---|---|
path | Path-like object to file to be read TYPE: |
encoding | Encoding to use if decoding the bytes into a string TYPE: |
mime_type | If provided, will be set as the MIME type of the data TYPE: |
guess_type | If TYPE: |
metadata | Metadata to associate with the TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Blob |
|
from_data classmethod ¶
from_data( data: str | bytes, *, encoding: str = "utf-8", mime_type: str | None = None, path: str | None = None, metadata: dict | None = None, ) -> Blob Initialize the Blob from in-memory data.
| PARAMETER | DESCRIPTION |
|---|---|
data | The in-memory data associated with the |
encoding | Encoding to use if decoding the bytes into a string TYPE: |
mime_type | If provided, will be set as the MIME type of the data TYPE: |
path | If provided, will be set as the source from which the data came TYPE: |
metadata | Metadata to associate with the TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Blob |
|
is_lc_serializable classmethod ¶
is_lc_serializable() -> bool Is this class serializable?
By design, even if a class inherits from Serializable, it is not serializable by default. This is to prevent accidental serialization of objects that should not be serialized.
| RETURNS | DESCRIPTION |
|---|---|
bool | Whether the class is serializable. Default is |
get_lc_namespace classmethod ¶
lc_id classmethod ¶
Return a unique identifier for this class for serialization purposes.
The unique identifier is a list of strings that describes the path to the object.
For example, for the class langchain.llms.openai.OpenAI, the id is ["langchain", "llms", "openai", "OpenAI"].
to_json ¶
Serialize the object to JSON.
| RAISES | DESCRIPTION |
|---|---|
ValueError | If the class has deprecated attributes. |
| RETURNS | DESCRIPTION |
|---|---|
SerializedConstructor | SerializedNotImplemented | A JSON serializable object or a |
to_json_not_implemented ¶
Serialize a "not implemented" object.
| RETURNS | DESCRIPTION |
|---|---|
SerializedNotImplemented |
|
BaseMedia ¶
Bases: Serializable
Base class for content used in retrieval and data processing workflows.
Provides common fields for content that needs to be stored, indexed, or searched.
Note
For multimodal content in chat messages (images, audio sent to/from LLMs), use langchain.messages content blocks instead.
| METHOD | DESCRIPTION |
|---|---|
__init__ | |
is_lc_serializable | Is this class serializable? |
get_lc_namespace | Get the namespace of the LangChain object. |
lc_id | Return a unique identifier for this class for serialization purposes. |
to_json | Serialize the object to JSON. |
to_json_not_implemented | Serialize a "not implemented" object. |
id class-attribute instance-attribute ¶
An optional identifier for the document.
Ideally this should be unique across the document collection and formatted as a UUID, but this will not be enforced.
metadata class-attribute instance-attribute ¶
Arbitrary metadata associated with the content.
lc_secrets property ¶
A map of constructor argument names to secret ids.
For example, {"openai_api_key": "OPENAI_API_KEY"}
lc_attributes property ¶
lc_attributes: dict List of attribute names that should be included in the serialized kwargs.
These attributes must be accepted by the constructor.
Default is an empty dictionary.
is_lc_serializable classmethod ¶
is_lc_serializable() -> bool Is this class serializable?
By design, even if a class inherits from Serializable, it is not serializable by default. This is to prevent accidental serialization of objects that should not be serialized.
| RETURNS | DESCRIPTION |
|---|---|
bool | Whether the class is serializable. Default is |
get_lc_namespace classmethod ¶
lc_id classmethod ¶
Return a unique identifier for this class for serialization purposes.
The unique identifier is a list of strings that describes the path to the object.
For example, for the class langchain.llms.openai.OpenAI, the id is ["langchain", "llms", "openai", "OpenAI"].
to_json ¶
Serialize the object to JSON.
| RAISES | DESCRIPTION |
|---|---|
ValueError | If the class has deprecated attributes. |
| RETURNS | DESCRIPTION |
|---|---|
SerializedConstructor | SerializedNotImplemented | A JSON serializable object or a |
to_json_not_implemented ¶
Serialize a "not implemented" object.
| RETURNS | DESCRIPTION |
|---|---|
SerializedNotImplemented |
|