- Notifications
You must be signed in to change notification settings - Fork 2k
Streamline the Document API #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add TextFormatter and DefaultTextFormatter concept that can filter the metadata and format the Document metadata and text according to predefined templates.
When the splitter breaks the parent Document into multiple chunks (e.g. into a list of children Documents) copy the source content formatter to the chunks by default. Use the copyContentFormatter flag to enable/disable copping.
Add useFormattedContent field to control the formatted vs. raw content being used for indexing.
| | ||
| enum MetadataMode { | ||
| | ||
| ALL, EMBED, LLM, NONE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change LLM to something like INFERENCE of MODEL as I've removed use of LLM generally speaking in naming things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like INFERENCE
| /** | ||
| * Metadata keys that are excluded from text for the LLM. | ||
| */ | ||
| private final List<String> excludedLlmMetadataKeys; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment wrt to naming , excludedInferenceMetadataKeys ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
spring-ai-core/src/main/java/org/springframework/ai/document/Document.java Outdated Show resolved Hide resolved
spring-ai-core/src/main/java/org/springframework/ai/document/Document.java Show resolved Hide resolved
| if (this.copyContentFormatter) { | ||
| // Copies the parent content formatter to the chucks documents it was | ||
| // spelt into. | ||
| newDoc = newDoc.updateContentFormatter(formatters.get(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple formatters indexed by List of strings seems unusual to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is because the createDocuments takes a list of texts, one per source document, and merged metadata of all source documents. We can not merge the content formatters, so we have to pass in a list of formatters (one per source document) that corresponds in order to the list of texts.
We can simplify the createDocuments to take one document at a time, but will loose the metadata merging, or introduce an internal record class like Cuhnk(String text, Map<String, Object> metadata, ContentFormatter formatter)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, why de we merge the metadata from all parent documents? Shouldn't the chunked docs inherit only the metadata their own parent? Not a mixture of metadata of all different parents?
Later can make sense if we can determine that all parent documents come from the same initial source for example. But, currently, we don't have a convention to reliably recognise this.
- Make the metadata mode configurable for the EmbeddingClient implementations. - Use the EMBED mode by default.
| Replaced #45 |
Add TextFormatter and DefaultTextFormatter concept that can filter the metadata and format the Document metadata and text according to predefined templates.