Skip to content

Conversation

@tzolov
Copy link
Contributor

@tzolov tzolov commented Sep 29, 2023

Add TextFormatter and DefaultTextFormatter concept that can filter the metadata and format the Document metadata and text according to predefined templates.

 Add TextFormatter and DefaultTextFormatter concept that can filter the metadata and format the Document metadata and text according to predefined templates.
@tzolov tzolov changed the title Streamline the Document API [WIP] Streamline the Document API Sep 29, 2023
 When the splitter breaks the parent Document into multiple chunks (e.g. into a list of children Documents) copy the source content formatter to the chunks by default. Use the copyContentFormatter flag to enable/disable copping.
 Add useFormattedContent field to control the formatted vs. raw content being used for indexing.

enum MetadataMode {

ALL, EMBED, LLM, NONE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change LLM to something like INFERENCE of MODEL as I've removed use of LLM generally speaking in naming things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like INFERENCE

/**
* Metadata keys that are excluded from text for the LLM.
*/
private final List<String> excludedLlmMetadataKeys;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment wrt to naming , excludedInferenceMetadataKeys ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (this.copyContentFormatter) {
// Copies the parent content formatter to the chucks documents it was
// spelt into.
newDoc = newDoc.updateContentFormatter(formatters.get(i));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple formatters indexed by List of strings seems unusual to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because the createDocuments takes a list of texts, one per source document, and merged metadata of all source documents. We can not merge the content formatters, so we have to pass in a list of formatters (one per source document) that corresponds in order to the list of texts.

We can simplify the createDocuments to take one document at a time, but will loose the metadata merging, or introduce an internal record class like Cuhnk(String text, Map<String, Object> metadata, ContentFormatter formatter)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why de we merge the metadata from all parent documents? Shouldn't the chunked docs inherit only the metadata their own parent? Not a mixture of metadata of all different parents?
Later can make sense if we can determine that all parent documents come from the same initial source for example. But, currently, we don't have a convention to reliably recognise this.

@tzolov
Copy link
Contributor Author

tzolov commented Oct 3, 2023

Replaced #45

@tzolov tzolov closed this Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants