Streamline the Document API #38

tzolov · 2023-09-29T16:45:59Z

Add TextFormatter and DefaultTextFormatter concept that can filter the metadata and format the Document metadata and text according to predefined templates.

…ecorator

When the splitter breaks the parent Document into multiple chunks (e.g. into a list of children Documents) copy the source content formatter to the chunks by default. Use the copyContentFormatter flag to enable/disable copping.

Add useFormattedContent field to control the formatted vs. raw content being used for indexing.

markpollack · 2023-10-02T13:15:22Z

spring-ai-core/src/main/java/org/springframework/ai/document/ContentFormatter.java

+
+enum MetadataMode {
+
+ALL, EMBED, LLM, NONE;


I think we should change LLM to something like INFERENCE of MODEL as I've removed use of LLM generally speaking in naming things.

I like INFERENCE

markpollack · 2023-10-02T13:16:10Z

spring-ai-core/src/main/java/org/springframework/ai/document/DefaultContentFormatter.java

+/**
+ * Metadata keys that are excluded from text for the LLM.
+ */
+private final List<String> excludedLlmMetadataKeys;


similar comment wrt to naming , excludedInferenceMetadataKeys ?

spring-ai-core/src/main/java/org/springframework/ai/document/Document.java

markpollack · 2023-10-02T13:24:03Z

spring-ai-core/src/main/java/org/springframework/ai/splitter/TextSplitter.java

+if (this.copyContentFormatter) {
+// Copies the parent content formatter to the chucks documents it was
+// spelt into.
+newDoc = newDoc.updateContentFormatter(formatters.get(i));


multiple formatters indexed by List of strings seems unusual to me.

It is because the createDocuments takes a list of texts, one per source document, and merged metadata of all source documents. We can not merge the content formatters, so we have to pass in a list of formatters (one per source document) that corresponds in order to the list of texts.

We can simplify the createDocuments to take one document at a time, but will loose the metadata merging, or introduce an internal record class like Cuhnk(String text, Map<String, Object> metadata, ContentFormatter formatter)?

Actually, why de we merge the metadata from all parent documents? Shouldn't the chunked docs inherit only the metadata their own parent? Not a mixture of metadata of all different parents?
Later can make sense if we can determine that all parent documents come from the same initial source for example. But, currently, we don't have a convention to reliably recognise this.

…actors

- Make the metadata mode configurable for the EmbeddingClient implementations. - Use the EMBED mode by default.

tzolov · 2023-10-03T18:39:01Z

Replaced #45

tzolov added 4 commits September 29, 2023 18:42

Streamline the Document API

0b165c6

Add TextFormatter and DefaultTextFormatter concept that can filter the metadata and format the Document metadata and text according to predefined templates.

add text formatter tests

ca1b733

code style reformat

84dbce6

minor code style improvements

7ac1532

tzolov changed the title ~~Streamline the Document API [WIP]~~ Streamline the Document API Sep 29, 2023

tzolov added 7 commits September 30, 2023 01:19

minor fixes

e9554f3

Rename TextFormatter to ContentFormatter. Add DocumentWithFormatter d…

90bc92b

…ecorator

further Document and Formatting streamlining

ea4af65

Allow the TextSplitter copy the document content formatter

3282693

When the splitter breaks the parent Document into multiple chunks (e.g. into a list of children Documents) copy the source content formatter to the chunks by default. Use the copyContentFormatter flag to enable/disable copping.

Add TextSplitter tests

6f8a0dd

Let the EmbeddingClient impls use the formatted conent by default

5571443

Add useFormattedContent field to control the formatted vs. raw content being used for indexing.

minor improvements

fc7958e

markpollack reviewed Oct 2, 2023

View reviewed changes

tzolov added 5 commits October 2, 2023 17:39

Move MetadataMode out of the ContentFormatter state. Add MetadataExtr…

52401bc

…actors

Address review requests. Bump version to 0.7.0-SN

442c2ee

Configurable metadata-mode for EmbeddingClients

c1bad54

- Make the metadata mode configurable for the EmbeddingClient implementations. - Use the EMBED mode by default.

All functional

a07ca19

code style improvements

701c966

tzolov closed this Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streamline the Document API #38

Streamline the Document API #38

Uh oh!

tzolov commented Sep 29, 2023

markpollack Oct 2, 2023

tzolov Oct 2, 2023

markpollack Oct 2, 2023

tzolov Oct 2, 2023

Uh oh!

Uh oh!

markpollack Oct 2, 2023

tzolov Oct 3, 2023

tzolov Oct 3, 2023

tzolov commented Oct 3, 2023

Labels

2 participants

Streamline the Document API #38

Streamline the Document API #38

Uh oh!

Conversation

tzolov commented Sep 29, 2023

markpollack Oct 2, 2023

Choose a reason for hiding this comment

tzolov Oct 2, 2023

Choose a reason for hiding this comment

markpollack Oct 2, 2023

Choose a reason for hiding this comment

tzolov Oct 2, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

markpollack Oct 2, 2023

Choose a reason for hiding this comment

tzolov Oct 3, 2023

Choose a reason for hiding this comment

tzolov Oct 3, 2023

Choose a reason for hiding this comment

tzolov commented Oct 3, 2023

Labels

2 participants