CSHARP-1014 Add vector support #611

joao-r-reis · 2024-08-28T10:30:46Z

No description provided.

I will try to add some tests :)

…CSHARP-1014

src/Cassandra.Tests/SerializerTests.cs

…olumns

SiyaoIsHiding

It looks good! None of my request of changes is urgent.

src/Cassandra/CqlVector.cs

SiyaoIsHiding · 2024-09-04T20:20:23Z

src/Cassandra.IntegrationTests/Core/TypeSerializersTests.cs

+ public BigInteger b { get; set; }
+ }
+
+ public class MixedTypeTwo


I was confused for a moment so I wanted to put a note here: Those FixedType, MixedType, e.t.c mean they are UDTs with fixed/var-sized fields. When UDT types are used as a subtype in vectors, no matter what field they have, UDTs are always considered var-sized.

I'm going to be honest, I looked at Bret's python driver PR for some of these test cases and these types actually come from there so I didn't spend much time thinking about the test cases that involve these types and their names.

Looking a bit more into this, it does appear that having all these types is a bit redundant since UDT as a vector subtype is always handled the same way regardless of the types in that UDT (and the names might be a bit misleading as you noted).

@absurdfarce do you have any thoughts on this? I'm leaning towards changing these UDT names but keeping them since the tests are already functional and removing tests feels awkward

My intent with the Python test was to make it as robust as possible in the hopes of catching future regression. I agree with @SiyaoIsHiding that UDTs will always be encoded as variable size type so these distinctions shouldn't matter for the current implementation. It's more of a hedge against future changes.

SiyaoIsHiding · 2024-09-05T04:03:08Z

src/Cassandra.Tests/DataTypeParserTests.cs

+ AssertFn("org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 )");
+ AssertFn("org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 )");
+ AssertFn("org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 ) ");
+ AssertFn(" org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 ) ");


Thank you for the comprehensive test cases! This is not urgent but I am nervous that this DataTypeParser can still go wrong someday in some weird corner cases 💀 I tested out the following cases and commented the result returned by the DataTypeParser. Are they all expected?

AssertFn(" org.apache.cassandra.db.marshal. VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 ) "); // A custom type, without throwing an error AssertFn("ORG.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type,3)"); // custom type AssertFn("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type, 0)"); // pass AssertFn("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type, -1)"); // pass

Yeah honestly I also got a bit anxious when I was basically forced to change this class to support the whitespaces in the vector type and that's why I added all of these test cases.

As you noted here (and this comment) there's a few cases that still don't work but I really want to limit the changes I make on this class otherwise there's an even higher risk that something will break. None of those test cases you listed worked before this change (I think?) and I don't think they are valid cases in C* 5 or Astra so I'd be ok with keeping the behavior as is.

@absurdfarce any thoughts on this? Do you think it would be worth it to make this parser resilient enough to handle the test cases that Jane listed here?

AssertFn("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type, 0)"); // pass

I think this should pass, these are returned by the server so if the server does return them then the driver should be able to parse it. This is also in line with Bret's thoughts about not throwing an exception if someone tries to create a vector with a dimension of 0 (I actually just pushed a commit to remove that validation and it actually allows the serializer to use the default ctor which is a significant performance boost).

For a vector of dimension -1 the user will not actually be able to provide a vector that the driver accepts but that's a separate concern (it's not the concern of this class).

I understand @SiyaoIsHiding to be saying that all four of the cases she referenced above are parsed correctly, which seems right to me. Cases 2 and 3 above (Int32 subtype, size 3 and 0 respectively) seem correct on their face and case 1 is just a whitespace question (which I know has been a bit of an issue). The only troublesome one there to me would be the last case (Int32 subtype, size -1), but even there the type definition is syntacticly correct... it's only the semantics that are wrong. So if the parser is evaluating both I guess I'd expect it to fail but if it's just checking syntax I don't see a problem there.

If I've misunderstood then please let me know; I'm very sympathetic to @joao-r-reis 's point that it's worth our while to minimize changes to this functionality in order to reduce the chances of any additional breakage.

Is this DataTypeParser only used when communicating with the server but not when parsing user-provided statements?

The first two don't actually parse correctly, they are parsed as "custom types" which are the default when the driver doesn't recognize a type (it should be parsed as an actual vector type). The first case has whitespaces in some places where the parser still doesn't parse correctly and the second case has upper case letters which the parser also doesn't handle correctly.

Neither of these cases have ever worked before so I'm wondering if we can just leave them be since we don't expect any known supported type to be returned with these whitespaces or upper case letters...

Is this DataTypeParser only used when communicating with the server but not when parsing user-provided statements?

I believe so... Can't 100% confirm it without double checking the codebase. I can do that tomorrow.

ParseFqTypeName is used in the SchemaParser to parse the schema using the system tables and to parse custom types returned by the server when it returns rows (QUERY/EXECUTE requests).

ParseTypeName is only used in the SchemaParser to also parse schema using system tables.

SiyaoIsHiding · 2024-09-05T04:15:43Z

src/Cassandra.Tests/DataTypeParserTests.cs

+ Assert.AreEqual(3, ((VectorColumnInfo)dataType.TypeInfo).Dimension);
+ }
+
+ AssertFn("vector<int,3>");


Again, here are a bunch of corner cases that are not important nor urgent to fix, but I would just put it here to make sure it is what we want.

AssertFn("Vector<int, 3>"); // this will raise NullReferenceException AssertFn("vector<INT, 3>"); // NullReferenceException AssertFn("vector <int, 3>"); // Not a valid type vector AssertFn("vector<3, 3>"); // NullReferenceException AssertFn("<vector<int, 3>>"); // NullReferenceException AssertFn("vector<int, 3)"); // pass

Commented about this here.

I will say though that regardless of whether these should be supported or not, throwing a null reference exception doesn't seem ok so I'll take a look.

Ok, it's because the udtResolver param is being passed as null on the test, fixed it.

AssertFn("vector<3, 3>"); // NullReferenceException

The driver assumes the first param to be a type so if the user doesn't define a UDT binding that matches the type name of the first parameter then an error will be returned.

This method doesn't appear to parse custom types (that are unrecognized by the driver) at all (as opposed to ParseFqTypeName) and I don't know if that's intentional or not. I don't think we have a need to change this anyway so...

How do you think about

AssertFn("vector <int, 3>"); // Not a valid type vector

?

It's the same as list <int> for example, I'm fairly sure list <int> also doesn't parse correctly but it's never been an issue. I do think that we should at the very least document these use cases on a JIRA ticket so we don't lose track of them and maybe we can improve this parser in the near future to account for these cases.

Ticket here.

src/Cassandra.Tests/SerializerTests.cs

SiyaoIsHiding · 2024-09-05T22:07:20Z

doc/features/vectors/README.md

+ .Insert(new TestTable1 { I = 3, J = new CqlVector<float>(10.1f, 10.2f, 10.3f) })
+ .ExecuteAsync();
+
+// DON'T USE AllowFiltering, this is required in this case because the ANN operator 


I couldn't understand this. Why I shouldn't use AllowFiltering?

It performs a full scan on all C* nodes so the performance can be very unpredictable. Here's some docs on this topic from the official C* docs

Got it. Do you think we can consider the following wording, and adding another way of LINQ?

// Using AllowFiltering is not recommended due to unpredictable performance. // Here we use AllowFiltering in this case just for the example, // as the ANN operator is not supported in LINQ yet. var entity = (await table.Where(t => t.I == 3 && t.J == CqlVector<float>.New(new [] {10.1f, 10.2f, 10.3f})).AllowFiltering().ExecuteAsync()).SingleOrDefault(); // The following way also works var entity = (await (from t in table where t.J == CqlVector<float>.New(new [] {10.1f, 10.2f, 10.3f}) select t).AllowFiltering().ExecuteAsync()).SingleOrDefault();

updated this section, can you give it another look?

SiyaoIsHiding · 2024-09-05T22:38:03Z

src/Cassandra.Tests/DataTypeParserTests.cs

+ Assert.AreEqual(3, ((VectorColumnInfo)dataType.TypeInfo).Dimension);
+ }
+
+ AssertFn("vector<int,3>");


How do you think about

AssertFn("vector <int, 3>"); // Not a valid type vector

?

joao-r-reis and others added 19 commits August 23, 2024 13:35

wip

1e01fab

wip

5618e62

changes to cql vector type

b2741ad

bug fixing and changes to CqlVector type

a7c8e1e

Jane Review for Vector Support (#610)

81a35c3

I will try to add some tests :)

fix custom type decode bug

493a20f

Merge branch 'CSHARP-1014' of github.com:datastax/csharp-driver into …

1d1b454

…CSHARP-1014

fix CqlVector.FromArray method

c6e9a77

wip

0f5e242

fix nested vector bug

30cd71e

fix TODOs

789371d

wip tests

ba7c265

wip

1cfbb51

fix data type parser and add more tests

2c769f5

fix CqlVector.Equals

ccafe17

more integration tests and more fixes

014b030

fix build

9b07666

add more integration tests

6c81f50

add vector integration tests with LINQ and client side encryption

8a2d24c

joao-r-reis marked this pull request as ready for review September 3, 2024 15:55

SiyaoIsHiding reviewed Sep 4, 2024

View reviewed changes

src/Cassandra.Tests/SerializerTests.cs Show resolved Hide resolved

support Vector to collection/array conversion when selecting vector c…

a973720

…olumns

joao-r-reis mentioned this pull request Sep 4, 2024

[CSHARP-969] - OpenTelemetry tracing #608

Merged

joao-r-reis added 2 commits September 4, 2024 14:59

fix timeuuid serializer test

49bda43

3.21.0-alpha1 release

5c67fed

SiyaoIsHiding requested changes Sep 5, 2024

View reviewed changes

SiyaoIsHiding reviewed Sep 5, 2024

View reviewed changes

src/Cassandra.Tests/SerializerTests.cs Outdated Show resolved Hide resolved

joao-r-reis added 3 commits September 5, 2024 11:56

PR discussions

301772f

rename Dimension to Dimensions and add xml docs

6f84945

add vector support section to the driver manual

c775296

SiyaoIsHiding approved these changes Sep 5, 2024

View reviewed changes

update vector docs

dd49450

joao-r-reis merged commit 8564d39 into master Sep 9, 2024

joao-r-reis deleted the CSHARP-1014 branch September 9, 2024 15:26

joao-r-reis added this to the 3.21.0 milestone Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CSHARP-1014 Add vector support #611

CSHARP-1014 Add vector support #611

Uh oh!

joao-r-reis commented Aug 28, 2024

Uh oh!

SiyaoIsHiding left a comment

Uh oh!

SiyaoIsHiding Sep 4, 2024

joao-r-reis Sep 5, 2024

absurdfarce Sep 5, 2024

SiyaoIsHiding Sep 5, 2024

joao-r-reis Sep 5, 2024

joao-r-reis Sep 5, 2024

absurdfarce Sep 5, 2024

SiyaoIsHiding Sep 5, 2024

joao-r-reis Sep 5, 2024

joao-r-reis Sep 5, 2024 •

edited

Loading

joao-r-reis Sep 6, 2024

SiyaoIsHiding Sep 5, 2024

joao-r-reis Sep 5, 2024

joao-r-reis Sep 5, 2024

joao-r-reis Sep 5, 2024 •

edited

Loading

SiyaoIsHiding Sep 5, 2024

joao-r-reis Sep 6, 2024

Uh oh!

SiyaoIsHiding Sep 5, 2024

joao-r-reis Sep 6, 2024

SiyaoIsHiding Sep 6, 2024

joao-r-reis Sep 7, 2024

SiyaoIsHiding Sep 5, 2024

Labels

4 participants

CSHARP-1014 Add vector support #611

CSHARP-1014 Add vector support #611

Uh oh!

Conversation

joao-r-reis commented Aug 28, 2024

Uh oh!

SiyaoIsHiding left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joao-r-reis Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joao-r-reis Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

4 participants

joao-r-reis Sep 5, 2024 •

edited

Loading

joao-r-reis Sep 5, 2024 •

edited

Loading