Skip to content

Conversation

@joao-r-reis
Copy link
Collaborator

No description provided.

@joao-r-reis joao-r-reis marked this pull request as ready for review September 3, 2024 15:55
Copy link
Contributor

@SiyaoIsHiding SiyaoIsHiding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good! None of my request of changes is urgent.

public BigInteger b { get; set; }
}

public class MixedTypeTwo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused for a moment so I wanted to put a note here: Those FixedType, MixedType, e.t.c mean they are UDTs with fixed/var-sized fields. When UDT types are used as a subtype in vectors, no matter what field they have, UDTs are always considered var-sized.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to be honest, I looked at Bret's python driver PR for some of these test cases and these types actually come from there so I didn't spend much time thinking about the test cases that involve these types and their names.

Looking a bit more into this, it does appear that having all these types is a bit redundant since UDT as a vector subtype is always handled the same way regardless of the types in that UDT (and the names might be a bit misleading as you noted).

@absurdfarce do you have any thoughts on this? I'm leaning towards changing these UDT names but keeping them since the tests are already functional and removing tests feels awkward

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent with the Python test was to make it as robust as possible in the hopes of catching future regression. I agree with @SiyaoIsHiding that UDTs will always be encoded as variable size type so these distinctions shouldn't matter for the current implementation. It's more of a hedge against future changes.

AssertFn("org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 )");
AssertFn("org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 )");
AssertFn("org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 ) ");
AssertFn(" org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 ) ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comprehensive test cases! This is not urgent but I am nervous that this DataTypeParser can still go wrong someday in some weird corner cases 💀 I tested out the following cases and commented the result returned by the DataTypeParser. Are they all expected?

AssertFn(" org.apache.cassandra.db.marshal. VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 ) "); // A custom type, without throwing an error AssertFn("ORG.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type,3)"); // custom type AssertFn("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type, 0)"); // pass AssertFn("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type, -1)"); // pass
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah honestly I also got a bit anxious when I was basically forced to change this class to support the whitespaces in the vector type and that's why I added all of these test cases.

As you noted here (and this comment) there's a few cases that still don't work but I really want to limit the changes I make on this class otherwise there's an even higher risk that something will break. None of those test cases you listed worked before this change (I think?) and I don't think they are valid cases in C* 5 or Astra so I'd be ok with keeping the behavior as is.

@absurdfarce any thoughts on this? Do you think it would be worth it to make this parser resilient enough to handle the test cases that Jane listed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssertFn("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type, 0)"); // pass 

I think this should pass, these are returned by the server so if the server does return them then the driver should be able to parse it. This is also in line with Bret's thoughts about not throwing an exception if someone tries to create a vector with a dimension of 0 (I actually just pushed a commit to remove that validation and it actually allows the serializer to use the default ctor which is a significant performance boost).

For a vector of dimension -1 the user will not actually be able to provide a vector that the driver accepts but that's a separate concern (it's not the concern of this class).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand @SiyaoIsHiding to be saying that all four of the cases she referenced above are parsed correctly, which seems right to me. Cases 2 and 3 above (Int32 subtype, size 3 and 0 respectively) seem correct on their face and case 1 is just a whitespace question (which I know has been a bit of an issue). The only troublesome one there to me would be the last case (Int32 subtype, size -1), but even there the type definition is syntacticly correct... it's only the semantics that are wrong. So if the parser is evaluating both I guess I'd expect it to fail but if it's just checking syntax I don't see a problem there.

If I've misunderstood then please let me know; I'm very sympathetic to @joao-r-reis 's point that it's worth our while to minimize changes to this functionality in order to reduce the chances of any additional breakage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this DataTypeParser only used when communicating with the server but not when parsing user-provided statements?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two don't actually parse correctly, they are parsed as "custom types" which are the default when the driver doesn't recognize a type (it should be parsed as an actual vector type). The first case has whitespaces in some places where the parser still doesn't parse correctly and the second case has upper case letters which the parser also doesn't handle correctly.

Neither of these cases have ever worked before so I'm wondering if we can just leave them be since we don't expect any known supported type to be returned with these whitespaces or upper case letters...

Copy link
Collaborator Author

@joao-r-reis joao-r-reis Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this DataTypeParser only used when communicating with the server but not when parsing user-provided statements?

I believe so... Can't 100% confirm it without double checking the codebase. I can do that tomorrow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseFqTypeName is used in the SchemaParser to parse the schema using the system tables and to parse custom types returned by the server when it returns rows (QUERY/EXECUTE requests).

ParseTypeName is only used in the SchemaParser to also parse schema using system tables.

Assert.AreEqual(3, ((VectorColumnInfo)dataType.TypeInfo).Dimension);
}

AssertFn("vector<int,3>");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, here are a bunch of corner cases that are not important nor urgent to fix, but I would just put it here to make sure it is what we want.

 AssertFn("Vector<int, 3>"); // this will raise NullReferenceException AssertFn("vector<INT, 3>"); // NullReferenceException AssertFn("vector <int, 3>"); // Not a valid type vector AssertFn("vector<3, 3>"); // NullReferenceException AssertFn("<vector<int, 3>>"); // NullReferenceException AssertFn("vector<int, 3)"); // pass
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented about this here.

I will say though that regardless of whether these should be supported or not, throwing a null reference exception doesn't seem ok so I'll take a look.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's because the udtResolver param is being passed as null on the test, fixed it.

Copy link
Collaborator Author

@joao-r-reis joao-r-reis Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssertFn("vector<3, 3>"); // NullReferenceException 

The driver assumes the first param to be a type so if the user doesn't define a UDT binding that matches the type name of the first parameter then an error will be returned.

This method doesn't appear to parse custom types (that are unrecognized by the driver) at all (as opposed to ParseFqTypeName) and I don't know if that's intentional or not. I don't think we have a need to change this anyway so...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think about

AssertFn("vector <int, 3>"); // Not a valid type vector

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same as list <int> for example, I'm fairly sure list <int> also doesn't parse correctly but it's never been an issue. I do think that we should at the very least document these use cases on a JIRA ticket so we don't lose track of them and maybe we can improve this parser in the near future to account for these cases.

Ticket here.

.Insert(new TestTable1 { I = 3, J = new CqlVector<float>(10.1f, 10.2f, 10.3f) })
.ExecuteAsync();

// DON'T USE AllowFiltering, this is required in this case because the ANN operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't understand this. Why I shouldn't use AllowFiltering?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It performs a full scan on all C* nodes so the performance can be very unpredictable. Here's some docs on this topic from the official C* docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Do you think we can consider the following wording, and adding another way of LINQ?

// Using AllowFiltering is not recommended due to unpredictable performance. // Here we use AllowFiltering in this case just for the example, // as the ANN operator is not supported in LINQ yet. var entity = (await table.Where(t => t.I == 3 && t.J == CqlVector<float>.New(new [] {10.1f, 10.2f, 10.3f})).AllowFiltering().ExecuteAsync()).SingleOrDefault(); // The following way also works var entity = (await (from t in table where t.J == CqlVector<float>.New(new [] {10.1f, 10.2f, 10.3f}) select t).AllowFiltering().ExecuteAsync()).SingleOrDefault();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this section, can you give it another look?

Assert.AreEqual(3, ((VectorColumnInfo)dataType.TypeInfo).Dimension);
}

AssertFn("vector<int,3>");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think about

AssertFn("vector <int, 3>"); // Not a valid type vector

?

@joao-r-reis joao-r-reis merged commit 8564d39 into master Sep 9, 2024
@joao-r-reis joao-r-reis deleted the CSHARP-1014 branch September 9, 2024 15:26
@joao-r-reis joao-r-reis added this to the 3.21.0 milestone Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants