- Notifications
You must be signed in to change notification settings - Fork 249
CSHARP-1014 Add vector support #611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I will try to add some tests :)
SiyaoIsHiding left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good! None of my request of changes is urgent.
| public BigInteger b { get; set; } | ||
| } | ||
| | ||
| public class MixedTypeTwo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused for a moment so I wanted to put a note here: Those FixedType, MixedType, e.t.c mean they are UDTs with fixed/var-sized fields. When UDT types are used as a subtype in vectors, no matter what field they have, UDTs are always considered var-sized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to be honest, I looked at Bret's python driver PR for some of these test cases and these types actually come from there so I didn't spend much time thinking about the test cases that involve these types and their names.
Looking a bit more into this, it does appear that having all these types is a bit redundant since UDT as a vector subtype is always handled the same way regardless of the types in that UDT (and the names might be a bit misleading as you noted).
@absurdfarce do you have any thoughts on this? I'm leaning towards changing these UDT names but keeping them since the tests are already functional and removing tests feels awkward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent with the Python test was to make it as robust as possible in the hopes of catching future regression. I agree with @SiyaoIsHiding that UDTs will always be encoded as variable size type so these distinctions shouldn't matter for the current implementation. It's more of a hedge against future changes.
| AssertFn("org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 )"); | ||
| AssertFn("org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 )"); | ||
| AssertFn("org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 ) "); | ||
| AssertFn(" org.apache.cassandra.db.marshal.VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 ) "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comprehensive test cases! This is not urgent but I am nervous that this DataTypeParser can still go wrong someday in some weird corner cases 💀 I tested out the following cases and commented the result returned by the DataTypeParser. Are they all expected?
AssertFn(" org.apache.cassandra.db.marshal. VectorType( org.apache.cassandra.db.marshal.Int32Type , 3 ) "); // A custom type, without throwing an error AssertFn("ORG.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type,3)"); // custom type AssertFn("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type, 0)"); // pass AssertFn("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type, -1)"); // passThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah honestly I also got a bit anxious when I was basically forced to change this class to support the whitespaces in the vector type and that's why I added all of these test cases.
As you noted here (and this comment) there's a few cases that still don't work but I really want to limit the changes I make on this class otherwise there's an even higher risk that something will break. None of those test cases you listed worked before this change (I think?) and I don't think they are valid cases in C* 5 or Astra so I'd be ok with keeping the behavior as is.
@absurdfarce any thoughts on this? Do you think it would be worth it to make this parser resilient enough to handle the test cases that Jane listed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AssertFn("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.Int32Type, 0)"); // pass I think this should pass, these are returned by the server so if the server does return them then the driver should be able to parse it. This is also in line with Bret's thoughts about not throwing an exception if someone tries to create a vector with a dimension of 0 (I actually just pushed a commit to remove that validation and it actually allows the serializer to use the default ctor which is a significant performance boost).
For a vector of dimension -1 the user will not actually be able to provide a vector that the driver accepts but that's a separate concern (it's not the concern of this class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand @SiyaoIsHiding to be saying that all four of the cases she referenced above are parsed correctly, which seems right to me. Cases 2 and 3 above (Int32 subtype, size 3 and 0 respectively) seem correct on their face and case 1 is just a whitespace question (which I know has been a bit of an issue). The only troublesome one there to me would be the last case (Int32 subtype, size -1), but even there the type definition is syntacticly correct... it's only the semantics that are wrong. So if the parser is evaluating both I guess I'd expect it to fail but if it's just checking syntax I don't see a problem there.
If I've misunderstood then please let me know; I'm very sympathetic to @joao-r-reis 's point that it's worth our while to minimize changes to this functionality in order to reduce the chances of any additional breakage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this DataTypeParser only used when communicating with the server but not when parsing user-provided statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first two don't actually parse correctly, they are parsed as "custom types" which are the default when the driver doesn't recognize a type (it should be parsed as an actual vector type). The first case has whitespaces in some places where the parser still doesn't parse correctly and the second case has upper case letters which the parser also doesn't handle correctly.
Neither of these cases have ever worked before so I'm wondering if we can just leave them be since we don't expect any known supported type to be returned with these whitespaces or upper case letters...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this
DataTypeParseronly used when communicating with the server but not when parsing user-provided statements?
I believe so... Can't 100% confirm it without double checking the codebase. I can do that tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParseFqTypeName is used in the SchemaParser to parse the schema using the system tables and to parse custom types returned by the server when it returns rows (QUERY/EXECUTE requests).
ParseTypeName is only used in the SchemaParser to also parse schema using system tables.
| Assert.AreEqual(3, ((VectorColumnInfo)dataType.TypeInfo).Dimension); | ||
| } | ||
| | ||
| AssertFn("vector<int,3>"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, here are a bunch of corner cases that are not important nor urgent to fix, but I would just put it here to make sure it is what we want.
AssertFn("Vector<int, 3>"); // this will raise NullReferenceException AssertFn("vector<INT, 3>"); // NullReferenceException AssertFn("vector <int, 3>"); // Not a valid type vector AssertFn("vector<3, 3>"); // NullReferenceException AssertFn("<vector<int, 3>>"); // NullReferenceException AssertFn("vector<int, 3)"); // passThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented about this here.
I will say though that regardless of whether these should be supported or not, throwing a null reference exception doesn't seem ok so I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it's because the udtResolver param is being passed as null on the test, fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AssertFn("vector<3, 3>"); // NullReferenceException The driver assumes the first param to be a type so if the user doesn't define a UDT binding that matches the type name of the first parameter then an error will be returned.
This method doesn't appear to parse custom types (that are unrecognized by the driver) at all (as opposed to ParseFqTypeName) and I don't know if that's intentional or not. I don't think we have a need to change this anyway so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you think about
AssertFn("vector <int, 3>"); // Not a valid type vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same as list <int> for example, I'm fairly sure list <int> also doesn't parse correctly but it's never been an issue. I do think that we should at the very least document these use cases on a JIRA ticket so we don't lose track of them and maybe we can improve this parser in the near future to account for these cases.
doc/features/vectors/README.md Outdated
| .Insert(new TestTable1 { I = 3, J = new CqlVector<float>(10.1f, 10.2f, 10.3f) }) | ||
| .ExecuteAsync(); | ||
| | ||
| // DON'T USE AllowFiltering, this is required in this case because the ANN operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't understand this. Why I shouldn't use AllowFiltering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It performs a full scan on all C* nodes so the performance can be very unpredictable. Here's some docs on this topic from the official C* docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Do you think we can consider the following wording, and adding another way of LINQ?
// Using AllowFiltering is not recommended due to unpredictable performance. // Here we use AllowFiltering in this case just for the example, // as the ANN operator is not supported in LINQ yet. var entity = (await table.Where(t => t.I == 3 && t.J == CqlVector<float>.New(new [] {10.1f, 10.2f, 10.3f})).AllowFiltering().ExecuteAsync()).SingleOrDefault(); // The following way also works var entity = (await (from t in table where t.J == CqlVector<float>.New(new [] {10.1f, 10.2f, 10.3f}) select t).AllowFiltering().ExecuteAsync()).SingleOrDefault();There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated this section, can you give it another look?
| Assert.AreEqual(3, ((VectorColumnInfo)dataType.TypeInfo).Dimension); | ||
| } | ||
| | ||
| AssertFn("vector<int,3>"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you think about
AssertFn("vector <int, 3>"); // Not a valid type vector?
No description provided.