Skip to content

Conversation

@akkadhim
Copy link

@akkadhim akkadhim commented Dec 25, 2024

Adding the recommendation system experiments. Please ignore any changes outside the (examples/recomm_system) directory.

@BooBSD
Copy link

BooBSD commented Dec 26, 2024

@akkadhim Could you please export your noisy datasets to a CSV file for testing in other languages?

@akkadhim
Copy link
Author

@akkadhim Could you please export your noisy datasets to a CSV file for testing in other languages?

Sure, below are different datasets for different noise ratios.

noisy_dataset_0.05.csv
noisy_dataset_0.005.csv
noisy_dataset_0.02.csv
noisy_dataset_0.2.csv
noisy_dataset_0.01.csv
noisy_dataset_0.1.csv

@BooBSD
Copy link

BooBSD commented Dec 27, 2024

@akkadhim Thank you!

@BooBSD
Copy link

BooBSD commented Dec 27, 2024

@akkadhim Is it correct that, after one-hot booleanization, your input data consists of 10709 bits? This includes 1350 unique product_ids + 317 categories + 9042 user_ids.

@akkadhim
Copy link
Author

@akkadhim Is it correct that, after one-hot booleanization, your input data consists of 10709 bits? This includes 1350 unique product_ids + 317 categories + 9042 user_ids.
After expanding the original dataset and adding the noise, the unique features will be:
Users: 1193
Items: 1350
Categories: 211
I used the one_hot_encoding for the TM classifier, and at that step, the dataset split to train and test portions.

@BooBSD
Copy link

BooBSD commented Dec 27, 2024

@akkadhim
Got it. However, the columns category and user_id contain lists of categories and users, joined by the "|" and "," characters (for example: "Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables" or "AH4BURHCF5UQFZR4VJQXBEQCTYVQ,AGSJLPK6HU2FB4HII64NQ3OYFFFA,AGG75KFRXNLCYVRAPA6D4ZBNTNSA"). Why weren’t they split into individual unique categories and user IDs? Could you confirm if your method of booleanization is correct?

@BooBSD
Copy link

BooBSD commented Dec 27, 2024

@akkadhim
I tested both booleanization methods (yours and mine) and obtained approximately the same validation accuracy.
I split your dataset such that the first 80% is used for training, and the last 20% for validation.

My best validation accuracy:

  • noisy_dataset_0.005.csv: 99.73%
  • noisy_dataset_0.2.csv: 84.87%

Here is the proof:

#1 Accuracy: 83.81% Best: 83.81% Training: 1.946s Testing: 0.107s #2 Accuracy: 96.69% Best: 96.69% Training: 0.609s Testing: 0.009s #3 Accuracy: 99.69% Best: 99.69% Training: 0.442s Testing: 0.008s #4 Accuracy: 99.69% Best: 99.69% Training: 0.350s Testing: 0.007s #5 Accuracy: 99.69% Best: 99.69% Training: 0.279s Testing: 0.007s #6 Accuracy: 99.69% Best: 99.69% Training: 0.238s Testing: 0.006s #7 Accuracy: 99.69% Best: 99.69% Training: 0.192s Testing: 0.006s #8 Accuracy: 99.69% Best: 99.69% Training: 0.178s Testing: 0.006s #9 Accuracy: 99.69% Best: 99.69% Training: 0.173s Testing: 0.006s #10 Accuracy: 99.69% Best: 99.69% Training: 0.147s Testing: 0.005s .... #300 Accuracy: 99.73% Best: 99.73% Training: 0.085s Testing: 0.003s #301 Accuracy: 99.69% Best: 99.73% Training: 0.090s Testing: 0.003s #302 Accuracy: 99.73% Best: 99.73% Training: 0.086s Testing: 0.003s #303 Accuracy: 99.73% Best: 99.73% Training: 0.084s Testing: 0.003s #304 Accuracy: 99.73% Best: 99.73% Training: 0.081s Testing: 0.003s #305 Accuracy: 99.73% Best: 99.73% Training: 0.089s Testing: 0.003s #306 Accuracy: 99.73% Best: 99.73% Training: 0.080s Testing: 0.003s #307 Accuracy: 99.73% Best: 99.73% Training: 0.081s Testing: 0.003s #308 Accuracy: 99.73% Best: 99.73% Training: 0.089s Testing: 0.003s #309 Accuracy: 99.73% Best: 99.73% Training: 0.088s Testing: 0.003s #310 Accuracy: 99.69% Best: 99.73% Training: 0.083s Testing: 0.003s #311 Accuracy: 99.69% Best: 99.73% Training: 0.081s Testing: 0.003s #312 Accuracy: 99.73% Best: 99.73% Training: 0.082s Testing: 0.003s #313 Accuracy: 99.73% Best: 99.73% Training: 0.079s Testing: 0.003s #314 Accuracy: 99.69% Best: 99.73% Training: 0.081s Testing: 0.003s #315 Accuracy: 99.73% Best: 99.73% Training: 0.083s Testing: 0.003s #316 Accuracy: 99.73% Best: 99.73% Training: 0.088s Testing: 0.003s #317 Accuracy: 99.73% Best: 99.73% Training: 0.085s Testing: 0.003s #318 Accuracy: 99.73% Best: 99.73% Training: 0.086s Testing: 0.003s #319 Accuracy: 99.73% Best: 99.73% Training: 0.088s Testing: 0.003s #320 Accuracy: 99.73% Best: 99.73% Training: 0.091s Testing: 0.003s 

These results were obtained on a CPU, and it works quite fast.

@akkadhim
Copy link
Author

@akkadhim Got it. However, the columns category and user_id contain lists of categories and users, joined by the "|" and "," characters (for example: "Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables" or "AH4BURHCF5UQFZR4VJQXBEQCTYVQ,AGSJLPK6HU2FB4HII64NQ3OYFFFA,AGG75KFRXNLCYVRAPA6D4ZBNTNSA"). Why weren’t they split into individual unique categories and user IDs? Could you confirm if your method of booleanization is correct?

For the user_id, the CSV formatting rules allow handling such cases by enclosing the value in double quotes, while the categories column format maintains the original structure of the dataset. Splitting these fields would alter the representation of hierarchical categories and associated user IDs.
Yes, it is correct.

@akkadhim
Copy link
Author

@akkadhim I tested both booleanization methods (yours and mine) and obtained approximately the same validation accuracy. I split your dataset such that the first 80% is used for training, and the last 20% for validation.

My best validation accuracy:

  • noisy_dataset_0.005.csv: 99.73%
  • noisy_dataset_0.2.csv: 84.87%

Here is the proof:

#1 Accuracy: 83.81% Best: 83.81% Training: 1.946s Testing: 0.107s #2 Accuracy: 96.69% Best: 96.69% Training: 0.609s Testing: 0.009s #3 Accuracy: 99.69% Best: 99.69% Training: 0.442s Testing: 0.008s #4 Accuracy: 99.69% Best: 99.69% Training: 0.350s Testing: 0.007s #5 Accuracy: 99.69% Best: 99.69% Training: 0.279s Testing: 0.007s #6 Accuracy: 99.69% Best: 99.69% Training: 0.238s Testing: 0.006s #7 Accuracy: 99.69% Best: 99.69% Training: 0.192s Testing: 0.006s #8 Accuracy: 99.69% Best: 99.69% Training: 0.178s Testing: 0.006s #9 Accuracy: 99.69% Best: 99.69% Training: 0.173s Testing: 0.006s #10 Accuracy: 99.69% Best: 99.69% Training: 0.147s Testing: 0.005s .... #300 Accuracy: 99.73% Best: 99.73% Training: 0.085s Testing: 0.003s #301 Accuracy: 99.69% Best: 99.73% Training: 0.090s Testing: 0.003s #302 Accuracy: 99.73% Best: 99.73% Training: 0.086s Testing: 0.003s #303 Accuracy: 99.73% Best: 99.73% Training: 0.084s Testing: 0.003s #304 Accuracy: 99.73% Best: 99.73% Training: 0.081s Testing: 0.003s #305 Accuracy: 99.73% Best: 99.73% Training: 0.089s Testing: 0.003s #306 Accuracy: 99.73% Best: 99.73% Training: 0.080s Testing: 0.003s #307 Accuracy: 99.73% Best: 99.73% Training: 0.081s Testing: 0.003s #308 Accuracy: 99.73% Best: 99.73% Training: 0.089s Testing: 0.003s #309 Accuracy: 99.73% Best: 99.73% Training: 0.088s Testing: 0.003s #310 Accuracy: 99.69% Best: 99.73% Training: 0.083s Testing: 0.003s #311 Accuracy: 99.69% Best: 99.73% Training: 0.081s Testing: 0.003s #312 Accuracy: 99.73% Best: 99.73% Training: 0.082s Testing: 0.003s #313 Accuracy: 99.73% Best: 99.73% Training: 0.079s Testing: 0.003s #314 Accuracy: 99.69% Best: 99.73% Training: 0.081s Testing: 0.003s #315 Accuracy: 99.73% Best: 99.73% Training: 0.083s Testing: 0.003s #316 Accuracy: 99.73% Best: 99.73% Training: 0.088s Testing: 0.003s #317 Accuracy: 99.73% Best: 99.73% Training: 0.085s Testing: 0.003s #318 Accuracy: 99.73% Best: 99.73% Training: 0.086s Testing: 0.003s #319 Accuracy: 99.73% Best: 99.73% Training: 0.088s Testing: 0.003s #320 Accuracy: 99.73% Best: 99.73% Training: 0.091s Testing: 0.003s 

These results were obtained on a CPU, and it works quite fast.

Very impressive! Nice work, @BooBSD!

@BooBSD
Copy link

BooBSD commented Jan 29, 2025

@akkadhim Hey, can you please share your presentation on GloVe, Word2Vec, etc., from the latest meeting call?

@akkadhim
Copy link
Author

@akkadhim Hey, can you please share your presentation on GloVe, Word2Vec, etc., from the latest meeting call?

Sure, please send me your emil address.

@BooBSD
Copy link

BooBSD commented Jan 30, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants