Note: The most recent model was trained on CPU using 1200 (query, positive, negative) samples. The training loss is visualized in loss_1200.jpeg.
pip install -r requirements.txtThis command loads a subset of the dataset, performs cleaning, flattening, and tokenization, and then trains the model on 10 samples to demonstrate pipeline functionality, decreasing loss, and similarity matrix. To train on a larger dataset, subset_size value in src/config.py can be updated.
python -m src.runAn initially empty folder which later stores tokenized_msmacro_dataset\ and cleaned_dataset.json
Main source code directory with modular components.
Unified training pipeline that handles data import, cleaning, preprocessing, and model training using configurations defined in config.py.
Holds training hyperparameters like dataset path, batch size, learning rate, number of epochs, etc. Subset size currently set to 10 to check if the pipeline is working properly.
import_data.py- Loads raw data and extracts query, positive, and negative passages.preprocess.py- Flattens and tokenizes the dataset into triplets and saves them using Hugging FaceDataset.save_to_disk.collator.py– Defines a custom data collator that pads and batches query, positive, and negative inputs for training.dataloader.py- Creates a PyTorch DataLoader for the tokenized dataset using a custom collator.
embeddingmodel.py– Wraps the Hugging Face transformer model (SmolLM2-135M) to produce sentence-level embeddings using the last token representation.dualencoder.py– Combines the same embedding model for query, positive, and negative inputs. Performs normalization and returns embeddings.tokenizer.py– Loads and returns the tokenizer used for preprocessing.
utils.py– Includessave_checkpoint()to save model periodically.trainer.py– Containstrain_embedding_model()andtrain_one_epoch()methods to train the model.
Shows the contrastive loss trend over epochs during training on 1200 samples.
Contains:
- Logs from early experimentation on 500 samples.
- Saved similarity matrix and loss values for inspection.