Skip to content

marin-community/marin

Marin

Documentation License

"I am not afraid of storms, for I am learning how to sail my ship."
– Louisa May Alcott

Marin is an open-source framework for the research and development of foundation models.

A key feature of Marin is reproducibility: every step, from raw data to the final model are recorded, not just the end result. This includes failed experiments, so the entire research process is transparent.

Marin's primary use case is training language model like Llama, DeepSeek, Qwen, etc. Notably, this includes data curation, transformation, filtering, tokenization, training, and evaluation.

We used Marin to train the first open-source 8B parameter model to outperform Llama 3.1 8B. You can see the training script or read the retrospective.

The documentation for Marin is available on ReadTheDocs or in the docs/ folder.

To get started with Marin:

Example

Marin experiments are defined as a set of steps that can depend on each other and are executed in a topological order, like a Makefile.

As a brief example of how you can use Marin, here is a complete script for training a tiny model on TinyStories. You can check out the full script for more details.

from fray.cluster import ResourceConfig from experiments.defaults import default_tokenize, default_train from experiments.llama import llama3_tokenizer, llama_nano from experiments.simple_train_config import SimpleTrainConfig from marin.execution.executor import executor_main # 1. Choose a dataset tinystories_hf_id = "roneneldan/TinyStories" # 2. Tokenize the dataset tinystories_tokenized = default_tokenize( name=tinystories_hf_id, # path to write tokenized files (tokenized/ will be prepended) dataset=tinystories_hf_id, # HF dataset id tokenizer=llama3_tokenizer, ) # 3. Define training configuration nano_train_config = SimpleTrainConfig( # Here we define the hardware resources we need. resources=ResourceConfig.with_cpu(), train_batch_size=4, num_train_steps=100, # set hyperparameters learning_rate=6e-4, weight_decay=0.1, # keep eval quick for tutorial max_eval_batches=4, ) # 4. Train the model nano_tinystories_model = default_train( name="marin-nano-tinystories", # Steps can depend on other steps: nano_tinystories_model depends on tinystories_tokenized tokenized=tinystories_tokenized, model_config=llama_nano, train_config=nano_train_config, # wandb tags tags=["llama", "nano", "tinystories", "tutorial"], # We can run many [eval_harness](https://github.com/EleutherAI/lm-evaluation-harness) tasks in the loop # during training, but there's no point in running evals on such a tiny model eval_harness_tasks=[], # to keep tutorial fast, skip default validation sets use_default_validation=False, ) if __name__ == "__main__": executor_main(steps=[ nano_tinystories_model, ])

Here, we create two steps, one for tokenizing the dataset and one for training the model. The training step depends on the tokenized dataset step, so it will be executed after the tokenization step is completed.

With slight modifications, you can extend this to train a larger model on a larger dataset, a mixture of datasets, even scaling to very large TPU pods (or multislice TPU, and, soon, multi-node GPUs!).

Agent-Friendly Recipes

  • New: See docs/recipes/add_dataset.md for a step-by-step guide to adding new datasets, designed for both humans and coding agents.

About

Open-source framework for the research and development of foundation models.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks