Marin

"I am not afraid of storms, for I am learning how to sail my ship."
– Louisa May Alcott

Marin is an open-source framework for the research and development of foundation models.

A key feature of Marin is reproducibility: every step, from raw data to the final model are recorded, not just the end result. This includes failed experiments, so the entire research process is transparent.

Marin's primary use case is training language model like Llama, DeepSeek, Qwen, etc. Notably, this includes data curation, transformation, filtering, tokenization, training, and evaluation.

We used Marin to train the first open-source 8B parameter model to outperform Llama 3.1 8B. You can see the training script or read the retrospective.

The documentation for Marin is available on ReadTheDocs or in the docs/ folder.

To get started with Marin:

Install Marin.
Train a tiny language model using Marin.
See how to run a much larger DCLM 1B/1x experiment using Marin.
See a summary of the experiments we've run.
Participate in the Marin Speedrun competition to try to find the most efficient way to train a language model.
Try out the Marin Datashop to contribute and create data for your use case.
Join the Marin Discord to chat with the community.

Example

Marin experiments are defined as a set of steps that can depend on each other and are executed in a topological order, like a Makefile.

As a brief example of how you can use Marin, here is a complete script for training a tiny model on TinyStories. You can check out the full script for more details.

from fray.cluster import ResourceConfig from experiments.defaults import default_tokenize, default_train from experiments.llama import llama3_tokenizer, llama_nano from experiments.simple_train_config import SimpleTrainConfig from marin.execution.executor import executor_main # 1. Choose a dataset tinystories_hf_id = "roneneldan/TinyStories" # 2. Tokenize the dataset tinystories_tokenized = default_tokenize( name=tinystories_hf_id, # path to write tokenized files (tokenized/ will be prepended) dataset=tinystories_hf_id, # HF dataset id tokenizer=llama3_tokenizer, ) # 3. Define training configuration nano_train_config = SimpleTrainConfig( # Here we define the hardware resources we need. resources=ResourceConfig.with_cpu(), train_batch_size=4, num_train_steps=100, # set hyperparameters learning_rate=6e-4, weight_decay=0.1, # keep eval quick for tutorial max_eval_batches=4, ) # 4. Train the model nano_tinystories_model = default_train( name="marin-nano-tinystories", # Steps can depend on other steps: nano_tinystories_model depends on tinystories_tokenized tokenized=tinystories_tokenized, model_config=llama_nano, train_config=nano_train_config, # wandb tags tags=["llama", "nano", "tinystories", "tutorial"], # We can run many [eval_harness](https://github.com/EleutherAI/lm-evaluation-harness) tasks in the loop # during training, but there's no point in running evals on such a tiny model eval_harness_tasks=[], # to keep tutorial fast, skip default validation sets use_default_validation=False, ) if __name__ == "__main__": executor_main(steps=[ nano_tinystories_model, ])

Here, we create two steps, one for tokenizing the dataset and one for training the model. The training step depends on the tokenized dataset step, so it will be executed after the tokenization step is completed.

With slight modifications, you can extend this to train a larger model on a larger dataset, a mixture of datasets, even scaling to very large TPU pods (or multislice TPU, and, soon, multi-node GPUs!).

Agent-Friendly Recipes

New: See docs/recipes/add_dataset.md for a step-by-step guide to adding new datasets, designed for both humans and coding agents.

Name		Name	Last commit message	Last commit date
Latest commit History 6,995 Commits
.agents/docs		.agents/docs
.github		.github
data_browser		data_browser
docker/marin		docker/marin
docs		docs
etc		etc
experiments		experiments
infra		infra
lib		lib
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pyrefly-baseline.json		.pyrefly-baseline.json
.readthedocs.yaml		.readthedocs.yaml
AGENTS.md		AGENTS.md
AUTHORS.md		AUTHORS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Marin

Example

Agent-Friendly Recipes

About

Uh oh!

Uh oh!

Contributors 74

Languages

License

marin-community/marin

Folders and files

Latest commit

History

Repository files navigation

Marin

Example

Agent-Friendly Recipes

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 74

Languages