Skip to content

Conversation

@rjsu26
Copy link

@rjsu26 rjsu26 commented Oct 31, 2025

  • Added Docker support: Created Dockerfile and docker-compose.yml for containerized deployments
  • Environment improvements: Added necessary environment variables to avoid dependency conflicts in training scripts
  • Documentation updates: Updated Tutorial.md with virtualenv setup instructions
  • Cleanup: Removed old output logs, result CSV files, and consolidated requirements files
  • Minor fixes: Small adjustments to DeepSpeed configuration to run faster and complete within the example run command's timeout.
@rjsu26
Copy link
Author

rjsu26 commented Oct 31, 2025

Eventually we will split the recipe to detect H200, B200, etc and have the right pytorch/CUDA installed.

@rjsu26 rjsu26 self-assigned this Oct 31, 2025
- Add .dockerignore, .gitignore, Dockerfile, docker-compose.yml, Makefile - Update training/.deepspeed_env and training/Tutorial.md - Update training/run_opt-* scripts (1.3b, 13b, 350m variants) - -Tweak training/utils/ds_utils.py and training/utils/model/model_utils.py
@rjsu26 rjsu26 force-pushed the raj_new_libraries branch from 83d4da6 to 29ed7a1 Compare November 4, 2025 00:14
@rjsu26 rjsu26 merged commit c952261 into master Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants