Skip to content

Conversation

@jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 10, 2024

The mistral model adapted to run Starcoder 2:

  • Use layer norm (RMS still available as option)
  • Use standard MLP (gated still available as option)
  • Add back biases (optional)
  • Change (default?) tokenizer class

Missing for starcoder 1 (do we want to support it?):

  • Absolute position embeddings

Other notes:

  • Has less entries in modeling auto than gpt bigcode (3 instead of 6), probably doesn't matter
  • Using repeat for kv cache in flash attn, might not be necessary.

Still got a bunch of minor things to do (see todos)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants