Skip to content

As it says on the tin, this repo has a simple implementation of a transformer model, with some borrowed efficiency improvements. The purpose is mainly pedagogical.

License

Notifications You must be signed in to change notification settings

cfoster0/simple-parallel-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

simple-parallel-transformer

As it says on the tin, this repo has a simple implementation of a Transformer model, with some additional improvements. The purpose is mainly pedagogical.

Design

More specifically, its block has the following modifications:

  • Cogview's Sandwich Layer Norm, which puts a LayerNorm at the start and end of the block.
  • Single-block design from Gated Attention Unit, by way of Mamba. Instead of a separate attention and feedforward layer, combines them in parallel as a gated attention layer, with the gating being passed into a SiLu/SwiGLU activation function. Also expands the internal dimension to be larger than the residual dimension.
  • Smeared keys on each head, to facilitate learning of previous-token heads, induction heads, and n-grams.
  • Per-head, data-dependent attention biasing. This is done by a projection -> sigmoid, cumulative sum to produce "absolute positions", and then a subtraction to get "relative positions" biases for each query-key pair, similar to ALiBi.

References

About

As it says on the tin, this repo has a simple implementation of a transformer model, with some borrowed efficiency improvements. The purpose is mainly pedagogical.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages