Skip to content

Modular Python implementation of encoder-only, decoder-only and encoder-decoder transformer architectures from scratch, as detailed in Attention Is All You Need.

License

Notifications You must be signed in to change notification settings

eonu/transformers-from-scratch

Repository files navigation

Transformers From Scratch

ContentsFeatures · Example · Details · Datasets · Models and notebooks · Repository structure · Installation · Running · References

The repository contains a modular Python implementation of transformer architectures for natural language understanding and generation tasks, according to:

  • The seminal paper Attention Is All You Need by Vaswani et al.[1] that details the novel attention-based transformer architecture and its application to sequence-to-sequence tasks, demonstrating its effectiveness by achieving state-of-the-art performance in machine translation, surpassing previous LSTM and CNN based neural machine translation architectures.
  • The chapter on Transformers and Large Language Models from Speech and Language Processing by Jurafsky & Martin[2] which provides a more comprehensive and illustrative look into some of the high-level details discussed in Attention Is All You Need.

Features

  • Generic encoder-only, decoder-only and encoder-decoder transformer architectures.
  • Wrappers for causal language modelling, sequence-to-sequence generation and classification/regression.
  • Various decoding methods for causal/sequence-to-sequence generation:
    • Search-based (greedy and beam search)
    • Sampling-based (nucleus, temperature and top-k sampling)
  • Example applications to real-world datasets.

PyTorch restrictions

This project is implemented using PyTorch and PyTorch Lightning.

As PyTorch provides a number of transformer and attention related layers in its torch.nn submodule, this project explicitly avoids the use of:

All other layers provided by torch.nn are allowed, including:

  • nn.Embedding: For token embedding look-up by vocabulary ID.
  • nn.LayerNorm: For layer normalization as implemented in Attention Is All You Need.

Other restrictions

  • Transformer models implemented and made available in other libraries such as HuggingFace's transformers are not used in this project.
  • However, the tokenizers provided by transformers were used, as developing tokenization algorithms was not the primary objective of this project.
  • No existing "x from scratch" resources were used, such as the famous Let's build GPT: from scratch, in code, spelled out. by Andrej Karpathy[3].
  • No other online resources were used, apart from official documentation for packages such as PyTorch, PyTorch Lightning and Huggingface Tokenizers.

Example

Training a causal language model to generate "Florida man"-style news headlines.

from transformers import LlamaTokenizer

from transformer.params import TransformerParams, TemperatureSamplingParams
from transformer.models import CausalLM
from transformer.decoding import TemperatureSamplingDecoder

# initialize HuggingFace tokenizer
tokenizer = LlamaTokenizer.from_pretrained(
    "huggyllama/llama-7b", add_eos_token=True, legacy=False
)
tokenizer.add_special_tokens({"pad_token": "<pad>"})

# initialize the causal language model
model = CausalLM(
    params=TransformerParams(context_length=64),
    tokenizer=tokenizer,
)

# train the language model
model.train(...)

# initialize decoder for sequence generation
decoder = TemperatureSamplingDecoder(
    params=TemperatureSamplingParams(max_length=100, temperature=0.5, k=5),
    model=model,
)

# generation without context
decoder.generate()
'Florida man arrested after baby alligator, guns, drugs found inside truck'

# generation with context
decoder.generate("Florida man shot")
'Florida man shot and killed while attempting to steal pizza and Pokemon cards from Target'

Details

While the original architecture described in Attention Is All You Need is an encoder-decoder based architecture using transformers for neural machine translation which is a sequence-to-sequence learning task, this project was designed to be more general, allowing for a variety of natural language tasks by implementing encoder-only, decoder-only and encoder-decoder architectures.

Encoder-only Decoder-only Encoder-decoder
Diagram
Tasks Contextualized embedding and supervised inference Autoregressive generation Sequence-to-sequence generation
Example use-cases
  • Producing contextualized token embeddings
  • Sentiment classification
  • Intent classification
  • Text generation
  • Machine translation
  • Text summarization

Datasets

The following datasets were used to test the above transformer implementations on various tasks.

  • arXiv Paper Abstracts: arXiv manuscripts and their metadata including titles, abstracts and categories.
  • CommonLit Readability Prize: Literary passages and their associated "readability" score for use in grade 3-12 classrooms.
  • Reddit r/FloridaMan: News headlines about various (often funny and irrational) actions performed by Florida men and women.
  • Europarl: Transcriptions of European Parliament proceedings between 1996-2006, collected in 11 languages.

Models and notebooks

Encoder-only models

  • ClassifierLM: A generic transformer-based language model for assigning classes to text.
  • RegressorLM: A generic transformer-based language model for assigning scores to text.

Decoder-only models

  • CausalLM: A generic transformer-based language model for generating text in an autoregressive manner.

Encoder-decoder models

  • Seq2SeqLM: A generic transformer-based language model for generating output text given an input text.

Repository structure

Installation

The transformer implementation is installable as a local Python package, named transformer.

pip install -e .

To run the notebooks, you will need additional dependencies which can be installed with the notebooks extra.

pip install -e ".[notebooks]"

This package was developed on Python 3.11.8, so it is recommended to use a virtual environment with the same version.

Running

You should be able to simply run the Jupyter notebooks in the notebooks/ folder.

Beware, they take time – even with a good GPU (especially the sequence-to-sequence ones)!

References

[1] Vaswani et al., "Attention Is All You Need", Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), 6000-6010.
[2] Dan Jurafsky & James H. Martin, "Transformers and Large Language Models", Speech and Language Processing, 3rd ed. draft (2024), ch. 10.
[3] Andrej Karpathy "Let's build GPT: from scratch, in code, spelled out.", YouTube (2023)

© 2024-2025, Edwin Onuonga - Published under the terms of the MIT license.
Authored and maintained by Edwin Onuonga.

About

Modular Python implementation of encoder-only, decoder-only and encoder-decoder transformer architectures from scratch, as detailed in Attention Is All You Need.

Topics

Resources

License

Stars

Watchers

Forks

Languages