GPT-NeoX-Colab

An accessible set of Google Colab notebooks for training and experimenting with GPT-NeoX SLM models on limited resources.

Status

The Shakespeare Text Generation Notebook is available. The Python Code Completion Notebook is under development.

Overview

This repository provides a collection of example Google Colab notebooks that guide users through setting up and training GPT-NeoX models on tasks such as text generation and code completion. Tailored for ease of use, these notebooks run efficiently on single consumer-grade GPUs (like a T4 on Colab), providing an educational environment for experimenting with GPT-NeoX’s configurations, data handling, and training workflows.

Features

Lightweight Model Configurations – Optimized settings for fast training on single GPUs.
Custom Data Handling – Instructions on loading, preprocessing, and tokenizing custom datasets.
Hyperparameter Experimentation – Modular code to quickly adjust configurations and observe results.
Experiment Tracking with DagsHub – Track model metrics, hyperparameters, and checkpoints.
Collaboration-Ready – GitHub-based code sharing with clear structure and collaborative tools.

Available Notebooks

1. Shakespeare Text Generation

Objective: Train a small GPT-NeoX model on the Shakespeare dataset to explore basic text generation.

Highlights:

Setup and configuration guidance for small, Colab-friendly models.
Step-by-step instructions on data loading, tokenization, and model training.
Integrated experiment tracking with DagsHub to log metrics and visualize model performance.

Open in Colab ➔

2. Python Code Completion

Objective: Train a GPT-NeoX model from scratch for Python code completion tasks.

Highlights:

Includes dataset recommendations and preprocessing steps specific to Python code.
Detailed sections on customizing training algorithms and hyper-parameters.
Integrated DagsHub tracking to facilitate comparison of different model configurations.
Evaluation on a public benchmark

Open in Colab ➔

Quick Start

Prerequisites

Google Colab – A free Google account with access to Colab GPUs.
GitHub Account – For code sharing and collaboration.
DagsHub Account – For experiment tracking.

Setup and Execution

Run the Notebook

From this GitHub repository open the Colab notebook in Colab using the Colab link at the top of the notebook.
- Shakespeare
- Code Completion
- Follow the setup and training instructions in the notebook

Repository Structure

GPT-NeoX-Colab/
├── notebooks/
│   ├── shakespeare_training.ipynb
│   └── code_completion_training.ipynb
├── configs/
│   ├── shakespeare.yaml
│   └── code_completion.yaml
├── data/
│   └── (Data instructions and files)
├── tokenizer/
│   └── (Tokenizer files)
├── scripts/
│   └── (Helper scripts for training and data processing)
├── README.md
├── .gitignore
└── requirements.txt

Key Components

1. GitHub for Collaboration

Version Control: Use GitHub for all code updates, bug tracking, and feature requests.
Clear Commit Messages: Make changes with clear, descriptive messages to facilitate collaboration.
Repository Organization: The repository is structured to keep configurations, scripts, and data handling in separate folders for clarity.

2. DagsHub for Experiment Tracking

Project Creation: Log into DagsHub and create a project to track experiments.
Automated Logging: The notebooks are configured to log hyperparameters, metrics, and artifacts in real-time.
Comparisons: Easily compare different model runs, configurations, and metrics.

Additional Resources

GPT-NeoX Documentation: EleutherAI GPT-NeoX
SourceGraph for Code Navigation: SourceGraph GPT-NeoX – For navigating the GPT-NeoX codebase.
Benchmarking Datasets:
- CodeXGLUE Token-Level Code Completion
- CodeXGLUE Line-Level Code Completion

Contribution Guidelines

We welcome contributions! To contribute:

Fork this repository.
Create a new branch (feature/some-feature).
Commit your changes and open a pull request.
Ensure that your code is well-documented and follows best practices.

License

This project is licensed under the MIT License. See the LICENSE file for more information.

Acknowledgments

Special thanks to EleutherAI for developing GPT-NeoX and providing the open-source community with invaluable tools and resources.
Thanks to Weights & Biases for providing excellent tools for experiment tracking.

For further assistance in exploring the GPT-NeoX codebase, visit SourceGraph GPT-NeoX.

Name		Name	Last commit message	Last commit date
Latest commit History 277 Commits
.devcontainer		.devcontainer
.dvc		.dvc
.vscode		.vscode
configs		configs
notebooks		notebooks
out		out
research		research
scripts		scripts
src/GPTNeoXColab		src/GPTNeoXColab
test		test
tokenizer		tokenizer
.dvcignore		.dvcignore
.env		.env
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
data.dvc		data.dvc
ideas.md		ideas.md
models.dvc		models.dvc
notes.md		notes.md
py.typed		py.typed
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_colab.txt		requirements_colab.txt
setup.cfg		setup.cfg
setup.py		setup.py
taskNotebook1.md		taskNotebook1.md
taskNotebook2.md		taskNotebook2.md
taskTokenizer.md		taskTokenizer.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-NeoX-Colab

Status

Overview

Features

Available Notebooks

1. Shakespeare Text Generation

2. Python Code Completion

Quick Start

Prerequisites

Setup and Execution

Repository Structure

Key Components

1. GitHub for Collaboration

2. DagsHub for Experiment Tracking

Additional Resources

Contribution Guidelines

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

markNZed/GPT-NeoX-Colab

Folders and files

Latest commit

History

Repository files navigation

GPT-NeoX-Colab

Status

Overview

Features

Available Notebooks

1. Shakespeare Text Generation

2. Python Code Completion

Quick Start

Prerequisites

Setup and Execution

Repository Structure

Key Components

1. GitHub for Collaboration

2. DagsHub for Experiment Tracking

Additional Resources

Contribution Guidelines

License

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages