This repository contains an implementation of Direct Preference Optimization (DPO) loss from scratch. The model is fine-tuned on 500 training examples, sampled from the UltraFeedback Binarized preference dataset. The training process is executed using a vanilla PyTorch training loop.
- When tuning hyperparameters, it's crucial to remember that the effectiveness of certain values may vary depending on your specific setup. For example, in my setup, a beta value of 0.1 yields better results compared to 0.2. Similarly, setting weight decay to a lower value and ensuring that the longer the sequence length, the better. I suggest treating these values as initial benchmarks and fine-tuning them to align with your setup for optimal results.
The DPO model is trained with the following config:
-
Training Parameters:
- Epochs: 1
- Batch Size: 1
- Gradient Accumulation Steps: 2
- Learning Rate: 1e-7
- Learning Rate Decay: Cosine
- Weight Decay: 1e-2
-
DPO Loss Parameter:
- Beta: 0.1
Title | Link | Description |
---|---|---|
DPO paper | The original paper introducing Direct Preference Optimization (DPO), providing in-depth details about the algorithm. | |
H2O-Danube-1.8B Technical Report | This technical report presents the H2O-Danube-1.8B model and its application, including how they align the model using DPO | |
Sebastian's newsletter on iterative DPO | Newsletter | Sebastian Raschka's newsletter discusses iterative DPO and its practical implications, offering insights into its effectiveness in real-world scenarios. |
Implementation of the DPO algorithm | GitHub | This GitHub repository provides an implementation of the DPO algorithm, allowing you to directly access and utilize the codebase for your projects. |
Video explains DPO | YouTube | This video offers a visual and auditory explanation of DPO, which can be helpful for those who prefer seeking a simplified overview of the concept. |