FMS Acceleration is designed to accelerate the fine-tuning and training of large models. This framework comprises a collection of libraries intended to be used with the fms-hf-tuning suite.
The fms-acceleration framework includes accelerators for Full and Parameter Efficient Fine Tuning (PEFT), including
- Low Rank Adaptation (LoRA) acceleration (coming soon)
- Bits-and-Bytes (BNB) quantised LoRA : QLoRA acceleration
- AutoGPTQ quantised LoRA : GPTQ-LoRA acceleration
- Full Fine Tuning acceleration (coming soon)
- Padding-Free Attention
Our tests show a significant increase in training token throughput using this fms-acceleration framework.
For example:
- QLoRA: 22-43 % token throughput increase on 1 GPU as compared to using Hugging Face BNB QLoRA
- QLoRA: Straightforward integration with multiple GPU as compared to using Hugging Face BNB QLoRA
- GPTQ-LoRA: 22-44 % token throughput increase on 1 GPU as compared to using Hugging Face BNB QLoRA
- GPTQ-LoRA: Straightforward integration with multiple GPU as compared to using Hugging Face BNB QLoRA
The above includes numbers using fusedOps-and-kernels and actual impl coming soon, see below.
This package is in BETA and is under development. Expect breaking changes!
Plugin | Description | Depends | License | Status |
---|---|---|---|---|
framework | This acceleration framework for integration with huggingface trainers | Alpha | ||
accelerated-peft | For PEFT-training, e.g., 4bit QLoRA. | Huggingface AutoGPTQ |
Apache 2.0 MIT |
Alpha |
fused-op-and-kernels | Fused LoRA and triton kernels (e.g., fast cross-entropy, rms, rope) | -- | Apache 2.0 (contains extracted code) | Beta |
attention-and-distributed-packing | Padding-Free Flash Attention Computation | flash-attn | Apache 2.0 | Beta |
accelerated-moe | Triton Kernels for Mixture-of-Expert parallel, inspired by ScatterMoe and MegaBlocks | Apache 2.0 | Beta |
Below we demonstrate how to accelerate your tuning experience with tuning/sft_trainer.py from fms-hf-tuning
.
Note: New exciting plugins will be added over time, so please check here for the latest accelerations!.
fms-acceleration
is part of fms-hf-tuning
, and instructions to utilize fms-acceleration
for tuning are found here. In particular, fms-acceleration
plugins can be accessed via command line arguments to fms-hf-tuning
(e.g., --auto_gptq triton_v2
); this is made available via integrated configuration dataclasses that configures the AccelerationFramework
for the user.
As new plugins become available, more command line arguments will be made avaiable to fms-hf-tuning
to enable them. However, this kind of integration takes time; plugins that are in development / research stages may not be immediately integrated.
Therefore, an intermediary step is required to access plugins in fms-acceleration
before they become integrated into fms-hf-tuning
. In fact, such a method is critical for benchmarking / testing, that needs to happen before integration of any plugin in fms-hf-tuning
can even be considered. Hence, we provide a method to configure the acceleration framework via a configuration YAML, that is passed into AccelerationFramework
via an environment variable; the instructions for this is provided below. Futhermore, experienced users can also leverage this to early test plugins, but be warned that the learning curve to use these plugins is high (since it requires knowledge on how to write such a configuration). To aid on this, the following instructions are provide that describes both a basic and advanced flow.
Note: As mentioned above, the recommended approach for fms-hf-tuning
is to use the acceleration config dataclasses.
This method documented for the configuration YAML is only for testing/research purposes and not recommended for production. For general use, please refer instead to the instructions here.
Below we illustrate a configuration YAML flow using the accelerated quantised PEFT using GPTQ-LoRA tuning with the AutoGPTQ triton_v2
kernel use case; this kernel is state-of-the-art provided by jeromeku
on Mar 2024:
There is both a basic and advanced usage for the configuration YAML flow.
Most users of fms-hf-tuning
only require the basic flow:
- Assumption 1: user has an already prepared configuration, say from sample-configurations.
- Assumption 2: user knows exactly what acceleration 'plugins` are required (based on the configuration).
- Assumption 3: the arguments for running
sft_trainer.py
is the same; save for one extra argument--acceleration_framework_config_file
used to pass in the acceleration config.
In this case then the basic flow comprises of 3 steps:
-
First go to fms-hf-tuning and install the framework library:
$ pip install -e .[fms-accel]
or alternatively install the framework directly:
$ pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/framework
The above installs the command line utility
fms_acceleration.cli
, which is used to install plugins (and also other things like view sample configurations). -
install
the required framework plugins; we install thefms-acceleration-peft
plugin for GPTQ-LoRA tuning with triton v2 as:python -m fms_acceleration.cli install fms_acceleration_peft
The above is the equivalent of:
pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/accelerated-peft
-
Run
sft_trainer.py
providing the acceleration configuration (via the environment variableACCELERATION_FRAMEWORK_CONFIG_FILE
and arguments; given the basic flow assumption that we simply re-use the samesft_trainer.py
arguments as we had without using thefms_acceleration
package:# when using sample-configurations, arguments can be referred from # defaults.yaml and scenarios.yaml ACCELERATION_FRAMEWORK_CONFIG_FILE=framework.yaml \ python sft_trainer.py \ ... # arguments
The framework activates relevant plugins given the framework configuration; for more details see framework/README.md.
Activate
TRANSFORMERS_VERBOSITY=info
to see the huggingface trainer printouts and verify thatAccelerationFramework
is activated!# this printout will be seen in huggingface trainer logs if acceleration is activated ***** FMS AccelerationFramework ***** Active Plugin: AutoGPTQAccelerationPlugin. Python package: fms_acceleration_peft. Version: 0.0.1. ***** Running training ***** Num examples = 1,549 Num Epochs = 1 Instantaneous batch size per device = 4 Total train batch size (w. parallel, distributed & accumulation) = 4 Gradient Accumulation steps = 1 Total optimization steps = 200 Number of trainable parameters = 13,631,488
The advanced flow makes further use of fms_acceleration.cli
to:
- list all available configs and acceleration plugins the configs depend on.
- list all available plugins and check which are the installed ones.
- identify critical
sft_trainer
arguments required for correct operation of a particular framework config.
The advanced flow comprises of 5 steps:
-
Same as Step 1 of basic flow.
-
Use
fms_acceleration.cli configs
to search for sample configs:$ python -m fms_acceleration.cli configs 1. accelerated-peft-autogptq (accelerated-peft-autogptq-sample-configuration.yaml) - plugins: ['accelerated-peft'] 2. accelerated-peft-bnb (accelerated-peft-bnb-nf4-sample-configuration.yaml) - plugins: ['accelerated-peft']
This is equivalent to the searching over the:
- Full sample configuration list that shows
plugins
required for all available configs. - E.g., Accelerated GPTQ-LoRA configuration here.
- Full sample configuration list that shows
-
install
plugins same as Step 2 of basic flow, noting that in addition we can useplugins
to display all available plugins; this list updates as more plugins get developed. Recall thatconfigs
list the requiredplugins
for the sample configurations; make sure all of them are installed.$ python -m fms_acceleration.cli plugins Choose from the list of plugin shortnames, and do: * 'python -m fms_acceleration.cli install <pip-install-flags> PLUGIN_NAME'. List of PLUGIN_NAME [PLUGIN_SHORTNAME]: 1. fms_acceleration_peft [peft]
After
install
the list will update to indicate the installed plugins. -
Get the correct arguments for
sft_trainer.py
:-
arguments required for correct operation (e.g., if using accelerated peft, then
peft_method
is required).- Use
arguments
along with the sample configurationshortname
to display the relevant critical arguments; these arguments can be manually referred from scenarios.yaml:
$ python -m fms_acceleration.cli arguments accelerated-peft-autogptq Searching for configuration shortnames: ['accelerated-peft-autogptq'] 1. scenario: accelerated-peft-gptq configs: accelerated-peft-autogptq arguments: --learning_rate 2e-4 \ --fp16 True \ --torch_dtype float16 \ --peft_method lora \ --r 16 \ --lora_alpha 16 \ --lora_dropout 0.0 \ --target_modules ['q_proj', 'k_proj', 'v_proj', 'o_proj']
- Use
-
More info on
defaults.yaml
andscenarios.yaml
found here.- Arguments not critical to the plugins found in defaults.yaml. These can be taken with liberty.
- Arguments critcal to plugins found in scenarios.yaml. The relevant section of scenarios.yaml, is the one whose
framework_config
entries, match theshortname
of the sample configuration of interest.
-
This repo requires CUDA to compute the kernels, and it is convinient to use NVidia Pytorch Containers that already comets with CUDA installed. We have tested with the following versions:
pytorch:24.01-py3
The benchmarks can be reproduced with the provided scripts.
- includes baseline benches (e.g., standard fine-tuning, standard peft).
- benches for various acceleration sample configs.
See below CSV files for various results:
For deeper dive into details see framework/README.md.
IBM Research, Singapore
- Fabian Lim flim@sg.ibm.com
- Aaron Chew aaron.chew1@sg.ibm.com
- Laura Wynter lwynter@sg.ibm.com