FasterTransformer4CodeFuse

Introduce

Provide high-performance model inference, mainly supporting the CodeFuse model from Ant Group.

Compared to the original FT, this repo has these features:

✅ Int8 quantization of CodeFuse model
✅ Prompt does not require a complete word at the end
✅ Python API
✅ Streaming Output with Python API
✅ Higher model load speed
✅ Some bugfix

Performance

Batch size: 1

Model			CodeFuse 13B
Measurements			Latency (ms)
GPU			Single A100		2 * A100 Tensor Parallelism
Data Type			fp16	int8	fp16	int8
Input/Output Length	16	8	160	195	238	84
	64	32	608	369	373	295
	256	128	2650	1530	1492	1130
	1024	512	10776	7054	6786	5415
Tokens Per Sec			48	75	77	98

Get Start

We run in the container environment: nvcr.io/nvidia/pytorch:22.09-py3。

1. Install requirements

pip install --no-cache-dir pybind11==2.6.2 transformers accelerate sentencepiece

echo "export pybind11_DIR=/opt/conda/lib/python3.8/site-packages/pybind11/share/cmake/pybind11/" >> ~/.bashrc
export pybind11_DIR=/opt/conda/lib/python3.8/site-packages/pybind11/share/cmake/pybind11/

2. Build

mkdir build ; cd build
export TORCH_PYTHON_LIBRARIES=/opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so
cmake -DCMAKE_BUILD_TYPE=Release -DSM="80;75" -DBUILD_PYT=ON -DSPARSITY_SUPPORT=OFF -DMEASURE_BUILD_TIME=ON \
      -DBUILD_CUTLASS_MIXED_GEMM=ON -DBUILD_MULTI_GPU=ON -DBUILD_TRT=OFF \
      -DENABLE_FP8=OFF -DBUILD_PYBIND=ON -DTORCH_PYTHON_LIBRARIES=${TORCH_PYTHON_LIBRARIES} ..
make -j"$(grep -c ^processor /proc/cpuinfo)"

3. Run

You can use examples/pytorch/codefuse/huggingface_convert.py script to convert checkpoint files from HuggingFace to FasterTransformer.

export MODEL_NAME=codefuse
export TENSOR_PARA_SIZE=2

python ../examples/pytorch/codefuse/huggingface_convert.py \
       -o ../models/${MODEL_NAME}/fastertransformer \
       -i ../models/${MODEL_NAME}/transformers \
       -infer_gpu_num ${TENSOR_PARA_SIZE} \
       -processes 20 \
       -weight_data_type fp16 \
       -model_name gptneox

You can use examples/pytorch/codefuse/quant_and_save.py script to convert fp16 or fp32 FasterTransformer checkpoint files to int8 files and scales, getting higher model load speed and smaller checkpoint files.

export MODEL_NAME=codefuse
export TENSOR_PARA_SIZE=2

python ../examples/pytorch/codefuse/quant_and_save.py \
       --in_dir ../models/${MODEL_NAME}/fastertransformer/${TENSOR_PARA_SIZE}-gpu \
       --out_dir ../models/${MODEL_NAME}/fastertransformer/${TENSOR_PARA_SIZE}-gpu_int8 \
       --lib_path ../build/lib/libth_common.so \
       --tensor_para_size ${TENSOR_PARA_SIZE} \
       --use_gptj_residual \
       --data_type fp16

You can use examples/pytorch/codefuse/codefuse_example.py to run model inference.

export MODEL_NAME=codefuse

# fp16 1gpu
python ../examples/pytorch/codefuse/codefuse_example.py \
       --ckpt_path ../models/${MODEL_NAME}/fastertransformer/1-gpu \
       --tokenizer_path ../models/${MODEL_NAME}/transformers

# int8 1gpu
python ../examples/pytorch/codefuse/codefuse_example.py \
       --ckpt_path ../models/${MODEL_NAME}/fastertransformer/1-gpu_int8 \
       --tokenizer_path ../models/${MODEL_NAME}/transformers \
       --int8_mode 1 \
       --enable_int8_weights 1

# fp16 2gpus
torchrun --nproc_per_node 2 ../examples/pytorch/codefuse/codefuse_example.py \
         --world_size 2 \
         --ckpt_path ../models/${MODEL_NAME}/fastertransformer/2-gpu \
         --tokenizer_path ../models/${MODEL_NAME}/transformers

# int8 2gpus
torchrun --nproc_per_node 2 ../examples/pytorch/codefuse/codefuse_example.py \
         --world_size 2 \
         --ckpt_path ../models/${MODEL_NAME}/fastertransformer/2-gpu_int8 \
         --tokenizer_path ../models/${MODEL_NAME}/transformers \
         --int8_mode 1 \
         --enable_int8_weights 1

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
3rdparty		3rdparty
assets		assets
benchmarks		benchmarks
cmake		cmake
docker		docker
docs		docs
examples		examples
src		src
templates/adding_a_new_model		templates/adding_a_new_model
tests		tests
.clang-format		.clang-format
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
README_ORI.md		README_ORI.md
input_demo.jsonl		input_demo.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FasterTransformer4CodeFuse

Introduce

Performance

Get Start

1. Install requirements

2. Build

3. Run

About

Releases

Packages

Contributors 3

Languages

License

codefuse-ai/FasterTransformer4CodeFuse

Folders and files

Latest commit

History

Repository files navigation

FasterTransformer4CodeFuse

Introduce

Performance

Get Start

1. Install requirements

2. Build

3. Run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages