A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/codefuse-ai/FasterTransformer4CodeFuse below:

codefuse-ai/FasterTransformer4CodeFuse: High-performance LLM inference based on our optimized version of FastTransfomer

FasterTransformer4CodeFuse

Provide high-performance model inference, mainly supporting the CodeFuse model from Ant Group.

Compared to the original FT, this repo has these features:

Batch size: 1

Model CodeFuse 13B Measurements Latency (ms) GPU Single A100 2 * A100 Tensor Parallelism Data Type fp16 int8 fp16 int8 Input/Output Length 16 8 160 195 238 84 64 32 608 369 373 295 256 128 2650 1530 1492 1130 1024 512 10776 7054 6786 5415 Tokens Per Sec 48 75 77 98

We run in the container environment: nvcr.io/nvidia/pytorch:22.09-py3

pip install --no-cache-dir pybind11==2.6.2 transformers accelerate sentencepiece

echo "export pybind11_DIR=/opt/conda/lib/python3.8/site-packages/pybind11/share/cmake/pybind11/" >> ~/.bashrc
export pybind11_DIR=/opt/conda/lib/python3.8/site-packages/pybind11/share/cmake/pybind11/
mkdir build ; cd build
export TORCH_PYTHON_LIBRARIES=/opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so
cmake -DCMAKE_BUILD_TYPE=Release -DSM="80;75" -DBUILD_PYT=ON -DSPARSITY_SUPPORT=OFF -DMEASURE_BUILD_TIME=ON \
      -DBUILD_CUTLASS_MIXED_GEMM=ON -DBUILD_MULTI_GPU=ON -DBUILD_TRT=OFF \
      -DENABLE_FP8=OFF -DBUILD_PYBIND=ON -DTORCH_PYTHON_LIBRARIES=${TORCH_PYTHON_LIBRARIES} ..
make -j"$(grep -c ^processor /proc/cpuinfo)"

You can use examples/pytorch/codefuse/huggingface_convert.py script to convert checkpoint files from HuggingFace to FasterTransformer.

export MODEL_NAME=codefuse
export TENSOR_PARA_SIZE=2

python ../examples/pytorch/codefuse/huggingface_convert.py \
       -o ../models/${MODEL_NAME}/fastertransformer \
       -i ../models/${MODEL_NAME}/transformers \
       -infer_gpu_num ${TENSOR_PARA_SIZE} \
       -processes 20 \
       -weight_data_type fp16 \
       -model_name gptneox

You can use examples/pytorch/codefuse/quant_and_save.py script to convert fp16 or fp32 FasterTransformer checkpoint files to int8 files and scales, getting higher model load speed and smaller checkpoint files.

export MODEL_NAME=codefuse
export TENSOR_PARA_SIZE=2

python ../examples/pytorch/codefuse/quant_and_save.py \
       --in_dir ../models/${MODEL_NAME}/fastertransformer/${TENSOR_PARA_SIZE}-gpu \
       --out_dir ../models/${MODEL_NAME}/fastertransformer/${TENSOR_PARA_SIZE}-gpu_int8 \
       --lib_path ../build/lib/libth_common.so \
       --tensor_para_size ${TENSOR_PARA_SIZE} \
       --use_gptj_residual \
       --data_type fp16

You can use examples/pytorch/codefuse/codefuse_example.py to run model inference.

export MODEL_NAME=codefuse

# fp16 1gpu
python ../examples/pytorch/codefuse/codefuse_example.py \
       --ckpt_path ../models/${MODEL_NAME}/fastertransformer/1-gpu \
       --tokenizer_path ../models/${MODEL_NAME}/transformers

# int8 1gpu
python ../examples/pytorch/codefuse/codefuse_example.py \
       --ckpt_path ../models/${MODEL_NAME}/fastertransformer/1-gpu_int8 \
       --tokenizer_path ../models/${MODEL_NAME}/transformers \
       --int8_mode 1 \
       --enable_int8_weights 1

# fp16 2gpus
torchrun --nproc_per_node 2 ../examples/pytorch/codefuse/codefuse_example.py \
         --world_size 2 \
         --ckpt_path ../models/${MODEL_NAME}/fastertransformer/2-gpu \
         --tokenizer_path ../models/${MODEL_NAME}/transformers

# int8 2gpus
torchrun --nproc_per_node 2 ../examples/pytorch/codefuse/codefuse_example.py \
         --world_size 2 \
         --ckpt_path ../models/${MODEL_NAME}/fastertransformer/2-gpu_int8 \
         --tokenizer_path ../models/${MODEL_NAME}/transformers \
         --int8_mode 1 \
         --enable_int8_weights 1

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4