TensorRT-LLM笔记

原文链接

开启inflight-batching, client侧需要使用inflight_batcher_llm_client.py：

python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer-dir ${HF_LLAMA_MODEL}

bad_words: output中不允许出现的词语；

stop_words: output生成到这些词，则停止；

build engine常用参数：

--gpt_attention_plugin float16

--gemm_plugin float16

--context_fmha enable

--kv_cache_type paged：Paged KV Cache?

Best Practices for Tuning the Performance of TensorRT-LLM — tensorrt_llm documentation

`max_batch_size`, `max_seq_len` and `max_num_tokens`

--multiple_profiles: 允许trtllm多次尝试，其自动选取性能最好的；

1. 默认打开：--gpt_attention_plugin：in-place update on KV cache；减少了显存占用，减少了显存copy;

2. 默认打开：--context_fmha：attention计算这里，是否采用fused kernel；短句子，用vanilla；长句子，用FlashAttention和FlashAttention2; 官网介绍

3. 默认打开：--remove_input_padding：输入序列末尾不再padding；（我猜就是为inflight-batching？）

4. 默认打开：--paged_kv_cache：Paged Attention;

5. 默认打开: inflight-batching; 当1、3、4都打开时，该功能自动打开；将context阶段的seq和generate阶段的seq，放在同一个batch里，interleave起来进行计算？

--reduce_fusion enable: fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel; 似乎有前置条件：“when the custom AllReduce is already enabled”

Embedding分散存放：官方示例；2~3个选项同时打开：--use_parallel_embedding和--embedding_sharding_dim和--use_embedding_sharing；"--embedding_sharding_dim 0"按照vocab size进行划分；"--embedding_sharding_dim 0"按照维度进行划分；（Best Practices里说，convert_checkpoint.py阶段还要用上--use_embedding_sharing，trtllm-build阶段还要用上--lookup_plugin和--gemm_plugin）

# 2-way tensor parallelism with embedding parallelism along hidden dimension
python3 convert_checkpoint.py --model_dir gpt2 \
        --dtype float16 \
        --tp_size 2 \
        --use_parallel_embedding \
        --embedding_sharding_dim 1 \
        --output_dir gpt2/trt_ckpt/fp16/2-gpu

trtllm-build --checkpoint_dir gpt2/trt_ckpt/fp16/2-gpu \
        --output_dir gpt2/trt_engines/fp16/2-gpu

--use_fused_mlp；(默认关闭的）；将2个MLP层和1个激活层，fuse到1个kernel里；FP8的时候可能影响精度，可尝试关掉（打开的话：--use_fused_mlp=enable --gemm_swiglu_plugin fp8）；

--gemm_plugin: (默认关闭的）；使用NVIDIA cuBLASLt执行GEMM计算；FP16和BF16建议开着，可以加速计算、减少显存占用；FP8：小batch建议打开，大batch建议关闭；

KV cache的显存占用：--kv_cache_free_gpu_mem_fraction范围在0.0~1.0，free memory里用来存放KV cache的比例，默认是0.9，可以设到0.95，不要设到1.0，因为要留一点儿给input和output；--max_tokens_in_paged_kv_cache，KV cache里存放的token数量，建议只设fraction参数即可，会自动计算token数量；

--max_attention_window_size: 以效果换性能；KV cache里最多只保存这么多个token；如果input tokens+output tokens, 超过了这个数，则只有末尾这个数个tokens会参与attention计算，最前面的tokens的显存被释放；

LLama Examples

--use_paged_context_fmha: 跟--context_fmha（FlashAttention)相比，适用于long context，prefilling阶段可把input context划分为多个chunk，每个chunk单独进行计算；从而减少了activation memory的大小，允许了更长的input；和--max_num_tokens一起用才行，指定好每个chunk最多多少个tokens（我推测，如果input超出这个长度，则会被切成多个chunk，执行多次prefilling前向推理）

python -m tensorrt_llm.commands.build --checkpoint_dir /tmp/llama-3-8B-1048k/trt_ckpts \
            --output_dir /tmp/llama-3-8B-1048k/trt_engines \
            --gemm_plugin float16 \
            --max_num_tokens 4096 \
            --max_input_len 131072 \
            --max_seq_len 131082 \
            --use_paged_context_fmha enable

注意：max_input_len指定了输入最大长度，max_num_tokens指定了对输入进行分块的chunk大小；max_seq_len-max_input_len指定了输出的最大长度；（也可以把--max_batch_size设成1，进一步减少显存占用）

int8 kv cache + per-channel weight-only quantization:

# Build model with both INT8 weight-only and INT8 KV cache enabled
python convert_checkpoint.py --model_dir ./llama-models/llama-7b-hf   \
                             --output_dir ./tllm_checkpoint_1gpu_int8_kv_wq \
                             --dtype float16  \
                             --int8_kv_cache \             # KV
                             --use_weight_only \           # W
                             --weight_only_precision int8  # W

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int8_kv_wq \
            --output_dir ./tmp/llama/7B/trt_engines/int8_kv_cache_weight_only/1-gpu \
            --gemm_plugin auto  # 默认是disable的！

int8 kv cache + int4 awq weight quantization:

（注意，用的是quantize.py，而不是convert_checkpoint..py)

python ../quantization/quantize.py --model_dir /tmp/llama-7b-hf \
                                   --output_dir ./tllm_checkpoint_1gpu_awq_int8_kv_cache \
                                   --dtype float16 \
                                   --qformat int4_awq \     # AWQ
                                   --awq_block_size 128 \   # AWQ
                                   --kv_cache_dtype int8 \  # KV
                                   --calib_size 32

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_awq_int8_kv_cache \
            --output_dir ./tmp/llama/7B/trt_engines/int8_kv_cache_int4_AWQ/1-gpu/ \
            --gemm_plugin auto \

SmoothQuant: (INT8-Activation * INT8-Weight)

(--per_token和--per_channel，是用空间换精度，可选0~2个）

# Build model for SmoothQuant in the _per_token_ + _per_channel_ mode
python3 convert_checkpoint.py --model_dir /llama-models/llama-7b-hf \
                            --output_dir /tmp/tllm_checkpoint_1gpu_sq \
                            --dtype float16 \
                            --smoothquant 0.5 \ #
                            --per_token \       # Optional
                            --per_channel       # Optional

trtllm-build --checkpoint_dir /tmp/tllm_checkpoint_1gpu_sq \
             --output_dir ./engine_outputs \
             --gemm_plugin auto