Before submitting:
- Check if your issue is listed in known issues.
- vLLM Version: (from
pip show vllmor git commit hash) - Hardware Setup:
- GPU(s): (Make, Model, and Count)
- Driver Version: (
nvidia-smiorrocm-smioutput) - Memory: (Host and GPU memory)
- Execution Environment:
- Docker Image: (Name + Tag)
- CUDA/ROCm Version:
- Python Version:
- Kernel Version: (
uname -a)
# Full command from benchmarks/ directory
# Include all parameters and quantization flags
# Example:
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-2-7b-hf \
--tensor-parallel-size 2 \
--dtype half \
--num-prompts 64 \
--input-len 1024 \
--output-len 128 \
-tp 8# Any non-default environment variables
# Example:
export VLLM_USE_TRITON_FLASH_ATTN=False| Metric | Good Performance (Image: vllm:old) |
Regressed Performance (Image: vllm:new) |
|---|---|---|
| Throughput (tokens/s) | 1250 | 840 |
| Memory Utilization | 78% | 92% |
| GPU Utilization | 95% | 68% |
- Original working Docker image:
docker pull rocm/vllm-dev:main - Regression Docker image:
docker pull rocm/vllm-dev:nightly - Performance difference persists across multiple runs
- Verified with different input sizes/batch sizes
# Minimal command that triggers the issue
# Include deployment commands if applicable
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-2-7b-hf \
--max-num-seqs 16 \
--enforce-eagerExpand for full logs
[Full plaintext log output]
# Non-default configurations
export VLLM_USE_TRITON_FLASH_ATTN=false- Issue reproduces with
--enforce-eagermode - Issue reproduces with different random seeds
- First observed date:
- Frequency: (Always/Intermittent/Specific Conditions)
- Related components: (e.g., FP8 quantization, PagedAttention)
- Custom modifications: (List any code/configuration changes)