All Posts
Tutorials

Deploy a vLLM Inference Server in Under 5 Minutes

Wollnut Labs TeamMarch 22, 20254 min

Why vLLM?

vLLM is the fastest open-source LLM inference engine. It uses PagedAttention for efficient memory management and supports continuous batching for high throughput.

Quick Deploy

  • Launch an instance with the **vLLM Inference Server** template
  • SSH into your instance
  • Start vLLM:
  • python -m vllm.entrypoints.openai.api_server \
      --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
      --port 8000 \
      --tensor-parallel-size 1
  • Query your endpoint:
  • curl http://YOUR_IP:8000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
        "prompt": "Explain quantum computing in simple terms:",
        "max_tokens": 200
      }'

    Scaling Up

    For larger models like DeepSeek R1 (671B), use H100 8x or H200 4x with tensor parallelism:

    python -m vllm.entrypoints.openai.api_server \
      --model deepseek-ai/DeepSeek-R1 \
      --tensor-parallel-size 8 \
      --port 8000

    Performance Tips

  • Use `--quantization awq` or `--quantization gptq` to reduce memory usage
  • Set `--max-model-len` to limit context length and save memory
  • Use `--gpu-memory-utilization 0.95` to maximize GPU memory usage
  • Enable `--enable-chunked-prefill` for better long-context performance
  • Cost

    Running a 7B model inference server on H100 1x costs $2.25/hr. For production workloads serving thousands of requests, this is significantly cheaper than API-based alternatives.