Wollnut Labs

Why vLLM?

vLLM is the fastest open-source LLM inference engine. It uses PagedAttention for efficient memory management and supports continuous batching for high throughput.

Quick Deploy

Launch an instance with the **vLLM Inference Server** template

SSH into your instance

Start vLLM:

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --port 8000 \
  --tensor-parallel-size 1

Query your endpoint:

curl http://YOUR_IP:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    "prompt": "Explain quantum computing in simple terms:",
    "max_tokens": 200
  }'

Scaling Up

For larger models like DeepSeek R1 (671B), use H100 8x or H200 4x with tensor parallelism:

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1 \
  --tensor-parallel-size 8 \
  --port 8000

Performance Tips

Use `--quantization awq` or `--quantization gptq` to reduce memory usage

Set `--max-model-len` to limit context length and save memory

Use `--gpu-memory-utilization 0.95` to maximize GPU memory usage

Enable `--enable-chunked-prefill` for better long-context performance

Cost

Running a 7B model inference server on H100 1x costs $2.25/hr. For production workloads serving thousands of requests, this is significantly cheaper than API-based alternatives.

Back to Blog Get Started Free

Deploy a vLLM Inference Server in Under 5 Minutes

Why vLLM?

Quick Deploy

Scaling Up

Performance Tips

Cost