Best VPS for Whisper in 2026

Want to transcribe audio without sending it to third-party APIs? OpenAI’s Whisper runs entirely on your own server — giving you unlimited, private speech-to-text. Here’s what VPS specs you actually need.

What is Whisper?

Whisper is OpenAI’s open-source speech recognition model. It handles:

Transcription — Audio to text in 99+ languages
Translation — Translate any language to English
Subtitle generation — Timestamped output for video
Speaker diarization — With extensions like WhisperX

whisper audio.mp3 --model medium --language en

Why self-host Whisper?

Privacy — Audio never leaves your server
No per-minute costs — OpenAI charges $0.006/min, it adds up fast
No file size limits — Process hours-long recordings
Batch processing — Transcribe hundreds of files overnight
Customization — Use faster-whisper, WhisperX, or fine-tuned models

VPS Requirements for Whisper

Whisper’s resource needs depend on model size and whether you use GPU acceleration.

Minimum (CPU-only, small model)

CPU: 4+ cores
RAM: 4GB
Storage: 10GB SSD

Recommended (CPU, medium model)

CPU: 8+ cores (AVX2 support)
RAM: 8GB
Storage: 20GB NVMe

Optimal (GPU acceleration)

GPU: NVIDIA with 6GB+ VRAM
RAM: 8GB+ system RAM
Storage: 30GB+ NVMe

Whisper Model Sizes

Pick based on your available resources:

Model	Size	Min VRAM	Min RAM (CPU)	Relative Speed	Accuracy
tiny	75MB	1GB	2GB	32x	Basic
base	142MB	1GB	2GB	16x	Good
small	466MB	2GB	4GB	6x	Better
medium	1.5GB	5GB	8GB	2x	Great
large-v3	3.1GB	10GB	16GB	1x	Best

Tip: The medium model hits the sweet spot — 95%+ accuracy with reasonable speed. Use large-v3 only when accuracy is critical.

Best VPS for Whisper (CPU)

CPU transcription works fine for batch jobs and occasional use. Expect roughly real-time speed with small model (1 hour audio ≈ 1 hour processing).

1. Hetzner CPX41 (Best Value)

€14.99/mo | 8 vCPU (AMD EPYC), 16GB RAM, 160GB NVMe

Handles the medium model comfortably. AMD EPYC processors have strong AVX2 performance which Whisper relies on heavily.

Performance: ~1x real-time with medium model, ~3x with small

2. Hostinger KVM8 (Budget Pick)

$19.99/mo | 8 vCPU, 16GB RAM, 200GB NVMe

Good specs at a fair price. The 200GB storage is handy if you’re processing lots of audio files.

3. Contabo VPS XL (Most RAM)

€13.99/mo | 8 vCPU, 30GB RAM, 400GB SSD

If you want to run large-v3 on CPU, you need 16GB+ RAM. Contabo’s generous memory allocation makes this possible at budget pricing.

Best GPU VPS for Whisper

GPU acceleration makes Whisper 10-30x faster. A 1-hour podcast transcribes in 2-5 minutes.

1. Vultr Cloud GPU (Best Availability)

$90/mo | NVIDIA A16 (16GB VRAM), 6 vCPU, 16GB RAM

Runs every Whisper model including large-v3. Always available — no spot instance headaches.

Performance: ~10-15x real-time with large-v3

2. Hetzner Dedicated GPU (Best Monthly Rate)

€179/mo | NVIDIA RTX 4000 (8GB VRAM), 8 cores, 64GB RAM

Best value for 24/7 transcription workloads. Runs medium and small models at blazing speed.

3. RunPod (Cheapest for Batch Jobs)

$0.20/hr | NVIDIA RTX 4090 (24GB VRAM)

Spin up when you have files to process, shut down when done. Perfect for occasional bulk transcription.

4. Lambda Labs (Heavy Workloads)

$0.50/hr (~$360/mo) | NVIDIA A10 (24GB VRAM)

For production transcription pipelines processing thousands of hours monthly.

Complete Setup Guide

Step 1: Create Your VPS

We’ll use Hetzner CPX41 for this guide:

Sign up at Hetzner Cloud
Create server → Ubuntu 22.04 → CPX41
Add your SSH key
Note the IP address

Step 2: Install Whisper

ssh root@your-server-ip

# Install dependencies
apt update && apt install -y python3-pip ffmpeg

# Install Whisper
pip3 install openai-whisper

Step 3: Transcribe Your First File

# Basic transcription
whisper recording.mp3 --model medium

# With language detection
whisper recording.mp3 --model medium --task transcribe

# Translate to English
whisper foreign_audio.mp3 --model medium --task translate

# Output subtitles
whisper video.mp4 --model medium --output_format srt

Step 4: Use faster-whisper (Recommended)

faster-whisper uses CTranslate2 and is 4x faster than standard Whisper with lower memory usage:

pip3 install faster-whisper

python3 << 'EOF'
from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cpu", compute_type="int8")
segments, info = model.transcribe("recording.mp3")

print(f"Detected language: {info.language} ({info.language_probability:.0%})")
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
EOF

Why faster-whisper?

4x faster on CPU, 2x faster on GPU
Uses less memory (int8 quantization)
Same accuracy as original Whisper
Drop-in replacement

Step 5: Set Up as API Service

Create a simple transcription API with FastAPI:

pip3 install fastapi uvicorn python-multipart faster-whisper

# transcription_api.py
from fastapi import FastAPI, UploadFile
from faster_whisper import WhisperModel
import tempfile, os

app = FastAPI()
model = WhisperModel("medium", device="cpu", compute_type="int8")

@app.post("/transcribe")
async def transcribe(file: UploadFile):
    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name

    segments, info = model.transcribe(tmp_path)
    text = " ".join(s.text for s in segments)
    os.unlink(tmp_path)

    return {
        "language": info.language,
        "text": text.strip()
    }

uvicorn transcription_api:app --host 0.0.0.0 --port 8000

Send files to your API:

curl -X POST http://your-server-ip:8000/transcribe \
  -F "file=@recording.mp3"

Step 6: Docker Setup (Alternative)

docker run -d -p 8000:8000 \
  --name whisper \
  -v whisper-models:/root/.cache \
  onerahmet/openai-whisper-asr-webservice:latest

This gives you a ready-made REST API with Swagger docs at http://your-server-ip:8000/docs.

Performance Optimization

1. Use faster-whisper with int8

# CPU — int8 quantization (fastest)
model = WhisperModel("medium", device="cpu", compute_type="int8")

# GPU — float16 (best quality/speed balance)
model = WhisperModel("medium", device="cuda", compute_type="float16")

2. Batch Processing Script

#!/bin/bash
# transcribe_all.sh — process all audio files in a directory
INPUT_DIR="./audio"
OUTPUT_DIR="./transcripts"
mkdir -p "$OUTPUT_DIR"

for file in "$INPUT_DIR"/*.{mp3,wav,m4a,flac}; do
    [ -f "$file" ] || continue
    filename=$(basename "$file" | sed 's/\.[^.]*$//')
    echo "Processing: $file"
    whisper "$file" --model medium --output_dir "$OUTPUT_DIR" --output_format txt
done
echo "Done! Transcripts in $OUTPUT_DIR"

3. Enable Swap for Large Models

fallocate -l 8G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile swap swap defaults 0 0' >> /etc/fstab

4. Use VAD (Voice Activity Detection)

Skip silence to speed up processing:

segments, info = model.transcribe(
    "recording.mp3",
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500)
)

This can speed up transcription by 2-3x on recordings with lots of silence or pauses.

Cost Comparison: VPS vs APIs

Option	Monthly Cost	Hours of Audio
OpenAI Whisper API	$0.006/min	100 hrs = $36
Google Speech-to-Text	$0.006/min	100 hrs = $36
AWS Transcribe	$0.024/min	100 hrs = $144
Hetzner VPS + Whisper	€15/mo	Unlimited
Vultr GPU + Whisper	$90/mo	Unlimited

Self-hosting breaks even at roughly 40 hours/month on Hetzner, or 250 hours/month on Vultr GPU. After that, every hour is free.

Use Cases

Podcast Transcription

Run large-v3 for best accuracy. A 1-hour episode takes ~5 min on GPU, ~1 hour on CPU.

Meeting Notes

Combine Whisper with WhisperX for speaker diarization:

pip install whisperx

python3 -c "
import whisperx
model = whisperx.load_model('medium', 'cpu')
result = model.transcribe('meeting.mp3')
# Add speaker labels
diarize_model = whisperx.DiarizationPipeline()
result = whisperx.assign_word_speakers(diarize_model('meeting.mp3'), result)
"

Subtitle Generation

whisper video.mp4 --model medium --output_format srt --word_timestamps True

Voice Note Processing

Build a Telegram bot or webhook that auto-transcribes voice messages.

FAQ

Can I run Whisper on 2GB RAM?

Yes, with the tiny or base model. Accuracy is lower but fine for clear English audio.

Is GPU required?

No. CPU works perfectly for batch processing where speed isn’t critical. Use faster-whisper with int8 for best CPU performance.

Which model should I use?

medium for most use cases. large-v3 if accuracy is critical (legal, medical). small if speed matters more than perfect accuracy.

Can Whisper handle multiple languages?

Yes. It auto-detects language and can transcribe 99+ languages. Translation to English is built in.

How accurate is Whisper?

The large-v3 model approaches human-level accuracy (~95-98% word error rate on clean audio). medium is close behind at ~93-96%.

Recommended Setup

Use Case	VPS	Cost	Model	Speed
Occasional Use	Hetzner CPX21	€8/mo	small	~3x real-time
Daily Transcription	Hetzner CPX41	€15/mo	medium	~1x real-time
Fast Processing	Vultr GPU	$90/mo	large-v3	~15x real-time
Bulk/Production	Lambda A10	$360/mo	large-v3	~20x real-time

For most users, Hetzner CPX41 at €15/mo with faster-whisper and the medium model is the sweet spot. Accurate enough for real work, affordable enough to leave running.

// last updated: March 6, 2026. Disclosure: This article may contain affiliate links.

Best VPS for Whisper 2026: Self-Host Speech-to-Text