Best VPS for Whisper 2026: Self-Host Speech-to-Text
REVIEW 10 min read fordnox

Best VPS for Whisper 2026: Self-Host Speech-to-Text

Find the best VPS for running OpenAI Whisper. Compare GPU and CPU options for self-hosted speech-to-text transcription on your own server.


Best VPS for Whisper in 2026

Want to transcribe audio without sending it to third-party APIs? OpenAI’s Whisper runs entirely on your own server — giving you unlimited, private speech-to-text. Here’s what VPS specs you actually need.

What is Whisper?

What is Whisper?

What is Whisper?

Whisper is OpenAI’s open-source speech recognition model. It handles:

whisper audio.mp3 --model medium --language en

Why self-host Whisper?

VPS Requirements for Whisper

Whisper’s resource needs depend on model size and whether you use GPU acceleration.

Minimum (CPU-only, small model)

Optimal (GPU acceleration)

Whisper Model Sizes

Pick based on your available resources:

ModelSizeMin VRAMMin RAM (CPU)Relative SpeedAccuracy
tiny75MB1GB2GB32xBasic
base142MB1GB2GB16xGood
small466MB2GB4GB6xBetter
medium1.5GB5GB8GB2xGreat
large-v33.1GB10GB16GB1xBest

Tip: The medium model hits the sweet spot — 95%+ accuracy with reasonable speed. Use large-v3 only when accuracy is critical.

Best VPS for Whisper (CPU)

CPU transcription works fine for batch jobs and occasional use. Expect roughly real-time speed with small model (1 hour audio ≈ 1 hour processing).

1. Hetzner CPX41 (Best Value)

€14.99/mo | 8 vCPU (AMD EPYC), 16GB RAM, 160GB NVMe

Handles the medium model comfortably. AMD EPYC processors have strong AVX2 performance which Whisper relies on heavily.

Performance: ~1x real-time with medium model, ~3x with small

2. Hostinger KVM8 (Budget Pick)

$19.99/mo | 8 vCPU, 16GB RAM, 200GB NVMe

Good specs at a fair price. The 200GB storage is handy if you’re processing lots of audio files.

3. Contabo VPS XL (Most RAM)

€13.99/mo | 8 vCPU, 30GB RAM, 400GB SSD

If you want to run large-v3 on CPU, you need 16GB+ RAM. Contabo’s generous memory allocation makes this possible at budget pricing.

Best GPU VPS for Whisper

GPU acceleration makes Whisper 10-30x faster. A 1-hour podcast transcribes in 2-5 minutes.

1. Vultr Cloud GPU (Best Availability)

$90/mo | NVIDIA A16 (16GB VRAM), 6 vCPU, 16GB RAM

Runs every Whisper model including large-v3. Always available — no spot instance headaches.

Performance: ~10-15x real-time with large-v3

2. Hetzner Dedicated GPU (Best Monthly Rate)

€179/mo | NVIDIA RTX 4000 (8GB VRAM), 8 cores, 64GB RAM

Best value for 24/7 transcription workloads. Runs medium and small models at blazing speed.

3. RunPod (Cheapest for Batch Jobs)

$0.20/hr | NVIDIA RTX 4090 (24GB VRAM)

Spin up when you have files to process, shut down when done. Perfect for occasional bulk transcription.

4. Lambda Labs (Heavy Workloads)

$0.50/hr (~$360/mo) | NVIDIA A10 (24GB VRAM)

For production transcription pipelines processing thousands of hours monthly.

Complete Setup Guide

Step 1: Create Your VPS

We’ll use Hetzner CPX41 for this guide:

  1. Sign up at Hetzner Cloud
  2. Create server → Ubuntu 22.04 → CPX41
  3. Add your SSH key
  4. Note the IP address

Step 2: Install Whisper

ssh root@your-server-ip

# Install dependencies
apt update && apt install -y python3-pip ffmpeg

# Install Whisper
pip3 install openai-whisper

Step 3: Transcribe Your First File

# Basic transcription
whisper recording.mp3 --model medium

# With language detection
whisper recording.mp3 --model medium --task transcribe

# Translate to English
whisper foreign_audio.mp3 --model medium --task translate

# Output subtitles
whisper video.mp4 --model medium --output_format srt

faster-whisper uses CTranslate2 and is 4x faster than standard Whisper with lower memory usage:

pip3 install faster-whisper

python3 << 'EOF'
from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cpu", compute_type="int8")
segments, info = model.transcribe("recording.mp3")

print(f"Detected language: {info.language} ({info.language_probability:.0%})")
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
EOF

Why faster-whisper?

Step 5: Set Up as API Service

Create a simple transcription API with FastAPI:

pip3 install fastapi uvicorn python-multipart faster-whisper
# transcription_api.py
from fastapi import FastAPI, UploadFile
from faster_whisper import WhisperModel
import tempfile, os

app = FastAPI()
model = WhisperModel("medium", device="cpu", compute_type="int8")

@app.post("/transcribe")
async def transcribe(file: UploadFile):
    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name

    segments, info = model.transcribe(tmp_path)
    text = " ".join(s.text for s in segments)
    os.unlink(tmp_path)

    return {
        "language": info.language,
        "text": text.strip()
    }
uvicorn transcription_api:app --host 0.0.0.0 --port 8000

Send files to your API:

curl -X POST http://your-server-ip:8000/transcribe \
  -F "file=@recording.mp3"

Step 6: Docker Setup (Alternative)

docker run -d -p 8000:8000 \
  --name whisper \
  -v whisper-models:/root/.cache \
  onerahmet/openai-whisper-asr-webservice:latest

This gives you a ready-made REST API with Swagger docs at http://your-server-ip:8000/docs.

Performance Optimization

1. Use faster-whisper with int8

# CPU — int8 quantization (fastest)
model = WhisperModel("medium", device="cpu", compute_type="int8")

# GPU — float16 (best quality/speed balance)
model = WhisperModel("medium", device="cuda", compute_type="float16")

2. Batch Processing Script

#!/bin/bash
# transcribe_all.sh — process all audio files in a directory
INPUT_DIR="./audio"
OUTPUT_DIR="./transcripts"
mkdir -p "$OUTPUT_DIR"

for file in "$INPUT_DIR"/*.{mp3,wav,m4a,flac}; do
    [ -f "$file" ] || continue
    filename=$(basename "$file" | sed 's/\.[^.]*$//')
    echo "Processing: $file"
    whisper "$file" --model medium --output_dir "$OUTPUT_DIR" --output_format txt
done
echo "Done! Transcripts in $OUTPUT_DIR"

3. Enable Swap for Large Models

fallocate -l 8G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile swap swap defaults 0 0' >> /etc/fstab

4. Use VAD (Voice Activity Detection)

Skip silence to speed up processing:

segments, info = model.transcribe(
    "recording.mp3",
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500)
)

This can speed up transcription by 2-3x on recordings with lots of silence or pauses.

Cost Comparison: VPS vs APIs

OptionMonthly CostHours of Audio
OpenAI Whisper API$0.006/min100 hrs = $36
Google Speech-to-Text$0.006/min100 hrs = $36
AWS Transcribe$0.024/min100 hrs = $144
Hetzner VPS + Whisper€15/moUnlimited
Vultr GPU + Whisper$90/moUnlimited

Self-hosting breaks even at roughly 40 hours/month on Hetzner, or 250 hours/month on Vultr GPU. After that, every hour is free.

Use Cases

Podcast Transcription

Run large-v3 for best accuracy. A 1-hour episode takes ~5 min on GPU, ~1 hour on CPU.

Meeting Notes

Combine Whisper with WhisperX for speaker diarization:

pip install whisperx

python3 -c "
import whisperx
model = whisperx.load_model('medium', 'cpu')
result = model.transcribe('meeting.mp3')
# Add speaker labels
diarize_model = whisperx.DiarizationPipeline()
result = whisperx.assign_word_speakers(diarize_model('meeting.mp3'), result)
"

Subtitle Generation

whisper video.mp4 --model medium --output_format srt --word_timestamps True

Voice Note Processing

Build a Telegram bot or webhook that auto-transcribes voice messages.

FAQ

Can I run Whisper on 2GB RAM?

Yes, with the tiny or base model. Accuracy is lower but fine for clear English audio.

Is GPU required?

No. CPU works perfectly for batch processing where speed isn’t critical. Use faster-whisper with int8 for best CPU performance.

Which model should I use?

medium for most use cases. large-v3 if accuracy is critical (legal, medical). small if speed matters more than perfect accuracy.

Can Whisper handle multiple languages?

Yes. It auto-detects language and can transcribe 99+ languages. Translation to English is built in.

How accurate is Whisper?

The large-v3 model approaches human-level accuracy (~95-98% word error rate on clean audio). medium is close behind at ~93-96%.

Use CaseVPSCostModelSpeed
Occasional UseHetzner CPX21€8/mosmall~3x real-time
Daily TranscriptionHetzner CPX41€15/momedium~1x real-time
Fast ProcessingVultr GPU$90/molarge-v3~15x real-time
Bulk/ProductionLambda A10$360/molarge-v3~20x real-time

For most users, Hetzner CPX41 at €15/mo with faster-whisper and the medium model is the sweet spot. Accurate enough for real work, affordable enough to leave running.

~/best-vps-for-whisper/get-started

Ready to get started?

Get the best VPS hosting deal today. Hostinger offers 4GB RAM VPS starting at just $4.99/mo.

Get Hostinger VPS — $4.99/mo

// up to 75% off + free domain included

// related topics

best vps for whisper whisper hosting self-hosted speech to text whisper transcription server vps for whisper ai

// related guides

Andrius Putna

Andrius Putna

I am Andrius Putna. Geek. Since early 2000 in love tinkering with web technologies. Now AI. Bridging business and technology to drive meaningful impact. Combining expertise in customer experience, technology, and business strategy to deliver valuable insights. Father, open-source contributor, investor, 2xIronman, MBA graduate.

// last updated: March 6, 2026. Disclosure: This article may contain affiliate links.