Runbook: Jellyfin transcoding and ffmpeg¶
Jellyfin runs as a native systemd service (
jellyfin.service) on VM 115 with an RTX 2080 Ti passed through from the Proxmox host, not in Kubernetes or Docker (it moved off both in March 2026). Reach the VM by SSH (ssh ladino@192.168.1.170) or from a Proxmox host withqm guest exec 115 -- <cmd>. NVENC is the hardware accel path; there is no Intel QSV / VAAPI here.
When a play "doesn't work," the cause is almost always one of: a codec mismatch the client cannot direct-play, a transcoding failure on the GPU, the GPU watchdog killing the transcode, or insufficient bandwidth. ffmpeg does the work, so most diagnostics start by reading what ffmpeg said.
Symptoms and what they usually mean¶
| Symptom | Likely cause |
|---|---|
| "Playback error" immediately, before any frame | Profile mismatch (codec, container, level), or transcode failed to start |
| Plays a few seconds then stops, repeatedly | Transcode falling behind, or the gpu-watchdog SIGTERMing ffmpeg (see below) |
| Audio works, video black or green | Hardware decode pipeline failing, silent CPU fallback dying on the time budget |
| Stream choppy on local LAN | Transcoding used where direct-play was possible; check the decision log |
| VM CPU pegged during a single stream | Software transcode where NVENC was expected (driver gone, or HEVC source on a card that cannot decode it) |
Hardware is lacking required capabilities in ffmpeg log |
The card cannot decode this codec. Almost always the K620 failover path (Maxwell cannot do HEVC/AV1/10-bit) |
| Multiple streams all stutter together | Concurrent transcode count exceeds what the GPU can sustain |
Triage¶
Step 1: read the ffmpeg log for the failed session¶
Jellyfin writes every ffmpeg invocation to /var/log/jellyfin/. Files are FFmpeg.Transcode-<timestamp>_<sessionid>_<hash>.log.
# Most recent transcode log
ls -t /var/log/jellyfin/FFmpeg.Transcode-*.log | head -1
# Tail it
tail -100 "$(ls -t /var/log/jellyfin/FFmpeg.Transcode-*.log | head -1)"
Things to grep for:
Hardware is lacking required capabilities/Failed setup for format cuda: the GPU cannot decode this source codec in hardware. On the 2080 Ti this should not happen for H.264/HEVC. If you see it, you are almost certainly on the K620 failover node (Maxwell: H.264 NVENC only, no HEVC/AV1/10-bit). Checknvidia-smi(next step) and see the GPU D3cold runbook.exited with code 143: SIGTERM (128 + 15). Something killed ffmpeg. The usual culprit is the gpu-watchdog stopping Jellyfin andkillall -9 ffmpegon a (real or false-positive) GPU failure. Checkjournalctl -u gpu-watchdog.Cannot load libnvcuvid/OpenEncodeSessionEx failed/no capable devices found: NVENC/NVDEC path broken. Usually the nvidia driver in the guest is not talking to the card (nvidia-smifails), which after a migration is often the HA cmdline-cache quirk (GPU not actually attached to the QEMU process).Invalid data found when processing input: bad source file or a truncated NFS read.ffprobethe file directly.
Step 2: confirm the GPU is actually there and healthy¶
This is the difference between a server that handles several streams and one that handles none. It is also a frequent silent regression after a migration, reboot, or driver update.
# Always wrap nvidia-smi in timeout. A hung GPU hangs nvidia-smi forever.
timeout 10 nvidia-smi --query-gpu=name,driver_version,utilization.gpu,memory.used --format=csv
# Expected on the primary node: "NVIDIA GeForce RTX 2080 Ti, 570.xxx, ..."
# "Quadro K620" means you are on the failover node and HEVC/AV1/HDR will not transcode.
# "NVIDIA-SMI has failed..." or "No NVIDIA GPU found" means the card is not attached to the guest.
# What does ffmpeg think it can do?
/usr/lib/jellyfin-ffmpeg/ffmpeg -hide_banner -hwaccels # expect: cuda (among others)
/usr/lib/jellyfin-ffmpeg/ffmpeg -hide_banner -encoders | grep -E "nvenc"
# Expect h264_nvenc + hevc_nvenc (+ av1_nvenc on the 2080 Ti). Only h264_nvenc on the K620.
If nvidia-smi fails inside the VM, the problem is below Jellyfin (passthrough / driver), not Jellyfin itself. Go to the GPU D3cold runbook. In particular, after an HA-driven migration, verify the card is in the running QEMU process on the host:
# On the Proxmox host running VM 115
QEMU_PID=$(cat /run/qemu-server/115.pid)
tr "\0" "\n" < /proc/$QEMU_PID/cmdline | grep -c vfio-pci # expect 4; if 0, qm stop 115 && qm start 115
Step 3: is the watchdog killing the transcode?¶
systemctl is-active gpu-watchdog
journalctl -u gpu-watchdog --since "20 min ago" --no-pager | tail -20
# "Detected expected GPU: NVIDIA GeForce RTX 2080 Ti" = active recovery mode (normal on primary node)
# "entering PASSIVE mode" = on the fallback K620, watchdog will NOT kill transcodes (correct)
# Repeated "GPU failure detected ... Stopping Jellyfin" = the watchdog is the problem
If the watchdog is crash-looping on the primary node, the GPU is genuinely sick: go to the GPU D3cold runbook. On the fallback node it should be PASSIVE and harmless; if it is not, the host has an old pre-2026-05-27 watchdog and needs roles/jellyfin re-applied.
Step 4: check the playback decision¶
Sometimes ffmpeg is fine and Jellyfin is choosing to transcode where it could direct-play, or refusing direct-stream because the client lied about support.
The key line is the transcoding decision and its reason ("Container is not supported", "Audio codec is not supported", "Bitrate is too high"). Most of the time it is correct and the fix is client-side (e.g. an app that misreports HEVC support). Occasionally it is a misconfigured custom profile.
Fix patterns¶
Wrong GPU (on the K620 failover node)¶
If nvidia-smi shows a Quadro K620, the 2080 Ti hung and HA failed VM 115 over. H.264 SDR content will still transcode; HEVC/AV1/HDR will not. This is the GPU D3cold runbook: recover the 2080 Ti (cold reboot pve2) and migrate VM 115 back.
Hardware accel config¶
Encoder settings live in /etc/jellyfin/encoding.xml (managed by the roles/jellyfin Ansible role; edit there, not live). Current known-good values:
<HardwareAccelerationType>nvenc</HardwareAccelerationType>
<EnableHardwareEncoding>true</EnableHardwareEncoding>
<EnableTonemapping>false</EnableTonemapping> <!-- HDR tonemapping crashes this card; keep off -->
<EnableEnhancedNvdecDecoder>false</EnableEnhancedNvdecDecoder> <!-- 4K DV HDR can OOM 11GB VRAM -->
EnableTonemapping and enhanced NVDEC are deliberately off (they have crashed or OOM'd the GPU historically). Do not flip them on without testing under 4K HDR load.
Verify the full pipeline with a synthetic transcode¶
# HEVC NVDEC + h264 NVENC. Exits 0 on the 2080 Ti, FAILS with
# "Hardware is lacking required capabilities" on the K620.
SAMPLE=$(find /mnt/media/library -type f -name "*.mkv" -size -500M 2>/dev/null | head -1)
timeout 30 /usr/lib/jellyfin-ffmpeg/ffmpeg -hide_banner -loglevel error \
-hwaccel cuda -hwaccel_output_format cuda -c:v hevc_cuvid \
-i "$SAMPLE" -t 10 -c:v h264_nvenc -preset p1 -b:v 5M -f null - && echo OK
Transcoding works but the box is melting¶
Cap concurrent transcodes in Dashboard > Playback. The 2080 Ti handles several 1080p streams comfortably; still set a ceiling so it refuses rather than degrades. Better to reject one play than serve several broken ones.
Source file is corrupt¶
If ffprobe errors, the file is the problem. Re-grab from the *arr stack or restore from a NAS snapshot.
Telemetry¶
Already wired in the lab (textfile collectors on the VM, scraped by Prometheus):
gpu-metrics.timerwritesnvidia_*(utilization, encoder/decoder load, VRAM, temp) every 30s.jellyfin-metrics.timerwrites session/transcode counts every 60s.gpu_watchdog.prom(nvidia_gpu_failures_total, soft/hard recovery counters) written by the watchdog.
A "transcoding active for >X min with rising failures" signal is more useful than "Jellyfin is up", because Jellyfin can be Ready while every stream silently fails.
Patterns worth knowing¶
- Read the ffmpeg log first, every time. The actual error is in there in plain text.
exited with code 143is not an ffmpeg bug, it is SIGTERM. Something killed it. On this host that is almost always the gpu-watchdog. Chase the watchdog, not the codec.- Wrap
nvidia-smiintimeout. A hung GPU hangs it forever and wedges anything that calls it. Hardware is lacking required capabilitiesmeans wrong card, not wrong config. You are on the K620. Recover the 2080 Ti.- ICMP to the VM is blocked (UFW); use HTTP for liveness.
curl http://192.168.1.170:8096/health, notping. - SQLite plus an abrupt stop equals corruption. Never restart Jellyfin mid-stream. See the DB corruption runbook.