Runbook: Jellyfin transcoding and ffmpeg¶
When a play "doesn't work," the cause is almost always one of: codec mismatch the client cannot direct-play, transcoding failure inside the pod, or insufficient bandwidth between server and client. ffmpeg is the part doing the work, so most diagnostics start by reading what ffmpeg said.
Symptoms and what they usually mean¶
| Symptom | Likely cause |
|---|---|
| "Playback error" immediately, before any video frame | Profile mismatch (codec, container, level), or transcode failed to start |
| Plays for a few seconds then stops | Transcode segment generation falling behind, or stream killed by a session limit |
| Audio works, video black or green | Hardware accel pipeline failing, falling back to CPU silently then dying on time budget |
| Stream choppy on local LAN | Transcode being used unnecessarily; check direct-play eligibility |
| Server pod CPU pegged at 100% during a single stream | Transcoding in software when hardware accel was expected to be available |
| Multiple streams all stutter together | Concurrent transcode count exceeded what the GPU/CPU can sustain |
Triage¶
Step 1: read the ffmpeg log for the failed session¶
Jellyfin writes every ffmpeg invocation to /config/log/. The file name is timestamped and includes the session ID.
POD=$(kubectl -n media get pod -l app.kubernetes.io/name=jellyfin -o jsonpath='{.items[0].metadata.name}')
# Most recent transcode log
kubectl -n media exec $POD -- ls -t /config/log/ffmpeg-transcode-*.txt | head -1
# Tail the last one
kubectl -n media exec $POD -- sh -c 'tail -200 $(ls -t /config/log/ffmpeg-transcode-*.txt | head -1)'
Things to grep for:
Failed to open codecorNo such codec: missing codec, almost always because the container build of jellyfin-ffmpeg lost a codec or the input has a codec the build does not include (Dolby Vision profile 8, certain DTS variants).vaapi_open_internal: Failed to initialize VAAPI device: Intel QSV / VAAPI hardware accel cannot reach the GPU. Almost always permissions on/dev/dri/renderD128inside the pod.Cannot load nvcuda.dllornvenc not loaded: NVENC path broken. Usually the wrong driver inside the pod versus the host, or nonvidia.com/gpuresource on the Deployment.Decoder ... not allowed: Jellyfin's profile is rejecting the source codec for transcode. Codec license / build flag.Invalid data found when processing input: source file is bad, or the network mount returning truncated data. Tryffprobeon the file directly.
Step 2: confirm hardware accel is wired up¶
Hardware-accelerated transcoding is the difference between a server that handles 4 streams and one that handles 1. It is also a frequent silent regression after image upgrades.
For Intel QSV / VAAPI:
# Does the pod see the iGPU device?
kubectl -n media exec $POD -- ls -la /dev/dri/
# Expected: card0 + renderD128, both readable by the jellyfin user (UID 1000 typically)
# If renderD128 has mode 0660 root:render and the pod runs as 1000, transcode will fail.
# What does ffmpeg think it can do?
kubectl -n media exec $POD -- /usr/lib/jellyfin-ffmpeg/ffmpeg -hide_banner -hwaccels
# Expected output includes: vaapi, qsv (Intel), or cuda/nvenc (NVIDIA)
For NVIDIA / NVENC:
kubectl -n media exec $POD -- nvidia-smi # should list the GPU
kubectl -n media describe pod $POD | grep -i nvidia.com/gpu
# Expected: "nvidia.com/gpu: 1" in resources
If the device is missing inside the pod, the issue is at the Deployment manifest level (not Jellyfin itself):
- VAAPI: needs
/dev/dridevice added viavolumes+volumeMountsandsecurityContext.runAsGroupset to the host'srendergroup GID, ORsecurityContext.privileged: true(heavier). - NVENC: needs
nvidia.com/gpu: 1in resources, NVIDIA device plugin running on the node, andruntimeClassName: nvidiaif the cluster uses runtime classes.
Step 3: check the user's playback profile¶
Sometimes ffmpeg is fine and the problem is upstream: Jellyfin is choosing to transcode where it could direct-play, or refusing direct-stream because the client claims unsupported.
# Tail the main Jellyfin log during a known-bad play
kubectl -n media logs $POD -f | grep -iE "stream|playback|transcoding decision|videocodec|audiocodec"
The key line is "Transcoding decision: ...". It will say the chosen reason ("Container is not supported", "Audio codec is not supported", "Bitrate is too high"). Most of the time this is correct and the answer is "fix the client" (e.g. the iOS native player that lies about HEVC support). Sometimes it is a misconfigured custom profile in Dashboard, easy to fix.
Fix patterns¶
Hardware accel broken after image bump¶
# Make sure the Deployment has these for VAAPI / Intel QSV:
spec:
template:
spec:
securityContext:
runAsUser: 1000
runAsGroup: 1000
supplementalGroups: [44, 109] # video, render groups on Debian-based images
containers:
- name: jellyfin
resources:
limits:
gpu.intel.com/i915: 1 # if using the Intel device plugin
volumeMounts:
- mountPath: /dev/dri
name: dri
volumes:
- name: dri
hostPath:
path: /dev/dri
type: Directory
Verify the host's render group GID matches one of the supplementalGroups. On Debian 12 it is 109; older may be 110.
Transcoding works but the box is melting¶
Cap concurrent transcodes. Dashboard > Playback > "Maximum number of concurrent stream transcodes" set conservatively (1 to 2 for a small iGPU, 4 to 6 for an Arc / dedicated GPU). Better to refuse a play than to deliver four broken ones.
Direct-play forced where transcoding is wanted¶
A specific client-profile override in /config/data/dlna/profiles/users/ may be forcing direct play and failing on the wall clock instead of falling back. Move the profile aside, restart Jellyfin, retest with the default profile.
Source file is corrupt¶
# Run from the pod (uses jellyfin-ffmpeg, not system ffmpeg)
kubectl -n media exec $POD -- /usr/lib/jellyfin-ffmpeg/ffprobe -v error \
-show_streams -show_format \
"/media/<path-to-the-bad-file>"
If ffprobe errors out, the file is the problem, not Jellyfin. Re-grab from the *arr stack or restore from snapshot.
Telemetry to add (if not present)¶
The lab's Prometheus stack already scrapes pod CPU and memory. Useful Jellyfin-specific signals if you want them:
jellyfin_active_sessionsfrom the Jellyfin Prometheus plugin.jellyfin_transcoding_sessionscount.- A
nvidia-smiexporter (or DCGM) if running NVENC; track GPU utilization and encoder utilization separately. The lab has had GPU temp vianode_hwmon(see homepage temperatures panel) but no direct NVENC saturation metric.
A "transcoding active for >X min" alert is more useful than "Jellyfin is up", because Jellyfin can be up and Ready while every stream silently fails.
Patterns worth knowing¶
- Read the ffmpeg log first, every time. It is verbose and intimidating, but the actual error is in there in plain text and saves hours of guessing.
- Hardware accel is a silent regression magnet. Every image rebuild, every kernel upgrade, every CSI/device-plugin change can break it without a clear failure. Add a synthetic transcode check (a tiny file ffmpeg-transcoded on a CronJob) that pages on failure.
- Refuse, don't degrade. Capping concurrent transcodes preserves quality of service. Letting a small box accept 6 streams and serve 6 broken ones is worse than serving 2 well.
- SQLite + active stream + abrupt restart = corruption. Treat it as a hard rule. See the DB corruption runbook.