Video Transcoding at Scale

Build a GPU-accelerated transcoding pipeline on Jetson Orin NX clusters or traditional GPU servers — all supporting infrastructure included

Video transcoding is compute-heavy and highly parallel. Every video is independent, so more workers means more throughput. The challenge is managing the fleet: deploying workers consistently, routing input and output through shared storage, and scaling up or down as demand changes.

The right hardware for a transcoding farm isn't always a rack of A100s. For most workloads — HLS ladder generation, clip transcoding, live stream packaging — the NVIDIA Jetson Orin NX is the ideal building block: dedicated NVENC and NVDEC engines, low power draw, ARM64, and a price point that makes large clusters economically practical. PodWarden runs on ARM64 and manages Jetson nodes exactly like any other host.

The Hub catalog covers every supporting service the pipeline needs. If you already have S3, a job queue, or a monitoring stack, bring it. If you don't, everything is available.

What You Need

Component	Bring your own	Or deploy from Hub
Object storage	AWS S3, GCS, Wasabi, MinIO already running	MinIO or RustFS — deploy to any node, exposes S3 API
Job queue	Redis, RabbitMQ, SQS, NATS	Redis or RabbitMQ — from the Hub catalog
Database	Existing PostgreSQL (for job tracking UI)	PostgreSQL — from Hub
Secrets	Vault, AWS Secrets Manager	Vault — from Hub
Monitoring	Existing Prometheus + Grafana	Prometheus + Grafana + DCGM Exporter
Container registry	Docker Hub, GHCR	Harbor or Gitea — needed if you maintain custom ARM64 worker images

Stack Architecture

The Jetson Orin NX Advantage

The Jetson Orin NX is a system-on-module with dedicated hardware video engines that make it exceptional for transcoding:

Feature	Orin NX 16GB	Why it matters
NVENC	1× dedicated encode engine	H.264, H.265, AV1 hardware encode — zero CPU load
NVDEC	1× dedicated decode engine	Hardware decode of input streams
CUDA cores	1024	Available for filters, scaling, color conversion
Power draw	10–25W	Dense clusters without specialized power infrastructure
Architecture	ARM64	Standard Linux, standard NVIDIA container runtime
Form factor	SoM (67.6 × 45 mm)	Compact carrier boards, rack-mount sleds

The NVENC and NVDEC engines are independent of the CUDA cores — they run simultaneously without contention. A single Orin NX sustains multiple concurrent 1080p60 encode sessions while the CPU handles queue polling, format demuxing, and S3 upload.

Because Orin NX modules are inexpensive, you can build clusters that would be cost-prohibitive with traditional GPU servers. A rack of Jetson nodes — each running two or three concurrent encode sessions — often outperforms a handful of A100 machines for this workload, at a fraction of the cost and power budget.

Jetson vs x86 Topology

Building the Foundation

Deploy supporting services before the transcoding workers. They're standard stacks — import from Hub, assign to your cluster, deploy.

Transcoding Pipeline Flow

1. Object Storage

Workers need to read source video and write encoded output. Register an existing S3 endpoint as a storage connection under Settings → Storage. PodWarden tests connectivity from all cluster nodes and injects credentials as environment variables at deploy time.

If you don't have S3 storage:

MinIO — Import from Hub. Deploy to a node with fast disk (NVMe recommended for video I/O). Exposes a full S3 API; every other component treats it identically to AWS S3.
RustFS — High-performance alternative, also S3-compatible. Better suited for high-throughput encode pipelines with many simultaneous workers.

Create two buckets: one for ingest (source files), one for output (encoded files).

2. Job Queue

Workers poll the queue for jobs, transcode, and report completion. Pick based on what you know:

Redis — Simple, fast, widely supported. Import from Hub. Workers use BLPOP or a queue library.
RabbitMQ — More durable, supports dead-letter queues for failed jobs. Better for high-volume pipelines where job loss is not acceptable.

If you use a cloud queue (SQS, Cloud Tasks), set the QUEUE_URL environment variable accordingly — no Hub component needed.

3. Monitoring

Import Prometheus, Grafana, and DCGM Exporter from Hub.

Deploy DCGM Exporter as a DaemonSet — it runs on every GPU node automatically and exposes per-GPU metrics including NVENC/NVDEC engine utilization (on supported drivers). On Jetson, the equivalent is Tegrastats Exporter — also available from the Hub catalog.

Grafana shows queue depth, encode throughput (frames/second per node), GPU/encoder utilization, and error rates. This tells you immediately when a node is stalled, when the queue is backing up, or when a bad source file is causing repeated failures.

4. Secrets

Store S3 credentials, queue passwords, and registry credentials in Vault (from Hub) or your existing secrets manager. PodWarden injects secrets via secret_refs at deploy time — they never appear in template definitions or deployment logs.

5. Container Registry (optional but recommended)

Jetson workers require ARM64 images built on top of NVIDIA's L4T base (nvcr.io/nvidia/l4t-base). You'll likely maintain a custom FFmpeg image for your pipeline.

Deploy Harbor or Gitea (with built-in registry) from Hub. Build your ARM64 FFmpeg image once and push it there. All Jetson workers pull from your internal registry — no external registry dependency, no rate limits.

Worker Templates

Jetson Orin NX worker (ARM64)

Kind:           Deployment
Image:          registry.internal/ffmpeg-worker:latest-arm64
GPU count:      1
VRAM:           8Gi
CPU:            4
Memory:         8Gi
Node selector:  { "nvidia.com/gpu.product": "Orin" }

Variable	Example	Description
`QUEUE_URL`	`redis://redis.mesh:6379`	Job queue connection
`QUEUE_NAME`	`transcode-jobs`	Queue name
`INPUT_BUCKET`	`s3://media-ingest`	Source video bucket
`OUTPUT_BUCKET`	`s3://media-output`	Encoded output bucket
`PRESET`	`hls-ladder`	Encoding profile
`CONCURRENCY`	`3`	Parallel encode sessions per node
`HWACCEL`	`cuda`	Hardware acceleration
`NVENC_PRESET`	`p4`	NVENC quality/speed (`p1`–`p7`)
`S3_ENDPOINT_URL`	`http://minio.mesh:9000`	Internal MinIO endpoint

Sensitive values (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, queue password) come from Vault via secret_refs.

x86 GPU server worker (amd64)

For heavier jobs — multi-stream 4K, HDR tonemapping, complex filter graphs, ProRes output — use a traditional GPU server:

Kind:           Deployment
Image:          registry.internal/ffmpeg-worker:latest-amd64
GPU count:      1
VRAM:           16Gi
CPU:            16
Memory:         32Gi
Node selector:  { "kubernetes.io/arch": "amd64" }

Same environment variables, different image architecture and node selector. Both worker types coexist in the same cluster. The queue routes job types to the appropriate workers.

Volume mounts

Path	Volume type	Purpose
`/tmp/transcode`	emptyDir	Working directory for in-flight segments

Source video and output are handled via S3 API calls from within the worker — no persistent mounts needed unless you're using NFS for source files.

Multi-Profile Deployments

Maintain separate stacks for each encoding profile:

Profile	Target	Notes
HLS adaptive bitrate	Web playback	1080p/720p/480p/360p ladder, fMP4 segments
Broadcast archive	Long-term storage	ProRes 422 HQ or DNxHR — CPU-encoded on x86 nodes
Social clips	Short-form platforms	H.264/AV1, vertical and square crops
Proxy generation	Editorial workflows	Low-res H.264, fast encode for NLE preview

Each profile is a separate stack with a different PRESET value. Deploy all profiles to the same cluster. The queue routes job types to the matching worker — Jetson nodes handle the volume, x86 GPU servers handle the exceptions.

Multi-Profile Routing

Scaling the Fleet

Jetson nodes are inexpensive enough that horizontal scaling is usually the right answer. Add nodes, join the cluster, workers start picking up jobs automatically. No queue reconfiguration, no storage changes.

For temporary capacity spikes, rent cloud GPU nodes, join them to the cluster with an x86 worker template, and remove them when the backlog clears.

Scaling Lifecycle

Job kind for batch processing — For a one-time migration or catalogue re-encode, use kind: Job instead of a deployment. The job processes the queue and stops when complete. PodWarden records run duration and exit code. Batch jobs don't idle after finishing — important when running on rented nodes.

Networking

Jetson nodes behind NAT (home labs, edge deployments) connect via Tailscale mesh — no public IP needed. PodWarden auto-detects Tailscale-connected nodes and tags them mesh. The MinIO, Redis, and Vault instances on your mesh are reachable from all worker nodes.

For latency-sensitive live transcoding, co-locate Jetson nodes with your ingest infrastructure on the same LAN. Tag those nodes lan and set the worker template to require lan connectivity — PodWarden schedules only on nodes that can reach the ingest source.

Hub Templates for This Stack

Template	Role
FFmpeg worker (Jetson NVENC)	ARM64 NVENC/NVDEC worker
FFmpeg worker (x86 NVENC)	amd64 GPU-accelerated worker
FFmpeg worker (CPU)	Software encoding, any architecture
MinIO	S3-compatible object storage
RustFS	High-performance S3 object storage
Redis	Job queue
RabbitMQ	Durable job queue with dead-letter support
PostgreSQL	Job tracking database
Vault	Secrets management
Prometheus	Metrics collection
Grafana	Transcoding pipeline dashboards
DCGM Exporter	Per-GPU metrics for x86 nodes (DaemonSet)
Tegrastats Exporter	Per-GPU metrics for Jetson nodes (DaemonSet)
Harbor	Private container registry for custom ARM64 images

The complete pipeline — storage, queue, workers, monitoring — runs on your own nodes, managed from one dashboard. No external dependencies unless you choose them.