Three Paths to GPU Scheduling, One Platform for All
When teams start scaling AI and HPC workloads on Kubernetes, one question appears quickly: which scheduler is a good fit for real GPU work?
At Radiant, we wanted a practical answer. So we benchmarked three different approaches to Kubernetes batch scheduling:
- Kueue for Kubernetes-native queuing and admission control
- Volcano for batch scheduling with queue management and native workflow primitives
- Slinky for teams that want the power of Slurm inside Kubernetes
The main takeaway for users evaluating Radiant is straightforward:
You are not locked into a single scheduling model on Radiant.
If your team prefers a lightweight Kubernetes-native approach, Kueue is a practical option. If you need richer batch semantics and native workflow support, Volcano covers more ground. If your users already operate in a Slurm environment and want familiar HPC controls such as QOS, fair-share, and dependency chains, Slinky brings that model into Kubernetes.
Radiant provides a GPU platform where teams can choose and run the scheduling approach that fits their workload style, operational preferences, and user expectations.
Why We Ran This Benchmark
A scheduler matters most when the cluster is busy.
In real GPU environments, teams care about more than simply starting pods. They need to know:
- Will an urgent job jump ahead when GPUs are busy?
- Can one team avoid starving another team?
- Can lower-priority work be preempted safely?
- Can multi-stage pipelines run in the right order?
- What happens if a controller restarts mid-run?
These are the questions that matter when choosing a platform.
That is why our benchmark focused on scheduler behavior under pressure, not synthetic marketing claims. We tested:
- Installation complexity
- Configuration overhead
- Job priorities
- Fair-share scheduling
- Preemption support
- Job dependencies
- Recovery for queued jobs
- GPU/resource allocation correctness
- Observability and debuggability
- Stability
How We Tested
To isolate scheduler behavior from model-specific noise, we used lightweight mock GPU jobs based on nvidia/cuda:12.4.0-base-ubuntu22.04. These jobs requested real nvidia.com/gpu resources and used controlled sleep durations to simulate workload stages.
The benchmark included:
- Blocker jobs that occupied 1-2 GPUs to create contention
- Competing jobs that tested queue order, priority, and fairness
- Preemption scenarios where urgent jobs arrived after lower-priority work was already running
- Three-stage pipelines representing preprocess, train, and evaluation flows
- Recovery scenarios where scheduler/controller components were deliberately interrupted
Test environments:
- Kueue: vcluster environment with 2x NVIDIA H200 141GB HBM3
- Slinky and Volcano: dedicated host Kubernetes cluster with 2x NVIDIA H100 80GB
That split was not arbitrary. In our setup, Slinky and Volcano needed a dedicated host Kubernetes cluster for different technical reasons:
- Slinky requires privileged worker pods and elevated Linux capabilities, which were blocked by the baseline PodSecurity restrictions in the vcluster environment.
Volcano was affected by a vcluster late-bind synchronization issue: its scheduler assigns nodeName after pod creation, and that late update was not propagated correctly to the host cluster, leaving pods stuck in Pending.
The Short Version
Here is the practical takeaway:
- Kueue is a straightforward path for teams that want to stay close to native Kubernetes concepts.
- Volcano provides a broad set of Kubernetes-native batch features.
- Slinky fits HPC-oriented teams that want Slurm semantics on Kubernetes.
All three handled the benchmark scenarios correctly. The difference is in how they do it, how much setup they require, and which operating model your users want.
What We Found
Kueue: The Kubernetes-Native Choice
Kueue performed well in the areas that Kubernetes-first teams usually care about:
- Minimal job-level changes
- Clear queue and admission model
- Clear priority handling
- Fast preemption once enabled
- Good recovery behavior because state lives in Kubernetes CRDs
In our tests, Kueue admitted a later high-priority job ahead of earlier low-priority work and preempted running lower-priority work in about a second when configured with withinClusterQueue: LowerPriority.
That makes Kueue a good match for platform teams that want to preserve a standard Kubernetes developer experience. Users continue to submit ordinary Kubernetes Jobs, while the platform controls admission centrally.
The trade-off is that Kueue does not natively solve everything. Fair-share behavior is not provided by default, and job dependencies require an additional workflow layer such as Argo or Tekton.
Good fit for Radiant users: teams that want a Kubernetes workflow with low operational overhead and close integration with existing K8s tooling.
Volcano: The Kubernetes Batch Specialist
Volcano offered one of the broader feature sets in the benchmark.
It handled:
- Priority scheduling
- Fair-share via queue weights and scheduling plugins
- Native workflow ordering through JobFlow and JobTemplate
- Recovery without special persistence configuration
- Solid GPU allocation and operational visibility
Its native pipeline support was notable. We defined a three-stage workflow declaratively and Volcano executed it with zero-gap handoffs between stages. For teams building multi-step AI or data processing pipelines, that is a useful capability.
Volcano also recovered cleanly in destructive testing. After deleting both scheduler and controller pods, the running job continued unaffected and queued jobs were dispatched normally once resources became available again. Because the state is stored in Kubernetes CRDs, recovery is direct and fast.
The biggest nuance is preemption. Volcano supports it, but it's not as intuitive as a simple single-queue priority model. In our successful test, preemption worked through cross-queue reclaim, using asymmetric queue weights and 1-GPU jobs. It was effective, but more architectural tuning was required than with Kueue or Slinky.
In our environment, Volcano also needed a dedicated host Kubernetes cluster rather than a vcluster. The reason was not Volcano's scheduling model itself, but the interaction with the vcluster sync layer: Volcano sets nodeName after pod creation, and that late scheduling update was not propagated correctly to the host cluster, so pods remained Pending.
Good fit for Radiant users: teams that want a Kubernetes-native batch scheduler with richer scheduling behavior, native pipeline orchestration, and clear control over queue-based resource sharing.
Slinky: Slurm Semantics on Kubernetes
Slinky was the most operationally involved option, but it also delivered the most familiar HPC scheduling behavior.
It gave us:
- Native Slurm QOS and priority handling
- Clear fair-share behavior
- Near-instant preemption
- Native dependency chains via sbatch --dependency=afterok
- Good queue introspection through squeue, sprio, sshare, and sacct
This makes Slinky a good option for organizations already familiar with Slurm. Instead of asking those teams to adopt a different scheduling mental model, they can install Slinky on Radiant and keep a Slurm-style operational experience inside Kubernetes.
In our tests, Slinky behaved as expected for priority, fair-share, and preemption. A low-QOS running job was preempted in under a second when a high-QOS job arrived. Fair-share also behaved as expected for HPC users: the lower-usage team gained a scheduling advantage over the heavier user.
The trade-off is setup and operational complexity. Slinky needed more moving pieces, including Slurm accounting and a MariaDB backend, and it requires privileged worker pods. That means it is not always the easiest first step for cloud-native teams or restricted multi-tenant environments.
That privileged execution model is also why Slinky needed a dedicated host Kubernetes cluster in our benchmark. In the vcluster environment, the required privileges and Linux capabilities for slurmd were blocked by the baseline PodSecurity policy.
Good fit for Radiant users: HPC and research teams that already use Slurm and want to preserve that experience while gaining the deployment and infrastructure benefits of Kubernetes.
Cross-Framework Lessons
A few patterns became clear across the benchmark.
To make that more concrete, here is the condensed cross-framework view from the benchmark runs:
This is where Radiant is useful: users can run different scheduler models on the platform without being forced into a single workflow pattern.
1. There is no single "best" scheduler for every user
This is why the scheduler choice on Radiant matters.
Different teams want different operating models:
- Some want Kubernetes-first simplicity
- Some want advanced batch scheduling without leaving Kubernetes
- Some want the full Slurm mindset they already know
Supporting only one scheduler would force some users into the wrong abstraction. Supporting multiple proven frameworks lets customers match tooling to workload reality.
2. Preemption is not the same across platforms
All three frameworks supported preemption, but the implementation model differed significantly:
- Kueue: fast and Kubernetes-native once explicitly enabled
- Slinky: direct using Slurm QOS preemption
- Volcano: effective, but most dependent on queue architecture and scheduler tuning
This matters for real production design. "Supports preemption" is not enough. The exact control model affects predictability, operational simplicity, and user expectations.
3. Native dependency handling is a major differentiator
For multi-stage pipelines:
- Slinky offers proven Slurm dependency primitives
- Volcano offers native declarative workflow support
- Kueue needs an external workflow layer
If your users run chained jobs regularly, this difference becomes important very quickly.
4. Recovery behavior separates platform-grade tools from demos
All three frameworks showed meaningful recovery capability, but with different foundations:
- Kueue and Volcano benefit from Kubernetes-native CRD-backed state
- Slinky can recover well too, but only when persistence is configured correctly
That distinction matters in production, and it highlights why scheduler choice is also an operational architecture choice.
What This Means for Radiant Users
The broader takeaway is not that one framework "won." It is that different workload styles can be supported on the same Radiant platform.
That matters because real customers rarely look the same:
- AI platform teams may prefer Kueue for its native Kubernetes ergonomics
- Batch and ML pipeline teams may prefer Volcano for its richer queueing and workflow model
- Research, simulation, and HPC teams may prefer Slinky because Slurm is already part of their operating DNA
On Radiant, those users do not need to give up a shared platform just because they prefer different scheduling semantics.
They can run on the same GPU platform, benefit from the same operational environment, and choose the scheduler model that best matches how they work.
Practical Guidance
If you want the simplest Kubernetes-native experience, start with Kueue.
If you want a broader Kubernetes batch feature set, start with Volcano.
If your organization already lives in Slurm and wants that same control plane logic on Kubernetes, use Slinky.
That kind of user choice is what Radiant is intended to support.
For the raw benchmark details (including how to run the benchmark yourself), see the benchmark repository.
