No items found.
No items found.

Benchmarking Kueue, Volcano, and Slinky on Radiant

Wooden wave ridges
10
Min Read
April 23, 2026
Share Article

Three Paths to GPU Scheduling, One Platform for All

When teams start scaling AI and HPC workloads on Kubernetes, one question appears quickly: which scheduler is a good fit for real GPU work?

At Radiant, we wanted a practical answer. So we benchmarked three different approaches to Kubernetes batch scheduling:

  • Kueue for Kubernetes-native queuing and admission control
  • Volcano for batch scheduling with queue management and native workflow primitives
  • Slinky for teams that want the power of Slurm inside Kubernetes

The main takeaway for users evaluating Radiant is straightforward:

You are not locked into a single scheduling model on Radiant.

If your team prefers a lightweight Kubernetes-native approach, Kueue is a practical option. If you need richer batch semantics and native workflow support, Volcano covers more ground. If your users already operate in a Slurm environment and want familiar HPC controls such as QOS, fair-share, and dependency chains, Slinky brings that model into Kubernetes.

Radiant provides a GPU platform where teams can choose and run the scheduling approach that fits their workload style, operational preferences, and user expectations.

Why We Ran This Benchmark

A scheduler matters most when the cluster is busy.

In real GPU environments, teams care about more than simply starting pods. They need to know:

  • Will an urgent job jump ahead when GPUs are busy?
  • Can one team avoid starving another team?
  • Can lower-priority work be preempted safely?
  • Can multi-stage pipelines run in the right order?
  • What happens if a controller restarts mid-run?

These are the questions that matter when choosing a platform.

That is why our benchmark focused on scheduler behavior under pressure, not synthetic marketing claims. We tested:

  1. Installation complexity
  2. Configuration overhead
  3. Job priorities
  4. Fair-share scheduling
  5. Preemption support
  6. Job dependencies
  7. Recovery for queued jobs
  8. GPU/resource allocation correctness
  9. Observability and debuggability
  10. Stability

How We Tested

To isolate scheduler behavior from model-specific noise, we used lightweight mock GPU jobs based on nvidia/cuda:12.4.0-base-ubuntu22.04. These jobs requested real nvidia.com/gpu resources and used controlled sleep durations to simulate workload stages.

The benchmark included:

  • Blocker jobs that occupied 1-2 GPUs to create contention
  • Competing jobs that tested queue order, priority, and fairness
  • Preemption scenarios where urgent jobs arrived after lower-priority work was already running
  • Three-stage pipelines representing preprocess, train, and evaluation flows
  • Recovery scenarios where scheduler/controller components were deliberately interrupted

Test environments:

  • Kueue: vcluster environment with 2x NVIDIA H200 141GB HBM3
  • Slinky and Volcano: dedicated host Kubernetes cluster with 2x NVIDIA H100 80GB

That split was not arbitrary. In our setup, Slinky and Volcano needed a dedicated host Kubernetes cluster for different technical reasons:

  • Slinky requires privileged worker pods and elevated Linux capabilities, which were blocked by the baseline PodSecurity restrictions in the vcluster environment.

Volcano was affected by a vcluster late-bind synchronization issue: its scheduler assigns nodeName after pod creation, and that late update was not propagated correctly to the host cluster, leaving pods stuck in Pending.

The Short Version

Here is the practical takeaway:

  • Kueue is a straightforward path for teams that want to stay close to native Kubernetes concepts.
  • Volcano provides a broad set of Kubernetes-native batch features.
  • Slinky fits HPC-oriented teams that want Slurm semantics on Kubernetes.

All three handled the benchmark scenarios correctly. The difference is in how they do it, how much setup they require, and which operating model your users want.

What We Found

Kueue: The Kubernetes-Native Choice

Kueue performed well in the areas that Kubernetes-first teams usually care about:

  • Minimal job-level changes
  • Clear queue and admission model
  • Clear priority handling
  • Fast preemption once enabled
  • Good recovery behavior because state lives in Kubernetes CRDs

In our tests, Kueue admitted a later high-priority job ahead of earlier low-priority work and preempted running lower-priority work in about a second when configured with withinClusterQueue: LowerPriority.

That makes Kueue a good match for platform teams that want to preserve a standard Kubernetes developer experience. Users continue to submit ordinary Kubernetes Jobs, while the platform controls admission centrally.

The trade-off is that Kueue does not natively solve everything. Fair-share behavior is not provided by default, and job dependencies require an additional workflow layer such as Argo or Tekton.

Good fit for Radiant users: teams that want a Kubernetes workflow with low operational overhead and close integration with existing K8s tooling.

Volcano: The Kubernetes Batch Specialist

Volcano offered one of the broader feature sets in the benchmark.

It handled:

  • Priority scheduling
  • Fair-share via queue weights and scheduling plugins
  • Native workflow ordering through JobFlow and JobTemplate
  • Recovery without special persistence configuration
  • Solid GPU allocation and operational visibility

Its native pipeline support was notable. We defined a three-stage workflow declaratively and Volcano executed it with zero-gap handoffs between stages. For teams building multi-step AI or data processing pipelines, that is a useful capability.

Volcano also recovered cleanly in destructive testing. After deleting both scheduler and controller pods, the running job continued unaffected and queued jobs were dispatched normally once resources became available again. Because the state is stored in Kubernetes CRDs, recovery is direct and fast.

The biggest nuance is preemption. Volcano supports it, but it's not as intuitive as a simple single-queue priority model. In our successful test, preemption worked through cross-queue reclaim, using asymmetric queue weights and 1-GPU jobs. It was effective, but more architectural tuning was required than with Kueue or Slinky.

In our environment, Volcano also needed a dedicated host Kubernetes cluster rather than a vcluster. The reason was not Volcano's scheduling model itself, but the interaction with the vcluster sync layer: Volcano sets nodeName after pod creation, and that late scheduling update was not propagated correctly to the host cluster, so pods remained Pending.

Good fit for Radiant users: teams that want a Kubernetes-native batch scheduler with richer scheduling behavior, native pipeline orchestration, and clear control over queue-based resource sharing.

Slinky: Slurm Semantics on Kubernetes

Slinky was the most operationally involved option, but it also delivered the most familiar HPC scheduling behavior.

It gave us:

  • Native Slurm QOS and priority handling
  • Clear fair-share behavior
  • Near-instant preemption
  • Native dependency chains via sbatch --dependency=afterok
  • Good queue introspection through squeue, sprio, sshare, and sacct

This makes Slinky a good option for organizations already familiar with Slurm. Instead of asking those teams to adopt a different scheduling mental model, they can install Slinky on Radiant and keep a Slurm-style operational experience inside Kubernetes.

In our tests, Slinky behaved as expected for priority, fair-share, and preemption. A low-QOS running job was preempted in under a second when a high-QOS job arrived. Fair-share also behaved as expected for HPC users: the lower-usage team gained a scheduling advantage over the heavier user.

The trade-off is setup and operational complexity. Slinky needed more moving pieces, including Slurm accounting and a MariaDB backend, and it requires privileged worker pods. That means it is not always the easiest first step for cloud-native teams or restricted multi-tenant environments.

That privileged execution model is also why Slinky needed a dedicated host Kubernetes cluster in our benchmark. In the vcluster environment, the required privileges and Linux capabilities for slurmd were blocked by the baseline PodSecurity policy.

Good fit for Radiant users: HPC and research teams that already use Slurm and want to preserve that experience while gaining the deployment and infrastructure benefits of Kubernetes.

Cross-Framework Lessons

A few patterns became clear across the benchmark.

To make that more concrete, here is the condensed cross-framework view from the benchmark runs:

Criterion Kueue Slinky Volcano
Installation complexity Medium High Low-Medium
Configuration overhead Low Medium-High Low
Job priorities Pass (66s gain) Pass (60s gain) Pass (63s gain)
Fair-share scheduling Not supported by default Pass (3x priority for light user) Pass (proportion plugin, queue-based)
Preemption support Pass (~1s) Pass (<1s via QOS) Pass with caveats (~35s, cross-queue only)
Job dependencies Not natively supported Pass (native --dependency=afterok) Pass (native JobFlow CRDs, zero-gap)
Recovery for queued jobs Pass Pass (requires persistence) Pass (zero config, ~4s recovery)
GPU/resource allocation Pass Pass Pass
Observability/debuggability Good Good Good

This is where Radiant is useful: users can run different scheduler models on the platform without being forced into a single workflow pattern.

1. There is no single "best" scheduler for every user

This is why the scheduler choice on Radiant matters.

Different teams want different operating models:

  • Some want Kubernetes-first simplicity
  • Some want advanced batch scheduling without leaving Kubernetes
  • Some want the full Slurm mindset they already know

Supporting only one scheduler would force some users into the wrong abstraction. Supporting multiple proven frameworks lets customers match tooling to workload reality.

2. Preemption is not the same across platforms

All three frameworks supported preemption, but the implementation model differed significantly:

  • Kueue: fast and Kubernetes-native once explicitly enabled
  • Slinky: direct using Slurm QOS preemption
  • Volcano: effective, but most dependent on queue architecture and scheduler tuning

This matters for real production design. "Supports preemption" is not enough. The exact control model affects predictability, operational simplicity, and user expectations.

3. Native dependency handling is a major differentiator

For multi-stage pipelines:

  • Slinky offers proven Slurm dependency primitives
  • Volcano offers native declarative workflow support
  • Kueue needs an external workflow layer

If your users run chained jobs regularly, this difference becomes important very quickly.

4. Recovery behavior separates platform-grade tools from demos

All three frameworks showed meaningful recovery capability, but with different foundations:

  • Kueue and Volcano benefit from Kubernetes-native CRD-backed state
  • Slinky can recover well too, but only when persistence is configured correctly

That distinction matters in production, and it highlights why scheduler choice is also an operational architecture choice.

What This Means for Radiant Users

The broader takeaway is not that one framework "won." It is that different workload styles can be supported on the same Radiant platform.

That matters because real customers rarely look the same:

  • AI platform teams may prefer Kueue for its native Kubernetes ergonomics
  • Batch and ML pipeline teams may prefer Volcano for its richer queueing and workflow model
  • Research, simulation, and HPC teams may prefer Slinky because Slurm is already part of their operating DNA

On Radiant, those users do not need to give up a shared platform just because they prefer different scheduling semantics.

They can run on the same GPU platform, benefit from the same operational environment, and choose the scheduler model that best matches how they work.

Practical Guidance

If you want the simplest Kubernetes-native experience, start with Kueue.

If you want a broader Kubernetes batch feature set, start with Volcano.

If your organization already lives in Slurm and wants that same control plane logic on Kubernetes, use Slinky.

That kind of user choice is what Radiant is intended to support.

For the raw benchmark details (including how to run the benchmark yourself), see the benchmark repository.

FAQs

No items found.

How To's

No items found.

Related Articles