No items found.
No items found.

How to Run Fish Audio S2 Pro on Radiant AI Cloud

Open Source Text to Speech
5
Min Read
May 27, 2026
Share Article

The last year has seen an explosion in open-source voice AI, with models such as NVIDIA Magpie, Voxtral, Kokoro and Maya1 beginning to produce audio comparable in quality to proprietary models. But despite rapid progress, most text-to-speech (TTS) systems still struggle with sounding genuinely human. Speech generation often breaks down when emotion, pacing, emphasis, or conversational nuance enter the picture. Voices become flat or multi-speaker interactions feel robotic, and fine-grained control typically requires rigid SSML syntax.

Fish Audio S2 Pro approaches the problem differently. Rather than treating speech generation as a static text-to-waveform task, S2 Pro introduces a far more expressive and controllable framework for voice synthesis. It combines natural-language emotional instructions, inline prosody control, multilingual generation, and low-latency streaming into a single open-weight model. The result is one of the most flexible open-source TTS systems currently available.

In this guide, we’ll run Fish Audio S2 Pro on Radiant AI Cloud and explore what makes the model particularly compelling for production voice applications.

Model Overview

Most TTS systems optimize for intelligibility and speed. Fish Audio S2 Pro pushes further into expressiveness and instruction following.

The model supports:

  • Natural emotional and tone control  using inline tags such as [whisper], [laugh], [professional broadcast tone], [excited] etc.
  • 15,000+ Unique Tags Supported: Not limited to preset tags; supports free-form text descriptions.
  • Multi-speaker voice generation in a single output stream
  • Voice cloning from short reference samples
  • Streaming low-latency inference
  • Automatic multilingual generation across 80+ languages

Unlike older speech systems that depend heavily on structured SSML markup, S2 Pro allows developers to shape delivery using free-form natural-language cues directly inside prompts. 

Fish Audio S2 Pro uses a Dual-Autoregressive architecture built around a decoder-only transformer paired with an RVQ-based audio codec.  The system is split into two generation stages:

  • A larger “Slow AR” model predicts semantic speech structure over time
  • A smaller “Fast AR” model reconstructs fine-grained acoustic detail across residual codec streams 

This separation allows the model to maintain both high expressiveness and relatively efficient inference performance.

Model Attribute Specifications
Architecture Dual-Autoregressive (Dual-AR) with RVQ code
Total parameters ~4.4B (4B Slow AR + 400M Fast AR)
Training data 10M+ hours of audio
Audio Compression 10 codebooks, ~21 Hz frame rate
Input modalities Text, Reference Audio (for zero-shot cloning)
Control syntax Free-form [bracket] natural language tags
Supported Languages 80+ (Automatic detection)
License Fish Audio Research License (open weights)

Deploying Fish Audio S2 Pro on Radiant AI Cloud

Prerequisites

To get started, create a GPU virtual machine (VM) on Radiant AI Cloud.

We have selected the NVIDIA H200 GPU for this tutorial for their wide availability and cost-effectiveness. H100 and L40S GPUs are also an option to run this audio model. Upgrading to NVIDIA B200 hardware would unlock higher throughput and a larger number of inference clients.

Step 1: Clone the Repository

SSH into your instance and clone the Fish Speech repository:

git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech

Install dependencies:

sudo apt-get update
# needed to install pyaudio
sudo apt-get install portaudio19-dev python3-all-dev
pip install -e .

Step 2: Download the Model

hf download fishaudio/s2-pro --local-dir checkpoints/s2-pro 

Step 3: Run the Awesome WebUI Interface

Note: Make sure you’re running the latest version of Node JS

sudo apt-get install -y nodejs
cd awesome_webui 
npm install
npm run build

Step 4: Start Inference Server

python tools/api_server.py --listen 0.0.0.0:8888 --compile 

Here’s a snapshot of the GPU instance that shows memory usage of about 20 GB VRAM

Fish Audio TTS Memory VRAM Requirements

Step 5: Generate Speech

You can now generate expressive speech outputs directly from the UI, with the following URL: http://VM-IP:8888/ui

Text to Speech via UI
Fish Audio on Awesome UI

Voice Cloning

Fish Audio S2 Pro also supports voice cloning from reference audio. According to Fish Audio documentation, users can upload short voice samples and generate consistent speech outputs using cloned speaker characteristics. Check out the vLLM commands to generate voice clones using Fish Audio S2 Pro.

How good is Fish Audio S2 Pro?

What stands out immediately is not just voice quality, but controllability. Many TTS systems can sound realistic under ideal conditions. But it’s usually harder for speech models to maintain realism while dynamically shifting emotional tone, speaker identity, pacing, or conversational context. Fish Audio S2 Pro handles these transitions surprisingly naturally.

The inline prompting system also dramatically lowers friction for developers experimenting with expressive speech generation. Instead of building rigid prosody pipelines, developers can iterate conversationally.

Here are some sample responses to text-to-speech prompts:

Prompt:

[professional broadcast tone] Welcome back to the show, chef! Today we are making the perfect French baguette.

[Slight French Accent] "Ah, thank you! The secret is in the fermentation and, of course, a little bit of magic.

Listen to a TTS demo

Multi-speaker conversation

Prompt:

<speaker:0> [professional] Reinforcement learning relies on a continuous cycle of interaction between an ML algorithm known as an agent and its environment.

<speaker:1> [excited] In reinforcement learning, a policy is a mapping that tells an agent which action to choose in each state to maximize long-term cumulative rewards. 

<speaker:2> [voice down] At each step, the agent observes the current state of the environment, selects an action, and [break] in response the environment transitions to a new state while providing a reward signal.  

<speaker:0>  The reward is a numerical measure of the outcome, higher for a desirable result and lower for an undesirable result.

Listen to the Multi-speaker conversation

Non-English TTS

Prompt:

<speaker:0> Bonjour Sophie. Comment allez-vous aujourd’hui ?

<speaker:1> Bonjour Marc. Très bien, merci. Et vous ?

<speaker:0>Ça va très bien. Avez-vous reçu le rapport financier de ce matin ?

<speaker:1>Oui, je l’ai lu. Les chiffres sont très bons pour ce trimestre.

<speaker:0> C’est une excellente nouvelle. Pouvons-nous en discuter à la réunion de cet après-midi ?

<speaker:1>Absolument. À quelle heure est prévue la réunion ?

<speaker:0> Elle est à quatorze heures dans la grande salle.

<speaker:1> Parfait, je serai prête. Merci Marc.

<speaker:0> Je vous en prie, Sophie. À plus tard !

Listen to the French TTS demo

Token Generation Speed:

Fish Audio AI Performance

The Fish S2 Pro clocked around 70 tokens/sec consistently on an H200 GPU. We expect even higher throughput on Blackwell GPUs.

What Comes Next for Open Voice Models

Open-source speech generation is evolving quickly. The next wave of models will likely converge several capabilities into unified voice-native systems:

  • Real-time streaming interaction
  • Speech-to-speech generation
  • Emotionally adaptive conversational agents
  • Long-context conversational memory
  • Multimodal reasoning with voice outputs
  • Multi-agent spoken collaboration

As these systems mature, infrastructure becomes increasingly important.

Low-latency streaming voice inference needs fast compute, networking, storage, and orchestration. Running these workloads efficiently requires infrastructure designed specifically for modern AI systems rather than retrofitted cloud abstractions.

Build Voice AI on Radiant

Fish Audio S2 Pro demonstrates how quickly open-source AI audio is advancing. Expressive speech generation, multilingual synthesis, real-time streaming, and programmable emotional control are becoming foundational AI capabilities rather than experimental features.

Radiant provides the infrastructure layer needed to deploy these systems reliably at production scale. Explore high-performance AI infrastructure for next-generation AI workloads with Radiant AI Cloud.

FAQs

No items found.

How To's

No items found.

Related Articles