OpenAudio S1: AI Text-to-Speech by Fish Audio

Advanced AI Text-to-Speech & Voice Cloning

What is OpenAudio S1?

OpenAudio S1 is the latest generation, advanced text-to-speech (TTS) model officially launched by Fish Audio, a leader in AI speech technologies. As a brand-new upgrade based on Fish Audio's "Fish Speech" series, OpenAudio S1 promises to redefine the AI voice generation experience.

Text-to-Speech (TTS): Converts text into highly natural and expressive speech.

Voice Cloning: Supports zero-shot and few-shot voice cloning, generating high-fidelity cloned voices from as little as 10-30 seconds of audio samples, with the process taking less than a minute.

ADVANCED TTS

Text to Speech & Voice Cloning

OpenAudio S1 Key Features

Text-to-Speech (TTS): OpenAudio S1 converts text into highly natural and expressive speech, reaching the expressiveness and naturalness of professional voice actors.

Text-to-Speech (TTS)

Converts text into highly natural and expressive speech, reaching the expressiveness and naturalness of professional voice actors.

Voice Cloning

Supports zero-shot and few-shot voice cloning with just 10–30 seconds of audio, generating high-fidelity cloned voices in under a minute—ideal for personalized broadcasters or celebrity voice simulations.

Highly Natural Sound & Emotional Control

Produces smooth, realistic voices nearly indistinguishable from human voiceovers, with over 50 emotions and tone markers. Adjust expression, emotion, and subtle cues like laughter or whispers via natural language.

Strong Instruction-Following & Customization

Control speech rate, volume, pauses, and more with simple text commands. Developers can customize tone, emphasis, and pacing in real-time via API.

Multispeaker & Style Flexibility

Seamlessly switch between characters and styles within a single clip—perfect for audiobooks, podcasts, and interactive dialogues.

Multilingual & Cross-lingual Support

Covers 13 languages including English, Chinese, Japanese, Korean, French, German, Arabic, and Spanish. Handles any language script without phoneme reliance.

High Accuracy & Fast Performance

Achieves English WER as low as 0.008 and CER of 0.004. Cloud processing averages 20s per voice; real-time factor is 1:5 on RTX 4060 and 1:15 on RTX 4090, with <100ms latency for 11 languages.

Innovative Dual-AR Architecture

Combines fast and slow Transformer modules for stable, efficient voice generation.

RLHF Training & Large-Scale Data

Emotional expressiveness is enhanced with RLHF (Reinforcement Learning from Human Feedback) and training on 2 million hours of audio data.

Simple & Modular

How OpenAudio S1 Works

Discover how OpenAudio S1’s modular architecture and task-specific tuning deliver text to speech and voice cloning, consistent, and controllable image generation—perfect for subject-driven applications.

Extensive Training Data

Trained on 2 million hours of audio, OpenAudio S1 achieves breakthrough quality and diversity in voice generation. This vast dataset enables the model to produce smooth, realistic voices nearly indistinguishable from human voiceovers.

Strong Instruction-Following Capability

Users can control speech rate, volume, pauses, and add effects like laughter with simple text commands. Developers can further customize tone, emphasis, and pacing in real-time via API.

High Accuracy & Fast Performance

Achieves English WER as low as 0.008 and CER of 0.004. Cloud processing averages 20s per voice; with 'fish-tech acceleration,' real-time factor is 1:5 on RTX 4060 and 1:15 on RTX 4090, supporting 11 languages with <100ms latency.

Innovative Dual-AR Architecture

Utilizes a unique dual autoregressive (Dual-AR) architecture, combining fast and slow Transformer modules for stable, efficient voice generation. Grouped finite scalar vector quantization (GFSQ) further enhances codebook processing, ensuring high-fidelity output with reduced computational cost.

Zero-shot & Few-shot Voice Cloning

Requires only 10–30 seconds of audio to generate high-fidelity cloned voices in under a minute, capturing unique speaking patterns, rhythm, and style for personalized voice cloning.

RLHF-Driven Emotional Expression

Leverages online RLHF (Reinforcement Learning from Human Feedback) to capture voice timbre and intonation, enabling natural emotional expression. Users can control over 50 emotions and tone markers—like (angry), (happy), (sad), (excited), (whisper), and more—including subtle cues such as laughter or crying.

Multilingual & Cross-lingual Support

Supports 13 languages, including English, Chinese, Japanese, Korean, French, German, Arabic, and Spanish. No phoneme dependency—handles any language script for TTS by simply pasting text into the input box.

Extensive Training Data

Innovative Dual-AR Architecture

RLHF-Driven Emotional Expression

Strong Instruction-Following Capability

Users can control speech rate, volume, pauses, and add effects like laughter with simple text commands. Developers can further customize tone, emphasis, and pacing in real-time via API.

Zero-shot & Few-shot Voice Cloning

Requires only 10–30 seconds of audio to generate high-fidelity cloned voices in under a minute, capturing unique speaking patterns, rhythm, and style for personalized voice cloning.

Multilingual & Cross-lingual Support

High Accuracy & Fast Performance

OpenAudio S1: Technical Highlights

Extensive Training Data

OpenAudio S1 was trained on an enormous dataset of 2 million hours of audio. This large-scale training has led to substantial improvements in the quality and diversity of its voice generation. For the 4-billion parameter S1 model, this training volume contributes to its industry-leading performance.

Innovative Dual-AR Architecture

The model adopts a unique dual autoregressive (Dual-AR) architecture, combining fast and slow Transformer modules to optimize the stability and efficiency of voice generation. Grouped finite scalar vector quantization (GFSQ) technology further enhances codebook processing, ensuring high-fidelity voice output while reducing computational costs.

RLHF-Driven Emotional Expression

OpenAudio S1's ability to generate highly natural and emotionally nuanced speech is significantly enhanced by online Reinforcement Learning from Human Feedback (RLHF). This allows the model to precisely capture voice timbre and intonation, generating more natural emotional expressions compared to traditional TTS models.

Apache License | Explore on Hugging Face | Github Repo

Try OpenAudio S1 Live

Interactive OpenAudio S1 Demo

Experience OpenAudio S1 directly in your browser.

Powerful Capabilities

How to Use OpenAudio S1-Mini on Hugging Face

OpenAudio S1-Mini lets you generate expressive, high-quality speech from your text in seconds. Follow these steps to get started with the Hugging Face web interface:

1
Access the Model
- Go to the OpenAudio S1-Mini page on Hugging Face.
- Log in to your Hugging Face account, or Sign Up if you don't have one.
2
Navigate to the Interface
- Look for the "Use via API" or Space/Demo option on the model page.
- Click on the interactive demo or Space to access the web interface.
- Wait for the model to load. You’ll see: "The model running in this WebUI is OpenAudio S1 Mini".
3
Enter Your Text
- Locate the "Input Text" section at the top.
- Click in the text box that says "Put your text here.".
- Type or paste your text to convert to speech. Supported languages: English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese.
4
Add Emotional Control (Optional)
Enhance your speech by adding markers in parentheses:
- Emotions: (excited), (sad), (angry), (happy), (surprised)
- Tones: (whispering), (shouting), (soft tone), (in a hurry tone)
- Special effects: (laughing), (chuckling), (sighing), (crying)
Example: Hello there! (excited) How are you doing today? (curious)
5
Configure Advanced Settings (Optional)
Click on the "Advanced Config" tab to adjust:
- Temperature: 0.9 (default) – Higher = more creative, Lower = more consistent
- Top-P: 0.9 (default) – Controls randomness
- Repetition Penalty: 1.1 (default) – Prevents repetitive speech
- Maximum tokens per batch: 0 (unlimited) or set a limit
- Seed: 0 (random) or enter a number for reproducible results
6
Generate Audio
- Click the blue "Generate" button.
- Wait for processing (may take a few seconds to minutes depending on text length).
- The generated audio will appear in the "Generated Audio" section on the right.
7
Listen and Download
- Play the audio using the audio player controls.
- Download the audio file if you want to save it locally.
- The audio will be in high-quality format suitable for various uses.
8
Iterate and Refine
- Modify your text or settings if needed.
- Try different emotional markers for varied expression.
- Experiment with parameters to get the desired voice quality.
- Generate again with new settings.

Got Questions?

Frequently Asked Questions

Find answers to common questions about OpenAudio S1's features, requirements, and capabilities

OpenAudio S1 is Fish Audio's latest generation voice generation model, building on the Fish Speech series to achieve unprecedented levels of speech naturalness and expressiveness. It aims to reach the quality of professional voice actors and has been recognized as a new benchmark in Text-to-Speech (TTS).

OpenAudio S1 offers highly natural sound, a wealth of tone control with over 50 emotions and tone markers, and strong instruction-following capabilities to control speech details like rate, volume, and pauses. It also provides ultra-realistic voice cloning from as little as 10-30 seconds of audio.

It was trained on an enormous dataset of 2 million hours of audio data. Key technical highlights include an innovative Dual-AR architecture combining fast and slow Transformer modules for stability and efficiency, and RLHF (Reinforcement Learning from Human Feedback) technology for more natural emotional expressions.

OpenAudio S1 demonstrates industry-leading accuracy, achieving an English word error rate (WER) as low as 0.008 and a character error rate (CER) of only 0.004. It also provides lightning-fast performance, with cloud processing generating high-quality voice in an average of 20 seconds, and real-time factors up to 1:15 on an Nvidia RTX 4090. It ranked #1 on the TTS-Arena leaderboard.

Yes, OpenAudio S1 comes in two main versions: OpenAudio S1, the full-featured 4-billion parameter proprietary model available via cloud services, and OpenAudio S1-mini, a 0.5-billion parameter distilled, fully open-source model optimized for faster inference.

OpenAudio S1 supports 13 languages, including English, Chinese, Japanese, Korean, French, German, Arabic, and Spanish. It boasts no phoneme dependency, meaning it can handle text in any language script by simply copying and pasting. It supports 11 languages with real-time performance and under 100 milliseconds of latency.

User feedback indicates OpenAudio S1 surpasses competitors like ElevenLabs in terms of voice authenticity and emotional nuance. It offers layered emotional control capable of understanding complex prompts like "excited but nervous" and adding subtle sound cues, making its voices feel more human.

Yes, it features an easy-to-use, Gradio-based web UI and a PyQt6 graphical interface. It's designed to be deploy-friendly with native support for Linux and Windows. A free plan is available allowing users to test the S1 model with a credit limit, and voice cloning can be done quickly, taking about 23 seconds to record and analyze a voice.

What is OpenAudio S1?

OpenAudio S1 Key Features

Text-to-Speech (TTS)

Voice Cloning

Highly Natural Sound & Emotional Control

Strong Instruction-Following & Customization

Multispeaker & Style Flexibility

Multilingual & Cross-lingual Support

High Accuracy & Fast Performance

Innovative Dual-AR Architecture

RLHF Training & Large-Scale Data

How OpenAudio S1 Works

Extensive Training Data

Strong Instruction-Following Capability

High Accuracy & Fast Performance

Innovative Dual-AR Architecture

Zero-shot & Few-shot Voice Cloning

RLHF-Driven Emotional Expression

Multilingual & Cross-lingual Support

Extensive Training Data

Innovative Dual-AR Architecture

RLHF-Driven Emotional Expression

Strong Instruction-Following Capability

Zero-shot & Few-shot Voice Cloning

Multilingual & Cross-lingual Support

High Accuracy & Fast Performance

OpenAudio S1: Technical Highlights

Extensive Training Data

Innovative Dual-AR Architecture

RLHF-Driven Emotional Expression

Interactive OpenAudio S1 Demo

How to Use OpenAudio S1-Mini on Hugging Face

Access the Model

Navigate to the Interface

Enter Your Text

Add Emotional Control (Optional)

Configure Advanced Settings (Optional)

Generate Audio

Listen and Download

Iterate and Refine

Frequently Asked Questions

What is OpenAudio S1?

What are its core capabilities?

What makes OpenAudio S1 technically advanced?

How does OpenAudio S1 perform in terms of accuracy and speed?

Are there different versions of the model available?

What languages does it support, and how well?

How does OpenAudio S1 compare to other leading AI voice models?

Is OpenAudio S1 user-friendly and accessible?