What is OpenAudio S1?
OpenAudio S1 is the latest generation, advanced text-to-speech (TTS) model officially launched by Fish Audio, a leader in AI speech technologies. As a brand-new upgrade based on Fish Audio's "Fish Speech" series, OpenAudio S1 promises to redefine the AI voice generation experience.
Text-to-Speech (TTS): Converts text into highly natural and expressive speech.
Voice Cloning: Supports zero-shot and few-shot voice cloning, generating high-fidelity cloned voices from as little as 10-30 seconds of audio samples, with the process taking less than a minute.
OpenAudio S1 Key Features
Text-to-Speech (TTS): OpenAudio S1 converts text into highly natural and expressive speech, reaching the expressiveness and naturalness of professional voice actors.
Text-to-Speech (TTS)
Converts text into highly natural and expressive speech, reaching the expressiveness and naturalness of professional voice actors.
Voice Cloning
Supports zero-shot and few-shot voice cloning with just 10–30 seconds of audio, generating high-fidelity cloned voices in under a minute—ideal for personalized broadcasters or celebrity voice simulations.
Highly Natural Sound & Emotional Control
Produces smooth, realistic voices nearly indistinguishable from human voiceovers, with over 50 emotions and tone markers. Adjust expression, emotion, and subtle cues like laughter or whispers via natural language.
Strong Instruction-Following & Customization
Control speech rate, volume, pauses, and more with simple text commands. Developers can customize tone, emphasis, and pacing in real-time via API.
Multispeaker & Style Flexibility
Seamlessly switch between characters and styles within a single clip—perfect for audiobooks, podcasts, and interactive dialogues.
Multilingual & Cross-lingual Support
Covers 13 languages including English, Chinese, Japanese, Korean, French, German, Arabic, and Spanish. Handles any language script without phoneme reliance.
High Accuracy & Fast Performance
Achieves English WER as low as 0.008 and CER of 0.004. Cloud processing averages 20s per voice; real-time factor is 1:5 on RTX 4060 and 1:15 on RTX 4090, with <100ms latency for 11 languages.
Innovative Dual-AR Architecture
Combines fast and slow Transformer modules for stable, efficient voice generation.
RLHF Training & Large-Scale Data
Emotional expressiveness is enhanced with RLHF (Reinforcement Learning from Human Feedback) and training on 2 million hours of audio data.
How OpenAudio S1 Works
Discover how OpenAudio S1’s modular architecture and task-specific tuning deliver text to speech and voice cloning, consistent, and controllable image generation—perfect for subject-driven applications.
Extensive Training Data
Trained on 2 million hours of audio, OpenAudio S1 achieves breakthrough quality and diversity in voice generation. This vast dataset enables the model to produce smooth, realistic voices nearly indistinguishable from human voiceovers.
Strong Instruction-Following Capability
Users can control speech rate, volume, pauses, and add effects like laughter with simple text commands. Developers can further customize tone, emphasis, and pacing in real-time via API.
High Accuracy & Fast Performance
Achieves English WER as low as 0.008 and CER of 0.004. Cloud processing averages 20s per voice; with 'fish-tech acceleration,' real-time factor is 1:5 on RTX 4060 and 1:15 on RTX 4090, supporting 11 languages with <100ms latency.
Innovative Dual-AR Architecture
Utilizes a unique dual autoregressive (Dual-AR) architecture, combining fast and slow Transformer modules for stable, efficient voice generation. Grouped finite scalar vector quantization (GFSQ) further enhances codebook processing, ensuring high-fidelity output with reduced computational cost.
Zero-shot & Few-shot Voice Cloning
Requires only 10–30 seconds of audio to generate high-fidelity cloned voices in under a minute, capturing unique speaking patterns, rhythm, and style for personalized voice cloning.
RLHF-Driven Emotional Expression
Leverages online RLHF (Reinforcement Learning from Human Feedback) to capture voice timbre and intonation, enabling natural emotional expression. Users can control over 50 emotions and tone markers—like (angry), (happy), (sad), (excited), (whisper), and more—including subtle cues such as laughter or crying.
Multilingual & Cross-lingual Support
Supports 13 languages, including English, Chinese, Japanese, Korean, French, German, Arabic, and Spanish. No phoneme dependency—handles any language script for TTS by simply pasting text into the input box.
Extensive Training Data
Trained on 2 million hours of audio, OpenAudio S1 achieves breakthrough quality and diversity in voice generation. This vast dataset enables the model to produce smooth, realistic voices nearly indistinguishable from human voiceovers.
Innovative Dual-AR Architecture
Utilizes a unique dual autoregressive (Dual-AR) architecture, combining fast and slow Transformer modules for stable, efficient voice generation. Grouped finite scalar vector quantization (GFSQ) further enhances codebook processing, ensuring high-fidelity output with reduced computational cost.
RLHF-Driven Emotional Expression
Leverages online RLHF (Reinforcement Learning from Human Feedback) to capture voice timbre and intonation, enabling natural emotional expression. Users can control over 50 emotions and tone markers—like (angry), (happy), (sad), (excited), (whisper), and more—including subtle cues such as laughter or crying.
Strong Instruction-Following Capability
Users can control speech rate, volume, pauses, and add effects like laughter with simple text commands. Developers can further customize tone, emphasis, and pacing in real-time via API.
Zero-shot & Few-shot Voice Cloning
Requires only 10–30 seconds of audio to generate high-fidelity cloned voices in under a minute, capturing unique speaking patterns, rhythm, and style for personalized voice cloning.
Multilingual & Cross-lingual Support
Supports 13 languages, including English, Chinese, Japanese, Korean, French, German, Arabic, and Spanish. No phoneme dependency—handles any language script for TTS by simply pasting text into the input box.
High Accuracy & Fast Performance
Achieves English WER as low as 0.008 and CER of 0.004. Cloud processing averages 20s per voice; with 'fish-tech acceleration,' real-time factor is 1:5 on RTX 4060 and 1:15 on RTX 4090, supporting 11 languages with <100ms latency.
OpenAudio S1: Technical Highlights
Extensive Training Data
OpenAudio S1 was trained on an enormous dataset of 2 million hours of audio. This large-scale training has led to substantial improvements in the quality and diversity of its voice generation. For the 4-billion parameter S1 model, this training volume contributes to its industry-leading performance.
Innovative Dual-AR Architecture
The model adopts a unique dual autoregressive (Dual-AR) architecture, combining fast and slow Transformer modules to optimize the stability and efficiency of voice generation. Grouped finite scalar vector quantization (GFSQ) technology further enhances codebook processing, ensuring high-fidelity voice output while reducing computational costs.
RLHF-Driven Emotional Expression
OpenAudio S1's ability to generate highly natural and emotionally nuanced speech is significantly enhanced by online Reinforcement Learning from Human Feedback (RLHF). This allows the model to precisely capture voice timbre and intonation, generating more natural emotional expressions compared to traditional TTS models.
Interactive OpenAudio S1 Demo
Experience OpenAudio S1 directly in your browser.
How to Use OpenAudio S1-Mini on Hugging Face
OpenAudio S1-Mini lets you generate expressive, high-quality speech from your text in seconds. Follow these steps to get started with the Hugging Face web interface:
- 1
Access the Model
- Go to the OpenAudio S1-Mini page on Hugging Face.
- Log in to your Hugging Face account, or Sign Up if you don't have one.
- 2
Navigate to the Interface
- Look for the "Use via API" or Space/Demo option on the model page.
- Click on the interactive demo or Space to access the web interface.
- Wait for the model to load. You’ll see: "The model running in this WebUI is OpenAudio S1 Mini".
- 3
Enter Your Text
- Locate the "Input Text" section at the top.
- Click in the text box that says "Put your text here.".
- Type or paste your text to convert to speech. Supported languages: English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese.
- 4
Add Emotional Control (Optional)
Enhance your speech by adding markers in parentheses:
- Emotions: (excited), (sad), (angry), (happy), (surprised)
- Tones: (whispering), (shouting), (soft tone), (in a hurry tone)
- Special effects: (laughing), (chuckling), (sighing), (crying)
Example: Hello there! (excited) How are you doing today? (curious) - 5
Configure Advanced Settings (Optional)
Click on the "Advanced Config" tab to adjust:
- Temperature: 0.9 (default) – Higher = more creative, Lower = more consistent
- Top-P: 0.9 (default) – Controls randomness
- Repetition Penalty: 1.1 (default) – Prevents repetitive speech
- Maximum tokens per batch: 0 (unlimited) or set a limit
- Seed: 0 (random) or enter a number for reproducible results
- 6
Generate Audio
- Click the blue "Generate" button.
- Wait for processing (may take a few seconds to minutes depending on text length).
- The generated audio will appear in the "Generated Audio" section on the right.
- 7
Listen and Download
- Play the audio using the audio player controls.
- Download the audio file if you want to save it locally.
- The audio will be in high-quality format suitable for various uses.
- 8
Iterate and Refine
- Modify your text or settings if needed.
- Try different emotional markers for varied expression.
- Experiment with parameters to get the desired voice quality.
- Generate again with new settings.
Frequently Asked Questions
Find answers to common questions about OpenAudio S1's features, requirements, and capabilities
OpenAudio S1 is Fish Audio's latest generation voice generation model, building on the Fish Speech series to achieve unprecedented levels of speech naturalness and expressiveness. It aims to reach the quality of professional voice actors and has been recognized as a new benchmark in Text-to-Speech (TTS).
OpenAudio S1 offers highly natural sound, a wealth of tone control with over 50 emotions and tone markers, and strong instruction-following capabilities to control speech details like rate, volume, and pauses. It also provides ultra-realistic voice cloning from as little as 10-30 seconds of audio.
It was trained on an enormous dataset of 2 million hours of audio data. Key technical highlights include an innovative Dual-AR architecture combining fast and slow Transformer modules for stability and efficiency, and RLHF (Reinforcement Learning from Human Feedback) technology for more natural emotional expressions.
OpenAudio S1 demonstrates industry-leading accuracy, achieving an English word error rate (WER) as low as 0.008 and a character error rate (CER) of only 0.004. It also provides lightning-fast performance, with cloud processing generating high-quality voice in an average of 20 seconds, and real-time factors up to 1:15 on an Nvidia RTX 4090. It ranked #1 on the TTS-Arena leaderboard.
Yes, OpenAudio S1 comes in two main versions: OpenAudio S1, the full-featured 4-billion parameter proprietary model available via cloud services, and OpenAudio S1-mini, a 0.5-billion parameter distilled, fully open-source model optimized for faster inference.
OpenAudio S1 supports 13 languages, including English, Chinese, Japanese, Korean, French, German, Arabic, and Spanish. It boasts no phoneme dependency, meaning it can handle text in any language script by simply copying and pasting. It supports 11 languages with real-time performance and under 100 milliseconds of latency.
User feedback indicates OpenAudio S1 surpasses competitors like ElevenLabs in terms of voice authenticity and emotional nuance. It offers layered emotional control capable of understanding complex prompts like "excited but nervous" and adding subtle sound cues, making its voices feel more human.
Yes, it features an easy-to-use, Gradio-based web UI and a PyQt6 graphical interface. It's designed to be deploy-friendly with native support for Linux and Windows. A free plan is available allowing users to test the S1 model with a credit limit, and voice cloning can be done quickly, taking about 23 seconds to record and analyze a voice.