Create Engaging Audio with Multilingual Voice Synthesis API

3 May 2025

The Multilingual Voice Synthesis API offers developers the ability to convert text into high-quality audio across multiple languages, making it an essential tool for enhancing user experiences in various applications. This service provides advanced voice synthesis capabilities, including emotional expression and real-time voice cloning, designed to meet the needs of modern applications that require dynamic and engaging audio content. With the ability to produce audio in over 30 languages, the API is optimized for low latency, ensuring a seamless experience for users.

Common use cases for this API include developing voiceovers for videos, creating audiobooks, enhancing accessibility features in applications, and building interactive voice response systems. By integrating this API, developers can enrich their applications with lifelike audio that resonates with diverse audiences, making content more relatable and engaging.

Prerequisites

To get started with the Multilingual Voice Synthesis API, you will need an API key and a basic understanding of making API calls.

Convert Text to Audio with Emotion and Multilingual Support

This action transforms text into audio, leveraging the Speech-02 series to deliver high-quality voice synthesis. It allows for emotional expression and supports multilingual capabilities, making it ideal for applications that require dynamic audio responses. With features like pitch, speed, and volume control, developers can create tailored audio experiences that reflect the desired tone and mood.

Input Requirements

To use this action, the following input parameters are required:

text (string): The text to convert into speech (max 5000 characters). You can use <#x#> to control pause duration between words.
pitch (integer): Adjusts the pitch of the speech output (range: -12 to +12).
speed (number): Adjusts the speed of speech (range: 0.5 to 2.0).
volume (number): Adjusts the volume of the speech output (range: 0 to 10).
audioBitrate (integer): Specifies the audio bitrate (options: 32000, 64000, 128000, 256000).
audioChannel (string): Defines the audio channel (options: mono, stereo).
speechEmotion (string): Selects the emotional expression (options include neutral, happy, sad, angry, fearful, disgusted, surprised).
audioSampleRate (integer): Determines the sample rate (options: 8000, 16000, 22050, 24000, 32000, 44100).
voiceIdentifier (string): Specifies the desired voice ID from a selection of predefined voices.
languageEnhancement (string): Enhances recognition of specific languages and dialects.
englishTextNormalization (boolean): Enables normalization for improved number reading.

Expected Output

The expected output is a URL link to the generated audio file, allowing you to easily access and use the audio in your applications.

Use Cases for this Action

Voiceovers for Media: Create engaging audio for videos, advertisements, and presentations that require a professional touch.
Audiobook Production: Convert written content into lifelike audiobooks, catering to a wider audience.
Interactive Applications: Enhance user interaction in apps by providing spoken responses that feel natural and engaging.
Accessibility Features: Improve accessibility by enabling text-to-speech capabilities in applications, making content available to users with visual impairments.

import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "d4c29702-81a5-4a8c-bfad-14f0621b51c5" # Action ID for: Convert Text to Audio with Emotion and Multilingual Support

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.",
  "pitch": 0,
  "speed": 1,
  "volume": 1,
  "audioBitrate": 128000,
  "audioChannel": "mono",
  "speechEmotion": "angry",
  "audioSampleRate": 32000,
  "voiceIdentifier": "Deep_Voice_Man",
  "languageEnhancement": "English",
  "englishTextNormalization": true
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")

Conclusion

The Multilingual Voice Synthesis API empowers developers to create rich, engaging audio experiences that can be tailored to various applications. With its multilingual support and emotional expression capabilities, this API offers a unique opportunity to make content more relatable and enjoyable for users. Whether you're enhancing an existing application or developing a new project, integrating this voice synthesis technology will significantly elevate the quality of audio interactions.

As a next step, consider exploring the API's documentation further to tailor the audio output to your specific needs and maximize its potential in your projects.