Create Engaging Multilingual Audio with AI Speech Synthesis

3 May 2025

The Multilingual Text To Audio Synthesis API empowers developers to transform written content into high-quality audio in multiple languages, complete with emotional nuances and voice cloning capabilities. This API is particularly advantageous for applications requiring voiceovers, audiobooks, and other audio content that benefits from a more human-like delivery. By harnessing advanced speech synthesis technology, developers can simplify the audio production process, enhance user engagement, and reach a broader audience through multilingual support.

Prerequisites

To get started with the Multilingual Text To Audio Synthesis API, you'll need an API key and a basic understanding of making API calls.

Generate Emotional Multilingual Speech

This action allows you to convert text into high-fidelity multilingual audio, integrating emotional expression and voice cloning capabilities. Utilizing the Speech-02-HD and Speech-02-Turbo models, this action is ideal for applications like voiceovers and audiobooks where emotional depth and clarity are essential.

Purpose

The Generate Emotional Multilingual Speech action addresses the need for realistic and expressive audio output, making it suitable for various content types, from educational resources to entertainment.

Input Requirements

You need to provide a JSON object that includes the text you want to convert, along with optional parameters to customize the output. The key properties include:

text: The text to convert to speech (limit: < 5000 characters).
pitch: Adjusts the speech pitch (range: -12 to 12).
speed: Controls the speech speed (range: 0.5 to 2.0).
volume: Sets the speech volume (range: 0 to 10).
bitrate: Specifies the audio bitrate (options: 32000, 64000, 128000, 256000).
channel: Defines audio channel (mono or stereo).
emotion: Sets the emotional tone (options: neutral, happy, sad, etc.).
audioSampleRate: Determines the sample rate (options: 8000, 16000, 22050, 24000, 32000, 44100).
voiceIdentifier: Selects the desired voice ID.
languageEnhancement: Enhances recognition of specific languages.
enableEnglishNormalization: Activates text normalization for improved reading.

Expected Output

The output will be a URL linking to the generated audio file, allowing you to easily access and use the audio in your applications.

Use Cases for this specific action

Voiceovers: Create engaging voiceovers for videos, presentations, or advertisements that resonate with diverse audiences.
Audiobooks: Produce high-quality audiobooks with emotional depth, making the listening experience more immersive.
Language Learning: Develop language learning applications where users can listen to correct pronunciations and emotional expressions in different languages.
Interactive Applications: Enhance customer engagement in chatbots or virtual assistants by providing emotionally rich audio responses.

import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "a649502d-7d78-415e-b97d-c37d6112ce46" # Action ID for: Generate Emotional Multilingual Speech

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.",
  "pitch": 0,
  "speed": 1,
  "volume": 1,
  "bitrate": 128000,
  "channel": "mono",
  "emotion": "happy",
  "audioSampleRate": 32000,
  "voiceIdentifier": "Friendly_Person",
  "languageEnhancement": "English",
  "enableEnglishNormalization": true
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")

Conclusion

The Multilingual Text To Audio Synthesis API not only streamlines the audio creation process but also enhances user experience through emotional and multilingual capabilities. By integrating this API, developers can create diverse audio content that is accessible and engaging for a global audience. Next steps might include exploring additional customization options or integrating this API into existing applications to elevate the audio experience.