Enhance User Experience with Multilingual Voice Synthesis

6 May 2025

The Multilingual Voice Synthesis API is a powerful tool designed for developers looking to integrate high-quality voice synthesis into their applications. This API allows you to convert text into audio with remarkable realism, featuring advanced capabilities such as voice cloning, emotional expression, and support for multiple languages. By leveraging this API, you can create engaging content for audiobooks, voiceovers, and interactive applications, significantly enhancing user experience while simplifying the development process.

Imagine the possibilities: creating immersive educational tools, developing dynamic virtual assistants, or generating captivating audio content for storytelling. With the Multilingual Voice Synthesis API, you can bring your text to life in ways that resonate with your audience, regardless of their language or emotional context.

Prerequisites

To get started, you will need a Cognitive Actions API key and a basic understanding of making API calls.

Convert Text to Audio with Voice Synthesis

The "Convert Text to Audio with Voice Synthesis" action is the core functionality of the Multilingual Voice Synthesis API. This action transforms written text into spoken audio, providing a seamless way to deliver content in a voice that sounds human-like.

Purpose

This action is designed to solve the challenge of creating realistic audio outputs from text, which is essential for applications requiring voice interaction. It allows developers to customize the audio output based on pitch, speed, volume, emotional tone, and more, ensuring a tailored user experience.

Input Requirements

The input for this action requires a JSON object with the following key parameters:

text (string): The text to be converted to speech (max 5000 characters).
pitch (integer): Adjusts the pitch of the audio (range: -12 to 12).
speed (number): Modifies the speed of speech (range: 0.5 to 2).
volume (number): Sets the audio volume (range: 0 to 10).
bitrate (integer): Selects the audio bitrate from available options (32000, 64000, 128000, 256000).
channel (string): Determines if the output is mono or stereo.
emotion (string): Specifies the emotional tone (options include neutral, happy, sad, etc.).
audioSampleRate (integer): Sets the sample rate (options include 8000, 16000, 22050, etc.).
voiceIdentifier (string): Chooses the voice ID for the output.
languageEnhancement (string): Enhances speech recognition for specific languages.
enableEnglishNormalization (boolean): Normalizes English text for better number reading.

Expected Output

The output will be a direct link to the generated audio file, allowing immediate access to the synthesized speech.

Use Cases for this Specific Action

This action is ideal for a variety of applications:

Audiobooks: Create engaging audiobooks that captivate listeners with expressive narration.
Voiceovers: Generate professional-quality voiceovers for videos, advertisements, or presentations.
Interactive Learning: Develop educational tools that provide spoken instructions or feedback in multiple languages.
Virtual Assistants: Enhance user engagement by integrating natural-sounding dialogue into chatbots and virtual assistants.

import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "d7c9e3f5-782d-4a4d-8d2f-c91c60f37499" # Action ID for: Convert Text to Audio with Voice Synthesis

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.",
  "pitch": 0,
  "speed": 1,
  "volume": 1,
  "bitrate": 128000,
  "channel": "mono",
  "emotion": "angry",
  "audioSampleRate": 32000,
  "voiceIdentifier": "Deep_Voice_Man",
  "languageEnhancement": "English",
  "enableEnglishNormalization": true
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")

Conclusion

The Multilingual Voice Synthesis API offers developers a robust solution for creating high-quality audio content from text. With features that allow for emotional expression and multilingual support, this API is not only versatile but also essential for building applications that prioritize user engagement. Whether you are developing interactive tools, enhancing media content, or creating personalized experiences, the API can elevate your projects to new heights.

As a next step, explore how you can integrate this API into your applications and start transforming your text into immersive audio experiences today!