Transform Text to Multilingual Speech with MeloTTS: A Developer's Guide

22 Apr 2025
Transform Text to Multilingual Speech with MeloTTS: A Developer's Guide

In the realm of artificial intelligence, the ability to convert text into natural-sounding speech has gained immense popularity. The MeloTTS Cognitive Actions, part of the cjwbw/melotts specification, provide developers with a powerful tool to integrate high-quality multilingual text-to-speech capabilities into their applications. With advanced models at your disposal, creating a seamless voice experience in various languages and dialects has never been easier.

Prerequisites

Before diving into the integration of MeloTTS, ensure you have the following:

  • An API key for accessing the Cognitive Actions platform.
  • Basic knowledge of making HTTP requests and handling JSON data.
  • Familiarity with Python for executing API calls.

Authentication typically involves passing the API key in the request headers when invoking the Cognitive Actions.

Cognitive Actions Overview

Generate Multilingual Speech with MeloTTS

The Generate Multilingual Speech with MeloTTS action allows developers to convert text into high-quality speech across multiple languages. This action is categorized under text-to-speech and supports various speech speeds and speaker options.

Input

The input for this action requires a JSON object structured according to the following schema:

{
  "text": "The field of text-to-speech has seen rapid development recently.",
  "speed": 1,
  "speaker": "EN-BR",
  "language": "EN"
}
  • text (required): The content to be converted into speech. (Example: "The field of text-to-speech has seen rapid development recently.")
  • speed (optional): The rate at which the speech output is delivered, ranging from 0.1 (10 times slower) to 10 (10 times faster). (Default: 1)
  • speaker (optional): Specifies the speaker for English language models. For non-English languages, use '-'. (Example: "EN-BR")
  • language (optional): The language of the speech output. Supported options include: EN (English), ES (Spanish), FR (French), ZH (Chinese), JP (Japanese), KR (Korean). (Default: EN)

Output

Upon successful execution, the action returns a URL pointing to the generated speech audio file. For example:

https://assets.cognitiveactions.com/invocations/5fa16356-9991-4976-acfe-206104fbcb31/dd30dfd4-e017-4aeb-b0de-7006b8811bfc.wav

This URL can be used to access and play the resulting audio file.

Conceptual Usage Example (Python)

Here's a conceptual example of how a developer might call this action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "7cef1f8a-1a55-4631-8ed2-0fe33fb3e913"  # Action ID for Generate Multilingual Speech with MeloTTS

# Construct the input payload based on the action's requirements
payload = {
    "text": "The field of text-to-speech has seen rapid development recently.",
    "speed": 1,
    "speaker": "EN-BR",
    "language": "EN"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this example, the script sets up the necessary headers and constructs the input JSON payload according to the action's schema. The correct action ID is included in the request, enabling the conversion of text to speech seamlessly.

Conclusion

The MeloTTS Cognitive Action provides a robust solution for integrating multilingual speech synthesis into your applications. By leveraging its capabilities, you can enhance user experience through natural-sounding voice interactions across different languages and dialects. Consider exploring additional features and potential use cases to maximize the benefits of this powerful tool in your projects. Happy coding!