Generate Multilingual Speech with Zonos Cognitive Actions

23 Apr 2025
Generate Multilingual Speech with Zonos Cognitive Actions

In the world of voice technology, the ability to generate high-quality, expressive speech in multiple languages is becoming increasingly essential. The Zonos Cognitive Actions provide developers with powerful tools to integrate advanced text-to-speech capabilities into their applications. With Zonos, you can create multilingual voice clones that not only convey information but also express emotions, making interactions with technology feel more human-like. This blog post will guide you through the "Generate Multilingual Voice Clone" action, detailing how to leverage its capabilities effectively.

Prerequisites

Before you can start using the Zonos Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform, which will authenticate your requests.
  • Basic understanding of JSON and how to structure API requests.

Authentication typically involves passing your API key in the headers of your requests.

Cognitive Actions Overview

Generate Multilingual Voice Clone

Purpose:
This action creates high-quality multilingual speech using Zonos, a leading text-to-speech model. It supports voice cloning from short reference clips and allows you to control various parameters, including emotions and speaking rate. The output is high-fidelity audio at 44kHz.

Category: Text-to-Speech

Input:
The input for this action is structured as follows:

  • text (required): The text to be converted into speech.
  • seed (optional): An integer to initialize the random number generator for consistent results.
  • audio (optional): A URI pointing to an audio file for voice cloning.
  • emotion (optional): A comma-separated list of floats representing an emotion vector.
  • language (optional): The language code for speech generation, defaulting to "en-us".
  • modelVariant (optional): Specifies the model type to use, defaulting to "transformer".
  • speakingRate (optional): The speaking rate in phonemes per second, with default values set.

Example Input:

{
  "seed": 1,
  "text": "Hi! I'm Zonos, a text-to-speech model built by Zyphra...",
  "audio": "https://replicate.delivery/pbxt/MUEtXI54W68rj2eUER8rrkaRNUPjtqZdVXN5hQnhmRVMBqwC/richard_sample.wav",
  "emotion": "",
  "language": "en-us",
  "modelVariant": "transformer",
  "speakingRate": 15
}

Output:
The action typically returns a URI pointing to the audio file generated. For example:

https://assets.cognitiveactions.com/invocations/d898583a-fd72-435b-81d9-e6e36d7e8b27/c37c1986-2d70-4685-b940-8aa9985497de.wav

Conceptual Usage Example (Python): Here's how you might structure a call to this action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "ee09c9f5-be03-4725-90af-16da5d5f681b" # Action ID for Generate Multilingual Voice Clone

# Construct the input payload based on the action's requirements
payload = {
    "seed": 1,
    "text": "Hi! I'm Zonos, a text-to-speech model built by Zyphra...",
    "audio": "https://replicate.delivery/pbxt/MUEtXI54W68rj2eUER8rrkaRNUPjtqZdVXN5hQnhmRVMBqwC/richard_sample.wav",
    "emotion": "",
    "language": "en-us",
    "modelVariant": "transformer",
    "speakingRate": 15
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this snippet, replace the placeholder for the API key and endpoint with your actual credentials. The action ID corresponds to the "Generate Multilingual Voice Clone" action, and the payload is structured to meet the defined input schema.

Conclusion

Integrating the Zonos Cognitive Actions into your applications can significantly enhance user interactions by providing expressive, multilingual speech capabilities. By following the guidelines outlined in this post, you can start leveraging the power of text-to-speech technology in your projects. Explore various use cases, such as personalized audiobooks, virtual assistants, or even language learning tools, and take your applications to the next level!