Transform Text to Speech Effortlessly with ttsds/xtts_1 Cognitive Actions

22 Apr 2025
Transform Text to Speech Effortlessly with ttsds/xtts_1 Cognitive Actions

In today’s digital landscape, the ability to convert text into speech can enhance user experiences significantly, whether for accessibility, content consumption, or interactive applications. The ttsds/xtts_1 spec provides a powerful set of Cognitive Actions that enable developers to integrate text-to-speech capabilities seamlessly into their applications. With the "Synthesize Speech from Text" action, you can transform written content into spoken word using various languages and speaker tones, making your applications more engaging.

Prerequisites

Before diving into the integration of Cognitive Actions, ensure you have:

  • An API key for the Cognitive Actions platform to authenticate your requests.
  • Basic knowledge of JSON structure as the input and output will be in this format.

Authentication typically involves including your API key in the request headers, which allows secure access to the service.

Cognitive Actions Overview

Synthesize Speech from Text

Description: This action converts text into speech using a specified speaker reference and language. It supports a variety of languages and allows developers to customize the synthesized voice, providing a flexible solution for diverse applications.

  • Category: Text-to-Speech

Input

The input for this action is structured as follows:

{
  "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
  "language": "en",
  "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

Required Fields:

  • text: The content you wish to synthesize into speech.
  • speakerReference: A URI that points to the audio file guiding the synthesis.

Optional Fields:

  • language: The language code for the text (default is "en" for English).

Output

Upon successful execution, the action returns a URL pointing to the audio file of the synthesized speech. For example:

"https://assets.cognitiveactions.com/invocations/9d85fab9-3515-4daf-b43f-27b0ed15960c/4e6afd9d-e583-4ea3-bd8a-3675acedd122.wav"

This URL can be used to play or download the synthesized speech.

Conceptual Usage Example (Python)

Here’s how you might call the Synthesize Speech from Text action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "0ce97daa-4ea3-4316-b699-12f07e9e8e4f"  # Action ID for Synthesize Speech from Text

# Construct the input payload based on the action's requirements
payload = {
    "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
    "language": "en",
    "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this snippet, replace the placeholders with your actual API key and endpoint. The payload variable is structured according to the requirements of the action, ensuring that the necessary fields are provided.

Conclusion

The ttsds/xtts_1 Cognitive Actions, particularly the Synthesize Speech from Text, offer developers a robust way to convert text into natural-sounding speech across multiple languages and speakers. By leveraging these actions, you can significantly enhance user interactions within your applications, making content more accessible and engaging.

Explore these capabilities further and consider how they can fit into your next project!