Transforming Text into Speech with the ttsds/styletts2 Cognitive Actions

25 Apr 2025
Transforming Text into Speech with the ttsds/styletts2 Cognitive Actions

In the age of digital communication, having the ability to convert text into speech can significantly enhance user interactions. The ttsds/styletts2 specification offers developers a powerful Cognitive Action that allows for voice synthesis from provided text. This capability not only enhances user experience but also provides a personalized touch by enabling the use of specific speaker voices.

Prerequisites

To utilize the Cognitive Actions under the ttsds/styletts2 spec, you'll need:

  • API Key: Ensure you have an API key for the Cognitive Actions platform. This key will be necessary for authenticating your requests.
  • Basic Setup: You may need to set up your development environment to make HTTP requests. Familiarity with JSON is essential as the input and output will be formatted in this way.

Authentication Concept

Typically, authentication is done by passing your API key in the headers of your requests. This ensures that your application can securely communicate with the Cognitive Actions service.

Cognitive Actions Overview

Generate Voice from Text

Description: This action converts provided text into synthesized speech. If a specific speaker's voice reference is available, it enhances the speech output, making it more personalized.

Category: Text-to-Speech

Input

The input schema for this action requires the following fields:

  • text (required): This is the main content that needs to be converted into speech.
  • speakerReference (optional): This is a URI reference to an audio file of the speaker's voice. If provided, this voice will be used in the synthesis.

Example Input:

{
  "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
  "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

Output

The action typically returns a URI to the generated audio file. This file contains the synthesized speech based on the input text.

Example Output:

"https://assets.cognitiveactions.com/invocations/d5a3c6d6-141a-4d12-a3da-d0916a66e3c7/1009dc9e-d84a-486a-93c4-26d88227d5bb.wav"

Conceptual Usage Example (Python)

Here’s how you might call this Cognitive Action using Python. Please note that the endpoint URL and request structure are illustrative.

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "93f270b2-3b59-40d0-9694-bd6779f19569" # Action ID for Generate Voice from Text

# Construct the input payload based on the action's requirements
payload = {
    "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
    "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The action_id corresponds to the action for generating voice from text. The payload is constructed using the example input, and the request is sent to the hypothetical endpoint.

Conclusion

The ttsds/styletts2 Cognitive Actions provide a robust solution for transforming text into speech, enhancing user interaction with personalized voice synthesis. By utilizing the Generate Voice from Text action, developers can easily integrate voice capabilities into their applications. Consider exploring other potential use cases, such as creating interactive voice responses or enhancing accessibility features within your applications. Start integrating today and elevate your user experience!