Create Engaging Audio Experiences with Guided Text to Speech Cognitive Actions

22 Apr 2025
Create Engaging Audio Experiences with Guided Text to Speech Cognitive Actions

In today's digital landscape, the demand for interactive and engaging audio experiences is on the rise. The Guided Text to Speech Cognitive Actions from the lee101/guided-text-to-speech specification provide developers with powerful tools to generate synthetic speech that captures the nuances of human voice. By leveraging these pre-built actions, you can easily integrate advanced text-to-speech capabilities into your applications, enhancing user engagement and accessibility.

Prerequisites

Before you start using the Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform, which will be used to authenticate your requests.
  • Familiarity with JSON format, as the input and output for the actions are structured in JSON.

Authentication typically involves passing your API key in the request headers to ensure secure access to the action functionalities.

Cognitive Actions Overview

Generate Guided Text to Speech

The Generate Guided Text to Speech action is designed to synthesize speech based on detailed descriptions of voice characteristics and the text you want to articulate. This action allows you to specify attributes such as tone, emotion, speed, and background noise, which results in a more personalized audio output.

  • Category: Text-to-Speech
  • Purpose: To create synthetic speech that accurately reflects the specified characteristics of the voice and the provided text prompt.

Input

The input for this action requires two fields: voice and prompt.

  • voice: A string that describes the speaker's voice characteristics, including elements like gender, pitch, pace, and clarity.
    • Example: "A female speaker with a slightly low-pitched, quite monotone voice delivers her words at a slightly faster-than-average pace in a confined space with very clear audio."
  • prompt: A string that contains the text to be spoken.
    • Example: "hi whats the weather?"

Here’s a JSON example of the input payload:

{
  "voice": "A female speaker with a slightly low-pitched, quite monotone voice delivers her words at a slightly faster-than-average pace in a confined space with very clear audio.",
  "prompt": "hi whats the weather?"
}

Output

Upon successful execution, the action returns a URL pointing to the generated audio file. This URL can be used to access the synthesized speech output.

  • Example Output:
    "https://assets.cognitiveactions.com/invocations/4e5c0363-f033-493b-a042-0428cab7f1c5/757e5b42-6054-4b38-9efd-9cab58fb1cbf.mp3"
    

Conceptual Usage Example (Python)

Below is a conceptual Python code snippet demonstrating how to call the Generate Guided Text to Speech action.

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "dbedd35f-8568-4ba3-b494-e91932a9e6d2" # Action ID for Generate Guided Text to Speech

# Construct the input payload based on the action's requirements
payload = {
    "voice": "A female speaker with a slightly low-pitched, quite monotone voice delivers her words at a slightly faster-than-average pace in a confined space with very clear audio.",
    "prompt": "hi whats the weather?"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The payload variable is constructed using the example input, and the request is sent to the hypothetical endpoint. The output will contain the URL to the generated audio file if the request is successful.

Conclusion

The Guided Text to Speech Cognitive Actions provide an excellent opportunity for developers to create rich audio experiences that are both engaging and tailored to user needs. By utilizing these actions, you can enhance your applications' interactivity and accessibility, making them more appealing to a broader audience. Explore various use cases, from virtual assistants to educational tools, and start integrating these powerful text-to-speech capabilities in your projects today!