Unlocking Text-to-Speech Capabilities with ttsds/whisperspeech Cognitive Actions

23 Apr 2025
Unlocking Text-to-Speech Capabilities with ttsds/whisperspeech Cognitive Actions

In the realm of developing applications that require audio output, the ttsds/whisperspeech API offers powerful Cognitive Actions for converting text to speech. With the ability to maintain speaker consistency and support multiple languages, these actions can significantly enhance user experience by providing natural-sounding audio narrations. In this article, we will explore how to leverage these Cognitive Actions to seamlessly integrate text-to-speech functionality into your applications.

Prerequisites

Before you dive into integrating the Cognitive Actions, ensure you have the following:

  • An API key for accessing the Cognitive Actions platform.
  • Basic understanding of JSON and RESTful API concepts.

For authentication, you will typically pass your API key in the headers of your requests, which allows you to securely interact with the Cognitive Actions services.

Cognitive Actions Overview

Convert Text to Speech with Speaker Consistency

This action enables you to convert text into speech while ensuring that the output maintains consistency with a provided speaker reference. The Whisperspeech model supports various languages and customizable synthesis options for optimal audio quality.

Input

The input for this action requires the following fields:

  • speakerReference (string, required): A URI pointing to the reference audio file that will be used for maintaining speaker consistency.
  • text (string, required): The text that you want to convert into speech. This should be a coherent string suitable for audio conversion.
  • version (string, optional): The version of the model to be used for synthesis. Options include "tiny", "base", "small", or "medium", with "small" as the default.
  • language (string, optional): The language of the text. Options include English (en), Polish (pl), German (de), French (fr), Italian (it), Dutch (nl), Spanish (es), and Portuguese (pt), with English as the default.

Example Input:

{
  "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
  "version": "small",
  "language": "en",
  "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

Output

Upon successful execution, the action returns a URL pointing to the synthesized audio file.

Example Output:

https://assets.cognitiveactions.com/invocations/7c0ea921-0c15-4ae2-8221-9dafbf4a1110/bab740d8-5a44-4e9f-b28f-87382379ddb1.wav

Conceptual Usage Example (Python)

Here’s how you might implement this action in Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "f6230c8b-ce34-4eff-99ee-ef8ee9611e2f" # Action ID for Convert Text to Speech with Speaker Consistency

# Construct the input payload based on the action's requirements
payload = {
    "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
    "version": "small",
    "language": "en",
    "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this example, you will need to replace "YOUR_COGNITIVE_ACTIONS_API_KEY" with your actual API key. The action_id corresponds to the "Convert Text to Speech with Speaker Consistency" action. The payload structure is based on the requirements outlined above, ensuring that the input adheres to the expected schema.

Conclusion

The ttsds/whisperspeech Cognitive Actions offer robust capabilities for text-to-speech conversion, enabling developers to create immersive audio experiences. By utilizing these actions, you can easily transform text into high-quality speech while maintaining speaker consistency across various languages. As a next step, consider exploring additional use cases or integrating other Cognitive Actions to expand your application's capabilities. Happy coding!