Transform Your Applications with Text-to-Speech Using WhisperSpeech Actions

21 Apr 2025
Transform Your Applications with Text-to-Speech Using WhisperSpeech Actions

Integrating advanced text-to-speech capabilities into your applications can greatly enhance user experience, making content more accessible and engaging. The WhisperSpeech Cognitive Actions, part of the lucataco/whisperspeech-small spec, provides developers with powerful tools to convert text into high-quality speech. This open-source system employs cutting-edge models like Whisper, EnCodec, and Vocos to deliver impressive speech synthesis that rivals traditional methods.

Prerequisites

Before diving into the integration of the WhisperSpeech Cognitive Actions, ensure you have the following:

  • API Key: You will need an API key for the Cognitive Actions platform to authenticate your requests.
  • Setup: Familiarize yourself with passing the API key in the headers of your requests, which is typically done using the Authorization header.

Cognitive Actions Overview

Convert Text to Speech with WhisperSpeech

This action allows you to convert a string of text into synthesized speech. It leverages voice cloning capabilities and is designed to create high-quality audio output, similar in impact to what Stable Diffusion has achieved in the realm of image synthesis.

Input

The input schema for this action is a JSON object requiring the following fields:

  • prompt: A string containing the text to be converted into speech.
  • languageCode: A string indicating the language for synthesis; supported values are en (English) and pl (Polish). Defaults to en.
  • voiceProfileUrl: A string that provides a URL to an audio file for zero-shot voice cloning, enabling the model to mimic the specified voice.
Example Input
{
  "prompt": "This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer",
  "languageCode": "en",
  "voiceProfileUrl": ""
}

Output

Upon successful execution, the action returns a URL pointing to the generated audio file in WAV format.

Example Output
"https://assets.cognitiveactions.com/invocations/a521a8ec-1dcc-4f1b-9ac8-98a7382cbbf9/97ddc8df-5f02-4dca-8b55-4d8a82bf5270.wav"

Conceptual Usage Example (Python)

Here's a conceptual Python code snippet demonstrating how to call this action using a hypothetical Cognitive Actions execution endpoint:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "0209c4e9-3cf8-4538-a2bb-e55e55e7edea"  # Action ID for Convert Text to Speech with WhisperSpeech

# Construct the input payload based on the action's requirements
payload = {
    "prompt": "This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer",
    "languageCode": "en",
    "voiceProfileUrl": ""
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, you'll notice how to replace the action ID and structure the input JSON payload. This example emphasizes that the actual endpoint URL and request structure are illustrative and may differ based on your setup.

Conclusion

The WhisperSpeech Cognitive Actions offer a robust solution for developers looking to integrate advanced text-to-speech functionalities into their applications. With the ability to customize voice profiles and support multiple languages, these actions provide a flexible and powerful way to enhance user engagement through high-quality audio output. Consider exploring additional use cases, such as creating voice assistants, audiobooks, or interactive learning tools, to leverage the full potential of WhisperSpeech in your projects.