Transform Your Applications with Text-to-Speech Using WhisperSpeech Actions

Integrating advanced text-to-speech capabilities into your applications can greatly enhance user experience, making content more accessible and engaging. The WhisperSpeech Cognitive Actions, part of the lucataco/whisperspeech-small spec, provides developers with powerful tools to convert text into high-quality speech. This open-source system employs cutting-edge models like Whisper, EnCodec, and Vocos to deliver impressive speech synthesis that rivals traditional methods.
Prerequisites
Before diving into the integration of the WhisperSpeech Cognitive Actions, ensure you have the following:
- API Key: You will need an API key for the Cognitive Actions platform to authenticate your requests.
- Setup: Familiarize yourself with passing the API key in the headers of your requests, which is typically done using the
Authorizationheader.
Cognitive Actions Overview
Convert Text to Speech with WhisperSpeech
This action allows you to convert a string of text into synthesized speech. It leverages voice cloning capabilities and is designed to create high-quality audio output, similar in impact to what Stable Diffusion has achieved in the realm of image synthesis.
Input
The input schema for this action is a JSON object requiring the following fields:
- prompt: A string containing the text to be converted into speech.
- languageCode: A string indicating the language for synthesis; supported values are
en(English) andpl(Polish). Defaults toen. - voiceProfileUrl: A string that provides a URL to an audio file for zero-shot voice cloning, enabling the model to mimic the specified voice.
Example Input
{
"prompt": "This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer",
"languageCode": "en",
"voiceProfileUrl": ""
}
Output
Upon successful execution, the action returns a URL pointing to the generated audio file in WAV format.
Example Output
"https://assets.cognitiveactions.com/invocations/a521a8ec-1dcc-4f1b-9ac8-98a7382cbbf9/97ddc8df-5f02-4dca-8b55-4d8a82bf5270.wav"
Conceptual Usage Example (Python)
Here's a conceptual Python code snippet demonstrating how to call this action using a hypothetical Cognitive Actions execution endpoint:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "0209c4e9-3cf8-4538-a2bb-e55e55e7edea" # Action ID for Convert Text to Speech with WhisperSpeech
# Construct the input payload based on the action's requirements
payload = {
"prompt": "This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer",
"languageCode": "en",
"voiceProfileUrl": ""
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this code snippet, you'll notice how to replace the action ID and structure the input JSON payload. This example emphasizes that the actual endpoint URL and request structure are illustrative and may differ based on your setup.
Conclusion
The WhisperSpeech Cognitive Actions offer a robust solution for developers looking to integrate advanced text-to-speech functionalities into their applications. With the ability to customize voice profiles and support multiple languages, these actions provide a flexible and powerful way to enhance user engagement through high-quality audio output. Consider exploring additional use cases, such as creating voice assistants, audiobooks, or interactive learning tools, to leverage the full potential of WhisperSpeech in your projects.