Create Realistic Speech with Tortoise TTS Cognitive Actions

Integrating text-to-speech capabilities into your applications can elevate user experiences significantly. The afiaka87/tortoise-tts offers a powerful Cognitive Action that allows developers to generate realistic speech from text and clone voices using pre-recorded audio. This action, developed by James Betker (also known as 'neonbjb'), provides various voice presets and customization options, making it a versatile tool for developers looking to enrich their applications.
Prerequisites
To use the Tortoise TTS Cognitive Action, you will need:
- An API key for the Cognitive Actions platform to authenticate your requests. This key is typically passed in the headers of your API calls.
- A basic understanding of JSON to structure your input data correctly.
Cognitive Actions Overview
Generate and Clone Voices
This action enables the creation of realistic speech from given text inputs and the ability to clone voices using MP3 files. You can choose from various voice presets, including ultra-fast, fast, standard, and high-quality settings.
Input
The following fields are required and optional for this action:
- seed (integer): A random seed for reproducibility. Default is
0.
Example:0 - text (string): The text to be spoken.
Example:"The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them." - voicePreset (string): Specifies the quality level of the voice synthesis. Options include
'ultra_fast','fast','standard', and'high_quality'. Default is'fast'.
Example:"fast" - primaryVoice (string): Choose the primary voice for generation. Use
'random'for a random selection or'custom_voice'for a custom option. Default is'random'.
Example:"custom_voice" - cvvpInfluence (number): Controls the influence of the CVVP model on the output, ranging from
0to1. A higher value may reduce the likelihood of multiple speakers. Default is0(disabled). - secondaryVoice (string): (Optional) Averages the latents of multiple voices to create a new voice. Use
'disabled'to turn off voice mixing. Default is'disabled'. - tertiaryVoice (string): (Optional) Similar to secondaryVoice. Default is
'disabled'. - customVoiceUri (string): (Optional) A URI to an MP3 file for creating a custom voice. The audio must be at least 15 seconds long, feature a single speaker, and be in MP3 format. This overrides the primary voice setting.
Example:"https://replicate.delivery/mgxm/671f3086-382f-4850-be82-db853e5f05a8/nixon.mp3"
Here’s a practical example of the JSON payload needed to invoke the action:
{
"seed": 0,
"text": "The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.",
"voicePreset": "fast",
"primaryVoice": "custom_voice",
"tertiaryVoice": "disabled",
"customVoiceUri": "https://replicate.delivery/mgxm/671f3086-382f-4850-be82-db853e5f05a8/nixon.mp3",
"secondaryVoice": "disabled"
}
Output
The action typically returns a URL pointing to the generated audio file. Here’s an example of the expected output:
"https://assets.cognitiveactions.com/invocations/561ba549-0086-4650-ac4f-c122bd2c8148/52c3aa14-45bf-45e0-9f37-4168ecfcf1de.mp3"
Conceptual Usage Example (Python)
Here’s a conceptual Python code snippet demonstrating how to call the Tortoise TTS Cognitive Action:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "fbe73780-2a6e-4403-8440-0d543fc5ceb5" # Action ID for Generate and Clone Voices
# Construct the input payload based on the action's requirements
payload = {
"seed": 0,
"text": "The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.",
"voicePreset": "fast",
"primaryVoice": "custom_voice",
"tertiaryVoice": "disabled",
"customVoiceUri": "https://replicate.delivery/mgxm/671f3086-382f-4850-be82-db853e5f05a8/nixon.mp3",
"secondaryVoice": "disabled"
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this example, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The action ID and input payload are structured according to the required schema, allowing for the generation of speech based on the provided text.
Conclusion
The Tortoise TTS Cognitive Action empowers developers to integrate advanced text-to-speech capabilities into their applications seamlessly. By utilizing its customizable voice generation features, you can create engaging and dynamic user experiences. Explore various use cases, such as voiceovers for videos, interactive applications, or accessibility tools, to leverage the full potential of this action in your projects.