Transform Text to Speech with the jaaari/kokoro-82m Cognitive Actions

In today's digital landscape, the ability to convert text into natural-sounding speech can enhance user experiences across various applications, from accessibility tools to entertainment. The jaaari/kokoro-82m API offers a powerful Cognitive Action that enables developers to generate high-quality speech from text using the Kokoro v1.0 model, which boasts 82 million parameters and is built on the advanced StyleTTS2 architecture. This guide will walk you through how to leverage this action effectively in your applications.
Prerequisites
Before diving into the integration, ensure you have the following in place:
- API Key: To use the Cognitive Actions, you will need an API key for authentication. This key should be included in the request headers to access the service.
- Setup: Familiarize yourself with making HTTP requests in your programming environment of choice, as you will be sending JSON payloads to the Cognitive Actions API.
Cognitive Actions Overview
Generate Speech from Text
The Generate Speech from Text action converts text input into high-quality speech. This action supports a variety of languages and voice types, allowing for customizable speech speed to suit different user preferences.
- Category: Text-to-Speech
Input
The input for this action requires the following fields:
- text (required): The text that you want to convert to speech. The action automatically splits long text into smaller chunks for processing.
- speed (optional): A number that determines the speech speed multiplier, ranging from 0.1 (slowest) to 5.0 (fastest). The default is 1.0, representing normal speed.
- voiceType (optional): The specific voice model to use for speech synthesis. There are several options available, with 'af' being the default.
Example Input JSON:
{
"text": "Hi! I'm Kokoro, a text-to-speech voice crafted by hexgrad — based on StyleTTS2. You can also find me in Kuluko, an app that lets you create fully personalized audiobooks — from characters to storylines — all tailored to your preferences. Want to give it a go? Search for Kuluko on the Apple or Android app store and start crafting your own story today!",
"speed": 1,
"voiceType": "af_nicole"
}
Output
Upon successful execution, the action returns a URL that links to the generated audio file in WAV format. The output typically looks like this:
Example Output:
"https://assets.cognitiveactions.com/invocations/e12737ec-b836-4a55-a399-3c2ff67ad7f7/b5d5d027-9cc6-4942-bc85-b5fddf4fe381.wav"
Conceptual Usage Example (Python)
Here’s a conceptual Python code snippet that demonstrates how to invoke the Generate Speech from Text action using the Cognitive Actions API:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "f6130e8b-afca-4b6f-afa9-0083f82ea039" # Action ID for Generate Speech from Text
# Construct the input payload based on the action's requirements
payload = {
"text": "Hi! I'm Kokoro, a text-to-speech voice crafted by hexgrad — based on StyleTTS2. You can also find me in Kuluko, an app that lets you create fully personalized audiobooks — from characters to storylines — all tailored to your preferences. Want to give it a go? Search for Kuluko on the Apple or Android app store and start crafting your own story today!",
"speed": 1,
"voiceType": "af_nicole"
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this code snippet:
- We define the action ID for the Generate Speech from Text action.
- The input payload is constructed based on the required and optional fields.
- The request is sent to the Cognitive Actions API, and we handle potential errors gracefully.
Conclusion
The jaaari/kokoro-82m Cognitive Action for generating speech from text is a robust tool that can enhance user interaction in various applications. By integrating this action, developers can provide personalized and engaging audio experiences to their users. Consider exploring additional use cases, such as creating audiobooks, voiceovers for videos, or enhancing accessibility features in your applications. Happy coding!