Transform Your Text into Speech with Kokoro TTS Cognitive Actions

24 Apr 2025
Transform Your Text into Speech with Kokoro TTS Cognitive Actions

In the realm of voice synthesis, the Kokoro TTS action from the kjjk10/kokoro-82m specification provides developers with a powerful tool to convert text into natural-sounding speech. Utilizing an advanced text-to-speech model with 82 million parameters, Kokoro delivers high-quality audio output in a variety of voices. This capability allows developers to enhance user engagement and accessibility in their applications effortlessly.

Prerequisites

Before diving into the integration of Kokoro TTS, ensure you have the following:

  • An API key for the Cognitive Actions platform to authenticate your requests.
  • Basic familiarity with sending HTTP requests and handling JSON payloads.

Authentication typically involves passing your API key in the request headers, allowing secure access to the Cognitive Actions.

Cognitive Actions Overview

Generate Speech with Kokoro TTS

The Generate Speech with Kokoro TTS action is designed to transform textual content into speech. It supports multiple voice options, enabling developers to create diverse audio outputs tailored to their applications' needs.

  • Category: Text-to-Speech
  • Purpose: Convert text into high-quality speech audio.

Input

The action requires the following input parameters:

  • text (string): The content to be converted into speech. This can include plain text or formatted strings.
    • Default: "Hello, world!"
    • Example:
      You open your eyes so that only a slender chink of light seeps in, and peer at the gingko trees in front of the Provincial Office. As though there, between those branches, the wind is about to take on visible form. As though the raindrops suspended in the air, held breath before the plunge, are on the cusp of trembling down, glittering like jewels.
      
      When you open your eyes properly, the trees’ outlines dim and blur. You’re going to need glasses before long.
      
  • voiceOption (string): Specifies the voice to be used for text-to-speech conversion. Choose from a predefined list of voices.
    • Default: "af"
    • Example: "af_bella"

Here’s how the input JSON payload would look:

{
  "text": "You open your eyes so that only a slender chink of light seeps in, and peer at the gingko trees in front of the Provincial Office. As though there, between those branches, the wind is about to take on visible form. As though the raindrops suspended in the air, held breath before the plunge, are on the cusp of trembling down, glittering like jewels.\n\nWhen you open your eyes properly, the trees’ outlines dim and blur. You’re going to need glasses before long.",
  "voiceOption": "af_bella"
}

Output

Upon successfully executing the action, it returns a URL pointing to the generated audio file.

  • Example Output:
https://assets.cognitiveactions.com/invocations/b02a49fa-727d-44b1-a5a2-33553c79d02f/c79a33e1-6f62-4d20-93c6-08992904adc0.wav

This URL can be used to play or download the audio file generated from the provided text.

Conceptual Usage Example (Python)

Here’s a conceptual Python snippet demonstrating how to call the Kokoro TTS action:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "5253e1d7-968f-456b-a6ea-0d790cae6e05"  # Action ID for Generate Speech with Kokoro TTS

# Construct the input payload based on the action's requirements
payload = {
    "text": "You open your eyes so that only a slender chink of light seeps in, and peer at the gingko trees in front of the Provincial Office. As though there, between those branches, the wind is about to take on visible form. As though the raindrops suspended in the air, held breath before the plunge, are on the cusp of trembling down, glittering like jewels.\n\nWhen you open your eyes properly, the trees’ outlines dim and blur. You’re going to need glasses before long.",
    "voiceOption": "af_bella"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code example:

  • Replace "YOUR_COGNITIVE_ACTIONS_API_KEY" with your actual API key.
  • The action_id corresponds to the Generate Speech action.
  • The payload is structured according to the input schema, and the request is sent to a hypothetical execution endpoint.

Conclusion

The Kokoro TTS action provides an efficient way to integrate text-to-speech capabilities into your applications, enhancing user experience and accessibility. By leveraging the power of this Cognitive Action, developers can easily convert text into engaging audio content. Explore further possibilities by experimenting with different voice options and incorporating this functionality into various use cases. Happy coding!