Generate Speech from Text Effortlessly with adirik/styletts2 Cognitive Actions

22 Apr 2025
Generate Speech from Text Effortlessly with adirik/styletts2 Cognitive Actions

In today's digital landscape, the ability to convert text into expressive speech is a game-changer for developers. The adirik/styletts2 API offers a powerful Cognitive Action that transforms text into speech using advanced techniques such as style diffusion and adversarial training. This capability allows for high-quality, human-like speech synthesis, making it ideal for applications in gaming, virtual assistants, and more.

Prerequisites

Before you start using the Cognitive Actions, you'll need to ensure you have the following:

  • An API key for the Cognitive Actions platform.
  • Basic understanding of JSON and how to make HTTP requests.

To authenticate your requests, you will typically include your API key in the headers of your requests, ensuring that your application can securely communicate with the Cognitive Actions service.

Cognitive Actions Overview

Generate Speech from Text with StyleTTS 2

The Generate Speech from Text with StyleTTS 2 action transforms plain text into speech while allowing for speaker adaptation through reference speech. This action utilizes the StyleTTS 2 model, which can create human-level text-to-speech synthesis, enhancing the expressiveness of the output.

Input

The Input schema for this action requires several parameters, with text as a mandatory field. Here’s a breakdown of the input fields:

  • text (required): The text string you want to convert into speech.
  • beta (optional): A number (default: 0.7) that affects the prosody of the speech.
  • alpha (optional): A number (default: 0.3) that influences the timbre of the speaker's voice.
  • seed (optional): An integer (default: 0) for reproducibility of the speech synthesis.
  • weights (optional): A URL for replicate weights if fine-tuned on new speakers.
  • reference (optional): URI of a reference speech for style adaptation.
  • diffusionSteps (optional): An integer (default: 10) that determines the number of diffusion steps for synthesis.
  • embeddingScale (optional): A number (default: 1) that affects the emotional expressiveness of the speech.

Example Input:

{
  "beta": 0.7,
  "seed": 0,
  "text": "StyleTTS 2 is a text-to-speech model that leverages style diffusion and adversarial training with large speech language models to achieve human-level text-to-speech synthesis.",
  "alpha": 0.3,
  "diffusionSteps": 10,
  "embeddingScale": 1.5
}

Output

Upon successful execution, the action returns a URL pointing to the generated speech audio file. This audio can be played back in applications or stored for future use.

Example Output:

https://assets.cognitiveactions.com/invocations/80777c18-dcf6-47e0-b41d-490ce7c7d674/7a1238b6-9ba5-4f0e-950f-1a69449a77fb.mp3

Conceptual Usage Example (Python)

Here’s how a developer might call the Cognitive Actions execution endpoint for the Generate Speech from Text with StyleTTS 2 action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "ab720d5d-a46b-43a5-8ee9-ee9eddec34d5"  # Action ID for Generate Speech from Text with StyleTTS 2

# Construct the input payload based on the action's requirements
payload = {
    "beta": 0.7,
    "seed": 0,
    "text": "StyleTTS 2 is a text-to-speech model that leverages style diffusion and adversarial training with large speech language models to achieve human-level text-to-speech synthesis.",
    "alpha": 0.3,
    "diffusionSteps": 10,
    "embeddingScale": 1.5
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The payload variable constructs the input using the required fields based on the action's schema. The response will include the URL of the generated audio file.

Conclusion

The adirik/styletts2 Cognitive Actions provide an innovative and efficient way to integrate text-to-speech capabilities into your applications. With customizable parameters for prosody and emotional expressiveness, you can create a more engaging user experience. Consider exploring additional use cases, such as voiceovers for videos or interactive voice response systems, to fully leverage the potential of this powerful tool. Happy coding!