Transform Text to Speech with lucataco/step-audio-tts-3b Cognitive Actions

25 Apr 2025
Transform Text to Speech with lucataco/step-audio-tts-3b Cognitive Actions

In the realm of audio processing, the ability to convert text into natural-sounding speech has become increasingly vital. The lucataco/step-audio-tts-3b spec offers developers a powerful toolset for integrating advanced text-to-speech capabilities into applications. With features like RAP and humming, this set of Cognitive Actions supports multiple languages and emotional expressions, enabling dynamic audio experiences. Let’s explore how to effectively use these pre-built actions in your projects.

Prerequisites

Before diving into the integration, ensure you have the following:

  • An API key for the Cognitive Actions platform.
  • Basic knowledge of making HTTP requests and handling JSON data.

For authentication, you typically include your API key in the request headers, allowing secure access to the Cognitive Actions services.

Cognitive Actions Overview

Generate Speech with Step-Audio-TTS-3B

The Generate Speech with Step-Audio-TTS-3B action allows you to convert text into speech using the Step-Audio-TTS-3B model. This action is particularly useful for applications that require engaging audio content, such as virtual assistants, educational tools, and entertainment platforms.

Input

The input for this action is defined by the following schema:

  • text (string): The text to be synthesized into speech. It’s essential to ensure the text captures the intended message for the selected speaker.
  • speakerName (string): The name of the speaker whose voice will be used for synthesis. Options include:
    • 闫雨婷
    • 闫雨婷RAP
    • 闫雨婷VOCAL

Example Input JSON:

{
  "text": "(RAP) I set out on the journey of freedom, chasing that distant dream, breaking free from the shackles of bondage, letting my soul drift with the wind, every step is full of power, every moment is extremely shining, the belief in freedom is burning, illuminating the direction of my progress!",
  "speakerName": "闫雨婷"
}

Output

Upon successful execution, the action returns a URL pointing to the generated audio file. The audio output typically sounds natural and expressive, reflecting the emotional tone of the input text.

Example Output:

https://assets.cognitiveactions.com/invocations/5ad68ea0-048b-41c4-b893-71e0660c4f74/28e9b9b3-da17-4344-8ef1-c1ccd899a064.wav

Conceptual Usage Example (Python)

Here’s a conceptual Python code snippet demonstrating how to call the cognitive action endpoint:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "a0c764c1-aa84-4a21-b006-964c0b3360c6" # Action ID for Generate Speech with Step-Audio-TTS-3B

# Construct the input payload based on the action's requirements
payload = {
    "text": "(RAP) I set out on the journey of freedom, chasing that distant dream, breaking free from the shackles of bondage, letting my soul drift with the wind, every step is full of power, every moment is extremely shining, the belief in freedom is burning, illuminating the direction of my progress!",
    "speakerName": "闫雨婷"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

This code snippet demonstrates how to structure the input payload correctly and make a request to the Cognitive Actions endpoint. Make sure to replace the placeholders with your actual API key and the correct endpoint.

Conclusion

The lucataco/step-audio-tts-3b Cognitive Actions provide a robust framework for developers looking to integrate advanced text-to-speech functionalities. With its ability to produce expressive speech in various voices, you can enhance user interactions in your applications. Start experimenting with these actions today to create dynamic audio experiences that resonate with your audience!