Enhance Your Applications with Text-to-Speech: Integrating FishSpeech Cognitive Actions

24 Apr 2025
Enhance Your Applications with Text-to-Speech: Integrating FishSpeech Cognitive Actions

In the world of application development, enhancing user experience through realistic audio output can be a game changer. The FishSpeech Cognitive Actions, part of the ttsds/fishspeech_1_1_large specification, offers developers the ability to seamlessly transform text into speech while maintaining the unique vocal characteristics of a specified speaker. This capability not only improves the realism of audio outputs but also allows for a more personalized interaction with users.

Prerequisites

Before diving into using the Cognitive Actions, ensure you have the following:

  • An API key for accessing the Cognitive Actions platform.
  • Basic knowledge of JSON and how to structure API requests.
  • Familiarity with Python and its requests library for making HTTP requests.

Authentication is typically handled by passing your API key in the request headers, allowing you to securely access the Cognitive Actions.

Cognitive Actions Overview

Generate Speech from Text and Speaker Reference

The Generate Speech from Text and Speaker Reference action is designed to convert a given text into speech while using a specific speaker reference. This enhances the overall audio experience by ensuring the speech output reflects the vocal characteristics of the designated speaker.

  • Category: text-to-speech

Input

The action requires the following input fields:

  • text (required): The main text content to be converted into speech.
  • textReference (required): An additional reference string that provides context related to the main text.
  • speakerReference (required): A URI pointing to a specific audio resource that represents the speaker.

Here’s a practical example of the JSON payload needed to invoke this action:

{
  "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
  "textReference": "and keeping eternity before the eyes, though much.",
  "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

Output

Upon successful execution, the action returns a URI pointing to the generated speech audio file. Here’s an example of what the output might look like:

"https://assets.cognitiveactions.com/invocations/c6c1d520-e08d-482f-bd8f-a56571a425c1/aa927724-b708-48aa-8c6c-c1bcf3c5222c.wav"

This URI can be used to play the generated speech audio in your application.

Conceptual Usage Example (Python)

Here’s a Python code snippet demonstrating how a developer might call the Generate Speech action:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "c49a8d65-7d95-466e-bb20-694e7f55fed5"  # Action ID for Generate Speech from Text and Speaker Reference

# Construct the input payload based on the action's requirements
payload = {
    "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
    "textReference": "and keeping eternity before the eyes, though much.",
    "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this example, replace the placeholder API key and endpoint URL with your actual credentials. The action ID and input payload are structured as required, enabling you to effectively call the Generate Speech action.

Conclusion

Integrating the FishSpeech Cognitive Actions into your applications can significantly enhance user engagement through realistic and personalized speech outputs. By leveraging the ability to generate speech from text and specific speaker references, you can create more dynamic and relatable interactions within your applications. As a next step, consider exploring additional use cases where text-to-speech functionality can elevate user experience, such as in educational tools, virtual assistants, or audiobook applications.