Transform Text to Speech with ttsds/fishspeech_1_0 Cognitive Actions

24 Apr 2025
Transform Text to Speech with ttsds/fishspeech_1_0 Cognitive Actions

In the world of digital applications, enhancing user experience through audio interaction is becoming increasingly important. The ttsds/fishspeech_1_0 specification introduces a powerful Cognitive Action that enables developers to convert text into speech seamlessly. Utilizing the Fish Speech V1.0 model developed by fish.audio, this action provides high-quality audio output from text content, making it a valuable tool for applications like virtual assistants, audiobooks, and more.

Prerequisites

To get started with the Cognitive Actions in the ttsds/fishspeech_1_0 spec, you'll need to meet a few general requirements:

  • API Key: You will need an API key to authenticate your requests to the Cognitive Actions platform.
  • Proper Setup: Ensure you have access to the necessary endpoints and have configured your application to make HTTP requests.

Authentication typically involves passing your API key in the request headers.

Cognitive Actions Overview

Generate Speech from Text

The Generate Speech from Text action is designed to convert written text into spoken audio. By providing text content, a reference snippet, and a URI pointing to a speaker's audio representation, this action produces high-quality audio output suitable for various applications.

  • Category: Text-to-Speech

Input

The input for this action requires the following fields:

  • text (string, required): The primary text content to be transformed into speech.
  • textReference (string, required): A reference text snippet that provides context or comparison.
  • speakerReference (uri, required): A URL pointing to an audio file that represents the speaker.

Example Input:

{
  "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
  "textReference": "and keeping eternity before the eyes, though much",
  "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

Output

Upon successful execution, the action returns a URI to the generated audio file. The output typically looks like this:

Example Output:

https://assets.cognitiveactions.com/invocations/067cdf2d-0278-4b6a-833b-8c26b4926af3/a153e168-c413-45fc-88f4-41e8ece3af76.wav

Conceptual Usage Example (Python)

Here's a conceptual Python code snippet that demonstrates how to call the Generate Speech from Text action:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "b6946f4d-e95a-4b68-b1e5-57cbe3602816" # Action ID for Generate Speech from Text

# Construct the input payload based on the action's requirements
payload = {
    "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
    "textReference": "and keeping eternity before the eyes, though much",
    "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this snippet, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The input payload is constructed based on the action's requirements, and the request is sent to the hypothetical Cognitive Actions execution endpoint. The response will contain the URI of the generated audio.

Conclusion

The Generate Speech from Text action in the ttsds/fishspeech_1_0 spec opens up exciting possibilities for developers looking to enhance their applications with voice capabilities. By leveraging this action, you can easily convert text into high-quality audio, enriching user interaction and engagement. Whether you're developing a virtual assistant, an educational tool, or an entertainment application, integrating text-to-speech functionality can significantly improve user experience.

Consider exploring further use cases or combining this action with other functionalities to create innovative applications!