Generate Bilingual Text-to-Speech with Fish Speech Cognitive Actions

23 Apr 2025
Generate Bilingual Text-to-Speech with Fish Speech Cognitive Actions

In today's digital landscape, the ability to convert text into natural-sounding speech is increasingly valuable. The Fish Speech Cognitive Actions, from the spec titled "jichengdu/fish-speech," enables developers to harness the power of advanced text-to-speech technology. Utilizing the Fish Speech V1.5 model, this action provides high-quality, bilingual speech generation that eliminates phoneme dependency, making it a versatile tool for various applications.

Prerequisites

To successfully use the Fish Speech Cognitive Actions, you'll need to set up a few key elements:

  • API Key: Ensure you have a valid API key for authenticating your requests to the Cognitive Actions platform.
  • Endpoint Access: You'll need to know the endpoint where the Cognitive Actions are accessible.

Authentication generally involves passing your API key in the request headers to authorize your access to the service.

Cognitive Actions Overview

Generate Speech with Fish Speech

The Generate Speech with Fish Speech action allows you to transform text into speech using the advanced capabilities of the Fish Speech model. This action is particularly useful for applications that require multilingual support and high-quality audio output.

  • Category: Text-to-Speech
  • Purpose: Convert text input into a speech output, with options for compilation optimization and reference audio.

Input

The input schema for this action requires the following fields:

  • text (required): The text that you want to convert into speech.
    Example: "我的猫猫就是全世界最好的猫"
  • useCompile (optional): A boolean flag indicating whether to use compilation optimization. Defaults to true if not specified.
  • referenceText (optional): Text content corresponding to the reference audio.
  • referenceAudio (optional): A URI of the reference audio file.

Example Input:

{
  "text": "我的猫猫就是全世界最好的猫",
  "useCompile": true
}

Output

The output of this action typically returns a URI pointing to the generated audio file. This allows you to access and play the synthesized speech directly.

Example Output:

"https://assets.cognitiveactions.com/invocations/0ca91503-5e6a-43ab-8c89-90f226a3fd30/962e9c8b-0e74-4fa9-8483-075ad15ee35c.wav"

Conceptual Usage Example (Python)

Here’s how you might call the Generate Speech action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "e9a3dc39-e1be-4b71-addc-25e7c83cab11"  # Action ID for Generate Speech with Fish Speech

# Construct the input payload based on the action's requirements
payload = {
    "text": "我的猫猫就是全世界最好的猫",
    "useCompile": True
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In the code above, you will need to replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The action ID is set to the value corresponding to the Generate Speech action, and the input payload is structured according to the required schema. The endpoint URL and request structure are illustrative, serving as a guide for your implementation.

Conclusion

The Fish Speech Cognitive Actions provide a robust solution for developers looking to integrate high-quality text-to-speech capabilities into their applications. With the ability to handle bilingual text and produce natural-sounding speech without phoneme constraints, you can create engaging and accessible user experiences. To explore further, consider experimenting with other features of the Fish Speech model or integrating additional functionalities into your projects. Happy coding!