Generate High-Quality Speech with jichengdu/fish-speech Cognitive Actions

24 Apr 2025
Generate High-Quality Speech with jichengdu/fish-speech Cognitive Actions

In the rapidly evolving landscape of artificial intelligence, the ability to convert text into natural-sounding speech has become essential for a variety of applications, from virtual assistants to content creation. The jichengdu/fish-speech API offers powerful Cognitive Actions designed to facilitate text-to-speech conversion, allowing developers to leverage advanced speech synthesis technology. In this post, we will explore how to utilize the "Generate Natural Speech with Fish Speech V1.5" action to create high-quality audio output from text.

Prerequisites

Before diving into the usage of the Fish Speech Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform to authenticate your requests.
  • Basic knowledge of making HTTP requests in Python.
  • Familiarity with JSON structures, as you will be working with input and output in this format.

Authentication typically involves passing your API key in the headers of your requests, ensuring your application can securely communicate with the service.

Cognitive Actions Overview

Generate Natural Speech with Fish Speech V1.5

The "Generate Natural Speech with Fish Speech V1.5" action allows you to convert text into natural-sounding speech, supporting both English and Chinese languages. This action utilizes advanced speech synthesis technology, enabling a smooth experience without traditional phoneme dependencies.

Input

The input for this action requires the following fields as per the input schema:

  • text (required): The text you want to convert into speech.
  • useCompile (optional): Indicates whether to use compile-time optimization (default is true).
  • referenceText (optional): Text content corresponding to reference audio to improve synthesis accuracy.
  • referenceAudio (optional): A URI pointing to the reference audio file to guide prosody and intonation.

Example Input:

{
  "text": "我的猫猫就是全世界最好的猫",
  "useCompile": true
}

Output

Upon successful execution, this action returns a URI pointing to the generated audio file. The output typically looks like this:

Example Output:

https://assets.cognitiveactions.com/invocations/95385794-9fab-480b-9d84-9d36b73acd22/5da225fa-5e96-4075-a42f-f20620c16238.wav

Conceptual Usage Example (Python)

To illustrate how developers can invoke this action, here's a conceptual Python code snippet:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "b56daa86-828b-40df-baf3-d79dee0725a7"  # Action ID for Generate Natural Speech with Fish Speech V1.5

# Construct the input payload based on the action's requirements
payload = {
    "text": "我的猫猫就是全世界最好的猫",
    "useCompile": True
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, replace "YOUR_COGNITIVE_ACTIONS_API_KEY" with your actual API key. The action_id corresponds to the specific action for generating speech. The input payload is structured according to the action's requirements, ensuring that the text is properly formatted for conversion.

Conclusion

The jichengdu/fish-speech API provides a robust solution for converting text into natural-sounding speech with its Cognitive Actions. By integrating the "Generate Natural Speech with Fish Speech V1.5" action into your applications, you can enhance user experience through seamless audio output. As you explore the capabilities of these actions, consider various use cases such as virtual assistants, language learning tools, and accessibility features. Happy coding!