Effortless Text-to-Speech Conversion with ttsds/amphion_maskgct Cognitive Actions

In today's digital landscape, transforming text into speech has become a vital feature for various applications, enhancing accessibility and user engagement. The ttsds/amphion_maskgct spec provides developers with a powerful tool through its Generate Speech with MaskGCT action. This action leverages the innovative MaskGCT model, a zero-shot text-to-speech solution that utilizes the masked generative codec transformer by Amphion. By integrating this action, you can efficiently convert text into natural-sounding speech.
Prerequisites
Before you start using the Cognitive Actions from the ttsds/amphion_maskgct spec, there are a few prerequisites to consider:
- API Key: You will need an API key to authenticate your requests to the Cognitive Actions platform.
- Endpoint Setup: Ensure you have the correct endpoint URL to send your requests. This will typically be provided by the Cognitive Actions platform.
Authentication is usually performed by including your API key in the request headers, enabling secure access to the actions.
Cognitive Actions Overview
Generate Speech with MaskGCT
The Generate Speech with MaskGCT action is designed to convert text into speech using advanced AI technology. This action is particularly beneficial for applications requiring real-time text-to-speech capabilities, such as virtual assistants, audiobooks, or educational tools.
Input
The input for this action must adhere to the following schema:
- language (string, required): The language code of the text. Supported values are:
en: Englishzh: Chinese
- speakerReference (string, required): A URI pointing to an audio file or other reference for the speaker.
- text (string, required): The main text content for processing.
- textReference (string, required): A supplementary identifier related to the text content.
Example Input:
{
"text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
"language": "en",
"textReference": "and keeping eternity before the eyes, though much",
"speakerReference": "https://replicate.delivery/pbxt/MNDu8UJR7zB1dZHG3UOPCD5B4crZunv2j32UsTd3Qd5PdG1R/example.wav"
}
Output
The action typically returns a link to the generated audio file in the form of a URI.
Example Output:
https://assets.cognitiveactions.com/invocations/596b2ba9-24b7-43c1-9530-cf1a8016316f/c3522a90-583d-4239-9950-609e30edfe16.wav
This output indicates the location of the audio file containing the generated speech.
Conceptual Usage Example (Python)
Here's how you might implement the Generate Speech with MaskGCT action in your application using Python:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "1985f655-6a0d-4fdb-922b-fc96b091ecd5" # Action ID for Generate Speech with MaskGCT
# Construct the input payload based on the action's requirements
payload = {
"text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
"language": "en",
"textReference": "and keeping eternity before the eyes, though much",
"speakerReference": "https://replicate.delivery/pbxt/MNDu8UJR7zB1dZHG3UOPCD5B4crZunv2j32UsTd3Qd5PdG1R/example.wav"
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this Python code snippet:
- You need to replace
YOUR_COGNITIVE_ACTIONS_API_KEYwith your actual API key. - The action ID for Generate Speech with MaskGCT is set, and the input payload is structured based on the schema provided.
- The response is handled gracefully, allowing you to capture and display any errors that occur during execution.
Conclusion
The Generate Speech with MaskGCT action from the ttsds/amphion_maskgct spec provides a robust solution for integrating text-to-speech capabilities into your applications. By leveraging this action, you can enhance user experiences through natural-sounding audio outputs. As a next step, consider exploring additional use cases such as creating audiobooks or enhancing accessibility features in your applications. Happy coding!