Transform Text to Speech with ttsds/gptsovits_2 Cognitive Actions

In the ever-evolving landscape of AI, text-to-speech (TTS) capabilities have gained immense popularity due to their versatility and ease of integration. The ttsds/gptsovits_2 API provides developers with powerful Cognitive Actions that enable seamless conversion of text into natural-sounding speech, supporting multiple languages and speaker references. This blog post will guide you through the integration of the Perform Text-to-Speech Prediction action, showcasing how to leverage it in your applications.
Prerequisites
Before diving into the Cognitive Actions, ensure you have the following:
- An API key for the Cognitive Actions platform, which you will use for authentication.
- Familiarity with JSON structures, as you'll be constructing and sending JSON payloads.
- Basic understanding of Python and libraries such as
requestswill be helpful for making API calls.
Authentication typically involves passing your API key in the request headers, allowing you to securely access the Cognitive Actions.
Cognitive Actions Overview
Perform Text-to-Speech Prediction
The Perform Text-to-Speech Prediction action is designed to convert text into speech, offering support for various languages, including English, Chinese, Japanese, Korean, and Cantonese. This action allows you to specify a speaker reference to personalize the speech output.
Input
To use this action, you need to provide the following input fields:
- text (required): The main content that will be processed. It must be a string.
- language (required): Specifies the language of the text. Acceptable values are:
enfor Englishzhfor Chinesejafor Japanesekofor Koreanyuefor Cantonese
- textReference (optional): A supporting text reference related to the main content.
- speakerReference (required): A URI link pointing to an audio recording of the speaker, used as a reference.
Here’s an example of the JSON payload required to invoke this action:
{
"text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
"language": "en",
"textReference": "and keeping eternity before the eyes, though much.",
"speakerReference": "https://replicate.delivery/pbxt/MNDu8UJR7zB1dZHG3UOPCD5B4crZunv2j32UsTd3Qd5PdG1R/example.wav"
}
Output
When you execute the action, it typically returns a URI linking to the audio file of the generated speech. For example, the output might look like this:
https://assets.cognitiveactions.com/invocations/15a149d9-ce4c-4024-8e35-acde00f48084/f73dbb0d-0cff-43c6-997a-d8ba93f07298.wav
Conceptual Usage Example (Python)
Here’s a conceptual Python code snippet to demonstrate how you might call the Perform Text-to-Speech Prediction action using the requests library:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "d9883102-7b90-496a-965f-00cda9f11726" # Action ID for Perform Text-to-Speech Prediction
# Construct the input payload based on the action's requirements
payload = {
"text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
"language": "en",
"textReference": "and keeping eternity before the eyes, though much.",
"speakerReference": "https://replicate.delivery/pbxt/MNDu8UJR7zB1dZHG3UOPCD5B4crZunv2j32UsTd3Qd5PdG1R/example.wav"
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this snippet, replace the placeholder YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The action ID and the input payload are structured according to the Perform Text-to-Speech Prediction action's requirements.
Conclusion
The ttsds/gptsovits_2 Cognitive Actions provide a robust solution for integrating text-to-speech capabilities into your applications. By utilizing the Perform Text-to-Speech Prediction action, you can create engaging audio outputs tailored to various languages and speaker styles. As you explore these capabilities, consider how you can enhance user experiences through personalized audio content, such as in e-learning platforms, accessibility tools, or interactive storytelling applications. Happy coding!