Create Engaging Text-to-Speech Experiences with jichengdu/spark-tts Actions

In today’s digital landscape, providing users with a rich auditory experience can significantly enhance engagement. The jichengdu/spark-tts API offers a powerful set of Cognitive Actions that allow developers to create synthetic speech from text. This guide will delve into the capabilities of these actions, specifically focusing on generating speech through various modes, including voice cloning and custom voice creation. By leveraging these pre-built actions, you can easily integrate text-to-speech functionality into your applications.
Prerequisites
Before diving into the integration, ensure you have the following:
- An API key for the Cognitive Actions platform.
- Familiarity with making HTTP requests in your application.
- Basic knowledge of JSON format for payload structuring.
Authentication typically involves passing your API key in the request headers, allowing you to securely access the Cognitive Actions.
Cognitive Actions Overview
Generate Speech from Text
The Generate Speech from Text action provides an easy way to create synthetic speech using the Spark-TTS model. This action supports two modes: voice cloning, which replicates a specific voice using prompt audio, and voice creation, where you can customize the voice parameters such as gender, pitch, and speed.
Input
The input for this action requires a JSON object with the following schema:
{
"mode": "voice_creation", // or "voice_cloning"
"text": "白日依山尽,黄河入海流。",
"topK": 50,
"topP": 0.95,
"pitch": "high",
"speed": "low",
"gender": "female",
"promptText": "",
"temperature": 0.8,
"promptSpeechPath": "http://example.com/prompt_audio.wav" // Required for voice_cloning
}
Example Input:
{
"mode": "voice_creation",
"text": "白日依山尽,黄河入海流。",
"topK": 50,
"topP": 0.95,
"pitch": "high",
"speed": "low",
"gender": "female",
"promptText": "",
"temperature": 0.8
}
Required Fields:
- text: The text to convert into speech.
- mode: Choose between
voice_cloningorvoice_creation.
Optional Fields:
- topK: Limits token selection.
- topP: Sets the sampling probability.
- pitch, speed, gender: Specify voice characteristics for voice creation.
- promptText: Transcript for voice cloning.
- promptSpeechPath: URI to prompt audio (required for voice cloning).
- temperature: Influences randomness in speech generation.
Output
The output of the action is typically a URL to the generated audio file. Here’s an example of the output response:
https://assets.cognitiveactions.com/invocations/9f9472af-806b-44a0-a214-58a1fa8edb68/a5ea0f13-8169-405e-8e2f-80925cf63813.wav
This URL points to the audio file where the synthesized speech can be accessed.
Conceptual Usage Example (Python)
Here’s how you might use this action in a Python application:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "bd0a382e-91cf-497d-9448-2186ce16b742" # Action ID for Generate Speech from Text
# Construct the input payload based on the action's requirements
payload = {
"mode": "voice_creation",
"text": "白日依山尽,黄河入海流。",
"topK": 50,
"topP": 0.95,
"pitch": "high",
"speed": "low",
"gender": "female",
"promptText": "",
"temperature": 0.8
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this code snippet, make sure to replace the placeholder values with your actual API key and endpoint. The action_id variable is set to the ID for the Generate Speech from Text action. The input JSON payload is structured according to the action's requirements.
Conclusion
The jichengdu/spark-tts actions open up a world of possibilities for integrating text-to-speech functionality into your applications. By utilizing the Generate Speech from Text action, you can create engaging auditory experiences tailored to your users' needs. Whether you choose voice cloning or custom voice creation, these capabilities can enhance accessibility and interactivity in your projects.
Start exploring these actions today and consider how they can enrich your applications!