Generate High-Quality Speech with lucataco/csm-1b Cognitive Actions

In the realm of natural language processing and speech synthesis, the lucataco/csm-1b API provides powerful Cognitive Actions that enable developers to leverage advanced speech generation capabilities. One such action is the ability to generate RVQ audio codes from text and audio inputs, utilizing the Conversational Speech Model (CSM) by Sesame. This model employs cutting-edge technology to deliver high-quality voice synthesis, making it ideal for research and educational applications.
Prerequisites
To get started with the Cognitive Actions in the lucataco/csm-1b API, you'll need an API key for the Cognitive Actions platform. Typically, authentication can be handled by passing this key in the headers of your API requests, ensuring secure access to the functionalities provided.
Cognitive Actions Overview
Generate RVQ Audio Codes
The Generate RVQ Audio Codes action is designed to convert text into speech using the CSM model. This action allows users to specify various parameters, including the speaker and the maximum audio length.
Input
The input for this action requires the following fields:
- Text: The string of text you want to convert to speech.
- Speaker ID: An integer indicating which speaker to use for speech synthesis (0 or 1).
- Maximum Audio Length: An integer defining the maximum duration of the generated audio in milliseconds, which must be between 1,000 and 30,000 milliseconds.
Example Input:
{
"text": "This is CSM by Sesame, generate FVQ audio codes from text",
"speaker": 0,
"maxAudioLengthMs": 10000
}
Output
The output of the action is a URL pointing to the generated audio file. This audio file can then be used in various applications, such as embedding in web pages or applications.
Example Output:
https://assets.cognitiveactions.com/invocations/5d3b93eb-56b7-4ab9-84ed-7e8dd98ab5f0/62131ffb-3b2c-466f-ae09-ebc05adcdf8f.wav
Conceptual Usage Example (Python)
Below is a conceptual Python code snippet demonstrating how to invoke the Generate RVQ Audio Codes action using a hypothetical Cognitive Actions execution endpoint:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "d2c33a73-b998-481f-a698-6d3ebfecfea0" # Action ID for Generate RVQ Audio Codes
# Construct the input payload based on the action's requirements
payload = {
"text": "This is CSM by Sesame, generate FVQ audio codes from text",
"speaker": 0,
"maxAudioLengthMs": 10000
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this example, you'll replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The action ID is set to the one provided for the Generate RVQ Audio Codes action. The input payload is constructed according to the action's schema, ensuring that the request is correctly formatted.
Conclusion
The lucataco/csm-1b Cognitive Actions provide developers with a robust solution for generating high-quality speech from text. By harnessing the power of the Conversational Speech Model, you can create engaging audio content for various applications. As you explore these capabilities further, consider how integrating speech synthesis can enhance your projects, whether in education, research, or entertainment.