Generate High-Quality Conversational Speech with Csm 1b

26 Apr 2025
Generate High-Quality Conversational Speech with Csm 1b

In the realm of text-to-speech technology, the Csm 1b service stands out by providing developers with the tools to generate high-quality conversational speech. Leveraging the advanced Conversational Speech Model (CSM) from Sesame, this service transforms text and audio inputs into RVQ audio codes. The underlying Llama backbone, paired with an audio decoder, ensures that the speech generated is not only natural-sounding but also versatile for various applications.

Common use cases for Csm 1b include educational tools that require dynamic and engaging audio content, interactive voice response systems, and research projects focusing on speech generation. By integrating this service, developers can enhance user experiences, making interactions more lifelike and engaging.

Before you start, ensure you have your Cognitive Actions API key and a basic understanding of making API calls.

Generate Conversational Speech

The "Generate Conversational Speech" action is designed to convert textual input into spoken audio. This powerful feature addresses the need for high-quality speech synthesis in numerous applications, allowing developers to create more interactive and engaging experiences for users.

Input Requirements

To utilize this action, you need to provide the following inputs:

  • Text: The text that you want to convert to speech. For example, "This is CSM by Sesame, generate FVQ audio codes from text."
  • Speaker ID: Choose between two speaker options (0 for the default speaker or 1 for an alternate speaker).
  • Max Audio Length (ms): Specify the maximum duration for the generated audio, which can range from 1000 to 30000 milliseconds. The default value is set to 10000 milliseconds.

Expected Output

Upon successful execution, you will receive a URL link to the generated audio file, which can be played or used as needed. For instance, the output might look like this: https://assets.cognitiveactions.com/invocations/49fec5ab-1cdc-4981-869e-4b13b6d12f8b/06a32589-042e-4f1f-927a-2fb7897e5199.wav.

Use Cases for this specific action

This action is ideal for:

  • Educational Platforms: Create engaging audio lessons or tutorials that speak directly to students, enhancing learning experiences.
  • Virtual Assistants: Implement realistic speech in applications that require conversational interfaces, improving user interaction.
  • Accessibility Solutions: Generate audio content for visually impaired users, ensuring information is accessible in an auditory format.
import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "36c33b14-3b94-43a1-b0d7-c0ab71c49081" # Action ID for: Generate Conversational Speech

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "text": "This is CSM by Sesame, generate FVQ audio codes from text",
  "speaker": 0,
  "maxAudioLengthMs": 10000
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")

Conclusion

The Csm 1b service provides developers with an invaluable resource for generating high-quality conversational speech. With its ability to convert text into lifelike audio, it opens up numerous possibilities across various domains such as education, customer service, and accessibility. By integrating this powerful text-to-speech action, developers can significantly enhance user engagement and interaction.

As a next step, consider exploring additional features of the Cognitive Actions API to further expand your application's capabilities and improve user experiences.