Create Engaging Speech with Mars5 Text-to-Speech Actions

25 Apr 2025
Create Engaging Speech with Mars5 Text-to-Speech Actions

The Mars5 TTS (Text-to-Speech) service empowers developers to generate high-quality speech using advanced AI models. This service leverages the MARS5 English Speech Model, renowned for its exceptional prosody and versatility in synthesizing speech for various applications. Whether you're creating engaging sports commentary, developing interactive anime content, or simply need a voice for your application, Mars5 TTS simplifies the process of transforming text into lifelike speech.

By integrating Mars5 TTS, developers can save time and resources while enhancing user experience with natural-sounding audio. Common use cases include voiceovers for videos, automated customer service responses, and educational content narration. With customizable parameters, you can tailor the speech output to fit your specific needs, making it a powerful tool in your development arsenal.

Prerequisites

Before you begin, ensure you have a Cognitive Actions API key and a basic understanding of making API calls. This will enable you to seamlessly access the Mars5 TTS functionalities.

Generate Speech with Mars5 TTS

The Generate Speech with Mars5 TTS action allows you to synthesize speech from text, utilizing a two-stage AR-NAR pipeline to produce diverse and high-quality audio. This action is particularly effective for scenarios requiring nuanced and engaging speech.

Input Requirements

  • text: The text that will be synthesized into speech. It’s recommended to craft engaging and clear text for better synthesis results.
  • topK: Specifies the number of highest probability vocabulary tokens to keep for top-K filtering, allowing for more control over the synthesis process.
  • temperature: Controls the randomness of the speech generation. A lower value makes the output more predictable, while a higher value increases variability.
  • frequencyPenalty: Applies a penalty to tokens based on their existing frequency in the generated output, promoting diversity in the speech.
  • referenceAudioFile: A URI for a reference audio file used for voice cloning. The audio should be no longer than 10 seconds.
  • repetitionPenaltyWindow: Defines the window size for applying repetition penalties, helping to reduce repetitive text in the output.
  • referenceAudioTranscript: The transcription of the spoken content within the reference audio file, which should match the audio for optimal cloning accuracy.

Expected Output

The action will return a URL link to the generated speech audio file, allowing you to easily integrate it into your application or project.

Use Cases for this Specific Action

  • Sports Commentary: Create dynamic and engaging commentary for sports events, enhancing viewer experience with real-time audio.
  • Anime Voiceovers: Synthesize character voices for animated content, providing unique and varied speech styles that bring characters to life.
  • Interactive Applications: Integrate lifelike speech in gaming or virtual environments, making user interactions more immersive and enjoyable.
import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "b86aabad-680f-466c-85e2-9f269f3105db" # Action ID for: Generate Speech with Mars5 TTS

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "text": "Introducing Mars5, a revolutionary open-source text-to-speech model.",
  "topK": 100,
  "temperature": 1.1,
  "frequencyPenalty": 5,
  "referenceAudioFile": "https://replicate.delivery/pbxt/L9a6SelzU0B2DIWeNpkNR0CKForWSbkswoUP69L0NLjLswVV/voice_sample.wav",
  "repetitionPenaltyWindow": 150,
  "referenceAudioTranscript": "Hi there. I'm your new voice clone. Try your best to upload quality audio."
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")

Conclusion

Mars5 TTS offers developers a robust solution for generating high-quality speech tailored to various applications. By providing customizable options for text synthesis, it enhances the overall user experience and allows for creative flexibility in audio production. Consider incorporating Mars5 TTS into your next project to elevate the quality of your audio content and make your applications more engaging. Start exploring the possibilities today!