Create Multilingual Speech with Xtts V2 Fork

26 Apr 2025
Create Multilingual Speech with Xtts V2 Fork

The Xtts V2 Fork is a powerful text-to-speech service that enables developers to generate natural-sounding speech in multiple languages from text input. Leveraging a clone of the Coqui XTTS-v2 model, this service offers enhanced processing speed on CUDA 12.4, making it an ideal solution for applications that require quick and efficient voice synthesis.

With the ability to create voice clones, developers can personalize audio outputs, ensuring that the synthesized speech maintains the unique characteristics of the original speaker. This capability opens up a range of possibilities for applications such as virtual assistants, audiobooks, language learning tools, and more.

Prerequisites

To use the Xtts V2 Fork, you will need an API key for the Cognitive Actions service and a basic understanding of making API calls.

Generate Voice Clone Speech

The Generate Voice Clone Speech action allows you to create multilingual speech from text. This action addresses the need for realistic and expressive voice synthesis, enabling applications to deliver content in a way that resonates with users.

Input Requirements:

  • Text: The text you want to convert to speech. This should be a string, and a default example is provided to showcase the model's language capabilities.
  • Speaker: A URL pointing to an audio file of the original speaker. The audio must be at least 6 seconds long and can be in formats like wav, mp3, or ogg.
  • Output Language: The language in which the text will be synthesized. Options include languages like English, Spanish, French, and many more.

Example Input:

{
  "text": "Hi there. This is forked Coqui XTTS-2 model with 17 languages supported.",
  "speaker": "https://replicate.delivery/pbxt/Jt79w0xsT64R1JsiJ0LQRL8UcWspg5J4RFrU6YwEKpOT1ukS/male.wav",
  "outputLanguage": "en"
}

Expected Output: The output will be a URL linking to the synthesized audio file containing the spoken version of the input text in the specified language.

"https://assets.cognitiveactions.com/invocations/97ded661-e787-460b-8d72-e549c34499be/ff571600-b167-4356-bc87-19a64e7c0d9f.wav"

Use Cases for this specific action:

  • Virtual Assistants: Enhance user interactions by providing voice responses that sound like a specific person.
  • Audiobook Creation: Generate narrated versions of text content, making literature more accessible.
  • Language Learning: Create engaging and personalized audio for language learners, helping them better understand pronunciation and intonation.
  • Multimedia Applications: Add voiceovers to videos or presentations, improving viewer engagement through personalized audio.
import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "68dff14c-53f9-43b9-afc1-bb739937cecb" # Action ID for: Generate Voice Clone Speech

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "text": "Hi there. This is forked Coqui XTTS-2 model with 17 languages supported.",
  "speaker": "https://replicate.delivery/pbxt/Jt79w0xsT64R1JsiJ0LQRL8UcWspg5J4RFrU6YwEKpOT1ukS/male.wav",
  "outputLanguage": "en"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")

Conclusion

The Xtts V2 Fork's voice cloning capabilities offer immense value for developers looking to integrate advanced text-to-speech functionalities into their applications. With support for multiple languages and the ability to replicate unique speaker characteristics, this service can cater to a wide range of use cases, from enhancing user experiences to creating engaging educational tools.

As you explore the possibilities of the Xtts V2 Fork, consider how you can leverage these features in your applications to provide richer, more interactive experiences for your users.