Enhance User Experience with Multilingual Voice Synthesis using Cosyvoice

26 Apr 2025
Enhance User Experience with Multilingual Voice Synthesis using Cosyvoice

In an increasingly globalized world, the ability to communicate across languages is essential. Cosyvoice offers an innovative solution with its powerful Cognitive Actions designed for seamless voice synthesis. With the capability to generate high-quality, low-latency speech, Cosyvoice supports various voice cloning tasks, enabling developers to create multilingual experiences that resonate with users from diverse backgrounds.

Imagine providing your application with the ability to speak in multiple languages, utilizing unique voices that reflect the original speaker's tone and style. This is especially beneficial in scenarios such as creating localized content for global audiences, enhancing accessibility for users with different language preferences, or even developing personalized voice assistants that feel more human-like.

Prerequisites

To get started, you will need a valid Cognitive Actions API key and a basic understanding of how to make API calls.

Generate Scalable Multilingual Voice Synthesis

The Generate Scalable Multilingual Voice Synthesis action allows developers to create high-quality speech synthesis effortlessly. This action is perfect for projects requiring multilingual support, rapid synthesis, and improved pronunciation capabilities compared to earlier versions.

Purpose

This action solves the problem of generating natural-sounding speech in multiple languages while maintaining the essence of the original voice. It leverages advanced voice cloning technology that can handle zero-shot and cross-lingual tasks, making it versatile for various applications.

Input Requirements

To utilize this action, you need to provide the following inputs:

  • sourceAudio: A URI link to the source audio file that serves as the basis for voice cloning.
  • sourceTranscript: The transcript of the spoken content in the source audio, accurately reflecting what is said.
  • textToSpeechText: The text you want to convert into speech, which will be synthesized using the cloned voice.
  • task: Specifies the type of voice cloning or generation task (e.g., zero-shot voice clone, cross-lingual voice clone, instructed voice generation).
  • instruction: Specific instructions for the Instructed Voice Generation task, if applicable.

Example Input

{
  "task": "zero-shot voice clone",
  "instruction": "",
  "sourceAudio": "https://replicate.delivery/pbxt/MgbBQRAKfZkuc9EcspUou25Uxfdgc3xWS43kvqIla8eWBsaQ/zero_shot_prompt.wav",
  "sourceTranscript": "希望你以后能够做得比我还好哟!",
  "textToSpeechText": "白日依山尽,黄河入海流。"
}

Expected Output

The output will be a URI link to the synthesized audio file, which you can use directly in your applications. The audio generated will reflect the input parameters, delivering a seamless user experience.

Example Output

https://assets.cognitiveactions.com/invocations/ad8aabb6-251f-4471-944c-752babf40f7d/4b5b024e-04a0-4313-b403-d8ec0a6ca87d.wav

Use Cases for this Specific Action

  1. Localization of Content: If you have global users, using multilingual voice synthesis can help localize your content, making it more relatable and engaging.
  2. Creating Voice Assistants: Developers can create personalized voice assistants that can speak multiple languages, enhancing user interaction.
  3. Accessibility Solutions: This action can be used to provide audio content for users with disabilities, ensuring that everyone has access to information in their preferred language.

```python
import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "732946b7-a569-40ed-a60c-3da1702a7d62" # Action ID for: Generate Scalable Multilingual Voice Synthesis

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "task": "zero-shot voice clone",
  "instruction": "",
  "sourceAudio": "https://replicate.delivery/pbxt/MgbBQRAKfZkuc9EcspUou25Uxfdgc3xWS43kvqIla8eWBsaQ/zero_shot_prompt.wav",
  "sourceTranscript": "希望你以后能够做得比我还好哟!",
  "textToSpeechText": "白日依山尽,黄河入海流。"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")


## Conclusion

Cosyvoice's multilingual voice synthesis capabilities bring a wealth of benefits to developers looking to enhance their applications. With high-quality and low-latency speech synthesis, you can create engaging, personalized experiences for users worldwide. Whether you are localizing content, building voice assistants, or improving accessibility, integrating Cosyvoice will help you meet diverse user needs effectively. 

As you explore the possibilities of voice synthesis, consider how you can leverage this technology to create impactful interactions. Start integrating today and elevate your application's communication capabilities!