Transform Text to Speech with Voices Cognitive Actions

25 Apr 2025
Transform Text to Speech with Voices Cognitive Actions

In the digital age, the ability to convert text to speech has become increasingly valuable. The "Voices" service provides developers with powerful Cognitive Actions that enable seamless voice synthesis, transforming written content into natural-sounding speech. This capability not only enhances user experience but also opens up innovative avenues for accessibility, communication, and content consumption.

The Voices service supports various modes of speech synthesis—zero-shot, cross-lingual, and voice conversion—allowing developers to tailor the audio output to their specific needs. Whether you're building an application for language learning, creating voiceovers for videos, or developing assistive technologies, these actions simplify the implementation process and enhance your application's functionality.

Prerequisites

To get started, you'll need a Cognitive Actions API key and a basic understanding of making API calls.

Perform Voice Synthesis

The Perform Voice Synthesis action is designed to convert text into speech, offering a range of customizable options to enhance the audio output. This operation is particularly beneficial for generating voiceovers, creating interactive applications, and providing auditory content for users with visual impairments.

Input Requirements

The input for this action requires several parameters:

  • text: The text string to be synthesized (required for zero-shot and cross-lingual modes).
  • speed: A number to adjust the speech speed (default is 1, must be at least 0.2).
  • useCpu: A boolean to specify whether to use CPU instead of GPU (default is false).
  • promptText: The text corresponding to the prompt audio, applicable only in zero-shot mode.
  • promptAudio: A URI pointing to the prompt audio file for zero-shot and cross-lingual modes.
  • sourceAudio: A URI for the source audio file used in voice conversion.
  • targetAudio: A URI for the target audio file for voice conversion.
  • synthesisMode: Specifies the mode of synthesis (options include zero_shot, cross_lingual, and voice_conversion; default is zero_shot).
  • maxChunkDuration: The maximum duration for processing each audio chunk (default is 30 seconds).
  • useHalfPrecision: A boolean to activate FP16 precision for faster processing (default is true).
  • optimizeMemoryUsage: A boolean to enable optimizations for reduced memory usage (default is true).

Expected Output

The expected output is a URI link to the generated audio file, which contains the synthesized speech based on the provided text.

Use Cases for this Specific Action

This action is ideal for a variety of applications:

  • Language Learning Apps: Provide users with spoken examples of text, aiding in pronunciation and comprehension.
  • E-Learning Platforms: Enhance course materials with audio lectures, making content more engaging and accessible.
  • Assistive Technologies: Create applications that read text aloud for visually impaired users, improving accessibility.
  • Content Creation: Generate voiceovers for videos, podcasts, or advertisements, streamlining the production process.
import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "33fe6f58-cbbe-4cb9-a23c-f3c8cb675181" # Action ID for: Perform Voice Synthesis

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "text": "If the supply of fruit is greater than the family needs, it may be made a source of income by sending the fresh fruit to the market if there is one near enough, or by preserving, canning, and making jelly for sale. To make such an enterprise a success the fruit and work must be first class. There is magic in the word 'Homemade,' when the product appeals to the eye and the palate; but many careless and incompetent people have found to their sorrow that this word has not magic enough to float inferior goods on the market. As a rule large canning and preserving establishments are clean and have the best appliances, and they employ chemists and skilled labor. The home product must be very good to compete with the attractive goods that are sent out from such establishments. Yet for first-class homemade products there is a market in all large cities. All first-class grocers have customers who purchase such goods. If the supply of fruit is greater than the family needs, it may be made a source of income by sending the fresh fruit to the market if there is one near enough, or by preserving, canning, and making jelly for sale. To make such an enterprise a success the fruit and work must be first class. There is magic in the word 'Homemade,' when the product appeals to the eye and the palate; but many careless and incompetent people have found to their sorrow that this word has not magic enough to float inferior goods on the market. As a rule large canning and preserving establishments are clean and have the best appliances, and they employ chemists and skilled labor. The home product must be very good to compete with the attractive goods that are sent out from such establishments. Yet for first-class homemade products there is a market in all large cities. All first-class grocers have customers who purchase such goods. If the supply of fruit is greater than the family needs, it may be made a source of income by sending the fresh fruit to the market if there is one near enough, or by preserving, canning, and making jelly for sale. To make such an enterprise a success the fruit and work must be first class. There is magic in the word 'Homemade,' when the product appeals to the eye and the palate; but many careless and incompetent people have found to their sorrow that this word has not magic enough to float inferior goods on the market. As a rule large canning and preserving establishments are clean and have the best appliances, and they employ chemists and skilled labor. The home product must be very good to compete with the attractive goods that are sent out from such establishments. Yet for first-class homemade products there is a market in all large cities. All first-class grocers have customers who purchase such goods. If the supply of fruit is greater than the family needs, it may be made a source of income by sending the fresh fruit to the market if there is one near enough, or by preserving, canning, and making jelly for sale. To make such an enterprise a success the fruit and work must be first class. There is magic in the word 'Homemade,' when the product appeals to the eye and the palate; but many careless and incompetent people have found to their sorrow that this word has not magic enough to float inferior goods on the market. As a rule large canning and preserving establishments are clean and have the best appliances, and they employ chemists and skilled labor. The home product must be very good to compete with the attractive goods that are sent out from such establishments. Yet for first-class homemade products there is a market in all large cities. All first-class grocers have customers who purchase such goods.",
  "speed": 1,
  "useCpu": false,
  "promptAudio": "https://replicate.delivery/pbxt/LsULg0uIcSsRSpH4gshQOgz6JT8yRNVmwcJdnTwwcNG0vNMa/voice_preview_Android%20X.Y.%20Z.%20-%20AI%20Robot%20of%20the%20Future.mp3",
  "synthesisMode": "cross_lingual",
  "maxChunkDuration": 30,
  "useHalfPrecision": true,
  "optimizeMemoryUsage": true
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")

Conclusion

The Voices Cognitive Actions offer an efficient and powerful way to integrate voice synthesis into your applications. By leveraging these capabilities, developers can create more interactive and accessible experiences for users. Whether you're enhancing educational tools, building new content platforms, or developing assistive technologies, the potential applications are vast. Start exploring the Voices service today and unlock new possibilities for your projects!