Transform Your Applications with Text-to-Speech Synthesis using Voicecraft Actions

23 Apr 2025
Transform Your Applications with Text-to-Speech Synthesis using Voicecraft Actions

In the rapidly evolving world of application development, integrating voice capabilities can significantly enhance user engagement and accessibility. The Voicecraft API, part of the ttsds/voicecraft spec, offers advanced Cognitive Actions that allow developers to synthesize speech from text effortlessly. By using these pre-built actions, you can transform written content into natural-sounding speech with adjustable voice characteristics, catering to a range of applications from virtual assistants to multimedia content creation.

Prerequisites

Before diving into the integration of Voicecraft's Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform to authenticate your requests.
  • Basic familiarity with JSON and Python, as the examples will be structured in these formats.

Authentication typically involves passing your API key in the request headers, ensuring that your application can securely interact with the Voicecraft service.

Cognitive Actions Overview

Synthesize Speech from Text

The Synthesize Speech from Text action allows you to convert text into spoken words using advanced voice synthesis models. This action is particularly useful for applications that require voice output, enhancing user experience through audio feedback.

  • Category: Text-to-Speech

Input

The input for this action requires the following fields:

  • text (required): The text content that will be synthesized into speech.
    Example: "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good."
  • speakerReference (required): A URI pointing to the reference audio file, used to adjust the voice characteristics in the synthesis process.
    Example: "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
  • textReference (required): A transcript of the reference audio, used for comparison with the synthesized speech.
    Example: "and keeping eternity before the eyes, though much."
  • version (optional): Specifies which version of the synthesis model to use. The default version is "giga330m".
    Example: "giga330m"

Here’s a JSON payload example for this action:

{
  "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
  "version": "giga330m",
  "textReference": "and keeping eternity before the eyes, though much.",
  "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

Output

Upon successful execution, this action returns a URI pointing to the synthesized speech audio file. The typical output format looks like this:

Example Output:
"https://assets.cognitiveactions.com/invocations/f591131c-5223-4847-a43c-6df6df7dc3f5/ae000d3b-ac36-4039-9363-a8eb1e90af3e.wav"

Conceptual Usage Example (Python)

Here’s how you might structure a Python script to call the Synthesize Speech from Text action:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "299879c9-fec5-442e-94dc-cb70b2fa7ab8" # Action ID for Synthesize Speech from Text

# Construct the input payload based on the action's requirements
payload = {
    "text": "With tenure, Suzie'd have all the more leisure for yachting, but her publications are no good.",
    "version": "giga330m",
    "textReference": "and keeping eternity before the eyes, though much.",
    "speakerReference": "https://replicate.delivery/pbxt/MNFXdPaUPOwYCZjZM4azsymbzE2TCV2WJXfGpeV2DrFWaSq8/example_en.wav"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet:

  • Replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key.
  • The action_id corresponds to the Synthesize Speech from Text action.
  • The payload is structured according to the required input schema.

Conclusion

The Voicecraft Cognitive Actions provide powerful tools for integrating text-to-speech capabilities into your applications. By using the Synthesize Speech from Text action, developers can easily convert written content into natural-sounding speech, enhancing user interaction and accessibility.

Consider exploring additional use cases, such as creating voiceovers for videos, enhancing assistive technologies, or developing interactive voice response systems. The possibilities are endless, and with Voicecraft, you're well-equipped to innovate!