Unlocking Voice Editing and TTS with VoiceCraft Cognitive Actions

23 Apr 2025
Unlocking Voice Editing and TTS with VoiceCraft Cognitive Actions

Integrating advanced audio processing capabilities into applications can elevate user experiences and streamline workflows. The VoiceCraft API offers innovative Cognitive Actions that empower developers to perform sophisticated audio tasks such as zero-shot speech editing and text-to-speech (TTS) capabilities. This article will walk you through the available actions and how to effectively integrate them into your applications.

Prerequisites

Before diving into the integration details, ensure you have the following:

  • An API key for the VoiceCraft platform.
  • Basic familiarity with making HTTP requests and handling JSON payloads.

To authenticate your API requests, you'll typically pass your API key in the headers of your requests.

Cognitive Actions Overview

Perform Zero-Shot Speech Editing and TTS

The Perform Zero-Shot Speech Editing and TTS action allows you to edit or clone voices from diverse audio datasets with minimal input. This action is particularly powerful in contexts like audiobooks, podcasts, and internet videos where audio quality and voice fidelity are paramount.

Input

The action requires a structured payload that includes both required and optional fields. Here’s the schema that defines what you need to provide:

  • originalAudio (string): URL link to the original audio file.
  • goalTranscript (string): Transcript of the target audio file.
  • task (string): Specify the desired task type. Options include speech_editing-substitution, speech_editing-insertion, speech_editing-deletion, or zero-shot text-to-speech (default).
  • seed (integer, optional): Random seed; leave blank to randomize.
  • batchSize (integer, optional): Number of samples to generate (default: 4 for TTS, 1 for editing).
  • cacheUsage (integer, optional): Set to 0 for reduced VRAM usage, 1 for standard usage (default).
  • leftPadding (number, optional): Margin to the left of the editing segment (default: 0.08).
  • rightPadding (number, optional): Margin to the right of the editing segment (default: 0.08).
  • cutOffSeconds (number, optional): Duration of original audio used (default: 3.01 seconds for TTS).
  • originalTranscript (string, optional): Original transcript of the audio, if available.
  • transcriptionModel (string, optional): Choose a WhisperX model for transcript generation.
  • audioSynthesisModel (string, optional): Select an audio synthesis model for execution.
  • probabilityThreshold (number, optional): Probability threshold for nucleus sampling (default: 0.9 for TTS).
  • repetitionLimit (integer, optional): Limit for repetitions in output (default: 3).

Example Input:

{
  "task": "zero-shot text-to-speech",
  "batchSize": 4,
  "cacheUsage": 1,
  "leftPadding": 0.08,
  "temperature": 1,
  "rightPadding": 0.08,
  "cutOffSeconds": 3.01,
  "originalAudio": "https://replicate.delivery/pbxt/Kh3PJuzs2xNgaaNOU6fD3jTz0Xx2dE1zpdXpT2k19fzsB8qE/84_121550_000074_000000.wav",
  "goalTranscript": "I cannot believe that the same model can also do text to speech synthesis too!",
  "repetitionLimit": 3,
  "originalTranscript": "",
  "transcriptionModel": "base.en",
  "audioSynthesisModel": "giga330M_TTSEnhanced.pth",
  "probabilityThreshold": 0.8
}

Output

The action will typically return a JSON response containing the generated audio URL and the original transcript from the audio provided. Here’s what you can expect:

Example Output:

{
  "generated_audio": "https://assets.cognitiveactions.com/invocations/d0d640ba-0830-4cff-815f-1279fdef24e8/ee5ea7e6-c440-4a15-8c0f-d2fd68078339.wav",
  "whisper_transcript_orig_audio": "But when I had approached so near to them, the common object, which the sense deceives, lost not by distance any of its marks."
}

Conceptual Usage Example (Python)

Below is a conceptual Python snippet that demonstrates how to invoke this action. Remember, the endpoint URL and request structure are hypothetical and should be adapted based on your actual implementation.

import requests
import json

# Replace with your VoiceCraft API key and endpoint
VOICECRAFT_API_KEY = "YOUR_VOICECRAFT_API_KEY"
VOICECRAFT_EXECUTE_URL = "https://api.voicecraft.com/actions/execute" # Hypothetical endpoint

action_id = "2c9747fa-38a2-4c64-9d71-ab2c8e2e07eb" # Action ID for Perform Zero-Shot Speech Editing and TTS

# Construct the input payload based on the action's requirements
payload = {
    "task": "zero-shot text-to-speech",
    "batchSize": 4,
    "cacheUsage": 1,
    "leftPadding": 0.08,
    "temperature": 1,
    "rightPadding": 0.08,
    "cutOffSeconds": 3.01,
    "originalAudio": "https://replicate.delivery/pbxt/Kh3PJuzs2xNgaaNOU6fD3jTz0Xx2dE1zpdXpT2k19fzsB8qE/84_121550_000074_000000.wav",
    "goalTranscript": "I cannot believe that the same model can also do text to speech synthesis too!",
    "repetitionLimit": 3,
    "originalTranscript": "",
    "transcriptionModel": "base.en",
    "audioSynthesisModel": "giga330M_TTSEnhanced.pth",
    "probabilityThreshold": 0.8
}

headers = {
    "Authorization": f"Bearer {VOICECRAFT_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        VOICECRAFT_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, you’ll see how to structure the input payload according to the action's requirements. The API key and hypothetical endpoint should be replaced with actual values for your implementation.

Conclusion

The VoiceCraft Cognitive Actions provide powerful capabilities for audio processing and speech synthesis that can significantly enhance the functionality of your applications. By leveraging zero-shot speech editing and TTS, developers can create more engaging and interactive user experiences. Explore the possibilities and consider integrating these actions into your next project!