Efficient Audio Transcription with zeke/zisper Cognitive Actions

25 Apr 2025
Efficient Audio Transcription with zeke/zisper Cognitive Actions

In today's data-driven world, audio content is rapidly growing, and the need for efficient transcription has never been greater. The zeke/zisper API provides a powerful Cognitive Action called Transcribe Audio with WhisperX, designed to facilitate high-speed transcription of audio files. This action not only supports batch processing but also offers detailed logging options and flexibility in output formats, making it a versatile tool for developers looking to integrate audio transcription capabilities into their applications.

Prerequisites

Before you start integrating the Cognitive Action, ensure you have the following:

  • An active API key for the Cognitive Actions platform.
  • A basic understanding of how to make HTTP requests in your preferred programming language.

For authentication, you’ll typically pass your API key in the request headers, allowing you to securely access the Cognitive Actions.

Cognitive Actions Overview

Transcribe Audio with WhisperX

The Transcribe Audio with WhisperX action leverages the WhisperX library to provide efficient and fast transcription for audio files. It supports batch processing and offers options for detailed memory usage logging, word-level timing, and conditional text-only outputs. This makes it an ideal choice for applications requiring quick and reliable audio transcription.

Input

The input for this action is defined by the following schema:

  • audio (required): The URI of the audio file to be processed. This should be a direct link to the file in a supported format such as .m4a.
  • debug (optional): A boolean flag to enable detailed memory usage logging for debugging purposes. Defaults to false.
  • onlyText (optional): Boolean flag indicating if only transcribed text should be returned. Defaults to false.
  • batchSize (optional): Specifies the number of audio segments processed in parallel during transcription. Defaults to 32.
  • alignOutput (optional): A boolean option to enable word-level timing for detailed transcription alignment. Defaults to false (currently limited to English transcriptions).

Example Input:

{
  "audio": "https://replicate.delivery/pbxt/JhaBNshkAf8NShYt0eQVdTqZ46LNcEHBcnacQRPkvOxTqfXb/yolo.m4a",
  "debug": false,
  "onlyText": false,
  "batchSize": 32,
  "alignOutput": false
}

Output

The action typically returns an array of objects, each containing the following fields:

  • start: The timestamp in seconds when the segment starts.
  • end: The timestamp in seconds when the segment ends.
  • text: The transcribed text for that segment.

Example Output:

[
  {
    "end": 3.282,
    "text": " This is a test of YOLO.",
    "start": 1.038
  }
]

Conceptual Usage Example (Python)

Here’s a conceptual example of how you might use this action in Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "af414730-0d85-456e-a117-c21016579b82"  # Action ID for Transcribe Audio with WhisperX

# Construct the input payload based on the action's requirements
payload = {
    "audio": "https://replicate.delivery/pbxt/JhaBNshkAf8NShYt0eQVdTqZ46LNcEHBcnacQRPkvOxTqfXb/yolo.m4a",
    "debug": False,
    "onlyText": False,
    "batchSize": 32,
    "alignOutput": False
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, replace the placeholder for the API key and endpoint with your actual values. The input payload is structured to match the requirements of the Transcribe Audio with WhisperX action, and the output is handled gracefully.

Conclusion

The Transcribe Audio with WhisperX action from the zeke/zisper API provides a robust solution for developers looking to incorporate audio transcription into their applications. Its batch processing capabilities, logging options, and flexibility in output make it a valuable tool for any project dealing with audio data. Consider experimenting with the action's parameters to optimize transcription performance based on your specific use case. Happy coding!