Accelerate Audio Transcription with the Distil-Whisper Cognitive Actions

25 Apr 2025
Accelerate Audio Transcription with the Distil-Whisper Cognitive Actions

In today's fast-paced digital landscape, the ability to transcribe audio efficiently can significantly enhance the user experience across various applications. The Distil-Whisper Cognitive Actions provide a robust solution for developers looking to integrate high-performance audio transcription capabilities into their applications. This powerful API leverages the Distil-Whisper model, known for its rapid processing and low error rates, making it ideal for both short and long-form audio content.

Prerequisites

Before integrating the Distil-Whisper Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform, which you'll use to authenticate your requests.
  • Basic knowledge of JSON structure and API calls.

Authentication typically involves passing your API key in the headers of your HTTP requests. This allows you to securely access the Cognitive Actions services while protecting your API key.

Cognitive Actions Overview

Enhance Audio Transcription

The Enhance Audio Transcription action utilizes the Distil-Whisper model to provide efficient audio transcription services. This operation is designed to deliver transcripts with up to six times faster processing speed while maintaining a word error rate of less than 1% on out-of-distribution evaluations. It is suitable for various audio lengths, making it a versatile tool for developers.

Input

The input for this action requires a JSON object with the following properties:

  • audio: (required) A URI link to the audio file.
  • modelName: (optional) Specifies the model for processing. Defaults to "distil-whisper/distil-large-v2".
  • maxNewTokens: (optional) An integer specifying the maximum number of new tokens to generate. Defaults to 128.
  • longFormTranscription: (optional) A boolean flag that enables chunked transcription for long-form audio files. Defaults to false.

Example Input:

{
  "audio": "https://replicate.delivery/pbxt/Jo9t4FqOSuJfla5FDD0TOgaLwIZiv0jETZXc5SWlL9ZtGFIg/1.wav"
}

Output

This action returns a string containing the transcription of the audio. For example:

Example Output:

others will be discontinued and need to be replaced by new benchmark rates.

Conceptual Usage Example (Python)

Below is a conceptual Python code snippet demonstrating how to invoke the Enhance Audio Transcription action:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "028edefe-f28f-40fc-b530-fc73038cf5d7"  # Action ID for Enhance Audio Transcription

# Construct the input payload based on the action's requirements
payload = {
    "audio": "https://replicate.delivery/pbxt/Jo9t4FqOSuJfla5FDD0TOgaLwIZiv0jETZXc5SWlL9ZtGFIg/1.wav"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, replace "YOUR_COGNITIVE_ACTIONS_API_KEY" with your actual API key. The action_id is set to the ID of the Enhance Audio Transcription action. The input payload is structured to include the audio file URI required for transcription.

Conclusion

The Distil-Whisper Cognitive Actions provide a powerful toolset for developers looking to enhance their applications with fast and accurate audio transcription capabilities. By leveraging these pre-built actions, you can save time and resources while ensuring high-quality outputs. Whether you need to transcribe short snippets or lengthy discussions, the Distil-Whisper model is designed to meet your needs efficiently. Consider integrating these actions into your applications to unlock new functionalities and improve user experiences.