Streamline Audio Transcription with hovevideo/stable-whisper Cognitive Actions

23 Apr 2025
Streamline Audio Transcription with hovevideo/stable-whisper Cognitive Actions

Integrating advanced audio transcription capabilities into your applications has never been easier, thanks to the hovevideo/stable-whisper Cognitive Actions. Leveraging OpenAI's Whisper model, these actions enable developers to transcribe audio and video files accurately and efficiently. With the added benefit of stable timestamps through the stable-ts Python package, you can ensure that your transcriptions are not only precise but also time-synced with the original media. This blog post will guide you through the available Cognitive Action for audio transcription and how to implement it in your applications.

Prerequisites

Before diving into the integration of Cognitive Actions, make sure you have:

  • An API key for the Cognitive Actions platform.
  • Access to the internet for media files to be processed.
  • Familiarity with making HTTP requests and handling JSON data.

Authentication typically involves passing your API key in the request headers to ensure secure access to the Cognitive Actions service.

Cognitive Actions Overview

Transcribe Audio with Timestamp Stabilization

Description:
This action transcribes audio files by utilizing OpenAI's Whisper model, enhanced with stable timestamps to improve transcription accuracy.

Category: audio-transcription

Input

The input for this action requires a JSON object with the following schema:

{
  "type": "object",
  "title": "CompositeRequest",
  "required": [
    "url"
  ],
  "properties": {
    "url": {
      "type": "string",
      "title": "Media URL",
      "description": "URL pointing to the audio or video file to be processed. It must be accessible over the internet."
    },
    "outputFormat": {
      "enum": [
        "ass",
        "json"
      ],
      "type": "string",
      "title": "Output Format",
      "default": "json",
      "description": "Specifies the format of the output: 'ass' for SubStation Alpha subtitles or 'json' for transcription in JSON format. Default is 'json'."
    }
  }
}

Example Input:

{
  "url": "https://replicate.delivery/pbxt/Js2Fgx9MSOCzdTnzHQLJXj7abLp3JLIG3iqdsYXV24tHIdk8/OSR_uk_000_0050_8k.wav"
}

Output

The action returns a URL that points to the transcription result in JSON format. For example:

https://assets.cognitiveactions.com/invocations/bffc47b2-b4c7-491e-b29b-fbe16a71fbd8/1fd191aa-7c8f-4fdb-b53d-1883796ace7b.json

This output will contain the transcribed text along with any relevant metadata, such as timestamps for each segment of the transcription.

Conceptual Usage Example (Python)

Here is a conceptual Python code snippet demonstrating how to call this action using the Cognitive Actions API:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "8b681ef3-517a-4809-b28f-93e1f3234f3b" # Action ID for Transcribe Audio with Timestamp Stabilization

# Construct the input payload based on the action's requirements
payload = {
    "url": "https://replicate.delivery/pbxt/Js2Fgx9MSOCzdTnzHQLJXj7abLp3JLIG3iqdsYXV24tHIdk8/OSR_uk_000_0050_8k.wav"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this example, you replace the placeholder for your API key and make a POST request to the hypothetical Cognitive Actions API endpoint. The payload contains the URL of the media file you want to transcribe. The action ID is specified, and upon successful execution, the transcription result will be printed.

Conclusion

The hovevideo/stable-whisper Cognitive Actions provide a powerful solution for audio transcription, allowing developers to integrate high-quality, timestamped transcriptions into their applications effortlessly. The provided action not only simplifies the process of converting audio content into text but also ensures that the output is accurate and reliable. As you explore further, consider how these transcriptions can enhance user experiences, improve accessibility, or streamline workflows in your applications. Happy coding!