Transcribe Audio with Word Timestamps Using hnesk/whisper-wordtimestamps Cognitive Actions

22 Apr 2025
Transcribe Audio with Word Timestamps Using hnesk/whisper-wordtimestamps Cognitive Actions

In the realm of audio processing, accurately transcribing spoken content into text is a vital task for many applications, such as accessibility tools, content creation, and data analysis. The hnesk/whisper-wordtimestamps API utilizes the OpenAI Whisper model to provide a robust solution for transcribing audio files while delivering precise word-level timestamps. This feature enhances overall accuracy and analysis, making it easier to dissect and interact with audio content.

Prerequisites

Before integrating the Cognitive Actions from the hnesk/whisper-wordtimestamps API, ensure you have the following:

  • An API key for accessing the Cognitive Actions platform.
  • Basic knowledge of making HTTP requests and handling JSON data.
  • Familiarity with Python programming, as we will provide conceptual code examples.

To authenticate your requests, you will typically pass the API key in the headers of your HTTP requests.

Cognitive Actions Overview

Transcribe Audio with Whisper Word Timestamps

Description: Use the OpenAI Whisper model to transcribe audio files, providing detailed word-level timestamps for enhanced accuracy and analysis. This operation exposes settings for fine-tuned transcription quality and language detection.

  • Category: audio-transcription

Input

The input for this action requires the following fields:

  • audio (required): URI of the audio file to be processed.
    Example: https://replicate.delivery/pbxt/IfYtYMI5B23lFkUoI7zDtehuLw2NzKCoJpmJQvSVGD5l3gfY/vocals.mp3
  • model (optional): Select the Whisper model variant. Default is 'base'.
    Example: large-v1
  • language (optional): Specify the language spoken in the audio. Use 'None' for automatic detection.
  • patience (optional): Patience value in beam decoding. Default is 1.0.
  • temperature (optional): Sampling temperature. Default is 0.
  • initialPrompt (optional): Initial prompt text for the first window.
    Example: Karstadtdetektiv
  • suppressTokens (optional): Comma-separated list of token IDs to suppress during sampling. Default is -1.
  • wordTimestamps (optional): Include word-level timestamps.
    Example: true
  • noSpeechThreshold (optional): Threshold for silence detection. Default is 0.6.
  • appendPunctuations (optional): Symbols to append to the preceding word if word_timestamps is true.
    Example: "'.。,,!!??::”)]}、
  • prependPunctuations (optional): Symbols to prepend to the following word if word_timestamps is true.
    Example: "‘“¿([{-
  • conditionOnPreviousText (optional): Uses previous model output as a prompt for the next window. Default is true.
  • logProbabilityThreshold (optional): Threshold for decoding treatment. Default is -1.
  • compressionRatioThreshold (optional): Compression ratio threshold for decoding. Default is 2.4.
  • temperatureIncrementOnFallback (optional): Temperature increment when decoding fails. Default is 0.2.

Example Input:

{
  "audio": "https://replicate.delivery/pbxt/IfYtYMI5B23lFkUoI7zDtehuLw2NzKCoJpmJQvSVGD5l3gfY/vocals.mp3",
  "model": "large-v1",
  "initialPrompt": "Karstadtdetektiv",
  "suppressTokens": "-1",
  "wordTimestamps": true,
  "noSpeechThreshold": 0.6,
  "appendPunctuations": "\"'.。,,!!??::”)]}、",
  "prependPunctuations": "\"'“¿([{-",
  "conditionOnPreviousText": true,
  "logProbabilityThreshold": -1,
  "compressionRatioThreshold": 2,
  "temperatureIncrementOnFallback": 0.2
}

Output

The action typically returns a JSON object containing the transcription and detailed timestamps for each word. The output structure includes segments of text with corresponding timestamps and probabilities.

Example Output:

{
  "segments": [
    {
      "id": 0,
      "start": 7.02,
      "end": 21.98,
      "text": " Ich bin Karstadtdetektiv, ich bin direktiv von Karstadt.",
      "words": [
        {
          "start": 7.02,
          "end": 7.58,
          "word": " Ich",
          "probability": 0.28249597549438477
        },
        ...
      ]
    }
  ],
  "transcription": " Ich bin Karstadtdetektiv, ich bin direktiv von Karstadt. ...",
  "detected_language": "german"
}

Conceptual Usage Example (Python)

Here’s how you can call the action using Python:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "086b1ea2-ae42-47ab-b8d3-ef9fb790e791" # Action ID for Transcribe Audio

# Construct the input payload based on the action's requirements
payload = {
    "audio": "https://replicate.delivery/pbxt/IfYtYMI5B23lFkUoI7zDtehuLw2NzKCoJpmJQvSVGD5l3gfY/vocals.mp3",
    "model": "large-v1",
    "initialPrompt": "Karstadtdetektiv",
    "suppressTokens": "-1",
    "wordTimestamps": True,
    "noSpeechThreshold": 0.6,
    "appendPunctuations": "\"'.。,,!!??::”)]}、",
    "prependPunctuations": "\"'“¿([{-",
    "conditionOnPreviousText": True,
    "logProbabilityThreshold": -1,
    "compressionRatioThreshold": 2,
    "temperatureIncrementOnFallback": 0.2
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, you will replace the COGNITIVE_ACTIONS_API_KEY with your actual API key. The payload is constructed based on the input schema provided for the transcription action. The response will contain the transcription and timestamps for each word, allowing for in-depth analysis and interaction.

Conclusion

The hnesk/whisper-wordtimestamps Cognitive Action provides a powerful tool for audio transcription, complete with detailed word-level timestamps that can greatly enhance the accuracy of transcribed content. By integrating this action into your applications, you can improve accessibility and create richer user experiences. Consider exploring various use cases, such as building transcription services, enhancing audio search capabilities, or implementing real-time captioning systems.