Accurate Audio Transcription with Timestamped Insights

25 Apr 2025
Accurate Audio Transcription with Timestamped Insights

In the world of audio processing, the ability to accurately transcribe spoken content has become increasingly valuable. The "Whisper Timestamped" service provides developers with powerful Cognitive Actions that leverage the Whisper Large V3 model to transcribe audio files with remarkable precision. This service not only delivers accurate transcriptions but also includes features like word-level timestamps and confidence scores, enhancing the overall utility of the transcribed data. With its support for multiple languages and advanced capabilities such as Voice Activity Detection and speech disfluency detection, Whisper Timestamped simplifies the transcription process while improving output accuracy.

Common use cases for this service include creating subtitles for videos, generating transcripts for podcasts, and analyzing spoken content for research purposes. Whether you are building an application that requires real-time transcription capabilities or simply need to convert audio files into text for better accessibility, Whisper Timestamped can help streamline your workflow and improve efficiency.

Before diving into the implementation details, ensure you have a valid Cognitive Actions API key and a basic understanding of making API calls.

Transcribe Audio with Whisper Large V3

This action allows you to transcribe audio files using the Whisper Large V3 model. It provides detailed word-level timestamps along with confidence scores, which are essential for applications that depend on precise transcriptions. The model is designed to handle multiple languages and includes features that enhance transcription accuracy, such as Voice Activity Detection and detection of speech disfluencies.

Input Requirements

To use this action, you need to provide the following input:

  • audioFileUri: A publicly accessible URI of the audio file you want to transcribe.
  • language: Optionally specify the language code, or set it to "auto" for automatic detection.
  • Other parameters include settings for verbosity, task type (transcribe or translate), temperature, and various thresholds.

Expected Output

The output will include the transcribed text, the language detected, and detailed segments of the audio with timestamps and confidence scores for each word. This structured output makes it easy to integrate the transcription into your applications.

Use Cases for this Specific Action

  • Video Subtitling: Automatically generate subtitles for video content, making it accessible to a wider audience.
  • Podcast Transcription: Convert podcast episodes into text format for easy sharing and reference.
  • Speech Analysis: Analyze spoken content for research, identifying patterns in speech disfluencies or measuring speaker confidence.
import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "975bf7ad-dc61-4c9f-8170-f84ffe553cb8" # Action ID for: Transcribe Audio with Whisper Large V3

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "verbose": false,
  "language": "auto",
  "operation": "transcribe",
  "temperature": 0,
  "audioFileUri": "https://github.com/CheyneyComputerScience/CREMA-D/raw/refs/heads/master/AudioMP3/1012_TIE_NEU_XX.mp3?download=",
  "suppressTokens": "-1",
  "logprobThreshold": -1,
  "noSpeechThreshold": 0.6,
  "detectDisfluencies": false,
  "computeWordConfidence": true,
  "voiceActivityDetection": true,
  "conditionOnPreviousText": true,
  "compressionRatioThreshold": 2.4
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")

In conclusion, Whisper Timestamped offers a robust solution for audio transcription that combines accuracy with advanced features. By integrating this service into your applications, you can enhance the user experience and make audio content more accessible. Whether you're developing for media, education, or research, leveraging these Cognitive Actions can significantly streamline your workflow. Start exploring the capabilities of Whisper Timestamped today and transform how you handle audio data.