Achieve Precise Transcription Alignment with cureau/force-align-wordstamps

22 Apr 2025
Achieve Precise Transcription Alignment with cureau/force-align-wordstamps

In the world of audio processing, aligning transcripts with audio files is crucial for creating accurate, time-stamped representations of spoken content. The cureau/force-align-wordstamps API provides a powerful Cognitive Action to seamlessly align audio files with provided transcripts, generating precise word-level timestamps using the stable_whisper model. This functionality is ideal for developers looking to enhance their transcription tasks with high accuracy and reliability.

Prerequisites

Before you start using the Cognitive Actions provided by the cureau/force-align-wordstamps API, ensure you have the following:

  • An API key for the Cognitive Actions platform to authenticate your requests.
  • Basic understanding of how to make HTTP requests and handle JSON data.

To authenticate, simply include your API key in the headers of your requests.

Cognitive Actions Overview

Align Transcript to Audio with Timestamps

Description:
This action aligns audio files with provided transcripts, generating precise word-level timestamps. It leverages the stable_whisper model to ensure high accuracy in transcription tasks.

Category: audio-processing

Input

The input for this action requires the following fields:

  • audioFile (required): The URI of the input audio file, which must be a valid URL format (e.g., MP3).
  • transcript (required): The text representation of the audio content.
  • language (optional): The language code for the audio file and transcript. Defaults to 'en' for English.
  • showProbabilities (optional): A boolean indicating whether to display probabilities for the processing results. Defaults to false.

Example Input:

{
  "language": "en",
  "audioFile": "https://replicate.delivery/pbxt/MJgmXOy2ANed1nazwQPaEyP23w4GKmOy4KoWrz9IC7WzXSiN/audio.mp3",
  "transcript": "On that road we heard the song of morning stars; we drank in fragrances aerial and sweet as a May mist; we were rich in gossamer fancies and iris hopes; our hearts sought and found the boon of dreams; the years waited beyond and they were very fair; life was a rose-lipped comrade with purple flowers dripping from her fingers.",
  "showProbabilities": false
}

Output

The action returns an array of objects, each containing the following properties:

  • word: The individual word from the transcript.
  • start: The start time of the word in seconds.
  • end: The end time of the word in seconds.

Example Output:

[
  {"start": 0.78, "end": 0.84, "word": "On"},
  {"start": 0.84, "end": 0.98, "word": "that"},
  {"start": 0.98, "end": 1.24, "word": "road"},
  ...
  {"start": 17.6, "end": 17.76, "word": "from"},
  {"start": 17.76, "end": 17.84, "word": "her"},
  {"start": 17.84, "end": 17.84, "word": "fingers."}
]

Conceptual Usage Example (Python)

Here’s how you can call the Align Transcript to Audio with Timestamps action using Python. This conceptual example demonstrates how to structure your input JSON payload correctly:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint

action_id = "043e9650-7be5-4259-87e2-e9b45d4e6b34" # Action ID for Align Transcript to Audio with Timestamps

# Construct the input payload based on the action's requirements
payload = {
  "language": "en",
  "audioFile": "https://replicate.delivery/pbxt/MJgmXOy2ANed1nazwQPaEyP23w4GKmOy4KoWrz9IC7WzXSiN/audio.mp3",
  "transcript": "On that road we heard the song of morning stars; we drank in fragrances aerial and sweet as a May mist; we were rich in gossamer fancies and iris hopes; our hearts sought and found the boon of dreams; the years waited beyond and they were very fair; life was a rose-lipped comrade with purple flowers dripping from her fingers.",
  "showProbabilities": False
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload} # Hypothetical structure
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet:

  • The action_id variable holds the ID for the "Align Transcript to Audio with Timestamps" action.
  • The payload variable is constructed according to the required input schema.
  • A POST request is made to a hypothetical endpoint, and the response is printed in a formatted JSON structure.

Conclusion

The cureau/force-align-wordstamps Cognitive Action provides developers with a robust solution for aligning audio files with transcripts. By leveraging this functionality, you can achieve high accuracy in transcription tasks, enhancing the user experience in applications that rely on precise audio-to-text alignment. Consider integrating these actions into your workflows to streamline your audio processing capabilities. Happy coding!