Achieve Precise Audio-Transcript Alignment with Cognitive Actions

25 Apr 2025
Achieve Precise Audio-Transcript Alignment with Cognitive Actions

In today's digital landscape, aligning audio files with their corresponding transcripts is crucial for various applications, from creating accessible content to enhancing user experience in multimedia platforms. The "Force Align Wordstamps" service offers advanced Cognitive Actions that automate this process, ensuring accurate word-level timestamps that can operate effectively in both clean and noisy audio environments. By leveraging these actions, developers can streamline workflows, improve content accessibility, and enhance the usability of audio data.

Prerequisites

To get started with the Force Align Wordstamps service, you'll need a Cognitive Actions API key and a basic understanding of making API calls.

Align Transcript to Audio

The "Align Transcript to Audio" action is designed to precisely sync audio files with their provided transcripts. This action utilizes advanced models to deliver accurate word-level timestamps, making it an invaluable tool for developers working with audio data.

Input Requirements: To use this action, you need to provide:

  • audioFile: A valid URI pointing to the input audio file.
  • transcript: The full text derived from the audio, captured as a string.
  • language (optional): Specifies the language of the audio and transcript (default is English).
  • showProbabilities (optional): A boolean indicating whether to display probabilities for transcript elements (default is false).

Example Input:

{
  "language": "en",
  "audioFile": "https://replicate.delivery/pbxt/MJgmXOy2ANed1nazwQPaEyP23w4GKmOy4KoWrz9IC7WzXSiN/audio.mp3",
  "transcript": "On that road we heard the song of morning stars; we drank in fragrances aerial and sweet as a May mist; we were rich in gossamer fancies and iris hopes; our hearts sought and found the boon of dreams; the years waited beyond and they were very fair; life was a rose-lipped comrade with purple flowers dripping from her fingers.",
  "showProbabilities": false
}

Expected Output: The output will consist of an array of wordstamps, each containing:

  • word: The specific word in the transcript.
  • start: The start timestamp of the word in seconds.
  • end: The end timestamp of the word in seconds.

Example Output:

{
  "wordstamps": [
    {"start": 0.78, "end": 0.84, "word": "On"},
    {"start": 0.84, "end": 0.98, "word": "that"},
    ...
    {"start": 17.6, "end": 17.76, "word": "from"},
    {"start": 17.76, "end": 17.84, "word": "her"},
    {"start": 17.84, "end": 18.0, "word": "fingers."}
  ]
}

Use Cases for this specific action:

  • Accessibility Enhancements: Improve accessibility in educational and entertainment content by providing synchronized transcripts for audio files.
  • Content Creation: Streamline the video production process by ensuring that subtitles match the spoken content accurately, leading to better viewer engagement.
  • Speech Recognition Training: Use the aligned transcripts as high-quality training data for machine learning models in speech recognition applications.
import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "85caa374-0bd9-4acc-af58-f6ae3e5d771f" # Action ID for: Align Transcript to Audio

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "language": "en",
  "audioFile": "https://replicate.delivery/pbxt/MJgmXOy2ANed1nazwQPaEyP23w4GKmOy4KoWrz9IC7WzXSiN/audio.mp3",
  "transcript": "On that road we heard the song of morning stars; we drank in fragrances aerial and sweet as a May mist; we were rich in gossamer fancies and iris hopes; our hearts sought and found the boon of dreams; the years waited beyond and they were very fair; life was a rose-lipped comrade with purple flowers dripping from her fingers.",
  "showProbabilities": false
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")

Conclusion

The Force Align Wordstamps service, particularly the "Align Transcript to Audio" action, significantly enhances the ability to synchronize audio and text, benefiting a wide array of applications from accessibility to content creation. By implementing these actions, developers can ensure a seamless user experience and much-improved data usability. As a next step, consider integrating this action into your audio processing workflows to unlock new possibilities in your projects.