Enhance Audio Transcription Accuracy with Wordstamp Alignment

26 Apr 2025
Enhance Audio Transcription Accuracy with Wordstamp Alignment

In today's fast-paced digital world, ensuring precise alignment between audio files and their corresponding transcripts is essential for a variety of applications, including content creation, accessibility, and language learning. The "Force Align Wordstamps" service provides developers with powerful Cognitive Actions that enable accurate synchronization of audio content with text transcripts. By leveraging advanced algorithms, this service not only enhances the accuracy of transcriptions but also offers word-level timestamps that can significantly improve user experience and content usability.

Common use cases for this service include enhancing accessibility for hearing-impaired individuals by providing accurate subtitles, creating interactive educational tools that allow learners to follow along with audio content, and improving the quality of automated transcription services used in media production. With this powerful tool, developers can streamline their workflows, save time, and enhance the overall quality of their audio-visual content.

Prerequisites

To get started with the Force Align Wordstamps service, you will need a valid Cognitive Actions API key and a basic understanding of making API calls.

Force Align Transcript to Audio

The "Force Align Transcript to Audio" action is designed to align audio files precisely with their source transcripts. By utilizing stable-ts technology, this action significantly improves the accuracy of word-level timestamps, ensuring that users can easily track the spoken content in relation to the text. Additionally, developers have the option to include probability scores, further enhancing the precision of the alignment.

Input Requirements:

  • audioFile: A URI pointing to the input audio file in MP3 format (required).
  • transcript: The textual representation of the audio content (required).
  • showProbabilities: A boolean indicating whether to display probabilities associated with the processed audio (optional, defaults to false).

Example Input:

{
  "audioFile": "https://replicate.delivery/pbxt/MJgmXOy2ANed1nazwQPaEyP23w4GKmOy4KoWrz9IC7WzXSiN/audio.mp3",
  "transcript": "On that road we heard the song of morning stars; we drank in fragrances aerial and sweet as a May mist; we were rich in gossamer fancies and iris hopes; our hearts sought and found the boon of dreams; the years waited beyond and they were very fair; life was a rose-lipped comrade with purple flowers dripping from her fingers.",
  "showProbabilities": false
}

Expected Output: The output will include an array of wordstamps, each containing the start and end timestamps for each word in the transcript, providing precise alignment between the audio and text.

Example Output:

{
  "wordstamps": [
    {"start": 0.78, "end": 0.84, "word": "On"},
    {"start": 0.84, "end": 0.98, "word": "that"},
    {"start": 0.98, "end": 1.24, "word": "road"},
    ...
    {"start": 17.6, "end": 17.76, "word": "her"},
    {"start": 17.76, "end": 17.84, "word": "fingers."}
  ]
}

Use Cases for this specific action:

  • Accessibility: Create accurate subtitles for videos and podcasts, ensuring that hearing-impaired users can follow along seamlessly.
  • Educational Tools: Develop interactive learning materials that allow students to read along with audio, enhancing comprehension and retention.
  • Content Production: Improve the quality of transcripts used in media production, making it easier to edit and produce content that meets industry standards.

```python
import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "b5f53024-6868-4d25-98d2-6ad804d0976a" # Action ID for: Force Align Transcript to Audio

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "audioFile": "https://replicate.delivery/pbxt/MJgmXOy2ANed1nazwQPaEyP23w4GKmOy4KoWrz9IC7WzXSiN/audio.mp3",
  "transcript": "On that road we heard the song of morning stars; we drank in fragrances aerial and sweet as a May mist; we were rich in gossamer fancies and iris hopes; our hearts sought and found the boon of dreams; the years waited beyond and they were very fair; life was a rose-lipped comrade with purple flowers dripping from her fingers.",
  "showProbabilities": false
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")


### Conclusion
The Force Align Wordstamps service offers developers a powerful solution for aligning audio and text with precision. By improving transcription accuracy and providing detailed word-level timestamps, it opens up a range of possibilities for enhancing accessibility, educational content, and media production. As you integrate these actions into your projects, consider how they can streamline your workflows and elevate the quality of your audio-visual content. Start exploring the potential of this service today and transform how your applications handle audio and transcript alignment!