Achieve Precision with Audio-Text Alignment Using Forced Alignment Actions

23 Apr 2025
Achieve Precision with Audio-Text Alignment Using Forced Alignment Actions

Integrating audio processing capabilities into applications has never been easier, thanks to the quinten-kamphuis/forced-alignment Cognitive Actions. These actions allow developers to generate precise word-level timings from audio and text inputs, enabling seamless synchronization in multimedia applications. This blog post dives into the available Cognitive Action and how to utilize it effectively for your projects.

Prerequisites

Before you start using the Cognitive Actions, ensure you have the following:

  • An API key for the Cognitive Actions platform to authenticate your requests.
  • Access to an endpoint for executing the actions (conceptual structure provided below).
  • Familiarity with making HTTP requests in your preferred programming language.

Authentication typically involves passing your API key in the request headers to authorize access to the Cognitive Actions.

Cognitive Actions Overview

Generate Word-Level Audio-Text Alignment

This action generates precise word-level timings from audio and text input, supporting all characters. Built using torchaudio's MMS model, it is particularly valuable for applications requiring accurate audio transcription or alignment.

Input

The input for this action requires two fields:

  • audioUri: A URI pointing to the audio resource. This field is required, and the audio file should be accessible via the provided link.
  • scriptContent: The text content of the script. This is also a required field and can include any narrative or dialogue relevant to the audio.

Example Input:

{
  "audioUri": "https://replicate.delivery/pbxt/Lv9UNQISI5TvC443clsryKIEOD3LLgHqh8rsNcnKokVSZGV9/audio.mp3",
  "scriptContent": "The whole city burned to the ground in a matter of hours!"
}

Output

The action returns an array of objects, each containing:

  • word: The individual word from the script.
  • start: The start time of the word in seconds.
  • end: The end time of the word in seconds.

Example Output:

[
  {
    "end": 0.160,
    "word": "The",
    "start": 0.080
  },
  {
    "end": 0.401,
    "word": "whole",
    "start": 0.280
  },
  {
    "end": 0.641,
    "word": "city",
    "start": 0.501
  },
  ...
]

Conceptual Usage Example (Python)

Below is a conceptual Python code snippet demonstrating how to call the Cognitive Actions endpoint to execute this action:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "12518adf-4c35-4a2b-be94-7108dd605073"  # Action ID for Generate Word-Level Audio-Text Alignment

# Construct the input payload based on the action's requirements
payload = {
    "audioUri": "https://replicate.delivery/pbxt/Lv9UNQISI5TvC443clsryKIEOD3LLgHqh8rsNcnKokVSZGV9/audio.mp3",
    "scriptContent": "The whole city burned to the ground in a matter of hours!"
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)
    
    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet, replace the COGNITIVE_ACTIONS_API_KEY with your actual API key, and ensure the endpoint URL is correct. The payload is structured according to the input requirements of the action, providing the audio URI and the corresponding script content.

Conclusion

The Generate Word-Level Audio-Text Alignment action provides developers with a powerful tool for achieving precise audio-text synchronization in their applications. By integrating this action, you can enhance user experiences in audio transcription, multimedia content creation, and more. Explore potential use cases and start integrating today!