Transforming Audio Experiences with the Universal Audio Processing API

2 May 2025

The Universal Audio Processing API offers developers a robust solution for a variety of audio processing tasks. With the capability to perform Automatic Speech Recognition (ASR), audio reasoning, captioning, emotion sensing, and text-to-speech, this API simplifies the integration of advanced audio functionalities into applications. By leveraging the Kimi-Audio-7B-Instruct model, developers can enhance user interactions through real-time audio processing, making it an invaluable tool for applications in diverse fields such as healthcare, education, entertainment, and accessibility.

Imagine building an application that transcribes audio from meetings or lectures into text, generates captions for videos, or even creates interactive voice responses. The versatility of this API allows for seamless integration of these features, significantly reducing development time and complexity while enhancing user engagement.

Prerequisites

To get started, you'll need a Cognitive Actions API key and some familiarity with making API calls. This will enable you to authenticate and interact with the Universal Audio Processing API effectively.

Execute Universal Audio Processing

The Execute Universal Audio Processing action allows you to harness the full potential of the Kimi-Audio-7B-Instruct model for comprehensive audio operations. This action addresses a range of audio processing needs, making it possible to convert audio to text, generate audio responses, or both, based on your application's requirements.

Input Requirements

The API expects a JSON object with the following properties:

audioUrl: A valid URL pointing to the audio file you wish to process.
prompt: An optional text prompt to guide the model's processing, particularly useful for ASR tasks.
outputFormat: Specifies whether to return audio, text, or both.
Additional parameters like textTopKLimit, audioTopKLimit, and repetition penalties to fine-tune the model's output.

For example, a typical input might look like this:

{
  "audioUrl": "https://example.com/audiofile.wav",
  "outputFormat": "both",
  "textTopKLimit": 5,
  "audioTopKLimit": 10,
  "returnJsonFormat": true,
  "textPenaltyRepetition": 1,
  "audioPenaltyRepetition": 1,
  "textWindowSizeRepetition": 16,
  "audioWindowSizeRepetition": 64,
  "textGenerationTemperature": 0,
  "audioGenerationTemperature": 0.8
}

Expected Output

The output will be in JSON format, containing:

A transcription of the audio if requested.
An audio file generated based on the input parameters.

For instance, the output might resemble:

{
  "json_str": "This is the transcribed text from the audio.",
  "media_path": "https://example.com/generatedaudio.wav"
}

Use Cases for this Action

The Execute Universal Audio Processing action is particularly beneficial in scenarios such as:

Creating Accessible Content: Automatically transcribing audio content into text for accessibility purposes, making it easier for users with hearing impairments to engage with media.
Enhancing User Interaction: Implementing voice-activated features in applications, allowing users to interact with your software using natural language, greatly improving user experience.
Multilingual Applications: Supporting global users by transcribing and translating audio content into multiple languages, thereby broadening your audience reach.

import requests
import json

# Replace with your actual Cognitive Actions API key and endpoint
# Ensure your environment securely handles the API key
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
# This endpoint URL is hypothetical and should be documented for users
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"

action_id = "ba9af154-ad28-48ba-9c78-66e464bb7198" # Action ID for: Execute Universal Audio Processing

# Construct the exact input payload based on the action's requirements
# This example uses the predefined example_input for this action:
payload = {
  "audioUrl": "https://replicate.delivery/pbxt/MvvLnD9djueaX3u8fCEm2bptKjp6j1DXIRB1MFncD9ZGxCvF/replicate-prediction-r6ns1e52whrga0cphha96k3axm.wav",
  "outputFormat": "both",
  "textTopKLimit": 5,
  "audioTopKLimit": 10,
  "returnJsonFormat": true,
  "textPenaltyRepetition": 1,
  "audioPenaltyRepetition": 1,
  "textWindowSizeRepetition": 16,
  "audioWindowSizeRepetition": 64,
  "textGenerationTemperature": 0,
  "audioGenerationTemperature": 0.8
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json",
    # Add any other required headers for the Cognitive Actions API
}

# Prepare the request body for the hypothetical execution endpoint
request_body = {
    "action_id": action_id,
    "inputs": payload
}

print(f"--- Calling Cognitive Action: {action.name or action_id} ---")
print(f"Endpoint: {COGNITIVE_ACTIONS_EXECUTE_URL}")
print(f"Action ID: {action_id}")
print("Payload being sent:")
print(json.dumps(request_body, indent=2))
print("------------------------------------------------")

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json=request_body
    )
    response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully. Result:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body (non-JSON): {e.response.text}")
    print("------------------------------------------------")

Conclusion

The Universal Audio Processing API is a powerful tool that enables developers to seamlessly integrate advanced audio processing capabilities into their applications. By utilizing the Execute Universal Audio Processing action, you can enhance user experiences, improve accessibility, and create innovative audio-driven features. With its flexible input options and comprehensive output formats, this API is poised to meet a wide range of audio processing needs.

Explore the possibilities of audio transformation today and take your applications to the next level!