Transcribe and Translate Audio with loginethu/whisper-a100 Cognitive Actions

In the realm of audio processing, the loginethu/whisper-a100 API provides powerful Cognitive Actions designed to transcribe and translate audio files seamlessly. Built on the OpenAI Whisper model optimized for A100 hardware, these actions offer developers a straightforward way to convert spoken language into text, with the added flexibility of translation and various output formats. By leveraging these pre-built actions, you can save time and resources while enhancing your applications with advanced speech-to-text capabilities.
Prerequisites
Before diving into the integration of Cognitive Actions, there are a few prerequisites you need to be aware of:
- API Key: You will need a valid API key for the Cognitive Actions platform. This key is essential for authenticating your requests.
- Setup: Ensure that you have access to the Cognitive Actions API endpoint.
- Authentication: Conceptually, you will pass the API key in the headers of your requests to authenticate your calls.
Cognitive Actions Overview
Transcribe and Translate Audio Using Whisper on A100
Description: This action transcribes audio files utilizing the OpenAI Whisper model optimized for A100 hardware. It allows options for translating the transcriptions into English and selecting different transcription formats.
Category: Speech-to-Text
Input
The input for this action requires the following fields:
- audio (required): A URI pointing to the audio file to be processed.
- model (optional): Select a Whisper model from available options (default: "large-v2").
- language (optional): The spoken language in the audio, with the option for automatic detection.
- translate (optional): Set to true to translate the text into English (default: false).
- temperature (optional): Temperature parameter for sampling (default: 0).
- suppressTokens (optional): A comma-separated list of token IDs to suppress during sampling (default: "-1").
- noSpeechThreshold (optional): Threshold for considering segments as silence (default: 0.6).
- transcriptionFormat (optional): Format for the transcription output (default: "plain text").
- conditionOnPreviousText (optional): Uses previous output as a prompt for consistency (default: true).
- logProbabilityThreshold (optional): Threshold for decoding failure (default: -1).
- compressionRatioThreshold (optional): Threshold for considering decoding as failed (default: 2.4).
- temperatureIncrementOnFallback (optional): Increment temperature if fallback occurs due to unsatisfied thresholds (default: 0.2).
Example Input:
{
"audio": "https://replicate.delivery/mgxm/e5159b1b-508a-4be4-b892-e1eb47850bdc/OSR_uk_000_0050_8k.wav",
"model": "large-v2",
"language": "af",
"translate": false,
"temperature": 0,
"suppressTokens": "-1",
"noSpeechThreshold": 0.6,
"transcriptionFormat": "plain text",
"conditionOnPreviousText": true,
"logProbabilityThreshold": -1,
"compressionRatioThreshold": 2.4,
"temperatureIncrementOnFallback": 0.2
}
Output
The action typically returns a structured output that includes:
- segments: An array of text segments extracted from the audio, each containing:
id: Segment identifiertext: Transcribed text for the segmentstartandend: Timestamps for the segmenttokens: Token IDs corresponding to the text- Additional metrics such as
avg_logprob,temperature, andno_speech_prob.
Example Output:
{
"segments": [
{
"id": 0,
"start": 0,
"end": 6.8,
"text": " Die litte verhaal die hulle vertel is vals.",
"tokens": [50364, 3229, 287, 9786, ...],
"avg_logprob": -0.9236625600083966,
"temperature": 0,
"no_speech_prob": 0.12543471157550812,
"compression_ratio": 1.5411764705882354
},
...
],
"transcription": " Die litte verhaal die hulle vertel is vals. Die deur was gebied, gesloten en gebolt ook.",
"detected_language": "afrikaans"
}
Conceptual Usage Example (Python)
Here’s how you might invoke this action using Python. This snippet shows how to structure the input JSON payload correctly and send a request to the cognitive actions endpoint:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "389c0d06-5e4d-44de-a9ad-20fe300dbf36" # Action ID for Transcribe and Translate Audio Using Whisper on A100
# Construct the input payload based on the action's requirements
payload = {
"audio": "https://replicate.delivery/mgxm/e5159b1b-508a-4be4-b892-e1eb47850bdc/OSR_uk_000_0050_8k.wav",
"model": "large-v2",
"language": "af",
"translate": False,
"temperature": 0,
"suppressTokens": "-1",
"noSpeechThreshold": 0.6,
"transcriptionFormat": "plain text",
"conditionOnPreviousText": True,
"logProbabilityThreshold": -1,
"compressionRatioThreshold": 2.4,
"temperatureIncrementOnFallback": 0.2
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this example, replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key. The action ID should match the one provided for the "Transcribe and Translate Audio Using Whisper on A100." The payload structure is derived directly from the action's input schema.
Conclusion
The loginethu/whisper-a100 Cognitive Action for audio transcription and translation offers an efficient solution for developers looking to integrate advanced speech-to-text capabilities into their applications. With support for multiple languages and flexible output formats, this action can significantly enhance user experience across various domains. As a next step, consider exploring additional use cases such as integrating this functionality into voice recognition applications, automatic subtitling for videos, or accessibility tools. Happy coding!