Transcribe Audio Effortlessly with Soykertje/Whisper's Cognitive Actions

23 Apr 2025
Transcribe Audio Effortlessly with Soykertje/Whisper's Cognitive Actions

In today's digital landscape, transforming audio content into written text is a crucial capability for various applications, ranging from transcription services to accessibility solutions. The Soykertje/Whisper API offers a powerful Cognitive Action that allows developers to convert speech from audio files into text seamlessly. This action leverages the advanced Whisper model, providing multilingual transcription, language identification, optional translation to English, and detailed word-level timestamps for enhanced accuracy.

Prerequisites

To get started with the Cognitive Actions in the Soykertje/Whisper API, you will need:

  • An API key for the Cognitive Actions platform.
  • Basic knowledge of JSON and API requests.

Authentication typically involves passing your API key in the headers of your requests, ensuring secure access to the service.

Cognitive Actions Overview

Convert Speech to Text

Description: This action utilizes the Whisper model to transcribe speech from audio files into text. It supports multilingual transcription and provides options for translation into English, along with word-level timestamps for improved accuracy.

  • Category: Speech-to-text

Input

The input schema for this action requires the following parameters:

  • audio (required): The URI of the audio file to be processed.
  • model (optional): Specifies the Whisper model to use (default is large-v2).
  • patience (optional): A patience value for beam decoding.
  • translate (optional): If set to true, the text is translated to English (default is false).
  • temperature (optional): A parameter for sampling randomness.
  • initialPrompt (optional): A prompt for the first window of processing.
  • spokenLanguage (optional): The language spoken in the audio (set to None for automatic detection).
  • suppressTokens (optional): Token IDs to suppress during sampling.
  • wordTimestamps (optional): Provides timestamps at the word level (default is true).
  • noSpeechThreshold (optional): Threshold for silence detection.
  • transcriptionFormat (optional): Specifies the output format for transcription (plain text, srt, or vtt).
  • conditionOnPreviousText (optional): Whether to provide previous model output as a prompt.
  • logProbabilityThreshold (optional): Threshold for decoding success.
  • compressionRatioThreshold (optional): Threshold for decoding success based on compression ratio.
  • temperatureIncrementOnFallback (optional): Increment applied to temperature upon decoding failures.

Example Input:

{
  "audio": "https://replicate.delivery/pbxt/JOMXkhgXe8wbOGqnQ5LwJUulwpbcvd1NZUYKFe1TMPGaXg24/1987%20Micro%20Machines%20Car%20Playset%20Commercial%20%28Featuring%20John%20Moschitta%20the%20Micro%20Machine%20Man%29.mp3",
  "model": "large-v2",
  "translate": false,
  "temperature": 0,
  "suppressTokens": "-1",
  "wordTimestamps": true,
  "noSpeechThreshold": 0.6,
  "transcriptionFormat": "plain text",
  "conditionOnPreviousText": true,
  "logProbabilityThreshold": -1,
  "compressionRatioThreshold": 2.4,
  "temperatureIncrementOnFallback": 0.2
}

Output

The expected output from this action includes a transcription of the audio, segments of text with timestamps, probabilities for each word, and more. Here’s an example of the output structure:

Example Output:

{
  "segments": [
    {
      "id": 0,
      "end": 3.14,
      "seek": 0,
      "text": " This is the Micro Machine Man presenting the most midget miniature motorcade of micro machines.",
      "start": 0.18,
      "words": [
        { "end": 0.5, "word": " This", "start": 0.18, "probability": 0.5188 },
        { "end": 0.66, "word": " is", "start": 0.5, "probability": 0.9037 },
        { "end": 0.88, "word": " the", "start": 0.66, "probability": 0.7630 },
        { "end": 0.9, "word": " Micro", "start": 0.88, "probability": 0.7600 },
        { "end": 1, "word": " Machine", "start": 0.9, "probability": 0.4324 },
        { "end": 1.18, "word": " Man", "start": 1, "probability": 0.8935 }
      ],
      "transcription": "This is the Micro Machine Man presenting the most midget miniature motorcade of micro machines."
    }
  ],
  "detected_language": "english",
  "transcription": "This is the Micro Machine Man presenting the..."
}

Conceptual Usage Example (Python)

Here's a conceptual Python code snippet demonstrating how to call the Convert Speech to Text action using the Cognitive Actions API:

import requests
import json

# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute"  # Hypothetical endpoint

action_id = "43aa2a75-d09c-40ce-9e5f-6a73ed582bf2"  # Action ID for Convert Speech to Text

# Construct the input payload based on the action's requirements
payload = {
    "audio": "https://replicate.delivery/pbxt/JOMXkhgXe8wbOGqnQ5LwJUulwpbcvd1NZUYKFe1TMPGaXg24/1987%20Micro%20Machines%20Car%20Playset%20Commercial%20%28Featuring%20John%20Moschitta%20the%20Micro%20Machine%20Man%29.mp3",
    "model": "large-v2",
    "translate": False,
    "temperature": 0,
    "suppressTokens": "-1",
    "wordTimestamps": True,
    "noSpeechThreshold": 0.6,
    "transcriptionFormat": "plain text",
    "conditionOnPreviousText": True,
    "logProbabilityThreshold": -1,
    "compressionRatioThreshold": 2.4,
    "temperatureIncrementOnFallback": 0.2
}

headers = {
    "Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
    "Content-Type": "application/json"
}

try:
    response = requests.post(
        COGNITIVE_ACTIONS_EXECUTE_URL,
        headers=headers,
        json={"action_id": action_id, "inputs": payload}  # Hypothetical structure
    )
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    result = response.json()
    print("Action executed successfully:")
    print(json.dumps(result, indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error executing action {action_id}: {e}")
    if e.response is not None:
        print(f"Response status: {e.response.status_code}")
        try:
            print(f"Response body: {e.response.json()}")
        except json.JSONDecodeError:
            print(f"Response body: {e.response.text}")

In this code snippet:

  • Replace YOUR_COGNITIVE_ACTIONS_API_KEY with your actual API key.
  • The input payload is constructed based on the defined schema, and a POST request is sent to the hypothetical execution endpoint.
  • The response is handled to either display the successfully transcribed text or to log the error details.

Conclusion

The Convert Speech to Text action in the Soykertje/Whisper API provides a powerful way to transcribe audio files into text while offering advanced features like language identification and translation capabilities. By integrating this action into your applications, you can enhance accessibility and create more interactive experiences for users. For further exploration, consider experimenting with different audio inputs and configurations to see how the model adapts to various speech patterns and languages. Happy coding!