Enhancing Video Analysis: Integrate Active Speaker Detection with TalkNet-ASD

In the realm of video processing, accurately identifying speakers in a video can significantly enhance audio-visual analysis. The TalkNet-ASD model provides a powerful Cognitive Action that enables developers to detect active speakers within video content. This functionality opens up a multitude of applications, from improving accessibility features to enriching user experiences in conferencing tools.
Prerequisites
To start using the Cognitive Actions associated with TalkNet-ASD, ensure that you have:
- An API key for the Cognitive Actions platform.
- Access to a suitable environment for making API calls (e.g., Python with
requestslibrary). - An understanding of how to construct JSON payloads for API requests.
Authentication typically involves including your API key in the request headers.
Cognitive Actions Overview
Detect Active Speaker in Video
Description:
This action utilizes the TalkNet-ASD model to accurately identify and detect who is speaking in a video, significantly improving the capabilities for audio-visual analysis.
Category: Video Processing
Input
The input for this action requires a JSON payload structured as follows:
- video (string, required): The URI path to the video file to be processed.
- start (integer, optional): The start time of the video, in seconds (default: 0).
- duration (integer, optional): The duration, in seconds, of the video segment to process (default: -1, which processes the entire video).
- minimumFaceSize (integer, optional): The minimum pixel size required for a face to be detected (default: 1).
- minimumTrackFrames (integer, optional): The minimum number of frames required for each shot to be tracked (default: 10).
- faceDetectionScaleFactor (number, optional): The scale factor for resizing frames during face detection (default: 0.25).
- cropBoundingBoxScale (number, optional): The scale factor for the bounding box when cropping (default: 0.4).
- returnJsonFormat (boolean, optional): Indicates whether to return results in JSON format (default: true).
- returnBoundingBoxAsPercentages (boolean, optional): If true, bounding box coordinates will be returned as percentages of the video's dimensions (default: false).
- numberOfAllowedFailedDetections (integer, optional): The maximum number of failed detections before tracking stops (default: 10).
Example Input:
{
"start": 0,
"video": "https://replicate.delivery/pbxt/KUjguhc1e9L8mC40dd8Ub8lnLiQtjOLYvgsuuWNbQmLmfmXX/Untitled.mp4",
"duration": 0,
"minimumFaceSize": 1,
"returnJsonFormat": false,
"minimumTrackFrames": 10,
"cropBoundingBoxScale": 0.4,
"faceDetectionScaleFactor": 0.25,
"returnBoundingBoxAsPercentages": false,
"numberOfAllowedFailedDetections": 10
}
Output
The action typically returns a JSON response containing the following:
- json_str: A string potentially containing results in JSON format (can be null).
- media_path: An array of URIs pointing to the processed video segments where speakers have been detected.
Example Output:
{
"json_str": null,
"media_path": [
"https://assets.cognitiveactions.com/invocations/6d2b76c9-7317-497e-ae47-841b022a501a/199627b3-bdd2-4321-ac80-0bc4f646e20f.mp4",
"https://assets.cognitiveactions.com/invocations/6d2b76c9-7317-497e-ae47-841b022a501a/8182c863-9f29-4f45-a1c3-98add5874ff2.mp4",
"https://assets.cognitiveactions.com/invocations/6d2b76c9-7317-497e-ae47-841b022a501a/4fbd29d5-184e-4ad4-95a4-9ffc526c4f42.mp4"
]
}
Conceptual Usage Example (Python)
Below is a conceptual example of how to call the Detect Active Speaker in Video action using Python:
import requests
import json
# Replace with your Cognitive Actions API key and endpoint
COGNITIVE_ACTIONS_API_KEY = "YOUR_COGNITIVE_ACTIONS_API_KEY"
COGNITIVE_ACTIONS_EXECUTE_URL = "https://api.cognitiveactions.com/actions/execute" # Hypothetical endpoint
action_id = "e45dd08c-bbd6-4016-a926-5e94b42dc059" # Action ID for Detect Active Speaker in Video
# Construct the input payload based on the action's requirements
payload = {
"start": 0,
"video": "https://replicate.delivery/pbxt/KUjguhc1e9L8mC40dd8Ub8lnLiQtjOLYvgsuuWNbQmLmfmXX/Untitled.mp4",
"duration": 0,
"minimumFaceSize": 1,
"returnJsonFormat": false,
"minimumTrackFrames": 10,
"cropBoundingBoxScale": 0.4,
"faceDetectionScaleFactor": 0.25,
"returnBoundingBoxAsPercentages": false,
"numberOfAllowedFailedDetections": 10
}
headers = {
"Authorization": f"Bearer {COGNITIVE_ACTIONS_API_KEY}",
"Content-Type": "application/json"
}
try:
response = requests.post(
COGNITIVE_ACTIONS_EXECUTE_URL,
headers=headers,
json={"action_id": action_id, "inputs": payload} # Hypothetical structure
)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
result = response.json()
print("Action executed successfully:")
print(json.dumps(result, indent=2))
except requests.exceptions.RequestException as e:
print(f"Error executing action {action_id}: {e}")
if e.response is not None:
print(f"Response status: {e.response.status_code}")
try:
print(f"Response body: {e.response.json()}")
except json.JSONDecodeError:
print(f"Response body: {e.response.text}")
In this code snippet:
- Replace
YOUR_COGNITIVE_ACTIONS_API_KEYwith your actual API key. - The
action_idvariable holds the ID for the active speaker detection action. - The input payload is structured according to the required schema.
- The example demonstrates a POST request to a hypothetical endpoint to execute the action.
Conclusion
Integrating the TalkNet-ASD Cognitive Action for detecting active speakers in video content can greatly enhance your application's audio-visual capabilities. With a straightforward API call structure, developers can easily implement this feature to unlock various use cases, from improving accessibility to enriching conferencing experiences. Explore these capabilities today and elevate your video processing projects!