Video security analysis for privileged access management using generat …

Security teams in highly regulated industries like financial services often employ Privileged Access Management (PAM) systems to secure, manage, and monitor the use of privileged access across their critical IT infrastructure. Security and compliance regulations require that security teams audit the actions performed by systems administrators using privileged credentials. Keystroke logging (the action of recording the keys struck on a keyboard into a log) and video recording of the server console sessions is a feature of PAM systems that enable security teams to meet these security and compliance obligations.
Keystroke logging produces a dataset that can be programmatically parsed, making it possible to review the activity in these sessions for anomalies, quickly and at scale. However, the capturing of keystrokes into a log is not always an option. Operating systems like Windows are predominantly interacted with through a graphical user interface, restricting the PAM system to capturing the activity in these privileged access sessions as video recordings of the server console.
Video recordings can’t be easily parsed like log files, requiring security team members to playback the recordings to review the actions performed in them. A typical PAM system of a financial services organization can produce over 100,000 hours of video recordings each month. If only 30% of these video recordings come from Windows Servers, it would require a workforce of 1,000 employees, working around the clock, to review them all. As a result, security teams are constrained to performing random spot-checks, impacting their ability to detect security anomalies by bad actors.
The following graphic is a simple example of Windows Server Console activity that could be captured in a video recording.

AI services have revolutionized the way we process, analyze, and extract insights from video content. These services use advanced machine learning (ML) algorithms and computer vision techniques to perform functions like object detection and tracking, activity recognition, and text and audio recognition. However, to describe what is occurring in the video from what can be visually observed, we can harness the image analysis capabilities of generative AI.
Advancements in multi-modal large language models (MLLMs), like Anthropic’s state-of-the-art Claude 3, offer cutting-edge computer vision techniques, enabling Anthropic’s Claude to interpret visual information and understand the relationships, activities, and broader context depicted in images. Using this capability, security teams can process all the video recordings into transcripts. Security analytics can then be performed against the transcripts, enabling organizations to improve their security posture by increasing their ability to detect security anomalies by bad actors.
In this post, we show you how to use Amazon Bedrock and Anthropic’s Claude 3 to solve this problem. We explain the end-to-end solution workflow, the prompts needed to produce the transcript and perform security analysis, and provide a deployable solution architecture.
Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using the AWS tools without having to manage any infrastructure.
Solution workflow
Our solution requires a two-stage workflow of video transcription and security analysis. The first stage uses Anthropic’s Claude to produce a transcript of the video recordings. The second stage uses Anthropic’s Claude to analyze the transcript for security anomalies.
Stage 1: Video transcription
Many of the MLLMs available at the time of writing, including Anthropic’s Claude, are unable to directly process sequential visual data formats like MPEG and AVI, and of those that can, their performance and accuracy are below what can be achieved when analyzing static images. Because of that, we need to break the video recordings into a sequence of static images for Anthropic’s Claude to analyze.
The following diagram depicts the workflow we will use to perform the video transcription.

The first step in our workflow extracts one still frame image a second from our video recording. Then we engineer images into a prompt that instructs Anthropic’s Claude Haiku 3 to analyze them and produce a visual transcript. At the time of writing, Anthropic’s Claude on Amazon Bedrock is limited to accepting up to 20 images at one time; therefore, to transcribe videos longer than 20 seconds, we need to submit the images in batches to produce a transcript of each 20-second segment. After all segments have been individually transcribed, we engineer them into another prompt instructing Anthropic’s Claude Sonnet 3 to aggregate the segments into a complete transcript.
Stage 2: Security analysis
The second stage can be performed several times to run different queries against the combined transcript for security analysis.
The following diagram depicts the workflow we will use to perform the security analysis of the aggregated video transcripts.

The type of security analysis performed against the transcripts will vary depending on factors like the data classification or criticality of the server the recording was taken from. The following are some common examples of the security analysis that could be performed:

Compliance with change request runbook – Compare the actions described in the transcript with the steps defined in the runbook of the associated change request. Highlight any actions taken that don’t appear to be part of the runbook.
Sensitive data access and exfiltration risk – Analyze the actions described in the transcript to determine whether any sensitive data may have been accessed, changed, or copied to an external location.
Privilege elevation risk – Analyze the actions described in the transcript to determine whether any attempts were made to elevate privileges or gain unauthorized access to a system.

This workflow provides the mechanical function of processing the video recordings through Anthropic’s Claude into transcripts and performing security analysis. The key to the capability of the solution is the prompts we have engineered to instruct Anthropic’s Claude what to do.
Prompt engineering
Prompt engineering is the process of carefully designing the input prompts or instructions that are given to LLMs and other generative AI systems. These prompts are crucial in determining the quality, relevance, and coherence of the output generated by the AI.
For a comprehensive guide to prompt engineering, refer to Prompt engineering techniques and best practices: Learn by doing with Anthropic’s Claude 3 on Amazon Bedrock.
Video transcript prompt (Stage 1)
The utility of our solution relies on the accuracy of the transcripts we receive from Anthropic’s Claude when it is passed the images to analyze. We must also account for limitations in the data that we ask Anthropic’s Claude to analyze. The image sequences we pass to Anthropic’s Claude will often lack the visual indicators necessary to conclusively determine what actions are being performed. For example, the use of shortcut keys like Ctrl + S to save a document can’t be detected from an image of the console. The click of a button or menu items could also occur in the 1 fps time lapse between the still frame images. These limitations can lead Anthropic’s Claude to make inaccurate assumptions about the action being performed. To counter this, we include instructions in our prompt to not make assumptions and tag where it can’t categorically determine whether an action has been performed or not.
The outputs from generative AI models can never be 100% accurate, but we can engineer a complex prompt that will provide a transcript with a level of accuracy sufficient for our security analysis purposes. We provide an example prompt with the solution that we detail further and that you can adapt and modify at will. Using the task context, detailed task description and rules, immediate task, and instructions to think step-by-step in our prompt, we influence the accuracy of the image analysis by describing the role and task to be performed by Anthropic’s Claude. With the examples and output formatting elements, we can control the consistency of the transcripts we receive as the output.
To learn more about creating complex prompts and gain practical experience, refer to the Complex Prompts from Scratch lab in our Prompt Engineering with Anthropic’s Claude 3 workshop.
The following is an example of our task context:

You are a Video Transcriptionist who specializes in watching recordings from Windows
Server Consoles, providing a summary description of what tasks you visually observe
taking place in videos.  You will carefully watch through the video and document the
various tasks, configurations, and processes that you see being performed by the IT
Systems Administrator. Your goal is to create a comprehensive, step-by-step transcript
that captures all the relevant details.

The following is the detailed task description and rules:

Here is a description of how you will function:
– You receive an ordered sequence of still frame images taken from a sample of a video
recording.
– You will analyze each of the still frame images in the video sequence, comparing the
previous image to the current image, and determine a list of actions being performed by
the IT Systems Administrator.
– You will capture detail about the applications being launched, websites accessed,
files accessed or updated.
– Where you identify a Command Line Interface in use by the IT Systems Administrator,
you will capture the commands being executed.
– If there are many small actions such as typing text letter by letter then you can
summarize them as one step.
– If there is a big change between frames and the individual actions have not been
captured then you should describe what you think has happened. Precede that description
with the word ASSUMPTION to clearly mark that you are making an assumption.

The following are examples:

Here is an example.
<example>
1. The Windows Server desktop is displayed.
2. The administrator opens the Start menu.
3. The administrator uses the search bar to search for and launch the Paint application.
4. The Paint application window opens, displaying a blank canvas.
5. The administrator selects the Text tool from the toolbar in Paint.
6. The administrator types the text “Hello” using the keyboard.
7. The administrator types the text “World!” using the keyboard, completing the phrase
“Hello World!”.
8. The administrator adds a smiley face emoticon “:” and “)” to the end of the text.
9. ASSUMPTION: The administrator saves the Paint file.
10. ASSUMPTION: The administrator closes the Paint application.
</example>

The following summarizes the immediate task:

Analyze the actions the administrator performs.

The following are instructions to think step-by-step:

Think step-by-step before you narrate what action the administrator took in
<thinking></thinking> tags.
First, observe the images thoroughly and write down the key UI elements that are
relevant to administrator input, for example text input, mouse clicks, and buttons.
Then identify which UI elements changed from the previous frame to the current frame.
Then think about all the potential administrator actions that resulted in the change.
Finally, write down the most likely action that the user took in
<narration></narration> tags.

Lastly, the following is an example of output formatting:

Detail each of the actions in a numbered list.
Do not provide any preamble, only output the list of actions and start with 1.
Put your response in <narration></narration> tags.

Aggregate transcripts prompt (Stage 1)
To create the aggregated transcript, we pass all of the segment transcripts to Anthropic’s Claude in a single prompt along with instructions on how to combine them and format the output:

Combine the lists of actions in the provided messages.
List all the steps as a numbered list and start with 1.
You must keep the ASSUMPTION: where it is used.
Keep the style of the list of actions.
Do not provide any preamble, and only output the list of actions.

Security analysis prompts (Stage 2)
The prompts we use for the security analysis require the aggregated transcript to be provided to Anthropic’s Claude in the prompt along with a description of the security analysis to be performed.
The following prompt is for compliance with a change request runbook:

You are an IT Security Auditor. You will be given two documents to compare.
The first document is a runbook for an IT Change Management Ticket that describes the
steps an IT Administrator is going to perform.
The second document is a transcript of a video recording taken in the Windows Server
Console that the IT Administrator used to complete the steps described in the runbook.
Your task is to compare the transcript with the runbook and assess whether there are
any anomalies that could be a security concern.

You carefully review the two documents provided – the runbook for an IT Change
Management Ticket and the transcript of the video recording from the Windows Server
Console – to identify any anomalies that could be a security concern.

As the IT Security Auditor, you will provide your assessment as follows:
1. Comparison of the Runbook and Transcript:
– You will closely examine each step in the runbook and compare it to the actions
taken by the IT Administrator in the transcript.
– You will look for any deviations or additional steps that were not outlined in the
runbook, which could indicate unauthorized or potentially malicious activities.
– You will also check if the sequence of actions in the transcript matches the steps
described in the runbook.
2. Identification of Anomalies:
– You will carefully analyze the transcript for any unusual commands, script executions,
or access to sensitive systems or data that were not mentioned in the runbook.
– You will look for any indications of privilege escalation, unauthorized access
attempts, or the use of tools or techniques that could be used for malicious purposes.
– You will also check for any discrepancies between the reported actions in the runbook
and the actual actions taken, as recorded in the transcript.

Here are the two documents.  The runbook for the IT Change Management ticket is provided
in <runbook> tags.  The transcript is provided in <transcript> tags.

The following prompt is for sensitive data access and exfiltration risk:

You are an IT Security Auditor. You will be given a transcript that describes the actions
performed by an IT Administrator on a Window Server.  Your task is to assess whether there
are any actions taken, such as accessing, changing or copying of sensitive data, that could
be a breach of data privacy, data security or a data exfiltration risk.

The transcript is provided in <transcript> tags.

The following prompt is for privilege elevation risk:

You are an IT Security Auditor. You will be given a transcript that describes the actions
performed by an IT Administrator on a Window Server. Your task is to assess whether there
are any actions taken that could represent an attempt to elevate privileges or gain
unauthorized access to a system.

The transcript is provided in <transcript> tags.

Solution overview
The serverless architecture provides a video processing pipeline to run Stage 1 of the workflow, and a simple UI for the Stage 2 security analysis of the aggregated transcripts. This architecture can be used for demonstration purposes and testing with your own video recordings and prompts; however, it is not suitable for a production use.
The following diagram illustrates the solution architecture.

In Stage 1, video recordings are uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, which sends a notification of the object creation to Amazon EventBridge. An EventBridge rule then triggers the AWS Step Functions workflow to begin processing the video recording into a transcript. The Step Functions workflow generates the still frame images from the video recording and uploads them to another S3 bucket. Then the workflow runs parallel tasks to submit the images, for each 20-second segment, to Amazon Bedrock for transcribing before writing the output to an Amazon DynamoDB table. The segment transcripts are passed to the final task in the workflow, which submits them to Amazon Bedrock, with instructions to combine them into an aggregated transcript, which is written to DynamoDB.
The UI is provided by a simple Streamlit application with access to the DynamoDB and Amazon Bedrock APIs. Through the Streamlit application, users can read the transcripts from DynamoDB and submit them to Amazon Bedrock for security analysis.
Solution implementation
The solution architecture we’ve presented provides a starting point for security teams looking to improve their security posture. For a detailed solution walkthrough and guidance on how to implement this solution, refer to the Video Security Analysis for Privileged Access Management using GenAI GitHub repository. This will guide you through the prerequisite tools, enabling models in Amazon Bedrock, cloning the repository, and using the AWS Cloud Development Kit (AWS CDK) to deploy into your own AWS account.
We welcome your feedback, questions, and contributions as we continue to refine and expand this approach to video-based security analysis.
Conclusion
In this post, we showed you an innovative solution to a challenge faced by security teams in highly regulated industries: the efficient security analysis of vast amounts of video recordings from Privileged Access Management (PAM) systems. We demonstrated how you can use Anthropic’s Claude 3 family of models and Amazon Bedrock to perform the complex task of analyzing video recordings of server console sessions and perform queries to highlight any potential security anomalies.
We also provided a template for how you can analyze sequences of still frame images taken from a video recording, which could be applied to different types of video content. You can use the techniques described in this post to develop your own video transcription solution. By tailoring the prompt engineering to your video content type, you can adapt the solution to your use case. Furthermore, by using model evaluation in Amazon Bedrock, you can improve the accuracy of the results you receive from your prompt.
To learn more, the Prompt Engineering with Anthropic’s Claude 3 workshop is an excellent resource for you to gain hands-on experience in your own AWS account.

About the authors
Ken Haynes is a Senior Solutions Architect in AWS Global Financial Services and has been with AWS since September 2022. Prior to AWS, Ken worked for Santander UK Technology and Deutsche Bank helping them build their cloud foundations on AWS, Azure, and GCP.
Rim Zaafouri is a technologist at heart and a cloud enthusiast. As an AWS Solutions Architect, she guides financial services businesses in their cloud adoption journey and helps them to drive innovation, with a particular focus on serverless technologies and generative AI. Beyond the tech world, Rim is an avid fitness enthusiast and loves exploring new destinations around the world.
Patrick Sard works as a Solutions Architect accompanying financial institutions in EMEA through their cloud transformation journeys. He has helped multiple enterprises harness the power of AI and machine learning on AWS. He’s currently guiding organizations to unlock the transformative potential of Generative AI technologies. When not architecting cloud solutions, you’ll likely find Patrick on a tennis court, applying the same determination to perfect his game as he does to solving complex technical challenges.

How Cato Networks uses Amazon Bedrock to transform free text search in …

This is a guest post authored by Asaf Fried, Daniel Pienica, Sergey Volkovich from Cato Networks.
Cato Networks is a leading provider of secure access service edge (SASE), an enterprise networking and security unified cloud-centered service that converges SD-WAN, a cloud network, and security service edge (SSE) functions, including firewall as a service (FWaaS), a secure web gateway, zero trust network access, and more.
On our SASE management console, the central events page provides a comprehensive view of the events occurring on a specific account. With potentially millions of events over a selected time range, the goal is to refine these events using various filters until a manageable number of relevant events are identified for analysis. Users can review different types of events such as security, connectivity, system, and management, each categorized by specific criteria like threat protection, LAN monitoring, and firmware updates. However, the process of adding filters to the search query is manual and can be time consuming, because it requires in-depth familiarity with the product glossary.
To address this challenge, we recently enabled customers to perform free text searches on the event management page, allowing new users to run queries with minimal product knowledge. This was accomplished by using foundation models (FMs) to transform natural language into structured queries that are compatible with our products’ GraphQL API.
In this post, we demonstrate how we used Amazon Bedrock, a fully managed service that makes FMs from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and quickly integrate and deploy them into your applications using AWS tools without having to manage the infrastructure. Amazon Bedrock enabled us to enrich FMs with product-specific knowledge and convert free text inputs from users into structured search queries for the product API that can greatly enhance user experience and efficiency in data management applications.
Solution overview
The Events page includes a filter bar with both event and time range filters. These filters need to be added and updated manually for each query. The following screenshot shows an example of the event filters (1) and time filters (2) as seen on the filter bar (source: Cato knowledge base).

The event filters are a conjunction of statements in the following form:

Key – The field name
Operator – The evaluation operator (for example, is, in, includes, greater than, etc.)
Value – A single value or list of values

For example, the following screenshot shows a filter for action in [ Alert, Block ].

The time filter is a time range following ISO 8601 time intervals standard.
For example, the following screenshot shows a time filter for UTC.2024-10-{01/00:00:00–02/00:00:00}.

Converting free text to a structured query of event and time filters is a complex natural language processing (NLP) task that can be accomplished using FMs. Customizing an FM that is specialized on a specific task is often done using one of the following approaches:

Prompt engineering – Add instructions in the context/input window of the model to help it complete the task successfully.
Retrieval Augmented Generation (RAG) – Retrieve relevant context from a knowledge base, based on the input query. This context is augmented to the original query. This approach is used for reducing the amount of context provided to the model to relevant data only.
Fine-tuning – Train the FM on data relevant to the task. In this case, the relevant context will be embedded into the model weights, instead of being part of the input.

For our specific task, we’ve found prompt engineering sufficient to achieve the results we needed.
Because the event filters on the Events page are specific to our product, we need to provide the FM with the exact instructions for how to generate them, based on free text queries. The main considerations when creating the prompt are:

Include the relevant context – This includes the following:

The available keys, operators, and values the model can use.
Specific instructions. For example, numeric operators can only be used with keys that have numeric values.

Make sure it’s simple to validate – Given the extensive number of instructions and limitations, we can’t trust the model output without checking the results for validity. For example, what if the model generates a filter with a key not supported by our API?

Instead of asking the FM to generate the GraphQL API request directly, we can use the following method:

Instruct the model to return a response following a well-known JSON schema validation IETF standard.
Validate the JSON schema on the response.
Translate it to a GraphQL API request.

Request prompt
Based on the preceding examples, the system prompt will be structured as follows:
# Genral Instructions

Your task is to convert free text queries to a JSON format that will be used to query security and network events in a SASE management console of Cato Networks. You are only allowed to output text in JSON format. Your output will be validated against the following schema that is compatible with the IETF standard:

# Schema definition
{
    “$schema”: “https://json-schema.org/draft/2020-12/schema”,
    “title”: “Query Schema”,
   “description”: “Query object to be executed in the ‘Events’ management console page. “,
    “type”: “object”,
    “properties”:
    {
        “filters”:
        {
            “type”: “array”,
           “description”: “List of filters to apply in the query, based on the free text query provided.”,
            “items”:
            {
                “oneOf”:
                [
                    {
                        “$ref”: “#/$defs/Action”
                    },
                    .
                    .
                    .
                ]
            }
        },
        “time”:
        {
            “description”: “Start datetime and end datetime to be used in the query.”,
            “type”: “object”,
            “required”:
            [
                “start”,
                “end”
            ],
            “properties”:
            {
                “start”:
                {
                    “description”: “start datetime”,
                    “type”: “string”,
                    “format”: “date-time”
                },
                “end”:
                {
                    “description”: “end datetime”,
                    “type”: “string”,
                    “format”: “date-time”
                }
            }
        },
        “$defs”:
        {
            “Operator”:
            {
                “description”: “The operator used in the filter.”,
                “type”: “string”,
                “enum”:
                [
                    “is”,
                    “in”,
                    “not_in”,
                    .
                    .
                    .
                ]
            },
            “Action”:
            {
                “required”:
                [
                    “id”,
                    “operator”,
                    “values”
                ],
                “description”: “The action taken in the event.”,
                “properties”:
                {
                    “id”:
                    {
                        “const”: “action”
                    },
                    “operator”:
                    {
                        “$ref”: “#/$defs/Operator”
                    },
                    “values”:
                    {
                        “type”: “array”,
                        “minItems”: 1,
                        “items”:
                        {
                            “type”: “string”,
                            “enum”:
                            [
                                “Block”,
                                “Allow”,
                                “Monitor”,
                                “Alert”,
                                “Prompt”
                            ]
                        }
                    }
                }
            },
            .
            .
            .
        }
    }
}

Each user query (appended to the system prompt) will be structured as follows:
# Free text query
Query: {free_text_query}

# Add current timestamp for context (used for time filters) 
Context: If you need a reference to the current datetime, it is {datetime}, and the current day of the week is {day_of_week}

The same JSON schema included in the prompt can also be used to validate the model’s response. This step is crucial, because model behavior is inherently non-deterministic, and responses that don’t comply with our API will break the product functionality.
In addition to validating alignment, the JSON schema can also point out the exact schema violation. This allows us to create a policy based on different failure types. For example:

If there are missing fields marked as required, output a translation failure to the user
If the value given for an event filter doesn’t comply with the format, remove the filter and create an API request from other values, and output a translation warning to the user

After the FM successfully translates the free text into structured output, converting it into an API request—such as GraphQL—is a straightforward and deterministic process.
To validate this approach, we’ve created a benchmark with hundreds of text queries and their corresponding expected JSON outputs. For example, let’s consider the following text query:
Security events with high risk level from IPS and Anti Malware engines
For this query, we expect the following response from the model, based on the JSON schema provided:
{
    “filters”:
    [
        {
            “id”: “risk_level”,
            “operator”: “is”,
            “values”:
            [
                “High”
            ]
        },
        {
            “id”: “event_type”,
            “operator”: “is”,
            “values”:
            [
                “Security”
            ]
        },
        {
            “id”: “event_subtype “,
            “operator”: “in”,
            “values”:
            [
                “IPS”,
                “Anti Malware”
            ]
        }
    ]
}

For each response of the FM, we define three different outcomes:

Success:

Valid JSON
Valid by schema
Full match of filters

Partial:

Valid JSON
Valid by schema
Partial match of filters

Error:

Invalid JSON or invalid by schema

Because translation failures lead to a poor user experience, releasing the feature was contingent on achieving an error rate below 0.05, and the selected FM was the one with the highest success rate (ratio of responses with full match of filters) passing this criterion.
Working with Amazon Bedrock
Amazon Bedrock is a fully managed service that simplifies access to a wide range of state-of-the-art FMs through a single, serverless API. It offers a production-ready service capable of efficiently handling large-scale requests, making it ideal for enterprise-level deployments.
Amazon Bedrock enabled us to efficiently transition between different models, making it simple to benchmark and optimize for accuracy, latency, and cost, without the complexity of managing the underlying infrastructure. Additionally, some vendors within the Amazon Bedrock landscape, such as Cohere and Anthropic’s Claude, offer models with native understanding of JSON schemas and structured data, further enhancing their applicability to our specific task.
Using our benchmark, we evaluated several FMs on Amazon Bedrock, taking into account accuracy, latency, and cost. Based on the results, we selected anthropic.claude-3-5-sonnet-20241022-v2:0, which met the error rate criterion and achieved the highest success rate while maintaining reasonable costs and latency. Following this, we proceeded to develop the complete solution, which includes the following components:

Management console – Cato’s management application that the user interacts with to view their account’s network and security events.
GraphQL server – A backend service that provides a GraphQL API for accessing data in a Cato account.
Amazon Bedrock – The cloud service that handles hosting and serving requests to the FM.
Natural language search (NLS) service – An Amazon Elastic Kubernetes Service (Amazon EKS) hosted service to bridge between Cato’s management console and Amazon Bedrock. This service is responsible for creating the complete prompt for the FM and validating the response using the JSON schema.

The following diagram illustrates the workflow from the user’s manual query to the extraction of relevant events.

With the new capability, users can also use free text query mode, which is processed as shown in the following diagram.

The following screenshot of the Events page displays free text query mode in action.

Business impact
The recent feature update has received positive customer feedback. Users, especially those unfamiliar with Cato, have found the new search capability more intuitive, making it straightforward to navigate and engage with the system. Additionally, the inclusion of multi-language input, natively supported by the FM, has made the Events page more accessible for non-native English speakers to use, helping them interact and find insights in their own language.
One of the standout impacts is the significant reduction in query time—cut down from minutes of manual filtering to near-instant results. Account admins using the new feature have reported near-zero time to value, experiencing immediate benefits with minimal learning curve.
Conclusion
Accurately converting free text inputs into structured data is crucial for applications that involve data management and user interaction. In this post, we introduced a real business use case from Cato Networks that significantly improved user experience.
By using Amazon Bedrock, we gained access to state-of-the-art generative language models with built-in support for JSON schemas and structured data. This allowed us to optimize for cost, latency, and accuracy without the complexity of managing the underlying infrastructure.
Although a prompt engineering solution met our needs, users handling complex JSON schemas might want to explore alternative approaches to reduce costs. Including the entire schema in the prompt can lead to a significantly high token count for a single query. In such cases, consider using Amazon Bedrock to fine-tune a model, to embed product knowledge more efficiently.

About the Authors
Asaf Fried leads the Data Science team in Cato Research Labs at Cato Networks. Member of Cato Ctrl. Asaf has more than six years of both academic and industry experience in applying state-of-the-art and novel machine learning methods to the domain of networking and cybersecurity. His main research interests include asset discovery, risk assessment, and network-based attacks in enterprise environments.
Daniel Pienica is a Data Scientist at Cato Networks with a strong passion for large language models (LLMs) and machine learning (ML). With six years of experience in ML and cybersecurity, he brings a wealth of knowledge to his work. Holding an MSc in Applied Statistics, Daniel applies his analytical skills to solve complex data problems. His enthusiasm for LLMs drives him to find innovative solutions in cybersecurity. Daniel’s dedication to his field is evident in his continuous exploration of new technologies and techniques.
Sergey Volkovich is an experienced Data Scientist at Cato Networks, where he develops AI-based solutions in cybersecurity & computer networks. He completed an M.Sc. in physics at Bar-Ilan University, where he published a paper on theoretical quantum optics. Before joining Cato, he held multiple positions across diverse deep learning projects, ranging from publishing a paper on discovering new particles at the Weizmann Institute to advancing computer networks and algorithmic trading. Presently, his main area of focus is state-of-the-art natural language processing.
Omer Haim is a Senior Solutions Architect at Amazon Web Services, with over 6 years of experience dedicated to solving complex customer challenges through innovative machine learning and AI solutions. He brings deep expertise in generative AI and container technologies, and is passionate about working backwards from customer needs to deliver scalable, efficient solutions that drive business value and technological transformation.

Snowflake AI Research Open-Sources SwiftKV: A Novel AI Approach that R …

Large Language Models (LLMs) have become pivotal in artificial intelligence, powering a variety of applications from chatbots to content generation tools. However, their deployment at scale presents notable challenges. High computational costs, latency, and energy consumption often limit their wider use. Organizations face the difficulty of balancing high throughput with reasonable operating expenses. Additionally, as models grow larger, the need for more efficient solutions becomes increasingly urgent. Addressing these issues is essential to making LLMs more practical and accessible.

Snowflake AI Research team introduces SwiftKV, a solution designed to enhance LLM inference throughput while reducing associated costs. SwiftKV uses key-value caching techniques to reuse intermediate computations during inference. By eliminating redundant calculations, it streamlines the inference process and makes LLM deployments more efficient.

SwiftKV’s design targets the computational intensity of LLMs. Conventional inference pipelines often recompute identical operations for multiple requests, resulting in inefficiencies. SwiftKV introduces a caching layer that identifies and stores reusable computational results. This approach accelerates inference and reduces resource requirements, making it a practical choice for organizations aiming to optimize their AI operations.

Technical Details and Key Benefits of SwiftKV

SwiftKV incorporates a key-value memory system into the LLM inference architecture. Its operation can be summarized as follows:

Key-Value Caching: During inference, SwiftKV captures intermediate activations (keys) and their corresponding results (values). For similar queries, it retrieves the precomputed values rather than recalculating them.

Efficient Storage Management: The caching mechanism employs strategies such as least recently used (LRU) eviction to manage memory effectively, ensuring that the cache remains useful without excessive resource consumption.

Seamless Integration: SwiftKV is compatible with existing LLM frameworks, such as Hugging Face’s Transformers and Meta’s LLaMA, enabling easy adoption without significant changes to existing pipelines.

The benefits of SwiftKV include:

Cost Reduction: By avoiding redundant computations, SwiftKV significantly cuts inference costs. Snowflake AI Research reports up to a 75% reduction in costs in some scenarios.

Enhanced Throughput: The caching mechanism reduces inference time, improving response speed.

Energy Savings: Lower computational demands translate into reduced energy consumption, supporting sustainable AI practices.

Scalability: SwiftKV is well-suited for large-scale deployments, meeting the needs of enterprises expanding their AI capabilities.

https://www.snowflake.com/en/blog/up-to-75-lower-inference-cost-llama-meta-llm/

Results

Snowflake AI Research’s evaluations of SwiftKV provide valuable insights into its effectiveness. For example, integrating SwiftKV with Meta’s LLaMA models led to up to a 75% reduction in inference costs without any compromise in accuracy or performance. These outcomes highlight the efficiency gains possible with this approach.

Additionally, tests demonstrate significant reductions in inference latency, even for larger models. The caching system ensures that complex queries benefit from faster processing times. This combination of cost efficiency and performance optimization makes SwiftKV a compelling choice for organizations aiming to scale AI solutions affordably.

The open-sourcing of SwiftKV encourages collaboration within the AI community. By sharing this technology, Snowflake AI Research invites developers, researchers, and enterprises to explore and enhance its capabilities, fostering innovation in LLM efficiency.

https://www.snowflake.com/en/blog/up-to-75-lower-inference-cost-llama-meta-llm/

Conclusion: A Step Forward in LLM Efficiency

SwiftKV offers a thoughtful solution to the challenges of deploying LLMs at scale. By tackling high computational costs and latency, it helps make AI applications more practical and accessible. The incorporation of key-value caching into inference pipelines showcases how targeted optimizations can drive significant improvements.

As the field of AI progresses, tools like SwiftKV will continue to shape the development of efficient and sustainable technologies. Its open-source nature ensures that the broader community can contribute to its growth and application. By enabling more cost-effective and scalable use of LLMs, SwiftKV underscores the importance of innovation in making AI truly transformative for businesses and developers alike.

Check out the Details and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Snowflake AI Research Open-Sources SwiftKV: A Novel AI Approach that Reduces Inference Costs of Meta Llama LLMs up to 75% on Cortex AI appeared first on MarkTechPost.

Create Portrait Mode Effect with Segment Anything Model 2 (SAM2)

Have you ever admired how smartphone cameras isolate the main subject from the background, adding a subtle blur to the background based on depth? This “portrait mode” effect gives photographs a professional look by simulating shallow depth-of-field similar to DSLR cameras. In this tutorial, we’ll recreate this effect programmatically using open-source computer vision models, like SAM2 from Meta and MiDaS from Intel ISL.

Tools and Frameworks

To build our pipeline, we’ll use:

Segment Anything Model (SAM2): To segment objects of interest and separate the foreground from the background.

Depth Estimation Model: To compute a depth map, enabling depth-based blurring.

Gaussian Blur: To blur the background with intensity varying based on depth.

Step 1: Setting Up the Environment

To get started, install the following dependencies:

pip install matplotlib samv2 pytest opencv-python timm pillow

Step 2: Loading a Target Image

Choose a picture to apply this effect and load it into Python using the Pillow library.

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

image_path = “<path to your image>.jpg”
img = Image.open(image_path)
img_array = np.array(img)

# Display the image
plt.imshow(img)
plt.axis(“off”)
plt.show()

Step 3: Initialize the SAM2

To initialize the model, download the pretrained checkpoint. SAM2 offers four variants based on performance and inference speed: tiny, small, base_plus, and large. In this tutorial, we’ll use tiny for faster inference.

Download the model checkpoint from: https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_<model_type>.pt

Replace <model_type> with your desired model type.

from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
from sam2.utils.misc import variant_to_config_mapping
from sam2.utils.visualization import show_masks

model = build_sam2(
variant_to_config_mapping[“tiny”],
“sam2_hiera_tiny.pt”,
)
image_predictor = SAM2ImagePredictor(model)

Step 4: Feed Image into SAM and Select the Subject

Set the image in SAM and provide points that lie on the subject you want to isolate. SAM predicts a binary mask of the subject and background.

image_predictor.set_image(img_array)
input_point = np.array([[2500, 1200], [2500, 1500], [2500, 2000]])
input_label = np.array([1, 1, 1])

masks, scores, logits = image_predictor.predict(
point_coords=input_point,
point_labels=input_label,
box=None,
multimask_output=True,
)
output_mask = show_masks(img_array, masks, scores)
sorted_ind = np.argsort(scores)[::-1]

Step 5: Initialize the Depth Estimation Model

For depth estimation, we use MiDaS by Intel ISL. Similar to SAM, you can choose different variants based on accuracy and speed.Note: The predicted depth map is reversed, meaning larger values correspond to closer objects. We’ll invert it in the next step for better intuitiveness.

import torch
import torchvision.transforms as transforms

model_type = “DPT_Large” # MiDaS v3 – Large (highest accuracy)

# Load MiDaS model
model = torch.hub.load(“intel-isl/MiDaS”, model_type)
model.eval()

# Load and preprocess image
transform = torch.hub.load(“intel-isl/MiDaS”, “transforms”).dpt_transform
input_batch = transform(img_array)

# Perform depth estimation
with torch.no_grad():
prediction = model(input_batch)
prediction = torch.nn.functional.interpolate(
prediction.unsqueeze(1),
size=img_array.shape[:2],
mode=”bicubic”,
align_corners=False,
).squeeze()

prediction = prediction.cpu().numpy()

# Visualize the depth map
plt.imshow(prediction, cmap=”plasma”)
plt.colorbar(label=”Relative Depth”)
plt.title(“Depth Map Visualization”)
plt.show()

Step 6: Apply Depth-Based Gaussian Blur

Here we optimize the depth-based blurring using an iterative Gaussian blur approach. Instead of applying a single large kernel, we apply a smaller kernel multiple times for pixels with higher depth values.

import cv2

def apply_depth_based_blur_iterative(image, depth_map, base_kernel_size=7, max_repeats=10):
if base_kernel_size % 2 == 0:
base_kernel_size += 1

# Invert depth map
depth_map = np.max(depth_map) – depth_map

# Normalize depth to range [0, max_repeats]
depth_normalized = cv2.normalize(depth_map, None, 0, max_repeats, cv2.NORM_MINMAX).astype(np.uint8)

blurred_image = image.copy()

for repeat in range(1, max_repeats + 1):
mask = (depth_normalized == repeat)
if np.any(mask):
blurred_temp = cv2.GaussianBlur(blurred_image, (base_kernel_size, base_kernel_size), 0)
for c in range(image.shape[2]):
blurred_image[…, c][mask] = blurred_temp[…, c][mask]

return blurred_image

blurred_image = apply_depth_based_blur_iterative(img_array, prediction, base_kernel_size=35, max_repeats=20)

# Visualize the result
plt.figure(figsize=(20, 10))
plt.subplot(1, 2, 1)
plt.imshow(img)
plt.title(“Original Image”)
plt.axis(“off”)

plt.subplot(1, 2, 2)
plt.imshow(blurred_image)
plt.title(“Depth-based Blurred Image”)
plt.axis(“off”)
plt.show()

Step 7: Combine Foreground and Background

Finally, use the SAM mask to extract the sharp foreground and combine it with the blurred background.

def combine_foreground_background(foreground, background, mask):
if mask.ndim == 2:
mask = np.expand_dims(mask, axis=-1)
return np.where(mask, foreground, background)

mask = masks[sorted_ind[0]].astype(np.uint8)
mask = cv2.resize(mask, (img_array.shape[1], img_array.shape[0]))
foreground = img_array
background = blurred_image

combined_image = combine_foreground_background(foreground, background, mask)

plt.figure(figsize=(20, 10))
plt.subplot(1, 2, 1)
plt.imshow(img)
plt.title(“Original Image”)
plt.axis(“off”)

plt.subplot(1, 2, 2)
plt.imshow(combined_image)
plt.title(“Final Portrait Mode Effect”)
plt.axis(“off”)
plt.show()

Conclusion

With just a few tools, we’ve recreated the portrait mode effect programmatically. This technique can be extended for photo editing applications, simulating camera effects, or creative projects.

Future Enhancements:

Use edge detection algorithms for better refinement of subject edges.

Experiment with kernel sizes to enhance the blur effect.

Create a user interface to upload images and select subjects dynamically.

Resources:

Segment anything model by META (https://github.com/facebookresearch/sam2)

CPU compatible implementation of SAM 2 (https://github.com/SauravMaheshkar/samv2/tree/main)

MIDas Depth Estimation Model ( https://pytorch.org/hub/intelisl_midas_v2/)

The post Create Portrait Mode Effect with Segment Anything Model 2 (SAM2) appeared first on MarkTechPost.

Enhancing Lexicon-Based Text Embeddings with Large Language Models

Lexicon-based embeddings are one of the good alternatives to dense embeddings, yet they face numerous challenges that restrain their wider adoption. One key problem is tokenization redundancy, whereby subword tokenization breaks semantically equivalent tokens, causing inefficiencies and inconsistencies in embeddings. The other limitation of causal LLMs is unidirectional attention; this means tokens cannot fully leverage the surrounding context while pretraining. These challenges confine the adaptability and efficiency of lexicon-based embeddings, especially in tasks beyond information retrieval, thereby making it necessary to utilize a stronger approach than the previous ones to make them more useful.

Several techniques have been proposed to investigate the possibility of usages through lexicon-based embeddings. SPLADE uses bidirectional attention and aligns the embeddings with language modeling objectives, while PromptReps utilizes prompt engineering to produce lexicon-based embeddings in causal LLMs. SPLADE is limited to smaller models and a specific retrieval task, limiting its applicability. The PromptReps model lacks contextual understanding with unidirectional attention, and the performance obtained is suboptimal. This class of methods often has very high computational complexity, inefficiency due to fragmenting tokenization, and lacks scope for larger applications such as clustering and classification.

Researchers from the University of Amsterdam, the University of Technology Sydney, and Tencent IEG propose LENS (Lexicon-based EmbeddiNgS), a groundbreaking framework designed to address the limitations of current lexicon-based embedding techniques. Through applying KMeans for the clustering of semantically analogous tokens, LENS effectively amalgamates token embeddings, thereby minimizing redundancy and dimensionality. This streamlined representation facilitates the creation of embeddings characterized by reduced dimensions, all the while maintaining semantic depth. In addition, bidirectional attention is used to overcome the contextual constraints imposed by unidirectional attention used by causal LLMs. It allows tokens to fully utilize their context on both sides. Several experimental experiments showed that max-pooling works the best among pooling strategies for word embeddings. Finally, LENS is used along with dense embeddings. The hybrid embedding uses the best features of the two approaches. Hence, it performs better over a wide range of tasks. These enhancements make LENS a versatile and efficient framework for producing interpretable and contextually-aware embeddings that can be used for clustering, classification, and retrieval applications.

LENS uses clustering to replace the original token embeddings with cluster centroids in the language modeling head, removing redundancy and the dimensionality of the embedding. The embeddings are then 4,000 or 8,000 dimensional and as efficient and scalable as dense embeddings. In the fine-tuning stage, the model incorporates bidirectional attention that improves contextual understanding since tokens can make informed decisions based on their full context. The framework is based on the Mistral-7B model; the datasets are public and include tasks such as retrieval, clustering, classification, and semantic textual similarity. The training methodology is optimized using a streamlined single-stage pipeline and InfoNCE loss for enhancing embeddings. This methodology guarantees ease of use, the ability to scale, and strong performance on tasks, thereby rendering the framework suitable for a range of applications.

LENS exhibits remarkable efficacy across various benchmarks, such as the Massive Text Embedding Benchmark (MTEB) and AIR-Bench. Within the MTEB framework, LENS-8000 attains the highest mean score among models trained publicly, outpacing dense embeddings in six of seven task classifications. The more compact LENS-4000 model also demonstrates competitive performance, highlighting its scalability and efficiency. With dense embeddings, LENS shows strong advantages, establishing new baselines in retrieval tasks and uniformly providing improvements over a range of datasets. Qualitative evaluations demonstrate that LENS flawlessly achieves semantic associations, reduces noise in tokenization, and yields very compact and informative embeddings. The solid generalization capabilities of the framework and competitive performance in out-of-domain tasks further establish its versatility and applicability for a wide variety of tasks.

LENS is an exciting step forward for lexicon-based embedding models because they address tokenization redundancy while being more effective than other approaches that improve contextual representations through bidirectional attention mechanisms. The compactness, efficiency, and interpretability of the model are seen across a wide spectrum of tasks: from retrieval to clustering and classification tasks, which demonstrate superior performance against traditional approaches. Moreover, this effectiveness in dense environments points towards its revolutionary possibility in text representation. Future studies may extend this study by adding multi-lingual datasets and bigger models to augment its significance and relevance within the domain of artificial intelligence.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Enhancing Lexicon-Based Text Embeddings with Large Language Models appeared first on MarkTechPost.

Solve forecasting challenges for the retail and CPG industry using Ama …

Businesses today deal with a reality that is increasingly complex and volatile. Companies across retail, manufacturing, healthcare, and other sectors face pressing challenges in accurate planning and forecasting. Predicting future inventory needs, setting achievable strategic goals, and budgeting effectively involve grappling with ever-changing consumer demand and global market forces. Inventory shortages, surpluses, and unmet customer expectations pose constant threats. Supply chain forecasting is critical to helping businesses tackle these uncertainties.
By using historical sales and supply data to anticipate future shifts in demand, supply chain forecasting supports executive decision-making on inventory, strategy, and budgeting. Analyzing past trends while accounting for impacts ranging from seasons to world events provides insights to guide business planning. Organizations that tap predictive capabilities to inform decisions can thrive amid fierce competition and market volatility. Overall, mastering demand predictions allows businesses to fulfill customer expectations by providing the right products at the right times.
In this post, we show you how Amazon Web Services (AWS) helps in solving forecasting challenges by customizing machine learning (ML) models for forecasting. We dive into Amazon SageMaker Canvas and explain how SageMaker Canvas can solve forecasting challenges for retail and consumer packaged goods (CPG) enterprises.
Introduction to Amazon SageMaker Canvas
Amazon SageMaker Canvas is a powerful no-code ML service that gives business analysts and data professionals the tools to build accurate ML models without writing a single line of code. This visual, point-and-click interface democratizes ML so users can take advantage of the power of AI for various business applications. SageMaker Canvas supports multiple ML modalities and problem types, catering to a wide range of use cases based on data types, such as tabular data (our focus in this post), computer vision, natural language processing, and document analysis. To learn more about the modalities that Amazon SageMaker Canvas supports, visit the Amazon SageMaker Canvas product page.
For time-series forecasting use cases, SageMaker Canvas uses autoML to train six algorithms on your historical time-series dataset and combines them using a stacking ensemble method to create an optimal forecasting model. The algorithms are: Convolutional Neural Network – Quantile Regression (CNN-QR), DeepAR+, Prophet, Non-Parametric Time Series (NPTS), Autoregressive Integrated Moving Average (ARIMA), and Exponential Smoothing (ETS). To learn more about these algorithms visit Algorithms support for time-series forecasting in the Amazon SageMaker documentation.
How Amazon SageMaker Canvas can help retail and CPG manufacturers solve their forecasting challenges
The combination of a user-friendly UI interface and automated ML technology available in SageMaker Canvas gives users the tools to efficiently build, deploy, and maintain ML models with little to no coding required. For example, business analysts who have no coding or cloud engineering expertise can quickly use Amazon SageMaker Canvas to upload their time-series data and make forecasting predictions. And this isn’t a service to be used by business analysts only. Any team at a retail or CPG company can use this service to generate forecasting data using the user-friendly UI of SageMaker Canvas.
To effectively use Amazon SageMaker Canvas for retail forecasting, customers should use their sales data for a set of SKUs for which they would like to forecast demand. It’s crucial to have data across all months of the year, considering the seasonal variation in demand in a retail environment. Additionally, it’s essential to provide a few years’ worth of data to eliminate anomalies or outliers within the data.
Retail and CPG organizations rely on industry standard methods in their approach to forecasting. One of these methods is quantiles. Quantiles in forecasting represent specific points in the predicted distribution of possible future values. They allow ML models to provide probabilistic forecasts rather than merely single point estimates. Quantiles help quantify the uncertainty in predictions by showing the range and spread of possible outcomes. Common quantiles used are the 10th, 50th (median), and 90th percentiles. For example, the 90th percentile forecast means there’s a 90% chance the actual value will be at or below that level.
By providing a probabilistic view of future demand, quantile forecasting enables retail and CPG organizations to make more informed decisions in the face of uncertainty, ultimately leading to improved operational efficiency and financial performance.
Amazon SageMaker Canvas addresses this need with ML models coupled with quantile regression. With quantile regression, you can select from a wide range of planning scenarios, which are expressed as quantiles, rather than rely on single point forecasts. It’s these quantiles that offer choice.
What do these quantiles mean? Check the following figure, which is a sample of a time-series forecasting prediction using Amazon SageMaker Canvas. The figure provides a visual of a time-series forecast with multiple outcomes, made possible through quantile regression. The red line, denoted with p05, offers a probability that the real number, whatever it may be, is expected to fall below the p05 line about 5% of the time. Conversely, this means 95% of the time the true number will likely fall above the p05 line.
Retail or CPG organizations can evaluate multiple quantile prediction points with a consideration for the over- and under-supply costs of each item to automatically select the quantile likely to provide the most profit in future periods. When necessary, you can override the selection when business rules desire a fixed quantile over a dynamic one.

To learn more about how to use quantiles for your business, check out this Beyond forecasting: The delicate balance of serving customers and growing your business.
Another powerful feature that Amazon SageMaker Canvas offers is what-if analysis, which complements quantile forecasting with the ability to interactively explore how changes in input variables affect predictions. Users can change model inputs and immediately observe how these changes impact individual predictions. This feature allows for real-time exploration of different scenarios without needing to retrain the model.
What-if analysis in SageMaker Canvas can be applied to various scenarios, such as:

Forecasting inventory in coming months
Predicting sales for the next quarter
Assessing the effect of price reductions on holiday season sales
Estimating customer footfall in stores over the next few hours

How to generate forecasts
The following example illustrates the steps to follow for users to generate forecasts from a time-series dwe use a consumer electronics dataset to forecast 5 months of sales based on current and historic demand. To download a copy of this dataset, visit .
In order to access Amazon SageMaker Canvas, you can either directly sign in using the AWS Management Console and navigate to Amazon SageMaker Canvas, or you can access Amazon SageMaker Canvas directly using single sign-on as detailed in Enable single sign-on access of Amazon SageMaker Canvas using AWS IAM Identity Center. In this post, we access Amazon SageMaker Canvas through the AWS console.
Generate forecasts
To generate forecasts, follow these steps:

On the Amazon SageMaker console, in the left navigation pane, choose Canvas.
Choose Open Canvas on the right side under Get Started, as shown in the following screenshot. If this is your first time using SageMaker Canvas, you need to create a SageMaker Canvas user by following the prompts on the screen. A new browser tab will open for the SageMaker Canvas console.

In the left navigation pane, choose Datasets.
To import your time-series dataset, choose the Import data dropdown menu and then choose Tabular, as shown in the following screenshot.

In Dataset name, enter a name such as Consumer_Electronics and then choose Create, as shown in the following screenshot.

Upload your dataset (in CSV or Parquet format) from your computer or an Amazon Simple Storage Service (Amazon S3) bucket.
Preview the data, then choose Create dataset, as shown in the following screenshot.

Under Status, your dataset import will show as Processing. When it shows as Complete, proceed to the next step.

Now that you have your dataset created and your time-series data file uploaded, create a new model to generate forecasts for your dataset. In the left navigation pane, choose My Models, then choose New model, as shown in the following screenshot.

In Model name, enter a name such as consumer_electronics_forecast. Under Problem type, select your use case type. Our use case is Predictive analysis, which builds models using tabular datasets for different problems, including forecasts.
Choose Create.

You will be transferred to the Build In the Target column dropdown menu, select the column where you want to generate the forecasts. This is the demand column in our dataset, as shown in the followings screenshot. After you select the target column, SageMaker Canvas will automatically select Time series forecasting as the Model type.
Choose Configure model.

A window will pop up asking you to provide more information, as shown in the following screenshot. Enter the following details:

Choose the column that uniquely identifies the items in your dataset – This configuration determines how you identify your items in the datasets in a unique way. For this use case, select item_id because we’re planning to forecast sales per store.
Choose a column that groups the forecast by the values in the column – If you have logical groupings of the items selected in the previous field, you can choose that feature here. We don’t have one for this use case, but examples would be state, region, country, or other groupings of stores.
Choose the column that contains the time stamps – The timestamp is the feature that contains the timestamp information. SageMaker Canvas requires data timestamp in the format YYYY-MM-DD HH:mm:ss (for example, 2022-01-01 01:00:00).
Specify the number of months you want to forecast into the future – SageMaker Canvas forecasts values up to the point in time specified in the timestamp field. For this use case, we will forecast values up to 5 months in the future. You may choose to enter any valid value, but be aware a higher number will impact the accuracy of predictions and also may take longer to compute.
You can use a holiday schedule to improve your prediction accuracy – (Optional) You can enable Use holiday schedule and choose a relevant country if you want to learn how it helps with accuracy. However, it might not have much impact on this use case because our dataset is synthetic.

To change the quantiles from the default values as explained previously, in the left navigation pane, choose Forecast quantiles. In the Forecast quantiles field, enter your own values, as shown in the following screenshot.

SageMaker Canvas chooses an AutoML algorithm based on your data and then trains an ensemble model to make predictions for time-series forecasting problems. Using time-series forecasts, you can make predictions that can vary with time, such as forecasting:

Your inventory in the coming months
Your sales for the next months
The effect of reducing the price on sales during the holiday season
The number of customers entering a store in the next several hours
How a reduction in the price of a product affects sales over a time period

If you’re not sure which forecasting algorithms to try, select all of them. To help you decide which algorithms to select, refer to Algorithms support for time-series forecasting, where you can learn more details and compare algorithms.

Choose Save.

Train the model
Now that the configuration is done, you can train the model. SageMaker Canvas offers two build options:

Quick build – Builds a model in a fraction of the time compared to a standard build. Potential accuracy is exchanged for speed.
Standard build – Builds the best model from an optimized process powered by AutoML. Speed is exchanged for greatest accuracy.

For this walkthrough, we choose Standard build, as shown in the following screenshot.

When the model training finishes, you will be routed to the Analyze There, you can find the average prediction accuracy and the column impact on prediction outcome.

Your numbers might differ from what the following screenshot shows. This is due to the stochastic nature of the ML process.

Here are explanations of what these metrics mean and how you can use them:

wQL – The average Weighted Quantile Loss (wQL) evaluates the forecast by averaging the accuracy at the P10, P50, and P90 quantiles (unless the user has changed them). A lower value indicates a more accurate model. In our example, we used the default quantiles. If you choose quantiles with different percentiles, wQL will center on the numbers you choose.
MAPE – Mean absolute percentage error (MAPE) is the percentage error (percent difference of the mean forecasted value compared to the actual value) averaged over all time points. A lower value indicates a more accurate model, where MAPE = 0 is a model with no errors.
WAPE – Weighted Absolute Percent Error (WAPE) is the sum of the absolute error normalized by the sum of the absolute target, which measure the overall deviation of forecasted values from observed values. A lower value indicates a more accurate model, where WAPE = 0 is a model with no errors.
RMSE – Root mean square error (RMSE) is the square root of the average squared errors. A lower RMSE indicates a more accurate model, where RMSE = 0 is a model with no errors.
MASE – Mean absolute scaled error (MASE) is the mean absolute error of the forecast normalized by the mean absolute error of a simple baseline forecasting method. A lower value indicates a more accurate model, where MASE < 1 is estimated to be better than the baseline and MASE > 1 is estimated to be worse than the baseline.

You can change the default metric based on your needs. wQL is the default metric. Companies should choose a metric that aligns with their specific business goals and is straightforward for  stakeholders to interpret. The choice of metric should be driven by the specific characteristics of the demand data, the business objectives, and the interpretability requirements of stakeholders.
For instance, a high-traffic grocery store that sells perishable items requires the lowest possible wQL. This is crucial to prevent lost sales from understocking while also avoiding overstocking, which can lead to spoilage of those perishables.
It’s often recommended to evaluate multiple metrics and select the one that best aligns with the company’s forecasting goals and data patterns. For example, wQL is a robust metric that can handle intermittent demand and provide a more comprehensive evaluation of forecast accuracy across different quantiles. However, RMSE gives higher weight to larger errors due to the squaring operation, making it more sensitive to outliers.

Choose Predict to open the Predict

To generate forecast predictions for all the items in the dataset, select Batch prediction. To generate forecast predictions for a specific item (for example, to predict demand in real-time), select Single prediction. The following steps show how to perform both operations.

To generate forecast predictions for a specific item, follow these steps:

Choose Single item and select any of the items from the item dropdown list. SageMaker Canvas generates a prediction for our item, showing the average prediction (that is, demand of that item with respect to timestamp). SageMaker Canvas provides results for all upper bound, lower bound, and expected forecast.

It’s a best practice to have bounds rather than a single prediction point so that you can pick whichever fits best your use case. For example, you might want to reduce waste of resources of overstock by choosing to use the lower bound, or you might want to choose to follow the upper bound to make sure that you meet customer demand. For instance, a highly advertised item in a promotional flyer might be stocked at the 90th percentile (p90) to make sure of availability and prevent customer disappointment. On the other hand, accessories or bulky items that are less likely to drive customer traffic could be stocked at the 40th percentile (p40). It’s generally not advisable to stock below the 40th percentile, to avoid being consistently out of stock.

To generate the forecast prediction, select the Download prediction dropdown menu button to download the forecast prediction chart as image or forecast prediction values as CSV file.

You can use the What if scenario button to explore how changing the price will affect the demand of an item. To use this feature, you must leave empty the future dated rows with the feature you’re predicting. This dataset has empty cells for a few items, which means that this feature is enabled for them. Choose What if scenario and edit the values for the different dates to view how changing the price will affect demand. This feature helps organizations test specific scenarios without making changes to the underlying data.
To generate batch predictions on the entire dataset, follow these steps:

Choose All items and then choose Start Predictions. The Status will show as Generating predictions, as shown in the following screenshot.

When it’s complete, the Status will show as Ready, as shown in the following screenshot. Select the three-dot additional options icon and choose Preview. This will open the prediction results in a preview page.

Choose Download to export these results to your local computer or choose Send to Amazon QuickSight for visualization, as shown in the following screenshot.

Training time and performance
SageMaker Canvas provides efficient training times and offers valuable insights into model performance. You can inspect model accuracy, perform backtesting, and evaluate various performance metrics for the underlying models. By combining multiple algorithms in the background, SageMaker Canvas significantly reduces the time required to train models compared to training each model individually. Additionally, by using the model leaderboard dashboard, you can assess the performance of each trained algorithm against your specific time-series data, ranked based on the selected performance metric (wQL by default).
This dashboard also displays other metrics, which you can use to compare different algorithms trained on your data across various performance measures, facilitating informed decision-making and model selection.
To view the leaderboard, choose Model leaderboard, as shown in the following screenshot.

The model leaderboard shows you the different algorithms used to train your data along with their performance based on all the available metrics, as shown in the following screenshot.

Integration
Retail and (CPG) organizations often rely on applications such as inventory lifecycle management, order management systems, and business intelligence (BI) dashboards, which incorporate forecasting capabilities. In these scenarios, organizations can seamlessly integrate the SageMaker Canvas forecasting service with their existing applications, enabling them to harness the power of forecasting data. To use the forecasting data within these applications, an endpoint for the forecasting model is required. Although SageMaker Canvas models can be deployed to provide endpoints, this process may require additional effort from a machine learning operations (MLOps) perspective. Fortunately, Amazon SageMaker streamlines this process, streamlining the deployment and integration of SageMaker Canvas models.
The following steps show how you can deploy SageMaker Canvas models using SageMaker:

On the SageMaker console, in the left navigation pane, choose My Models.
Select the three-dot additional options icon next to the model you want to deploy and choose Deploy, as shown in the following screenshot.

Under Instance type, select the size of the instance where your model will be deployed to. Choose Deploy and wait until your deployment status changes to In service.

After your deployment is in service, in the left navigation pane, choose ML Ops to get your deployed model endpoint, as shown in the following screenshot. You can test your deployment or start using the endpoint in your applications.

Reproducibility and API management
It’s important to understand that Amazon SageMaker Canvas uses Speed up your time series forecasting by up to 50 percent with Amazon SageMaker Canvas UI and AutoML APIs in the AWS Machine Learning Blog.
Insights
Retail and CPG enterprises typically use visualization tools such as Amazon QuickSight or third-party software such as Tableau to understand forecast results and share them across business units. To streamline the visualization, SageMaker Canvas provides embedded visualization for exploring forecast results. For those retail and CPG enterprises who want to visualize the forecasting data in their own BI dashboard systems (such as Amazon QuickSight, Tableau, and Qlik), SageMaker Canvas forecasting models can be deployed to generate forecasting endpoints. Users can also generate a batch prediction file to Amazon QuickSight for batch prediction from the predict window as shown in the following screenshot.

The following screenshot shows the batch prediction file in QuickSight as a database that you can use for analysis

When your dataset is in Amazon QuickSight, you can start analyzing or even visualizing your data using the visualizations tools, as shown in the following screenshot.

Cost
Amazon SageMaker Canvas offers a flexible, cost-effective pricing model based on three key components: workspace instance runtime, utilization of pre-built models, and resource consumption for custom model creation and prediction generation. The billing cycle commences upon launching the SageMaker Canvas application, encompassing a range of essential tasks including data ingestion, preparation, exploration, model experimentation, and analysis of prediction and explainability results. This comprehensive approach means that users only pay for the resources they actively use, providing a transparent and efficient pricing structure. To learn more about pricing examples, check out Amazon SageMaker Canvas pricing.
Ownership and portability
More retail and CPG enterprises have embraced multi-cloud deployments for several reasons. To streamline portability of models built and trained on Amazon SageMaker Canvas to other cloud providers or on-premises environments, Amazon SageMaker Canvas provides downloadable model artifacts.
Also, several retail and CPG companies have many business units (such as merchandising, planning, or inventory management) within the organization who all use forecasting for solving different use cases. To streamline ownership of a model and facilitate straightforward sharing between business units, Amazon SageMaker Canvas now extends its Model Registry integration to timeseries forecasting models. With a single click, customers can register the ML models built on Amazon SageMaker Canvas with the SageMaker Model Registry, as shown in the following screenshot. Register a Model Version in the Amazon SageMaker Developer Guide shows you where to find the S3 bucket location where your model’s artifacts are stored.

Clean up
To avoid incurring unnecessary costs, you can delete the model you just built, then delete the dataset, and sign out of your Amazon SageMaker Canvas domain. If you also signed up for Amazon QuickSight, you can unsubscribe and remove your Amazon QuickSight account.
Conclusion
Amazon SageMaker Canvas empowers retail and CPG companies with a no-code forecasting solution. It delivers automated time-series predictions for inventory planning and demand anticipation, featuring an intuitive interface and rapid model development. With seamless integration capabilities and cost-effective insights, it enables businesses to enhance operational efficiency, meet customer expectations, and gain a competitive edge in the fast-paced retail and consumer goods markets.
We encourage you to evaluate how you can improve your forecasting capabilities using Amazon SageMaker Canvas. Use the intuitive no-code interface to analyze and improve the accuracy of your demand predictions for retail and CPG products, enhancing inventory management and operational efficiency. To get started, you can review the workshop Amazon SageMaker Canvas Immersion Day.

About the Authors
Aditya Pendyala is a Principal Solutions Architect at AWS based out of NYC. He has extensive experience in architecting cloud-based applications. He is currently working with large enterprises to help them craft highly scalable, flexible, and resilient cloud architectures, and guides them on all things cloud. He has a Master of Science degree in Computer Science from Shippensburg University and believes in the quote “When you cease to learn, you cease to grow.
Julio Hanna, an AWS Solutions Architect based in New York City, specializes in enterprise technology solutions and operational efficiency. With a career focused on driving innovation, he currently leverages Artificial Intelligence, Machine Learning, and Generative AI to help organizations navigate their digital transformation journeys. Julio’s expertise lies in harnessing cutting-edge technologies to deliver strategic value and foster innovation in enterprise environments.

Enabling generative AI self-service using Amazon Lex, Amazon Bedrock, …

Chat-based assistants have become an invaluable tool for providing automated customer service and support. This post builds on a previous post, Integrate QnABot on AWS with ServiceNow, and explores how to build an intelligent assistant using Amazon Lex, Amazon Bedrock Knowledge Bases, and a custom ServiceNow integration to create an automated incident management support experience.
Amazon Lex is powered by the same deep learning technologies used in Alexa. With it, developers can quickly build conversational interfaces that can understand natural language, engage in realistic dialogues, and fulfill customer requests. Amazon Lex can be configured to respond to customer questions using Amazon Bedrock foundation models (FMs) to search and summarize FAQ responses. Amazon Bedrock Knowledge Bases provides the capability of amassing data sources into a repository of information. Using knowledge bases, you can effortlessly create an application that uses Retrieval Augmented Generation (RAG), a technique where the retrieval of information from data sources enhances the generation of model responses.
ServiceNow is a cloud-based platform for IT workflow management and automation. With its robust capabilities for ticketing, knowledge management, human resources (HR) services, and more, ServiceNow is already powering many enterprise service desks.
By connecting an Amazon Lex chat assistant with Amazon Bedrock Knowledge Bases and ServiceNow, companies can provide 24/7 automated support and self-service options to customers and employees. In this post, we demonstrate how to integrate Amazon Lex with Amazon Bedrock Knowledge Bases and ServiceNow.
Solution overview
The following diagram illustrates the solution architecture.

The workflow includes the following steps:

The ServiceNow knowledge bank is exported into Amazon Simple Storage Service (Amazon S3), which will be used as the data source for Amazon Bedrock Knowledge Bases. Data in Amazon S3 is encrypted by default. You can further enhance security by Using server-side encryption with AWS KMS keys (SSE-KMS).
Amazon AppFlow can be used to sync between ServiceNow and Amazon S3. Other alternatives like AWS Glue can also be used to ingest data from ServiceNow.
Amazon Bedrock Knowledge Bases is created with Amazon S3 as the data source and Amazon Titan (or any other model of your choice) as the embedding model.
When users of the Amazon Lex chat assistant ask queries, Amazon Lex fetches answers from Amazon Bedrock Knowledge Bases.
If the user requests a ServiceNow ticket to be created, it invokes the AWS Lambda
The Lambda function fetches secrets from AWS Secrets Manager and makes an HTTP call to create a ServiceNow ticket.
Application Auto Scaling is enabled on AWS Lambda to automatically scale Lambda according to user interactions.
The solution will confer with responsible AI policies and Guardrails for Amazon Bedrock will enforce organizational responsible AI policies.
The solution is monitored using Amazon CloudWatch, AWS CloudTrail, and Amazon GuardDuty.

Be sure to follow least privilege access policies while giving access to any system resources.
Prerequisites
The following prerequisites need to be completed before building the solution.

On the Amazon Bedrock console, sign up for access to the Anthropic Claude model of your choice using the instructions at Manage access to Amazon Bedrock foundation models. For information about pricing for using Amazon Bedrock, see Amazon Bedrock pricing.
Sign up for a ServiceNow account if you do not have one. Save your username and password. You will need to store them in AWS Secrets Manager later in this walkthrough.
Create a ServiceNow instance following the instructions in Integrate QnABot on AWS ServiceNow.
Create a user with permissions to create incidents in ServiceNow using the instructions at Create a user. Make a note of these credentials for use later in this walkthrough.

The instructions provided in this walkthrough are for demonstration purposes. Follow ServiceNow documentation to create community instances and follow their best practices.
Solution overview
To integrate Amazon Lex with Amazon Bedrock Knowledge Bases and ServiceNow, follow the steps in the next sections.
Deployment with AWS CloudFormation console
In this step, you first create the solution architecture discussed in the solution overview, except for the Amazon Lex assistant, which you will create later in the walkthrough. Complete the following steps:

On the CloudFormation console, verify that you are in the correct AWS Region and choose Create stack to create the CloudFormation stack.
Download the CloudFormation template and upload it in the Specify template Choose Next.
For Stack name, enter a name such as ServiceNowBedrockStack.
In the Parameters section, for ServiceNow details, provide the values of ServiceNow host and ServiceNow username created earlier.
Keep the other values as default. Under Capabilities on the last page, select I acknowledge that AWS CloudFormation might create IAM resources. Choose Submit to create the CloudFormation stack.
After the successful deployment of the whole stack, from the Outputs tab, make a note of the output key value BedrockKnowledgeBaseId because you will need it later during creation of the Amazon Lex assistant.

Integration of Lambda with Application Auto Scaling is beyond the scope of this post. For guidance, refer to the instructions at AWS Lambda and Application Auto Scaling.
Store the secrets in AWS Secrets Manager
Follow these steps to store your ServiceNow username and password in AWS Secrets Manager:

On the CloudFormation console, on the Resources tab, enter the word “secrets” to filter search results. Under Physical ID, select the console URL of the AWS Secrets Manager secret you created using the CloudFormation stack.
On the AWS Secrets Manager console, on the Overview tab, under Secret value, choose Retrieve secret value.
Select Edit and enter the username and password of the ServiceNow instance you created earlier. Make sure that both the username and password are correct.

Download knowledge articles
You need access to ServiceNow knowledge articles. Follow these steps:

Create a knowledge base if you don’t have one. Periodically, you may need to sync your knowledge base to keep it up to date.
Sync the data from ServiceNow to Amazon S3 using Amazon AppFlow by following instructions at ServiceNow. Alternatively, you can use AWS Glue to ingest data from ServiceNow to Amazon S3 by following instructions at the blog post, Extract ServiceNow data using AWS Glue Studio in an Amazon S3 data lake and analyze using Amazon Athena.
Download a sample article.

Sync Amazon Bedrock Knowledge Bases:
This solution uses the fully managed Knowledge Base for Amazon Bedrock to seamlessly power a RAG workflow, eliminating the need for custom integrations and data flow management. As the data source for the knowledge base, the solution uses Amazon S3. The following steps outline uploading ServiceNow articles to an S3 bucket created by a CloudFormation template.

On the CloudFormation console, on the Resources tab, enter “S3” to filter search results. Under Physical ID, select the URL for the S3 bucket created using the CloudFormation stack.
Upload the previously downloaded knowledge articles to this S3 bucket.

Next you need to sync the data source.

On the CloudFormation console, on the Outputs tab, enter “Knowledge” to filter search results. Under Value, select the console URL of the knowledge bases that you created using the CloudFormation stack. Open that URL in a new browser tab.
Scroll down to Data source and select the data source. Choose Sync.

You can test the knowledge base by choosing the model in the Test the knowledge base section and asking the model a question.
Responsible AI using Guardrails for Amazon Bedrock
Conversational AI applications require robust guardrails to safeguard sensitive user data, adhere to privacy regulations, enforce ethical principles, and mitigate hallucinations, fostering responsible development and deployment. Guardrails for Amazon Bedrock allow you to configure your organizational policies against the knowledge bases. They help keep your generative AI applications safe by evaluating both user inputs and model responses
To set up guardrails, follow these steps:

Follow the instructions at the Amazon Bedrock User Guide to create a guardrail.

You can reduce the hallucinations of the model responses by enabling grounding check and relevance check and adjusting the threshold

Create a version of the guardrail.
Select the newly created guardrail and copy the guardrail ID. You will use this ID later in the intent creation.

Amazon Lex setup
In this section, you configure your Amazon Lex chat assistant with intents to call Amazon Bedrock. This walkthrough uses Amazon Lex V2.

On the CloudFormation console, on the Outputs tab, copy the value of BedrockKnowledgeBaseId. You will need this ID later in this section.
On the Outputs tab, under Outputs, enter “bot” to filter search results. Choose the console URL of the Amazon Lex assistant you created using the CloudFormation stack. Open that URL in a new browser tab.
On the Amazon Lex Intents page, choose Create another intent. On the Add intent dropdown menu, choose Use built-in intent.
On the Use built-in intent screen, under Built-in intent, choose QnAIntent- Gen AI feature.
For Intent name, enter BedrockKb and select Add.
In the QnA configuration section, under Select model, choose Anthropic and Claude 3 Haiku or a model of your choice.
Expand Additional Model Settings and enter the Guardrail ID for the guardrails you created earlier. Under Guardrail Version, enter a number that corresponds to the number of versions you have created.
Enter the Knowledge base for Amazon Bedrock Id that you captured earlier in the CloudFormation outputs section. Choose Save intent at the bottom.

You can now add more QnAIntents pointing to different knowledge bases.

Return to the intents list by choosing Back to intents list in the navigation pane.
Select Build to build the assistant.

A green banner on the top of the page with the message Successfully built language English (US) in bot: servicenow-lex-bot indicates the Amazon Lex assistant is now ready.

Test the solution
To test the solution, follow these steps:

In the navigation pane, choose Aliases. Under Aliases, select TestBotAlias.
Under Languages, choose English (US). Choose Test.
A new test window will pop up in the bottom of the screen.
Enter the question “What benefits does AnyCompany offer to its employees?” Then press Enter.

The chat assistant generates a response based on the content in knowledge base.

To test Amazon Lex to create a ServiceNow ticket for information not present in the knowledge base, enter the question “Create a ticket for password reset” and press Enter.

The chat assistant generates a new ServiceNow ticket because this information is not available in the knowledge base.

To search for the incident, log in to the ServiceNow endpoint that you configured earlier.

Monitoring
You can use CloudWatch logs to review the performance of the assistant and to troubleshoot issues with conversations. From the CloudFormation stack that you deployed, you have already configured your Amazon Lex assistant CloudWatch log group with appropriate permissions.
To view the conversation logs from the Amazon Lex assistant, follow these directions.
On the CloudFormation console, on the Outputs tab, enter “Log” to filter search results. Under Value, choose the console URL of the CloudWatch log group that you created using the CloudFormation stack. Open that URL in a new browser tab.
To protect sensitive data, Amazon Lex obscures slot values in conversation logs. As security best practice, do not store any slot values in request or session attributes. Amazon Lex V2 doesn’t obscure the slot value in audio. You can selectively capture only text using the instructions at Selective conversation log capture.
Enable logging for Amazon Bedrock ingestion jobs
You can monitor Amazon Bedrock ingestion jobs using CloudWatch. To configure logging for an ingestion job, follow the instructions at Knowlege bases logging.
AWS CloudTrail logs
AWS CloudTrail is an AWS service that tracks actions taken by a user, role, or an AWS service. CloudTrail is enabled on your AWS account when you create the account. When activity occurs in that activity is recorded in a CloudTrail event along with other AWS service events in Event history. You can view, search, and download recent events in your AWS account. For more information, see Working with CloudTrail Event history.
As security best practice, you should monitor any access to your environment. You can configure Amazon GuardDuty to identify any unexpected and potentially unauthorized activity in your AWS environment.
Cleanup
To avoid incurring future charges, delete the resources you created. To clean up the AWS environment, use the following steps:

Empty the contents of the S3 bucket you created as part of the CloudFormation stack.
Delete the CloudFormation stack you created.

Conclusion
As customer expectations continue to evolve, embracing innovative technologies like conversational AI and knowledge management systems becomes essential for businesses to stay ahead of the curve. By implementing this integrated solution, companies can enhance operational efficiency and deliver superior service to both their customers and employees, while also adapting the responsible AI policies of the organization.
Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the Generative AI Innovation Center.

About the Authors
Marcelo Silva is an experienced tech professional who excels in designing, developing, and implementing cutting-edge products. Starting off his career at Cisco, Marcelo worked on various high-profile projects including deployments of the first ever carrier routing system and the successful rollout of ASR9000. His expertise extends to cloud technology, analytics, and product management, having served as senior manager for several companies such as Cisco, Cape Networks, and AWS before joining GenAI. Currently working as a Conversational AI/GenAI Product Manager, Marcelo continues to excel in delivering innovative solutions across industries.
Sujatha Dantuluri is a seasoned Senior Solutions Architect on the US federal civilian team at AWS, with over two decades of experience supporting commercial and federal government clients. Her expertise lies in architecting mission-critical solutions and working closely with customers to ensure their success. Sujatha is an accomplished public speaker, frequently sharing her insights and knowledge at industry events and conferences. She has contributed to IEEE standards and is passionate about empowering others through her engaging presentations and thought-provoking ideas.
NagaBharathi Challa is a solutions architect on the US federal civilian team at Amazon Web Services (AWS). She works closely with customers to effectively use AWS services for their mission use cases, providing architectural best practices and guidance on a wide range of services. Outside of work, she enjoys spending time with family and spreading the power of meditation.
Pranit Raje is a Cloud Architect on the AWS Professional Services India team. He specializes in DevOps, operational excellence, and automation using DevSecOps practices and infrastructure as code. Outside of work, he enjoys going on long drives with his beloved family, spending time with them, and watching movies.

Step Towards Best Practices for Open Datasets for LLM Training

Large language models rely heavily on open datasets to train, which poses significant legal, technical, and ethical challenges in managing such datasets. There are uncertainties around the legal implications of using data based on varying copyright laws and changing regulations regarding safe usage. The lack of global standards or centralized databases to validate and license datasets and incomplete or inconsistent metadata makes it impossible to assess the legal status of works. Technical barriers also relate to access to digitized public domain material. Most open datasets are not governed and have not implemented any kind of legal safety net for their contributors, exposing them to dangers and making them impossible to scale up. While intended to create more transparency and collaborative work, they do little or nothing to engage broader social challenges such as diversity and accountability and often exclude underrepresented languages and viewpoints. 

Current methods of building open datasets for LLMs often lack clear legal frameworks and face significant technical, operational, and ethical challenges. Traditional methods depend on incomplete metadata, complicating verifying copyright status and compliance across different regions with different laws. Digitization of public domain materials and making them accessible is challenging because big projects like Google Books restrict usage, which prevents the construction of open datasets. Volunteer-driven projects lack structured governance, which exposes the contributors to legal risks. Such gaps prevent equal access, prevent diversity in data representation, and concentrate power in a few dominant organizations. This creates an ecosystem where open datasets struggle to compete with proprietary models, reducing accountability and slowing progress toward transparent and inclusive AI development.

To mitigate issues in metadata encoding, data sourcing, and processing for machine learning datasets, researchers proposed a framework focused on building a reliable corpus using openly licensed and public domain data for training large language models (LLMs). The framework emphasizes overcoming technical challenges like ensuring reliable metadata and digitizing physical records. It promotes cross-domain cooperation to responsibly curate, govern, and release these datasets while promoting competition in the LLM ecosystem. It also emphasizes metadata standards, reproducibility for accountability, and ensuring data source diversity as an alternative to more traditional methods lacking structured governance and transparency.

Researchers included all the practical steps of sourcing, processing, and governing datasets. Tools for detecting openly licensed content were used to ensure high-quality data. The framework integrated standards for metadata consistency, emphasized digitization, and encouraged collaboration with communities to create datasets. It also supported transparency and reproducibility in preprocessing and addressed potential biases and harmful content in a robust and inclusive system for training LLMs while reducing legal risks. The framework also highlights engaging with underrepresented communities to build diverse datasets and create clearer, machine-readable terms of Use. Additionally, making the open data ecosystem sustainable should come through proposed funding models on public funding from both tech companies and cultural institutions to ensure sustainable participation.

Finally, the researchers provided a clear scenario with a broadly outlined plan on how to approach the issues discussed within the context of training LLMs on non-licensed data, with a focus on the openness of the datasets and the efforts made by different spheres. Initiatives such as emphasizing metadata standardization, enhancing the digitization process, and responsible governance were intended to make the artificial intelligence ecosystem more open. The works build the foundation for future works where further probing into newer innovations in dataset management, AI governance, and advancements of the technologies that enhance the accessibility of data while addressing the problem of ethical and legal challenges.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Step Towards Best Practices for Open Datasets for LLM Training appeared first on MarkTechPost.

AutoCBT: An Adaptive Multi-Agent Framework for Enhanced Automated Cogn …

Traditional psychological counseling, often conducted in person, remains limited to individuals actively seeking help for psychological concerns. In contrast, online automated counseling presents a viable option for those hesitant to pursue therapy due to stigma or shame. Cognitive Behavioral Therapy (CBT), a widely practiced approach in psychological counseling, aims to help individuals identify and correct cognitive distortions contributing to negative emotions and behaviors. The emergence of LLMs has opened new possibilities for automating CBT diagnosis and treatment. However, current LLM-based CBT systems face challenges such as fixed structural frameworks, which limit adaptability and self-optimization, and repetitive response patterns that provide generic, unhelpful suggestions.

Recent advancements in AI have introduced frameworks like CBT-LLM, which employs prompt-based learning, and CoCoA, which integrates memory mechanisms for retrieval-augmented generation. These systems aim to identify and address cognitive distortions in user statements while enhancing the depth and relevance of therapeutic interactions. Despite their potential, existing methods often lack personalization, adaptability to changing user needs, and a nuanced understanding of dynamic therapeutic processes. To bridge these gaps, ongoing research uses annotated datasets, ontologies, and advanced LLMs to develop context-aware CBT systems that mimic human cognitive processes.

Researchers from the Shenzhen Key Laboratory for High-Performance Data Mining, Shenzhen Institutes of Advanced Technology, the Chinese Academy of Sciences, and several other institutions developed AutoCBT, an autonomous multi-agent framework designed for CBT in single-turn psychological consultations. Utilizing Quora-like and YiXinLi models, AutoCBT integrates dynamic routing and memory mechanisms to improve response quality and adaptability. The framework undergoes structured reasoning and editing to generate high-quality, context-aware outputs. Evaluated on a bilingual dataset, it outperforms traditional LLM-based systems, addressing challenges like dynamic routing, supervisory mechanisms, and Llama’s over-protection issue.

AutoCBT is a versatile framework designed for multi-agent systems in CBT, comprising a Counsellor Agent (interface), Supervisor Agents, communication topology, and routing strategies. The Counsellor Agent, powered by LLMs, interacts with users and seeks input from Supervisor Agents to generate confident, high-quality responses. Agents feature memory mechanisms for short-term and long-term storage, and routing strategies like unicast and broadcast enable dynamic communication. AutoCBT incorporates CBT principles—empathy, belief identification, reflection, strategy, and encouragement—mapped to specific Supervisor Agents. Its effectiveness was validated using a bilingual dataset combining PsyQA and TherapistQA, categorized and augmented with cognitive distortion examples.

In online psychological counseling, LLMs like Qwen-2.5-72B and Llama-3.1-70B were evaluated for handling emotional nuances and instruction adherence. AutoCBT, a two-stage framework, outperformed Generation and PromptCBT by incorporating dynamic routing and supervisory mechanisms, achieving higher scores across empathy, cognitive distortion handling, and response relevance. AutoCBT’s iterative approach enhanced its draft responses, which were validated by automatic and human evaluations. Challenges included routing conflicts, role confusion, and redundant feedback loops, mitigated through design adjustments. Llama’s over-caution led to frequent refusals on sensitive topics, unlike Qwen, which responded comprehensively, highlighting the importance of balance in model sensitivity.

In conclusion, AutoCBT is an innovative multi-agent framework designed for CBT-based psychological counseling. By integrating dynamic routing and supervisory mechanisms, AutoCBT addresses limitations in traditional LLM-based counseling, significantly enhancing response quality and effectiveness in identifying and addressing cognitive distortions. AutoCBT achieves superior dialogue quality through its adaptive and autonomous design compared to static, prompt-based systems. Challenges in LLMs’ semantic understanding and instruction adherence were identified and mitigated through targeted solutions. Leveraging bilingual datasets and models, the framework demonstrates its potential to deliver high-quality, automated counseling services. It offers a scalable alternative for individuals hesitant to pursue traditional therapy due to stigma.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post AutoCBT: An Adaptive Multi-Agent Framework for Enhanced Automated Cognitive Behavioral Therapy appeared first on MarkTechPost.

This AI Paper Introduces a Novel DINOv2-LLaVA Framework: Advanced Visi …

The automation of radiology report generation has become one of the significant areas of focus in biomedical natural language processing. This is driven by the vast and exponentially growing medical imaging data and a dependency on highly accurate diagnostic interpretation in modern health care. Advancements in artificial intelligence make image analysis combined with natural language processing the key to changing the landscape of radiology workflows regarding efficiency, consistency, and accuracy of diagnostics.

A significant challenge in this field lies in generating comprehensive and accurate reports that meet the complexities of medical imaging. Radiology reports often require precise descriptions of imaging findings and their clinical implications. Ensuring consistency in report quality while capturing subtle nuances from medical images is particularly challenging. The limited availability of radiologists and the growing demand for imaging interpretations further complicate the situation, highlighting the need for effective automation solutions.

The traditional approach to the automation of radiology reporting is based on convolutional neural networks (CNNs) or visual transformers to extract features from images. Such image-processing techniques often combine with transformers or recurrent neural networks (RNNs) to generate textual outputs. These approaches have shown promise but usually fail to maintain factual accuracy and clinical relevance. Integrating image and text data remains a technical hurdle, which opens the way for further improvements in model design and data utilization.

Researchers from AIRI and Skoltech brought the most advanced system that would combat all these challenges. It’s a vision encoder DINOv2 specifically trained for medical data coupled with an open biomedical large language model called OpenBio-LLM-8B. It was accomplished by using the LLaVA framework, which can ease the process of vision-language interaction. The authors relied on a PadChest dataset, BIMCV-COVID19, CheXpert, OpenI, and MIMIC-CXR datasets to train and test their model to effectively deal with many varied clinical settings.

The proposed system integrates advanced methodologies for both image encoding and language generation. The DINOv2 vision encoder works on chest X-ray images, extracting nuanced features from radiological studies. These features are processed by OpenBio-LLM-8B, a text decoder optimized for the biomedical domain. Over two days, training was conducted on powerful computational resources, including 4 NVIDIA A100 GPUs. The team used a set of techniques called Low-Rank Adaptation (LoRA) fine-tuning methods to enhance learning without overfitting. Only high-quality images were included in a careful preprocessing pipeline, using the first two images from every study for evaluation.

The system’s performance was impressive at all the chosen evaluation metrics; hence, it performs well in radiology report generation. On the hidden test sets, the model achieved a BLEU-4 score of 11.68 for findings and 12.33 for impressions, which reflected its precision in generating relevant textual content. In addition, the system attained an F1-CheXbert score of 57.49 for findings and 56.97 for impressions, indicating that it can capture critical medical observations accurately. The BERTScore for findings was 53.80, further validating the semantic consistency of the generated texts. Metrics like ROUGE-L and F1-RadGraph showed that the system performed better, with 26.16 and 28.67, respectively, for findings.

The researchers addressed long-standing challenges in radiology automation by leveraging a carefully curated dataset and specialized computational techniques. Their approach balanced computational efficiency with clinical precision, demonstrating the practical feasibility of such systems in real-world settings. Integrating domain-specific encoders and decoders proved instrumental in achieving high-quality outputs, setting a new benchmark for automated radiology reporting.

This research marks a major milestone in biomedical natural language processing. With the solution for the complexities of medical imaging, the AIRI and Skoltech team has demonstrated how AI can change radiology workflows. Their findings highlight the need to combine specific models with robust datasets for meaningful progress in automating diagnostic reporting.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post This AI Paper Introduces a Novel DINOv2-LLaVA Framework: Advanced Vision-Language Model for Automated Radiology Report Generation appeared first on MarkTechPost.

Researchers from MIT, Google DeepMind, and Oxford Unveil Why Vision-La …

Vision-language models (VLMs) play a crucial role in multimodal tasks like image retrieval, captioning, and medical diagnostics by aligning visual and linguistic data. However, understanding negation in these models remains one of the main challenges. Negation is critical for nuanced applications, such as distinguishing “a room without windows” from “a room with windows.” Despite their advancements, current VLMs fail to interpret negation reliably, severely limiting their effectiveness in high-stakes domains like safety monitoring and healthcare. Addressing this challenge is essential to expand their applicability in real-world scenarios.

The current VLMs, such as CLIP, use shared embedding spaces to align visual and textual representations. Though these models excel in tasks such as cross-modal retrieval and image captioning, their performance falls sharply when dealing with negated statements. This limitation arises due to pretraining data biases because the training datasets contain mainly affirmative examples, leading to affirmation bias, where models treat negated and affirmative statements as equivalents. Existing benchmarks such as CREPE and CC-Neg rely on simplistic templated examples that don’t represent the richness and depth of negation in natural language. VLMs tend to collapse the embeddings of negated and affirmative captions so it is extremely challenging to tease apart fine-grained differences between the concepts. This poses a problem in using VLMs for precise language understanding applications, for instance, querying a medical imaging database with complex inclusion and exclusion criteria.

To address these limitations, researchers from MIT, Google DeepMind, and the University of Oxford proposed the NegBench framework for the evaluation and improvement of negation comprehension over VLMs. The framework assesses two fundamental tasks: Retrieval with Negation (Retrieval-Neg), which examines the model’s capacity to retrieve images according to both affirmative and negated specifications, such as “a beach without people,” and Multiple Choice Questions with Negation (MCQ-Neg), which evaluates nuanced comprehension by necessitating that models select appropriate captions from slight variations. It uses enormous synthetic datasets, like CC12M-NegCap and CC12M-NegMCQ, augmented with millions of captions that contain a wide range of negation scenarios. This will expose VLMs to somewhat challenging negatives and paraphrased captions, improving the training and evaluation of models. Standard datasets, such as COCO and MSR-VTT, were also adapted, including negated captions and paraphrases, to further expand linguistic diversity and test the robustness. By incorporating varied and complex negation examples, NegBench effectively overcomes existing limitations, significantly enhancing model performance and generalization.

NegBench leverages both real and synthetic datasets to test negation comprehension. Datasets like COCO, VOC2007, and CheXpert were adapted to include negation scenarios, such as “This image includes trees but not buildings.” For MCQs, templates like “This image includes A but not B” were used alongside paraphrased variations for diversity. NegBench is further augmented with the HardNeg-Syn dataset, where images are synthesized to present pairs differing from each other based on the occurrence or absence of certain objects only, hence constituting difficult cases for negation understanding. Model fine-tuning relied on two training objectives. On one hand, contrastive loss facilitated the alignment between image-caption pairs, enhancing performance in retrieval. On the other hand, using multiple-choice loss helped in making fine-grained negation judgments by preferring the right captions in the MCQ context.

The fine-tuned models showed considerable improvements in retrieval and comprehension tasks using the negation-enriched datasets. For retrieval, the model’s recall increases by 10% for negated queries, where performance is nearly at par with standard retrieval tasks. In the multiple-choice question tasks, accuracy improvements of up to 40% were reported, showing a better ability to differentiate between the subtle affirmative and negated captions. Advancements were uniform over a range of datasets, including COCO and MSR-VTT, and on synthetic datasets like HardNeg-Syn, where models handled negation and complex linguistic developments appropriately. This suggests that representing scenarios with diverse kinds of negation in training and testing is effective in reducing affirmation bias and generalization.

NegBench addresses a critical gap in VLMs by being the first work to address their inability to understand negation. It brings significant improvements in retrieval and comprehension tasks by incorporating diverse negation examples into trAIning and evaluation. Such improvements open up avenues for much more robust AI systems that are capable of nuanced language understanding, with important implications for critical domains like medical diagnostics and semantic content retrieval.

Check out the Paper and Code. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Researchers from MIT, Google DeepMind, and Oxford Unveil Why Vision-Language Models Do Not Understand Negation and Proposes a Groundbreaking Solution appeared first on MarkTechPost.

Researchers from China Develop Advanced Compression and Learning Techn …

One of the most significant and advanced capabilities of a multimodal large language model is long-context video modeling, which allows models to handle movies, documentaries, and live streams spanning multiple hours. However, despite the commendable advancements made in video comprehension in LLMs, including caption generation and question answering, many obstructions remain in processing extremely long videos. The most crucial of these is understanding the context brought by long videos.

Although much work has already been done in this domain, ranging from training on massive text and frame corpora to building an effective training system with long-context parallelism and data packing, these super-long multimodal contexts have significantly reduced models’ training and inference efficiency. Moreover, the redundancy introduced by frames further complicates model learning. An interesting direction in this field is the compression of video tokens, which shows great potential but suffers from a trade-off in detailed representations. This article presents the latest research on a new compression method for long-context multimodal modeling.

Researchers from the Shenzhen Institutes of Advanced Technology propose a hierarchical video token compression method (HiCo) with a practical context modeling system, VideoChat-Flash, tailored for processing long-context videos. HiCo addresses the visual redundancies in video information by compressing extended contexts from clip to video level to minimize computation while preserving all critical data. VideoChat-Flash, on the other hand, features a multi-stage short-to-long learning scheme along with a rich dataset of real-world long videos. It is an adequate long-video understanding of MLLM with a training infrastructure that supports high-degree sequence parallelism.

HiCo compresses tokens hierarchically to obtain high-density token representations and widen the context window. The authors sequentially segment long videos into shorter clips and feed them into the MLLM. The compression is based on spatiotemporal redundancies. HiCo further links the compressed tokens with user queries and exploits semantic correlations between clips and real-world embeddings to reduce the token quantity.

Next, in VideoChat-Flash, which employs a multi-stage short-to-long learning scheme and a corresponding data receipt, the authors begin supervised fine-tuning with short videos and associated captions and QAs, gradually shifting to long videos, and ultimately training on a mixed-length corpus. Short videos prove highly effective in enhancing basic visual perception and concisely expressing long videos. The authors provide a massive dataset for fine-tuning, encompassing 300,000 hours of videos with annotations spanning 2 billion words.

Another innovation proposed in the paper is a modified “Needle in a Haystack” (NIAH) task for multi-hop video configurations. Conventionally, the NIAH task evaluates a model by requiring it to locate an indicated image, find a target word, or answer a question in a video. Here, a target image is typically inserted into video frames, which the model can identify through visual distinction without understanding the context. To address this loophole, the authors proposed a new benchmark, “multi-hop needle in a video haystack,” which requires the model to locate a sequence of interconnected indicative images, where subsequent images can only be found using clues from the first image.

The proposed method achieved a computational reduction of up to two orders of magnitude in experiments. VideoChat-Flash, in particular, demonstrated remarkable performance on both mainstream short and long video benchmarks at 2B and 7B scales. The authors surpassed all other methods for the 7B scale model, proclaiming it as the new state-of-the-art in short video understanding. Even in long-video comprehension, their model outperformed previous open-source MLLMs, achieving SOTA in several benchmarks. The proposed model also exhibited strong temporal grounding capabilities, with zero-shot performance exceeding many renowned MLLMs. Additionally, VideoChat-Flash achieved an astounding accuracy of 99.1% on over 10,000 frames in NIAH.

Conclusion: The authors introduced a hierarchical compression technique, HiCo, and VideoChat-Flash, an MLLM trained using an innovative multi-stage scheme. This method advanced compression techniques to reduce computations for long-context videos while surpassing the accuracies of current SOTA models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Researchers from China Develop Advanced Compression and Learning Techniques to process  Long-Context Videos at 100 Times Less Compute appeared first on MarkTechPost.

OmniThink: A Cognitive Framework for Enhanced Long-Form Article Genera …

LLMs have made significant strides in automated writing, particularly in tasks like open-domain long-form generation and topic-specific reports. Many approaches rely on Retrieval-Augmented Generation (RAG) to incorporate external information into the writing process. However, these methods often fall short due to fixed retrieval strategies, limiting the generated content’s depth, diversity, and utility—this lack of nuanced and comprehensive exploration results in repetitive, shallow, and unoriginal outputs. While newer methods like STORM and Co-STORM broaden information collection through role-playing and multi-perspective retrieval, they remain confined by static knowledge boundaries and fail to leverage the full potential of LLMs for dynamic and context-aware retrieval.

Machine writing lacks such iterative processes, unlike humans, who naturally reorganize and refine their cognitive frameworks through reflective practices. Reflection-based frameworks like OmniThink aim to address these shortcomings by enabling models to adjust retrieval strategies and deepen topic understanding dynamically. Recent research has highlighted the importance of integrating diverse perspectives and reasoning across multiple sources in generating high-quality outputs. While prior methods, such as multi-turn retrieval and roundtable simulations, have progressed in diversifying information sources, they often fail to adapt flexibly as the model’s understanding evolves. 

Researchers from Zhejiang University, Tongyi Lab (Alibaba Group), and the Zhejiang Key Laboratory of Big Data Intelligent Computing introduced OmniThink. This machine-writing framework mimics human cognitive processes of iterative reflection and expansion. OmniThink dynamically adjusts retrieval strategies to gather diverse, relevant information by emulating how learners progressively deepen their understanding. This approach enhances knowledge density while maintaining coherence and depth. Evaluated on the WildSeek dataset using a new “knowledge density” metric, OmniThink demonstrated improved article quality. Human evaluations and expert feedback affirmed its potential for generating insightful, comprehensive, long-form content, addressing key challenges in automated writing.

Open-domain long-form generation entails creating detailed articles by retrieving and synthesizing information from open sources. Traditional methods involve two steps: retrieving topic-related data via search engines and generating an outline before composing the article. However, issues like redundancy and low knowledge density persist. OmniThink addresses this by emulating human-like iterative expansion and reflection, building an information tree and conceptual pool to structure relevant, diverse data. Through a three-step process—information acquisition, outline structuring and article composition—OmniThink ensures logical coherence and rich content. It integrates semantic similarity to retrieve relevant data and refines drafts to produce concise, high-density articles.

OmniThink demonstrates outstanding performance in generating articles and outlines, excelling in metrics like relevance, breadth, depth, and novelty, particularly when using GPT-4o. Its dynamic expansion and reflection mechanisms enhance information diversity, knowledge density, and creativity, enabling deeper knowledge exploration. The model’s outline generation improves structural coherence and logical consistency, attributed to its unique Concept Pool design. Human evaluations confirm OmniThink’s superior performance compared to baselines like Co-STORM, especially in breadth. However, subtle improvements in novelty are less evident to human evaluators, highlighting the need for more refined evaluation methods to assess advanced model capabilities accurately.

In conclusion, OmniThink is a machine writing framework that mimics human-like iterative expansion and reflection to produce well-structured, high-quality long-form articles. Unlike traditional retrieval-augmented generation methods, which often result in shallow, redundant, and unoriginal content, OmniThink enhances knowledge density, coherence, and depth by progressively deepening topic understanding, similar to human cognitive learning. As automatic and human evaluations confirm, this model-agnostic approach can integrate with existing frameworks. Future work aims to incorporate advanced methods combining deeper reasoning, role-playing, and human-computer interaction, further addressing challenges in generating informative and diverse long-form content.

Check out the Paper, GitHub Page, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post OmniThink: A Cognitive Framework for Enhanced Long-Form Article Generation Through Iterative Reflection and Expansion appeared first on MarkTechPost.

Google AI Introduces ZeroBAS: A Neural Method to Synthesize Binaural A …

Humans possess an extraordinary ability to localize sound sources and interpret their environment using auditory cues, a phenomenon termed spatial hearing. This capability enables tasks such as identifying speakers in noisy settings or navigating complex environments. Emulating such auditory spatial perception is crucial for enhancing the immersive experience in technologies like augmented reality (AR) and virtual reality (VR). However, the transition from monaural (single-channel) to binaural (two-channel) audio synthesis—which captures spatial auditory effects—faces significant challenges, particularly due to the limited availability of multi-channel and positional audio data.

Traditional mono-to-binaural synthesis approaches often rely on digital signal processing (DSP) frameworks. These methods model auditory effects using components such as the head-related transfer function (HRTF), room impulse response (RIR), and ambient noise, typically treated as linear time-invariant (LTI) systems. Although DSP-based techniques are well-established and can generate realistic audio experiences, they fail to account for the nonlinear acoustic wave effects inherent in real-world sound propagation.

Supervised learning models have emerged as an alternative to DSP, leveraging neural networks to synthesize binaural audio. However, such models face two major limitations: First, the scarcity of position-annotated binaural datasets and second, susceptibility to overfitting to specific acoustic environments, speaker characteristics, and training datasets. The need for specialized equipment for data collection further constraints these approaches, making supervised methods costly and less practical.

To address these challenges, researchers from Google have proposed ZeroBAS, a zero-shot neural method for mono-to-binaural speech synthesis that does not rely on binaural training data. This innovative approach employs parameter-free geometric time warping (GTW) and amplitude scaling (AS) techniques based on source position. These initial binaural signals are further refined using a pretrained denoising vocoder, yielding perceptually realistic binaural audio. Remarkably, ZeroBAS generalizes effectively across diverse room conditions, as demonstrated using the newly introduced TUT Mono-to-Binaural dataset, and achieves performance comparable to, or even better than, state-of-the-art supervised methods on out-of-distribution data.

The ZeroBAS framework comprises a three-stage architecture as follows:

In stage 1, Geometric time warping (GTW) transforms the monaural input into two channels (left and right) by simulating interaural time differences (ITD) based on the relative positions of the sound source and listener’s ears. GTW computes the time delays for the left and right ear channels. The warped signals are then interpolated linearly to generate initial binaural channels.

In stage 2, Amplitude scaling (AS) enhances the spatial realism of the warped signals by simulating the interaural level difference (ILD) based on the inverse-square law. As human perception of sound spatiality relies on both ITD and ILD, with the latter dominant for high-frequency sounds. Using the Euclidean distances of source from both ears and , the amplitudes are scaled.

In stage 3, involves an iterative refinement of the warped and scaled signals using a pretrained denoising vocoder, WaveFit. This vocoder leverages log-mel spectrogram features and denoising diffusion probabilistic models (DDPMs) to generate clean binaural waveforms. By iteratively applying the vocoder, the system mitigates acoustic artifacts and ensures high-quality binaural audio output.

Coming to evaluations, ZeroBAS was evaluated on two datasets (results in Table 1 and 2): the Binaural Speech dataset and the newly introduced TUT Mono-to-Binaural dataset. The latter was designed to test the generalization capabilities of mono-to-binaural synthesis methods in diverse acoustic environments. In objective evaluations, ZeroBAS demonstrated significant improvements over DSP baselines and approached the performance of supervised methods despite not being trained on binaural data. Notably, ZeroBAS achieved superior results on the out-of-distribution TUT dataset, highlighting its robustness across varied conditions.

Subjective evaluations further confirmed the efficacy of ZeroBAS. Mean Opinion Score (MOS) assessments showed that human listeners rated ZeroBAS’s outputs as slightly more natural than those of supervised methods. In MUSHRA evaluations, ZeroBAS achieved comparable spatial quality to supervised models, with listeners unable to discern statistically significant differences.

Even though this method is quite remarkable, it does have some limitations. ZeroBAS struggles to directly process phase information because the vocoder lacks positional conditioning, and it relies on general models instead of environment-specific ones. Despite these constraints, its ability to generalize effectively highlights the potential of zero-shot learning in binaural audio synthesis.

In conclusion, ZeroBAS offers a fascinating, room-agnostic approach to binaural speech synthesis that achieves perceptual quality comparable to supervised methods without requiring binaural training data. Its robust performance across diverse acoustic environments makes it a promising candidate for real-world applications in AR, VR, and immersive audio systems.

Check out the Paper and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post Google AI Introduces ZeroBAS: A Neural Method to Synthesize Binaural Audio from Monaural Audio Recordings and Positional Information without Training on Any Binaural Data appeared first on MarkTechPost.

Microsoft Presents a Comprehensive Framework for Securing Generative A …

The rapid advancement and widespread adoption of generative AI systems across various domains have increased the critical importance of AI red teaming for evaluating technology safety and security. While AI red teaming aims to evaluate end-to-end systems by simulating real-world attacks, current methodologies face significant challenges in effectiveness and implementation. The complexity of modern AI systems, with their expanding capabilities across multiple modalities including vision and audio, has created an unprecedented array of potential vulnerabilities and attack vectors. Moreover, integrating agentic systems that grant AI models higher privileges and access to external tools has substantially increased the attack surface and potential impact of security breaches.

Current approaches to AI security have revealed significant limitations in addressing both traditional and emerging vulnerabilities. Traditional security assessment methods mainly focus on model-level risks while overlooking critical system-level vulnerabilities that often prove more exploitable. Moreover, AI systems utilizing retrieval augmented generation (RAG) architectures have shown susceptibility to cross-prompt injection attacks, where malicious instructions hidden in documents can manipulate model behavior and facilitate data exfiltration. While some defensive techniques like input sanitization and instruction hierarchies offer partial solutions, they cannot eliminate security risks due to the fundamental limitations of language models.

Researchers from Microsoft have proposed a comprehensive framework for AI red teaming based on their extensive experience testing over 100 generative AI products. Their approach introduces a structured threat model ontology designed to systematically identify and evaluate traditional and emerging security risks in AI systems. The framework encompasses eight key lessons from real-world operations, ranging from fundamental system understanding to integrating automation in security testing. This methodology addresses the growing complexity of AI security by combining systematic threat modeling with practical insights derived from actual red teaming operations. The approach emphasizes the importance of considering both system-level and model-level vulnerabilities.

The operational architecture of Microsoft’s AI red teaming framework uses a dual-focus approach targeting both standalone AI models and integrated systems. The framework distinguishes between cloud-hosted models and complex systems that incorporate these models into various applications like copilots and plugins. Their methodology has evolved significantly since 2021 expanding from security-focused assessments to include comprehensive responsible AI (RAI) impact evaluations. The testing protocol maintains a rigorous coverage, of traditional security concerns, including data exfiltration, credential leaking, and remote code execution, while simultaneously addressing AI-specific vulnerabilities.

The effectiveness of Microsoft’s red teaming framework has been shown through a comparative analysis of attack methodologies. Their findings challenge conventional assumptions about the necessity of complex techniques, revealing that simpler approaches often match or exceed the effectiveness of complex gradient-based methods. The research highlights the superiority of system-level attack approaches over model-specific tactics. This conclusion is supported by real-world evidence showing that attackers typically exploit combinations of simple vulnerabilities across system components rather than focusing on complex model-level attacks. These results emphasize the importance of adopting a holistic security perspective, that considers both AI-specific and traditional system vulnerabilities.

In conclusion, researchers from Microsoft have proposed a comprehensive framework for AI red teaming. The framework developed through testing over 100 GenAI products provides valuable insights into effective risk evaluation methodologies. The combination of a structured threat model ontology with practical lessons learned offers a robust foundation for organizations developing their own AI security assessment protocols. These insights and methodologies provide essential guidance for addressing real-world vulnerabilities. The framework’s emphasis on practical, implementable solutions positions it as a valuable resource for organizations, research institutions, and governments working to establish effective AI risk assessment protocols.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)
The post Microsoft Presents a Comprehensive Framework for Securing Generative AI Systems Using Lessons from Red Teaming 100 Generative AI Products appeared first on MarkTechPost.