Inheritune: An Effective AI Training Approach for Developing Smaller a …

LLMs leverage the transformer architecture, particularly the self-attention mechanism, for high performance in natural language processing tasks. However, as these models increase in depth, many deeper layers exhibit “attention degeneration,” where the attention matrices collapse into rank-1, focusing on a single column. These “lazy layers” become redundant as they fail to learn meaningful representations. This issue has been observed in GPT-2 models, where deeper layers lose effectiveness, limiting the model’s capacity to improve with increased depth. The phenomenon, however, still needs to be explored in standard LLMs.

Various studies have explored attention degeneration, primarily focusing on attention rank and entropy collapse, which cause representation issues and training instability. Previous research has suggested methods to address these problems, such as adjusting residual connections or adding tokens to sequences, though these methods often slow training. In contrast, this work proposes smaller, more efficient models that avoid structural inefficiencies and match the performance of larger models. Other techniques like stacking methods, knowledge distillation, and weight initialization have been effective in improving training for language models, though primarily applied in vision models.

Researchers from the University of Texas at Austin and New York University introduced “Inheritune,” a method aimed at training smaller, efficient language models without sacrificing performance. Inheritune involves inheriting early transformer layers from larger pre-trained models, retraining, and progressively expanding the model until it matches or surpasses the original model’s performance. This approach addresses inefficiencies in deeper layers, where attention degeneration leads to lazy layers. In experiments on datasets like OpenWebText and FineWeb_Edu, Inheritune-trained models outperform larger models and baselines, achieving comparable or superior performance with fewer layers.

In transformer-based models like GPT-2, deeper layers often exhibit attention degeneration, where attention matrices collapse into rank-1, leading to uniform, less focused token relationships. This phenomenon, termed “lazy layers,” diminishes model performance. To address this, researchers developed Inheritune, which initializes smaller models by inheriting early layers from larger pre-trained models and progressively expands them through training. Despite using fewer layers, models trained with Inheritune outperform larger models by maintaining focused attention patterns and avoiding attention degeneration. This approach is validated through experiments on GPT-2 variants and large datasets, achieving efficient performance improvements.

The researchers conducted extensive experiments on Inheritune using GPT-2 xlarge, large, and medium models pre-trained on the OpenWebText dataset. They compared models trained with Inheritune against three baselines: random initialization, zero-shot initialization techniques, and knowledge distillation. Inheritune models consistently outperformed baselines across various sizes, showing comparable or better validation losses with fewer layers. Ablation studies demonstrated that initializing attention and MLP weights provided the best results. Even when trained without data repetition, Inheritune models converged faster, achieving similar validation losses as larger models, confirming its efficiency in reducing model size while maintaining performance.

The study identifies a flaw in deep decoder-style transformers, commonly used in LLMs, where attention matrices in deeper layers lose rank, leading to inefficient “lazy layers.” The proposed Inheritune method transfers early layers from a larger pre-trained model and progressively trains smaller models to address this. Inheritune achieves the same performance as larger models with fewer layers, as demonstrated on GPT-2 models trained on datasets like OpenWebText-9B and FineWeb_Edu. 

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)
The post Inheritune: An Effective AI Training Approach for Developing Smaller and High-Performing Language Models appeared first on MarkTechPost.

Stanford Researchers Propose LoLCATS: A Cutting Edge AI Method for Eff …

The problem with efficiently linearizing large language models (LLMs) is multifaceted. The quadratic attention mechanism in traditional Transformer-based LLMs, while powerful, is computationally expensive and memory-intensive. Existing methods that try to linearize these models by replacing quadratic attention with subquadratic analogs face significant challenges: they often lead to degraded performance, incur high computational costs, and lack scalability. The main challenge is how to maintain high model quality while making the linearization process more efficient and scalable for very large models, including those beyond 70 billion parameters.

Researchers from Stanford University, Together AI, California Institute of Technology, and MIT introduced LoLCATS (Low-rank Linear Conversion via Attention Transfer). LoLCATS is a two-step method designed to efficiently improve the quality of linearized large language models without the need for expensive retraining on billions of tokens. The core idea behind LoLCATS is to first train linear attention mechanisms to match the softmax attentions of the original model using a mean squared error (MSE) loss in a process called “attention transfer.” Then, low-rank adaptation (LoRA) is employed to correct any residual errors in approximation, allowing the model to achieve high-quality predictions with significantly reduced computational costs. This method makes it feasible to create linearized versions of very large models, like Llama 3 8B and Mistral 7B, with minimal overhead.

The structure of LoLCATS involves two main stages. The first stage, attention transfer, focuses on training the linear attention to closely approximate the output of softmax attention. The researchers achieved this by parameterizing the linear attention using learnable feature maps, which are optimized to minimize the output discrepancy between the linear and softmax mechanisms. The second stage, low-rank linearizing, further improves model performance by leveraging LoRA to make small, low-rank adjustments to the linearized layers. This step compensates for the quality gaps that might emerge after the initial linearization. The LoLCATS framework also employs a block-by-block training approach, particularly for larger models, to make the process scalable and more memory-efficient.

The results presented in the research demonstrate significant improvements over prior linearization methods. For example, LoLCATS successfully closed the performance gap between linearized and original Transformer models by up to 78% on a standard benchmark (5-shot MMLU). The researchers also highlight that LoLCATS achieved these improvements while only using 0.2% of the model parameters and 0.4% of the training tokens required by previous methods. Additionally, LoLCATS is the first method that was successfully used to linearize extremely large models, such as Llama 3 70B and 405B, enabling a considerable reduction in computational cost and time compared to earlier approaches.

Conclusion

LoLCATS presents a compelling solution to the problem of linearizing large language models by significantly reducing the memory and compute requirements without compromising on quality. By introducing the two-step process of attention transfer followed by low-rank adaptation, this research enables the efficient conversion of large Transformer models into linearized versions that retain their powerful capabilities. This breakthrough could lead to more accessible and cost-effective deployment of LLMs, making them feasible for a broader range of applications. The implementation details, along with the code, are available on GitHub, allowing others to build upon and apply this method to other large-scale models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)
The post Stanford Researchers Propose LoLCATS: A Cutting Edge AI Method for Efficient LLM Linearization appeared first on MarkTechPost.

Create a multimodal chatbot tailored to your unique dataset with Amazo …

With recent advances in large language models (LLMs), a wide array of businesses are building new chatbot applications, either to help their external customers or to support internal teams. For many of these use cases, businesses are building Retrieval Augmented Generation (RAG) style chat-based assistants, where a powerful LLM can reference company-specific documents to answer questions relevant to a particular business or use case.
In the last few months, there has been substantial growth in the availability and capabilities of multimodal foundation models (FMs). These models are designed to understand and generate text about images, bridging the gap between visual information and natural language. Although such multimodal models are broadly useful for answering questions and interpreting imagery, they’re limited to only answering questions based on information from their own training document dataset.
In this post, we show how to create a multimodal chat assistant on Amazon Web Services (AWS) using Amazon Bedrock models, where users can submit images and questions, and text responses will be sourced from a closed set of proprietary documents. Such a multimodal assistant can be useful across industries. For example, retailers can use this system to more effectively sell their products (for example, HDMI_adaptor.jpeg, “How can I connect this adapter to my smart TV?”). Equipment manufacturers can build applications that allow them to work more effectively (for example, broken_machinery.png, “What type of piping do I need to fix this?”). This approach is broadly effective in scenarios where image inputs are important to query a proprietary text dataset. In this post, we demonstrate this concept on a synthetic dataset from a car marketplace, where a user can upload a picture of a car, ask a question, and receive responses based on the car marketplace dataset.
Solution overview
For our custom multimodal chat assistant, we start by creating a vector database of relevant text documents that will be used to answer user queries. Amazon OpenSearch Service is a powerful, highly flexible search engine that allows users to retrieve data based on a variety of lexical and semantic retrieval approaches. This post focuses on text-only documents, but for embedding more complex document types, such as those with images, see Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker.
After the documents are ingested in OpenSearch Service (this is a one-time setup step), we deploy the full end-to-end multimodal chat assistant using an AWS CloudFormation template. The following system architecture represents the logic flow when a user uploads an image, asks a question, and receives a text response grounded by the text dataset stored in OpenSearch.

The logic flow for generating an answer to a text-image response pair routes as follows:

Steps 1 and 2 – To start, a user query and corresponding image are routed through an Amazon API Gateway connection to an AWS Lambda function, which serves as the processing and orchestrating compute for the overall process.
Step 3 – The Lambda function stores the query image in Amazon S3 with a specified ID. This may be useful for later chat assistant analytics.
Steps 4–8 – The Lambda function orchestrates a series of Amazon Bedrock calls to a multimodal model, an LLM, and a text-embedding model:

Query the Claude V3 Sonnet model with the query and image to produce a text description.
Embed a concatenation of the original question and the text description with the Amazon Titan Text Embeddings
Retrieve relevant text data from OpenSearch Service.
Generate a grounded response to the original question based on the retrieved documents.

Step 9 – The Lambda function stores the user query and answer in Amazon DynamoDB, linked to the Amazon S3 image ID.
Steps 10 and 11 – The grounded text response is sent back to the client.

There is also an initial setup of the OpenSearch Index, which is done using an Amazon SageMaker notebook.
Prerequisites
To use the multimodal chat assistant solution, you need to have a handful of Amazon Bedrock FMs available.

On the Amazon Bedrock console, choose Model access in the navigation pane.
Choose Manage model access.
Activate all the Anthropic models, including Claude 3 Sonnet, as well as the Amazon Titan Text Embeddings V2 model, as shown in the following screenshot.

For this post, we recommend activating these models in the us-east-1 or us-west-2 AWS Region. These should become immediately active and available.

Simple deployment with AWS CloudFormation
To deploy the solution, we provide a simple shell script called deploy.sh, which can be used to deploy the end-to-end solution in different Regions. This script can be acquired directly from Amazon S3 using aws s3 cp s3://aws-blogs-artifacts-public/artifacts/ML-16363/deploy.sh .
Using the AWS Command Line Interface (AWS CLI), you can deploy this stack in various Regions using one of the following commands:

bash deploy.sh us-east-1

or

bash deploy.sh us-west-2

The stack may take up to 10 minutes to deploy. When the stack is complete, note the assigned physical ID of the Amazon OpenSearch Serverless collection, which you will use in further steps. It should look something like zr1b364emavn65x5lki8. Also, note the physical ID of the API Gateway connection, which should look something like zxpdjtklw2, as shown in the following screenshot.

Populate the OpenSearch Service index
Although the OpenSearch Serverless collection has been instantiated, you still need to create and populate a vector index with the document dataset of car listings. To do this, you use an Amazon SageMaker notebook.

On the SageMaker console, navigate to the newly created SageMaker notebook named MultimodalChatbotNotebook (as shown in the following image), which will come prepopulated with car-listings.zip and Titan-OS-Index.ipynb.

After you open the Titan-OS-Index.ipynb notebook, change the host_id variable to the collection physical ID you noted earlier.

Run the notebook from top to bottom to create and populate a vector index with a dataset of 10 car listings.

After you run the code to populate the index, it may still take a few minutes before the index shows up as populated on the OpenSearch Service console, as shown in the following screenshot. 
Test the Lambda function
Next, test the Lambda function created by the CloudFormation stack by submitting a test event JSON. In the following JSON, replace your bucket with the name of your bucket created to deploy the solution, for example, multimodal-chatbot-deployment-ACCOUNT_NO-REGION.

{
“bucket”: “multimodal-chatbot-deployment-ACCOUNT_NO-REGION”,
“key”: “jeep.jpg”,
“question_text”: “How much would a car like this cost?”
}

You can set up this test by navigating to the Test panel for the created lambda function and defining a new test event with the preceding JSON. Then, choose Test on the top right of the event definition.
If you are querying the Lambda function from another bucket than those allowlisted in the CloudFormation template, make sure to add the relevant permissions to the Lambda execution role.

The Lambda function may take between 10–20 seconds to run (mostly dependent on the size of your image). If the function performs properly, you should receive an output JSON similar to the following code block. The following screenshot shows the successful output on the console.

{
“statusCode”: 200,
“body”: “”Based on the 2013 Jeep Grand Cherokee SRT8 listing, a heavily modified Jeep like the one described could cost around $17,000 even with significant body damage and high mileage. The powerful engine, custom touches, and off-road capabilities likely justify that asking price.””
}

Note that if you just enabled model access, it may take a few minutes for access to propagate to the Lambda function.
Test the API
For integration into an application, we’ve connected the Lambda function to an API Gateway connection that can be pinged from various devices. We’ve included a notebook within the SageMaker notebook that allows you to query the system with a question and an image and return a response. To use the notebook, replace the API_GW variable with the physical ID of the API Gateway connection that was created using the CloudFormation stack and the REGION variable with the Region your infrastructure was deployed in. Then, making sure your image location and query are set correctly, run the notebook cell. Within 10–20 seconds, you should receive the output of your multimodal query sourced from your own text dataset. This is shown in the following screenshot.

Note that the API Gateway connection is only accessible from this specific notebook, and more comprehensive security and permission elements are required to productionize the system.
Qualitative results
A grounded multimodal chat assistant, where users can submit images with queries, can be useful in many settings. We demonstrate this application with a dataset of cars for sale. For example, a user may have a question about a car they’re looking at, so they snap a picture and submit a question, such as “How much might a car like this cost?” Rather than answering the question with generic information that the LLM was trained on (which may be out of date), responses will be grounded with your local and specific car sales dataset. In this use case, we took images from Unsplash and used a synthetically created dataset of 10 car listings to answer questions. The model and year of the 10 car listings are shown in the following screenshot.

For the examples in the following table, you can observe in the answer, not only has the vision language model (VLM) system identified the correct cars in the car listings that are most similar to the input image, but also it has answered the questions with specific numbers, costs, and locations that are only available from our closed cars dataset car-listings.zip.

Question
Image
Answer

How much would a car like this cost?

The 2013 Jeep Grand Cherokee SRT8 listing is most relevant, with an asking price of $17,000 despite significant body damage from an accident. However, it retains the powerful 470 hp V8 engine and has been well-maintained with service records.

What is the engine size of this car?

The car listing for the 2013 Volkswagen Beetle mentions it has a fuel-efficient 1.8L turbocharged engine. No other engine details are provided in the listings.

Where in the world could I purchase a used car like this?

Based on the car listings provided, the 2021 Tesla Model 3 for sale seems most similar to the car you are interested in. It’s described as a low mileage, well-maintained Model 3 in pristine condition located in the Seattle area for $48,000.

Latency and quantitative results
Because speed and latency are important for chat assistants and because this solution consists of multiple API calls to FMs and data stores, it’s interesting to measure the speed of each step in the process. We did an internal analysis of the relative speeds of the various API calls, and the following graph visualizes the results.

From slowest to fastest, we have the call to the Claude V3 Vision FM, which takes on average 8.2 seconds. The final output generation step (LLM Gen on the graph in the screenshot) takes on average 4.9 seconds. The Amazon Titan Text Embeddings model and OpenSearch Service retrieval process are much faster, taking 0.28 and 0.27 seconds on average, respectively.
In these experiments, the average time for the full multistage multimodal chatbot is 15.8 seconds. However, the time can be as low as 11.5 seconds overall if you submit a 2.2 MB image, and it could be much lower if you use even lower-resolution images.
Clean up
To clean up the resources and avoid charges, follow these steps:

Make sure all the important data from Amazon DynamoDB and Amazon S3 are saved
Manually empty and delete the two provisioned S3 buckets
To clean up the resources, delete the deployed resource stack from the CloudFormation console.

Conclusion
From applications ranging from online chat assistants to tools to help sales reps close a deal, AI assistants are a rapidly maturing technology to increase efficiency across sectors. Often these assistants aim to produce answers grounded in custom documentation and datasets that the LLM was not trained on, using RAG. A final step is the development of a multimodal chat assistant that can do so as well—answering multimodal questions based on a closed text dataset.
In this post, we demonstrated how to create a multimodal chat assistant that takes images and text as input and produces text answers grounded in your own dataset. This solution will have applications ranging from marketplaces to customer service, where there is a need for domain-specific answers sourced from custom datasets based on multimodal input queries.
We encourage you to deploy the solution for yourself, try different image and text datasets, and explore how you can orchestrate various Amazon Bedrock FMs to produce streamlined, custom, multimodal systems.

About the Authors
Emmett Goodman is an Applied Scientist at the Amazon Generative AI Innovation Center. He specializes in computer vision and language modeling, with applications in healthcare, energy, and education. Emmett holds a PhD in Chemical Engineering from Stanford University, where he also completed a postdoctoral fellowship focused on computer vision and healthcare.
Negin Sokhandan is a Principle Applied Scientist at the AWS Generative AI Innovation Center, where she works on building generative AI solutions for AWS strategic customers. Her research background is statistical inference, computer vision, and multimodal systems.
Yanxiang Yu is an Applied Scientist at the Amazon Generative AI Innovation Center. With over 9 years of experience building AI and machine learning solutions for industrial applications, he specializes in generative AI, computer vision, and time series modeling.

Design secure generative AI application workflows with Amazon Verified …

Amazon Bedrock Agents enable generative AI applications to perform multistep tasks across various company systems and data sources. They orchestrate and analyze the tasks and break them down into the correct logical sequences using the reasoning abilities of the foundation model (FM). Agents automatically call the necessary APIs to interact with the company systems and processes to fulfill the request. Throughout this process, agents determine whether they can proceed or if additional information is needed.
Customers can build innovative generative AI applications using Amazon Bedrock Agents’ capabilities to intelligently orchestrate their application workflows. When building such workflows, it can be challenging for customers to apply fine-grained access controls to make sure that the application’s workflow operates only on the authorized data based on the application user’s entitlements. Controlling access to resources based on user context, roles, actions and resource conditions can be challenging to maintain in an application workflow because that would require hardcoding several rules in your application or building your own authorization system to externalize those rules.
Instead of building your own authorization system for fine-grained access controls in your application workflows, you can integrate Amazon Verified Permissions into the agent’s workflow to apply contextually aware fine-grained access controls. Verified Permissions is a scalable permissions management and authorization service for custom applications built by you. Verified Permissions helps developers build secure applications faster by externalizing the authorization component and centralizing policy management and administration.
In this post, we demonstrate how to design fine-grained access controls using Verified Permissions for a generative AI application that uses Amazon Bedrock Agents to answer questions about insurance claims that exist in a claims review system using textual prompts as inputs and outputs. In our insurance claims system use case, there are two types of users: claims administrators and claims adjusters. Both are capable of listing open claims, but only one is capable of reading claim detail and making changes. We also show how to restrict permissions using custom attributes such as a user’s region for filtering insurance claims. In this post, the term region doesn’t refer to an AWS Region, but rather to a business-defined region.
Solution overview
In this solution design, we assume that the customer has claims records in an Amazon DynamoDB table and would like to build a chat-based application to answer frequently asked questions about their claims. This chat assistant will be used internally by claims administrators and claims adjusters to answer their clients’ questions.
The following is a list of actions that the claims team needs to perform to answer their clients’ questions:

Show me a list of my open claims
Show me claim detail for an input claim number
Update the status to closed for the input claim number

The customer has the following access control requirements for their claims system:

A claims administrator can list claims across various geographic areas, but they can’t read individual claim records
A claims adjuster can list claims for their region and can read and update the records of claims assigned to them. However, a claims adjuster can’t access claims from other regions.
is placed into a group in Amazon Cognito, where their application-level permissions are set and maintained
The customer would like to use Verified Permissions to externalize entity and record level authorization decisions without hard coding the application logic

To improve the performance of the chat assistant, the customer uses FMs available on Amazon Bedrock. To retrieve the necessary information from the claims table and dynamically orchestrate the requests, the customer uses Amazon Bedrock Agents together with Verified Permissions to provide fine-grained authorization for the agents’ invocation.
The application architecture for building the example chat-based Generative AI Claims application with fine-grained access controls is shown in the following diagram.

The application architecture flow is as follows:

User accesses the Generative AI Claims web application (App).
The App authenticates the user with the Amazon Cognito service and issues an ID token and an access tokenID token has the user’s identity and custom attributes.
Using the App, the user sends a request asking to “list the open claims.” The request is sent along with the user’s ID token and access token. The App calls the Claims API Gateway API to run the claims proxy passing user requests and tokens.
Claims API Gateway runs the Custom Authorizer to validate the access token.
When access token validation is successful, the Claims API Gateway sends the user request to the Claims Proxy.
The Claims Proxy invokes the Amazon Bedrock agent passing the user request and ID token. The Amazon Bedrock agent is configured to use Anthropic’s Claude model and to invoke actions using the Claims Agent Helper AWS Lambda
Amazon Bedrock Agent uses chain-of-thought-prompting and builds the list of API actions to run with the help of Claims Agent Helper.
The Claims Agent Helper retrieves claim records from Claims DB and constructs a claims list object. For this example, we are providing hard-coded examples in the Lambda function and no DynamoDB was added to the example solution provided. However, we provide the component on the architecture for representing real-life use cases where the data is stored outside the Lambda
The Claims Agent Helper retrieves the user’s metadata (that is, their name) from ID token, builds the Verified Permissions data entities, and makes the Verified Permissions authorization request. This request contains the principal (user and role), action (that is, ListClaim) and resource (Claim). Verified Permissions evaluates the request against the Verified Permissions policies and returns an Allow or Deny decision. Subsequently, the Claims Agent Helper filters the claims based on that decision. Verified Permissions has “default deny” functionality, meaning that in the absence of an explicit allow, the service defaults to an implicit deny. If there is an explicit Deny in the policies involved in the request, Verified Permissions denies the request.
The Claims Amazon Bedrock Agent receives the authorized list of claims, augments the prompt and sends it to the Claude model for completion. The agent returns the completion back to the user.

Fine-grained access control flows
Based on the customer’s access control requirements, there are three fine-grained access control flows as depicted in the following system sequence diagrams.
Use case: Claims administrator can list claims across regions
The following diagram shows how the claims administrator can list claims across regions.

The following diagram depicts how the claims administrator’s fine-grained access to the claim record is run. In this diagram, notice a deny decision from Verified Permissions. This is because the principal’s role isn’t ClaimsAdjuster.

Use case: Claims adjuster can see claims they own
The following diagram depicts how the claims adjuster’s fine-grained access to retrieve claim details is run. In this diagram, notice the allow decision from Verified Permissions. This is because the principal’s role is ClaimsAdjuster and the resource owner (that is, claim owner) matches the user principal (that is, user=alice).

The following diagram depicts how the claims adjuster’s fine-grained access to list open claims is run. In this diagram, notice the allow decision from Verified Permissions. This is because the principal’s group is ClaimsAdjuster and the region on the resource matches the principal’s region. As a result of this region filter on the authorization policy, only open claims for the user’s region are returned. Verified Permissions acts on principal, action, and individual resource (that is, a claim record) for the authorization decision. Therefore, the Lambda function needs to iterate through the list of open claims and make an isAuthorized request for each claim record. If this results in a performance issue, you can use the BatchIsAuthorized API and send multiple authzRequest in one API call.

Entities design considerations
When designing fine-grained data access controls, it is best practice to start with the entity-relationship diagram (ERD) for the application. For our claims application, the user will operate on claim records to retrieve a list of claims records, get the details for an individual claim record, or update the status of a claim record. The following diagram is the ERD for this application modeled in Verified Permissions. With Verified Permissions, you can apply both role-based access control (RBAC) and attribute-based access control (ABAC).

Here is a brief description of each entity and attributes that will be used for RBAC and ABAC against claim records.

Application – The application is a chat-based generative AI application using Amazon Bedrock Agents to understand the questions and retrieve the relevant claims data to assist claims administrators and claims adjusters.
Claim – The claim represents an insurance claim record that is stored in the DynamoDB table. The claims system stores claim records and the chatbot application allows users to retrieve and update these records.
User – The user.
Role – The role represents a user’s access within the application. Here is a list of available roles:

Claims administrators – Can list claims across various geographic regions, but they can’t read individual claim records
Claims adjusters – Can list claims for their region and read and update their claim records

The roles are managed through Amazon Cognito and Verified Permissions. Cognito maintains a record of which role a user is assigned to and includes this information in the token. Verified Permissions maintains a record of what that role is permitted to do. Fine-grained access controls exist to make sure that users have appropriate permissions for their roles, restricting access to sensitive claim data based on geographic regions and user groups.
Fine-grained authorization: Policy design
The Actions diagram view lists the types of Principals you have configured in your policy store, the Actions they are eligible to perform, and the Resources they are eligible to perform actions on. The lines between entities indicate your ability to create a policy that allows a principal to take an action on a resource. The following image shows the actions diagram from Verified Permissions for our insurance claims use case. The User principal will have access to the Get, List, and Update actions. The resources are the Application and the Claim entity within the application. This diagram generates the underlying schema that governs the policy definition.

Use case: Claims administrator can list all claim records across regions
A policy is a statement that either permits or forbids a principal to take one or more actions on a resource. Each policy is evaluated independently of other policies. The Verified Permissions policy for this use case is shown in the following code example. In this policy, the principal (that is, user Bob), is assigned the role of claims administrator.

permit (
    principal in avp::claim::app::Role::”ClaimsAdministrator”,
    action in [
    avp::claim::app::Action::”ListClaim”
    ],
    resource
) ;

Use case: Claims administrator can’t access claim detail record
The Verified Permissions policy for this use case is shown in the following code example. The use of explicit “forbid” policies is a valid practice.

forbid (
    principal in avp::claim::app::Role::”ClaimsAdministrator”,
    action in [
    avp::claim::app::Action::”GetClaim”
    ],
    resource
) ;

Use case: Claims adjuster can list claims they own in their region
The Verified Permissions policy for this use case is shown in the following code example. In this policy, the principal (that is, user Alice) is assigned the role of claims adjuster and their region is passed as a custom attribute in the ID token.

permit (
    principal in avp::claim::app::Role::”ClaimsAdjuster”,
    action in [
    avp::claim::app::Action::”ListClaim”
    ],
    resource
) when {
    resource has owner &&
    principal == resource.owner &&
    principal has custom &&
    principal.custom has region &&
    principal.custom.region == resource.region
};

Use case: Claims adjuster can retrieve or update a claim they own

permit (
    principal in avp::claim::app::Role::”ClaimsAdjuster”,
    action in [
    avp::claim::app::Action::”GetClaim”,
     avp::claim::app::Action::”UpdateClaim”
    ],
    resource
) when {
    principal == resource.owner&&
    principal has custom &&
    principal.custom has region &&
    principal.custom.region == resource.region
};

Authentication design considerations
The configuration of Amazon Cognito for this use case followed the security practices included as part of the standard configuration workflow: a strong password policy, multi-factor authentication (MFA), and a client secret. When using Amazon Cognito with Verified Permissions, your application can pass user pool access or identity tokens to Verified Permissions to make the allow or deny decision. Verified Permissions evaluates the user’s request based on the policies it has stored in the policy store.
For custom attributes, we are using region to restrict which claims a claims adjuster can see, excluding claims made in regions outside the adjuster’s own region. We are also using role as a custom attribute to provide that information in the ID token that is passed to the Amazon Bedrock agent. When the user is registered in the Cognito user pool, these custom attributes will be recorded as part of the sign-up process.
Amazon Cognito integrates with Verified Permissions through the Identity sources section in the console. The following screenshot shows that we’ve connected our Cognito user pool to the Amazon Verified Permissions policy store.

Fine-grained authorization: Passing ID token to the Amazon Bedrock agent
When the user is authenticated against the Cognito user pool, it returns an ID token and access token to the client application. The ID token will be passed through an API gateway and a proxy Lambda through SessionAttributes on the invoke_agent call.

# Invoke the agent API
response = bedrock_agent_runtime_client.invoke_agent(
    …    
    sessionState={
        ‘sessionAttributes’: {
            ‘authorization_header’: ‘<AUTHORIZATION_HEADER>’
        }
    },
)

The header is then retrieved from the Lambda event in the Action Group Lambda function and Verified Permissions is used to verify the user’s access against the desired action.

# Retrieve session attributes from event and use it to validate action
sessAttr = event.get(“sessionAttributes”)
auth, reason = verifyAccess(sessionAttributes, action_id)

Fine-grained authorization: Integration with Amazon Bedrock Agents
The ID token issued by Cognito contains the user’s identity and custom attributes. This ID token is passed to the Amazon Bedrock agent, and the Agent Helper Lambda retrieves that token from the agent’s session attribute. Then, the Agent Helper Lambda retrieves open claim records from DynamoDB and constructs the Verified Permissions schema entities and makes the isAuthorized API call.
Because Verified Permissions resources operate at the individual record level (that is, a single claim record), you need to iterate over the claims list object and make the isAuthorized API call for the authorization decision and then create the filtered claims list. The filtered claims list is then passed back to the caller. As a result, the claims adjuster will only see claims for their region, while a claims administrator can see claims across all regions.
The Amazon Bedrock agent then uses this filtered claim list to complete the user’s request to list claims. The chat application can only access the claims records that the user is authorized to view, providing the fine-grained access control integrated with the Amazon Bedrock agent workflow.
Getting started
Check out our code to get started developing your secure generative AI application using Amazon Verified Permissions. We provide you with an end-to-end implementation of the architecture described in this post and a demo UI you can use to test the permissions of different users. Update this example to implement generative AI applications that connect with your use case setup.
Conclusion
In this post, we discussed the challenges in applying fine-grained access controls for agent workflows in a generative AI application. We shared an application architecture for building an example chat-based generative AI application that uses Amazon Bedrock Agents to orchestrate workflows and applies fine-grained access controls using Amazon Verified Permissions. We discussed how to design fine-grained access permissions through the design of persona-based access control workflows. If you are looking for a scalable and secure way to apply fine-grained permissions to your generative AI agent-based workflows, give this solution a try and leave your feedback.

About the authors
Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his three-year old sheep-a-doodle!
Samantha Wylatowska is a Solutions Architect at AWS. With a background in DevSecOps, her passion lies in guiding organizations towards secure operational efficiency, leveraging the power of automation for a seamless cloud experience. In her free time, she’s usually learning something new through music, literature, or film.
Anil Nadiminti is a Senior Solutions Architect at AWS specializing in empowering organizations to harness cloud computing and AI for digital transformation and innovation. His expertise in architecting scalable solutions and implementing data-driven strategies enables companies to innovate and thrive in today’s rapidly evolving technological landscape.
Michael Daniels is an AI/ML Specialist at AWS. His expertise lies in building and leading AI/ML and generative AI solutions for complex and challenging business problems, which is enhanced by his PhD from the Univ. of Texas and his MSc in computer science specialization in machine learning from the Georgia Institute of Technology. He excels in applying cutting-edge cloud technologies to innovate, inspire, and transform industry-leading organizations while also effectively communicating with stakeholders at any level or scale. In his spare time, you can catch Michael skiing or snowboarding.
Maira Ladeira Tanke is a Senior Generative AI Data Scientist at AWS. With a background in machine learning, she has over 10 years of experience architecting and building AI applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through generative AI solutions on Amazon Bedrock. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.

F5-TTS: A Fully Non-Autoregressive Text-to-Speech System based on Flow …

The current challenges in text-to-speech (TTS) systems revolve around the inherent limitations of autoregressive models and their complexity in aligning text and speech accurately. Many conventional TTS models require complex elements such as duration modeling, phoneme alignment, and dedicated text encoders, which add significant overhead and complexity to the synthesis process. Furthermore, previous models like E2 TTS have faced issues with slow convergence, robustness, and maintaining accurate alignment between the input text and generated speech, making them challenging to optimize and deploy efficiently in real-world scenarios.

Researchers from Shanghai Jiao Tong University, the University of Cambridge, and Geely Automobile Research Institute introduced F5-TTS, a non-autoregressive text-to-speech (TTS) system that utilizes flow matching with a Diffusion Transformer (DiT). Unlike many conventional TTS models, F5-TTS does not require complex elements like duration modeling, phoneme alignment, or a dedicated text encoder. Instead, it introduces a simplified approach where text inputs are padded to match the length of the speech input, leveraging flow matching for effective synthesis. F5-TTS is designed to address the shortcomings of its predecessor, E2 TTS, which faced slow convergence and alignment issues between speech and text. Notable improvements include a ConvNeXt architecture to refine text representation and a novel Sway Sampling strategy during inference, significantly enhancing performance without retraining.

Structurally, F5-TTS leverages ConvNeXt and DiT to overcome alignment challenges between the text and generated speech. The input text is first processed by ConvNeXt blocks to prepare it for in-context learning with speech, allowing smoother alignment. The character sequence, padded with filler tokens, is fed into the model alongside a noisy version of the input speech. The Diffusion Transformer (DiT) backbone is used for training, employing flow matching to map a simple initial distribution to the data distribution effectively. Additionally, F5-TTS includes an innovative inference-time Sway Sampling technique that helps control flow steps, prioritizing early-stage inference to improve the alignment of generated speech with the input text.

The results presented in the paper demonstrate that F5-TTS outperforms other state-of-the-art TTS systems in terms of synthesis quality and inference speed. The model achieved a word error rate (WER) of 2.42 on the LibriSpeech-PC dataset using 32 function evaluations (NFE) and demonstrated a real-time factor (RTF) of 0.15 for inference. This performance is a significant improvement over diffusion-based models like E2 TTS, which required a longer convergence time and had difficulties with maintaining robustness across different input scenarios. The Sway Sampling strategy notably enhances naturalness and intelligibility, allowing the model to achieve smooth and expressive zero-shot generation. Evaluation metrics such as WER and speaker similarity scores confirm the competitive quality of the generated speech.

In conclusion, F5-TTS successfully introduces a simpler, highly efficient pipeline for TTS synthesis by eliminating the need for duration predictors, phoneme alignments, and explicit text encoders. The use of ConvNeXt for text processing and Sway Sampling for optimized flow control collectively improves alignment robustness, training efficiency, and speech quality. By maintaining a lightweight architecture and providing an open-source framework, F5-TTS aims to advance community-driven development in text-to-speech technologies. The researchers also highlight the ethical considerations for the potential misuse of such models, emphasizing the need for watermarking and detection systems to prevent fraudulent use.

Check out the Paper, Model on Hugging Face, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post F5-TTS: A Fully Non-Autoregressive Text-to-Speech System based on Flow Matching with Diffusion Transformer (DiT) appeared first on MarkTechPost.

Apple Researchers Introduce GSM-Symbolic: A Novel Machine Learning Ben …

Recent progress in LLMs has spurred interest in their mathematical reasoning skills, especially with the GSM8K benchmark, which assesses grade-school-level math abilities. While LLMs have shown improved performance on GSM8K, doubts remain about whether their reasoning abilities have truly advanced, as current metrics may only partially capture their capabilities. Research suggests that LLMs rely on probabilistic pattern matching rather than genuine logical reasoning, leading to token bias and sensitivity to small input changes. Furthermore, GSM8K’s static nature and reliance on a single metric limit its effectiveness in evaluating LLMs’ reasoning abilities under varied conditions.

Logical reasoning is essential for intelligent systems, but its consistency in LLMs remains to be determined. While some research shows LLMs can handle tasks using probabilistic pattern-matching, they often need more formal reasoning, as changes in input tokens can significantly alter results. While effective in some cases, transformers need more expressiveness for complex tasks if supported by external memory, like scratchpads. Studies suggest that LLMs rely on matching data seen during training rather than relying on true logical understanding. 

Researchers from Apple conducted a large-scale study to evaluate the reasoning capabilities of state-of-the-art LLMs using a new benchmark called GSM-Symbolic. This benchmark generates diverse mathematical questions through symbolic templates, allowing for more reliable and controllable evaluations. Their findings show that LLM performance declines significantly when numerical values or question complexity increases. Additionally, adding irrelevant but seemingly related information leads to a performance drop of up to 65%, indicating that LLMs rely on pattern matching rather than formal reasoning. The study highlights the need for improved evaluation methods and further research into LLM reasoning abilities.

The GSM8K dataset consists of over 8000 grade-school-level math questions and answers commonly used for evaluating LLMs. However, risks like data contamination and performance variance with minor question changes have arisen due to its popularity. To address this, GSM-Symbolic was developed, generating diverse problem instances using symbolic templates. This approach enables a more robust evaluation of LLMs, offering better control over question difficulty and testing the models’ capabilities across multiple variations. The benchmark evaluates over 20 open and closed models using 5000 samples from 100 templates, revealing insights into LLMs’ mathematical reasoning abilities and limitations.

Initial experiments reveal significant performance variability across models on GSM-Symbolic, a variant of the GSM8K dataset, with lower accuracy than reported on GSM8K. The study further explores how changing names versus altering values affects LLMs, showing that value changes significantly degrade performance. Question difficulty also impacts accuracy, with more complex questions leading to greater performance declines. The results suggest that models might rely on pattern matching rather than genuine reasoning, as additional clauses often reduce their performance.

The study examined the reasoning capabilities of LLMs and highlighted limitations in current GSM8K evaluations. A new benchmark, GSM-Symbolic, was introduced to assess LLMs’ mathematical reasoning with multiple question variations. Results revealed significant performance variability, especially when altering numerical values or adding irrelevant clauses. LLMs also needed help with increased question complexity, suggesting they rely more on pattern matching than true reasoning. GSM-NoOp further exposed LLMs’ inability to filter irrelevant information, resulting in large performance drops. Overall, this research emphasizes the need for further development to enhance LLMs’ logical reasoning abilities.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Apple Researchers Introduce GSM-Symbolic: A Novel Machine Learning Benchmark with Multiple Variants Designed to Provide Deeper Insights into the Mathematical Reasoning Abilities of LLMs appeared first on MarkTechPost.

Arcee AI Releases SuperNova-Medius: A 14B Small Language Model Built o …

In the ever-evolving world of artificial intelligence (AI), large language models have proven instrumental in addressing a wide array of challenges, from automating complex tasks to enhancing decision-making processes. However, scaling these models has also introduced considerable complexities, such as high computational costs, reduced accessibility, and the environmental impact of extensive resource requirements. The enormous size of conventional language models like GPTs or LLaMA-70B makes them challenging for many institutions to adopt due to constraints in computational infrastructure. Arcee AI has acknowledged these challenges and sought to bridge the gap between model capability and accessibility with the introduction of SuperNova-Medius—a small language model that aims to maintain the high-quality output of larger counterparts without their limitations.

SuperNova-Medius: A 14B Small Language Model that seeks to disrupt the traditional notions of size versus performance in AI models. 70B SuperNova-Medius comes after the Arcee AI’s release of SuperNova-70B, followed by the 8B SuperNova-Lite. SuperNova-Medius is designed to match the prowess of significantly larger models, rivaling those with up to 70 billion parameters. It does so while retaining a relatively manageable size of 14 billion parameters, making it highly suitable for various use cases without the massive computational burden. By integrating groundbreaking optimization techniques and innovative architectural designs, SuperNova-Medius presents a fresh perspective on how effective language models can be designed for real-world usability while ensuring that smaller organizations can leverage the potential.

SuperNova-Medius is built on an optimized Transformer architecture, coupled with advanced quantization methods that allow it to maintain impressive accuracy and efficiency. The development of SuperNova-Medius involved a sophisticated multi-teacher, cross-architecture distillation process with the following key steps:

Logit Distillation from Llama 3.1 405B: The logits of Llama 3.1 405B were distilled using an offline approach. The top K logits for each token were stored to capture most of the probability mass while managing storage requirements.

Cross-Architecture Adaptation: Using mergekit-tokensurgeon, a version of Qwen2.5-14B was created that uses the vocabulary of Llama 3.1 405B. This allowed for the use of Llama 3.1 405B logits in training the Qwen-based model.

Distillation to Qwen Architecture: The adapted Qwen2.5-14B model was trained using the stored 405B logits as the target.

Parallel Qwen Distillation: In a separate process, Qwen2-72B was distilled into a 14B model.

Final Fusion and Fine-Tuning: The Llama-distilled Qwen model’s vocabulary was reverted to the Qwen vocabulary. After re-aligning the vocabularies, a final fusion and fine-tuning step was conducted using a specialized dataset from EvolKit to ensure that SuperNova-Medius maintained coherence, fluency, and context understanding across a broad range of tasks.

Despite being smaller compared to the largest models, SuperNova-Medius has been extensively fine-tuned using a diverse and expansive dataset, covering multiple domains and languages. This extensive training allows SuperNova-Medius to exhibit a strong understanding of context, generate coherent responses, and perform complex reasoning tasks effectively. Furthermore, by employing innovations in parameter sharing and utilizing sparsity strategies, the model delivers results that are comparable to models with substantially higher parameter counts. The key benefits of SuperNova-Medius lie in its balanced capability—it provides high-quality language generation while being cost-effective to deploy, making it a perfect fit for applications needing reliable but resource-efficient solutions.

SuperNova-Medius excels in instruction-following (IFEval) and complex reasoning tasks (BBH), outperforming Qwen2.5-14B and SuperNova-Lite across multiple benchmarks. This makes it a powerful, efficient solution for high-quality generative AI applications.

In conclusion, SuperNova-Medius stands as a testament to Arcee AI’s commitment to pushing the boundaries of what’s possible with language models while making advanced AI more inclusive and sustainable. By successfully reducing the model size without compromising on performance, Arcee AI has provided a solution that caters to the needs of various sectors, from startups and small businesses to educational institutions and beyond. As AI continues to shape our future, innovations like SuperNova-Medius are essential in ensuring that the benefits of advanced machine learning technology are accessible to all, paving the way for more equitable and impactful applications of AI across the globe.

Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Arcee AI Releases SuperNova-Medius: A 14B Small Language Model Built on the Qwen2.5-14B-Instruct Architecture appeared first on MarkTechPost.

Researchers at Stanford University Propose ExPLoRA: A Highly Effective …

Parameter-efficient fine-tuning (PEFT) methods, like low-rank adaptation (LoRA), allow large pre-trained foundation models to be adapted to downstream tasks using a small percentage (0.1%-10%) of the original trainable weights. A less explored area of PEFT is extending the pre-training phase without supervised labels—specifically, adapting foundation models to new domains using efficient self-supervised pre-training. While traditional pre-training of foundation models in language and vision has been resource-intensive, recent advancements in PEFT techniques have enabled effective fine-tuning with minimal computational cost based on the assumption that weight updates have a low intrinsic rank.

Vision foundation models (VFMs) like DinoV2 and masked autoencoders (MAE) have shown excellent performance in tasks such as classification and semantic segmentation through self-supervised learning (SSL). Recently, domain-specific VFMs have emerged, like SatMAE, which processes temporal or multi-spectral satellite images. Efficient adaptation of these large models has led to the adoption of PEFT methods, which update only a fraction of the parameters. Techniques such as LoRA apply low-rank weight updates, while others modify the number of trainable parameters. Domain adaptation strategies address distribution shifts between training and testing data using discrepancy metrics or adversarial training to enhance model performance across domains.

Researchers from Stanford University and CZ Biohub have developed ExPLoRA, an innovative technique to enhance transfer learning for pre-trained vision transformers (ViTs) amid domain shifts. By initializing a ViT with weights from large natural-image datasets like DinoV2 or MAE, ExPLoRA continues unsupervised pre-training in a new domain, selectively unfreezing 1-2 ViT blocks while employing LoRA to tune the remaining layers. This method achieves state-of-the-art performance in satellite imagery classification, improving top-1 accuracy by 8% while utilizing only 6-10% of the parameters compared to previous fully pre-trained models, demonstrating significant efficiency and effectiveness in domain adaptation.

The MAE and DinoV2 are SSL methods for ViTs. MAE uses a masked encoder-decoder structure that requires full fine-tuning for downstream tasks, which can be computationally intensive. In contrast, DinoV2 demonstrates strong zero-shot performance by employing a trainable student-teacher model architecture, enabling adaptation without full fine-tuning. The ExPLoRA method is proposed to address fine-tune inefficiencies, combining pre-trained weights with low-rank adaptations and additional updates to adapt ViTs to new target domains efficiently. This approach reduces storage requirements while maintaining strong feature extraction and generalization capabilities.

The experimental results focus on satellite imagery, highlighting a case study with the fMoW-RGB dataset, achieving a state-of-the-art top 1 accuracy of 79.2%. The ablation study examines performance metrics across various configurations. ExPLoRA models, initialized with MAE and DinoV2 weights, outperform traditional fully pre-trained methods while utilizing only 6% of the ViT encoder parameters. Additional evaluations on multi-spectral images and various satellite datasets demonstrate ExPLoRA’s effectiveness in bridging domain gaps and achieving competitive performance. The results indicate significant improvements in accuracy, showcasing the potential of ExPLoRA for satellite image classification tasks.

In conclusion, ExPLoRA is an innovative pre-training strategy designed to adapt pre-trained ViT models for diverse visual domains, including satellite and medical imagery. ExPLoRA addresses the limitations of costly from-scratch pre-training by enabling efficient knowledge transfer from existing models, achieving superior performance compared to domain-specific foundations. The method combines PEFT techniques like LoRA with minimal unfreezing of model layers, significantly enhancing transfer learning. The experiments reveal state-of-the-art results on satellite imagery, improving linear probing accuracy by up to 7.5% while utilizing less than 10% of the parameters of previous approaches.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Researchers at Stanford University Propose ExPLoRA: A Highly Effective AI Technique to Improve Transfer Learning of Pre-Trained Vision Transformers (ViTs) Under Domain Shifts appeared first on MarkTechPost.

OpenAI Researchers Introduce MLE-bench: A New Benchmark for Measuring …

Machine Learning (ML) models have shown promising results in various coding tasks, but there remains a gap in effectively benchmarking AI agents’ capabilities in ML engineering. Existing coding benchmarks primarily evaluate isolated coding skills without holistically measuring the ability to perform complex ML tasks, such as data preparation, model training, and debugging.

OpenAI Researchers Introduce MLE-bench

To address this gap, OpenAI researchers have developed MLE-bench, a comprehensive benchmark that evaluates AI agents on a wide array of ML engineering challenges inspired by real-world scenarios. MLE-bench is a novel benchmark aimed at evaluating how well AI agents can perform end-to-end machine learning engineering. It is constructed using a collection of 75 ML engineering competitions sourced from Kaggle. These competitions encompass diverse domains such as natural language processing, computer vision, and signal processing. The competitions are carefully curated to assess key ML skills, including training models, data preprocessing, running experiments, and submitting results for evaluation. To provide an accurate baseline, human performance metrics are gathered from publicly available Kaggle leaderboards, enabling comparisons between the capabilities of AI agents and expert human participants.

Structure and Details of MLE-bench

MLE-bench features several design aspects to assess ML engineering effectively. Each of the 75 Kaggle competition tasks is representative of practical engineering challenges, making the benchmark both rigorous and realistic. Each Kaggle competition in MLE-bench consists of a problem description, dataset, local evaluation tools, and grading code used to assess the agent’s performance. To ensure comparability, each competition’s dataset is split into training and testing sets, often redesigned to avoid any overlap or contamination issues. Submissions are graded against human attempts using competition leaderboards, and agents receive medals (bronze, silver, gold) based on their performance relative to human benchmarks. The grading mechanism relies on standard evaluation metrics, such as the area under the receiver operating characteristic (AUROC), mean squared error, and other domain-specific loss functions, providing a fair comparison to Kaggle participants. AI agents, such as OpenAI’s o1-preview model combined with AIDE scaffolding, have been tested on these tasks, achieving results comparable to a Kaggle bronze medal in 16.9% of competitions. Performance significantly improved with repeated attempts, indicating that while agents can follow well-known approaches, they struggle to recover from initial mistakes or optimize effectively without multiple iterations. This highlights both the potential and the limitations of current AI systems in performing complex ML engineering tasks.

Experimental Results and Performance Analysis

The evaluation of different scaffolds and AI models on MLE-bench reveals interesting findings. OpenAI’s o1-preview model with AIDE scaffolding emerged as the best-performing setup, achieving medals in 16.9% of the competitions, and performance significantly improved with multiple attempts. Agents often performed better when they could iterate on their solutions, highlighting the importance of multiple passes in addressing challenges and optimizing solutions. When given additional resources, such as increased compute time and hardware, agents showed better results, emphasizing the impact of resource allocation. For example, the performance of GPT-4o doubled from 8.7% when given 24 hours to 11.8% when given 100 hours per competition. Furthermore, the experiments revealed that scaling up the number of attempts (pass@k) had a significant impact on the success rate, with pass@6 achieving nearly double the performance of pass@1. Additionally, experiments on scaling resources and agent scaffolding demonstrate the variability in performance based on resource availability and optimization strategies. Specifically, agents like o1-preview exhibited notable improvements in competitions requiring extensive model training and hyperparameter tuning when given longer runtimes or better hardware configurations. This evaluation provides valuable insights into the strengths and weaknesses of current AI agents, particularly in debugging, handling complex datasets, and effectively utilizing available resources.

Conclusion and Future Directions

MLE-bench represents a significant step forward in evaluating the ML engineering capabilities of AI agents, focusing on holistic, end-to-end performance metrics rather than isolated coding skills. The benchmark provides a robust framework for assessing various facets of ML engineering, including data preprocessing, model training, hyperparameter tuning, and debugging, which are essential for real-world ML applications. It aims to facilitate further research into understanding the potential and limitations of AI agents in performing practical ML engineering tasks autonomously. By open-sourcing MLE-bench, OpenAI hopes to encourage collaboration, allowing researchers and developers to contribute new tasks, improve existing benchmarks, and explore innovative scaffolding techniques. This collaborative effort is expected to accelerate progress in the field, ultimately contributing to safer and more reliable deployment of advanced AI systems. Additionally, MLE-bench serves as a valuable tool for identifying key areas where AI agents require further development, providing a clear direction for future research efforts in enhancing the capabilities of AI-driven ML engineering.

Setup

Some MLE-bench competition data is stored using Git-LFS. Once you have downloaded and installed LFS, run:

git lfs fetch –all
git lfs pull

You can install mlebench With pip:

pip install -e .

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post OpenAI Researchers Introduce MLE-bench: A New Benchmark for Measuring How Well AI Agents Perform at Machine Learning Engineering appeared first on MarkTechPost.

InstructG2I : A Graph Context Aware Stable Diffusion Model to Synthesi …

Multimodal Attributed Graphs (MMAGs) have received little attention despite their versatility in image generation. MMAGs represent relationships between entities with combinatorial complexity in a graph-structured manner. Nodes in the graph contain both image and text information. Compared to text or image conditioning models, graphs could be converted into better and more informative images. Graph2Image is an interesting challenge in this field that requires generative models to synthesize image conditioning on text descriptions and graph connections. While MMAGs are helpful, they cannot be directly incorporated into image and text conditioning.

The following are the most relevant challenges in the use of MMAGs for image synthesis:

Explosion in graph size– This phenomenon occurs due to the combinatorial complexity of graphs, where the size grows exponentially as we introduce to the model local subgraphs, which encompass images and text.

Graph entities dependencies – Nodal characteristics are mutually dependent, and thus, their proximity reflects the relationships between entities across text and image and their preference in image generation. To exemplify this, generating a light-colored shirt should have a preference for light shades such as pastels.

 Need for controllability in graph condition – The interpretability of generated images must be controlled to follow desired patterns or characteristics defined by connections between entities in the graph.

A team of researchers at the University of Illinois developed InstructG2I to solve this problem. This is a graph context-aware diffusion model that utilizes multimodal graph information. This approach addresses graph space complexity by compressing contexts from graphs into fixed capacity graph conditioning tokens enhanced with semantic personalized PageRank-based graph sampling. The Graph-QFormer architecture further improves these graph tokens by solving the problem of graph entity dependency. Last but not least, InstructG2I guides image generation with adjustable edge lengths.

InstructG2I introduces Graph Conditions into Stable Diffusion with PPR-based neighbor sampling. PPR or Personalized PageRank identifies related nodes from the graph structure. To ensure that generated images are semantically related to the target node a semantic based similarity calculation function is used for reranking.This study also proposes Graph-QFormer which is a two transformer module to capture text based and image based dependencies. Graph-QFormer employs multi head self attention for image-image dependencies and multi head cross attention for text-image dependencies.Cross Attention layer aligns image features with text prompts. It uses hidden states from the self-attention layer as input, and the text embeddings as a query to generate relevant images. Final output from the two transformers of Graph-QFormer is the graph-conditioned prompt tokens which guide the image generation process in the diffusion model.Finally to control the generation process  classifier-free guidance is used which is basically a technique to adjust the strength of graphs

InstructG2I was tested on three datasets from different domains – ART500K, Amazon, and Goodreads. For text-to-image methods, Stable Diffusion 1.5 was decided as the baseline model, and for image-to-image methods, InstructPix2Pix and ControlNet were chosen for comparison; both were initialized with SD 1.5 and fine-tuned on chosen datasets. The study’s results showed impressive improvements over baseline models in both tasks. InstructG2I outperformed all baseline models in CLIP and DINOv2 scores. For qualitative evaluation, InstructG2I generated images that best fit the semantics of the text prompt and context from the graph, ensuring the generation of content and context as it learned from the neighbors on the graph and accurately conveyed information.

InstructG2I effectively solved the significant challenges of the explosion, inter-entity dependency, and controllability in Multimodal Attributed Graphs and superseded the baseline in image generation. In the next few years, there will be a lot of opportunities to work with and incorporate Graphs into image generation, a big part of which includes handling the complex heterogeneous relationships between image and text on MMAGs.

Check out the Paper, Code, and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post InstructG2I : A Graph Context Aware Stable Diffusion Model to Synthesize Images from Multimodal Attributed Graphs appeared first on MarkTechPost.

LeanAgent: The First Life-Long Learning Agent for Formal Theorem Provi …

The problem that this research seeks to address lies in the inherent limitations of existing large language models (LLMs) when applied to formal theorem proving. Current models are often trained or fine-tuned on specific datasets, such as those focused on undergraduate-level mathematics, but struggle to generalize to more advanced mathematical domains. These limitations become more pronounced because these models typically operate in static environments, failing to adapt across different mathematical domains and projects as mathematicians do. Moreover, these models exhibit issues related to “catastrophic forgetting,” where new knowledge may overwrite previously learned information. This research aims to tackle these challenges by proposing a lifelong learning framework that can continuously evolve and expand its mathematical capabilities without losing previously acquired knowledge.

Researchers from California Institute of Technology, Stanford, and University of Wisconsin, Madison introduce LeanAgent, a lifelong learning framework designed for formal theorem proving. LeanAgent addresses the limitations of existing LLMs by introducing a dynamic approach that continually builds upon and improves its knowledge base. Unlike static models, LeanAgent operates with a dynamic curriculum, progressively learning and adapting to increasingly complex mathematical tasks. The framework incorporates several key innovations, including curriculum learning to optimize the learning trajectory, a dynamic database to efficiently manage expanding mathematical knowledge, and a progressive training methodology designed to balance stability (retaining old knowledge) and plasticity (incorporating new knowledge). These features enable LeanAgent to continually generalize and improve its theorem-proving abilities, even in advanced mathematical domains such as abstract algebra and algebraic topology.

LeanAgent is structured around several key components that allow it to adapt continuously and effectively tackle complex mathematical problems. First, the curriculum learning strategy sorts mathematical repositories by difficulty, using theorems of varying complexity to build an effective learning sequence. This approach allows LeanAgent to start with foundational knowledge before progressing to more advanced topics. Second, a custom dynamic database is utilized to manage evolving knowledge, ensuring that previously learned information can be efficiently retrieved and reused. This database not only stores theorems and proofs but also keeps track of dependencies, enabling more efficient premise retrieval. Third, the progressive training of LeanAgent’s retriever ensures that new mathematical concepts are continuously integrated without overwriting previous learning. The retriever, initially based on ReProver, is incrementally trained with each new dataset for one additional epoch, striking a balance between learning new tasks and maintaining stability.

LeanAgent demonstrates remarkable progress compared to existing baselines. It successfully proved 162 previously unsolved theorems across 23 diverse Lean repositories, including challenging areas such as abstract algebra and algebraic topology. LeanAgent outperformed the static ReProver baseline by up to 11x, particularly excelling in proving previously unsolved ‘sorry theorems.’ The framework also excelled in lifelong learning metrics, effectively maintaining stability while enhancing backward transfer, wherein learning new tasks enhanced performance on prior ones. LeanAgent’s structured learning progression, beginning with fundamental concepts and advancing to intricate topics, showcases its capacity for continuous enhancement—a crucial advantage over existing models that struggle to remain relevant across diverse and evolving mathematical domains.

The conclusion drawn from this research highlights LeanAgent’s potential to transform formal theorem proving through its lifelong learning capabilities. By proving numerous complex theorems that were previously unsolved, LeanAgent has demonstrated the effectiveness of a curriculum-based, dynamic learning strategy in continuously expanding and improving a model’s knowledge base. The research emphasizes the importance of balancing stability and plasticity, which LeanAgent achieves through its progressive training methodology. Moving forward, LeanAgent sets a foundation for future exploration in using lifelong learning frameworks for formal mathematics, potentially paving the way for AI systems that can assist mathematicians across multiple domains in real time, while continuously expanding their understanding and capability.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post LeanAgent: The First Life-Long Learning Agent for Formal Theorem Proving in Lean, Proving 162 Theorems Previously Unproved by Humans Across 23 Diverse Lean Mathematics Repositories appeared first on MarkTechPost.

Multimodal Situational Safety Benchmark (MSSBench): A Comprehensive Be …

Multimodal Situational Safety is a critical aspect that focuses on the model’s ability to interpret and respond safely to complex real-world scenarios involving visual and textual information. It ensures that Multimodal Large Language Models (MLLMs) can recognize and address potential risks inherent in their interactions. These models are designed to interact seamlessly with visual and textual inputs, making them highly capable of assisting humans by understanding real-world situations and providing appropriate responses. With applications spanning visual question answering to embodied decision-making, MLLMs are integrated into robots and assistive systems to perform tasks based on instructions and environmental cues. While these advanced models can transform various industries by enhancing automation and facilitating safer human-AI collaboration, ensuring robust multimodal situational safety becomes crucial for deployment.

One critical issue highlighted by the researchers is the lack of adequate Multimodal Situational Safety in existing models, which poses a significant safety concern when deploying MLLMs in real-world applications. As these models become more sophisticated, their ability to evaluate situations based on combined visual and textual data must be meticulously assessed to prevent harmful or erroneous outputs. For instance, a language-based AI model might interpret a query as safe when visual context is absent. However, when a visual cue is added, such as a user asking how to practice running near the edge of a cliff, the model should be capable of recognizing the safety risk and issuing an appropriate warning. This capability, known as “situational safety reasoning,” is essential but remains underdeveloped in current MLLM systems, making their comprehensive testing and improvement imperative before real-world deployment.

Existing methods for assessing Multimodal Situational Safety often rely on text-based benchmarks needing more real-time situational analysis capabilities. These assessments must be revised to address the nuanced challenges of multimodal scenarios, where models must simultaneously interpret visual and linguistic inputs. In many cases, MLLMs might identify unsafe language queries in isolation but fail to incorporate visual context accurately, especially in applications that demand situational awareness, such as domestic assistance or autonomous driving. To address this gap, a more integrated approach that thoroughly considers linguistic and visual aspects is needed to ensure comprehensive Multimodal Situational Safety evaluation, reducing risks and improving model reliability in diverse real-world scenarios.

Researchers from the University of California, Santa Cruz, and the University of California, Berkeley, introduced a novel evaluation method known as the “Multimodal Situational Safety” benchmark (MSSBench). This benchmark assesses how well MLLMs can handle safe and unsafe situations by providing 1,820 language query-image pairs that simulate real-world scenarios. The dataset includes safe and hazardous visual contexts and aims to test the model’s ability to perform situational safety reasoning. This new evaluation method stands out because it measures the MLLMs’ responses based on language inputs and the visual context of each query, making it a more rigorous test of the model’s overall situational awareness.

The MSSBench evaluation process categorizes visual contexts into different safety categories, such as physical harm, property damage, and illegal activities, to cover a broad range of potential safety issues. The results from evaluating various state-of-the-art MLLMs using MSSBench reveal that these models struggle to recognize unsafe situations effectively. The benchmark’s evaluation showed that even the best-performing model, Claude 3.5 Sonnet, achieved an average safety accuracy of just 62.2%. Open-source models like MiniGPT-V2 and Qwen-VL performed significantly worse, with safety accuracies dropping as low as 50% in certain scenarios. Also, these models overlook safety-critical information embedded in visual inputs, which proprietary models handle more adeptly.

The researchers also explored the limitations of current MLLMs in scenarios that involve complex tasks. For example, in embodied assistant scenarios, models were tested in simulated household environments where they had to complete tasks like placing objects or toggling appliances. The findings indicate that MLLMs perform poorly in these scenarios due to their inability to perceive and interpret visual cues that indicate safety risks accurately. To mitigate these issues, the research team introduced a multi-agent pipeline that breaks down situational reasoning into separate subtasks. By assigning different tasks to specialized agents, such as visual understanding and safety judgment, the pipeline improved the average safety performance across all MLLMs tested.

The study’s results emphasize that while the multi-agent approach shows promise, there is still much room for improvement. For example, even with a multi-agent system, MLLMs like mPLUG-Owl2 and DeepSeek failed to recognize unsafe scenarios in 32% of the test cases, indicating that future work needs to focus on enhancing these models’ visual-textual alignment and situational reasoning capabilities.

Key Takeaways from the research on Multimodal Situational Safety benchmark:

Benchmark Creation: The Multimodal Situational Safety benchmark (MSSBench) includes 1,820 query-image pairs, evaluating MLLMs on various safety aspects.

Safety Categories: The benchmark assesses safety in four categories: physical harm, property damage, illegal activities, and context-based risks.

Model Performance: The best-performing models, like Claude 3.5 Sonnet, achieved a maximum safety accuracy of 62.2%, highlighting a significant area for improvement.

Multi-Agent System: Introducing a multi-agent system improved safety performance by assigning specific subtasks, but issues like visual misunderstanding persisted.

Future Directions: The study suggests that further development of MLLM safety mechanisms is necessary to achieve reliable situational awareness in complex, multimodal scenarios.

In conclusion, the research presents a new framework for evaluating the situational safety of MLLMs through the Multimodal Situational Safety benchmark. It reveals the critical gaps in current MLLM safety performance and proposes a multi-agent approach to address these challenges. The research demonstrates the importance of comprehensive safety evaluation in multimodal AI systems, especially as these models become more prevalent in real-world applications.

Check out the Paper, GitHub, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)
The post Multimodal Situational Safety Benchmark (MSSBench): A Comprehensive Benchmark to Analyze How AI Models Evaluate Safety and Contextual Awareness Across Varied Real-World Situations appeared first on MarkTechPost.

Boost productivity by using AI in cloud operational health management

Modern organizations increasingly depend on robust cloud infrastructure to provide business continuity and operational efficiency. Operational health events – including operational issues, software lifecycle notifications, and more – serve as critical inputs to cloud operations management. Inefficiencies in handling these events can lead to unplanned downtime, unnecessary costs, and revenue loss for organizations.
However, managing cloud operational events presents significant challenges, particularly in complex organizational structures. With a vast array of services and resource footprints spanning hundreds of accounts, organizations can face an overwhelming volume of operational events occurring daily, making manual administration impractical. Although traditional programmatic approaches offer automation capabilities, they often come with significant development and maintenance overhead, in addition to increasingly complex mapping rules and inflexible triage logic.
This post shows you how to create an AI-powered, event-driven operations assistant that automatically responds to operational events. It uses Amazon Bedrock, AWS Health, AWS Step Functions, and other AWS services. The assistant can filter out irrelevant events (based on your organization’s policies), recommend actions, create and manage issue tickets in integrated IT service management (ITSM) tools to track actions, and query knowledge bases for insights related to operational events. By orchestrating a group of AI endpoints, the agentic AI design of this solution enables the automation of complex tasks, streamlining the remediation processes for cloud operational events. This approach helps organizations overcome the challenges of managing the volume of operational events in complex, cloud-driven environments with minimal human supervision, ultimately improving business continuity and operational efficiency.
Event-driven operations management
Operational events refer to occurrences within your organization’s cloud environment that might impact the performance, resilience, security, or cost of your workloads. Some examples of AWS-sourced operational events include:

AWS Health events — Notifications related to AWS service availability, operational issues, or scheduled maintenance that might affect your AWS resources.
AWS Security Hub findings — Alerts about potential security vulnerabilities or misconfigurations identified within your AWS environment.
AWS Cost Anomaly Detection alerts – Notifications about unusual spending patterns or cost spikes.
AWS Trusted Advisor findings — Opportunities for optimizing your AWS resources, improving security, and reducing costs.

However, operational events aren’t limited to AWS-sourced events. They can also originate from your own workloads or on-premises environments. In principle, any event that can integrate with your operations management and is of importance to your workload health qualifies as an operational event.
Operational event management is a comprehensive process that provides efficient handling of events from start to finish. It involves notification, triage, progress tracking, action, and archiving and reporting at a large scale. The following is a breakdown of the typical tasks included in each step:

Notification of events:

Format notifications in a standardized, user-friendly way.
Dispatch notifications through instant messaging tools or emails.

Triage of events:

Filter out irrelevant or noise events based on predefined company policies.
Analyze the events’ impact by examining their metadata and textual description.
Convert events into actionable tasks and assigning responsible owners based on roles and responsibilities.
Log tickets or page the appropriate personnel in the chosen ITSM tools.

Status tracking of events and actions:

Group related events into threads for straightforward management.
Update ticket statuses based on the progress of event threads and action owner updates.

Insights and reporting:

Query and consolidate knowledge across various event sources and tickets.
Create business intelligence (BI) dashboards for visual representation and analysis of event data.

A streamlined process should include steps to ensure that events are promptly detected, prioritized, acted upon, and documented for future reference and compliance purposes, enabling efficient operational event management at scale. However, traditional programmatic automation has limitations when handling multiple tasks. For instance, programmatic rules for event attribute-based noise filtering lack flexibility when faced with organizational changes, expansion of the service footprint, or new data source formats, leading growing complexity.
Automating impact analysis in traditional automation through keyword matching on free-text descriptions is impractical. Converting events to tickets requires manual effort to generate action hints and lacks correlation to the originating events. Extracting event storylines from long, complex threads of event updates is challenging.
Let’s explore an AI-based solution to see how it can help address these challenges and improve productivity.
Solution overview
The solution uses AWS Health and AWS Security Hub findings as sources of operational events to demonstrate the workflow. It can be extended to incorporate additional types of operational events—from AWS or non-AWS sources—by following an event-driven architecture (EDA) approach.
The solution is designed to be fully serverless on AWS and can be deployed as infrastructure as code (IaC) by usingf the AWS Cloud Development Kit (AWS CDK).
Slack is used as the primary UI, but you can implement the solution using other messaging tools such as Microsoft Teams.
The cost of running and hosting the solution depends on the actual consumption of queries and the size of the vector store and the Amazon Kendra document libraries. See Amazon Bedrock pricing, Amazon OpenSearch pricing and Amazon Kendra pricing for pricing details.
The full code repository is available in the accompanying GitHub repo.
The following diagram illustrates the solution architecture.

Figure – solution architecture diagram
Solution walk-through
The solution consists of three microservice layers, which we discuss in the following sections.
Event processing layer
The event processing layer manages notifications, acknowledgments, and triage of actions. Its main logic is controlled by two key workflows implemented using Step Functions.

Event orchestration workflow – This workflow is subscribed to and invoked by operational events delivered to the main Amazon EventBridge hub. It sends HealthEventAdded or SecHubEventAdded events back to the main event hub following the workflow in the following figure.

Figure – Event orchestration workflow

Event notification workflow – This workflow formats notifications that are exchanged between Slack chat and backend microservices. It listens to control events such as HealthEventAdded and SecHubEventAdded.

Figure – Event notification workflow
AI layer
The AI layer handles the interactions between Agents for Amazon Bedrock, Knowledge Bases for Amazon Bedrock, and the UI (Slack chat). It has several key components.
OpsAgent is an operations assistant powered by Anthropic Claude 3 Haiku on Amazon Bedrock. It reacts to operational events based on the event type and text descriptions. OpsAgent is supported by two other AI model endpoints on Amazon Bedrock with different knowledge domains. An action group is defined and attached to OpsAgent, allowing it to solve more complex problems by orchestrating the work of AI endpoints and taking actions such as creating tickets without human supervisions.
OpsAgent is pre-prompted with required company policies and guidelines to perform event filtering, triage, and ITSM actions based on your requirements. See the sample escalation policy in the GitHub repo (between escalation_runbook tags).
OpsAgent uses two supporting AI model endpoints:

The events expert endpoint uses the Amazon Titan in Amazon Bedrock foundation model (FM) and Amazon OpenSearch Serverless to answer questions about operational events using Retrieval Augmented Generation (RAG).
The ask-aws endpoint uses the Amazon Titan model and Amazon Kendra as the RAG source. It contains the latest AWS documentation on selected topics. You must syncronize the Amazon Kendra data sources to ensure the underlying AI model is using the latest documentation. Your can do this using the AWS Management Console after the solution is deployed.

These dedicated endpoints with specialized RAG data sources help break down complex tasks, improve accuracy, and make sure the correct model is used.
The AI layer also includes of two AI orchestration Step Functions workflows. The workflows manage the AI agent, AI model endpoints, and the interaction with the user (through Slack chat):

The AI integration workflow defines how the operations assistant reacts to operational events based on the event type and the text descriptions of those events. The following figure illustrates the workflow.

Figure – AI integration workflow

The AI chatbot workflow manages the interaction between users and the OpsAgent assistant through a chat interface. The chatbot handles chat sessions and context.

Figure: AI chatbot workflow
Archiving and reporting layer
The archiving and reporting layer handles streaming, storing, and extracting, transforming, and loading (ETL) operational event data. It also prepares a data lake for BI dashboards and reporting analysis. However, this solution doesn’t include an actual dashboard implementation; it prepares an operational event data lake for later development.
Use case examples
You can use this solution for automated event notification, autonomous event acknowledgement, and action triage by setting up a virtual supervisor or operator that follows your organization’s policies. The virtual operator is equipped with multiple AI capabilities—each of which is specialized in a specific knowledge domain—such as generating recommended actions or taking actions to issue tickets in ITSM tools, as shown in the following figure.

Figure – use case example 1
The virtual event supervisor filters out noise based on your policies, as illustrated in the following figure.

Figure – use case example 2
AI can use the tickets that are related to a specific AWS Health event to provide the latest status updates on those tickets, as shown in the following figure.

Figure – use case example 3
The following figure shows how the assistant evaluates complex threads of operational events to provide valuable insights.

Figure – use case example 4
The following figure shows a more sophisticated use case.

Figure – use case example 5
Prerequisites
To deploy this solution, you must meet the following prerequisites:

Have at least one AWS account with permissions to create and manage the necessary resources and components for the application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?. The project uses a typical setup of two accounts, where one is the organization’s health administrator account and the other is the worker account hosting backend microservices. The worker account can be the same as the administrator account if you choose to use a single account setup.
Make sure you have access to Amazon Bedrock FMs in your preferred AWS Region in the worker account. The FMs used in the post are Anthropic Claude 3 Haiku, and Amazon Titan Text G1 – Premier.
Enable the AWS Health Organization view and delegate an administrator account in your AWS management account if you want to manage AWS Health events across your entire organization. Enabling AWS Health Organization view is optional if you only need to source operational events from a single account. Delegation of a separate administrator account for AWS Health is also optional if you want to manage all operational events from your AWS management account.
Enable AWS Security Hub in your AWS management account. Optionally, enable Security Hub with Organizations integration if you want to monitor security findings for the entire organization instead of just a single account.
Have a Slack workspace with permissions to configure a Slack app and set up a channel.
Install the AWS CDK in your local environment, bootstrapped in your AWS accounts, it will be used for solution deployment into the administration account and worker account.
Have AWS Serverless Application Model (AWS SAM) and Docker installed in your development environment to build AWS Lambda packages

Create a Slack app and set up a channel
Set up Slack:

Create a Slack app from the manifest template, using the content of the slack-app-manifest.json file from the GitHub repository.
Install your app into your workspace, and take note of the Bot User OAuth Token value to be used in later steps.
Take note of the Verification Token value under Basic Information of your app, you will need it in later steps.
In your Slack desktop app, go to your workspace and add the newly created app.
Create a Slack channel and add the newly created app as an integrated app to the channel.
Find and take note of the channel ID by choosing (right-clicking) the channel name, choosing Additional options to access the More menu, and choosing Open details to see the channel details.

Prepare your deployment environment
Use the following commands to ready your deployment environment for the worker account. Make sure you aren’t running the command under an existing AWS CDK project root directory. This step is required only if you chose a worker account that’s different from the administration account:

# Make sure your shell session environment is configured to access the worker
# account of your choice, for detailed guidance on how to configure, refer to
# https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
# Note that in this step you are bootstrapping your worker account in such a way
# that your administration account is trusted to execute CloudFormation deployment in
# your worker account, the following command uses an example execution role policy of ‘AdministratorAccess’,
# you can swap it for other policies of your own for least privilege best practice,
# for more information on the topic, refer to https://repost.aws/knowledge-center/cdk-customize-bootstrap-cfntoolkit
cdk bootstrap aws://<replace with your AWS account id of the worker account>/<replace with the region where your worker services is> –trust <replace with your AWS account id of the administration account> –cloudformation-execution-policies ‘arn:aws:iam::aws:policy/AdministratorAccess’ –trust-for-lookup <replace with your AWS account id of the administration account>

Use the following commands to ready your deployment environment for the administration account. Make sure you aren’t running the commands under an existing AWS CDK project root directory:

# Make sure your shell session environment is configured to access the admistration
# account of your choice, for detailed guidance on how to configure, refer to
# https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
# Note ‘us-east-1’ region is required for receiving AWS Health events associated with
# services that operate in AWS global region.
cdk bootstrap <replace with your AWS account id of the administration account>/us-east-1

# Optional, if you have your cloud infrastructures hosted in other AWS regions than ‘us-east-1’,
# repeat the below commands for each region
cdk bootstrap <replace with your AWS account id of the administration account>/<replace with the region name, e.g. us-west-2>

Copy the GitHub repo to your local directory
Use the following code to copy the GitHub repo to your local directory.:

git clone https://github.com/aws-samples/ops-health-ai.git
cd ops-health-ai
npm install
cd lambda/src
# Depending on your build environment, you might want to change the arch type to ‘x86’
# or ‘arm’ in lambda/src/template.yaml file before build
sam build –use-container
cd ../..

Create an .env file
Create an .env file containing the following code under the project root directory. Replace the variable placeholders with your account information:

CDK_ADMIN_ACCOUNT=<replace with your 12 digits administration AWS account id>
CDK_PROCESSING_ACCOUNT=<replace with your 12 digits worker AWS account id. This account id is the same as the admin account id if using single account setup>
EVENT_REGIONS=us-east-1,<region 1 of where your infrastructures are hosted>,<region 2 of where your infrastructures are hosted>
CDK_PROCESSING_REGION=<replace with the region where you want the worker services to be, e.g. us-east-1>
EVENT_HUB_ARN=arn:aws:events:<replace with the worker service region>:<replace with the worker service account id>:event-bus/AiOpsStatefulStackAiOpsEventBus
SLACK_CHANNEL_ID=<your Slack channel ID noted down from earlier step>
SLACK_APP_VERIFICATION_TOKEN=<replace with your Slack app verification token>
SLACK_ACCESS_TOKEN=<replace with your Slack Bot User OAuth Token value>

Deploy the solution using the AWS CDK
Deploy the processing microservice to your worker account (the worker account can be the same as your administrator account):

In the project root directory, run the following command: cdk deploy –all –require-approval never
Capture the HandleSlackCommApiUrl stack output URL,
Go to your Slack app and navigate to Event Subscriptions, Request URL Change,
Update the URL value with the stack output URL and save your settings.

Test the solution
Test the solution by sending a mock operational event to your administration account . Run the following AWS Command Line Interface (AWS CLI) command: aws events put-events –entries file://test-events/mockup-events.json
You will receive Slack messages notifying you about the mock event followed by automatic update from the AI assistant reporting the actions it took and the reasons for each action. You don’t need to manually choose Accept or Discharge for each event.
Try creating more mock events based on your past operational events and test them with the use cases described in the Use case examples section.
If you have just enabled AWS Security Hub in your administrator account, you might need to wait for up to 24 hours for any findings to be reported and acted on by the solution. AWS Health events, on the other hand, will be reported whenever applicable.
Clean up
To clean up your resources, run the following command in the CDK project directory: cdk destroy –all
Conclusion
This solution uses AI to help you automate complex tasks in cloud operational events management, bringing new opportunities for you to further streamline cloud operations management at scale with improved productivity, and operational resilience.
To learn more about the AWS services used in this solution, see:

Concepts for AWS Health
Understanding how findings work in Security Hub
Agents for Amazon Bedrock

About the author
Sean Xiaohai Wang is a Senior Technical Account Manager at Amazon Web Services. He helps enterpise customers build and operate efficiently on AWS.

How Indeed builds and deploys fine-tuned LLMs on Amazon SageMaker

This post is cowritten with Ethan Handel and Zhiyuan He from Indeed.com.
Indeed is the world’s #1 job site¹ and a leading global job matching and hiring marketplace. Our mission is to help people get jobs. At Indeed, we serve over 350 million global Unique Visitors  monthly² across more than 60 countries, powering millions of connections to new job opportunities every day. Since our founding nearly two decades ago, machine learning (ML) and artificial intelligence (AI) have been at the heart of building data-driven products that better match job seekers with the right roles and get people hired.
On the Core AI team at Indeed, we embody this legacy of AI innovation by investing heavily in HR domain research and development. We provide teams across the company with production-ready, fine-tuned large language models (LLMs) based on state-of-the-art open source architectures. In this post, we describe how using the capabilities of Amazon SageMaker has accelerated Indeed’s AI research, development velocity, flexibility, and overall value in our pursuit of using Indeed’s unique and vast data to leverage advanced LLMs.
Infrastructure challenges
Indeed’s business is fundamentally text-based. Indeed company generates 320 Terabytes of data daily³, which is uniquely valuable due to its breadth and the ability to connect elements like job descriptions and resumes and match them to the actions and behaviors that drive key company metric: a successful hire. LLMs represent a significant opportunity to improve how job seekers and employers interact in Indeed’s marketplace, with use cases such as match explanations, job description generation, match labeling, resume or job description skill extraction, and career guides, among others.
Last year, the Core AI team evaluated if Indeed’s HR domain-specific data could be used to fine-tune open source LLMs to enhance performance on particular tasks or domains. We chose the fine-tuning approach to best incorporate Indeed’s unique knowledge and vocabulary around mapping the world of jobs. Other strategies like prompt tuning or Retrieval Augmented Generation (RAG) and pre-training models were initially less appropriate due to context window limitations and cost-benefit trade-offs.
The Core AI team’s objective was to explore solutions that addressed the specific needs of Indeed’s environment by providing high performance for fine-tuning, minimal effort for iterative development, and a pathway for future cost-effective production inference. Indeed was looking for a solution that addressed the following challenges:

How do we efficiently set up repeatable, low-overhead patterns for fine-tuning open-source LLMs?
How can we provide production LLM inference at Indeed’s scale with favorable latency and costs?
How do we efficiently onboard early products with different request and inference patterns?

The following sections discuss how we addressed each challenge.
Solution overview
Ultimately, Indeed’s Core AI team converged on the decision to use Amazon SageMaker to solve for the aforementioned challenges and meet the following requirements:

Accelerate fine-tuning using Amazon SageMaker
Serve production traffic quickly using Amazon SageMaker inference
Enable Indeed to serve a variety of production use cases with flexibility using Amazon SageMaker generative AI inference capabilities (inference components)

Accelerate fine-tuning using Amazon SageMaker
One of the primary challenges that we faced was achieving efficient fine-tuning. Initially, Indeed’s Core AI team setup involved manually setting up raw Amazon Elastic Compute Cloud (Amazon EC2) instances and configuring training environments. Scientists had to manage personal development accounts and GPU schedules, leading to development overhead and resource under-utilization. To address these challenges, we used Amazon SageMaker to initiate and manage training jobs efficiently. Transitioning to Amazon SageMaker provided several advantages:

Resource optimization – Amazon SageMaker offered better instance availability and billed only for the actual training time, reducing costs associated with idle resources
Ease of setup – We no longer needed to worry about the setup required for running training jobs, simplifying the process significantly
Scalability – The Amazon SageMaker infrastructure allowed us to scale our training jobs efficiently, accommodating the growing demands of our LLM fine-tuning efforts

Smoothly serve production traffic using Amazon SageMaker inference
To better serve Indeed users with LLMs, we standardized the request and response formats across different models by employing open source software as an abstraction layer. This layer converted the interactions into a standardized OpenAI format, simplifying integration with various services and providing consistency in model interactions.
We built an inference infrastructure using Amazon SageMaker inference to host fine-tuned Indeed in-house models. The Amazon SageMaker infrastructure provided a robust service for deploying and managing models at scale. We deployed different specialized models on Amazon SageMaker inference endpoints. Amazon SageMaker supports various inference frameworks; we chose the Transformers Generative Inference (TGI) framework from Hugging Face for flexibility in access to the latest open source models.
The setup on Amazon SageMaker inference has enabled rapid iteration, allowing Indeed  to experiment with over 20 different models in a month. Furthermore, the robust infrastructure is capable of hosting dynamic production traffic, handling up to 3 million requests per day.
The following architecture diagram showcases the interaction between Indeed’s application and Amazon SageMaker inference endpoints.

Serve a variety of production use cases with flexibility using Amazon SageMaker generative AI inference components
Results from LLM fine-tuning revealed performance benefits. The final challenge was quickly implementing the capability to serve production traffic to support real, high-volume production use cases. Given the applicability of our models to meet use cases across the HR domain, our team hosted multiple different specialty models for various purposes. Most models didn’t necessitate the extensive resources of an 8-GPU p4d instance but still required the latency benefits of A100 GPUs.
Amazon SageMaker recently introduced a new feature called inference components that significantly enhances the efficiency of deploying multiple ML models to a single endpoint. This innovative capability allows for the optimal placement and packing of models onto ML instances, resulting in an average cost savings of up to 50%. The inference components abstraction enables users to assign specific compute resources, such as CPUs, GPUs, or AWS Neuron accelerators, to each individual model. This granular control allows for more efficient utilization of computing power, because Amazon SageMaker can now dynamically scale each model up or down based on the configured scaling policies. Furthermore, the intelligent scaling offered by this capability automatically adds or removes instances as needed, making sure that capacity is met while minimizing idle compute resources. This flexibility extends the ability to scale a model down to zero copies, freeing up valuable resources when demand is low. This feature empowers generative AI and LLM inference to optimize their model deployment costs, reduce latency, and manage multiple models with greater agility and precision. By decoupling the models from the underlying infrastructure, inference components offer a more efficient and cost-effective way to use the full potential of Amazon SageMaker inference.
Amazon SageMaker inference components allowed Indeed’s Core AI team to deploy different models to the same instance with the desired copies of a model, optimizing resource usage. By consolidating multiple models on a single instance, we created the most cost-effective LLM solution available to Indeed product teams. Furthermore, with inference components now supporting dynamic auto scaling, we could optimize the deployment strategy. This feature automatically adjusts the number of model copies based on demand, providing even greater efficiency and cost savings, even compared to third-party LLM providers.
Since integrating inference components into the inference design, Indeed’s Core AI team has built and validated LLMs that have served over 6.5 million production requests.
The following figure illustrates the internals of the Core AI’s LLM server.

The simplicity of our Amazon SageMaker setup significantly improves setup speed and flexibility. Today, we deploy Amazon SageMaker models using the Hugging Face TGI image in a custom Docker container, giving Indeed instant access to over 18 open source model families.
The following diagram illustrates Indeed’s Core AI flywheel.

Core AI’s business value from Amazon SageMaker
The seamless integration of Amazon SageMaker inference components, coupled with our team’s iterative enhancements, has accelerated our path to value. We can now swiftly deploy and fine-tune our models, while benefiting from robust scalability and cost-efficiency—a significant advantage in our pursuit of delivering cutting-edge HR solutions to our customers.
Maximize performance
High-velocity research enables Indeed to iterate on fine-tuning approaches to maximize performance. We have fine-tuned over 75 models to advance research and production objectives.
We can quickly validate and improve our fine-tuning methodology with many open-source LLMs. For instance, we moved from fine-tuning base foundation models (FMs) with third-party instruction data to fine-tuning instruction-tuned FMs based on empirical performance improvements.
For our unique purposes, our portfolio of LLMs performs at parity or better than the most popular general third-party models across 15 HR domain-specific tasks. For specific HR domain tasks like extracting skill attributes from resumes, we see a 4–5 times improvement from fine-tuning performance over general domain third-party models and a notable increase in HR marketplace functionality.
The following figure illustrates Indeed’s inference continuous integration and delivery (CI/CD) workflow.

The following figure presents some task examples.

High flexibility
Flexibility allows Indeed to be on the frontier of LLM technology. We can deploy and test the latest state-of-the-art open science models on our scalable Amazon SageMaker inference infrastructure immediately upon availability. When Meta launched the Llama3 model family in April 2024, these FMs were deployed within the day, enabling Indeed to start research and provide early testing for teams across Indeed. Within weeks, we fine-tuned our best-performing model to-date and released it. The following figure illustrates an example task.

Production scale
Core AI developed LLMs have already served 6.5 million live production requests with a single p4d instance and a p99 latency of under 7 seconds.
Cost-efficiency
Each LLM request through Amazon SageMaker is on average 67% cheaper than the prevailing third-party vendor model’s on-demand pricing in early 2024, creating the potential for significant cost savings.
Indeed’s contributions to Amazon SageMaker inference: Enhancing generative AI inference capabilities
Building upon the success of their use case, Indeed has been instrumental in partnering with the Amazon SageMaker inference team to provide inputs to help AWS build and enhance key generative AI capabilities within Amazon SageMaker. Since the early days of engagement, Indeed has provided the Amazon SageMaker inference team with valuable inputs to improve our offerings. The features and optimizations introduced through this collaboration are empowering other AWS customers to unlock the transformative potential of generative AI with greater ease, cost-effectiveness, and performance.

“Amazon SageMaker inference has enabled Indeed to rapidly deploy high-performing HR domain generative AI models, powering millions of users seeking new job opportunities every day. The flexibility, partnership, and cost-efficiency of Amazon SageMaker inference has been valuable in supporting Indeed’s efforts to leverage AI to better serve our users.”
– Ethan Handel, Senior Product Manager at Indeed.

Conclusion
Indeed’s implementation of Amazon SageMaker inference components has been instrumental in solidifying the company’s position as an AI leader in the HR industry. Core AI now has a robust service landscape that enhances the company’s ability to develop and deploy AI solutions tailored to the HR industry. With Amazon SageMaker, Indeed has successfully built and integrated HR domain LLMs that significantly improve job matching processes and other aspects of Indeed’s marketplace.
The flexibility and scalability of Amazon SageMaker inference components have empowered Indeed to stay ahead of the curve, continually adapting its AI-driven solutions to meet the evolving needs of job seekers and employers worldwide. This strategic partnership underscores the transformative potential of integrating advanced AI capabilities, like those offered by Amazon SageMaker inference components, into core business operations to drive efficiency and innovation.
¹Comscore, Unique Visitors, June 2024 ²Indeed Internal Data, average monthly Unique Visitors October 2023 – March 2024 ³Indeed data

About the Authors
Ethan Handel is a Senior Product Manager at Indeed, based in Austin, TX. He specializes in generative AI research and development and applied data science products, unlocking new ways to help people get jobs across the world every day. He loves solving big problems and innovating with how Indeed gets value from data. Ethan also loves being a dad of three, is an avid photographer, and loves everything automotive.
Zhiyuan He is a Staff Software Engineer at Indeed, based in Seattle, WA. He leads a dynamic team that focuses on all aspects of utilizing LLM at Indeed, including fine-tuning, evaluation, and inferencing, enhancing the job search experience for millions globally. Zhiyuan is passionate about tackling complex challenges and is exploring creative approaches.
Alak Eswaradass is a Principal Solutions Architect at AWS based in Chicago, IL. She is passionate about helping customers design cloud architectures using AWS services to solve business challenges and is enthusiastic about solving a variety of ML use cases for AWS customers. When she’s not working, Alak enjoys spending time with her daughters and exploring the outdoors with her dogs.
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, multi-tenant models, cost optimizations, and making deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Brett Seib is a Senior Solutions Architect, based in Austin, Texas. He is passionate about innovating and using technology to solve business challenges for customers. Brett has several years of experience in the enterprise, Artificial Intelligence (AI), and data analytics industries, accelerating business outcomes.

Improve LLM application robustness with Amazon Bedrock Guardrails and …

Agentic workflows are a fresh new perspective in building dynamic and complex business use case-based workflows with the help of large language models (LLMs) as their reasoning engine. These agentic workflows decompose the natural language query-based tasks into multiple actionable steps with iterative feedback loops and self-reflection to produce the final result using tools and APIs. This naturally warrants the need to measure and evaluate the robustness of these workflows, in particular those that are adversarial or harmful in nature.
Amazon Bedrock Agents can break down natural language conversations into a sequence of tasks and API calls using ReAct and chain-of-thought (CoT) prompting techniques using LLMs. This offers tremendous use case flexibility, enables dynamic workflows, and reduces development cost. Amazon Bedrock Agents is instrumental in customization and tailoring apps to help meet specific project requirements while protecting private data and securing your applications. These agents work with AWS managed infrastructure capabilities and Amazon Bedrock, reducing infrastructure management overhead.
Although Amazon Bedrock Agents have built-in mechanisms to help avoid general harmful content, you can incorporate a custom, user-defined fine-grained mechanism with Amazon Bedrock Guardrails. Amazon Bedrock Guardrails provides additional customizable safeguards on top of the built-in protections of foundation models (FMs), delivering safety protections that are among the best in the industry by blocking harmful content and filtering hallucinated responses for Retrieval Augmented Generation (RAG) and summarization workloads. This enables you to customize and apply safety, privacy, and truthfulness protections within a single solution.
In this post, we demonstrate how you can identify and improve the robustness of Amazon Bedrock Agents when integrated with Amazon Bedrock Guardrails for domain-specific use cases.
Solution overview
In this post, we explore a sample use case for an online retail chatbot. The chatbot requires dynamic workflows for use cases like searching for and purchasing shoes based on customer preferences using natural language queries. To implement this, we build an agentic workflow using Amazon Bedrock Agents.
To test its adversarial robustness, we then prompt this bot to give fiduciary advice regarding retirement. We use this example to demonstrate robustness concerns, followed by robustness improvement using the agentic workflow with Amazon Bedrock Guardrails to help prevent the bot from giving fiduciary advice.
In this implementation, the preprocessing stage (the first stage of the agentic workflow, before the LLM is invoked) of the agent is turned off by default. Even with preprocessing turned on, there is usually a need for more fine-grained use case-specific control over what can be marked as safe and acceptable or not. In this example, a retail agent for shoes giving away fiduciary advice is definitely out of scope of the product use case and may be detrimental advice, resulting in customers losing trust, among other safety concerns.
Another typical fine-grained robustness control requirement could be to restrict personally identifiable information (PII) from being generated by these agentic workflows. We can configure and set up Amazon Bedrock Guardrails in Amazon Bedrock Agents to deliver improved robustness against such regulatory compliance cases and custom business needs without the need for fine-tuning LLMs.
The following diagram illustrates the solution architecture.

We use the following AWS services:

Amazon Bedrock to invoke LLMs
Amazon Bedrock Agents for the agentic workflows
Amazon Bedrock Guardrails to deny adversarial inputs
AWS Identity and Access Management (IAM) for permission control across various AWS services
AWS Lambda for business API implementation
Amazon SageMaker to host Jupyter notebooks and invoke the Amazon Bedrock Agents API

In the following sections, we demonstrate how to use the GitHub repository to run this example using three Jupyter notebooks.
Prerequisites
To run this demo in your AWS account, complete the following prerequisites:

Create an AWS account if you don’t already have one.
Clone the GitHub repository and follow the steps explained in the README.
Set up a SageMaker notebook using an AWS CloudFormation template, available in the GitHub repo. The CloudFormation template also provides the required IAM access to set up SageMaker resources and Lambda functions.
Acquire access to models hosted on Amazon Bedrock. Choose Manage model access in the navigation pane on the Amazon Bedrock console and choose from the list of available options. We use Anthropic Claude 3 Haiku on Amazon Bedrock and Amazon Titan Embeddings Text v1 on Amazon Bedrock for this post.

Create a guardrail
In the Part 1a notebook, complete the following steps to create a guardrail to help prevent the chatbot from providing fiduciary advice:

Create a guardrail with Amazon Bedrock Guardrails using the Boto3 API with content filters, word and phrase filters, and sensitive word filters, such as for PII and regular expressions (regex) to protect sensitive information from our retail customers.
List and create guardrail versions.
Update the guardrails.
Perform unit testing on the guardrails.
Note theguardrail-idand guardrail-arn values to use in Part 1c:

create_response = client.create_guardrail(
name=guardrail_name,
description=’Prevents our model from providing fiduciary advice.’,
topicPolicyConfig={
‘topicsConfig’: [
{
‘name’: ‘Fiduciary Advice’,
‘definition’: ‘Providing personalized advice or recommendations on managing financial assets, investments, or trusts in a fiduciary capacity or assuming related obligations and liabilities.’,
‘examples’: [
‘What stocks should I invest in for my retirement?’,
‘Is it a good idea to put my money in a mutual fund?’,
‘How should I allocate my 401(k) investments?’,
‘What type of trust fund should I set up for my children?’,
‘Should I hire a financial advisor to manage my investments?’
],
‘type’: ‘DENY’
}
]
},
….
}

Test the use case without guardrails
In the Part 1b notebook, complete the following steps to demonstrate the use case using Amazon Bedrock Agents without Amazon Bedrock Guardrails and no preprocessing to demonstrate the adversarial robustness problem:

Choose the underlying FM for your agent.
Provide a clear and concise agent instruction.
Create and associate an action group with an API schema and Lambda function.
Create, invoke, test, and deploy the agent.
Demonstrate a chat session with multi-turn conversations.

The agent instruction is as follows:

“You are an agent that helps customers purchase shoes. If the customer does not provide their name in the first input, ask for them name before invoking any functions.
Retrieve customer details like customer ID and preferred activity based on the name.
Then check inventory for shoe best fit activity matching customer preferred activity.
Generate response with shoe ID, style description and colors based on shoe inventory details.
If multiple matches exist, display all of them to the user.
After customer indicates they would like to order the shoe, use the shoe ID corresponding to their choice and
customer ID from initial customer details received, to place order for the shoe.”

A valid user query would be “Hello, my name is John Doe. I am looking to buy running shoes. Can you elaborate more about Shoe ID 10?” However, by using Amazon Bedrock Agents without Amazon Bedrock Guardrails, the agent allows fiduciary advice for queries like the following:

“How should I invest for my retirement? I want to be able to generate $5,000 a month.”
“How do I make money to prepare for my retirement?”

Test the use case with guardrails
In the Part 1c notebook, repeat the steps in Part 1b but now to demonstrate using Amazon Bedrock Agents with guardrails (and still no preprocessing) to improve and evaluate the adversarial robustness concern by not allowing fiduciary advice. The complete steps are the following:

Choose the underlying FM for your agent.
Provide a clear and concise agent instruction.
Create and associate an action group with an API schema and Lambda function.
During the configuration setup of Amazon Bedrock Agents in this example, associate the guardrail created previously in Part 1a with this agent.
Create, invoke, test, and deploy the agent.
Demonstrate a chat session with multi-turn conversations.

To associate a guardrail-id with an agent during creation, we can use the following code snippet:

gconfig = {
“guardrailIdentifier”: ‘an9l3icjg3kj’,
“guardrailVersion”: ‘DRAFT’
}

response = bedrock_agent_client.create_agent(
agentName=agent_name,
agentResourceRoleArn=agent_role[‘Role’][‘Arn’],
description=”Retail agent for shoe purchase.”,
idleSessionTTLInSeconds=3600,
foundationModel=”anthropic.claude-3-haiku-20240307-v1:0″,
instruction=agent_instruction,
guardrailConfiguration=gconfig,
)

As we can expect, our retail chatbot should decline to answer invalid queries because it has no relationship with its purpose in our use case.
Cost considerations
The following are important cost considerations:

There are no separate charges for building resources using Amazon Bedrock Agents.
You will incur charges for embedding model and text model invocation on Amazon Bedrock. Generation of text and embeddings with the agent in Amazon Bedrock incurs charges according to the cost of each FM. For more details, refer to Amazon Bedrock pricing.
You will incur charges for Amazon Bedrock Guardrails. For more details, see Amazon Bedrock pricing.
You will incur charges for storing files in Amazon Simple Storage Service (Amazon S3). For more details, see Amazon S3 pricing.
You will incur charges for your SageMaker instance, Lambda function, and AWS CloudFormation usage. For more details, see Amazon SageMaker pricing, AWS Lambda pricing, and AWS CloudFormation pricing.

Clean up
For the Part 1b and Part 1c notebooks, to avoid incurring recurring costs, the implementation automatically cleans up resources after an entire run of the notebook. You can check the notebook instructions in the Clean-up Resources section on how to avoid the automatic cleanup and experiment with different prompts.
The order of cleanup is as follows:

Disable the action group.
Delete the action group.
Delete the alias.
Delete the agent.
Delete the Lambda function.
Empty the S3 bucket.
Delete the S3 bucket.
Delete IAM roles and policies.

You can delete guardrails from the Amazon Bedrock console or API. Unless the guardrails are invoked through agents in this demo, you will not be charged. For more details, see Delete a guardrail.
Conclusion
In this post, we demonstrated how Amazon Bedrock Guardrails can improve the robustness of the agent framework. We were able to stop our chatbot from responding to non-relevant queries and protect personal information from our customers, ultimately improving the robustness of our agentic implementation with Amazon Bedrock Agents.
In general, the preprocessing stage of Amazon Bedrock Agents can intercept and reject adversarial inputs, but guardrails can help prevent prompts that may be very specific to the topic or use case (such as PII and HIPAA rules) that the LLM hasn’t seen previously, without having to fine-tune the LLM.
To learn more about creating models with Amazon Bedrock, see Customize your model to improve its performance for your use case. To learn more about using agents to orchestrate workflows, see Automate tasks in your application using conversational agents. For details about using guardrails to safeguard your generative AI applications, refer to Stop harmful content in models using Amazon Bedrock Guardrails.
Acknowledgements
The author thanks all the reviewers for their valuable feedback.

About the Author
Shayan Ray is an Applied Scientist at Amazon Web Services. His area of research is all things natural language (like NLP, NLU, and NLG). His work has been focused on conversational AI, task-oriented dialogue systems, and LLM-based agents. His research publications are on natural language processing, personalization, and reinforcement learning.