A New AI Approach for Estimating Causal Effects Using Neural Networks

Have you ever wondered how we can determine the true impact of a particular intervention or treatment on certain outcomes? This is a crucial question in fields like medicine, economics, and social sciences, where understanding cause-and-effect relationships is essential. Researchers have been grappling with this challenge, known as the “Fundamental Problem of Causal Inference,” – when we observe an outcome, we typically don’t know what would have happened under an alternative intervention. This issue has led to the development of various indirect methods to estimate causal effects from observational data.

Some existing approaches include the S-Learner, which trains a single model with the treatment variable as a feature, and the T-Learner, which fits separate models for treated and untreated groups. However, these methods can suffer from issues like bias towards zero treatment effect (S-Learner) and data efficiency problems (T-Learner).

More sophisticated methods like TARNet, Dragonnet, and BCAUSS have emerged, leveraging the concept of representation learning with neural networks. These models typically consist of a pre-representation component that learns representations from the input data and a post-representation component that maps these representations to the desired output.

While these representation-based approaches have shown promising results, they often overlook a particular source of bias: spurious interactions (see Table 1) between variables within the model. But what exactly are spurious interactions, and why are they problematic? Imagine you’re trying to estimate the causal effect of a treatment on an outcome while considering various other factors (covariates) that might influence the outcome. In some cases, the neural network might detect and rely on interactions between variables that don’t actually have a causal relationship. These spurious interactions can act as correlational shortcuts, distorting the estimated causal effects, especially when data is limited.

Reference: https://arxiv.org/pdf/2404.12238.pdf

To address this issue, researchers from the Universitat de Barcelona have proposed a novel method called Neural Networks with Causal Graph Constraints (NN-CGC). The core idea behind NN-CGC is to constrain the learned distribution of the neural network to better align with the causal model, effectively reducing the reliance on spurious interactions.

Here’s a simplified explanation of how NN-CGC works:

Variable Grouping: The input variables are divided into groups based on the causal graph (or expert knowledge if the causal graph is unavailable). Each group contains variables that are causally related to each other as shown in Figure 1.

Independent Causal Mechanisms: Each variable group is processed independently through a set of layers, modeling the Independent Causal Mechanisms for the outcome variable and its direct causes.

Constraining Interactions: By processing each variable group separately, NN-CGC ensures that the learned representations are free from spurious interactions between variables from different groups.

Post-representation: The outputs from the independent group representations are combined and passed through a linear layer to form the final representation. This final representation can then be fed into the output heads of existing architectures like TARNet, Dragonnet, or BCAUSS.

By incorporating causal constraints in this manner, NN-CGC aims to mitigate the bias introduced by spurious variable interactions, leading to more accurate causal effect estimations.

The researchers evaluated NN-CGC on various synthetic and semi-synthetic benchmarks, including the well-known IHDP and JOBS datasets. The results are quite promising: across multiple scenarios and metrics (like PEHE and ATE), the constrained versions of TARNet, Dragonnet, and BCAUSS (combined with NN-CGC) consistently outperformed their unconstrained counterparts, achieving new state-of-the-art performance.

One interesting observation is that in high-noise environments, the unconstrained models sometimes performed better than the constrained ones. This suggests that in such cases, the constraints might be discarding some causally valid information alongside the spurious interactions.

Overall, NN-CGC presents a novel and flexible approach to incorporating causal information into neural networks for causal effect estimation. By addressing the often-overlooked issue of spurious interactions, it demonstrates significant improvements over existing methods. The researchers have made their code openly available, allowing others to build upon and refine this promising technique.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit
The post A New AI Approach for Estimating Causal Effects Using Neural Networks appeared first on MarkTechPost.

DeepMind Researchers Propose Naturalized Execution Tuning (NExT): A Se …

Understanding and reasoning about program execution is a critical skill for developers, often applied during tasks like debugging and code repair. Traditionally, developers simulate code execution mentally or through debugging tools to identify and fix errors. Despite their sophistication, large language models (LLMs) trained on code have struggled to grasp the deeper, semantic aspects of program execution beyond the superficial textual representation of code. This limitation often affects their performance in complex software engineering tasks, such as program repair, where understanding the execution flow of a program is essential.

Existing research in AI-driven software development includes several frameworks and models focused on enhancing code execution reasoning. Notable examples include CrossBeam, which leverages execution states in sequence-to-sequence models, and specialized neural architectures like the instruction pointer attention graph neural networks. Other approaches, such as the differentiable Forth interpreter and Scratchpad, integrate execution traces directly into model training to improve program synthesis and debugging capabilities. These methods pave the way for advanced reasoning about code, focusing on both the process and the dynamic states of execution within programming environments.

Researchers from Google DeepMind, Yale University, and the University of Illinois have proposed NExT, which introduces a novel approach by teaching LLMs to interpret and utilize execution traces, enabling more nuanced reasoning about program behavior during runtime. This method stands apart due to its incorporation of detailed runtime data directly into model training, fostering a deeper semantic understanding of code. By embedding execution traces as inline comments, NExT allows models to access crucial contexts that traditional training methods often overlook, making the generated rationales for code fixes more accurate and grounded in actual code execution.

The methodology of NExT utilizes a self-training loop to refine the model’s ability to generate execution-aware rationales. Initially, execution traces are synthesized with proposed code fixes in a dataset, where each trace details variable states and their changes during execution. Using the PaLM 2 model from Google, the method evaluates performance on tasks such as program repair, significantly enhancing model accuracy with repeated iterations. Datasets include Mbpp-R and HumanEval Fix-Plus, benchmarks designed to test programming skills and error fixing in code. This method of iterative learning and synthetic dataset generation focuses on practical improvements in LLMs’ programming capabilities without requiring extensive manual annotations.

Substantial improvements in program repair tasks demonstrate the effectiveness of NExT. Upon applying the NExT methodology, the PaLM 2 model achieved a 26.1% absolute increase in the fixed rate on the Mbpp-R dataset and a 14.3% absolute improvement on HumanEval Fix-Plus. These results indicate significant enhancements in the model’s ability to diagnose and correct programming errors accurately. Moreover, the quality of rationales generated by the model, essential for explaining code fixes, was markedly improved, as evidenced by automated metrics and human evaluations.

In conclusion, the NExT methodology significantly advances the capability of large language models to understand and fix code by integrating execution traces into their training. This approach has markedly improved the fix rates and rationale quality in complex programming tasks, as evidenced by substantial gains on established benchmarks like Mbpp-R and HumanEval Fix-Plus. NExT’s practical impact on enhancing the accuracy and reliability of automated program repair showcases its potential to transform software development practices.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit
The post DeepMind Researchers Propose Naturalized Execution Tuning (NExT): A Self-Training Machine Learning Method that Drastically Improves the LLM’s Ability to Reason about Code Execution appeared first on MarkTechPost.

Enhancing AI Model’s Scalability and Performance: A Study on Multi-H …

Large capacity models, such as Large Language Models (LLMs) and Large Multi-modal Models (LMMs), have demonstrated effectiveness across various domains and tasks. Scaling up these models by increasing parameter count enhances performance but significantly reduces inference speed, limiting practicality. Sparse Mixtures of Experts (SMoE) offer a promising alternative, enabling model scalability while mitigating computational costs. However, SMoE faces two key challenges: i) low expert activation and ii) limited analytical capabilities, which hinder its effectiveness and scalability.

SMoE enhances model capacity while maintaining constant computational demand, yielding superior performance compared to densely-activated models. Unlike dense models, SMoE employs N-independent Feed-Forward Networks (FFN) as experts within each Mixture-of-Experts (MoE) layer and a gating function to distribute weights over these experts’ outputs. The routing mechanism selects the top-k experts from N experts, where k << N facilitates data and expert parallelism. Larger k values often improve model performance but can reduce training efficiency.

Researchers from Tsinghua University and Microsoft Research introduce Multi-Head Mixture-of-Experts (MH-MoE). MH-MoE utilises a multi-head mechanism to divide each input token into multiple sub-tokens and distribute them across different experts, achieving denser expert activation without increasing computational or parameter complexity. In contrast to SMoE, MH-MoE activates four experts for a single input token by splitting it into four sub-tokens. This allocation enables the model to focus on various representation spaces within experts, facilitating a more nuanced understanding of vision and language patterns. 

The architecture of MH-MoE addresses issues of low expert activation and token ambiguity by employing a multi-head mechanism to split tokens into sub-tokens and route them to various experts. In MH-MoE, each parallel layer contains a set of N experts, with a multi-head layer projecting inputs followed by token splitting and gating functions to route sub-tokens to experts. The top-k routing mechanism activates experts with the highest scores, and the resulting sub-tokens are processed by these activated experts and rearranged before token merging to maintain input-output shape consistency. The Token-Splitting-Merging (TSM) operation increases the data volume routed to specific experts, resulting in denser expert activation and improved understanding. This process ensures no additional computational cost in subsequent blocks, with a hyperparameter β used to balance parameters and computational complexity with the original SMoE.

The validation perplexity curves for all pretrained models and pre-training tasks are examined under two expert settings (8 experts and 32 experts). MH-MoE consistently maintains lower perplexity than the baselines across various experimental setups, indicating more effective learning. Also, increasing the number of experts correlates with a decrease in perplexity for MH-MoE, suggesting enhanced representation learning capabilities. Downstream evaluation across different pre-training tasks further validates the efficacy of MH-MoE. In English-focused language modeling, MH-MoE achieves the best performance across multiple benchmarks, demonstrating its effectiveness in improving language representation. Similarly, MH-MoE outperforms X-MoE consistently in multi-lingual language modeling, showcasing its superiority in modeling cross-lingual natural language. In masked multi-modal modeling tasks such as visual question answering, visual reasoning, and image captioning, MH-MoE consistently outperforms Dense and X-MoE baselines, underscoring its ability to capture diverse semantic and detailed information within visual data.

In conclusion, This paper investigates methods for achieving denser expert activation without introducing additional cost while enhancing fine-grained understanding ability. The proposed MH-MoE offers a straightforward implementation of these functionalities. Also, MH-MoE’s simplicity facilitates seamless integration with other SMoE frameworks, improving performance easily. Extensive empirical results across three tasks validate the effectiveness of MH-MoE in achieving these objectives.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit
The post Enhancing AI Model’s Scalability and Performance: A Study on Multi-Head Mixture-of-Experts appeared first on MarkTechPost.

Neural Flow Diffusion Models (NFDM): A Novel Machine Learning Framewor …

The probabilistic machine learning class, generative models, has many uses in different domains, including the visual and performing arts, the medical industry, and even physics. To generate new samples that are similar to the original data, generative models are very good at building probability distributions that appropriately describe datasets. These features are perfect for generating synthetic datasets to supplement training data (data augmentation) and discovering latent structures and patterns in an unsupervised learning environment. 

The two main steps in building diffusion models, which are a type of generative model, are the forward and reverse processes. Over time, the data distribution becomes corrupted by the forward process, going from its original condition to a noisy one. The reverse process can restore data distribution by learning to invert corruptions introduced by the forward process. In this approach, it can train itself to produce data out of thin air. Diffusion models have shown impressive performance in several fields. The majority of current diffusion models, however, assume a fixed forward process that is Gaussian in nature, rendering them incapable of task adaptation or target simplification during the reverse process.

New research by the University of Amsterdam and Constructor University, Bremen, introduces Neural Flow Diffusion Models (NFDM). This framework enables the forward process to specify and learn latent variable distributions. Suppose any continuous (and learnable) distribution can be represented as an invertible mapping applied to noise. In that case, NFDM may accommodate it, unlike traditional diffusion models that depend on a conditional Gaussian forward process. Additionally, the researchers minimize a variational upper bound on the negative log-likelihood (NLL) using an end-to-end optimization technique that does not include simulation. In addition, they suggest a parameterization for the forward process that is based on efficient neural networks. This will allow it to learn the data distribution more easily and adapt to the reverse process while training. 

Using NFDM’s adaptability, the researchers delve deeper into training with limits on the inverse process to acquire generative dynamics with targeted attributes. A curvature penalty on the deterministic generating trajectories is considered a case study. The empirical results show better computing efficiency than baselines on synthetic datasets, MNIST, CIFAR-10, and downsampled ImageNet.

Presenting their experimental findings on CIFAR-10, ImageNet 32 and 64, the team showcased the vast potential of NFDM with a learnable forward process. The state-of-the-art NLL results they achieved are crucial for a myriad of applications, including data compression, anomaly detection, and out-of-distribution detection. They also demonstrated NFDM’s application in learning generative processes with specific attributes, such as dynamics with straight-line trajectories. In these cases, NFDM led to significantly faster sampling rates, improved generation quality, and required fewer sampling steps, underscoring its practical value.

The researchers are candid about the considerations that must be made when adopting NFDM. They acknowledge that compared to traditional diffusion models, the computational costs increase when a neural network is used to parameterize the forward process. Their results indicate that NFDM optimization iterations take around 2.2 times longer than traditional diffusion models. However, they believe that NFDM’s potential in various fields and practical applications is driven by its flexibility in learning generative processes. They also propose potential avenues for improvement, such as incorporating orthogonal methods like distillation, changing the target, and exploring different parameterizations. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit
The post Neural Flow Diffusion Models (NFDM): A Novel Machine Learning Framework that Enhances Diffusion Models by Supporting a Broader Range of Forward Processes Beyond the Fixed Linear Gaussian appeared first on MarkTechPost.

Snowflake AI Research Team Unveils Arctic: An Open-Source Enterprise-G …

Snowflake AI Research has launched the Arctic, a cutting-edge open-source large language model (LLM) specifically designed for enterprise AI applications, setting a new standard for cost-effectiveness and accessibility. This model leverages a unique Dense-MoE Hybrid transformer architecture to handle SQL generation, coding, and following instructions efficiently. With its capability to operate under lower computing budgets, Arctic offers a high-performance solution for businesses looking to integrate advanced AI without the substantial costs typically involved.

Snowflake Arctic includes two variants: 

Arctic Base

Arctic Instruct 

Both models are hosted on Hugging Face and are readily accessible for use. The Arctic Base provides a solid foundation for general AI tasks, suitable for most enterprise needs. Meanwhile, the Arctic Instruct is optimized for more specific instructions, offering tailored performance enhancements such as precise command response and advanced query handling. Both versions are designed to be highly adaptable and scalable to meet diverse business requirements.

Image Source

Arctic’s training process reflects significant advancements in efficiency, utilizing less than $2 million in computing costs, a fraction of what is typically required. This efficiency is achieved through its innovative architecture, combining a 10B dense transformer with a 128×3.66B MoE MLP, optimizing cost and performance. Snowflake’s strategic approach allows Arctic to deliver top-tier enterprise intelligence while maintaining competitive performance across various metrics, including coding, SQL generation, and instruction following.

Snowflake Arctic is open-sourced under the Apache 2.0 license to improve transparency and collaboration, providing ungated access to both the model weights and the underlying code. This initiative supports technological innovation and fosters a community around the Arctic, enabling developers and researchers to build upon and extend the model’s capabilities. Open sourcing includes comprehensive documentation and data recipes, further aiding users in customizing the model for their needs.

Image Source

Getting started with Snowflake Arctic is streamlined through platforms such as Hugging Face, AWS, and NVIDIA AI Catalog, among others. Users can download the model variants directly and access detailed tutorials for setting up Arctic for inference, including basic setups and more complex deployments. This ease of access facilitates rapid integration and utilization of Arctic in enterprise environments, ensuring that businesses can quickly benefit from its advanced AI capabilities.

Image Source

In conclusion, Snowflake Arctic represents a significant leap forward in enterprise AI. By dramatically reducing the cost and complexity of deploying advanced AI solutions, Arctic enables businesses of all sizes to harness the power of large language models. Its open-source nature and robust support ecosystem further enhance its value, making it an ideal choice for organizations seeking to innovate and excel in today’s competitive market.
The post Snowflake AI Research Team Unveils Arctic: An Open-Source Enterprise-Grade Large Language Model (LLM) with a Staggering 480B Parameters appeared first on MarkTechPost.

Deploy a Hugging Face (PyAnnote) speaker diarization model on Amazon S …

Speaker diarization, an essential process in audio analysis, segments an audio file based on speaker identity. This post delves into integrating Hugging Face’s PyAnnote for speaker diarization with Amazon SageMaker asynchronous endpoints.
We provide a comprehensive guide on how to deploy speaker segmentation and clustering solutions using SageMaker on the AWS Cloud. You can use this solution for applications dealing with multi-speaker (over 100) audio recordings.
Solution overview
Amazon Transcribe is the go-to service for speaker diarization in AWS. However, for non-supported languages, you can use other models (in our case, PyAnnote) that will be deployed in SageMaker for inference. For short audio files where the inference takes up to 60 seconds, you can use real-time inference. For longer than 60 seconds, asynchronous inference should be used. The added benefit of asynchronous inference is the cost savings by auto scaling the instance count to zero when there are no requests to process.
Hugging Face is a popular open source hub for machine learning (ML) models. AWS and Hugging Face have a partnership that allows a seamless integration through SageMaker with a set of AWS Deep Learning Containers (DLCs) for training and inference in PyTorch or TensorFlow, and Hugging Face estimators and predictors for the SageMaker Python SDK. SageMaker features and capabilities help developers and data scientists get started with natural language processing (NLP) on AWS with ease.
The integration for this solution involves using Hugging Face’s pre-trained speaker diarization model using the PyAnnote library. PyAnnote is an open source toolkit written in Python for speaker diarization. This model, trained on the sample audio dataset, enables effective speaker partitioning in audio files. The model is deployed on SageMaker as an asynchronous endpoint setup, providing efficient and scalable processing of diarization tasks.
The following diagram illustrates the solution architecture.
For this post, we use the following audio file.

Stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels. Audio files sampled at a different rate are resampled to 16kHz automatically upon loading.
Prerequisites
Complete the following prerequisites:

Create a SageMaker domain.
Make sure your AWS Identity and Access Management (IAM) user has the necessary access permissions for creating a SageMaker role.
Make sure the AWS account has a service quota for hosting a SageMaker endpoint for an ml.g5.2xlarge instance.

Create a model function for accessing PyAnnote speaker diarization from Hugging Face
You can use the Hugging Face Hub to access the desired pre-trained PyAnnote speaker diarization model. You use the same script for downloading the model file when creating the SageMaker endpoint.

See the following code:

from PyAnnote.audio import Pipeline

def model_fn(model_dir):
# Load the model from the specified model directory
model = Pipeline.from_pretrained(
“PyAnnote/speaker-diarization-3.1″,
use_auth_token=”Replace-with-the-Hugging-face-auth-token”)
return model

Package the model code

Prepare essential files like inference.py, which contains the inference code:

%%writefile model/code/inference.py
from PyAnnote.audio import Pipeline
import subprocess
import boto3
from urllib.parse import urlparse
import pandas as pd
from io import StringIO
import os
import torch

def model_fn(model_dir):
# Load the model from the specified model directory
model = Pipeline.from_pretrained(
“PyAnnote/speaker-diarization-3.1”,
use_auth_token=”hf_oBxxxxxxxxxxxx)
return model

def diarization_from_s3(model, s3_file, language=None):
s3 = boto3.client(“s3”)
o = urlparse(s3_file, allow_fragments=False)
bucket = o.netloc
key = o.path.lstrip(“/”)
s3.download_file(bucket, key, “tmp.wav”)
result = model(“tmp.wav”)
data = {}
for turn, _, speaker in result.itertracks(yield_label=True):
data[turn] = (turn.start, turn.end, speaker)
data_df = pd.DataFrame(data.values(), columns=[“start”, “end”, “speaker”])
print(data_df.shape)
result = data_df.to_json(orient=”split”)
return result

def predict_fn(data, model):
s3_file = data.pop(“s3_file”)
language = data.pop(“language”, None)
result = diarization_from_s3(model, s3_file, language)
return {
“diarization_from_s3”: result
}

Prepare a requirements.txt file, which contains the required Python libraries necessary to run the inference:

with open(“model/code/requirements.txt”, “w”) as f:
f.write(“transformers==4.25.1n”)
f.write(“boto3n”)
f.write(“PyAnnote.audion”)
f.write(“soundfilen”)
f.write(“librosan”)
f.write(“onnxruntimen”)
f.write(“wgetn”)
f.write(“pandas”)

Lastly, compress the inference.py and requirements.txt files and save it as model.tar.gz:

!tar zcvf model.tar.gz *

Configure a SageMaker model
Define a SageMaker model resource by specifying the image URI, model data location in Amazon Simple Storage Service (S3), and SageMaker role:

import sagemaker
import boto3

sess = sagemaker.Session()

sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
sagemaker_session_bucket = sess.default_bucket()

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client(“iam”)
role = iam.get_role(RoleName=”sagemaker_execution_role”)[“Role”][“Arn”]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f”sagemaker role arn: {role}”)
print(f”sagemaker bucket: {sess.default_bucket()}”)
print(f”sagemaker session region: {sess.boto_region_name}”)

Upload the model to Amazon S3
Upload the zipped PyAnnote Hugging Face model file to an S3 bucket:

s3_location = f”s3://{sagemaker_session_bucket}/whisper/model/model.tar.gz”
!aws s3 cp model.tar.gz $s3_location

Create a SageMaker asynchronous endpoint
Configure an asynchronous endpoint for deploying the model on SageMaker using the provided asynchronous inference configuration:

from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig
from sagemaker.s3 import s3_path_join
from sagemaker.utils import name_from_base

async_endpoint_name = name_from_base(“custom-asyc”)

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
model_data=s3_location, # path to your model and script
role=role, # iam role with permissions to create an Endpoint
transformers_version=”4.17″, # transformers version used
pytorch_version=”1.10″, # pytorch version used
py_version=”py38″, # python version used
)

# create async endpoint configuration
async_config = AsyncInferenceConfig(
output_path=s3_path_join(
“s3://”, sagemaker_session_bucket, “async_inference/output”
), # Where our results will be stored
# Add nofitication SNS if needed
notification_config={
# “SuccessTopic”: “PUT YOUR SUCCESS SNS TOPIC ARN”,
# “ErrorTopic”: “PUT YOUR ERROR SNS TOPIC ARN”,
}, # Notification configuration
)

env = {“MODEL_SERVER_WORKERS”: “2”}

# deploy the endpoint endpoint
async_predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type=”ml.xx”,
async_inference_config=async_config,
endpoint_name=async_endpoint_name,
env=env,
)

Test the endpoint
Evaluate the endpoint functionality by sending an audio file for diarization and retrieving the JSON output stored in the specified S3 output path:

# Replace with a path to audio object in S3
from sagemaker.async_inference import WaiterConfig
res = async_predictor.predict_async(data=data)
print(f”Response output path: {res.output_path}”)
print(“Start Polling to get response:”)

config = WaiterConfig(
max_attempts=10, # number of attempts
delay=10# time in seconds to wait between attempts
)
res.get_result(config)
#import waiterconfig

To deploy this solution at scale, we suggest using AWS Lambda, Amazon Simple Notification Service (Amazon SNS), or Amazon Simple Queue Service (Amazon SQS). These services are designed for scalability, event-driven architectures, and efficient resource utilization. They can help decouple the asynchronous inference process from the result processing, allowing you to scale each component independently and handle bursts of inference requests more effectively.
Results
Model output is stored at s3://sagemaker-xxxx /async_inference/output/. The output shows that the audio recording has been segmented into three columns:

Start (start time in seconds)
End (end time in seconds)
Speaker (speaker label)

The following code shows an example of our results:

[0.9762308998, 8.9049235993, “SPEAKER_01”]

[9.533106961, 12.1646859083, “SPEAKER_01”]

[13.1324278438, 13.9303904924, “SPEAKER_00”]

[14.3548387097, 26.1884550085, “SPEAKER_00”]

[27.2410865874, 28.2258064516, “SPEAKER_01”]

[28.3446519525, 31.298811545, “SPEAKER_01”]

Clean up
You can set a scaling policy to zero by setting MinCapacity to 0; asynchronous inference lets you auto scale to zero with no requests. You don’t need to delete the endpoint, it scales from zero when needed again, reducing costs when not in use. See the following code:

# Common class representing application autoscaling for SageMaker
client = boto3.client(‘application-autoscaling’)

# This is the format in which application autoscaling references the endpoint
resource_id=’endpoint/’ + <endpoint_name> + ‘/variant/’ + <‘variant1′>

# Define and register your endpoint variant
response = client.register_scalable_target(
ServiceNamespace=’sagemaker’,
ResourceId=resource_id,
ScalableDimension=’sagemaker:variant:DesiredInstanceCount’, # The number of EC2 instances for your Amazon SageMaker model endpoint variant.
MinCapacity=0,
MaxCapacity=5
)

If you want to delete the endpoint, use the following code:

async_predictor.delete_endpoint(async_endpoint_name)

Benefits of asynchronous endpoint deployment
This solution offers the following benefits:

The solution can efficiently handle multiple or large audio files.
This example uses a single instance for demonstration. If you want to use this solution for hundreds or thousands of videos and use an asynchronous endpoint to process across multiple instances, you can use an auto scaling policy, which is designed for a large number of source documents. Auto scaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload.
The solution optimizes resources and reduces system load by separating long-running tasks from real-time inference.

Conclusion
In this post, we provided a straightforward approach to deploy Hugging Face’s speaker diarization model on SageMaker using Python scripts. Using an asynchronous endpoint provides an efficient and scalable means to deliver diarization predictions as a service, accommodating concurrent requests seamlessly.
Get started today with asynchronous speaker diarization for your audio projects. Reach out in the comments if you have any questions about getting your own asynchronous diarization endpoint up and running.

About the Authors
Sanjay Tiwary is a Specialist Solutions Architect AI/ML who spends his time working with strategic customers to define business requirements, provide L300 sessions around specific use cases, and design AI/ML applications and services that are scalable, reliable, and performant. He has helped launch and scale the AI/ML powered Amazon SageMaker service and has implemented several proofs of concept using Amazon AI services. He has also developed the advanced analytics platform as a part of the digital transformation journey.
Kiran Challapalli is a deep tech business developer with the AWS public sector. He has more than 8 years of experience in AI/ML and 23 years of overall software development and sales experience. Kiran helps public sector businesses across India explore and co-create cloud-based solutions that use AI, ML, and generative AI—including large language models—technologies.

Evaluate the text summarization capabilities of LLMs for enhanced deci …

Organizations across industries are using automatic text summarization to more efficiently handle vast amounts of information and make better decisions. In the financial sector, investment banks condense earnings reports down to key takeaways to rapidly analyze quarterly performance. Media companies use summarization to monitor news and social media so journalists can quickly write stories on developing issues. Government agencies summarize lengthy policy documents and reports to help policymakers strategize and prioritize goals.
By creating condensed versions of long, complex documents, summarization technology enables users to focus on the most salient content. This leads to better comprehension and retention of critical information. The time savings allow stakeholders to review more material in less time, gaining a broader perspective. With enhanced understanding and more synthesized insights, organizations can make better informed strategic decisions, accelerate research, improve productivity, and increase their impact. The transformative power of advanced summarization capabilities will only continue growing as more industries adopt artificial intelligence (AI) to harness overflowing information streams.
In this post, we explore leading approaches for evaluating summarization accuracy objectively, including ROUGE metrics, METEOR, and BERTScore. Understanding the strengths and weaknesses of these techniques can help guide selection and improvement efforts. The overall goal of this post is to demystify summarization evaluation to help teams better benchmark performance on this critical capability as they seek to maximize value.
Types of summarization
Summarization can generally be divided into two main types: extractive summarization and abstractive summarization. Both approaches aim to condense long pieces of text into shorter forms, capturing the most critical information or essence of the original content, but they do so in fundamentally different ways.
Extractive summarization involves identifying and extracting key phrases, sentences, or segments from the original text without altering them. The system selects parts of the text deemed most informative or representative of the whole. Extractive summarization is useful if accuracy is critical and the summary needs to reflect the exact information from the original text. These could be use cases like highlighting specific legal terms, obligations, and rights outlined in the terms of use. The most common techniques used for extractive summarization are term frequency-inverse document frequency (TF-IDF), sentence scoring, text rank algorithm, and supervised machine learning (ML).
Abstractive summarization goes a step further by generating new phrases and sentences that were not in the original text, essentially paraphrasing and condensing the original content. This approach requires a deeper understanding of the text, because the AI needs to interpret the meaning and then express it in a new, concise form. Large language models (LLMs) are best suited for abstractive summarization because the transformer models use attention mechanisms to focus on relevant parts of the input text when generating summaries. The attention mechanism allows the model to assign different weights to different words or tokens in the input sequence, enabling it to capture long-range dependencies and contextually relevant information.
In addition to these two primary types, there are hybrid approaches that combine extractive and abstractive methods. These approaches might start with extractive summarization to identify the most important content and then use abstractive techniques to rewrite or condense that content into a fluent summary.
The challenge
Finding the optimal method to evaluate summary quality remains an open challenge. As organizations increasingly rely on automatic text summarization to distill key information from documents, the need grows for standardized techniques to measure summarization accuracy. Ideally, these evaluation metrics would quantify how well machine-generated summaries extract the most salient content from source texts and present coherent summaries reflecting the original meaning and context.
However, developing robust evaluation methodologies for text summarization presents difficulties:

Human-authored reference summaries used for comparison often exhibit high variability based on subjective determinations of importance
Nuanced aspects of summary quality like fluency, readability, and coherence prove difficult to quantify programmatically
Wide variation exists across summarization methods from statistical algorithms to neural networks, complicating direct comparisons

Recall-Oriented Understudy for Gisting Evaluation (ROUGE)
ROUGE metrics, such as ROUGE-N and ROUGE-L, play a crucial role in evaluating the quality of machine-generated summaries compared to human-written reference summaries. These metrics focus on assessing the overlap between the content of machine-generated and human-crafted summaries by analyzing n-grams, which are groups of words or tokens. For instance, ROUGE-1 evaluates the match of individual words (unigrams), whereas ROUGE-2 considers pairs of words (bigrams). Additionally, ROUGE-N assesses the longest common subsequence of words between the two texts, allowing for flexibility in word order.
To illustrate this, consider the following examples:

ROGUE-1 metric – ROUGE-1 evaluates the overlap of unigrams (single words) between a generated summary and a reference summary. For example, if a reference summary contains “The quick brown fox jumps,” and the generated summary is “The brown fox jumps quickly,” the ROUGE-1 metric would consider “brown,” “fox,” and “jumps” as overlapping unigrams. ROUGE-1 focuses on the presence of individual words in the summaries, measuring how well the generated summary captures the key words from the reference summary.
ROGUE-2 metric – ROUGE-2 assesses the overlap of bigrams (pairs of adjacent words) between a generated summary and a reference summary. For instance, if the reference summary has “The cat is sleeping,” and the generated summary reads “A cat is sleeping,” ROUGE-2 would identify “cat is” and “is sleeping” as an overlapping bigram. ROUGE-2 provides insight into how well the generated summary maintains the sequence and context of word pairs compared to the reference summary.
ROUGE-N metric – ROUGE-N is a generalized form where N represents any number, allowing evaluation based on n-grams (sequences of N words). Considering N=3, if the reference summary states “The sun is shining brightly,” and the generated summary is “Sun shining brightly,” ROUGE-3 would recognize “sun shining brightly” as a matching trigram. ROUGE-N offers flexibility to evaluate summaries based on different lengths of word sequences, providing a more comprehensive assessment of content overlap.

These examples illustrate how ROUGE-1, ROUGE-2, and ROUGE-N metrics function in evaluating automatic summarization or machine translation tasks by comparing generated summaries with reference summaries based on different levels of word sequences.
Calculate a ROUGE-N score
You can use the following steps to calculate a ROUGE-N score:

Tokenize the generated summary and the reference summary into individual words or tokens using basic tokenization methods like splitting by whitespace or natural language processing (NLP) libraries.
Generate n-grams (contiguous sequences of N words) from both the generated summary and the reference summary.
Count the number of overlapping n-grams between the generated summary and the reference summary.
Calculate precision, recall, and F1 score:

Precision – The number of overlapping n-grams divided by the total number of n-grams in the generated summary.
Recall – The number of overlapping n-grams divided by the total number of n-grams in the reference summary.
F1 score – The harmonic mean of precision and recall, calculated as (2 * precision * recall) / (precision + recall).

The aggregate F1 score obtained from calculating precision, recall, and F1 score for each row in the dataset is considered as the ROUGE-N score.

Limitations
ROGUE has the following limitations:

Narrow focus on lexical overlap – The core idea behind ROUGE is to compare the system-generated summary to a set of reference or human-created summaries, and measure the lexical overlap between them. This means ROUGE has a very narrow focus on word-level similarity. It doesn’t actually evaluate semantic meaning, coherence, or readability of the summary. A system could achieve high ROUGE scores by simply extracting sentences word-for-word from the original text, without generating a coherent or concise summary.
Insensitivity to paraphrasing – Because ROUGE relies on lexical matching, it can’t detect semantic equivalence between words and phrases. Therefore, paraphrasing and use of synonyms will often lead to lower ROUGE scores, even if the meaning is preserved. This disadvantages systems that paraphrase or summarize in an abstractive way.
Lack of semantic understanding – ROUGE doesn’t evaluate whether the system truly understood the meanings and concepts in the original text. A summary could achieve high lexical overlap with references, while missing the main ideas or containing factual inconsistencies. ROUGE would not identify these issues.

When to use ROUGE
ROUGE is simple and fast to calculate. Use it as a baseline or benchmark for summary quality related to content selection. ROUGE metrics are most effectively employed in scenarios involving abstractive summarization tasks, automatic summarization evaluation, assessments of LLMs, and comparative analyses of different summarization approaches. By using ROUGE metrics in these contexts, stakeholders can quantitatively evaluate the quality and effectiveness of summary generation processes.
Metric for Evaluation of Translation with Explicit Ordering (METEOR)
One of the major challenges in evaluating summarization systems is assessing how well the generated summary flows logically, rather than just selecting relevant words and phrases from the source text. Simply extracting relevant keywords and sentences doesn’t necessarily produce a coherent and cohesive summary. The summary should flow smoothly and connect ideas logically, even if they aren’t presented in the same order as the original document.
The flexibility of matching by reducing words to their root or base form (For example, after stemming, words like “running,” “runs,” and “ran” all become “run”) and synonyms means METEOR correlates better with human judgements of summary quality. It can identify if important content is preserved, even if the wording differs. This is a key advantage over n-gram based metrics like ROUGE, which only look for exact token matches. METEOR also gives higher scores to summaries that focus on the most salient content from the reference. Lower scores are given to repetitive or irrelevant information. This aligns well with the goal of summarization to keep the most important content only. METEOR is a semantically meaningful metric that can overcome some of the limitations of n-gram matching for evaluating text summarization. The incorporation of stemming and synonyms allows for better assessment of information overlap and content accuracy.
To illustrate this, consider the following examples:
Reference Summary: Leaves fall during autumn.
Generated Summary 1: Leaves drop in fall.
Generated Summary 2: Leaves green in summer.
The words that match between the reference and generated summary 1 are highlighted:
Reference Summary: Leaves fall during autumn.
Generated Summary 1: Leaves drop in fall.
Even though “fall” and “autumn” are different tokens, METEOR recognizes them as synonyms through its synonym matching. “Drop” and “fall” are identified as a stemmed match. For generated summary 2, there are no matches with the reference summary besides “Leaves,” so this summary would receive a much lower METEOR score. The more semantically meaningful matches, the higher the METEOR score. This allows METEOR to better evaluate the content and accuracy of summaries compared to simple n-gram matching.
Calculate a METEOR score
Complete the following steps to calculate a METEOR score:

Tokenize the generated summary and the reference summary into individual words or tokens using basic tokenization methods like splitting by whitespace or NLP libraries.
Calculate the unigram precision, recall, and F-mean score, giving more weightage to recall than precision.
Apply a penalty for exact matches to avoid overemphasizing them. The penalty is chosen based on dataset characteristics, task requirements, and the balance between precision and recall. Subtract this penalty from the F-mean score calculated in Step 2.
Calculate the F-mean score for stemmed forms (reducing words to their base or root form) and synonyms for unigrams where applicable. Aggregate this with the earlier calculated F-mean score to obtain the final METEOR score. The METEOR score ranges from 0–1, where 0 indicates no similarity between the generated summary and reference summary, and 1 indicates perfect alignment. Typically, summarization scores fall between 0–0.6.

Limitations
When employing the METEOR metric for evaluating summarization tasks, several challenges may arise:

Semantic complexity – METEOR’s emphasis on semantic similarity can struggle to capture the nuanced meanings and context in complex summarization tasks, potentially leading to inaccuracies in evaluation.
Reference variability – Variability in human-generated reference summaries can impact METEOR scores, because differences in reference content may affect the evaluation of machine-generated summaries.
Linguistic diversity – The effectiveness of METEOR may vary across languages due to linguistic variations, syntax differences, and semantic nuances, posing challenges in multilingual summarization evaluations.
Length discrepancy – Evaluating summaries of varying lengths can be challenging for METEOR, because discrepancies in length compared to the reference summary may result in penalties or inaccuracies in assessment.
Parameter tuning – Optimizing METEOR’s parameters for different datasets and summarization tasks can be time-consuming and require careful tuning to make sure the metric provides accurate evaluations.
Evaluation bias – There is a risk of evaluation bias with METEOR if not properly adjusted or calibrated for specific summarization domains or tasks. This can potentially lead to skewed results and affect the reliability of the evaluation process.

By being aware of these challenges and considering them when using METEOR as a metric for summarization tasks, researchers and practitioners can navigate potential limitations and make more informed decisions in their evaluation processes.
When to use METEOR
METEOR is commonly used to automatically evaluate the quality of text summaries. It is preferable to use METEOR as an evaluation metric when the order of ideas, concepts, or entities in the summary matters. METEOR considers the order and matches n-grams between the generated summary and reference summaries. It rewards summaries that preserve sequential information. Unlike metrics like ROUGE, which rely on overlap of n-grams with reference summaries, METEOR matches stems, synonyms, and paraphrases. METEOR works better when there can be multiple correct ways of summarizing the original text. METEOR incorporates WordNet synonyms and stemmed tokens when matching n-grams. In short, summaries that are semantically similar but use different words or phrasing will still score well. METEOR has a built-in penalty for summaries with repetitive n-grams. Therefore, it discourages word-for-word extraction or lack of abstraction. METEOR is a good choice when semantic similarity, order of ideas, and fluent phrasing are important for judging summary quality. It is less appropriate for tasks where only lexical overlap with reference summaries matters.
BERTScore
Surface-level lexical measures like ROUGE and METEOR evaluate summarization systems by comparing the word overlap between a candidate summary and a reference summary. However, they rely heavily on exact string matching between words and phrases. This means they may miss semantic similarities between words and phrases that have different surface forms but similar underlying meanings. By relying only on surface matching, these metrics may underestimate the quality of system summaries that use synonymous words or paraphrase concepts differently from reference summaries. Two summaries could convey nearly identical information but receive low surface-level scores due to vocabulary differences.
BERTScore is a way to automatically evaluate how good a summary is by comparing it to a reference summary written by a human. It uses BERT, a popular NLP technique, to understand the meaning and context of words in the candidate summary and reference summary. Specifically, it looks at each word or token in the candidate summary and finds the most similar word in the reference summary based on the BERT embeddings, which are vector representations of the meaning and context of each word. It measures the similarity using cosine similarity, which tells how close the vectors are to each other. For each word in the candidate summary, it finds the most related word in the reference summary using BERT’s understanding of language. It compares all these word similarities across the whole summary to get an overall score of how semantically similar the candidate summary is to the reference summary. The more similar the words and meanings captured by BERT, the higher the BERTScore. This allows it to automatically evaluate the quality of a generated summary by comparing it to a human reference without needing human evaluation each time.
To illustrate this, imagine you have a machine-generated summary: “The quick brown fox jumps over the lazy dog.” Now, let’s consider a human-crafted reference summary: “A fast brown fox leaps over a sleeping canine.”
Calculate a BERTScore
Complete the following steps to calculate a BERTScore:

BERTScore uses contextual embeddings to represent each token in both the candidate (machine-generated) and reference (human-crafted) sentences. Contextual embeddings are a type of word representation in NLP that captures the meaning of a word based on its context within a sentence or text. Unlike traditional word embeddings that assign a fixed vector to each word regardless of its context, contextual embeddings consider the surrounding words to generate a unique representation for each word depending on how it is used in a specific sentence.
The metric then computes the similarity between each token in the candidate sentence with each token in the reference sentence using cosine similarity. Cosine similarity helps us quantify how closely related two sets of data are by focusing on the direction they point in a multi-dimensional space, making it a valuable tool for tasks like search algorithms, NLP, and recommendation systems.
By comparing the contextual embeddings and computing similarity scores for all tokens, BERTScore generates a comprehensive evaluation that captures the semantic relevance and context of the generated summary compared to the human-crafted reference.
The final BERTScore output provides a similarity score that reflects how well the machine-generated summary aligns with the reference summary in terms of meaning and context.

In essence, BERTScore goes beyond traditional metrics by considering the semantic nuances and context of sentences, offering a more sophisticated evaluation that closely mirrors human judgment. This advanced approach enhances the accuracy and reliability of evaluating summarization tasks, making BERTScore a valuable tool in assessing text generation systems.
Limitations:
Although BERTScore offers significant advantages in evaluating summarization tasks, it also comes with certain limitations that need to be considered:

Computational intensity – BERTScore can be computationally intensive due to its reliance on pre-trained language models like BERT. This can lead to longer evaluation times, especially when processing large volumes of text data.
Dependency on pre-trained models – The effectiveness of BERTScore is highly dependent on the quality and relevance of the pre-trained language model used. In scenarios where the pre-trained model may not adequately capture the nuances of the text, the evaluation results may be affected.
Scalability – Scaling BERTScore for large datasets or real-time applications can be challenging due to its computational demands. Implementing BERTScore in production environments may require optimization strategies to provide efficient performance.
Domain specificity – BERTScore’s performance may vary across different domains or specialized text types. Adapting the metric to specific domains or tasks may require fine-tuning or adjustments to produce accurate evaluations.
Interpretability – Although BERTScore provides a comprehensive evaluation based on contextual embeddings, interpreting the specific reasons behind the similarity scores generated for each token can be complex and may require additional analysis.
Reference-free evaluation – Although BERTScore reduces the reliance on reference summaries for evaluation, this reference-free approach may not fully capture all aspects of summarization quality, particularly in scenarios where human-crafted references are essential for assessing content relevance and coherence.

Acknowledging these limitations can help you make informed decisions when using BERTScore as a metric for evaluating summarization tasks, providing a balanced understanding of its strengths and constraints.
When to use BERTScore
BERTScore can evaluate the quality of text summarization by comparing a generated summary to a reference summary. It uses neural networks like BERT to measure semantic similarity beyond just exact word or phrase matching. This makes BERTScore very useful when semantic fidelity preserving the full meaning and content is critical for your summarization task. BERTScore will give higher scores to summaries that convey the same information as the reference summary, even if they use different words and sentence structures. The bottom line is that BERTScore is ideal for summarization tasks where retaining the full semantic meaning not just keywords or topics is vital. Its advanced neural scoring allows it to compare meaning beyond surface-level word matching. This makes it suitable for cases where subtle differences in wording can substantially alter overall meaning and implications. BERTScore, in particular, excels in capturing semantic similarity, which is crucial for assessing the quality of abstractive summaries like those produced by Retrieval Augmented Generation (RAG) models.
Model evaluation frameworks
Model evaluation frameworks are essential for accurately gauging the performance of various summarization models. These frameworks are instrumental in comparing models, providing coherence between generated summaries and source content, and pinpointing deficiencies in evaluation methods. By conducting thorough assessments and consistent benchmarking, these frameworks propel text summarization research by advocating standardized evaluation practices and enabling multifaceted model comparisons.
In AWS, the FMEval library within Amazon SageMaker Clarify streamlines the evaluation and selection of foundation models (FMs) for tasks like text summarization, question answering, and classification. It empowers you to evaluate FMs based on metrics such as accuracy, robustness, creativity, bias, and toxicity, supporting both automated and human-in-the-loop evaluations for LLMs. With UI-based or programmatic evaluations, FMEval generates detailed reports with visualizations to quantify model risks like inaccuracies, toxicity, or bias, helping organizations align with their responsible generative AI guidelines. In this section, we demonstrate how to use the FMEval library.
Evaluate Claude v2 on summarization accuracy using Amazon Bedrock
The following code snippet is an example of how to interact with the Anthropic Claude model using Python code:

import json
# We use Claude v2 in this example.
# See https://docs.anthropic.com/claude/reference/claude-on-amazon-bedrock#list-available-models
# for instructions on how to list the model IDs for all available Claude model variants.
model_id = ‘anthropic.claude-v2′
accept = “application/json”
contentType = “application/json”
# `prompt_data` is structured in the format that the Claude model expects, as documented here:
# https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-claude.html#model-parameters-claude-request-body
prompt_data = “””Human: Who is Barack Obama?
Assistant:
“””
# For more details on parameters that can be included in `body` (such as “max_tokens_to_sample”),
# see https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-claude.html#model-parameters-claude-request-body
body = json.dumps({“prompt”: prompt_data, “max_tokens_to_sample”: 500})
# Invoke the model
response = bedrock_runtime.invoke_model(
body=body, modelId=model_id, accept=accept, contentType=contentType
)
# Parse the invocation response
response_body = json.loads(response.get(“body”).read())
print(response_body.get(“completion”))

In simple terms, this code performs the following actions:

Import the necessary libraries, including json, to work with JSON data.
Define the model ID as anthropic.claude-v2 and set the content type for the request.
Create a prompt_data variable that structures the input data for the Claude model. In this case, it asks the question “Who is Barack Obama?” and expects a response from the model.
Construct a JSON object named body that includes the prompt data, and specify additional parameters like the maximum number of tokens to generate.
Invoke the Claude model using bedrock_runtime.invoke_model with the defined parameters.
Parse the response from the model, extract the completion (generated text), and print it out.

Make sure the AWS Identity and Access Management (IAM) role associated with the Amazon SageMaker Studio user profile has access to the Amazon Bedrock models being invoked. Refer to Identity-based policy examples for Amazon Bedrock for guidance on best practices and examples of identity-based policies for Amazon Bedrock.
Using the FMEval library to evaluate the summarized output from Claude
We use the following code to evaluate the summarized output:

from fmeval.data_loaders.data_config import DataConfig
from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy
config = DataConfig(
dataset_name=”gigaword_sample”,
dataset_uri=”gigaword_sample.jsonl”,
dataset_mime_type=MIME_TYPE_JSONLINES,
model_input_location=”document”,
target_output_location=”summary”
)
bedrock_model_runner = BedrockModelRunner(
model_id=model_id,
output=’completion’,
content_template='{“prompt”: $prompt, “max_tokens_to_sample”: 500}’
)
eval_algo = SummarizationAccuracy()
eval_output = eval_algo.evaluate(model=bedrock_model_runner, dataset_config=config,
prompt_template=”Human: Summarise the following text in one sentence: $featurennAssistant:n”, save=True)

In the preceding code snippet, to evaluate text summarization using the FMEval library, we complete the following steps:

Create a ModelRunner to perform invocation on your LLM. The FMEval library provides built-in support for Amazon SageMaker endpoints and Amazon SageMaker JumpStart LLMs. You can also extend the ModelRunner interface for any LLMs hosted anywhere.
Use supported eval_algorithms like toxicity, summarization, accuracy, semantic, and robustness, based on your evaluation needs.
Customize the evaluation configuration parameters for your specific use case.
Use the evaluation algorithm with either built-in or custom datasets to evaluate your LLM model. The dataset used in this case is sourced from the following GitHub repo.

Refer to the developer guide and examples for detailed usage of evaluation algorithms.
The following table summarizes the results of the evaluation.

model _input
model_output
target_output
prompt
scores
meteor_score
rouge_score
bert_score

John Edward 0 Bates, formerly of Spalding, Linco…..
I cannot make any definitive judgments, as th…
A former Lincolnshire Police officer carried o…
Human: John Edward Bates, formerly of Spalding…
[{‘name’: ‘meteor’, ‘value’: 0.101010101010101…
0.10101
0
0.557155

23 October 2015 Last updated at 17:44 BST|nIt’…
Here are some key points about hurricane/trop..
Hurricane Patricia has been rated as a categor…
Human: 23 October 2015 Last updated at 17:44 B…
[{‘name’: meteor’, “value’: 0.102339181286549..
0.102339
0.018265
0.441421

Ferrari appeared in a position to challenge un…
Here are the key points from the article:nin…
Lewis Hamilton stormed to pole position at the…
Human: Ferrari appeared in a position to chall…
[{‘name’: ‘meteor’, ‘value’: 0.322543352601156…
0.322543
0.078212
0.606487

The Bath-born player, 28, has made 36 appearan…
Okay, let me summarize the key points:/nin- E…..
Newport Gwent Dragons number eight Ed Jackson
Human: The Bath-born player, 28, has made 36 a…
[{‘name’: ‘meteor’, ‘value’: 0105740181268882…
0.10574
0.012987
0.539488

Weaknesses in the way mice swapped data with c…
Here are the key points I gathered from the a…
Hackers could gain access to home and
Human: Weaknesses in the swar mice swapped data
[{‘name’: ‘meteor’, ‘value’: 0.201048289433848…
0.201048
0.021858
0.526947

Check out the sample notebook for more details about the summarization evaluation that we discussed in this post.
Conclusion
ROUGE, METEOR, and BERTScore all measure the quality of machine-generated summaries, but focus on different aspects like lexical overlap, fluency, or semantic similarity. Make sure to select the metric that aligns with what defines “good” for your specific summarization use case. You can also use a combination of metrics. This provides a more well-rounded evaluation and guards against potential weaknesses of any individual metric. With the right measurements, you can iteratively improve your summarizers to meet whichever notion of accuracy matters most.
Additionally, FM and LLM evaluation is necessary to be able to productionize these models at scale. With FMEval, you get a vast set of built-in algorithms across many NLP tasks, but also a scalable and flexible tool for large-scale evaluations of your own models, datasets, and algorithms. To scale up, you can use this package in your LLMOps pipelines to evaluate multiple models. To learn more about FMEval in AWS and how to use it effectively, refer to Use SageMaker Clarify to evaluate large language models. For further understanding and insights into the capabilities of SageMaker Clarify in evaluating FMs, see Amazon SageMaker Clarify Makes It Easier to Evaluate and Select Foundation Models.

About the Authors

Dinesh Kumar Subramani is a Senior Solutions Architect based in Edinburgh, Scotland. He specializes in artificial intelligence and machine learning, and is member of technical field community with in Amazon. Dinesh works closely with UK Central Government customers to solve their problems using AWS services. Outside of work, Dinesh enjoys spending quality time with his family, playing chess, and exploring a diverse range of music.

Pranav Sharma is an AWS leader driving technology and business transformation initiatives across Europe, the Middle East, and Africa. He has experience in designing and running artificial intelligence platforms in production that support millions of customers and deliver business outcomes. He has played technology and people leadership roles for Global Financial Services organizations. Outside of work, he likes to read, play tennis with his son, and watch movies.

What Does Google’s Cookie Delay Mean for Marketers?

About a month ago I wrote a post titled “Can AI Help Overcome the Loss of Third-Party Cookies? Google Thinks So”.

The post was written in response to Google’s VP of Global Advertising Strategies telling everyone it was time to move on from third-party cookies. 

That there would be no more delays and marketers needed to shift their focus before it was too late. 

And then…

Ah, Google.

It turns out there would in fact be more delays and the new timeline is somewhere in 2025…maybe.

Here’s the thing.

Cookies are going away. It might not be 100% by the end of the year and it might not be 100% at the end of next year, but they will go away. 

While the US certainly lags behind the EU when it comes to privacy, there are still concerns and tech companies are continuing to adapt their technologies to ensure these concerns are addressed. 

What does this mean for marketers?

We’re going to give you the same advice we have been giving – focus on first-party data.

First-party data is your data to own. 

You don’t have to worry about if or when Google decides to make a change. 

Plus, first-party data lends itself to SO MANY things, including:

Personalization

Segmentation 

Ad Targeting

Customer retention

Campaign optimization

Attribution

Website optimization

Customer experience

And more! 

Marketers can’t keep relying on Google delays. We have to move forward.

How to Collect First-Party Data  

It’s easy for me to sit here and tell marketers to move on from third-party data but I understand the reality of the situation – collecting first-party data comes with its own set of challenges. 

How do you get people to give you their info?

How do you store their info securely?

How do you connect your first-party data?

Yes, it’s certainly important to have the right tech stack in place but when it comes to actually collecting first-party data, it’s easier than ever!

Starting with website visitor identification. 

Website Visitor Identification + Your Website = First-Party Data

Just by adding one line of code to your website, you can start identifying your website visitors and gathering first-party data. 

This includes names, emails, phone numbers, job titles, companies, and more. 

What makes this so cool is that a platform like Customers.ai not only helps you identify the people, but also gives you important information about them. 

You can get demographic data and business data, and you can even track the customer journey!

On top of being able to collect the data and enrich it, Customers.ai allows you to send the data directly to your CRM, directly to your marketing automation system, directly to your ad campaigns, etc. 

You have the ability to build true customer profiles and get real attribution data without needing third-party data! 

Look, there has never been an easier way to collect first-party data in a compliant manner. 

If you aren’t using this, you are truly missing out.

The best part? It’s easy to set up!

To install the Website Visitor ID X-Ray Pixel, sign up (for FREE!), go to your dashboard, and navigate to My Automations. 

Select + New Automation and get your pixel. We have easy install options for Google Tag Manager, WordPress, and Shopify, or you can install the pixel manually.

Other Methods of Collecting First-Party Data

We are (obviously) partial to website visitor ID tools because well, that’s what do (and for the record, it’s SO easy). 

But that isn’t the only way to collect first-party data. Let’s look at a few other ways:

Email Campaigns

“Fill out your email for a 25% off discount.”

“Sign up for our newsletter and get a free [insert offer]”

“Enter your email for a chance to win [insert prize]”

There’s a reason for these things and the reason is first-party data. 

Email not only can give you personal identifiers but metrics from campaigns (think open rates, click-through rates, unsubscribe rates, etc), are great for understanding shopper preferences. 

Surveys & Quizzes

Surveys and quizzes sure are fun!

They are also brilliant methods of collecting user data. 

The key is making it simple enough and interesting enough that your customers want to complete it. 

Be sure to only ask for the information you need. 

Remember, you can build customer profiles over time so you don’t need to ask everything in one place.

Customer Feedback & Reviews

Soliciting reviews through emails, social media, or on your own website is another avenue for collecting valuable first-party data. 

Reviews can be associated with specific customers to build customer profiles and can help you better understand what they like and dislike.

Don’t Let Google Make You Complacent

While I’m sure marketers are taking a collective breath at another delay from Google, we think it’s important not to get complacent. 

Change is hard and shifting the way we do things takes time.

Use this time you have to innovate without the ticking clock of cookie deprecation.

Use this time to better understand how you can not just collect first-party data but use it to gain deeper audience insights, and, yes, actually get ready for a cookie-less world.

We are here to help and if you want to see how you can start collecting data in under 90 seconds, sign up for a free trial today!

See Who Is On Your Site Right Now!

Turn anonymous visitors into genuine contacts.

Try it Free, No Credit Card Required

Get The X-Ray Pixel

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post What Does Google’s Cookie Delay Mean for Marketers? appeared first on Customers.ai.

Top 10 Explainable AI (XAI) Frameworks

The increasing complexity of AI systems, particularly with the rise of opaque models like Deep Neural Networks (DNNs), has highlighted the need for transparency in decision-making processes. As black-box models become more prevalent, stakeholders in AI demand explanations to justify decisions, especially in critical contexts like medicine and autonomous vehicles. Transparency is essential for ethical AI and improving system performance, as it helps detect biases, enhance robustness against adversarial attacks, and ensure meaningful variables influence the output.

To ensure practicality, interpretable AI systems must offer insights into model mechanisms, visualize discrimination rules, or identify factors that could perturb the model. Explainable AI (XAI) aims to balance model explainability with high learning performance, fostering human understanding, trust, and effective management of AI partners. Drawing from social sciences and psychology, XAI seeks to create a suite of techniques facilitating transparency and comprehension in the evolving landscape of AI.

Some XAI frameworks that have proven their success in this field:

What-If Tool (WIT): An open-source application proposed by Google researchers, enabling users to analyze ML systems without extensive coding. It facilitates testing performance in hypothetical scenarios, analyzing data feature importance, visualizing model behavior, and assessing fairness metrics.

Local Interpretable Model-Agnostic Explanations (LIME): A new explanation method that clarifies the predictions of any classifier by learning an interpretable model localized around the prediction, ensuring the explanation is understandable and reliable.

SHapley Additive exPlanations (SHAP): SHAP provides a comprehensive framework for interpreting model predictions by assigning an importance value to each feature for a specific prediction. Key innovations of SHAP include (1) the discovery of a new category of additive feature importance measures and (2) theoretical findings that demonstrate a distinct solution within this category that possesses a collection of favorable properties.

DeepLIFT (Deep Learning Important FeaTures): DeepLIFT is a method that deconstructs a neural network’s output prediction for a given input by tracing the influence of all neurons in the network back to each input feature. This technique compares the activation of each neuron to a predefined ‘reference activation’ and assigns contribution scores based on the observed differences. DeepLIFT can separately address positive and negative contributions, allowing it to reveal dependencies that other techniques may miss. Moreover, it can compute these contribution scores efficiently in just one backward pass through the network.

ELI5 is a Python package that helps debug machine learning classifiers and explain their predictions. It supports multiple ML frameworks and packages, including Keras, XGBoost, LightGBM, and CatBoost. ELI5 also implements several algorithms for inspecting black-box models.

AI Explainability 360 (AIX360): The AIX360 toolkit is an open-source library that allows for the interpretability and explainability of data & machine learning models. This Python package includes a comprehensive set of algorithms covering different explanation dimensions and proxy explainability metrics.

Shapash is a Python library designed to make machine learning interpretable and accessible to everyone. It offers various visualization types with clear and explicit labels that are easy to understand. This enables Data Scientists to comprehend their models better and share their findings, while end users can grasp the decisions made by a model through a summary of the most influential factors. MAIF Data Scientists developed Shapash.

XAI is a Machine Learning library designed with AI explainability at its core. XAI contains various tools that enable the analysis and evaluation of data and models. The Institute for Ethical AI & ML maintains the XAI library. More broadly, the XAI library is designed using the three steps of explainable machine learning, which involve 1) data analysis, 2) model evaluation, and 3) production monitoring.

OmniXAI1: An open-source Python library for XAI proposed by Salesforce researchers, offering comprehensive capabilities for understanding and interpreting ML decisions. It integrates various interpretable ML techniques into a unified interface, supporting multiple data types and models. With a user-friendly interface, practitioners can easily generate explanations and visualize insights with minimal code. OmniXAI aims to simplify XAI for data scientists and practitioners across different ML process stages.

Image Source

10. Activation atlases: These atlases expand upon feature visualization, a method used to explore the representations within the hidden layers of neural networks. Initially, feature visualization concentrated on single neurons. By gathering and visualizing hundreds of thousands of examples of how neurons interact, activation atlases shift the focus from isolated neurons to the broader representational space that these neurons collectively inhabit.

In conclusion, the landscape of AI is evolving rapidly, with increasingly complex models driving advancements across various sectors. However, the rise of opaque models like Deep Neural Networks has underscored the critical need for transparency in decision-making processes. XAI frameworks have emerged as essential tools to address this challenge, offering practitioners the means to understand and interpret machine learning decisions effectively. Through a diverse array of techniques and libraries such as the What-If Tool, LIME, SHAP, and OmniXAI1, stakeholders can gain insights into model mechanisms, visualize data features, and assess fairness metrics, thereby fostering trust, accountability, and ethical AI implementation in diverse real-world applications.
The post Top 10 Explainable AI (XAI) Frameworks appeared first on MarkTechPost.

JP Morgan AI Research Introduces FlowMind: A Novel Machine Learning Ap …

Automation in modern industries often involves repetitive tasks, but the challenge arises when tasks require flexibility and spontaneous decision-making. Traditional robotic process automation (RPA) systems are designed for static, routine activities, falling short when unpredictability is introduced. These systems are typically confined to predefined workflows, limiting their ability to handle tasks that deviate from standard procedures or require immediate adaptation.

In many sectors, particularly financial services, dynamic workflow automation is critical. Traditional approaches cannot efficiently manage non-standard tasks requiring high security and adaptability levels. This issue is pronounced in environments where data integrity and confidentiality are paramount.

Existing research in Robotic Process Automation (RPA) has focused on rule-based systems like UiPath and Blue Prism, which automate routine tasks such as data entry and customer service. The rise of Large Language Models (LLMs) like OpenAI’s Generative Pretrained Transformer (GPT) series has expanded capabilities into dynamic code generation. Frameworks like Langchain and HuggingFace’s Transformer Agent further integrate LLMs with external data for adaptive responses. At the same time, AutoGPT addresses limited problem-solving scenarios, highlighting the need for more robust and flexible automation solutions in data-sensitive fields like finance.

Researchers at J.P. Morgan AI Research have introduced FlowMind, a system employing LLMs, particularly Generative Pretrained Transformer (GPT), to automate workflows dynamically. This innovation stands out because it incorporates ‘lecture recipes’ to prime LLMs before task engagement, ensuring an understanding of the task context and API functionality. This methodology significantly boosts the model’s ability to handle complex, real-world tasks securely and efficiently without directly interacting with sensitive data.

FlowMind operates through a structured two-stage framework. Initially, the system educates the LLM on task-specific APIs through a detailed lecture phase, preparing the model with necessary contextual information and technical specifics. In the workflow generation phase, the LLM applies this knowledge to generate and execute code based on user inputs dynamically. The methodology utilizes the NCEN-QA dataset, specifically designed for financial workflows, which includes a variety of question-answer pairs based on N-CEN reports about funds. This dataset tests the LLM’s ability to handle real-world financial queries effectively. User feedback is integrated into the process, allowing for continuous refinement of the workflows to ensure relevance and accuracy.

FlowMind has demonstrated robust performance in automated workflow generation, achieving exceptional accuracy rates across various tests. Specifically, in the NCEN-QA dataset, FlowMind achieved an outstanding accuracy of 99.5% on easier tasks and 96.0% on more complex scenarios, significantly outperforming traditional RPA systems. These impressive results illustrate the effectiveness of lecture-based preparation and API integration. Incorporating user feedback into the workflow led to further improvements, allowing the system to refine its outputs and adapt to user-specific requirements, ultimately enhancing the accuracy and applicability of the generated workflows.

In conclusion, the research introduced FlowMind, developed by J.P. Morgan AI Research. It leverages LLMs, specifically GPT, to automate complex workflows dynamically. This system uniquely integrates structured API interactions and user feedback into a two-stage framework, enhancing security and adaptability. The methodology has proven effective, achieving up to 100% accuracy in realistic financial scenarios through the NCEN-QA dataset. FlowMind’s innovative approach represents a significant advancement in RPA, offering a scalable, efficient solution that directly addresses the needs of industries requiring robust, flexible automation systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit
The post JP Morgan AI Research Introduces FlowMind: A Novel Machine Learning Approach that Leverages the Capabilities of LLMs such as GPT to Create an Automatic Workflow Generation System appeared first on MarkTechPost.

OpenCRISPR: An Open-Source AI-Generated Gene Editor that Exhibits Comp …

Gene editing is a rapidly evolving field with profound implications for agriculture, biotechnology, and medicine. One of the most promising tools in this arena is the CRISPR-Cas system, originally derived from bacterial immune defence mechanisms. This technology offers a precise means of altering genetic sequences, but its adaptation from microbial environments to more complex eukaryotic cells often compromises efficiency and specificity.

Researchers have sought to enhance the functionality of CRISPR systems to address this challenge. Traditional methods like directed evolution and structure-guided design have facilitated some progress. However, these techniques struggle with protein evolution’s intricate and unpredictable nature, often leading to suboptimal performance when these systems are applied outside their natural contexts.

A research team from Profluent Bio, Berkeley, CA, USA; Laboratory of Protein Design and Immunoengineering, École Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, Lausanne, Switzerland; Department of Microbiology, University of Washington, Seattle, WA, USA has employed artificial intelligence to overcome these limitations. The team has pioneered the design of novel gene editors by training large language models on a dataset comprising over a million CRISPR operons and 26 terabases of assembled genomes. This AI-driven approach sidesteps the slow and uncertain process of natural evolution, enabling the rapid generation of diverse and highly functional proteins.

The results of this AI-centric methodology are the newly designed proteins, including the standout OpenCRISPR-1, have significantly improved target accuracy and reduced off-target effects. For example, OpenCRISPR-1 demonstrated compatibility with base editing, a refined form of gene editing that allows for single nucleotide changes without creating double-strand breaks. In practical terms, OpenCRISPR-1 achieved editing efficiency comparable to the best existing systems like SpCas9 but with far fewer unwanted mutations, highlighting an up to 95% reduction in off-target activity in some cases.

These AI-generated proteins exhibited an expanded range of functionality. They maintained high activity across varied conditions, easily adapting to different temperatures and molecular environments. This adaptability is critical for applications in human health, where precision and reliability are paramount. The research documented the creation of over four million protein sequences, from which a select group was chosen for detailed characterisation based on their robustness and specificity.

In conclusion, the study underscores a significant advancement in gene editing technology by addressing the functional limitations of CRISPR-Cas systems in non-native environments. Researchers have innovatively generated diverse functional gene editors by harnessing artificial intelligence. The standout, OpenCRISPR-1, demonstrated high efficiency and specificity and remarkably reduced off-target effects. This breakthrough illustrates the potential of AI-driven methodologies to refine and accelerate the development of gene editing tools, paving the way for more precise and reliable applications in medicine and agriculture. This research sets a new standard in the field and promises transformative future developments.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit
The post OpenCRISPR: An Open-Source AI-Generated Gene Editor that Exhibits Compatibility with Base Editing appeared first on MarkTechPost.

Enhance conversational AI with advanced routing techniques with Amazon …

Conversational artificial intelligence (AI) assistants are engineered to provide precise, real-time responses through intelligent routing of queries to the most suitable AI functions. With AWS generative AI services like Amazon Bedrock, developers can create systems that expertly manage and respond to user requests. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon using a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
This post assesses two primary approaches for developing AI assistants: using managed services such as Agents for Amazon Bedrock, and employing open source technologies like LangChain. We explore the advantages and challenges of each, so you can choose the most suitable path for your needs.
What is an AI assistant?
An AI assistant is an intelligent system that understands natural language queries and interacts with various tools, data sources, and APIs to perform tasks or retrieve information on behalf of the user. Effective AI assistants possess the following key capabilities:

Natural language processing (NLP) and conversational flow
Knowledge base integration and semantic searches to understand and retrieve relevant information based on the nuances of conversation context
Running tasks, such as database queries and custom AWS Lambda functions
Handling specialized conversations and user requests

We demonstrate the benefits of AI assistants using Internet of Things (IoT) device management as an example. In this use case, AI can help technicians manage machinery efficiently with commands that fetch data or automate tasks, streamlining operations in manufacturing.
Agents for Amazon Bedrock approach
Agents for Amazon Bedrock allows you to build generative AI applications that can run multi-step tasks across a company’s systems and data sources. It offers the following key capabilities:

Automatic prompt creation from instructions, API details, and data source information, saving weeks of prompt engineering effort
Retrieval Augmented Generation (RAG) to securely connect agents to a company’s data sources and provide relevant responses
Orchestration and running of multi-step tasks by breaking down requests into logical sequences and calling necessary APIs
Visibility into the agent’s reasoning through a chain-of-thought (CoT) trace, allowing troubleshooting and steering of model behavior
Prompt engineering abilities to modify the automatically generated prompt template for enhanced control over agents

You can use Agents for Amazon Bedrock and Knowledge Bases for Amazon Bedrock to build and deploy AI assistants for complex routing use cases. They provide a strategic advantage for developers and organizations by simplifying infrastructure management, enhancing scalability, improving security, and reducing undifferentiated heavy lifting. They also allow for simpler application layer code because the routing logic, vectorization, and memory is fully managed.
Solution overview
This solution introduces a conversational AI assistant tailored for IoT device management and operations when using Anthropic’s Claude v2.1 on Amazon Bedrock. The AI assistant’s core functionality is governed by a comprehensive set of instructions, known as a system prompt, which delineates its capabilities and areas of expertise. This guidance makes sure the AI assistant can handle a wide range of tasks, from managing device information to running operational commands.

“””The following is the system prompt that outlines the full scope of the AI assistant’s capabilities:
You are an IoT Ops agent that handles the following activities:
– Looking up IoT device information
– Checking IoT operating metrics (historical data)
– Performing actions on a device-by-device ID
– Answering general questions
You can check device information (Device ID, Features, Technical Specifications, Installation Guide, Maintenance and Troubleshooting, Safety Guidelines, Warranty, and Support) from the “IotDeviceSpecs” knowledge base.
Additionally, you can access device historical data or device metrics. The device metrics are stored in an Athena DB named “iot_ops_glue_db” in a table named “iot_device_metrics”.
The table schema includes fields for oil level, temperature, pressure, received_at timestamp, and device_id.
The available actions you can perform on the devices include start, shutdown, and reboot.”””

Equipped with these capabilities, as detailed in the system prompt, the AI assistant follows a structured workflow to address user questions. The following figure provides a visual representation of this workflow, illustrating each step from initial user interaction to the final response.

The workflow is composed of the following steps:

The process begins when a user requests the assistant to perform a task; for example, asking for the maximum data points for a specific IoT device device_xxx. This text input is captured and sent to the AI assistant.
The AI assistant interprets the user’s text input. It uses the provided conversation history, action groups, and knowledge bases to understand the context and determine the necessary tasks.
After the user’s intent is parsed and understood, the AI assistant defines tasks. This is based on the instructions that are interpreted by the assistant as per the system prompt and user’s input.
The tasks are then run through a series of API calls. This is done using ReAct prompting, which breaks down the task into a series of steps that are processed sequentially:

For device metrics checks, we use the check-device-metrics action group, which involves an API call to Lambda functions that then query Amazon Athena for the requested data.
For direct device actions like start, stop, or reboot, we use the action-on-device action group, which invokes a Lambda function. This function initiates a process that sends commands to the IoT device. For this post, the Lambda function sends notifications using Amazon Simple Email Service (Amazon SES).
We use Knowledge Bases for Amazon Bedrock to fetch from historical data stored as embeddings in the Amazon OpenSearch Service vector database.

After the tasks are complete, the final response is generated by the Amazon Bedrock FM and conveyed back to the user.
Agents for Amazon Bedrock automatically stores information using a stateful session to maintain the same conversation. The state is deleted after a configurable idle timeout elapses.

Technical overview
The following diagram illustrates the architecture to deploy an AI assistant with Agents for Amazon Bedrock.

It consists of the following key components:

Conversational interface – The conversational interface uses Streamlit, an open source Python library that simplifies the creation of custom, visually appealing web apps for machine learning (ML) and data science. It is hosted on Amazon Elastic Container Service (Amazon ECS) with AWS Fargate, and it is accessed using an Application Load Balancer. You can use Fargate with Amazon ECS to run containers without having to manage servers, clusters, or virtual machines.
Agents for Amazon Bedrock – Agents for Amazon Bedrock completes the user queries through a series of reasoning steps and corresponding actions based on ReAct prompting:

Knowledge Bases for Amazon Bedrock – Knowledge Bases for Amazon Bedrock provides fully managed RAG to supply the AI assistant with access to your data. In our use case, we uploaded device specifications into an Amazon Simple Storage Service (Amazon S3) bucket. It serves as the data source to the knowledge base.
Action groups – These are defined API schemas that invoke specific Lambda functions to interact with IoT devices and other AWS services.
Anthropic Claude v2.1 on Amazon Bedrock – This model interprets user queries and orchestrates the flow of tasks.
Amazon Titan Embeddings – This model serves as a text embeddings model, transforming natural language text—from single words to complex documents—into numerical vectors. This enables vector search capabilities, allowing the system to semantically match user queries with the most relevant knowledge base entries for effective search.

The solution is integrated with AWS services such as Lambda for running code in response to API calls, Athena for querying datasets, OpenSearch Service for searching through knowledge bases, and Amazon S3 for storage. These services work together to provide a seamless experience for IoT device operations management through natural language commands.
Benefits
This solution offers the following benefits:

Implementation complexity:

Fewer lines of code are required, because Agents for Amazon Bedrock abstracts away much of the underlying complexity, reducing development effort
Managing vector databases like OpenSearch Service is simplified, because Knowledge Bases for Amazon Bedrock handles vectorization and storage
Integration with various AWS services is more streamlined through pre-defined action groups

Developer experience:

The Amazon Bedrock console provides a user-friendly interface for prompt development, testing, and root cause analysis (RCA), enhancing the overall developer experience

Agility and flexibility:

Agents for Amazon Bedrock allows for seamless upgrades to newer FMs (such as Claude 3.0) when they become available, so your solution stays up to date with the latest advancements
Service quotas and limitations are managed by AWS, reducing the overhead of monitoring and scaling infrastructure

Security:

Amazon Bedrock is a fully managed service, adhering to AWS’s stringent security and compliance standards, potentially simplifying organizational security reviews

Although Agents for Amazon Bedrock offers a streamlined and managed solution for building conversational AI applications, some organizations may prefer an open source approach. In such cases, you can use frameworks like LangChain, which we discuss in the next section.
LangChain dynamic routing approach
LangChain is an open source framework that simplifies building conversational AI by allowing the integration of large language models (LLMs) and dynamic routing capabilities. With LangChain Expression Language (LCEL), developers can define the routing, which allows you to create non-deterministic chains where the output of a previous step defines the next step. Routing helps provide structure and consistency in interactions with LLMs.
For this post, we use the same example as the AI assistant for IoT device management. However, the main difference is that we need to handle the system prompts separately and treat each chain as a separate entity. The routing chain decides the destination chain based on the user’s input. The decision is made with the support of an LLM by passing the system prompt, chat history, and user’s question.
Solution overview
The following diagram illustrates the dynamic routing solution workflow.

The workflow consists of the following steps:

The user presents a question to the AI assistant. For example, “What are the max metrics for device 1009?”
An LLM evaluates each question along with the chat history from the same session to determine its nature and which subject area it falls under (such as SQL, action, search, or SME). The LLM classifies the input and the LCEL routing chain takes that input.
The router chain selects the destination chain based on the input, and the LLM is provided with the following system prompt:

“””Given the user question below, classify it as one of the candidate prompts. You may want to modify the input considering the chat history and the context of the question.
Sometimes the user may just assume that you have the context of the conversation and may not provide a clear input. Hence, you are being provided with the chat history for more context.
Respond with only a Markdown code snippet containing a JSON object formatted EXACTLY as specified below.
Do not provide an explanation to your classification beside the Markdown, I just need to know your decision on which destination and next_inputs
<candidate prompt>
physics: Good for answering questions about physics
sql: sql: Good for querying sql from AWS Athena. User input may look like: get me max or min for device x?
lambdachain: Good to execute actions with Amazon Lambda like shutting down a device or turning off an engine User input can be like, shutdown device x, or terminate process y, etc.
rag: Good to search knowledgebase and retrieve information about devices and other related information. User question can be like: what do you know about device x?
default: if the input is not well suited for any of the candidate prompts above. this could be used to carry on the conversation and respond to queries like provide a summary of the conversation
</candidate prompt>”””

The LLM evaluates the user’s question along with the chat history to determine the nature of the query and which subject area it falls under. The LLM then classifies the input and outputs a JSON response in the following format:

<Markdown>
“`json
{{
“destination”: string name of the prompt to use
“next_inputs”: string a potentially modified version of the original input
}}
“`

The router chain uses this JSON response to invoke the corresponding destination chain. There are four subject-specific destination chains, each with its own system prompt:

SQL-related queries are sent to the SQL destination chain for database interactions. You can use LCEL to build the SQL chain.
Action-oriented questions invoke the custom Lambda destination chain for running operations. With LCEL, you can define your own custom function; in our case, it’s a function to run a predefined Lambda function to send an email with a device ID parsed. Example user input might be “Shut down device 1009.”
Search-focused inquiries proceed to the RAG destination chain for information retrieval.
SME-related questions go to the SME/expert destination chain for specialized insights.
Each destination chain takes the input and runs the necessary models or functions:

The SQL chain uses Athena for running queries.
The RAG chain uses OpenSearch Service for semantic search.
The custom Lambda chain runs Lambda functions for actions.
The SME/expert chain provides insights using the Amazon Bedrock model.

Responses from each destination chain are formulated into coherent insights by the LLM. These insights are then delivered to the user, completing the query cycle.
User input and responses are stored in Amazon DynamoDB to provide context to the LLM for the current session and from past interactions. The duration of persisted information in DynamoDB is controlled by the application.

Technical overview
The following diagram illustrates the architecture of the LangChain dynamic routing solution.

The web application is built on Streamlit hosted on Amazon ECS with Fargate, and it is accessed using an Application Load Balancer. We use Anthropic’s Claude v2.1 on Amazon Bedrock as our LLM. The web application interacts with the model using LangChain libraries. It also interacts with variety of other AWS services, such as OpenSearch Service, Athena, and DynamoDB to fulfill end-users’ needs.
Benefits
This solution offers the following benefits:

Implementation complexity:

Although it requires more code and custom development, LangChain provides greater flexibility and control over the routing logic and integration with various components.
Managing vector databases like OpenSearch Service requires additional setup and configuration efforts. The vectorization process is implemented in code.
Integrating with AWS services may involve more custom code and configuration.

Developer experience:

LangChain’s Python-based approach and extensive documentation can be appealing to developers already familiar with Python and open source tools.
Prompt development and debugging may require more manual effort compared to using the Amazon Bedrock console.

Agility and flexibility:

LangChain supports a wide range of LLMs, allowing you to switch between different models or providers, fostering flexibility.
The open source nature of LangChain enables community-driven improvements and customizations.

Security:

As an open source framework, LangChain may require more rigorous security reviews and vetting within organizations, potentially adding overhead.

Conclusion
Conversational AI assistants are transformative tools for streamlining operations and enhancing user experiences. This post explored two powerful approaches using AWS services: the managed Agents for Amazon Bedrock and the flexible, open source LangChain dynamic routing. The choice between these approaches hinges on your organization’s requirements, development preferences, and desired level of customization. Regardless of the path taken, AWS empowers you to create intelligent AI assistants that revolutionize business and customer interactions
Find the solution code and deployment assets in our GitHub repository, where you can follow the detailed steps for each conversational AI approach.

About the Authors
Ameer Hakme is an AWS Solutions Architect based in Pennsylvania. He collaborates with Independent Software Vendors (ISVs) in the Northeast region, assisting them in designing and building scalable and modern platforms on the AWS Cloud. An expert in AI/ML and generative AI, Ameer helps customers unlock the potential of these cutting-edge technologies. In his leisure time, he enjoys riding his motorcycle and spending quality time with his family.
Sharon Li is an AI/ML Solutions Architect at Amazon Web Services based in Boston, with a passion for designing and building Generative AI applications on AWS. She collaborates with customers to leverage AWS AI/ML services for innovative solutions.
Kawsar Kamal is a senior solutions architect at Amazon Web Services with over 15 years of experience in the infrastructure automation and security space. He helps clients design and build scalable DevSecOps and AI/ML solutions in the Cloud.

Improve LLM performance with human and AI feedback on Amazon SageMaker …

The Amazon EU Design and Construction (Amazon D&C) team is the engineering team designing and constructing Amazon warehouses. The team navigates a large volume of documents and locates the right information to make sure the warehouse design meets the highest standards. In the post A generative AI-powered solution on Amazon SageMaker to help Amazon EU Design and Construction, we presented a question answering bot solution using a Retrieval Augmented Generation (RAG) pipeline with a fine-tuned large language model (LLM) for Amazon D&C to efficiently retrieve accurate information from a large volume of unorganized documents, and provide timely and high-quality services in their construction projects. The Amazon D&C team implemented the solution in a pilot for Amazon engineers and collected user feedback.
In this post, we share how we analyzed the feedback data and identified limitations of accuracy and hallucinations RAG provided, and used the human evaluation score to train the model through reinforcement learning. To increase training samples for better learning, we also used another LLM to generate feedback scores. This method addressed the RAG limitation and further improved the bot response quality. We present the reinforcement learning process and the benchmarking results to demonstrate the LLM performance improvement. The solution uses Amazon SageMaker JumpStart as the core service for model deployment, fine-tuning, and reinforcement learning.
Collect feedback from Amazon engineers in a pilot project
After developing the solution described in A generative AI-powered solution on Amazon SageMaker to help Amazon EU Design and Construction, the Amazon D&C team deployed the solution and ran a pilot project with Amazon engineers. The engineers accessed the pilot system through a web application developed by Streamlit, connected with the RAG pipeline. In the pipeline, we used Amazon OpenSearch Service for the vector database, and deployed a fine-tuned Mistral-7B-Instruct model on Amazon SageMaker.
One of the key objectives of the pilot is to collect feedback from Amazon engineers and use the feedback to further reduce LLM hallucinations. To achieve this, we developed a feedback collection module in the UI, as shown in the following figure, and stored the web session information and user feedback in Amazon DynamoDB. Through the feedback collection UI, Amazon engineers can select from five satisfaction levels: strongly disagree, disagree, neutral, agree, and strongly agree, corresponding to the feedback scores from 1–5. They can also provide a better answer to the question or comment on why the LLM response is not satisfactory.

During the pilot, users provided 118 feedback responses. The following chart shows the distribution from negative to positive. We observed the incorrect answers marked by the negative feedback (disagree and strongly disagree) are 45% among the total feedback (53 out of 118). Some of the errors (about 10% of negative feedback and 7.5% of overall responses) can be addressed by user education and prompt engineering. Additionally, we can address the issue with the solution of LLM fine-tuning and reinforcement learning, described in the next section.

Some users simply left a note, such as “Great!” for the strongly agree answers, or “Doesn’t answer the question,” for the strongly disagree answers. Other users provided scores and explained how they justify the LLM answers in their notes. In the following example, the user specifically provided the adequate document and content to correct the LLM hallucination.

Question
How many car parking spaces do we have?

Bot Answer
Based on the given context, we don’t have enough information to determine the number of car parking spaces. The context discusses the minimum size of parking spaces and the requirements for external lighting levels, but it does not provide a specific number of parking spaces. **You can refer to further explanations in the following resources:** * ARS GEN 10.0/05.01.02. Design Criteria & Appendices/Performance Package AR Sortable Design Criteria v20.1.1.pdf – page: 10 * ARS GEN 10.0/05.01.02. Design Criteria & Appendices/Performance Package AR Sortable Design Criteria v20.1.1.pdf – page: 79

User score
Strongly Disagree

User notes
This is specified on page 21 of design criteria section 01 13 10

Improve bot response with supervised fine-tuning and reinforcement learning
The solution consists of three steps of fine-tuning:

Conduct supervised fine-tuning using labeled data. This method was described in A generative AI-powered solution on Amazon SageMaker to help Amazon EU Design and Construction.
Collect user feedback to label the question-answer pairs for further LLM tuning.
When the training data is ready, further tune the model using reinforcement learning from human feedback (RLHF).

RLHF is widely used throughout generative artificial intelligence (AI) and LLM applications. It incorporates human feedback in the rewards function and trains the model with a reinforcement learning algorithm to maximize rewards, which makes the model perform tasks more aligned with human goals. The following diagram shows the pipeline of the steps.

We tested the methodology using the Amazon D&C documents with a Mistral-7B model on SageMaker JumpStart.
Supervised fine-tuning
In the previous post, we demonstrated how the fine-tuned Falcon-7B model outperforms the RAG pipeline and improves the quality and accuracy of QA bot response. For this post, we performed supervised fine-tuning on the Mistral-7B model. The supervised fine-tuning used the PEFT/LoRA technique (LoRA_r = 512, LoRA_alpha = 1024) on 436,207,616 parameters (5.68% of the total 7,677,964,288 parameters). The training was conducted on a p3.8x node with 137 samples synthetically generated by LLM and validated by humans; the process is well converged after 20 epochs, as shown in the following figure.

The fine-tuned model was validated by 274 samples, and the inference results were compared with the reference answers by the semantic similarity score. The score is 0.8100, which is higher than the score of 0.6419 from the traditional RAG.
Collect human and AI feedback for reinforcement learning
For RLHF, a sufficient amount of high-quality training samples labeled by subject matter experts (SMEs) are essential. However, poor-quality human labels will likely cause worse model performance than the original model after RLHF training. SMEs’ time is a scarce resource in any organization; reviewing hundreds or thousands of LLM responses and providing feedback requires a significant time investment from SMEs that may not have a clear return on investment.
To address this challenge, we adopted the reinforcement learning from AI feedback (RLAIF) method. RLAIF employs an AI assistant (another LLM) to provide evaluation scores, rather than from humans. In this hybrid learning approach, the learning agent refines the actions not only based on the interaction with a human but also from feedback provided by another AI model. It is much more scalable to provide sufficient training data for reinforcement learning, and much less subjective because the training is not dependent on particular preferences from a small group of SMEs. The SMEs may still be involved, but only to supervise the process and examine the AI feedback quality. This tremendously reduces the SMEs’ workload, because the AI-generated scores and labels are much less complicated for SMEs to analyze, filter, and sort.
In this use case, we provide the following prompt to an Anthropic Claude 2 model to generate the evaluation scores as the training samples for the reinforcement learning step:

“””

Human:

I am going to provide you with a question, a reference answer, and an machine generated answer to the question.

question: {question}

reference answer: {reference_answer}

machine_answer: {machine_answer}

Please analyze the question, the reference answer, and the machine generated answer to determine if the machine generated answer semantically matches the reference answer.

First think through your answer inside of <thinking> tags, then assign a score between 0.0 and 5.0 based on how well the machine answer matches the reference answer. Score 5.0 means a completely match, score 0.0 means a completely mismatch. Answer the score inside of <scoring> tags.

Also provide the reason to give this score inside of <reason> tags.

Assistant: <thinking>

“””

The Claude 2 model generates an evaluation score and a comment on why the score is given, as shown in the following example.

Question
What is meant by ‘Fit Out Works’ in this context?

Reference Answer
Fit Out Works refers to Works carried out under the construction contract by or on behalf of Amazon via the Developer up to First Receive Date from a notional Developer Shell & Core Base-build position. These Works include Amazon’s required Fit-Outs, big-box installations, and custom

QA Bot response
Fit Out Works refers to Works carried out under the construction contract by or on behalf of Amazon via the Developer up to First Receive Date from a notional Developer Shell & Core Base-build position.

AI score
4.5

AI comment
The machine answer provides the key information from the standard answer – that Fit Out Works refers to Works carried out under the construction contract by or on behalf of Amazon via the Developer up to First Receive Date from a notional Developer Shell & Core Base-build position. The additional details about including Amazon’s required Fit-Outs, big-box installations, and customizations are relevant and help provide further context, so I’ve scored the machine answer 4.5 out of 5. It captures the essence of the standard answer and provides some useful extra details.

Out of the 274 validation questions, the supervised fine-tuned model generated 159 responses that have AI scores greater than 4. We observed 60 answers with scores lower than 3; there is space to improve the overall response quality.

The Amazon Engineering SMEs validated this AI feedback and acknowledged the benefits of using AI scores. Without AI feedback, the SMEs would need some time to review and analyze each LLM response to identify the cut-off answers and hallucinations, and to judge whether the LLM is returning correct contents and key concepts. AI feedback provides AI scores automatically and enables the SMEs to use filtering, sorting, and grouping to validate the scores and identify trends in the responses. This reduces the average SME’s review time by 80%.
Reinforcement learning from human and AI feedback
When the training samples are ready, we use the proximal policy optimization (PPO) algorithm to perform reinforcement learning. PPO uses a policy gradient method, which takes small steps to update the policy in the learning process, so that the learning agents can reliably reach the optimal policy network. This makes the training process more stable and reduces the possibility of divergence.
During the training, first we use the human- and AI-labeled data to build a reward model, which will be used guide the weights update in the learning process. For this use case, we select a distilroberta-base reward model and train it by samples in the following format:
[Instruction, Chosen_response, Rejected_response]
The following is an example of a training record.

Instruction
According to the context, what is specified for inclusive and accessible design?

Chosen_response
BREEAM Credit HEA06 – inclusive and accessible design – The building is designed to be fit for purpose, appropriate and accessible by all potential users. An access strategy is developed in line with the BREEAM Check list A3

Rejected_response
The context states that

The reward model is trained with the learning rate 1e-5. As shown in the following chart, the training well converges after 10 epochs.

Then the reward model is used for reinforcement learning. For this use case, we use PEFT/LoRA to accelerate the learning. The LoRA_r is 32 and LoRA_alpha is 64. The PPO-based learning is on a rate of 1.41e-5 in three epochs. Reinforcement training is highly volatile and very sensitive to the label data and feedback data. We use the reward_mean metric, the average reward obtained from the environment by specifically monitoring the reward model, to observe the learning process.
After RLAIF, we can see the number of bot responses with positive scores (above 3) increase from 78.1% to 85.5%, and those with negative scores (below 3) decrease from 21.9% to 14.5%. In total, the average AI feedback score increases up to 8%, from 3.9 to 4.2.

You can see the improvement in the following example.

Question
For floor areas being tiled, what screed is needed?

Reference
Concrete Screed flooring is applicable only in floor-tiled areas. Refer to drawing series A-32 to see where floor-tiles are applicable.

Bot response before RLAIF
Concrete Screed flooring is applicable only in

AI score before RLAIF
3

Bot response after RLAIF
Concrete Screed flooring is applicable only in areas being tiled. Refer to drawing series A-32 to see where tile flooring is applicable.

AI score after RLAIF
4.5

Conclusion
This post is a continuation of our work with the Amazon EU Design and Construction team as discussed in A generative AI-powered solution on Amazon SageMaker to help Amazon EU Design and Construction. In this post, we showed how we generated human and AI feedback data to fine-tune the Mistral-7B model with reinforcement learning. The model after RLAIF provided better performance for Amazon Engineering’s question answering bot, improved the AI feedback score by 8%. In the Amazon D&C team’s pilot project, using RLAIF reduced the validation workload for SMEs by an estimated 80%. As the next step, we will scale up this solution by connecting with Amazon Engineering’s data infrastructure, and design a framework to automate the continuous learning process with a human in the loop. We will also further improve the AI feedback quality by tuning the prompt template.
Through this process, we learned how to further improve the quality and performance of question answering tasks through RLHF and RLAIF.

Human validation and augmentation are essential to provide accurate and responsible outputs from LLM. The human feedback can be used in RLHF to further improve the model response.
RLAIF automates the evaluation and learning cycle. The AI-generated feedback is less subjective because it doesn’t depend on a particular preference from a small pool of SMEs.
RLAIF is more scalable to improve the bot quality through continued reinforcement learning while minimizing the efforts required from SMEs. It is especially useful for developing domain-specific generative AI solutions within large organizations.
This process should be done on a regular basis, especially when new domain data is available to be covered by the solution.

In this use case, we used SageMaker JumpStart to test multiple LLMs and experiment with multiple LLM training approaches. It significantly accelerates the AI feedback and learning cycle with maximized efficiency and quality. For your own project, you can introduce the human-in-the-loop approach to collect your users’ feedback, or generate AI feedback using another LLM. Then you can follow the three-step process defined in this post to fine-tune your models using RLHF and RLAIF. We recommend experimenting with the methods using SageMaker JumpStart to speed up the process.

About the Author
Yunfei Bai is a Senior Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.
Elad Dwek is a Construction Technology Manager at Amazon. With a background in construction and project management, Elad helps teams adopt new technologies and data-based processes to deliver construction projects. He identifies needs and solutions, and facilitates the development of the bespoke attributes. Elad has an MBA and a BSc in Structural Engineering. Outside of work, Elad enjoys yoga, woodworking, and traveling with his family.
Luca Cerabone is a Business Intelligence Engineer at Amazon. Drawing from his background in data science and analytics, Luca crafts tailored technical solutions to meet the unique needs of his customers, driving them towards more sustainable and scalable processes. Armed with an MSc in Data Science, Luca enjoys engaging in DIY projects, gardening and experimenting with culinary delights in his leisure moments.

Improve accuracy of Amazon Rekognition Face Search with user vectors

In various industries, such as financial services, telecommunications, and healthcare, customers use a digital identity process, which usually involves several steps to verify end-users during online onboarding or step-up authentication. An example of one step that can be used is face search, which can help determine whether a new end-user’s face matches those associated with an existing account.
Building an accurate face search system involves several steps. The system must be able to detect human faces in images, extract the faces into vector representations, store face vectors in a database, and compare new faces against existing entries. Amazon Rekognition makes this effortless by giving you pre-trained models that are invoked via simple API calls.
Amazon Rekognition enables you to achieve very high face search accuracy with a single face image. In some cases, you can use multiple images of the same person’s face to create user vectors and improve accuracy even further. This is especially helpful when images have variations in lighting, poses, and appearances.
In this post, we demonstrate how to use the Amazon Rekognition Face Search APIs with user vectors to increase the similarity score for true matches and decrease the similarity score for true non-matches.
We compare the results of performing face matching with and without user vectors.
Amazon Rekognition face matching
Amazon Rekognition face matching enables measuring the similarity of a face vector extracted from one image to a face vector extracted from another image. A pair of face images is said to be a true match if both images contain the face of the same person, and a true non-match otherwise. Amazon Rekognition returns a score for the similarity of the source and target faces. The minimum similarity score is 0, implying very little similarity, and the maximum is 100.
For comparing a source face with a collection of target faces (1:N matching), Amazon Rekognition allows you to create a Collection object and populate it with faces from images using API calls.
When adding a face to a collection, Amazon Rekognition doesn’t store the actual image of the face but rather the face vector, a mathematical representation of the face. With the SearchFaces API, you can compare a source face with one or several collections of target faces.
In June 2023, AWS launched user vectors, a new capability that significantly improves face search accuracy by using multiple face images of a user. Now, you can create user vectors, which aggregate multiple face vectors of the same user. User vectors offer higher face search accuracy with more robust depictions, because they contain varying degrees of lighting, sharpness, pose, appearance, and more. This improves the accuracy compared to searching against individual face vectors.
In the following sections, we outline the process of using Amazon Rekognition user vectors. We guide you through creating a collection, storing face vectors in that collection, aggregating those face vectors into user vectors, and then comparing the results of searching against those individual face vectors and user vectors.
Solution overview
For this solution, we use an Amazon Rekognition collection of users, each with its associated indexed face vectors from a number of different images of faces for each user.
Let’s look at the workflow to build a collection with users and faces:

Create an Amazon Rekognition collection.
For each user, create a user in the collection.
For each image of the user, add the face to the collection (IndexFaces, which returns face ID corresponding to each face vector).
Associate all indexed face IDs with the user (this is necessary for user vectors).

Then, we will compare the following workflows:
Searching with a new given input image against individual face vectors in our collection:

Get all faces from an image (DetectFaces).
For each face, compare against individual faces in our collection (SearchFacesByImage).

Searching with a new given input image against user vectors in our collection:

Get all faces from an image (DetectFaces).
For each face, compare to the user vector (SearchUsersByImage).

Now let’s describe the solution in details.
Prerequisites
Add the following policy to your AWS Identity and Access Management (IAM) user or role. The policy grants you permission to the relevant Amazon Rekognition APIs and allows access to an Amazon Simple Storage Service (Amazon S3) bucket to store the images:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “RekognitionPermissions”,
“Effect”: “Allow”,
“Action”: [
“rekognition:CreateCollection”,
“rekognition:DeleteCollection”,
“rekognition:CreateUser”,
“rekognition:IndexFaces”,
“rekognition:DetectFaces”,
“rekognition:AssociateFaces”,
“rekognition:SearchUsersByImage”,
“rekognition:SearchFacesByImage”
],
“Resource”: “*”
},
{
“Sid”: “S3BucketPermissions”,
“Effect”: “Allow”,
“Action”: [
“s3:GetObject”,
“s3:PutObject”,
“s3:ListBucket”
],
“Resource”: [
“arn:aws:s3:::<replace_with_your_bucket>/*”,
“arn:aws:s3:::<replace_with_your_bucket>”
]
}
]
}

Create an Amazon Rekognition collection and add users and faces
First, we create an S3 bucket to store users’ images. We organize the bucket by creating a folder for each user that contains their personal images. Our images folder looks like the following structure:

── images
│ ├── photo.jpeg
│ ├── Swami
│ │ ├── Swami1.jpeg
│ │ └── Swami2.jpeg
│ └── Werner
│ ├── Werner1.jpeg
│ ├── Werner2.jpeg
│ └── Werner3.jpeg

Our S3 bucket has a directory for each user that stores their images. There are currently two folders, and each contains several images. You can add more folders for your users, each containing one or more images to be indexed.
Next, we create our Amazon Rekognition collection. We have supplied helpers.py, which contains different methods that we use:

create_collection – Create a new collection
delete_collection – Delete a collection
create_user – Create a new user in a collection
add_faces_to_collection – Add faces to collection
associate_faces – Associate face_ids to a user in a collection
get_subdirs – Get all subdirectories under the S3 prefix
get_files – Get all files under the S3 prefix

The following is an example method for creating an Amazon Rekognition collection:

import boto3
session = boto3.Session()
client = session.client(‘rekognition’)

def create_collection(collection_id):
try:
# Create a collection
print(‘Creating collection:’ + collection_id)
response = client.create_collection(CollectionId=collection_id)
print(‘Collection ARN: ‘ + response[‘CollectionArn’])
print(‘Status code: ‘ + str(response[‘StatusCode’]))
print(‘Done…’)
except client.exceptions.ResourceAlreadyExistsException:
print(‘Resource already exits…’)

Create the collection with the following code:

import helpers
collection_id = “faces-collection”
helpers.create_collection(collection_id)

Next, let’s add the face vectors into our collection and aggregate them into user vectors.
For each user in the S3 directory, we create a user vector in the collection. Then we index the face images for each user into the collection as individual face vectors, which generates face IDs. Lastly, we associate the face IDs to the appropriate user vector.
This creates two types of vectors in our collection:

Individual face vectors
User vectors, which are built based on the face vector IDs supplied using the method associate_faces

See the following code:

bucket = ‘<replace_with_your_bucket>’
prefix = ‘images/’

# Get all the users directories from s3 containing the images
folder_list = helpers.get_subdirs(bucket, prefix)
print(f”Found users folders: {folder_list}”)
print()

for user_id in folder_list:
face_ids = []
helpers.create_user(collection_id, user_id)
# Get all files per user under the s3 user directory
images = helpers.get_files(bucket, prefix + user_id + “/”)
print (f”Found images={images} for {user_id}”)
for image in images:
face_id = helpers.add_faces_to_collection(bucket, image, collection_id)
face_ids.append(face_id)
helpers.associate_faces(collection_id, user_id, face_ids)
print()

We use the following methods:

get_subdirs – Returns a list of all the users’ directories. In our example, the value is [Swami,Werner].
get_files – Returns all the images files under the S3 prefix for the user.
face_ids – This is a list containing all the face IDs belonging to a user. We use this list when calling the AssociateFaces API.

As explained earlier, you can add more users by adding folders for them (the folder dictates the user ID) and add your images in that folder (no ordering is required for the files).
Now that our environment is set up and we have both individual face vectors and user vectors, let’s compare our search quality against each of them. To do that, we use a new photo with multiple people and attempt to match their faces against our collection, first against the individual face vectors and then against the user vectors.
Face search of image against a collection of individual face vectors
To search against our individual face vectors, we use the Amazon Rekognition SearchFacesByImage API. This function uses a source face image to search against individual face vectors in our collection and returns faces that match our defined similarity score threshold.
An important consideration is that the SearchFacesByImage API will only operate on the largest face detected in the image. If multiple faces are present, you need to crop each individual face and pass it separately to the method for identification.
For extracting faces details from an image (such as their location on the image), we use the Amazon Rekognition DetectFaces API.
The following detect_faces_in_image method detects faces in an image. For each face, it performs the following actions:

Print its bounding box location
Crop the face from the image and check if such face exists in the collection and print the user or ‘Unknown’
Print the similarity score

The example Python code uses the Pillow library for doing the image manipulations (such as printing, drawing, and cropping).
We use a similarity score threshold of 99%, which is a common setting for identity verification use cases.
Run the following code:

import detect_users
from PIL import Image

# The image we would like to match faces against our collection.
file_key= “images/photo.jpeg”

img = detect_users.detect_faces_in_image(
bucket,
file_key,
collection_id,
threshold=99
)
img.show() # or in Jupyter use display(img)

file_key is the S3 object key we want to match against our collection. We have supplied an example image (photo.jpeg) under the images folder.
The following image shows our results.

Using a threshold of 99%, only one person was identified. Dr. Werner Vogels was flagged as Unknown. If we run the same code using a lower threshold of 90 (set threshold=90), we get the following results.

Now we see Dr. Werner Vogel’s face has a similarity score of 96.86%. Next, let’s check if we can get the similarity score above our defined threshold by using user vectors.
Face search of image against a collection of user vectors
To search against our user vectors, we use the Amazon Rekognition SearchUsersByImage API. This function uses a source face image to search against user vectors in our collection and returns users that match our defined similarity score threshold.
The same consideration is relevant here – the SearchUsersByImage API will only operate on the largest face detected in the image. If there are multiple faces present, you need to crop each individual face and pass it separately to the method for identification.
For extracting faces details from an image (such as their location on the image), we use the Amazon Rekognition DetectFaces API.
The following detect_users_in_image method detects faces in an image. For each face, it performs the following actions:

Print its bounding box location
Crop the face from the image and check if such user face exists in our collection and print the user or ‘Unknown’
Print the similarity score

See the following code:

import boto3
import io
import math
from PIL import Image, ImageDraw, ImageFont

def detect_users_in_image(bucket, key, collection_id, threshold=80):

session = boto3.Session()
client = session.client(‘rekognition’)

# Load image from S3 bucket
s3_connection = boto3.resource(‘s3’)
s3_object = s3_connection.Object(bucket, key)
s3_response = s3_object.get()

stream = io.BytesIO(s3_response[‘Body’].read())
image = Image.open(stream)

# Call DetectFaces to find faces in image
response = client.detect_faces(
Image={‘S3Object’: {‘Bucket’: bucket, ‘Name’: key}},
Attributes=[‘ALL’]
)

imgWidth, imgHeight = image.size
draw = ImageDraw.Draw(image)

# Calculate and display bounding boxes for each detected face
for faceDetail in response[‘FaceDetails’]:
print(‘The detected face is between ‘ + str(faceDetail[‘AgeRange’][‘Low’])
+ ‘ and ‘ + str(faceDetail[‘AgeRange’][‘High’]) + ‘ years old’)

box = faceDetail[‘BoundingBox’]
left = imgWidth * box[‘Left’]
top = imgHeight * box[‘Top’]
width = imgWidth * box[‘Width’]
height = imgHeight * box[‘Height’]

print(‘Left: ‘ + ‘{0:.0f}’.format(left))
print(‘Top: ‘ + ‘{0:.0f}’.format(top))
print(‘Face Width: ‘ + “{0:.0f}”.format(width))
print(‘Face Height: ‘ + “{0:.0f}”.format(height))

points = (
(left, top),
(left + width, top),
(left + width, top + height),
(left, top + height),
(left, top)
)

# Crop the face box and convert it to byte array
face = image.crop((left, top, left + width, top + height))
imgByteArr = image_to_byte_array(face, image.format)

# Search for a user in our collection using the cropped image
user_response = client.search_users_by_image(
CollectionId=collection_id,
Image={‘Bytes’: imgByteArr},
UserMatchThreshold=threshold
)
# print (user_response)

# Extract user id and the similarity from the response
if (user_response[‘UserMatches’]):
similarity = user_response[‘UserMatches’][0][‘Similarity’]
similarity = (math.trunc(similarity * 100) / 100) if isinstance(similarity, float) else similarity
user_id = user_response[‘UserMatches’][0][‘User’][‘UserId’]
print(f”User {user_id} was found, similarity of {similarity}%”)
print(“”)
else:
user_id = “Unknown”
similarity = 0

draw.line(points, fill=’#00d400′, width=4)
font = ImageFont.load_default(size=25)
draw.text((left, top – 30), user_id, fill=’#00d400′, font=font)
if similarity > 0:
draw.text((left, top + 1), str(similarity), fill=’#00d400′, font=font)

return image

The function returns a modified image with the results that can be saved to Amazon S3 or printed. The function also outputs statistics about the estimated ages of the faces to the terminal.
Run the following code:

import detect_users
from PIL import Image

# The image we would like to match faces against our collection.
file_key= “images/photo.jpeg”

img = detect_users.detect_users_in_image(
bucket,
file_key,
collection_id,
threshold=99
)
img.show() # or in Jupyter use display(img)

The following image shows our results.

The users that exist in our collection were identified correctly with high similarity (over 99%).
We were able to increase the similarity score by using three face vectors per user vector. As we increase the number of face vectors used, we expect the similarity score for true matches to also increase. You can use up to 100 face vectors per user vector.
An end-to-end example code can be found in the GitHub repository. It includes a detailed Jupyter notebook that you can run on Amazon SageMaker Studio (or other alternatives).
Clean up
To delete the collection, use the following code:

helpers.delete_collection(collection_id)

Conclusion
In this post, we presented how to use Amazon Rekognition user vectors to implement face search against a collection of users’ faces. We demonstrated how to improve face search accuracy by using multiple face images per user and compared it against individual face vectors. Additionally, we described how you can use the different Amazon Rekognition APIs to detect faces. The provided example code serves as a solid foundation for constructing a functional face search system.
For more information about Amazon Rekognition user vectors, refer to Searching faces in a collection. If you’re new to Amazon Rekognition, you can use our Free Tier, which lasts 12 months and includes processing 5,000 images per month and storing 1,000 user vector objects per month.

About the Authors
Arik Porat is a Senior Startups Solutions Architect at Amazon Web Services. He works with startups to help them build and design their solutions in the cloud, and is passionate about machine learning and container-based solutions. In his spare time, Arik likes to play chess and video games.
Eliran Efron is a Startups Solutions Architect at Amazon Web Services. Eliran is a data and compute enthusiast, assisting startups designing their system architectures. In his spare time, Eliran likes to build and race cars in Touring races and build IoT devices.

Top 15 AI Libraries/Frameworks for Automatically Red-Teaming Your Gene …

Prompt Fuzzer: The Prompt Fuzzer is an interactive tool designed to evaluate the security of GenAI application system prompts by simulating various dynamic LLM-based attacks. It assesses security by analyzing the results of these simulations, helping users fortify their system prompts accordingly. This tool specifically customizes its tests to fit the unique configuration and domain of the user’s application. The Fuzzer also features a Playground chat interface, allowing users to refine their system prompts iteratively, enhancing their resilience against a broad range of generative AI attacks. Users should be aware that using the Prompt Fuzzer will consume tokens.

Garak: Garak is a tool that evaluates whether an LLM can be made to fail in undesirable ways. It tests for vulnerabilities, including hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and other potential weaknesses. Analogous to nmap for network security, Garak is a diagnostic tool for LLMs. It is freely available, and its developers are passionate about continuously enhancing it to support applications better.

HouYi: This repository contains the source code for HouYi, a framework designed to automatically inject prompts into applications integrated with large language models (LLMs) to test their vulnerability. Additionally, the repository includes a demo script that simulates an LLM-integrated application and shows how to deploy HouYi for such attacks. Users can apply HouYi to real-world LLM-integrated applications by creating their harnesses and defining the attack intention.

JailbreakingLLMs: There is an increasing focus on aligning LLMs with human values, yet these models are susceptible to adversarial jailbreaks that bypass their safety mechanisms. The Prompt Automatic Iterative Refinement (PAIR) algorithm has been developed to address this. Inspired by social engineering tactics, PAIR uses one LLM to automatically generate jailbreak prompts for another target LLM without human help. PAIR can efficiently create a jailbreak by making iterative queries, often in fewer than twenty attempts. This method demonstrates high success rates and is effective on various LLMs, including GPT-3.5/4, Vicuna, and PaLM-2.

LLMAttacks: Recent efforts have aimed to align LLMs to prevent them from generating objectionable content. LLMAttacks method that effectively prompts these models to produce undesirable outputs. By automatically generating adversarial suffixes through greedy and gradient-based searches, the process bypasses the need for manual crafting. These suffixes have proven transferable across multiple LLMs, including ChatGPT, Bard, and Claude, as well as open-source models like LLaMA-2-Chat and Pythia. This advancement highlights significant vulnerabilities in LLMs, underscoring the need for strategies to counteract such adversarial tactics.

PromptInject: Transformer-based LLMs like GPT-3 are extensively used in customer-facing applications but remain vulnerable to malicious interactions. The study introduces PROMPTINJECT, a framework for creating adversarial prompts through a mask-based iterative process. This research highlights how GPT-3 can be misaligned using straightforward, handcrafted inputs. It focuses on two attack methods: goal hijacking and prompt leaking. Findings reveal that even attackers with low skill levels can exploit the stochastic nature of GPT-3, posing significant long-tail risks to these models.

The Recon-ng Framework: Recon-ng is a comprehensive reconnaissance framework tailored for efficient, web-based, open-source intelligence gathering. It features a user interface similar to the Metasploit Framework, which eases the learning process but serves a different purpose. Unlike other frameworks aimed at exploitation or social engineering, Recon-ng is specifically designed for reconnaissance. Those looking to conduct exploits should use Metasploit, and the Social-Engineer Toolkit is recommended for social engineering. Recon-ng supports a modular architecture, making it accessible for Python developers to contribute. Users can refer to the Wiki and the Development Guide for starting points and details.Buster: Buster is a sophisticated OSINT tool that facilitates a range of online investigations. It can retrieve social accounts linked to an email from various platforms such as Gravatar About.me, Myspace, Skype, GitHub, LinkedIn, and from records of previous breaches. Buster also finds links to mentions of the email across Google, Twitter, dark web search engines, and paste sites. Additionally, it can identify breaches associated with an email, reveal domains registered to an email via reverse WHOIS, generate potential emails and usernames for an individual, locate emails tied to social media accounts or usernames, and uncover a person’s work email.WitnessMe: WitnessMe is a web inventory tool inspired by Eyewitness and designed for extensibility, enabling custom functions using its backend-driven headless browser via the Pyppeteer library. This tool stands out for its ease of use with Python 3.7+, Docker compatibility, and avoidance of installation dependencies. It supports extensive parsing of large Nessus and NMap XML files, offers CSV and HTML reporting, and features HTTP proxy support and a RESTful API for remote operations. WitnessMe includes a CLI for reviewing scan results and is optimized for deployment to cloud platforms like GCP Cloud Run and AWS ElasticBeanstalk. Additionally, it offers signature scanning and terminal-based screenshot previews.

LLM Canary: The LLM Canary tool is an accessible, open-source security benchmarking suite that enables developers to test, assess, and compare LLMs. This tool helps developers identify security trade-offs when choosing a model and address vulnerabilities before integration. It incorporates test groups aligned with the OWASP Top 10 for LLMs and stays updated with the latest threats. Users of LLM Canary can identify and evaluate potential vulnerabilities, run simultaneous tests on multiple LLMs for efficiency, compare results against benchmarks or previous tests, and design custom tests for comprehensive security evaluation.PyRIT: PyRIT, developed by the AI Red Team, is a library designed to enhance the robustness evaluation of LLM endpoints, targeting harm categories such as fabrication, misuse, and prohibited content. This tool automates AI red teaming tasks, freeing up resources to handle more complex issues, and identifies security and privacy harms, including malware generation and identity theft. It provides a benchmark for researchers to compare current model performance against future iterations, helping detect any degradation. At Microsoft, PyRIT is used to refine product versions and meta prompts to better safeguard against prompt injection attacks.

LLMFuzzer: LLMFuzzer is an innovative open-source fuzzing framework tailored for LLMs and their API integrations. It’s ideal for security enthusiasts, pen-testers, and cyber security researchers who aim to uncover and exploit vulnerabilities in AI systems. The tool streamlines the testing process with features like robust fuzzing, LLM API integration testing, various fuzzing strategies, and a modular design for easy expansion. Future enhancements include additional attacks, HTML report generation, diverse connectors and comparers, proxy support, side LLM observation, and an autonomous attack mode.PromptMap: Prompt injection is a security vulnerability where malicious prompts manipulate a ChatGPT instance to perform unintended actions. The tool “promptmap” automates the testing of these attacks by analyzing the context and purpose of your ChatGPT rules. Using your system prompts, it crafts tailored attack prompts and tests them on a ChatGPT instance. promptmap then evaluates the success of the prompt injection by analyzing the responses from your ChatGPT instance. This tool helps identify and mitigate potential vulnerabilities by simulating real attack scenarios.

Gitleaks: Gitleaks is a Static Application Security Testing (SAST) tool designed to detect hardcoded secrets such as passwords, API keys, and tokens in git repositories. It offers a straightforward interface for scanning your code for historical and current secrets. Users can easily run Gitleaks locally with a simple command, and it identifies sensitive information, providing details like file location and author. Gitleaks can be installed via Homebrew, Docker, or Go, and binaries are available for various platforms. It also supports integration as a pre-commit hook or a GitHub action to enhance security practices.

Cloud_enum: The multi-cloud OSINT tool is designed to identify public resources across AWS, Azure, and Google Cloud. For Amazon Web Services, it can enumerate open or protected S3 buckets and various awsapps like WorkMail and WorkDocs. In Microsoft Azure, the tool can discover storage accounts, open blob storage containers, hosted databases, virtual machines, and web apps. The Google Cloud Platform detects open or protected GCP and Firebase buckets, Firebase Realtime Databases, Google App Engine sites, and Cloud Functions, including the enumeration of projects and regions and the brute-forcing of function names. It also identifies open Firebase apps.
The post Top 15 AI Libraries/Frameworks for Automatically Red-Teaming Your Generative AI Application appeared first on MarkTechPost.