Stanford Researchers Developed POPPER: An Agentic AI Framework that Au …

Hypothesis validation is fundamental in scientific discovery, decision-making, and information acquisition. Whether in biology, economics, or policymaking, researchers rely on testing hypotheses to guide their conclusions. Traditionally, this process involves designing experiments, collecting data, and analyzing results to determine the validity of a hypothesis. However, the volume of generated hypotheses has increased dramatically with the advent of LLMs. While these AI-driven hypotheses offer novel insights, their plausibility varies widely, making manual validation impractical. Thus, automation in hypothesis validation has become an essential challenge in ensuring that only scientifically rigorous hypotheses guide future research.

The main challenge in hypothesis validation is that many real-world hypotheses are abstract and not directly measurable. For instance, stating that a specific gene causes a disease is too broad and needs to be translated into testable implications. The rise of LLMs has exacerbated this issue, as these models generate hypotheses at an unprecedented scale, many of which may be inaccurate or misleading. Existing validation methods struggle to keep pace, making it difficult to determine which hypotheses are worth further investigation. Also, statistical rigor is often compromised, leading to false verifications that can misdirect research and policy efforts.

Traditional methods of hypothesis validation include statistical testing frameworks such as p-value-based hypothesis testing and Fisher’s combined test. However, these approaches rely on human intervention to design falsification experiments and interpret results. Some automated approaches exist, but they often lack mechanisms for controlling Type-I errors (false positives) and ensuring that conclusions are statistically reliable. Many AI-driven validation tools do not systematically challenge hypotheses through rigorous falsification, increasing the risk of misleading findings. As a result, a scalable and statistically sound solution is needed to automate the hypothesis validation process effectively.

Researchers from Stanford University and Harvard University introduced POPPER, an agentic framework that automates the process of hypothesis validation by integrating rigorous statistical principles with LLM-based agents. The framework systematically applies Karl Popper’s principle of falsification, which emphasizes disproving rather than proving hypotheses. POPPER employs two specialized AI-driven agents: 

The Experiment Design Agent which formulates falsification experiments

The Experiment Execution Agent which implements them

Each hypothesis is divided into specific, testable sub-hypotheses and subjected to falsification experiments. POPPER ensures that only well-supported hypotheses are advanced by continuously refining the validation process and aggregating evidence. Unlike traditional methods, POPPER dynamically adapts its approach based on prior results, significantly improving efficiency while maintaining statistical integrity.

POPPER functions through an iterative process in which falsification experiments sequentially test hypotheses. The Experiment Design Agent generates these experiments by identifying the measurable implications of a given hypothesis. The Experiment Execution Agent then carries out the proposed experiments using statistical methods, simulations, and real-world data collection. Key to POPPER’s methodology is its ability to strictly control Type-I error rates, ensuring that false positives are minimized. Unlike conventional approaches that treat p-values in isolation, POPPER introduces a sequential testing framework in which individual p-values are converted into e-values, a statistical measure allowing continuous evidence accumulation while maintaining error control. This adaptive approach enables the system to refine its hypotheses dynamically, reducing the chances of reaching incorrect conclusions. The framework’s flexibility allows it to work with existing datasets, conduct new simulations, or interact with live data sources, making it highly versatile across disciplines.

POPPER was evaluated across six domains: biology, sociology, and economics. The system was tested against 86 validated hypotheses, with results showing Type-I error rates below 0.10 across all datasets. POPPER demonstrated significant improvements in statistical power compared to existing validation methods, outperforming standard techniques such as Fisher’s combined test and likelihood ratio models. In one study focusing on biological hypotheses related to Interleukin-2 (IL-2), POPPER’s iterative testing mechanism improved validation power by 3.17 times compared to alternative methods. Also, an expert evaluation involving nine PhD-level computational biologists and biostatisticians found that POPPER’s hypothesis validation accuracy was comparable to that of human researchers but was completed in one-tenth the time. By leveraging its adaptive testing framework, POPPER reduced the time required for complex hypothesis validation by 10, making it significantly more scalable and efficient.

Several Key Takeaways from the Research include:

POPPER provides a scalable, AI-driven solution that automates the falsification of hypotheses, reducing manual workload and improving efficiency.

The framework maintains strict Type-I error control, ensuring that false positives remain below 0.10, critical for scientific integrity.

Compared to human researchers, POPPER completes hypothesis validation 10 times faster, significantly improving the speed of scientific discovery.

Unlike traditional p-value testing, using e-values allows accumulating experimental evidence while dynamically refining hypothesis validation.

Tested across six scientific fields, including biology, sociology, and economics, demonstrating broad applicability.

Evaluated by nine PhD-level scientists, POPPER’s accuracy matched human performance while dramatically reducing time spent on validation.

Improved statistical power by 3.17 times over traditional hypothesis validation methods, ensuring more reliable conclusions.

POPPER integrates Large Language Models to dynamically generate and refine falsification experiments, making it adaptable to evolving research needs.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Stanford Researchers Developed POPPER: An Agentic AI Framework that Automates Hypothesis Validation with Rigorous Statistical Control, Reducing Errors and Accelerating Scientific Discovery by 10x appeared first on MarkTechPost.

xAI Releases Grok 3 Beta: A Super Advanced AI Model Blending Strong Re …

Modern AI systems have made significant strides, yet many still struggle with complex reasoning tasks. Issues such as inconsistent problem-solving, limited chain-of-thought capabilities, and occasional factual inaccuracies remain. These challenges hinder practical applications in research and software development, where nuanced understanding and precision are crucial. The drive to overcome these limitations has prompted a reexamination of how AI models are built and trained, with a focus on improving transparency and reliabilit

xAI’s recent release of the Grok 3 Beta marks a thoughtful step forward in AI development. In their announcement, the company outlines how this new model builds on its predecessors with a refined approach to reasoning and problem-solving. Grok 3 is trained on the company’s Colossus supercluster using substantially more compute than previous iterations. This enhanced training has yielded improvements in areas such as mathematics, coding, and instruction-following, while also enabling the model to consider multiple solution paths before arriving at a final answer.

Rather than relying on oversold promises, the release emphasizes that Grok 3—and its streamlined variant, Grok 3 mini—are still evolving. Early access is designed to encourage user feedback, which will help guide further improvements. The model’s ability to reveal its reasoning process through a “Think” button invites users to engage directly with its problem-solving steps, promoting a level of transparency that is often absent in traditional AI outputs.

Technical Details and Practical Benefits

At its core, Grok 3 leverages a reinforcement learning framework to enhance its chain-of-thought process. This approach allows the model to simulate a form of internal reasoning, iterating over possible solutions and correcting errors along the way. Users can observe this process, which is particularly valuable in tasks where a clear rationale is as important as the final answer. The integration of this reasoning mode sets Grok 3 apart from many earlier models that simply generate responses without an explainable thought process.

Technically, Grok 3’s architecture benefits from an expanded context window, now capable of handling up to one million tokens. This makes it better suited for processing lengthy documents and managing intricate instructions. Benchmark tests indicate notable improvements in various areas, including competition math challenges, advanced reasoning tasks, and code generation. For example, the model achieved a 93.3% accuracy rate on a recent mathematics competition when utilizing its highest level of test-time compute. These technical enhancements translate into practical benefits: clearer, more reliable responses that can support both academic and professional applications without unnecessary embellishment.

Data Insights and Comparative Analysis

The model’s performance in various benchmarks, such as those assessing reasoning and code generation, demonstrates that it can effectively handle complex tasks. Although some skepticism remains within the community, the empirical results suggest that Grok 3 is a robust addition to the AI landscape.

Comparative analysis with other leading models highlights that while many systems continue to be popular choices, Grok 3’s combination of enhanced reasoning and a larger context window provides a distinct advantage in addressing more involved queries. Furthermore, the introduction of the Grok 3 mini variant broadens the range of applications by offering a more cost-efficient option for tasks that do not require as extensive world knowledge. This data underscores the importance of continued innovation in AI, driven by rigorous testing and real-world performance rather than speculative promises.

Conclusion

Grok 3 represents a thoughtful evolution in the quest for more reliable and transparent AI reasoning. By focusing on improved problem-solving through reinforcement learning and offering users a window into its internal thought processes, the model addresses several longstanding challenges. Its performance across a range of benchmarks—spanning from competition math to advanced code generation—demonstrates that a balanced, methodical approach to AI development can yield meaningful improvements.

For researchers and developers, Grok 3 offers not only enhanced technical capabilities but also a practical tool for exploring complex ideas with greater clarity. The model’s design reflects a measured progression in AI, one that values incremental improvements and user engagement over hyperbolic claims. As xAI continues to refine Grok 3 based on real-world feedback, the technology stands to play a significant role in both academic research and practical applications in software development.

Check out the Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post xAI Releases Grok 3 Beta: A Super Advanced AI Model Blending Strong Reasoning with Extensive Pretraining Knowledge appeared first on MarkTechPost.

Google DeepMind Releases PaliGemma 2 Mix: New Instruction Vision Langu …

Vision‐language models (VLMs) have long promised to bridge the gap between image understanding and natural language processing. Yet, practical challenges persist. Traditional VLMs often struggle with variability in image resolution, contextual nuance, and the sheer complexity of converting visual data into accurate textual descriptions. For instance, models may generate concise captions for simple images but falter when asked to describe complex scenes, read text from images, or even detect multiple objects with spatial precision. These shortcomings have historically limited VLM adoption in applications such as optical character recognition (OCR), document understanding, and detailed image captioning. Google’s new release aims to tackle these issues head on—by providing a flexible, multi-task approach that enhances fine-tuning capability and improves performance across a range of vision-language tasks. This is especially vital for industries that depend on precise image-to-text translation, like autonomous vehicles, medical imaging, and multimedia content analysis.

Google DeepMind has just unveiled a new set of PaliGemma 2 checkpoints that are tailor-made for use in applications such as OCR, image captioning, and beyond. These checkpoints come in a variety of sizes—from 3B to a massive 28B parameters—and are offered as open-weight models. One of the most striking features is that these models are fully integrated with the Transformers ecosystem, making them immediately accessible via popular libraries. Whether you are using the HF Transformers API for inference or adapting the model for further fine-tuning, the new checkpoints promise a streamlined workflow for developers and researchers alike. By offering multiple parameter scales and supporting a range of image resolutions (224×224, 448×448, and even 896×896), Google has ensured that practitioners can select the precise balance between computational efficiency and model accuracy needed for their specific tasks.

Technical Details and Benefits

At its core, PaliGemma 2 Mix builds upon the pre-trained PaliGemma 2 models, which themselves integrate the powerful SigLIP image encoder with the advanced Gemma 2 text decoder. The “Mix” models are a fine-tuned variant designed to perform robustly across a mix of vision-language tasks. They utilize open-ended prompt formats—such as “caption {lang}”, “describe {lang}”, “ocr”, and more—thereby offering enhanced flexibility. This fine-tuning approach not only improves task-specific performance but also provides a baseline that signals the model’s potential when adapted to downstream tasks.

The architecture supports both HF Transformers and JAX frameworks, meaning that users can run the models in different precision formats (e.g., bfloat16, 4-bit quantization with bitsandbytes) to suit various hardware configurations. This multi-resolution capability is a significant technical benefit, allowing the same base model to excel at coarse tasks (like simple captioning) and fine-grained tasks (such as detecting minute details in OCR) simply by adjusting the input resolution. Moreover, the open-weight nature of these checkpoints enables seamless integration into research pipelines and facilitates rapid iteration without the overhead of proprietary restrictions.

Performance Insights and Benchmark Results

Early benchmarks of the PaliGemma 2 Mix models are promising. In tests spanning general vision-language tasks, document understanding, localization tasks, and text recognition, the model variants show consistent performance improvements over their predecessors. For instance, when tasked with detailed image description, both the 3B and 10B checkpoints produced accurate and nuanced captions—correctly identifying objects and spatial relations in complex urban scenes.

In OCR tasks, the fine-tuned models demonstrated robust text extraction capabilities by accurately reading dates, prices, and other details from challenging ticket images. Moreover, for localization tasks involving object detection and segmentation, the model outputs include precise bounding box coordinates and segmentation masks. These outputs have been evaluated on standard benchmarks with metrics such as CIDEr scores for captioning and Intersection over Union (IoU) for segmentation. The results underscore the model’s ability to scale with increased parameter count and resolution: larger checkpoints generally yield higher performance, though at the cost of increased computational resource requirements. This scalability, combined with excellent performance in both quantitative benchmarks and qualitative real-world examples, positions PaliGemma 2 Mix as a versatile tool for a wide array of applications.

Conclusion

Google’s release of the PaliGemma 2 Mix checkpoints marks a significant milestone in the evolution of vision-language models. By addressing long-standing challenges—such as resolution sensitivity, context-rich captioning, and multi-task adaptability—these models empower developers to deploy AI solutions that are both flexible and highly performant. Whether for OCR, detailed image description, or object detection, the open-weight, transformer-compatible nature of PaliGemma 2 Mix provides an accessible platform that can be seamlessly integrated into various applications. As the AI community continues to push the boundaries of multimodal processing, tools like these will be critical in bridging the gap between raw visual data and meaningful language interpretation.

Check out the Technical details and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Google DeepMind Releases PaliGemma 2 Mix: New Instruction Vision Language Models Fine-Tuned on a Mix of Vision Language Tasks appeared first on MarkTechPost.

Generate synthetic counterparty (CR) risk data with generative AI usin …

Data is the lifeblood of modern applications, driving everything from application testing to machine learning (ML) model training and evaluation. As data demands continue to surge, the emergence of generative AI models presents an innovative solution. These large language models (LLMs), trained on expansive data corpora, possess the remarkable capability to generate new content across multiple media formats—text, audio, and video—and across various business domains, based on provided prompts and inputs.
In this post, we explore how you can use these LLMs with advanced Retrieval Augmented Generation (RAG) to generate high-quality synthetic data for a finance domain use case. You can use the same technique for synthetic data for other business domain use cases as well. For this post, we demonstrate how to generate counterparty risk (CR) data, which would be beneficial for over-the-counter (OTC) derivatives that are traded directly between two parties, without going through a formal exchange.
Solution overview
OTC derivatives are typically customized contracts between counterparties and include a variety of financial instruments, such as forwards, options, swaps, and other structured products. A counterparty is the other party involved in a financial transaction. In the context of OTC derivatives, the counterparty refers to the entity (such as a bank, financial institution, corporation, or individual) with whom a derivative contract is made.
For example, in an OTC swap or option contract, one entity agrees to terms with another party, and each entity becomes the counterparty to the other. The responsibilities, obligations, and risks (such as credit risk) are shared between these two entities according to the contract.
As financial institutions continue to navigate the complex landscape of CR, the need for accurate and reliable risk assessment models has become paramount. For our use case, ABC Bank, a fictional financial services organization, has taken on the challenge of developing an ML model to assess the risk of a given counterparty based on their exposure to OTC derivative data.
Building such a model presents numerous challenges. Although ABC Bank has gathered a large dataset from various sources and in different formats, the data may be biased, skewed, or lack the diversity needed to train a highly accurate model. The primary challenge lies in collecting and preprocessing the data to make it suitable for training an ML model. Deploying a poorly suited model could result in misinformed decisions and significant financial losses.
We propose a generative AI solution that uses the RAG approach. RAG is a widely used approach that enhances LLMs by supplying extra information from external data sources not included in their original training. The entire solution can be broadly divided into three steps: indexing, data generation, and validation.
Data indexing
In the indexing step, we parse, chunk, and convert the representative CR data into vector format using the Amazon Titan Text Embeddings V2 model and store this information in a Chroma vector database. Chroma is an open source vector database known for its ease of use, efficient similarity search, and support for multimodal data and metadata. It offers both in-memory and persistent storage options, integrates well with popular ML frameworks, and is suitable for a wide range of AI applications. It is particularly beneficial for smaller to medium-sized datasets and projects requiring local deployment or low resource usage. The following diagram illustrates this architecture.

Here are the steps for data indexing:

The sample CR data is segmented into smaller, manageable chunks to optimize it for embedding generation.
These segmented data chunks are then passed to a method responsible for both generating embeddings and storing them efficiently.
The Amazon Titan Text Embeddings V2 API is called upon to generate high-quality embeddings from the prepared data chunks.
The resulting embeddings are then stored in the Chroma vector database, providing efficient retrieval and similarity searches for future use.

Data generation
When the user requests data for a certain scenario, the request is converted into vector format and then looked up in the Chroma database to find matches with the stored data. The retrieved data is augmented with the user request and additional prompts to Anthropic’s Claude Haiku on Amazon Bedrock. Anthropic’s Claude Haiku was chosen primarily for its speed, processing over 21,000 tokens per second, which significantly outpaces its peers. Moreover, Anthropic’s Claude Haiku’s efficiency in data generation is remarkable, with a 1:5 input-to-output token ratio. This means it can generate a large volume of data from a relatively small amount of input or context. This capability not only enhances the model’s effectiveness, but also makes it cost-efficient for our application, where we need to generate numerous data samples from a limited set of examples. Anthropic’s Claude Haiku LLM is invoked iteratively to efficiently manage token consumption and help prevent reaching the maximum token limit. The following diagram illustrates this workflow.

Here are the steps for data generation:

The user initiates a request to generate new synthetic counterparty risk data based on specific criteria.
The Amazon Titan Text Embeddings V2 LLM is employed to create embeddings for the user’s request prompts, transforming them into a machine-interpretable format.
These newly generated embeddings are then forwarded to a specialized module designed to identify matching stored data.
The Chroma vector database, which houses previously stored embeddings, is queried to find data that closely matches the user’s request.
The identified matching data and the original user prompts are then passed to a module responsible for generating new synthetic data.
Anthropic’s Claude Haiku 3.0 model is invoked, using both the matching embeddings and user prompts as input to create high-quality synthetic data.
The generated synthetic data is then parsed and formatted into a .csv file using the Pydantic library, providing a structured and validated output.
To confirm the quality of the generated data, several statistical methods are applied, including quantile-quantile (Q-Q) plots and correlation heat maps of key attributes, providing a comprehensive validation process.

Data validation
When validating the synthetic CR data generated by the LLM, we employed Q-Q plots and correlation heat maps focusing on key attributes such as cp_exposure, cp_replacement_cost, and cp_settlement_risk. These statistical tools serve crucial roles in promoting the quality and representativeness of the synthetic data. By using the Q-Q plots, we can assess whether these attributes follow a normal distribution, which is often expected in many clinical and financial variables. By comparing the quantiles of our synthetic data against theoretical normal distributions, we can identify significant deviations that might indicate bias or unrealistic data generation.
Simultaneously, the correlation heat maps provide a visual representation of the relationships between these attributes and others in the dataset. This is particularly important because it helps verify that the LLM has maintained the complex interdependencies typically observed in real CR data. For instance, we would expect certain correlations between exposure and replacement cost, or between replacement cost and settlement risk. By making sure these correlations are preserved in our synthetic data, we can be more confident that analyses or models built on this data will yield insights that are applicable to real-world scenarios. This rigorous validation process helps to mitigate the risk of introducing artificial patterns or biases, thereby enhancing the reliability and utility of our synthetic CR dataset for subsequent research or modeling tasks.
We’ve created a Jupyter notebook containing three parts to implement the key components of the solution. We provide code snippets from the notebooks for better understanding.
Prerequisites
To set up the solution and generate test data, you should have the following prerequisites:

Python 3 must be installed on your machine
We recommend that an integrated development environment (IDE) that can run Jupyter notebooks be installed
You can also create a Jupyter notebook instance using Amazon SageMaker from AWS console and develop the code there.
You need to have an AWS account with access to Amazon Bedrock and the following LLMs enabled (be careful not to share the AWS account credentials):

Amazon Titan Text Embeddings V2
Anthropic’s Claude 3 Haiku

Setup
Here are the steps to setup the environment.
import sys!{sys.executable} -m pip install -r requirements.txt
The content of the requirements.txt is given here.
boto3
langchain
langchain-community
streamlit
chromadb==0.4.15
numpy
jq
langchain-aws
seaborn
matplotlib
scipy
The following code snippet will perform all the necessary imports.
from pprint import pprint
from uuid import uuid4
import chromadb
from langchain_community.document_loaders import JSONLoader
from langchain_community.embeddings import BedrockEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
Index data in the Chroma database
In this section, we show how indexing of data is done in a Chroma database as a locally maintained open source vector store. This index data is used as context for generating data.
The following code snippet shows the preprocessing steps of loading the JSON data from a file and splitting it into smaller chunks:
def load_using_jsonloaer(path):
loader = JSONLoader(path,
jq_schema=”.[]”,
text_content=False)
documents = loader.load()
return documents

def split_documents(documents):
doc_list = [item for item in documents]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=0)
texts = text_splitter.split_documents(doc_list)
return texts
The following snippet shows how an Amazon Bedrock embedding instance is created. We used the Amazon Titan Embeddings V2 model:
def get_bedrock_embeddings():
aws_region = “us-east-1”
model_id = “amazon.titan-embed-text-v2:0″ #look for latest version of model
bedrock_embeddings = BedrockEmbeddings(model_id=model_id, region_name=aws_region)
return bedrock_embeddings
The following code shows how the embeddings are created and then loaded in the Chroma database:
persistent_client = chromadb.PersistentClient(path=”../data/chroma_index”)
collection = persistent_client.get_or_create_collection(“test_124″)
print(collection)
# query the database
vector_store_with_persistent_client = Chroma(collection_name=”test_124″,
persist_directory=”../data/chroma_index”,
embedding_function=get_bedrock_embeddings(),
client=persistent_client)
load_json_and_index(vector_store_with_persistent_client)
Generate data
The following code snippet shows the configuration used during the LLM invocation using Amazon Bedrock APIs. The LLM used is Anthropic’s Claude 3 Haiku:
config = Config(
region_name=’us-east-1′,
signature_version=’v4′,
retries={
‘max_attempts’: 2,
‘mode’: ‘standard’
}
)
bedrock_runtime = boto3.client(‘bedrock-runtime’, config=config)
model_id = “anthropic.claude-3-haiku-20240307-v1:0” #look for latest version of model
model_kwrgs = {
“temperature”: 0,
“max_tokens”: 8000,
“top_p”: 1.0,
“top_k”: 25,
“stop_sequences”: [“company-1000”],
}
# Initialize the language model
llm = ChatBedrock(
model_id=model_id,
model_kwargs=model_kwrgs,
client=bedrock_runtime,
)
The following code shows how the context is fetched by looking up the Chroma database (where data was indexed) for matching embeddings. We use the same Amazon Titan model to generate the embeddings:
def get_context(scenario):
region_name = ‘us-east-1’
credential_profile_name = “default”
titan_model_id = “amazon.titan-embed-text-v2:0″
kb_context = []
be = BedrockEmbeddings(region_name=region_name,
credentials_profile_name=credential_profile_name,
model_id=titan_model_id)

vector_store = Chroma(collection_name=”test_124″, persist_directory=”../data/chroma_index”,
embedding_function=be)
search_results = vector_store.similarity_search(scenario, k=3)
for doc in search_results:
kb_context.append(doc.page_content)
return json.dumps(kb_context)
The following snippet shows how we formulated the detailed prompt that was passed to the LLM. We provided examples for the context, scenario, start index, end index, records count, and other parameters. The prompt is subjective and can be adjusted for experimentation.
# Create a prompt template
prompt_template = ChatPromptTemplate.from_template(
“You are a financial data expert tasked with generating records ”
“representing company OTC derivative data and ”
“should be good enough for investor and lending ML model to take decisions ”
“and data should accurately represent the scenario: {scenario} n ”
“and as per examples given in context: ”
“and context is {context} ”
“the examples given in context is for reference only, do not use same values while generating dataset.”
“generate dataset with the diverse set of samples but record should be able to represent the given scenario accurately.”
“Please ensure that the generated data meets the following criteria: ”
“The data should be diverse and realistic, reflecting various industries, ”
“company sizes, financial metrics. ”
“Ensure that the generated data follows logical relationships and correlations between features ”
“(e.g., higher revenue typically corresponds to more employees, ”
“better credit ratings, and lower risk). ”
“And Generate {count} records starting from index {start_index}. ”
“generate just JSON as per schema and do not include any text or message before or after JSON. ”
“{format_instruction} n”
“If continuing, start after this record: {last_record}n”
“If stopping, do not include this record in the output.”
“Please ensure that the generated data is well-formatted and consistent.”
)
The following code snippet shows the process for generating the synthetic data. You can call this method in an iterative manner to generate more records. The input parameters include scenario, context, count, start_index, and last_record. The response data is also formatted into CSV format using the instruction provided by the following:
output_parser.get_format_instructions():

def generate_records(start_index, count, scenario, context, last_record=””):
try:
response = chain.invoke({
“count”: count,
“start_index”: start_index,
“scenario”: scenario,
“context”: context,
“last_record”: last_record,
“format_instruction”: output_parser.get_format_instructions(),
“data_set_class_schema”: DataSet.schema_json()
})

return response
except Exception as e:
print(f”Error in generate_records: {e}”)
raise e
Parsing the output generated by the LLM and representing it in CSV was quite challenging. We used a Pydantic parser to parse the JSON output generated by the LLM, as shown in the following code snippet:
class CustomPydanticOutputParser(PydanticOutputParser):
def parse(self, text: str) -> BaseModel:
# Extract JSON from the text
try:
# Find the first occurrence of ‘{‘
start = text.index(‘{‘)
# Find the last occurrence of ‘}’
end = text.rindex(‘}’) + 1
json_str = text[start:end]

# Parse the JSON string
parsed_json = json.loads(json_str)

# Use the parent class to convert to Pydantic object
return super().parse_with_cls(parsed_json)
except (ValueError, json.JSONDecodeError) as e:
raise ValueError(f”Failed to parse output: {e}”)
The following code snippet shows how the records are generated in an iterative manner with 10 records in each invocation to the LLM:
def generate_full_dataset(total_records, batch_size, scenario, context):
dataset = []
total_generated = 0
last_record = “”
batch: DataSet = generate_records(total_generated,
min(batch_size, total_records – total_generated),
scenario, context, last_record)
# print(f”batch: {type(batch)}”)
total_generated = len(batch.records)
dataset.extend(batch.records)
while total_generated < total_records:
try:
batch = generate_records(total_generated,
min(batch_size, total_records – total_generated),
scenario, context, batch.records[-1].json())
processed_batch = batch.records

if processed_batch:
dataset.extend(processed_batch)
total_generated += len(processed_batch)
last_record = processed_batch[-1].start_index
print(f”Generated {total_generated} records.”)
else:
print(“Generated an empty or invalid batch. Retrying…”)
time.sleep(10)
except Exception as e:
print(f”Error occurred: {e}. Retrying…”)
time.sleep(5)

return dataset[:total_records] # Ensure exactly the requested number of records
Verify the statistical properties of the generated data
We generated Q-Q plots for key attributes of the generated data: cp_exposure, cp_replacement_cost, and cp_settlement_risk, as shown in the following screenshots. The Q-Q plots compare the quantiles of the data distribution with the quantiles of a normal distribution. If the data isn’t skewed, the points should approximately follow the diagonal line.

As the next step of verification, we created a corelation heat map of the following attributes: cp_exposure, cp_replacement_cost, cp_settlement_risk, and risk. The plot is perfectly balanced with the diagonal elements showing a value of 1. The value of 1 indicates the column is perfectly co-related to itself. The following screenshot is the correlation heatmap.

Clean up
It’s a best practice to clean up the resources you created as part of this post to prevent unnecessary costs and potential security risks from leaving resources running. If you created the Jupyter notebook instance in SageMaker please complete the following steps:

Save and shut down the notebook: # First save your work
# Then close all open notebooks by clicking File -> Close and Halt
Clear the output (if needed before saving): # Option 1: Using notebook menu
# Kernel -> Restart & Clear Output

# Option 2: Using code
from IPython.display import clear_output
clear_output()
Stop and delete the Jupyter notebook instance created in SageMaker: # Option 1: Using aws cli
# Stop the notebook instance when not in use
aws sagemaker stop-notebook-instance –notebook-instance-name <your-notebook-name>

# If you no longer need the notebook instance
aws sagemaker delete-notebook-instance –notebook-instance-name <your-notebook-name>

# Option 2: Using Sagemager Console
# Amazon Sagemaker -> Notebooks
# Select the Notebook and click Actions drop-down and hit Stop.
Click Actions drop-down and hit Delete

Responsible use of AI
Responsible AI use and data privacy are paramount when using AI in financial applications. Although synthetic data generation can be a powerful tool, it’s crucial to make sure that no real customer information is used without proper authorization and thorough anonymization. Organizations must prioritize data protection, implement robust security measures, and adhere to relevant regulations. Additionally, when developing and deploying AI models, it’s essential to consider ethical implications, potential biases, and the broader societal impact. Responsible AI practices include regular audits, transparency in decision-making processes, and ongoing monitoring to help prevent unintended consequences. By balancing innovation with ethical considerations, financial institutions can harness the benefits of AI while maintaining trust and protecting individual privacy.
Conclusion
In this post, we showed how to generate a well-balanced synthetic dataset representing various aspects of counterparty data, using RAG-based prompt engineering with LLMs. Counterparty data analysis is imperative for making OTC transactions between two counterparties. Because actual business data in this domain isn’t easily available, using this approach you can generate synthetic training data for your ML models at minimal cost often within minutes. After you train the model, you can use it to make intelligent decisions before entering into an OTC derivative transaction.
For more information about this topic, refer to the following resources:

What is RAG (Retrieval-Augmented Generation)?
Retrieval Augmented Generation options and architectures on AWS
Generate synthetic data for evaluating RAG systems using Amazon Bedrock
Amazon Titan Text Embeddings models
Claude 3 Haiku: our fastest model yet

About the Authors
Santosh Kulkarni is a Senior Moderation Architect with over 16 years of experience, specialized in developing serverless, container-based, and data architectures for clients across various domains. Santosh’s expertise extends to machine learning, as a certified AWS ML specialist. Currently, engaged in multiple initiatives leveraging AWS Bedrock and hosted Foundation models.
Joyanta Banerjee is a Senior Modernization Architect with AWS ProServe and specializes in building secure and scalable cloud native application for customers from different industry domains. He has developed an interest in the AI/ML space particularly leveraging Gen AI capabilities available on Amazon Bedrock.
Mallik Panchumarthy is a Senior Specialist Solutions Architect for generative AI and machine learning at AWS. Mallik works with customers to help them architect efficient, secure and scalable AI and machine learning applications. Mallik specializes in generative AI services Amazon Bedrock and Amazon SageMaker.

Turbocharging premium audit capabilities with the power of generative …

This post is co-written with Sajin Jacob, Jerry Chen, Siddarth Mohanram, Luis Barbier, Kristen Chenowith, and Michelle Stahl from Verisk.
Verisk (Nasdaq: VRSK) is a leading data analytics and technology partner for the global insurance industry. Through advanced analytics, software, research, and industry expertise across more than 20 countries, Verisk helps build resilience for individuals, communities, and businesses. The company is committed to ethical and responsible AI development with human oversight and transparency. Verisk is using generative AI to enhance operational efficiencies and profitability for insurance clients while adhering to its ethical AI principles.
Verisk’s Premium Audit Advisory Service (PAAS®) is the leading source of technical information and training for premium auditors and underwriters. PAAS helps users classify exposure for commercial casualty insurance, including general liability, commercial auto, and workers’ compensation. PAAS offers a wide range of essential services, including more than 40,000 classification guides and more than 500 bulletins. PAAS now includes PAAS AI, the first commercially available interactive generative-AI chats specifically developed for premium audit, which reduces research time and empower users to make informed decisions by answering questions and quickly retrieving and summarizing multiple PAAS documents like class guides, bulletins, rating cards, etc.
In this post, we describe the development of the customer support process in PAAS, incorporating generative AI, the data, the architecture, and the evaluation of the results. Conversational AI assistants are rapidly transforming customer and employee support. Verisk has embraced this technology and developed its own PAAS AI, which provides an enhanced self-service capability to the PAAS platform.
The opportunity
The Verisk PAAS platform houses a vast array of documents—including class guides, advisory content, and bulletins—that aid Verisk’s customers in determining the appropriate rules and classifications for workers’ compensation, general liability, and commercial auto business. When premium auditors need accurate answers within this extensive document repository, the challenges they face are:

Overwhelming volume – The sheer volume of documents (advisories, bulletins, and so on) makes manual searching time-consuming and inefficient
Slow response times – Finding accurate information within this vast repository can be slow, hindering timely decision-making
Inconsistent quality of responses – Manual searches might yield irrelevant or incomplete results, leading to uncertainty and potential errors

To address this issue, Verisk PAAS AI is designed to alleviate the burden by providing round-the-clock support for business processing and delivering precise and quick responses to customer queries. This technology is deeply integrated into Verisk’s newly reimagined PAAS platform, using all of Verisk’s documentation, training materials, and collective expertise. It employs a retrieval augmented generation (RAG) approach and a combination of AWS services alongside proprietary evaluations to promptly answer most user questions about the capabilities of the Verisk PAAS platform.
When deployed at scale, this PAAS AI will enable Verisk staff to dedicate more time to complex issues, critical projects, and innovation, thereby enhancing the overall customer experience. Throughout the development process, Verisk encountered several considerations, key findings, and decisions that provide valuable insights for any enterprise looking to explore the potential of generative AI.
The approach
When creating an interactive agent using large language models (LLMs), two common approaches are RAG and model fine-tuning. The choice between these methods depends on the specific use case and available data. Verisk PAAS began developing a RAG pipeline for its PAAS AI and has progressively improved this solution. Here are some reasons why continuing with a RAG architecture was beneficial for Verisk:

Dynamic data access – The PAAS platform is constantly evolving, adding new business functions and technical capabilities. Verisk needed to make sure its responses are based on the most current information. The RAG approach allows access to continuously updated data, providing responses with the latest information without frequently retraining the model.
Multiple data sources – Besides data recency, another crucial aspect is the ability to draw from multiple PAAS resources to acquire relevant context. The ease of expanding the knowledge base without the need for fine-tuning new data sources makes the solution adaptable.
Reduced hallucinations – Retrieval minimizes the risk of hallucinations compared with free-form text generation because responses come directly from the provided excerpts. Verisk developed an evaluation tool to enhance response quality.
LLM linguistics – Although appropriate context can be retrieved from enterprise data sources, the underlying LLM manages the linguistics and fluency.
Transparency – Verisk aimed to consistently improve the PAAS AI’s response generation ability. A RAG architecture offered the transparency required in the context retrieval process, which would ultimately be used to generate user responses. This transparency helped Verisk identify areas where document restructuring was needed.
Data governance – With diverse users accessing the platform and differing data access permissions, data governance and isolation were critical. Verisk implemented controls within the RAG pipeline to restrict data access based on user permissions, helping to ensure that responses are delivered only to authorized users.

Although both RAG and fine-tuning have their pros and cons, RAG is the best approach for building a PAAS AI on the PAAS platform, given Verisk’s needs for real-time accuracy, explainability, and configurability. The pipeline architecture supports iterative enhancement as the use cases for the Verisk PAAS platform develop.
Solution overview
The following diagram showcases a high-level architectural data flow that highlights various AWS services used in constructing the solution. Verisk’s system demonstrates a complex AI setup, where multiple components interact and frequently call on the LLM to provide user responses. Employing the PAAS platform to manage these varied components was an intuitive decision.

The key components are as follows:

Amazon ElastiCache
Amazon Bedrock
Amazon OpenSearch Service
Snowflake in Amazon
Evaluation API
Feedback loop (implementation in progress)

Amazon ElastiCache
Verisk’s PAAS team determined that ElastiCache is the ideal solution for storing all chat history. This storage approach allows for seamless integration in conversational chats and enables the display of recent conversations on the website, providing an efficient and responsive user experience.
Amazon Bedrock
Anthropic’s Claude, available in Amazon Bedrock, played various roles within Verisk’s solution:

Response generation – When building their PAAS AI, Verisk conducted a comprehensive evaluation of leading LLMs, using their extensive dataset to test each model’s capabilities. Through Amazon Bedrock, Verisk gained streamlined access to multiple best-in-class foundation models (FMs), enabling efficient testing and comparison across key performance criteria. The Amazon Bedrock unified API and robust infrastructure provided the ideal platform to develop, test, and deploy LLM solutions at scale. After this extensive testing, Verisk found Anthropic’s Claude model consistently outperformed across key criteria. Anthropic’s Claude demonstrated superior language understanding in Verisk’s complex business domain, allowing more pertinent responses to user questions. Given the model’s standout results across Verisk PAAS platform use cases, it was the clear choice to power the PAAS AI’s natural language capabilities.
Conversation summarization – When a user asks a follow-up question, the PAAS AI can continue the conversational thread. To enable this, Verisk used Claude to summarize the dialogue to update the context from ElastiCache. The full conversation summary and new excerpts are input to the LLM to generate the next response. This conversational flow allows the PAAS AI to answer user follow-up questions and have a more natural, contextual dialogue, bringing Verisk PAAS closer to having a true AI assistant that can engage in useful, back-and-forth conversations with users.
Keyword extraction – Keywords are extracted from user questions and previous conversations to be used for creating the new summarized prompt and to be input to Verisk’s knowledge base retrievers to perform vector similarity search.

Amazon OpenSearch Service
Primarily used for the storage of text embeddings, OpenSearch facilitates efficient document retrieval by enabling rapid access to indexed data. These embeddings serve as semantic representations of documents, allowing for advanced search capabilities that go beyond simple keyword matching. This semantic search functionality enhances the system’s ability to retrieve relevant documents that are contextually similar to the search queries, thereby improving the overall accuracy and speed of data queries. Additionally, OpenSearch functions as a semantic cache for similarity searches, optimizing performance by reducing the computational load and improving response times during data retrieval operations. This makes it an indispensable tool in the larger PAAS ecosystem, where the need for quick and precise information access is paramount.
Snowflake in Amazon
The integration of Snowflake in the PAAS AI ecosystem helps provide scalable and real-time access to data, allowing Verisk to promptly address customer concerns and improve its services. By using Snowflake’s capabilities, Verisk can perform advanced analytics, including sentiment analysis and predictive modeling, to better understand customer needs and enhance user experiences. This continuous feedback loop is vital for refining the PAAS AI and making sure it remains responsive and relevant to user demands.
Structuring and retrieving the data
An essential element in developing the PAAS AI’s knowledge base was properly structuring and effectively querying the data to deliver accurate answers. Verisk explored various techniques to optimize both the organization of the content and the methods to extract the most relevant information:

Chunking – A key step in preparing the accumulated questions and answers was splitting the data into individual documents to facilitate indexing into OpenSearch Service. Rather than uploading large files containing multiple pages of content, Verisk chunked the data into smaller segments by document section and character lengths. By splitting the data into small, modular chunks focused on a single section of a document, Verisk could more easily index each document and had greater success in pulling back the correct context. Chunking the data also enabled straightforward updating and reindexing of the knowledge base over time.
Hybrid query – When querying the knowledge base, Verisk found that using just standard vector search wasn’t enough to retrieve all the relevant contexts pertaining to a question. Therefore, a solution was implemented to combine a sparse bm25 search in combination with the dense vector search to create a hybrid search approach, which yielded much better context retrieval results.
Data separation and filters – Another issue Verisk ran into was that, because of the vast amount of documents and the overlapping content within certain topics, incorrect documents were being retrieved for some questions that asked for specific topics that were present across multiple sources—some of these weren’t needed or appropriate in the context of the user’s question. Therefore, data separation was implemented to split the documents based on document type and filter by line of business to improve context retrieval within the application.

By thoroughly experimenting and optimizing both the knowledge base powering the PAAS AI and the queries to extract answers from it, Verisk was able to achieve very high answer accuracy during the proof of concept, paving the way for further development. The techniques explored—hybrid querying, HTML section chunking, and index filtering—became core elements of Verisk’s approach for extracting quality contexts.
LLM parameters and models
Experimenting with prompt structure, length, temperature, role-playing, and context was key to improving the quality and accuracy of the PAAS AI’s Claude-powered responses. The prompt design guidelines provided by Anthropic were incredibly helpful.
Verisk crafted prompts that provided Anthropic’s Claude with clear context and set roles for answering user questions. Setting the temperature to 0 helped reduce the randomness and indeterministic nature of LLM-generated responses.
Verisk also experimented with different models to improve the efficiency of the overall solution. For scenarios where latency was more important and less reasoning was required, Anthropic’s Claude Haiku was the perfect solution. For other scenarios such as question answering using provided contexts where it was more important for the LLM to be able to understand every detail given in the prompt, Anthropic’s Claude Sonnet was the better choice to balance latency, performance, and cost.
Guardrails
LLM guardrails were implemented in the PAAS AI project using both the guardrails provided by Amazon Bedrock and specialized sections within the prompt to detect unrelated questions and prompt attack attempts. Amazon Bedrock guardrails can be attached to any Amazon Bedrock model invocation call and automatically detect if the given model input and output are in violation of the language filters that are set (violence, misconduct, sexual, and so on), which helps with screening user inputs. The specialized prompts further improve LLM security by creating a second net that uses the power of the LLMs to catch any inappropriate inputs from the users.
This allows Verisk to be confident that the model will only answer to its intended purpose surrounding premium auditing services and will not be misused by threat actors.

After validating several evaluation tools such as Deepeval, Ragas, Trulens, and so on, the Verisk PAAS team realized that there were certain limitations to using these tools for their specific use case. Consequently, the team decided to develop its own evaluation API, shown in the following figure.
This custom API evaluates the answers based on three major metrics:

Answer relevancy score – Using LLMs, the process assesses whether the answers provided are relevant to the customer’s prompt. This helps make sure that the responses are directly addressing the questions posed.
Context relevancy score – By using LLMs, the process evaluates whether the context retrieved is appropriate and aligns well with the question. This helps make sure that the LLM has the appropriate and accurate contexts to generate a response.
Faithfulness score – Using LLMs, the process checks if the responses are generated based on their retrieved context or if they are hallucinated. This is crucial for maintaining the integrity and reliability of the information provided.

This custom evaluation approach helps make sure that the answers generated are not only relevant and contextually appropriate but also faithful to the established generative AI knowledge base, minimizing the risk of misinformation. By incorporating these metrics, Verisk has enhanced the robustness and reliability of their PAAS AI, providing customers with accurate and trustworthy responses.

The Verisk PAAS team has implemented a comprehensive feedback loop mechanism, shown in the following figure, to support continuous improvement and address any issues that might arise.
This feedback loop is structured around the following key components:

Customer feedback analysis – The team actively collects and analyzes feedback from customers to identify potential data issues or problems with the generative AI responses. This analysis helps pinpoint specific areas that need improvement.
Issue categorization – After an issue is identified, it’s categorized based on its nature. If it’s a data-related issue, it’s assigned to the internal business team for resolution. If it’s an application issue, a Jira ticket is automatically created for the PAAS IT team to address and fix the problem.
QA test case updates – The system provides an option to update QA test cases based on the feedback received. This helps make sure that the test scenarios remain relevant and comprehensive, covering a wide range of potential issues.
Ground truth agreements – Ground truth agreements, which serve as the benchmark for evaluating LLM response quality, are periodically reviewed and updated. This helps make sure that the evaluation metrics remain accurate and reflective of the desired standards.
Ongoing evaluations – Regular evaluations of the LLM responses are conducted using the updated QA test cases and ground truth agreements. This helps in maintaining high-quality responses and quickly addressing any deviations from the expected standards.

This robust feedback loop mechanism enables Verisk to continuously fine-tune the PAAS AI, making sure that it delivers precise, relevant, and contextually appropriate answers to customer queries. By integrating customer feedback, categorizing issues efficiently, updating test scenarios, and adhering to stringent evaluation protocols, Verisk maintains a high standard of service and drives continuous improvement in its generative AI capabilities.
Business impact
Verisk initially rolled out the PAAS AI to one beta customer to demonstrate real-world performance and impact. Supporting a customer in this way is a stark contrast to how Verisk has historically engaged with and supported customers in the past, where Verisk would typically have a team allocated to interact with the customer directly. Verisk’s PAAS AI has revolutionized the way subject matter experts (SMEs) work and cost-effectively scales while still providing high-quality assistance. What previously took hours of manual review can now be accomplished in minutes, resulting in an extraordinary 96–98% reduction in processing time per specialist. This dramatic improvement in efficiency not only streamline operations but also allows Verisk’s experts to focus on more strategic initiatives that drive greater value for the organization.
In analyzing this early usage data, Verisk uncovered additional areas where it can drive business value for its customers. As Verisk collects additional information, this data will help uncover what will be needed to improve results and prepare to roll out to a wider customer base of approximately 15,000 users.
Ongoing development will focus on expanding these capabilities, prioritized based on the collected questions. Most exciting, though, are the new possibilities on the horizon with generative AI. Verisk knows this technology is rapidly advancing and is eager to harness innovations to bring even more value to customers. As new models and techniques emerge, Verisk plans to adapt the PAAS AI to take advantage of the latest capabilities. Although the PAAS AI currently focuses on responding to user questions, this is only the starting point. Verisk plans to quickly improve its capabilities to proactively make suggestions and configure functionality directly in the system itself. The Verisk PAAS team is inspired by the challenge of pushing the boundaries of what’s possible with generative AI and is excited to test those boundaries.
Conclusion
Verisk’s development of a PAAS AI for its PAAS platform demonstrates the transformative power of generative AI in customer support and operational efficiency. Through careful data harvesting, structuring, retrieval, and the use of LLMs, semantic search functionalities, and stringent evaluation protocols, Verisk has crafted a robust system that delivers accurate, real-time answers to user questions. By continuing to enhance the PAAS AI’s features while maintaining ethical and responsible AI practices, Verisk is set to provide increased value to its customers, enable staff to concentrate on innovation, and establish new benchmarks for customer service in the insurance sector.
For more information, see the following resources:

Explore generative AI on AWS
Learn about unlocking the business value of generative AI
Learn more about Anthropic’s Claude 3 models on Amazon Bedrock
Learn about Amazon Bedrock and how to build and scale generative AI applications with FMs
Explore generative AI quickstart proofs of concept

About the Authors
Sajin Jacob is the Director of Software Engineering at Verisk, where he leads the Premium Audit Advisory Service (PAAS) development team. In this role, Sajin plays a crucial part in designing the architecture and providing strategic guidance to eight development teams, optimizing their efficiency and ensuring the maintainability of all solutions. He holds an MS in Software Engineering from Periyar University, India.
Jerry Chen is a Lead Software Developer at Verisk, based in Jersey City. He leads the GenAi development team, working on solutions for projects within the Verisk Underwriting department to enhance application functionalities and accessibility. Within PAAS, he has worked on the implementation of the conversational RAG architecture with enhancements such as hybrid search, guardrails, and response evaluations. Jerry holds a degree in Computer Science from Stevens Institute of Technology.
Sid Mohanram is the Senior Vice President of Core Lines Technology at Verisk. His area of expertise includes data strategy, analytics engineering, and digital transformation. Sid is head of the technology organization with global teams across five countries. He is also responsible for leading the technology transformation for the multi-year Core Lines Reimagine initiative. Sid holds an MS in Information Systems from Stevens Institute of Technology.
Luis Barbier is the Chief Technology Officer (CTO) of Verisk Underwriting at Verisk. He provides guidance to the development teams’ architectures to maximize efficiency and maintainability for all underwriting solutions. Luis holds an MBA from Iona University.
Kristen Chenowith, MSMSL, CPCU, WCP, APA, CIPA, AIS, is PAAS Product Manager at Verisk. She is currently the product owner for the Premium Audit Advisory Service (PAAS) product suite, including PAAS AI, a first to market generative AI chat tool for premium audit that accelerates research for many consultative questions by 98% compared to traditional methods. Kristen holds an MS in Management, Strategy and Leadership at Michigan State University and a BS in Business Administration at Valparaiso University. She has been in the commercial insurance industry and premium audit field since 2006.
Michelle Stahl, MBA, CPCU, AIM, API, AIS, is a Digital Product Manager with Verisk. She has over 20 years of experience building and transforming technology initiatives for the insurance industry. She has worked as a software developer, project manager, and product manager throughout her career.
Arun Pradeep Selvaraj is a Senior Solutions Architect at AWS. Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build, and reinvent. He is creative, fast-paced, deeply customer-obsessed, and uses the working backward process to build modern architectures to help customers solve their unique challenges. Connect with him on LinkedIn.
Ryan Doty is a Solutions Architect Manager at AWS, based out of New York. He helps financial services customers accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions. Coming from a software development and sales engineering background, the possibilities that the cloud can bring to the world excite him.
Apoorva Kiran, PhD, is a Senior Solutions Architect at AWS, based out of New York. He is aligned with the financial service industry, and is responsible for providing architectural guidelines to design innovative and scalable fintech solutions. He specializes in developing and commercializing artificial intelligence and machine learning products. Connect with him on LinkedIn.

Microsoft Researchers Present Magma: A Multimodal AI Model Integrating …

Multimodal AI agents are designed to process and integrate various data types, such as images, text, and videos, to perform tasks in digital and physical environments. They are used in robotics, virtual assistants, and user interface automation, where they need to understand and act based on complex multimodal inputs. These systems aim to bridge verbal and spatial intelligence by leveraging deep learning techniques, enabling interactions across multiple domains.

AI systems often specialize in vision-language understanding or robotic manipulation but struggle to combine these capabilities into a single model. Many AI models are designed for domain-specific tasks, such as UI navigation in digital environments or physical manipulation in robotics, limiting their generalization across different applications. The challenge lies in developing a unified model to understand and act across multiple modalities, ensuring effective decision-making in structured and unstructured environments.

Existing Vision-Language-Action (VLA) models attempt to address multimodal tasks by pretraining on large datasets of vision-language pairs followed by action trajectory data. However, these models typically lack adaptability across different environments. Examples include Pix2Act and WebGUM, which excel in UI navigation, and OpenVLA and RT-2, which are optimized for robotic manipulation. These models often require separate training processes and fail to generalize across both digital and physical environments. Also, conventional multimodal models struggle with integrating spatial and temporal intelligence, limiting their ability to perform complex tasks autonomously.

Researchers from Microsoft Research, the University of Maryland, the University of Wisconsin-Madison KAIST, and the University of Washington introduced Magma, a foundation model designed to unify multimodal understanding with action execution, enabling AI agents to function seamlessly in digital and physical environments. Magma is designed to overcome the shortcomings of existing VLA models by incorporating a robust training methodology that integrates multimodal understanding, action grounding, and planning. Magma is trained using a diverse dataset comprising 39 million samples, including images, videos, and robotic action trajectories. It incorporates two novel techniques, 

Set-of-Mark (SoM): SoM enables the model to label actionable visual objects, such as buttons in UI environments

Trace-of-Mark (ToM): ToM allows it to track object movements over time and plan future actions accordingly

Magma employs a combination of deep learning architectures and large-scale pretraining to optimize its performance across multiple domains. The model uses a ConvNeXt-XXL vision backbone to process images and videos, while an LLaMA-3-8B language model handles textual inputs. This architecture enables Magma to integrate vision-language understanding with action execution seamlessly. It is trained on a curated dataset that includes UI navigation tasks from SeeClick and Vision2UI, robotic manipulation datasets from Open-X-Embodiment, and instructional videos from sources like Ego4D, Something-Something V2, and Epic-Kitchen. By leveraging SoM and ToM, Magma can effectively learn action grounding from UI screenshots and robotics data while enhancing its ability to predict future actions based on observed visual sequences. During training, the model processes up to 2.7 million UI screenshots, 970,000 robotic trajectories, and over 25 million video samples to ensure robust multimodal learning.

In zero-shot UI navigation tasks, Magma achieved an element selection accuracy of 57.2%, outperforming models like GPT-4V-OmniParser and SeeClick. In robotic manipulation tasks, Magma attained a success rate of 52.3% in Google Robot tasks and 35.4% in Bridge simulations, significantly surpassing OpenVLA, which only achieved 31.7% and 15.9% in the same benchmarks. The model also performed exceptionally well in multimodal understanding tasks, reaching 80.0% accuracy in VQA v2, 66.5% in TextVQA, and 87.4% in POPE evaluations. Magma also demonstrated strong spatial reasoning capabilities, scoring 74.8% on the BLINK dataset and 80.1% on the Visual Spatial Reasoning (VSR) benchmark. In video question-answering tasks, Magma achieved an accuracy of 88.6% on IntentQA and 72.9% on NextQA, further highlighting its ability to process temporal information effectively.

Several Key Takeaways emerge from the Research on Magma:

Magma was trained on 39 million multimodal samples, including 2.7 million UI screenshots, 970,000 robotic trajectories, and 25 million video samples.

The model combines vision, language, and action in a unified framework, overcoming the limitations of domain-specific AI models.

SoM enables accurate labeling of clickable objects, while ToM allows tracking object movement over time, improving long-term planning capabilities.

Magma achieved a 57.2% accuracy rate in element selection in UI tasks, a 52.3% success rate in robotic manipulation, and an 80.0% accuracy rate in VQA tasks.

Magma outperformed existing AI models by over 19.6% in spatial reasoning benchmarks and improved by 28% over previous models in video-based reasoning.

Magma demonstrated superior generalization across multiple tasks without requiring additional fine-tuning, making it a highly adaptable AI agent.

Magma’s capabilities can enhance decision-making and execution in robotics, autonomous systems, UI automation, digital assistants, and industrial AI.

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Microsoft Researchers Present Magma: A Multimodal AI Model Integrating Vision, Language, and Action for Advanced Robotics, UI Navigation, and Intelligent Decision-Making appeared first on MarkTechPost.

Advancing MLLM Alignment Through MM-RLHF: A Large-Scale Human Preferen …

Multimodal Large Language Models (MLLMs) have gained significant attention for their ability to handle complex tasks involving vision, language, and audio integration. However, they lack the comprehensive alignment beyond basic Supervised Fine-tuning (SFT). Current state-of-the-art models often bypass rigorous alignment stages, leaving crucial aspects like truthfulness, safety, and human preference alignment inadequately addressed. Existing approaches target only specific domains such as hallucination reduction or conversational improvements, falling short of enhancing the model’s overall performance and reliability. This narrow focus raises questions about whether human preference alignment can improve MLLMs across a broader spectrum of tasks.

Recent years have witnessed substantial progress in MLLMs, built upon advanced LLM architectures like GPTs, LLaMA, Alpaca, Vicuna, and Mistral. These models have evolved through end-to-end training approaches, tackling complex multimodal tasks involving image-text alignment, reasoning, and instruction following. Several open-source MLLMs, including Otter, mPLUG-Owl, LLaVA, Qwen-VL, and VITA, have emerged to address fundamental multimodal challenges. However, alignment efforts have remained limited. While algorithms like Fact-RLHF and LLAVACRITIC have shown promise in reducing hallucinations and improving conversational abilities, they haven’t enhanced general capabilities. Evaluation frameworks such as MME, MMBench, and Seed-Bench have been developed to assess these models.

Researchers from KuaiShou, CASIA, NJU, USTC, PKU, Alibaba, and Meta AI have proposed MM-RLHF, an innovative approach featuring a comprehensive dataset of 120k fine-grained, human-annotated preference comparison pairs. This dataset represents a significant advancement in terms of size, diversity, and annotation quality compared to existing resources. The method introduces two key innovations: a Critique-Based Reward Model that generates detailed critiques before scoring outputs, and Dynamic Reward Scaling that optimizes sample weights based on reward signals. It enhances both the interpretability of model decisions and the efficiency of the alignment process, addressing the limitations of traditional scalar reward mechanisms in multimodal contexts.

The MM-RLHF implementation involves a complex data preparation and filtering process across three main domains: image understanding, video understanding, and multimodal safety. The image understanding component integrates data from multiple sources including LLaVA-OV, VLfeedback, and LLaVA-RLHF, with multi-turn dialogues converted to single-turn format. This compilation results in over 10 million dialogue samples covering diverse tasks from basic conversation to complex reasoning. The data filtering process uses predefined sampling weights categorized into three types: multiple-choice questions for testing reasoning and perception, long-text questions for evaluating conversational abilities, and short-text questions for basic image analysis.

The evaluation of MM-RLHF and MM-DPO shows significant improvements across multiple dimensions when applied to models like LLaVA-OV-7B, LLaVA-OV-0.5B, and InternVL-1B. Conversational abilities improved by over 10%, while unsafe behaviors decreased by at least 50%. The aligned models show better results in hallucination reduction, mathematical reasoning, and multi-image understanding, even without specific training data for some tasks. However, model-specific variations are observed, with different models requiring distinct hyperparameter settings for optimal performance. Also, high-resolution tasks show limited gains due to dataset constraints and filtering strategies that don’t target resolution optimization.

In this paper, researchers introduced MM-RLHF, a dataset and alignment approach that shows significant advancement in MLLM development. Unlike previous task-specific approaches, this method takes a holistic approach to improve model performance across multiple dimensions. The dataset’s rich annotation granularity, including per-dimension scores and ranking rationales, offers untapped potential for future development. Future research directions will focus on utilizing this granularity through advanced optimization techniques, addressing high-resolution data limitations, and expanding the dataset through semi-automated methods, potentially establishing a foundation for more robust multimodal learning frameworks.

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Advancing MLLM Alignment Through MM-RLHF: A Large-Scale Human Preference Dataset for Multimodal Tasks appeared first on MarkTechPost.

Learning Intuitive Physics: Advancing AI Through Predictive Representa …

Humans possess an innate understanding of physics, expecting objects to behave predictably without abrupt changes in position, shape, or color. This fundamental cognition is observed in infants, primates, birds, and marine mammals, supporting the core knowledge hypothesis, which suggests humans have evolutionarily developed systems for reasoning about objects, space, and agents. While AI surpasses humans in complex tasks like coding and mathematics, it struggles with intuitive physics, highlighting Moravec’s paradox. AI approaches to physical reasoning fall into two categories: structured models, which simulate object interactions using predefined rules, and pixel-based generative models, which predict future sensory inputs without explicit abstractions.

Researchers from FAIR at Meta, Univ Gustave Eiffel, and EHESS explore how general-purpose deep neural networks develop an understanding of intuitive physics by predicting masked regions in natural videos. Using the violation-of-expectation framework, they demonstrate that models trained to predict outcomes in an abstract representation space—such as Joint Embedding Predictive Architectures (JEPAs)—can accurately recognize physical properties like object permanence and shape consistency. In contrast, video prediction models operating in pixel space and multimodal large language models perform closer to random guessing. This suggests that learning in an abstract space, rather than relying on predefined rules, is sufficient to acquire an intuitive understanding of physics.

The study focuses on a video-based JEPA model, V-JEPA, which predicts future video frames in a learned representation space, aligning with the predictive coding theory in neuroscience. V-JEPA achieved 98% zero-shot accuracy on the IntPhys benchmark and 62% on the InfLevel benchmark, outperforming other models. Ablation experiments revealed that intuitive physics understanding emerges robustly across different model sizes and training durations. Even a small 115 million parameter V-JEPA model or one trained on just one week of video showed above-chance performance. These findings challenge the notion that intuitive physics requires innate core knowledge and highlight the potential of abstract prediction models in developing physical reasoning.

The violation-of-expectation paradigm in developmental psychology assesses intuitive physics understanding by observing reactions to physically impossible scenarios. Traditionally applied to infants, this method measures surprise responses through physiological indicators like gaze time. More recently, it has been extended to AI systems by presenting them with paired visual scenes, where one includes a physical impossibility, such as a ball disappearing behind an occluder. The V-JEPA architecture, designed for video prediction tasks, learns high-level representations by predicting masked portions of videos. This approach enables the model to develop an implicit understanding of object dynamics without relying on predefined abstractions, as shown through its ability to anticipate and react to unexpected physical events in video sequences.

V-JEPA was tested on datasets such as IntPhys, GRASP, and InfLevel-lab to benchmark intuitive physics comprehension, assessing properties like object permanence, continuity, and gravity. Compared to other models, including VideoMAEv2 and multimodal language models like Qwen2-VL-7B and Gemini 1.5 pro, V-JEPA achieved significantly higher accuracy, demonstrating that learning in a structured representation space enhances physical reasoning. Statistical analyses confirmed its superiority over untrained networks across multiple properties, reinforcing that self-supervised video prediction fosters a deeper understanding of real-world physics. These findings highlight the challenge of intuitive physics for existing AI models and suggest that predictive learning in a learned representation space is key to improving AI’s physical reasoning abilities.

In conclusion, the study explores how state-of-the-art deep learning models develop an understanding of intuitive physics. The model demonstrates intuitive physics comprehension without task-specific adaptation by pretraining V-JEPA on natural videos using a prediction task in a learned representation space. Results suggest this ability arises from general learning principles rather than hardwired knowledge. However, V-JEPA struggles with object interactions, likely due to training limitations and short video processing. Enhancing model memory and incorporating action-based learning could improve performance. Future research may examine models trained on infant-like visual data, reinforcing the potential of predictive learning for physical reasoning in AI.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Learning Intuitive Physics: Advancing AI Through Predictive Representation Models appeared first on MarkTechPost.

Build verifiable explainability into financial services workflows with …

Foundational models (FMs) and generative AI are transforming how financial service institutions (FSIs) operate their core business functions. AWS FSI customers, including NASDAQ, State Bank of India, and Bridgewater, have used FMs to reimagine their business operations and deliver improved outcomes.
FMs are probabilistic in nature and produce a range of outcomes. Though these models can produce sophisticated outputs through the interplay of pre-training, fine-tuning, and prompt engineering, their decision-making process remains less transparent than classical predictive approaches. Although emerging techniques such as tool use and Retrieval Augmented Generation (RAG) aim to enhance transparency, they too rely on probabilistic mechanisms—whether in retrieving relevant context or selecting appropriate tools. Even methods such as attention visualization and prompt tracing produce probabilistic insights rather than deterministic explanations.
AWS customers operating in regulated industries such as insurance, banking, payments, and capital markets, where decision transparency is paramount, want to launch FM-powered applications with the same confidence of traditional, deterministic software. To address these challenges, we’re introducing Automated Reasoning checks in Amazon Bedrock Guardrails (preview.) Automated Reasoning checks can detect hallucinations, suggest corrections, and highlight unstated assumptions in the response of your generative AI application. More importantly, Automated Reasoning checks can explain why a statement is accurate using mathematically verifiable, deterministic formal logic.
To use Automated Reasoning checks, you first create an Automated Reasoning policy by encoding a set of logical rules and variables from available source documentation. Automated Reasoning checks can then validate that the questions (prompts) and the FM-suggested answers are consistent with the rules defined in the Automated Reasoning policy using sound mathematical techniques. This fundamentally changes the approach to a solution’s transparency in FM applications, adding a deterministic verification for process-oriented workflows common in FSI organizations.
In this post, we explore how Automated Reasoning checks work through various common FSI scenarios such as insurance legal triaging, underwriting rules validation, and claims processing.
What is Automated Reasoning and how does it help?
Automated Reasoning is a field of computer science focused on mathematical proof and logical deduction—similar to how an auditor might verify financial statements or how a compliance officer makes sure that regulatory requirements are met. Rather than using probabilistic approaches such as traditional machine learning (ML), Automated Reasoning tools rely on mathematical logic to definitively verify compliance with policies and provide certainty (under given assumptions) about what a system will or won’t do. Automated Reasoning checks in Amazon Bedrock Guardrails is the first offering from a major cloud provider in the generative AI space.
The following financial example serves as an illustration.
Consider a basic trading rule: “If a trade is over $1 million AND the client is not tier-1 rated, THEN additional approval is required.”
An Automated Reasoning system would analyze this rule by breaking it down into logical components:

Trade value > $1,000,000
Client rating ≠ tier-1
Result: Additional approval required

When presented with a scenario, the system can provide a deterministic (yes or no) answer about whether additional approval is needed, along with the exact logical path it used to reach that conclusion. For instance:

Scenario A – $1.5M trade, tier-2 client → Additional approval required (Both conditions met)
Scenario B – $2M trade, tier-1 client → No additional approval (Second condition not met)

What makes Automated Reasoning different is its fundamental departure from probabilistic approaches common in generative AI. At its core, Automated Reasoning provides deterministic outcomes where the same input consistently produces the same output, backed by verifiable proof chains that trace each conclusion to its original rules. This mathematical certainty, based on formal logic rather than statistical inference, enables complete verification of possible scenarios within defined rules (and under given assumptions).
FSIs regularly apply Automated Reasoning to verify regulatory compliance, validate trading rules, manage access controls, and enforce policy frameworks. However, it’s important to understand its limitations. Automated Reasoning can’t predict future events or handle ambiguous situations, nor can it learn from new data such as ML models. It requires precise, formal definition of rules and isn’t suitable for subjective decisions that require human judgment. This is where the combination of generative AI and Automated Reasoning come into play.
As institutions seek to integrate generative AI into their decision-making processes, Amazon Bedrock Guardrails Automated Reasoning checks provides a way to incorporate Automated Reasoning into the generative AI workflow. Automated Reasoning checks deliver deterministic verification of model outputs against documented rules, complete with audit trails and mathematical proof of policy adherence. This capability makes it particularly valuable for regulated processes where accuracy and governance are essential, such as risk assessment, compliance monitoring, and fraud detection. Most importantly, through its deterministic rule-checking and explainable audit trails, Automated Reasoning checks effectively address one of the major barriers to generative AI adoption: model hallucination, where models generate unreliable or unfaithful responses to the given task.
Using Automated Reasoning checks for Amazon Bedrock in financial services
A great candidate for applying Automated Reasoning in FSI is in scenarios where a process or workflow can be translated into a set of logical rules. Hard-coding rules as programmatic functions provides deterministic outcomes, but it becomes complex to maintain and requires highly structured inputs, potentially compromising the user experience. Alternatively, using an FM as the decision engine offers flexibility but introduces uncertainty. This is because FMs operate as black boxes where the internal reasoning process remains opaque and difficult to audit. In addition, the FM’s potential to hallucinate or misinterpret inputs means that conclusions would require human verification to verify accuracy.
Solution overview
This is where Automated Reasoning checks come into play. The following diagram demonstrates the workflow to combine generative AI and Automated Reasoning to incorporate both methods.

The following steps explain the workflow in detail:

The source document along with the intent instructions are passed to the Automated Reasoning checks service to build the rules and variables and create an Automated Reasoning checks policy.
An Automated Reasoning checks policy is created and versioned.
An Automated Reasoning checks policy and version is associated with an Amazon Bedrock guardrail.
An ApplyGuardrail API call is made with the question and an FM response to the associated Amazon Bedrock guardrail.
The Automated Reasoning checks model is triggered with the inputs from the ApplyGuardrail API, building logical representation of the input and FM response.
An Automated Reasoning check is completed based on the created rules and variables from the source document and the logical representation of the inputs.
The results of the Automated Reasoning check are shared with the user along with what rules, variables, and variable values were used in its determination, plus suggestions on what would make the assertion valid.

Prerequisites
Before you build your first Automated Reasoning check for Amazon Bedrock Guardrails, make sure you have the following:

An AWS account that provides access to AWS services, including Amazon Bedrock.
The new Automated Reasoning checks safeguard is available today in preview in Amazon Bedrock Guardrails in the US West (Oregon) AWS Region. Make sure that you have access to the Automated Reasoning checks preview within Amazon Bedrock. To request access to the preview today, contact your AWS account team. To learn more, visit Amazon Bedrock Guardrails.
An AWS Identity and Access Management (IAM) user set up for the Amazon Bedrock API and appropriate permissions added to the IAM user

Solution walkthrough
To build an Automated Reasoning check for Amazon Bedrock Guardrails, follow these steps:

On the Amazon Bedrock console, under Safeguards in the navigation pane, select Automated Reasoning.
Choose Create policy, as shown in the following screenshot.

On the Create policy section, shown in the following screenshot, enter the following inputs:

Name – Name of the Automated Reasoning checks policy.
Description – Description of the Automated Reasoning checks policy.
Source content – The document to create the rules and variables from. You need to upload a document in PDF format.
Intent – Instructions on how to approach the creation of the rules and variables.

The following sections dive into some example uses of Automated Reasoning checks.
Automated Reasoning checks for insurance underwriting rules validation
Consider a scenario for an auto insurance company’s underwriting rules validation process.
Underwriting is a fundamental function within the insurance industry, serving as the foundation for risk assessment and management. Underwriters are responsible for evaluating insurance applications, determining the level of risk associated with each applicant, and making decisions on whether to accept or reject the application based on the insurer’s guidelines and risk appetite.
One of the key challenges in underwriting is the process of rule validations, which is the verification that the information provided in the documents adheres to the insurer’s underwriting guidelines. This is a complex task that deals with unstructured data and varying document formats.
This example uses an auto insurance company’s underwriting rules guideline document. A typical underwriting manual can have rules to define unacceptable drivers, unacceptable vehicles, and other definitions, as shown in the following example:
Unacceptable drivers

Drivers with 3 or more DUIs.
For new business or additional drivers, drivers with 3 or more accidents, regardless of fault.
Drivers with more than 2 major violations.
Drivers with more than 3 chargeable accidents.
Military personnel not stationed in California.
Drivers 75 and older without a completed company Physician’s Report form.
Any driver disclosing physical or mental conditions that might affect the driver’s ability to safely operate a motor vehicle may be required to complete a company Physician’s Report form to verify their ability to drive. In addition, if in the course of an investigation we discover an undisclosed medical concern, a completed company Physician’s Report form will be required.
Any unlisted or undisclosed driver that is a household member or has regular use of a covered vehicle.

Unacceptable Vehicles

Vehicles principally garaged outside the state of California.
Vehicles with more or less than 4 wheels.
Vehicles with cargo capacity over 1 ton.
Motor vehicles not eligible to be licensed for highway use.
Taxicabs, limousines, emergency vehicles, escort vehicles, and buses.
Vehicles used for pickup or delivery of goods at any time including pizzas, magazines, and newspapers.
Vehicles used for public livery, conveyance, and company fleets.
Vehicles made available to unlisted drivers for any use including business use such as sales, farming, or artisan use (for example, pooled vehicles).
Vehicles used to transport nursery or school children, migrant workers, or hotel or motel guests.
Vehicles with permanent or removable business-solicitation logos or advertising.
Vehicles owned or leased by a partnership or corporation.
Step vans, panel vans, dump trucks, flatbed trucks, amphibious vehicles, dune buggies, motorcycles, scooters, motor homes, travel trailers, micro or kit cars, antique or classic vehicles, custom, rebuilt, altered or modified vehicles.
Physical damage coverage for vehicles with an ISO symbol of more than 20 for model year 2010 and earlier or ISO symbol 41 for model year 2011 and later.
Liability coverage for vehicles with an ISO symbol of more than 25 for vehicles with model year 2010 and earlier or ISO symbol 59 for model year 2011 and later.
Salvaged vehicles for comprehensive and collision coverage. Liability only policies for salvaged vehicles are acceptable.
Physical damage coverage for vehicles over 15 years old for new business or for vehicles added during the policy term.

For this example, we entered the following inputs for the Automated Reasoning check:

Name – Auto Policy Rule Validation.
Description – A policy document outlining the rules and criteria that define unacceptable drivers and unacceptable vehicles.
Source content – A document describing the companies’ underwriting manual and guidelines. You can copy and paste the example provided and create a PDF document. Upload this document as your source content.
Intent – Create a logical model for auto insurance underwriting policy approval. An underwriter associate will provide the driver profile and type of vehicle and ask whether a policy can be written for this potential customer. The underwriting guideline document uses a list of unacceptable driver profiles and unacceptable vehicles. Make sure to create a separate rule for each unacceptable condition listed in the document, and create a variable to capture whether the driver is an acceptable risk or not. A customer that doesn’t violate any rule is acceptable. Here is an example: ” Is the risk acceptable for a driver with the following profile? A driver has 4 car accidents, uses the car as a Uber-Taxi, and has 3 DUIs”. The model should determine: “The driver has unacceptable risks. Driving a taxi is an unacceptable risk. The driver has multiple DUIs.”

The model creates rules and variables from the source content. Depending on the size of the source content, this process may take more than 10 minutes.
The process of rule and variable creation is probabilistic in nature, and we highly recommend that you edit the created rules and variables to align better with your source content.
After the process is complete, a set of rules and variables will be created and can be reviewed and edited.
The following screenshots show an extract of the rules and variables created by the Automated Reasoning checks feature. The actual policy will have more rules and variables that can be viewed in Amazon Bedrock, but we’re not showing them here due to space limits.

The Automated Reasoning checks policy must be associated to an Amazon Bedrock guardrail. For more information, refer to Create a guardrail.

Test the policy
To test this policy, we considered a hypothetical scenario with an FM-generated response to validate.
Question: Is the risk acceptable for a driver with the following profile? Has 2 chargeable accidents in a span of 10 years. Driving records show a negligent driving charge and one DUI.
Answer: Driver has unacceptable risk. Number of chargeable accidents count is 2.
After entering the question and answer inputs, choose Submit, as shown in the following screenshot.
The Automated Reasoning check returned as Invalid, as shown in the following screenshot. The components shown in the screenshot are as follows:

Validation result – This is the Automated Reasoning checks validation output. This conclusion is reached by computing the extracted variable assignments against the rules defined in the Automated Reasoning policy.
Applied rules – These are the rules that were used to reach the validation result for this finding.
Extracted variables – This list shows how Automated Reasoning checks interpreted the input Q&A and used it to assign values to variables in the Automated Reasoning policy. These variable values are computed against the rules in the policy to reach the validation result.
Suggestions – When the validation result is invalid, this list shows a set of variable assignments that would make the conclusion valid. When the validation result is valid, this list shows a list of assignments that are necessary for the result to hold; these are unstated assumptions in the answer. You can use these values alongside the rules to generate a string that provides feedback to your FM.

The model evaluated the answer against the Automated Reasoning logical rules, and in this scenario the following rule was triggered:
“A driver is considered an acceptable risk if and only if their number of violations is less than or equal to 2.”
The Extracted variables value for violation_count is 2, and the is_acceptable_risk variable was set to false, which is wrong according to the Automated Reasoning logic. Therefore, the answer isn’t valid.
The suggested value for is_acceptable_risk is true.
Here is an example with a revised answer.
Question: Is the risk acceptable for a driver with the following profile? Has 2 chargeable accidents in a span of 10 years. Driving records show a negligent driving charge and one DUI.
Answer: Driver has acceptable risk.
Because no rules were violated, the Automated Reasoning logic determines the assertion is Valid, as shown in the following screenshot.

Automated Reasoning checks for insurance legal triaging
For the next example, consider a scenario where an underwriter is evaluating whether a long-term care (LTC) claim requires legal intervention.
For this example, we entered the following inputs:

Name – Legal LTC Triage
Description – A workflow document outlining the criteria, process, and requirements for referring LTC claims to legal investigation
Source content – A document describing your LTC legal triaging process. You need to upload your own legal LTC triage document in PDF format. This document should outline the criteria, process, and requirements for referring LTC claims to legal investigation.
Intent – Create a logical model that validates compliance requirements for LTC claims under legal investigation. The model must evaluate individual policy conditions including benefit thresholds, care durations, and documentation requirements that trigger investigations. It should verify timeline constraints, proper sequencing of actions, and policy limits. Each requirement must be evaluated independently, where a single violation results in noncompliance. For example: “A claim has two care plan amendments within 90 days, provider records covering 10 months, and a review meeting at 12 days. Is this compliant?” The model should determine: “Not compliant because: multiple amendments require investigation, provider records must cover 12 months, and review meetings must be within 10 days.”

The process of rule and variable creation is probabilistic in nature, and we highly recommend that you edit the created rules and variables to align better with your source content.
After the process is complete, a set of rules and variables will be created. To review and edit a rule or variable, select the more options icon under Actions and then choose Edit. The following screenshots show the Rules and Variables screens.

Test the policy
From here we can test out our Automated Reasoning checks in the test playground. Note: to do this, the Automated Reasoning checks policy must be associated to an Amazon Bedrock guardrail.To test this policy, we posed the following hypothetical scenario with an FM-generated response for the Automated Reasoning checks policy to validate.
Question: A claim with care duration of 28 months, no documentation irregularities, and total projected benefit value of $200,000 has been submitted. Does this require legal investigation?
Answer: This claim does not require legal investigation because the total projected benefit value is below $250,000 and there are no documentation irregularities.

After completing the check, the Automated Reasoning tool produces the validation result, which for this example was Invalid, as shown in the following screenshot. This means the FM generated response violates one or more rules from the generated Automated Reasoning checks policy.

The rule that was triggered was the following:
“A claim is flagged for legal investigation if and only if there are documentation irregularities, or the total projected benefit exceeds $250,000, or the care duration is more than 24 months, or the number of care plan amendments within a 90-day period is greater than 1.”
Based on our input the model determined our variable inputs to be:

Name
Type
Value
Description

1
total_projected_benefit
Real number
200,000
The total projected monetary value of benefits for a long-term care claim

2
flag_for_legal_investigation
Boolean
FALSE
Indicates whether a claim should be flagged for legal investigation based on the specified criteria

3
has_documentation_irregularities
Boolean
FALSE
Presence of irregularities in the care provider’s documentation

4
care_duration_months
Integer
28
The length of time for which care is provided or expected to be provided

From this, we can determine where exactly our rule was found INVALID. Our input had care_duration_months > 24 months, and flag_for_legal_investigation was set as FALSE. This invalidated our rule.
In the suggestions, we observe that for our original Q&A to be correct, we’d have to have flag_for_legal_investigation as TRUE, along with the total_projected_benefit being 200,000.
We can validate whether the suggestion will yield a VALID response by adjusting our answer to the original question to the following.
“This claim does require legal investigation even though the total projected benefit value is below $250,000 and there are no documentation irregularities.”

As shown in the following screenshot, no rules were triggered. However, what changed is our extracted variables and our suggestions.

Now that the assertion is valid, we have the other requirements as unstated assumptions according to our rules to make sure that this is a VALID response. We can use suggestions to modify our response to the end user with more granular detail.
Automated Reasoning checks for insurance claims processing
The final example demonstrates an Automated Reasoning checks example for claims processing.
Claims processing is another fundamental function within insurance companies, and it’s the process used by policy holders to exercise their policy to get compensation for an event (a car accident, for example). Claims processors work to validate the claim and the beneficiaries, determine the amount of compensation, and work to settle the claim. This process includes verification of the people involved, proof of the incident, and a host of legal guidelines that they’re required to follow.
One of the key issues in claims processing is validating the claim and the parties involved. In this example, we use Automated Reasoning checks to provide recommendations to individuals attempting to file a claim in the case of a house fire.
As in the previous examples, we create an Automated Reasoning guardrail policy as follows:

Name – Home Owners Insurance Claims Policy
Description – This policy is used for the validation of homeowners’ insurance claims and includes the processes and procedures needed to file a claim.
Source content – A document describing the companies’ homeowners’ insurance claims process. This document should outline the necessary processes and procedures needed to file a claim.
Intent – Create a logical model that validates the requirements for homeowner claims. The model must evaluate individual policy conditions, including benefit thresholds, durations, and documentation requirements needed for the creation of a claim. It should verify timeline constraints, proper sequencing of actions, and policy limits. Each requirement must be evaluated independently, where any single violation results in noncompliance. For example: “I had a fire at my house. What documents do I need in order to file a claim?” The model should determine: “You will need to provide a fire department report, police report, photos, and your policy number.”

The following screenshots show an extract of the rules and variables created by the Automated Reasoning checks feature. The actual policy will have more rules and variables that can be viewed in Amazon Bedrock, but we’re not showing them due to space limits.

Test the policy
To test this policy, we considered a hypothetical scenario with an FM-generated response to validate.
Question: I had a fire at my house. What documents do I need to file a claim?
Answer: You provide a report from the fire department, a police report, photos, and policy number.
In this case, the Automated Reasoning check returned as Valid, as shown in the following screenshot. Automated Reasoning checks validated that the answer is correct and aligns to the provided claims processing document.

Conclusion
In this post, we demonstrated that Automated Reasoning checks solve a core challenge within FMs: the ability to verifiably demonstrate the reasoning for decision-making. By incorporating Automated Reasoning checks into our workflow, we were able to validate a complex triage scenario and determine the exact reason for why a decision was made. Automated Reasoning is deterministic, meaning that with the same ruleset, same variables, and same input and FM response, the determination will be reproducible. This means you can the reproduce findings for compliance or regulatory reporting.
Automated Reasoning checks in Amazon Bedrock Guardrails empowers financial service professionals to work more effectively with generative AI by providing deterministic validation of FM responses for decision-oriented documents. This enhances human decision-making by reducing hallucination risk and creating reproducible, explainable safeguards that help professionals better understand and trust FM-generated insights.
The new Automated Reasoning checks safeguard is available today in preview in Amazon Bedrock Guardrails in the US West (Oregon) AWS Region. We invite you to build your first Automated Reasoning checks. For detailed guidance, visit our documentation and code examples in our GitHub repo. Please share your experiences in the comments or reach out to the authors with questions. Happy building!

About the Authors
Alfredo Castillo is a Senior Solutions Architect at AWS, where he works with Financial Services customers on all aspects of internet-scale distributed systems, and specializes in Machine learning,  Natural Language Processing, Intelligent Document Processing, and GenAI. Alfredo has a background in both electrical engineering and computer science. He is passionate about family, technology, and endurance sports.
Andy Hall is a Senior Solutions Architect with AWS and is focused on helping Financial Services customers with their digital transformation to AWS. Andy has helped companies to architect, migrate, and modernize large-scale applications to AWS. Over the past 30 years, Andy has led efforts around Software Development, System Architecture, Data Processing, and Development Workflows for large enterprises.
Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Best practices for Amazon SageMaker HyperPod task governance

At AWS re:Invent 2024, we launched a new innovation in Amazon SageMaker HyperPod on Amazon Elastic Kubernetes Service (Amazon EKS) that enables you to run generative AI development tasks on shared accelerated compute resources efficiently and reduce costs by up to 40%. Administrators can use SageMaker HyperPod task governance to govern allocation of accelerated compute to teams and projects, and enforce policies that determine the priorities across different types of tasks. The resulting improvement in utilization of compute resources enables organizations to focus on accelerating their generative AI innovation and time to market, instead of spending time coordinating resource allocation and continuously replanning their generative AI development tasks.
In this post, we provide best practices to maximize the value of SageMaker HyperPod task governance and make the administration and data science experiences seamless. We also discuss common governance scenarios when administering and running generative AI development tasks.
Prerequisites
To get started with SageMaker HyperPod task governance on an existing SageMaker HyperPod cluster orchestrated by Amazon EKS, make sure you uninstall any existing Kueue installations, and have a Kubernetes cluster running version 1.30+.
Administration experience
Administrators are the first persona interacting with SageMaker HyperPod task governance. They are responsible for managing the cluster compute allocation according to the organization’s priorities and goals.
Managing compute
The first step to managing capacity across teams is to set up compute allocations. When setting up a compute allocation, keep in mind the following considerations:

What type of tasks does this team typically run?
Does this team constantly run tasks and require reserved capacity?
What is this team’s priority relative to other teams?

When setting up a compute allocation, an administrator sets the team’s fair-share weight, which provides relative prioritization comparative to other teams when vying for the same idle compute. Higher weight enables a team to access unutilized resources within shared capacity sooner. As a best practice, set the fair-share weight higher for teams that will require access to capacity sooner than other teams.
After the fair-share weight is set, the administrator then sets up the quota and borrowing strategy. Quota determines the allocation per instance type within the cluster’s instance groups. Borrowing strategy determines whether a team will share or reserve their allotted capacity. To enforce proper quota management, the total reserved quota should not surpass the cluster’s available capacity for that resource. For instance, if a cluster comprises 20 ml.c5.2xlarge instances, the cumulative quota assigned to teams should remain under 20.
If the compute allocations for teams allow for “Lend and Borrow” or “Lend,” the idle capacity is shared between these teams. For example, if Team A has a quota of 6 but is using only 2 for its tasks, and Team B has a quota of 5 and is using 4 for its tasks, and a task that is submitted to Team B requiring 4 resources, 3 will be borrowed from Team A based on its “Lend and Borrow” settings. If any team’s compute allocations setting is set to “Don’t Lend,” the team will not be able to borrow any additional capacity beyond its reserved capacity.
To maintain a pool or a set of resources that all teams can borrow from, users can set up a dedicated team with resources that bridge the gap between other teams’ allocations and the total cluster capacity. Make sure that this cumulative resource allocation includes the appropriate instance types and doesn’t exceed the total cluster capacity. To make sure that these resources can be shared among teams, enable the participating teams to have their compute allocations set to “Lend and Borrow” or “Lend” for this common pool of resources. In addition, every time new teams are introduced, or quota allocations are changed or there are any changes to the cluster capacity, revisit the quota allocations of all the teams, to make sure the cumulative quota remains at or below cluster capacity.
After compute allocations have been set, the administrator will also need to set a cluster policy, which is comprised of two components: task prioritization and idle compute allocation. Administrators will set up a task prioritization, which determines the priority level for tasks running in a cluster. Next, an administrator will set idle compute allocation setting to either “first come, first serve,” in which tasks are not prioritized, or “fair-share allocation,” in which idle compute is distributed to teams based on their fair-share weight.
Observability
To get started with observability, install the Amazon CloudWatch Observability add-on with Kueue metrics selected. The SageMaker HyperPod task governance dashboard provides a single pane of glass view for cluster utilization across teams. At present, you can view tasks running for PyTorch, TensorFlow, and MPI tasks. Administrators can analyze the graphs within the dashboard to understand equity in resource sharing and utilization of resources.
To view utilization of resources, users can see the following dashboard showing GPU and vCPU utilization. These graphs inform administrators where teams can further maximize their GPU utilization. In this example, administrators observe GPU utilization around 52%

Administrators have a real-time view of utilization of instances as tasks are running or moved to pending during preemption. In this example, the ML engineering team is borrowing 5 GPUs for their training task

With SageMaker HyperPod, you can additionally set up observability tools of your choice. In our public workshop, we have steps on how to set up Amazon Managed Prometheus and Grafana dashboards.
Data scientist experience
Data scientists are the second persona interacting with SageMaker HyperPod clusters. Data scientists are responsible for the training, fine-tuning, and deployment of models on accelerated compute instances. It’s important to make sure data scientists have the necessary capacity and permissions when interacting with clusters of GPUs.
Access control
When working with SageMaker HyperPod task governance, data scientists will assume their specific role. Each data science team will need to have their own role and associated role-based access control (RBAC) on the cluster. RBAC prevents data scientists from submitting tasks to teams in which they do not belong. For more information about data science role permissions, see AWS Identity and Access Management for SageMaker HyperPod. As a best practice, administrators should limit data scientists according to the principle of least privilege. After roles and access entries are set up, data scientists can assume their associated AWS Identity and Access Management (IAM) role to submit tasks to corresponding namespaces. It’s important to note that users interacting with the console dashboard who didn’t create the associated EKS cluster will need to have their role added to the AccessEntry list for the EKS cluster.
Submitting tasks
There are two ways to submit tasks on Amazon EKS orchestrated SageMaker HyperPod clusters: kubectl and the SageMaker HyperPod CLI. With both options, data scientists will need to reference their team’s namespace and task priority class in the task configuration file in order to use their allocated quota with appropriate prioritization. If the user doesn’t a specify priority class, then SageMaker HyperPod task governance will automatically assume the lowest priority.
In the following code snippet, we show the labels required in a kubectl manifest file for the researchers namespace with inference priority. Priority classes will have -priority appended to the name set in the cluster policy. For further guidance on submitting tasks to SageMaker HyperPod task governance, follow the documentation here.
metadata:
name: job-name
namespace: hyperpod-ns-researchers
labels:
kueue.x-k8s.io/queue-name: hyperpod-ns-researchers-localqueue
kueue.x-k8s.io/priority-class: inference-priority
HyperPod CLI
The HyperPod CLI was created to abstract the complexities of working with kubectl and enable developers using SageMaker HyperPod to iterate faster with custom commands. HyperPod CLI v2.0.0 introduces a new default scheduler type with autofill commands, auto discovery of namespaces, improved cluster and task management features, and enhanced visibility into task priorities and accelerator quota allocations. Data scientists can use the new HyperPod CLI to quickly submit tasks, iterate, and experiment in their generative AI development lifecycle.
Sample commands
The following is a short reference guide for helpful commands when interacting with SageMaker HyperPod task governance:

Describing cluster policy with the AWS CLI – This AWS Command Line Interface (AWS CLI) command is useful to view the cluster policy settings for your cluster.
List compute quota allocations with the AWS CLI – This AWS CLI command is useful to view the different teams and set up task governance and their respective quota allocation settings.
HyperPod CLI – The HyperPod CLI abstracts common kubectl commands used to interact with SageMaker HyperPod clusters such as submitting, listing, and cancelling tasks. Refer to the for a full list of commands.
kubectl – You can also use kubectl to interact with task governance with the following example commands:

kubectl get pytorchjobs -n hyperpod-ns-<team-name> This command shows you the PyTorch tasks running in the specified team namespace.
kubectl get workloads -n hyperpod-ns-<team-name> / kubectl describe workload <workload-name> -n hyperpod-ns-<team-name> – These commands show the workloads running in your cluster per namespace and provide detailed reasonings on Kueue Admission. You can use these commands to answer questions such as “Why was my task preempted?” or “Why did my task get admitted?”

Common scenarios
SageMaker HyperPod task governance enables allocating compute quota to teams, increasing utilization of compute resources, reducing costs, and accelerating waiting tasks by priority and in turn accelerating time to market. To relate these value propositions to real work scenarios, we will talk about a enterprise and a startup situation.
Enterprises have different teams working towards various business goals, each with budgets that limit their compute access. To maximize resource utilization within budget constraints, SageMaker HyperPod task governance allows enterprises to allocate compute quotas to teams for artificial intelligence and machine learning (AI/ML) tasks. When teams use up their allocation, they can access idle compute from other teams to accelerate waiting tasks, providing optimal resource utilization across the organization.
Startups aim to maximize compute resource utilization while achieving timely allocation for high-priority tasks. SageMaker HyperPod task governance’s prioritization feature allows you to assign priorities to different task types, such as prioritizing inference over training. This makes sure that high-priority tasks receive necessary compute resources before lower-priority ones, optimizing overall resource allocation.
Now we will walk you through two common scenarios for users interacting with SageMaker HyperPod task governance.
Scenario 1: Enterprise
In the first scenario, we have an enterprise company who wants to manage compute allocations to optimize for cost. This company has five teams sharing 80 GPUs, with the following configuration:

Team 1 – Compute allocation: 20; Strategy: Don’t Lend
Team 2 – Compute allocation: 20; Strategy: Don’t Lend
Team 3 – Compute allocation: 5; Strategy: Lend & Borrow at 150%; Fair-share weight: 100
Team 4 – Compute allocation: 10; Strategy: Lend & Borrow at 100%; Fair-share weight: 75
Team 5 – Compute allocation: 25; Strategy: Lend & Borrow at 50%; Fair-share weight: 50

This sample configuration reserves capacity to teams that will be constantly using instances for high-priority tasks. In addition, a few teams have the option to lend and borrow idle compute from other teams—this improves cost optimization by reserving capacity as needed and allowing non-consistent workloads to run using available idle compute with prioritization.
Scenario 2: Startup
In the second scenario, we have a startup customer who wants to provide equitable compute allocation for members of their engineering and research teams. This company has three teams sharing 15 GPUs:

Team 1 (ML engineering) – Compute allocation: 6; Strategy: Lend & Borrow at 50%; Fair-share weight: 100
Team 2 (Researchers) – Compute allocation: 5; Strategy: Lend & Borrow at 50%; Fair-share weight: 100
Team 3 (Real-time chatbot) – Compute allocation: 4; Strategy: Don’t Lend; Fair-share weight: 100

This sample configuration promotes equitable compute allocation across the company because all teams have the same fair-share weight and are able to preempt tasks with lower priority.

Conclusion
In this post, we discussed best practices for efficient use of SageMaker HyperPod task governance. We also provided certain patterns that you can adopt while administering generative AI tasks, whether you are aiming to optimize for cost or optimize for equitable compute allocation. To get started with SageMaker HyperPod task governance, refer to the Amazon EKS Support in Amazon SageMaker HyperPod workshop and SageMaker HyperPod task governance.

About the Author
Nisha Nadkarni is a Senior GenAI Specialist Solutions Architect at AWS, where she guides companies through best practices when deploying large scale distributed training and inference on AWS. Prior to her current role, she spent several years at AWS focused on helping emerging GenAI startups develop models from ideation to production.
Chaitanya Hazarey leads software development for SageMaker HyperPod task governance at Amazon, bringing extensive expertise in full-stack engineering, ML/AI, and data science. As a passionate advocate for responsible AI development, he combines technical leadership with a deep commitment to advancing AI capabilities while maintaining ethical considerations. His comprehensive understanding of modern product development drives innovation in machine learning infrastructure.
 Kareem Syed-Mohammed is a Product Manager at AWS. He is focused on compute optimization and cost governance. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.

How Formula 1® uses generative AI to accelerate race-day issue resolu …

Formula 1® (F1) races are high-stakes affairs where operational efficiency is paramount. During these live events, F1 IT engineers must triage critical issues across its services, such as network degradation to one of its APIs. This impacts downstream services that consume data from the API, including products such as F1 TV, which offer live and on-demand coverage of every race as well as real-time telemetry. Determining the root cause of these issues and preventing it from happening again takes significant effort. Due to the event schedule and change freeze periods, it can take up to 3 weeks to triage, test, and resolve a critical issue, requiring investigations across teams including development, operations, infrastructure, and networking.
“We used to have a recurring issue with the web API system, which was slow to respond and provided inconsistent outputs. Teams spent around 15 full engineer days to iteratively resolve the issue over several events: reviewing logs, inspecting anomalies, and iterating on the fixes,” says Lee Wright, head of IT Operations at Formula 1. Recognizing this challenge as an opportunity for innovation, F1 partnered with Amazon Web Services (AWS) to develop an AI-driven solution using Amazon Bedrock to streamline issue resolution. In this post, we show you how F1 created a purpose-built root cause analysis (RCA) assistant to empower users such as operations engineers, software developers, and network engineers to troubleshoot issues, narrow down on the root cause, and significantly reduce the manual intervention required to fix recurrent issues during and after live events. We’ve also provided a GitHub repo for a general-purpose version of the accompanying chat-based application.
Users can ask the RCA chat-based assistant questions using natural language prompts, with the solution troubleshooting in the background, identifying potential reasons for the incident and recommending next steps. The assistant is connected to internal and external systems, with the capability to query various sources such as SQL databases, Amazon CloudWatch logs, and third-party tools to check the live system health status. Because the solution doesn’t require domain-specific knowledge, it even allows engineers of different disciplines and levels of expertise to resolve issues.
“With the RCA tool, the team could narrow down the root cause and implement a solution within 3 days, including deployments and testing over a race weekend. The system not only saves time on active resolution, it also routes the issue to the correct team to resolve, allowing teams to focus on other high-priority tasks, like building new products to enhance the race experience,” adds Wright. By using generative AI, engineers can receive a response within 5–10 seconds on a specific query and reduce the initial triage time from more than a day to less than 20 minutes. The end-to-end time to resolution has been reduced by as much as 86%.
Implementing the root cause analysis solution architecture
In collaboration with the AWS Prototyping team, F1 embarked on a 5-week prototype to demonstrate the feasibility of this solution. The objective was to use AWS to replicate and automate the current manual troubleshooting process for two candidate systems. As a starting point, the team reviewed real-life issues, drafting a flowchart outlining 1) the troubleshooting process, 2) teams and systems involved, 3) required live checks, and 4) logs investigations required for each scenario. The following is a diagram of the solution architecture.

To handle the log data efficiently, raw logs were centralized into an Amazon Simple Storage Service (Amazon S3) bucket. An Amazon EventBridge schedule checked this bucket hourly for new files and triggered log transformation extract, transform, and load (ETL) pipelines built using AWS Glue and Apache Spark. The transformed logs were stored in a separate S3 bucket, while another EventBridge schedule fed these transformed logs into Amazon Bedrock Knowledge Bases, an end-to-end managed Retrieval Augmented Generation (RAG) workflow capability, allowing the chat assistant to query them efficiently. Amazon Bedrock Agents facilitates interaction with internal systems such as databases and Amazon Elastic Compute Cloud (Amazon EC2) instances and external systems such as Jira and Datadog. Anthropic’s Claude 3 models (the latest model at the time of development) were used to orchestrate and generate high-quality responses, maintaining accurate and relevant information from the chat assistant. Finally, the chat application is hosted in an AWS Fargate for Amazon Elastic Container Service (Amazon ECS) service, providing scalability and reliability to handle variable loads without compromising performance.
The following sections further explain the main components of the solution: ETL pipelines to transform the log data, agentic RAG implementation, and the chat application.
Creating ETL pipelines to transform log data
Preparing your data to provide quality results is the first step in an AI project. AWS helps you improve your data quality over time so you can innovate with trust and confidence. Amazon CloudWatch gives you visibility into system-wide performance and allows you to set alarms, automatically react to changes, and gain a unified view of operational health.
For this solution, AWS Glue and Apache Spark handled data transformations from these logs and other data sources to improve the chatbot’s accuracy and cost efficiency. AWS Glue helps you discover, prepare, and integrate your data at scale. For this project, there was a simple three-step process for the log data transformation. The following is a diagram of the data processing flow.

Data standardization: Schemas, types and formats – Conforming the data to a unified format helps the chat assistant understand the data more thoroughly, improving output accuracy. To enable Amazon Bedrock Knowledge Bases to ingest data consumed from different sources and formats (such as structure, schema, column names, timestamp formats), the data must first be standardized.
Data filtering: Removing unnecessary data – To improve the chat assistant’s performance further, it’s important to reduce the amount of data to scan. A simple way to do that is to determine which data columns wouldn’t be used by the chat assistant. This removed a considerable amount of data in the ETL process even before ingesting into the knowledge base. Plus, it reduced costs in the embeddings process because less data is used to transform and tokenize into the vector database. All this helps improve the chat assistant’s accuracy, performance, and cost. For example, the chat assistant doesn’t need all the headers from some HTTP requests, but it does need the host and user agent.
Data aggregation: Reducing data size – Users only need to know by the minute when a problem occurred, so aggregating data at the minute level helped to reduce the data size. For example, when there are 60 data points per minute with API response times, data was aggregated to a single data point per minute. This single aggregated event contains attributes such as the maximum time taken to fulfill a request, focusing the chat assistant to identify if the response time was high—again reducing the data needed to analyze the issue.

Building the RCA assistant with Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases
Amazon Bedrock was used to build an agentic (agent-based) RAG solution for the RCA assistant. Amazon Bedrock Agents streamlines workflows and automates repetitive tasks. Agents uses the reasoning capability of foundation models (FMs) to break down user-requested tasks into multiple steps. They use the provided instruction to create an orchestration plan and then carry out the plan by invoking company APIs and accessing knowledge bases using RAG to provide a final response to the end user.
Knowledge bases are essential to the RAG framework, querying business data sources and adding relevant context to answer your questions. Amazon Bedrock Agents also allows interaction with internal and external systems, such as querying database statuses to check their health, querying Datadog for live application monitoring, and raising Jira tickets for future analysis and investigation. Anthropic’s Claude 3 Sonnet model was selected for informative and comprehensive answers and the ability to understand diversified questions. For example, it can correctly interpret user input date formats such as “2024-05-10” or “10th May 2024.”
Amazon Bedrock Agents integrates with Amazon Bedrock Knowledge Bases, providing the end user with a single and consolidated frontend. The RCA agent considers the tools and knowledge bases available, then intelligently and autonomously creates an execution plan. After the agent receives documents from the knowledge base and responses from tool APIs, it consolidates the information to feed it to the large language model (LLM) and generate the final response. The following diagram illustrates the orchestration flow.

Systems security
With Amazon Bedrock, you have full control over the data used to customize the FMs for generative AI applications such as RCA. Data is encrypted in transit and at rest. Identity-based policies provide further control over your data, helping you manage what actions roles can perform, on which resources, and under what conditions.
To evaluate the system health of RCA, the agent runs a series of checks, such as AWS Boto3 API calls (for example, boto3_client.describe_security_groups, to determine if an IP address is allowed to access system) or database SQL queries (SQL: sys.dm_os_schedulers, to query the database system metrics such as CPU, memory or user locks).
To help protect these systems against potential hallucinations or even prompt injections, agents aren’t allowed to create their own database queries or system health checks on the fly. Instead, a series of controlled SQL queries and API checks were implemented, following the principle of least privilege (PoLP). This layer also validates the input and output schema (see Powertools docs), making sure this aspect is also controlled. To learn more about protecting your application, refer to the ArXiv paper, From Prompt Injections to SQL Injection Attacks. The following code is an example.

“””
– Health Checks: one explicit function per Health Check, to avoid potential LLM hallucinations or risky syntax errors.
– DB is KMS-encrypted and behind private subnets. Connection uses Least-Privileges and Secrets Manager
– Schema is protected using OpenAPI, via AWS Lambda Powertools BedrockAgentResolver
“””

from typing import List, Annotated
from helpers import run_sql_query, check_ec2_port_access
from aws_lambda_powertools.event_handler.bedrock_agent import BedrockAgentResolver
from aws_lambda_powertools.event_handler.openapi.params import Query, Body
from aws_lambda_powertools import Metrics, Tracer, Logger
from aws_lambda_powertools.metrics import MetricUnit

# Initialize Agents, Metrics, Loggers and Tracers
app = BedrockAgentResolver()
metrics = Metrics(namespace=”rca-stack-api-logs”, service=”HealthChecks”)
tracer = Tracer()
logger = Logger(level=’INFO’)

@tracer.capture_method
@app.get(“/checkDatabaseCPUMemory”, description=’Checks the CPU and Memory usage, for the Database server.’)
def check_db_cpu_memory() -> Annotated[List, Body(description=’Returns Database CPU and Memory metrics’)]:
response = run_sql_query(‘db_cpu_memory’)
metrics.add_metric(name=”DBCpuMemory”, unit=MetricUnit.Count, value=1)
logger.info(response)

return response

Frontend application: The chat assistant UI
The chat assistant UI was developed using the Streamlit framework, which is Python-based and provides simple yet powerful application widgets. In the Streamlit app, users can test their Amazon Bedrock agent iterations seamlessly by providing or replacing the agent ID and alias ID. In the chat assistant, the full conversation history is displayed, and the conversation can be reset by choosing Clear. The response from the LLM application consists of two parts. On the left is the final neutral response based on the user’s questions. On the right is the trace of LLM agent orchestration plans and executions, which is hidden by default to keep the response clean and concise. The trace can be reviewed and examined by the user to make sure that the correct tools are invoked and the correct documents are retrieved by the LLM chatbot.
A general-purpose version of the chat-based application is available from this GitHub repo, where you can experiment with the solution and modify it for additional use cases.
In the following demo, the scenario involves user complaints that they can’t connect to F1 databases. Using the chat assistant, users can check if the database driver version they’re using is supported by the server. Additionally, users can verify EC2 instance network connectivity by providing the EC2 instance ID and AWS Region. These checks are performed by API tools accessible by the agent. Furthermore, users can troubleshoot website access issues by checking system logs. In the demo, users provide an error code and date, and the chat assistant retrieves relevant logs from Amazon Bedrock Knowledge Bases to answer their questions and provide information for future analysis.

Technical engineers can now query to investigate system errors and issues using natural language. It’s integrated with existing incident management tools (such as Jira) to facilitate seamless communication and ticket creation. In most cases, the chat assistant can quickly identify the root cause and provide remediation recommendations, even if multiple issues are present. When warranted, particularly challenging issues are automatically escalated to the F1 engineering team for investigation, allowing engineers to better prioritize their tasks.
Conclusion
In this post, we explained how F1 and AWS have developed a root cause analysis (RCA) assistant powered by Amazon Bedrock to reduce manual intervention and accelerate the resolution of recurrent operational issues during races from weeks to minutes. The RCA assistant enables the F1 team to spend more time on innovation and improving its services, ultimately delivering an exceptional experience for fans and partners. The successful collaboration between F1 and AWS showcases the transformative potential of generative AI in empowering teams to accomplish more in less time.
Learn more about how AWS helps F1 on and off the track.

About the Author
Carlos Contreras is a Senior Big Data and Generative AI Architect, at Amazon Web Services. Carlos specializes in designing and developing scalable prototypes for customers, to solve their most complex business challenges, implementing RAG and Agentic solutions with Distributed Data Processing techniques.
Hin Yee Liu is a Senior Prototyping Engagement Manager at Amazon Web Services. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies. In her free time, she enjoys knitting, travelling and strength training.
Olga Miloserdova is an Innovation Lead at Amazon Web Services, where she supports executive leadership teams across industries to drive innovation initiatives leveraging Amazon’s customer-centric Working Backwards methodology.
Ying Hou, PhD is a Senior GenAI Prototyping Architect at AWS, where she collaborates with customers to build cutting-edge GenAI applications, specialising in RAG and agentic solutions. Her expertise spans GenAI, ASR, Computer Vision, NLP, and time series prediction models. When she’s not architecting AI solutions, she enjoys spending quality time with her family, getting lost in novels, and exploring the UK’s national parks.

Introducing The First And Only ESP Built For Website Visitor Identific …

In February 2025, Klaviyo announced significant pricing changes that have left many businesses re-evaluating their email marketing strategies. 

The shift to billing based on active profiles means that even unengaged or inactive contacts can increase costs, leading to unexpected budget strains for many.

Needless to say, people are not happy.

This is going to hurt. Looks like Klaviyo needs more money.With our current 100k active profiles we would be looking at at a 345% price increase from $400/month to $1,380 – even though we usually stay under the 250k monthly email cap.Guess we need to start suppressing… pic.twitter.com/FQBzfmUtyK— Michael Simpson (@Michael_in_biz) January 14, 2025

For businesses relying on Klaviyo, this update isn’t just a small tweak…it’s a major shake-up. More contacts now mean more costs, whether they’re engaged or not.

At Customers.ai, we feel that growing your email list shouldn’t come with runaway expenses.

That’s exactly why we built Customers.ai ESP – the first ESP designed specifically for Website Visitor Identification.

With Customers.ai ESP, you can engage visitor ID contacts separately, qualify high-intent shoppers, and only push the best leads into Klaviyo or Mailchimp. The best part? You can still grow your list while keeping costs in check.

Exciting, right?! Let’s take a closer look.

The Customers.ai Advantage: The Only Website Visitor ID Product With A Natively Integrated ESP

Most ESPs were built in a world where email lists were static, engagement tracking was basic, and segmentation meant sorting by “new” vs. “returning” customers. That world is gone.

Today, your email list is constantly evolving. People visit your site, browse, engage, and drop off, all before ever filling out a form. 

If your ESP wasn’t built to keep up with this, you’re either missing opportunities or paying way too much to send emails that never land.

That’s where Customers.ai ESP comes in.

Better Data = Better Deliverability

Bad email data is a one-way ticket to the spam folder. Sending to disengaged or low-quality contacts tanks your sender reputation and once you’re flagged, good luck getting back in the inbox.

That’s why we have a built-in filter that automatically screens out disengaged contacts before they ever damage your deliverability. 

Unlike other visitor identification platforms, we don’t just dump every email into your ESP and hope for the best – we only send high-quality, verified contacts that are ready to engage.

Power Your Automations with Real-Time Segmentation

Most ESPs let you segment based on basic attributes like age, location, or purchase history. But if you’re still relying on old-school lists, you’re missing out big time.

With Customers.ai ESP, you can:

Segment based on real visitor behavior

Push contacts to your other ESP only when they engage, protecting your Klaviyo or Mailchimp account.

Trigger automations in real time so you reach the right people at the right moment – before they bounce.

This isn’t just email marketing. This is email marketing built for the way people actually shop today.

Why You Need the Customers.ai ESP

Not all ESPs are built the same. Some were made for traditional email lists. Some were made for newsletters. 

The Customers.ai ESP was built for high-performance, data-driven marketing that turns anonymous visitors into customers – without blowing up your budget.

Here’s what makes our ESP different:

1. Protect Your Deliverability

Your sender reputation is everything. Once you start landing in spam, it’s a steep climb back. Customers.ai ESP keeps your list clean and your emails landing in the inbox.

2. ESP Interoperability

Already using Klaviyo or Mailchimp? Keep them! 

Our ESP is designed to work alongside your existing provider, so you don’t have to choose between tools. You get the best of both.

3. Cut Klaviyo Costs

Here’s the problem: Website Visitor ID tools generate 10x more email contacts but they also 10x your Klaviyo bill. Not with us. 

We keep costs low so you can scale your email marketing without draining your budget.

4. Engage Website ID Contacts Separately

Not every contact should be dumped into your primary ESP. With Customers.ai ESP, you can keep your website visitor ID contacts separate and only push high-intent contacts to your other ESP once they’re ready to convert.

5. Superpower Your Segmentations

Forget broad audience lists. Use real-time consumer data to power hyper-targeted automations that reach the right people at the right time.

6. Build Beautiful Emails

Good segmentation is only half the battle. You also need emails that convert. 

New feature alert! The new and improved drag-and-drop email editor Our partner lilikoi agency used it to set up an on-brand welcome email to high-intent visitors for Solaris Solar. The result? A 37.5% open rate and 9.9% click rate pic.twitter.com/8Gs8zERIN7— CustomersAI (@CustomersAI) December 4, 2024

Our drag-and-drop builder makes it easy to create high-impact emails that engage and drive action.

7. Engage Website Visitors at Scale

Previously anonymous visitors don’t have to stay anonymous. Customers.ai ESP lets you reach, engage, and convert website traffic in ways no other ESP can.

This is how email should work in 2025 – smarter, more scalable, and built to drive results.

Key Features That Set Our ESP Apart

Most ESPs charge more as you grow. Customers.ai ESP is designed to let you scale without the steep price hikes. 

But cost savings aren’t the only thing that makes it different. 

It’s packed with automation, advanced deliverability tools, and AI-powered customization that other ESPs don’t offer. Basically? More power, fewer limits, better pricing.

Here are some of the core features that set us apart:

Built for Growth Without the Cost Creep

Klaviyo and Mailchimp pricing increases fast, especially when your contact list grows. Customers.ai ESP includes 10 email sends per contact at a fraction of the cost, making high-volume email marketing actually affordable.

Unlimited Everything

Hitting limits on email automations and campaigns is a thing of the past. Customers.ai ESP gives you unlimited email automations, unlimited campaigns, and unlimited email aliases so you can run as many workflows and sequences as you need – without restrictions.

Email Deliverability That Works for You

A bad sender reputation means emails don’t land, and your entire strategy falls apart. Built-in email warmup and deliverability monitoring ensure you stay out of spam and get in front of your audience.

Automation and AI-Powered Personalization

The right message at the right time drives conversions, but manually creating every email is inefficient. With an automation builder and AI-driven email customization, Customers.ai ESP makes scaling personalized email campaigns simple.

Seamless CRM Integration

Connect your ESP to your CRM for better customer insights, cleaner data, and smoother workflows. 

No more disjointed platforms or missing information. Everything works together.

Advanced Analytics and Reporting

Stop guessing what’s working and what’s not. Customers.ai ESP provides deep insights into your campaigns so you can optimize in real time and make data-backed decisions.

We feel confident in saying that the Customers.ai ESP isn’t just another email platform. It’s a smarter, cost-effective way to scale your email marketing.

The Future of Email Marketing Starts with Customers.ai

Email is evolving and the brands that win aren’t the ones just sending more emails. They’re the ones sending smarter emails – at the right time, to the right people, with precision.

That’s exactly what Customers.ai ESP was built for. 

More powerful segmentation, lower costs, and real-time engagement that turns website visitors into customers. 

Whether you use it as a standalone ESP or alongside your existing provider, it’s designed to help you scale without limits, protect your deliverability, and maximize every contact.

If you’re ready to stop overpaying for outdated email strategies and start using the first ESP built for anonymous visitor identification, it’s time to see Customers.ai ESP in action.

See the Anonymous Visitors Hiding on Your Site

Book a demo of Customers.ai’s U.S. website visitor identification, customer journey insights and remarketing platform to skyrocket conversions and sales.

Book a Demo

Important Next Steps

See what targeted outbound marketing is all about. Capture and engage your first 500 website visitor leads with Customers.ai X-Ray website visitor identification for free.

Talk and learn about sales outreach automation with other growth enthusiasts. Join Customers.ai Island, our Facebook group of 40K marketers and entrepreneurs who are ready to support you.

Advance your marketing performance with Sales Outreach School, a free tutorial and training area for sales pros and marketers.

The post Introducing The First And Only ESP Built For Website Visitor Identification appeared first on Customers.ai.

Scale AI Research Introduces J2 Attackers: Leveraging Human Expertise …

Transforming language models into effective red teamers is not without its challenges. Modern large language models have transformed the way we interact with technology, yet they still struggle with preventing the generation of harmful content. Efforts such as refusal training help these models deny risky requests, but even these safeguards can be bypassed with carefully designed attacks. This ongoing tension between innovation and security remains a critical issue in deploying these systems responsibly.

In practice, ensuring safety means contending with both automated attacks and human-crafted jailbreaks. Human red teamers often devise sophisticated multi-turn strategies that expose vulnerabilities in ways that automated techniques sometimes miss. However, relying solely on human expertise is resource intensive and lacks the scalability required for widespread application. As a result, researchers are exploring more systematic and scalable methods to assess and strengthen model safety.

Scale AI Research introduces J2 attackers to address these challenges. In this approach, a human red teamer first “jailbreaks” a refusal-trained language model, encouraging it to bypass its own safeguards. This transformed model, now referred to as a J2 attacker, is then used to systematically test vulnerabilities in other language models. The process unfolds in a carefully structured manner that balances human guidance with automated, iterative refinement.

The J2 method begins with a manual phase where a human operator provides strategic prompts and specific instructions. Once the initial jailbreak is successful, the model enters a multi-turn conversation phase where it refines its tactics using feedback from previous attempts. This blend of human expertise and the model’s own in-context learning abilities creates a feedback loop that continuously improves the red teaming process. The result is a measured and methodical system that challenges existing safeguards without resorting to sensationalism.

The technical framework behind J2 attackers is thoughtfully designed. It divides the red teaming process into three distinct phases: planning, attack, and debrief. During the planning phase, detailed prompts break down conventional refusal barriers, allowing the model to prepare its approach. The subsequent attack phase consists of a series of controlled, multi-turn dialogues with the target model, each cycle refining the strategy based on prior outcomes.

In the debrief phase, an independent evaluation is conducted to assess the success of the attack. This feedback is then used to further adjust the model’s tactics, fostering a cycle of continuous improvement. By modularly incorporating diverse red teaming strategies—from narrative-based fictionalization to technical prompt engineering—the approach maintains a disciplined focus on security without overhyping its capabilities.

Empirical evaluations of the J2 attackers reveal encouraging, yet measured, progress. In controlled experiments, models like Sonnet-3.5 and Gemini-1.5-pro achieved attack success rates of around 93% and 91% against GPT-4o on the Harmbench dataset. These figures are comparable to the performance of experienced human red teamers, who averaged success rates close to 98%. Such results underscore the potential of an automated system to assist in vulnerability assessments while still relying on human oversight.

Further insights show that the iterative planning-attack-debrief cycles play a crucial role in refining the process. Studies indicate that approximately six cycles tend to offer a balance between thoroughness and efficiency. An ensemble of multiple J2 attackers, each applying different strategies, further enhances overall performance by covering a broader spectrum of vulnerabilities. These findings provide a solid foundation for future work aimed at further stabilizing and improving the security of language models.

In conclusion, the introduction of J2 attackers by Scale AI represents a thoughtful step forward in the evolution of language model safety research. By enabling a refusal-trained language model to facilitate red teaming, this approach opens new avenues for systematically uncovering vulnerabilities. The work is grounded in a careful balance between human guidance and automated refinement, ensuring that the method remains both rigorous and accessible.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Scale AI Research Introduces J2 Attackers: Leveraging Human Expertise to Transform Advanced LLMs into Effective Red Teamers appeared first on MarkTechPost.

Stanford Researchers Introduced a Multi-Agent Reinforcement Learning F …

Artificial intelligence in multi-agent environments has made significant strides, particularly in reinforcement learning. One of the core challenges in this domain is developing AI agents capable of communicating effectively through natural language. This is particularly critical in settings where each agent has only partial visibility of the environment, making knowledge-sharing essential for achieving collective goals. Social deduction games provide an ideal framework for testing AI’s ability to deduce information through conversations, as these games require reasoning, deception detection, and strategic collaboration.

A key issue in AI-driven social deduction is ensuring that agents can conduct meaningful discussions without relying on human demonstrations. Many language models falter in multi-agent settings due to their dependence on vast datasets of human conversations. The challenge intensifies as AI agents struggle to assess whether their contributions meaningfully impact decision-making. Without a clear mechanism to evaluate the usefulness of their messages, they often generate unstructured and ineffective communication, leading to suboptimal performance in strategic games that require deduction and persuasion.

Existing reinforcement learning approaches attempt to address this problem but frequently fall short. Some techniques depend on pre-existing datasets of human interactions, which are not always available or adaptable to new scenarios. Others incorporate language models with reinforcement learning but fail due to sparse feedback, which makes it difficult for AI to refine its dialogue strategies. Traditional methods cannot thus systematically improve communication skills over time, making AI discussions in multi-agent environments less effective.

A research team from Stanford University introduced an innovative method for training AI agents in social deduction settings without human demonstrations—their approach leverages multi-agent reinforcement learning to develop AI capable of understanding and articulating meaningful arguments. The research focuses on the game *Among Us*, where crewmates must identify an imposter through verbal discussions. The researchers designed a training mechanism that divides communication into listening and speaking, allowing the AI to optimize both skills independently. The method integrates a structured reward system that progressively enables agents to refine their discussion techniques.

The methodology introduces a dense reward signal that provides precise feedback to improve communication. AI agents enhance their listening abilities by predicting environmental details based on prior discussions. At the same time, their speaking proficiency improves through reinforcement learning, where messages are assessed based on their impact on other agents’ beliefs. This structured approach ensures that AI-generated messages are logical, persuasive, and relevant to the conversation. The research team employed RWKV, a recurrent neural network model, as the foundation for their training, optimizing it for long-form discussions and dynamic gameplay environments.

Experimental results demonstrated that this training approach significantly improved AI performance compared to traditional reinforcement learning techniques. The trained AI exhibited behaviors akin to human players, including suspect accusation, evidence presentation, and reasoning based on observed actions. The study showed that AI models utilizing this structured discussion learning framework achieved a win rate of approximately 56%, compared to the 28% win rate of reinforcement learning models without the structured dialogue framework. Furthermore, the AI trained using this method outperformed models four times larger in size, underscoring the efficiency of the proposed training strategy. When analyzing discussion behaviors, the research team observed that the AI could accurately identify imposters at a success rate twice as high as baseline reinforcement learning approaches.

Further analysis revealed that AI models trained under this framework adapted effectively to adversarial strategies. Imposters attempted to manipulate discussions by shifting blame, initially confusing AI crewmates. However, the AI agents learned to differentiate between genuine accusations and misleading statements through iterative training. Researchers found that AI-generated messages that explicitly named a suspect were more likely to influence group decisions. This emergent behavior closely resembled human intuition, indicating that the AI could adapt discussion strategies dynamically.

This research marks a significant advancement in AI-driven social deduction. By addressing the communication challenges in multi-agent settings, the study provides a structured and effective framework for training AI agents to engage in meaningful discussions without relying on extensive human demonstrations. The proposed method enhances AI decision-making, allowing for more persuasive and logical reasoning in environments that require collaboration and the detection of deception. The research opens possibilities for broader applications, including AI assistants capable of analyzing complex discussions, negotiating, and strategizing in real-world scenarios.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Stanford Researchers Introduced a Multi-Agent Reinforcement Learning Framework for Effective Social Deduction in AI Communication appeared first on MarkTechPost.

Rethinking AI Safety: Balancing Existential Risks and Practical Challe …

Recent discussions on AI safety increasingly link it to existential risks posed by advanced AI, suggesting that addressing safety inherently involves considering catastrophic scenarios. However, this perspective has drawbacks: it may exclude researchers with different approaches, mislead the public into thinking AI safety is solely about existential threats, and create resistance among skeptics. As AI rapidly advances, policymakers must establish regulatory frameworks and safety standards. While existential risks dominate the current discourse, past technological safety fields—such as aviation, pharmaceuticals, and cybersecurity—have developed robust engineering and governance practices. These frameworks could inform AI safety, ensuring reliable and responsible system deployment.

Researchers from the University of Edinburgh and Carnegie Mellon University highlight that AI safety discussions often focus on existential risks, which may exclude diverse perspectives and mislead public perception. Their systematic review of peer-reviewed research reveals a broad spectrum of safety concerns, including adversarial robustness and interpretability, aligning with traditional system safety practices. The study suggests integrating near-term and long-term risks rather than prioritizing existential threats. While AI safety research evolves rapidly, capturing relevant studies remains challenging. Expanding discourse to incorporate established engineering safety principles can help address immediate and future AI risks effectively.

The researchers systematically reviewed AI safety literature using a structured methodology based on Kitchenham and Charters’ guidelines, complemented by snowball sampling to capture emerging research. They focused on two key research questions: identifying risks across the AI system lifecycle and evaluating proposed mitigation strategies. Their search process involved querying the Web of Science (WoS) and Scopus databases, refining results through hierarchical filters, and supplementing findings with influential seed papers. The review process included screening 2,666 database papers and 117 from snowball sampling, ultimately selecting 383 for analysis. Papers were annotated with metadata such as author affiliations, publication year, and citation count and were categorized based on methodological approaches, specific safety concerns addressed, and risk mitigation strategies.

The study’s bibliometric analysis revealed a steady increase in AI safety research since 2016, driven by advancements in deep learning. A word cloud analysis highlighted key themes such as safe reinforcement learning, adversarial robustness, and domain adaptation. A co-occurrence graph of abstract terms identified four major research clusters: (1) human and societal implications of AI, focusing on trust, accountability, and safety assurance; (2) safe reinforcement learning, emphasizing robust agent control in uncertain environments; (3) supervised learning, particularly in classification tasks, with a focus on robustness, generalization, and accuracy; and (4) adversarial attacks and defense strategies in deep learning models. The findings suggest that AI safety research aligns with traditional safety engineering principles, integrating aspects of reliability engineering, control theory, and cybersecurity to ensure AI systems are both effective and secure.

AI safety research categorizes risks into eight types: noise, lack of monitoring, system misspecification, and adversarial attacks. Most studies address issues related to noise and outliers, affecting model robustness and generalization. A significant focus is also on monitoring failures, system misspecifications, and control enforcement gaps. Research methods include applied algorithms, simulated agents, analysis frameworks, and mechanistic interpretability. While theoretical works propose conceptual models, applied studies develop practical algorithms. Recent efforts emphasize reinforcement learning safety, adversarial robustness, and explainability. The field parallels traditional engineering safety, integrating verification techniques to enhance AI reliability and mitigate potential risks.

In conclusion, the study systematically reviewed peer-reviewed literature to explore AI safety challenges. The findings highlight diverse motivations and research outcomes aimed at ensuring AI systems are reliable and beneficial. AI safety research addresses various risks, including design flaws, robustness issues, inadequate monitoring, and embedded biases. The study advocates for framing AI safety within broader technological safety, expanding stakeholder engagement, and promoting inclusive research. While existential risks remain relevant, a wider perspective fosters productive discourse. Future research should explore sociotechnical AI safety and incorporate non-peer-reviewed sources for a comprehensive understanding, ensuring AI safety remains an evolving, inclusive, and multidisciplinary field.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets
The post Rethinking AI Safety: Balancing Existential Risks and Practical Challenges appeared first on MarkTechPost.