May 2025 - Page 2 of 8 - i-genie.co.uk

This AI Paper Introduces Differentiable MCMC Layers: A New AI Framewor …

Posted on May 27, 2025 by i-genie

Neural networks have long been powerful tools for handling complex data-driven tasks. Still, they often struggle to make discrete decisions under strict constraints, like routing vehicles or scheduling jobs. These discrete decision problems, commonly found in operations research, are computationally intensive and difficult to integrate into the smooth, continuous frameworks of neural networks. Such challenges limit the ability to combine learning-based models with combinatorial reasoning, creating a bottleneck in applications that demand both.

A major issue arises when integrating discrete combinatorial solvers with gradient-based learning systems. Many combinatorial problems are NP-hard, meaning it’s impossible to find exact solutions within a reasonable time for large instances. Existing strategies often depend on exact solvers or introduce continuous relaxations, which may not provide solutions that respect the hard constraints of the original problem. These approaches typically involve heavy computational costs, and when exact oracles are unavailable, the methods fail to deliver consistent gradients for learning. This creates a gap where neural networks can learn representations but cannot reliably make complex, structured decisions in a way that scales.

Commonly used methods rely on exact solvers for structured inference tasks, such as MAP solvers in graphical models or linear programming relaxations. These methods often require repeated oracle calls during each training iteration and depend on specific problem formulations. Techniques like Fenchel-Young losses or perturbation-based methods allow approximate learning, but their guarantees break down when used with inexact solvers like local search heuristics. This reliance on exact solutions hinders their practical use in large-scale, real-world combinatorial tasks, such as vehicle routing with dynamic requests and time windows.

Researchers from Google DeepMind and ENPC propose a novel solution by transforming local search heuristics into differentiable combinatorial layers through the lens of Markov Chain Monte Carlo (MCMC) methods. The researchers create MCMC layers that operate on discrete combinatorial spaces by mapping problem-specific neighborhood systems into proposal distributions. This design allows neural networks to integrate local search heuristics, like simulated annealing or Metropolis-Hastings, as part of the learning pipeline without access to exact solvers. Their approach enables gradient-based learning over discrete solutions by using acceptance rules that correct for the bias introduced by approximate solvers, ensuring theoretical soundness while reducing the computational burden.

In more detail, the researchers construct a framework where local search heuristics propose neighbor solutions based on the problem structure, and the acceptance rules from MCMC methods ensure these moves result in a valid sampling process over the solution space. The resulting MCMC layer approximates the target distribution of feasible solutions and provides unbiased gradients for a single iteration under a target-dependent Fenchel-Young loss. This makes it possible to perform learning even with minimal MCMC iterations, such as using a single sample per forward pass while maintaining theoretical convergence properties. By embedding this layer in a neural network, they can train models that predict parameters for combinatorial problems and improve solution quality over time.

The research team evaluated this method on a large-scale dynamic vehicle routing problem with time windows, a complex, real-world combinatorial optimization task. They showed their approach could handle large instances efficiently, significantly outperforming perturbation-based methods under limited time budgets. For example, their MCMC layer achieved a test relative cost of 5.9% compared to anticipative baselines when using a heuristic-based initialization. In comparison, the perturbation-based method achieved 6.3% under the same conditions. Even at extremely low time budgets, such as a 1 ms time limit, their method outperformed perturbation methods by a large margin—achieving 7.8% relative cost versus 65.2% for perturbation-based approaches. They also demonstrated that initializing the MCMC chain with ground-truth solutions or heuristic-enhanced states improved learning efficiency and solution quality, especially when using a small number of MCMC iterations.

This research demonstrates a principled way to integrate NP-hard combinatorial problems into neural networks without relying on exact solvers. The problem of combining learning with discrete decision-making is addressed by using MCMC layers constructed from local search heuristics, enabling theoretically sound, efficient training. The proposed method bridges the gap between deep learning and combinatorial optimization, providing a scalable and practical solution for complex tasks like vehicle routing.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks appeared first on MarkTechPost.

Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researche …

Posted on May 27, 2025 by i-genie

Reinforcement learning (RL) has emerged as a fundamental approach in LLM post-training, utilizing supervision signals from human feedback (RLHF) or verifiable rewards (RLVR). While RLVR shows promise in mathematical reasoning, it faces significant constraints due to dependence on training queries with verifiable answers. This requirement limits applications to large-scale training on general-domain queries where verification proves intractable. Further, current reward models, categorized into scalar and generative types, cannot effectively scale test-time compute for reward estimation. Existing approaches apply uniform computational resources across all inputs, lacking adaptability to allocate additional resources to challenging queries requiring nuanced analysis.

Formulation strategies and scoring schemes characterize reward models. Numeric approaches assign scalar scores to query-response pairs, while generative methods produce natural language feedback. Scoring follows absolute evaluation of individual pairs or discriminative comparison of candidate responses. Generative reward models, aligned with the LLM-as-a-Judge paradigm, offer interpretable feedback but face reliability concerns due to biased judgments. Inference-time scaling methods dynamically adjust computational resources, including parallel strategies like multi-sampling and horizon-based scaling for extended reasoning traces. However, they lack systematic adaptation to input complexity, limiting their effectiveness across diverse query types.

Researchers from Microsoft Research, Tsinghua University, and Peking University have proposed Reward Reasoning Models (RRMs), which perform explicit reasoning before producing final rewards. This reasoning phase allows RRMs to adaptively allocate additional computational resources when evaluating responses to complex tasks. RRMs introduce a dimension for enhancing reward modeling by scaling test-time compute while maintaining general applicability across diverse evaluation scenarios. Through chain-of-thought reasoning, RRMs utilize additional test-time compute for complex queries where appropriate rewards are not immediately apparent. This encourages RRMs to self-evolve reward reasoning capabilities without explicit reasoning traces as training data.

RRMs utilize the Qwen2 model with a Transformer-decoder backbone, formulating reward modeling as text completion where RRMs autoregressively generate thinking processes followed by final judgments. Each input contains a query and two responses to determine preference without allowing ties. Researchers use the RewardBench repository to guide systematic analysis across evaluation criteria, including instruction fidelity, helpfulness, accuracy, harmlessness, and detail level. RRMs support multi-response evaluation through ELO rating systems and knockout tournaments, both combinable with majority voting for enhanced test-time compute utilization. This samples RRMs multiple times for pairwise comparisons, performing majority voting to obtain robust comparison results.

Evaluation results show that RRMs achieve competitive performance against strong baselines on RewardBench and PandaLM Test benchmarks, with RRM-32B attaining 98.6% accuracy in reasoning categories. Comparing with DirectJudge models trained on identical data reveals substantial performance gaps, indicating RRMs effectively use test-time compute for complex queries. In reward-guided best-of-N inference, RRMs surpass all baseline models without additional test-time compute, with majority voting providing substantial improvements across evaluated subsets. Post-training experiments show steady downstream performance improvements on MMLU-Pro and GPQA. Scaling experiments across 7B, 14B, and 32B models confirm that longer thinking horizons consistently improve accuracy.

In conclusion, researchers introduced RRMs to perform explicit reasoning processes before reward assignment to address computational inflexibility in existing reward modeling approaches. Rule-based-reward RL enables RRMs to develop complex reasoning capabilities without requiring explicit reasoning traces as supervision. RRMs efficiently utilize test-time compute through parallel and sequential scaling approaches. The effectiveness of RRMs in practical applications, including reward-guided best-of-N inference and post-training feedback, demonstrates their potential as strong alternatives to traditional scalar reward models in alignment techniques.

Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment appeared first on MarkTechPost.

Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data …

Posted on May 26, 2025 by i-genie

Real-world data is often costly, messy, and limited by privacy rules. Synthetic data offers a solution—and it’s already widely used:

LLMs train on AI-generated text

Fraud systems simulate edge cases

Vision models pretrain on fake images

SDV (Synthetic Data Vault) is an open-source Python library that generates realistic tabular data using machine learning. It learns patterns from real data and creates high-quality synthetic data for safe sharing, testing, and model training.

In this tutorial, we’ll use SDV to generate synthetic data step by step.

Copy CodeCopiedUse a different Browserpip install sdv

We will first install the sdv library:

Copy CodeCopiedUse a different Browserfrom sdv.io.local import CSVHandler

connector = CSVHandler()
FOLDER_NAME = ‘.’ # If the data is in the same directory

data = connector.read(folder_name=FOLDER_NAME)
salesDf = data[‘data’]

Next, we import the necessary module and connect to our local folder containing the dataset files. This reads the CSV files from the specified folder and stores them as pandas DataFrames. In this case, we access the main dataset using data[‘data’].

Copy CodeCopiedUse a different Browserfrom sdv.metadata import Metadata
metadata = Metadata.load_from_json(‘metadata.json’)

We now import the metadata for our dataset. This metadata is stored in a JSON file and tells SDV how to interpret your data. It includes:

The table name

The primary key

The data type of each column (e.g., categorical, numerical, datetime, etc.)

Optional column formats like datetime patterns or ID patterns

Table relationships (for multi-table setups)

Here is a sample metadata.json format:

Copy CodeCopiedUse a different Browser{
“METADATA_SPEC_VERSION”: “V1”,
“tables”: {
“your_table_name”: {
“primary_key”: “your_primary_key_column”,
“columns”: {
“your_primary_key_column”: { “sdtype”: “id”, “regex_format”: “T[0-9]{6}” },
“date_column”: { “sdtype”: “datetime”, “datetime_format”: “%d-%m-%Y” },
“category_column”: { “sdtype”: “categorical” },
“numeric_column”: { “sdtype”: “numerical” }
},
“column_relationships”: []
}
}
}

Copy CodeCopiedUse a different Browserfrom sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframes(data)

Alternatively, we can use the SDV library to automatically infer the metadata. However, the results may not always be accurate or complete, so you might need to review and update it if there are any discrepancies.

Copy CodeCopiedUse a different Browserfrom sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)

With the metadata and original dataset ready, we can now use SDV to train a model and generate synthetic data. The model learns the structure and patterns in your real dataset and uses that knowledge to create synthetic records.

You can control how many rows to generate using the num_rows argument.

Copy CodeCopiedUse a different Browserfrom sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
salesDf,
synthetic_data,
metadata)

The SDV library also provides tools to evaluate the quality of your synthetic data by comparing it to the original dataset. A great place to start is by generating a quality report

You can also visualize how the synthetic data compares to the real data using SDV’s built-in plotting tools. For example, import get_column_plot from sdv.evaluation.single_table to create comparison plots for specific columns:

Copy CodeCopiedUse a different Browserfrom sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
real_data=salesDf,
synthetic_data=synthetic_data,
column_name=’Sales’,
metadata=metadata
)

fig.show()

We can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets.

Copy CodeCopiedUse a different Browserimport pandas as pd
import matplotlib.pyplot as plt

# Ensure ‘Date’ columns are datetime
salesDf[‘Date’] = pd.to_datetime(salesDf[‘Date’], format=’%d-%m-%Y’)
synthetic_data[‘Date’] = pd.to_datetime(synthetic_data[‘Date’], format=’%d-%m-%Y’)

# Extract ‘Month’ as year-month string
salesDf[‘Month’] = salesDf[‘Date’].dt.to_period(‘M’).astype(str)
synthetic_data[‘Month’] = synthetic_data[‘Date’].dt.to_period(‘M’).astype(str)

# Group by ‘Month’ and calculate average sales
actual_avg_monthly = salesDf.groupby(‘Month’)[‘Sales’].mean().rename(‘Actual Average Sales’)
synthetic_avg_monthly = synthetic_data.groupby(‘Month’)[‘Sales’].mean().rename(‘Synthetic Average Sales’)

# Merge the two series into a DataFrame
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison[‘Actual Average Sales’], label=’Actual Average Sales’, marker=’o’)
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison[‘Synthetic Average Sales’], label=’Synthetic Average Sales’, marker=’o’)

plt.title(‘Average Monthly Sales Comparison: Actual vs Synthetic’)
plt.xlabel(‘Month’)
plt.ylabel(‘Average Sales’)
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.ylim(bottom=0) # y-axis starts at 0
plt.tight_layout()
plt.show()

This chart also shows that the average monthly sales in both datasets are very similar, with only minimal differences.

In this tutorial, we demonstrated how to prepare your data and metadata for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can create high-quality synthetic data that closely mirrors the real data’s patterns and distributions. We also explored how to evaluate and visualize the synthetic data, confirming that key metrics like sales distributions and monthly trends remain consistent. Synthetic data offers a powerful way to overcome privacy and availability challenges while enabling robust data analysis and machine learning workflows.

Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV) appeared first on MarkTechPost.

NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Mo …

Posted on May 26, 2025 by i-genie

NVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks.

The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings.

Model Architecture and Training Stack

Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count.

The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization (RPO), a method intended to enhance the model’s utility in chat-based and instruction-following environments.

This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes.

Performance Benchmarks

Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains.

While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads.

Edge-Ready Deployment

One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations.

For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility.

Licensing and Access

The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models.

Conclusion

Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance.

Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks appeared first on MarkTechPost.

A Coding Implementation to Build an AI Agent with Live Python Executio …

Posted on May 26, 2025 by i-genie

In this tutorial, we will discover how to harness the power of an advanced AI Agent, augmented with both Python execution and result-validation capabilities, to tackle complex computational tasks. By integrating LangChain’s ReAct agent framework with Anthropic’s Claude API, we build an end-to-end solution to generate Python code and execute it live, capture its outputs, maintain execution state, and automatically verify results against expected properties or test cases. This seamless loop of “write → run → validate” empowers you to develop robust analyses, algorithms, and simple ML pipelines with confidence in every step.

Copy CodeCopiedUse a different Browser!pip install langchain langchain-anthropic langchain-core anthropic

We install the core LangChain framework along with the Anthropic integration and its core utilities, ensuring you have both the agent orchestration tools (langchain, langchain-core) and the Claude-specific bindings (langchain-anthropic, anthropic) available in your environment.

Copy CodeCopiedUse a different Browserimport os
from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import Tool
from langchain_core.prompts import PromptTemplate
from langchain_anthropic import ChatAnthropic
import sys
import io
import re
import json
from typing import Dict, Any, List

We bring together everything needed to build our ReAct-style agent: OS access for environment variables, LangChain’s agent constructors (create_react_agent, AgentExecutor), and Tool class for defining custom actions, the PromptTemplate for crafting the chain-of-thought prompt, and Anthropic’s ChatAnthropic client for connecting to Claude. Standard Python modules (sys, io, re, json) handle I/O capture, regular expressions, and serialization, while typing provides type hints for clearer, more maintainable code.

Copy CodeCopiedUse a different Browserclass PythonREPLTool:
def __init__(self):
self.globals_dict = {
‘__builtins__’: __builtins__,
‘json’: json,
‘re’: re
}
self.locals_dict = {}
self.execution_history = []

def run(self, code: str) -> str:
try:
old_stdout = sys.stdout
old_stderr = sys.stderr
sys.stdout = captured_output = io.StringIO()
sys.stderr = captured_error = io.StringIO()

execution_result = None

try:
result = eval(code, self.globals_dict, self.locals_dict)
execution_result = result
if result is not None:
print(result)
except SyntaxError:
exec(code, self.globals_dict, self.locals_dict)

output = captured_output.getvalue()
error_output = captured_error.getvalue()

sys.stdout = old_stdout
sys.stderr = old_stderr

self.execution_history.append({
‘code’: code,
‘output’: output,
‘result’: execution_result,
‘error’: error_output
})

response = f”**Code Executed:**n“`pythonn{code}n“`nn”
if error_output:
response += f”**Errors/Warnings:**n{error_output}nn”
response += f”**Output:**n{output if output.strip() else ‘No console output’}”

if execution_result is not None and not output.strip():
response += f”n**Return Value:** {execution_result}”

return response

except Exception as e:
sys.stdout = old_stdout
sys.stderr = old_stderr

error_info = f”**Code Executed:**n“`pythonn{code}n“`nn**Runtime Error:**n{str(e)}n**Error Type:** {type(e).__name__}”

self.execution_history.append({
‘code’: code,
‘output’: ”,
‘result’: None,
‘error’: str(e)
})

return error_info

def get_execution_history(self) -> List[Dict[str, Any]]:
return self.execution_history

def clear_history(self):
self.execution_history = []

This PythonREPLTool encapsulates a stateful in‐process Python REPL: it captures and executes arbitrary code (evaluating expressions or running statements), redirects stdout/stderr to record outputs and errors, and maintains a history of each execution. Returning a formatted summary, including the executed code, any console output or errors, and return values, provides transparent, reproducible feedback for every snippet run within our agent.

Copy CodeCopiedUse a different Browserclass ResultValidator:
def __init__(self, python_repl: PythonREPLTool):
self.python_repl = python_repl

def validate_mathematical_result(self, description: str, expected_properties: Dict[str, Any]) -> str:
“””Validate mathematical computations”””
validation_code = f”””
# Validation for: {description}
validation_results = {{}}

# Get the last execution results
history = {self.python_repl.execution_history}
if history:
last_execution = history[-1]
print(f”Last execution output: {{last_execution[‘output’]}}”)

# Extract numbers from the output
import re
numbers = re.findall(r’d+(?:.d+)?’, last_execution[‘output’])
if numbers:
numbers = [float(n) for n in numbers]
validation_results[‘extracted_numbers’] = numbers

# Validate expected properties
for prop, expected_value in {expected_properties}.items():
if prop == ‘count’:
actual_count = len(numbers)
validation_results[f’count_check’] = actual_count == expected_value
print(f”Count validation: Expected {{expected_value}}, Got {{actual_count}}”)
elif prop == ‘max_value’:
if numbers:
max_val = max(numbers)
validation_results[f’max_check’] = max_val <= expected_value
print(f”Max value validation: {{max_val}} <= {{expected_value}} = {{max_val <= expected_value}}”)
elif prop == ‘min_value’:
if numbers:
min_val = min(numbers)
validation_results[f’min_check’] = min_val >= expected_value
print(f”Min value validation: {{min_val}} >= {{expected_value}} = {{min_val >= expected_value}}”)
elif prop == ‘sum_range’:
if numbers:
total = sum(numbers)
min_sum, max_sum = expected_value
validation_results[f’sum_check’] = min_sum <= total <= max_sum
print(f”Sum validation: {{min_sum}} <= {{total}} <= {{max_sum}} = {{min_sum <= total <= max_sum}}”)

print(“nValidation Summary:”)
for key, value in validation_results.items():
print(f”{{key}}: {{value}}”)

validation_results
“””
return self.python_repl.run(validation_code)

def validate_data_analysis(self, description: str, expected_structure: Dict[str, Any]) -> str:
“””Validate data analysis results”””
validation_code = f”””
# Data Analysis Validation for: {description}
validation_results = {{}}

# Check if required variables exist in global scope
required_vars = {list(expected_structure.keys())}
existing_vars = []

for var_name in required_vars:
if var_name in globals():
existing_vars.append(var_name)
var_value = globals()[var_name]
validation_results[f'{{var_name}}_exists’] = True
validation_results[f'{{var_name}}_type’] = type(var_value).__name__

# Type-specific validations
if isinstance(var_value, (list, tuple)):
validation_results[f'{{var_name}}_length’] = len(var_value)
elif isinstance(var_value, dict):
validation_results[f'{{var_name}}_keys’] = list(var_value.keys())
elif isinstance(var_value, (int, float)):
validation_results[f'{{var_name}}_value’] = var_value

print(f”✓ Variable ‘{{var_name}}’ found: {{type(var_value).__name__}} = {{var_value}}”)
else:
validation_results[f'{{var_name}}_exists’] = False
print(f”✗ Variable ‘{{var_name}}’ not found”)

print(f”nFound {{len(existing_vars)}}/{{len(required_vars)}} required variables”)

# Additional structure validation
for var_name, expected_type in {expected_structure}.items():
if var_name in globals():
actual_type = type(globals()[var_name]).__name__
validation_results[f'{{var_name}}_type_match’] = actual_type == expected_type
print(f”Type check ‘{{var_name}}’: Expected {{expected_type}}, Got {{actual_type}}”)

validation_results
“””
return self.python_repl.run(validation_code)

def validate_algorithm_correctness(self, description: str, test_cases: List[Dict[str, Any]]) -> str:
“””Validate algorithm implementations with test cases”””
validation_code = f”””
# Algorithm Validation for: {description}
validation_results = {{}}
test_results = []

test_cases = {test_cases}

for i, test_case in enumerate(test_cases):
test_name = test_case.get(‘name’, f’Test {{i+1}}’)
input_val = test_case.get(‘input’)
expected = test_case.get(‘expected’)
function_name = test_case.get(‘function’)

print(f”nRunning {{test_name}}:”)
print(f”Input: {{input_val}}”)
print(f”Expected: {{expected}}”)

try:
if function_name and function_name in globals():
func = globals()[function_name]
if callable(func):
if isinstance(input_val, (list, tuple)):
result = func(*input_val)
else:
result = func(input_val)

passed = result == expected
test_results.append({{
‘test_name’: test_name,
‘input’: input_val,
‘expected’: expected,
‘actual’: result,
‘passed’: passed
}})

status = “✓ PASS” if passed else “✗ FAIL”
print(f”Actual: {{result}}”)
print(f”Status: {{status}}”)
else:
print(f”✗ ERROR: ‘{{function_name}}’ is not callable”)
else:
print(f”✗ ERROR: Function ‘{{function_name}}’ not found”)

except Exception as e:
print(f”✗ ERROR: {{str(e)}}”)
test_results.append({{
‘test_name’: test_name,
‘error’: str(e),
‘passed’: False
}})

# Summary
passed_tests = sum(1 for test in test_results if test.get(‘passed’, False))
total_tests = len(test_results)
validation_results[‘tests_passed’] = passed_tests
validation_results[‘total_tests’] = total_tests
validation_results[‘success_rate’] = passed_tests / total_tests if total_tests > 0 else 0

print(f”n=== VALIDATION SUMMARY ===”)
print(f”Tests passed: {{passed_tests}}/{{total_tests}}”)
print(f”Success rate: {{validation_results[‘success_rate’]:.1%}}”)

test_results
“””
return self.python_repl.run(validation_code)

This ResultValidator class builds on the PythonREPLTool to automatically generate and run bespoke validation routines, checking numerical properties, verifying data structures, or running algorithm test cases against the agent’s execution history. Emitting Python snippets that extract outputs, compare them to expected criteria, and summarize pass/fail results closes the loop on “execute → validate” within our agent’s workflow.

Copy CodeCopiedUse a different Browserpython_repl = PythonREPLTool()
validator = ResultValidator(python_repl)

Here, we instantiate our interactive Python REPL tool (python_repl) and then create a ResultValidator tied to that same REPL instance. This wiring ensures any code you execute is immediately available for automated validation steps, closing the loop on execution and correctness checking.

Copy CodeCopiedUse a different Browserpython_tool = Tool(
name=”python_repl”,
description=”Execute Python code and return both the code and its output. Maintains state between executions.”,
func=python_repl.run
)

validation_tool = Tool(
name=”result_validator”,
description=”Validate the results of previous computations with specific test cases and expected properties.”,
func=lambda query: validator.validate_mathematical_result(query, {})
)

Here, we wrap our REPL and validation methods into LangChain Tool objects, assigning them clear names and descriptions. The agent can invoke python_repl to run code and result_validator to check the last execution against your specified criteria automatically.

Copy CodeCopiedUse a different Browserprompt_template = “””You are Claude, an advanced AI assistant with Python execution and result validation capabilities.

You can execute Python code to solve complex problems and then validate your results to ensure accuracy.

Available tools:
{tools}

Use this format:
Question: the input question you must answer
Thought: analyze what needs to be done
Action: {tool_names}
Action Input: [your input]
Observation: [result]
… (repeat Thought/Action/Action Input/Observation as needed)
Thought: I should validate my results
Action: [validation if needed]
Action Input: [validation parameters]
Observation: [validation results]
Thought: I now have the complete answer
Final Answer: [comprehensive answer with validation confirmation]

Question: {input}
{agent_scratchpad}”””

prompt = PromptTemplate(
template=prompt_template,
input_variables=[“input”, “agent_scratchpad”],
partial_variables={
“tools”: “python_repl – Execute Python codenresult_validator – Validate computation results”,
“tool_names”: “python_repl, result_validator”
}
)

Above prompt template frames Claude as a dual-capability assistant that first reasons (“Thought”), selects from the python_repl and result_validator tools to run code and check outputs, and then iterates until it has a validated solution. By defining a clear chain-of-thought structure with placeholders for tool names and their usage examples, it guides the agent to: (1) break down the problem, (2) call python_repl to execute necessary code, (3) call result_validator to confirm correctness, and finally (4) deliver a self-checked “Final Answer.” This scaffolding ensures a disciplined “write → run → validate” workflow.

Copy CodeCopiedUse a different Browserclass AdvancedClaudeCodeAgent:
def __init__(self, anthropic_api_key=None):
if anthropic_api_key:
os.environ[“ANTHROPIC_API_KEY”] = anthropic_api_key

self.llm = ChatAnthropic(
model=”claude-3-opus-20240229″,
temperature=0,
max_tokens=4000
)

self.agent = create_react_agent(
llm=self.llm,
tools=[python_tool, validation_tool],
prompt=prompt
)

self.agent_executor = AgentExecutor(
agent=self.agent,
tools=[python_tool, validation_tool],
verbose=True,
handle_parsing_errors=True,
max_iterations=8,
return_intermediate_steps=True
)

self.python_repl = python_repl
self.validator = validator

def run(self, query: str) -> str:
try:
result = self.agent_executor.invoke({“input”: query})
return result[“output”]
except Exception as e:
return f”Error: {str(e)}”

def validate_last_result(self, description: str, validation_params: Dict[str, Any]) -> str:
“””Manually validate the last computation result”””
if ‘test_cases’ in validation_params:
return self.validator.validate_algorithm_correctness(description, validation_params[‘test_cases’])
elif ‘expected_structure’ in validation_params:
return self.validator.validate_data_analysis(description, validation_params[‘expected_structure’])
else:
return self.validator.validate_mathematical_result(description, validation_params)

def get_execution_summary(self) -> Dict[str, Any]:
“””Get summary of all executions”””
history = self.python_repl.get_execution_history()
return {
‘total_executions’: len(history),
‘successful_executions’: len([h for h in history if not h[‘error’]]),
‘failed_executions’: len([h for h in history if h[‘error’]]),
‘execution_details’: history
}

This AdvancedClaudeCodeAgent class wraps everything into a single, easy-to-use interface: it configures the Anthropic Claude client (using your API key), instantiates a ReAct-style agent with our python_repl and result_validator tools and the custom prompt, and sets up an executor that drives iterative “think → code → validate” loops. Its run() method lets you submit natural-language queries and returns Claude’s final, self-checked answer; validate_last_result() exposes manual hooks for additional checks; and get_execution_summary() provides a concise report on every code snippet you’ve executed (how many succeeded, failed, and their details).

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
API_KEY = “Use Your Own Key Here”

agent = AdvancedClaudeCodeAgent(anthropic_api_key=API_KEY)

print(” Advanced Claude Code Agent with Validation”)
print(“=” * 60)

print(“n Example 1: Prime Number Analysis with Twin Prime Detection”)
print(“-” * 60)
query1 = “””
Find all prime numbers between 1 and 200, then:
1. Calculate their sum
2. Find all twin prime pairs (primes that differ by 2)
3. Calculate the average gap between consecutive primes
4. Identify the largest prime gap in this range
After computation, validate that we found the correct number of primes and that all identified numbers are actually prime.
“””
result1 = agent.run(query1)
print(result1)

print(“n” + “=” * 80 + “n”)

print(” Example 2: Advanced Sales Data Analysis with Statistical Validation”)
print(“-” * 60)
query2 = “””
Create a comprehensive sales analysis:
1. Generate sales data for 12 products across 24 months with realistic seasonal patterns
2. Calculate monthly growth rates, yearly totals, and trend analysis
3. Identify top 3 performing products and worst 3 performing products
4. Perform correlation analysis between different products
5. Create summary statistics (mean, median, standard deviation, percentiles)
After analysis, validate the data structure, ensure all calculations are mathematically correct, and verify the statistical measures.
“””
result2 = agent.run(query2)
print(result2)

print(“n” + “=” * 80 + “n”)

print(” Example 3: Advanced Algorithm Implementation with Test Suite”)
print(“-” * 60)
query3 = “””
Implement and validate a comprehensive sorting and searching system:
1. Implement quicksort, mergesort, and binary search algorithms
2. Create test data with various edge cases (empty lists, single elements, duplicates, sorted/reverse sorted)
3. Benchmark the performance of different sorting algorithms
4. Implement a function to find the kth largest element using different approaches
5. Test all implementations with comprehensive test cases including edge cases
After implementation, validate each algorithm with multiple test cases to ensure correctness.
“””
result3 = agent.run(query3)
print(result3)

print(“n” + “=” * 80 + “n”)

print(” Example 4: Machine Learning Model with Cross-Validation”)
print(“-” * 60)
query4 = “””
Build a complete machine learning pipeline:
1. Generate a synthetic dataset with features and target variable (classification problem)
2. Implement data preprocessing (normalization, feature scaling)
3. Implement a simple linear classifier from scratch (gradient descent)
4. Split data into train/validation/test sets
5. Train the model and evaluate performance (accuracy, precision, recall)
6. Implement k-fold cross-validation
7. Compare results with different hyperparameters
Validate the entire pipeline by ensuring mathematical correctness of gradient descent, proper data splitting, and realistic performance metrics.
“””
result4 = agent.run(query4)
print(result4)

print(“n” + “=” * 80 + “n”)

print(” Execution Summary”)
print(“-” * 60)
summary = agent.get_execution_summary()
print(f”Total code executions: {summary[‘total_executions’]}”)
print(f”Successful executions: {summary[‘successful_executions’]}”)
print(f”Failed executions: {summary[‘failed_executions’]}”)

if summary[‘failed_executions’] > 0:
print(“nFailed executions details:”)
for i, execution in enumerate(summary[‘execution_details’]):
if execution[‘error’]:
print(f” {i+1}. Error: {execution[‘error’]}”)

print(f”nSuccess rate: {(summary[‘successful_executions’]/summary[‘total_executions’]*100):.1f}%”)

Finally, we instantiate the AdvancedClaudeCodeAgent with your Anthropic API key, run four illustrative example queries (covering prime‐number analysis, sales data analytics, algorithm implementations, and a simple ML pipeline), and print each validated result. Finally, it gathers and displays a concise execution summary, total runs, successes, failures, and error details, demonstrating the agent’s live “write → run → validate” workflow.

In conclusion, we have developed a versatile AdvancedClaudeCodeAgent capable of seamlessly blending generative reasoning with precise computational control. At its core, this Agent doesn’t just draft Python snippets; it runs them on the spot and checks their correctness against your specified criteria, closing the feedback loop automatically. Whether you’re performing prime-number analyses, statistical data evaluations, algorithm benchmarking, or end-to-end ML workflows, this pattern ensures reliability and reproducibility.

Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Implementation to Build an AI Agent with Live Python Execution and Automated Validation appeared first on MarkTechPost.

Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with La …

Posted on May 25, 2025 by i-genie

In this comprehensive tutorial, we guide users through creating a powerful multi-tool AI agent using LangGraph and Claude, optimized for diverse tasks including mathematical computations, web searches, weather inquiries, text analysis, and real-time information retrieval. It begins by simplifying dependency installations to ensure effortless setup, even for beginners. Users are then introduced to structured implementations of specialized tools, such as a safe calculator, an efficient web-search utility leveraging DuckDuckGo, a mock weather information provider, a detailed text analyzer, and a time-fetching function. The tutorial also clearly delineates the integration of these tools within a sophisticated agent architecture built using LangGraph, illustrating practical usage through interactive examples and clear explanations, facilitating both beginners and advanced developers to deploy custom multi-functional AI agents rapidly.

Copy CodeCopiedUse a different Browserimport subprocess
import sys

def install_packages():
packages = [
“langgraph”,
“langchain”,
“langchain-anthropic”,
“langchain-community”,
“requests”,
“python-dotenv”,
“duckduckgo-search”
]

for package in packages:
try:
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, package, “-q”])
print(f”✓ Installed {package}”)
except subprocess.CalledProcessError:
print(f”✗ Failed to install {package}”)

print(“Installing required packages…”)
install_packages()
print(“Installation complete!n”)

We automate the installation of essential Python packages required for building a LangGraph-based multi-tool AI agent. It leverages a subprocess to run pip commands silently and ensures each package, ranging from long-chain components to web search and environment handling tools, is installed successfully. This setup streamlines the environment preparation process, making the notebook portable and beginner-friendly.

Copy CodeCopiedUse a different Browserimport os
import json
import math
import requests
from typing import Dict, List, Any, Annotated, TypedDict
from datetime import datetime
import operator

from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage
from langchain_core.tools import tool
from langchain_anthropic import ChatAnthropic
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.memory import MemorySaver
from duckduckgo_search import DDGS

We import all the necessary libraries and modules for constructing the multi-tool AI agent. It includes Python standard libraries such as os, json, math, and datetime for general-purpose functionality and external libraries like requests for HTTP calls and duckduckgo_search for implementing web search. The LangChain and LangGraph ecosystems bring in message types, tool decorators, state graph components, and checkpointing utilities, while ChatAnthropic enables integration with the Claude model for conversational intelligence. These imports form the foundational building blocks for defining tools, agent workflows, and interactions.

Copy CodeCopiedUse a different Browseros.environ[“ANTHROPIC_API_KEY”] = “Use Your API Key Here”

ANTHROPIC_API_KEY = os.getenv(“ANTHROPIC_API_KEY”)

We set and retrieve the Anthropic API key required to authenticate and interact with Claude models. The os.environ line assigns your API key (which you should replace with a valid key), while os.getenv securely retrieves it for later use in model initialization. This approach ensures the key is accessible throughout the script without hardcoding it multiple times.

Copy CodeCopiedUse a different Browserfrom typing import TypedDict

class AgentState(TypedDict):
messages: Annotated[List[BaseMessage], operator.add]

@tool
def calculator(expression: str) -> str:
“””
Perform mathematical calculations. Supports basic arithmetic, trigonometry, and more.

Args:
expression: Mathematical expression as a string (e.g., “2 + 3 * 4”, “sin(3.14159/2)”)

Returns:
Result of the calculation as a string
“””
try:
allowed_names = {
‘abs’: abs, ’round’: round, ‘min’: min, ‘max’: max,
‘sum’: sum, ‘pow’: pow, ‘sqrt’: math.sqrt,
‘sin’: math.sin, ‘cos’: math.cos, ‘tan’: math.tan,
‘log’: math.log, ‘log10’: math.log10, ‘exp’: math.exp,
‘pi’: math.pi, ‘e’: math.e
}

expression = expression.replace(‘^’, ‘**’)

result = eval(expression, {“__builtins__”: {}}, allowed_names)
return f”Result: {result}”
except Exception as e:
return f”Error in calculation: {str(e)}”

We define the agent’s internal state and implement a robust calculator tool. The AgentState class uses TypedDict to structure agent memory, specifically tracking messages exchanged during the conversation. The calculator function, decorated with @tool to register it as an AI-usable utility, securely evaluates mathematical expressions. It allows for safe computation by limiting available functions to a predefined set from the math module and replacing common syntax like ^ with Python’s exponentiation operator. This ensures the tool can handle simple arithmetic and advanced functions like trigonometry or logarithms while preventing unsafe code execution.

Copy CodeCopiedUse a different Browser@tool
def web_search(query: str, num_results: int = 3) -> str:
“””
Search the web for information using DuckDuckGo.

Args:
query: Search query string
num_results: Number of results to return (default: 3, max: 10)

Returns:
Search results as formatted string
“””
try:
num_results = min(max(num_results, 1), 10)

with DDGS() as ddgs:
results = list(ddgs.text(query, max_results=num_results))

if not results:
return f”No search results found for: {query}”

formatted_results = f”Search results for ‘{query}’:nn”
for i, result in enumerate(results, 1):
formatted_results += f”{i}. **{result[‘title’]}**n”
formatted_results += f” {result[‘body’]}n”
formatted_results += f” Source: {result[‘href’]}nn”

return formatted_results
except Exception as e:
return f”Error performing web search: {str(e)}”

We define a web_search tool that enables the agent to fetch real-time information from the internet using the DuckDuckGo Search API via the duckduckgo_search Python package. The tool accepts a search query and an optional num_results parameter, ensuring that the number of results returned is between 1 and 10. It opens a DuckDuckGo search session, retrieves the results, and formats them neatly for user-friendly display. If no results are found or an error occurs, the function handles it gracefully by returning an informative message. This tool equips the agent with real-time search capabilities, enhancing responsiveness and utility.

Copy CodeCopiedUse a different Browser@tool
def weather_info(city: str) -> str:
“””
Get current weather information for a city using OpenWeatherMap API.
Note: This is a mock implementation for demo purposes.

Args:
city: Name of the city

Returns:
Weather information as a string
“””
mock_weather = {
“new york”: {“temp”: 22, “condition”: “Partly Cloudy”, “humidity”: 65},
“london”: {“temp”: 15, “condition”: “Rainy”, “humidity”: 80},
“tokyo”: {“temp”: 28, “condition”: “Sunny”, “humidity”: 70},
“paris”: {“temp”: 18, “condition”: “Overcast”, “humidity”: 75}
}

city_lower = city.lower()
if city_lower in mock_weather:
weather = mock_weather[city_lower]
return f”Weather in {city}:n”
f”Temperature: {weather[‘temp’]}°Cn”
f”Condition: {weather[‘condition’]}n”
f”Humidity: {weather[‘humidity’]}%”
else:
return f”Weather data not available for {city}. (This is a demo with limited cities: New York, London, Tokyo, Paris)”

We define a weather_info tool that simulates retrieving current weather data for a given city. While it does not connect to a live weather API, it uses a predefined dictionary of mock data for major cities like New York, London, Tokyo, and Paris. Upon receiving a city name, the function normalizes it to lowercase and checks for its presence in the mock dataset. It returns temperature, weather condition, and humidity in a readable format if found. Otherwise, it notifies the user that weather data is unavailable. This tool serves as a placeholder and can later be upgraded to fetch live data from an actual weather API.

Copy CodeCopiedUse a different Browser@tool
def text_analyzer(text: str) -> str:
“””
Analyze text and provide statistics like word count, character count, etc.

Args:
text: Text to analyze

Returns:
Text analysis results
“””
if not text.strip():
return “Please provide text to analyze.”

words = text.split()
sentences = text.split(‘.’) + text.split(‘!’) + text.split(‘?’)
sentences = [s.strip() for s in sentences if s.strip()]

analysis = f”Text Analysis Results:n”
analysis += f”• Characters (with spaces): {len(text)}n”
analysis += f”• Characters (without spaces): {len(text.replace(‘ ‘, ”))}n”
analysis += f”• Words: {len(words)}n”
analysis += f”• Sentences: {len(sentences)}n”
analysis += f”• Average words per sentence: {len(words) / max(len(sentences), 1):.1f}n”
analysis += f”• Most common word: {max(set(words), key=words.count) if words else ‘N/A’}”

return analysis

The text_analyzer tool provides a detailed statistical analysis of a given text input. It calculates metrics such as character count (with and without spaces), word count, sentence count, and average words per sentence, and it identifies the most frequently occurring word. The tool handles empty input gracefully by prompting the user to provide valid text. It uses simple string operations and Python’s set and max functions to extract meaningful insights. It is a valuable utility for language analysis or content quality checks in the AI agent’s toolkit.

Copy CodeCopiedUse a different Browser@tool
def current_time() -> str:
“””
Get the current date and time.

Returns:
Current date and time as a formatted string
“””
now = datetime.now()
return f”Current date and time: {now.strftime(‘%Y-%m-%d %H:%M:%S’)}”

The current_time tool provides a straightforward way to retrieve the current system date and time in a human-readable format. Using Python’s datetime module, it captures the present moment and formats it as YYYY-MM-DD HH:MM:SS. This utility is particularly useful for time-stamping responses or answering user queries about the current date and time within the AI agent’s interaction flow.

Copy CodeCopiedUse a different Browsertools = [calculator, web_search, weather_info, text_analyzer, current_time]

def create_llm():
if ANTHROPIC_API_KEY:
return ChatAnthropic(
model=”claude-3-haiku-20240307″,
temperature=0.1,
max_tokens=1024
)
else:
class MockLLM:
def invoke(self, messages):
last_message = messages[-1].content if messages else “”

if any(word in last_message.lower() for word in [‘calculate’, ‘math’, ‘+’, ‘-‘, ‘*’, ‘/’, ‘sqrt’, ‘sin’, ‘cos’]):
import re
numbers = re.findall(r'[d+-*/.()sw]+’, last_message)
expr = numbers[0] if numbers else “2+2″
return AIMessage(content=”I’ll help you with that calculation.”,
tool_calls=[{“name”: “calculator”, “args”: {“expression”: expr.strip()}, “id”: “calc1″}])
elif any(word in last_message.lower() for word in [‘search’, ‘find’, ‘look up’, ‘information about’]):
query = last_message.replace(‘search for’, ”).replace(‘find’, ”).replace(‘look up’, ”).strip()
if not query or len(query) < 3:
query = “python programming”
return AIMessage(content=”I’ll search for that information.”,
tool_calls=[{“name”: “web_search”, “args”: {“query”: query}, “id”: “search1”}])
elif any(word in last_message.lower() for word in [‘weather’, ‘temperature’]):
city = “New York”
words = last_message.lower().split()
for i, word in enumerate(words):
if word == ‘in’ and i + 1 < len(words):
city = words[i + 1].title()
break
return AIMessage(content=”I’ll get the weather information.”,
tool_calls=[{“name”: “weather_info”, “args”: {“city”: city}, “id”: “weather1″}])
elif any(word in last_message.lower() for word in [‘time’, ‘date’]):
return AIMessage(content=”I’ll get the current time.”,
tool_calls=[{“name”: “current_time”, “args”: {}, “id”: “time1″}])
elif any(word in last_message.lower() for word in [‘analyze’, ‘analysis’]):
text = last_message.replace(‘analyze this text:’, ”).replace(‘analyze’, ”).strip()
if not text:
text = “Sample text for analysis”
return AIMessage(content=”I’ll analyze that text for you.”,
tool_calls=[{“name”: “text_analyzer”, “args”: {“text”: text}, “id”: “analyze1″}])
else:
return AIMessage(content=”Hello! I’m a multi-tool agent powered by Claude. I can help with:n• Mathematical calculationsn• Web searchesn• Weather informationn• Text analysisn• Current time/datennWhat would you like me to help you with?”)

def bind_tools(self, tools):
return self

print(” Note: Using mock LLM for demo. Add your ANTHROPIC_API_KEY for full functionality.”)
return MockLLM()

llm = create_llm()
llm_with_tools = llm.bind_tools(tools)

We initialize the language model that powers the AI agent. If a valid Anthropic API key is available, it uses the Claude 3 Haiku model for high-quality responses. Without an API key, a MockLLM is defined to simulate basic tool-routing behavior based on keyword matching, allowing the agent to function offline with limited capabilities. The bind_tools method links the defined tools to the model, enabling it to invoke them as needed.

Copy CodeCopiedUse a different Browserdef agent_node(state: AgentState) -> Dict[str, Any]:
“””Main agent node that processes messages and decides on tool usage.”””
messages = state[“messages”]
response = llm_with_tools.invoke(messages)
return {“messages”: [response]}

def should_continue(state: AgentState) -> str:
“””Determine whether to continue with tool calls or end.”””
last_message = state[“messages”][-1]
if hasattr(last_message, ‘tool_calls’) and last_message.tool_calls:
return “tools”
return END

We define the agent’s core decision-making logic. The agent_node function handles incoming messages, invokes the language model (with tools), and returns the model’s response. The should_continue function then evaluates whether the model’s response includes tool calls. If so, it routes control to the tool execution node; otherwise, it directs the flow to end the interaction. These functions enable dynamic and conditional transitions within the agent’s workflow.

Copy CodeCopiedUse a different Browserdef create_agent_graph():
tool_node = ToolNode(tools)

workflow = StateGraph(AgentState)

workflow.add_node(“agent”, agent_node)
workflow.add_node(“tools”, tool_node)

workflow.add_edge(START, “agent”)
workflow.add_conditional_edges(“agent”, should_continue, {“tools”: “tools”, END: END})
workflow.add_edge(“tools”, “agent”)

memory = MemorySaver()

app = workflow.compile(checkpointer=memory)

return app

print(“Creating LangGraph Multi-Tool Agent…”)
agent = create_agent_graph()
print(“✓ Agent created successfully!n”)

We construct the LangGraph-powered workflow that defines the AI agent’s operational structure. It initializes a ToolNode to handle tool executions and uses a StateGraph to organize the flow between agent decisions and tool usage. Nodes and edges are added to manage transitions: starting with the agent, conditionally routing to tools, and looping back as needed. A MemorySaver is integrated for persistent state tracking across turns. The graph is compiled into an executable application (app), enabling a structured, memory-aware multi-tool agent ready for deployment.

Copy CodeCopiedUse a different Browserdef test_agent():
“””Test the agent with various queries.”””
config = {“configurable”: {“thread_id”: “test-thread”}}

test_queries = [
“What’s 15 * 7 + 23?”,
“Search for information about Python programming”,
“What’s the weather like in Tokyo?”,
“What time is it?”,
“Analyze this text: ‘LangGraph is an amazing framework for building AI agents.'”
]

print(” Testing the agent with sample queries…n”)

for i, query in enumerate(test_queries, 1):
print(f”Query {i}: {query}”)
print(“-” * 50)

try:
response = agent.invoke(
{“messages”: [HumanMessage(content=query)]},
config=config
)

last_message = response[“messages”][-1]
print(f”Response: {last_message.content}n”)

except Exception as e:
print(f”Error: {str(e)}n”)

The test_agent function is a validation utility that ensures that the LangGraph agent responds correctly across different use cases. It runs predefined queries, arithmetic, web search, weather, time, and text analysis, and prints the agent’s responses. Using a consistent thread_id for configuration, it invokes the agent with each query. It neatly displays the results, helping developers verify tool integration and conversational logic before moving to interactive or production use.

Copy CodeCopiedUse a different Browserdef chat_with_agent():
“””Interactive chat function.”””
config = {“configurable”: {“thread_id”: “interactive-thread”}}

print(” Multi-Tool Agent Chat”)
print(“Available tools: Calculator, Web Search, Weather Info, Text Analyzer, Current Time”)
print(“Type ‘quit’ to exit, ‘help’ for available commandsn”)

while True:
try:
user_input = input(“You: “).strip()

if user_input.lower() in [‘quit’, ‘exit’, ‘q’]:
print(“Goodbye!”)
break
elif user_input.lower() == ‘help’:
print(“nAvailable commands:”)
print(“• Calculator: ‘Calculate 15 * 7 + 23’ or ‘What’s sin(pi/2)?'”)
print(“• Web Search: ‘Search for Python tutorials’ or ‘Find information about AI'”)
print(“• Weather: ‘Weather in Tokyo’ or ‘What’s the temperature in London?'”)
print(“• Text Analysis: ‘Analyze this text: [your text]'”)
print(“• Current Time: ‘What time is it?’ or ‘Current date'”)
print(“• quit: Exit the chatn”)
continue
elif not user_input:
continue

response = agent.invoke(
{“messages”: [HumanMessage(content=user_input)]},
config=config
)

last_message = response[“messages”][-1]
print(f”Agent: {last_message.content}n”)

except KeyboardInterrupt:
print(“nGoodbye!”)
break
except Exception as e:
print(f”Error: {str(e)}n”)

The chat_with_agent function provides an interactive command-line interface for real-time conversations with the LangGraph multi-tool agent. It supports natural language queries and recognizes commands like “help” for usage guidance and “quit” to exit. Each user input is processed through the agent, which dynamically selects and invokes appropriate response tools. The function enhances user engagement by simulating a conversational experience and showcasing the agent’s capabilities in handling various queries, from math and web search to weather, text analysis, and time retrieval.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
test_agent()

print(“=” * 60)
print(” LangGraph Multi-Tool Agent is ready!”)
print(“=” * 60)

chat_with_agent()

def quick_demo():
“””Quick demonstration of agent capabilities.”””
config = {“configurable”: {“thread_id”: “demo”}}

demos = [
(“Math”, “Calculate the square root of 144 plus 5 times 3”),
(“Search”, “Find recent news about artificial intelligence”),
(“Time”, “What’s the current date and time?”)
]

print(” Quick Demo of Agent Capabilitiesn”)

for category, query in demos:
print(f”[{category}] Query: {query}”)
try:
response = agent.invoke(
{“messages”: [HumanMessage(content=query)]},
config=config
)
print(f”Response: {response[‘messages’][-1].content}n”)
except Exception as e:
print(f”Error: {str(e)}n”)

print(“n” + “=”*60)
print(” Usage Instructions:”)
print(“1. Add your ANTHROPIC_API_KEY to use Claude model”)
print(” os.environ[‘ANTHROPIC_API_KEY’] = ‘your-anthropic-api-key'”)
print(“2. Run quick_demo() for a quick demonstration”)
print(“3. Run chat_with_agent() for interactive chat”)
print(“4. The agent supports: calculations, web search, weather, text analysis, and time”)
print(“5. Example: ‘Calculate 15*7+23’ or ‘Search for Python tutorials'”)
print(“=”*60)

Finally, we orchestrate the execution of the LangGraph multi-tool agent. If the script is run directly, it initiates test_agent() to validate functionality with sample queries, followed by launching the interactive chat_with_agent() mode for real-time interaction. The quick_demo() function also briefly showcases the agent’s capabilities in math, search, and time queries. Clear usage instructions are printed at the end, guiding users on configuring the API key, running demonstrations, and interacting with the agent. This provides a smooth onboarding experience for users to explore and extend the agent’s functionality.

In conclusion, this step-by-step tutorial gives valuable insights into building an effective multi-tool AI agent leveraging LangGraph and Claude’s generative capabilities. With straightforward explanations and hands-on demonstrations, the guide empowers users to integrate diverse utilities into a cohesive and interactive system. The agent’s flexibility in performing tasks, from complex calculations to dynamic information retrieval, showcases the versatility of modern AI development frameworks. Also, the inclusion of user-friendly functions for both testing and interactive chat enhances practical understanding, enabling immediate application in various contexts. Developers can confidently extend and customize their AI agents with this foundational knowledge.

Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent Creation appeared first on MarkTechPost.

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms …

Posted on May 25, 2025 by i-genie

LLMs have shown impressive capabilities across various programming tasks, yet their potential for program optimization has not been fully explored. While some recent efforts have used LLMs to enhance performance in languages like C++ and Python, the broader application of LLMs to optimize code, especially in low-level programming contexts, remains limited. Existing LLM benchmarks largely focus on code generation from natural language or solving GitHub issues, as seen in HumanEval, MBPP, APPS, SWE-bench, and SWE-agent. Moreover, models such as Codex, AlphaCode, and Code Llama primarily aim to improve code generation quality rather than performance. However, select research has begun addressing optimization, including parallelization and code efficiency improvements, though many of these approaches are constrained by the need for formal verification, limiting scalability.

In contrast, some newer methods embrace test-based validation, allowing optimization of more complex programs with loops. Learning-based strategies in compiler optimization—like AutoPhase, which uses reinforcement learning for pass sequencing, and Coreset, which applies graph neural networks—have shown promise in improving performance. Superoptimization techniques aim to find the most efficient version of a program but are typically restricted to small-scale problems. Additionally, frameworks like AutoTVM and Ansor have focused on optimizing GPU kernel code through statistical modeling and search. Recently, LLM-driven optimization has gained attention, with reinforcement learning approaches guiding LLMs using feedback from test cases. Techniques like CodeRL and PPOCoder leverage policy optimization methods to fine-tune models for better performance, even across resource-constrained programming languages like Verilog.

Stanford, UIUC, CMU, and Visa Research researchers explore using LLMs to optimize assembly code performance—an area traditionally handled by compilers like GCC. They introduce a reinforcement learning framework using Proximal Policy Optimization (PPO), guided by a reward balancing correctness and speedup over the gcc -O3 baseline. Using a dataset of 8,072 real-world programs, their model, Qwen2.5-Coder-7B-PPO, achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. Their results show that with RL training, LLMs can effectively outperform conventional compiler optimizations.

The methodology involves optimizing compiled C programs for performance using an RL approach. Given a C program C, it is compiled to assembly P using gcc -O3. The goal is to generate a new assembly program P’ that is functionally equivalent but faster. Correctness is verified using a test set, and speedup is measured by execution time improvement. Using CodeNet as the dataset, the authors apply PPO to train a language model that generates improved code. Two reward functions—Correctness-Guided Speedup and Speedup-Only—are used to guide training based on program validity, correctness, and performance gains.

The study evaluates various language models on optimizing assembly code, revealing that most models struggle with low test pass rates and minimal speedups. However, Qwen2.5-Coder-7B-PPO, trained with reinforcement learning, significantly outperforms others, achieving 96% accuracy and a 1.47× average speedup. Ablation studies show that using gcc -O3 as a reference aids performance, while removing it leads to sharp declines. Notably, models like Claude-3.7-sonnet can surpass compilers by identifying hardware-specific optimizations, such as replacing loops with a single popcnt instruction, demonstrating their ability to perform semantic-level code transformations beyond traditional compiler capabilities.

In conclusion, the study explores using LLMs to optimize assembly code, a domain where traditional compilers struggle due to the complexity of low-level performance tuning. The authors fine-tune Qwen2.5-Coder-7B using PPO, rewarding both correctness (via test cases) and speedup over gcc -O3. They introduce a benchmark of 8,072 real-world C programs to evaluate performance. The model achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. While effective, limitations include a lack of formal correctness guarantees and variability in hardware performance across systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers appeared first on MarkTechPost.

A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Ag …

Posted on May 25, 2025 by i-genie

In this tutorial, we demonstrated how Microsoft’s AutoGen framework empowers developers to orchestrate complex, multi-agent workflows with minimal code. By leveraging AutoGen’s RoundRobinGroupChat and TeamTool abstractions, you can seamlessly assemble specialist assistants, such as Researchers, FactCheckers, Critics, Summarizers, and Editors, into a cohesive “DeepDive” tool. AutoGen handles the intricacies of turn‐taking, termination conditions, and streaming output, allowing you to focus on defining each agent’s expertise and system prompts rather than plumbing together callbacks or manual prompt chains. Whether conducting in‐depth research, validating facts, refining prose, or integrating third‐party tools, AutoGen provides a unified API that scales from simple two‐agent pipelines to elaborate, five‐agent collaboratives.

Copy CodeCopiedUse a different Browser!pip install -q autogen-agentchat[gemini] autogen-ext[openai] nest_asyncio

We install the AutoGen AgentChat package with Gemini support, the OpenAI extension for API compatibility, and the nest_asyncio library to patch the notebook’s event loop, ensuring you have all the components needed to run asynchronous, multi-agent workflows in Colab.

Copy CodeCopiedUse a different Browserimport os, nest_asyncio
from getpass import getpass

nest_asyncio.apply()
os.environ[“GEMINI_API_KEY”] = getpass(“Enter your Gemini API key: “)

We import and apply nest_asyncio to enable nested event loops in notebook environments, then securely prompt for your Gemini API key using getpass and store it in os.environ for authenticated model client access.

Copy CodeCopiedUse a different Browserfrom autogen_ext.models.openai import OpenAIChatCompletionClient

model_client = OpenAIChatCompletionClient(
model=”gemini-1.5-flash-8b”,
api_key=os.environ[“GEMINI_API_KEY”],
api_type=”google”,
)

We initialize an OpenAI‐compatible chat client pointed at Google’s Gemini by specifying the gemini-1.5-flash-8b model, injecting your stored Gemini API key, and setting api_type=”google”, giving you a ready-to-use model_client for downstream AutoGen agents.

Copy CodeCopiedUse a different Browserfrom autogen_agentchat.agents import AssistantAgent

researcher = AssistantAgent(name=”Researcher”, system_message=”Gather and summarize factual info.”, model_client=model_client)
factchecker = AssistantAgent(name=”FactChecker”, system_message=”Verify facts and cite sources.”, model_client=model_client)
critic = AssistantAgent(name=”Critic”, system_message=”Critique clarity and logic.”, model_client=model_client)
summarizer = AssistantAgent(name=”Summarizer”,system_message=”Condense into a brief executive summary.”, model_client=model_client)
editor = AssistantAgent(name=”Editor”, system_message=”Polish language and signal APPROVED when done.”, model_client=model_client)

We define five specialized assistant agents, Researcher, FactChecker, Critic, Summarizer, and Editor, each initialized with a role-specific system message and the shared Gemini-powered model client, enabling them to gather information, respectively, verify accuracy, critique content, condense summaries, and polish language within the AutoGen workflow.

Copy CodeCopiedUse a different Browserfrom autogen_agentchat.teams import RoundRobinGroupChat
from autogen_agentchat.conditions import MaxMessageTermination, TextMentionTermination

max_msgs = MaxMessageTermination(max_messages=20)
text_term = TextMentionTermination(text=”APPROVED”, sources=[“Editor”])
termination = max_msgs | text_term
team = RoundRobinGroupChat(
participants=[researcher, factchecker, critic, summarizer, editor],
termination_condition=termination
)

We import the RoundRobinGroupChat class along with two termination conditions, then compose a stop rule that fires after 20 total messages or when the Editor agent mentions “APPROVED.” Finally, it instantiates a round-robin team of the five specialized agents with that combined termination logic, enabling them to cycle through research, fact-checking, critique, summarization, and editing until one of the stop conditions is met.

Copy CodeCopiedUse a different Browserfrom autogen_agentchat.tools import TeamTool

deepdive_tool = TeamTool(team=team, name=”DeepDive”, description=”Collaborative multi-agent deep dive”)

WE wrap our RoundRobinGroupChat team in a TeamTool named “DeepDive” with a human-readable description, effectively packaging the entire multi-agent workflow into a single callable tool that other agents can invoke seamlessly.

Copy CodeCopiedUse a different Browserhost = AssistantAgent(
name=”Host”,
model_client=model_client,
tools=[deepdive_tool],
system_message=”You have access to a DeepDive tool for in-depth research.”
)

We create a “Host” assistant agent configured with the shared Gemini-powered model_client, grant it the DeepDive team tool for orchestrating in-depth research, and prime it with a system message that informs it of its ability to invoke the multi-agent DeepDive workflow.

Copy CodeCopiedUse a different Browserimport asyncio

async def run_deepdive(topic: str):
result = await host.run(task=f”Deep dive on: {topic}”)
print(” DeepDive result:n”, result)
await model_client.close()

topic = “Impacts of Model Context Protocl on Agentic AI”
loop = asyncio.get_event_loop()
loop.run_until_complete(run_deepdive(topic))

Finally, we define an asynchronous run_deepdive function that tells the Host agent to execute the DeepDive team tool on a given topic, prints the comprehensive result, and then closes the model client; it then grabs Colab’s existing asyncio loop and runs the coroutine to completion for a seamless, synchronous execution.

In conclusion, integrating Google Gemini via AutoGen’s OpenAI‐compatible client and wrapping our multi‐agent team as a callable TeamTool gives us a powerful template for building highly modular and reusable workflows. AutoGen abstracts away event loop management (with nest_asyncio), streaming responses, and termination logic, enabling us to iterate quickly on agent roles and overall orchestration. This advanced pattern streamlines the development of collaborative AI systems and lays the foundation for extending into retrieval pipelines, dynamic selectors, or conditional execution strategies.

Check out the Notebook here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen appeared first on MarkTechPost.

Researchers from the National University of Singapore Introduce ‘Thi …

Posted on May 24, 2025 by i-genie

The effectiveness of language models relies on their ability to simulate human-like step-by-step deduction. However, these reasoning sequences are resource-intensive and can be wasteful for simple questions that do not require elaborate computation. This lack of awareness regarding the complexity of the task is one of the core challenges in these models. They often default to detailed reasoning even for queries that could be answered directly. Such an approach increases token usage, extends response time, and increases system latency and memory usage. As a result, there’s a pressing need to equip language models with a mechanism that allows them to make autonomous decisions about whether to think deeply or respond succinctly.

Current tools attempting to solve this issue either rely on manually set heuristics or prompt engineering to switch between short and long responses. Some methods use separate models and route questions based on complexity estimates. Still, these external routing systems often lack insight into the target model’s strengths and fail to make optimal decisions. Other techniques fine-tune models with prompt-based cues like “reasoning on/off,” but these rely on static rules rather than dynamic understanding. Despite some improvements, these approaches fail to enable fully autonomous and context-sensitive control within a single model.

Researchers from the National University of Singapore introduced a new framework called Thinkless, which equips a language model with the ability to dynamically decide between using short or long-form reasoning. The framework is built on reinforcement learning and introduces two special control tokens—<short> for concise answers and <think> for detailed responses. By incorporating a novel algorithm called Decoupled Group Relative Policy Optimization (DeGRPO), Thinkless separates the training focus between selecting the reasoning mode and improving the accuracy of the generated response. This design prevents the model from falling into one-dimensional behavior and enables adaptive reasoning tailored to each query.

The methodology involves two stages: warm-up distillation and reinforcement learning. In the distillation phase, Thinkless is trained using outputs from two expert models—one specializing in short responses and the other in detailed reasoning. This stage helps the model establish a firm link between the control token and the desired reasoning format. The reinforcement learning stage then fine-tunes the model’s ability to decide which reasoning mode to use. DeGRPO decomposes the learning into two separate objectives: one for training the control token and another for refining the response tokens. This approach avoids the gradient imbalances in earlier models, where longer responses would overpower the learning signal, leading to a collapse in reasoning diversity. Thinkless ensures that both <short> and <think> tokens receive balanced updates, promoting stable learning across response types.

When evaluated, Thinkless significantly reduced long-form reasoning while preserving high accuracy. On the Minerva Algebra benchmark, the model used the <think> token in only 25.88% of cases while achieving 94.59% accuracy. In contrast, conventional reasoning models had to use extended chains of thought much more frequently. On the AIME 2024 dataset, Thinkless reached a 27.33% accuracy rate with 100% usage of the reasoning mode, showing that it could maintain performance when full reasoning was necessary. On the GSM8K dataset, it utilized <think> only 13.31% of the time, yet still achieved 84.18% accuracy. These results reflect the model’s ability to handle simple and complex queries with appropriate reasoning depth, cutting down on unnecessary token generation by as much as 90% in some tasks.

Overall, this study from the National University of Singapore researchers presents a compelling solution to the inefficiencies of uniform reasoning in large language models. By introducing a mechanism that enables models to judge task complexity and adjust their inference strategy accordingly, Thinkless optimizes both accuracy and efficiency. The method balances depth of reasoning and response precision without relying on fixed rules, offering a data-driven approach to more intelligent language model behavior.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPO appeared first on MarkTechPost.

Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long- …

Posted on May 24, 2025 by i-genie

Recent advances in long-context (LC) modeling have unlocked new capabilities for LLMs and large vision-language models (LVLMs). Long-context vision–language models (LCVLMs) show an important step forward by enabling LVLMs to process hundreds of images and thousands of interleaved text tokens in a single forward pass. However, the development of effective evaluation benchmarks lags. It is still unclear how well current LCVLMs perform in long-context settings, what tasks they struggle with, and how robust they are to input length variation. Current benchmarks face the following problem: (a) Limited coverage of downstream tasks, (b) Insufficient coverage of image types, (c) Lack of context length control, and (d) Single context length.

Various techniques have extended context windows for LVLMs, including longer pre-training lengths, position extrapolation, and efficient architectures. Models like Gemini-2.5 and Qwen2.5-VL have adopted these approaches alongside vision token compression methods to accommodate longer sequences. For evaluation, the Needle-in-a-Haystack task became a standard benchmark for testing LC ability by inserting information at specific depths within long texts. However, existing vision-language benchmarks remain limited, focusing only on NIAH variants or long-document VQA tasks. Even MileBench contains short-context tasks with an average length of only 9K tokens, failing to evaluate true LC capabilities across diverse vision-language applications.

Researchers from HKUST, Tencent AI Seattle Lab, University of Edinburgh, Miniml.AI, and NVIDIA AI Technology Center have proposed MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs. It comprises 13,331 examples spanning five downstream task categories, including Visual RAG and Many-Shot ICL, covering natural and synthetic image types. All examples are standardized across five input lengths from 8K to 128K tokens using a cross-modal tokenization scheme combining vision patches and text tokens. Through benchmarking 46 closed-source and open-source models, the research reveals that single-task performance poorly predicts overall LC capability, both model types struggle with LC tasks, and stronger reasoning models show better LC performance.

Researchers construct LC by inserting gold passages containing answers among large sets of distracting passages retrieved from Wikipedia. For ViQuAE, gold passages from KILT are used, while InfoSeek uses lead sections from Wikipedia entity pages. Further, Wikipedia pages are split into 100-word passages, and retrieved distractors are added until reaching desired input lengths. Many-shot in-context learning tasks utilize four diverse image classification datasets: Stanford Cars, Food101, SUN397, and iNat2021, accommodating 500 images within 128K context windows. Cross-modal token counting combines text tokens using the Llama2 tokenizer with visual tokens processed through 14×14 patches and 2×2 pixel unshuffle compression, ensuring compatibility with modern LVLMs for evaluation.

The evaluation on MMLONGBENCH across tasks and context Lengths shows that all models struggle, but closed-source models perform better. For the longest input length of 128K, all models struggle with long-context vision-language tasks, with GPT-4o achieving only 62.9 average performance. Gemini-2.5-Pro became the strongest performer, outperforming open-source models by 20 points except on ICL tasks. Further, Ovis2-34B model achieves a score of 41.6 on summarization, similar to GPT-4o (42.4). Qwen2.5-VL-32B achieves a SubEM score of 64.6 on VRAG, even better than Gemini-2.0-Flash. Models show generalization capabilities beyond their training context lengths, with Qwen2-VL-72B achieving a 51.9 average score at 128K despite a 32K training window.

In conclusion, researchers introduced MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs across diverse downstream tasks. It provides a rigorous foundation for diagnosing frontier model capabilities by covering five distinct task categories with unified cross-modal token counting and standardized context lengths. The evaluation of 46 models demonstrates that single-task performance unreliably predicts overall long-context ability, and frontier models face significant challenges in OCR accuracy and cross-modal retrieval. MMLONGBENCH is a standard evaluation framework to drive future research toward more efficient vision-language token encodings, robust position-extrapolation schemes, and improved multi-modal retrieval and reasoning capabilities.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models appeared first on MarkTechPost.

Principal Financial Group increases Voice Virtual Assistant performanc …

Posted on May 24, 2025 by i-genie

This post was cowritten by Mulay Ahmed, Assistant Director of Engineering, and Ruby Donald, Assistant Director of Engineering at Principal Financial Group. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
Principal Financial Group® is an integrated global financial services company with specialized solutions helping people, businesses, and institutions reach their long-term financial goals and access greater financial security.
With US contact centers that handle millions of customer calls annually, Principal® wanted to further modernize their customer call experience. With a robust AWS Cloud infrastructure already in place, they selected a cloud-first approach to create a more personalized and seamless experience for their customers that would:

Understand customer intents through natural language (vs. touch tone experiences)
Assist customers with self-service offerings where possible
Accurately route customer calls based on business rules
Assist engagement center agents with contextual data

Initially, Principal developed a voice Virtual Assistant (VA) using an Amazon Lex bot to recognize customer intents. The VA can perform self-service transactions or route customers to specific call center queues in the Genesys Cloud contact center platform, based on customer intents and business rules.
As customers interact with the VA, it’s essential to continuously monitor its health and performance. This allows Principal to identify opportunities for fine-tuning, which can enhance the VA’s ability to understand customer intents. Consequently, this will reduce fallback intent rates, improve functional intent fulfillment rates, and lead to better customer experiences.
In this post, we explore how Principal used this opportunity to build an integrated voice VA reporting and analytics solution using an Amazon QuickSight dashboard.
Amazon Lex is a service for building conversational interfaces using voice and text. It provides high-quality speech recognition and language understanding capabilities, enabling the addition of sophisticated, natural language chatbots to new and existing applications.
Genesys Cloud, an omni-channel orchestration and customer relationship platform, provides a contact center platform in a public cloud model that enables quick and simple integration of AWS Contact Center Intelligence (AWS CCI). As part of AWS CCI, Genesys Cloud integrates with Amazon Lex, which enables self-service, intelligent routing, and data collection capabilities.
QuickSight is a unified business intelligence (BI) service that makes it straightforward within an organization to build visualizations, perform ad hoc analysis, and quickly get business insights from their data.
Solution overview
Principal required a reporting and analytics solution that would monitor VA performance based on customer interactions at scale, enabling Principal to improve the Amazon Lex bot performance.
Reporting requirements included customer and VA interaction and Amazon Lex bot performance (target metrics and intent fulfillment) analytics to identify and implement tuning and training opportunities.
The solution used a QuickSight dashboard that derives these insights from the following customer interaction data used to measure VA performance:

Genesys Cloud data such as queues and data actions
Business-specific data such as product and call center operations data
Business API-specific data and metrics such as API response codes

The following diagram shows the solution architecture using Genesys, Amazon Lex, and QuickSight.

The solution workflow involves the following steps:

Users call in and interact with Genesys Cloud.
Genesys Cloud calls an AWS Lambda routing function. This function will return a response to Genesys Cloud with the necessary data, to route the customer call. To generate a response, the function fetches routing data from an Amazon DynamoDB table, and requests an Amazon Lex V2 bot to provide an answer on the user intent.
The Amazon Lex V2 bot processes the customer intent and calls a Lambda fulfillment function to fulfill the intent.
The fulfillment function executes custom logic (routing and session variables logic) and calls necessary APIs to fetch the data required to fulfill the intent.
The APIs process and return the data requested (such as data to perform a self-service transaction).
The Amazon Lex V2 bot’s conversation logs are sent to Amazon CloudWatch (these logs will be used for business analytics, operational monitoring, and alerts).
Genesys Cloud calls a third Lambda function to send customer interaction reports. The Genesys report function pushes these reports to an Amazon Simple Storage Service (Amazon S3) bucket (these reports will be used for business analytics).
An Amazon Data Firehose delivery stream ships the conversation logs from CloudWatch to an S3 bucket.
The Firehose delivery stream transforms the logs in Parquet or CSV format using a Lambda function.
An AWS Glue crawler scans the data in Amazon S3.
The crawler creates or updates the AWS Glue Data Catalog with the schema information.
We use Amazon Athena to query the datasets (customer interaction reports and conversation logs).
QuickSight connects to Athena to query the data from Amazon S3 using the Data Catalog.

Other design considerations
The following are other key design considerations to implement the VA solution:

Cost optimization – The solution uses Amazon S3 Bucket Keys to optimize on costs:

Reduce the number of Amazon S3 requests to AWS Key Management Service (AWS KMS) to complete encryption operations.
Reduce the number of AWS KMS events in AWS CloudTrail logs.

Encryption – The solution encrypts data at rest with AWS KMS and in transit using SSL/TLS.
Genesys Cloud integration – The integration between the Amazon Lex V2 bot and Genesys Cloud is done using AWS Identity and Access Management (IAM). For more details, see Genesys Cloud.
Logging and monitoring – The solution monitors AWS resources with CloudWatch and uses alerts to receive notification upon failure events.
Least privilege access – The solution uses IAM roles and policies to grant the minimum necessary permissions to uses and services.
Data privacy – The solution handles customer sensitive data such as personally identifiable information (PII) according to compliance and data protection requirements. It implements data masking when applicable and appropriate.
Secure APIs – APIs implemented in this solution are protected and designed according to compliance and security requirements.
Data types – The solution defines data types, such as time stamps, in the Data Catalog (and Athena) in order to refresh data (SPICE data) in QuickSight on a schedule.
DevOps – The solution is version controlled, and changes are deployed using pipelines, to enable faster release cycles.
Analytics on Amazon Lex – Analytics on Amazon Lex empowers teams with data-driven insights to improve the performance of their bots. The overview dashboard provides a single snapshot of key metrics such as the total number of conversations and intent recognition rates. Principal does not use this capability due to the following reasons:

The dashboard can’t integrate with external data:

Genesys Cloud data (such as queues and data actions)
Business-specific data (such as product and call center operations data)
Business API-specific data and metrics (such as response codes)

The dashboard can’t be customized to add additional views and data.

Sample dashboard
With this reporting and analytics solution, Principal can consolidate data from multiple sources and visualize the performance of the VA to identify areas of opportunities for improvement. The following screenshot shows an example of their QuickSight dashboard for illustrative purposes.

Conclusion
In this post, we presented how Principal created a report and analytics solution for their VA solution using Genesys Cloud and Amazon Lex, along with QuickSight to provide customer interaction insights.
The VA solution allowed Principal to maintain its existing contact center solution with Genesys Cloud and achieve better customer experiences. It offers other benefits such as the ability for a customer to receive support on some inquiries without requiring an agent on the call (self-service). It also provides intelligent routing capabilities, leading to reduced call time and increased agent productivity.
With the implementation of this solution, Principal can monitor and derive insights from its VA solution and fine-tune accordingly its performance.
In its 2025 roadmap, Principal will continue to strengthen the foundation of the solution described in this post. In a second post, Principal will present how they automate the deployment and testing of new Amazon Lex bot versions.
AWS and Amazon are not affiliates of any company of the Principal Financial Group®. This communication is intended to be educational in nature and is not intended to be taken as a recommendation.
Insurance products issued by Principal National Life Insurance Co (except in NY) and Principal Life Insurance Company®. Plan administrative services offered by Principal Life. Principal Funds, Inc. is distributed by Principal Funds Distributor, Inc. Securities offered through Principal Securities, Inc., member SIPC and/or independent broker/dealers. Referenced companies are members of the Principal Financial Group®, Des Moines, IA 50392. ©2025 Principal Financial Services, Inc. 4373397-042025

About the Authors
Mulay Ahmed is an Assistant Director of Engineering at Principal and well-versed in architecting and implementing complex enterprise-grade solutions on AWS Cloud.
Ruby Donald is an Assistant Director of Engineering at Principal and leads the Enterprise Virtual Assistants Engineering Team. She has extensive experience in building and delivering software at enterprise scale.

Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype t …

Posted on May 23, 2025 by i-genie

Modern web usage spans many digital interactions, from filling out forms and managing accounts to executing data queries and navigating complex dashboards. Despite the web being deeply intertwined with productivity and work processes, many of these actions still demand repetitive human input. This scenario is especially true for environments that require detailed instructions or decisions beyond mere searches. While artificial intelligence agents have emerged to support task automation, many prioritize complete autonomy. However, this frequently sidelines user control, leading to outcomes that diverge from user expectations. The next leap forward in productivity-enhancing AI involves agents designed not to replace users but to collaborate with them, blending automation with continuous, real-time human input for more accurate and trusted results.

A key challenge in deploying AI agents for web-based tasks is the lack of visibility and intervention. Users often cannot see what steps the agent is planning, how it intends to execute them, or when it might go off track. In scenarios that involve complex decisions, like entering payment information, interpreting dynamic content, or running scripts, users need mechanisms to step in and redirect the process. Without these capabilities, systems risk making irreversible mistakes or misaligning with user goals. This highlights a significant limitation in current AI automation: the absence of structured human-in-the-loop design, where users dynamically guide and supervise agent behavior, without acting merely as spectators.

Previous solutions approached web automation through rule-based scripts or general-purpose AI agents driven by language models. These systems interpret user commands and attempt to carry them out autonomously. However, they often execute plans without surfacing intermediate decisions or allowing meaningful user feedback. A few offer command-line-like interactions, which are inaccessible to the average user and rarely include layered safety mechanisms. Moreover, minimal support for task reuse or performance learning across sessions limits long-term value. These systems also tend to lack adaptability when the context changes mid-task or errors must be corrected collaboratively.

Researchers at Microsoft introduced Magentic-UI, an open-source prototype that emphasizes collaborative human-AI interaction for web-based tasks. Unlike previous systems aiming for full independence, this tool promotes real-time co-planning, execution sharing, and step-by-step user oversight. Magentic-UI is built on Microsoft’s AutoGen framework and is tightly integrated with Azure AI Foundry Labs. It’s a direct evolution from the previously introduced Magentic-One system. With its launch, Microsoft Research aims to address fundamental questions about human oversight, safety mechanisms, and learning in agentic systems by offering an experimental platform for researchers and developers.

Magentic-UI includes four core interactive features: co-planning, co-tasking, action guards, and plan learning. Co-planning lets users view and adjust the agent’s proposed steps before execution begins, offering full control over what the AI will do. Co-tasking enables real-time visibility during operation, letting users pause, edit, or take over specific actions. Action guards are customizable confirmations for high-risk activities like closing browser tabs or clicking “submit” on a form, actions that could have unintended consequences. Plan learning allows Magentic-UI to remember and refine steps for future tasks, improving over time through experience. These capabilities are supported by a modular team of agents: the Orchestrator leads planning and decision-making, WebSurfer handles browser interactions, Coder executes code in a sandbox, and FileSurfer interprets files and data.

Image Source

Technically, when a user submits a request, the Orchestrator agent generates a step-by-step plan. Users can modify it through a graphical interface by editing, deleting, or regenerating steps. Once finalized, the plan is delegated across specialized agents. Each agent reports after performing its task, and the Orchestrator determines whether to proceed, repeat, or request user feedback. All actions are visible on the interface, and users can halt execution at any point. This architecture not only ensures transparency but also allows for adaptive task flows. For example, if a step fails due to a broken link, the Orchestrator can dynamically adjust the plan with user consent.

In controlled evaluations using the GAIA benchmark, which includes complex tasks like navigating the web and interpreting documents, Magentic-UI’s performance was rigorously tested. GAIA consists of 162 tasks requiring multimodal understanding. When operating autonomously, Magentic-UI completed 30.3% of tasks successfully. However, when supported by a simulated user with access to additional task information, success jumped to 51.9%, a 71% improvement. Another configuration using a smarter simulated user improved the rate to 42.6%. Interestingly, Magentic-UI requested help in only 10% of the enhanced tasks and asked for final answers in 18%. In those cases, the system asked for help an average of just 1.1 times. This shows how minimal but well-timed human intervention significantly boosts task completion without high oversight costs.

Image Source

Magentic-UI also features a “Saved Plans” gallery that displays strategies reused from past tasks. Retrieval from this gallery is approximately three times faster than generating a new plan. A predictive mechanism surfaces these plans while users type, streamlining repeated tasks like flight searches or form submissions. Safety mechanisms are robust. Every browser or code action runs inside a Docker container, ensuring that no user credentials are exposed. Users can define allow-lists for site access, and every action can be gated behind approval prompts. A red-team evaluation further tested it against phishing attacks and prompt injections, where the system either sought user clarification or blocked execution, reinforcing its layered defense model.

Image Source

Several Key Takeaways from the Research on Magentic-UI:

With simple human input, magentic-UI boosts task completion by 71% (from 30.3% to 51.9%).

Requests user help in only 10% of enhanced tasks and averages 1.1 help requests per task.

It features a co-planning UI that allows full user control before execution.

Executes tasks via four modular agents: Orchestrator, WebSurfer, Coder, and FileSurfer.

Stores and reuses plans, reducing repeat task latency by up to 3x.

All actions are sandboxed via Docker containers; no user credentials are ever exposed.

Passed red-team evaluations against phishing and injection threats.

Supports fully user-configurable “action guards” for high-risk steps.

Fully open-source and integrated with Azure AI Foundry Labs.

In conclusion, Magentic-UI addresses a long-standing problem in AI automation, the lack of transparency and controllability. Rather than replacing users, it enables them to remain central to the process. The system performs well even with minimal help and learns to improve each time. The modular design, robust safeguards, and detailed interaction model create a strong foundation for future intelligent assistants.

Check out the Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use appeared first on MarkTechPost.

Beyond Aha Moments: Structuring Reasoning in Large Language Models

Posted on May 23, 2025 by i-genie

Large Reasoning Models (LRMs) like OpenAI’s o1 and o3, DeepSeek-R1, Grok 3.5, and Gemini 2.5 Pro have shown strong capabilities in long CoT reasoning, often displaying advanced behaviors such as self-correction, backtracking, and verification—collectively known as “aha moments.” These behaviors have been observed to emerge through outcome-driven RL without the need for supervised fine-tuning. Models like DeepSeek-R1 and its open-source replications (e.g., TinyZero and Logic-RL) have demonstrated that carefully designed RL pipelines—using rule-based rewards, curriculum learning, and structured training—can induce such reflective reasoning abilities. However, these emergent behaviors tend to be unpredictable and inconsistent, limiting their practical reliability and scalability.

To address this, researchers have explored structured RL frameworks that target specific reasoning types, such as deduction, abduction, and induction. These approaches involve aligning specialist models, merging them in parameter space, and applying domain-specific continual RL. Tools like Logic-RL use rule-conditioned RL to solve logic puzzles, improving transferability to tasks like math reasoning. Meanwhile, other works propose mechanisms to enhance reasoning robustness, such as training models to reason both forwards and backwards, or iteratively self-critiquing their outputs. Studies analyzing “aha moments” suggest that these behaviors stem from internal shifts in uncertainty, latent representation, and self-assessment, offering new insights into engineering more reliable reasoning models.

Researchers from the National University of Singapore, Tsinghua University, and Salesforce AI Research address the limitations of relying on spontaneous “aha moments” in large language models by explicitly aligning them with three core reasoning abilities: deduction, induction, and abduction. They introduce a three-stage pipeline—individual meta-ability alignment, parameter-space merging, and domain-specific reinforcement learning—significantly enhancing model performance. Using a programmatically generated, self-verifiable task suite, their approach boosts accuracy over instruction-tuned baselines by over 10%, with further gains from domain-specific RL. This structured alignment framework offers a scalable, generalizable method for improving reasoning across math, coding, and science domains.

The researchers designed tasks aligned with deduction, induction, and abduction by using a structured “given two, infer the third” format based on hypothesis (H), rule (R), and observation (O). Deduction is framed as satisfiability checking, induction as masked-sequence prediction, and abduction as reverse rule-graph inference. These tasks are synthetically generated and automatically verified. The training pipeline includes three stages: (A) independently training models for each reasoning type using REINFORCE++ with structured rewards, (B) merging models through weighted parameter interpolation, and (C) fine-tuning the unified model on domain-specific data via reinforcement learning, isolating the benefit of meta-ability alignment.

The study evaluates models aligned with meta-abilities—deduction, induction, and abduction—using a curriculum learning setup across difficulty levels. Models trained on synthetic tasks strongly generalize to seven unseen math, code, and science benchmarks. At both 7B and 32B scales, meta-ability–aligned and merged models consistently outperform instruction-tuned baselines, with the merged model offering the highest gains. Continued domain-specific RL from these merged checkpoints (Domain-RL-Meta) leads to further improvements over standard RL finetuning (Domain-RL-Ins), especially in math benchmarks. Overall, the alignment strategy enhances reasoning abilities, and its benefits scale with model size, significantly boosting performance ceilings across tasks.

In conclusion, the study shows that large reasoning models can develop advanced problem-solving skills without depending on unpredictable “aha moments.” By aligning models with three core reasoning abilities—deduction, induction, and abduction—using self-verifiable tasks, the authors create specialist agents that can be effectively combined into a single model. This merged model outperforms instruction-tuned baselines by over 10% on diagnostic tasks and up to 2% on real-world benchmarks. When used as a starting point for domain-specific reinforcement learning, it raises performance by another 4%. This modular, systematic training approach offers a scalable and controllable foundation for building reliable, interpretable reasoning systems.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Beyond Aha Moments: Structuring Reasoning in Large Language Models appeared first on MarkTechPost.

Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap …

Posted on May 23, 2025 by i-genie

Anthropic has announced the release of its next-generation language models: Claude Opus 4 and Claude Sonnet 4. The update marks a significant technical refinement in the Claude model family, particularly in areas involving structured reasoning, software engineering, and autonomous agent behaviors.

This release is not another reinvention but a focused improvement—bringing increased consistency, interpretability, and performance across complex reasoning tasks. With extended context handling, long-horizon planning, and more efficient coding capabilities, these models reflect a maturing shift toward functional generalist systems that can serve a range of high-complexity applications.

Claude Opus 4: Scaling Advanced Reasoning and Multi-file Code Understanding

Positioned as the flagship model, Claude Opus 4 has been benchmarked as Anthropic’s most capable model to date. Designed to handle intricate reasoning workflows and software development scenarios, Opus 4 has achieved:

72.5% accuracy on the SWE-bench benchmark, which tests models against real-world GitHub issue resolution.

43.2% on TerminalBench, which evaluates correctness in terminal-based code generation tasks requiring multi-step planning.

A notable aspect of Claude Opus 4 is its agentic behavior in software environments. In practical testing, the model was able to autonomously sustain nearly seven hours of uninterrupted code generation and task execution. This is a marked improvement from Claude 3 Opus, which previously sustained such tasks for under an hour.

These improvements are attributed to enhanced memory management, broader context retention, and a more robust internal planning loop. From a developer’s perspective, Opus 4 reduces the need for frequent interventions and exhibits stronger consistency in handling edge cases across software stacks.

Claude Sonnet 4: A Balanced Model for General Reasoning and Code Tasks

Claude Sonnet 4 replaces its predecessor, Claude 3.5 Sonnet, with a more stable and balanced architecture that brings improvements in both speed and quality without significantly increasing computational costs.

Sonnet 4 is optimized for mid-scale deployments where cost-performance trade-offs are critical. While not matching Opus 4’s reasoning ceiling, it inherits many architectural upgrades—supporting multi-file code navigation, intermediate tool use, and structured text processing with improved latency.

It serves as the new default model for free-tier users on Claude.ai and is also available via API. This makes Sonnet 4 a practical option for lightweight development tools, user-facing assistants, and analytical pipelines requiring consistent but less intensive model calls.

Architectural Highlights: Hybrid Reasoning and Extended Thinking

Both models incorporate hybrid reasoning capabilities, introducing two distinct response modes:

Fast Mode for low-latency responses suitable for short prompts and conversational tasks.

Extended Thinking Mode for computationally intensive tasks requiring deeper inference, longer memory chains, or multi-turn agentic behavior.

This dual-mode reasoning strategy allows users to dynamically allocate compute and latency budgets based on task complexity. It is especially relevant in agent frameworks, where LLMs must balance fast reaction time with deliberative planning.

Deployment and Integration

Claude Opus 4 and Sonnet 4 are accessible through multiple cloud platforms:

Anthropic’s Claude API

Amazon Bedrock

Google Cloud Vertex AI

This cross-platform availability simplifies model deployment into diverse enterprise environments, supporting use cases ranging from autonomous agents to code analysis, decision support, and retrieval-augmented generation (RAG) pipelines.

Conclusion

The Claude 4 series does not introduce radical design changes but instead demonstrates measured improvements in reliability, interpretability, and task generalization. With Claude Opus 4, Anthropic positions itself firmly in the upper tier of AI model providers for reasoning and coding automation. Meanwhile, Claude Sonnet 4 offers a technically sound, cost-efficient entry point for developers and researchers working on mid-scale AI applications.

For engineering teams evaluating LLMs for long-context planning, software agents, or structured data workflows, the Claude 4 models present a competitive, technically capable alternative.

Check out the Technical details and Get started today on Claude, Claude Code, or the platform of your choice. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Anthropic Releases Claude Opus 4 and Claude Sonnet 4: A Technical Leap in Reasoning, Coding, and AI Agent Design appeared first on MarkTechPost.

Optimize query responses with user feedback using Amazon Bedrock embed …

Posted on May 23, 2025 by i-genie

Improving response quality for user queries is essential for AI-driven applications, especially those focusing on user satisfaction. For example, an HR chat-based assistant should strictly follow company policies and respond using a certain tone. A deviation from that can be corrected by feedback from users. This post demonstrates how Amazon Bedrock, combined with a user feedback dataset and few-shot prompting, can refine responses for higher user satisfaction. By using Amazon Titan Text Embeddings v2, we demonstrate a statistically significant improvement in response quality, making it a valuable tool for applications seeking accurate and personalized responses.
Recent studies have highlighted the value of feedback and prompting in refining AI responses. Prompt Optimization with Human Feedback proposes a systematic approach to learning from user feedback, using it to iteratively fine-tune models for improved alignment and robustness. Similarly, Black-Box Prompt Optimization: Aligning Large Language Models without Model Training demonstrates how retrieval augmented chain-of-thought prompting enhances few-shot learning by integrating relevant context, enabling better reasoning and response quality. Building on these ideas, our work uses the Amazon Titan Text Embeddings v2 model to optimize responses using available user feedback and few-shot prompting, achieving statistically significant improvements in user satisfaction. Amazon Bedrock already provides an automatic prompt optimization feature to automatically adapt and optimize prompts without additional user input. In this blog post, we showcase how to use OSS libraries for a more customized optimization based on user feedback and few-shot prompting.
We’ve developed a practical solution using Amazon Bedrock that automatically improves chat assistant responses based on user feedback. This solution uses embeddings and few-shot prompting. To demonstrate the effectiveness of the solution, we used a publicly available user feedback dataset. However, when applying it inside a company, the model can use its own feedback data provided by its users. With our test dataset, it shows a 3.67% increase in user satisfaction scores. The key steps include:

Retrieve a publicly available user feedback dataset (for this example, Unified Feedback Dataset on Hugging Face).
Create embeddings for queries to capture semantic similar examples, using Amazon Titan Text Embeddings.
Use similar queries as examples in a few-shot prompt to generate optimized prompts.
Compare optimized prompts against direct large language model (LLM) calls.
Validate the improvement in response quality using a paired sample t-test.

The following diagram is an overview of the system.

The key benefits of using Amazon Bedrock are:

Zero infrastructure management – Deploy and scale without managing complex machine learning (ML) infrastructure
Cost-effective – Pay only for what you use with the Amazon Bedrock pay-as-you-go pricing model
Enterprise-grade security – Use AWS built-in security and compliance features
Straightforward integration – Integrate seamlessly existing applications and open source tools
Multiple model options – Access various foundation models (FMs) for different use cases

The following sections dive deeper into these steps, providing code snippets from the notebook to illustrate the process.
Prerequisites
Prerequisites for implementation include an AWS account with Amazon Bedrock access, Python 3.8 or later, and configured Amazon credentials.
Data collection
We downloaded a user feedback dataset from Hugging Face, llm-blender/Unified-Feedback. The dataset contains fields such as conv_A_user (the user query) and conv_A_rating (a binary rating; 0 means the user doesn’t like it and 1 means the user likes it). The following code retrieves the dataset and focuses on the fields needed for embedding generation and feedback analysis. It can be run in an Amazon Sagemaker notebook or a Jupyter notebook that has access to Amazon Bedrock.

# Load the dataset and specify the subset
dataset = load_dataset(“llm-blender/Unified-Feedback”, “synthetic-instruct-gptj-pairwise”)

# Access the ‘train’ split
train_dataset = dataset[“train”]

# Convert the dataset to Pandas DataFrame
df = train_dataset.to_pandas()

# Flatten the nested conversation structures for conv_A and conv_B safely
df[‘conv_A_user’] = df[‘conv_A’].apply(lambda x: x[0][‘content’] if len(x) > 0 else None)
df[‘conv_A_assistant’] = df[‘conv_A’].apply(lambda x: x[1][‘content’] if len(x) > 1 else None)

# Drop the original nested columns if they are no longer needed
df = df.drop(columns=[‘conv_A’, ‘conv_B’])

Data sampling and embedding generation
To manage the process effectively, we sampled 6,000 queries from the dataset. We used Amazon Titan Text Embeddings v2 to create embeddings for these queries, transforming text into high-dimensional representations that allow for similarity comparisons. See the following code:

import random import bedrock # Take a sample of 6000 queries
df = df.shuffle(seed=42).select(range(6000))
# AWS credentials
session = boto3.Session()
region = ‘us-east-1’
# Initialize the S3 client
s3_client = boto3.client(‘s3’)

boto3_bedrock = boto3.client(‘bedrock-runtime’, region)
titan_embed_v2 = BedrockEmbeddings(
client=boto3_bedrock, model_id=”amazon.titan-embed-text-v2:0″)

# Function to convert text to embeddings
def get_embeddings(text):
response = titan_embed_v2.embed_query(text)
return response # This should return the embedding vector

# Apply the function to the ‘prompt’ column and store in a new column
df_test[‘conv_A_user_vec’] = df_test[‘conv_A_user’].apply(get_embeddings)

Few-shot prompting with similarity search
For this part, we took the following steps:

Sample 100 queries from the dataset for testing. Sampling 100 queries helps us run multiple trials to validate our solution.
Compute cosine similarity (measure of similarity between two non-zero vectors) between the embeddings of these test queries and the stored 6,000 embeddings.
Select the top k similar queries to the test queries to serve as few-shot examples. We set K = 10 to balance between the computational efficiency and diversity of the examples.

See the following code:

# Step 2: Define cosine similarity function
def compute_cosine_similarity(embedding1, embedding2):
embedding1 = np.array(embedding1).reshape(1, -1) # Reshape to 2D array
embedding2 = np.array(embedding2).reshape(1, -1) # Reshape to 2D array
return cosine_similarity(embedding1, embedding2)[0][0]

# Sample query embedding
def get_matched_convo(query, df):
query_embedding = get_embeddings(query)

# Step 3: Compute similarity with each row in the DataFrame
df[‘similarity’] = df[‘conv_A_user_vec’].apply(lambda x: compute_cosine_similarity(query_embedding, x))

# Step 4: Sort rows based on similarity score (descending order)
df_sorted = df.sort_values(by=’similarity’, ascending=False)

# Step 5: Filter or get top matching rows (e.g., top 10 matches)
top_matches = df_sorted.head(10)

# Print top matches
return top_matches[[‘conv_A_user’, ‘conv_A_assistant’,’conv_A_rating’,’similarity’]]

This code provides a few-shot context for each test query, using cosine similarity to retrieve the closest matches. These example queries and feedback serve as additional context to guide the prompt optimization. The following function generates the few-shot prompt:

import boto3
from langchain_aws import ChatBedrock
from pydantic import BaseModel

# Initialize Amazon Bedrock client
bedrock_runtime = boto3.client(service_name=”bedrock-runtime”, region_name=”us-east-1″)

# Configure the model to use
model_id = “us.anthropic.claude-3-5-haiku-20241022-v1:0”
model_kwargs = {
“max_tokens”: 2048,
“temperature”: 0.1,
“top_k”: 250,
“top_p”: 1,
“stop_sequences”: [“nnHuman”],
}

# Create the LangChain Chat object for Bedrock
llm = ChatBedrock(
client=bedrock_runtime,
model_id=model_id,
model_kwargs=model_kwargs,
)

# Pydantic model to validate the output prompt
class OptimizedPromptOutput(BaseModel):
optimized_prompt: str

# Function to generate the few-shot prompt
def generate_few_shot_prompt_only(user_query, nearest_examples):
# Ensure that df_examples is a DataFrame
if not isinstance(nearest_examples, pd.DataFrame):
raise ValueError(“Expected df_examples to be a DataFrame”)
# Construct the few-shot prompt using nearest matching examples
few_shot_prompt = “Here are examples of user queries, LLM responses, and feedback:nn”
for i in range(len(nearest_examples)):
few_shot_prompt += f”User Query: {nearest_examples.loc[i,’conv_A_user’]}n”
few_shot_prompt += f”LLM Response: {nearest_examples.loc[i,’conv_A_assistant’]}n”
few_shot_prompt += f”User Feedback: {‘👍’ if nearest_examples.loc[i,’conv_A_rating’] == 1.0 else ‘👎’}nn”

# Add the user query for which the optimized prompt is required
few_shot_prompt += f”Based on these examples, generate a general optimized prompt for the following user query:nn”
few_shot_prompt += f”User Query: {user_query}n”
few_shot_prompt += “Optimized Prompt: Provide a clear, well-researched response based on accurate data and credible sources. Avoid unnecessary information or speculation.”

return few_shot_prompt

The get_optimized_prompt function performs the following tasks:

The user query and similar examples generate a few-shot prompt.
We use the few-shot prompt in an LLM call to generate an optimized prompt.
Make sure the output is in the following format using Pydantic.

See the following code:

# Function to generate an optimized prompt using Bedrock and return only the prompt using Pydantic
def get_optimized_prompt(user_query, nearest_examples):
# Generate the few-shot prompt
few_shot_prompt = generate_few_shot_prompt_only(user_query, nearest_examples)

# Call the LLM to generate the optimized prompt
response = llm.invoke(few_shot_prompt)

# Extract and validate only the optimized prompt using Pydantic
optimized_prompt = response.content # Fixed to access the ‘content’ attribute of the AIMessage object
optimized_prompt_output = OptimizedPromptOutput(optimized_prompt=optimized_prompt)

return optimized_prompt_output.optimized_prompt

# Example usage
query = “Is the US dollar weakening over time?”
nearest_examples = get_matched_convo(query, df_test)
nearest_examples.reset_index(drop=True, inplace=True)

# Generate optimized prompt
optimized_prompt = get_optimized_prompt(query, nearest_examples)
print(“Optimized Prompt:”, optimized_prompt)

The make_llm_call_with_optimized_prompt function uses an optimized prompt and user query to make the LLM (Anthropic’s Claude Haiku 3.5) call to get the final response:

# Function to make the LLM call using the optimized prompt and user query
def make_llm_call_with_optimized_prompt(optimized_prompt, user_query):
start_time = time.time()
# Combine the optimized prompt and user query to form the input for the LLM
final_prompt = f”{optimized_prompt}nnUser Query: {user_query}nResponse:”

# Make the call to the LLM using the combined prompt
response = llm.invoke(final_prompt)

# Extract only the content from the LLM response
final_response = response.content # Extract the response content without adding any labels
time_taken = time.time() – start_time
return final_response,time_taken

# Example usage
user_query = “How to grow avocado indoor?”
# Assume ‘optimized_prompt’ has already been generated from the previous step
final_response,time_taken = make_llm_call_with_optimized_prompt(optimized_prompt, user_query)
print(“LLM Response:”, final_response)

Comparative evaluation of optimized and unoptimized prompts
To compare the optimized prompt with the baseline (in this case, the unoptimized prompt), we defined a function that returned a result without an optimized prompt for all the queries in the evaluation dataset:

def get_unoptimized_prompt_response(df_eval):
# Iterate over the dataframe and make LLM calls
for index, row in tqdm(df_eval.iterrows()):
# Get the user query from ‘conv_A_user’
user_query = row[‘conv_A_user’]

# Make the Bedrock LLM call
response = llm.invoke(user_query)

# Store the response content in a new column ‘unoptimized_prompt_response’
df_eval.at[index, ‘unoptimized_prompt_response’] = response.content # Extract ‘content’ from the response object

return df_eval

The following function generates the query response using similarity search and intermediate optimized prompt generation for all the queries in the evaluation dataset:

def get_optimized_prompt_response(df_eval):
# Iterate over the dataframe and make LLM calls
for index, row in tqdm(df_eval.iterrows()):
# Get the user query from ‘conv_A_user’
user_query = row[‘conv_A_user’]
nearest_examples = get_matched_convo(user_query, df_test)
nearest_examples.reset_index(drop=True, inplace=True)
optimized_prompt = get_optimized_prompt(user_query, nearest_examples)
# Make the Bedrock LLM call
final_response,time_taken = make_llm_call_with_optimized_prompt(optimized_prompt, user_query)

# Store the response content in a new column ‘unoptimized_prompt_response’
df_eval.at[index, ‘optimized_prompt_response’] = final_response # Extract ‘content’ from the response object

return df_eval

This code compares responses generated with and without few-shot optimization, setting up the data for evaluation.
LLM as judge and evaluation of responses
To quantify response quality, we used an LLM as a judge to score the optimized and unoptimized responses for alignment with the user query. We used Pydantic here to make sure the output sticks to the desired pattern of 0 (LLM predicts the response won’t be liked by the user) or 1 (LLM predicts the response will be liked by the user):

# Define Pydantic model to enforce predicted feedback as 0 or 1
class FeedbackPrediction(BaseModel):
predicted_feedback: conint(ge=0, le=1) # Only allow values 0 or 1

# Function to generate few-shot prompt
def generate_few_shot_prompt(df_examples, unoptimized_response):
few_shot_prompt = (
“You are an impartial judge evaluating the quality of LLM responses. ”
“Based on the user queries and the LLM responses provided below, your task is to determine whether the response is good or bad, ”
“using the examples provided. Return 1 if the response is good (thumbs up) or 0 if the response is bad (thumbs down).nn”
)
few_shot_prompt += “Below are examples of user queries, LLM responses, and user feedback:nn”

# Iterate over few-shot examples
for i, row in df_examples.iterrows():
few_shot_prompt += f”User Query: {row[‘conv_A_user’]}n”
few_shot_prompt += f”LLM Response: {row[‘conv_A_assistant’]}n”
few_shot_prompt += f”User Feedback: {‘👍’ if row[‘conv_A_rating’] == 1 else ‘👎’}nn”

# Provide the unoptimized response for feedback prediction
few_shot_prompt += (
“Now, evaluate the following LLM response based on the examples above. Return 0 for bad response or 1 for good response.nn”
f”User Query: {unoptimized_response}n”
f”Predicted Feedback (0 for 👎, 1 for 👍):”
)
return few_shot_prompt

LLM-as-a-judge is a functionality where an LLM can judge the accuracy of a text using certain grounding examples. We have used that functionality here to judge the difference between the result received from optimized and un-optimized prompt. Amazon Bedrock launched an LLM-as-a-judge functionality in December 2024 that can be used for such use cases. In the following function, we demonstrate how the LLM acts as an evaluator, scoring responses based on their alignment and satisfaction for the full evaluation dataset:

# Function to predict feedback using few-shot examples
def predict_feedback(df_examples, df_to_rate, response_column, target_col):
# Create a new column to store predicted feedback
df_to_rate[target_col] = None

# Iterate over each row in the dataframe to rate
for index, row in tqdm(df_to_rate.iterrows(), total=len(df_to_rate)):
# Get the unoptimized prompt response
try:
time.sleep(2)
unoptimized_response = row[response_column]

# Generate few-shot prompt
few_shot_prompt = generate_few_shot_prompt(df_examples, unoptimized_response)

# Call the LLM to predict the feedback
response = llm.invoke(few_shot_prompt)

# Extract the predicted feedback (assuming the model returns ‘0’ or ‘1’ as feedback)
predicted_feedback_str = response.content.strip() # Clean and extract the predicted feedback

# Validate the feedback using Pydantic
try:
feedback_prediction = FeedbackPrediction(predicted_feedback=int(predicted_feedback_str))
# Store the predicted feedback in the dataframe
df_to_rate.at[index, target_col] = feedback_prediction.predicted_feedback
except (ValueError, ValidationError):
# In case of invalid data, assign default value (e.g., 0)
df_to_rate.at[index, target_col] = 0
except:
pass

return df_to_rate

In the following example, we repeated this process for 20 trials, capturing user satisfaction scores each time. The overall score for the dataset is the sum of the user satisfaction score.

df_eval = df.drop(df_test.index).sample(100)
df_eval[‘unoptimized_prompt_response’] = “” # Create an empty column to store responses
df_eval = get_unoptimized_prompt_response(df_eval)
df_eval[‘optimized_prompt_response’] = “” # Create an empty column to store responses
df_eval = get_optimized_prompt_response(df_eval)
Call the function to predict feedback
df_with_predictions = predict_feedback(df_eval, df_eval, ‘unoptimized_prompt_response’, ‘predicted_unoptimized_feedback’)
df_with_predictions = predict_feedback(df_with_predictions, df_with_predictions, ‘optimized_prompt_response’, ‘predicted_optimized_feedback’)

# Calculate accuracy for unoptimized and optimized responses
original_success = df_with_predictions.conv_A_rating.sum()*100.0/len(df_with_predictions)
unoptimized_success = df_with_predictions.predicted_unoptimized_feedback.sum()*100.0/len(df_with_predictions)
optimized_success = df_with_predictions.predicted_optimized_feedback.sum()*100.0/len(df_with_predictions)

# Display results
print(f”Original success: {original_success:.2f}%”)
print(f”Unoptimized Prompt success: {unoptimized_success:.2f}%”)
print(f”Optimized Prompt success: {optimized_success:.2f}%”)

Result analysis
The following line chart shows the performance improvement of the optimized solution over the unoptimized one. Green areas indicate positive improvements, whereas red areas show negative changes.

As we gathered the result of 20 trials, we saw that the mean of satisfaction scores from the unoptimized prompt was 0.8696, whereas the mean of satisfaction scores from the optimized prompt was 0.9063. Therefore, our method outperforms the baseline by 3.67%.
Finally, we ran a paired sample t-test to compare satisfaction scores from the optimized and unoptimized prompts. This statistical test validated whether prompt optimization significantly improved response quality. See the following code:

from scipy import stats
# Sample user satisfaction scores from the notebook
unopt = [] #20 samples of scores for the unoptimized promt
opt = [] # 20 samples of scores for the optimized promt]
# Paired sample t-test
t_stat, p_val = stats.ttest_rel(unopt, opt)
print(f”t-statistic: {t_stat}, p-value: {p_val}”)

After running the t-test, we got a p-value of 0.000762, which is less than 0.05. Therefore, the performance boost of optimized prompts over unoptimized prompts is statistically significant.
Key takeaways
We learned the following key takeaways from this solution:

Few-shot prompting improves query response – Using highly similar few-shot examples leads to significant improvements in response quality.
Amazon Titan Text Embeddings enables contextual similarity – The model produces embeddings that facilitate effective similarity searches.
Statistical validation confirms effectiveness – A p-value of 0.000762 indicates that our optimized approach meaningfully enhances user satisfaction.
Improved business impact – This approach delivers measurable business value through improved AI assistant performance. The 3.67% increase in satisfaction scores translates to tangible outcomes: HR departments can expect fewer policy misinterpretations (reducing compliance risks), and customer service teams might see a significant reduction in escalated tickets. The solution’s ability to continuously learn from feedback creates a self-improving system that increases ROI over time without requiring specialized ML expertise or infrastructure investments.

Limitations
Although the system shows promise, its performance heavily depends on the availability and volume of user feedback, especially in closed-domain applications. In scenarios where only a handful of feedback examples are available, the model might struggle to generate meaningful optimizations or fail to capture the nuances of user preferences effectively. Additionally, the current implementation assumes that user feedback is reliable and representative of broader user needs, which might not always be the case.
Next steps
Future work could focus on expanding this system to support multilingual queries and responses, enabling broader applicability across diverse user bases. Incorporating Retrieval Augmented Generation (RAG) techniques could further enhance context handling and accuracy for complex queries. Additionally, exploring ways to address the limitations in low-feedback scenarios, such as synthetic feedback generation or transfer learning, could make the approach more robust and versatile.
Conclusion
In this post, we demonstrated the effectiveness of query optimization using Amazon Bedrock, few-shot prompting, and user feedback to significantly enhance response quality. By aligning responses with user-specific preferences, this approach alleviates the need for expensive model fine-tuning, making it practical for real-world applications. Its flexibility makes it suitable for chat-based assistants across various domains, such as ecommerce, customer service, and hospitality, where high-quality, user-aligned responses are essential.
To learn more, refer to the following resources:

Black-Box Prompt Optimization: Aligning Large Language Models without Model Training
Approaches to Few-Shot Learning
Recent Advances in LLM Feedback Integration
Frameworks for Query Optimization

About the Authors
Tanay Chowdhury is a Data Scientist at the Generative AI Innovation Center at Amazon Web Services.
Parth Patwa is a Data Scientist at the Generative AI Innovation Center at Amazon Web Services.
Yingwei Yu is an Applied Science Manager at the Generative AI Innovation Center at Amazon Web Services.