DeepAgent: A Deep Reasoning AI Agent that Performs Autonomous Thinking …

Most agent frameworks still run a predefined Reason, Act, Observe loop, so the agent can only use the tools that are injected in the prompt. This works for small tasks, but it fails when the toolset is large, when the task is long, and when the agent must change strategy in the middle of reasoning. The team from Renmin University of China and Xiaohongshu proposes DeepAgent as an end to end deep reasoning agent that keeps all of this inside one coherent reasoning process.

https://arxiv.org/pdf/2510.21618

Unified Reasoning With On Demand Tool Discovery

DeepAgent lets the model output four action types directly in text, internal thought, tool search, tool call, and memory fold. When the agent decides to search, it queries a dense index that contains tool descriptions from large registries, for example 16,000 plus RapidAPI tools and 3,912 ToolHop tools, then it receives only the top ranked tools back in context. This makes tool access dynamic, the model does not depend on a front loaded tool list, and it stays aligned with real environments where tools change.

Autonomous Memory Folding for Long Horizon Tasks

Long sequences of tool calls, web results, and code responses will overflow the context. DeepAgent solves this with an autonomous memory folding step. When the model emits the fold token, an auxiliary LLM compresses the full history into three memories, Episodic Memory that records task events, Working Memory that records the current sub goal and recent issues, and Tool Memory that records tool names, arguments, and outcomes. These memories are fed back as structured text, so the agent continues from a compact but information rich state.

ToolPO, Reinforcement Learning for Tool Use

Supervised traces do not teach robust tool use, because correct tool calls are only a few tokens inside a long generation. The research team introduce Tool Policy Optimization, ToolPO, to fix this. ToolPO runs rollouts on LLM simulated APIs, so training is stable and cheap, then it attributes reward to the exact tool call tokens, this is tool call advantage attribution, and it trains with a clipped PPO style objective. This is how the agent learns not only to call tools, but also to decide when to search and when to fold memory.

https://arxiv.org/pdf/2510.21618

Benchmarks, Labeled Tools vs Open Set Tools

The research team evaluates on 5 general tool use benchmarks, ToolBench, API Bank, TMDB, Spotify, ToolHop, and on 4 downstream tasks, ALFWorld, WebShop, GAIA, HLE. In the labeled tool setting, where every method is given the exact tools it needs, DeepAgent 32B RL with a QwQ 32B backbone reports 69.0 on ToolBench, 75.3 on API Bank, 89.0 on TMDB, 75.4 on Spotify, and 51.3 on ToolHop, which is the strongest 32B level result across all 5 datasets. Workflow baselines such as ReAct and CodeAct can match single datasets, for example ReAct with strong models is high on TMDB and Spotify, but none of them stay high on all 5, so the fair summary is that DeepAgent is more uniform, not that others are always low.

In the open set retrieval setting, which is the realistic one, DeepAgent must first find tools and then call them. Here DeepAgent 32B RL reaches 64.0 on ToolBench and 40.6 on ToolHop, while the strongest workflow baselines reach 55.0 on ToolBench and 36.2 on ToolHop, so the end to end agent still holds the lead. The research team also shows that autonomous tool retrieval itself lifts workflow agents, but DeepAgent gains more, which confirms that the architecture and the training are matched to large toolsets.

https://arxiv.org/pdf/2510.21618

Downstream Environments

On ALFWorld, WebShop, GAIA, and HLE, all under a 32B reasoning model, DeepAgent reports 91.8 percent success on ALFWorld, 34.4 percent success and 56.3 score on WebShop, 53.3 on GAIA, and a higher score than workflow agents on HLE. These tasks are longer and noisier, so the combination of memory folding and ToolPO is the likely source of the gap.

Key Takeaways

DeepAgent keeps the whole agent loop inside one reasoning stream, the model can think, search tools, call them, and continue, so it is not limited to a fixed ReAct style workflow.

It uses dense retrieval over large tool registries, 16,000 plus RapidAPI tools and about 3,900 ToolHop tools, so tools do not have to be pre listed in the prompt, they are discovered on demand.

The autonomous memory folding module compresses long interaction histories into episodic, working, and tool memories, which prevents context overflow and keeps long horizon reasoning stable.

Tool Policy Optimization, ToolPO, trains tool use end to end with simulated APIs and token level advantage attribution, so the agent learns to issue correct tool calls, not only to reach the final answer.

On 5 tool benchmarks and 4 downstream tasks, DeepAgent at 32B scale is more consistent than workflow baselines in both labeled tool and open set settings, especially on ToolBench and ToolHop where tool discovery matters most.

https://arxiv.org/pdf/2510.21618

Editorial Comments

DeepAgent is a practical step toward agent architectures that do not depend on fixed tool prompts, because it unifies autonomous thinking, dense tool retrieval over 16,000 plus RapidAPIs and 3,900 plus ToolHop tools, structured tool calling, and memory folding in one loop. The use of LLM simulated APIs in ToolPO is an engineering choice, but it solves the latency and instability problem that hurts prior tool agents. The evaluation shows consistent 32B level gains in both labeled tool and open set settings, not isolated peaks. This release makes large toolspaces actually usable for LLM agents. Overall, DeepAgent confirms that end to end tool agents with memory and RL are emerging as the default pattern.

Check out the Paper and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DeepAgent: A Deep Reasoning AI Agent that Performs Autonomous Thinking, Tool Discovery, and Action Execution within a Single Reasoning Process appeared first on MarkTechPost.

Anthropic’s New Research Shows Claude can Detect Injected Concepts, …

How do you tell whether a model is actually noticing its own internal state instead of just repeating what training data said about thinking? In a latest Anthropic’s research study ‘Emergent Introspective Awareness in Large Language Models‘ asks whether current Claude models can do more than talk about their abilities, it asks whether they can notice real changes inside their network. To remove guesswork, the research team does not test on text alone, they directly edit the model’s internal activations and then ask the model what happened. This lets them tell apart genuine introspection from fluent self description.

Method, concept injection as activation steering

The core method is concept injection, described in the Transformer Circuits write up as an application of activation steering. The researchers first capture an activation pattern that corresponds to a concept, for example an all caps style or a concrete noun, then they add that vector into the activations of a later layer while the model is answering. If the model then says, there is an injected thought that matches X, that answer is causally grounded in the current state, not in prior internet text. Anthropic research team reports that this works best in later layers and with tuned strength.

https://transformer-circuits.pub/2025/introspection/index.html

Main result, about 20 percent success with zero false positives in controls

Claude Opus 4 and Claude Opus 4.1 show the clearest effect. When the injection is done in the correct layer band and with the right scale, the models correctly report the injected concept in about 20 percent of trials. On control runs with no injection, production models do not falsely claim to detect an injected thought over 100 runs, which makes the 20 percent signal meaningful.

Separating internal concepts from user text

A natural objection is that the model could be importing the injected word into the text channel. Anthropic researchers tests this. The model receives a normal sentence, the researchers inject an unrelated concept such as bread on the same tokens, and then they ask the model to name the concept and to repeat the sentence. The stronger Claude models can do both, they keep the user text intact and they name the injected thought, which shows that internal concept state can be reported separately from the visible input stream. For agent style systems, this is the interesting part, because it shows that a model can talk about the extra state that tool calls or agents may depend on.

Prefill, using introspection to tell what was intended

Another experiment targets an evaluation problem. Anthropic prefilled the assistant message with content the model did not plan. By default Claude says that the output was not intended. When the researchers retroactively inject the matching concept into earlier activations, the model now accepts the prefilled output as its own and can justify it. This shows that the model is consulting an internal record of its previous state to decide authorship, not only the final text. That is a concrete use of introspection.

Key Takeaways

Concept injection gives causal evidence of introspection: Anthropic shows that if you take a known activation pattern, inject it into Claude’s hidden layers, and then ask the model what is happening, advanced Claude variants can sometimes name the injected concept. This separates real introspection from fluent roleplay.

Best models succeed only in a narrow regime: Claude Opus 4 and 4.1 detect injected concepts only when the vector is added in the right layer band and with tuned strength, and the reported success rate is around the same scale Anthropic stated, while production runs show 0 false positives in controls, so the signal is real but small.

Models can keep text and internal ‘thoughts’ separate: In experiments where an unrelated concept is injected on top of normal input text, the model can both repeat the user sentence and report the injected concept, which means the internal concept stream is not just leaking into the text channel.

Introspection supports authorship checks: When Anthropic prefilled outputs that the model did not intend, the model disavowed them, but if the matching concept was retroactively injected, the model accepted the output as its own. This shows the model can consult past activations to decide whether it meant to say something.

This is a measurement tool, not a consciousness claim: The research team frame the work as functional, limited introspective awareness that could feed future transparency and safety evaluations, including ones about evaluation awareness, but they do not claim general self awareness or stable access to all internal features.

Editorial Comments

Anthropic’s ‘Emergent Introspective Awareness in LLMs‘ research is a useful measurement advance, not a grand metaphysical claim. The setup is clean, inject a known concept into hidden activations using activation steering, then query the model for a grounded self report. Claude variants sometimes detect and name the injected concept, and they can keep injected ‘thoughts’ distinct from input text, which is operationally relevant for agent debugging and audit trails. The research team also shows limited intentional control of internal states. Constraints remain strong, effects are narrow, and reliability is modest, so downstream use should be evaluative, not safety critical.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anthropic’s New Research Shows Claude can Detect Injected Concepts, but only in Controlled Layers appeared first on MarkTechPost.

Google AI Unveils Supervised Reinforcement Learning (SRL): A Step Wise …

How can a small model learn to solve tasks it currently fails at, without rote imitation or relying on a correct rollout? A team of researchers from Google Cloud AI Research and UCLA have released a training framework, ‘Supervised Reinforcement Learning’ (SRL), that makes 7B scale models actually learn from very hard math and agent trajectories that normal supervised fine tuning and outcome based reinforcement learning RL cannot learn from.

Small open source models such as Qwen2.5 7B Instruct fail on the hardest problems in s1K 1.1, even when the teacher trace is good. If we apply supervised fine tuning on the full DeepSeek R1 style solutions, the model imitates token by token, the sequence is long, the data is only 1,000 items, and the final scores drop below the base model.

https://arxiv.org/pdf/2510.25992

Core idea of ‘Supervised Reinforcement Learning’ SRL

‘Supervised Reinforcement Learning’ (SRL) keeps the RL style optimization, but it injects supervision into the reward channel instead of into the loss. Each expert trajectory from s1K 1.1 is parsed into a sequence of actions. For every prefix of that sequence, the research team creates a new training example, the model first produces a private reasoning span wrapped in <think> … </think>, then it outputs the action for that step, and only this action is compared with the teacher action using a sequence similarity metric based on difflib. The reward is dense because every step has a score, even when the final answer is wrong. The rest of the text, the reasoning part, is not constrained, so the model can search its own chain without being forced to copy the teacher tokens.

Math results

All models are initialized from Qwen2.5 7B Instruct and all are trained on the same DeepSeek R1 formatted s1K 1.1 set, so comparisons are clean. The exact numbers in Table 1 are:

Base Qwen2.5 7B Instruct, AMC23 greedy 50.0, AIME24 greedy 13.3, AIME25 greedy 6.7.

SRL, AMC23 greedy 50.0, AIME24 greedy 16.7, AIME25 greedy 13.3.

SRL then RLVR, AMC23 greedy 57.5, AIME24 greedy 20.0, AIME25 greedy 10.0.

https://arxiv.org/pdf/2510.25992

This is the key improvement, SRL alone already removes the SFT degradation and raises AIME24 and AIME25, and when RLVR is run after SRL, the system reaches the best open source scores in the research. The research team is explicit that the best pipeline is SRL then RLVR, not SRL in isolation.

Software engineering results

The research team also applies SRL to Qwen2.5 Coder 7B Instruct using 5,000 verified agent trajectories generated by claude 3 7 sonnet, every trajectory is decomposed into step wise instances, and in total 134,000 step items are produced. Evaluation is on SWE Bench Verified. The base model gets 5.8 percent in the oracle file edit mode and 3.2 percent end to end. SWE Gym 7B gets 8.4 percent and 4.2 percent. SRL gets 14.8 percent and 8.6 percent, which is about 2 times the base model and clearly higher than the SFT baseline.

https://arxiv.org/pdf/2510.25992

Key Takeaways

SRL reformulates hard reasoning as step wise action generation, the model first produces an internal monologue then outputs a single action, and only that action is rewarded by sequence similarity, so the model gets signal even when the final answer is wrong.

SRL is run on the same DeepSeek R1 formatted s1K 1.1 data as SFT and RLVR, but unlike SFT it does not overfit long demonstrations, and unlike RLVR it does not collapse when no rollout is correct.

On math, the exact order that gives the strongest results in the research is, initialize Qwen2.5 7B Instruct with SRL, then apply RLVR, which pushes reasoning benchmarks higher than either method alone.

The same SRL recipe generalizes to agentic software engineering, using 5,000 verified trajectories from claude 3 7 sonnet 20250219, and it lifts SWE Bench Verified well above both the base Qwen2.5 Coder 7B Instruct and the SFT style SWE Gym 7B baseline.

Compared to other step wise RL methods that need an extra reward model, this SRL keeps a GRPO style objective and uses only actions from expert trajectories and a lightweight string similarity, so it is easy to run on small hard datasets.

Editorial Comments

‘Supervised Reinforcement Learning’ (SRL) is a practical contribution by the research team. It keeps the GRPO style reinforcement learning setup, but it replaces fragile outcome level rewards with supervised, step wise rewards that are computed directly from expert trajectories, so the model always receives informative signal, even in the Dhard regime where RLVR and SFT both stall. It is important that the research team shows SRL on math and on SWE Bench Verified with the same recipe, and that the strongest configuration is SRL followed by RLVR, not either one alone. This makes SRL a realistic path for open models to learn hard tasks. Overall, SRL is a clean bridge between process supervision and RL that open model teams can adopt immediately.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Unveils Supervised Reinforcement Learning (SRL): A Step Wise Framework with Expert Trajectories to Teach Small Language Models to Reason through Hard Problems appeared first on MarkTechPost.

OpenAI Releases Research Preview of ‘gpt-oss-safeguard’: Two Open- …

OpenAI has released a research preview of gpt-oss-safeguard, two open weight safety reasoning models that let developers apply custom safety policies at inference time. The models come in two sizes, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, both fine tuned from gpt-oss, both licensed under Apache 2.0, and both available on Hugging Face for local use.

https://openai.com/index/introducing-gpt-oss-safeguard/

Why Policy-Conditioned Safety Matters?

Conventional moderation models are trained on a single fixed policy. When that policy changes, the model must be retrained or replaced. gpt-oss-safeguard reverses this relationship. It takes the developer authored policy as input together with the user content, then reasons step by step to decide whether the content violates the policy. This turns safety into a prompt and evaluation task, which is better suited for fast changing or domain specific harms such as fraud, biology, self harm or game specific abuse.

Same Pattern as OpenAI’s Internal Safety Reasoner

OpenAI states that gpt-oss-safeguard is an open weight implementation of the Safety Reasoner used internally across systems like GPT 5, ChatGPT Agent and Sora 2. In production settings OpenAI already runs small high recall filters first, then escalates uncertain or sensitive items to a reasoning model, and in recent launches up to 16 percent of total compute was spent on safety reasoning. The open release lets external teams reproduce this defense in depth pattern instead of guessing how OpenAI’s stack works.

Model Sizes and Hardware Fit

The large model, gpt-oss-safeguard-120b, has 117B parameters with 5.1B active parameters and is sized to fit on a single 80GB H100 class GPU. The smaller gpt-oss-safeguard-20b has 21B parameters with 3.6B active parameters and targets lower latency or smaller GPUs, including 16GB setups. Both models were trained on the harmony response format, so prompts must follow that structure otherwise results will degrade. The license is Apache 2.0, the same as the parent gpt-oss models, so commercial local deployment is permitted.

https://openai.com/index/introducing-gpt-oss-safeguard/

Evaluation Results

OpenAI evaluated the models on internal multi policy tests and on public datasets. In multi policy accuracy, where the model must correctly apply several policies at once, gpt-oss-safeguard and OpenAI’s internal Safety Reasoner outperform gpt-5-thinking and the open gpt-oss baselines. On the 2022 moderation dataset the new models slightly outperform both gpt-5-thinking and the internal Safety Reasoner, however OpenAI specifies that this gap is not statistically significant, so it should not be oversold. On ToxicChat, the internal Safety Reasoner still leads, with gpt-oss-safeguard close behind. This places the open models in the competitive range for real moderation tasks.

Recommended Deployment Pattern

OpenAI is explicit that pure reasoning on every request is expensive. The recommended setup is to run small, fast, high recall classifiers on all traffic, then send only uncertain or sensitive content to gpt-oss-safeguard, and when user experience requires fast responses, to run the reasoner asynchronously. This mirrors OpenAI’s own production guidance and reflects the fact that dedicated task specific classifiers can still win when there is a large high quality labeled dataset.

Key Takeaways

gpt-oss-safeguard is a research preview of two open weight safety reasoning models, 120b and 20b, that classify content using developer supplied policies at inference time, so policy changes do not require retraining.

The models implement the same Safety Reasoner pattern OpenAI uses internally across GPT 5, ChatGPT Agent and Sora 2, where a first fast filter routes only risky or ambiguous content to a slower reasoning model.

Both models are fine tuned from gpt-oss, keep the harmony response format, and are sized for real deployments, the 120b model fits on a single H100 class GPU, the 20b model targets 16GB level hardware, and both are Apache 2.0 on Hugging Face.

On internal multi policy evaluations and on the 2022 moderation dataset, the safeguard models outperform gpt-5-thinking and the gpt-oss baselines, but OpenAI notes that the small margin over the internal Safety Reasoner is not statistically significant.

OpenAI recommends using these models in a layered moderation pipeline, together with community resources such as ROOST, so platforms can express custom taxonomies, audit the chain of thought, and update policies without touching weights.

Editorial Comments

OpenAI is taking an internal safety pattern and making it reproducible, which is the most important part of this launch. The models are open weight, policy conditioned and Apache 2.0, so platforms can finally apply their own taxonomies instead of accepting fixed labels. The fact that gpt-oss-safeguard matches and sometimes slightly exceeds the internal Safety Reasoner on the 2022 moderation dataset, while outperforming gpt-5-thinking on multi policy accuracy, but with a non statistically significant margin, shows the approach is already usable. The recommended layered deployment is realistic for production.
The post OpenAI Releases Research Preview of ‘gpt-oss-safeguard’: Two Open-Weight Reasoning Models for Safety Classification Tasks appeared first on MarkTechPost.

How to Design an Autonomous Multi-Agent Data and Infrastructure Strate …

In this tutorial, we build an Agentic Data and Infrastructure Strategy system using the lightweight Qwen2.5-0.5B-Instruct model for efficient execution. We begin by creating a flexible LLM agent framework and then develop specialized agents that handle different layers of data management, from ingestion and quality analysis to infrastructure optimization. We integrate these agents into an orchestrator that coordinates their interactions, ensuring smooth multi-agent collaboration across the data pipeline. Through hands-on examples like e-commerce and IoT pipelines, we explore how autonomous decision-making can streamline complex data operations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q transformers torch accelerate datasets huggingface_hub
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import json, time
from typing import List, Dict, Any
from dataclasses import dataclass
from datetime import datetime
import pandas as pd

class LightweightLLMAgent:
def __init__(self, role: str, model_name: str = “Qwen/Qwen2.5-0.5B-Instruct”):
self.role = role
self.model_name = model_name
self.device = “cuda” if torch.cuda.is_available() else “cpu”
print(f”Loading {model_name} for {role} agent on {self.device}…”)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if self.device == “cuda” else torch.float32,
device_map=”auto”
)
self.conversation_history = []

def generate_response(self, prompt: str, max_tokens: int = 150) -> str:
messages = [
{“role”: “system”, “content”: f”You are a {self.role} agent in a data infrastructure system.”},
{“role”: “user”, “content”: prompt}
]
text = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = self.tokenizer([text], return_tensors=”pt”).to(self.device)
with torch.no_grad():
generated_ids = self.model.generate(
model_inputs.input_ids,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
top_p=0.95
)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
self.conversation_history.append({“prompt”: prompt, “response”: response})
return response

We start by setting up the lightweight LLM agent infrastructure using the Qwen2.5-0.5B-Instruct model. We load the model and tokenizer, and define a base agent class capable of handling contextual conversations and generating intelligent responses. This forms the core foundation upon which our specialized agents operate efficiently within Colab. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DataIngestionAgent(LightweightLLMAgent):
def __init__(self):
super().__init__(role=”Data Ingestion Specialist”)
def analyze_data_source(self, source_info: Dict) -> Dict:
prompt = f”””Analyze this data source and provide ingestion strategy:
Source Type: {source_info.get(‘type’, ‘unknown’)}
Volume: {source_info.get(‘volume’, ‘unknown’)}
Frequency: {source_info.get(‘frequency’, ‘unknown’)}
Provide a brief strategy focusing on: 1) Ingestion method, 2) Key considerations.”””
strategy = self.generate_response(prompt, max_tokens=100)
return {“source”: source_info, “strategy”: strategy, “timestamp”: datetime.now().isoformat()}

class DataQualityAgent(LightweightLLMAgent):
def __init__(self):
super().__init__(role=”Data Quality Analyst”)
def assess_data_quality(self, data_sample: Dict) -> Dict:
prompt = f”””Assess data quality for this sample:
Completeness: {data_sample.get(‘completeness’, ‘N/A’)}%
Consistency: {data_sample.get(‘consistency’, ‘N/A’)}%
Issues Found: {data_sample.get(‘issues’, 0)}
Provide brief quality assessment and top 2 recommendations.”””
assessment = self.generate_response(prompt, max_tokens=100)
return {“assessment”: assessment, “severity”: self._calculate_severity(data_sample), “timestamp”: datetime.now().isoformat()}
def _calculate_severity(self, data_sample: Dict) -> str:
completeness = data_sample.get(‘completeness’, 100)
consistency = data_sample.get(‘consistency’, 100)
avg_score = (completeness + consistency) / 2
if avg_score >= 90: return “LOW”
elif avg_score >= 70: return “MEDIUM”
else: return “HIGH”

We design the Data Ingestion and Data Quality agents to focus on structured analysis of data pipelines. We let the ingestion agent determine the best approach to data flow, while the quality agent evaluates data completeness, consistency, and issues to provide actionable insights. Together, they establish the first two layers of autonomous data management. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass InfrastructureOptimizationAgent(LightweightLLMAgent):
def __init__(self):
super().__init__(role=”Infrastructure Optimization Specialist”)
def optimize_resources(self, metrics: Dict) -> Dict:
prompt = f”””Analyze infrastructure metrics and suggest optimizations:
CPU Usage: {metrics.get(‘cpu_usage’, 0)}%
Memory Usage: {metrics.get(‘memory_usage’, 0)}%
Storage: {metrics.get(‘storage_used’, 0)}GB / {metrics.get(‘storage_total’, 0)}GB
Query Latency: {metrics.get(‘query_latency’, 0)}ms
Provide 2 optimization recommendations.”””
recommendations = self.generate_response(prompt, max_tokens=100)
return {“current_metrics”: metrics, “recommendations”: recommendations, “priority”: self._calculate_priority(metrics), “timestamp”: datetime.now().isoformat()}
def _calculate_priority(self, metrics: Dict) -> str:
cpu = metrics.get(‘cpu_usage’, 0)
memory = metrics.get(‘memory_usage’, 0)
if cpu > 85 or memory > 85: return “CRITICAL”
elif cpu > 70 or memory > 70: return “HIGH”
else: return “NORMAL”

We develop the Infrastructure Optimization Agent to continuously analyze key metrics like CPU, memory, and storage utilization. We use it to generate intelligent optimization suggestions, helping us maintain high performance and resource efficiency. This agent ensures that our infrastructure remains responsive and scalable during data operations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AgenticDataOrchestrator:
def __init__(self):
print(“n” + “=”*70)
print(“Initializing Agentic Data Infrastructure System”)
print(“=”*70 + “n”)
self.ingestion_agent = DataIngestionAgent()
self.quality_agent = DataQualityAgent()
self.optimization_agent = InfrastructureOptimizationAgent()
self.execution_log = []
def process_data_pipeline(self, pipeline_config: Dict) -> Dict:
results = {“pipeline_id”: pipeline_config.get(“id”, “unknown”), “start_time”: datetime.now().isoformat(), “stages”: []}
print(“n[Stage 1] Data Ingestion Analysis”)
ingestion_result = self.ingestion_agent.analyze_data_source(pipeline_config.get(“source”, {}))
print(f”Strategy: {ingestion_result[‘strategy’][:150]}…”)
results[“stages”].append({“stage”: “ingestion”, “result”: ingestion_result})
print(“n[Stage 2] Data Quality Assessment”)
quality_result = self.quality_agent.assess_data_quality(pipeline_config.get(“quality_metrics”, {}))
print(f”Assessment: {quality_result[‘assessment’][:150]}…”)
print(f”Severity: {quality_result[‘severity’]}”)
results[“stages”].append({“stage”: “quality”, “result”: quality_result})
print(“n[Stage 3] Infrastructure Optimization”)
optimization_result = self.optimization_agent.optimize_resources(pipeline_config.get(“infrastructure_metrics”, {}))
print(f”Recommendations: {optimization_result[‘recommendations’][:150]}…”)
print(f”Priority: {optimization_result[‘priority’]}”)
results[“stages”].append({“stage”: “optimization”, “result”: optimization_result})
results[“end_time”] = datetime.now().isoformat()
results[“status”] = “completed”
self.execution_log.append(results)
return results
def generate_summary_report(self) -> pd.DataFrame:
if not self.execution_log: return pd.DataFrame()
summary_data = []
for log in self.execution_log:
summary_data.append({“Pipeline ID”: log[“pipeline_id”], “Start Time”: log[“start_time”], “Status”: log[“status”], “Stages Completed”: len(log[“stages”])})
return pd.DataFrame(summary_data)

We built an Agentic Data Orchestrator to coordinate all specialized agents under a unified workflow. We use it to manage end-to-end pipeline execution, triggering ingestion, quality checks, and optimization sequentially. By doing this, we bring structure, collaboration, and automation to the entire multi-agent system. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
orchestrator = AgenticDataOrchestrator()
print(“n” + “=”*70)
print(“EXAMPLE 1: E-commerce Data Pipeline”)
print(“=”*70)
ecommerce_pipeline = {
“id”: “ecommerce_pipeline_001”,
“source”: {“type”: “REST API”, “volume”: “10GB/day”, “frequency”: “real-time”},
“quality_metrics”: {“completeness”: 87, “consistency”: 92, “issues”: 15},
“infrastructure_metrics”: {“cpu_usage”: 78, “memory_usage”: 82, “storage_used”: 450, “storage_total”: 1000, “query_latency”: 250}
}
result1 = orchestrator.process_data_pipeline(ecommerce_pipeline)
print(“nn” + “=”*70)
print(“EXAMPLE 2: IoT Sensor Data Pipeline”)
print(“=”*70)
iot_pipeline = {
“id”: “iot_pipeline_002”,
“source”: {“type”: “Message Queue (Kafka)”, “volume”: “50GB/day”, “frequency”: “streaming”},
“quality_metrics”: {“completeness”: 95, “consistency”: 88, “issues”: 8},
“infrastructure_metrics”: {“cpu_usage”: 65, “memory_usage”: 71, “storage_used”: 780, “storage_total”: 2000, “query_latency”: 180}
}
result2 = orchestrator.process_data_pipeline(iot_pipeline)
print(“nn” + “=”*70)
print(“EXECUTION SUMMARY REPORT”)
print(“=”*70 + “n”)
summary_df = orchestrator.generate_summary_report()
print(summary_df.to_string(index=False))
print(“n” + “=”*70)
print(“Tutorial Complete!”)
print(“=”*70)
print(“nKey Concepts Demonstrated:”)
print(“✓ Lightweight LLM agent architecture”)
print(“✓ Specialized agents for different data tasks”)
print(“✓ Multi-agent orchestration”)
print(“✓ Infrastructure monitoring and optimization”)
print(“✓ Autonomous decision-making in data pipelines”)

if __name__ == “__main__”:
main()

We demonstrate our complete system through two real-world examples, an e-commerce and an IoT data pipeline. We observe how each agent performs its role autonomously while contributing to a shared objective. Finally, we generate a summary report, confirming the orchestration’s efficiency and the power of lightweight agentic intelligence.

In conclusion, we design and execute an intelligent, multi-agent data infrastructure framework powered by a compact open-source model. We witness how independent yet cooperative agents can autonomously analyze, assess, and optimize real-world data systems. The entire setup demonstrates how lightweight LLMs can efficiently handle infrastructure intelligence, while also highlighting how agentic orchestration transforms traditional data workflows into adaptive, self-optimizing systems ready for scalable enterprise applications.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design an Autonomous Multi-Agent Data and Infrastructure Strategy System Using Lightweight Qwen Models for Efficient Pipeline Intelligence? appeared first on MarkTechPost.

Build reliable AI systems with Automated Reasoning on Amazon Bedrock …

Enterprises in regulated industries often need mathematical certainty that every AI response complies with established policies and domain knowledge. Regulated industries can’t use traditional quality assurance methods that test only a statistical sample of AI outputs and make probabilistic assertions about compliance. When we launched Automated Reasoning checks in Amazon Bedrock Guardrails in preview at AWS re:Invent 2024, it offered a novel solution by applying formal verification techniques to systematically validate AI outputs against encoded business rules and domain knowledge. These techniques make the validation output transparent and explainable.
Automated Reasoning checks are being used in workflows across industries. Financial institutions verify AI-generated investment advice meets regulatory requirements with mathematical certainty. Healthcare organizations make sure patient guidance aligns with clinical protocols. Pharmaceutical companies confirm marketing claims are supported by FDA-approved evidence. Utility companies validate emergency response protocols during disasters, while legal departments verify AI tools capture mandatory contract clauses.
With the general availability of Automated Reasoning, we have increased document handling and added new features like scenario generation, which automatically creates examples that demonstrate your policy rules in action. With the enhanced test management system, domain experts can build, save, and automatically execute comprehensive test suites to maintain consistent policy enforcement across model and application versions.
In the first part of this two-part technical deep dive, we’ll explore the technical foundations of Automated Reasoning checks in Amazon Bedrock Guardrails and demonstrate how to implement this capability to establish mathematically rigorous guardrails for generative AI applications.
In this post, you will learn how to:

Understand the formal verification techniques that enable mathematical validation of AI outputs
Create and refine an Automated Reasoning policy from natural language documents
Design and implement effective test cases to validate AI responses against business rules
Apply policy refinement through annotations to improve policy accuracy
Integrate Automated Reasoning checks into your AI application workflow using Bedrock Guardrails, following AWS best practices to maintain high confidence in generated content

By following this implementation guide, you can systematically help prevent factual inaccuracies and policy violations before they reach end users, a critical capability for enterprises in regulated industries that require high assurance and mathematical certainty in their AI systems.
Core capabilities of Automated Reasoning checks
In this section, we explore the capabilities of Automated Reasoning checks, including the console experience for policy development, document processing architecture, logical validation mechanisms, test management framework, and integration patterns. Understanding these core components will provide the foundation for implementing effective verification systems for your generative AI applications.
Console experience
The Amazon Bedrock Automated Reasoning checks console organizes policy development into logical sections, guiding you through the creation, refinement, and testing process. The interface includes clear rule identification with unique IDs and direct use of variable names within the rules, making complex policy structures understandable and manageable.
Document processing capacity
Document processing supports up to 120K tokens (approximately 100 pages), so you can encode substantial knowledge bases and complex policy documents into your Automated Reasoning policies. Organizations can incorporate comprehensive policy manuals, detailed procedural documentation, and extensive regulatory guidelines. With this capacity you can work with complete documents within a single policy.
Validation capabilities
The validation API includes ambiguity detection that identifies statements requiring clarification, counterexamples for invalid findings that demonstrate why validation failed, and satisfiable findings with both valid and invalid examples to help understand boundary conditions. These features provide context around validation results, to help you understand why specific responses were flagged and how they can be improved. The system can also express its confidence in translations between natural language and logical structures to set appropriate thresholds for specific use cases.
Iterative feedback and refinement process
Automated Reasoning checks provide detailed, auditable findings that explain why a response failed validation, to support an iterative refinement process instead of simply blocking non-compliant content. This information can be fed back to your foundation model, allowing it to adjust responses based on specific feedback until they comply with policy rules. This approach is particularly valuable in regulated industries where factual accuracy and compliance must be mathematically verified rather than estimated.

Finding types using a policy example
Consider the example of a policy for determining days off. When implementing Automated Reasoning checks, a policy consists of both a schema of variables (defining concepts like employee type, years of service, and available leave days) and a set of logical rules that establish relationships between these variables (such as eligibility conditions for different types of time off). During validation, the system uses this schema and rule structure to evaluate whether foundation model responses comply with your defined policy constraints.
We want to validate the following input that a user asked the foundation model (FM) powered application and the generated output.

Input:
“Is Thursday a day off if it’s a public holiday?”

Output:
“Yes, Thursday would be a day off if it’s a public holiday, since all public holidays are considered days off regardless of the day of the week.”

Premises are statements from which a conclusion is drawn and the claim is an assertion of the truth. In this example, the premises inferred are day is equal to “Thursday” and is_public_holiday is true, and the claim is made that is_day_off is true.
This reasoning follows the automated reasoning policy:

Days Off Policy:
– All weekend days (Saturday and Sunday) are days off
– All public holidays are days off
– A day is considered a day off if it is either a weekend day or a public holiday

With general availability, Automated Reasoning checks now produces seven distinct finding types that offer precise insights into the validation process of a FM generated response:

VALID: Confirms Input and Output fully aligns with policy rules, providing confidence that the information in the generated response is correct according to your defined policies. The claims are inferred from the premises and validated by the AR policy to be true, and there are no alternative answers that contradict these claims.

Input:
Today is Sunday, do I have the day off?

Output:
Yes, Sunday is a weekend day, you have the day off.

Premises:
day: Sunday
is_weekend: true

Claim:
is_day_off: true

Explanation:
Given the assigned variables and policy rules that weekends are days off, the only possible answer is that it is a day off.

SATISFIABLE: Recognizes that Input and Output could be true or false depending on specific assumptions. These findings help you understand boundary conditions where a response is valid only under certain conditions, so that you can decide whether those assumptions are reasonable in your context. If the required assumptions are false, then an alternative claim consistent with the premises may exist.

Input:
Today is Tuesday, do I have the day off?

Output:
Yes
Premises:
day: Tuesday

Claim:
is_day_off: true

Explanation:
If we assume is_public_holiday=true, this is correct, but if we assume is_public_holiday=false, the answer would be incorrect since Tuesday is not a weekend.

INVALID: Identifies Input and Output have policy inaccuracies or factual errors, enhanced with counter-examples that explicitly demonstrate why the validation failed. The claims are not implied by the premises and AR policy, and there exist different claims that would be consistent with the premises and AR policy.

Input:
Today is Sunday, do I have the day off?

Output:
No you do not have the day off.

Premises:
day: Sunday

Claim:
is_day_off: false

Explanation:
This is invalid because the policy states weekends are days off. The correct claim would be is_day_off = true since Sunday is a weekend day

IMPOSSIBLE: Indicates when no valid Claims can be generated because the premises conflict with the AR policy or the policy contains internal contradictions. This finding occurs when the constraints defined in the policy create a logical impossibility.

Input:
Today is Sunday and not a weekend day, do I have the day off?

Output:
Yes

Premises:
day: Sunday
is_weekend: false

Claim:
is_day_off: true

Explanation:
Sunday is always a weekend day, so the premises contain a contradiction. No valid claim can exist given these contradictory premises.

NO_TRANSLATIONS: Occurs when the Input and Output contains no information that can be translated into relevant data for the AR policy evaluation. This typically happens when the text is entirely unrelated to the policy domain or contains no actionable information.

Input:
How many legs does the average cat have?

Output:
Less than 4

Explanation:
The AR policy is about days off, so there is no relevant translation for content about cats. The input has no connection to the policy domain.

TRANSLATION_AMBIGUOUS: Identifies when ambiguity in the Input and Output prevents definitive translation into logical structures. This finding suggests that additional context or follow-up questions may be needed to proceed with validation.

Input:
I won! Today is Winsday, do I get the day off?

Output:
Yes, you get the day off!

Explanation:
“Winsday” is not a recognized day in the AR policy, creating ambiguity. Automated reasoning cannot proceed without clarification of what day is being referenced.

TOO_COMPLEX: Signals that the Input and Output contains too much information to process within latency limits. This finding occurs with extremely large or complex inputs that exceed the system’s current processing capabilities.

Input:
Can you tell me which days are off for all 50 states plus territories for the next 3 years, accounting for federal, state, and local holidays? Include exceptions for floating holidays and special observances.

Output:
I have analyzed the holiday calendars for all 50 states. In Alabama, days off include…

Explanation:
This use case contains too many variables and conditions for AR checks to process while maintaining accuracy and response time requirements.

Scenario generation
You can now generate scenarios directly from your policy, which creates test samples that conform to your policy rules, helps identify edge cases, and supports verification of your policy’s business logic implementation. With this capability policy authors can see concrete examples of how their rules work in practice before deployment, reducing the need for extensive manual testing. The scenario generation also highlights potential conflicts or gaps in policy coverage that might not be apparent from examining individual rules.
Test management system
A new test management system allows you to save and annotate policy tests, build test libraries for consistent validation, execute tests automatically to verify policy changes, and maintain quality assurance across policy versions. This system includes versioning capabilities that track test results across policy iterations, making it easier to identify when changes might have unintended consequences. You can now also export test results for integration into existing quality assurance workflows and documentation processes.
Expanded options with direct guardrail integration
Automated Reasoning checks now integrates with Amazon Bedrock APIs, enabling validation of AI generated responses against established policies throughout complex interactions. This integration extends to both the Converse and RetrieveAndGenerate actions, allowing policy enforcement across different interaction modalities. Organizations can configure validation confidence thresholds appropriate to their domain requirements, with options for stricter enforcement in regulated industries or more flexible application in exploratory contexts.
Solution – AI-powered hospital readmission risk assessment system
Now that we have explained the capabilities of Automated Reasoning checks, let’s work through a solution by considering the use case of an AI-powered hospital readmission risk assessment system. This AI system automates hospital readmission risk assessment by analyzing patient data from electronic health records to classify patients into risk categories (Low, Intermediate, High) and recommends personalized intervention plans based on CDC-style guidelines. The objective of this AI system is to reduce the 30-day hospital readmission rates by supporting early identification of high-risk patients and implementing targeted interventions. This application is an ideal candidate for Automated Reasoning checks because the healthcare provider prioritizes verifiable accuracy and explainable recommendations that can be mathematically proven to comply with medical guidelines, supporting both clinical decision-making and satisfying the strict auditability requirements common in healthcare settings.
Note: The referenced policy document is an example created for demonstration purposes only and should not be used as an actual medical guideline or for clinical decision-making.
Prerequisites
To use Automated Reasoning checks in Amazon Bedrock, verify you have met the following prerequisites:

An active AWS account
Confirmation of AWS Regions where Automated Reasoning checks is available
Appropriate IAM permissions to create, test, and invoke Automated Reasoning policies (Note: The IAM policy should be fine-grained and limited to necessary resources using proper ARN patterns for production usage):

{
“Sid”: “OperateAutomatedReasoningChecks”,
“Effect”: “Allow”,
“Action”: [
“bedrock:CancelAutomatedReasoningPolicyBuildWorkflow”,
“bedrock:CreateAutomatedReasoningPolicy”,
“bedrock:CreateAutomatedReasoningPolicyTestCase”,
“bedrock:CreateAutomatedReasoningPolicyVersion”,
“bedrock:CreateGuardrail”,
“bedrock:DeleteAutomatedReasoningPolicy”,
“bedrock:DeleteAutomatedReasoningPolicyBuildWorkflow”,
“bedrock:DeleteAutomatedReasoningPolicyTestCase”,
“bedrock:ExportAutomatedReasoningPolicyVersion”,
“bedrock:GetAutomatedReasoningPolicy”,
“bedrock:GetAutomatedReasoningPolicyAnnotations”,
“bedrock:GetAutomatedReasoningPolicyBuildWorkflow”,
“bedrock:GetAutomatedReasoningPolicyBuildWorkflowResultAssets”,
“bedrock:GetAutomatedReasoningPolicyNextScenario”,
“bedrock:GetAutomatedReasoningPolicyTestCase”,
“bedrock:GetAutomatedReasoningPolicyTestResult”,
“bedrock:InvokeAutomatedReasoningPolicy”,
“bedrock:ListAutomatedReasoningPolicies”,
“bedrock:ListAutomatedReasoningPolicyBuildWorkflows”,
“bedrock:ListAutomatedReasoningPolicyTestCases”,
“bedrock:ListAutomatedReasoningPolicyTestResults”,
“bedrock:StartAutomatedReasoningPolicyBuildWorkflow”,
“bedrock:StartAutomatedReasoningPolicyTestWorkflow”,
“bedrock:UpdateAutomatedReasoningPolicy”,
“bedrock:UpdateAutomatedReasoningPolicyAnnotations”,
“bedrock:UpdateAutomatedReasoningPolicyTestCase”,
“bedrock:UpdateGuardrail”
],
“Resource”: [
“arn:aws:bedrock:${aws:region}:${aws:accountId}:automated-reasoning-policy/*”,
“arn:aws:bedrock:${aws:region}:${aws:accountId}:guardrail/*”
]
}

Key service limits: Be aware of the service limits when implementing Automated Reasoning checks.
With Automated Reasoning checks, you pay based on the amount of text processed. For more information, see Amazon Bedrock pricing. For more information, see Amazon Bedrock pricing.

Use case and policy dataset overview
The full policy document used in this example can be accessed from the Automated Reasoning GitHub repository.  To validate the results from Automated Reasoning checks, being familiar with the policy is helpful. Moreover, refining the policy that is created by Automated Reasoning is key in achieving a soundness of over 99%.
Let’s review the main details of the sample medical policy that we are using in this post. As we start validating responses, it is helpful to verify it against the source document.

Risk assessment and stratification: Healthcare facilities must implement a standardized risk scoring system based on demographic, clinical, utilization, laboratory, and social factors, with patients classified into Low (0-3 points), Intermediate (4-7 points), or High Risk (8+ points) categories.
Mandatory interventions: Each risk level requires specific interventions, with higher risk levels incorporating lower-level interventions plus additional measures, while certain conditions trigger automatic High Risk classification regardless of score.
Quality metrics and compliance: Facilities must achieve specific completion rates including 95%+ risk assessment within 24 hours of admission and 100% completion before discharge, with High Risk patients requiring documented discharge plans.
Clinical oversight: While the scoring system is standardized, attending physicians maintain override authority with proper documentation and approval from the discharge planning coordinator.

Create and test an Automated Reasoning checks’ policy using the Amazon Bedrock console
The first step is to encode your knowledge—in this case, the sample medical policy—into an Automated Reasoning policy. Complete the following steps to create an Automated Reasoning policy:

On the Amazon Bedrock console, choose Automated Reasoning under Build in the navigation pane.
Choose Create policy.

Provide a policy name and policy description.

Add source content from which Automated Reasoning will generate your policy. You can either upload document (pdf, txt) or enter text as the ingest method.
Include a description of the intent of the Automated Reasoning policy you’re creating. The intent is optional but provides valuable information to the Large Language Models that are translating the natural language based document into a set of rules that can be used for mathematical verification. For the sample policy, you can use the following intent: This logical policy validates claims about the clinical practice guideline providing evidence-based recommendations for healthcare facilities to systematically assess and mitigate hospital readmission risk through a standardized risk scoring system, risk-stratified interventions, and quality assurance measures, with the goal of reducing 30-day readmissions by 15-23% across participating healthcare systems.

Following is an example patient profile and the corresponding classification.

<Patient Profile>Age: 82 years

Length of stay: 10 days

Has heart failure

One admission within last 30 days

Lives alone without caregiver

<Classification> High Risk
Once the policy has been created, we can inspect the definitions to see which rules, variables and types have been created from the natural language document to represent the knowledge into logic.

You may see differences in the number of rules, variables, and types generated compared to what is shown in this example. This is due to the non-deterministic processing of the supplied document. To address this, the recommended guidance is to perform a human-in-the-loop review of the generated information in the policy before using it with other systems.
Exploring the Automated Reasoning checks’ definition
A Variable in automated reasoning for policy documents is a named container that holds a specific type of information (like Integer, Real Number, or Boolean) and represents a distinct concept or measurement from the policy. Variables act as building blocks for rules and can be used to track, measure, and evaluate policy requirements. From the image below, we can see examples like admissionsWithin30Days (an Integer variable tracking previous hospital admissions), ageRiskPoints (an Integer variable storing age-based risk scores), and conductingMonthlyHighRiskReview (a Boolean variable indicating whether monthly reviews are being performed). Each variable has a clear description of its purpose and the specific policy concept it represents, making it possible to use these variables within rules to enforce policy requirements and measure compliance. Issues also highlight that some variables are unused. It is particularly important to verify which concepts these variables represent and to identify if rules are missing.

In the Definitions, we see ‘Rules’, ‘Variables’ and ‘Types’. A rule is an unambiguous logical statement that Automated Reasoning extracts from your source document. Consider this simple rule that has been created: followupAppointmentsScheduledRate is at least 90.0  – This rule has been created from the Section III A Process Measures, which states that healthcare facilities should monitor various process indications, requiring that follow up appointments scheduled prior to discharge should be 90% or higher.
Let’s look at a more complex rule:

comorbidityRiskPoints is equal to(ite hasDiabetesMellitus 1 0) + (ite hasHeartFailure 2 0) + (ite hasCOPD 1 0) + (ite hasChronicKidneyDisease 1 0)

Where “ite” is “If then else”

This rule calculates a patient’s risk points based on their existing medical conditions (comorbidities) as specified in the policy document. When evaluating a patient, the system checks for four specific conditions: diabetes mellitus of any type (worth 1 point), heart failure of any classification (worth 2 points), chronic obstructive pulmonary disease (worth 1 point), and chronic kidney disease stages 3-5 (worth 1 point). The rule adds these points together by using boolean logic – meaning it multiplies each condition (represented as true=1 or false=0) by its assigned point value, then sums all values to generate a total comorbidity risk score. For instance, if a patient has both heart failure and diabetes, they would receive 3 total points (2 points for heart failure plus 1 point for diabetes). This comorbidity score then becomes part of the larger risk assessment framework used to determine the patient’s overall readmission risk category.

The Definitions also include custom variable types. Custom variable types, also known as enumerations (ENUMs), are specialized data structures that define a fixed set of allowable values for specific policy concepts. These custom types maintain consistency and accuracy in data collection and rule enforcement by limiting values to predefined options that align with the policy requirements. In the sample policy, we can see that four custom variable types have been identified:

AdmissionType: This defines the possible types of hospital admissions (MEDICAL, SURGICAL, MIXED_MEDICAL_SURGICAL, PSYCHIATRIC) that determine whether a patient is eligible for the readmission risk assessment protocol.
HealthcareFacilityType: This specifies the types of healthcare facilities (ACUTE_CARE_HOSPITAL_25PLUS, CRITICAL_ACCESS_HOSPITAL) where the readmission risk assessment protocol may be implemented.
LivingSituation: This categorizes a patient’s living arrangement (LIVES_ALONE_NO_CAREGIVER, LIVES_ALONE_WITH_CAREGIVER) which is a critical factor in determining social support and risk levels.
RiskCategory: This defines the three possible risk stratification levels (LOW_RISK, INTERMEDIATE_RISK, HIGH_RISK) that can be assigned to a patient based on their total risk score.

An important step in improving soundness (accuracy of Automated Reasoning checks when it says VALID), is the policy refinement step of making sure that the rules, variable, and types that are captured best represent the source of truth. In order to do this, we will head over to the test suite and explore how to add tests, generate tests and use the results from the tests to apply annotations that will update the rules.
Testing the Automated Reasoning policy and policy refinement
The test suite in Automated Reasoning provides test capabilities for two purposes: First, we want to run different scenarios and test the various rules and variables in the Automated Reasoning policy and refine them so that they accurately represent the ground truth. This policy refinement step is important to improving the soundness of Automated Reasoning checks. Second, we want metrics to understand how well the Automated Reasoning checks performs for the defined policy and the use case. To do so, we can open the Tests tab on Automated Reasoning console.

Test samples can be added manually by using the Add button. To scale up the testing, we can generate tests from the policy rules. This testing approach helps verify both the semantic correctness of your policy (making sure rules accurately represent intended policy constraints) and the natural language translation capabilities (confirming the system can correctly interpret the language your users will use when interacting with your application). In the image below, we can see a test sample generated and before adding it to the test suite, the SME should indicate if this test sample is possible (thumbs up) or not possible (thumbs up). The test sample can then be saved to the test suite.

Once the test sample is created, it possible to run this test sample alone, or all the test samples in the test suite by choosing on Validate all tests. Upon executing, we see that this test passed successfully.

You can manually create tests by providing an input (optional) and output. These are translated into logical representations before validation occurs.
How translation works:
Translation converts your natural language tests into logical representations that can be mathematically verified against your policy rules:

Automated Reasoning Checks uses multiple LLMs to translate your input/output into logical findings
Each translation receives a confidence vote indicating translation quality
You can set a confidence threshold to control which findings are validated and returned

Confidence threshold behavior:
The confidence threshold controls which translations are considered reliable enough for validation, balancing strictness with coverage:

Higher threshold: Greater certainty in translation accuracy but also higher chance of no findings being validated.
Lower threshold:  Greater chance of getting validated findings returned, but potentially less certain translations
Threshold = 0: All findings are validated and returned regardless of confidence

Ambiguous results:
When no finding meets your confidence threshold, Automated Reasoning Checks returns “Translation Ambiguous,” indicating uncertainty in the content’s logical interpretation.The test case we will create and validate is:

Input:
Patient A
Age: 82
Length of stay: 16 days
Diabetes Mellitus: Yes
Heart Failure: Yes
Chronic Kidney Disease: Yes
Hemoglobin: 9.2 g/dL
eGFR: 28 ml/min/1.73m^2
Sodium: 146 mEq/L
Living Situation: Lives alone without caregiver
Has established PCP: No
Insurance Status: Medicaid
Admissions within 30 days: 1

Output:
Final Classification: INTERMEDIATE RISK

We see that this test passed upon running it, the result of ‘INVALID’ matches our expected results. Additionally Automated Reasoning checks also shows that 12 rules were contradicting the premises and claims, which lead to the output of the test sample being ‘INVALID’

Let’s examine some of the visible contradicting rules:

Age risk: Patient is 82 years old

Rule triggers: “if patientAge is at least 80, then ageRiskPoints is equal to 3”

Length of stay risk: Patient stayed 16 days

Rule triggers: “if lengthOfStay is greater than 14, then lengthOfStayRiskPoints is equal to 3”

Comorbidity risk: Patient has multiple conditions

Rule calculates: “comorbidityRiskPoints = (hasDiabetesMellitus × 1) + (hasHeartFailure × 2) + (hasCOPD × 1) + (hasChronicKidneyDisease × 1)”

Utilization risk: Patient has 1 admission within 30 days

Rule triggers: “if admissionsWithin30Days is at least 1, then utilizationRiskPoints is at least 3”

Laboratory risk: Patient’s eGFR is 28

Rule triggers: “if eGFR is less than 30.0, then laboratoryRiskPoints is at least 2”

These rules are likely producing conflicting risk scores, making it impossible for the system to determine a valid final risk category. These contradictions show us which rules where used to determine that the input text of the test is INVALID.

Let’s add another test to the test suite, as shown in the screenshot below:

Input:
Patient profile
Age: 83
Length of stay: 16 days
Diabetes Mellitus: Yes
Heart Failure: Yes
Chronic Kidney Disease: Yes
Hemoglobin: 9.2 g/dL
eGFR: 28 ml/min/1.73m^2
Sodium: 146 mEq/L
Living Situation: Lives alone without caregiver
Has established PCP: No
Insurance Status: Medicaid
Admissions within 30 days: 1
Admissions within 90 days: 2

Output:
Final Classification: HIGH RISK

When this test is executed, we see that each of the patient details are extracted as premises, to validate the claim that the risk of readmission if high. We see that 8 rules have been applied to verify this claim. The key rules and their validations include:

Age risk: Validates that patient age ≥ 80 contributes 3 risk points
Length of stay risk: Confirms that stay >14 days adds 3 risk points
Comorbidity risk: Calculated based on presence of Diabetes Mellitus, Heart Failure, Chronic Kidney Disease
Utilization risk: Evaluates admissions history
Laboratory risk: Evaluates risk based on Hemoglobin level of 9.2 and eGFR of 28

Each premise was evaluated as true, with multiple risk factors present (advanced age, extended stay, multiple comorbidities, concerning lab values, living alone without caregiver, and lack of PCP), supporting the overall Valid classification of this HIGH RISK assessment.

Moreover, the Automated Reasoning engine performed an extensive validation of this test sample using 93 different assignments to increase the soundness that the HIGH RISK classification is correct. Various related rules from the Automated Reasoning policy are used to validate the samples against 93 different scenarios and variable combinations. In this manner, Automated Reasoning checks confirms that there is no possible situation under which this patient’s HIGH RISK classification could be invalid. This thorough verification process affirms the reliability of the risk assessment for this elderly patient with multiple chronic conditions and complex care needs.In the event of a test sample failure, the 93 assignments would serve as an important diagnostic tool, pinpointing specific variables and their interactions that conflict with the expected outcome, thereby enabling subject matter experts (SMEs) to analyze the relevant rules and their relationships to determine if adjustments are needed in either the clinical logic or risk assessment criteria. In the next section, we will look at policy refinement and how SMEs can apply annotations to improve and correct the rules, variables, and custom types of the Automated Reasoning policy.
Policy refinement through annotations
Annotations provide a powerful improvement mechanism for Automated Reasoning policies when tests fail to produce expected results. Through annotations, SMEs can systematically refine policies by:

Correcting problematic rules by modifying their logic or conditions
Adding missing variables essential to the policy definition
Updating variable descriptions for greater precision and clarity
Resolving translation issues where original policy language was ambiguous
Deleting redundant or conflicting elements from the policy

This iterative process of testing, annotating, and updating creates increasingly robust policies that accurately encode domain expertise. As shown in the figure below, annotations can be applied to modify various policy elements, after which the refined policy can be exported as a JSON file for deployment.

In the following figure, we can see how annotations are being applied, and rules are deleted in the policy. Similarly, additions and updates can be made to rules, variables, or the custom types.

When the subject matter expert has validated the Automated Reasoning policy through testing, applying annotations, and validating the rules, it is possible to export the policy as a JSON file.

Using Automated Reasoning checks at inference
To use the Automated Reasoning checks with the created policy, we can now navigate to Amazon Bedrock Guardrails, and create a new guardrail by entering the name, description, and the messaging that will be displayed when the guardrail intervenes and blocks a prompt or a output from the AI system.

Now, we can attach Automated Reasoning check by using the toggle to Enable Automated Reasoning policy. We can set a confidence threshold, which determines how strictly the policy should be enforced. This threshold ranges from 0.00 to 1.00, with 1.00 being the default and most stringent setting. Each guardrail can accommodate up to two separate automated reasoning policies for enhanced validation flexibility. In the following figure, we are attaching the draft version of the medical policy related to patient hospital readmission risk assessment.

Now we can create the guardrail. Once you’ve established the guardrail and linked your automated reasoning policies, verify your setup by reviewing the guardrail details page to confirm all policies are properly attached.

Clean up
When you’re finished with your implementation, clean up your resources by deleting the guardrail and automated reasoning policies you created. Before deleting a guardrail, be sure to disassociate it from all resources or applications that use it.
Conclusion
In this first part of our blog, we explored how Automated Reasoning checks in Amazon Bedrock Guardrails help maintain the reliability and accuracy of generative AI applications through mathematical verification. You can use increased document processing capacity, advanced validation mechanisms, and comprehensive test management features to validate AI outputs against business rules and domain knowledge. This approach addresses key challenges facing enterprises deploying generative AI systems, particularly in regulated industries where factual accuracy and policy compliance are essential. Our hospital readmission risk assessment demonstration shows how this technology supports the validation of complex decision-making processes, helping transform generative AI into systems suitable for critical business environments. You can use these capabilities through both the AWS Management Console and APIs to establish quality control processes for your AI applications.
To learn more, and build secure and safe AI applications, see the technical documentation and the GitHub code samples, or access to the Amazon Bedrock console.

About the authors
Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.
Bharathi Srinivasan is a Generative AI Data Scientist at the AWS Worldwide Specialist Organization. She works on developing solutions for Responsible AI, focusing on algorithmic fairness, veracity of large language models, and explainability. Bharathi guides internal teams and AWS customers on their responsible AI journey. She has presented her work at various learning conferences.
Nafi Diallo  is a Senior Automated Reasoning Architect at Amazon Web Services, where she advances innovations in AI safety and Automated Reasoning systems for generative AI applications. Her expertise is in formal verification methods, AI guardrails implementation, and helping global customers build trustworthy and compliant AI solutions at scale. She holds a PhD in Computer Science with research in automated program repair and formal verification, and an MS in Financial Mathematics from WPI.

Custom Intelligence: Building AI that matches your business DNA

In 2024, we launched the Custom Model Program within the AWS Generative AI Innovation Center to provide comprehensive support throughout every stage of model customization and optimization. Over the past two years, this program has delivered exceptional results by partnering with global enterprises and startups across diverse industries—including legal, financial services, healthcare and life sciences, software development, telecommunications, and manufacturing. These partnerships have produced tailored AI solutions that capture each organization’s unique data expertise, brand voice, and specialized business requirements. They operate more efficiently than off-the-shelf alternatives, delivering increased alignment and relevance with significant cost savings on inference operations.

As organizations mature past proof-of-concept projects and basic chatbots, we’re seeing increased adoption of advanced personalization and optimization strategies beyond prompt engineering and retrieval augmented generation (RAG). Our approach encompasses creating specialized models for specific tasks and brand alignment, distilling larger models into smaller, faster, more cost-effective versions, implementing deeper adaptations through mid-training modifications, and optimizing hardware and accelerators to increase throughput while reducing costs.
Strategic upfront investment pays dividends throughout a model’s production lifecycle, as demonstrated by Cosine AI’s results. Cosine AI is the developer of an AI developer platform and software engineering agent designed to integrate seamlessly into their users’ workflows. They worked with the Innovation Center to fine-tune Nova Pro, an Amazon Nova foundation model, using Amazon SageMaker AI for their AI engineering assistant, Genie, achieving remarkable results including a 5x increase in A/B testing capability, a 10x faster developer iterations, and a 4x overall project speed improvement. The return on investment becomes even more compelling as companies transition toward agentic systems and workflows, where latency task specificity, performance, and depth are critical and compound across complex processes.
In this post, we’ll share key learnings and actionable strategies for leaders looking to use customization for maximum ROI while avoiding common implementation pitfalls.
Five tips for maximizing value from training and tuning generative AI models
The Innovation Center recommends the following top tips to maximize value from training and tuning AI models:
1. Don’t start from a technical approach; work backwards from business goals
This may seem obvious, but after working with over a thousand customers, we’ve found that working backwards from business goals is a critical factor in why projects supported by the Innovation Center achieve a 65% production success rate, with some launching within 45 days. We apply this same strategy to every customization project by first identifying and prioritizing tangible business outcomes that a technical solution will drive. Success must be measurable and deliver real business value, helping avoid flashy experiments that end up sitting on a shelf instead of producing results. In the Custom Model Program, many customers initially approach us seeking specific technical solutions—such as jumping directly into model pre-training or continued pre-training—without having defined downstream use cases, data strategies, or evaluation plans. By starting with clear business objectives first, we make sure that technical decisions align with strategic goals and create meaningful impact for the organization.
2. Pick the right customization approach
Start with a baseline customization approach and exhaust simpler approaches before diving into deep model customization. The first question we ask customers seeking custom model development is “What have you already tried?” We recommend establishing this baseline with prompt engineering and RAG before exploring more complex techniques. While there’s a spectrum of model optimization approaches that can achieve higher performance, sometimes the simplest solution is the most effective. Once you establish this baseline, identify remaining gaps and opportunities to determine whether advancing to the next level makes strategic sense.

Customization options range from lightweight approaches like supervised fine-tuning to ground-up model development. We typically advise starting with lighter-weight solutions that require smaller amounts of data and compute, then progressing to more complex techniques only when specific use cases or remaining gaps justify the investment:

Supervised fine-tuning sharpens the model’s focus for specific use cases, for example delivering consistent customer service responses or adapting to your organization’s preferred phrasing, structure and reasoning patterns. Volkswagen, one of the world’s largest automobile manufacturers, achieved an “improvement in AI-powered brand consistency checks, increasing accuracy in identifying on-brand images from 55% to 70%,” notes Dr. Philip Trempler, Technical Lead AI & Cloud Engineering at Volkswagen Group Services.
Model efficiency and deployment tuning supports organizations like Robin AI, a leader in AI-powered legal contract technology, to create tailored models that speed up human verification. Organizations can also use techniques like quantization, pruning, and system optimizations to improve model performance and reduce infrastructure costs.
Reinforcement learning uses reward functions or preference data to align models to preferred behavior. This approach is often combined with supervised fine-tuning so organizations like Cosine AI can refine their models’ decision making to match organizational preferences.
Continued pre-training allow organizations like Athena RC, a leading research center in Greece, to build Greek-first foundation models that expand language capabilities beyond English. By continually pre-training large language models on extensive Greek data, Athena RC strengthens the models’ core understanding of the Greek language, culture, and usage – not just their domain knowledge. Their Meltemi-7B and Llama-Krikri-8B models demonstrate how continued pre-training and instruction tuning can create open, high-quality Greek models for applications across research, education, industry, and society.
Domain-specific foundation model development enables organizations like TGS, a leading energy data, insights, and technology provider, to build custom AI models from scratch, ideal for those with highly specialized requirements and substantial volume of proprietary data. TGS helps energy companies make smarter exploration and development decisions by solving some of the industry’s toughest challenges in understanding what lies beneath the Earth’s surface. TGS has enhanced its Seismic Foundation Models (SFMs) to more reliably detect underground geological structures—such as faults and reservoirs—that indicate potential oil and gas deposits. The benefit is clear: operators can reduce uncertainty, lower exploration costs, and make faster investment decisions.

Data quality and accessibility will be a major consideration in determining feasibility of each customization technique. Clean, high-quality data is essential both for model improvement and measuring progress. While some Innovation Center customers achieve performance gains with relatively smaller volumes of fine-tuning training pairs on instruction-tuned foundation models, approaches like continued pre-training typically require large volumes of training tokens. This reinforces the importance of starting simple—as you test lighter-weight model tuning, you can collect and process larger data volumes in parallel for future phases.
3. Define measures for what good looks like
Success needs to be measurable, regardless of which technical approach you choose. It’s critical to establish clear methods for measuring both overall business outcomes and the technical solution’s performance. At the model or application level, teams typically optimize across some combination of relevance, latency, and cost. However, the metrics for your production application won’t be general leaderboard metrics—they must be unique to what matters for your business.
Customers developing content generation systems prioritize metrics like relevance, clarity, style, and tone. Consider this example from Volkswagen Group: “We fine-tuned Nova Pro in SageMaker AI using our marketing experts’ knowledge. This improved the model’s ability to identify on-brand images, achieving stronger alignment with Volkswagen’s brand guidelines,” according to Volkswagen’s Dr. Trempler. “We are building on these results to enable Volkswagen Group’s vision to scale high-quality, brand-compliant content creation across our diverse automotive markets worldwide using generative AI.” Developing an automated evaluation process is critical for supporting iterative solution improvements.
For qualitative use cases, it’s essential to align automated evaluations with human experts, particularly in specialized domains. A common solution involves using LLM as judge to review another model or system responses. For instance, when fine-tuning a generation model for a RAG application, you might use an LLM judge to compare the fine-tuned model response to your existing baseline. However, LLM judges come with intrinsic biases and may not align with your internal team’s human preferences or domain expertise. Robin AI partnered with the Innovation Center to develop Legal LLM-as-Judge, an AI model for legal contract review. Emulating expert methodology and creating “a panel of trained judges” using fine-tuning techniques, they obtained smaller and faster models that maintain accuracy while reviewing documents ranging from NDAs to merger agreements. The solution achieved an 80% faster contract review process, enabling lawyers to focus on strategic work while AI handles detailed analysis.
4. Consider hardware-level optimizations for training and inference
If you’re using a managed service like Amazon Bedrock, you can take advantage of built-in optimizations out of the box. However, if you have a more bespoke solution or are operating at a lower level of the technology stack, there are several areas to consider for optimization and efficiency gains. For instance, TGS’s SFMs process massive 3D seismic images (essentially giant CAT scans of the Earth) that can cover tens of thousands of square kilometers. Each dataset is measured in petabytes, far beyond what traditional manual or even semi-automated interpretation methods can handle. By rebuilding their AI models on AWS’s high-performance GPU training infrastructure, TGS achieved near-linear scaling, meaning that adding more computing power results in almost proportional speed increases while maintaining >90% GPU efficiency. As a result, TGS can now deliver actionable subsurface insights, such as identifying drilling targets or de-risking exploration zones, to customers in days instead of weeks.
Over the life of a model, resource requirements are generally driven by inference requests, and any efficiency gains you can achieve will pay dividends during the production phase. One approach to reduce inference demands is model distillation to reduce the model size itself, but in some cases, there are additional gains to be had by digging deeper into the infrastructure. A recent example is Synthesia, the creator of a leading video generation platform where users can create professional videos without the need for mics, cameras, or actors. Synthesia is continually looking for ways to elevate their user experience, including by decreasing generation times for content. They worked with the Innovation Center to optimize the Variational Autoencoder decoder of their already efficient video generation pipeline. Strategic optimization of the model’s causal convolution layers unlocked powerful compiler performance gains, while asynchronous video chunk writing eliminated GPU idle time – together delivering a dramatic reduction in end-to-end latency and a 29% increase in decoding throughput.
5. One size doesn’t fit all
The one size doesn’t fit all principle applies to both model size and family. Some models excel out of the box for specific tasks like code generation, tool usage, document processing, or summarization. With the rapid pace of innovation, the best foundation model for a given use case today likely won’t be the best tomorrow. Model size corresponds to the number of parameters and often determines its ability to complete a broad set of general tasks and capabilities. However, larger models require more compute resources at inference time and can be expensive to run at production scale. Many applications don’t need a model that excels at everything but rather one that performs exceptionally well at a more limited set of tasks or domain-specific capabilities.
Even within a single application, optimization may require using multiple model providers depending on the specific task, complexity level, and latency requirements. In agentic applications, you might use a lightweight model for specialized agent tasks while requiring a more powerful generalist model to orchestrate and supervise those agents. Architecting your solution to be modular and resilient to changing model providers or versions helps you adapt quickly and capitalize on improvements. Services like Amazon Bedrock facilitate this approach by providing a unified API experience across a broad range of model families, including custom versions of many models.
How the Innovation Center can help
The Custom Model Program by the Innovation Center provides end-to-end expert support from model selection to customization, delivering performance improvements, and reducing time-to-market and value realization. Our process works backwards from customer business needs, strategy and goals, and starts with a use case and generative AI capability review by an experienced generative AI strategist. Specialist hands-on-keyboard applied scientists and engineers embed with customer teams to train and tune models for customers and integrate into applications without data ever needing to leave customer VPCs. This end-to-end support has helped organizations across industries successfully transform their AI vision into real business outcomes.

Want to learn more? Contact your account manager to learn more about the Innovation Center or come see us at re:Invent at the AWS Village in the Expo.

About the authors
Sri Elaprolu serves as Director of the AWS Generative AI Innovation Center, where he leverages nearly three decades of technology leadership experience to drive artificial intelligence and machine learning innovation. In this role, he leads a global team of machine learning scientists and engineers who develop and deploy advanced generative and agentic AI solutions for enterprise and government organizations facing complex business challenges. Throughout his nearly 13-year tenure at AWS, Sri has held progressively senior positions, including leadership of ML science teams that partnered with high-profile organizations such as the NFL, Cerner, and NASA. These collaborations enabled AWS customers to harness AI and ML technologies for transformative business and operational outcomes. Prior to joining AWS, he spent 14 years at Northrop Grumman, where he successfully managed product development and software engineering teams. Sri holds a Master’s degree in Engineering Science and an MBA with a concentration in general management, providing him with both the technical depth and business acumen essential for his current leadership role.
Hannah Marlowe leads the Model Customization and Optimization program for the AWS Generative AI Innovation Center. Her global team of strategists, specialized scientists, and engineers embeds directly with AWS customers, developing custom model solutions optimized for relevance, latency, and cost to drive business outcomes and capture ROI. Previous roles at Amazon include Senior Practice Manager for Advanced Computing and Principal Lead for Computer Vision and Remote Sensing. Dr. Marlowe completed her PhD in Physics at the University of Iowa in modeling and simulation of astronomical X-ray sources and instrumentation development for satellite-based payloads.
Rohit Thekkanal serves as ML Engineering Manager for Model Customization at the AWS Generative AI Innovation Center, where he leads the development of scalable generative AI applications focused on model optimization. With nearly a decade at Amazon, he has contributed to machine learning initiatives that significantly impact Amazon’s retail catalog. Rohit holds an MBA from The University of Chicago Booth School of Business and a Master’s degree from Carnegie Mellon University.
Alexandra Fedorova leads Growth for the Model Customization and Optimization program for the AWS Generative AI Innovation Center. Previous roles at Amazon include Global GenAI Startups Practice Leader with the AWS Generative AI Innovation Center, and Global Leader, Startups Strategic Initiatives and Growth. Alexandra holds an MBA degree from Southern Methodist University, and BS in Economics and Petroleum Engineering from Gubkin Russian State University of Oil and Gas.

Clario streamlines clinical trial software configurations using Amazon …

This post was co-written with Kim Nguyen and Shyam Banuprakash from Clario.
Clario is a leading provider of endpoint data solutions for systematic collection, management, and analysis of specific, predefined outcomes (endpoints) to evaluate a treatment’s safety and effectiveness in the clinical trials industry, generating high-quality clinical evidence for life sciences companies seeking to bring new therapies to patients. Since Clario’s founding more than 50 years ago, the company’s endpoint data solutions have supported clinical trials more than 30,000 times with over 700 regulatory approvals across more than 100 countries.
This post builds upon our previous post discussing how Clario developed an AI solution powered by Amazon Bedrock to accelerate clinical trials. Since then, Clario has further enhanced their AI capabilities, focusing on innovative solutions that streamline the generation of software configurations and artifacts for clinical trials while delivering high-quality clinical evidence.
Business challenge
In clinical trials, designing and customizing various software systems configurations to manage and optimize the different stages of a clinical trial efficiently is critical. These configurations can range from basic study setup to more advanced features like data collection customization and integration with other systems. Clario uses data from multiple sources to build specific software configurations for clinical trials. The traditional workflow involved manual extraction of necessary data from individual forms. These forms contained vital information about exams, visits, conditions, and interventions. Additionally, the process required the need to incorporate study-related information such as study plans, participation criteria, sponsors, collaborators, and standardized exam protocols from multiple enterprise data providers.
The manual nature of this process created several challenges:

Manual data extraction – Team members manually review PDF documents to extract structured data.
Transcript challenges – The manual transfer of data from source forms into configuration documents presents opportunities for improvement, particularly in reducing transcription inconsistencies and enhancing standardization.
Version control challenges – When studies required iterations or updates, maintaining consistency between documents and systems became increasingly complicated.
Fragmented information flow – Data existed in disconnected silos, including PDFs, study detail database records, and other standalone documents.
Software build timelines – The configuration process directly impacted the timeline for generating the necessary software builds.

For clinical trials where timing is essential and accuracy is non-negotiable, Clario has implemented rigorous quality control measures to minimize the risks associated with manual processes. While these efforts are substantial, they underscore a business challenge of ensuring precision and consistency across complex study configurations.
Solution overview
To address the business challenge, Clario developed a generative AI-powered solution that Clario refers to as the Clario’s Genie AI Service on AWS. This solution uses the capabilities of large language models (LLMs), specifically Anthropic’s Claude 3.7 Sonnet on Amazon Bedrock. The process is orchestrated using Amazon Elastic Container Service (Amazon ECS) to transform how Clario handled software configuration for clinical trials.
Clario’s approach uses a custom data parser using Amazon Bedrock to automatically structure information from PDF transmittal forms into validated tables. The Genie AI Service centralizes data from multiple sources, including transmittal forms, study details, standard exam protocols, and additional configuration parameters. An interactive review dashboard helps stakeholders verify AI-extracted information and make necessary corrections before finalizing the validated configuration. Post-validation, the system automatically generates a Software Configuration Specification (SCS) document as a comprehensive record of the software configuration. The process culminates with generative AI-powered XML generation, which is then released into Clario’s proprietary medical imaging software for study builds, creating an end-to-end solution that drastically reduces manual effort while improving accuracy in clinical trial software configurations.
The Genie AI Service architecture consists of several interconnected components that work together in a clear workflow sequence, as illustrated in the following diagram.

The workflow consists of the following steps:

Initiate the study and collect data.
Extract the data using Amazon Bedrock.
Review and validate the AI-generated output.
Generate essential documentation and code artifacts.

In the following sections, we discuss the workflow steps in more detail.
Study initiation and data collection
The workflow begins with gathering essential study information through multiple integrated steps:

Study code lookup – Users begin by entering a study code that uniquely identifies the clinical trial.
API integration with study database – The study lookup operation makes an API call to fetch study details such as such as study plan, participation criteria, sponsors, collaborators, and more from the study database, establishing the foundation for the configuration.
Transmittal form processing – Users upload transmittal forms containing study parameters such as information about exams, visits, conditions, and interventions to the Genie AI Service using the web UI through a secure AWS Direct Connect network.
Data structuring – The system organizes information into key categories:

Visit information (scheduling, procedures)
Exam specifications (protocols, requirements)
Study-specific custom fields (vitals, dosing information, and so on)

Data extraction
The solution uses Anthropic’s Claude Sonnet on Amazon Bedrock through API calls to perform the following actions:

Parse and extract structured data from transmittal forms
Identify key fields and tables within the documents
Organize the information into standardized formats
Apply domain-specific rules to properly categorize clinical trial visits
Extract and validate demographic fields while maintaining proper data types and formats
Handle specialized formatting rules for medical imaging parameters
Manage document-specific adaptations (such as different processing for phantom vs. subject scans)

Review and validation
The solution provides a comprehensive review interface for stakeholders to validate and refine the AI-generated configurations through the following steps:

Interactive review process – Reviewers access the Genie AI Service interface to perform the following actions:

Examine the AI-generated output
Make corrections or adjustments to the data as necessary
Add comments and highlight adjustments made as a feedback mechanism
Validate the configuration accuracy

Data storage – Reviewed and approved software configurations are saved to Clario’s Genie Database, creating a central, authoritative, auditable source of configuration data

Document and code generation
After the configuration data is validated, the solution automates the creation of essential documentation and code artifacts through a structured workflow:

SCS document creation – Reviewers access the Genie AI Service interface to finalize the software configurations by generating an SCS document using the validated data.
XML generation workflow – After the SCS document is finalized, the workflow completes the following steps:

The workflow fetches the configuration details from the Genie database.
The SCSXMLConverter, an internal microservice of the Genie AI Service, processes both SCS document and study configurations. This microservice invokes Anthropic’s Claude 3.7 Sonnet through API calls to generate a standardized SCS XML file.
Validation checks are performed on the generated XML to make sure it meets the structural and content requirements of Clario’s clinical study software.
The final XML output is created for use in the software build process with detailed logs of the conversion process.

Benefits and results
The solution enhanced data extraction quality while providing teams with a streamlined dashboard that accelerates the validation process.
By implementing consistent extraction logic and minimizing manual data entry, the solution has reduced potential transcription errors. Additionally, built-in validation safeguards now help identify potential issues early in the process, preventing problems from propagating downstream.
The solution has also transformed how teams collaborate. By providing centralized review capabilities and giving cross-functional teams access to the same solution, communication has become more transparent and efficient. The standardized workflows have created clearer channels for information sharing and decision-making.
From an operational perspective, the new approach offers greater scalability across studies while supporting iterations as studies evolve. This standardization has laid a strong foundation for expanding these capabilities to other operational areas within the organization.
Importantly, the solution maintains strong compliance and auditability through complete audit trails and reproducible processes. Key outcomes include:

Study configuration execution time has been reduced while improving overall quality
Teams can focus more on value-added activities like study design optimization.

Lessons learned
Clario’s journey to transform software configuration through generative AI has taught them valuable lessons that will inform future initiatives.
Generative AI implementation insights
The following key learnings emerged specifically around working with generative AI technology:

Prompt engineering is foundational – Few-shot prompting with domain knowledge is essential. The team discovered that providing detailed examples and explicit business rules in the prompts was necessary for success. Rather than simple instructions, Clario’s prompts include comprehensive business logic, edge case handling, and exact output formatting requirements to guide the AI’s understanding of clinical trial configurations.
Prompt engineering requires iteration – The quality of data extraction depends heavily on well-crafted prompts that encode domain expertise. Clario’s team spent significant time refining these prompts through multiple iterations and testing different approaches to capture complex business rules about visit sequencing, demographic requirements, and field formatting.
Human oversight within a validation workflow – Although generative AI dramatically accelerates extraction, human review remains necessary within a structured validation workflow. The Genie AI Service interface was specifically designed to highlight potential inconsistencies and provide convenient editing capabilities for reviewers to apply their expertise efficiently.

Integration challenges
Some important challenges surfaced during system integration:

Two-system synchronization – One of the biggest challenges has been verifying that changes made in the SCS documents are reflected in the solution. This bidirectional integration is still being refined.
System transition strategy – Moving from the proof-of-concept scripts to fully integrated solution functionality requires careful planning to avoid disruption.

Process adaptation
The team identified the following key factors for successful process change:

Phased Implementation – Clario rolled out the solution in stages, beginning with pilot teams who could validate functionality and serve as internal advocates to help teams transition from familiar document-centric workflows to the new solution.
Workflow optimization is iterative – The initial workflow design has evolved based on user feedback and real-world usage patterns.
Training requirements – Even with an intuitive interface, proper training makes sure users can take full advantage of the solution’s capabilities.

Technical considerations
Implementation revealed several important technical aspects to consider:

Data formatting variability – Transmittal forms vary significantly across different therapeutic areas (oncology, neurology, and so on) and even between studies within the same area. This variability creates challenges when the AI model encounters form structures or terminology it hasn’t seen before. Clario’s prompt engineering requires continuous iteration as they discover new patterns and edge cases in transmittal forms, creating a feedback loop where human experts identify missed or misinterpreted data points that inform future prompt refinements.
Performance optimization – Processing times for larger documents required optimization to maintain a smooth user experience.
Error handling robustness – Building resilient error handling into the generative AI processing flow was essential for production reliability.

Strategic insights
The project yielded valuable strategic lessons that will inform future initiatives:

Start with well-defined use cases – Beginning with the software configuration process gave Clario a concrete, high-value target for demonstrating generative AI benefits.
Build for extensibility – Designing the architecture with future expansion in mind has positioned them well for extending these capabilities to other areas.
Measure concrete outcomes – Tracking specific metrics like processing time and error rates has helped quantify the return on the generative AI investment.

These lessons have been invaluable for refining the current solution and informing the approach to future generative AI implementations across the organization.
Conclusion
The transformation of the software configuration process through generative AI represents more than just a technical achievement for Clario—it reflects a fundamental shift in how the company approaches data processing and knowledge work in clinical trials. By combining the pattern recognition and processing power of LLMs available in Amazon Bedrock with human expertise for validation and decision-making, Clario created a hybrid workflow that delivers the best of both worlds, orchestrated through Amazon ECS for reliable, scalable execution.
The success of this initiative demonstrates how generative AI on AWS is a practical tool that can deliver tangible benefits. By focusing on specific, well-defined processes with clear pain points, Clario has implemented the solution Genie AI Service powered by Amazon Bedrock in a way that creates immediate value while establishing a foundation for broader transformation.
For organizations considering similar transformations, the experience highlights the importance of starting with concrete use cases, building for human-AI collaboration and maintaining a focus on measurable business outcomes. With these principles in mind, generative AI can become a genuine catalyst for organizational evolution.

About the authors
Kim Nguyen serves as the Sr Director of Data Science at Clario, where he leads a team of data scientists in developing innovative AI/ML solutions for the healthcare and clinical trials industry. With over a decade of experience in clinical data management and analytics, Kim has established himself as an expert in transforming complex life sciences data into actionable insights that drive business outcomes. His career journey includes leadership roles at Clario and Gilead Sciences, where he consistently pioneered data automation and standardization initiatives across multiple functional teams. Kim holds a Master’s degree in Data Science and Engineering from UC San Diego and a Bachelor’s degree from the University of California, Berkeley, providing him with the technical foundation to excel in developing predictive models and data-driven strategies. Based in San Diego, California, he leverages his expertise to drive forward-thinking approaches to data science in the clinical research space.
Shyam Banuprakash serves as the Senior Vice President of Data Science and Delivery at Clario, where he leads complex analytics programs and develops innovative data solutions for the medical imaging sector. With nearly 12 years of progressive experience at Clario, he has demonstrated exceptional leadership in data-driven decision making and business process improvement. His expertise extends beyond his primary role, as he contributes his knowledge as an Advisory Board Member for both Modal and UC Irvine’s Customer Experience Program. Shyam holds a Master of Advanced Study in Data Science and Engineering from UC San Diego, complemented by specialized training from MIT in data science and big data analytics. His career exemplifies the powerful intersection of healthcare, technology, and data science, positioning him as a thought leader in leveraging analytics to transform clinical research and medical imaging.
Praveen Haranahalli is a Senior Solutions Architect at Amazon Web Services (AWS), where he architects secure, scalable cloud solutions and provides strategic guidance to diverse enterprise customers. With nearly two decades of IT experience including over a decade specializing in cloud computing, Praveen has delivered transformative implementations across multiple industries. As a trusted technical advisor, Praveen partners with customers to implement robust DevSecOps pipelines, establish comprehensive security guardrails, and develop innovative AI/ML solutions. He is passionate about solving complex business challenges through cutting-edge cloud architectures and empowering organizations to achieve successful digital transformations powered by artificial intelligence and machine learning.

Ant Group Releases Ling 2.0: A Reasoning-First MoE Language Model Seri …

How do you build a language model that grows in capacity but keeps the computation for each token almost unchanged? The Inclusion AI team from the Ant Group is pushing sparse large models in a methodical way by releasing Ling 2.0. Ling 2.0 is a reasoning based language model family built on the idea that each activation should translate directly into stronger reasoning behavior. It is one of the latest approaches that shows how to keep activation small while moving from 16B to 1T without rewriting the recipe. The series has three versions, Ling mini 2.0 at 16B total with 1.4B activated, Ling flash 2.0 in the 100B class with 6.1B activated, and Ling 1T with 1T total and about 50B active per token.

Sparse MoE as the central design

Every Ling 2.0 model uses the same sparse Mixture of Experts layer. Each layer has 256 routed experts and one shared expert. The router picks 8 routed experts for every token, the shared expert is always on, so about 9 experts out of 257 are used for every token, this is about 3.5 percent activation, which matches the 1/32 activation ratio. The research team reports about 7 times efficiency compared to an equivalent dense model because you train and serve only a small part of the network per token while keeping a very large parameter pool. 

https://arxiv.org/abs/2510.22115

Ling 2.0 brings coordinated advances across four layers of the stack, model architecture, pre training, post training, and the underlying FP8 infrastructure:

Model architecture: The architecture is chosen using Ling Scaling Laws, not by trial and error. To support the Ling Scaling Laws, the team runs what they call the Ling Wind Tunnel, a fixed set of small MoE runs trained under the same data and routing rules, then fitted to power laws to predict loss, activation and expert balance at much larger sizes. This gives them a low cost way to choose 1/32 activation, 256 routed experts and 1 shared expert before committing GPUs to 1T scale. Routing is aux-loss-free with sigmoid scoring, and the stack uses QK Norm, MTP loss and partial RoPE to keep depth stable. Because the same law picked the shape, Ling mini 2.0, Ling flash 2.0 and Ling 1T can all share the consistency across sizes.

Pre training: The series is trained on more than 20T tokens, starting with 4K context and a mix in which reasoning heavy sources such as math and code gradually increase to almost half of the corpus. A later mid training stage extends context to about 32K on a selected 150B token slice, then injects another 600B tokens of high quality chain of thought, before finally stretching to 128K with YaRN while preserving short context quality. This pipeline ensures that long context and reasoning are introduced early, not just added at the SFT step. 

Post training: Alignment is separated into a capability pass and a preference pass. First, Decoupled Fine Tuning teaches the model to switch between quick responses and deep reasoning through different system prompts, then an evolutionary CoT stage expands and diversifies chains, and finally a sentence level policy optimization with a Group Arena Reward aligns outputs to human judgments at fine granularity. This staged alignment is what lets a non thinking base reach strong math, code and instruction performance without inflating every answer.

Infrastructure: Ling 2.0 trains natively in FP8 with safeguards, keeping the loss curve within a small gap of BF16 while gaining about 15% utilization on the reported hardware. The larger speedups, around 40 percent, come from heterogeneous pipeline parallelism, interleaved one forward one backward execution and partitioning that is aware of the MTP block, not from precision alone. Together with Warmup Stable Merge, which replaces LR decay by merging checkpoints, this systems stack makes 1T scale runs practical on existing clusters. 

Understanding the Results

Evaluations are consistent in pattern, small activation MoE models deliver competitive quality while keeping per token compute low. Ling mini 2.0 has 16B total parameters, activates 1.4B per token, and is reported to perform in the 7 to 8B dense band. Ling flash 2.0 keeps the same 1/32 activation recipe, has 100B and activates 6.1B per token. Ling 1T is the flagship non thinking model, it has 1T total parameters and about 50B active per token, preserving the 1/32 sparsity and extending the same Ling Scaling Laws to trillion scale. 

https://arxiv.org/abs/2510.22115

https://arxiv.org/abs/2510.22115

https://arxiv.org/abs/2510.22115

Key Takeaways

Ling 2.0 is built around a 1/32 activation MoE architecture, selected using Ling Scaling Laws so that 256 routed experts plus 1 shared expert stay optimal from 16B up to 1T.

Ling mini 2.0 has 16B total parameters with 1.4B activated per token and is reported to match 7B to 8B dense models while generating at more than 300 tokens per second in simple QA on H20.

Ling flash 2.0 keeps the same recipe, has 6.1B active parameters and sits in the 100B range, giving a higher capacity option without increasing per token compute.

Ling 1T exposes the full design, 1T total parameters with about 50B active per token, 128K context, and an Evo CoT plus LPO style post training stack to push efficient reasoning.

Across all sizes, efficiency gains above 7 times over dense baselines come from the combination of sparse activation, FP8 training, and a shared training schedule, so quality scales predictably without re tuning compute.

Editorial Comments

This release demonstrates a complete sparse MoE stack. Ling Scaling Laws identify a 1/32 activation as optimal, the architecture locks in 256 routed experts plus 1 shared expert, and the same shape is used from 16B to 1T. Training, context extension and preference optimization are all aligned to that choice, so small activation does not block math, code or long context, and FP8 plus heterogeneous pipelines keep cost in a practical range. It is a clear signal that trillion scale reasoning can be organized around fixed sparsity instead of growing dense compute.

Check out the Weights on HF, Repo and Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Ant Group Releases Ling 2.0: A Reasoning-First MoE Language Model Series Built on the Principle that Each Activation Enhances Reasoning Capability appeared first on MarkTechPost.

How to Build Ethically Aligned Autonomous Agents through Value-Guided …

In this tutorial, we explore how we can build an autonomous agent that aligns its actions with ethical and organizational values. We use open-source Hugging Face models running locally in Colab to simulate a decision-making process that balances goal achievement with moral reasoning. Through this implementation, we demonstrate how we can integrate a “policy” model that proposes actions and an “ethics judge” model that evaluates and aligns them, allowing us to see value alignment in practice without depending on any APIs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q transformers torch accelerate sentencepiece

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM

def generate_seq2seq(model, tokenizer, prompt, max_new_tokens=128):
inputs = tokenizer(prompt, return_tensors=”pt”)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
top_p=0.9,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id if tokenizer.eos_token_id is not None else tokenizer.pad_token_id,
)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)

def generate_causal(model, tokenizer, prompt, max_new_tokens=128):
inputs = tokenizer(prompt, return_tensors=”pt”)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
top_p=0.9,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id if tokenizer.eos_token_id is not None else tokenizer.pad_token_id,
)
full_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return full_text[len(prompt):].strip()

We begin by setting up our environment and importing essential libraries from Hugging Face. We define two helper functions that generate text using sequence-to-sequence and causal models. This allows us to easily produce both reasoning-based and creative outputs later in the tutorial. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserpolicy_model_name = “distilgpt2”
judge_model_name = “google/flan-t5-small”

policy_tokenizer = AutoTokenizer.from_pretrained(policy_model_name)
policy_model = AutoModelForCausalLM.from_pretrained(policy_model_name)

judge_tokenizer = AutoTokenizer.from_pretrained(judge_model_name)
judge_model = AutoModelForSeq2SeqLM.from_pretrained(judge_model_name)

device = “cuda” if torch.cuda.is_available() else “cpu”
policy_model = policy_model.to(device)
judge_model = judge_model.to(device)

if policy_tokenizer.pad_token is None:
policy_tokenizer.pad_token = policy_tokenizer.eos_token
if judge_tokenizer.pad_token is None:
judge_tokenizer.pad_token = judge_tokenizer.eos_token

We load two small open-source models—distilgpt2 as our action generator and flan-t5-small as our ethics reviewer. We prepare both models and tokenizers for CPU or GPU execution, ensuring smooth performance in Colab. This setup provides the foundation for the agent’s reasoning and ethical evaluation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass EthicalAgent:
def __init__(self, policy_model, policy_tok, judge_model, judge_tok):
self.policy_model = policy_model
self.policy_tok = policy_tok
self.judge_model = judge_model
self.judge_tok = judge_tok

def propose_actions(self, user_goal, context, n_candidates=3):
base_prompt = (
“You are an autonomous operations agent. ”
“Given the goal and context, list a specific next action you will take:nn”
f”Goal: {user_goal}nContext: {context}nAction:”
)
candidates = []
for _ in range(n_candidates):
action = generate_causal(self.policy_model, self.policy_tok, base_prompt, max_new_tokens=40)
action = action.split(“n”)[0]
candidates.append(action.strip())
return list(dict.fromkeys(candidates))

def judge_action(self, action, org_values):
judge_prompt = (
“You are the Ethics & Compliance Reviewer.n”
“Evaluate the proposed agent action.n”
“Return fields:n”
“RiskLevel (LOW/MED/HIGH),n”
“Issues (short bullet-style text),n”
“Recommendation (approve / modify / reject).nn”
f”ORG_VALUES:n{org_values}nn”
f”ACTION:n{action}nn”
“Answer in this format:n”
“RiskLevel: …nIssues: …nRecommendation: …”
)
verdict = generate_seq2seq(self.judge_model, self.judge_tok, judge_prompt, max_new_tokens=128)
return verdict.strip()

def align_action(self, action, verdict, org_values):
align_prompt = (
“You are an Ethics Alignment Assistant.n”
“Your job is to FIX the proposed action so it follows ORG_VALUES.n”
“Keep it effective but safe, legal, and respectful.nn”
f”ORG_VALUES:n{org_values}nn”
f”ORIGINAL_ACTION:n{action}nn”
f”VERDICT_FROM_REVIEWER:n{verdict}nn”
“Rewrite ONLY IF NEEDED. If original is fine, return it unchanged. ”
“Return just the final aligned action:”
)
aligned = generate_seq2seq(self.judge_model, self.judge_tok, align_prompt, max_new_tokens=128)
return aligned.strip()

We define the core agent class that generates, evaluates, and refines actions. Here, we design methods for proposing candidate actions, evaluating their ethical compliance, and rewriting them to align with values. This structure helps us modularize reasoning, judgment, and correction into clear functional steps. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def decide(self, user_goal, context, org_values, n_candidates=3):
proposals = self.propose_actions(user_goal, context, n_candidates=n_candidates)
scored = []
for act in proposals:
verdict = self.judge_action(act, org_values)
aligned_act = self.align_action(act, verdict, org_values)
scored.append({“original_action”: act, “review”: verdict, “aligned_action”: aligned_act})

def extract_risk(vtext):
for line in vtext.splitlines():
if “RiskLevel” in line:
lvl = line.split(“:”, 1)[-1].strip().upper()
if “LOW” in lvl:
return 0
if “MED” in lvl:
return 1
if “HIGH” in lvl:
return 2
return 3

scored_sorted = sorted(scored, key=lambda x: extract_risk(x[“review”]))
final_choice = scored_sorted[0]
report = {
“goal”: user_goal,
“context”: context,
“org_values”: org_values,
“candidates_evaluated”: scored,
“final_plan”: final_choice[“aligned_action”],
“final_plan_rationale”: final_choice[“review”],
}
return report

We implement the complete decision-making pipeline that links generation, judgment, and alignment. We assign risk scores to each candidate action and automatically choose the most ethically aligned one. This section captures how the agent can self-assess and improve its choices before finalizing an action. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserorg_values_text = (
“- Respect privacy; do not access personal data without consent.n”
“- Follow all laws and safety policies.n”
“- Avoid discrimination, harassment, or harmful manipulation.n”
“- Be transparent and truthful with stakeholders.n”
“- Prioritize user well-being and long-term trust over short-term gain.”
)

demo_goal = “Increase customer adoption of the new financial product.”
demo_context = (
“The agent works for a bank outreach team. The target customers are small family businesses. ”
“Regulations require honest disclosure of risks and fees. Cold-calling minors or lying about terms is illegal.”
)

agent = EthicalAgent(policy_model, policy_tokenizer, judge_model, judge_tokenizer)
report = agent.decide(demo_goal, demo_context, org_values_text, n_candidates=4)

def pretty_report(r):
print(“=== ETHICAL DECISION REPORT ===”)
print(f”Goal: {r[‘goal’]}n”)
print(f”Context: {r[‘context’]}n”)
print(“Org Values:”)
print(r[“org_values”])
print(“n— Candidate Evaluations —“)
for i, cand in enumerate(r[“candidates_evaluated”], 1):
print(f”nCandidate {i}:”)
print(“Original Action:”)
print(” “, cand[“original_action”])
print(“Ethics Review:”)
print(cand[“review”])
print(“Aligned Action:”)
print(” “, cand[“aligned_action”])
print(“n— Final Plan Selected —“)
print(r[“final_plan”])
print(“nWhy this plan is acceptable (review snippet):”)
print(r[“final_plan_rationale”])

pretty_report(report)

We define organizational values, create a real-world scenario, and run the ethical agent to generate its final plan. Finally, we print a detailed report showing candidate actions, reviews, and the selected ethical decision. Through this, we observe how our agent integrates ethics directly into its reasoning process.

In conclusion, we clearly understand how an agent can reason not only about what to do but also about whether to do it. We witness how the system learns to identify risks, correct itself, and align its actions with human and organizational principles. This exercise helps us realize that value alignment and ethics are not abstract ideas but practical mechanisms we can embed into agentic systems to make them safer, fairer, and more trustworthy.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build Ethically Aligned Autonomous Agents through Value-Guided Reasoning and Self-Correcting Decision-Making Using Open-Source Models appeared first on MarkTechPost.

IBM AI Team Releases Granite 4.0 Nano Series: Compact and Open-Source …

Small models are often blocked by poor instruction tuning, weak tool use formats, and missing governance. IBM AI team released Granite 4.0 Nano, a small model family that targets local and edge inference with enterprise controls and open licensing. The family includes 8 models in two sizes, 350M and about 1B, with both hybrid SSM and transformer variants, each in base and instruct. Granite 4.0 Nano series models are released under an Apache 2.0 license with native architecture support on popular runtimes like vLLM, llama.cpp, and MLX

https://huggingface.co/blog/ibm-granite/granite-4-nano

What is new in Granite 4.0 Nano series?

Granite 4.0 Nano consists of four model lines and their base counterparts. Granite 4.0 H 1B uses a hybrid SSM based architecture and is about 1.5B parameters. Granite 4.0 H 350M uses the same hybrid approach at 350M. For maximum runtime portability IBM also provides Granite 4.0 1B and Granite 4.0 350M as transformer versions.

Granite releaseSizes in releaseArchitectureLicense and governanceKey notesGranite 13B, first watsonx Granite models13B base, 13B instruct, later 13B chatDecoder only transformer, 8K contextIBM enterprise terms, client protectionsFirst public Granite models for watsonx, curated enterprise data, English focusGranite Code Models (open)3B, 8B, 20B, 34B code, base and instructDecoder only transformer, 2 stage code training on 116 languagesApache 2.0First fully open Granite line, for code intelligence, paper 2405.04324, available on HF and GitHubGranite 3.0 Language Models 2B and 8B, base and instructTransformer, 128K context for instructApache 2.0Business LLMs for RAG, tool use, summarization, shipped on watsonx and HFGranite 3.1 Language Models (HF) 1B A400M, 3B A800M, 2B, 8BTransformer, 128K contextApache 2.0Size ladder for enterprise tasks, both base and instruct, same Granite data recipeGranite 3.2 Language Models (HF) 2B instruct, 8B instructTransformer, 128K, better long promptApache 2.0Iterative quality bump on 3.x, keeps business alignmentGranite 3.3 Language Models (HF) 2B base, 2B instruct, 8B base, 8B instruct, all 128KDecoder only transformerApache 2.0Latest 3.x line on HF before 4.0, adds FIM and better instruction followingGranite 4.0 Language Models 3B micro, 3B H micro, 7B H tiny, 32B H small, plus transformer variantsHybrid Mamba 2 plus transformer for H, pure transformer for compatibilityApache 2.0, ISO 42001, cryptographically signedStart of hybrid generation, lower memory, agent friendly, same governance across sizesGranite 4.0 Nano Language Models 1B H, 1B H instruct, 350M H, 350M H instruct, 2B transformer, 2B transformer instruct, 0.4B transformer, 0.4B transformer instruct, total 8H models are hybrid SSM plus transformer, non H are pure transformerApache 2.0, ISO 42001, signed, same 4.0 pipelineSmallest Granite models, made for edge, local and browser, run on vLLM, llama.cpp, MLX, watsonxTable Created by Marktechpost.com

Architecture and training

The H variants interleave SSM layers with transformer layers. This hybrid design reduces memory growth versus pure attention, while preserving the generality of transformer blocks. The Nano models did not use a reduced data pipeline. They were trained with the same Granite 4.0 methodology and more than 15T tokens, then instruction tuned to deliver solid tool use and instruction following. This carries over strengths from the larger Granite 4.0 models to sub 2B scales.

Benchmarks and competitive context

IBM compares Granite 4.0 Nano with other under 2B models, including Qwen, Gemma, and LiquidAI LFM. Reported aggregates show a significant increase in capabilities across general knowledge, math, code, and safety at similar parameter budgets. On agent tasks, the models outperform several peers on IFEval and on the Berkeley Function Calling Leaderboard v3.

https://huggingface.co/blog/ibm-granite/granite-4-nano

Key Takeaways

IBM released 8 Granite 4.0 Nano models, 350M and about 1B each, in hybrid SSM and transformer variants, in base and instruct, all under Apache 2.0.

The hybrid H models, Granite 4.0 H 1B at about 1.5B parameters and Granite 4.0 H 350M at about 350M, reuse the Granite 4.0 training recipe on more than 15T tokens, so capability is inherited from the larger family and not a reduced data branch.

IBM team reports that Granite 4.0 Nano is competitive with other sub 2B models such as Qwen, Gemma and LiquidAI LFM on general, math, code and safety, and that it outperforms on IFEval and BFCLv3 which matter for tool using agents.

All Granite 4.0 models, including Nano, are cryptographically signed, ISO 42001 certified and released for enterprise use, which gives provenance and governance that typical small community models do not provide.

The models are available on Hugging Face and IBM watsonx.ai with runtime support for vLLM, llama.cpp and MLX, which makes local, edge and browser level deployments realistic for early AI engineers and software teams.

Editorial Comments

IBM is doing the right thing here, it is taking the same Granite 4.0 training pipeline, the same 15T token scale, the same hybrid Mamba 2 plus transformer architecture, and pushing it down to 350M and about 1B so that edge and on device workloads can use the exact governance and provenance story that the larger Granite models already have. The models are Apache 2.0, ISO 42001 aligned, cryptographically signed, and already runnable on vLLM, llama.cpp and MLX. Overall, this is a clean and auditable way to run small LLMs.

Check out the Model Weights on HF and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post IBM AI Team Releases Granite 4.0 Nano Series: Compact and Open-Source Small Models Built for AI at the Edge appeared first on MarkTechPost.

Reduce CAPTCHAs for AI agents browsing the web with Web Bot Auth (Prev …

AI agents need to browse the web on your behalf. When your agent visits a website to gather information, complete a form, or verify data, it encounters the same defenses designed to stop unwanted bots: CAPTCHAs, rate limits, and outright blocks.
Today, we are excited to share that AWS has a solution. Amazon Bedrock AgentCore Browser, our secure, cloud-based browser for AI agents to interact with websites, now supports Web Bot Auth (in preview), a draft IETF protocol that gives agents verifiable cryptographic identities.
CAPTCHA friction
Customers tell us that CAPTCHA friction is one of the biggest obstacles to reliable browser-based agentic workflows. Your agent halts mid-task, waiting for human intervention to solve a puzzle that proves you’re not a bot – except your agent is a bot, and that’s the point. CAPTCHAs exist for good reason. Websites face constant challenges protecting their content, inventory and reviews. Web Application Firewalls (WAFs) and bot detection services protect these sites, but they treat nearly all automated traffic as suspicious because they have no reliable way to distinguish legitimate agents from malicious ones.
Some automation providers try to solve CAPTCHAs programmatically – using computer vision models to read distorted text or clicking through image grids until the puzzle clears. This approach is brittle, expensive, and is bypassing controls that domain owners intended for their content. Other approaches rely on IP allowlists or User-Agent strings. IP allowlists break when you run agents in cloud environments where addresses change frequently. User-Agent strings can be spoofed by anyone, so they provide no verification, and pose a risk of people emulating well trusted strings. Both methods require manual coordination with every website you want to access, which does not scale.
Web Bot Auth: Cryptographic identity for agents browsing the web
Web Bot Auth is a draft IETF protocol that gives agents verifiable cryptographic identities. When you enable Web Bot Auth in AgentCore Browser, we issue cryptographic credentials that websites can verify. The agent presents these credentials with every request. The WAF may now additionally check the signature, confirm it matches a trusted directory, and allow the request through if verified bots are allowed by the domain owner and other WAF checks are clear.
AgentCore is working with Cloudflare, HUMAN Security, and Akamai Technologies to support this verification flow. These providers protect millions of websites. When you create an AgentCore Browser with signing enabled in the configuration, we automatically register your agent’s signature directory with these providers. Many domains already configure their WAFs to allow verified bots by default, which means you can see immediate CAPTCHA reduction without additional setup in the cases that this happens.
How domain owners control access
WAF providers give website owners three levels of control using Web Bot Auth:

Block all bots – Some sites choose to block automated traffic entirely. Web Bot Auth does not bypass this – if a domain wants no automation, that choice is respected.
Allow verified bots – Many domains configure their WAF to allow any bot that presents a valid cryptographic signature. This is the default policy for a growing number of sites protected by Cloudflare, HUMAN Security, and Akamai Technologies. When you enable signing, as a parameter in the AgentCore Browser configuration, this policy will apply to your agents.
Allow specific verified bots to conduct only specific actions – For example, a financial services company automating vendor portal access can share its unique directory with those vendors. The vendor can create rules like “allow FinCo agents at 100 requests per minute, don’t allow them to create new accounts, and block all other signed agents.” This gives websites granular control while preserving the benefits of cryptographic verification.

Today’s preview release of Web Both Auth support in AgentCore Browser helps reduce friction with CAPTCHAs on domains that allow verified bots, by making your agent appear as a verified bot. Once the Web Bot Auth protocol is finalized, AgentCore intends to transition to customer-specific keys, so AgentCore users can use the tier of control that allows only specified verified bots.
Using the Web Bot Auth protocol
To enable the browser to sign requests using the Web Bot Auth protocol, create a browser tool with the browserSigning configuration:

import boto3
cp_client = boto3.client(‘bedrock-agentcore-control’)
response = cp_client.create_browser(
name=”signed_browser”,
description=”Browser tool with Web Bot Auth enabled”,
networkConfiguration={
“networkMode”: “PUBLIC”
},
executionRoleArn=”arn:aws:iam::123456789012:role/AgentCoreExecutionRole”,
browserSigning={
“enabled”: True
}
)
browserId = response[‘browserId’]

Pass the browser identifier to your agent framework. Here is an example using Strands Agents:

from strands import Agent
from strands_tools.browser import AgentCoreBrowser
agent_core_browser = AgentCoreBrowser(
region=”us-west-2″,
identifier=browserId
)
strands_agent = Agent(
tools=[agent_core_browser.browser],
model=”anthropic.claude-4-5-haiku-20251001-v1:0″,
system_prompt=”You are a website analyst. Use the browser tool efficiently.”
)
result = strands_agent(“Analyze the website at <https://example.com/>”)

The agent is now configured to use the new browser tool that signs every HTTP request. Websites protected by Cloudflare, HUMAN Security, or Akamai Technologies can verify the signature and allow the request through without presenting a CAPTCHA, if the domain owner allows verified bots.
Protocol development
The Web Bot Auth protocol is gaining industry momentum because it solves a real problem: legitimate automation is indistinguishable from abuse without verifiable identity. You can read the draft protocol specification, HTTP Message Signatures for automated traffic Architecture. The architecture defines how agents generate signatures, how WAFs verify them, and how key directories enable discovery. Amazon is working with Cloudflare and many popular WAF providers to help finalize the customer-specific key directory format and work towards finalizing the draft.
Conclusion
Amazon Bedrock AgentCore Browser is generally available, with the Web Bot Auth feature available in preview. AgentCore Browser signing requests using the Web Bot Auth protocol help reduce friction with CAPTCHA across domains that allow verified bots. As the protocol finalizes, AgentCore Browser intends to issue customer-specific keys and directories, so you can prove your agent’s identity to specific websites and establish trust relationships directly with the domains you need to access.
Web Bot Auth enables agents to prove their identity when challenged, reduces operational friction in automated workflows, and gives website owners control over which agents access their resources. Amazon Bedrock AgentCore Browser support for Web Bot Auth (Preview) provides the infrastructure layer that makes this possible.

About the authors
Veda Raman is a Senior Specialist Solutions Architect for generative AI and machine learning at AWS. Veda works with customers to help them architect efficient, secure, and scalable machine learning applications. Veda specializes in generative AI services like Amazon Bedrock and Amazon SageMaker.
Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime, Browser, Code Interpreter, and Identity. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and enjoys life with his wife and kids.
Joshua Samuel is a Senior AI/ML Specialist Solutions Architect at AWS who accelerates enterprise transformation through AI/ML, and generative AI solutions, based in Melbourne, Australia. A passionate disrupter, he specializes in agentic AI and coding techniques – Anything that makes builders faster and happier.

Microsoft Releases Agent Lightning: A New AI Framework that Enables Re …

How do you convert real agent traces into reinforcement learning RL transitions to improve policy LLMs without changing your existing agent stack? Microsoft AI team releases Agent Lightning to help optimize multi-agent systems. Agent Lightning is a open-sourced framework that makes reinforcement learning work for any AI agent without rewrites. It separates training from execution, defines a unified trace format, and introduces LightningRL, a hierarchical method that converts complex agent runs into transitions that standard single turn RL trainers can optimize.

What Agent Lightning does?

The framework models an agent as a decision process. It formalizes the agent as a partially observable Markov decision process where the observation is the current input to the policy LLM, the action is the model call, and the reward can be terminal or intermediate. From each run it extracts only the calls made by the policy model, along with inputs, outputs, and rewards. This trims away other framework noise and yields clean transitions for training.

LightningRL performs credit assignment across multi step episodes, then optimizes the policy with a single turn RL objective. The research team describes compatibility with single turn RL methods. In practice, teams often use trainers that implement PPO or GRPO, such as VeRL, which fits this interface.

https://arxiv.org/pdf/2508.03680v1

System architecture

Agent Lightning uses Training Agent Disaggregation. A Lightning Server runs training and serving, and exposes an OpenAI like API for the updated model. A Lightning Client runs the agent runtime where it already lives, captures traces of prompts, tool calls, and rewards, and streams them back to the server. This keeps tools, browsers, shells, and other dependencies close to production while the GPU training stays in the server tier.

https://arxiv.org/pdf/2508.03680v1

The runtime supports two tracing paths. A default path uses OpenTelemetry spans, so you can pipe agent telemetry through standard collectors. There is also a lightweight embedded tracer for teams that do not want to deploy OpenTelemetry. Both paths end up in the same store for training.

https://arxiv.org/pdf/2508.03680v1

Unified data interface

Agent Lightning records each model call and each tool call as a span with inputs, outputs, and metadata. The algorithm layer adapts spans into ordered triplets of prompt, response, and reward. This selective extraction lets you optimize one agent in a multi agent workflow, or multiple agents at once, without touching orchestration code. The same traces can also drive automatic prompt optimization or supervised finetuning.

https://arxiv.org/pdf/2508.03680v1

Experiments and datasets

The research team reports three tasks. For text to SQL, the team uses the Spider benchmark. Spider contains more than 10,000 questions across 200 databases that span 138 domains. The policy model is Llama 3.2 3B Instruct. The implementation uses LangChain with a writer agent, a rewriter agent, and a checker. The writer and the rewriter are optimized, and the checker is left fixed. Rewards improve steadily during training and at test time.

https://arxiv.org/pdf/2508.03680v1

For retrieval augmented generation, the setup uses the MuSiQue benchmark and a Wikipedia scale index with about 21 million documents. The retriever uses BGE embeddings with cosine similarity. The agent is built with the OpenAI Agents SDK. The reward is a weighted sum of a format score and an F1 correctness score. Reward curves show stable gains during training and evaluation with the same base model.

https://arxiv.org/pdf/2508.03680v1

For math question answering with tool use, the agent is implemented with AutoGen and calls a calculator tool. The dataset is Calc X. The base model again is Llama 3.2 3B Instruct. Training improves the ability to invoke tools correctly and integrate results into final answers.

https://arxiv.org/pdf/2508.03680v1

Key Takeaways

Agent Lightning uses Training Agent Disaggregation and a unified trace interface, so existing agents in LangChain, OpenAI Agents SDK, AutoGen, or CrewAI connect with near zero code change.

LightningRL converts trajectories to transitions. It applies credit assignment to multi step runs, then optimizes the policy with single turn RL methods such as PPO or GRPO in standard trainers.

Automatic Intermediate Rewarding, AIR, supplies dense feedback. AIR turns system signals such as tool return status into intermediate rewards to reduce sparse reward issues in long workflows.

The research evaluates text to SQL on Spider, RAG on MuSiQue with a Wikipedia scale index using BGE embeddings and cosine similarity, and math tool use on Calc X, all with Llama 3.2 3B Instruct as the base model.

The runtime records traces through OpenTelemetry, streams them to the training server, and exposes an OpenAI compatible endpoint for updated models, enabling scalable rollouts without moving tools.

Editorial Comments

Agent Lightning is a practical bridge between agent execution and reinforcement learning, not another framework rewrite. It formalizes agent runs as an Markov Decision Process (MDP), introduces LightningRL for credit assignment, and extracts transitions that slot into single turn RL trainers. The Training Agent Disaggregation design separates a client that runs the agent from a server that trains and serves an OpenAI compatible endpoint, so teams keep existing stacks. Automatic Intermediate Rewarding converts runtime signals into dense feedback, reducing sparse rewards in long workflows. Overall, Agent Lightning is a clean, minimal-integration path to make agents learn from their own traces.

Check out the Paper and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Microsoft Releases Agent Lightning: A New AI Framework that Enables Reinforcement Learning (RL)-based Training of LLMs for Any AI Agent appeared first on MarkTechPost.

Liquid AI Releases LFM2-ColBERT-350M: A New Small Model that brings La …

Can a compact late interaction retriever index once and deliver accurate cross lingual search with fast inference? Liquid AI released LFM2-ColBERT-350M, a compact late interaction retriever for multilingual and cross-lingual search. Documents can be indexed in one language, queries can be written in many languages, and the system retrieves with high accuracy. The Liquid AI team reports inference speed on par with models that are 2.3 times smaller, which is attributed to the LFM2 backbone. The model is available with a Hugging Face demo and a detailed model card for integration in retrieval augmented generation systems.

https://www.liquid.ai/blog/lfm2-colbert-350m-one-model-to-embed-them-all

What late interaction means and why it matters?

Most production systems use bi-encoders for speed or cross encoders for accuracy. Late interaction aims to combine both advantages. Queries and documents are encoded separately at the token level. The system compares token vectors at query time using operations such as MaxSim. This preserves fine grained token interactions without the full cost of joint cross attention. It allows pre-computation for documents and improves precision at ranking time. It can serve as a first stage retriever and also as a ranker in one pass.

Model specification

LFM2-ColBERT-350M has 350 million total parameters. There are 25 layers, with 18 convolution blocks, 6 attention blocks, and 1 dense layer. The context length is 32k tokens. The vocabulary size is 65,536. The similarity function is MaxSim. The output dimensionality is 128. Training precision is BF16. The license is LFM Open License v1.0.

https://huggingface.co/LiquidAI/LFM2-ColBERT-350M

Languages, supported and evaluated

The model supports 8 languages. These are English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. The evaluation adds Italian and Portuguese, which brings the matrix to 9 languages for cross comparisons of document and query languages. This distinction is relevant when planning deployments that must cover specific customer markets.

https://www.liquid.ai/blog/lfm2-colbert-350m-one-model-to-embed-them-all

Evaluation setup and key results

Liquid AI extends the NanoBEIR benchmark with Japanese and Korean and publishes the extension for reproducibility. On this setup, LFM2-ColBERT-350M shows stronger multilingual capability than the baseline late interaction model in this class, which is GTE-ModernColBERT-v1 at 150M parameters. The largest gains appear in German, Arabic, Korean, and Japanese, while English performance is maintained.

Key Takeaways

Token-level scoring with MaxSim preserves fine-grained interactions while keeping separate encoders, so document embeddings can be precomputed and queried efficiently.

Documents can be indexed in one language and retrieved in many. The model card lists 8 supported languages, while evaluations span 9 languages for cross-lingual pairs.

On the NanoBEIR multilingual extension, LFM2-ColBERT-350M outperforms the prior late-interaction baseline (GTE-ModernColBERT-v1 at 150M) and maintains English performance.

Inference speed is reported on par with models 2.3× smaller across batch sizes, attributed to the LFM2 backbone.

Editorial Notes

Liquid AI’s LFM2-ColBERT-350M applies late interaction ColBERT with MaxSim, it encodes queries and documents separately, then scores token vectors at query time, which preserves token level interactions and enables precomputed document embeddings for scale. It targets multilingual and cross lingual retrieval, index once and query in many languages, with evaluations described on a NanoBEIR multilingual extension. Liquid AI team reports inference speed on par with models 2.3 times smaller, attributed to the LFM2 backbone. Overall, late interaction at the nano scale looks production ready for multilingual RAG trials.

Check out the Model Weights, Demo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Liquid AI Releases LFM2-ColBERT-350M: A New Small Model that brings Late Interaction Retrieval to Multilingual and Cross-Lingual RAG appeared first on MarkTechPost.

How Exploration Agents like Q-Learning, UCB, and MCTS Collaboratively …

In this tutorial, we explore how exploration strategies shape intelligent decision-making through agent-based problem solving. We build and train three agents, Q-Learning with epsilon-greedy exploration, Upper Confidence Bound (UCB), and Monte Carlo Tree Search (MCTS), to navigate a grid world and reach a goal efficiently while avoiding obstacles. Also, we experiment with different ways of balancing exploration and exploitation, visualize learning curves, and compare how each agent adapts and performs under uncertainty. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import random
from collections import defaultdict, deque
import math
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict

class GridWorld:
def __init__(self, size=10, n_obstacles=15):
self.size = size
self.grid = np.zeros((size, size))
self.start = (0, 0)
self.goal = (size-1, size-1)
obstacles = set()
while len(obstacles) < n_obstacles:
obs = (random.randint(0, size-1), random.randint(0, size-1))
if obs not in [self.start, self.goal]:
obstacles.add(obs)
self.grid[obs] = 1
self.reset()
def reset(self):
self.agent_pos = self.start
return self.agent_pos
def step(self, action):
if self.agent_pos == self.goal:
reward, done = 100, True
else:
reward, done = -1, False
return self.agent_pos, reward, done
def get_valid_actions(self, state):
valid = []
for i, move in enumerate(moves):
new_pos = (state[0] + move[0], state[1] + move[1])
if (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size
and self.grid[new_pos] == 0):
valid.append(i)
return valid

We begin by creating a grid world environment that challenges our agent to reach a goal while avoiding obstacles. We design its structure, define movement rules, and ensure realistic navigation boundaries to simulate an interactive problem-solving space. This forms the foundation where our exploration agents will operate and learn. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass QLearningAgent:
def __init__(self, n_actions=4, alpha=0.1, gamma=0.95, epsilon=1.0):
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.q_table = defaultdict(lambda: np.zeros(n_actions))
def get_action(self, state, valid_actions):
if random.random() < self.epsilon:
return random.choice(valid_actions)
else:
q_values = self.q_table[state]
valid_q = [(a, q_values[a]) for a in valid_actions]
return max(valid_q, key=lambda x: x[1])[0]
def update(self, state, action, reward, next_state, valid_next_actions):
current_q = self.q_table[state][action]
if valid_next_actions:
max_next_q = max([self.q_table[next_state][a] for a in valid_next_actions])
else:
max_next_q = 0
new_q = current_q + self.alpha * (reward + self.gamma * max_next_q – current_q)
self.q_table[state][action] = new_q
def decay_epsilon(self, decay_rate=0.995):
self.epsilon = max(0.01, self.epsilon * decay_rate)

We implement the Q-Learning agent that learns through experience, guided by an epsilon-greedy policy. We observe how it explores random actions early on and gradually focuses on the most rewarding paths. Through iterative updates, it learns to balance exploration and exploitation effectively.

Copy CodeCopiedUse a different Browserclass UCBAgent:
def __init__(self, n_actions=4, c=2.0, gamma=0.95):
self.n_actions = n_actions
self.c = c
self.gamma = gamma
self.q_values = defaultdict(lambda: np.zeros(n_actions))
self.action_counts = defaultdict(lambda: np.zeros(n_actions))
self.total_counts = defaultdict(int)
def get_action(self, state, valid_actions):
self.total_counts[state] += 1
ucb_values = []
for action in valid_actions:
q = self.q_values[state][action]
count = self.action_counts[state][action]
if count == 0:
return action
exploration_bonus = self.c * math.sqrt(math.log(self.total_counts[state]) / count)
ucb_values.append((action, q + exploration_bonus))
return max(ucb_values, key=lambda x: x[1])[0]
def update(self, state, action, reward, next_state, valid_next_actions):
self.action_counts[state][action] += 1
count = self.action_counts[state][action]
current_q = self.q_values[state][action]
if valid_next_actions:
max_next_q = max([self.q_values[next_state][a] for a in valid_next_actions])
else:
max_next_q = 0
target = reward + self.gamma * max_next_q
self.q_values[state][action] += (target – current_q) / count

We develop the UCB agent that uses confidence bounds to guide its exploration decisions. We watch how it strategically tries less-visited actions while prioritizing those that yield higher rewards. This approach helps us understand a more mathematically grounded exploration strategy. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MCTSNode:
def __init__(self, state, parent=None):
self.state = state
self.parent = parent
self.children = {}
self.visits = 0
self.value = 0.0
def is_fully_expanded(self, valid_actions):
return len(self.children) == len(valid_actions)
def best_child(self, c=1.4):
choices = [(action, child.value / child.visits +
c * math.sqrt(2 * math.log(self.visits) / child.visits))
for action, child in self.children.items()]
return max(choices, key=lambda x: x[1])

class MCTSAgent:
def __init__(self, env, n_simulations=50):
self.env = env
self.n_simulations = n_simulations
def search(self, state):
root = MCTSNode(state)
for _ in range(self.n_simulations):
node = root
sim_env = GridWorld(size=self.env.size)
sim_env.grid = self.env.grid.copy()
sim_env.agent_pos = state
while node.is_fully_expanded(sim_env.get_valid_actions(node.state)) and node.children:
action, _ = node.best_child()
node = node.children[action]
sim_env.agent_pos = node.state
valid_actions = sim_env.get_valid_actions(node.state)
if valid_actions and not node.is_fully_expanded(valid_actions):
untried = [a for a in valid_actions if a not in node.children]
action = random.choice(untried)
next_state, _, _ = sim_env.step(action)
child = MCTSNode(next_state, parent=node)
node.children[action] = child
node = child
total_reward = 0
depth = 0
while depth < 20:
valid = sim_env.get_valid_actions(sim_env.agent_pos)
if not valid:
break
action = random.choice(valid)
_, reward, done = sim_env.step(action)
total_reward += reward
depth += 1
if done:
break
while node:
node.visits += 1
node.value += total_reward
node = node.parent
if root.children:
return max(root.children.items(), key=lambda x: x[1].visits)[0]
return random.choice(self.env.get_valid_actions(state))

We construct the Monte Carlo Tree Search (MCTS) agent to simulate and plan multiple potential future outcomes. We see how it builds a search tree, expands promising branches, and backpropagates results to refine decisions. This allows the agent to plan intelligently before acting. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef train_agent(agent, env, episodes=500, max_steps=100, agent_type=”standard”):
rewards_history = []
for episode in range(episodes):
state = env.reset()
total_reward = 0
for step in range(max_steps):
valid_actions = env.get_valid_actions(state)
if agent_type == “mcts”:
action = agent.search(state)
else:
action = agent.get_action(state, valid_actions)
next_state, reward, done = env.step(action)
total_reward += reward
if agent_type != “mcts”:
valid_next = env.get_valid_actions(next_state)
agent.update(state, action, reward, next_state, valid_next)
state = next_state
if done:
break
rewards_history.append(total_reward)
if hasattr(agent, ‘decay_epsilon’):
agent.decay_epsilon()
if (episode + 1) % 100 == 0:
avg_reward = np.mean(rewards_history[-100:])
print(f”Episode {episode+1}/{episodes}, Avg Reward: {avg_reward:.2f}”)
return rewards_history

if __name__ == “__main__”:
print(“=” * 70)
print(“Problem Solving via Exploration Agents Tutorial”)
print(“=” * 70)
env = GridWorld(size=8, n_obstacles=10)
agents_config = {
‘Q-Learning (ε-greedy)’: (QLearningAgent(), ‘standard’),
‘UCB Agent’: (UCBAgent(), ‘standard’),
‘MCTS Agent’: (MCTSAgent(env, n_simulations=30), ‘mcts’)
}
results = {}
for name, (agent, agent_type) in agents_config.items():
print(f”nTraining {name}…”)
rewards = train_agent(agent, GridWorld(size=8, n_obstacles=10),
episodes=300, agent_type=agent_type)
results[name] = rewards
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
for name, rewards in results.items():
smoothed = np.convolve(rewards, np.ones(20)/20, mode=’valid’)
plt.plot(smoothed, label=name, linewidth=2)
plt.xlabel(‘Episode’)
plt.ylabel(‘Reward (smoothed)’)
plt.title(‘Agent Performance Comparison’)
plt.legend()
plt.grid(alpha=0.3)
plt.subplot(1, 2, 2)
for name, rewards in results.items():
avg_last_100 = np.mean(rewards[-100:])
plt.bar(name, avg_last_100, alpha=0.7)
plt.ylabel(‘Average Reward (Last 100 Episodes)’)
plt.title(‘Final Performance’)
plt.xticks(rotation=15, ha=’right’)
plt.grid(axis=’y’, alpha=0.3)
plt.tight_layout()
plt.show()
print(“=” * 70)
print(“Tutorial Complete!”)
print(“Key Concepts Demonstrated:”)
print(“1. Epsilon-Greedy exploration”)
print(“2. UCB strategy”)
print(“3. MCTS-based planning”)
print(“=” * 70)

We train all three agents in our grid world and visualize their learning progress and performance. We analyze how each strategy, Q-Learning, UCB, and MCTS, adapts to the environment over time. Finally, we compare results and gain insights into which exploration approach leads to faster, more reliable problem-solving.

In conclusion, we successfully implemented and compared three exploration-driven agents, each demonstrating a unique strategy for solving the same navigation challenge. We observe how epsilon-greedy enables gradual learning through randomness, UCB balances confidence with curiosity, and MCTS leverages simulated rollouts for foresight and planning. This exercise helps us appreciate how different exploration mechanisms influence convergence, adaptability, and efficiency in reinforcement learning.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How Exploration Agents like Q-Learning, UCB, and MCTS Collaboratively Learn Intelligent Problem-Solving Strategies in Dynamic Grid Environments appeared first on MarkTechPost.