A Coding Guide to Design an Agentic AI System Using a Control-Plane Ar …

In this tutorial, we build an advanced Agentic AI using the control-plane design pattern, and we walk through each component step by step as we implement it. We treat the control plane as the central orchestrator that coordinates tools, manages safety rules, and structures the reasoning loop. Also, we set up a miniature retrieval system, defined modular tools, and integrated an agentic reasoning layer that dynamically plans and executes actions. At last, we observe how the entire system behaves like a disciplined, tool-aware AI capable of retrieving knowledge, assessing understanding, updating learner profiles, and logging all interactions through a unified, scalable architecture. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys

def install_deps():
deps = [‘anthropic’, ‘numpy’, ‘scikit-learn’]
for dep in deps:
subprocess.check_call([sys.executable, ‘-m’, ‘pip’, ‘install’, ‘-q’, dep])

try:
import anthropic
except ImportError:
install_deps()
import anthropic

import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from dataclasses import dataclass, asdict
from typing import List, Dict, Any, Optional
from datetime import datetime

@dataclass
class Document:
id: str
content: str
metadata: Dict[str, Any]
embedding: Optional[np.ndarray] = None

class SimpleRAGRetriever:
def __init__(self):
self.documents = self._init_knowledge_base()

def _init_knowledge_base(self) -> List[Document]:
docs = [
Document(“cs101”, “Python basics: Variables store data. Use x=5 for integers, name=’Alice’ for strings. Print with print().”, {“topic”: “python”, “level”: “beginner”}),
Document(“cs102”, “Functions encapsulate reusable code. Define with def func_name(params): and call with func_name(args).”, {“topic”: “python”, “level”: “intermediate”}),
Document(“cs103”, “Object-oriented programming uses classes. class MyClass: defines structure, __init__ initializes instances.”, {“topic”: “python”, “level”: “advanced”}),
Document(“math101”, “Linear algebra: Vectors are ordered lists of numbers. Matrix multiplication combines transformations.”, {“topic”: “math”, “level”: “intermediate”}),
Document(“ml101”, “Machine learning trains models on data to make predictions. Supervised learning uses labeled examples.”, {“topic”: “ml”, “level”: “beginner”}),
Document(“ml102”, “Neural networks are composed of layers. Each layer applies weights and activation functions to transform inputs.”, {“topic”: “ml”, “level”: “advanced”}),
]
for i, doc in enumerate(docs):
doc.embedding = np.random.rand(128)
doc.embedding[i*20:(i+1)*20] += 2
return docs

def retrieve(self, query: str, top_k: int = 2) -> List[Document]:
query_embedding = np.random.rand(128)
scores = [cosine_similarity([query_embedding], [doc.embedding])[0][0] for doc in self.documents]
top_indices = np.argsort(scores)[-top_k:][::-1]
return [self.documents[i] for i in top_indices]

We set up all dependencies, import the libraries we rely on, and initialize the data structures for our knowledge base. We define a simple retriever and generate mock embeddings to simulate similarity search in a lightweight way. As we run this block, we prepare everything needed for retrieval-driven reasoning in the later components. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ToolRegistry:
def __init__(self, retriever: SimpleRAGRetriever):
self.retriever = retriever
self.interaction_log = []
self.user_state = {“level”: “beginner”, “topics_covered”: []}

def search_knowledge(self, query: str, filters: Optional[Dict] = None) -> Dict:
docs = self.retriever.retrieve(query, top_k=2)
if filters:
docs = [d for d in docs if all(d.metadata.get(k) == v for k, v in filters.items())]
return {
“tool”: “search_knowledge”,
“results”: [{“content”: d.content, “metadata”: d.metadata} for d in docs],
“count”: len(docs)
}

def assess_understanding(self, topic: str) -> Dict:
questions = {
“python”: [“What keyword defines a function?”, “How do you create a variable?”],
“ml”: [“What is supervised learning?”, “Name two types of ML algorithms.”],
“math”: [“What is a vector?”, “Explain matrix multiplication.”]
}
return {
“tool”: “assess_understanding”,
“topic”: topic,
“questions”: questions.get(topic, [“General comprehension check.”])
}

def update_learner_profile(self, topic: str, level: str) -> Dict:
if topic not in self.user_state[“topics_covered”]:
self.user_state[“topics_covered”].append(topic)
self.user_state[“level”] = level
return {
“tool”: “update_learner_profile”,
“status”: “updated”,
“profile”: self.user_state.copy()
}

def log_interaction(self, event: str, details: Dict) -> Dict:
log_entry = {
“timestamp”: datetime.now().isoformat(),
“event”: event,
“details”: details
}
self.interaction_log.append(log_entry)
return {“tool”: “log_interaction”, “status”: “logged”, “entry_id”: len(self.interaction_log)}

We build the tool registry that our agent uses while interacting with the system. We define tools such as knowledge search, assessments, profile updates, and logging, and we maintain a persistent user-state dictionary. As we use this layer, we see how each tool becomes a modular capability that the control plane can route to. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ControlPlane:
def __init__(self, tool_registry: ToolRegistry):
self.tools = tool_registry
self.safety_rules = {
“max_tools_per_request”: 4,
“allowed_tools”: [“search_knowledge”, “assess_understanding”,
“update_learner_profile”, “log_interaction”]
}
self.execution_log = []

def execute(self, plan: Dict[str, Any]) -> Dict[str, Any]:
if not self._validate_request(plan):
return {“error”: “Safety validation failed”, “plan”: plan}

action = plan.get(“action”)
params = plan.get(“parameters”, {})
result = self._route_and_execute(action, params)

self.execution_log.append({
“timestamp”: datetime.now().isoformat(),
“plan”: plan,
“result”: result
})

return {
“success”: True,
“action”: action,
“result”: result,
“metadata”: {
“execution_count”: len(self.execution_log),
“safety_checks_passed”: True
}
}

def _validate_request(self, plan: Dict) -> bool:
action = plan.get(“action”)
if action not in self.safety_rules[“allowed_tools”]:
return False
if len(self.execution_log) >= 100:
return False
return True

def _route_and_execute(self, action: str, params: Dict) -> Any:
tool_map = {
“search_knowledge”: self.tools.search_knowledge,
“assess_understanding”: self.tools.assess_understanding,
“update_learner_profile”: self.tools.update_learner_profile,
“log_interaction”: self.tools.log_interaction
}
tool_func = tool_map.get(action)
if tool_func:
return tool_func(**params)
return {“error”: f”Unknown action: {action}”}

We implement the control plane that orchestrates tool execution, checks safety rules, and manages permissions. We validate every request, route actions to the right tool, and keep an execution log for transparency. As we run this snippet, we observe how the control plane becomes the governing system that ensures predictable and safe agentic behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass TutorAgent:
def __init__(self, control_plane: ControlPlane, api_key: str):
self.control_plane = control_plane
self.client = anthropic.Anthropic(api_key=api_key)
self.conversation_history = []

def teach(self, student_query: str) -> str:
plan = self._plan_actions(student_query)
results = []
for action_plan in plan:
result = self.control_plane.execute(action_plan)
results.append(result)

response = self._synthesize_response(student_query, results)

self.conversation_history.append({
“query”: student_query,
“plan”: plan,
“results”: results,
“response”: response
})
return response

def _plan_actions(self, query: str) -> List[Dict]:
plan = []
query_lower = query.lower()

if any(kw in query_lower for kw in [“what”, “how”, “explain”, “teach”]):
plan.append({
“action”: “search_knowledge”,
“parameters”: {“query”: query},
“context”: {“intent”: “knowledge_retrieval”}
})

if any(kw in query_lower for kw in [“test”, “quiz”, “assess”, “check”]):
topic = “python” if “python” in query_lower else “ml”
plan.append({
“action”: “assess_understanding”,
“parameters”: {“topic”: topic},
“context”: {“intent”: “assessment”}
})

plan.append({
“action”: “log_interaction”,
“parameters”: {“event”: “query_processed”, “details”: {“query”: query}},
“context”: {“intent”: “logging”}
})

return plan

def _synthesize_response(self, query: str, results: List[Dict]) -> str:
response_parts = [f”Student Query: {query}n”]

for result in results:
if result.get(“success”) and “result” in result:
tool_result = result[“result”]

if result[“action”] == “search_knowledge”:
response_parts.append(“n Retrieved Knowledge:”)
for doc in tool_result.get(“results”, []):
response_parts.append(f” • {doc[‘content’]}”)

elif result[“action”] == “assess_understanding”:
response_parts.append(“n Assessment Questions:”)
for q in tool_result.get(“questions”, []):
response_parts.append(f” • {q}”)

return “n”.join(response_parts)

We implement the TutorAgent, which plans actions, communicates with the control plane, and synthesizes final responses. We analyze queries, generate multi-step plans, and combine tool outputs into meaningful answers for learners. As we execute this snippet, we see the agent behaving intelligently by coordinating retrieval, assessment, and logging. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_demo():
print(“=” * 70)
print(“Control Plane as a Tool: RAG AI Tutor Demo”)
print(“=” * 70)

API_KEY = “your-api-key-here”

retriever = SimpleRAGRetriever()
tool_registry = ToolRegistry(retriever)
control_plane = ControlPlane(tool_registry)

print(“System initialized”)
print(f”Tools: {len(control_plane.safety_rules[‘allowed_tools’])}”)
print(f”Knowledge base: {len(retriever.documents)} documents”)

try:
tutor = TutorAgent(control_plane, API_KEY)
except:
print(“Mock mode enabled”)
tutor = None

demo_queries = [
“Explain Python functions to me”,
“I want to learn about machine learning”,
“Test my understanding of Python basics”
]

for query in demo_queries:
print(“n— Query —“)
if tutor:
print(tutor.teach(query))
else:
plan = [
{“action”: “search_knowledge”, “parameters”: {“query”: query}},
{“action”: “log_interaction”, “parameters”: {“event”: “query”, “details”: {}}}
]
print(query)
for action in plan:
result = control_plane.execute(action)
print(f”{action[‘action’]}: {result.get(‘success’, False)}”)

print(“Summary”)
print(f”Executions: {len(control_plane.execution_log)}”)
print(f”Logs: {len(tool_registry.interaction_log)}”)
print(f”Profile: {tool_registry.user_state}”)

if __name__ == “__main__”:
run_demo()

We run a complete demo that initializes all components, processes sample student queries, and prints system state summaries. We watch the agent step through retrieval and logging while the control plane enforces rules and tracks execution history. As we finish this block, we get a clear picture of how the entire architecture works together in a realistic teaching loop.

In conclusion, we gain a clear understanding of how the control-plane pattern simplifies orchestration, strengthens safety, and creates a clean separation between reasoning and tool execution. We now see how a retrieval system, tool registry, and agentic planning layer come together to form a coherent AI tutor that responds intelligently to student queries. As we experiment with the demo, we observe how the system routes tasks, applies rules, and synthesizes useful insights from tool outputs, all while remaining modular and extensible.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Design an Agentic AI System Using a Control-Plane Architecture for Safe, Modular, and Scalable Tool-Driven Reasoning Workflows appeared first on MarkTechPost.

DeepSeek AI Releases DeepSeekMath-V2: The Open Weights Maths Model Tha …

How can an AI system prove complex olympiad level math problems in clear natural language while also checking that its own reasoning is actually correct? DeepSeek AI has released DeepSeekMath-V2, an open weights large language model that is optimized for natural language theorem proving with self verification. The model is built on DeepSeek-V3.2-Exp-Base, runs as a 685B parameter mixture of experts, and is available on Hugging Face under an Apache 2.0 license.

In evaluations, DeepSeekMath-V2 reaches gold level scores on IMO 2025 and CMO 2024, and achieves 118 of 120 points on Putnam 2024 when used with scaled test time compute.

Why Final Answer Rewards are not Enough?

Most recent math reasoning models use reinforcement learning that rewards only the final answer on benchmarks such as AIME and HMMT. This approach pushed models from weak baselines to near saturation on short answer contests in about one year. (Hugging Face)

However, the DeepSeek research team points out two structural problems:

A correct numeric answer does not guarantee correct reasoning. The model may reach the right number through algebraic mistakes that cancel out.

Many tasks, such as olympiad proofs and theorem proving, require a complete argument in natural language. These tasks do not have a single final numeric answer, so standard answer based rewards do not apply.

DeepSeekMath-V2 therefore optimizes proof quality instead of pure answer accuracy. The system evaluates whether a proof is complete and logically sound, and uses that evaluation as the main learning signal.

Training a Verifier before the Generator

The core design is verifier first. DeepSeek research team trains an LLM based verifier that can read a problem and a candidate proof, then output both a natural language analysis and a discrete quality score in the set {0, 0.5, 1}.

The initial reinforcement learning data comes from Art of Problem Solving contests. The research team crawl 17,503 proof style problems from olympiads, team selection tests, and post 2010 problems that explicitly require proofs. These problems form the base set for cold start RL. Candidate proofs come from a DeepSeek-V3.2 reasoning model that is prompted to iteratively refine its own solutions, which increases detail but also creates many imperfect proofs. Human experts label these proofs using the 0, 0.5, 1 rubric, based on rigor and completeness.

The verifier is trained with Group Relative Policy Optimization (GRPO). The reward has two components:

A format reward, which checks that the verifier output follows a fixed template, including an analysis section and a final score in a box.

A score reward, which penalizes the absolute difference between the predicted score and the expert score.

This stage produces a verifier that can grade olympiad style proofs in a consistent way.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

Meta Verification to Control Hallucinated Critiques

A verifier can still game the reward. It can output the correct final score while inventing fake issues in the analysis. This would satisfy the numeric objective but make the explanations unreliable.

To address this, the research team introduce a meta verifier. The meta verifier reads the original problem, the proof, and the verifier analysis, and then evaluates whether the analysis is faithful. It scores aspects such as restatement of steps, identification of real defects, and consistency between the narrative and the final score.

The meta verifier is also trained with GRPO, with its own format and score rewards. Its output, a meta quality score, is then used as an extra reward term for the base verifier. Analyses that hallucinate problems get low meta scores, even if the final proof score is correct. In experiments, this raises the average meta evaluated quality of analyses from around 0.85 to 0.96 on a validation split, while keeping proof score accuracy stable.

Self Verifying Proof Generator and Sequential Refinement

Once the verifier is strong, DeepSeek research team trains the proof generator. The generator takes a problem and outputs both a solution and a self analysis that follows the same rubric as the verifier.

The reward for the generator combines three signals:

The verifier score on the generated proof.

The agreement between the self reported score and the verifier score.

The meta verification score of the self analysis.

Formally, the main reward uses weights α = 0.76 for the proof score and β = 0.24 for the self analysis component, multiplied by a format term that enforces the output structure. This pushes the generator to write proofs that the verifier accepts, and to be honest about remaining issues. If it claims that a flawed proof is perfect, it loses reward through disagreement and low meta scores.

DeepSeek also exploits the 128K token context limit of the base model. For hard problems, the generator often cannot repair all issues in a single pass, because the refined proof plus analysis would exceed context. In that case, the system runs sequential refinement. It generates a proof and self analysis, feeds them back as context, and asks the model to produce a new proof that fixes the previously detected issues. This loop can repeat several times, subject to the context budget.

https://github.com/deepseek-ai/DeepSeek-Math-V2/tree/main

Scaling Verification and Auto Labeling

As the generator improves, it produces harder proofs, which are costly to label by hand. To keep training data fresh, the research team introduces an automatic labeling pipeline based on scaled verification.

For each candidate proof, the system samples multiple independent verifier analyses, then evaluates each analysis using the meta verifier. If several high quality analyses converge on the same serious issues, the proof is labeled as incorrect. If no valid issues survive meta checking, the proof is labeled as correct. In the final training iterations this pipeline replaces human labels, with spot checks confirming good agreement with experts.

Competition and Benchmark Results

The research team evaluated DeepSeekMath-V2 on several fronts:

On an internal set of 91 CNML level problems covering algebra, geometry, number theory, combinatorics, and inequalities, it shows that DeepSeekMath-V2 achieves the highest mean proof score among Gemini 2.5 Pro, GPT 5 Thinking High, and DeepSeekMath-V2 in every category, as measured by their verifier.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

On IMO Shortlist 2024, sequential refinement with self verification improves both pass at 1 and best of 32 quality metrics as the maximum number of refinement iterations increases.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

On IMO ProofBench, expert evaluation the above figure shows that DeepSeekMath-V2 outperforms DeepMind DeepThink IMO Gold on the Basic subset and remains competitive on the Advanced subset, while clearly beating other large models.

https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf

For full competitions, it reports:

IMO 2025: 5 of 6 problems solved, gold medal level.

CMO 2024: 4 problems fully solved plus partial credit on 1 more, gold medal level.

Putnam 2024: 11 of 12 problems solved completely and the remaining problem with minor errors, for 118 of 120 points, above the best human score of 90.

Key Takeaways

DeepSeekMath V2 is a 685B parameter model built on DeepSeek V3.2 Exp Base, designed for natural language theorem proving with self verification, and released as open weights under the Apache 2.0 license.

The main innovation is a verifier first training pipeline with a GRPO trained verifier and meta verifier that score proofs on rigor, not only final answers, which directly addresses the gap between correct answers and correct reasoning.

A proof generator is then trained against this verifier and meta verifier, using rewards that combine proof quality, agreement with self evaluation, and analysis faithfulness, plus sequential refinement under 128K context to iteratively repair proofs.

With scaled test time compute and large verification budgets, DeepSeekMath V2 reaches gold level performance on IMO 2025 and CMO 2024 and scores 118 of 120 on Putnam 2024, surpassing the best human score that year.

Editorial Notes

DeepSeekMath-V2 is an important step toward self verifiable mathematical reasoning, because it directly tackles the gap between correct final answers and correct reasoning, using a verifier, meta verifier and proof generator trained with GRPO on olympiad style proofs and deployed at 685B scale to reach gold level performance on IMO 2025, CMO 2024 and a near perfect 118 of 120 score on Putnam 2024. Overall, this release shows that self verifiable mathematical reasoning with open weights is now practically achievable for competition level problems.

Check out the Full Paper, Model Weights on HF and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post DeepSeek AI Releases DeepSeekMath-V2: The Open Weights Maths Model That Scored 118/120 on Putnam 2024 appeared first on MarkTechPost.

A Coding Implementation for an Agentic AI Framework that Performs Lite …

In this tutorial, we build a complete scientific discovery agent step by step and experience how each component works together to form a coherent research workflow. We begin by loading our literature corpus, constructing retrieval and LLM modules, and then assembling agents that search papers, generate hypotheses, design experiments, and produce structured reports. Through snippets mentioned below, we see how an agentic pipeline emerges naturally, allowing us to explore a scientific question from initial curiosity to a full analysis within a single, integrated system. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport sys, subprocess

def install_deps():
pkgs = [“transformers”, “scikit-learn”, “numpy”]
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”] + pkgs)

try:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
except ImportError:
install_deps()
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

from dataclasses import dataclass
from typing import List, Dict, Any

np.random.seed(42)

LITERATURE = [
{“id”: “P1″,”title”: “Self-Supervised Protein Language Models for Structure Prediction”,”field”: “computational biology”,
“abstract”: “We explore transformer-based protein language models trained on millions of sequences. The models learn residue-level embeddings that improve secondary structure prediction and stability estimation.”},
{“id”: “P2″,”title”: “CRISPR Off-Target Detection Using Deep Learning”,”field”: “genome editing”,
“abstract”: “We propose a convolutional neural network architecture for predicting CRISPR-Cas9 off-target effects directly from genomic sequences, achieving state-of-the-art accuracy on GUIDE-seq datasets.”},
{“id”: “P3″,”title”: “Foundation Models for Scientific Equation Discovery”,”field”: “scientific ML”,
“abstract”: “Large language models are combined with symbolic regression to recover governing equations from noisy experimental observations in physics and fluid dynamics.”},
{“id”: “P4″,”title”: “Active Learning for Materials Property Optimization”,”field”: “materials science”,
“abstract”: “We integrate Bayesian optimization with graph neural networks to actively select candidate materials that maximize target properties while reducing experimental cost.”},
{“id”: “P5″,”title”: “Graph-Based Retrieval for Cross-Domain Literature Review”,”field”: “NLP for science”,
“abstract”: “We construct a heterogeneous citation and concept graph over multi-domain scientific papers and show that graph-aware retrieval improves cross-domain literature exploration.”},
]

corpus_texts = [p[“abstract”] + ” ” + p[“title”] for p in LITERATURE]
vectorizer = TfidfVectorizer(stop_words=”english”)
corpus_matrix = vectorizer.fit_transform(corpus_texts)

MODEL_NAME = “google/flan-t5-small”
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

def generate_text(prompt: str, max_new_tokens: int = 256) -> str:
inputs = tokenizer(prompt, return_tensors=”pt”, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, num_beams=4, early_stopping=True)
return tokenizer.decode(outputs[0], skip_special_tokens=True)

We laid the foundation for our scientific agent by loading libraries, preparing the literature corpus, and initializing our language model. We build the TF-IDF vectorizer and embed all abstracts to later retrieve relevant papers. With the model loaded and data structured, we create the computational backbone for everything that follows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class PaperHit:
paper: Dict[str, Any]
score: float

class LiteratureAgent:
def __init__(self, vectorizer, corpus_matrix, papers: List[Dict[str, Any]]):
self.vectorizer = vectorizer
self.corpus_matrix = corpus_matrix
self.papers = papers

def search(self, query: str, k: int = 3) -> List[PaperHit]:
q_vec = self.vectorizer.transform([query])
sims = cosine_similarity(q_vec, self.corpus_matrix)[0]
idxs = np.argsort(-sims)[:k]
hits = [PaperHit(self.papers[i], float(sims[i])) for i in idxs]
return hits

We implement the literature-search component of our agent. We convert user queries into a vector space and identify the most relevant scientific papers using cosine similarity. Through this, we give our system the ability to ground its reasoning in the closest-matching prior work. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class ExperimentPlan:
system: str
hypothesis: str
variables: Dict[str, Any]
protocol: List[str]

@dataclass
class ExperimentResult:
plan: ExperimentPlan
metrics: Dict[str, float]

class ExperimentAgent:
def design_experiment(self, question: str, hypothesis: str, hits: List[PaperHit]) -> ExperimentPlan:
top_field = hits[0].paper[“field”] if hits else “computational science”
protocol = [
f”Construct dataset combining ideas from: {‘, ‘.join(h.paper[‘id’] for h in hits)}.”,
“Split data into train/validation/test.”,
“Compare baseline model vs. augmented model implementing the hypothesis.”,
“Evaluate using appropriate metrics and perform ablation analysis.”,
]
variables = {
“baseline_model”: “sequence CNN”,
“augmented_model”: “protein language model + CNN”,
“n_train_samples”: 5000,
“n_validation_samples”: 1000,
“metric”: “AUROC”,
}
system = f”{top_field} system related to: {question}”
return ExperimentPlan(system=system, hypothesis=hypothesis, variables=variables, protocol=protocol)

def run_experiment(self, plan: ExperimentPlan) -> ExperimentResult:
base = 0.78 + 0.02 * np.random.randn()
gain = abs(0.05 + 0.01 * np.random.randn())
metrics = {
“baseline_AUROC”: round(base, 3),
“augmented_AUROC”: round(base + gain, 3),
“estimated_gain”: round(gain, 3),
}
return ExperimentResult(plan=plan, metrics=metrics)

We design and simulate experiments based on the retrieved literature and the generated hypothesis. We automatically define variables, build a protocol, and generate synthetic metrics that imitate the dynamics of a real scientific evaluation. This lets us move from theoretical ideas to an actionable experimental plan. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ReportAgent:
def write_report(self, question: str, hits: List[PaperHit], plan: ExperimentPlan, result: ExperimentResult) -> str:
related_work = “n”.join(f”- {h.paper[‘title’]} ({h.paper[‘field’]})” for h in hits)
protocol_str = “n”.join(f”- {step}” for step in plan.protocol)
prompt = f”””
You are an AI research assistant writing a concise research-style report.

Research question:
{question}

Hypothesis:
{plan.hypothesis}

Relevant prior work:
{related_work}

Planned experiment:
System: {plan.system}
Variables: {plan.variables}
Protocol:
{protocol_str}

Simulated results:
{result.metrics}

Write a clear report with the following sections:
1. Background
2. Proposed Approach
3. Experimental Setup
4. Results and Discussion
5. Limitations and Future Work
“””
return generate_text(prompt.strip(), max_new_tokens=320)

We generate a full research-style report using the LLM. We assemble the hypothesis, protocol, results, and related work into a structured document with clearly defined sections. This allows us to turn the pipeline’s raw outputs into polished scientific communication. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ScientificAgent:
def __init__(self):
self.lit_agent = LiteratureAgent(vectorizer, corpus_matrix, LITERATURE)
self.exp_agent = ExperimentAgent()
self.report_agent = ReportAgent()

def propose_hypothesis(self, question: str, hits: List[PaperHit]) -> str:
context = ” “.join(h.paper[“abstract”] for h in hits)
prompt = f”””
You are an AI scientist. Given a research question and related abstracts,
propose a single, testable hypothesis in 2-3 sentences.

Research question:
{question}

Related abstracts:
{context}
“””
return generate_text(prompt.strip(), max_new_tokens=96)

def run_pipeline(self, question: str) -> str:
hits = self.lit_agent.search(question, k=3)
hypothesis = self.propose_hypothesis(question, hits)
plan = self.exp_agent.design_experiment(question, hypothesis, hits)
result = self.exp_agent.run_experiment(plan)
report = self.report_agent.write_report(question, hits, plan, result)
return report

if __name__ == “__main__”:
research_question = (
“How can protein language model embeddings improve CRISPR off-target ”
“prediction compared to sequence-only CNN baselines?”
)
agent = ScientificAgent()
final_report = agent.run_pipeline(research_question)
print(final_report)

We orchestrate the entire pipeline, searching the literature, generating a hypothesis, designing the experiment, running the simulation, and writing the report. We then execute the system on a real research question and observe the complete workflow in action. This step brings all the modules together into a unified scientific agent.

In conclusion, we see how a compact codebase can evolve into a functioning AI co-researcher capable of searching, reasoning, simulating, and summarizing. We understand how each snippet contributes to the full pipeline and how agentic components amplify one another when combined. Also, we place ourselves in a strong position to extend the agent with richer literature sources, more realistic models, and more sophisticated experimental logic, pushing our scientific exploration further with every iteration.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation for an Agentic AI Framework that Performs Literature Analysis, Hypothesis Generation, Experimental Planning, Simulation, and Scientific Reporting appeared first on MarkTechPost.

OceanBase Releases seekdb: An Open Source AI Native Hybrid Search Data …

AI applications rarely deal with one clean table. They mix user profiles, chat logs, JSON metadata, embeddings, and sometimes spatial data. Most teams answer this with a patchwork of an OLTP database, a vector store, and a search engine. OceanBase released seekdb, an open source AI focused database (under the Apache 2.0 license). seekdb is described as an AI native search database that unifies relational data, vector data, text, JSON, and GIS in one engine and exposes hybrid search and in database AI workflows. 

What is seekdb?

seekdb is positioned as the lightweight, embedded version of the OceanBase engine, aimed at AI applications rather than general purpose distributed deployments. It runs as a single node database, supports embedded mode and client or server mode, and remains compatible with MySQL drivers and SQL syntax.

In the capability matrix, seekdb is marked as:

Embedded database supported

Standalone database supported

Distributed database not supported

while the full OceanBase product covers the distributed case.

From a data model perspective, seekdb supports:

Relational data with standard SQL

Vector search

Full text search

JSON data

Spatial GIS data

all inside one storage and indexing layer.

Hybrid search as the core feature

The main feature OceanBase pushes is hybrid search. This is search that combines vector based semantic retrieval, full text keyword retrieval, and scalar filters in a single query and a single ranking step.

seekdb implements hybrid search through a system package named DBMS_HYBRID_SEARCH with two entry points:

DBMS_HYBRID_SEARCH.SEARCH which returns results as JSON, sorted by relevance

DBMS_HYBRID_SEARCH.GET_SQL which returns the concrete SQL string used for execution

The hybrid search path can run:

pure vector search

pure full text search

combined hybrid search

and can push relational filters and joins down into storage. It also supports query reranking strategies like weighted scores and reciprocal rank fusion and can plug in large language model based re-rankers.

For retrieval augmented generation (RAG) and agent memory, this means you can write a single SQL query that does semantic matching on embeddings, exact matching on product codes or proper nouns, and relational filtering on user or tenant scopes.

Vector and full text engine details

At its core, seekdb exposes a modern vector and full text stack.

For vectors, seekdb:

supports dense vectors and sparse vectors

supports Manhattan, Euclidean, inner product, and cosine distance metrics

provides in memory index types such as HNSW, HNSW SQ, HNSW BQ

provides disk based index types including IVF and IVF PQ

Hybrid vector index show how you can store raw text, let seekdb call an embedding model automatically, and have the system maintain the corresponding vector index without a separate preprocessing pipeline.

For text, seekdb offers full text search with:

keyword, phrase, and Boolean queries

BM25 ranking for relevance

multiple tokenizer modes

The key point is that full text and vector indexes are first class and are integrated in the same query planner as scalar indexes and GIS indexes, so hybrid search does not need external orchestration.

AI functions inside the database

seekdb includes built in AI function expressions that let you call models directly from SQL, without a separate application service mediating every call. The main functions are:

AI_EMBED to convert text into embeddings

AI_COMPLETE for text generation using a chat or completion model

AI_RERANK to rerank a list of candidatesAI_PROMPT to assemble prompt templates and dynamic values into a JSON object for AI_COMPLETE

Model metadata and endpoints are managed by the DBMS_AI_SERVICE package, which lets you register external providers, set URLs, and configure keys, all on the database side. 

Multimodal data and workloads

seekdb is built to handle multiple data modalities in one node. it has a multimodal data and indexing layer that covers vectors, text, JSON, and GIS, and a multi-model compute layer for hybrid workloads across vector, full text, and scalar conditions.

It also provides JSON indexes for metadata queries and GIS indexes for spatial conditions. This allows queries like:

find semantically similar documents

filter by JSON metadata like tenant, region, or category

constrain by spatial range or polygon

without leaving the same engine.

Because seekdb is derived from the OceanBase engine, it inherits ACID transactions, row and column hybrid storage, and vectorized execution, although high scale distributed deployments remain a job for the full OceanBase database.

Comparison Table

Key Takeaways

AI native hybrid search: seekdb unifies vector search, full text search and relational filtering in a single SQL and DBMS_HYBRID_SEARCH interface, so RAG and agent workloads can run multi signal retrieval in one query instead of stitching together multiple engines.

Multimodal data in one engine: seekdb stores and indexes relational data, vectors, text, JSON and GIS in the same engine, which lets AI applications keep documents, embeddings and metadata consistent without maintaining separate databases.

In database AI functions for RAG: With AI_EMBED, AI_COMPLETE, AI_RERANK and AI_PROMPT, seekdb can call embedding models, LLMs and rerankers directly from SQL, which simplifies RAG pipelines and moves more orchestration logic into the database layer.

Single node, embedded friendly design: seekdb is a single node, MySQL compatible engine that supports embedded and standalone modes, while distributed, large scale deployments remain the role of full OceanBase, which makes seekdb suitable for local, edge and service embedded AI workloads.

Open source and tool ecosystem: seekdb is open sourced under Apache 2.0 and integrates with a growing ecosystem of AI tools and frameworks, with Python support via pyseekdb and MCP based integration for code assistants and agents, so it can act as a unified data plane for AI applications.

Check out the Repo and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post OceanBase Releases seekdb: An Open Source AI Native Hybrid Search Database for Multi-model RAG and AI Agents appeared first on MarkTechPost.

Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Exp …

Tencent Hunyuan has released HunyuanOCR, a 1B parameter vision language model that is specialized for OCR and document understanding. The model is built on Hunyuan’s native multimodal architecture and runs spotting, parsing, information extraction, visual question answering, and text image translation through a single end to end pipeline.

HunyuanOCR is a lightweight alternative to general VLMs such as Gemini 2.5 and Qwen3 VL that still matches or surpasses them on OCR centric tasks. It targets production use cases like document parsing, card and receipt extraction, video subtitle extraction, and multilingual document translation.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Architecture, Native Resolution ViT plus Lightweight LLM

HunyuanOCR uses 3 main modules, a Native Resolution Visual Encoder called Hunyuan ViT, an Adaptive MLP Connector, and a Lightweight Language Model. The encoder is based on SigLIP-v2-400M and is extended to support arbitrary input resolutions through adaptive patching that preserves the original aspect ratio. Images are split into patches according to their native proportions and processed with global attention, which improves recognition on long text lines, long documents, and low quality scans.

The Adaptive MLP Connector performs learnable pooling on the spatial dimension. It compresses the dense visual tokens into a shorter sequence, while keeping information from text dense regions. This reduces sequence length passed to the language model and lowers compute, while preserving OCR relevant details.

The language model is based on the densely architected Hunyuan 0.5B model and uses XD RoPE. XD RoPE splits rotary position embeddings into 4 subspaces for text, height, width, and time. This gives the model a native way to align 1D token order with 2D layout and 3D spatiotemporal structure. As a result, the same stack can handle multi column pages, cross page flows, and sequences of video frames.

Training and inference follow a fully end to end paradigm. There is no external layout analysis or post processing model in the loop. All tasks are expressed as natural language prompts and handled in a single forward pass. This design removes error propagation across pipeline stages and simplifies deployment.

Data and Pre Training Recipe

The data pipeline builds more than 200M image text pairs, across 9 real world scenarios, including street views, documents, advertisements, handwritten text, screenshots, cards and certificates and invoices, game interfaces, video frames, and artistic typography. The corpus covers more than 130 languages.

Synthetic data comes from a multilingual generator that supports right to left scripts and paragraph level rendering. The pipeline controls font, language, rotation, and RGB values, and applies warping, blur, and local lighting changes to simulate mobile captures and other hard conditions.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Pre training follows 4 stages. Stage-1 performs vision language alignment with pure text, synthetic parsing and recognition data, and general caption data, using 50B tokens and 8k context. Stage-2 runs multimodal pre training on 300B tokens that mix pure text with synthetic spotting, parsing, translation, and VQA samples. Stage-3 extends context length to 32k with 80B tokens focused on long documents and long text. Stage-4 is application oriented supervised fine tuning on 24B tokens of human annotated and hard negative data, keeping 32k context and unified instruction templates.

Reinforcement Learning with Verifiable Rewards

After supervised training, HunyuanOCR is further optimized with reinforcement learning. The research team use Group Relative Policy Optimization GRPO and a Reinforcement Learning with Verifiable Rewards setup for structured tasks. For text spotting, the reward is based on intersection over union matching of boxes combined with normalized edit distance over text. For document parsing, the reward uses normalized edit distance between the generated structure and the reference.

For VQA and translation, the system uses an LLM as a judge. VQA uses a binary reward that checks semantic match. Translation uses a COMET style scoring LLM with scores in [0, 5], normalized to [0, 1]. The training framework enforces length limits and strict formats, and assigns zero reward when outputs overflow or break schema, which stabilizes optimization and encourages valid JSON or structured outputs.

Benchmark Results, a 1B Model Competing with Larger VLMs

On the internal text spotting benchmark of 900 images across 9 categories, HunyuanOCR reaches an overall score of 70.92. It outperforms traditional pipeline methods like PaddleOCR and BaiduOCR and also general VLMs such as Gemini 2.5 Pro, Qwen3 VL 2B, Qwen3 VL 235B, and Seed 1.6 Vision, despite using far fewer parameters.

On OmniDocBench, HunyuanOCR achieves 94.10 overall, with 94.73 on formulas and 91.81 on tables. On the Wild OmniDocBench variant, which prints and recaptures documents under folds and lighting changes, it scores 85.21 overall. On DocML, a multilingual parsing benchmark across 14 non Chinese and non English languages, it reaches 91.03, and the paper reports state of the art results across all 14 languages.

For information extraction and VQA, HunyuanOCR reaches 92.29 accuracy on cards, 92.53 on receipts, and 92.87 on video subtitles. On OCRBench, it scores 860, higher than DeepSeek OCR at similar scale and close to larger general VLMs like Qwen3 VL 2B Instruct and Gemini 2.5 Pro.

In text image translation, HunyuanOCR uses the DoTA benchmark and a DocML based internal set. It achieves a strong COMET score on DoTA for English to Chinese document translation, and the model wins first place in Track 2.2 OCR free Small Model of the ICDAR 2025 DIMT competition.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/HunyuanOCR_Technical_Report.pdf

Key Takeaways

Compact end to end OCR VLM: HunyuanOCR is a 1B parameter OCR focused vision language model that connects a 0.4B native resolution ViT to a 0.5B Hunyuan language model through an MLP adapter, and runs spotting, parsing, information extraction, VQA and translation in one end to end instruction driven pipeline without external layout or detection modules.

Unified support for diverse OCR scenarios: The model is trained on more than 200M image text pairs across 9 scenarios, including documents, street views, advertisements, handwritten content, screenshots, cards and invoices, game interfaces and video frames, with coverage of over 130 languages in training and support for more than 100 languages in deployment.

Data pipeline plus reinforcement learning: Training uses a 4 stage recipe, vision language alignment, multimodal pre training, long context pre training and application oriented supervised fine tuning, followed by reinforcement learning with group relative policy optimization and verifiable rewards for spotting, parsing, VQA and translation.

Strong benchmark results for sub 3B modelsHunyuanOCR reaches 94.1 on OmniDocBench for document understanding, and achieves 860 on OCRBench, which is reported as state of the art among vision language models with fewer than 3B parameters, while also outperforming several commercial OCR APIs and larger open models such as Qwen3 VL 4B on core OCR benchmarks.

Editorial Notes

HunyuanOCR is a strong signal that OCR specific VLMs are maturing into practical infrastructure, not just benchmarks. Tencent combines a 1B parameter end to end architecture with Native Vision Transformer, Adaptive MLP Connector and RL with verifiable rewards to deliver a single model that covers spotting, parsing, IE, VQA and translation across more than 100 languages, and it does so while reaching leading scores on OCRBench for sub 3B models and 94.1 on OmniDocBench. Overall, HunyuanOCR marks an important shift toward compact, instruction driven OCR engines that are realistic for production deployment.

Check out the Paper, Model weight and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Expert VLM appeared first on MarkTechPost.

Black Forest Labs Releases FLUX.2: A 32B Flow Matching Transformer for …

Black Forest Labs has released FLUX.2, its second generation image generation and editing system. FLUX.2 targets real world creative workflows such as marketing assets, product photography, design layouts, and complex infographics, with editing support up to 4 megapixels and strong control over layout, logos, and typography.

FLUX.2 product family and FLUX.2 [dev]

The FLUX.2 family spans hosted APIs and open weights:

FLUX.2 [pro] is the managed API tier. It targets state of the art quality relative to closed models, with high prompt adherence and low inference cost, and is available in the BFL Playground, BFL API, and partner platforms.

FLUX.2 [flex] exposes parameters such as number of steps and guidance scale, so developers can trade off latency, text rendering accuracy, and visual detail.

FLUX.2 [dev] is the open weight checkpoint, derived from the base FLUX.2 model. It is described as the most powerful open weight image generation and editing model, combining text to image and multi image editing in one checkpoint, with 32 billion parameters.

FLUX.2 [klein] is a coming open source Apache 2.0 variant, size distilled from the base model for smaller setups, with many of the same capabilities.

All variants support image editing from text and multiple references in a single model, which removes the need to maintain separate checkpoints for generation and editing.

Architecture, latent flow, and the FLUX.2 VAE

FLUX.2 uses a latent flow matching architecture. The core design couples a Mistral-3 24B vision language model with a rectified flow transformer that operates on latent image representations. The vision language model provides semantic grounding and world knowledge, while the transformer backbone learns spatial structure, materials, and composition.

The model is trained to map noise latents to image latents under text conditioning, so the same architecture supports both text driven synthesis and editing. For editing, latents are initialized from existing images, then updated under the same flow process while preserving structure.

A new FLUX.2 VAE defines the latent space. It is designed to balance learnability, reconstruction quality, and compression, and is released separately on Hugging Face under an Apache 2.0 license. This autoencoder is the backbone for all FLUX.2 flow models and can also be reused in other generative systems.

https://bfl.ai/blog/flux-2

Capabilities for production workflows

The FLUX.2 Docs and Diffusers integration highlight several key capabilities:

Multi reference support: FLUX.2 can combine up to 10 reference images to maintain character identity, product appearance, and style across outputs.

Photoreal detail at 4MP: the model can edit and generate images up to 4 megapixels, with improved textures, skin, fabrics, hands, and lighting suitable for product shots and photo like use cases.

Robust text and layout rendering: it can render complex typography, infographics, memes, and user interface layouts with small legible text, which is a common weakness in many older models.

World knowledge and spatial logic: the model is trained for more grounded lighting, perspective, and scene composition, which reduces artifacts and the synthetic look.

https://bfl.ai/blog/flux-2

Key Takeaways

FLUX.2 is a 32B latent flow matching transformer that unifies text to image, image editing, and multi reference composition in a single checkpoint.

FLUX.2 [dev] is the open weight variant, paired with the Apache 2.0 FLUX.2 VAE, while the core model weights use the FLUX.2-dev Non Commercial License with mandatory safety filtering.

The system supports up to 4 megapixel generation and editing, robust text and layout rendering, and up to 10 visual references for consistent characters, products, and styles.

Full precision inference requires more than 80GB VRAM, but 4 bit and FP8 quantized pipelines with offloading make FLUX.2 [dev] usable on 18GB to 24GB GPUs and even 8GB cards with sufficient system RAM.

Editorial Notes

FLUX.2 is an important step for open weight visual generation, since it combines a 32B rectified flow transformer, a Mistral 3 24B vision language model, and the FLUX.2 VAE into a single high fidelity pipeline for text to image and editing. The clear VRAM profiles, quantized variants, and strong integrations with Diffusers, ComfyUI, and Cloudflare Workers make it practical for real workloads, not only benchmarks. This release pushes open image models closer to production grade creative infrastructure.

Check out the Technical details, Model weight and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Black Forest Labs Releases FLUX.2: A 32B Flow Matching Transformer for Production Image Pipelines appeared first on MarkTechPost.

How to Implement Functional Components of Transformer and Mini-GPT Mod …

In this tutorial, we explore how to build neural networks from scratch using Tinygrad while remaining fully hands-on with tensors, autograd, attention mechanisms, and transformer architectures. We progressively build every component ourselves, from basic tensor operations to multi-head attention, transformer blocks, and, finally, a working mini-GPT model. Through each stage, we observe how Tinygrad’s simplicity helps us understand what happens under the hood when models train, optimize, and fuse kernels for performance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess, sys, os
print(“Installing dependencies…”)
subprocess.check_call([“apt-get”, “install”, “-qq”, “clang”], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, “git+https://github.com/tinygrad/tinygrad.git”])

import numpy as np
from tinygrad import Tensor, nn, Device
from tinygrad.nn import optim
import time

print(f” Using device: {Device.DEFAULT}”)
print(“=” * 60)

print(“n PART 1: Tensor Operations & Autograd”)
print(“-” * 60)

x = Tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = Tensor([[2.0, 0.0], [1.0, 2.0]], requires_grad=True)

z = (x @ y).sum() + (x ** 2).mean()
z.backward()

print(f”x:n{x.numpy()}”)
print(f”y:n{y.numpy()}”)
print(f”z (scalar): {z.numpy()}”)
print(f”∂z/∂x:n{x.grad.numpy()}”)
print(f”∂z/∂y:n{y.grad.numpy()}”)

We set up Tinygrad in our Colab environment and immediately begin experimenting with tensors and automatic differentiation. We create a small computation graph and observe how gradients flow through matrix operations. As we print the outputs, we gain an intuitive understanding of how Tinygrad handles backpropagation under the hood. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 2: Building Custom Layers”)
print(“-” * 60)

class MultiHeadAttention:
def __init__(self, dim, num_heads):
self.num_heads = num_heads
self.dim = dim
self.head_dim = dim // num_heads
self.qkv = Tensor.glorot_uniform(dim, 3 * dim)
self.out = Tensor.glorot_uniform(dim, dim)

def __call__(self, x):
B, T, C = x.shape[0], x.shape[1], x.shape[2]
qkv = x.reshape(B * T, C).dot(self.qkv).reshape(B, T, 3, self.num_heads, self.head_dim)
q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
scale = (self.head_dim ** -0.5)
attn = (q @ k.transpose(-2, -1)) * scale
attn = attn.softmax(axis=-1)
out = (attn @ v).transpose(1, 2).reshape(B, T, C)
return out.reshape(B * T, C).dot(self.out).reshape(B, T, C)

class TransformerBlock:
def __init__(self, dim, num_heads):
self.attn = MultiHeadAttention(dim, num_heads)
self.ff1 = Tensor.glorot_uniform(dim, 4 * dim)
self.ff2 = Tensor.glorot_uniform(4 * dim, dim)
self.ln1_w = Tensor.ones(dim)
self.ln2_w = Tensor.ones(dim)

def __call__(self, x):
x = x + self.attn(self._layernorm(x, self.ln1_w))
ff = x.reshape(-1, x.shape[-1])
ff = ff.dot(self.ff1).gelu().dot(self.ff2)
x = x + ff.reshape(x.shape)
return self._layernorm(x, self.ln2_w)

def _layernorm(self, x, w):
mean = x.mean(axis=-1, keepdim=True)
var = ((x – mean) ** 2).mean(axis=-1, keepdim=True)
return w * (x – mean) / (var + 1e-5).sqrt()

We design our own multi-head attention module and a transformer block entirely from scratch. We implement the projections, attention scores, softmax, feedforward layers, and layer normalization manually. As we run this code, we see how each component contributes to a transformer layer’s overall behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“n PART 3: Mini-GPT Architecture”)
print(“-” * 60)

class MiniGPT:
def __init__(self, vocab_size=256, dim=128, num_heads=4, num_layers=2, max_len=32):
self.vocab_size = vocab_size
self.dim = dim
self.tok_emb = Tensor.glorot_uniform(vocab_size, dim)
self.pos_emb = Tensor.glorot_uniform(max_len, dim)
self.blocks = [TransformerBlock(dim, num_heads) for _ in range(num_layers)]
self.ln_f = Tensor.ones(dim)
self.head = Tensor.glorot_uniform(dim, vocab_size)

def __call__(self, idx):
B, T = idx.shape[0], idx.shape[1]
tok_emb = self.tok_emb[idx.flatten()].reshape(B, T, self.dim)
pos_emb = self.pos_emb[:T].reshape(1, T, self.dim)
x = tok_emb + pos_emb
for block in self.blocks:
x = block(x)
mean = x.mean(axis=-1, keepdim=True)
var = ((x – mean) ** 2).mean(axis=-1, keepdim=True)
x = self.ln_f * (x – mean) / (var + 1e-5).sqrt()
return x.reshape(B * T, self.dim).dot(self.head).reshape(B, T, self.vocab_size)

def get_params(self):
params = [self.tok_emb, self.pos_emb, self.ln_f, self.head]
for block in self.blocks:
params.extend([block.attn.qkv, block.attn.out, block.ff1, block.ff2, block.ln1_w, block.ln2_w])
return params

model = MiniGPT(vocab_size=256, dim=64, num_heads=4, num_layers=2, max_len=16)
params = model.get_params()
total_params = sum(p.numel() for p in params)
print(f”Model initialized with {total_params:,} parameters”)

We assemble the full MiniGPT architecture using the components built earlier. We embed tokens, add positional information, stack multiple transformer blocks, and project the final outputs back to vocab logits. As we initialize the model, we begin to appreciate how a compact transformer can be built with surprisingly few moving parts. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 4: Training Loop”)
print(“-” * 60)

def gen_data(batch_size, seq_len):
x = np.random.randint(0, 256, (batch_size, seq_len))
y = np.roll(x, 1, axis=1)
y[:, 0] = x[:, 0]
return Tensor(x, dtype=’int32′), Tensor(y, dtype=’int32′)

optimizer = optim.Adam(params, lr=0.001)
losses = []

print(“Training to predict previous token in sequence…”)
with Tensor.train():
for step in range(20):
start = time.time()
x_batch, y_batch = gen_data(batch_size=16, seq_len=16)
logits = model(x_batch)
B, T, V = logits.shape[0], logits.shape[1], logits.shape[2]
loss = logits.reshape(B * T, V).sparse_categorical_crossentropy(y_batch.reshape(B * T))
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.append(loss.numpy())
elapsed = time.time() – start
if step % 5 == 0:
print(f”Step {step:3d} | Loss: {loss.numpy():.4f} | Time: {elapsed*1000:.1f}ms”)

print(“nn PART 5: Lazy Evaluation & Kernel Fusion”)
print(“-” * 60)

N = 512
a = Tensor.randn(N, N)
b = Tensor.randn(N, N)

print(“Creating computation: (A @ B.T + A).sum()”)
lazy_result = (a @ b.T + a).sum()
print(“→ No computation done yet (lazy evaluation)”)

print(“nCalling .realize() to execute…”)
start = time.time()
realized = lazy_result.realize()
elapsed = time.time() – start

print(f”✓ Computed in {elapsed*1000:.2f}ms”)
print(f”Result: {realized.numpy():.4f}”)
print(“nNote: Operations were fused into optimized kernels!”)

We train the MiniGPT model on simple synthetic data and observe the loss decreasing across steps. We also explore Tinygrad’s lazy execution model by creating a fused kernel that executes only when it is realized. As we monitor timings, we understand how kernel fusion improves performance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“nn PART 6: Custom Operations”)
print(“-” * 60)

def custom_activation(x):
return x * x.sigmoid()

x = Tensor([[-2.0, -1.0, 0.0, 1.0, 2.0]], requires_grad=True)
y = custom_activation(x)
loss = y.sum()
loss.backward()

print(f”Input: {x.numpy()}”)
print(f”Swish(x): {y.numpy()}”)
print(f”Gradient: {x.grad.numpy()}”)

print(“nn” + “=” * 60)
print(” Tutorial Complete!”)
print(“=” * 60)
print(“””
Key Concepts Covered:
1. Tensor operations with automatic differentiation
2. Custom neural network layers (Attention, Transformer)
3. Building a mini-GPT language model from scratch
4. Training loop with Adam optimizer
5. Lazy evaluation and kernel fusion
6. Custom activation functions
“””)

We implement a custom activation function and verify that gradients propagate correctly through it. We then print a summary of all major concepts covered in the tutorial. As we finish, we reflect on how each section builds our ability to understand, modify, and extend deep learning internals using Tinygrad.

In conclusion, we reinforce our understanding of how neural networks truly operate beneath modern abstractions, and we experience firsthand how Tinygrad empowers us to tinker with every internal detail. We have built a transformer, trained it on synthetic data, experimented with lazy evaluation and kernel fusion, and even created custom operations, all within a minimal, transparent framework. At last, we recognize how this workflow prepares us for deeper experimentation, whether we extend the model, integrate real datasets, or continue exploring Tinygrad’s low-level capabilities.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Implement Functional Components of Transformer and Mini-GPT Model from Scratch Using Tinygrad to Understand Deep Learning Internals appeared first on MarkTechPost.

How Myriad Genetics achieved fast, accurate, and cost-efficient docume …

This post was written with Martyna Shallenberg and Brode Mccrady from Myriad Genetics.
Healthcare organizations face challenges in processing and managing high volumes of complex medical documentation while maintaining quality in patient care. These organizations need solutions to process documents effectively to meet growing demands. Myriad Genetics, a provider of genetic testing and precision medicine solutions serving healthcare providers and patients worldwide, addresses this challenge.
Myriad’s Revenue Engineering Department processes thousands of healthcare documents daily across Women’s Health, Oncology, and Mental Health divisions. The company classifies incoming documents into classes such as Test Request Forms, Lab Results, Clinical Notes, and Insurance to automate Prior Authorization workflows. The system routes these documents to appropriate external vendors for processing based on their identified document class. They manually perform Key Information Extraction (KIE) including insurance details, patient information, and test results to determine Medicare eligibility and support downstream processes.
As document volumes increased, Myriad faced challenges with its existing system. The automated document classification solution worked but was costly and time-consuming. Information extraction remained manual due to complexity. To address high costs and slow processing, Myriad needed a better solution.
This post explores how Myriad Genetics partnered with the AWS Generative AI Innovation Center (GenAIIC) to transform their healthcare document processing pipeline using Amazon Bedrock and Amazon Nova foundation models. We detail the challenges with their existing solution, and how generative AI reduced costs and improved processing speed.
We examine the technical implementation using AWS’s open source GenAI Intelligent Document Processing (GenAI IDP) Accelerator solution, the optimization strategies used for document classification and key information extraction, and the measurable business impact on Myriad’s prior authorization workflows. We cover how we used prompt engineering techniques, model selection strategies, and architectural decisions to build a scalable solution that processes complex medical documents with high accuracy while reducing operational costs.
Document processing bottlenecks limiting healthcare operations
Myriad Genetics’ daily operations depend on efficiently processing complex medical documents containing critical information for patient care workflows and regulatory compliance. Their existing solution combined Amazon Textract for Optical Character Recognition (OCR) with Amazon Comprehend for document classification.
Despite 94% classification accuracy, this solution had operational challenges:

Operational costs: 3 cents per page resulting in $15,000 monthly expenses per business unit
Classification latency: 8.5 minutes per document, delaying downstream prior authorization workflows

Information extraction was entirely manual, requiring contextual understanding to differentiate critical clinical distinctions (like “is metastatic” versus “is not metastatic”) and to locate information like insurance numbers and patient information across varying document formats. This processing burden was substantial, with Women’s Health customer service requiring up to 10 full-time employees contributing 78 hours daily in the Women’s Health business unit alone.
Myriad needed a solution to:

Reduce document classification costs while maintaining or improving accuracy
Accelerate document processing to eliminate workflow bottlenecks
Automate information extraction for medical documents
Scale across multiple business units and document types

Amazon Bedrock and generative AI
Modern large language models (LLMs) process complex healthcare documents with high accuracy due to pre-training on massive text corpora. This pre-training enables LLMs to understand language patterns and document structures without feature engineering or large labeled datasets. Amazon Bedrock is a fully managed service that offers a broad range of high-performing LLMs from leading AI companies. It provides the security, privacy, and responsible AI capabilities that healthcare organizations require when processing sensitive medical information. For this solution, we used Amazon’s newest foundation models:

Amazon Nova Pro: A cost-effective, low-latency model ideal for document classification
Amazon Nova Premier: An advanced model with reasoning capabilities for information extraction

Solution overview
We implemented a solution with Myriad using AWS’s open source GenAI IDP Accelerator. The accelerator provides a scalable, serverless architecture that converts unstructured documents into structured data. The accelerator processes multiple documents in parallel through configurable concurrency limits without overwhelming downstream services. Its built-in evaluation framework lets users provide expected output through the user interface (UI) and evaluate generated results to iteratively customize configuration and improve accuracy.

The accelerator offers 1-click deployment with a choice of pre-built patterns optimized for different workloads with different configurability, cost, and accuracy requirements:

Pattern 1 – Uses Amazon Bedrock Data Automation, a fully managed service that offers rich out-of-the-box features, ease of use, and straightforward per-page pricing. This pattern is recommended for most use cases.
Pattern 2 – Uses Amazon Textract and Amazon Bedrock with Amazon Nova, Anthropic’s Claude, or custom fine-tuned Amazon Nova models. This pattern is ideal for complex documents requiring custom logic.
Pattern 3 – Uses Amazon Textract, Amazon SageMaker with a fine-tuned model for classification, and Amazon Bedrock for extraction. This pattern is ideal for documents requiring specialized classification.

Pattern 2 proved most suitable for this project, meeting the critical requirement of low cost while offering flexibility to optimize accuracy through prompt engineering and LLM selection. This pattern offers a no-code configuration – customize document types, extraction fields, and processing logic through configuration, editable in the web UI.
We customized the definitions of document classes, key attributes and their definitions per document class, LLM choice, LLM hyperparameters, and classification and extraction LLM prompts via Pattern 2’s config file. In production, Myriad integrated this solution into their existing event-driven architecture. The following diagram illustrates the production pipeline:

Document Ingestion: Incoming order events trigger document retrieval from source document management systems, with cache optimization for previously processed documents.
Concurrency Management: DynamoDB tracked concurrent AWS Step Function jobs while Amazon Simple Queue Service (SQS) queues files exceeding concurrency limits for document processing.
Text Extraction: Amazon Textract extracted text, layout information, tables and forms from the normalized documents.
Classification: The configured LLM analyzed the extracted content based on the customized document classification prompt provided in the config file and classifies documents into appropriate categories.
Key Information Extraction: The configured LLM extracted medical information using extraction prompt provided in the config file.
Structured Output: The pipeline formatted the results in a structured manner and delivered to Myriad’s Authorization System via RESTful operations.

Document classification with generative AI
While Myriad’s existing solution achieved 94% accuracy, misclassifications occurred due to structural similarities, overlapping content, and shared formatting patterns across document types. This semantic ambiguity made it difficult to distinguish between similar documents. We guided Myriad on prompt optimization techniques that used LLM’s contextual understanding capabilities. This approach moved beyond pattern matching to enable semantic analysis of document context and purpose, identifying distinguishing features that human experts recognize but previous automated systems missed.
AI-driven prompt engineering for document classification
We developed class definitions with distinguishing characteristics between similar document types. To identify these differentiators, we provided document samples from each class to Anthropic Claude Sonnet 3.7 on Amazon Bedrock with model reasoning enabled (a feature that allows the model to demonstrate its step-by-step analysis process). The model identified distinguishing features between similar document classes, which Myriad’s subject matter experts refined and incorporated into the GenAI IDP Accelerator’s Pattern 2 config file for document classification prompts.
Format-based classification strategies
We used document structure and formatting as key differentiators to distinguish between similar document types that shared comparable content but differed in structure. We enabled the classification models to recognize format-specific characteristics such as layout structures, field arrangements, and visual elements, allowing the system to differentiate between documents that textual content alone cannot distinguish. For example, lab reports and test results both contain patient information and medical data, but lab reports display numerical values in tabular format while test results follow a narrative format. We instructed the LLM: “Lab reports contain numerical results organized in tables with reference ranges and units. Test results present findings in paragraph format with clinical interpretations.”
Implementing negative prompting for enhanced accuracy
We implemented negative prompting techniques to resolve confusion between similar documents by explicitly instructing the model what classifications to avoid. This approach added exclusionary language to classification prompts, specifying characteristics that should not be associated with each document type. Initially, the system frequently misclassified Test Request Forms (TRFs) as Test Results due to confusion between patient medical history and lab measurements. Adding a negative prompt like “These forms contain patient medical history. DO NOT confuse them with test results which contain current/recent lab measurements” to the TRF definition improved the classification accuracy by 4%. By providing explicit guidance on common misclassification patterns, the system avoided typical errors and confusion between similar document types.
Model selection for cost and performance optimization
Model selection drives optimal cost-performance at scale, so we conducted comprehensive benchmarking using the GenAI IDP Accelerator’s evaluation framework. We tested four foundation models—Amazon Nova Lite, Amazon Nova Pro, Amazon Nova Premier, and Anthropic Claude Sonnet 3.7—using 1,200 healthcare documents across three document classes: Test Request Forms, Lab Results, and Insurance. We assessed each model using three critical metrics: classification accuracy, processing latency, and cost per document. The accelerator’s cost tracking enabled direct comparison of operational expenses across different model configurations, ensuring performance improvements translate into measurable business value at scale.
The evaluation results demonstrated that Amazon Nova Pro achieved optimal balance for Myriad’s use case. We transitioned from Myriad’s Amazon Comprehend implementation to Amazon Nova Pro with optimized prompts for document classification, achieving significant improvements: classification accuracy increased from 94% to 98%, processing costs decreased by 77%, and processing speed improved by 80%—reducing classification time from 8.5 minutes to 1.5 minutes per document.
Automating Key Information Extraction with generative AI
Myriad’s information extraction was manual, requiring up to 10 full-time employees contributing 78 hours daily in the Women’s Health unit alone, which created operational bottlenecks and scalability constraints. Automating healthcare KIE presented challenges: checkbox fields required distinguishing between marking styles (checkmarks, X’s, handwritten marks); documents contained ambiguous visual elements like overlapping marks or content spanning multiple fields; extraction needed contextual understanding to differentiate clinical distinctions and locate information across varying document formats. We worked with Myriad to develop an automated KIE solution, implementing the following optimization techniques to address extraction complexity.
Enhanced OCR configuration for checkbox recognition
To address checkbox identification challenges, we enabled Amazon Textract’s specialized TABLES and FORMS features on the GenAI IDP Accelerator portal as shown in the following image, to improve OCR discrimination between selected and unselected checkbox elements. These features enhanced the system’s ability to detect and interpret marking styles found in medical forms.

We enhanced accuracy by incorporating visual cues into the extraction prompts. We updated the prompts with instructions such as “look for visible marks in or around the small square boxes (✓, x, or handwritten marks)” to guide the language model in identifying checkbox selections. This combination of enhanced OCR capabilities and targeted prompting improved checkbox extraction in medical forms.
Visual context learning through few-shot examples
Configuring Textract and improving prompts alone could not handle complex visual elements effectively. We implemented a multimodal approach that sent both document images and extracted text from Textract to the foundation model, enabling simultaneous analysis of visual layout and textual content for accurate extraction decisions. We implemented few-shot learning by providing example document images paired with their expected extraction outputs to guide the model’s understanding of various form layouts and marking styles. Multiple document image examples with their correct extraction patterns create lengthy LLM prompts. We leveraged the GenAI IDP Accelerator’s built-in integration with Amazon Bedrock’s prompt caching feature to reduce costs and latency. Prompt caching stores lengthy few-shot examples in memory for 5 minutes—when processing multiple similar documents within that timeframe, Bedrock reuses cached examples instead of reprocessing them, reducing both cost and processing time.
Chain of thought reasoning for complex extraction
While this multimodal approach improved extraction accuracy, we still faced challenges with overlapping and ambiguous tick marks in complex form layouts. To perform well in ambiguous and complex situations, we used Amazon Nova Premier and implemented Chain of Thought reasoning to have the model think through extraction decisions step-by-step using thinking tags. For example:

Analyze the checkbox marks in this form:

<thinking>
1. What checkboxes are present? [List all visible options]
2. Where are the marks positioned? [Describe mark locations]
3. Which marks are clear vs ambiguous? [Assess mark quality]
4. For overlapping marks: Which checkbox contains most of the mark?
5. Are marks positioned in the center or touching edges? [Prioritize center positioning]
</thinking>

Additionally, we included reasoning explanations in the few-shot examples, demonstrating how we reached conclusions in ambiguous cases. This approach enabled the model to work through complex visual evidence and contextual clues before making final determinations, improving performance with ambiguous tick marks.
Testing across 32 document samples with varying complexity levels via the GenAI IDP Accelerator revealed that Amazon Textract with Layout, TABLES, and FORMS features enabled, paired with Amazon Nova Premier’s advanced reasoning capabilities and the inclusion of few-shot examples, delivered the best results. The solution achieved 90% accuracy (same as human evaluator baseline accuracy) while processing documents in approximately 1.3 minutes each.
Results and business impact
Through our new solution, we delivered measurable improvements that met the business goals established at the project outset:
Document classification performance:

We increased accuracy from 94% to 98% through prompt optimization techniques for Amazon Nova Pro, including AI-driven prompt engineering, document-format based classification strategies, and negative prompting.
We reduced classification costs by 77% (from 3.1 to 0.7 cents per page) by migrating from Amazon Comprehend to Amazon Nova Pro with optimized prompts.
We reduced classification time by 80% (from 8.5 to 1.5 minutes per document) by choosing Amazon Nova Pro to provide a low-latency and cost-effective solution.

New automated Key Information Extraction performance:

We achieved 90% extraction accuracy (same as the baseline manual process): Delivered through a combination of Amazon Textract’s document analysis capabilities, visual context learning through few-shot examples and Amazon Nova Premier’s reasoning for complex data interpretation.
We achieved processing costs of 9 cents per page and processing time of 1.3 minutes per document compared to manual baseline requiring up to 10 full-time employees working 78 hours daily per business unit.

Business impact and rollout
Myriad has planned a phased rollout beginning with document classification. They plan to launch our new classification solution in the Women’s Health business unit, followed by Oncology and Mental Health divisions. As a result of our work, Myriad will realize up to $132K in annual savings in their document classification costs. The solution reduces each prior authorization submission time by 2 minutes—specialists now complete orders in four minutes instead of six minutes due to faster access to tagged documents. This improvement saves 300 hours monthly across 9,000 prior authorizations in Women’s Health alone, equivalent to 50 hours per prior authorization specialist.
These measurable improvements have transformed Myriad’s operations, as their engineering leadership confirms:

“Partnering with the GenAIIC to migrate our Intelligent Document Processing solution from AWS Comprehend to Bedrock has been a transformative step forward. By improving both performance and accuracy, the solution is projected to deliver savings of more than $10,000 per month. The team’s close collaboration with Myriad’s internal engineering team delivered a high-quality, scalable solution, while their deep expertise in advanced language models has elevated our capabilities. This has been an excellent example of how innovation and partnership can drive measurable business impact.” – Martyna Shallenberg, Senior Director of Software Engineering, Myriad Genetics

Conclusion
The AWS GenAI IDP Accelerator enabled Myriad’s rapid implementation, providing a flexible framework that reduced development time. Healthcare organizations need tailored solutions—the accelerator delivers extensive customization capabilities that let users adapt solutions to specific document types and workflows without requiring extensive code changes or frequent redeployment during development. Our approach demonstrates the power of strategic prompt engineering and model selection. We achieved high accuracy in a specialized domain by focusing on prompt design, including negative prompting and visual cues. We optimized both cost and performance by selecting Amazon Nova Pro for classification and Nova Premier for complex extraction—matching the right model to each specific task.
Explore the solution for yourself
Organizations looking to improve their document processing workflows can experience these benefits firsthand. The open source GenAI IDP Accelerator that powered Myriad’s transformation is available to deploy and test in your environment. The accelerator’s straightforward setup process lets users quickly evaluate how generative AI can transform document processing challenges.
Once you’ve explored the accelerator and seen its potential impact on your workflows, reach out to the AWS GenAIIC team to explore how the GenAI IDP Accelerator can be customized and optimized for your specific use case. This hands-on approach ensures you can make informed decisions about implementing intelligent document processing in your organization.

About the authors
Priyashree Roy is a Data Scientist II at the AWS Generative AI Innovation Center, where she applies her expertise in machine learning and generative AI to develop innovative solutions for strategic AWS customers. She brings a rigorous scientific approach to complex business challenges, informed by her PhD in experimental particle physics from Florida State University and postdoctoral research at the University of Michigan.
Mofijul Islam is an Applied Scientist II and Tech Lead at the AWS Generative AI Innovation Center, where he helps customers tackle customer-centric research and business challenges using generative AI, large language models (LLM), multi-agent learning, code generation, and multimodal learning. He holds a PhD in machine learning from the University of Virginia, where his work focused on multimodal machine learning, multilingual natural language processing (NLP), and multitask learning. His research has been published in top-tier conferences like NeurIPS, International Conference on Learning Representations (ICLR), Empirical Methods in Natural Language Processing (EMNLP), Society for Artificial Intelligence and Statistics (AISTATS), and Association for the Advancement of Artificial Intelligence (AAAI), as well as Institute of Electrical and Electronics Engineers (IEEE) and Association for Computing Machinery (ACM) Transactions.
Nivedha Balakrishnan is a Deep Learning Architect II at the AWS Generative AI Innovation Center, where she helps customers design and deploy generative AI applications to solve complex business challenges. Her expertise spans large language models (LLMs), multimodal learning, and AI-driven automation. She holds a Master’s in Applied Data Science from San Jose State University and a Master’s in Biomedical Engineering from Linköping University, Sweden. Her previous research focused on AI for drug discovery and healthcare applications, bridging life sciences with machine learning.
Martyna Shallenberg is a Senior Director of Software Engineering at Myriad Genetics, where she leads cross-functional teams in building AI-driven enterprise solutions that transform revenue cycle operations and healthcare delivery. With a unique background spanning genomics, molecular diagnostics, and software engineering, she has scaled innovative platforms ranging from Intelligent Document Processing (IDP) to modular LIMS solutions. Martyna is also the Founder & President of BioHive’s HealthTech Hub, fostering cross-domain collaboration to accelerate precision medicine and healthcare innovation.
Brode Mccrady is a Software Engineering Manager at Myriad Genetics, where he leads initiatives in AI, revenue systems, and intelligent document processing. With over a decade of experience in business intelligence and strategic analytics, Brode brings deep expertise in translating complex business needs into scalable technical solutions. He holds a degree in Economics, which informs his data-driven approach to problem-solving and business strategy.
Randheer Gehlot is a Principal Customer Solutions Manager at AWS who specializes in healthcare and life sciences transformation. With a deep focus on AI/ML applications in healthcare, he helps enterprises design and implement efficient cloud solutions that address real business challenges. His work involves partnering with organizations to modernize their infrastructure, enable innovation, and accelerate their cloud adoption journey while ensuring practical, sustainable outcomes.
Acknowledgements
We would like to thank Bob Strahan, Kurt Mason, Akhil Nooney and Taylor Jensen for their significant contributions, strategic decisions and guidance throughout.

How CBRE powers unified property management search and digital assista …

This post was written with Lokesha Thimmegowda, Muppirala Venkata Krishna Kumar, and Maraka Vishwadev of CBRE.
CBRE is the world’s largest commercial real estate services and investment firm. The company serves clients in more than 100 countries and offers services ranging from capital markets and leasing advisory to investment management, project management and facilities management.
CBRE uses AI to improve commercial real estate solutions with advanced analytics, automated workflows, and predictive insights. The chance to unlock value with AI in the commercial real estate lifecycle begins with data at scale. With the industry’s largest dataset and a comprehensive suite of enterprise-grade technology, the company has implemented a range of AI solutions to boost individual productivity and support broad-scale transformation.
This blog post describes how CBRE and AWS partnered to transform how property management professionals access information, creating a next-generation search and digital assistant experience that unifies access across many types of property data using Amazon Bedrock, Amazon OpenSearch Service, Amazon Relational Database Service, Amazon Elastic Container Service, and AWS Lambda.
Unified property management search challenges
CBRE’s proprietary PULSE system consolidates a wide range of essential property data—covering structured data from relational databases that record transactions and unstructured data stored in document repositories containing everything from lease agreements to property inspections. In the past, property management professionals had to sift through millions of documents and switch between multiple different systems to locate property maintenance details. Data was scattered across 10 distinct sources and four separate databases, which made it hard to get complete answers. This fragmented setup reduced productivity and made it difficult to uncover key insights about property operations.
Experts in property management, not database syntax, needed to ask complex questions in natural language, quickly synthesize disparate information, and avoid manual review of lengthy documents.
The challenge: deliver an intuitive, unified search solution bridging structured and unstructured content, with robust security, enterprise-grade performance and reliability.
Solution architecture
CBRE implemented a global search solution within PULSE, powered by Amazon Bedrock, to address these challenges. The search architecture is designed for a seamless, intelligent, and secure information retrieval experience across diverse data types. It orchestrates an interplay of user interaction, AI-driven processing, and robust data storage.
CBRE’s PULSE search solution uses Amazon Bedrock for the rapid deployment of generative AI capabilities by using multiple foundation models through a single API. CBRE’s implementation uses Amazon Nova Pro for SQL query generation, achieving a 67% reduction in processing time, while Claude Haiku powers intelligent document interactions. The solution maintains enterprise-grade security for all property data. By combining Amazon Bedrock capabilities with Retrieval Augmented Generation (RAG) and Amazon OpenSearch Service, CBRE created a unified search experience across more than eight million documents and multiple databases, fundamentally transforming how property professionals access and analyze business-critical information.
The following diagram illustrates the architecture for the solution that CBRE implemented in AWS:

Let us go through the flow for the solution:

Property Manager and PULSE UI: Property managers interact through the intuitive PULSE user interface, which serves as the gateway for both traditional keyword searches and natural language queries (NLQ). The UI displays search results, supports document conversations, and presents intelligent summaries in desktop and mobile.
Dynamic search execution: When users submit requests, the system first retrieves user-specific permissions from Amazon ElastiCache for Redis, chosen for its low latency and high throughput. Search operations across Amazon OpenSearch and transactional databases are then constrained by these user-specific permissions, making sure users only access authorized results with real-time granular control.
Orchestration layer: This central control hub serves as the application’s brain, receiving user requests from PULSE UI and intelligently routing them to appropriate backend services. Key responsibilities include:

Routing queries to relevant data systems (structured databases, unstructured documents, or both for deep search).
Initiating parallel searches across SQL Interact and Doc Interact components.
Merging, de-duplicating, and ranking results from disparate sources for unified outcomes.
Managing conversation history through Amazon DynamoDB integration.

SQL interact component (structured data search): This pathway manages interactions with structured relational databases (RDBMS) through these key steps:

4.1 Database metadata retrieval: Dynamically fetches schema details (for example, table names, column names, data types, relationships, constraints) for entities like property, contacts, and tenants from an Amazon OpenSearch index.
4.2 Amazon Bedrock LLM (Amazon Nova Pro): Interprets the user’s natural language query alongside schema metadata, translating it into accurate, optimized SQL queries tailored to the database. The solution reduced SQL query generation time from an average of 12 seconds earlier to 4 seconds using Amazon Nova Pro.
4.3 RDBMS systems (PostgreSQL, MS SQL): Actual transactional databases, such as PostgreSQL and MS SQL, which house the core structured property management data (for example, properties, contacts, tenants, K2 forms). They execute the LLM-generated SQL queries and return the structured tabular results back to the SQL Interact component.

DocInteract Component (Unstructured Document Search): This pathway is specifically designed for intelligent search and interaction with unstructured documents.

5.1 Vector Store (OpenSearch Cluster): Stores documents, including those from OpenText, as high-dimensional vectors for efficient semantic search using techniques like k-Nearest Neighbors while prioritizing speed and accuracy with metadata filtering.
5.2 Amazon Bedrock LLM (Claude Haiku): Interprets NLQs and translates them into optimized OpenSearch DSL queries, while powering the “Chat With AI” feature for direct document interaction, generating concise, conversational responses including answers, summaries, and natural dialogue.

Having established the core architecture with both SQL Interact and DocInteract components, the following sections explore the specific optimizations and innovations implemented for each data type, beginning with structured data search enhancements.
Structured data search
Building on the SQL interact component outlined in the architecture, the PULSE Search application offers two search methods for accessing structured data in PostgreSQL and MS SQL. Keyword Search scans the fields and schemas for specific terms, facilitating comprehensive coverage of the entire data system. With Natural Language Query (NLQ) Search users can interact with the databases using everyday language, translating queries into database queries. Both methods support property managers to efficiently locate and retrieve information across the database modules.
Database layer search performance enhancement at the SQL level
Our unique challenge involved implementing application-wide keyword searches that needed to scan across the columns in database tables – a non-conventional requirement compared to traditional indexed column-specific searches in RDBMS systems. This universal search capability was essential for user experience, allowing information discovery without knowing specific column names or data structures.
We leveraged native full-text search capabilities in both PostgreSQL and MS SQL Server databases:

PostgreSQL Implementation:

SELECT * FROM dbo.pg_db_view_name bd WHERE textsearchable_all_col @@ to_tsquery(‘english’, ‘keyword’)

Microsoft SQL Server Implementation:

SELECT * FROM [dbo].ms_db_view_name WHERE CONTAINS(*, ‘8384F’)

Note: Our implementation uses specialized text search columns (textsearchable_all_col) concatenating the searchable fields from the view pd_db_view_name, while ms_db_view_name represents a view created with full-text search indexing.
This optimization delivered an 80% improvement in query performance by harnessing native database capabilities while balancing comprehensive search coverage with optimal database performance through specialized indexing algorithms.
Database layer search performance enhancement at the SQL interact API level
We implemented several optimizations in database search functionality targeting three key performances (KPIs): Accuracy (precision of results), Consistency (reproducible outcomes), and Relevancy (making sure results align with user intent). The enhancements reduced response latency while simultaneously boosting these ACR metrics, resulting in faster and more dependable search results.
Prompt Engineering Changes: We implemented a comprehensive approach to prompt management and optimization, focusing on the following factors.

Configurability: We implemented modular prompt templates stored in external files to enable version control, simplified management, and reduced prompt size, improving performance and maintainability.
Dynamic field selection for context window reduction: The system uses KNN-based similarity search to filter and select only the most relevant schema fields aligned with user intent, reducing context window size and optimizing prompt effectiveness.
Dynamic few-shot example: The system intelligently selects the most relevant few-shot example from a configuration file using KNN-based similarity search for the SQL generation. This smart, context-aware approach makes sure that only the most pertinent example is included in the prompt, minimizing unnecessary data overhead. This approach helped in getting consistent and accurate SQL generation from LLM.
Business rule integration: The system maintains a centralized repository of business rules in a dedicated schema wise configuration file, making rule management and updates streamlined and efficient. During prompt generation, relevant business rules are dynamically integrated into prompts, facilitating consistency in rule application while providing flexibility for updates and maintenance.
LLM score-based relevancy: We added a fourth LLM call to evaluate and reorder schema relevance after initial KNN retrieval, addressing challenges where vector search returned irrelevant or poorly ordered schemas.For example, when processing a user query about property or contact information, the vector search might return three schemas, but:

The third schema might be irrelevant to the query.
The ordering of the two relevant schemas might not reflect their true relevancy to the query.
To address these challenges, we introduced an additional LLM processing (4th LLM parallel call) step that:

Evaluates the relevance of each schema to the user query.
Assigns relevancy scores to determine schema importance.
Reorders schemas based on their actual relevance to the query.
This enhancement improved our schema selection process by:

Making sure only truly relevant schemas are selected.
Maintaining proper relevancy ordering.
Providing more accurate context for subsequent query processing.

These enhancements improved schema selection by verifying only truly relevant schemas are processed, maintaining proper relevancy ordering, and providing more accurate context for query processing. The result was more precise, contextually appropriate responses and improved overall application performance.
Parallel LLM inference for SQL generation with Amazon Nova Pro
We implemented a comprehensive parallel processing architecture for NLQ to SQL conversion, enhancing system performance and efficiency. The solution introduces concurrent schema-based API calls to the LLM inference engine, with asynchronous processing for multiple schema evaluations. Our security-first approach authenticates and validates user entitlements while performing context-aware schema identification that incorporates similarity search and enforces access permissions. The system only processes schemas for which the user has explicit authorization, facilitating foundational data security. Following authentication, the system dynamically generates prompts (as detailed in our prompt engineering framework) and initiates concurrent processing of the most relevant schemas through parallel LLM inference calls. Before execution, it enhances the generated SQL queries with mandatory security joins that enforce building-level access controls, restricting users to their authorized buildings only.
Finalized SQL queries are executed on respective database systems (PostgreSQL or SQL Server). The system processes the query results and returns them as a structured API response, maintaining security and data integrity throughout the entire workflow. This architecture facilitates both optimal performance through parallel processing and comprehensive security through multi-layered access controls.
This integrated approach incorporates concurrent validation of generated SQL queries, resulting in reduced processing time and improved system throughput and reduced inference latency with Amazon Nova Pro. With introduction of Nova Pro there was significant improvement in inference latency. The framework’s architecture facilitates efficient resource utilization while maintaining high accuracy in SQL query generation, making it particularly effective for handling complex database operations and high-volume query processing requirements.

Enhancing unstructured data search
The PULSE document search uses two main methods, enhanced by purpose-built specialized search functions. Users can use the streamlined Keyword Search to precisely locate terms within documents and metadata for fast retrieval when precise search terms are known. This straightforward approach makes sure users can quickly locate exact matches across the entire document landscape. The second method, Natural Language Query (NLQ) Search, supports interaction with documents using everyday language, interpreting intent and converting queries into search parameters—particularly powerful for complex or concept -based queries. Complementing these core search methods, the system offers specialized search capabilities including Favorites and Collections search so users can efficiently navigate their personally curated document sets and shared collections. Additionally, the system provides intelligent document upload search functionality that helps users quickly locate appropriate document categories and upload locations based on document types and property contexts.
The search infrastructure supports comprehensive file formats including PDFs, Microsoft Office documents (Word, Excel, PowerPoint), emails (MSG), images (JPG, PNG), text files, HTML files, and various other document types, facilitating comprehensive coverage across the document categories in the property management environment.
Prompt engineering and management optimization
Our Document Search system incorporates advanced prompt engineering techniques to enhance search accuracy, efficiency, and maintainability. Let’s explore the key features of our prompt management system and the value they bring to the search experience.
Two-stage prompt architecture and modular prompt management:
At the core of our system is a two-stage prompt architecture. This design separates tool selection from task execution for more efficient and accurate query processing.

# Modular prompt loading from configuration
get_doc_detect_prompt = get_prompts(“doc_prompts/tool_detect/Get_Document_data_detect”)
get_doc_prompt = get_prompts(“doc_prompts/prepare_prompt/Get_Document_data_prompt”)
keyword_search_detect_prompt = get_prompts(“doc_prompts/tool_detect/keyword_search_detect”)

def detect_tool(user_prompt):
tool_descriptions = {
“Get_Document_data”: get_doc_detect_prompt,
“keyword_search”: keyword_search_detect_prompt,
“Get_Fawdocs_collections”: faw_collection_detect_prompt,
“upload_documents”: upload_document_detect_prompt
}

messages = [
{“role”: “system”, “content”: “You are an AI assistant that determines the most appropriate tool…”},
{“role”: “user”, “content”: f”Here are the tool descriptions:n{json.dumps(tool_descriptions, indent=2)}nnUser query: {user_prompt}nnWhich tool should be used?”}
]

This architecture reduces token usage by up to 60% by loading only necessary prompts per query processing stage. The lightweight initial stage quickly routes queries to appropriate tools, while specialized prompts handle the actual execution with focused context, improving both performance and accuracy in tool selection and query execution.
Our modular prompt management system stores prompts in external configuration files for dynamic loading based on context and supporting personalization. It supports prompt updates without code deployments, cutting update cycles from hours to minutes. This architecture facilitates A/B testing of different prompt variations and quick rollbacks, enhancing system adaptability and reliability.

def prepare_tool_prompt(detected_tool, userid):
tool_prompts = {
“keyword_search”: keyword_search_prompt,
“Get_Document_data”: get_doc_prompt.replace(“userid”, userid),
“upload_documents”: upload_document_prompt,
“Get_Favdocs_collections”: fav_collection_prompt
}
return tool_prompts[detected_tool]

The system implements context-aware prompt selection, adapting to query types, document characteristics, and search contexts. This approach makes sure that the most appropriate prompt and query structure are used for each unique search scenario. For example, the system distinguishes between different question types (for example, ‘list_question’) for tailored processing of various query intents.
Search algorithm optimization
Our document search system implements search algorithms that combine vector-based semantic search with traditional text-based approaches to search across document metadata and content. We use different query strategies optimized for specific search scenarios.
Keyword search:
Keyword search uses a dual strategy combining both metadata and content searches using phrase matching. A fixed query template structure facilitates efficiency and consistency, incorporating predefined metadata, content, permission rules, and building ID constraints, while dynamically integrating user-specific terms and roles. This approach allows for fast and reliable searches while maintaining proper access controls and relevance.
User queries like “lease agreement” or “property tax 2023” are parsed into component words, each requiring a match in the document content for relevancy, facilitating precise results.

“bool”: {
“must”: [
{“match_phrase”: {“srccontent”: word}} for word in search_words
]
}

Similarly, for metadata searches, the system uses phrase searching across metadata fields:

“multi_match”: {
“query”: search_words,
“type”: “phrase”,
“fields”: [“srcmetadata”]
}

This approach provides exact matching capabilities across document metadata, facilitating precise results when users are searching for specific document properties. The system executes both search types concurrently and results from both searches are then merged and deduplicated, with scoring normalized across both result sets.
Natural language query search:
Our NLQ search combines LLM-generated queries with vector-based semantic search through two main components. The metadata search uses an LLM to generate OpenSearch queries from natural language input. For instance, “Find lease agreements mentioning early termination for tech companies from last year” is transformed into a structured query that searches across document types, dates, property names and other metadata fields.
For content searches, we employ KNN vector search with a K-factor of 5 to identify semantically similar content. The system converts queries into vector embeddings and executes both metadata and content searches simultaneously, combining results while minimizing duplicates.
Chat with Document (digital assistant for in-depth document interaction):
The Chat with Document feature supports natural conversation with specific documents after initial search. Users can ask questions, request summaries, or seek specific information from selected documents through a straightforward interaction process.
When engaged, the system retrieves the complete document content using its node identifier and processes user queries through a streamlined pipeline. Each query is handled by an LLM using carefully constructed prompts that combine the user’s question with relevant document context.
With this capability users can extract information from complex documents efficiently. For example, property managers can quickly understand lease terms or payment schedules without manually scanning lengthy agreements. The feature provides instant summaries and explanations for rapid information access and decision-making in document-intensive workflows.
Scaling document ingestion
To handle high-throughput document processing and large-scale enterprise ingestion, our ingestion pipeline uses asynchronous Amazon Textract for scalable, parallel text extraction. The architecture efficiently processes diverse file types-PDFs, PPTs, Word documents, Excel files and images-even with hundreds of pages or high-resolution content. Once a document is uploaded to an Amazon S3 bucket, a message triggers an SQS queue, invoking a Lambda function that initiates an asynchronous Textract job, offloading heavy extraction and OCR tasks without blocking execution.
For text documents, the system reads the file from Amazon S3 and submits it to Amazon Textract’s asynchronous API, which processes the document in the background. Once the job completes, the results are retrieved and parsed to extract structured text. This text is then chunked intelligently—based on token count or semantic boundaries—and passed through a Bedrock embedding model (For example, Amazon Titan Text embeddings v2). Each chunk is enriched with metadata and indexed into Amazon OpenSearch for fast and context-aware search capabilities. Once ingested, our intelligent query strategy, driven by user and CBRE market lookups, dynamically directs searches to the relevant OpenSearch indexes.
Image files follow a similar flow but use Amazon Bedrock Claude 3 Haiku for OCR after base64 conversion. Extracted text is then chunked, embedded, and indexed like standard text documents.
Security and access control
User authentication and authorization occurs through a multi-layered security process:

Access token validation: The system verifies the user’s identity by validating the user identity in Microsoft B2C and their access token against each request. The user is also checked for their authorization to access application.
Entitlement verification: Simultaneously, the system checks the user’s permissions in a Redis database to verify they have the appropriate access rights to specific modules in application and database schemas (entitlements) they’re authorized to query on.
Property access validation: The system also retrieves their authorized building list from Redis database (building id list to which the user is mapped), making sure they can only access data related to their properties within their business portfolio.

This parallel validation process facilitates more secure and appropriate access while maintaining optimal performance through Redis’s high-speed data retrieval capabilities. Redis is populated during the application load through mapping user entitlement and building mapping maintained in the database. If the user details are not found in Redis an API is invoked to replenish the Redis database.

Results and impact
CBRE’s experience with this initiative has led to enhanced operational efficiency and data reliability, directly translating into tangible business benefits:

Cost savings and resource optimization: By reducing hours of manual effort annually per user, the business can realize substantial cost savings (for example, in labor costs, reduced overtime, or reallocated personnel). This frees up valuable user time so that the team can focus on more strategic, high-value tasks that drive building performance, innovation and growth rather than repetitive manual processes.
Improved decision-making and risk mitigation: Delivering results with 95% accuracy for business decisions that are based on highly reliable data. This minimizes the risk of errors, leading to more informed strategies, fewer costly mistakes, and ultimately, better business outcomes.
Increased productivity and throughput: With less time spent on manual tasks and a higher assurance of data quality, workflows can become smoother and faster. This translates to increased overall productivity and potentially higher throughput for related processes, enhancing service delivery.

Lessons learned and best practices
The following are our lessons learned and best practices based on our experience building this solution:

Use prompt modularization: Prompt engineering is essential for optimizing application performance and maintaining consistent results. Breaking prompts into modular components helped in better prompt management, enhanced control and maintainability through streamlined version control, simplified testing and validation processes, and improved performance tracking capabilities. The modular approach to prompt design reduced token usage, which in turn decreased LLM response times and improved overall system performance. Module approach also helps in enhanced SQL generation efficiency through faster troubleshooting, reduced implementation time, and more reliable query generation, resulting in quicker resolution of edge cases and business rule updates.
Provide accurate few shot example: For increased accuracy and consistency of SQL generation, use dynamic few shot example with modular components for seamless updates to example repository.

Include examples covering common use cases and edge scenarios.
Maintain a diverse set of high-quality example pairs covering various business scenarios.
Keep examples concise and focused on specific patterns.
Regularly update examples based on new business requirements. Remove or update outdated examples.
Limit to top-1 or top-2 most relevant examples to manage token usage.
Regularly validate the relevance of selected examples.
Set up feedback loops to continuously improve example matching accuracy.
Fine-tune similarity thresholds for optimal example matching.

Reduce the context window: For reducing the context window size of the context passed, select only the top-N KNN fields from the schema definition along with key/mandatory fields. Only apply the dynamic context field selection for schema where high number of fields are present and increasing the context window size.
Improve relevancy: LLM Scoring mechanism helped us in getting the right relevant set of schemas (modules). Harnessing LLM intelligence over the KNN result of relevant module helped us get the most relevant ordered results. Also consider:

Vector similarity alone may not capture true semantic relevance.
Top-K nearest neighbors don’t always guarantee contextual accuracy.
Order of results may not reflect actual relevance to the query.
Use of LLM Scoring provided a more accurate schema relevancy determination.

Conclusion
CBRE Property Management and AWS together demonstrated how innovative cloud AI solutions can unlock real business value at scale. By using AWS services and best practices, enterprises can reimagine how they access, manage, and derive insight from their data and take real action.
To learn how your organization can accelerate digital transformation with AWS, contact your AWS account team or start exploring AWS AI and data analytics services today.
Further reading on AWS services featured in this solution:

Amazon Bedrock: Foundation Model Service
Amazon Nova
Amazon OpenSearch Service documentation

About the authors
Lokesha Thimmegowda is a Senior Principal Software Engineer at CBRE, specializing in artificial intelligence and AWS. With four AWS certifications, including Solutions Architect Professional and AWS AI Practitioner, he excels at guiding teams through complex challenges with innovative solutions. Lokesha is passionate about designing transformative solution architectures that drive efficiency. Outside of work, he enjoys daily tennis with his daughters and weekend cricket.
Muppirala Venkata Krishna Kumar Principal Software Engineer at CBRE with over 18 years of expertise in leading technical teams and designing end-to-end solutions across diverse domains. A strategic technical lead with a strong command over both front-end and back-end technologies, cloud architecture using AWS, and AI/ML-driven innovations. Passionate about staying at the forefront of technology, continuously learning, and implementing modern tools to drive impactful results. Outside of work, values quality time with family and enjoys spiritual travel experiences that bring balance and inspiration.
Maraka Vishwadev is a Senior Staff Engineer at CBRE with 18 years of experience in enterprise software development, specializing in backend–frontend technologies and AWS Cloud. He leads impactful initiatives in Generative AI, leveraging Large Language Models to drive intelligent automation, enhance user experiences, and unlock new business capabilities. He is deeply involved in architecting and delivering scalable, secure, and cloud-native solutions, aligning technology with business strategy. Vishwa balances his professional life with cooking, movies, and quality family time.
Chanpreet Singh is a Senior Consultant at AWS with 18+ years of industry experience, specializing in Data Analytics and AI/ML solutions. He partners with enterprise customers to architect and implement cutting-edge solutions in Big Data, Machine Learning, and Generative AI using AWS native services, partner solutions and open-source technologies. A passionate technologist and problem solver, he balances his professional life with nature exploration, reading, and quality family time.
Sachin Khanna is a Lead Consultant specializing in Artificial Intelligence and Machine Learning (AI/ML) within the AWS Professional Services team. With a strong background in data management, generative AI, large language models, and machine learning, he brings extensive expertise to projects involving data, databases, and AI-driven solutions. His proficiency in cloud migration and cost optimization has enabled him to guide customers through successful cloud adoption journeys, delivering tailored solutions and strategic insights.
Dwaragha Sivalingam is a Senior Solutions Architect specializing in generative AI at AWS, serving as a trusted advisor to customers on cloud transformation and AI strategy. With seven AWS certifications including ML Specialty, he has helped customers in many industries, including insurance, telecom, utilities, engineering, construction, and real estate. A machine learning enthusiast, he balances his professional life with family time, enjoying road trips, movies, and drone photography.

Managed Tiered KV Cache and Intelligent Routing for Amazon SageMaker H …

Modern AI applications demand fast, cost-effective responses from large language models, especially when handling long documents or extended conversations. However, LLM inference can become prohibitively slow and expensive as context length increases, with latency growing exponentially and costs mounting with each interaction.
LLM inference requires recalculating attention mechanisms for the previous tokens when generating each new token. This creates significant computational overhead and high latency for long sequences. Key-value (KV) caching addresses this bottleneck by storing and reusing key-value vectors from previous computations, reducing inference latency and time-to-first-token (TTFT). Intelligent routing in LLMs is a technique that sends requests with shared prompts to the same inference instance to maximize the efficiency of the KV cache. It routes a new request to an instance that has already processed the same prefix, allowing it to reuse the cached KV data to accelerate processing and reduce latency. However, customers have told us that setting up and configuring the right framework for KV caching and intelligent routing at production scale is challenging and takes long experimental cycles.
Today we’re excited to announce that Amazon SageMaker HyperPod now supports Managed Tiered KV Cache and Intelligent Routing capabilities through the HyperPod Inference Operator. These new capabilities can deliver significant performance improvements for LLM inference workloads by reducing time to first token (TTFT) by up to 40%, increasing throughput, and lowering compute costs by up to 25% when used for long context prompts and multi-turn chat conversations using our internal tools. These capabilities are available for use with the HyperPod Inference Operator, which automatically manages the routing and distributed KV caching infrastructure, significantly reducing operational overhead while delivering enterprise-grade performance for production LLM deployments. By using the new Managed Tiered KV Cache feature you can efficiently offload attention caches to CPU memory (L1 cache) and distribute L2 cache for cross-instance sharing through a tiered storage architecture in HyperPod for optimal resource utilization and cost efficiency at scale.
Efficient KV caching combined with intelligent routing maximizes cache hits across workers so you can achieve higher throughput and lower costs for your model deployments. These features are particularly beneficial in applications that are processing long documents where the same context or prefix is referenced, or in multi-turn conversations where context from previous exchanges needs to be maintained efficiently across multiple interactions.
For example, legal teams analyzing 200 page contracts can now receive instant answers to follow-up questions instead of waiting 5+ seconds per query, healthcare chatbots maintain natural conversation flow across 20+ turn patient dialogues, and customer service systems process millions of daily requests with both better performance and lower infrastructure costs. These optimizations make document analysis, multi-turn conversations, and high-throughput inference applications economically viable at enterprise scale.
Optimizing LLM inference with Managed Tiered KV Cache and Intelligent Routing
Let’s break down the new features:

Managed Tiered KV Cache: Automatic management of attention states across CPU memory (L1) and distributed tiered storage (L2) with configurable cache sizes and eviction policies. SageMaker HyperPod handles the distributed cache infrastructure through the newly launched tiered storage, alleviating operational overhead for cross node cache sharing across clusters. KV cache entries are accessible cluster-wide (L2) so that a node can benefit from computations performed by other nodes.
Intelligent Routing: Configurable request routing to maximize cache hits using strategies like prefix-aware, KV-aware, and round-robin routing.
Observability: Built-in HyperPod Observability integration for observability of metrics and logs for Managed Tiered KV Cache and Intelligent Routing in Amazon Managed Grafana.

Sample flow for inference requests with KV caching and Intelligent Routing
As a user sends an inference request to HyperPod Load Balancer, it forwards the request to the Intelligent Router within the HyperPod cluster. The Intelligent Router dynamically distributes requests to the most appropriate mode pod (Instance A or Instance B) based on the routing strategy to maximize KV cache hit and minimize inference latency. As the request reaches the model pod, the pod first checks L1 cache (CPU) for frequently used key-value pairs, then queries the shared L2 cache (Managed Tiered KV Cache) if needed, before performing full computation of the token. Newly generated KV pairs are stored in both cache tiers for future reuse. After computation completes, the inference result flows back through the Intelligent Router and Load Balancer to the user.

Managed Tiered KV Cache
Managed Tiered KV Cache and Intelligent Routing are configurable opt-in features. When enabling Managed KV Cache, L1 cache is enabled by default, while both L1 and L2 cache can be configured to be enabled or disabled. The L1 cache resides locally on each inference node utilizing CPU memory. This local cache provides significantly fast access, making it ideal for frequently accessed data within a single model instance. The cache automatically manages memory allocation and eviction policies to optimize for the most valuable cached content. The L2 cache operates as a distributed cache layer spanning the entire cluster, enabling cache sharing across multiple model instances. We support two backend options for L2 cache, each with the following benefits:

Managed Tiered KV Cache (Recommended): A HyperPod disaggregated memory solution that offers excellent scalability to Terabyte pools, low latency, AWS network optimized, GPU-aware design with zero-copy support, and cost efficiency at scale.
Redis: Simple to set up, works well for small to medium workloads, and offers a rich environment of tools and integrations.

The two-tier architecture works together seamlessly. When a request arrives, the system first checks the L1 cache for the required KV pairs. If found, they are used immediately with minimal latency. If not found in L1, the system queries the L2 cache. If found there, the data is retrieved and optionally promoted to L1 for faster future access. Only if the data is not present in either cache does the system perform the full computation, storing the results in both L1 and L2 for future reuse.
Intelligent Routing
Our Intelligent Routing system offers four configurable strategies to optimize request distribution based on your workload characteristics, with the routing strategy being user-configurable at deployment time to match your application’s specific requirements.

Prefix-aware routing serves as the default strategy, maintaining a tree structure to track which prefixes are cached on which endpoints, delivering strong general-purpose performance for applications with common prompt templates such as multi-turn conversations, customer service bots with standard greetings, and code generation with common imports.
KV-aware routing provides the most sophisticated cache management through a centralized controller that tracks cache locations and handles eviction events in real-time, excelling at long conversation threads, document processing workflows, and extended coding sessions where maximum cache efficiency is critical.
Round-robin routing offers the most straightforward approach, distributing requests evenly across the available workers, best suited for scenarios where requests are independent, such as batch inference jobs, stateless API calls, and load testing scenarios.

Strategy
Best for

Prefix-aware routing (default)
Multi-turn conversations, customer service bots, code generation with common headers

KV-aware routing
Long conversations, document processing, extended coding sessions

Round-robin routing
Batch inference, stateless API calls, load testing

Deploying the Managed Tiered KV Cache and Intelligent Routing solution
Prerequisites
Create a HyperPod cluster with Amazon EKS as an orchestrator.

In Amazon SageMaker AI console, navigate to HyperPod Clusters, then Cluster Management.
On the Cluster Management page, select Create HyperPod cluster, then Orchestrated by Amazon EKS.
You can use one-click deployment from the SageMaker AI console. For cluster set up details see Creating a SageMaker HyperPod cluster with Amazon EKS orchestration.
Verify that the HyperPod cluster status is InService.

Verify that the inference operator is up and running. The Inference add-on is installed as a default option when you create the HyperPod cluster from the console. If you want to use an existing EKS cluster, see Setting up your HyperPod clusters for model deployment to manually install the inference operator.

From the command line, run the following command: 

kubectl get pods -n hyperpod-inference-system

Output:

hyperpod-inference-operator-conroller-manager-xxxxxx pod is in running state in namespace hyperpod-inference-system

Or, verify that the operator is running from console. Navigate to EKS cluster, Resources, Pods, Pick namespace, hyperpod-inference-system.

Preparing your model deployment manifest files
You can enable these features by adding configurations to your InferenceEndpointConfig custom CRD file.
For the complete example, visit the AWS samples GitHub repository.

export MODEL_NAME=”Llama-3.1-8B-Instruct”
export INSTANCE_TYPE=”ml.g5.24xlarge”
export MODEL_IMAGE=”public.ecr.aws/deep-learning-containers/vllm:0.11.1-gpu-py312-cu129-ubuntu22.04-ec2-v1.0″
export S3_BUCKET=”my-model-bucket”
export S3_MODEL_PATH=”models/Llama-3.1-8B-Instruct”
export AWS_REGION=”us-west-2″
export CERT_S3_URI=”s3://my-bucket/certs/”
export NAMESPACE=”default”
export NAME=”demo”

cat << EOF > inference_endpoint_config.yaml
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
name: ${NAME}
namespace: ${NAMESPACE}
spec:
modelName: ${MODEL_NAME}
instanceType: ${INSTANCE_TYPE}
replicas: 1
invocationEndpoint: v1/chat/completions
modelSourceConfig:
modelSourceType: s3
s3Storage:
bucketName: ${S3_BUCKET}
region: ${AWS_REGION}
modelLocation: ${S3_MODEL_PATH}
prefetchEnabled: false
kvCacheSpec:
enableL1Cache: true
enableL2Cache: true
l2CacheSpec:
l2CacheBackend: “tieredstorage” # can also be “redis”
# Set l2CacheLocalUrl if selecting “redis”
# l2CacheLocalUrl: “redis:redisdefaultsvcclusterlocal:6379”
intelligentRoutingSpec:
enabled: true
routingStrategy: prefixaware
tlsConfig:
tlsCertificateOutputS3Uri: ${CERT_S3_URI}
metrics:
enabled: true
modelMetrics:
port: 8000
loadBalancer:
healthCheckPath: /health
worker:
resources:
limits:
nvidia.com/gpu: “4”
requests:
cpu: “6”
memory: 30Gi
nvidia.com/gpu: “4”
image: ${MODEL_IMAGE}
args:
– “–model”
– “/opt/ml/model”
– “–max-model-len”
– “20000”
– “–tensor-parallel-size”
– “4”
modelInvocationPort:
containerPort: 8000
name: http
modelVolumeMount:
name: model-weights
mountPath: /opt/ml/model
environmentVariables:
– name: OPTION_ROLLING_BATCH
value: “vllm”
– name: SAGEMAKER_SUBMIT_DIRECTORY
value: “/opt/ml/model/code”
– name: MODEL_CACHE_ROOT
value: “/opt/ml/model”
– name: SAGEMAKER_MODEL_SERVER_WORKERS
value: “1”
– name: SAGEMAKER_MODEL_SERVER_TIMEOUT
value: “3600”
EOF

kubectl apply -f inference_endpoint_config.yaml

# Check inferenceendpointconfig status
kubectl get inferenceendpointconfig ${NAME} -n ${NAMESPACE}
NAME AGE
demo 8s

# Check pods status – you should see worker pods
kubectl get pods -n ${NAMESPACE}
NAME READY STATUS RESTARTS AGE
demo-675886c7bb-7bhhg 3/3 Running 0 30s

# Router pods are under hyperpod-inference-system namespace
kubectl get pods -n hyperpod-inference-system
NAME READY STATUS RESTARTS AGE
hyperpod-inference-operator-controller-manager-dff64b947-m5nqk 1/1 Running 0 5h49m
demo-default-router-8787cf46c-jmgqd 2/2 Running 0 2m16s

Observability
You can monitor Managed KV Cache and Intelligent Routing metrics through the SageMaker HyperPod Observability features. For more information, see Accelerate foundation model development with one-click observability in Amazon SageMaker HyperPod.
KV Cache Metrics are available in the Inference dashboard.

Benchmarking
We conducted comprehensive benchmarking to validate real-world performance improvements for production LLM deployments. Our benchmarks were run with Managed Tiered KV Cache and Intelligent Routing feature using the Llama-3.1-70B-Instruct model deployed across 7 replicas on p5.48xlarge instances (each equipped with eight NVIDIA GPUs), under a steady-load traffic pattern. The benchmark environment used a dedicated client node group—with one c5.12xlarge instance per 100 concurrent requests to generate a controlled load, and a dedicated server node group, making sure model servers operated in isolation to help prevent resource contention under high concurrency.
Our benchmarks demonstrate that a combination of L1 and L2 Managed Tiered KV Cache and Intelligent Routing delivers substantial performance improvements across multiple dimensions. For medium context scenarios (8k tokens), we observed a 40% reduction in time to first token (TTFT) at P90, 72% reduction at P50, 24% increase in throughput, and 21% cost reduction compared to baseline configurations without optimization. The benefits are even more pronounced for long context workloads (64K tokens), achieving a 35% reduction in TTFT at P90, 94% reduction at P50, 38% throughput increase, and 28% cost savings. The optimization benefits scale dramatically with context length. While 8K token scenarios demonstrate solid improvements across the metrics, 64K token workloads experience transformative gains that fundamentally change the user experience. Our testing also confirmed that AWS-managed tiered storage consistently outperformed Redis-based L2 caching across the scenarios. The tiered storage backend delivered better latency and throughput without requiring the operational overhead of managing separate Redis infrastructure, making it the recommended choice for most deployments. Finally, unlike traditional performance optimizations that require tradeoffs between cost and speed, this solution delivers both simultaneously.
TTFT (P90)

TTFT (P50)

Throughput (TPS)

Cost/1000 token ($)

Conclusion
Managed Tiered KV Cache and Intelligent Routing in Amazon SageMaker HyperPod Model Deployment help you optimize LLM inference performance and costs through efficient memory management and smart request routing. You can get started today by adding these configurations to your HyperPod model deployments in the AWS Regions where SageMaker HyperPod is available.
To learn more, visit the Amazon SageMaker HyperPod documentation or follow the model deployment getting started guide.

About the authors
Chaitanya Hazarey is the Software Development Manager for SageMaker HyperPod Inference at Amazon, bringing extensive expertise in full-stack engineering, ML/AI, and data science. As a passionate advocate for responsible AI development, he combines technical leadership with a deep commitment to advancing AI capabilities while maintaining ethical considerations. His comprehensive understanding of modern product development drives innovation in machine learning infrastructure.
Pradeep Cruz is a Senior SDM at Amazon Web Services (AWS), driving AI infrastructure and applications at enterprise scale. Leading cross-functional organizations at Amazon SageMaker AI, he has built and scaled multiple high-impact services for enterprise customers including SageMaker HyperPod-EKS Inference, Task Governance, Feature Store, AIOps, and JumpStart Model Hub at AWS, alongside enterprise AI platforms at T-Mobile and Ericsson. His technical depth spans distributed systems, GenAI/ML, Kubernetes, cloud computing, and full-stack software development.
Vinay Arora is a Specialist Solution Architect for Generative AI at AWS, where he collaborates with customers in designing cutting-edge AI solutions leveraging AWS technologies. Prior to AWS, Vinay has over two decades of experience in finance—including roles at banks and hedge funds—he has built risk models, trading systems, and market data platforms. Vinay holds a master’s degree in computer science and business management.
Piyush Daftary is a Senior Software Engineer at AWS, working on Amazon SageMaker with a focus on building performant, scalable inference systems for large language models. His technical interests span AI/ML, databases, and search technologies, where he specializes in developing production-ready solutions that enable efficient model deployment and inference at scale. His work involves optimizing system performance, implementing intelligent routing mechanisms, and designing architectures that support both research and production workloads, with a passion for solving complex distributed systems challenges and making advanced AI capabilities more accessible to developers and organizations. Outside of work, he enjoys traveling, hiking, and spending time with family.
Ziwen Ning is a Senior Software Development Engineer at AWS, currently working on SageMaker Hyperpod Inference with a focus on building scalable infrastructure for large-scale AI model inference. His technical expertise spans container technologies, Kubernetes orchestration, and ML infrastructure, developed through extensive work across the AWS ecosystem. He has deep experience in container registries and distribution, container runtime development and open source contributions, and containerizing ML workloads with custom resource management and monitoring. Ziwen is passionate about designing production-grade systems that make advanced AI capabilities more accessible. In his free time, he enjoys kickboxing, badminton, and immersing himself in music.
Roman Blagovirnyy is a Sr. User Experience Designer on the SageMaker AI team with 19 years of diverse experience in interactive, workflow, and UI design, working on enterprise and B2B applications and features for the finance, healthcare, security, and HR industries prior to joining Amazon. At AWS Roman was a key contributor to the design of SageMaker AI Studio, SageMaker Studio Lab, data and model governance capabilities, and HyperPod. Roman’s currently works on new features and improvements to the administrator experience for HyperPod. In addition to this, Roman has a keen interest in design operations and process.
Caesar Chen is the Software Development Manager for SageMaker HyperPod at AWS, where he leads the development of cutting-edge machine learning infrastructure. With extensive experience in building production-grade ML systems, he drives technical innovation while fostering team excellence. His work in scalable model hosting infrastructure empowers data scientists and ML engineers to deploy and manage models with greater efficiency and reliability.
Chandra Lohit Reddy Tekulapally is a Software Development Engineer with the Amazon SageMaker HyperPod team. He is passionate about designing and building reliable, high-performance distributed systems that power large-scale AI workloads. Outside of work, he enjoys traveling and exploring new coffee spots.
Kunal Jha is a Principal Product Manager at AWS. He is focused on building Amazon SageMaker Hyperpod as the best-in-class choice for Generative AI model’s training and inference. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest.
Vivek Gangasani is a Worldwide Lead GenAI Specialist Solutions Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product strategy for SageMaker Inference. He also helps enterprises and startups deploy, manage, and scale their GenAI models with SageMaker and GPUs. Currently, he is focused on developing strategies and content for optimizing inference performance and GPU efficiency for hosting Large Language Models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Salesforce AI Research Introduces xRouter: A Reinforcement Learning Ro …

When your application can call many different LLMs with very different prices and capabilities, who should decide which one answers each request? Salesforce AI research team introduces ‘xRouter’, a tool-calling–based routing system that targets this gap with a reinforcement learning based router and learns when to answer locally and when to call external models, while tracking cost at token level.

What is xRouter?

xRouter is a tool calling based orchestration system built on Qwen2.5-7B-Instruct as the router backbone. The router is an instruction tuned model with tool calling capabilities that decides which downstream model to invoke, how to prompt it, and whether to synthesize or select an answer. The implementation uses DAPO, Distributional Advantage Policy Optimization, inside the Verl reinforcement learning framework, and exposes an OpenAI compatible API.

The router operates over more than 20 LLM tools in the full system. These tools span premium, standard, budget and specialized tiers, including GPT-5, GPT-4.1, GPT-5-Mini, GPT-5-Nano, o3, Kimi K2, DeepSeek-R1, Qwen3-235B variants and GPT-OSS models. The offloading pool is a 12 model subset that includes GPT-5, GPT-5-Mini, GPT-5-Nano, GPT-4o, GPT-4.1, o3, o3-Pro, o4-Mini, GPT-OSS-120B, GPT-OSS-20B and two Gemini-2.5 variants.

https://arxiv.org/pdf/2510.08439

Cost Aware Reward and Success Gating

Routing is framed as a reinforcement learning problem. For each episode, the reward combines a binary success signal and a cost penalty. The research team defines a reward that gives a fixed bonus when the final answer is correct, then subtracts a term proportional to the total normalized cost of all model calls. If the answer is wrong, the reward is zero regardless of how cheap it was.

As per the Model weights page, reward = quality − λ × normalized_cost, where λ is a cost penalty coefficient. Episodes with failures effectively have zero quality. This ‘success gated, cost shaped’ objective forces the router to first achieve correctness, then optimize cost among successful strategies. In practice, training uses 3 cost penalty settings, which produce the xRouter-7B-1, xRouter-7B-2 and xRouter-7B-3 variants.

https://arxiv.org/pdf/2510.08439

Training Data and Signal Design

xRouter training data comes from Reasoning360, which includes math, code and general reasoning tasks with difficulty estimates derived from a strong reference model, Qwen3-32B. The research team stratify samples into easy, medium and hard bands, and add simpler chit chat, retrieval and factual questions to teach the router when it can answer directly without delegation. Each sample includes descriptions and prices for models from different tiers. The system also refreshes the model catalog and perturbs costs to avoid overfitting to a static price table.

Failed trajectories, such as wrong answers from expensive models or unnecessary calls when the router could have answered itself, still incur full cost and receive zero reward. This produces a clean learning signal, where correctness gates reward and cost shapes the routing policy.

How the Router Behaves at Inference Time?

The router supports three execution modes. It can answer directly from the backbone without calling tools. It can call one or more downstream models, then synthesize a response using its own reasoning over their outputs. It can also call downstream models and use a special select_response tool to pick one of the replies as the final answer. These modes are implemented through function calls in an OpenAI style interface, which the orchestration engine executes through LiteLLM and SGLang.

Empirically, trained xRouter instances use a mix of direct and synthesized responses. Off the shelf routers such as GPT-4o, GPT-4.1, GPT-5, Qwen2.5-7B and Qwen3-8B tend to respond directly most of the time, even when instructed to offload when uncertain. This is an important behavioral difference and explains part of the efficiency gain.

Quantitative Results and Cost Utility

On static routing baselines across Minerva, MATH-500, Olympiad Bench, AIME-24, AMC-23, Codeforces, Code-Contests and Human-EvalPlus, xRouter-7B variants consistently improve accuracy compared to using the same base model as an untrained router. xRouter-7B-2, for example, reaches near GPT-5 accuracy on Olympiad Bench while using about one eighth of the GPT-5 evaluation cost.

In the system level comparison on LiveCodeBenchv5, GPQADiamond, AIME25, MT-Bench, IFEval and LiveBench, xRouter-7B-3 achieves the highest average accuracy on LiveCodeBenchv5 among all tested systems, and does this with moderate cost. Across tasks such as GPQA, xRouter variants reach around 80 to 90 percent of GPT-5 accuracy while consuming less than one fifth of the cost. The research team summarize that their cost aware reward can reduce inference cost by up to 80 percent at similar completion rates. The model weights HF card reports up to 60 percent cost reduction for comparable quality under other settings.

The research team also defines ‘cost utility’ as accuracy divided by cost. Open source single models with very low API prices often reach higher cost utility, but with lower absolute accuracy. xRouter sits in the middle, trading some cost utility for stronger task performance, which is usually what production systems care about.

Key Takeaways

xRouter is a tool calling router built on Qwen2.5 7B Instruct that learns to select among 20 plus external LLMs with a reinforcement learning policy that is explicitly cost aware.

The router uses a success gated reward, tasks only get positive reward when the final answer is correct, and within successful trajectories it applies a cost penalty term λ times normalized cost, which yields three xRouter 7B variants with different cost accuracy trade offs.

Training on Reasoning360 with difficulty stratification and synthetic easy queries teaches xRouter when to answer directly and when to offload, while perturbing prices and model pools improves robustness to changing provider catalogs.

Across math, coding and reasoning benchmarks, xRouter 7B models achieve near GPT 5 accuracy on hard tasks like Olympiad Bench and around 80 to 90 percent of GPT 5 accuracy on GPQA, while cutting offloading cost by up to 60 to 80 percent depending on the evaluation setup.

Editorial Notes

xRouter is a practical step toward cost aware orchestration for heterogeneous LLM fleets. It shows that a mid size router, trained with DAPO on Reasoning360 using a success gated, cost shaped reward, can consistently approach GPT 5 accuracy while reducing offloading cost by up to 60 to 80 percent.

Check out the PAPER and Model Weight. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Salesforce AI Research Introduces xRouter: A Reinforcement Learning Router for Cost Aware LLM Orchestration appeared first on MarkTechPost.

Agent0: A Fully Autonomous AI Framework that Evolves High-Performing A …

Large language models need huge human datasets, so what happens if the model must create all its own curriculum and teach itself to use tools? A team of researchers from UNC-Chapel Hill, Salesforce Research and Stanford University introduce ‘Agent0’, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration

Agent0 targets mathematical and general reasoning. It shows that careful task generation and tool integrated rollouts can push a base model beyond its original capabilities, across ten benchmarks.

https://arxiv.org/pdf/2511.16043

Two agents from one base model

Agent0 starts from a base policy π_base, for example Qwen3 4B Base or Qwen3 8B Base. It clones this policy into:

a Curriculum Agent πθ that generates tasks,

an Executor Agent πϕ that solves those tasks with a Python tool.

Training proceeds in iterations with two stages per iteration:

Curriculum evolution: The curriculum agent generates a batch of tasks. For each task, the executor samples multiple responses. A composite reward measures how uncertain the executor is, how often it uses the tool and how diverse the batch is. πθ is updated with Group Relative Policy Optimization (GRPO) using this reward.

Executor evolution: The trained curriculum agent is frozen. It generates a large pool of tasks. Agent0 filters this pool to keep only tasks near the executor’s capability frontier, then trains the executor on these tasks using an ambiguity aware RL objective called Ambiguity Dynamic Policy Optimization (ADPO).

This loop creates a feedback cycle. As the executor becomes stronger by using the code interpreter, the curriculum must generate more complex, tool reliant problems to keep its reward high.

https://arxiv.org/pdf/2511.16043

How the curriculum agent scores tasks?

The curriculum reward combines three signals:

Uncertainty reward: For each generated task x, the executor samples k responses and majority votes a pseudo answer. Self consistency p̂(x) is the fraction of responses that agree with this majority. The reward is maximal when p̂ is close to 0.5 and low when tasks are too easy or too hard. This encourages tasks that are challenging but still solvable for the current executor.

Tool use reward: The executor can trigger a sandboxed code interpreter using python tags and receives results tagged as output. Agent0 counts the number of tool calls in a trajectory and gives a scaled, capped reward, with a cap C set to 4 in experiments. This favors tasks that actually require tool calls rather than pure mental arithmetic.

Repetition penalty: Within each curriculum batch, Agent0 measures pairwise similarity between tasks using a BLEU based distance. Tasks are clustered, and a penalty term increases with cluster size. This discourages the curriculum from generating many near duplicates.

A composite reward multiplies a format check with a weighted sum of uncertainty and tool rewards minus the repetition penalty. This composite value feeds into GRPO to update πθ.

How the executor learns from noisy self labels?

The executor is also trained with GRPO but on multi turn, tool integrated trajectories and pseudo labels instead of ground truth answers.

Frontier dataset construction: After curriculum training in an iteration, the frozen curriculum generates a large candidate pool. For each task, Agent0 computes self consistency p̂(x) with the current executor and keeps only tasks where p̂ lies in an informative band, for example between 0.3 and 0.8. This defines a challenging frontier dataset that avoids trivial or impossible problems.

Multi turn tool integrated rollouts: For each frontier task, the executor generates a trajectory that can interleave:

natural language reasoning tokens,

python code segments,

output tool feedback.

Generation pauses when a tool call appears, executes the code in a sandboxed interpreter built on VeRL Tool, then resumes conditioned on the result. The trajectory terminates when the model produces a final answer inside {boxed …} tags.

A majority vote across sampled trajectories defines a pseudo label and a terminal reward for each trajectory.

ADPO, ambiguity aware RL: Standard GRPO treats all samples equally, which is unstable when labels come from majority voting on ambiguous tasks. ADPO modifies GRPO in two ways using p̂ as an ambiguity signal.

It scales the normalized advantage with a factor that increases with self consistency, so trajectories from low confidence tasks contribute less.

It sets a dynamic upper clipping bound for the importance ratio, which depends on self consistency. Empirical analysis shows that fixed upper clipping mainly affects low probability tokens. ADPO relaxes this bound adaptively, which improves exploration on uncertain tasks, as visualized by the up clipped token probability statistics.

https://arxiv.org/pdf/2511.16043

Results on mathematical and general reasoning

Agent0 is implemented on top of VeRL and evaluated on Qwen3 4B Base and Qwen3 8B Base. It uses a sandboxed Python interpreter as the single external tool.

The research team evaluate on ten benchmarks:

Mathematical reasoning: AMC, Minerva, MATH, GSM8K, Olympiad Bench, AIME24, AIME25.

General reasoning: SuperGPQA, MMLU Pro, BBEH.

They report pass@1 for most datasets and mean@32 for AMC and AIME tasks.

For Qwen3 8B Base, Agent0 reaches:

math average 58.2 versus 49.2 for the base model,

overall general average 42.1 versus 34.5 for the base model.

Agent0 also improves over strong data free baselines such as R Zero, Absolute Zero, SPIRAL and Socratic Zero, both with and without tools. On Qwen3 8B, it surpasses R Zero by 6.4 percentage points and Absolute Zero by 10.6 points on the overall average. It also beats Socratic Zero, which relies on external OpenAI APIs.

Across three co evolution iterations, average math performance on Qwen3 8B increases from 55.1 to 58.2 and general reasoning also improves per iteration. This confirms stable self improvement rather than collapse.

Qualitative examples show that curriculum tasks evolve from basic geometry questions to complex constraint satisfaction problems, while executor trajectories mix reasoning text with Python calls to reach correct answers.

Key Takeaways

Fully data free co evolution: Agent0 eliminates external datasets and human annotations. Two agents, a curriculum agent and an executor agent, are initialized from the same base LLM and co evolve only via reinforcement learning and a Python tool.

Frontier curriculum from self uncertainty: The curriculum agent uses the executor’s self consistency and tool usage to score tasks. It learns to generate frontier tasks that are neither trivial nor impossible, and that explicitly require tool integrated reasoning.

ADPO stabilizes RL with pseudo labels: The executor is trained with Ambiguity Dynamic Policy Optimization. ADPO down weights highly ambiguous tasks and adapts the clipping range based on self consistency, which makes GRPO style updates stable when rewards come from majority vote pseudo labels.

Consistent gains on math and general reasoning: On Qwen3 8B Base, Agent0 improves math benchmarks from 49.2 to 58.2 average and general reasoning from 34.5 to 42.1, which corresponds to relative gains of about 18 percent and 24 percent.

Outperforms prior zero data frameworks: Across ten benchmarks, Agent0 surpasses previous self evolving methods such as R Zero, Absolute Zero, SPIRAL and Socratic Zero, including those that already use tools or external APIs. This shows that the co evolution plus tool integration design is a meaningful step beyond earlier single round self play approaches.

Editorial Notes

Agent0 is an important step toward practical, data free reinforcement learning for tool integrated reasoning. It shows that a base LLM can act as both Curriculum Agent and Executor Agent, and that GRPO with ADPO and VeRL Tool can drive stable improvement from majority vote pseudo labels. The method also demonstrates that tool integrated co evolution can outperform prior zero data frameworks such as R Zero and Absolute Zero on strong Qwen3 baselines. Agent0 makes a strong case that self evolving, tool integrated LLM agents are becoming a realistic training paradigm.

Check out the PAPER and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Agent0: A Fully Autonomous AI Framework that Evolves High-Performing Agents without External Data through Multi-Step Co-Evolution appeared first on MarkTechPost.

How to Build a Neuro-Symbolic Hybrid Agent that Combines Logical Plann …

In this tutorial, we demonstrate how to combine the strengths of symbolic reasoning with neural learning to build a powerful hybrid agent. We focus on creating a neuro-symbolic architecture that uses classical planning for structure, rules, and goal-directed behavior, while neural networks handle perception and action refinement. As we walk through the code, we see how both layers interact in real time, allowing us to navigate an environment, overcome uncertainty, and adapt intelligently. At last, we understand how neuro-symbolic systems bring interpretability, robustness, and flexibility together in a single agentic framework. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass, field
from typing import List, Dict, Tuple, Set, Optional
from collections import deque
import warnings
warnings.filterwarnings(‘ignore’)

@dataclass
class State:
robot_pos: Tuple[int, int]
holding: Optional[str] = None
visited: Set[Tuple[int, int]] = field(default_factory=set)
objects_collected: Set[str] = field(default_factory=set)
def __hash__(self):
return hash((self.robot_pos, self.holding))

class SymbolicPlanner:
def __init__(self, grid_size: int = 8):
self.grid_size = grid_size
self.actions = [‘up’, ‘down’, ‘left’, ‘right’, ‘pickup’, ‘drop’]
def get_successors(self, state: State, obstacles: Set[Tuple[int, int]], objects: Dict[str, Tuple[int, int]]) -> List[Tuple[str, State]]:
successors = []
x, y = state.robot_pos
moves = {‘up’: (x, y-1), ‘down’: (x, y+1), ‘left’: (x-1, y), ‘right’: (x+1, y)}
for action, new_pos in moves.items():
nx, ny = new_pos
if (0 <= nx < self.grid_size and 0 <= ny < self.grid_size and new_pos not in obstacles):
new_state = State(new_pos, state.holding, state.visited | {new_pos}, state.objects_collected.copy())
successors.append((action, new_state))
if state.holding is None:
for obj_name, obj_pos in objects.items():
if state.robot_pos == obj_pos and obj_name not in state.objects_collected:
new_state = State(state.robot_pos, obj_name, state.visited.copy(), state.objects_collected.copy())
successors.append((‘pickup’, new_state))
if state.holding is not None:
new_state = State(state.robot_pos, None, state.visited.copy(), state.objects_collected | {state.holding})
successors.append((‘drop’, new_state))
return successors
def heuristic(self, state: State, goal: Tuple[int, int]) -> float:
return abs(state.robot_pos[0] – goal[0]) + abs(state.robot_pos[1] – goal[1])
def a_star_plan(self, start_state: State, goal: Tuple[int, int], obstacles: Set[Tuple[int, int]], objects: Dict[str, Tuple[int, int]]) -> List[str]:
counter = 0
frontier = [(self.heuristic(start_state, goal), counter, 0, start_state, [])]
visited = set()
while frontier:
frontier.sort()
_, _, cost, state, plan = frontier.pop(0)
counter += 1
if state.robot_pos == goal and len(state.objects_collected) >= len(objects):
return plan
state_key = (state.robot_pos, state.holding)
if state_key in visited:
continue
visited.add(state_key)
for action, next_state in self.get_successors(state, obstacles, objects):
new_cost = cost + 1
new_plan = plan + [action]
priority = new_cost + self.heuristic(next_state, goal)
frontier.append((priority, counter, new_cost, next_state, new_plan))
counter += 1
return []

We lay the foundation for our symbolic reasoning system and define how states, actions, and transitions work. We implement classical planning logic using A* search to generate goal-directed, interpretable action sequences. As we build this part, we establish the rule-based backbone that guides the agent’s high-level decisions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass NeuralPerception:
def __init__(self, grid_size: int = 8):
self.grid_size = grid_size
self.W1 = np.random.randn(grid_size * grid_size, 64) * 0.1
self.b1 = np.zeros(64)
self.W2 = np.random.randn(64, 32) * 0.1
self.b2 = np.zeros(32)
self.W3 = np.random.randn(32, grid_size * grid_size) * 0.1
self.b3 = np.zeros(grid_size * grid_size)
def relu(self, x):
return np.maximum(0, x)
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def perceive(self, noisy_grid: np.ndarray) -> np.ndarray:
x = noisy_grid.flatten()
h1 = self.relu(x @ self.W1 + self.b1)
h2 = self.relu(h1 @ self.W2 + self.b2)
out = self.sigmoid(h2 @ self.W3 + self.b3)
return out.reshape(self.grid_size, self.grid_size)

class NeuralPolicy:
def __init__(self, state_dim: int = 4, action_dim: int = 4):
self.W = np.random.randn(state_dim, action_dim) * 0.1
self.b = np.zeros(action_dim)
self.action_map = [‘up’, ‘down’, ‘left’, ‘right’]
def softmax(self, x):
exp_x = np.exp(x – np.max(x))
return exp_x / exp_x.sum()
def get_action_probs(self, state_features: np.ndarray) -> np.ndarray:
logits = state_features @ self.W + self.b
return self.softmax(logits)
def select_action(self, state_features: np.ndarray, symbolic_action: str) -> str:
probs = self.get_action_probs(state_features)
if symbolic_action in self.action_map:
sym_idx = self.action_map.index(symbolic_action)
probs[sym_idx] += 0.7
probs = probs / probs.sum()
return np.random.choice(self.action_map, p=probs)

We introduce the neural components that allow our agent to sense and adapt. We design a lightweight neural network to denoise the environment and a simple policy network to refine actions based on features. As we integrate these elements, we ensure that our agent can handle uncertainty and adjust behavior dynamically. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass NeuroSymbolicAgent:
def __init__(self, grid_size: int = 8):
self.grid_size = grid_size
self.planner = SymbolicPlanner(grid_size)
self.perception = NeuralPerception(grid_size)
self.policy = NeuralPolicy()
self.obstacles = {(3, 3), (3, 4), (4, 3), (5, 5), (6, 2)}
self.objects = {‘key’: (2, 6), ‘gem’: (6, 6)}
self.goal = (7, 7)
def create_noisy_observation(self, true_grid: np.ndarray) -> np.ndarray:
noise = np.random.randn(*true_grid.shape) * 0.2
return np.clip(true_grid + noise, 0, 1)
def extract_state_features(self, pos: Tuple[int, int], goal: Tuple[int, int]) -> np.ndarray:
return np.array([pos[0]/self.grid_size, pos[1]/self.grid_size, goal[0]/self.grid_size, goal[1]/self.grid_size])
def execute_mission(self, verbose: bool = True) -> Tuple[List, List]:
start_state = State(robot_pos=(0, 0), visited={(0, 0)})
symbolic_plan = self.planner.a_star_plan(start_state, self.goal, self.obstacles, self.objects)
if verbose:
print(f” Symbolic Plan Generated: {len(symbolic_plan)} steps”)
print(f” Plan: {symbolic_plan[:10]}{‘…’ if len(symbolic_plan) > 10 else ”}n”)
true_grid = np.zeros((self.grid_size, self.grid_size))
for obs in self.obstacles:
true_grid[obs[1], obs[0]] = 1.0
noisy_obs = self.create_noisy_observation(true_grid)
perceived_grid = self.perception.perceive(noisy_obs)
if verbose:
print(f” Neural Perception: Denoised obstacle map”)
print(f” Perception accuracy: {np.mean((perceived_grid > 0.5) == true_grid):.2%}n”)
trajectory = [(0, 0)]
current_pos = (0, 0)
actions_taken = []
for i, sym_action in enumerate(symbolic_plan[:30]):
features = self.extract_state_features(current_pos, self.goal)
refined_action = self.policy.select_action(features, sym_action) if sym_action in [‘up’,’down’,’left’,’right’] else sym_action
actions_taken.append(refined_action)
if refined_action == ‘up’: current_pos = (current_pos[0], max(0, current_pos[1]-1))
elif refined_action == ‘down’: current_pos = (current_pos[0], min(self.grid_size-1, current_pos[1]+1))
elif refined_action == ‘left’: current_pos = (max(0, current_pos[0]-1), current_pos[1])
elif refined_action == ‘right’: current_pos = (min(self.grid_size-1, current_pos[0]+1), current_pos[1])
if current_pos not in self.obstacles:
trajectory.append(current_pos)
return trajectory, actions_taken

We bring the symbolic and neural layers together into a unified agent. We generate a symbolic plan, perceive the environment through neural processing, and refine each planned action using the neural policy. As we execute the mission loop, we observe how both systems interact seamlessly to produce robust behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef visualize_execution(agent: NeuroSymbolicAgent, trajectory: List, title: str = “Neuro-Symbolic Agent Execution”):
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
ax = axes[0]
grid = np.zeros((agent.grid_size, agent.grid_size, 3))
for obs in agent.obstacles:
grid[obs[1], obs[0]] = [0.3, 0.3, 0.3]
for obj_pos in agent.objects.values():
grid[obj_pos[1], obj_pos[0]] = [1.0, 0.8, 0.0]
grid[agent.goal[1], agent.goal[0]] = [0.0, 1.0, 0.0]
for i, pos in enumerate(trajectory):
intensity = 0.3 + 0.7 * (i / len(trajectory))
grid[pos[1], pos[0]] = [intensity, 0.0, 1.0]
if trajectory:
grid[trajectory[0][1], trajectory[0][0]] = [1.0, 0.0, 0.0]
ax.imshow(grid)
ax.set_title(“Agent Trajectory in Environment”, fontsize=14, fontweight=’bold’)
ax.set_xlabel(“X Position”)
ax.set_ylabel(“Y Position”)
ax.grid(True, alpha=0.3)
ax = axes[1]
ax.axis(‘off’)
ax.text(0.5, 0.95, “Neuro-Symbolic Architecture”, ha=’center’, fontsize=16, fontweight=’bold’, transform=ax.transAxes)
layers = [(“SYMBOLIC LAYER”, 0.75, “Planning • State Logic • Rules”), (” INTEGRATION”, 0.60, “Feature Extraction • Action Blending”), (“NEURAL LAYER”, 0.45, “Perception • Policy Learning”), (” EXECUTION”, 0.30, “Action Refinement • Feedback”), (“ENVIRONMENT”, 0.15, “State Transitions • Observations”)]
colors = [‘#FF6B6B’, ‘#4ECDC4’, ‘#45B7D1’, ‘#96CEB4’, ‘#FFEAA7′]
for i, (name, y, desc) in enumerate(layers):
ax.add_patch(plt.Rectangle((0.1, y-0.05), 0.8, 0.08, facecolor=colors[i], alpha=0.7, transform=ax.transAxes))
ax.text(0.5, y, f”{name}n{desc}”, ha=’center’, va=’center’, fontsize=10, fontweight=’bold’, transform=ax.transAxes)
plt.tight_layout()
plt.savefig(‘neurosymbolic_agent.png’, dpi=150, bbox_inches=’tight’)
plt.show()
print(f”n Execution complete! Trajectory length: {len(trajectory)} steps”)

We visualize how the agent moves through the environment and how the architecture is structured. We plot obstacles, objects, the goal, and the full trajectory so that we can clearly see the agent’s decision process. As we render the architecture layers, we understand how the hybrid design flows from planning to perception to action. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
print(“=” * 70)
print(“NEURO-SYMBOLIC HYBRID AGENT TUTORIAL”)
print(“Combining Classical AI Planning with Modern Neural Networks”)
print(“=” * 70)
print()
agent = NeuroSymbolicAgent(grid_size=8)
trajectory, actions = agent.execute_mission(verbose=True)
visualize_execution(agent, trajectory)
print(“n” + “=” * 70)
print(“KEY INSIGHTS:”)
print(“=” * 70)
print(“✦ Symbolic Layer: Provides interpretable, verifiable plans”)
print(“✦ Neural Layer: Handles noisy perception & adapts to uncertainty”)
print(“✦ Integration: Combines strengths of both paradigms”)
print(“✦ Benefits: Explainability + Flexibility + Robustness”)
print(“=” * 70)

We run the complete neuro-symbolic pipeline from planning to execution to visualization. We instantiate the agent, execute the mission, and display key insights to summarize the system’s behavior. As we run this final block, we see the overall hybrid architecture in action and appreciate how each component contributes to the outcome.

In conclusion, we observe how smoothly the symbolic and neural components work together to produce a more capable and reliable agent. We appreciate how the symbolic planner gives us transparent, verifiable steps, while the neural layer adds adaptability and perceptual grounding that pure logic cannot offer. Through this hybrid approach, we can build agents that reason, perceive, and act in ways that are both intelligent and interpretable. We end with a deeper understanding of how neuro-symbolic AI moves us closer to practical, resilient agentic systems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Neuro-Symbolic Hybrid Agent that Combines Logical Planning with Neural Perception for Robust Autonomous Decision-Making appeared first on MarkTechPost.

Amazon SageMaker AI introduces EAGLE based adaptive speculative decodi …

Generative AI models continue to expand in scale and capability, increasing the demand for faster and more efficient inference. Applications need low latency and consistent performance without compromising output quality. Amazon SageMaker AI introduces new enhancements to its inference optimization toolkit that bring EAGLE based adaptive speculative decoding to more model architectures. These updates make it easier to accelerate decoding, optimize performance using your own data and deploy higher-throughput models using the familiar SageMaker AI workflow.
EAGLE, short for Extrapolation Algorithm for Greater Language-model Efficiency, is a technique that speeds up large language model decoding by predicting future tokens directly from the hidden layers of the model. When you guide optimization using your own application data, the improvements align with the actual patterns and domains you serve, producing faster inference that reflects your real workloads rather than generic benchmarks. Based on the model architecture, SageMaker AI trains EAGLE 3 or EAGLE 2 heads.
Note that this training and optimization is not limited to just a one time optimization operation. You can start by utilizing the datasets provided by SageMaker for the initial training, but as you continue to gather and collect your own data you can also fine-tune using your own curated dataset for highly adaptive, workload-specific performance. An example would be utilizing a tool such as Data Capture to curate your own dataset over time from real-time requests that are hitting your hosted model. This can be an iterative feature with multiple cycles of training to continuously improve performance.
In this post we’ll explain how to use EAGLE 2 and EAGLE 3 speculative decoding in Amazon SageMaker AI.
Solution overview
SageMaker AI now offers native support for both EAGLE 2 and EAGLE 3 speculative decoding, enabling each model architecture to apply the technique that best matches its internal design. For your base LLM, you can utilize either SageMaker JumpStart models or bring your own model artifacts to S3 from other model hubs, such as HuggingFace.
Speculative decoding is a widely employed technique for accelerating inference in LLMs without compromising quality. This method involves using a smaller draft model to generate preliminary tokens, which are then verified by the target LLM. The extent of the speedup achieved through speculative decoding is heavily dependent on the selection of the draft model.

The sequential nature of modern LLMs makes them expensive and slow, and speculative decoding has proven to be an effective solution to this problem. Methods like EAGLE improve upon this by reusing features from the target model, leading to better results. However, a current trend in the LLM community is to increase training data to boost model intelligence without adding inference costs. Unfortunately, this approach has limited benefits for EAGLE. This limitation is due to EAGLE’s constraints on feature prediction. To address this, EAGLE-3 is introduced, which predicts tokens directly instead of features and combines features from multiple layers using a technique called training-time testing. These changes significantly improve performance and allow the model to fully benefit from increased training data.

To give customers maximum flexibility, SageMaker supports every major workflow for building or refining an EAGLE model. You can train an EAGLE model entirely from scratch using the SageMaker curated open dataset, or train it from scratch with your own data to align speculative behavior with your traffic patterns. You can also start from an existing EAGLE base model: either retraining it with the default open dataset for a fast, high-quality baseline, or fine-tuning that base model with your own dataset for highly adaptive, workload-specific performance. In addition, SageMaker JumpStart provides fully pre-trained EAGLE models so you can begin optimizing immediately without preparing any artifacts.
The solution spans six supported architectures and includes a pre-trained, pre-cached EAGLE base to accelerate experimentation. SageMaker AI also supports widely used training data formats, specifically ShareGPT and OpenAI chat and completions, so existing corpora can be used directly. Customers can also provide the data captured using their own SageMaker AI endpoints provided the data is in the above specified formats. Whether you rely on the SageMaker open dataset or bring your own, optimization jobs typically deliver around a 2.5x thoughput over standard decoding while adapting naturally to the nuances of your specific use case.
All optimization jobs automatically produce benchmark results giving you clear visibility into latency and throughput improvements. You can run the entire workflow using SageMaker Studio or the AWS CLI and you deploy the optimized model through the same interface you already use for standard SageMaker AI inference.
SageMaker AI currently supports LlamaForCausalLM, Qwen3ForCausalLM, Qwen3MoeForCausalLM, Qwen2ForCausalLM and GptOssForCausalLM with EAGLE 3, and Qwen3NextForCausalLM with EAGLE 2. You can use one optimization pipeline across a mix of architectures while still gaining the benefits of model-specific behavior.
How EAGLE works inside the model
Speculative decoding can be thought of like a seasoned chief scientist guiding the flow of discovery. In traditional setups, a smaller “assistant” model runs ahead, quickly sketching out several possible token continuations, while the larger model examines and corrects those suggestions. This pairing reduces the number of slow, sequential steps by verifying multiple drafts at once.
EAGLE streamlines this process even further. Instead of depending on an external assistant, the model effectively becomes its own lab partner: it inspects its internal hidden-layer representations to anticipate several future tokens in parallel. Because these predictions arise from the model’s own learned structure, they tend to be more accurate upfront, leading to deeper speculative steps, fewer rejections, and smoother throughput.
By removing the overhead of coordinating a secondary model and enabling highly parallel verification, this approach alleviates memory bandwidth bottlenecks and delivers notable speedups, often around 2.5x, while maintaining the same output quality the baseline model would produce.
Running optimization jobs from the SDK or CLI
You can interface with the Optimization Toolkit using the AWS Python Boto3 SDK, Studio UI. In this section we explore utilizing the AWS CLI, the same API calls will map over to the Boto3 SDK. Here, the core API calls for endpoint creation remain the same: create_model, create_endpoint_config, and create_endpoint. The workflow we showcase here begins with model registration using the create_model API call. With the create_model API call you can specify your serving container and stack. You don’t need to create a SageMaker model object and can specify the model data in the Optimization Job API call as well.
For the EAGLE heads optimization, we specify the model data by pointing towards to the Model Data Source parameter, at the moment specification of the HuggingFace Hub Model ID is not supported. Pull your artifacts and upload them to an S3 bucket and specify it in the Model Data Source parameter. By default checks are done to verify that the appropriate files are uploaded so you have the standard model data expected for LLMs:

# traditional model data needed
model/
config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
generation_config.json
vocab.json
model.safetensors
model.safetensors.index.json

Let’s look at a few paths here:

Using your own model data with your own EAGLE curated dataset
Bringing your own trained EAGLE that you may want to train more
Bring your own model data and use SageMaker AI built-in datasets

1. Using your own model data with your own EAGLE curated dataset
We can start an optimization job with the create-optimization-job API call. Here is an example with a Qwen3 32B model. Note that you can bring your own data or also use the built-in SageMaker provided datasets. First we can create a SageMaker Model object that specifies the S3 bucket with our model artifacts:

aws sagemaker –region us-west-2 create-model
–model-name <target-model-name>
–primary-container ‘{ “Image”: “763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}”,
“ModelDataSource”: { “S3DataSource”: { “S3Uri”: “Enter model path”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } } }’ –execution-role-arn “Enter Execution Role ARN”

Our optimization call then pulls down these model artifacts when you specify the SageMaker Model and a TrainingDataSource parameter as the following:

aws sagemaker –region us-west-2 create-optimization-job
–optimization-job-name <job-name>
–account-id <account-id>
–deployment-instance-type ml.p5.48xlarge
–max-instance-count 10
–model-source ‘{
“SageMakerModel”: { “ModelName”: “Created Model name” }
}’
–optimization-configs'{
“ModelSpeculativeDecodingConfig”: {
“Technique”: “EAGLE”,
“TrainingDataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: “Enter custom train data location”
}
}
}’
–output-config ‘{
“S3OutputLocation”: “Enter optimization output location”
}’
–stopping-condition ‘{“MaxRuntimeInSeconds”: 432000}’
–role-arn “Enter Execution Role ARN”

2. Bringing your own trained EAGLE that you may want to train more
For your own trained EAGLE you can specify another parameter in the create_model API call where you point towards your EAGLE artifacts, optionally you can also specify a SageMaker JumpStart Model ID to pull down the packaged model artifacts.

# Enable additional model data source with EAGLE artifacts
aws sagemaker –region us-west-2 create-model
–model-name <target-model-name>
–primary-container ‘{ “Image”: “763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}”,
“ModelDataSource”: { “S3DataSource”: { “S3Uri”: “<model path>”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } },
“AdditionalModelDataSources”: [ { “ChannelName”: “eagle_model”,
“S3DataSource”: { “S3Uri”: “<pre-trained EAGLE path>”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } } ] }’ –execution-role-arn “Enter Execution Role ARN”

Similarly the optimization API then inherits this model object with the necessary model data:

aws sagemaker –region us-west-2 create-optimization-job
–account-id <account-id>
–optimization-job-name <job-name>
–deployment-instance-type ml.p5.48xlarge
–max-instance-count 10
–model-source ‘{
“SageMakerModel”: {
“ModelName”: “Created Model Name”
}
}’
–optimization-configs ‘{
“ModelSpeculativeDecodingConfig”: {
“Technique”: “EAGLE”,
“TrainingDataSource”: {
“S3Uri”: “Enter training data path”,
“S3DataType”: “S3Prefix”
}
}
}’
–output-config ‘{
“SageMakerModel”: {
“ModelName”: “Model Name”
},
“S3OutputLocation”: “Enter output data location”
}’
–stopping-condition ‘{“MaxRuntimeInSeconds”: 432000}’
–role-arn “Enter Execution Role ARN”

3. Bring your own model data and use SageMaker built-in datasets
Optionally, we can utilize the SageMaker provided datasets:

# SageMaker Provided Optimization Datasets
gsm8k_training.jsonl (https://huggingface.co/datasets/openai/gsm8k)
magicoder.jsonl (https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K)
opencodeinstruct.jsonl (https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
swebench_oracle_train.jsonl (https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
ultrachat_0_8k_515292.jsonl (https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)

After completion, SageMaker AI stores evaluation metrics in S3 and records the optimization lineage in Studio. You can deploy the optimized model to an inference endpoint with either the create_endpoint API call or in the UI.
Benchmarks
To benchmark this further we compared three states:

No EAGLE: Base model without EAGLE as a baseline
Base EAGLE: EAGLE training using built-in datasets provided by SageMaker AI
Trained EAGLE: EAGLE training using built-in datasets provided by SageMaker AI and retraining with own custom dataset

The numbers displayed below are for qwen3-32B across metrics such as Time to First Token (TTFT) and overall throughput.

Configuration
Concurrency
TTFT (ms)
TPOT (ms)
ITL (ms)
Request Throughput
Output Throughput (tokens/sec)
OTPS per request (tokens/sec)

No EAGLE
4
168.04
45.95
45.95
0.04
86.76
21.76

No EAGLE
8
219.53
51.02
51.01
0.08
156.46
19.6

Base EAGLE
1
89.76
21.71
53.01
0.02
45.87
46.07

Base EAGLE
2
132.15
20.78
50.75
0.05
95.73
48.13

Base EAGLE
4
133.06
20.11
49.06
0.1
196.67
49.73

Base EAGLE
8
154.44
20.58
50.15
0.19
381.86
48.59

Trained EAGLE
1
83.6
17.32
46.37
0.03
57.63
57.73

Trained EAGLE
2
129.07
18
48.38
0.05
110.86
55.55

Trained EAGLE
4
133.11
18.46
49.43
0.1
214.27
54.16

Trained EAGLE
8
151.19
19.15
51.5
0.2
412.25
52.22

Pricing considerations
Optimization jobs run on SageMaker AI training instances, you will be billed depending on the instance type and job duration. Deployment of the resulting optimized model uses standard SageMaker AI Inference pricing.
Conclusion
EAGLE based adaptive speculative decoding gives you a faster and more effective path to improve generative AI inference performance on Amazon SageMaker AI. By working inside the model rather than relying on a separate draft network, EAGLE accelerates decoding, increases throughput and maintains generation quality. When you optimize using your own dataset, the improvements reflect the unique behavior of your applications, resulting in better end-to-end performance. With built-in dataset support, benchmark automation and streamlined deployment, the inference optimization toolkit helps you deliver low-latency generative applications at scale.

About the authors
Kareem Syed-Mohammed is a Product Manager at AWS. He is focuses on enabling generative AI model development and governance on SageMaker HyperPod. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.
Xu Deng is a Software Engineer Manager with the SageMaker team. He focuses on helping customers build and optimize their AI/ML inference experience on Amazon SageMaker. In his spare time, he loves traveling and snowboarding.
Ram Vegiraju is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on SageMaker. In his spare time, he loves traveling and writing.
Vinay Arora is a Specialist Solution Architect for Generative AI at AWS, where he collaborates with customers in designing cutting-edge AI solutions leveraging AWS technologies. Prior to AWS, Vinay has over two decades of experience in finance—including roles at banks and hedge funds—he has built risk models, trading systems, and market data platforms. Vinay holds a master’s degree in computer science and business management.
Siddharth Shah is a Principal Engineer at AWS SageMaker, specializing in large-scale model hosting and optimization for Large Language Models. He previously worked on the launch of Amazon Textract, performance improvements in the model-hosting platform, and expedited retrieval systems for Amazon S3 Glacier. Outside of work, he enjoys hiking, video games, and hobby robotics.
Andy Peng is a builder with curiosity, motivated by scientific research and product innovation. He helped build key initiatives that span AWS SageMaker and Bedrock, Amazon S3, AWS App Runner, AWS Fargate, Alexa Health & Wellness, and AWS Payments, from 0-1 incubation to 10x scaling. Open-source enthusiast.
Johna Liu is a Software Development Engineer on the Amazon SageMaker team, where she builds and explores AI/LLM-powered tools that enhance efficiency and enable new capabilities. Outside of work, she enjoys tennis, basketball and baseball.
Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring new Seattle restaurants, traveling, and spending time with family and friends.

Train custom computer vision defect detection model using Amazon SageM …

On October 10, 2024, Amazon announced the discontinuation of the Amazon Lookout for Vision service, with a scheduled shut down date of October 31, 2025 (see Exploring alternatives and seamlessly migrating data from Amazon Lookout for Vision blog post). As part of our transition guidance for customers, we recommend the use of Amazon SageMaker AI tools to build applications for customers who are interested in AI/ML computer vision models for automated quality inspection use cases. To support that effort, AWS has made a pre-trained computer vision defect detection model available on AWS Marketplace that can be fine-tuned using Amazon SageMaker AI for a customer’s specific use case. If run in the cloud, this model only requires paying for infrastructure costs for training or inference. This approach provides the tools to accelerate solution development while facilitating complete flexibility to build a solution that integrates with any existing hardware and software infrastructure.
In this blog post, you will learn how to migrate your computer vision workloads from Amazon Lookout for Vision to Amazon SageMaker AI by following our step-by-step guidance.
AWS is sharing the main underlying models used for the service to end users in the AWS Marketplace. You can use the two main types of models, binary classification and semantic segmentation, when you train in your own AWS accounts for deployment on AWS or at the edge.
This model helps customers continue to use AWS defect detection technology at their own pace with greater flexibility. For example, you can train your models with larger instance types for faster training times. With access to set hyperparameters, you can also adjust model behavior that was not previously available on the AWS console. For example, you can set the multi-head model for semantic segmentation to disable the binary classifier head. This can make the model mode more tolerant of changing background and lighting conditions. You can also personalize the maximum training time, which was set to a non-changeable 24-hour limit on Amazon Lookout for Vision (L4V).
The GitHub repository for Amazon Lookout for Vision has been updated with a Jupyter Notebook to help you train datasets with these two model types and package them up. From there you can deploy the models by using a SageMaker endpoint, or edge devices.
To label the images beyond the sample data, you can use Amazon SageMaker Ground Truth to enable crowdsourcing or allow private teams to label the data, or use a partner solution such as Edge Impulse, Roboflow, or SuperbAI to do so. When you have the manifest file of the labeled data, the marketplace models can be used for training. You will lose a thumbnail-based dataset management tool like the Amazon Lookout for Vision console, so consider one of the previously mentioned partner solutions to help manage datasets. You can also export your existing data from the Lookout For Vision service using this guide.
Prerequisites
Before you begin, make sure you have the following components and permissions in place:

Amazon SageMaker Studio or Amazon SageMaker Unified Studio for integrated development environment (IDE)
AWS Identity and Access Management (IAM) role with these permissions to follow the principle of least privilege

Amazon S3

s3:GetObject
s3:PutObject
s3:DeleteObject
s3:ListBucket

SageMaker

sagemaker:CreateTrainingJob
sagemaker:CreateModel
sagemaker:CreateEndpoint
sagemaker:CreateEndpointConfig
sagemaker:CreateTransformJob
sagemaker:DescribeTrainingJob
sagemaker:DescribeModel
sagemaker:DescribeEndpoint
sagemaker:DescribeEndpointConfig
sagemaker:DescribeTransformJob
sagemaker:InvokeEndpoint
sagemaker:DeleteEndpoint
sagemaker:DeleteEndpointConfig
sagemaker:DeleteModel

Model subscription:

An AWS account with a subscription to Computer Vision Defect Detection Model or
An IAM role with these three permissions permission to make AWS Marketplace subscriptions in the AWS account you use:

aws-marketplace:ViewSubscriptions
aws-marketplace:Unsubscribe
aws-marketplace:Subscribe

Labeled data (you can use the cookie data sample in Github) or label your own data with SageMaker Ground Truth or an AWS Partner tool
Basic knowledge of creating a SageMaker notebook instance and running Jupyter notebook

Architecture overview
The following diagram illustrates the end-to-end flow, from image acquisition to inferencing at the edge. This blog focus on steps 2 and 3.

Use an edge application to configure cameras or sensors and capture training images.
Use SageMaker GroundTruth or AWS Partner platforms to export and label images.
Use Amazon SageMaker AI for model training.
Use REST, PLC, or digital input for image acquisition and processing.
Run real-time inference using the trained and deployed model.
Publish inference results to analytics and monitoring for alerts and analytics.
Perform automated action on the machine of concern or notify plant personnel of anomalies from inspection station component using OPC-UA or digital output.
Line operators and plant managers receive notifications for action.

Set up the labeling process
This section covers the steps to set up the labeling process using Amazon SageMaker Ground Truth, including creating a private labeling team and configuring the labeling job.

Configure Amazon SageMaker Ground Truth private team:

Select Amazon SageMaker AI, Ground Truth, Labeling workforces.
Select Private, then Create Private Team.
Enter a team name.
Leave other values as their defaults.
Select Create a new Amazon Cognito user group.
Select Create private Team.

On the Workers tab, select Invite New Workers.
Enter your team members’ email addresses to send sign-up invitations.

Label the dataset
After successfully completing the workforce setup for labelling, the next step is to label the dataset. This section explains how to prepare the dataset by uploading the images to an Amazon Simple Storage Service (Amazon S3) bucket, then create and run the SageMaker Ground Truth labeling job to label the images as normal or anomaly.

Upload the image datasets to an Amazon S3 bucket that SageMaker Ground Truth can access. If you don’t have a dataset, you can use either the cookie-dataset or aliens-dataset.

Copy all of the images from “normal” and “anomaly” folders into a single directory for SMGT to access or you will get an error message on the next step.
To use AWS CloudShell, run the following script:

#!/bin/bash
# Clone the repository
git clone https://github.com/aws-samples/amazon-lookout-for-vision.git
cd amazon-lookout-for-vision/aliens-dataset
# Remove existing all directory if it exists
rm -rf all
# Create a new all directory
mkdir -p all
# Copy normal images to all directory
cp normal/*.png all/
# Make sure we’re in the right directory before running the loop
cd “$(dirname “$0″)/amazon-lookout-for-vision/aliens-dataset”
# Copy anomaly images with .anomaly.png suffix
for file in anomaly/*.png; do
if [ -f “$file” ]; then
filename=$(basename “$file”)
cp “$file” “all/${filename}.anomaly.png”
fi
done
# Count files to verify
echo “Normal images: $(find normal -name “*.png” | wc -l)”
echo “Anomaly images: $(find anomaly -name “*.png” | wc -l)”
echo “Total images in all directory: $(find all -type f | wc -l)”
# Upload to S3
aws s3 cp all/ s3://<BUCKET_NAME>/aliens-dataset-all/ –recursive
# Clean up – remove the cloned repository
cd ../..
rm -rf amazon-lookout-for-vision

Alternatively, if you have the AWS CLI installed, you can copy them with the following commands (See setting up AWS CLI for how to do this):

sh-4.2$ git checkout https://github.com/aws-samples/amazon-lookout-for-vision.git
sh-4.2$ cd aliens-dataset ## keep in mind the filenames here clash, the following Linux command can help fix this
sh-4.2$ mkdir all
sh-4.2$ cp normal/.png all
sh-4.2$ aws s3 cp s3://aws-blogs-artifacts-public/artifacts/ML-19308/copy_conflicts.sh .

sh-4.2$ bash copy_conflicts.sh

sh-4.2$ ls -al all/

-rwxrwxr-x 1 ec2-user ec2-user 120035 Feb 17 16:39 59.png
-rwxrwxr-x 1 ec2-user ec2-user 93407 Feb 17 16:39 5.png
-rwxrwxr-x 1 ec2-user ec2-user 125477 Feb 17 16:39 5.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 123679 Feb 17 16:39 60.png
-rwxrwxr-x 1 ec2-user ec2-user 96330 Feb 17 16:39 6.png
-rwxrwxr-x 1 ec2-user ec2-user 126014 Feb 17 16:39 6.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 81051 Feb 17 16:39 7.png
-rwxrwxr-x 1 ec2-user ec2-user 128985 Feb 17 16:39 7.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 94216 Feb 17 16:39 8.png
-rwxrwxr-x 1 ec2-user ec2-user 128002 Feb 17 16:39 8.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 110814 Feb 17 16:39 9.png
-rwxrwxr-x 1 ec2-user ec2-user 131385 Feb 17 16:39 9.png.anomaly.png

sh-4.2$aws s3 cp all/ s3://<BUCKET_NAME>/aliens-dataset-all/ –recursive
Note: To prevent filename clash from the two folders, a suffix anomaly was added. The uploaded files should be in your <BUCKET_NAME>/aliens-dataset-all bucket for the Ground Truth job.

In the AWS Console, navigate to Amazon SageMaker AI, Ground Truth, Labeling Jobs, Create labeling job.

There are several options here to fill in; the most important fields to fill or select are:

Input data setup: Select Automated data setup
S3 location for input datasets: <Full path where your dataset exists>
S3 location data output datasets: <Same location as input dataset>
Data type: Select Image
IAM Role – Select Create new role if you do not have one set up to allow Ground Truth to interact with SageMaker services.

Choose Complete data setup. An Input data connection successful message displays. If you get an error, check your IAM role to make sure S3 access is enabled, and the directory has image files in it, as it will not recurse through sub-directories.

Select the task type. These models support Image Classification (Single Label), which is binary classification (think good or bad), or Semantic segmentation. You cannot use a bounding box type with these models. You can change your selection later.
Choose Next.
For Worker types, select Private. You can read more about Amazon Mechanical Turks or labeling subscriptions in the Developer Guide.
Under Private teams, select the private team you created in the previous steps.
For Task timeout and Task expiration time, leave the default values.
Leave Enable automated data labeling unselected. You can read more about automated data labeling here; however, it is not compatible with semantic segmentation.
On the Image classification screen, add two new labels: normal and anomaly. You can fill in the rest as needed. Choose Preview to see a preview of what it will look like to the end user.
Choose Create.
Select Ground Truth, and then select the Private tab.

Open the labeling portal sign-in URL in a new tab in your browser and then sign in to see your assigned tasks.
Select an assigned task and choose Start working to label the data.
Select normal or anomaly.

When the job is complete, make note of the output dataset location. You will need this for the training step.

If you need to add workers to the labelling job:

On the Amazon SageMaker AI Ground Truth page, select Labeling workforces.
Select the Private tab.
Click on the private team that was created earlier (CV-team).
Select the Workers tab
Select the desired worker from the list and choose Add workers to team.

You will then be redirected to the Amazon SageMaker AI, labelling workforces page with a confirmation message that worker has been added.

After you complete the labeling task, the output of the task is used to train the Computer Vision Detection model from the AWS Marketplace.
Train the model
This section discusses training the computer vision model using the AWS Marketplace Computer Vision Detection model and the labeled dataset from the previous step.

Go to the AWS Marketplace to subscribe to the model, https://aws.amazon.com/marketplace/pp/prodview-j72hhmlt6avp6.
Choose Continue to Subscribe.
Choose Continue to configuration.
Select the latest software version, your Region, and make sure Create a training job is selected.

Note: Copy the Product Arn and store in a text editor or notepad for later use.

Go to SageMaker AI, Notebook instances, Create notebook instance.

Note: GPU-enabled notebook instance is not required. Amazon SageMaker Training jobs will spin up the GPU instances needed during training, so most basic instances will be sufficient.

Select m5.2xl instance, Jupyter lab 4, with volume size of 128 GB. The default is 5 GB, which is too small.
Select an IAM role to allow the notebook to access resources in your account. You will need access to S3.
In the Git Repositories – optional section, select Clone a public Git repository to this notebook instance only.
Enter the Git repository URL. Leave all the other fields as their default, then choose Create notebook instance to start the instance.
After the instance starts, (the status will display as InService), select Open JupyterLab action for the new notebook instance.

JupyterLab opens:

On the left navigation pane, open the computer-vision-defect-detection folder.

In the AWS Console, go to Marketplace, Manage subscriptions, and then copy the ARN of your model subscription.

In the Jupyter notebook, locate the snippet below and update the placeholder value for algorithm_name variable with the Product Arn you copied in the previous step.

# TODO: change this to use subscribed SageMaker algorithm algorithm_name = “<Customer to specify the algorithm name after subscription >”

The bucket that would be used for this step would be automatically created and named in the format SageMaker-<REGION>-<ACCOUNT_ID>.

# Initialize SageMaker session and get execution role
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
#bucket = sagemaker_session.default_bucket()
role = get_execution_role()
# Project name would be used as part of s3 output path
project = “ComputerVisionDefectDetection”

In the AWS Console, navigate to Amazon SageMaker AI, Ground Truth, Labeling jobs and select the job that was completed.
Identify and take note of the output images folder (Output dataset location)

Note: To start the training job, look at the path for the output manifest in <BUCKET NAME>/aliens-dataset/all/aliensv2/manifests/output/output.manifest—this will be the training manifest for the next step.

Set the bucket variable to be the images bucket name that you previously set and object key the path to your manifest:

bucket: where to store the manifest file
classification_manifest_key: where the output manifest file is stored (for example, aliens-dataset-all/[job-name]/manifests/output/output.manifest)

Review the model training configuration in the Classification Model with Algorithm Estimator section.

# Create AlgorithmEstimator for classificatio
classification_estimator = AlgorithmEstimator(
algorithm_arn=algorithm_name,
role=role, instance_count=1,
instance_type=’ml.g4dn.2xlarge’,
volume_size=20, max_run=7200,
input_mode=’Pipe’, # REQUIRED: Algorithm only supports Pipe mode
sagemaker_session=sagemaker_session,
enable_network_isolation=True
)

# Set hyperparameters
classification_estimator.set_hyperparameters(
ModelType=’classification’,
TestInputDataAttributeNames=’source-ref,anomaly-label-metadata,anomaly-label’, 
TrainingInputDataAttributeNames=’source-ref,anomaly-label-metadata,anomaly-label’)

print(“Classification estimator configured successfully”)</code></pre><pre><code class=”lang-python”># Define training input using TrainingInput class
classification_training_input = TrainingInput(
s3_data=classification_s3_path, ‘
s3_data_type=’AugmentedManifestFile’,
attribute_names=[
‘source-ref’,
‘anomaly-label-metadata’,
‘anomaly-label’
],
record_wrapping=’RecordIO’,
input_mode=’Pipe’ # Must match the estimator’s input_mode)
# Start training job
classification_job_name = f’defect-detection-classification-
{datetime.datetime.now().strftime(“%Y-%m-%d-%H-%M-%S”)}
‘print(f”Starting classification training job: {classification_job_name}”)
classification_estimator.fit(
inputs={‘training’: classification_training_input},
job_name=classification_job_name,
wait=True,
logs=True

)

Note: The job uses NVIDIA G4DN instances. They can be sized up to a larger instance to decrease training time, but on a only 118 instances. The image dataset training finishes in less than 10 minutes with a g4dn.2xl. You can experiment with other instance types, however results may vary because the models were extensively tested on the G4DN instances.

Validate the values of TestInputDataAttributeNames and TrainingInputDataAttributeNames in the Hyperparameters section, as well as AttributeNames in the

TrainingInput section. The labels on all three must match the structure of your manifest file. Here is a sample manifest:

{
“source-ref”: “s3://[bucketname]/getting-started/training-images/anomaly-1.jpg”,
“anomaly-label-metadata”: {
“job-name”: “anomaly-label”,
“class-name”: “anomaly”,
“human-annotated”: “yes”,
“creation-date”: “2022-08-22T20:52:51.851Z”,
“type”: “groundtruth/image-classification”
},
“anomaly-label”: 1
}
{
“source-ref”: “s3://[bucketname]/getting-started/training-images/anomaly-2.jpg”,
“anomaly-label-metadata”: {
“job-name”: “anomaly-label”,
“class-name”: “anomaly”,
“human-annotated”: “yes”,
“creation-date”: “2022-08-22T21:11:39.545Z”,
“type”: “groundtruth/image-classification”
},
“anomaly-label”: 1
}

Note: Two of the three values include the labelling job name.

response = sagemaker.create_training_job(
TrainingJobName=classification_training_job_name,
HyperParameters={
‘ModelType’: ‘classification’,
‘TestInputDataAttributeNames’: ‘source-ref,aliens-v3,aliens-v3-metadata’,
‘TrainingInputDataAttributeNames’: ‘source-ref,aliens-v3,aliens-v3-metadata’
}
)

Run all the cells or blocks listed in the Classification Model with Algorithm Estimator section to start the training job.
If you want to train a segmentation model as well, follow the steps in the Segmentation Model with Algorithm Estimator section.

Note: After the training is completed, you are ready to test it!  There are few inference options available for this:

Real-time inference using Amazon SageMaker endpoints
Amazon SageMaker AI Batch Transform inference.
Edge deployment

Deploy the model
Amazon SageMaker AI endpoints and Amazon SageMaker AI Batch Transform inference are both used for inference but serve different purposes.
Amazon SageMaker AI endpoints
Amazon SageMaker AI endpoints are used for real-time inference, providing low-latency predictions suitable for applications requiring immediate responses. Endpoints remain active while they’re deployed, making them better suited for continuous and steady traffic, but potentially more costly due to ongoing resource usage.

In the Jupyter notebook, navigate to the (Optional) Running real-time inference using Amazon SageMaker endpoints section.
Run the following cell blocks to set up and invoke the endpoint:

#classification_training_job_name = “defect-detection-classification-2025-10-01-00-29-57” # remove

classification_training_job_name = “<provide training job name here>”

# Create estimator from training job
estimator = AlgorithmEstimator.attach(classification_training_job_name)

# Deploy endpoint using SageMaker v2 SDK
predictor = estimator.deploy(
initial_instance_count=1,
instance_type=’ml.c5.2xlarge’
)

print(f”Endpoint deployed: {predictor.endpoint_name}”)

#Invoke the endpoint

# Invoke the endpoint using predictor
result = predictor.predict(image_data)

# Clean up the temporary file
os.remove(local_file)

# Print the result
print(“nEndpoint Response:”)
print(json.dumps(result, indent=2))

Validate the inference, then delete the endpoint by running the following block:

# Delete the endpoint

predictor.delete_endpoint()
print(“Endpoint deleted”)

Note: If you start an endpoint, keep in mind you will be billed while it is running until you turn it off.
Amazon SageMaker AI Batch Transform
Batch Transform is designed for offline inference and making predictions on large datasets stored in S3, and is ideal for bulk processing where low latency is not critical. After the job is complete, the resources are released, making it cost-effective for sporadic workloads.

Navigate to the (Optional) Run Batch Transform Inference using SageMaker SDK v2 section.
Define the s3_input_data and s3_output_path parameters.

# Run batch transform job

#############################################
# Change to your input/output data S3 path #
#############################################

s3_input_data = “s3://<Specify-s3-path-to-test-images>”
s3_output_path = f”s3://{bucket}/{project}/batch-transform-output”

Run all the cells and blocks in the (Optional) Run Batch Transform Inference using SageMaker SDK v2 section to complete the batch inference.
Validate the batch transform job after completion by navigating to the s3_output_path folder. The following is a sample inference output file:

{
“Source”: {
“Type”: “direct”
},
“IsAnomalous”: true,
“Confidence”: 0.92744799389183
}

Clean up
To avoid incurring unnecessary charges, delete the following resources when you no longer need them:

Delete SageMaker endpoints.

Navigate to the Amazon SageMaker Console.
Select Endpoints.
Select the endpoint you created.
Choose Delete.

Delete SageMaker Notebook instances.

Navigate to the Amazon SageMaker Console.
Select Notebook instances.
Select the notebook instance you created.
Choose Stop if the instance is running.
Once stopped, choose Delete.

Delete S3 objects and buckets.

Navigate to the Amazon S3 Console.
Delete all objects in the buckets you created for this tutorial.
Delete the empty buckets.

Delete the Ground Truth labeling team.

Navigate to Ground Truth.
Select Labeling workforces.
Select the Private tab.
Select the private team you created.
Choose Delete team.

Conclusion
In this blog post, we’ve demonstrated how to transition from Amazon Lookout for Vision to using the underlying Computer Vision Detection models available through the AWS Marketplace, showing the step-by-step process of setting up labeling, training the model, and running inference through batch transformation. The transition provides customers with greater flexibility in terms of training options, hyperparameter adjustments, and deployment choices while continuing to use AWS defect detection technology at their own pace. Also be sure to check out our edge-based open source integrated Defect Detection Application on GitHub if you would like to combine what you have learned here.

About the authors
Ryan Vanderwerf is a is a senior partner solutions architect at Amazon Web Services specializing in smart manufacturing, vision, and machine learning. Ryan previously provided Java virtual machine-focused consulting and project development as a software engineer at OCI on the Grails and Micronaut team. He was chief architect/director of products at ReachForce, with a focus on software and system architecture for AWS Cloud SaaS solutions for marketing data management. Ryan has built several SaaS solutions in several domains such as financial, media, telecom, and e-learning companies since 1996
Lu Min is a Software Development Engineer for AWS Edge ML services, focused on developing machine learning solutions that operate at the edge for AWS customers. With expertise in optimizing ML models for resource-constrained environments, Lu helps customers implement efficient inference capabilities on edge devices and cloud communication, as well as manage model lifecycle using AWS SageMaker.
Tim Westman is the Product Manager and Go-to-Market Lead for Edge Machine Learning, AWS. Tim leads the Product Management and Business Development for the Edge Machine Learning business at Amazon Web Services. In this role, he works with customers to help build computer vision solutions at the edge to solve complex operational challenges. Tim has more than 30 years of experience in sales, business development and product management roles for leading hardware and software companies, with the last 8 years specializing in AI and computer vision for IoT applications.
Kunle Adeleke is an enterprise solutions architect, providing guidance to large AWS commercial customers in diverse industries craft their technology strategy. Kunle has led enterprise architecture teams and software development teams in both government and commercial sectors. His deep expertise spans software development, solution architecture, enterprise architecture, security, and data & AI/ML.