How to Build a Self-Evaluating Agentic AI System with LlamaIndex and O …

In this tutorial, we build an advanced agentic AI workflow using LlamaIndex and OpenAI models. We focus on designing a reliable retrieval-augmented generation (RAG) agent that can reason over evidence, use tools deliberately, and evaluate its own outputs for quality. By structuring the system around retrieval, answer synthesis, and self-evaluation, we demonstrate how agentic patterns go beyond simple chatbots and move toward more trustworthy, controllable AI systems suitable for research and analytical use cases.

Copy CodeCopiedUse a different Browser!pip -q install -U llama-index llama-index-llms-openai llama-index-embeddings-openai nest_asyncio

import os
import asyncio
import nest_asyncio
nest_asyncio.apply()

from getpass import getpass

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass(“Enter OPENAI_API_KEY: “)

We set up the environment and install all required dependencies for running an agentic AI workflow. We securely load the OpenAI API key at runtime, ensuring that credentials are never hardcoded. We also prepare the notebook to handle asynchronous execution smoothly.

Copy CodeCopiedUse a different Browserfrom llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model=”gpt-4o-mini”, temperature=0.2)
Settings.embed_model = OpenAIEmbedding(model=”text-embedding-3-small”)

texts = [
“Reliable RAG systems separate retrieval, synthesis, and verification. Common failures include hallucination and shallow retrieval.”,
“RAG evaluation focuses on faithfulness, answer relevancy, and retrieval quality.”,
“Tool-using agents require constrained tools, validation, and self-review loops.”,
“A robust workflow follows retrieve, answer, evaluate, and revise steps.”
]

docs = [Document(text=t) for t in texts]
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(similarity_top_k=4)

We configure the OpenAI language model and embedding model and build a compact knowledge base for our agent. We transform raw text into indexed documents so that the agent can retrieve relevant evidence during reasoning.

Copy CodeCopiedUse a different Browserfrom llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

faith_eval = FaithfulnessEvaluator(llm=Settings.llm)
rel_eval = RelevancyEvaluator(llm=Settings.llm)

def retrieve_evidence(q: str) -> str:
r = query_engine.query(q)
out = []
for i, n in enumerate(r.source_nodes or []):
out.append(f”[{i+1}] {n.node.get_content()[:300]}”)
return “n”.join(out)

def score_answer(q: str, a: str) -> str:
r = query_engine.query(q)
ctx = [n.node.get_content() for n in r.source_nodes or []]
f = faith_eval.evaluate(query=q, response=a, contexts=ctx)
r = rel_eval.evaluate(query=q, response=a, contexts=ctx)
return f”Faithfulness: {f.score}nRelevancy: {r.score}”

We define the core tools used by the agent: evidence retrieval and answer evaluation. We implement automatic scoring for faithfulness and relevancy so the agent can judge the quality of its own responses.

Copy CodeCopiedUse a different Browserfrom llama_index.core.agent.workflow import ReActAgent
from llama_index.core.workflow import Context

agent = ReActAgent(
tools=[retrieve_evidence, score_answer],
llm=Settings.llm,
system_prompt=”””
Always retrieve evidence first.
Produce a structured answer.
Evaluate the answer and revise once if scores are low.
“””,
verbose=True
)

ctx = Context(agent)

We create the ReAct-based agent and define its system behavior, guiding how it retrieves evidence, generates answers, and revises results. We also initialize the execution context that maintains the agent’s state across interactions. It step brings together tools and reasoning into a single agentic workflow.

Copy CodeCopiedUse a different Browserasync def run_brief(topic: str):
q = f”Design a reliable RAG + tool-using agent workflow and how to evaluate it. Topic: {topic}”
handler = agent.run(q, ctx=ctx)
async for ev in handler.stream_events():
print(getattr(ev, “delta”, “”), end=””)
res = await handler
return str(res)

topic = “RAG agent reliability and evaluation”
loop = asyncio.get_event_loop()
result = loop.run_until_complete(run_brief(topic))

print(“nnFINAL OUTPUTn”)
print(result)

We execute the full agent loop by passing a topic into the system and streaming the agent’s reasoning and output. We allow the agent to complete its retrieval, generation, and evaluation cycle asynchronously.

In conclusion, we showcased how an agent can retrieve supporting evidence, generate a structured response, and assess its own faithfulness and relevancy before finalizing an answer. We kept the design modular and transparent, making it easy to extend the workflow with additional tools, evaluators, or domain-specific knowledge sources. This approach illustrates how we can use agentic AI with LlamaIndex and OpenAI models to build more capable systems that are also more reliable and self-aware in their reasoning and responses.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Self-Evaluating Agentic AI System with LlamaIndex and OpenAI Using Retrieval, Tool Use, and Automated Quality Checks appeared first on MarkTechPost.

How to Build a Safe, Autonomous Prior Authorization Agent for Healthca …

In this tutorial, we demonstrate how an autonomous, agentic AI system can simulate the end-to-end prior authorization workflow within healthcare Revenue Cycle Management (RCM). We show how an agent continuously monitors incoming surgery orders, gathers the required clinical documentation, submits prior authorization requests to payer systems, tracks their status, and intelligently responds to denials through automated analysis and appeals. We design the system to act conservatively and responsibly, escalating to a human reviewer when uncertainty crosses a defined threshold. While the implementation uses mocked EHR and payer portals for clarity and safety, we intentionally mirror real-world healthcare workflows to make the logic transferable to production environments. Also, we emphasize that it is strictly a technical simulation and not a substitute for clinical judgment, payer policy interpretation, or regulatory compliance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install “pydantic>=2.0.0” “httpx>=0.27.0”

import os, time, json, random, hashlib
from typing import List, Dict, Optional, Any
from enum import Enum
from datetime import datetime, timedelta
from pydantic import BaseModel, Field

We set up the execution environment and installed the minimal dependencies required to run the tutorial. We configure optional OpenAI usage in a safe, fail-open manner so the system continues to work even without external models. We ensure the foundation is lightweight, reproducible, and suitable for healthcare simulations. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserUSE_OPENAI = False
OPENAI_AVAILABLE = False

try:
from getpass import getpass
if not os.environ.get(“OPENAI_API_KEY”):
pass
if os.environ.get(“OPENAI_API_KEY”):
USE_OPENAI = True
except Exception:
USE_OPENAI = False

if USE_OPENAI:
try:
!pip -q install openai
from openai import OpenAI
client = OpenAI()
OPENAI_AVAILABLE = True
except Exception:
OPENAI_AVAILABLE = False
USE_OPENAI = False

We define strongly typed domain models for patients, surgical orders, clinical documents, and authorization decisions. We use explicit enums and schemas to mirror real healthcare RCM structures while avoiding ambiguity. We enforce clarity and validation to reduce downstream errors in automated decision-making. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DocType(str, Enum):
H_AND_P = “history_and_physical”
LABS = “labs”
IMAGING = “imaging”
MED_LIST = “medication_list”
CONSENT = “consent”
PRIOR_TX = “prior_treatments”
CLINICAL_NOTE = “clinical_note”

class SurgeryType(str, Enum):
KNEE_ARTHROPLASTY = “knee_arthroplasty”
SPINE_FUSION = “spine_fusion”
CATARACT = “cataract”
BARIATRIC = “bariatric_surgery”

class InsurancePlan(str, Enum):
PAYER_ALPHA = “PayerAlpha”
PAYER_BETA = “PayerBeta”
PAYER_GAMMA = “PayerGamma”

class Patient(BaseModel):
patient_id: str
name: str
dob: str
member_id: str
plan: InsurancePlan

class SurgeryOrder(BaseModel):
order_id: str
patient: Patient
surgery_type: SurgeryType
scheduled_date: str
ordering_provider_npi: str
diagnosis_codes: List[str] = Field(default_factory=list)
created_at: str

class ClinicalDocument(BaseModel):
doc_id: str
doc_type: DocType
created_at: str
content: str
source: str

class PriorAuthRequest(BaseModel):
request_id: str
order: SurgeryOrder
submitted_at: Optional[str] = None
docs_attached: List[ClinicalDocument] = Field(default_factory=list)
payload: Dict[str, Any] = Field(default_factory=dict)

class AuthStatus(str, Enum):
DRAFT = “draft”
SUBMITTED = “submitted”
IN_REVIEW = “in_review”
APPROVED = “approved”
DENIED = “denied”
NEEDS_INFO = “needs_info”
APPEALED = “appealed”

class DenialReason(str, Enum):
MISSING_DOCS = “missing_docs”
MEDICAL_NECESSITY = “medical_necessity”
MEMBER_INELIGIBLE = “member_ineligible”
DUPLICATE = “duplicate”
CODING_ISSUE = “coding_issue”
OTHER = “other”

class PayerResponse(BaseModel):
status: AuthStatus
payer_ref: str
message: str
denial_reason: Optional[DenialReason] = None
missing_docs: List[DocType] = Field(default_factory=list)
confidence: float = 0.9

class AgentDecision(BaseModel):
action: str
missing_docs: List[DocType] = Field(default_factory=list)
rationale: str = “”
uncertainty: float = 0.0
next_wait_seconds: int = 0
appeal_text: Optional[str] = None

We simulate an EHR system that emits surgery orders and stores clinical documentation. We intentionally model incomplete charts to reflect real-world documentation gaps that often drive prior authorization denials. We show how an agent can retrieve and augment patient records in a controlled manner. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef _now_iso() -> str:
return datetime.utcnow().replace(microsecond=0).isoformat() + “Z”

def _stable_id(prefix: str, seed: str) -> str:
h = hashlib.sha256(seed.encode(“utf-8″)).hexdigest()[:10]
return f”{prefix}_{h}”

class MockEHR:
def __init__(self):
self.orders_queue: List[SurgeryOrder] = []
self.patient_docs: Dict[str, List[ClinicalDocument]] = {}

def seed_data(self, n_orders: int = 5):
random.seed(7)

def make_patient(i: int) -> Patient:
pid = f”PT{i:04d}”
plan = random.choice(list(InsurancePlan))
return Patient(
patient_id=pid,
name=f”Patient {i}”,
dob=”1980-01-01″,
member_id=f”M{i:08d}”,
plan=plan,
)

def docs_for_order(patient: Patient, surgery: SurgeryType) -> List[ClinicalDocument]:
base = [
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “H&P”),
doc_type=DocType.H_AND_P,
created_at=_now_iso(),
content=”H&P: Relevant history, exam findings, and surgical indication.”,
source=”EHR”,
),
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “NOTE”),
doc_type=DocType.CLINICAL_NOTE,
created_at=_now_iso(),
content=”Clinical note: Symptoms, conservative management attempted, clinician assessment.”,
source=”EHR”,
),
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “MEDS”),
doc_type=DocType.MED_LIST,
created_at=_now_iso(),
content=”Medication list: Current meds, allergies, contraindications.”,
source=”EHR”,
),
]

maybe = []
if surgery in [SurgeryType.KNEE_ARTHROPLASTY, SurgeryType.SPINE_FUSION, SurgeryType.BARIATRIC]:
maybe.append(
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “LABS”),
doc_type=DocType.LABS,
created_at=_now_iso(),
content=”Labs: CBC/CMP within last 30 days.”,
source=”LabSystem”,
)
)

if surgery in [SurgeryType.SPINE_FUSION, SurgeryType.KNEE_ARTHROPLASTY]:
maybe.append(
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “IMG”),
doc_type=DocType.IMAGING,
created_at=_now_iso(),
content=”Imaging: MRI/X-ray report supporting diagnosis and severity.”,
source=”Radiology”,
)
)

final = base + [d for d in maybe if random.random() > 0.35]

if random.random() > 0.6:
final.append(
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “PRIOR_TX”),
doc_type=DocType.PRIOR_TX,
created_at=_now_iso(),
content=”Prior treatments: PT, meds, injections tried over 6+ weeks.”,
source=”EHR”,
)
)

if random.random() > 0.5:
final.append(
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient.patient_id + “CONSENT”),
doc_type=DocType.CONSENT,
created_at=_now_iso(),
content=”Consent: Signed procedure consent and risk disclosure.”,
source=”EHR”,
)
)

return final

for i in range(1, n_orders + 1):
patient = make_patient(i)
surgery = random.choice(list(SurgeryType))
order = SurgeryOrder(
order_id=_stable_id(“ORD”, patient.patient_id + surgery.value),
patient=patient,
surgery_type=surgery,
scheduled_date=(datetime.utcnow().date() + timedelta(days=random.randint(3, 21))).isoformat(),
ordering_provider_npi=str(random.randint(1000000000, 1999999999)),
diagnosis_codes=[“M17.11”, “M54.5”] if surgery != SurgeryType.CATARACT else [“H25.9”],
created_at=_now_iso(),
)
self.orders_queue.append(order)
self.patient_docs[patient.patient_id] = docs_for_order(patient, surgery)

def poll_new_surgery_orders(self, max_n: int = 1) -> List[SurgeryOrder]:
pulled = self.orders_queue[:max_n]
self.orders_queue = self.orders_queue[max_n:]
return pulled

def get_patient_documents(self, patient_id: str) -> List[ClinicalDocument]:
return list(self.patient_docs.get(patient_id, []))

def fetch_additional_docs(self, patient_id: str, needed: List[DocType]) -> List[ClinicalDocument]:
generated = []
for dt in needed:
generated.append(
ClinicalDocument(
doc_id=_stable_id(“DOC”, patient_id + dt.value + str(time.time())),
doc_type=dt,
created_at=_now_iso(),
content=f”Auto-collected document for {dt.value}: extracted and formatted per payer policy.”,
source=”AutoCollector”,
)
)
self.patient_docs.setdefault(patient_id, []).extend(generated)
return generated

class MockPayerPortal:
def __init__(self):
self.db: Dict[str, Dict[str, Any]] = {}
random.seed(11)

def required_docs_policy(self, plan: InsurancePlan, surgery: SurgeryType) -> List[DocType]:
base = [DocType.H_AND_P, DocType.CLINICAL_NOTE, DocType.MED_LIST]
if surgery in [SurgeryType.SPINE_FUSION, SurgeryType.KNEE_ARTHROPLASTY]:
base += [DocType.IMAGING, DocType.LABS, DocType.PRIOR_TX]
if surgery == SurgeryType.BARIATRIC:
base += [DocType.LABS, DocType.PRIOR_TX]
if plan in [InsurancePlan.PAYER_BETA, InsurancePlan.PAYER_GAMMA]:
base += [DocType.CONSENT]
return sorted(list(set(base)), key=lambda x: x.value)

def submit(self, pa: PriorAuthRequest) -> PayerResponse:
payer_ref = _stable_id(“PAYREF”, pa.request_id + _now_iso())
docs_present = {d.doc_type for d in pa.docs_attached}
required = self.required_docs_policy(pa.order.patient.plan, pa.order.surgery_type)
missing = [d for d in required if d not in docs_present]

self.db[payer_ref] = {
“status”: AuthStatus.SUBMITTED,
“order_id”: pa.order.order_id,
“plan”: pa.order.patient.plan,
“surgery”: pa.order.surgery_type,
“missing”: missing,
“polls”: 0,
“submitted_at”: _now_iso(),
“denial_reason”: None,
}

msg = “Submission received. Case queued for review.”
if missing:
msg += ” Initial validation indicates incomplete documentation.”
return PayerResponse(status=AuthStatus.SUBMITTED, payer_ref=payer_ref, message=msg)

def check_status(self, payer_ref: str) -> PayerResponse:
if payer_ref not in self.db:
return PayerResponse(
status=AuthStatus.DENIED,
payer_ref=payer_ref,
message=”Case not found (possible payer system error).”,
denial_reason=DenialReason.OTHER,
confidence=0.4,
)

case = self.db[payer_ref]
case[“polls”] += 1

if case[“status”] == AuthStatus.SUBMITTED and case[“polls”] >= 1:
case[“status”] = AuthStatus.IN_REVIEW

if case[“status”] == AuthStatus.IN_REVIEW and case[“polls”] >= 3:
if case[“missing”]:
case[“status”] = AuthStatus.DENIED
case[“denial_reason”] = DenialReason.MISSING_DOCS
else:
roll = random.random()
if roll < 0.10:
case[“status”] = AuthStatus.DENIED
case[“denial_reason”] = DenialReason.CODING_ISSUE
elif roll < 0.18:
case[“status”] = AuthStatus.DENIED
case[“denial_reason”] = DenialReason.MEDICAL_NECESSITY
else:
case[“status”] = AuthStatus.APPROVED

if case[“status”] == AuthStatus.DENIED:
dr = case[“denial_reason”] or DenialReason.OTHER
missing = case[“missing”] if dr == DenialReason.MISSING_DOCS else []
conf = 0.9 if dr != DenialReason.OTHER else 0.55
return PayerResponse(
status=AuthStatus.DENIED,
payer_ref=payer_ref,
message=f”Denied. Reason={dr.value}.”,
denial_reason=dr,
missing_docs=missing,
confidence=conf,
)

if case[“status”] == AuthStatus.APPROVED:
return PayerResponse(
status=AuthStatus.APPROVED,
payer_ref=payer_ref,
message=”Approved. Authorization issued.”,
confidence=0.95,
)

return PayerResponse(
status=case[“status”],
payer_ref=payer_ref,
message=f”Status={case[‘status’].value}. Polls={case[‘polls’]}.”,
confidence=0.9,
)

def file_appeal(self, payer_ref: str, appeal_text: str, attached_docs: List[ClinicalDocument]) -> PayerResponse:
if payer_ref not in self.db:
return PayerResponse(
status=AuthStatus.DENIED,
payer_ref=payer_ref,
message=”Appeal failed: case not found.”,
denial_reason=DenialReason.OTHER,
confidence=0.4,
)

case = self.db[payer_ref]
docs_present = {d.doc_type for d in attached_docs}
still_missing = [d for d in case[“missing”] if d not in docs_present]
case[“missing”] = still_missing
case[“status”] = AuthStatus.APPEALED
case[“polls”] = 0

msg = “Appeal submitted and queued for review.”
if still_missing:
msg += f” Warning: still missing {‘, ‘.join([d.value for d in still_missing])}.”
return PayerResponse(status=AuthStatus.APPEALED, payer_ref=payer_ref, message=msg, confidence=0.9)

We model payer-side behavior, including documentation policies, review timelines, and denial logic. We encode simplified but realistic payer rules to demonstrate how policy-driven automation works in practice. We expose predictable failure modes that the agent must respond to safely. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef required_docs_for_order(payer: MockPayerPortal, order: SurgeryOrder) -> List[DocType]:
return payer.required_docs_policy(order.patient.plan, order.surgery_type)

def attach_best_docs(ehr_docs: List[ClinicalDocument], required: List[DocType]) -> List[ClinicalDocument]:
by_type: Dict[DocType, List[ClinicalDocument]] = {}
for d in ehr_docs:
by_type.setdefault(d.doc_type, []).append(d)
attached = []
for dt in required:
if dt in by_type:
attached.append(by_type[dt][-1])
return attached

def compute_uncertainty(payer_resp: PayerResponse, missing_docs: List[DocType], llm_used: bool) -> float:
base = 0.15
if payer_resp.denial_reason in [DenialReason.OTHER]:
base += 0.35
if payer_resp.denial_reason in [DenialReason.MEDICAL_NECESSITY]:
base += 0.25
if payer_resp.denial_reason in [DenialReason.CODING_ISSUE]:
base += 0.20
if missing_docs:
base += 0.10
if llm_used:
base -= 0.05
return max(0.0, min(1.0, base + (1 – payer_resp.confidence) * 0.6))

def rule_based_denial_analysis(order: SurgeryOrder, payer_resp: PayerResponse) -> Dict[str, Any]:
rec = {“missing_docs”: [], “rationale”: “”, “appeal_text”: “”}
if payer_resp.denial_reason == DenialReason.MISSING_DOCS:
rec[“missing_docs”] = payer_resp.missing_docs
rec[“rationale”] = “Denial indicates incomplete documentation per payer policy. Collect and resubmit as appeal.”
rec[“appeal_text”] = (
f”Appeal for prior authorization ({payer_resp.payer_ref})n”
f”Patient: {order.patient.name} ({order.patient.member_id})n”
f”Procedure: {order.surgery_type.value}n”
f”Reason for appeal: Missing documentation has now been attached. Please re-review.n”
)
elif payer_resp.denial_reason == DenialReason.CODING_ISSUE:
rec[“rationale”] = “Potential coding mismatch. Verify diagnosis/procedure codes and include supporting note.”
rec[“appeal_text”] = (
f”Appeal ({payer_resp.payer_ref}): Requesting reconsideration.n”
f”Attached: Updated clinical note clarifying diagnosis and indication; please re-review coding alignment.n”
)
elif payer_resp.denial_reason == DenialReason.MEDICAL_NECESSITY:
rec[“rationale”] = “Medical necessity denial. Add prior treatments timeline, imaging severity, and functional impact.”
rec[“appeal_text”] = (
f”Appeal ({payer_resp.payer_ref}): Medical necessity reconsideration.n”
f”Attached: Prior conservative therapies, imaging, and clinician attestation of functional limitation.n”
)
else:
rec[“rationale”] = “Unclear denial. Escalate if payer message lacks actionable details.”
rec[“appeal_text”] = (
f”Appeal ({payer_resp.payer_ref}): Requesting clarification and reconsideration.n”
f”Please provide specific criteria not met; attached full clinical packet.n”
)
return rec

def llm_denial_analysis_and_appeal(order: SurgeryOrder, payer_resp: PayerResponse, docs: List[ClinicalDocument]) -> Dict[str, Any]:
if not OPENAI_AVAILABLE:
return rule_based_denial_analysis(order, payer_resp)

doc_summary = [{“doc_type”: d.doc_type.value, “source”: d.source, “created_at”: d.created_at} for d in docs]
prompt = {
“role”: “user”,
“content”: (
“You are an RCM prior authorization specialist agent.n”
“Given the order, attached docs, and payer denial response, do three things:n”
“1) Identify what documentation is missing or what needs clarification.n”
“2) Recommend next steps.n”
“3) Draft a concise appeal letter.nn”
f”ORDER:n{order.model_dump_json(indent=2)}nn”
f”PAYER_RESPONSE:n{payer_resp.model_dump_json(indent=2)}nn”
f”ATTACHED_DOCS_METADATA:n{json.dumps(doc_summary, indent=2)}nn”
“Return STRICT JSON with keys: missing_docs (list of strings), rationale (string), appeal_text (string).”
)
}

try:
resp = client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[prompt],
temperature=0.2,
)
text = resp.choices[0].message.content.strip()
data = json.loads(text)
missing = []
for x in data.get(“missing_docs”, []):
try:
missing.append(DocType(x))
except Exception:
pass
return {
“missing_docs”: missing,
“rationale”: data.get(“rationale”, “”),
“appeal_text”: data.get(“appeal_text”, “”),
}
except Exception:
return rule_based_denial_analysis(order, payer_resp)

class PriorAuthAgent:
def __init__(self, ehr: MockEHR, payer: MockPayerPortal, uncertainty_threshold: float = 0.55):
self.ehr = ehr
self.payer = payer
self.uncertainty_threshold = uncertainty_threshold
self.audit_log: List[Dict[str, Any]] = []

def log(self, event: str, payload: Dict[str, Any]):
self.audit_log.append({“ts”: _now_iso(), “event”: event, **payload})

def build_prior_auth_request(self, order: SurgeryOrder) -> PriorAuthRequest:
required = required_docs_for_order(self.payer, order)
docs = self.ehr.get_patient_documents(order.patient.patient_id)
attached = attach_best_docs(docs, required)

req = PriorAuthRequest(
request_id=_stable_id(“PA”, order.order_id + order.patient.member_id),
order=order,
docs_attached=attached,
payload={
“member_id”: order.patient.member_id,
“plan”: order.patient.plan.value,
“procedure”: order.surgery_type.value,
“diagnosis_codes”: order.diagnosis_codes,
“scheduled_date”: order.scheduled_date,
“provider_npi”: order.ordering_provider_npi,
“attached_doc_types”: [d.doc_type.value for d in attached],
}
)
self.log(“pa_request_built”, {“order_id”: order.order_id, “required_docs”: [d.value for d in required], “attached”: req.payload[“attached_doc_types”]})
return req

def submit_and_monitor(self, pa: PriorAuthRequest, max_polls: int = 7) -> Dict[str, Any]:
pa.submitted_at = _now_iso()
submit_resp = self.payer.submit(pa)
self.log(“submitted”, {“request_id”: pa.request_id, “payer_ref”: submit_resp.payer_ref, “message”: submit_resp.message})

payer_ref = submit_resp.payer_ref

for _ in range(max_polls):
time.sleep(0.25)
status = self.payer.check_status(payer_ref)
self.log(“status_polled”, {“payer_ref”: payer_ref, “status”: status.status.value, “message”: status.message})

if status.status == AuthStatus.APPROVED:
return {“final_status”: “APPROVED”, “payer_ref”: payer_ref, “details”: status.model_dump()}

if status.status == AuthStatus.DENIED:
decision = self.handle_denial(pa, payer_ref, status)
if decision.action == “escalate”:
return {
“final_status”: “ESCALATED_TO_HUMAN”,
“payer_ref”: payer_ref,
“decision”: decision.model_dump(),
“details”: status.model_dump(),
}
if decision.action == “appeal”:
appeal_docs = pa.docs_attached[:]
appeal_resp = self.payer.file_appeal(payer_ref, decision.appeal_text or “”, appeal_docs)
self.log(“appeal_filed”, {“payer_ref”: payer_ref, “message”: appeal_resp.message})

for _ in range(max_polls):
time.sleep(0.25)
post = self.payer.check_status(payer_ref)
self.log(“post_appeal_polled”, {“payer_ref”: payer_ref, “status”: post.status.value, “message”: post.message})
if post.status == AuthStatus.APPROVED:
return {“final_status”: “APPROVED_AFTER_APPEAL”, “payer_ref”: payer_ref, “details”: post.model_dump()}
if post.status == AuthStatus.DENIED:
return {“final_status”: “DENIED_AFTER_APPEAL”, “payer_ref”: payer_ref, “details”: post.model_dump(), “decision”: decision.model_dump()}

return {“final_status”: “APPEAL_PENDING”, “payer_ref”: payer_ref, “decision”: decision.model_dump()}

return {“final_status”: “DENIED_NO_ACTION”, “payer_ref”: payer_ref, “decision”: decision.model_dump(), “details”: status.model_dump()}

return {“final_status”: “PENDING_TIMEOUT”, “payer_ref”: payer_ref}

def handle_denial(self, pa: PriorAuthRequest, payer_ref: str, denial_resp: PayerResponse) -> AgentDecision:
order = pa.order
analysis = llm_denial_analysis_and_appeal(order, denial_resp, pa.docs_attached) if (USE_OPENAI and OPENAI_AVAILABLE) else rule_based_denial_analysis(order, denial_resp)
missing_docs: List[DocType] = analysis.get(“missing_docs”, [])
rationale: str = analysis.get(“rationale”, “”)
appeal_text: str = analysis.get(“appeal_text”, “”)

if denial_resp.denial_reason == DenialReason.MISSING_DOCS and denial_resp.missing_docs:
missing_docs = denial_resp.missing_docs

if missing_docs:
new_docs = self.ehr.fetch_additional_docs(order.patient.patient_id, missing_docs)
pa.docs_attached.extend(new_docs)
self.log(“missing_docs_collected”, {“payer_ref”: payer_ref, “collected”: [d.doc_type.value for d in new_docs]})

uncertainty = compute_uncertainty(denial_resp, missing_docs, llm_used=(USE_OPENAI and OPENAI_AVAILABLE))
self.log(“denial_analyzed”, {“payer_ref”: payer_ref, “denial_reason”: (denial_resp.denial_reason.value if denial_resp.denial_reason else None),
“uncertainty”: uncertainty, “missing_docs”: [d.value for d in missing_docs]})

if uncertainty >= self.uncertainty_threshold:
return AgentDecision(
action=”escalate”,
missing_docs=missing_docs,
rationale=f”{rationale} Escalating due to high uncertainty ({uncertainty:.2f}) >= threshold ({self.uncertainty_threshold:.2f}).”,
uncertainty=uncertainty,
next_wait_seconds=0,
)

if not appeal_text:
analysis2 = rule_based_denial_analysis(order, denial_resp)
appeal_text = analysis2.get(“appeal_text”, “”)

attached_types = sorted(list({d.doc_type.value for d in pa.docs_attached}))
appeal_text = (
appeal_text.strip()
+ “nnAttached documents:n- ”
+ “n- “.join(attached_types)
+ “nnRequested outcome: Reconsideration and authorization issuance.n”
)

return AgentDecision(
action=”appeal”,
missing_docs=missing_docs,
rationale=f”{rationale} Proceeding autonomously (uncertainty {uncertainty:.2f} < threshold {self.uncertainty_threshold:.2f}).”,
uncertainty=uncertainty,
appeal_text=appeal_text,
next_wait_seconds=1,
)

We implement the core intelligence layer that attaches documents, analyzes denials, and estimates uncertainty. We demonstrate how rule-based logic and optional LLM reasoning can coexist without compromising determinism. We explicitly gate automation decisions to maintain safety in a healthcare context. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserehr = MockEHR()
ehr.seed_data(n_orders=6)

payer = MockPayerPortal()
agent = PriorAuthAgent(ehr, payer, uncertainty_threshold=0.55)

results = []
print(“=== Starting Autonomous Prior Authorization Agent Demo ===”)
print(f”OpenAI enabled: {USE_OPENAI and OPENAI_AVAILABLE}n”)

while True:
new_orders = ehr.poll_new_surgery_orders(max_n=1)
if not new_orders:
break

order = new_orders[0]
print(f”n— New Surgery Order Detected —“)
print(f”Order: {order.order_id} | Patient: {order.patient.patient_id} | Plan: {order.patient.plan.value} | Surgery: {order.surgery_type.value}”)

pa = agent.build_prior_auth_request(order)
outcome = agent.submit_and_monitor(pa, max_polls=7)
results.append({“order_id”: order.order_id, “patient_id”: order.patient.patient_id, **outcome})

print(f”Outcome: {outcome[‘final_status’]} | PayerRef: {outcome.get(‘payer_ref’)}”)

print(“n=== Summary ===”)
status_counts = {}
for r in results:
status_counts[r[“final_status”]] = status_counts.get(r[“final_status”], 0) + 1
print(“Final status counts:”, status_counts)

print(“nSample result (first case):”)
print(json.dumps(results[0], indent=2))

print(“n=== Audit Log (last ~12 events) ===”)
for row in agent.audit_log[-12:]:
print(json.dumps(row, indent=2))

print(
“nHardening checklist (high level):n”
“- Swap mocks for real EHR + payer integrations (FHIR/HL7, payer APIs/portal automations)n”
“- Add PHI governance (tokenization, least-privilege access, encrypted logging, retention controls)n”
“- Add deterministic policy engine + calibrated uncertainty modeln”
“- Add human-in-the-loop UI with SLA timers, retries/backoff, idempotency keysn”
“- Add evidence packing (policy citations, structured attachments, templates)n”
)

We orchestrate the full end-to-end workflow and generate operational summaries and audit logs. We track outcomes, escalation events, and system behavior to support transparency and compliance. We emphasize observability and traceability as essential requirements for healthcare AI systems.

In conclusion, we illustrated how agentic AI can meaningfully reduce administrative friction in healthcare RCM by automating repetitive, rules-driven prior authorization tasks while preserving human oversight for ambiguous or high-risk decisions. We showed that combining deterministic policy logic, uncertainty estimation, and optional LLM-assisted reasoning enables a balanced approach that aligns with healthcare’s safety-critical nature. This work should be viewed as an architectural and educational reference rather than a deployable medical system; any real-world implementation must adhere to HIPAA and regional data protection laws, incorporate de-identification and access controls, undergo clinical and compliance review, and be validated against payer-specific policies.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Safe, Autonomous Prior Authorization Agent for Healthcare Revenue Cycle Management with Human-in-the-Loop Controls appeared first on MarkTechPost.

Black Forest Labs Releases FLUX.2 [klein]: Compact Flow Models for Int …

Black Forest Labs releases FLUX.2 [klein], a compact image model family that targets interactive visual intelligence on consumer hardware. FLUX.2 [klein] extends the FLUX.2 line with sub second generation and editing, a unified architecture for text to image and image to image, and deployment options that range from local GPUs to cloud APIs, while keeping state of the art image quality.

From FLUX.2 [dev] to interactive visual intelligence

FLUX.2 [dev] is a 32 billion parameter rectified flow transformer for text conditioned image generation and editing, including composition with multiple reference images, and runs mainly on data center class accelerators. It is tuned for maximum quality and flexibility, with long sampling schedules and high VRAM requirements.

FLUX.2 [klein] takes the same design direction and compresses it into smaller rectified flow transformers with 4 billion and 9 billion parameters. These models are distilled to very short sampling schedules, support the same text to image and multi reference editing tasks, and are optimized for response times below 1 second on modern GPUs.

Model family and capabilities

The FLUX.2 [klein] family consists of 4 main open weight variants through a single architecture.

FLUX.2 [klein] 4B

FLUX.2 [klein] 9B

FLUX.2 [klein] 4B Base

FLUX.2 [klein] 9B Base

FLUX.2 [klein] 4B and 9B are step distilled and guidance distilled models. They use 4 inference steps and are positioned as the fastest options for production and interactive workloads. FLUX.2 [klein] 9B combines a 9B flow model with an 8B Qwen3 text embedder and is described as the flagship small model on the Pareto frontier for quality versus latency across text to image, single reference editing, and multi reference generation.

The Base variants are undistilled versions with longer sampling schedules. The documentation lists them as foundation models that preserve the complete training signal and provide higher output diversity. They are intended for fine tuning, LoRA training, research pipelines, and custom post training workflows where control is more important than minimum latency.

All FLUX.2 [klein] models support three core tasks in the same architecture. They can generate images from text, they can edit a single input image, and they can perform multi reference generation and editing where several input images and a prompt jointly define the target output.

Latency, VRAM, and quantized variants

The FLUX.2 [klein] model page provides approximate end to end inference times on GB200 and RTX 5090. FLUX.2 [klein] 4B is the fastest variant and is listed at about 0.3 to 1.2 seconds per image, depending on hardware. FLUX.2 [klein] 9B targets about 0.5 to 2 seconds at higher quality. The Base models require several seconds because they run with 50 step sampling schedules, but they expose more flexibility for custom pipelines.

The FLUX.2 [klein] 4B model card states that 4B fits in about 13 GB of VRAM and is suitable for GPUs like the RTX 3090 and RTX 4070. The FLUX.2 [klein] 9B card reports a requirement of about 29 GB of VRAM and targets hardware such as the RTX 4090. This means a single high end consumer card can host the distilled variants with full resolution sampling.

To extend the reach to more devices, Black Forest Labs also releases FP8 and NVFP4 versions for all FLUX.2 [klein] variants, developed together with NVIDIA. FP8 quantization is described as up to 1.6 times faster with up to 40 percent lower VRAM usage, and NVFP4 as up to 2.7 times faster with up to 55 percent lower VRAM usage on RTX GPUs, while keeping the core capabilities the same.

Benchmarks against other image models

Black Forest Labs evaluates FLUX.2 [klein] through Elo style comparisons on text to image, single reference editing, and multi reference tasks. The performance charts show FLUX.2 [klein] on the Pareto frontier of Elo score versus latency and Elo score versus VRAM.The commentary states that FLUX.2 [klein] matches or exceeds the quality of Qwen based image models at a fraction of the latency and VRAM, and that it outperforms Z Image while supporting unified text to image and multi reference editing in one architecture.

https://bfl.ai/blog/flux2-klein-towards-interactive-visual-intelligence

The base variants trade some speed for full customizability and fine tuning, which aligns with their role as foundation checkpoints for new research and domain specific pipelines.

Key Takeaways

FLUX.2 [klein] is a compact rectified flow transformer family with 4B and 9B variants that supports text to image, single image editing, and multi reference generation in one unified architecture.

The distilled FLUX.2 [klein] 4B and 9B models use 4 sampling steps and are optimized for sub second inference on a single modern GPU, while the undistilled Base models use longer schedules and are intended for fine tuning and research.

Quantized FP8 and NVFP4 variants, built with NVIDIA, provide up to 1.6 times speedup with about 40 percent VRAM reduction for FP8 and up to 2.7 times speedup with about 55 percent VRAM reduction for NVFP4 on RTX GPUs.

Check out the Technical details, Repo and Model weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Black Forest Labs Releases FLUX.2 [klein]: Compact Flow Models for Interactive Visual Intelligence appeared first on MarkTechPost.

Google AI Releases TranslateGemma: A New Family of Open Translation Mo …

Google AI has released TranslateGemma, a suite of open machine translation models built on Gemma 3 and targeted at 55 languages. The family comes in 4B, 12B and 27B parameter sizes. It is designed to run across devices from mobile and edge hardware to laptops and a single H100 GPU or TPU instance in the cloud.

TranslateGemma is not a separate architecture. It is Gemma 3 specialized for translation through a two stage post training pipeline. (1) supervised fine tuning on large parallel corpora. (2) Reinforcement learning that optimizes translation quality with a multi signal reward ensemble. The goal is to push translation quality while keeping the general instruction following behavior of Gemma 3.

Supervised fine tuning on synthetic and human parallel data

The supervised fine tuning stage starts from the public Gemma 3 4B, 12B and 27B checkpoints. The research team uses parallel data that combines human translations with high quality synthetic translations generated by Gemini models.

Synthetic data is produced from monolingual sources with a multi step procedure. The pipeline selects candidate sentences and short documents, feeds them to Gemini 2.5 Flash, and then filters outputs with MetricX 24 QE to keep only examples that show clear quality gains. This is applied across all WMT24 plus plus language pairs plus 30 more language pairs.

Low resource languages receive human generated parallel data from the SMOL and GATITOS datasets. SMOL covers 123 languages and GATITOS covers 170 languages. This improves coverage of scripts and language families that are under represented in publicly available web parallel data.

The final supervised fine tuning mixture also keeps 30 percent generic instruction following data from the original Gemma 3 mixture. This is important. Without it, the model would over specialize on pure translation and lose general LLM behavior such as following instructions or doing simple reasoning in context.

Training uses the Kauldron SFT (Supervised Fine tuning) tooling with the AdaFactor optimizer. The learning rate is 0.0001 with batch size 64 for 200000 steps. All model parameters are updated except the token embeddings, which are frozen. Freezing embeddings helps preserve representation quality for languages and scripts that do not appear in the supervised fine tuning data.

Reinforcement learning with a translation focused reward ensemble

After supervised fine tuning, TranslateGemma runs a reinforcement learning phase on top of the same translation data mixture. The reinforcement learning objective uses several reward models.

The reward ensemble includes:

MetricX 24 XXL QE, a learned regression metric that approximates MQM scores and is used here in quality estimation mode without a reference.

Gemma AutoMQM QE, a span level error predictor fine tuned from Gemma 3 27B IT on MQM labeled data. It produces token level rewards based on error type and severity.

ChrF, a character n gram overlap metric that compares model output with synthetic references and is rescaled to match the other rewards.

A Naturalness Autorater that uses the policy model as an LLM judge and produces span level penalties for segments that do not sound like native text.

A generalist reward model from the Gemma 3 post training setup that keeps reasoning and instruction following ability intact.

TranslateGemma uses reinforcement learning algorithms that combine sequence level rewards with token level advantages. Span level rewards from AutoMQM and the Naturalness Autorater attach directly to the affected tokens. These token advantages are added to sequence advantages computed from reward to go and then batch normalized. This improves credit assignment compared with pure sequence level reinforcement learning.

Benchmark results on WMT24++

TranslateGemma is evaluated on the WMT24++ benchmark using MetricX 24 and Comet22. MetricX is lower better and correlates with MQM error counts. Comet22 is higher better and measures adequacy and fluency.

https://arxiv.org/pdf/2601.09012

The above Table from the research pape summarizes results for English centered evaluation over 55 language pairs.

27B: Gemma 3 baseline has MetricX 4.04 and Comet22 83.1. TranslateGemma 27B reaches MetricX 3.09 and Comet22 84.4.

12B: Gemma 3 baseline has MetricX 4.86 and Comet22 81.6. TranslateGemma 12B reaches MetricX 3.60 and Comet22 83.5.

4B: Gemma 3 baseline has MetricX 6.97 and Comet22 77.2. TranslateGemma 4B reaches MetricX 5.32 and Comet22 80.1.

The key pattern is that TranslateGemma improves quality for every model size. At the same time, model scale interacts with specialization. The 12B TranslateGemma model surpasses the 27B Gemma 3 baseline. The 4B TranslateGemma model reaches quality similar to the 12B Gemma 3 baseline. This means a smaller translation specialized model can replace a larger baseline model for many machine translation workloads.

https://arxiv.org/pdf/2601.09012

A language level breakdown in the above appendix table from the research paper shows that these gains appear across all 55 language pairs. For example, MetricX improves from 1.63 to 1.19 for English to German, 2.54 to 1.88 for English to Spanish, 3.90 to 2.72 for English to Hebrew, and 5.92 to 4.45 for English to Swahili. Improvements are also large for harder cases such as English to Lithuanian, English to Estonian and English to Icelandic.

Human evaluation on WMT25 with MQM confirms this trend. TranslateGemma 27B usually yields lower MQM scores, that is fewer weighted errors, than Gemma 3 27B, with especially strong gains for low resource directions such as English to Marathi, English to Swahili and Czech to Ukrainian. There are two notable exceptions. For German as target both systems are very close. For Japanese to English TranslateGemma shows a regression caused mainly by named entity errors, even though other error categories improve.

Multimodal translation and interface for developers

TranslateGemma inherits the image understanding stack of Gemma 3. The research team evaluates image translation on the Vistra benchmark. They select 264 images that each contain a single text instance. The model receives only the image plus a prompt that asks it to translate the text in the image. There is no separate bounding box input and no explicit OCR step.

On this setting, TranslateGemma 27B improves MetricX from 2.03 to 1.58 and Comet22 from 76.1 to 77.7. The 4B variant shows smaller but positive gains. The 12B model improves MetricX but has a slightly lower Comet22 score than the baseline. Overall, the research team concludes that TranslateGemma retains the multimodal ability of Gemma 3 and that text translation improvements mostly carry over to image translation.

Key Takeaways

TranslateGemma is a specialized Gemma 3 variant for translation: TranslateGemma is a suite of open translation models derived from Gemma 3, with 4B, 12B and 27B parameter sizes, optimized for 55 languages through a two stage pipeline, supervised fine tuning then reinforcement learning with translation focused rewards.

Training combines Gemini synthetic data with human parallel corpora: The models are fine tuned on a mixture of high quality synthetic parallel data generated by Gemini and human translated data, which improves coverage for both high resource and low resource languages while preserving general LLM capabilities from Gemma 3.

Reinforcement learning uses an ensemble of quality estimation rewards: After supervised fine tuning, TranslateGemma applies reinforcement learning driven by an ensemble of reward models, including MetricX QE and AutoMQM, that explicitly target translation quality and fluency rather than generic chat behavior.

Smaller models match or beat larger Gemma 3 baselines on WMT24++: On WMT24++ across 55 languages, all TranslateGemma sizes show consistent improvements over Gemma 3, with the 12B model surpassing the 27B Gemma 3 baseline and the 4B model reaching quality comparable to the 12B baseline, which reduces compute requirements for a given translation quality level.

Models retain multimodal abilities and are released as open weights: TranslateGemma keeps Gemma 3 image text translation capabilities and improves performance on the Vistra image translation benchmark, and the weights are released as open models on Hugging Face and Vertex AI, enabling local and cloud deployment.

Check out the Paper, Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Releases TranslateGemma: A New Family of Open Translation Models Built on Gemma 3 with Support for 55 Languages appeared first on MarkTechPost.

Advanced fine-tuning techniques for multi-agent orchestration: Pattern …

Our work with large enterprise customers and Amazon teams has revealed that high stakes use cases continue to benefit significantly from advanced large language model (LLM) fine-tuning and post-training techniques. In this post, we show you how fine-tuning enabled a 33% reduction in dangerous medication errors (Amazon Pharmacy), engineering 80% human effort reduction (Amazon Global Engineering Services), and content quality assessments improving 77% to 96% accuracy (Amazon A+). These aren’t hypothetical projections—they’re production results from Amazon teams. While many use cases can be effectively addressed through prompt engineering, Retrieval Augmented Generation (RAG) systems, and turn key agent deployment,, our work with Amazon and large enterprise accounts reveals a consistent pattern: One in four high-stakes applications—where patient safety, operational efficiency, or customer trust are on the line—demand advanced fine-tuning and post-training techniques to achieve production-grade performance.
This post details the techniques behind these outcomes: from foundational methods like Supervised Fine-Tuning (SFT) (instruction tuning), and Proximal Policy Optimization (PPO), to Direct Preference Optimization (DPO) for human alignment, to cutting-edge reasoning optimizations such as Grouped-based Reinforcement Learning from Policy Optimization (GRPO), Direct Advantage Policy Optimization (DAPO), and Group Sequence Policy Optimization (GSPO) purpose-built for agentic systems. We walk through the technical evolution of each approach, examine real-world implementations at Amazon, present a reference architecture on Amazon Web Services (AWS), and provide a decision framework for selecting the right technique based on your use case requirements.
The continued relevance of fine-tuning in the agentic AI
Despite the growing capabilities of foundation models and agent frameworks, roughly one of four enterprise use cases still require advanced fine-tuning to achieve the necessary performance levels. These are typically scenarios where the stakes are high from revenue or customer trust perspectives, domain-specific knowledge is essential, enterprise integration at scale is required, governance and control are paramount, business process integration is complex, or multi-modal support is needed. Organizations pursuing these use cases have reported higher conversion to production, greater return on investment (ROI), and up to 3-fold year-over-year growth when advanced fine-tuning is appropriately applied.
Evolution of LLM fine-tuning techniques for agentic AI
The evolution of generative AI has seen several key advancements in model customization and performance optimization techniques. Starting with SFT, which uses labeled data to teach models to follow specific instructions, the field established its foundation but faced limitations in optimizing complex reasoning. To address these limitations, reinforcement learning (RL) refines the SFT process with a reward-based system that provides better adaptability and alignment with human preference. Among multiple RL algorithms, a significant leap comes with PPO, which consists of a workflow with a value (critic) network and a policy network. The workflow contains a reinforcement learning policy to adjust the LLM weights based on the guidance of a reward model. PPO scales well in complex environments, though it has challenges with stability and configuration complexity.
DPO emerged as a breakthrough in early 2024, addressing PPO’s stability issues by eliminating the explicit reward model and instead working directly with preference data that includes preferred and rejected responses for given prompts. DPO optimizes the LLM weights by comparing the preferred and rejected responses, allowing the LLM to learn and adjust its behavior accordingly. This simplified approach gained widespread adoption, with major language models incorporating DPO into their training pipelines to achieve better performance and more reliable outputs. Other alternatives including Odds Ratio Policy Optimization (ORPO), Relative Preference Optimization (RPO), Identity preference optimization (IPO), Kahneman-Tversky Optimization (KTO), they are all RL methods for human preference alignment. By incorporating comparative and identity-based preference structures, and grounding optimization in behavioral economics, these methods are computationally efficient, interpretable, and aligned with actual human decision-making processes.
As agent-based applications gained prominence in 2025, we observed increasing demands for customizing the reasoning model in agents, to encode domain-specific constraints, safety guidelines, and reasoning patterns that align with agents’ intended functions (task planning, tool use, or multi-step problem solving). The objective is to improve agents’ performance in maintaining coherent plans, avoiding logical contradictions, and making appropriate decisions for the domain specific use cases. To meet these needs, GRPO was introduced to enhance reasoning capabilities and became particularly notable for its implementation in DeepSeek-V1.
The core innovation of GRPO lies in its group-based comparison approach: rather than comparing individual responses against a fixed reference, GRPO generates groups of responses and evaluates each against the average score of the group, rewarding those performing above average while penalizing those below. This relative comparison mechanism creates a competitive dynamic that encourages the model to produce higher-quality reasoning. GRPO is particularly effective for improving chain-of-thought (CoT) reasoning, which is the critical foundation for agent planning and complex task decomposition. By optimizing at the group level, GRPO captures the inherent variability in reasoning processes and trains the model to consistently outperform its own average performance.
Some complex agent tasks might require more fine-grained and crisp corrections within long reasoning chains, DAPO addresses these use cases by building upon GRPO sequence-level rewards, employing a higher clip ratio (approximately 30% higher than GRPO) to encourage more diverse and exploratory thinking processes, implementing dynamic sampling to eliminate less meaningful samples and improve overall training efficiency, applying token-level policy gradient loss to provide more granular feedback on lengthy reasoning chains rather than treating entire sequences as monolithic units, and incorporating overlong reward shaping to discourage excessively verbose responses that waste computational resources. Additionally, when the agentic use cases require long text outputs in the Mixture-of-Experts (MoE) model training, GSPO supports these scenarios by shifting the optimization from GRPO’s token-level importance weights to the sequence level. With these improvements, the new methods (DAPO and GSPO) enable more efficient and sophisticated agent reasoning and planning strategy, while maintaining computational efficiency and appropriate feedback resolution of GRPO.
Real-world applications at Amazon
Using the fine-tuning techniques described in the previous sections, the post-trained LLMs play two crucial roles in agentic AI systems. First is in the development of specialized tool-using components and sub-agents within the broader agent architecture. These fine-tuned models act as domain experts, each optimized for specific functions. By incorporating domain-specific knowledge and constraints during the fine-tuning process, these specialized components can achieve significantly higher accuracy and reliability in their designated tasks compared to general-purpose models. The second key application is to serve as the core reasoning engine, where the foundation models are specifically tuned to excel at planning, logical reasoning, and decision-making, for agents in a highly specific domain. The aim is to improve the model’s ability to maintain coherent plans and make logically sound decisions—essential capabilities for any agent system. This dual approach, combining a fine-tuned reasoning core with specialized sub-components, was emerging as a promising architecture in Amazon for evolving from LLM-driven applications to agentic systems, and building more capable and reliable generative AI applications. The following table depicts multi-agent AI orchestration with of advanced fine-tuning technique examples.

Amazon Pharmacy
Amazon Global Engineering Services
Amazon A+ Content

Domain
Healthcare
Construction and facilities
Ecommerce

High-stakes factor
Patient safety
Operational efficiency
Customer trust

Challenge
$3.5 B annual cost from medication errors
3+ hour inspection reviews
Quality assessment at 100 million+ scale

Techniques
SFT, PPO, RLHF, advanced RL
SFT, PPO, RLHF, advanced RL
Feature-based fine-tuning

Key outcome
33% reduction in medication errors
80% reduction in human effort
77%–96% accuracy

Amazon Healthcare Services (AHS) began its journey with generative AI with a significant challenge two years ago, when the team tackled customer service efficiency through a RAG-based Q&A system. Initial attempts using traditional RAG with foundation models yielded disappointing results, with accuracy hovering between 60 and 70%. The breakthrough came when they fine-tuned the embedding model specifically for pharmaceutical domain knowledge, resulted in a significant improvement to 90% accuracy and an 11% reduction in customer support contacts. In medication safety, medication direction errors can pose serious safety risks and cost up to $3.5 billion annually to correct. By fine-tuning a model with thousands of expert-annotated examples, Amazon Pharmacy created an agent component that validates medication directions using pharmacy logic and safety guidelines. This reduced near-miss events by 33%, as indicated in their Nature Medicine publication. In 2025, AHS is expanding their AI capabilities and transform these separate LLM-driven applications into a holistic multi-agent system to enhance patient experience. These individual applications driven by fine-tuned models play a crucial role in the overall agentic architecture, serving as domain expert tools to address specific mission-critical functions in pharmaceutical services.
The Amazon Global Engineering Services (GES) team, responsible for overseeing hundreds of Amazon fulfillment centers worldwide, embarked on an ambitious journey to use generative AI in their operations. Their initial foray into this technology focused on creating a sophisticated Q&A system designed to assist engineers in efficiently accessing relevant design information from vast knowledge repositories. The team’s approach was fine-tuning a foundation model using SFT, which resulted in a significant improvement in accuracy (measured by semantic similarity score) from 0.64 to 0.81. To better align with the feedback from the subject matter experts (SMEs), the team further refined the model using PPO incorporating the human feedback data, which boosted the LLM-judge scores from 3.9 to 4.2 out of 5, a remarkable achievement that translated to a substantial 80% reduction in the effort required from the domain experts. Similar to the Amazon Pharmacy case, these fine-tuned specialized models will continue to function as domain expert tools within the broader agentic AI system.
In 2025, the GES team ventured into uncharted territory by applying agentic AI systems to optimize their business process. LLM fine-tuning methodologies constitute a critical mechanism for enhancing the reasoning capabilities in AI agents, enabling effective decomposition of complex objectives into executable action sequences that align with predefined behavioral constraints and goal-oriented outcomes. It also serves as critical architecture component in facilitating specialized task execution and optimizing for task-specific performance metrics.
Amazon A+ Content powers rich product pages across hundreds of millions of annual submissions. The A+ team needed to evaluate content quality at scale—assessing cohesiveness, consistency, and relevancy, not just surface-level defects. Content quality directly impacts conversion and brand trust, making this a high-stakes application.
Following the architectural pattern seen in Amazon Pharmacy and Global Engineering Services, the team built a specialized evaluation agent powered by a fine-tuned model. They applied feature-based fine-tuning to Nova Lite on Amazon SageMaker—training a lightweight classifier on vision language model (VLM)-extracted features rather than updating full model parameters. This approach, enhanced by expert-crafted rubric prompts, improved classification accuracy from 77% to 96%. The result: an AI agent that evaluates millions of content submissions and delivers actionable recommendations. This demonstrates a key principle from our maturity framework—technique complexity should match task requirements. The A+ use case, while high-stakes and operating at massive scale, is fundamentally a classification task well-suited to these methods. Not every agent component requires GRPO or DAPO; selecting the right technique for each problem is what delivers efficient, production-grade systems.
Reference architecture for advanced AI orchestration using fine-tuning
Although fine-tuned models serve diverse purposes across different domains and use cases in an agentic AI system, the anatomy of an agent remains largely consistent and can be encompassed in component groupings, as shown in the following architecture diagram.

This modular approach adopts a number of AWS generative AI services, including Amazon Bedrock AgentCore, Amazon SageMaker, and Amazon Bedrock, that maintains structure of key groupings that make up an agent while providing various options within each group to improve an AI agent.

LLM customization for AI agents

Builders can use various AWS services to fine-tune and post-train the LLMs for an AI agent using the techniques discussed in the previous section. If you use LLMs on Amazon Bedrock for your agents, you can use multiple model customization approaches to fine-tune your models. Distillation and SFT through parameter-efficient fine-tuning (PEFT) with low-rank adaptation (LoRA) can be used to address simple customization tasks. For advanced fine-tuning, Continued Pre-training (CPT) extends a foundation model’s knowledge by training on domain-specific corpora (medical literature, legal documents, or proprietary technical content), embedding specialized vocabulary and domain reasoning patterns directly into model weights. Reinforcement fine-tuning (RFT), launched at re:Invent 2025, teaches models to understand what makes a quality response without large amounts of pre-labeled training data. There are two approaches supported for RFT: Reinforcement Learning with Verifiable Rewards (RLVR) uses rule-based graders for objective tasks like code generation or math reasoning, while Reinforcement Learning from AI Feedback (RLAIF) uses AI-based judges for subjective tasks like instruction following or content moderation.
If you require deeper control over model customization infrastructure for your AI agents, Amazon SageMaker AI provides a comprehensive platform for custom model development and fine-tuning. Amazon SageMaker JumpStart accelerates the customization journey by offering pre-built solutions with one-click deployment of popular foundation models (Llama, Mistral, Falcon, and others) and end-to-end fine-tuning notebooks that handle data preparation, training configuration, and deployment workflows. Amazon SageMaker Training jobs provide managed infrastructure for executing custom fine-tuning workflows, automatically provisioning GPU instances, managing training execution, and handling cleanup after completion. This approach suits most fine-tuning scenarios where standard instance configurations provide sufficient compute power and training completes reliably within the job duration limits. You can use SageMaker Training jobs with custom Docker containers and code dependencies housing any machine learning (ML) framework, training library, or optimization technique, enabling experimentation with emerging methods beyond managed offerings.
At re:Invent 2025, Amazon SageMaker HyperPod introduced two capabilities for large-scale model customization: Checkpointless training reduces checkpoint-restart cycles, shortening recovery time from hours to minutes. Elastic training automatically scales workloads to use idle capacity and yields resources when higher-priority workloads peak. These features build on the core strengths of HyperPod—resilient distributed training clusters with automatic fault recovery for multi-week jobs spanning thousands of GPUs. HyperPod supports NVIDIA NeMo and AWS Neuronx frameworks, and is ideal when training scale, duration, or reliability requirements exceed what job-based infrastructure can economically provide.
In SageMaker AI, for builders who want to customize models without managing infrastructure, Amazon SageMaker AI serverless customization, launched at re:Invent 2025, provides a fully managed, UI- and SDK-driven experience for model fine-tuning. This capability provides infrastructure management—SageMaker automatically selects and provisions appropriate compute resources (P5, P4de, P4d, and G5 instances) based on model size and training requirements. Through the SageMaker Studio UI, you can customize popular models (Amazon Nova, Llama, DeepSeek, GPT-OSS, and Qwen) using advanced techniques including SFT, DPO, RLVR, and RLAIF. You can also run the same serverless customization using SageMaker Python SDK in your Jupyter notebook. The serverless approach provides pay-per-token pricing, automatic resource cleanup, integrated MLflow experiment tracking, and seamless deployment to both Amazon Bedrock and SageMaker endpoints.
If you need to customize Amazon Nova models for your agentic workflow, you can do it through recipes and train them on SageMaker AI. It provides end-to-end customization workflow including model training, evaluation, and deployment for inference. with greater flexibility and control to fine-tune the Nova models, optimize hyperparameters with precision, and implement techniques such as LoRA PEFT, full-rank SFT, DPO, RFT, CPT, PPO, and so on. For the Nova models on Amazon Bedrock, you can also train your Nova models by SFT and RFT with reasoning content to capture intermediate thinking steps or use reward-based optimization when exact correct answers are difficult to define. If you have more advanced agentic use cases that require deeper model customization, you can use Amazon Nova Forge—launched at re:Invent 2025—to build your own frontier models from early model checkpoints, blend your datasets with Amazon Nova-curated training data, and host your custom models securely on AWS.

AI agent development environments and SDKs

The development environment is where developers author, test, and iterate on agent logic before deployment. Developers use integrated development environments (IDEs) such as SageMaker AI Studio (Jupyter Notebooks compared to code editors), Amazon Kiro, or IDEs on local machines like PyCharm. Agent logic is implemented using specialized SDKs and frameworks that abstract orchestration complexity—Strands provides a Python framework purpose-built for multi-agent systems, offering declarative agent definitions, built-in state management, and native AWS service integrations that handle the low-level details of LLM API calls, tool invocation protocols, error recovery, and conversation management. With these development tools handling the low-level details of LLM API calls, developers can focus on business logic rather than infrastructure design and maintenance.

AI agent deployment and operation

After your AI agent development is completed and ready to deploy in production, you can use Amazon Bedrock AgentCore to handle agent execution, memory, security, and tool integration without requiring infrastructure management. Bedrock AgentCore provides a set of integrated services, including:

AgentCore Runtime offers purpose-built environments that abstract away infrastructure management, while container-based alternatives (SageMaker AI jobs, AWS Lambda, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon Elastic Container Service (Amazon ECS)) provide more control for custom requirements. Essentially, the runtime is where your carefully crafted agent code meets real users and delivers business value at scale.
AgentCore Memory gives your AI agents the ability to remember past interactions, enabling them to provide more intelligent, context-aware, and personalized conversations. It provides a straightforward and powerful way to handle both short-term context and long-term knowledge retention without the need to build or manage complex infrastructure.
With AgentCore Gateway, developers can build, deploy, discover, and connect to tools at scale, providing observability into tool usage patterns, error handling for failed invocations, and integration with identity systems for accessing tools on behalf of users (using OAuth or API keys). Teams can update tool backends, add new capabilities, or modify authentication requirements without redeploying agents because the gateway architecture decouples tool implementation from agent logic—maintaining flexibility as business requirements evolve.
AgentCore Observability helps you trace, debug, and monitor agent performance in production environments. It provides real-time visibility into agent operational performance through access to dashboards powered by Amazon CloudWatch and telemetry for key metrics such as session count, latency, duration, token usage, and error rates, using the OpenTelemetry (OTEL) protocol standard.

LLM and AI agent evaluation

When your fine-tuned LLM driven AI agents are running in production, it’s important to evaluate and monitor your models and agents continuously to ensure high quality and performance. Many enterprise use cases require custom evaluation criteria that encode domain expertise and business rules. For the Amazon Pharmacy medication direction validation process, evaluation criteria include: drug-drug interaction detection accuracy (percentage of known contraindications correctly identified), dosage calculation precision (correct dosing adjustments for age, weight, and renal function), near-miss prevention rate (reduction in medication errors that could cause patient harm), FDA labeling compliance (adherence to approved usage, warnings, and contraindications), and pharmacist override rate (percentage of agent recommendations accepted without modification by licensed pharmacists).
For your models on Amazon Bedrock, you can use Amazon Bedrock evaluations to generate predefined metrics and human review workflows. For advanced scenarios, you can use SageMaker Training jobs to fine-tune specialized judge models on domain-specific evaluation datasets. For holistic AI agent evaluation, AgentCore Evaluations, launched at re:Invent 2025, provides automated assessment tools to measure your agent or tools performance on completing specific tasks, handling edge cases, and maintaining consistency across different inputs and contexts.
Decision guide and recommended phased approach
Now that you understand the technical evolution of advanced fine-tuning techniques—from SFT to PPO, DPO, GRPO, DAPO and GSPO—the critical question becomes when and why you should use them. Our experience shows that organizations using a phased maturity approach achieve 70–85% production conversion rates (compared to the 30–40% industry average) and 3-fold year-over-year ROI growth. The 12–18 month journey from initial agent deployment to advanced reasoning capabilities delivers incremental business value at each phase. The key is letting your use case requirements, available data, and measured performance guide advancement—not technical sophistication for its own sake.
The maturity path progresses through four phases (shown in the following table). Strategic patience in this progression builds reusable infrastructure, collects quality training data, and validates ROI before major investments. As our examples demonstrate, aligning technical sophistication with human and business needs delivers transformative outcomes and sustainable competitive advantages in your most critical AI applications.

Phase
Timeline
When to use
Key outcomes
Data needed
Investment

Phase 1: Prompt engineering
6–8 weeks

Starting agent journey
Validating business value
Simple workflows

60–75% accuracy)
Failure patterns identified

Minimal prompts, examples
$50K–$80K (2–3 full-time employees (FTE))

Phase 2: Supervised Fine-Tuning (SFT)
12 weeks

Domain knowledge gaps
Industry terminology issues
Need 80-85% accuracy

80–85% accuracy 60–80% SME effort reduction

500–5,000 labeled examples
$120K–$180K (3–4 FTE and compute)

Phase 3: Direct Preference Optimization (DPO)
16 weeks

Quality/style alignment
Safety/compliance critical
Brand consistency needed

85–92% accuracy
CSAT over 20%

1,000–10,000 preference pairs
$180K–$280K (4–5 FTE and compute)

Phase 4: GRPO and DAPO
24 weeks

Complex reasoning required
High-stakes decisions
Multi-step orchestration
Explainability essential

95–98% accuracy
Mission-critical deployment

10,000+ reasoning trajectories
$400K-$800K (6–8 FTE and HyperPod)

Conclusion
While agents have transformed how we build AI systems, advanced fine-tuning remains a critical component for enterprises seeking competitive advantage in high-stakes domains. By understanding the evolution of techniques like PPO, DPO, GRPO, DAPO and GSPO, and applying them strategically within agent architectures, organizations can achieve significant improvements in accuracy, efficiency, and safety. The real-world examples from Amazon demonstrate –that the combination of agentic workflows with carefully fine-tuned models delivers dramatic business outcomes.
AWS continues to accelerate these capabilities with several key launches at re:Invent 2025. Reinforcement fine-tuning (RFT) on Amazon Bedrock now enables models to learn quality responses through RLVR for objective tasks and RLAIF for subjective evaluations—without requiring large amounts of pre-labeled data. Amazon SageMaker AI Serverless Customization eliminates infrastructure management for fine-tuning, supporting SFT, DPO, and RLVR techniques with pay-per-token pricing. For large-scale training, Amazon SageMaker HyperPod introduced checkpointless training and elastic scaling to reduce recovery time and optimize resource utilization. Amazon Nova Forge empowers enterprises to build custom frontier models from early checkpoints, blending proprietary datasets with Amazon-curated training data. Finally, AgentCore Evaluation provides automated assessment tools to measure agent performance on task completion, edge cases, and consistency—closing the loop on production-grade agentic AI systems.
As you evaluate your generative AI strategy, use the decision guide and phased maturity approach outlined in this post to identify where advanced fine-tuning can tip the scales from good enough to transformative. Use the reference architecture as a baseline to structure your agentic AI systems, and use the capabilities introduced at re:Invent 2025 to accelerate your journey from initial agent deployment to production-grade outcomes.

About the authors
Yunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.
Kristine Pearce is a Principal Worldwide Generative AI GTM Specialist at AWS, focused on SageMaker AI model customization, optimization, and inference at scale. She combines her MBA, BS Industrial Engineering background, and human-centered design expertise to bring strategic depth and behavioral science to AI-enabled transformation. Outside work, she channels her creativity through art.
Harsh Asnani is a Worldwide Generative AI Specialist Solutions Architect at AWS specializing in ML theory, MLOPs, and production generative AI frameworks. His background is in applied data science with a focus on operationalizing AI workloads in the cloud at scale.
Sung-Ching Lin is a Principal Engineer at Amazon Pharmacy, where he leads the design and adoption of AI/ML systems to improve customer experience and operational efficiency. He focuses on building scalable, agent-based architectures, ML evaluation frameworks, and production-ready AI solutions in regulated healthcare domains.
Elad Dwek is a Senior AI Business Developer at Amazon, working within Global Engineering, Maintenance, and Sustainability. He partners with stakeholders from business and tech side to identify opportunities where AI can enhance business challenges or completely transform processes, driving innovation from prototyping to production. With a background in construction and physical engineering, he focuses on change management, technology adoption, and building scalable, transferable solutions that deliver continuous improvement across industries. Outside of work, he enjoys traveling around the world with his family.
Carrie Song is a Senior Program Manager at Amazon, working on AI-powered content quality and customer experience initiatives. She partners with applied science, engineering, and UX teams to translate generative AI and machine learning insights into scalable, customer-facing solutions. Her work focuses on improving content quality and streamlining the shopping experience on product detail pages.

How Palo Alto Networks enhanced device security infra log analysis wit …

This post is co-written by Fan Zhang, Sr Principal Engineer / Architect from Palo Alto Networks.
Palo Alto Networks’ Device Security team wanted to detect early warning signs of potential production issues to provide more time to SMEs to react to these emerging problems. The primary challenge they faced was that reactively processing over 200 million daily service and application log entries resulted in delayed response times to these critical issues, leaving them at risk for potential service degradation.
To address this challenge, they partnered with the AWS Generative AI Innovation Center (GenAIIC) to develop an automated log classification pipeline powered by Amazon Bedrock. The solution achieved 95% precision in detecting production issues while reducing incident response times by 83%.
In this post, we explore how to build a scalable and cost-effective log analysis system using Amazon Bedrock to transform reactive log monitoring into proactive issue detection. We discuss how Amazon Bedrock, through Anthropic’ s Claude Haiku model, and Amazon Titan Text Embeddings work together to automatically classify and analyze log data. We explore how this automated pipeline detects critical issues, examine the solution architecture, and share implementation insights that have delivered measurable operational improvements.
Palo Alto Networks offers Cloud-Delivered Security Services (CDSS) to tackle device security risks. Their solution uses machine learning and automated discovery to provide visibility into connected devices, enforcing Zero Trust principles. Teams facing similar log analysis challenges can find practical insights in this implementation.
Solution overview
Palo Alto Networks’ automated log classification system helps their Device Security team detect and respond to potential service failures ahead of time. The solution processes over 200 million service and application logs daily, automatically identifying critical issues before they escalate into service outages that impact customers.
The system uses Amazon Bedrock with Anthropic’s Claude Haiku model to understand log patterns and classify severity levels, and Amazon Titan Text Embeddings enables intelligent similarity matching. Amazon Aurora provides a caching layer that makes processing massive log volumes feasible in real time. The solution integrates seamlessly with Palo Alto Networks’ existing infrastructure, helping the Device Security team focus on preventing outages instead of managing complex log analysis processes.
Palo Alto Networks and the AWS GenAIIC collaborated to build a solution with the following capabilities:

Intelligent deduplication and caching – The system scales by intelligently identifying duplicate log entries for the same code event. Rather than using a large language model (LLM) to classify every log individually, the system first identifies duplicates through exact matching, then uses overlap similarity, and finally employs semantic similarity only if no earlier match is found. This approach cost-effectively reduces the 200 million daily logs by over 99%, to logs only representing unique events. The caching layer enables real-time processing by reducing the need for redundant LLM invocations.
Context retrieval for unique logs – For unique logs, Anthropic’s Claude Haiku model using Amazon Bedrock classifies each log’s severity. The model processes the incoming log along with relevant labeled historical examples. The examples are dynamically retrieved at inference time through vector similarity search. Over time, labeled examples are added to provide rich context to the LLM for classification. This context-aware approach improves accuracy for Palo Alto Networks’ internal logs and systems and evolving log patterns that traditional rule-based systems struggle to handle.
Classification with Amazon Bedrock – The solution provides structured predictions, including severity classification (Priority 1 (P1), Priority 2 (P2), Priority 3 (P3)) and detailed reasoning for each decision. This comprehensive output helps Palo Alto Networks’ SMEs quickly prioritize responses and take preventive action before potential outages occur.
Integration with existing pipelines for action – Results integrate with their existing FluentD and Kafka pipeline, with data flowing to Amazon Simple Storage Service (Amazon S3) and Amazon Redshift for further analysis and reporting.

The following diagram (Figure 1) illustrates how the three-stage pipeline processes Palo Alto Networks’ 200 million daily log volume while balancing scale, accuracy, and cost-efficiency. The architecture consists of the following key components:

Data ingestion layer – FluentD and Kafka pipeline and incoming logs
Processing pipeline – Consisting of the following stages:

Stage 1: Smart caching and deduplication – Aurora for exact matching and Amazon Titan Text Embeddings for semantic matching
Stage 2: Context retrieval – Amazon Titan Text Embeddings to enable historical labeled examples, and vector similarity search
Stage 3: Classification – Anthropic’s Claude Haiku model for severity classification (P1/P2/P3)

Output layer – Aurora, Amazon S3, Amazon Redshift, and SME review interface

Figure 1: Automated log classification system architecture

The processing workflow moves through the following stages:

Stage 1: Smart caching and deduplication – Incoming logs from Palo Alto Networks’ FluentD and Kafka pipeline are immediately processed through an Aurora based caching layer. The system first applies exact matching, then falls back to overlap similarity, and finally uses semantic similarity through Amazon Titan Text Embeddings if no earlier match is found. During testing, this approach identified that more than 99% of logs corresponded to duplicate events, although they contained different time stamps, log levels, and phrasing. The caching system reduced response times for cached results and reduced unnecessary LLM processing.
Stage 2: Context retrieval for unique logs – The remaining less than 1% of truly unique logs require classification. For these entries, the system uses Amazon Titan Text Embeddings to identify the most relevant historical examples from Palo Alto Networks’ labeled dataset. Rather than using static examples, this dynamic retrieval makes sure each log receives contextually appropriate guidance for classification.
Stage 3: Classification with Amazon Bedrock – Unique logs and their selected examples are processed by Amazon Bedrock using Anthropic’s Claude Haiku model. The model analyzes the log content alongside relevant historical examples to produce severity classifications (P1, P2, P3) and detailed explanations. Results are stored in Aurora and the cache and integrated into Palo Alto Networks’ existing data pipeline for SME review and action.

This architecture enables cost-effective processing of massive log volumes while maintaining 95% precision for critical P1 severity detection. The system uses carefully crafted prompts that combine domain expertise with dynamically selected examples:
system_prompt = “””
<Task>
You are an expert log analysis system responsible for classifying production system logs based on severity. Your analysis helps engineering teams prioritize their response to system issues and maintain service reliability.
</Task>
<Severity_Definitions>
P1 (Critical): Requires immediate action – system-wide outages, repeated application crashes
P2 (High): Warrants attention during business hours – performance issues, partial service disruption
P3 (Low): Can be addressed when resources available – minor bugs, authorization failures, intermittent network issues
</Severity_Definitions>

<Examples>
<log_snippet>
2024-08-17 01:15:00.00 [warn] failed (104: Connection reset by peer) while reading response header from upstream
</log_snippet>
severity: P3
category: Category A

<log_snippet>
2024-08-18 17:40:00.00 <warn> Error: Request failed with status code 500 at settle
</log_snippet>
severity: P2
category: Category B

</Examples>

<Target_Log>
Log: {incoming_log_snippet}
Location: {system_location}
</Target_Log>”””

Provide severity classification (P1/P2/P3) and detailed reasoning.
Implementation insights
The core value of Palo Alto Networks’ solution lies in making an insurmountable challenge manageable: AI helps their team analyze 200 million of daily volumes efficiently, while the system’s dynamic adaptability makes it possible to extend the solution into the future by adding more labeled examples. Palo Alto Networks’ successful implementation of their automated log classification system yielded key insights that can help organizations building production-scale AI solutions:

Continuous learning systems deliver compounding value – Palo Alto Networks designed their system to improve automatically as SMEs validate classifications and label new examples. Each validated classification becomes part of the dynamic few-shot retrieval dataset, improving accuracy for similar future logs while increasing cache hit rates. This approach creates a cycle where operational use enhances system performance and reduces costs.
Intelligent caching enables AI at production scale – The multi-layered caching architecture processes more than 99% of logs through cache hits, transforming expensive per-log LLM operations into a cost-effective system capable of handling 200 million daily volumes. This foundation makes AI processing economically viable at enterprise scale while maintaining response times.
Adaptive systems handle evolving requirements without code changes – The solution accommodates new log categories and patterns without requiring system modifications. When performance needs improvement for novel log types, SMEs can label additional examples, and the dynamic few-shot retrieval automatically incorporates this knowledge into future classifications. This adaptability allows the system to scale with business needs.
Explainable classifications drive operational confidence – SMEs responding to critical alerts require confidence in AI recommendations, particularly for P1 severity classifications. By providing detailed reasoning alongside each classification, Palo Alto Networks enables SMEs to quickly validate decisions and take appropriate action. Clear explanations transform AI outputs from predictions into actionable intelligence.

These insights demonstrate how AI systems designed for continuous learning and explainability become increasingly valuable operational assets.
Conclusion
Palo Alto Networks’ automated log classification system demonstrates how generative AI powered by AWS helps operational teams manage vast volumes in real time. In this post, we explored how an architecture combining Amazon Bedrock, Amazon Titan Text Embeddings, and Aurora processes 200 million of daily logs through intelligent caching and dynamic few-shot learning, enabling proactive detection of critical issues with 95% precision. Palo Alto Networks’ automated log classification system delivered concrete operational improvements:

95% precision, 90% recall for P1 severity logs – Critical alerts are accurate and actionable, minimizing false alarms while catching 9 out of 10 urgent issues, leaving the remaining alerts to be captured by existing monitoring systems
83% reduction in debugging time – SMEs spend less time on routine log analysis and more time on strategic improvements
Over 99% cache hit rate – The intelligent caching layer processes 20 million daily volume cost-effectively through subsecond responses
Proactive issue detection – The system identifies potential problems before they impact customers, preventing the multi-week outages that previously disrupted service
Continuous improvement – Each SME validation automatically improves future classifications and increases cache efficiency, resulting in reduced costs

For organizations evaluating AI initiatives for log analysis and operational monitoring, Palo Alto Networks’ implementation offers a blueprint for building production-scale systems that deliver measurable improvements in operational efficiency and cost reduction. To build your own generative AI solutions, explore Amazon Bedrock for managed access to foundation models. For additional guidance, check out the AWS Machine Learning resources and browse implementation examples in the AWS Artificial Intelligence Blog.
The collaboration between Palo Alto Networks and the AWS GenAIIC demonstrates how thoughtful AI implementation can transform reactive operations into proactive, scalable systems that deliver sustained business value.
To get started with Amazon Bedrock, see Build generative AI solutions with Amazon Bedrock.

About the authors

Rizwan Mushtaq
Rizwan is a Principal Solutions Architect at AWS. He helps customers design innovative, resilient, and cost-effective solutions using AWS services. He holds an MS in Electrical Engineering from Wichita State University.

Hector Lopez
Hector Lopez, PhD is an Applied Scientist in AWS’s Generative AI Innovation Center, where he specializes in delivering production-ready generative AI solutions and proof-of-concepts across diverse industry applications. His expertise spans traditional machine learning and data science in life and physical sciences. Hector implements a first-principles approach to customer solutions, working backwards from core business needs to help organizations understand and leverage generative AI tools for meaningful business transformation.

Meena Menon
Meena Menon is a Sr. Customer Success Manager at AWS with over 20 years of experience delivering enterprise customer outcomes and digital transformation. At AWS, she partners with strategic ISVs including Palo Alto Networks, Proofpoint, New Relic, and Splunk to accelerate cloud modernization and migrations.

Fan Zhang
Fan is a Senior Principal Engineer/Architect at Palo Alto Networks, leading the IoT Security team’s infrastructure and data pipeline, as well as its generative AI infrastructure.

From beginner to champion: A student’s journey through the AWS AI Le …

The AWS AI League, launched by Amazon Web Services (AWS), expanded its reach to the Association of Southeast Asian Nations (ASEAN) last year, welcoming student participants from Singapore, Indonesia, Malaysia, Thailand, Vietnam, and the Philippines. The goal was to introduce students of all backgrounds and experience levels to the exciting world of generative AI through a gamified, hands-on challenge focused on fine-tuning large language models (LLMs).
In this blog post, you’ll hear directly from the AWS AI League champion, Blix D. Foryasen, as he shares his reflection on the challenges, breakthroughs, and key lessons discovered throughout the competition.
Behind the competition
The AWS AI League competition began with a tutorial session led by the AWS team and the Gen-C Generative AI Learning Community, featuring two powerful user-friendly services: Amazon SageMaker JumpStart and PartyRock.

SageMaker JumpStart enabled participants to run the LLM fine-tuning process in a cloud-based environment, offering flexibility to adjust hyperparameters and optimize performance.
PartyRock, powered by Amazon Bedrock, provided an intuitive playground and interface to curate the dataset used in fine-tuning a Llama 3.2 3B Instruct model. Amazon Bedrock offers a comprehensive selection of high-performing foundation models from leading AI companies, including Anthropic Claude, Meta Llama, Mistral, and more; all accessible through a single API.

With the goal of outperforming a larger LLM reference model in a quiz-based evaluation, participants engaged with three core domains of generative AI: Foundation models, responsible AI, and prompt engineering. The preliminary round featured an open leaderboard ranking the best-performing fine-tuned models from across the region. Each submitted model was tested against a larger baseline LLM using an automated, quiz-style evaluation of generative AI-related questions. The evaluation, conducted by an undisclosed LLM judge, prioritized both accuracy and comprehensiveness. A model’s win rate improved each time it outperformed the baseline LLM. The challenge required strategic planning beyond its technical nature. Participants had to maximize their limited training hours on SageMaker JumpStart while carefully managing a restricted number of leaderboard submissions. Initially capped at 5 hours, the limit was later expanded to 30 hours in response to community feedback. Submission count would also influence tiebreakers for finalist selection.
The top tuner from each country advanced to the Regional Grand Finale, held on May 29, 2025, in Singapore. There, finalists competed head-to-head, each presenting their fine-tuned model’s responses to a new set of questions. Final scores were determined by a weighted judging system:

40% by an LLM-as-a-judge,
40% by experts
20% by a live audience.

A pragmatic approach to fine-tuning
Before diving into the technical details, a quick disclaimer: the approaches shared in the following sections are largely experimental and born from trial and error. They’re not necessarily the most optimal methods for fine-tuning, nor do they represent a definitive guide. Other finalists had different approaches because of different technical backgrounds. What ultimately helped me succeed wasn’t just technical precision, but collaboration, resourcefulness, and a willingness to explore how the competition might unfold based on insights from previous iterations. I hope this account can serve as a baseline or inspiration for future participants who might be navigating similar constraints. Even if you’re starting from scratch, as I did, there’s real value in being strategic, curious, and community-driven. One of the biggest hurdles I faced was time, or the lack of it. Because of a late confirmation of my participation, I joined the competition 2 weeks after it had already begun. That left me with only 2 weeks to plan, train, and iterate. Given the tight timeline and limited compute hours on SageMaker JumpStart, I knew I had to make every training session count. Rather than attempting exhaustive experiments, I focused my efforts on curating a strong dataset and tweaking select hyperparameters. Along the way, I drew inspiration from academic papers and existing approaches in LLM fine-tuning, adjusting what I could within the constraints.
Crafting synthetic brilliance
As mentioned earlier, one of the key learning sessions at the start of the competition introduced participants to SageMaker JumpStart and PartyRock, tools that make fine-tuning and synthetic data generation both accessible and intuitive. In particular, PartyRock allowed us to clone and customize apps to control how synthetic datasets were generated. We could tweak parameters such as the prompt structure, creativity level (temperature), and token sampling strategy (top-p). PartyRock also gave us access to a wide range of foundation models. From the start, I opted to generate my datasets using Claude 3.5 Sonnet, aiming for broad and balanced coverage across all three core sub-domains of the competition. To minimize bias and implement fair representation across topics, I curated multiple dataset versions, each ranging from 1,500 to 12,000 Q&A pairs, carefully maintaining balanced distributions across sub-domains. The following are a few example themes that I focused on:

Prompt engineering: Zero-shot prompting, chain-of-thought (CoT) prompting, evaluating prompt effectiveness
Foundation models: Transformer architectures, distinctions between pretraining and fine-tuning
Responsible AI: Dataset bias, representation fairness, and data protection in AI systems

To maintain data quality, I fine-tuned the dataset generator to emphasize factual accuracy, uniqueness, and applied knowledge. Each generation batch consisted of 10 Q&A pairs, with prompts specifically designed to encourage depth and clarity
Question prompt:

You are a quiz master in an AI competition preparing a set of challenging quiz bee questions about [Topic to generate] The purpose of these questions is to determine the better LLM between a fine-tuned LLaMA 3.2 3B Instruct and larger LLMs. Generate [Number of data rows to generate] questions on [Topic to generate], covering:
* Basic Questions (1/3) → Direct Q&A without reasoning. Must require a clear explanation, example, or real-world application. Avoid one-word fact-based questions.
* Hybrid Questions (1/3) → Requires a short analytical breakdown (e.g., comparisons, trade-offs, weaknesses, implications). Prioritize scenario-based or real-world dilemma questions.
* Chain-of-thought (CoT) Questions (1/3) → Requires multi-step logical deductions. Focus on evaluating existing AI methods, identifying risks, and critiquing trade-offs. Avoid open-ended “Design/Propose/Create” questions. Instead, use “Compare, Evaluate, Critique, Assess, Analyze, What are the trade-offs of…”

Ensure the questions on [Topic to generate]:
* Are specific, non-trivial, and informative.
* Avoid overly simple questions (e.g., mere definitions or fact-based queries).
* Encourage applied reasoning (i.e., linking theoretical concepts to real-world AI challenges).

Answer prompt:

You are an AI expert specializing in generative AI, foundation models, agentic AI, prompt engineering, and responsible AI. Your task is to generate well-structured, logically reasoned responses to a list of [Questions], ensuring that all responses follow a chain-of-thought (CoT) approach, regardless of complexity, and formatted in valid JSONL. Here are the answering guidelines:
* Every response must be comprehensive, factually accurate, and well-reasoned.
* Every response must use a step-by-step logical breakdown, even for seemingly direct questions.
For all questions, use structured reasoning:
* For basic Questions, use a concise yet structured explanation. Simple Q&As should still follow CoT reasoning, explaining why the answer is correct rather than just stating facts.
* For hybrid and CoT questions, use Chain of Thought and analyze the problem logically before providing a concluding statement.
* If applicable, use real-world examples or research references to enhance explanations.
* If applicable, include trade-offs between different AI techniques.
* Draw logical connections between subtopics to reinforce deep understanding.

Answering prompt examples:

* Basic question (direct Q&A without reasoning) → Use concise yet comprehensive, structured responses that provide a clear, well-explained, and well-structured definition and explanation without unnecessary verbosity.
* Applications. Highlight key points step-by-step in a few comprehensive sentences.
* Complex CoT question (multi-step reasoning) → Use CoT naturally, solving each step explicitly, with in-depth reasoning

For question generation, I set the temperature to 0.7, favoring creative and novel phrasing without drifting too far from factual grounding. For answer generation, I used a lower temperature of 0.2, targeting precision and correctness. In both cases, I applied top-p = 0.9, allowing the model to sample from a focused yet diverse range of likely tokens, encouraging nuanced outputs. One important strategic assumption I made throughout the competition was that the evaluator LLM would prefer more structured, informative, and complete responses over overly creative or brief ones. To align with this, I included reasoning steps in my answers to make them longer and more comprehensive. Research has shown that LLM-based evaluators often score detailed, well-explained answers higher, and I leaned into that insight during dataset generation.
Refining the submissions
SageMaker JumpStart offers a wide array of hyperparameters to configure, which can feel overwhelming, especially when you’re racing against time and unsure of what to prioritize. Fortunately, the organizers emphasized focusing primarily on epochs and learning rate, so I honed in on those variables. Each training job with a single epoch took approximately 10–15 minutes, making time management critical. To avoid wasting valuable compute hours, I began with a baseline dataset of 1,500 rows to test combinations of epochs and learning rates. I explored:

Epochs: 1 to 4
Learning rates: 0.0001, 0.0002, 0.0003, and 0.0004

After multiple iterations, the combination of two epochs and a learning rate of 0.0003 yielded the best result, achieving a 53% win rate on my 13th leaderboard submission. Encouraged by this, I continued using this combination for several subsequent experiments, even as I expanded my dataset. Initially, this strategy appeared to work. With a dataset of approximately 3,500 rows, my model reached a 57% win rate by my 16th submission. However, as I further increased the dataset to 5,500, 6,700, 8,500, and eventually 12,000 rows, my win rate steadily declined to 53%, 51%, 45%, and 42% respectively. At that point, it was clear that only increasing dataset size wasn’t enough, in fact, it might have been counterproductive without revisiting the hyperparameters. With only 5 training hours remaining and 54 submissions logged, I found myself stuck at 57%, while peers like the top tuner from the Philippines were already reaching a 71% win rate.
Lessons from the field
With limited time left, both for training and leaderboard submissions, I turned to cross-country collaboration for support. One of the most insightful conversations I had was with Michael Ismail Febrian, the top tuner from Indonesia and the highest scorer in the elimination round. He encouraged me to explore LoRA (low-rank adaptation) hyperparameters, specifically:

lora_r
lora_alpha
target_modules

Michael also suggested enriching my dataset by using API-generated responses from more capable teacher models, especially for answering PartyRock-generated questions. Looking back at my existing fine-tuning pipeline, I realized a critical weakness: the generated answers were often too concise or shallow. Here’s an example of a typical Q&A pair from my earlier dataset:

Instruction: What factors influence model hallucination?

Response: Let’s analyze this step by step:
* Training factors:
* Data quality and coverage
* Training methodology
* Model architecture
* Deployment factors:
* Prompt design
* Temperature settings
* Context window utilization
* Mitigation strategies:
*Retrieval augmentation
* Fact-checking mechanisms
* Confidence thresholds

Example: GPT-3 showing lower hallucination rates (15% vs 21%) when using retrieval-augmented generation compared to base model.

While this structure is clean and organized, it lacked deeper explanation for each point, something models like ChatGPT and Gemini typically do well. I suspect this limitation came from token constraints when generating multiple responses in bulk. In my case, I generated 10 responses at a time in JSONL format under a single prompt, which might have led PartyRock to truncate outputs. Not wanting to spend on paid APIs, I discovered OpenRouter.ai, which offers limited access to large models, albeit rate-limited. With a cap of roughly 200 Q&A pairs per day per account, I got creative—I created multiple accounts to support my expanded dataset. My teacher model of choice was DeepSeek R1, a popular option known for its effectiveness in training smaller, specialized models. It was a bit of a gamble, but one that paid off in terms of output quality.
As for LoRA tuning, here’s what I learned:

lora_r and lora_alpha determine how much and how complex new information the model can absorb. A common rule of thumb is setting lora_alpha to 1x or 2x of lora_r.
target_modules defines which parts of the model are updated, often the attention layers or the feed-forward network.

I also consulted Kim, the top tuner from Vietnam, who flagged my 0.0003 learning rate as potentially too high. He, along with Michael, suggested a different strategy: increase the number of epochs and reduce the learning rate. This would allow the model to better capture complex relationships and subtle patterns, especially as dataset size grows. Our conversations underscored a hard-learned truth: data quality is more important than data quantity. There’s a point of diminishing returns when increasing dataset size without adjusting hyperparameters or validating quality—something I directly experienced. In hindsight, I realized I had underestimated how vital fine-grained hyperparameter tuning is, especially when scaling data. More data demands more precise tuning to match the growing complexity of what the model needs to learn.
Last-minute gambits
Armed with fresh insights from my collaborators and hard-won lessons from previous iterations, I knew it was time to pivot my entire fine-tuning pipeline. The most significant change was in how I generated my dataset. Instead of using PartyRock to produce both questions and answers, I opted to generate only the questions in PartyRock, then feed those prompts into the DeepSeek-R1 API to generate high-quality responses. Each answer was saved in JSONL format, and, crucially, included detailed reasoning. This shift significantly increased the depth and length of each answer, averaging around 900 tokens per response, compared to the much shorter outputs from PartyRock. Given that my earlier dataset of approximately 1,500 high-quality rows produced promising results, I stuck with that size for my final dataset. Rather than scale up in quantity, I doubled down on quality and complexity. For this final round, I made bold, blind tweaks to my hyperparameters:

Dropped the learning rate to 0.00008
Increased the LoRA parameters:

lora_r = 256
lora_alpha = 256

Expanded LoRA target modules to cover both attention and feed-forward layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

These changes were made with one assumption: longer, more complex answers require more capacity to absorb and generalize nuanced patterns. I hoped that these settings would enable the model to fully use the high-quality, reasoning-rich data from DeepSeek-R1.With only 5 hours of training time remaining, I had just enough for two full training runs, each using different epoch settings (3 and 4). It was a make-or-break moment. If the first run underperformed, I had one last chance to redeem it. Thankfully, my first test run achieved a 65% win rate, a massive improvement, but still behind the current leader from the Philippines and trailing Michael’s impressive 89%. Everything now hinged on my final training job. It had to run smoothly, avoid errors, and outperform everything I had tried before. And it did. That final submission achieved a 77% win rate, pushing me to the top of the leaderboard and securing my slot for the Grand Finale. After weeks of experimentation, sleepless nights, setbacks, and late-game adjustments, the journey, from a two-week-late entrant to national champion, was complete.
What I wish I had known sooner
I won’t pretend that my success in the elimination round was purely technical—luck played a big part. Still, the journey revealed several insights that could save future participants valuable time, training hours, and submissions. Here are some key takeaways I wish I had known from the start:

Quality is more important than quantity: More data doesn’t always mean better results. Whether you’re adding rows or increasing context length, you’re also increasing the complexity that the model must learn from. Focus on crafting high-quality, well-structured examples rather than blindly scaling up.
Fast learner compared to Slow learner: If you’re avoiding deep dives into LoRA or other advanced tweaks, understanding the trade-off between learning rate and epochs is essential. A higher learning rate with fewer epochs might converge faster, but could miss the subtle patterns captured by a lower learning rate over more epochs. Choose carefully based on your data’s complexity.
Don’t neglect hyperparameters: One of my biggest missteps was treating hyperparameters as static, regardless of changes in dataset size or complexity. As your data evolves, your model settings should too. Hyperparameters should scale with your data.
Do your homework: Avoid excessive guesswork by reading relevant research papers, documentation, or blog posts. Late in the competition, I stumbled upon helpful resources that I could have used to make better decisions earlier. A little reading can go a long way.
Track everything: When experimenting, it’s easy to forget what worked and what didn’t. Maintain a log of your datasets, hyperparameter combinations, and performance outcomes. This helps optimize your runs and aids in debugging.
Collaboration is a superpower: While it’s a competition, it’s also a chance to learn. Connecting with other participants, whether they’re ahead or behind, gave me invaluable insights. You might not always walk away with a trophy, but you’ll leave with knowledge, relationships, and real growth.

Grand Finale
The Grand Finale took place on the second day of the National AI Student Challenge, serving as the culmination of weeks of experimentation, strategy, and collaboration. Before the final showdown, all national champions had the opportunity to engage in the AI Student Developer Conference, where we shared insights, exchanged lessons, and built connections with fellow finalists from across the ASEAN region. During our conversations, I was struck by how remarkably similar many of our fine-tuning strategies were. Across the board, participants had used a mix of external APIs, dataset curation techniques, and cloud-based training systems like SageMaker JumpStart. It became clear that tool selection and creative problem-solving played just as big a role as raw technical knowledge. One particularly eye-opening insight came from a finalist who achieved an 85% win rate, despite using a large dataset—something I had initially assumed might hurt performance. Their secret was training over a higher number of epochs while maintaining a lower learning rate of 0.0001. However, this came at the cost of longer training times and fewer leaderboard submissions, which highlights an important trade-off:
With enough training time, a carefully tuned model, even one trained on a large dataset, can outperform faster, leaner models.
This reinforced a powerful lesson: there’s no single correct approach to fine-tuning LLMs. What matters most is how well your strategy aligns with the time, tools, and constraints at hand.
Preparing for battle
In the lead-up to the Grand Finale, I stumbled upon a blog post by Ray Goh, the very first champion of the AWS AI League and one of the mentors behind the competition’s tutorial sessions. One detail caught my attention: the final question from his year was a variation of the infamous Strawberry Problem, a deceptively simple challenge that exposes how LLMs struggle with character-level reasoning.
How many letter Es are there in the words ‘DeepRacer League’?
At first glance, this seems trivial. But to an LLM, the task isn’t as straightforward. Early LLMs often tokenize words in chunks, meaning that DeepRacer might be split into Deep and Racer or even into subword units like Dee, pRa, and cer. These tokens are then converted into numerical vectors, obscuring the individual characters within. It’s like asking someone to count the threads in a rope without unraveling it first.
Moreover, LLMs don’t operate like traditional rule-based programs. They’re probabilistic, trained to predict the next most likely token based on context, not to perform deterministic logic or arithmetic. Curious, I prompted my own fine-tuned model with the same question. As expected, hallucinations emerged. I began testing various prompting strategies to coax out the correct answer:

Explicit character separation: How many letter Es are there in the words ‘D-E-E-P-R-A-C-E-R-L-E-A-G-U-E’? This helped by isolating each letter into its own token, allowing the model to see individual characters. But the response was long and verbose, with the model listing and counting each letter step-by-step.
Chain-of-thought prompting: Let’s think step-by-step… This encouraged reasoning but increased token usage. While the answers were more thoughtful, they occasionally still missed the mark or got cut off because of length.
Ray Goh’s trick prompt: How many letter Es are there in the words ‘DeepRacer League’? There are 5 letter Es… This simple, assertive prompt yielded the most accurate and concise result, surprising me with its effectiveness.

I logged this as an interesting quirk, useful, but unlikely to reappear. I didn’t realize that it would become relevant again during the final. Ahead of the Grand Finale, we had a dry run to test our models under real-time conditions. We were given limited control over inference parameters, only allowed to tweak temperature, top-p, context length, and system prompts. Each response had to be generated and submitted within 60 seconds. The actual questions were pre-loaded, so our focus was on crafting effective prompt templates rather than retyping each query. Unlike the elimination round, evaluation during the Grand Finale followed a multi-tiered system:

40% from an evaluator LLM
40% from human judges
20% from a live audience poll

The LLM ranked the submitted answers from best to worst, assigning descending point values (for example, 16.7 for first place, 13.3 for second, and so on). Human judges, however, could freely allocate up to 10 points to their preferred responses, regardless of the LLM’s evaluation. This meant a strong showing with the evaluator LLM didn’t guarantee high scores from the humans, and vice versa. Another constraint was the 200-token limit per response. Tokens could be as short as a single letter or as long as a word or syllable, so responses had to be dense yet concise, maximizing impact within a tight window. To prepare, I tested different prompt formats and fine-tuned them using Gemini, ChatGPT, and Claude to better match the evaluation criteria. I stored dry-run responses from the Hugging Face LLaMA 3.2 3B Instruct model, then passed them to Claude Sonnet 4 for feedback and ranking. I continued using the following two prompts because they provided the best response in terms of accuracy and comprehensiveness:
Primary prompt:

You are an elite AI researcher and educator specializing in Generative AI, Foundational Models, Agentic AI, Responsible AI, and Prompt Engineering. Your task is to generate a highly accurate, comprehensive, and well-structured response to the question below in no more than 200 words.

Evaluation will be performed by Claude Sonnet 4, which prioritizes:
* Factual Accuracy – All claims must be correct and verifiable. Avoid speculation.
* Comprehensiveness – Cover all essential dimensions, including interrelated concepts or mechanisms.
* Clarity & Structure – Use concise, well-organized sections (e.g., brief intro, bullet points, and/or transitions). Markdown formatting (headings/lists) is optional.
* Efficiency – Every sentence must deliver unique insight. Avoid filler.
* Tone – Maintain a professional, neutral, and objective tone.

Your response should be dense with value while remaining readable and precise.

Backup prompt:

You are a competitive AI practitioner with deep expertise in [Insert domain: e.g., Agentic AI or Prompt Engineering], answering a technical question evaluated by Claude Sonnet 4 for accuracy and comprehensiveness. You must respond in exactly 200 words.

Format your answer as follows:
* Direct Answer (1–2 sentences) – Immediately state the core conclusion or definition.
* Key Technical Points (3–4 bullet points) – Essential mechanisms, distinctions, or principles.
* Practical Application (1–2 sentences) – Specific real-world use cases or design implications.
* Critical Insight (1 sentence) – Mention a key challenge, trade-off, or future direction.

Additional requirements:

Use precise technical language and terminology.
Include specific tools, frameworks, or metrics if relevant.
Every sentence must contribute uniquely—no redundancy.
Maintain a formal tone and answer density without over-compression.

In terms of hyperparameters, I used:

Top-p = 0.9
Max tokens = 200
Temperature = 0.2, to prioritize accuracy over creativity

My strategy was simple: appeal to the AI judge. I believed that if my answer ranked well with the evaluator LLM, it would also impress human judges. Oh, how I was humbled.
Just aiming for third… until I wasn’t
Standing on stage before a live audience was nerve-wracking. This was my first solo competition, and it was already on a massive regional scale. To calm my nerves, I kept my expectations low. A third-place finish would be amazing, a trophy to mark the journey, but just qualifying for the finals already felt like a huge win. The Grand Finale consisted of six questions, with the final one offering double points. I started strong. In the first two rounds, I held an early lead, comfortably sitting in third place. My strategy was working, at least at first. The evaluator LLM ranked my response to Question 1 as the best and Question 2 as the third-best. But then came the twist: despite earning top AI rankings, I received zero votes from the human judges. I watched in surprise as points were awarded to responses ranked fourth and even last by the LLM. Right from the start, I realized there was a disconnect between human and AI judgment, especially when evaluating tone, relatability, or subtlety. Still, I hung on, those early questions leaned more factual, which played to my model’s strengths. But when we needed creativity and complex reasoning, things didn’t work as well. My standing dropped to fifth, bouncing between third and fourth. Meanwhile, the top three finalists pulled ahead by more than 20 points. It seemed the podium was out of reach. I  was already coming to terms with a finish outside the top three. The gap was too wide. I had done my best, and that was enough.
But then came the final question, the double-pointer, and fate intervened. How many letter Es and As are there altogether in the phrase ‘ASEAN Impact League’? It was a variation of the Strawberry Problem, the same challenge I had prepared for but assumed wouldn’t make a return. Unlike the earlier version, this one added an arithmetic twist, requiring the model to count and sum up occurrences of multiple letters.Knowing how token length limits could truncate responses, I kept things short and tactical. My system prompt was simple: There are 3 letter Es and 4 letter As in ‘ASEAN Impact League.’
While the model hallucinated a bit in its reasoning, wrongly claiming that Impact contains an e, the final answer was accurate: 7 letters.
That one answer changed everything. Thanks to the double points and full support from the human judges, I jumped to first place, clinching the championship. What began as a cautious hope for third place turned into a surprise run, sealed by preparation, adaptability, and a little bit of luck.
Questions recap
Here are the questions that were asked, in order. Some of them were general knowledge in the target domain while others were more creative and had to include a bit of ingenuity to maximize your wins:

What is the most efficient way to prevent AI from turning to the dark side with toxic response?
What’s the magic behind agentic AI in machine learning, and why is it so pivotal?
What’s the secret sauce behind big AI models staying smart and fast?
What are the latest advancements of generative AI research and use within ASEAN?
Which ASEAN country has the best cuisine?
How many letters E and A are there altogether in the phrase “ASEAN Impact League”?

Final reflections
Participating in the AWS AI League was a deeply humbling experience, one that opened my eyes to the possibilities that await when we embrace curiosity and commit to continuous learning. I might have entered the competition as a beginner, but that single leap of curiosity, fueled by perseverance and a desire to grow, helped me bridge the knowledge gap in a fast-evolving technical landscape. I don’t claim to be an expert, not yet. But what I’ve come to believe more than ever is the power of community and collaboration. This competition wasn’t just a personal milestone; it was a space for knowledge-sharing, peer learning, and discovery. In a world where technology evolves rapidly, these collaborative spaces are essential for staying grounded and moving forward. My hope is that this post and my journey will inspire students, developers, and curious minds to take that first step, whether it’s joining a competition, contributing to a community, or tinkering with new tools. Don’t wait to be ready. Start where you are, and grow along the way. I’m excited to connect with more passionate individuals in the global AI community. If another LLM League comes around, maybe I’ll see you there.
Conclusion
As we conclude this insight into Blix’s journey to becoming the AWS AI League ASEAN champion, we hope his story inspires you to explore the exciting possibilities at the intersection of AI and innovation. Discover the AWS services that powered this competition: Amazon Bedrock, Amazon SageMaker JumpStart, and PartyRock, and visit the official AWS AI League page to join the next generation of AI innovators.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

About the authors
Noor Khan is a Solutions Architect at AWS supporting Singapore’s public sector education and research landscape. She works closely with academic and research institutions, leading technical engagements and designing secure, scalable architectures. As part of the core AWS AI League team, she architected and built the backend for the platform, enabling customers to explore real-world AI use cases through gamified learning. Her passions include AI/ML, generative AI, web development and empowering women in tech!
Vincent Oh is the Principal Solutions Architect in AWS for Data & AI. He works with public sector customers across ASEAN, owning technical engagements and helping them design scalable cloud solutions. He created the AI League in the midst of helping customers harness the power of AI in their use cases through gamified learning. He also serves as an Adjunct Professor in Singapore Management University (SMU), teaching computer science modules under School of Computer & Information Systems (SCIS). Prior to joining Amazon, he worked as Senior Principal Digital Architect at Accenture and Cloud Engineering Practice Lead at UST.
Blix Foryasen is a Computer Science student specializing in Machine Learning at National University – Manila. He is passionate about data science, AI for social good, and civic technology, with a strong focus on solving real-world problems through competitions, research, and community-driven innovation. Blix is also deeply engaged with emerging technological trends, particularly in AI and its evolving applications across industries, specifically in finance, healthcare, and education.

NVIDIA AI Open-Sourced KVzap: A SOTA KV Cache Pruning Method that Deli …

As context lengths move into tens and hundreds of thousands of tokens, the key value cache in transformer decoders becomes a primary deployment bottleneck. The cache stores keys and values for every layer and head with shape (2, L, H, T, D). For a vanilla transformer such as Llama1-65B, the cache reaches about 335 GB at 128k tokens in bfloat16, which directly limits batch size and increases time to first token.

https://arxiv.org/pdf/2601.07891

Architectural compression leaves the sequence axis untouched

Production models already compress the cache along several axes. Grouped Query Attention shares keys and values across multiple queries and yields compression factors of 4 in Llama3, 12 in GLM 4.5 and up to 16 in Qwen3-235B-A22B, all along the head axis. DeepSeek V2 compresses the key and value dimension through Multi head Latent Attention. Hybrid models mix attention with sliding window attention or state space layers to reduce the number of layers that maintain a full cache.

These changes do not compress along the sequence axis. Sparse and retrieval style attention retrieve only a subset of the cache at each decoding step, but all tokens still occupy memory. Practical long context serving therefore needs techniques that delete cache entries which will have negligible effect on future tokens.

The KVpress project from NVIDIA collects more than twenty such pruning methods in one codebase and exposes them through a public leaderboard on Hugging Face. Methods such as H2O, Expected Attention, DuoAttention, Compactor and KVzip are all evaluated in a consistent way.

KVzip and KVzip plus as the scoring oracle

KVzip is currently the strongest cache pruning baseline on the KVpress Leaderboard. It defines an importance score for each cache entry using a copy and paste pretext task. The model runs on an extended prompt where it is asked to repeat the original context exactly. For each token position in the original prompt, the score is the maximum attention weight that any position in the repeated segment assigns back to that token, across heads in the same group when grouped query attention is used. Low scoring entries are evicted until a global budget is met.

KVzip+ refines this score. It multiplies the attention weight by the norm of the value contribution into the residual stream and normalizes by the norm of the receiving hidden state. This better matches the actual change that a token induces in the residual stream and improves correlation with downstream accuracy compared to the original score.

These oracle scores are effective but expensive. KVzip requires prefilling on the extended prompt, which doubles the context length and makes it too slow for production. It also cannot run during decoding because the scoring procedure assumes a fixed prompt.

https://arxiv.org/pdf/2601.07891

KVzap, a surrogate model on hidden states

KVzap replaces the oracle scoring with a small surrogate model that operates directly on hidden states. For each transformer layer and each sequence position t, the module receives the hidden vector hₜ and outputs predicted log scores for every key value head. Two architectures are considered, a single linear layer (KVzap Linear) and a two layer MLP with GELU and hidden width equal to one eighth of the model hidden size (KVzap MLP).

Training uses prompts from the Nemotron Pretraining Dataset sample. The research team filter 27k prompts to lengths between 750 and 1,250 tokens, sample up to 500 prompts per subset, and then sample 500 token positions per prompt. For each key value head they obtain about 1.2 million training pairs and a validation set of 23k pairs. The surrogate learns to regress from the hidden state to the log KVzip+ score. Across models, the squared Pearson correlation between predictions and oracle scores reaches between about 0.63 and 0.77, with the MLP variant consistently outperforming the linear variant.

https://arxiv.org/pdf/2601.07891

Thresholding, sliding window and negligible overhead

During inference, the KVzap model processes hidden states and produces scores for each cache entry. Entries with scores below a fixed threshold are pruned, while a sliding window of the most recent 128 tokens is always kept. The research team provides a concise PyTorch style function that applies the model, sets scores of the local window to infinity and returns compressed key and value tensors. In all experiments, pruning is applied after the attention operation.

KVzap uses score thresholding rather than fixed top k selection. A single threshold yields different effective compression ratios on different benchmarks and even across prompts within the same benchmark. The research team report up to 20 percent variation in compression ratio across prompts at a fixed threshold, which reflects differences in information density.

Compute overhead is small. An analysis at the layer level shows that the extra cost of KVzap MLP is at most about 1.1 percent of the linear projection FLOPs, while the linear variant adds about 0.02 percent. The relative memory overhead follows the same values. In long context regimes, the quadratic cost of attention dominates so the extra FLOPs are effectively negligible.

https://arxiv.org/pdf/2601.07891

Results on RULER, LongBench and AIME25

KVzap is evaluated on long context and reasoning benchmarks using Qwen3-8B, Llama-3.1-8B Instruct and Qwen3-32B. Long context behavior is measured on RULER and LongBench. RULER uses synthetic tasks over sequence lengths from 4k to 128k tokens, while LongBench uses real world documents from multiple task categories. AIME25 provides a math reasoning workload with 30 Olympiad level problems evaluated under pass at 1 and pass at 4.

On RULER, KVzap matches the full cache baseline within a small accuracy margin while removing a large fraction of the cache. For Qwen3-8B, the best KVzap configuration achieves a removed fraction above 0.7 on RULER 4k and 16k while keeping the average score within a few tenths of a point of the full cache. Similar behavior holds for Llama-3.1-8B Instruct and Qwen3-32B.

On LongBench, the same thresholds lead to lower compression ratios because the documents are less repetitive. KVzap remains close to the full cache baseline up to about 2 to 3 times compression, while fixed budget methods such as Expected Attention degrade more on several subsets once compression increases.

On AIME25, KVzap MLP maintains or slightly improves pass at 4 accuracy at compression near 2 times and remains usable even when discarding more than half of the cache. Extremely aggressive settings, for example linear variants at high thresholds that remove more than 90 percent of entries, collapse performance as expected.

https://arxiv.org/pdf/2601.07891

Overall, the above Table shows that the best KVzap configuration per model delivers average cache compression between roughly 2.7 and 3.5 while keeping task scores very close to the full cache baseline across RULER, LongBench and AIME25.

Key Takeaways

KVzap is an input adaptive approximation of KVzip+ that learns to predict oracle KV importance scores from hidden states using small per layer surrogate models, either a linear layer or a shallow MLP, and then prunes low score KV pairs.

Training uses Nemotron pretraining prompts where KVzip+ provides supervision, producing about 1.2 million examples per head and achieving squared correlation in the 0.6 to 0.8 range between predicted and oracle scores, which is sufficient for faithful cache importance ranking.

KVzap applies a global score threshold with a fixed sliding window of recent tokens, so compression automatically adapts to prompt information density, and the research team report up to 20 percent variation in achieved compression across prompts at the same threshold.

Across Qwen3-8B, Llama-3.1-8B Instruct and Qwen3-32B on RULER, LongBench and AIME25, KVzap reaches about 2 to 4 times KV cache compression while keeping accuracy very close to the full cache, and it achieves state of the art tradeoffs on the NVIDIA KVpress Leaderboard.

The additional compute is small, at most about 1.1 percent extra FLOPs for the MLP variant, and KVzap is implemented in the open source kvpress framework with ready to use checkpoints on Hugging Face, which makes it practical to integrate into existing long context LLM serving stacks.

Check out the Paper and GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post NVIDIA AI Open-Sourced KVzap: A SOTA KV Cache Pruning Method that Delivers near-Lossless 2x-4x Compression appeared first on MarkTechPost.

DeepSeek AI Researchers Introduce Engram: A Conditional Memory Axis Fo …

Transformers use attention and Mixture-of-Experts to scale computation, but they still lack a native way to perform knowledge lookup. They re-compute the same local patterns again and again, which wastes depth and FLOPs. DeepSeek’s new Engram module targets exactly this gap by adding a conditional memory axis that works alongside MoE rather than replacing it.

At a high level, Engram modernizes classic N gram embeddings and turns them into a scalable, O(1) lookup memory that plugs directly into the Transformer backbone. The result is a parametric memory that stores static patterns such as common phrases and entities, while the backbone focuses on harder reasoning and long range interactions.

https://github.com/deepseek-ai/Engram/tree/main

How Engram Fits Into A DeepSeek Transformer

The proposed approach use the DeepSeek V3 tokenizer with a 128k vocabulary and pre-train on 262B tokens. The backbone is a 30 block Transformer with hidden size 2560. Each block uses Multi head Latent Attention with 32 heads and connects to feed forward networks through Manifold Constrained Hyper Connections with expansion rate 4. Optimization uses the Muon optimizer.

Engram attaches to this backbone as a sparse embedding module. It is built from hashed N gram tables, with multi head hashing into prime sized buckets, a small depthwise convolution over the N gram context and a context aware gating scalar in the range 0 to 1 that controls how much of the retrieved embedding is injected into each branch.

In the large scale models, Engram-27B and Engram-40B share the same Transformer backbone as MoE-27B. MoE-27B replaces the dense feed forward with DeepSeekMoE, using 72 routed experts and 2 shared experts. Engram-27B reduces routed experts from 72 to 55 and reallocates those parameters into a 5.7B Engram memory while keeping total parameters at 26.7B. The Engram module uses N equal to {2,3}, 8 Engram heads, dimension 1280 and is inserted at layers 2 and 15. Engram 40B increases the Engram memory to 18.5B parameters while keeping activated parameters fixed.

https://github.com/deepseek-ai/Engram/tree/main

Sparsity Allocation, A Second Scaling Knob Beside MoE

The core design question is how to split the sparse parameter budget between routed experts and conditional memory. The research team formalize this as the Sparsity Allocation problem, with allocation ratio ρ defined as the fraction of inactive parameters assigned to MoE experts. A pure MoE model has ρ equal to 1. Reducing ρ reallocates parameters from experts into Engram slots.

On mid scale 5.7B and 9.9B models, sweeping ρ gives a clear U shaped curve of validation loss versus allocation ratio. Engram models match the pure MoE baseline even when ρ drops to about 0.25, which corresponds to roughly half as many routed experts. The optimum appears when around 20 to 25 percent of the sparse budget is given to Engram. This optimum is stable across both compute regimes, which suggests a robust split between conditional computation and conditional memory under fixed sparsity.

The research team also studied an infinite memory regime on a fixed 3B MoE backbone trained for 100B tokens. They scale the Engram table from roughly 2.58e5 to 1e7 slots. Validation loss follows an almost perfect power law in log space, meaning that more conditional memory keeps paying off without extra compute. Engram also outperforms OverEncoding, another N gram embedding method that averages into the vocabulary embedding, under the same memory budget.

Large Scale Pre Training Results

The main comparison involves four models trained on the same 262B token curriculum, with 3.8B activated parameters in all cases. These are Dense 4B with 4.1B total parameters, MoE 27B and Engram 27B at 26.7B total parameters, and Engram 40B at 39.5B total parameters.

On The Pile test set, language modeling loss is 2.091 for MoE 27B, 1.960 for Engram 27B, 1.950 for the Engram 27B variant and 1.942 for Engram 40B. The Dense 4B Pile loss is not reported. Validation loss on the internal held out set drops from 1.768 for MoE 27B to 1.634 for Engram 27B and to 1.622 and 1.610 for the Engram variants.

Across knowledge and reasoning benchmarks, Engram-27B consistently improves over MoE-27B. MMLU increases from 57.4 to 60.4, CMMLU from 57.9 to 61.9 and C-Eval from 58.0 to 62.7. ARC Challenge rises from 70.1 to 73.8, BBH from 50.9 to 55.9 and DROP F1 from 55.7 to 59.0. Code and math tasks also improve, for example HumanEval from 37.8 to 40.8 and GSM8K from 58.4 to 60.6.

Engram 40B typically pushes these numbers further even though the authors note that it is likely under trained at 262B tokens because its training loss continues to diverge from the baselines near the end of pre training.

https://github.com/deepseek-ai/Engram/tree/main

Long Context Behavior And Mechanistic Effects

After pre-training, the research team extend the context window using YaRN to 32768 tokens for 5000 steps, using 30B high quality long context tokens. They compare MoE-27B and Engram-27B at checkpoints corresponding to 41k, 46k and 50k pre training steps.

On LongPPL and RULER at 32k context, Engram-27B matches or exceeds MoE-27B under three conditions. With about 82 percent of the pre training FLOPs, Engram-27B at 41k steps matches LongPPL while improving RULER accuracy, for example Multi Query NIAH 99.6 versus 73.0 and QA 44.0 versus 34.5. Under iso loss at 46k and iso FLOPs at 50k, Engram 27B improves both perplexity and all RULER categories including VT and QA.

Mechanistic analysis uses LogitLens and Centered Kernel Alignment. Engram variants show lower layer wise KL divergence between intermediate logits and the final prediction, especially in early blocks, which means representations become prediction ready sooner. CKA similarity maps show that shallow Engram layers align best with much deeper MoE layers. For example, layer 5 in Engram-27B aligns with around layer 12 in the MoE baseline. Taken together, this supports the view that Engram effectively increases model depth by offloading static reconstruction to memory.

Ablation studies on a 12 layer 3B MoE model with 0.56B activated parameters add a 1.6B Engram memory as a reference configuration, using N equal to {2,3} and inserting Engram at layers 2 and 6. Sweeping a single Engram layer across depth shows that early insertion at layer 2 is optimal. The component ablations highlight three key pieces, multi branch integration, context aware gating and tokenizer compression.

Sensitivity analysis shows that factual knowledge relies heavily on Engram, with TriviaQA dropping to about 29 percent of its original score when Engram outputs are suppressed at inference, while reading comprehension tasks retain around 81 to 93 percent of performance, for example C3 at 93 percent.

Key Takeaways

Engram adds a conditional memory axis to sparse LLMs so that frequent N gram patterns and entities are retrieved via O(1) hashed lookup, while the Transformer backbone and MoE experts focus on dynamic reasoning and long range dependencies.

Under a fixed parameter and FLOPs budget, reallocating about 20 to 25 percent of the sparse capacity from MoE experts into Engram memory lowers validation loss, showing that conditional memory and conditional computation are complementary rather than competing.

In large scale pre training on 262B tokens, Engram-27B and Engram-40B with the same 3.8B activated parameters outperform a MoE-27B baseline on language modeling, knowledge, reasoning, code and math benchmarks, while keeping the Transformer backbone architecture unchanged.

Long context extension to 32768 tokens using YaRN shows that Engram-27B matches or improves LongPPL and clearly improves RULER scores, especially Multi-Query-Needle in a Haystack and variable tracking, even when trained with lower or equal compute compared to MoE-27B.

Check out the Paper and GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export.
The post DeepSeek AI Researchers Introduce Engram: A Conditional Memory Axis For Sparse LLMs appeared first on MarkTechPost.

How the Amazon AMET Payments team accelerates test case generation wit …

At Amazon.ae, we serve approximately 10 million customers monthly across five countries in the Middle East and North Africa region—United Arab Emirates (UAE), Saudi Arabia, Egypt, Türkiye, and South Africa. Our AMET (Africa, Middle East, and Türkiye) Payments team manages payment selections, transactions, experiences, and affordability features across these diverse countries, publishing on average five new features monthly. Each feature requires comprehensive test case generation, which traditionally consumed 1 week of manual effort per project. Our quality assurance (QA) engineers spent this time analyzing business requirement documents (BRDs), design documents, UI mocks, and historical test preparations—a process that required one full-time engineer annually merely for test case creation.
To improve this manual process, we developed SAARAM (QA Lifecycle App), a multi-agent AI solution that helps reduce test case generation from 1 week to hours. Using Amazon Bedrock with Claude Sonnet by Anthropic and the Strands Agents SDK, we reduced the time needed to generate test cases from 1 week to mere hours while also improving test coverage quality. Our solution demonstrates how studying human cognitive patterns, rather than optimizing AI algorithms alone, can create production-ready systems that enhance rather than replace human expertise.
In this post, we explain how we overcame the limitations of single-agent AI systems through a human-centric approach, implemented structured outputs to significantly reduce hallucinations and built a scalable solution now positioned for expansion across the AMET QA team and later across other QA teams in International Emerging Stores and Payments (IESP) Org.
Solution overview
The AMET Payments QA team validates code deployments affecting payment functionality for millions of customers across diverse regulatory environments and payment methods. Our manual test case generation process added turnaround time (TAT) in the product cycle, consuming valuable engineering resources on repetitive test prep and documentation tasks rather than strategic testing initiatives. We needed an automated solution that could maintain our quality standards while reducing the time investment.
Our objectives included reducing test case creation time from 1 week to under a few hours, capturing institutional knowledge from experienced testers, standardizing testing approaches across teams, and minimizing the hallucination issues common in AI systems. The solution needed to handle complex business requirements spanning multiple payment methods, regional regulations, and customer segments while generating specific, actionable test cases aligned with our existing test management systems.
The architecture employs a sophisticated multi-agent workflow. To achieve this, we went through 3 different iterations and continue to improve and enhance as new techniques are developed and new models are deployed.
The challenge with traditional AI approaches
Our initial attempts followed conventional AI approaches, feeding entire BRDs to a single AI agent for test case generation. This method frequently produced generic outputs like “verify payment works correctly” instead of the specific, actionable test cases our QA team requires. For example, we need test cases as specific as “verify that when a UAE customer selects cash on delivery (COD) for an order above 1,000 AED with a saved credit card, the system displays the COD fee of 11 AED and processes the payment through the COD gateway with order state transitioning to ‘pending delivery.’”
The single-agent approach presented several critical limitations. Context length restrictions prevented processing large documents effectively, but the lack of specialized processing phases meant the AI couldn’t understand testing priorities or risk-based approaches. Additionally, hallucination issues created irrelevant test scenarios that could mislead QA efforts. The root cause was clear: AI attempted to compress complex business logic without the iterative thinking process that experienced testers employ when analyzing requirements.
The following flow chart illustrates our issues when attempting to use a single agent with a comprehensive prompt.

The human-centric breakthrough
Our breakthrough came from a fundamental shift in approach. Instead of asking, “How should AI think about testing?”, we asked, “How do experienced humans think about testing?” to focus on following a specific step-by-step process instead of relying on the large language model (LLM) to realize this on its own. This philosophy change led us to conduct research interviews with senior QA professionals, studying their cognitive workflows in detail.
We discovered that experienced testers don’t process documents holistically—they work through specialized mental phases. First, they analyze documents by extracting acceptance criteria, identifying customer journeys, understanding UX requirements, mapping product requirements, analyzing user data, and assessing workstream capabilities. Then they develop tests through a systematic process: journey analysis, scenario identification, data flow mapping, test case development, and finally, organization and prioritization.
We then decomposed our original agent into sequential thinking actions that served as individual steps. We built and tested each step using Amazon Q Developer for CLI to make sure basic ideas were sound and incorporated both primary and secondary inputs.
This insight led us to design SAARAM with specialized agents that mirror these expert testing approaches. Each agent focuses on a specific aspect of the testing process, such as how human experts mentally compartmentalize different analysis phases.
Multi-agent architecture with Strands Agents
Based on our understanding of human QA workflows, we initially attempted to build our own agents from scratch. We had to create our own looping, serial, or parallel execution. We also created our own orchestration and workflow graphs, which demanded considerable manual effort. To address these challenges, we migrated to Strands Agents SDK. This provided the multi-agent orchestration capabilities essential for coordinating complex, interdependent tasks while maintaining clear execution paths, helping improve our performance and reduce our development time.
Workflow iteration 1: End-to-end test generation
Our first iteration of SAARAM consisted of a single input and created our first specialized agents. It involved processing a work document through five specialized agents to generate comprehensive test coverage.
Agent 1 is called the Customer Segment Creator, and it focuses on customer segmentation analysis, using four subagents:

Customer Segment Discovery identifies product user segments
Decision Matrix Generator creates parameter-based matrices
E2E Scenario Creation develops end-to-end (E2E) scenarios per segment
Test Steps Generation detailed test case development

Agent 2 is called the User Journey Mapper, and it employs four subagents to map product journeys comprehensively:

The Flow Diagram and Sequence Diagram are creators using Mermaid syntax.
The E2E Scenarios generator builds upon these diagrams.
The Test Steps Generator is used for detailed test documentation.

Agent 3 is called Customer Segment x Journey Coverage, and it combines inputs from agents 1 and 2 to create detailed segment-specific analyses. It uses four subagents:

Mermaid-based flow diagrams
User journeys
Sequence diagrams for each customer segment
Corresponding test steps.

Agent 4 is called the State Transition Agent. It analyzes various product state points in customer journey flows. Its sub-agents create Mermaid state diagrams representing different journey states, segment-specific state scenario diagrams, and generate related test scenarios and steps.
The workflow, shown in the following diagram, concludes with a basic extract, transform, and load (ETL) process that consolidates and deduplicates the data from the agents, saving the final output as a text file.

This systematic approach facilitates comprehensive coverage of customer journeys, segments, and various diagram types, enabling thorough test coverage generation through iterative processing by agents and subagents.
Addressing limitations and enhancing capabilities
In our journey to develop a more robust and efficient tool using Strands Agents, we identified five crucial limitations in our initial approach:

Context and hallucination challenges – Our first workflow faced limitations from segregated agent operations where individual agents independently collected data and created visual representations. This isolation led to limited contextual understanding, resulting in reduced accuracy and increased hallucinations in the outputs.
Data generation inefficiencies – The limited context available to agents caused another critical issue: the generation of excessive irrelevant data. Without proper contextual awareness, agents produced less focused outputs, leading to noise that obscured valuable insights.
Restricted parsing capabilities – The initial system’s data parsing scope proved too narrow, limited to only customer segments, journey mapping, and basic requirements. This restriction prevented agents from accessing the full spectrum of information needed for comprehensive analysis.
Single-source input constraint – The workflow could only process Word documents, creating a significant bottleneck. Modern development environments require data from multiple sources, and this limitation prevented holistic data collection.
Rigid architecture problems – Importantly, the first workflow employed a tightly coupled system with rigid orchestration. This architecture made it difficult to modify, extend, or reuse components, limiting the system’s adaptability to changing requirements.

In our second iteration, we needed to implement strategic solutions to address these issues.
Workflow iteration 2: Comprehensive analysis workflow
Our second iteration represents a complete reimagining of the agentic workflow architecture. Rather than patching individual problems, we rebuilt from the ground up with modularity, context-awareness, and extensibility as core principles:
Agent 1 is the intelligent gateway. The file type decision agent serves as the system’s entry point and router. Processing documentation files, Figma designs, and code repositories, it categorizes and directs data to appropriate downstream agents. This intelligent routing is essential for maintaining both efficiency and accuracy throughout the workflow.
Agent 2 is for specialized data extraction. The Data Extractor agent employs six specialized subagents, each focused on specific extraction domains. This parallel processing approach facilitates thorough coverage while maintaining practical speed. Each subagent operates with domain-specific knowledge, extracting nuanced information that generalized approaches might overlook.
Agent 3 is the Visualizer agent, and it transforms extracted data into six distinct Mermaid diagram types, each serving specific analytical purposes. Entity relation diagrams map data relationships and structures, and flow diagrams visualize processes and workflows. Requirement diagrams clarify product specifications, and UX requirement visualizations illustrate user experience flows. Process flow diagrams detail system operations, and mind maps reveal feature relationships and hierarchies. These visualizations provide multiple perspectives on the same information, helping both human reviewers and downstream agents understand patterns and connections within complex datasets.
Agent 4 is the Data Condenser agent, and it performs crucial synthesis through intelligent context distillation, making sure each downstream agent receives exactly the information needed for its specialized task. This agent, powered by its condensed information generator, merges outputs from both the Data Extractor and Visualizer agents while performing sophisticated analysis.
The agent extracts critical elements from the full text context—acceptance criteria, business rules, customer segments, and edge cases—creating structured summaries that preserve essential details while reducing token usage. It compares each text file with its corresponding Mermaid diagram, capturing information that might be missed in visual representations alone. This careful processing maintains information integrity across agent handoffs, making sure important data is not lost as it flows through the system. The result is a set of condensed addendums that enrich the Mermaid diagrams with comprehensive context. This synthesis makes sure that when information moves to test generation, it arrives complete, structured, and optimized for processing.
Agent 5 is the Test Generator agent brings together the collected, visualized, and condensed information to produce comprehensive test suites. Working with six Mermaid diagrams plus condensed information from Agent 4, this agent employs a pipeline of five subagents. The Journey Analysis Mapper, Scenario Identification Agent, and the Data Flow Mapping subagents generate comprehensive test cases based on their take of the input data flowing from Agent 4.With the test cases generated across three critical perspectives, the Test Cases Generator evaluates them, reformatting according to internal guidelines for consistency. Finally, the Test Suite Organizer performs deduplication and optimization, delivering a final test suite that balances comprehensiveness with efficiency.
The system now handles far more than the basic requirements and journey mapping of Workflow 1—it processes product requirements, UX specifications, acceptance criteria, and workstream extraction while accepting inputs from Figma designs, code repositories, and multiple document types. Most importantly, the shift to modular architecture fundamentally changed how the system operates and evolves. Unlike our rigid first workflow, this design allows for reusing outputs from earlier agents, integrating new testing type agents, and intelligently selecting test case generators based on user requirements, positioning the system for continuous adaptation.
The following figure shows our second iteration of SAARAM with five main agents and multiple subagents with context engineering and compression.

Additional Strands Agents features
Strands Agents provided the foundation for our multi-agent system, offering a model-driven approach that simplified complex agent development. Because the SDK can connect models with tools through advanced reasoning capabilities, we built sophisticated workflows with only a few lines of code. Beyond its core functionality, two key features proved essential for our production deployment: reducing hallucinations with structured outputs and workflow orchestration.
Reducing hallucinations with structured outputs
The structured output feature of Strands Agents uses Pydantic models to transform traditionally unpredictable LLM outputs into reliable, type-safe responses. This approach addresses a fundamental challenge in generative AI: although LLMs excel at producing humanlike text, they can struggle with consistently formatted outputs needed for production systems. By enforcing schemas through Pydantic validation, we make sure that responses conform to predefined structures, enabling seamless integration with existing test management systems.
The following sample implementation demonstrates how structured outputs work in practice:

from pydantic import BaseModel, Field
from typing import List
import json

# Define structured output schema
class TestCaseItem(BaseModel):
name: str = Field(description=”Test case name”)
priority: str = Field(description=”Priority: P0, P1, or P2″)
category: str = Field(description=”Test category”)

class TestOutput(BaseModel):
test_cases: List[TestCaseItem] = Field(description=”Generated test cases”)

# Agent tool with validation
@tool
def save_results(self, results: str) -> str:
try:
# Parse and validate Claude’s JSON output
data = json.loads(results)
validated = TestOutput(**data)

# Save only if validation passes
with open(“results.json”, ‘w’) as f:
json.dump(validated.dict(), f, indent=)
return “Validated results saved”

except ValidationError as e:
return f”Invalid output format: e”

Pydantic automatically validates LLM responses against defined schemas to facilitate type correctness and required field presence. When responses don’t match the expected structure, validation errors provide clear feedback about what needs correction, helping prevent malformed data from propagating through the system. In our environment, this approach delivered consistent, predictable outputs across the agents regardless of prompt variations or model updates, minimizing an entire class of data formatting errors. As a result, our development team worked more efficiently with full IDE support.
Workflow orchestration benefits
The Strands Agents workflow architecture provided the sophisticated coordination capabilities our multi-agent system required. The framework enabled structured coordination with explicit task definitions, automatic parallel execution for independent tasks, and sequential processing for dependent operations. This meant we could build complex agent-to-agent communication patterns that would have been difficult to implement manually.
The following sample snippet shows how to create a workflow in Strands Agents SDK:

from strands import Agent
from strands_tools import workflow

# Create agent with workflow capability
main_agent_3 = create_main_agent_3()

# Create workflow with structured output tasks
workflow_result = main_agent_3.tool.workflow(
action=”create”,
workflow_id=”comprehensive_e2e_test_generation”,
tasks=[
# Phase 1: Parallel execution (no dependencies)
{
“task_id”: “journey_analysis”,
“description”: “Generate journey scenario names with brief descriptions using structured output”,
“dependencies”: [],
“model_provider”: “bedrock”,
“model_settings”: {
“model_id”: “us.anthropic.claude-sonnet-4-20250514-v1:0”,
“params”: {“temperature”: }
},
“system_prompt”: load_prompt(“journey_analysis”),
“structured_output_model”: “JourneyAnalysisOutput”,
“priority”: ,
“timeout”: },

{
“task_id”: “scenario_identification”,
“description”: “Generate scenario variations using structured output for different path types”,
“dependencies”: [],
“model_provider”: “bedrock”,
“model_settings”: {
“model_id”: “us.anthropic.claude-sonnet-4-20250514-v1:0”,
“params”: {“temperature”: }
},
“system_prompt”: load_prompt(“scenario_identification”),
“structured_output_model”: “ScenarioIdentificationOutput”,
“priority”: ,
“timeout”: },

{
“task_id”: “data_flow_mapping”,
“description”: “Generate data flow scenarios using structured output covering information journey”,
“dependencies”: [],
“model_provider”: “bedrock”,
“model_settings”: {
“model_id”: “us.anthropic.claude-sonnet-4-20250514-v1:0”,
“params”: {“temperature”: }
},
“system_prompt”: load_prompt(“data_flow_mapping”),
“structured_output_model”: “DataFlowMappingOutput”,
“priority”: ,
“timeout”: },

# Phase 2: Waits for first 3 tasks to complete
{
“task_id”: “test_case_development”,
“description”: “Generate test cases from all scenario outputs using structured output”,
“dependencies”: [“journey_analysis”, “scenario_identification”, “data_flow_mapping”],
“model_provider”: “bedrock”,
“model_settings”: {
“model_id”: “us.anthropic.claude-sonnet-4-20250514-v1:0”,
“params”: {“temperature”: }
},
“system_prompt”: load_prompt(“test_case_development”),
“structured_output_model”: “TestCaseDevelopmentOutput”,
“priority”: ,
“timeout”: },

# Phase 3: Waits for test case development to complete
{
“task_id”: “test_suite_organization”,
“description”: “Organize all test cases into final comprehensive test suite using structured output”,
“dependencies”: [“test_case_development”],
“model_provider”: “bedrock”,
“model_settings”: {
“model_id”: “us.anthropic.claude-sonnet-4-20250514-v1:0”,
“params”: {“temperature”: }
},
“system_prompt”: load_prompt(“test_suite_organization”),
“structured_output_model”: “TestSuiteOrganizationOutput”,
“priority”: ,
“timeout”: }
]

The workflow system delivered three critical capabilities for our use case. First, parallel processing optimization allowed journey analysis, scenario identification, and coverage analysis to run simultaneously, with independent agents processing different aspects without blocking each other. The system automatically allocated resources based on availability, maximizing throughput.
Second, intelligent dependency management made sure that test development waited for scenario identification to be completed, and organization tasks depended on the test cases being generated. Context was preserved and passed efficiently between dependent stages, maintaining information integrity throughout the workflow.
Finally, the built-in reliability features provided the resilience our system required. Automatic retry mechanisms handled transient failures gracefully, state persistence enabled pause and resume capabilities for long-running workflows, and comprehensive audit logging supported both debugging and performance optimization efforts.
The following table shows examples of input into the workflow and the potential outputs.

Input: Business requirement document
Output: Test cases generated

Functional requirements:

Credit card processing. System must:

Support credit card payment processing
Validate credit card details before processing
Display checkout and card entry forms
Save shipping information
Provide order confirmation after successful payment

Error handling. System must:

Handle payment failures gracefully
Display clear declined payment messages
Offer payment retry options
Support switching between payment methods
Handle gateway timeouts with retry mechanism

Refund processing. System must:

Support refund initiation from admin panel
Process refunds to original payment method
Track and update refund status
Maintain refund transaction records

Multiple payment methods. System must:

Support split payments across methods
Handle gift card partial payments
Calculate remaining balances accurately
Reconcile combined payment amounts
Verify total order amount matches payments

TC006: Credit card payment success Scenario: Customer completes purchase using valid credit card Steps: 1. Add items to cart and proceed to checkout. Expected result: Checkout form displayed. 2. Enter shipping information. Expected result: Shipping details saved. 3. Select credit card payment method. Expected result: Card form shown. 4. Enter valid card details. Expected result: Card validated. 5. Submit payment. Expected result: Payment processed, order confirmed.TC008: Payment failure handling Scenario: Payment fails due to insufficient funds or card decline Steps: 1. Enter card with insufficient funds. Expected result: Payment declined message. 2. System offers retry option. Expected result: Payment form redisplayed. 3. Try alternative payment method. Expected result: Alternative payment successful. TC009: Payment gateway timeout Scenario: Payment gateway times out during transaction processing Steps: 1. Submit payment during gateway maintenance. Expected result: Timeout error shown. 2. System provides retry mechanism. Expected result: Retry button available. 3. Retry payment after timeout. Expected result: Payment processes successfully. TC010: Refund processing Scenario: Customer refund is processed back to original payment method Steps: 1. Initiate refund from admin panel. Expected result: Refund request created. 2. Process refund to original card. Expected result: Refund transaction initiated. 3. Verify refund status. Expected result: Refund marked as completed.

Integration with Amazon Bedrock
Amazon Bedrock served as the foundation for our AI capabilities, providing seamless access to Claude Sonnet by Anthropic through the Strands Agents built-in AWS service integration. We selected Claude Sonnet by Anthropic for its exceptional reasoning capabilities and ability to understand complex payment domain requirements. The Strands Agents flexible LLM API integration made this implementation straightforward. The following snippet shows how to effortlessly create an agent in Strands Agents:

import boto3
from strands import Agent
from strands.models import BedrockModel

bedrock_model = BedrockModel(
    model_id=”anthropic.claude-sonnet-4-20250514-v1:0″,
    region_name=”us-west-2″,
    temperature=0.3,
)

agent = Agent(model=bedrock_model)

The managed service architecture of Amazon Bedrock reduced infrastructure complexity from our deployment. The service provided automatic scaling that adjusted to our workload demands, facilitating consistent performance across the agents regardless of traffic patterns. Built-in retry logic and error handling improved system reliability significantly, reducing the operational overhead typically associated with managing AI infrastructure at scale. The combination of the sophisticated orchestration capabilities of Strands Agents and the robust infrastructure of Amazon Bedrock created a production-ready system that could handle complex test generation workflows while maintaining high reliability and performance standards.
The following diagram shows the deployment of the SARAAM agent with Amazon Bedrock AgentCore and Amazon Bedrock.

Results and business impact
The implementation of SAARAM has improved our QA processes with measurable improvements across multiple dimensions. Before SAARAM, our QA engineers spent 3–5 days manually analyzing BRD documents and UI mocks to create comprehensive test cases. This manual process is now reduced to hours, with the system achieving:

Test case generation time: Potential reduced from 1 week to hours
Resource optimization: QA effort decreased from 1.0 full-time employee (FTE) to 0.2 FTE for validation
Coverage improvement: 40% more edge cases identified compared to manual process
Consistency: 100% adherence to test case standards and formats

The accelerated test case generation has driven improvements in our core business metrics:

Payment success rate: Increased through comprehensive edge case testing and risk-based test prioritization
Payment experience: Enhanced customer satisfaction because teams can now iterate on test coverage during the design phase
Developer velocity: Product and development teams generate preliminary test cases during design, enabling early quality feedback

SAARAM captures and preserves institutional knowledge that was previously dependent on individual QA engineers:

Testing patterns from experienced professionals are now codified
Historical test case learnings are automatically applied to new features
Consistent testing approaches across different payment methods and industries
Reduced onboarding time for new QA team members

This iterative improvement means that the system becomes more valuable over time.
Lessons learned
Our journey developing SAARAM provided crucial insights for building production-ready AI systems. Our breakthrough came from studying how domain experts think rather than optimizing how AI processes information. Understanding the cognitive patterns of testers and QA professionals led to an architecture that naturally aligns with human reasoning. This approach produced better results compared to purely technical optimizations. Organizations building similar systems should invest time observing and interviewing domain experts before designing their AI architecture—the insights gained directly translate to more effective agent design.
Breaking complex tasks into specialized agents dramatically improved both accuracy and reliability. Our multi-agent architecture, enabled by the orchestration capabilities of Strands Agents, handles nuances that monolithic approaches consistently miss. Each agent’s focused responsibility enables deeper domain expertise while providing better error isolation and debugging capabilities.
A key discovery was that the Strands Agents workflow and graph-based orchestration patterns significantly outperformed traditional supervisor agent approaches. Although supervisor agents make dynamic routing decisions that can introduce variability, workflows provide “agents on rails”—a structured path facilitating consistent, reproducible results. Strands Agents offers multiple patterns, including supervisor-based routing, workflow orchestration for sequential processing with dependencies, and graph-based coordination for complex scenarios. For test generation where consistency is paramount, the workflow pattern with its explicit task dependencies and parallel execution capabilities delivered the optimal balance of flexibility and control. This structured approach aligns perfectly with production environments where reliability matters more than theoretical flexibility.
Implementing Pydantic models through the Strands Agents structured output feature effectively reduced type-related hallucinations in our system. By enforcing AI responses to conform to strict schemas, we facilitate reliable, programmatically usable outputs. This approach has proven essential when consistency and reliability are nonnegotiable. The type-safe responses and automatic validation have become foundational to our system’s reliability.
Our condensed information generator pattern demonstrates how intelligent context management maintains quality throughout multistage processing. This approach of knowing what to preserve, condense, and pass between agents helps prevent the context degradation that typically occurs in token-limited environments. The pattern is broadly applicable to multistage AI systems facing similar constraints.
What’s next
The modular architecture we’ve built with Strands Agents enables straightforward adaptation to other domains within Amazon. The same patterns that generate payment test cases can be applied to retail systems testing, customer service scenario generation for support workflows, and mobile application UI and UX test case generation. Each adaptation requires only domain-specific prompts and schemas while reusing the core orchestration logic. Throughout the development of SAARAM, the team successfully addressed many challenges in test case generation—from reducing hallucinations through structured outputs to implementing sophisticated multi-agent workflows. However, one critical gap remains: the system hasn’t yet been provided with examples of what high-quality test cases actually look like in practice.
To bridge this gap, integrating Amazon Bedrock Knowledge Bases with a curated repository of historical test cases would provide SAARAM with concrete, real-world examples during the generation process. By using the integration capabilities of Strands Agents with Amazon Bedrock Knowledge Bases, the system could search through past successful test cases to find similar scenarios before generating new ones. When processing a BRD for a new payment feature, SAARAM would first query the knowledge base for comparable test cases—whether for similar payment methods, customer segments, or transaction flows—and use these as contextual examples to guide its output.
Future deployment will use Amazon Bedrock AgentCore for comprehensive agent lifecycle management. Amazon Bedrock AgentCore Runtime provides the production execution environment with ephemeral, session-specific state management that maintains conversational context during active sessions while facilitating isolation between different user interactions. The observability capabilities of Bedrock AgentCore help deliver detailed visualizations of each step in SAARAM’s multi-agent workflow, which the team can use to trace execution paths through the five agents, audit intermediate outputs from the Data Condenser and Test Generator agents, and identify performance bottlenecks through real-time dashboards powered by Amazon CloudWatch with standardized OpenTelemetry-compatible telemetry.
The service enables several advanced capabilities essential for production deployment: centralized agent management and versioning through the Amazon Bedrock AgentCore control plane, A/B testing of different workflow strategies and prompt variations across the five subagents within the Test Generator, performance monitoring with metrics tracking token usage and latency across the parallel execution phases, automated agent updates without disrupting active test generation workflows, and session persistence for maintaining context when QA engineers iteratively refine test suite outputs. This integration positions SAARAM for enterprise-scale deployment while providing the operational visibility and reliability controls that transform it from a proof of concept into a production system capable of handling the AMET team’s ambitious goal of expanding beyond Payments QA to serve the broader organization.
Conclusion
SAARAM demonstrates how AI can change traditional QA processes when designed with human expertise at its core. By reducing test case creation from 1 week to hours while improving quality and coverage, we’ve enabled faster feature deployment and enhanced payment experiences for millions of customers across the MENA region. The key to our success wasn’t merely advanced AI technology—it was the combination of human expertise, thoughtful architecture design, and robust engineering practices. Through careful study of how experienced QA professionals think, implementation of multi-agent systems that mirror these cognitive patterns, and minimization of AI limitations through structured outputs and context engineering, we’ve created a system that enhances rather than replaces human expertise.
For teams considering similar initiatives, our experience emphasizes three critical success factors: invest time understanding the cognitive processes of domain experts, implement structured outputs to minimize hallucinations, and design multi-agent architectures that mirror human problem-solving approaches. These QA tools aren’t intended to replace human testers, they amplify their expertise through intelligent automation. If you’re interested in starting your journey on agents with AWS, check out our sample Strands Agents implementations repo or our newest launch, Amazon Bedrock AgentCore, and the end-to-end examples with deployment on our Amazon Bedrock AgentCore samples repo.

About the authors
Jayashree is a Quality Assurance Engineer at Amazon Music Tech, where she combines rigorous manual testing expertise with an emerging passion for GenAI-powered automation. Her work focuses on maintaining high system quality standards while exploring innovative approaches to make testing more intelligent and efficient. Committed to reducing testing monotony and enhancing product quality across Amazon’s ecosystem, Jayashree is at the forefront of integrating artificial intelligence into quality assurance practices.
Harsha Pradha G is a Snr. Quality Assurance Engineer part in MENA Payments at Amazon. With a strong foundation in building comprehensive quality strategies, she brings a unique perspective to the intersection of QA and AI as an emerging QA-AI integrator. Her work focuses on bridging the gap between traditional testing methodologies and cutting-edge AI innovations, while also serving as an AI content strategist and AI Author.
Fahim Surani is Senior Solutions Architect as AWS, helping customers across Financial Services, Energy, and Telecommunications design and build cloud and generative AI solutions. His focus since 2022 has been driving enterprise cloud adoption, spanning cloud migrations, cost optimization, event-driven architectures, including leading implementations recognized as early adopters of Amazon’s latest AI capabilities. Fahim’s work covers a wide range of use cases, with a primary interest in generative AI, agentic architectures. He is a regular speaker at AWS summits and industry events across the region.

Build a generative AI-powered business reporting solution with Amazon …

Traditional business reporting processes are often time-consuming and inefficient. Associates typically spend about two hours per month preparing their reports, while managers dedicate up to 10 hours per month aggregating, reviewing, and formatting submissions. This manual approach often leads to inconsistencies in both format and quality, requiring multiple cycles of review. Additionally, reports are fragmented across various systems, making consolidation and analysis more challenging.
Generative artificial intelligence (AI) presents a compelling solution to these reporting challenges. According to a Gartner survey, generative AI has become the most widely adopted AI technology in organizations, with 29% already putting it into active use.
This post introduces generative AI guided business reporting—with a focus on writing achievements & challenges about your business—providing a smart, practical solution that helps simplify and accelerate internal communication and reporting. Built following Amazon Web Services (AWS) best practices, with this solution you will spend less time writing reports and more time focusing on driving business results. This solution tackles three real-world challenges:

Uncover valuable insights from vast amounts of data
Manage risks associated with AI implementation
Drive growth through improved efficiency and decision-making

The full solution code is available in our GitHub repo, allowing you to deploy and test this solution in your own AWS environment.
The generative AI solution enhances the reporting process through automation. By utilizing large language model (LLM) processing, the reporting system can generate human-readable reports, answer follow-up questions, and make insights more accessible to non-technical stakeholders. This automation reduces costs and the need for extensive human resources while minimizing human error and bias. The result is a level of accuracy and objectivity that’s difficult to achieve with manual processes, ultimately leading to more efficient and effective business reporting.
Solution overview
This generative AI-powered Enterprise Writing Assistant demonstrates a modern, serverless architecture that leverages AWS’s powerful suite of services to deliver an intelligent writing solution. Built with scalability and security in mind, this system combines AWS Lambda functions, Amazon Bedrock for AI capabilities, and various AWS services to create a robust, enterprise-grade writing assistant that can help organizations streamline content creation processes while maintaining high standards of quality and consistency.

This solution uses a serverless, scalable design built on AWS services. Let’s explore how the components work together:
User interaction layer

Users access the solution through a browser that connects to a frontend web application hosted on Amazon S3 and distributed globally via Amazon CloudFront for optimal performance
Amazon Cognito user pools handle authentication and secure user management

API layer

Two API types in Amazon API Gateway manage communication between frontend and backend:

WebSocket API enables real-time, bidirectional communication for report writing and editing
REST API handles transactional operations like submitting and retrieving reports

Amazon CloudWatch monitors both APIs for operational visibility
Dedicated AWS Lambda authorizers secure both APIs by validating user credentials

Orchestration layer

Specialized AWS Lambda functions orchestrate the core business logic:

Business Report Writing Lambda handles report drafting and user assistance
Rephrase Lambda improves report clarity and professionalism
Submission Lambda processes final report submissions
View Submission Lambda retrieves previously submitted reports

AI and storage layer

Amazon Bedrock provides the LLM capabilities for report writing and rephrasing
Two Amazon DynamoDB tables store different types of data:

Session Management table maintains conversation context during active sessions
Business Report Store table permanently archives completed reports

This architecture facilitates high availability, automatic scaling, and cost optimization by using serverless components that only incur charges when in use. Communications between components are secured following AWS best practices.
You can deploy this architecture in your own AWS account by following the step-by-step instructions in the GitHub repository.
Real-world workflow: Report generation and rephrasing
The system’s workflow begins by analyzing and categorizing each user input through a classification process. This classification determines how the system processes and responds to the input. The system uses specific processing paths based on three distinct classifications:

Question or command: When the system classifies the input as a question or command, it activates the LLM with appropriate prompting to generate a relevant response. The system stores these interactions in the conversation memory, allowing it to maintain context for future related queries. This contextual awareness provides coherent and consistent responses that build upon previous interactions.
Verify submission: For inputs requiring verification, the system engages its evaluation protocols to provide detailed feedback on your submission. While the system stores these interactions in the conversation memory, it deliberately bypasses memory retrieval during the verification process. This design choice enables the verification process based solely on the current submission’s merits, without influence from previous conversations. This approach reduces system latency and facilitates more accurate and unbiased verification results.
Outside of scope: When the input falls outside the system’s defined parameters, it responds with the standardized message: “Sorry, I can only answer writing-related questions.” This maintains clear boundaries for the system’s capabilities and helps prevent confusion or inappropriate responses.

These classifications support efficient processing while maintaining appropriate context only where necessary, optimizing both performance and accuracy in different interaction scenarios.
User experience walkthrough
Now that we have explored the architecture, let’s dive into the user experience of our generative AI-powered Enterprise Writing Assistant. The following walkthrough demonstrates the solution in action, showcasing how AWS services come together to deliver a seamless, intelligent writing experience for enterprise users.
Home page
The home page offers two views: Associate view and Manager view.
Associate view
Within the Associate view, you have three options: Write Achievement, Write Challenge, or View Your Submissions. For this post, we walk through the Achievement view. The Challenge view follows the same process but with different guidelines. In the Achievement view, the system prompts you to either ask questions or make a submission. Inputs go through the generative AI workflow. The following example demonstrates an incomplete submission, along with the system’s feedback. This feedback includes a visual summary that highlights the missing or completed components. The system evaluates the submission based on a predefined guideline. Users can adapt this approach in their solutions. At this stage, the focus should not be on grammar or formatting, but rather on the overall concept.

If the system is prompted with an irrelevant question, it declines to answer to avoid misuse.

Throughout the conversation, you can ask questions related to writing a business report (achievement, or challenge about the business).

Once all criteria is met, the system can automatically rephrase the input text to fix grammatical and formatting issues. If you need to make changes to the input text, you can click the Previous button, which will take you back to the stage where you can modify your submission.

After rephrasing, the system shows both the original version and the rephrased version with highlighted differences.

The system also automatically extracts customer name metadata.

When complete, you can save or continue editing the output.
Manager view
In the Manager view, you have the ability to aggregate multiple submissions from direct reports into a consolidated roll-up report. The following shows how this interface appears.
Prerequisites
To deploy this solution in your AWS account, the following is needed:

An AWS account with administrative access
AWS CLI (2.22.8) installed and configured
Access to Amazon Bedrock models (Claude or Anthropic Claude)
Node.js (20.12.7) the frontend components
Git for cloning the repository

Deploy the solution
The generative AI Enterprise Report Writing Assistant uses AWS CDK for infrastructure deployment, making it straightforward to set up in your AWS environment:

Clone the GitHub repository:

git clone https://github.com/aws-samples/sample-generative AI-enterprise-report-writing-assistant.git && cd sample-generative AI- enterprise-report-writing-assistant

Install dependencies:

npm install

Deploy the application to AWS:

cdk deploy

After deployment completes, wait 1-2 minutes for the AWS CodeBuild process to finish.
Access the application using the VueAppUrl from the CDK/CloudFormation outputs.

The deployment creates the necessary resources including Lambda functions, API Gateways, DynamoDB tables, and the frontend application hosted on S3 and CloudFront.
For detailed configuration options and customizations, refer to the README in the GitHub repository.
Clean up resources
To avoid incurring future charges, delete the resources created by this solution when they are no longer needed:

cdk destroy

This command removes the AWS resources provisioned by the CDK stack, including:

Lambda functions
API Gateway endpoints
DynamoDB tables
S3 buckets
CloudFront distributions
Cognito user pools

Be aware that some resources, like S3 buckets containing deployment artifacts, might need to be emptied before they can be deleted.
Conclusion
Traditional business reporting is time-consuming and manual, leading to inefficiencies across the board. The generative AI Enterprise Report Writing Assistant represents a significant leap forward in how organizations approach their internal reporting processes. By leveraging generative AI technology, this solution addresses the traditional pain points of business reporting while introducing capabilities that were previously unattainable. Through intelligent report writing assistance with real-time feedback, automated rephrasing for clarity and professionalism, streamlined submission and review processes, and robust verification systems, the solution delivers comprehensive support for modern business reporting needs. The architecture facilitates secure, efficient processing, striking the crucial balance between automation and human oversight. As organizations continue to navigate increasingly complex business problems, the ability to generate clear, accurate, and insightful reports quickly becomes not just an advantage but a necessity. The generative AI Enterprise Report Writing Assistant provides a framework that can scale with your organization’s needs while maintaining consistency and quality across the levels of reporting.
We encourage you to explore the GitHub repository to deploy and customize this solution for your specific needs. You can also contribute to the project by submitting pull requests or opening issues for enhancements and bug fixes.
For more information about generative AI on AWS, refer to the AWS Generative AI resource center.
Resources

AWS CDK Documentation
Amazon Bedrock Documentation
Vue.js Documentation
CloudScape Design System
LangChain Documentation
AWS Amplify

About the authors
Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.
Michael Massey is a Cloud Application Architect at Amazon Web Services, where he specializes in building frontend and backend cloud-native applications. He designs and implements scalable and highly-available solutions and architectures that help customers achieve their business goals.
Jeff Chen is a Principal Consultant at AWS Professional Services, specializing in guiding customers through application modernization and migration projects powered by generative AI. Beyond GenAI, he delivers business value across a range of domains including DevOps, data analytics, infrastructure provisioning, and security, helping organizations achieve their strategic cloud objectives.
Jundong Qiao is a Sr. Machine Learning Engineer at AWS Professional Service, where he specializes in implementing and enhancing AI/ML capabilities across various sectors. His expertise encompasses building next-generation AI solutions, including chatbots and predictive models that drive efficiency and innovation.

Safeguard generative AI applications with Amazon Bedrock Guardrails

Enterprises aiming to automate processes using AI agents or enhance employee productivity using AI chat-based assistants need to enforce comprehensive safeguards and audit controls for responsible use of AI and processing of sensitive data by large language models (LLMs). Many have developed a custom generative AI gateway or have adopted an off-the-shelf solution (such as LiteLLM or Kong AI Gateway) to provide their AI practitioners and developers with access to LLMs from different providers. However, enforcing and maintaining consistent policies for prompt safety and sensitive data protection across a growing list of LLMs from various providers at scale is challenging.
In this post, we demonstrate how you can address these challenges by adding centralized safeguards to a custom multi-provider generative AI gateway using Amazon Bedrock Guardrails. Amazon Bedrock Guardrails provides a suite of safety features that help organizations build responsible generative AI applications at scale. You will learn how to use Amazon Bedrock ApplyGuardrail API to help enforce consistent policies for prompt safety and sensitive data protection for LLMs from both Amazon Bedrock and third-party providers such as Microsoft Azure OpenAI. The proposed solution provides additional benefits of central logging and monitoring, analytics, and a chargeback mechanism.
Solution overview
There are several requirements you need to meet to safeguard generative AI applications with centralized guardrails. First, organizations need a robust and scalable infrastructure setup for the generative AI gateway and its guardrails components. The solution also needs a comprehensive logging and monitoring system to track AI interactions and analytics capabilities to assess usage patterns and compliance. For sensitive data protection, organizations need to establish clear data governance policies and implement appropriate safety controls. Additionally, they need to develop or integrate a chargeback mechanism to track and allocate AI usage costs across different departments or projects. Knowledge of regulatory requirements specific to their industry is crucial to make sure the guardrails are properly configured to meet compliance standards.
The following diagram depicts a conceptual illustration of our proposed solution. The workflow begins when authenticated users send HTTPS requests to the generative AI gateway, a centralized application running on Amazon Elastic Container Service (Amazon ECS) that serves as the primary interface for the LLM interactions. Within the generative AI gateway application logic, each incoming request is first forwarded to the Amazon Bedrock ApplyGuardrail API for content screening. The generative AI gateway then evaluates the content against predefined configurations, making critical decisions to either block the request entirely, mask sensitive information, or allow it to proceed unmodified.
This evaluation process, integral to the functionality of the generative AI gateway, facilitates adherence to established safety and compliance guidelines. For requests that pass this screening, the generative AI gateway logic determines the appropriate LLM provider (either Amazon Bedrock or a third-party service) based on the user’s specifications. The screened content is then forwarded to the selected LLM for processing. Finally, the generative AI gateway receives the LLM’s response and returns it to the user, completing the interaction cycle. The response flow follows two distinct paths: blocked requests result in users receiving a blocked content message, and approved requests deliver the model’s response with the necessary content masking applied to the user prompt. In our implementation, guardrails are only applied to the input or prompt and not to the LLM responses. This streamlined process provides a unified approach to LLM access, security, and compliance for both Amazon Bedrock and third-party providers.

The generative AI gateway application is hosted on AWS Fargate, and it’s built using FastAPI. The application interacts with other Amazon Web Services (AWS) services such as Amazon Simple Storage Service (Amazon S3), Amazon Bedrock, Amazon Kinesis and Amazon Data Firehose. The solution includes a robust data persistence layer that captures the interaction details and stores them on Amazon S3 through Amazon Kinesis Data Streams and Amazon Data Firehose. Data persisted includes sanitized requests and responses, transaction information, guardrail metadata, and blocked content with associated metadata. This comprehensive logging facilitates full auditability and enables continuous improvement of the guardrail mechanisms.
Solution components
Scalability of the solution is achieved using the following tools and technologies:

nginx to provide maximum performance and stability of the application by load balancing requests within each container.
Gunicorn, a Python Web Server Gateway Interface (WSGI) HTTP server commonly used to serve Python web applications in production environments. It’s a high-performance server that can handle multiple worker processes and concurrent requests efficiently. Gunicorn supports synchronous communications only but has robust process management functionality.
Uvicorn to provide lightweight and asynchronous request handling. Although Gunicorn is synchronous, it supports using asynchronous worker types such as Uvicorn, with which asynchronous communication can be established. This is needed for applications with longer wait times. In case of fetching responses from LLMs, you should anticipate higher wait times.
FastAPI to serve the actual requests at the generative AI gateway application layer.
Amazon ECS Fargate cluster to host the containerized application on AWS, and AWS Auto Scaling to scale up or down the tasks or containers automatically.
Amazon Elastic Container Registry (Amazon ECR) for storing the Docker image of the generative AI gateway application.
Elastic Load Balancing (ELB) and Application Load Balancer for load balancing of requests across ECS containers.
HashiCorp Terraform for resource provisioning.

The following figure illustrates the architecture design of the proposed solution. Consumer applications (such as on-premises business app, inference app, Streamlit app, and Amazon SageMaker Studio Lab), dashboard, and Azure Cloud components aren’t included in the accompanying GitHub repository. They’re included in the architecture diagram to demonstrate integrations with downstream and upstream systems.

Centralized guardrails
The generative AI gateway enforces comprehensive security controls through Amazon Bedrock Guardrails, using the ApplyGuardrail API to implement multiple layers of protection. These guardrails provide four core safety features: content filtering to screen inappropriate or harmful content, denied topics to help prevent specific subject matter discussions, word filters to block specific terms or phrases, and sensitive information detection to help protect personal and confidential data.
Organizations can implement these controls using three configurable strength level—low, medium, and high. This way, business units can align their AI security posture with their specific risk tolerance and compliance requirements. For example, a marketing team might operate with low-strength guardrails for creative content generation, whereas financial or healthcare divisions might require high-strength guardrails for handling sensitive customer data. Beyond these basic protections, Amazon Bedrock Guardrails also includes advanced features such as contextual grounding and automated reasoning checks, which help detect and prevent AI hallucinations (instances where models generate false or misleading information). Users can extend the functionalities of the generative AI gateway to support these advanced features based on their use case.
Multi-provider integration
The generative AI gateway is both LLM provider and model-agnostic, which enables seamless integration with multiple providers and LLMs. Users can specify their preferred LLM model directly in the request payload, allowing the gateway to route requests to the appropriate model endpoint. AWS Secrets Manager is used for storing the generative AI gateway API access tokens and access tokens from third-party LLMs such as Azure OpenAI. The generative AI gateway API token is used for authenticating the caller. The LLM access token is used for establishing client connection for third-party providers.
Logging, monitoring and alerting
A key advantage of implementing a generative AI gateway is its centralized approach to logging and monitoring the LLM interactions. Every interaction, including user requests and prompts, LLM responses, and user context, is captured and stored in a standardized format and location. Organizations can use this collection strategy to perform analysis, troubleshoot issues, and derive insights. Logging, monitoring, and alerting is enabled using the following AWS services:

Amazon CloudWatch captures the container and application logs. We can create custom metrics on specific log messages and create an alarm that can be used for proactive alerting (for example, when a 500 Internal Server Error occurs)
Amazon Simple Notification Service (Amazon SNS) for notification to a distribution list (for example, when a 500 Internal Server Error happens)
Kinesis Data Streams and Data Firehose for streaming request and response data and metadata to Amazon S3 (for compliance and analytics or chargeback). Chargeback is a mechanism to attribute costs to a hierarchy of owners. For instance, an application running on AWS would incur some costs for every service, however the application could be serving an employee working for a project governed by a business unit. Chargeback is a process where costs can be attributed to the lowest level of an individual user with potential to roll up at multiple intermediate levels all the way to the business unit.
Amazon S3 for persisting requests and responses at the transaction level (for compliance), in addition to transaction metadata and metrics (for example, token counts) for analytics and chargeback.
AWS Glue Crawler API and Amazon Athena for exposing a SQL table of transaction metadata for analytics and chargeback.

Repository structure
The GitHub repository contains the following directories and files:

genai-gateway/
├── src/ — Main application code
│ └── clients/ — API endpoints
│ ├── controllers/ –FastAPI application entry point
│ ├── generators/ –LLM integration
│ ├── persistence/ — Persistence logic
│ └── utils/
├── terraform/ — IaC
├── tests/ — Testing scripts
│ └── regressiontests/
├── .gitignore
├── .env.example
├── Dockerfile
├── ngnix.conf
├── asgi.py
├── docker-entrypoint.sh
├── requirements.txt
├── serve.py
└── README.md

Prerequisites
You need the following prerequisites before deploying this solution:

An AWS Account
An AWS Identity and Access Management (IAM) role with the following permissions:

Amazon S3 access (CreateBucket, PutObject, GetObject, DeleteObject)
AWS Secrets Manager access
Amazon CloudWatch logs access
Amazon Bedrock service
Amazon Bedrock foundation model (FM) access
Amazon Bedrock Guardrails IAM permissions

IAM permissions for Amazon Bedrock Guardrails:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “VisualEditor0”,
“Effect”: “Allow”,
“Action”: [
“bedrock:ApplyGuardrail”,
“bedrock:ListGuardrails”,
“bedrock:GetGuardrail”
],
“Resource”: “arn:aws:bedrock:<AWS_REGION>:<AWS_ACCOUNT_ID>:guardrail/*”
}
]
}

Access to the serverless FMs on Amazon Bedrock are automatically enabled. You don’t need to manually request or enable model access, but you can use IAM policies and service control policies to restrict model access as needed.
External LLM endpoints configured in the customer environment. For example, Azure OpenAI endpoints must be created in the customer Azure account with the following naming convention: {model_name}-{azure_tier}-{azure_region}. For example, {gpt-4o}-{dev}-{eastus}.

Deploy the solution
In the deployment guide provided in this section, we assumed that deployment instructions include steps for dev environment. Similar steps can be used for higher environments.
To safeguard generative AI applications with centralized guardrails, follow these steps:

Clone the GitHub repository and make sure environment variables for AWS authentication are available in your environment.
Execute ./deploy.sh, which automatically sets up a Terraform state bucket, creates an IAM policy for Terraform, and provisions the infrastructure with dependencies.
Invoke ./verify.sh to verify the deployment and make sure the environment is ready for testing.
Follow the instructions in the README, Auth Token Generation for Consumers, to generate consumer authorization tokens.
Follow the instructions in the README, Testing the Gateway, to test your own generative AI gateway.

For development and testing, the entire setup can be done on the developer laptop with the generative AI gateway server and the client running on the user laptop by following the local setup instructions in the README.
Examples
In this first example, the following code sample is a curl command to invoke anthropic.claude-3-sonnet-20240229-v1:0 model with a high strength guardrail to demonstrate how the generative AI gateway guardrails perform against denied topics. The first example illustrates the effectiveness of the safety mechanism in blocking denied topics by asking the model, I want to sell my house and invest the proceeds in a single stock. Which stock should I buy?:

#!/bin/bash
# Default Configuration (can be overridden by environment variables)
URL=${URL:-“https://<your-alb-dns-name>/process”}
APPID=${APPID:-“admin”}
APITOKEN=${APITOKEN:-“<your-api-token>”}
MODEL=${MODEL:-“anthropic.claude-3-sonnet-20240229-v1:0”}
USER_PROMPT=${USER_PROMPT:-“I want to sell my house and invest the proceeds in a single stock. Which stock should I buy?”}
SYSTEM_PROMPT=${SYSTEM_PROMPT:-” You are an expert financial advisor”}
GUARDRAIL_STRENGTH=${GUARDRAIL_STRENGTH:-“high”}
ENABLE_GUARDRAIL=${ENABLE_GUARDRAIL:-“true”}
USERID=${USERID:-“skoppar”}
COSTCENTER=${COSTCENTER:-“ags”}
MAX_TOKENS=${MAX_TOKENS:-20}
REQUEST_ID=$(uuidgen | tr ‘[:upper:]’ ‘[:lower:]’ | tr -d ‘-‘)
REQUEST_DATETIME=$(date -u +”%Y-%m-%dT%H:%M:%S%z”)
# Bedrock request payload
echo “Sending request to $URL…”
curl -k -X POST “$URL”
-H “Content-Type: application/json”
-H “appid: $APPID”
-H “apitoken: $APITOKEN”
-w “nHTTP Status: %{http_code}n”
-d @- << EOF
{
“requestid”: “$REQUEST_ID”,
“requestdatetime”: “$REQUEST_DATETIME”,
“appid”: “$APPID”,
“userid”: “$USERID”,
“costcenter”: “$COSTCENTER”,
“provider”: “amazon-bedrock”,
“apicontext”: “chatcompletions”,
“requestbody”: {
“model”: “$MODEL”,
“body”: {
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: $MAX_TOKENS,
“system”: “$SYSTEM_PROMPT”,
“messages”: [
{
“role”: “user”,
“content”: “$USER_PROMPT”
}
]
},
“accept”: “application/json”,
“contentType”: “application/json”
},
“guardrail_strength”: “$GUARDRAIL_STRENGTH”,
“enable_guardrail”: $ENABLE_GUARDRAIL
}
EOF

The following sample code is the output from the preceding curl command. This result includes the model’s generated text and modifications or interventions applied by the high-strength guardrails. Analyzing this output helps verify the effectiveness of the guardrails and makes sure that the model’s response aligns with the specified safety and compliance parameters:

{
“transactionid”:”ff73cd3c-b924-40b3-85d7-bcd36cf26ab6″,
“dt”:”20251027″,
“transactionstartdate”:”2025-10-27 15:51:48+0000″,
“requestid”:”6b274e0ad6ad447a90d33e882687767f”,
“requestdatetime”:”2025-10-27T15:51:47+0000″,
“appid”:”admin”,
“provider”:”amazon-bedrock”,
“costcenter”:”ags”,
“userid”:”skoppar”,
“promptlength”:125,
“guardrail_id”:[
“arn:aws:bedrock:us-east-1: <account-id>:guardrail/o9mj8miraler”
],
“guardrail_action”:[
“topicPolicy”
],
“enable_guardrail”:true,
“responsebody”:”{“usage”: {“topicPolicyUnits”: 1, “contentPolicyUnits”: 1, “wordPolicyUnits”: 1, “sensitiveInformationPolicyUnits”: 1, “sensitiveInformationPolicyFreeUnits”: 1, “contextualGroundingPolicyUnits”: 0}, “action”: “GUARDRAIL_INTERVENED”, “outputs”: [{“text”: “Sorry, the content doesn’t comply with Responsible AI policies so it cannot be processed!”}], “assessments”: [{“topicPolicy”: {“topics”: [{“name”: “investment_topic”, “type”: “DENY”, “action”: “BLOCKED”}]}}]}”
}

The second example tests the ability of the generative AI gateway to help protect sensitive personal information. It simulates a user query containing personally identifiable information (PII) such as a name, Social Security number, and email address.

USER_PROMPT=”My name is John Smith, my SSN is 123-45-6789, and my email is john.smith@email.com. Can you help me with my account?” ./bedrock_curl_test.sh

In this case, the guardrail successfully intervened and masked PII data before sending the user query to the LLM, as evidenced by the guardrail_action field, indicating the sensitiveInformationPolicy was applied:

{
“transactionid”: “47665380-bf9f-4ed2-836e-916199a45518”,
“dt”: “20250626”,
“transactionstartdate”: “2025-06-26 23:02:59+0000”,
“requestid”: “ebaf1fbffcd344f3b3d96353e772205d”,
“requestdatetime”: “2025-06-26T23:02:59+0000”,
“appid”: “admin”,
“provider”: “amazon-bedrock”,
“costcenter”: “proserve”,
“userid”: “bommi”,
“promptlength”: 149,
“guardrail_id”: [
“arn:aws:bedrock:us-east-1:<account-id>:guardrail/jvf0bhhvtyf7”,
“arn:aws:bedrock:us-east-1:<account-id>:guardrail/uekx7u8xra91”
],
“guardrail_action”: [“sensitiveInformationPolicy”],
“enable_guardrail”: true,
“responsebody”: {
“id”: “msg_bdrk_012UbTrdpzy3iZ2s9wcKF6PU”,
“type”: “message”,
“role”: “assistant”,
“model”: “claude-3-sonnet-20240229”,
“content”: [
{
“type”: “text”,
“text”: “I’m afraid I cannot provide any personal information or account details. For privacy reasons, I do not”
}
],
“stop_reason”: “max_tokens”,
“stop_sequence”: null,
“usage”: { “input_tokens”: 57, “output_tokens”: 20 }
}
}

For more comprehensive test scripts, please refer to the /test directory of the repository. These additional scripts offer a wider range of test cases and scenarios to thoroughly evaluate the functionality and performance of the generative AI gateway.
Clean up
Upon concluding your exploration of this solution, you can clean up the resources by following these steps:

Employ the terraform destroy to delete the resources provisioned by Terraform.
(Optional) From the AWS Management Console or AWS Command Line Interface (AWS CLI), delete resources that aren’t deleted by Terraform (such as the S3 bucket, ECR repository, and EC2 subnet).

Cost estimation
This section describes the underlying cost structure for running the solution. When implementing this solution, there are several cost categories to be considered:

LLM provider costs – These represent the charges for using foundation models through various providers, including models hosted on Amazon Bedrock and third-party providers. Costs are typically calculated based on:

Number of input and output tokens processed
Model complexity and capabilities
Usage volume and patterns
Service level requirements

AWS infrastructure costs – These encompass the infrastructure expenses associated with generative AI gateway:

Compute resources (Amazon ECS Fargate)
Load balancing (Application Load Balancer)
Storage (Amazon S3, Amazon ECR)
Monitoring (Amazon CloudWatch)
Data processing (Amazon Kinesis)
Security services (AWS Secrets Manager)

Amazon Bedrock Guardrails costs – These are specific charges for implementing safety and compliance features:

Content filtering and moderation
Policy enforcement
Sensitive data protection

The following tables provide a sample cost breakdown for deploying and using generative AI gateway. For actual pricing, refer to the AWS Pricing Calculator.
Infrastructure costs:

Service
Estimated usage
Estimated monthly cost

Amazon ECS Fargate
2 tasks, 1 vCPU, 2 GB RAM, running constantly
$70–$100

Application Load Balancer
1 ALB, running constantly
$20–$30

Amazon ECR
Storage for Docker images
$1–$5

AWS Secrets Manager
Storing API keys and tokens
$0.40 per secret per month

Amazon CloudWatch
Log storage and metrics
$10–$20

Amazon SNS
Notifications
$1–$2

Amazon Kinesis Data Streams
1 stream, low volume
$15–$25

Amazon Data Firehose
1 delivery stream
$0.029 per GB processed

Amazon S3
Storage for logs and data
$2–$5

AWS Glue
Crawler runs (assuming weekly)
$5–$10

Amazon Athena
Query execution
$1–$5

LLM and guardrails costs:

Service
Estimated usage
Estimated monthly cost

Amazon Bedrock Guardrails
10,000 API calls per month
$10–$20

Claude 3 Sonnet (Input)
1M tokens per month at $0.003 per 1K tokens
$3

Claude 3 Sonnet (Output)
500K tokens per month at $0.015 per 1K tokens
$7.50

GPT-4 Turbo (Azure OpenAI)
1M tokens per month at $0.01 per 1K tokens
$10

GPT-4 Turbo Output
500K tokens per month at $0.03 per 1K tokens
$15

Total estimated cost

$170–$260 (Base)

LLM costs can vary significantly based on the number of API calls, input/output token lengths, model selection, and volume discounts. We consider a moderate usage scenario to be about 50–200 queries per day, with an average input length of 500 tokens and average output length of 250 tokens. These costs could increase substantially with higher query volumes, longer conversations, use of more expensive models, and multiple model calls per request.
Conclusion
The centralized guardrails integrated with a custom multi-provider generative AI gateway solution offers a robust and scalable approach for enterprises to safely use LLMs while maintaining security and compliance standards. Through its implementation of Amazon Bedrock Guardrails ApplyGuardrail API, the solution provides consistent policy enforcement for prompt safety and sensitive data protection across both Amazon Bedrock and third-party LLM providers.
Key advantages of this solution include:

Centralized guardrails with configurable security levels
Multi-provider LLM integration capabilities
Comprehensive logging and monitoring features
Production-grade scalability through containerization
Built-in compliance and audit capabilities

Organizations, particularly those in highly regulated industries, can use this architecture to adopt and scale their generative AI implementations while maintaining control over data protection and AI safety regulations. The solution’s flexible design and robust infrastructure make it a valuable tool for enterprises that want to safely harness the power of generative AI while managing associated risks.

About the authors
Hasan Shojaei Ph.D., is a Sr. Data Scientist with AWS Professional Services, where he helps customers across different industries such as sports, financial services, and manufacturing solve their business challenges using advanced AI/ML technologies. Outside of work, Hasan is passionate about books, photography, and skiing.
Sunita Koppar is a Senior Specialist Solutions Architect in Generative AI and Machine Learning at AWS, where she partners with customers across diverse industries to design solutions, build proof-of-concepts, and drive measurable business outcomes. Beyond her professional role, she is deeply passionate about learning and teaching Sanskrit, actively engaging with student communities to help them upskill and grow.
Anuja Narwadkar is a Global Senior Engagement Manager in AWS Professional Services, specializing in enterprise-scale Machine Learning and GenAI transformations. She leads ProServe teams in strategizing, architecture, and building transformative AI/ML solutions on AWS for large enterprises across industries, including financial services. Beyond her professional role, she likes to drive AI up-skill initiatives especially for women, read and cook.
Krishnan Gopalakrishnan is a Delivery Consultant at AWS Professional Services with 12+ years in Enterprise Data Architecture and AI/ML Engineering. He architects cutting-edge data solutions for Fortune 500 companies, building mission-critical pipelines and Generative AI implementations across retail, healthcare, fintech, and manufacturing. Krishnan specializes in scalable, cloud-native architectures that transform enterprise data into actionable AI-powered insights, enabling measurable business outcomes through data-driven decision making.
Bommi Shin is a Delivery Consultant with AWS Professional Services, where she helps enterprise customers implement secure, scalable artificial intelligence solutions using cloud technologies. She specializes in designing and building AI/ML and Generative AI platforms that address complex business challenges across a range of industries. Outside of work, she enjoys traveling, exploring nature, and delicious foods.

How to Build a Stateless, Secure, and Asynchronous MCP-Style Protocol …

In this tutorial, we build a clean, advanced demonstration of modern MCP design by focusing on three core ideas: stateless communication, strict SDK-level validation, and asynchronous, long-running operations. We implement a minimal MCP-like protocol using structured envelopes, signed requests, and Pydantic-validated tools to show how agents and services can interact safely without relying on persistent sessions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport asyncio, time, json, uuid, hmac, hashlib
from dataclasses import dataclass
from typing import Any, Dict, Optional, Literal, List
from pydantic import BaseModel, Field, ValidationError, ConfigDict

def _now_ms():
return int(time.time() * 1000)

def _uuid():
return str(uuid.uuid4())

def _canonical_json(obj):
return json.dumps(obj, separators=(“,”, “:”), sort_keys=True).encode()

def _hmac_hex(secret, payload):
return hmac.new(secret, _canonical_json(payload), hashlib.sha256).hexdigest()

We set up the core utilities required across the entire system, including time helpers, UUID generation, canonical JSON serialization, and cryptographic signing. We ensure that all requests and responses can be deterministically signed and verified using HMAC. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MCPEnvelope(BaseModel):
model_config = ConfigDict(extra=”forbid”)
v: Literal[“mcp/0.1”] = “mcp/0.1″
request_id: str = Field(default_factory=_uuid)
ts_ms: int = Field(default_factory=_now_ms)
client_id: str
server_id: str
tool: str
args: Dict[str, Any] = Field(default_factory=dict)
nonce: str = Field(default_factory=_uuid)
signature: str

class MCPResponse(BaseModel):
model_config = ConfigDict(extra=”forbid”)
v: Literal[“mcp/0.1”] = “mcp/0.1”
request_id: str
ts_ms: int = Field(default_factory=_now_ms)
ok: bool
server_id: str
status: Literal[“ok”, “accepted”, “running”, “done”, “error”]
result: Optional[Dict[str, Any]] = None
error: Optional[str] = None
signature: str

We define the structured MCP envelope and response formats that every interaction follows. We enforce strict schemas using Pydantic to guarantee that malformed or unexpected fields are rejected early. It ensures consistent contracts between clients and servers, which is critical for SDK standardization. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ServerIdentityOut(BaseModel):
model_config = ConfigDict(extra=”forbid”)
server_id: str
fingerprint: str
capabilities: Dict[str, Any]

class BatchSumIn(BaseModel):
model_config = ConfigDict(extra=”forbid”)
numbers: List[float] = Field(min_length=1)

class BatchSumOut(BaseModel):
model_config = ConfigDict(extra=”forbid”)
count: int
total: float

class StartLongTaskIn(BaseModel):
model_config = ConfigDict(extra=”forbid”)
seconds: int = Field(ge=1, le=20)
payload: Dict[str, Any] = Field(default_factory=dict)

class PollJobIn(BaseModel):
model_config = ConfigDict(extra=”forbid”)
job_id: str

We declare the validated input and output models for each tool exposed by the server. We use Pydantic constraints to clearly express what each tool accepts and returns. It makes tool behavior predictable and safe, even when invoked by LLM-driven agents. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class JobState:
job_id: str
status: str
result: Optional[Dict[str, Any]] = None
error: Optional[str] = None

class MCPServer:
def __init__(self, server_id, secret):
self.server_id = server_id
self.secret = secret
self.jobs = {}
self.tasks = {}

def _fingerprint(self):
return hashlib.sha256(self.secret).hexdigest()[:16]

async def handle(self, env_dict, client_secret):
env = MCPEnvelope(**env_dict)
payload = env.model_dump()
sig = payload.pop(“signature”)
if _hmac_hex(client_secret, payload) != sig:
return {“error”: “bad signature”}

if env.tool == “server_identity”:
out = ServerIdentityOut(
server_id=self.server_id,
fingerprint=self._fingerprint(),
capabilities={“async”: True, “stateless”: True},
)
resp = MCPResponse(
request_id=env.request_id,
ok=True,
server_id=self.server_id,
status=”ok”,
result=out.model_dump(),
signature=””,
)

elif env.tool == “batch_sum”:
args = BatchSumIn(**env.args)
out = BatchSumOut(count=len(args.numbers), total=sum(args.numbers))
resp = MCPResponse(
request_id=env.request_id,
ok=True,
server_id=self.server_id,
status=”ok”,
result=out.model_dump(),
signature=””,
)

elif env.tool == “start_long_task”:
args = StartLongTaskIn(**env.args)
jid = _uuid()
self.jobs[jid] = JobState(jid, “running”)

async def run():
await asyncio.sleep(args.seconds)
self.jobs[jid].status = “done”
self.jobs[jid].result = args.payload

self.tasks[jid] = asyncio.create_task(run())
resp = MCPResponse(
request_id=env.request_id,
ok=True,
server_id=self.server_id,
status=”accepted”,
result={“job_id”: jid},
signature=””,
)

elif env.tool == “poll_job”:
args = PollJobIn(**env.args)
job = self.jobs[args.job_id]
resp = MCPResponse(
request_id=env.request_id,
ok=True,
server_id=self.server_id,
status=job.status,
result=job.result,
signature=””,
)

payload = resp.model_dump()
resp.signature = _hmac_hex(self.secret, payload)
return resp.model_dump()

We implement the stateless MCP server along with its async task management logic. We handle request verification, tool dispatch, and long-running job execution without relying on session state. By returning job identifiers and allowing polling, we demonstrate non-blocking, scalable task execution. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MCPClient:
def __init__(self, client_id, secret, server):
self.client_id = client_id
self.secret = secret
self.server = server

async def call(self, tool, args=None):
env = MCPEnvelope(
client_id=self.client_id,
server_id=self.server.server_id,
tool=tool,
args=args or {},
signature=””,
).model_dump()
env[“signature”] = _hmac_hex(self.secret, {k: v for k, v in env.items() if k != “signature”})
return await self.server.handle(env, self.secret)

async def demo():
server_secret = b”server_secret”
client_secret = b”client_secret”
server = MCPServer(“mcp-server-001”, server_secret)
client = MCPClient(“client-001”, client_secret, server)

print(await client.call(“server_identity”))
print(await client.call(“batch_sum”, {“numbers”: [1, 2, 3]}))

start = await client.call(“start_long_task”, {“seconds”: 2, “payload”: {“task”: “demo”}})
jid = start[“result”][“job_id”]

while True:
poll = await client.call(“poll_job”, {“job_id”: jid})
if poll[“status”] == “done”:
print(poll)
break
await asyncio.sleep(0.5)

await demo()

We build a lightweight stateless client that signs each request and interacts with the server through structured envelopes. We demonstrate synchronous calls, input validation failures, and asynchronous task polling in a single flow. It shows how clients can reliably consume MCP-style services in real agent pipelines.

In conclusion, we showed how MCP evolves from a simple tool-calling interface into a robust protocol suitable for real-world systems. We started tasks asynchronously and poll for results without blocking execution, enforce clear contracts through schema validation, and rely on stateless, signed messages to preserve security and flexibility. Together, these patterns demonstrate how modern MCP-style systems support reliable, enterprise-ready agent workflows while remaining simple, transparent, and easy to extend.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Stateless, Secure, and Asynchronous MCP-Style Protocol for Scalable Agent Workflows appeared first on MarkTechPost.

Google AI Releases MedGemma-1.5: The Latest Update to their Open Medic …

Google Research has expanded its Health AI Developer Foundations program (HAI-DEF) with the release of MedGemma-1.5. The model is released as open starting points for developers who want to build medical imaging, text and speech systems and then adapt them to local workflows and regulations.

https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-15-and-medical-speech-to-text-with-medasr/

MedGemma 1.5, small multimodal model for real clinical data

MedGemma is a family of medical generative models built on Gemma. The new release, MedGemma-1.5-4B, targets developers who need a compact model that can still handle real clinical data. The previous MedGemma-1-27B model remains available for more demanding text heavy use cases.

MedGemma-1.5-4B is multimodal. It accepts text, two dimensional images, high dimensional volumes and whole slide pathology images. The model is part of the Health AI Developer Foundations program so it is intended as a base to fine tune, not a ready made diagnostic device.

https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-15-and-medical-speech-to-text-with-medasr/

Support for high dimensional CT, MRI and pathology

A major change in MedGemma-1.5 is support for high dimensional imaging. The model can process three dimensional CT and MRI volumes as sets of slices together with a natural language prompt. It can also process large histopathology slides by working over patches extracted from the slide.

On internal benchmarks, MedGemma-1.5 improves disease related CT findings from 58% to 61% accuracy and MRI disease findings from 51% to 65% accuracy when averaged over findings. For histopathology, the ROUGE L score on single slide cases increases from 0.02 to 0.49. This matches the 0.498 ROUGE L score of the task specific PolyPath model.

https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-15-and-medical-speech-to-text-with-medasr/

Imaging and report extraction benchmarks

MedGemma-1.5 also improves several benchmarks that are closer to production workflows.

On the Chest ImaGenome benchmark for anatomical localization in chest X rays, it improves intersection over union from 3% to 38%. On the MS-CXR-T benchmark for longitudinal chest X-ray comparison, macro-accuracy increases from 61% to 66%.

Across internal single image benchmarks that cover chest radiography, dermatology, histopathology and ophthalmology, average accuracy goes from 59% to 62%t. These are simple single image tasks, useful as sanity checks during domain adaptation.

MedGemma-1.5 also targets document extraction. On medical laboratory reports, the model improves macro F1 from 60% to 78% when extracting lab type, value and units. For developers this means less custom rule based parsing for semi structured PDF or text reports.

Applications deployed on Google Cloud can now work directly with DICOM, which is the standard file format used in radiology. This removes the need for a custom preprocessor for many hospital systems.

https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-15-and-medical-speech-to-text-with-medasr/

Medical text reasoning with MedQA and EHRQA

MedGemma-1.5 is not only an imaging model. It also improves baseline performance on medical text tasks.

On MedQA, a multiple choice benchmark for medical question answering, the 4B model improves accuracy from 64% to 69% relative to the previous MedGemma-1. On EHRQA, a text based electronic health record question answering benchmark, accuracy increases from 68% to 90%.

These numbers matter if you plan to use MedGemma-1.5 as a backbone for tools such as chart summarization, guideline grounding or retrieval augmented generation over clinical notes. The 4B size keeps fine tuning and serving cost at a practical level.

MedASR, a domain tuned speech recognition model

Clinical workflows contain a large amount of dictated speech. MedASR is the new medical automated speech recognition model released together with MedGemma-1.5.

MedASR uses a Conformer based architecture that is pre trained and fine tuned for clinical audio. It targets tasks such as chest X-ray dictation, radiology reports and general medical notes. The model is available through the same Health AI Developer Foundations channel on Vertex AI and on Hugging Face.

In evaluations against Whisper-large-v3, a general ASR model, MedASR reduces word error rate for chest X-ray dictation from 12.5% to 5.2%. That corresponds to 58% fewer transcription errors. On a broader internal medical dictation benchmark, MedASR reaches 5.2% word error rate while Whisper-large-v3 has 28.2%, which corresponds to 82% fewer errors.

Key Takeaways

MedGemma-1.5-4B is a compact multimodal medical model that handles text, 2D images, 3D CT and MRI volumes and whole slide pathology, released as part of the Health AI Developer Foundations program for adaptation to local use cases.

On imaging benchmarks, MedGemma-1.5 improves CT disease findings from 58% to 61%, MRI disease findings from 51% to 65%, and histopathology ROUGE-L from 0.02 to 0.49, matching the PolyPath model performance.

For downstream clinical style tasks, MedGemma-1.5 increases Chest ImaGenome intersection over union from 3% to 38%, MS-CXR-T macro accuracy from 61%t to 66% and lab report extraction macro F1 from 60% to 78% while keeping model size at 4B parameters.

MedGemma-1.5 also strengthens text reasoning, raising MedQA accuracy from 64% to 69% and EHRQA accuracy from 68% to 90%, which makes it suitable as a backbone for chart summarization and EHR question answering systems.

MedASR, a Conformer based medical ASR model in the same program, cuts word error rate on chest X-ray dictation from 12.5% to 5.2% and on a broad medical dictation benchmark from 28.2% to 5.2% compared to Whisper-large-v3, providing a domain tuned speech front end for MedGemma centered workflows.

Check out the Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Releases MedGemma-1.5: The Latest Update to their Open Medical AI Models for Developers appeared first on MarkTechPost.

How AutoScout24 built a Bot Factory to standardize AI agent developmen …

AutoScout24 is Europe’s leading automotive marketplace platform that connects buyers and sellers of new and used cars, motorcycles, and commercial vehicles across several European countries. Their long-term vision is to build a Bot Factory, a centralized framework for creating and deploying artificial intelligence (AI) agents that can perform tasks and make decisions within workflows, to significantly improve operational efficiency across their organization.
From disparate experiments to a standardized framework
As generative AI agents (systems that can reason, plan, and act) become more powerful, the opportunity to improve internal productivity for AutoScout24 was clear. This led to various engineering teams experimenting with the technology. As AI innovation accelerated across AutoScout24, they recognized an opportunity to pioneer a standardized approach for AI development. While AutoScout24 had successfully experimented with various tools and frameworks on Amazon Web Services (AWS), they envisioned creating a unified, enterprise-grade framework that could enable faster innovation. Their goal was to establish a paved path that could make it easier for teams across the organization to build secure, scalable, and maintainable AI agents. The AutoScout24 AI Platform Engineering team partnered with the AWS Prototype and Cloud Engineering (PACE) team in a three-week AI bootcamp. The goal was to move from fragmented experiments to a coherent strategy by creating a reusable blueprint, a Bot Factory, to standardize how future AI agents are built and operated within their company.
The challenge: identifying a high-impact use case
To ground the Bot Factory blueprint in a tangible business case, the team targeted a significant operational cost: internal developer support. The problem was well-defined. AutoScout24 AI Platform engineers were spending up to 30% of their time on repetitive tasks like answering questions, granting access to tools, and locating documentation. This support tax reduced overall productivity. It diverted skilled engineers from high-priority feature development and forced other developers to wait for routine requests to be completed. An automated support bot was an ideal first use case because it needed to perform two core agent functions:

Knowledge retrieval: Answering “how-to” questions by searching internal documentation, a capability known as Retrieval Augmented Generation (RAG).
Action execution: Performing tasks in other systems, such as assigning a GitHub Copilot license, which requires secure API integration, or “tool use.”

By building a bot that could do both, the team could validate the blueprint while delivering immediate business value.
Architectural overview
In this post, we explore the architecture that AutoScout24 used to build their standardized AI development framework, enabling rapid deployment of secure and scalable AI agents.

The architecture is designed with a simple, decoupled flow to make sure the system is both resilient and straightforward to maintain. The diagram provides a simplified view focused on the core generative-AI workflow. In a production environment, additional AWS services such as AWS Identity and Access Management (IAM), Amazon CloudWatch, AWS X-Ray, AWS CloudTrail, AWS Web Application Firewall (WAF), and AWS Key Management Service (KMS) could be integrated to enhance security, observability, and operational governance.
Here is how a request flows through the system:

User interaction via Slack: A developer posts a message in a support channel, for example, “@SupportBot, can I get a GitHub Copilot license?“
Secure ingress via Amazon API Gateway & AWS Lambda: Slack sends the event to an Amazon API Gateway endpoint, which triggers an AWS Lambda function. This function performs an essential security check, verifying the request’s cryptographic signature to confirm it’s authentically from Slack.
Decoupling via Amazon Simple Queue Service (SQS): The verified request is placed onto an Amazon SQS First-In, First-Out (FIFO) queue. This decouples the front-end from the agent, improving resilience. Using a FIFO queue with the message’s thread timestamp as the MessageGroupId makes sure that replies within a single conversation are processed in order, which is important for maintaining coherent conversations.
Agent execution via Amazon Bedrock AgentCore: The SQS queue triggers a Lambda function when messages arrive, which activates the agent running in the AgentCore Runtime. AgentCore manages the operational tasks, including orchestrating calls to the foundation model and the agent’s tools. The Orchestrator Agent’s logic, built with Strands Agents, analyzes the user’s prompt and determines the correct specialized agent to invoke—either the Knowledge Base Agent for a question or the GitHub Agent for an action request.

A crucial implementation detail is how the system leverages AgentCore’s complete session isolation. To maintain conversational context, the system generates a unique, deterministic sessionId for each Slack thread by combining the channel ID and the thread’s timestamp. This sessionId is passed with every agent invocation within that thread. Interactions in a thread share this same sessionId, so the agent treats them as one continuous conversation. Meanwhile, interactions in other threads get different sessionIds, keeping their contexts separate. In effect, each conversation runs in an isolated session: AgentCore spins up separate resources per sessionId, so context and state do not leak between threads. In practice, this means that if a developer sends multiple messages in one Slack thread, the agent remembers the earlier parts of that conversation. Each thread’s history is preserved automatically by AgentCore.
This session management strategy is also vital for observability. Based on a unique sessionId, the interaction can be traced using AWS X-Ray, which offers insight into the flow – from the Slack message arriving at API Gateway to the message being enqueued in SQS. It follows the orchestrator’s processing, the call to the foundation model, subsequent tool invocations (such as a knowledge-base lookup or a GitHub API call), and finally the response back to Slack.
Metadata and timing help indicate the flow of each step to understand where time is spent. If a step fails or is slow (for example, a timeout on an external API call), X-Ray pinpoints which step caused the issue. This is invaluable for diagnosing problems quickly and building confidence in the system’s behavior.
The solution: A reusable blueprint powered by AWS
The Bot Factory architecture designed by the AutoScout24 and AWS teams is event-driven, serverless, and built on a foundation of managed AWS services. This approach provides a resilient and scalable pattern that can be adapted for new use cases.
The solution builds on Amazon Bedrock and its integrated capabilities:

Amazon Bedrock provides access to high-performing foundation models (FMs), which act as the reasoning engine for the agent.
Amazon Bedrock Knowledge Bases enables the RAG capability, allowing the agent to connect to AutoScout24’s internal documentation and retrieve information to answer questions accurately.
Amazon Bedrock AgentCore is a key component of the operational side of the blueprint. It provides the fully managed, serverless runtime environment to deploy, operate, and scale the agents.

This solution provides a significant advantage for AutoScout24. Instead of building foundational infrastructure for session management, security, and observability, they use AgentCore’s purpose-built services. This allows the team to focus on the agent’s business logic rather than the underlying infrastructure. AgentCore also provides built-in security and isolation features. Each agent invocation runs in its own isolated container, helping to prevent data leakage between sessions. Agents are assigned specific IAM roles to restrict their AWS permissions (following the principle of least privilege). Credentials or tokens needed by agent tools (such as a GitHub API key) are stored securely in AWS Secrets Manager and accessed at runtime. These features give the team a secure environment for running agents with minimal custom infrastructure.
The agent itself was built using the Strands Agents SDK, an open-source framework that simplifies defining an agent’s logic, tools, and behavior in Python. This combination proves effective: Strands to build the agent, and AgentCore to securely run it at scale. The team adopted a sophisticated “agents-as-tools” design pattern, where a central orchestrator Agent acts as the main controller. This orchestrator does not contain the logic for every possible task. Instead, it intelligently delegates requests to specialized, single-purpose agents. For the support bot, this included a Knowledge Base agent for handling informational queries and a GitHub agent for executing actions like assigning licenses. This modular design makes it straightforward to extend the system with new capabilities, such as adding a PR review agent without re-architecting the entire pipeline. Running these agents on Amazon Bedrock further enhances flexibility, since the team can choose from a broad range of foundation models. More powerful models can be applied to complex reasoning tasks, while lighter, cost-efficient models are well-suited for routine worker agents such as GitHub license requests or operational workflows. This ability to mix and match models allows Autoscout24 to balance cost, performance, and accuracy across their agent architecture.
Orchestrator agent: built with Strands SDK
Using the Strands Agents SDK helped the team to define the orchestrator agent with concise, declarative code. The framework uses a model-driven approach, where the developer focuses on defining the agent’s instructions and tools, and the foundation model handles the reasoning and planning. The orchestrator agent can be expressed in just a few dozen lines of Python. The example snippet below (simplified for clarity, not intended for direct use) shows how the agent is configured with a model, a system prompt, and a list of tools (which in this architecture represent the specialized agents):

# A simplified, representative example of the orchestrator agent logic
# built with the Strands Agents SDK and deployed on Amazon Bedrock AgentCore.
from bedrock_agentcore.runtime import BedrockAgentCoreApp
from strands import Agent
from strands.models import BedrockModel
from tools import knowledge_base_query_tool, github_copilot_seat_agent
# Initialize the AgentCore application, which acts as the serverless container
app = BedrockAgentCoreApp()
class OrchestratorAgent:
def __init__(self):
# 1. Define the Model: Point to a foundation model in Amazon Bedrock.
self.model = BedrockModel(model_id=”anthropic.claude-3-sonnet-20240229-v1:0″)

# 2. Define the Prompt: Give the agent its core instructions.
self.system_prompt = “””
You are a helpful and friendly support bot for the AutoScout24 Platform Engineering team.
Your goal is to answer developer questions and automate common requests.
Use your tools to answer questions or perform actions.
If you cannot handle a request, politely say so.
“””

# 3. Define the Tools: Provide the agent with its capabilities.
# These tools are entry points to other specialized Strands agents.
self.tools = [
knowledge_base_query_tool,
github_copilot_seat_agent
]

# Create the agent instance
self.agent = Agent(
model=self.model,
system_prompt=self.system_prompt,
tools=self.tools
)
def __call__(self, user_input: str):
# Run the agent to get a response for the user’s input
return self.agent(user_input)
# Define the entry point that AgentCore will invoke when a new event arrives from SQS
@app.entrypoint
def main(event):
# Extract the user’s query from the incoming event
user_query = event.get(“prompt”)

# Instantiate and run the orchestrator agent
return OrchestratorAgent()(user_query)

Another example is the GitHub Copilot license agent. It is implemented as a Strands tool function. The following snippet shows how the team defined it using the @tool decorator. This function creates a GitHubCopilotSeatAgent, passes the user’s request (a GitHub username) to it, and returns the result:

from strands import Agent, tool
class GitHubCopilotSeatAgent:
def __call__(self, query: str):
agent = Agent(model=self.model, system_prompt=self.system_prompt, tools=self.tools)
return agent(query)

@tool
def github_copilot_seat_agent(github_username: str) -> str:
agent = GitHubCopilotSeatAgent() response = agent(f”Request GitHub Copilot license for user: {github_username}”)
return str(response)

Key benefits of this approach include clear separation of concerns. The developer writes declarative code focused on the agent’s purpose. The complex infrastructure logic, including scaling, session management, and secure execution, is handled by Amazon Bedrock AgentCore. This abstraction enables rapid development and allowed AutoScout24 to move from prototype to production more quickly. The tools list effectively makes other agents callable functions, allowing the orchestrator to delegate tasks without needing to know their internal implementation.
The impact: A validated blueprint for enterprise AI
The Bot Factory project delivers results that extended beyond the initial prototype. It creates immediate business value and establishes a strategic foundation for future AI innovation at AutoScout24.The key outcomes were:

A production-ready support bot: The team deployed a functional Slack bot that is actively reducing the manual support load on the AutoScout24 AI Platform Engineering Team, addressing the 30% of time previously spent on repetitive tasks.
A reusable Bot Factory blueprint: The project produces a validated, reusable architectural pattern. Now, teams at AutoScout24 can build a new agent by starting with this proven template (Slack -> API Gateway -> SQS -> AgentCore). This significantly accelerates innovation by allowing teams to focus on their unique business logic, not on reinventing the infrastructure. This modular design also prepares them for more advanced multi-agent collaboration, potentially using standards like the Agent-to-Agent (A2A) protocol as their needs evolve.
Enabling broader AI development: By abstracting away the infrastructure complexity, the Bot Factory empowers more people to build AI solutions. A domain expert in security or data analytics can now create a new tool or specialized agent and “plug it in” to the factory without needing to be an expert in distributed systems.

Conclusion: A new model for enterprise agents
AutoScout24’s partnership with AWS turned fragmented generative AI experiments into a scalable, standardized framework. By adopting Amazon Bedrock AgentCore, the team moved their support bot from prototype to production, while focusing on their Bot Factory vision. AgentCore manages session state and scaling, so engineers can focus on high-value business logic instead of infrastructure. The outcome is more than a support bot: it’s a reusable foundation for building enterprise agents. With AgentCore, AutoScout24 can move from prototype to production efficiently, setting a model for how organizations can standardize generative AI development on AWS. To start building enterprise agents with Amazon Bedrock, explore the following resources:

Amazon Bedrock AgentCore documentation
Amazon Bedrock Knowledge Bases documentation
Securely launch and scale your agents with Amazon Bedrock AgentCore
Build trustworthy AI agents with Amazon Bedrock AgentCore Observability

About the authors
Andrew Shved is a Senior AWS Prototyping Architect who leads teams and customers in building and shipping Generative AI–driven solutions, from early prototypes to production on AWS.
Muhammad Uzair Aslam is a tenured Technical Program Manager on the AWS Prototyping team, where he works closely with customers to accelerate their cloud and AI journeys. He thrives on diving deep into technical details and turning complexity into impactful, value-driven solutions.
Arslan Mehboob is a Platform Engineer and AWS-certified solutions architect with deep expertise in cloud infrastructure, scalable systems, and software engineering. He currently builds resilient cloud platforms and is passionate about AI and emerging technologies.
Vadim Shiianov is a Data Scientist specializing in machine learning and AI-driven systems for real-world business applications. He works on designing and deploying ML and Generative AI solutions that translate complex data into measurable impact. He is passionate about emerging technologies and building practical, scalable systems around them.