This AI Research Proposes an AI Agent Immune System for Adaptive Cyber …

Can your AI security stack profile, reason, and neutralize a live security threat in ~220 ms—without a central round-trip? A team of researchers from Google and University of Arkansas at Little Rock outline an agentic cybersecurity “immune system” built from lightweight, autonomous sidecar AI agents colocated with workloads (Kubernetes pods, API gateways, edge services). Instead of exporting raw telemetry to a SIEM and waiting on batched classifiers, each agent learns local behavioral baselines, evaluates anomalies using federated intelligence, and applies least-privilege mitigations directly at the point of execution. In a controlled cloud-native simulation, this edge-first loop cut decision-to-mitigation to ~220 ms (≈3.4× faster than centralized pipelines), achieved F1 ≈ 0.89, and held host overhead under 10% CPU/RAM—evidence that collapsing detection and enforcement into the workload plane can deliver both speed and fidelity without material resource penalties.

https://arxiv.org/abs/2509.20640

What does “Profile → Reason → Neutralize” mean at the primitive level?

Profile. Agents are deployed as sidecars/daemonsets alongside microservices and API gateways. They build behavioral fingerprints from execution traces, syscall paths, API call sequences, and inter-service flows. This local baseline adapts to short-lived pods, rolling deploys, and autoscaling—conditions that routinely break perimeter controls and static allowlists. Profiling is not just a threshold on counts; it retains structural features (order, timing, peer set) that allow detection of zero-day-like deviations. The research team frames this as continuous, context-aware baselining across ingestion and sensing layers so that “normal” is learned per workload and per identity boundary.

Reason. When an anomaly appears (for example, an unusual burst of high-entropy uploads from a low-trust principal or a never-seen-before API call graph), the local agent mixes anomaly scores with federated intelligence—shared indicators and model deltas learned by peers—to produce a risk estimate. Reasoning is designed to be edge-first: the agent decides without a round-trip to a central adjudicator, and the trust decision is continuous rather than a static role gate. This aligns with zero-trust—identity and context are evaluated at each request, not just at session start—and it reduces central bottlenecks that add seconds of latency under load.

Neutralize. If risk exceeds a context-sensitive threshold, the agent executes an immediate local control mapped to least-privilege actions: quarantine the container (pause/isolate), rotate a credential, apply a rate-limit, revoke a token, or tighten a per-route policy. Enforcement is written back to policy stores and logged with a human-readable rationale for audit. The fast path here is the core differentiator: in the reported evaluation, the autonomous path triggers in ~220 ms versus ~540–750 ms for centralized ML or firewall update pipelines, which translates into a ~70% latency reduction and fewer opportunities for lateral movement during the decision window.

Where do the numbers come from, and what were the baselines?

The research team evaluated the architecture in a Kubernetes-native simulation spanning API abuse and lateral-movement scenarios. Against two typical baselines—(i) static rule pipelines and (ii) a batch-trained classifier—the agentic approach reports Precision 0.91 / Recall 0.87 / F1 0.89, while the baselines land near F1 0.64 (rules) and F1 0.79 (baseline ML). Decision latency falls to ~220 ms for local enforcement, compared with ~540–750 ms for centralized paths that require coordination with a controller or external firewall. Resource overhead on host services remains below 10% in CPU/RAM.

https://arxiv.org/abs/2509.20640

Why does this matter for zero-trust engineering, not just research graphs?

Zero-trust (ZT) calls for continuous verification at request-time using identity, device, and context. In practice, many ZT deployments still defer to central policy evaluators, so they inherit control-plane latency and queueing pathologies under load. By moving risk inference and enforcement to the autonomous edge, the architecture turns ZT posture from periodic policy pulls into a set of self-contained, continuously learning controllers that execute least-privilege changes locally and then synchronize state. That design simultaneously reduces mean time-to-contain (MTTC) and keeps decisions near the blast radius, which helps when inter-pod hops are measured in milliseconds. The research team also formalizes federated sharing to distribute indicators/model deltas without heavy raw-data movement, which is relevant for privacy boundaries and multi-tenant SaaS.

How does it integrate with existing stacks—Kubernetes, APIs, and identity?

Operationally, the agents are co-located with workloads (sidecar or node daemon). On Kubernetes, they can hook CNI-level telemetry for flow features, container runtime events for process-level signals, and envoy/nginx spans at API gateways for request graphs. For identity, they consume claims from your IdP and compute continuous trust scores that factor recent behavior and environment (e.g., geo-risk, device posture). Mitigations are expressed as idempotent primitives—network micro-policy updates, token revocation, per-route quotas—so they are straightforward to roll back or tighten incrementally. The architecture’s control loop (sense → reason → act → learn) is strictly feedback-driven and supports both human-in-the-loop (policy windows, approval gates for high-blast-radius changes) and autonomy for low-impact actions.

What are the governance and safety guardrails?

Speed without auditability is a non-starter in regulated environments. The research team emphasizes explainable decision logs that capture which signals and thresholds led to the action, with signed and versioned policy/model artifacts. It also discusses privacy-preserving modes—keeping sensitive data local while sharing model updates; differentially private updates are mentioned as an option in stricter regimes. For safety, the system supports override/rollback and staged rollouts (e.g., canarying new mitigation templates in non-critical namespaces). This is consistent with broader security work on threats and guardrails for agentic systems; if your org is adopting multi-agent pipelines, cross-check against current threat models for agent autonomy and tool use.

How do the reported results translate to production posture?

The evaluation is a 72-hour cloud-native simulation with injected behaviors: API misuse patterns, lateral movement, and zero-day-like deviations. Real systems will add messier signals (e.g., noisy sidecars, multi-cluster networking, mixed CNI plugins), which affects both detection and enforcement timing. That said, the fast-path structure—local decision + local act—is topology-agnostic and should preserve order-of-magnitude latency gains so long as mitigations are mapped to primitives available in your mesh/runtime. For production, begin with observe-only agents to build baselines, then turn on mitigations for low-risk actions (quota clamps, token revokes), then gate high-blast-radius controls (network slicing, container quarantine) behind policy windows until confidence/coverage metrics are green.

How does this sit in the broader agentic-security landscape?

There is growing research on securing agent systems and using agent workflows for security tasks. The research team discussed here is about defense via agent autonomy close to workloads. In parallel, other work tackles threat modeling for agentic AI, secure A2A protocol usage, and agentic vulnerability testing. If you adopt the architecture, pair it with a current agent-security threat model and a test harness that exercises tool-use boundaries and memory safety of agents.

Comparative Results (Kubernetes simulation)

MetricStatic rules pipelineBaseline ML (batch classifier)Agentic framework (edge autonomy)Precision0.710.830.91Recall0.580.760.87F10.640.790.89Decision-to-mitigation latency~750 ms~540 ms~220 msHost overhead (CPU/RAM)ModerateModerate<10%

Key Takeaways

Edge-first “cybersecurity immune system.” Lightweight sidecar/daemon AI agents colocated with workloads (Kubernetes pods, API gateways) learn behavioral fingerprints, decide locally, and enforce least-privilege mitigations without SIEM round-trips.

Measured performance. Reported decision-to-mitigation is ~220 ms—about 3.4× faster than centralized pipelines (≈540–750 ms)—with F1 ≈ 0.89 (P≈0.91, R≈0.87) in a Kubernetes simulation.

Low operational cost. Host overhead remains <10% CPU/RAM, making the approach practical for microservices and edge nodes.

Profile → Reason → Neutralize loop. Agents continuously baseline normal activity (profile), fuse local signals with federated intelligence for risk scoring (reason), and apply immediate, reversible controls such as container quarantine, token rotation, and rate-limits (neutralize).

Zero-trust alignment. Decisions are continuous and context-aware (identity, device, geo, workload), replacing static role gates and reducing dwell time and lateral movement risk.

Governance and safety. Actions are logged with explainable rationales; policies/models are signed and versioned; high-blast-radius mitigations can be gated behind human-in-the-loop and staged rollouts.

Summary

Treat defense as a distributed control plane made of profiling, reasoning, and neutralizing agents that act where the threat lives. The reported profile—~220 ms actions, ≈ 3.4× faster than centralized baselines, F1 ≈ 0.89, <10% overhead—is consistent with what you’d expect when you eliminate central hops and let autonomy handle least-privilege mitigations locally. It aligns with zero-trust’s continuous verification and gives teams a practical path to self-stabilizing operations: learn normal, flag deviations with federated context, and contain early—before lateral movement outpaces your control plane.

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post This AI Research Proposes an AI Agent Immune System for Adaptive Cybersecurity: 3.4× Faster Containment with <10% Overhead appeared first on MarkTechPost.

Gemini Robotics 1.5: DeepMind’s ER↔VLA Stack Brings Agentic Robots …

Can a single AI stack plan like a researcher, reason over scenes, and transfer motions across different robots—without retraining from scratch? Google DeepMind’s Gemini Robotics 1.5 says yes, by splitting embodied intelligence into two models: Gemini Robotics-ER 1.5 for high-level embodied reasoning (spatial understanding, planning, progress/success estimation, tool-use) and Gemini Robotics 1.5 for low-level visuomotor control. The system targets long-horizon, real-world tasks (e.g., multi-step packing, waste sorting with local rules) and introduces motion transfer to reuse data across heterogeneous platforms.

https://deepmind.google/discover/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/

What actually is the stack?

Gemini Robotics-ER 1.5 (reasoner/orchestrator): A multimodal planner that ingests images/video (and optionally audio), grounds references via 2D points, tracks progress, and invokes external tools (e.g., web search or local APIs) to fetch constraints before issuing sub-goals. It’s available via the Gemini API in Google AI Studio.

Gemini Robotics 1.5 (VLA controller): A vision-language-action model that converts instructions and percepts into motor commands, producing explicit “think-before-act” traces to decompose long tasks into short-horizon skills. Availability is limited to selected partners during the initial rollout.

https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf

Why split cognition from control?

Earlier end-to-end VLAs (Vision-Language-Action) struggle to plan robustly, verify success, and generalize across embodiments. Gemini Robotics 1.5 isolates those concerns: Gemini Robotics-ER 1.5 handles deliberation (scene reasoning, sub-goaling, success detection), while the VLA specializes in execution (closed-loop visuomotor control). This modularity improves interpretability (visible internal traces), error recovery, and long-horizon reliability.

Motion Transfer across embodiments

A core contribution is Motion Transfer (MT): training the VLA on a unified motion representation built from heterogeneous robot data—ALOHA, bi-arm Franka, and Apptronik Apollo—so skills learned on one platform can zero-shot transfer to another. This reduces per-robot data collection and narrows sim-to-real gaps by reusing cross-embodiment priors.

Quantitative signals

The research team showcased controlled A/B comparisons on real hardware and aligned MuJoCo scenes. This includes:

Generalization: Robotics 1.5 surpasses prior Gemini Robotics baselines in instruction following, action generalization, visual generalization, and task generalization across the three platforms.

Zero-shot cross-robot skills: MT yields measurable gains in progress and success when transferring skills across embodiments (e.g., Franka→ALOHA, ALOHA→Apollo), rather than merely improving partial progress.

“Thinking” improves acting: Enabling VLA thought traces increases long-horizon task completion and stabilizes mid-rollout plan revisions.

End-to-end agent gains: Pairing Gemini Robotics-ER 1.5 with the VLA agent substantially improves progress on multi-step tasks (e.g., desk organization, cooking-style sequences) versus a Gemini-2.5-Flash-based baseline orchestrator.

https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf

Safety and evaluation

DeepMind research team highlights layered controls: policy-aligned dialog/planning, safety-aware grounding (e.g., not pointing to hazardous objects), low-level physical limits, and expanded evaluation suites (e.g., ASIMOV/ASIMOV-style scenario testing and auto red-teaming to elicit edge-case failures). The goal is to catch hallucinated affordances or nonexistent objects before actuation.

Competitive/industry context

Gemini Robotics 1.5 is a shift from “single-instruction” robotics toward agentic, multi-step autonomy with explicit web/tool use and cross-platform learning, a capability set relevant to consumer and industrial robotics. Early partner access centers on established robotics vendors and humanoid platforms.

Key Takeaways

Two-model architecture (ER VLA): Gemini Robotics-ER 1.5 handles embodied reasoning—spatial grounding, planning, success/progress estimation, tool calls—while Robotics 1.5 is the vision-language-action executor that issues motor commands.

“Think-before-act” control: The VLA produces explicit intermediate reasoning/traces during execution, improving long-horizon decomposition and mid-task adaptation.

Motion Transfer across embodiments: A single VLA checkpoint reuses skills across heterogeneous robots (ALOHA, bi-arm Franka, Apptronik Apollo), enabling zero-/few-shot cross-robot execution rather than per-platform retraining.

Tool-augmented planning: ER 1.5 can invoke external tools (e.g., web search) to fetch constraints, then condition plans—e.g., packing after checking local weather or applying city-specific recycling rules.

Quantified improvements over prior baselines: The tech report documents higher instruction/action/visual/task generalization and better progress/success on real hardware and aligned simulators; results cover cross-embodiment transfers and long-horizon tasks.

Availability and access: ER 1.5 is available via the Gemini API (Google AI Studio) with docs, examples, and preview knobs; Robotics 1.5 (VLA) is limited to select partners with a public waitlist.

Safety & evaluation posture: DeepMind highlights layered safeguards (policy-aligned planning, safety-aware grounding, physical limits) and an upgraded ASIMOV benchmark plus adversarial evaluations to probe risky behaviors and hallucinated affordances.

Summary

Gemini Robotics 1.5 operationalizes a clean separation of embodied reasoning and control, adds motion transfer to recycle data across robots, and showcases the reasoning surface (point grounding, progress/success estimation, tool calls) to developers via the Gemini API. For teams building real-world agents, the design reduces per-platform data burden and strengthens long-horizon reliability—while keeping safety in scope with dedicated test suites and guardrails.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Gemini Robotics 1.5: DeepMind’s ER↔VLA Stack Brings Agentic Robots to the Real World appeared first on MarkTechPost.

Top 10 Local LLMs (2025): Context Windows, VRAM Targets, and Licenses …

Local LLMs matured fast in 2025: open-weight families like Llama 3.1 (128K context length (ctx)), Qwen3 (Apache-2.0, dense + MoE), Gemma 2 (9B/27B, 8K ctx), Mixtral 8×7B (Apache-2.0 SMoE), and Phi-4-mini (3.8B, 128K ctx) now ship reliable specs and first-class local runners (GGUF/llama.cpp, LM Studio, Ollama), making on-prem and even laptop inference practical if you match context length and quantization to VRAM. This guide lists the ten most deployable options by license clarity, stable GGUF availability, and reproducible performance characteristics (params, context length (ctx), quant presets).

Top 10 Local LLMs (2025)

1) Meta Llama 3.1-8B — robust “daily driver,” 128K context

Why it matters. A stable, multilingual baseline with long context and first-class support across local toolchains.Specs. Dense 8B decoder-only; official 128K context; instruction-tuned and base variants. Llama license (open weights). Common GGUF builds and Ollama recipes exist. Typical setup: Q4_K_M/Q5_K_M for ≤12-16 GB VRAM, Q6_K for ≥24 GB.

2) Meta Llama 3.2-1B/3B — edge-class, 128K context, on-device friendly

Why it matters. Small models that still take 128K tokens and run acceptably on CPUs/iGPUs when quantized; good for laptops and mini-PCs.Specs. 1B/3B instruction-tuned models; 128K context confirmed by Meta. Works well via llama.cpp GGUF and LM Studio’s multi-runtime stack (CPU/CUDA/Vulkan/Metal/ROCm).

3) Qwen3-14B / 32B — open Apache-2.0, strong tool-use & multilingual

Why it matters. Broad family (dense+MoE) under Apache-2.0 with active community ports to GGUF; widely reported as a capable general/agentic “daily driver” locally.Specs. 14B/32B dense checkpoints with long-context variants; modern tokenizer; rapid ecosystem updates. Start at Q4_K_M for 14B on 12 GB; move to Q5/Q6 when you have 24 GB+. (Qwen)

4) DeepSeek-R1-Distill-Qwen-7B — compact reasoning that fits

Why it matters. Distilled from R1-style reasoning traces; delivers step-by-step quality at 7B with widely available GGUFs. Excellent for math/coding on modest VRAM.Specs. 7B dense; long-context variants exist per conversion; curated GGUFs cover F32→Q4_K_M. For 8–12 GB VRAM try Q4_K_M; for 16–24 GB use Q5/Q6.

5) Google Gemma 2-9B / 27B — efficient dense; 8K context (explicit)

Why it matters. Strong quality-for-size and quantization behavior; 9B is a great mid-range local model.Specs. Dense 9B/27B; 8K context (don’t overstate); open weights under Gemma terms; widely packaged for llama.cpp/Ollama. 9B@Q4_K_M runs on many 12 GB cards.

6) Mixtral 8×7B (SMoE) — Apache-2.0 sparse MoE; cost/perf workhorse

Why it matters. Mixture-of-Experts throughput benefits at inference: ~2 experts/token selected at runtime; great compromise when you have ≥24–48 GB VRAM (or multi-GPU) and want stronger general performance.Specs. 8 experts of 7B each (sparse activation); Apache-2.0; instruct/base variants; mature GGUF conversions and Ollama recipes.

7) Microsoft Phi-4-mini-3.8B — small model, 128K context

Why it matters. Realistic “small-footprint reasoning” with 128K context and grouped-query attention; solid for CPU/iGPU boxes and latency-sensitive tools.Specs. 3.8B dense; 200k vocab; SFT/DPO alignment; model card documents 128K context and training profile. Use Q4_K_M on ≤8–12 GB VRAM.

8) Microsoft Phi-4-Reasoning-14B — mid-size reasoning (check ctx per build)

Why it matters. A 14B reasoning-tuned variant that is materially better for chain-of-thought-style tasks than generic 13–15B baselines.Specs. Dense 14B; context varies by distribution (model card for a common release lists 32K). For 24 GB VRAM, Q5_K_M/Q6_K is comfortable; mixed-precision runners (non-GGUF) need more.

9) Yi-1.5-9B / 34B — Apache-2.0 bilingual; 4K/16K/32K variants

Why it matters. Competitive EN/zh performance and permissive license; 9B is a strong alternative to Gemma-2-9B; 34B steps toward higher reasoning under Apache-2.0.Specs. Dense; context variants 4K/16K/32K; open weights under Apache-2.0 with active HF cards/repos. For 9B use Q4/Q5 on 12–16 GB.

10) InternLM 2 / 2.5-7B / 20B — research-friendly; math-tuned branches

Why it matters. An open series with lively research cadence; 7B is a practical local target; 20B moves you toward Gemma-2-27B-class capability (at higher VRAM).Specs. Dense 7B/20B; multiple chat/base/math variants; active HF presence. GGUF conversions and Ollama packs are common.

source: marktechpost.com

Summary

In local LLMs, the trade-offs are clear: pick dense models for predictable latency and simpler quantization (e.g., Llama 3.1-8B with a documented 128K context; Gemma 2-9B/27B with an explicit 8K window), move to sparse MoE like Mixtral 8×7B when your VRAM and parallelism justify higher throughput per cost, and treat small reasoning models (Phi-4-mini-3.8B, 128K) as the sweet spot for CPU/iGPU boxes. Licenses and ecosystems matter as much as raw scores: Qwen3’s Apache-2.0 releases (dense + MoE) and Meta/Google/Microsoft model cards give the operational guardrails (context, tokenizer, usage terms) you’ll actually live with. On the runtime side, standardize on GGUF/llama.cpp for portability, layer Ollama/LM Studio for convenience and hardware offload, and size quantization (Q4→Q6) to your memory budget. In short: choose by context + license + hardware path, not just leaderboard vibes.

The post Top 10 Local LLMs (2025): Context Windows, VRAM Targets, and Licenses Compared appeared first on MarkTechPost.

The Latest Gemini 2.5 Flash-Lite Preview is Now the Fastest Proprietar …

Google released an updated version of Gemini 2.5 Flash and Gemini 2.5 Flash-Lite preview models across AI Studio and Vertex AI, plus rolling aliases—gemini-flash-latest and gemini-flash-lite-latest—that always point to the newest preview in each family. For production stability, Google advises pinning fixed strings (gemini-2.5-flash, gemini-2.5-flash-lite). Google will give a two-week email notice before retargeting a -latest alias, and notes that rate limits, features, and cost may vary across alias updates.

https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/

What actually changed?

Flash: Improved agentic tool use and more efficient “thinking” (multi-pass reasoning). Google reports a +5 point lift on SWE-Bench Verified vs. the May preview (48.9% → 54.0%), indicating better long-horizon planning/code navigation.

Flash-Lite: Tuned for stricter instruction following, reduced verbosity, and stronger multimodal/translation. Google’s internal chart shows ~50% fewer output tokens for Flash-Lite and ~24% fewer for Flash, which directly cuts output-token spend and wall-clock time in throughput-bound services.

https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/

Independent Stats from the community thread

Artificial Analysis (the account behind the AI benchmarking site) received pre-release access and published external measurements across intelligence and speed. Highlights from the thread and companion pages:

Throughput: In endpoint tests, Gemini 2.5 Flash-Lite (Preview 09-2025, reasoning) is reported as the fastest proprietary model they track, around ~887 output tokens/s on AI Studio in their setup.

Intelligence index deltas: The September previews for Flash and Flash-Lite improve on Artificial Analysis’ aggregate “intelligence” scores compared with prior stable releases (site pages break down reasoning vs. non-reasoning tracks and blended price assumptions).

Token efficiency: The thread reiterates Google’s own reduction claims (−24% Flash, −50% Flash-Lite) and frames the win as cost-per-success improvements for tight latency budgets.

Google shared pre-release access for the new Gemini 2.5 Flash & Flash-Lite Preview 09-2025 models. We’ve independently benchmarked gains in intelligence (particularly for Flash-Lite), output speed and token efficiency compared to predecessorsKey takeaways from our intelligence… pic.twitter.com/ybzKvZBH5A— Artificial Analysis (@ArtificialAnlys) September 25, 2025

Cost surface and context budgets (for deployment choices)

Flash-Lite GA list price is $0.10 / 1M input tokens and $0.40 / 1M output tokens (Google’s July GA post and DeepMind’s model page). That baseline is where verbosity reductions translate to immediate savings.

Context: Flash-Lite supports ~1M-token context with configurable “thinking budgets” and tool connectivity (Search grounding, code execution)—useful for agent stacks that interleave reading, planning, and multi-tool calls.

Browser-agent angle and the o3 claim

A circulating claim says the “new Gemini Flash has o3-level accuracy, but is 2× faster and 4× cheaper on browser-agent tasks.” This is community-reported, not in Google’s official post. It likely traces to private/limited task suites (DOM navigation, action planning) with specific tool budgets and timeouts. Use it as a hypothesis for your own evals; don’t treat it as a cross-bench truth.

This is insane! The new Gemini Flash model released yesterday has the same accuracy as o3, but it is 2x faster and 4x cheaper for browser agent tasks.I ran evaluations the whole day and could not believe this. The previous gemini-2.5-flash had only 71% on this benchmark. https://t.co/KdgkuAK30W pic.twitter.com/F69BiZHiwD— Magnus Müller (@mamagnus00) September 26, 2025

Practical guidance for teams

Pin vs. chase -latest: If you depend on strict SLAs or fixed limits, pin the stable strings. If you continuously canary for cost/latency/quality, the -latest aliases reduce upgrade friction (Google provides two weeks’ notice before switching the pointer).

High-QPS or token-metered endpoints: Start with Flash-Lite preview; the verbosity and instruction-following upgrades shrink egress tokens. Validate multimodal and long-context traces under production load.

Agent/tool pipelines: A/B Flash preview where multi-step tool use dominates cost or failure modes; Google’s SWE-Bench Verified lift and community tokens/s figures suggest better planning under constrained thinking budgets.

Model strings (current)

Previews: gemini-2.5-flash-preview-09-2025, gemini-2.5-flash-lite-preview-09-2025

Stable: gemini-2.5-flash, gemini-2.5-flash-lite

Rolling aliases: gemini-flash-latest, gemini-flash-lite-latest (pointer semantics; may change features/limits/pricing).

Summary

Google’s new release update tightens tool-use competence (Flash) and token/latency efficiency (Flash-Lite) and introduces -latest aliases for faster iteration. External benchmarks from Artificial Analysis indicate meaningful throughput and intelligence-index gains for the Sept 2025. previews, with Flash-Lite now testing as the fastest proprietary model in their harness. Validate on your workload—especially browser-agent stacks—before committing to the aliases in production.

The post The Latest Gemini 2.5 Flash-Lite Preview is Now the Fastest Proprietary Model (External Tests) and 50% Fewer Output Tokens appeared first on MarkTechPost.

Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to T …

Hugging Face (HF) has released Smol2Operator, a reproducible, end-to-end recipe that turns a small vision-language model (VLM) with no prior UI grounding into a GUI-operating, tool-using agent. The release covers data transformation utilities, training scripts, transformed datasets, and the resulting 2.2B-parameter model checkpoint—positioned as a complete blueprint for building GUI agents from scratch rather than a single benchmark result.

But what’s new?

Two-phase post-training over a small VLM: Starting from SmolVLM2-2.2B-Instruct—a model that “initially has no grounding capabilities for GUI tasks”—Smol2Operator first instills perception/grounding, then layers agentic reasoning with supervised fine-tuning (SFT).

Unified action space across heterogeneous sources: A conversion pipeline normalizes disparate GUI action taxonomies (mobile, desktop, web) into a single, consistent function API (e.g., click, type, drag, normalized [0,1] coordinates), enabling coherent training across datasets. An Action Space Converter supports remapping to custom vocabularies.

But why Smol2Operator?

Most GUI-agent pipelines are blocked by fragmented action schemas and non-portable coordinates. Smol2Operator’s action-space unification and normalized coordinate strategy make datasets interoperable and training stable under image resizing, which is common in VLM preprocessing. This reduces the engineering overhead of assembling multi-source GUI data and lowers the barrier to reproducing agent behavior with small models.

How it works? training stack and data path

Data standardization:

Parse and normalize function calls from source datasets (e.g., AGUVIS stages) into a unified signature set; remove redundant actions; standardize parameter names; convert pixel to normalized coordinates.

Phase 1 (Perception/Grounding):

SFT on the unified action dataset to learn element localization and basic UI affordances, measured on ScreenSpot-v2 (element localization on screenshots).

Phase 2 (Cognition/Agentic reasoning):

Additional SFT to convert grounded perception into step-wise action planning aligned with the unified action API.

The HF Team reports a clean performance trajectory on ScreenSpot-v2 (benchmark) as grounding is learned, and shows similar training strategy scaling down to a ~460M “nanoVLM,” indicating the method’s portability across capacities (numbers are presented in the post’s tables).

Scope, limits, and next steps

Not a “SOTA at all costs” push: The HF team frame the work as a process blueprint—owning data conversion → grounding → reasoning—rather than chasing leaderboard peaks.

Evaluation focus: Demonstrations center on ScreenSpot-v2 perception and qualitative end-to-end task videos; broader cross-environment, cross-OS, or long-horizon task benchmarks are future work. The HF team notes potential gains from RL/DPO beyond SFT for on-policy adaptation.

Ecosystem trajectory: ScreenEnv’s roadmap includes wider OS coverage (Android/macOS/Windows), which would increase external validity of trained policies.

Summary

Smol2Operator is a fully open-source, reproducible pipeline that upgrades SmolVLM2-2.2B-Instruct—a VLM with zero GUI grounding—into an agentic GUI coder via a two-phase SFT process. The release standardizes heterogeneous GUI action schemas into a unified API with normalized coordinates, provides transformed AGUVIS-based datasets, publishes training notebooks and preprocessing code, and ships a final checkpoint plus a demo Space. It targets process transparency and portability over leaderboard chasing, and slots into the smolagents runtime with ScreenEnv for evaluation, offering a practical blueprint for teams building small, operator-grade GUI agents.

Check out the Technical details, and Full Collection on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to Train a 2.2B VLM into an Agentic GUI Coder appeared first on MarkTechPost.

Sakana AI Released ShinkaEvolve: An Open-Source Framework that Evolves …

Table of contentsWhat problem is it actually solving?Does the sample-efficiency claim hold beyond toy problems?How does the evolutionary loop look in practice?What are the concrete results?How does this compare to AlphaEvolve and related systems?SummaryFAQs — ShinkaEvolve

Sakana AI has released ShinkaEvolve, an open-sourced framework that uses large language models (LLMs) as mutation operators in an evolutionary loop to evolve programs for scientific and engineering problems—while drastically cutting the number of evaluations needed to reach strong solutions. On the canonical circle-packing benchmark (n=26 in a unit square), ShinkaEvolve reports a new SOTA configuration using ~150 program evaluations, where prior systems typically burned thousands. The project ships under Apache-2.0, with a research report and public code.

https://sakana.ai/shinka-evolve/

What problem is it actually solving?

Most “agentic” code-evolution systems explore by brute force: they mutate code, run it, score it, and repeat—consuming enormous sampling budgets. ShinkaEvolve targets that waste explicitly with three interacting components:

Adaptive parent sampling to balance exploration/exploitation. Parents are drawn from “islands” via fitness- and novelty-aware policies (power-law or weighted by performance and offspring counts) rather than always climbing the current best.

Novelty-based rejection filtering to avoid re-evaluating near-duplicates. Mutable code segments are embedded; if cosine similarity exceeds a threshold, a secondary LLM acts as a “novelty judge” before execution.

Bandit-based LLM ensembling so the system learns which model (e.g., GPT/Gemini/Claude/DeepSeek families) is yielding the biggest relative fitness jumps and routes future mutations accordingly (UCB1-style update on improvement over parent/baseline).

Does the sample-efficiency claim hold beyond toy problems?

The research team evaluates four distinct domains and shows consistent gains with small budgets:

Circle packing (n=26): reaches an improved configuration in roughly 150 evaluations; the research team also validate with stricter exact-constraint checking.

AIME math reasoning (2024 set): evolves agentic scaffolds that trace out a Pareto frontier (accuracy vs. LLM-call budget), outperforming hand-built baselines under limited query budgets / Pareto frontier of accuracy vs. calls and transferring to other AIME years and LLMs.

Competitive programming (ALE-Bench LITE): starting from ALE-Agent solutions, ShinkaEvolve delivers ~2.3% mean improvement across 10 tasks and pushes one task’s solution from 5th → 2nd in an AtCoder leaderboard counterfactual.

LLM training (Mixture-of-Experts): evolves a new load-balancing loss that improves perplexity and downstream accuracy at multiple regularization strengths vs. the widely-used global-batch LBL.

https://sakana.ai/shinka-evolve/

How does the evolutionary loop look in practice?

ShinkaEvolve maintains an archive of evaluated programs with fitness, public metrics, and textual feedback. For each generation: sample an island and parent(s); construct a mutation context with top-K and random “inspiration” programs; then propose edits via three operators—diff edits, full rewrites, and LLM-guided crossovers—while protecting immutable code regions with explicit markers. Executed candidates update both the archive and the bandit statistics that steer subsequent LLM/model selection. The system periodically produces a meta-scratchpad that summarizes recently successful strategies; those summaries are fed back into prompts to accelerate later generations.

What are the concrete results?

Circle packing: combined structured initialization (e.g., golden-angle patterns), hybrid global–local search (simulated annealing + SLSQP), and escape mechanisms (temperature reheating, ring rotations) discovered by the system—not hand-coded a priori.

AIME scaffolds: three-stage expert ensemble (generation → critical peer review → synthesis) that hits the accuracy/cost sweet spot at ~7 calls while retaining robustness when swapped to different LLM backends.

ALE-Bench: targeted engineering wins (e.g., caching kd-tree subtree stats; “targeted edge moves” toward misclassified items) that push scores without wholesale rewrites.

MoE loss: adds an entropy-modulated under-use penalty to the global-batch objective; empirically reduces miss-routing and improves perplexity/benchmarks as layer routing concentrates.

How does this compare to AlphaEvolve and related systems?

AlphaEvolve demonstrated strong closed-source results but at higher evaluation counts. ShinkaEvolve reproduces and surpasses the circle-packing result with orders-of-magnitude fewer samples and releases all components open-source. The research team also contrast variants (single-model vs. fixed ensemble vs. bandit ensemble) and ablate parent selection and novelty filtering, showing each contributes to the observed efficiency.

Summary

ShinkaEvolve is an Apache-2.0 framework for LLM-driven program evolution that cuts evaluations from thousands to hundreds by combining fitness/novelty-aware parent sampling, embedding-plus-LLM novelty rejection, and a UCB1-style adaptive LLM ensemble. It sets a new SOTA on circle packing (~150 evals), finds stronger AIME scaffolds under strict query budgets, improves ALE-Bench solutions (~2.3% mean gain, 5th→2nd on one task), and discovers a new MoE load-balancing loss that improves perplexity and downstream accuracy. Code and report are public.

FAQs — ShinkaEvolve

1) What is ShinkaEvolve?An open-source framework that couples LLM-driven program mutations with evolutionary search to automate algorithm discovery and optimization. Code and report are public.

2) How does it achieve higher sample-efficiency than prior evolutionary systems?Three mechanisms: adaptive parent sampling (explore/exploit balance), novelty-based rejection to avoid duplicate evaluations, and a bandit-based selector that routes mutations to the most promising LLMs.

3) What supports the results?It reaches state-of-the-art circle packing with ~150 evaluations; on AIME-2024 it evolves scaffolds under a 10-query cap per problem; it improves ALE-Bench solutions over strong baselines.

4) Where can I run it and what’s the license?The GitHub repo provides a WebUI and examples; ShinkaEvolve is released under Apache-2.0.

Check out the Technical details, Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Sakana AI Released ShinkaEvolve: An Open-Source Framework that Evolves Programs for Scientific Discovery with Unprecedented Sample-Efficiency appeared first on MarkTechPost.

Google AI Ships a Model Context Protocol (MCP) Server for Data Commons …

Google released a Model Context Protocol (MCP) server for Data Commons, exposing the project’s interconnected public datasets—census, health, climate, economics—through a standards-based interface that agentic systems can query in natural language. The Data Commons MCP Server is available now with quickstarts for Gemini CLI and Google’s Agent Development Kit (ADK).

What was released

An MCP server that lets any MCP-capable client or AI agent discover variables, resolve entities, fetch time series, and generate reports from Data Commons without hand-coding API calls. Google positions it as “from initial discovery to generative reports,” with example prompts spanning exploratory, analytical, and generative workflows.

Developer on-ramps: a PyPI package, a Gemini CLI flow, and an ADK sample/Colab to embed Data Commons queries inside agent pipelines.

Why MCP now?

MCP is an open protocol for connecting LLM agents to external tools and data with consistent capabilities (tools, prompts, resources) and transport semantics. By shipping a first-party MCP server, Google makes Data Commons addressable through the same interface that agents already use for other sources, reducing per-integration glue code and enabling registry-based discovery alongside other servers.

What you can do with it?

Exploratory: “What health data do you have for Africa?” → enumerate variables, coverage, and sources.

Analytical: “Compare life expectancy, inequality, and GDP growth for BRICS nations.” → retrieve series, normalize geos, align vintages, and return a table or chart payload.

Generative: “Generate a concise report on income vs. diabetes in US counties.” → fetch measures, compute correlations, include provenance.

Integration surface

Gemini CLI / any MCP client: install the Data Commons MCP package, point the client at the server, and issue NL queries; the client coordinates tool calls behind the scenes.

ADK agents: use Google’s sample agent to compose Data Commons calls with your own tools (e.g., visualization, storage) and return sourced outputs.

Docs entry point: MCP — Query data interactively with an AI agent with links to quickstart and user guide.

Real-world use case

Google highlights ONE Data Agent, built with the Data Commons MCP Server for the ONE Campaign. It lets policy analysts query tens of millions of health-financing datapoints via natural language, visualize results, and export clean datasets for downstream work.

Summary

In short, Google’s Data Commons MCP Server turns a sprawling corpus of public statistics into a first-class, protocol-native data source for agents—reducing custom glue code, preserving provenance, and fitting cleanly into existing MCP clients like Gemini CLI and ADK.

Check out the GitHub Repository and Try it out in Gemini CLI. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Google AI Ships a Model Context Protocol (MCP) Server for Data Commons, Giving AI Agents First-Class Access to Public Stats appeared first on MarkTechPost.

Building health care agents using Amazon Bedrock AgentCore

This blog was co-authored with Kuldeep Singh, Head of AI Platform at Innovaccer.
The integration of agentic AI is ushering in a transformative era in health care, marking a significant departure from traditional AI systems. Agentic AI demonstrates autonomous decision-making capabilities and adaptive learning in complex medical environments, enabling it to monitor patient progress, coordinate care teams, and adjust treatment strategies in real time. These intelligent systems are becoming deeply embedded in healthcare operations, from enhancing diagnostic precision through advanced pattern recognition to optimizing clinical workflows and accelerating drug discovery processes. Agentic AI combines proactive problem-solving abilities with real-time adaptability so that healthcare professionals can focus on high-value, patient-centered activities while the AI handles routine tasks and complex data analysis.
Innovaccer, a pioneering healthcare AI company, recently launched Innovaccer Gravity, built using Amazon Bedrock AgentCore, a new healthcare intelligence platform set to revolutionize data integration and AI-driven healthcare transformation. Building on their impressive track record—where their existing solutions serve more than 1,600 US care locations, manage more than 80 million unified health records, and have generated $1.5B in cost savings—this exemplifies how AWS customers are leading the agentic AI evolution by creating intelligent solutions that transform healthcare delivery while delivering significant ROI.
Health care demands precision and accountability. AI agents operating within this domain must handle sensitive patient data securely, adhere to rigorous compliance regulations (like HIPAA), and maintain consistent interoperability across diverse clinical workflows. Standard, generalized protocols fall short when dealing with complex healthcare systems and patient data protection requirements. Healthcare organizations need a robust service to convert their existing APIs into Model Context Protocol (MCP) compatible tools that can scale effectively while providing built-in authentication, authorization, encryption, and comprehensive audit trails. Amazon Bedrock AgentCore Gateway offers health care providers and digital health companies a straightforward and secure way to build, deploy, discover, and connect to tools at scale that they can use to create AI-powered healthcare solutions while maintaining the highest standards of security and compliance.
Problem
Healthcare organizations face significant data silo challenges because of diverse electronic health record (EHR) formats across different systems, often maintaining multiple systems to serve specialized departmental needs and legacy systems. FHIR (Fast Healthcare Interoperability Resources) solves these interoperability challenges by standardizing healthcare data into exchangeable resources (like patient records and lab results), enabling seamless communication between different systems while maintaining security and improving care coordination. However, implementing FHIR presents its own challenges, including technical complexity in integrating with legacy systems and the need for specialized expertise in healthcare informatics and API development.
The implementation of AI agents introduces new layers of complexity, requiring careful design and maintenance of interfaces with existing systems. AI agents need secure access to the FHIR data and other healthcare tools with authentication (both inbound and outbound) and end-to-end encryption. MCP is a standardized communication framework that enables AI systems to seamlessly interact with external tools, data sources, and services through a unified interface. However, the development and scaling of MCP servers require substantial resources and expertise. Hosting these services demands ongoing development time and attention to maintain optimal performance and reliability. As healthcare organizations navigate this complex terrain, addressing these challenges becomes critical for achieving true interoperability and harnessing the full potential of modern healthcare technology.
Deploy, enhance, and monitor AI agents at scale using Amazon Bedrock AgentCore
By using Amazon Bedrock AgentCore, you can deploy and operate highly capable AI agents securely at scale. It offers infrastructure purpose-built for dynamic agent workloads, powerful tools to enhance agents, and essential controls for real-world deployment. Bedrock AgentCore offers a set of composable services with the services most relevant to the solution in this post mentioned in the following list. For more information, see the Bedrock AgentCore documentation.

AgentCore Runtime provides a secure, serverless runtime purpose-built for deploying and scaling dynamic AI agents and tools using any open source framework, protocol, and model. Runtime was built to work for agentic workloads with industry-leading extended runtime support, fast cold starts, true session isolation, built-in identity, and support for multi-modal payloads.
AgentCore Gateway provides a secure way for agents to discover and use tools along with straightforward transformation of APIs, AWS Lambda functions, and existing services into agent-compatible tools. Gateway speeds up custom code development, infrastructure provisioning, and security implementation so developers can focus on building innovative agent applications.
AgentCore Identity provides a secure, scalable agent identity and access management capability accelerating AI agent development. It is compatible with existing identity providers, avoiding the need to migrate uses or rebuild authentication flows.
AgentCore Observability helps developers trace, debug, and monitor agent performance in production through unified operational dashboards. With support for OpenTelemetry compatible telemetry and detailed visualizations of each step of the agent workflow.

In this solution, we demonstrate how the user (a parent) can interact with a Strands or LangGraph agent in conversational style and get information about the immunization history and schedule of their child, inquire about the available slots, and book appointments. With some changes, AI agents can be made event-driven so that they can automatically send reminders, book appointments, and so on. This reduces the administrative burden on healthcare organizations and the parents who no longer need to keep track of the paperwork or make multiple calls to book appointments.

As shown in the preceding diagram, the workflow for the healthcare appointment book built using Amazon Bedrock AgentCore is the following:

User interacts with Strands or LangGraph agent: The solution contains both Strands and LangGraph agents. You can also use other frameworks such as AutoGen and CrewAI.
Reasoning LLM from Amazon Bedrock: Claude 3.5 Sonnet large language model (LLM) is used from Amazon Bedrock. The model demonstrates advanced reasoning by grasping nuances and complex instructions, along with strong tool-calling capabilities that allow it to effectively integrate with external applications and services to automate various tasks such as web browsing, calculations, or data interactions.
Tools exposed using AgentCore Gateway: AgentCore Gateway provides secure access to the necessary tools required for the Strands or LangGraph agent using standard MCP clients. In this solution, REST APIs are hosted on Amazon API Gateway and exposed as MCP tools using AgentCore Gateway.
Ingress authentication for AgentCore Gateway: AgentCore Gateway is protected with oAuth 2.0 using Amazon Cognito as the identity provider. You can use other oAuth 2.0 compatible identity providers such as Auth0, and Keycloak as needed to fit your use case.
OpenAPI specs converted into tools with AgentCore Gateway: Amazon API Gateway is used as the backend to expose the APIs. By importing the OpenAPI specs, AgentCore Gateway provides an MCP compatible server without additional configuration for tool metadata. The following are the tools used in the solution.

get_patient_emr(): Gets the parent’s and child’s demographics information.
search_immunization_emr() – Gets the immunization history and schedule for the child.
get_available_slots() – Gets the pediatrician’s schedule around parent’s preferred date.
book_appointment() – Books an appointment and returns the confirmation number.

AWS Healthlake as the FHIR server: HealthLake is used to manage patient data related to demographics, immunization history, schedule and appointments, and so on. HealthLake is a HIPAA-eligible service offering healthcare companies a complete view of individual and patient population health data using FHIR API-based transactions to securely store and transform their data into a queryable format at petabyte scale, and further analyze this data using machine learning (ML) models.
Egress authentication from AgentCore Gateway to tools: OAuth 2.0 with Amazon Cognito as the identity provider is used to do the authentication between AgentCore Gateway and the tools used in the solution.

Solution setup

Important: The following code example is meant for learning and demonstration purposes only. For production implementations, it is recommended to add required error handling, input validation, logging, and security controls.

The code and instructions to set up and clean up this example solution are available on GitHub. When set up, the solution looks like the following and is targeted towards parents to use the for immunization related appointments.

Customizing the solution
The solution can be customized to extend the same or a different use case through the following mechanisms:

OpenAPI specification: The solution uses a sample OpenAPI specification (named fhir-openapi-spec.yaml) with APIs hosted on API Gateway. The OpenAPI specification can be customized to add more tools or use entirely different tools by editing the YAML file. You must recreate the AgentCore gateway after making changes to the OpenAPI spec.
Agent instructions and LLM: The strands_agent.py or langgraph_agent.py can be modified to make changes to the goal or instructions for the Agent or to work with a different LLM.

Future enhancements
We’re already looking forward and planning future enhancements for this solution.

AgentCore Runtime: Host strands or a LangGraph agent on AgentCore Runtime.
AgentCore Memory: Use AgentCore Memory to preserve session information in short-term (in session) as well as long-term (across sessions) to provide a more personalized experience to the agent users.

Innovaccer’s use case for Bedrock AgentCore
Innovaccer’s gravity platform includes more than 400 connectors to unify data from EHRs from sources such as Epic, Oracle Cerner, and MEDITECH, more than 20 pre-trained models, 15 pre-built AI agents, 100 FHIR resources, and 60 out-of-the-box solutions with role based access control, comprehensive audit trail, end-to-end encryption, and secure personal health information (PHI) handling. They also provide a low-code or no-code interface to build additional AI agents with the tools exposed using Healthcare Model Context Protocol (HMCP) servers.
Innovaccer uses Bedrock AgentCore for the following purposes:

AgentCore Gateway to turn their OpenAPI specifications into HMCP compatible tools without the heavy lifting required to build, secure, or scale MCP servers.
AgentCore Identity to handle the inbound and outbound authentication integrating with Innovaccer- or customer-provided OAuth servers.
AgentCore Runtime to deploy and scale the AI agents with multi-agent collaboration, along with logging, traceability and ability to plug in custom guardrails.

Bedrock AgentCore supports enterprise-grade security with encryption in transit and at rest, complete session isolation, audit trails using AWS CloudTrail, and comprehensive controls to help Innovaccer agents operate reliably and securely at scale.
Pricing for Bedrock AgentCore Gateway:
AgentCore Gateway offers a consumption-based pricing model with billing based on API invocations (such as ListTools, InvokeTool and Search API), and indexing of tools. For more information, see the pricing page.
Conclusion
The integration of Amazon Bedrock AgentCore with healthcare systems represents a significant leap forward in the application of AI to improve patient care and streamline healthcare operations. By using the suite of services provided by Bedrock AgentCore, healthcare organizations can deploy sophisticated AI agents that securely interact with existing systems, adhere to strict compliance standards, and scale efficiently.
The solution architecture presented in this post demonstrates the practical application of these technologies, showcasing how AI agents can simplify complex processes such as immunization scheduling and appointment booking. This can reduce administrative burdens on healthcare providers and enhance the patient experience by providing straightforward access to critical health information and services.
As we look to the future, the potential for AI agents in the healthcare industry is vast. From improving diagnostic accuracy to personalizing treatment plans and streamlining clinical workflows, the possibilities are endless. Tools like Amazon Bedrock AgentCore can help healthcare organizations confidently navigate the complexities of implementing AI while maintaining the highest standards of security, compliance, and patient care.
The healthcare industry stands at the cusp of a transformative era, where AI agents will play an increasingly central role in delivering efficient, personalized, and high-quality care. By embracing these technologies and continuing to innovate, we can create a healthcare network that is more responsive, intelligent, and patient-centric than ever before.

About the Authors
Kamal Manchanda is a Senior Solutions Architect at AWS with 17 years of experience in cloud, data, and AI technologies. He works closely with C-level executives and technical teams of AWS customers to drive cloud adoption and digital transformation initiatives. Prior to AWS, he led global teams delivering cloud-centric systems, data-driven applications, and AI/ML solutions across consulting and product organizations. Kamal specializes in translating complex business challenges into scalable, secure solutions that deliver measurable business value.
Kuldeep Singh is AVP and Head of AI Platform at Innovaccer. He leads the work on AI agentic workflow layers for Gravity by Innovaccer, a healthcare intelligence platform designed to unify data, agents, and compliant workflows so health systems can deploy AI at scale. With deep experience in data engineering, AI, and product leadership, Kuldeep focuses on making healthcare more efficient, safe, and patient-centered. He plays a key role in building tools that allow care teams to automate complex, multi-step tasks (like integrating payer or EHR data, orchestrating clinical agents) without heavy engineering. He’s passionate about reducing clinician burnout, improving patient outcomes, and turning pilot projects into enterprise-wide AI solutions.

Build multi-agent site reliability engineering assistants with Amazon …

Site reliability engineers (SREs) face an increasingly complex challenge in modern distributed systems. During production incidents, they must rapidly correlate data from multiple sources—logs, metrics, Kubernetes events, and operational runbooks—to identify root causes and implement solutions. Traditional monitoring tools provide raw data but lack the intelligence to synthesize information across these diverse systems, often leaving SREs to manually piece together the story behind system failures.
With a generative AI solution, SREs can ask their infrastructure questions in natural language. For example, they can ask “Why are the payment-service pods crash looping?” or “What’s causing the API latency spike?” and receive comprehensive, actionable insights that combine infrastructure status, log analysis, performance metrics, and step-by-step remediation procedures. This capability transforms incident response from a manual, time-intensive process into a time-efficient, collaborative investigation.
In this post, we demonstrate how to build a multi-agent SRE assistant using Amazon Bedrock AgentCore, LangGraph, and the Model Context Protocol (MCP). This system deploys specialized AI agents that collaborate to provide the deep, contextual intelligence that modern SRE teams need for effective incident response and infrastructure management. We walk you through the complete implementation, from setting up the demo environment to deploying on Amazon Bedrock AgentCore Runtime for production use.
Solution overview
This solution uses a comprehensive multi-agent architecture that addresses the challenges of modern SRE operations through intelligent automation. The solution consists of four specialized AI agents working together under a supervisor agent to provide comprehensive infrastructure analysis and incident response assistance.
The examples in this post use synthetically generated data from our demo environment. The backend servers simulate realistic Kubernetes clusters, application logs, performance metrics, and operational runbooks. In production deployments, these stub servers would be replaced with connections to your actual infrastructure systems, monitoring services, and documentation repositories.
The architecture demonstrates several key capabilities:

Natural language infrastructure queries – You can ask complex questions about your infrastructure in plain English and receive detailed analysis combining data from multiple sources
Multi-agent collaboration – Specialized agents for Kubernetes, logs, metrics, and operational procedures work together to provide comprehensive insights
Real-time data synthesis – Agents access live infrastructure data through standardized APIs and present correlated findings
Automated runbook execution – Agents retrieve and display step-by-step operational procedures for common incident scenarios
Source attribution – Every finding includes explicit source attribution for verification and audit purposes

The following diagram illustrates the solution architecture.

The architecture demonstrates how the SRE support agent integrates seamlessly with Amazon Bedrock AgentCore components:

Customer interface – Receives alerts about degraded API response times and returns comprehensive agent responses
Amazon Bedrock AgentCore Runtime – Manages the execution environment for the multi-agent SRE solution
SRE support agent – Multi-agent collaboration system that processes incidents and orchestrates responses
Amazon Bedrock AgentCore Gateway – Routes requests to specialized tools through OpenAPI interfaces:

Kubernetes API for getting cluster events
Logs API for analyzing log patterns
Metrics API for analyzing performance trends
Runbooks API for searching operational procedures

Amazon Bedrock AgentCore Memory – Stores and retrieves session context and previous interactions for continuity
Amazon Bedrock AgentCore Identity – Handles authentication for tool access using Amazon Cognito integration
Amazon Bedrock AgentCore Observability – Collects and visualizes agent traces for monitoring and debugging
Amazon Bedrock LLMs – Powers the agent intelligence through Anthropic’s Claude large language models (LLMs)

The multi-agent solution uses a supervisor-agent pattern where a central orchestrator coordinates five specialized agents:

Supervisor agent – Analyzes incoming queries and creates investigation plans, routing work to appropriate specialists and aggregating results into comprehensive reports
Kubernetes infrastructure agent – Handles container orchestration and cluster operations, investigating pod failures, deployment issues, resource constraints, and cluster events
Application logs agent – Processes log data to find relevant information, identifies patterns and anomalies, and correlates events across multiple services
Performance metrics agent – Monitors system metrics and identifies performance issues, providing real-time analysis and historical trending
Operational runbooks agent – Provides access to documented procedures, troubleshooting guides, and escalation procedures based on the current situation

Using Amazon Bedrock AgentCore primitives
The solution showcases the power of Amazon Bedrock AgentCore by using multiple core primitives. The solution supports two providers for Anthropic’s LLMs. Amazon Bedrock supports Anthropic’s Claude 3.7 Sonnet for AWS integrated deployments, and Anthropic API supports Anthropic’s Claude 4 Sonnet for direct API access.
The Amazon Bedrock AgentCore Gateway component converts the SRE agent’s backend APIs (Kubernetes, application logs, performance metrics, and operational runbooks) into Model Context Protocol (MCP) tools. This enables agents built with an open-source framework supporting MCP (such as LangGraph in this post) to seamlessly access infrastructure APIs.
Security for the entire solution is provided by Amazon Bedrock AgentCore Identity. It supports ingress authentication for secure access control for agents connecting to the gateway, and egress authentication to manage authentication with backend servers, providing secure API access without hardcoding credentials.
The serverless execution environment for deploying the SRE agent in production is provided by Amazon Bedrock AgentCore Runtime. It automatically scales from zero to handle concurrent incident investigations while maintaining complete session isolation. Amazon Bedrock AgentCore Runtime supports both OAuth and AWS Identity and Access Management (IAM) for agent authentication. Applications that invoke agents must have appropriate IAM permissions and trust policies. For more information, see Identity and access management for Amazon Bedrock AgentCore.
Amazon Bedrock AgentCore Memory transforms the SRE agent from a stateless system into an intelligent learning assistant that personalizes investigations based on user preferences and historical context. The memory component provides three distinct strategies:

User preferences strategy (/sre/users/{user_id}/preferences) – Stores individual user preferences for investigation style, communication channels, escalation procedures, and report formatting. For example, Alice (a technical SRE) receives detailed systematic analysis with troubleshooting steps, whereas Carol (an executive) receives business-focused summaries with impact analysis.
Infrastructure knowledge strategy (/sre/infrastructure/{user_id}/{session_id}) – Accumulates domain expertise across investigations, enabling agents to learn from past discoveries. When the Kubernetes agent identifies a memory leak pattern, this knowledge becomes available for future investigations, enabling faster root cause identification.
Investigation memory strategy (/sre/investigations/{user_id}/{session_id}) – Maintains historical context of past incidents and their resolutions. This enables the solution to suggest proven remediation approaches and avoid anti-patterns that previously failed.

The memory component demonstrates its value through personalized investigations. When both Alice and Carol investigate “API response times have degraded 3x in the last hour,” they receive identical technical findings but completely different presentations.
Alice receives a technical analysis:

memory_client.retrieve_user_preferences(user_id=”Alice”)
# Returns: {“investigation_style”: “detailed_systematic_analysis”, “reports”: “technical_exposition_with_troubleshooting_steps”}

Carol receives an executive summary:

memory_client.retrieve_user_preferences(user_id=”Carol”)
# Returns: {“investigation_style”: “business_impact_focused”,”reports”: “executive_summary_without_technical_details”}

Adding observability to the SRE agent
Adding observability to an SRE agent deployed on Amazon Bedrock AgentCore Runtime is straightforward using the Amazon Bedrock AgentCore Observability primitive. This enables comprehensive monitoring through Amazon CloudWatch with metrics, traces, and logs. Setting up observability requires three steps:

Add the OpenTelemetry packages to your pyproject.toml:

dependencies = [
# … other dependencies …
“opentelemetry-instrumentation-langchain”,
“aws-opentelemetry-distro~=0.10.1”,
]

Configure observability for your agents to enable metrics in CloudWatch.
Start your container using the opentelemetry-instrument utility to automatically instrument your application.

The following command is added to the Dockerfile for the SRE agent:

# Run application with OpenTelemetry instrumentation
CMD [“uv”, “run”, “opentelemetry-instrument”, “uvicorn”, “sre_agent.agent_runtime:app”, “–host”, “0.0.0.0”, “–port”, “8080”]

As shown in the following screenshot, with observability enabled, you gain visibility into the following:

LLM invocation metrics – Token usage, latency, and model performance across agents
Tool execution traces – Duration and success rates for each MCP tool call
Memory operations – Retrieval patterns and storage efficiency
End-to-end request tracing – Complete request flow from user query to final response

The observability primitive automatically captures these metrics without additional code changes, providing production-grade monitoring capabilities out of the box.
Development to production flow
The SRE agent follows a four-step structured deployment process from local development to production, with detailed procedures documented in Development to Production Flow in the accompanying GitHub repo:

The deployment process maintains consistency across environments: the core agent code (sre_agent/) remains unchanged, and the deployment/ folder contains deployment-specific utilities. The same agent works locally and in production through environment configuration, with Amazon Bedrock AgentCore Gateway providing MCP tools access across different stages of development and deployment.
Implementation walkthrough
In the following section, we focus on how Amazon Bedrock AgentCore Gateway, Memory, and Runtime work together to build this multi-agent collaboration solution and deploy it end-to-end with MCP support and persistent intelligence.
We start by setting up the repository and establishing the local runtime environment with API keys, LLM providers, and demo infrastructure. We then bring core AgentCore components online by creating the gateway for standardized API access, configuring authentication, and establishing tool connectivity. We add intelligence through AgentCore Memory, creating strategies for user preferences and investigation history while loading personas for personalized incident response. Finally, we configure individual agents with specialized tools, integrate memory capabilities, orchestrate collaborative workflows, and deploy to AgentCore Runtime with full observability.
Detailed instructions for each step are provided in the repository:

Use Case Setup Guide – Backend deployment and development setup
Deployment Guide – Production containerization and Amazon Bedrock AgentCore Runtime deployment

Prerequisites
You can find the port forwarding requirements and other setup instructions in the README file’s Prerequisites section.
Convert APIs to MCP tools with Amazon Bedrock AgentCore Gateway
Amazon Bedrock AgentCore Gateway demonstrates the power of protocol standardization by converting existing backend APIs into MCP tools that agent frameworks can consume. This transformation happens seamlessly, requiring only OpenAPI specifications.
Upload OpenAPI specifications
The gateway process begins by uploading your existing API specifications to Amazon Simple Storage Service (Amazon S3). The create_gateway.sh script automatically handles uploading the four API specifications (Kubernetes, Logs, Metrics, and Runbooks) to your configured S3 bucket with proper metadata and content types. These specifications will be used to create API endpoint targets in the gateway.
Create an identity provider and gateway
Authentication is handled seamlessly through Amazon Bedrock AgentCore Identity. The main.py script creates both the credential provider and gateway:

# Create AgentCore Gateway with JWT authorization
def create_gateway(
client: Any,
gateway_name: str,
role_arn: str,
discovery_url: str,
allowed_clients: list = None,
description: str = “AgentCore Gateway created via SDK”,
search_type: str = “SEMANTIC”,
protocol_version: str = “2025-03-26”,
) -> Dict[str, Any]:

# Build auth config for Cognito
auth_config = {“customJWTAuthorizer”: {“discoveryUrl”: discovery_url}}
if allowed_clients:
auth_config[“customJWTAuthorizer”][“allowedClients”] = allowed_clients

protocol_configuration = {
“mcp”: {“searchType”: search_type, “supportedVersions”: [protocol_version]}
}

response = client.create_gateway(
name=gateway_name,
roleArn=role_arn,
protocolType=”MCP”,
authorizerType=”CUSTOM_JWT”,
authorizerConfiguration=auth_config,
protocolConfiguration=protocol_configuration,
description=description,
exceptionLevel=’DEBUG’
)
return response

Deploy API endpoint targets with credential providers
Each API becomes an MCP target through the gateway. The solution automatically handles credential management:

def create_api_endpoint_target(
client: Any,
gateway_id: str,
s3_uri: str,
provider_arn: str,
target_name_prefix: str = “open”,
description: str = “API Endpoint Target for OpenAPI schema”,
) -> Dict[str, Any]:

api_target_config = {“mcp”: {“openApiSchema”: {“s3”: {“uri”: s3_uri}}}}

# API key credential provider configuration
credential_config = {
“credentialProviderType”: “API_KEY”,
“credentialProvider”: {
“apiKeyCredentialProvider”: {
“providerArn”: provider_arn,
“credentialLocation”: “HEADER”,
“credentialParameterName”: “X-API-KEY”,
}
},
}

response = client.create_gateway_target(
gatewayIdentifier=gateway_id,
name=target_name_prefix,
description=description,
targetConfiguration=api_target_config,
credentialProviderConfigurations=[credential_config],
)
return response

Validate MCP tools are ready for agent framework
Post-deployment, Amazon Bedrock AgentCore Gateway provides a standardized /mcp endpoint secured with JWT tokens. Testing the deployment with mcp_cmds.sh reveals the power of this transformation:

Tool summary:
================
Total tools found: 21

Tool names:
• x_amz_bedrock_agentcore_search
• k8s-api___get_cluster_events
• k8s-api___get_deployment_status
• k8s-api___get_node_status
• k8s-api___get_pod_status
• k8s-api___get_resource_usage
• logs-api___analyze_log_patterns
• logs-api___count_log_events
• logs-api___get_error_logs
• logs-api___get_recent_logs
• logs-api___search_logs
• metrics-api___analyze_trends
• metrics-api___get_availability_metrics
• metrics-api___get_error_rates
• metrics-api___get_performance_metrics
• metrics-api___get_resource_metrics
• runbooks-api___get_common_resolutions
• runbooks-api___get_escalation_procedures
• runbooks-api___get_incident_playbook
• runbooks-api___get_troubleshooting_guide
• runbooks-api___search_runbooks

Universal agent framework compatibility
This MCP-standardized gateway can now be configured as a Streamable-HTTP server for MCP clients, including AWS Strands, Amazon’s agent development framework, LangGraph, the framework used in our SRE agent implementation, and CrewAI, a multi-agent collaboration framework.
The advantage of this approach is that existing APIs require no modification—only OpenAPI specifications. Amazon Bedrock AgentCore Gateway handles the following:

Protocol translation – Between REST APIs to MCP
Authentication – JWT token validation and credential injection
Security – TLS termination and access control
Standardization – Consistent tool naming and parameter handling

This means you can take existing infrastructure APIs (Kubernetes, monitoring, logging, documentation) and instantly make them available to AI agent frameworks that support MCP—through a single, secure, standardized interface.
Implement persistent intelligence with Amazon Bedrock AgentCore Memory
Whereas Amazon Bedrock AgentCore Gateway provides seamless API access, Amazon Bedrock AgentCore Memory transforms the SRE agent from a stateless system into an intelligent, learning assistant. The memory implementation demonstrates how a few lines of code can enable sophisticated personalization and cross-session knowledge retention.
Initialize memory strategies
The SRE agent memory component is built on Amazon Bedrock AgentCore Memory’s event-based model with automatic namespace routing. During initialization, the solution creates three memory strategies with specific namespace patterns:

from sre_agent.memory.client import SREMemoryClient
from sre_agent.memory.strategies import create_memory_strategies

# Initialize memory client
memory_client = SREMemoryClient(
memory_name=”sre_agent_memory”,
region=”us-east-1″
)

# Create three specialized memory strategies
strategies = create_memory_strategies()
for strategy in strategies:
memory_client.create_strategy(strategy)

The three strategies each serve distinct purposes:

User preferences (/sre/users/{user_id}/preferences) – Individual investigation styles and communication preferences
Infrastructure Knowledge: /sre/infrastructure/{user_id}/{session_id} – Domain expertise accumulated across investigations
Investigation Summaries: /sre/investigations/{user_id}/{session_id} – Historical incident patterns and resolutions

Load user personas and preferences
The solution comes preconfigured with user personas that demonstrate personalized investigations. The manage_memories.py script loads these personas:

# Load Alice – Technical SRE Engineer
alice_preferences = {
“investigation_style”: “detailed_systematic_analysis”,
“communication”: [“#alice-alerts”, “#sre-team”],
“escalation”: {“contact”: “alice.manager@company.com”, “threshold”: “15min”},
“reports”: “technical_exposition_with_troubleshooting_steps”,
“timezone”: “UTC”
}

# Load Carol – Executive/Director
carol_preferences = {
“investigation_style”: “business_impact_focused”,
“communication”: [“#carol-executive”, “#strategic-alerts”],
“escalation”: {“contact”: “carol.director@company.com”, “threshold”: “5min”},
“reports”: “executive_summary_without_technical_details”,
“timezone”: “EST”
}

# Store preferences using memory client
memory_client.store_user_preference(“Alice”, alice_preferences)
memory_client.store_user_preference(“Carol”, carol_preferences)

Automatic namespace routing in action
The power of Amazon Bedrock AgentCore Memory lies in its automatic namespace routing. When the SRE agent creates events, it only needs to provide the actor_id—Amazon Bedrock AgentCore Memory automatically determines which namespaces the event belongs to:

# During investigation, the supervisor agent stores context
memory_client.create_event(
memory_id=”sre_agent_memory-abc123″,
actor_id=”Alice”, # AgentCore Memory routes this automatically
session_id=”investigation_2025_01_15″,
messages=[(“investigation_started”, “USER”)]
)

# Memory system automatically:
# 1. Checks strategy namespaces <!– “all” is necessary here for technical accuracy –>
# 2. Matches actor_id “Alice” to /sre/users/Alice/preferences
# 3. Stores event in User Preferences Strategy
# 4. Makes event available for future retrievals

Validate the personalized investigation experience
The memory component’s impact becomes clear when both Alice and Carol investigate the same issue. Using identical technical findings, the solution produces completely different presentations of the same underlying content.
Alice’s technical report contains detailed systematic analysis for technical teams:

Technical Investigation Summary

Root Cause: Payment processor memory leak causing OOM kills

Analysis:
– Pod restart frequency increased 300% at 14:23 UTC
– Memory utilization peaked at 8.2GB (80% of container limit)
– JVM garbage collection latency spiked to 2.3s

Next Step:
1. Implement heap dump analysis (`kubectl exec payment-pod — jmap`)
2. Review recent code deployments for memory management changes
3. Consider increasing memory limits and implementing graceful shutdown

Carol’s executive summary contains business impact focused for executive stakeholders:

Business Impact Assessment
Status: CRITICAL – Customer payment processing degraded
Impact: 23% transaction failure rate, $47K revenue at risk
Timeline: Issue detected 14:23 UTC, resolution ETA 45 minutes
Business Actions: – Customer communication initiated via status page – Finance team alerted for revenue impact tracking – Escalating to VP Engineering if not resolved by 15:15 UTC

The memory component enables this personalization while continuously learning from each investigation, building organizational knowledge that improves incident response over time.
Deploy to production with Amazon Bedrock AgentCore Runtime
Amazon Bedrock AgentCore makes it straightforward to deploy existing agents to production. The process involves three key steps: containerizing your agent, deploying to Amazon Bedrock AgentCore Runtime, and invoking the deployed agent.
Containerize your agent
Amazon Bedrock AgentCore Runtime requires ARM64 containers. The following code shows the complete Dockerfile:

# Use uv’s ARM64 Python base image
FROM –platform=linux/arm64 ghcr.io/astral-sh/uv:python3.12-bookworm-slim

WORKDIR /app

# Copy uv files
COPY pyproject.toml uv.lock ./

# Install dependencies
RUN uv sync –frozen –no-dev

# Copy SRE agent module
COPY sre_agent/ ./sre_agent/

# Set environment variables
# Note: Set DEBUG=true to enable debug logging and traces
ENV PYTHONPATH=”/app”
PYTHONDONTWRITEBYTECODE=1
PYTHONUNBUFFERED=1

# Expose port
EXPOSE 8080

# Run application with OpenTelemetry instrumentation
CMD [“uv”, “run”, “opentelemetry-instrument”, “uvicorn”, “sre_agent.agent_runtime:app”, “–host”, “0.0.0.0”, “–port”, “8080”]

Existing agents just need a FastAPI wrapper (agent_runtime:app) to become compatible with Amazon Bedrock AgentCore, and we add opentelemetry-instrument to enable observability through Amazon Bedrock AgentCore.
Deploy to Amazon Bedrock AgentCore Runtime
Deploying to Amazon Bedrock AgentCore Runtime is straightforward with the deploy_agent_runtime.py script:

import boto3

# Create AgentCore client
client = boto3.client(‘bedrock-agentcore’, region_name=region)

# Environment variables for your agent
env_vars = {
‘GATEWAY_ACCESS_TOKEN’: gateway_access_token,
‘LLM_PROVIDER’: llm_provider,
‘ANTHROPIC_API_KEY’: anthropic_api_key # if using Anthropic
}

# Deploy container to AgentCore Runtime
response = client.create_agent_runtime(
agentRuntimeName=runtime_name,
agentRuntimeArtifact={
‘containerConfiguration’: {
‘containerUri’: container_uri # Your ECR container URI
}
},
networkConfiguration={“networkMode”: “PUBLIC”},
roleArn=role_arn,
environmentVariables=env_vars
)

print(f”Agent Runtime ARN: {response[‘agentRuntimeArn’]}”)

Amazon Bedrock AgentCore handles the infrastructure, scaling, and session management automatically.
Invoke your deployed agent
Calling your deployed agent is just as simple with invoke_agent_runtime.py:

# Prepare your query with user_id and session_id for memory personalization
payload = json.dumps({
“input”: {
“prompt”: “API response times have degraded 3x in the last hour”,
“user_id”: “Alice”, # User for personalized investigation
“session_id”: “investigation-20250127-123456″ # Session for context
}
})

# Invoke the deployed agent
response = agent_core_client.invoke_agent_runtime(
agentRuntimeArn=runtime_arn,
runtimeSessionId=session_id,
payload=payload,
qualifier=”DEFAULT”
)

# Get the response
response_data = json.loads(response[‘response’].read())
print(response_data) # Full response includes output with agent’s investigation

Key benefits of Amazon Bedrock AgentCore Runtime
Amazon Bedrock AgentCore Runtime offers the following key benefits:

Zero infrastructure management – No servers, load balancers, or scaling to configure
Built-in session isolation – Each conversation is completely isolated
AWS IAM integration – Secure access control without custom authentication
Automatic scaling – Scales from zero to thousands of concurrent sessions

The complete deployment process, including building containers and handling AWS permissions, is documented in the Deployment Guide.
Real-world use cases
Let’s explore how the SRE agent handles common incident response scenarios with a real investigation.
When facing a production issue, you can query the system in natural language. The solution uses Amazon Bedrock AgentCore Memory to personalize the investigation based on your role and preferences:

export USER_ID=Alice
sre-agent –prompt “API response times have degraded 3x in the last hour”

The supervisor retrieves Alice’s preferences from memory (detailed systematic analysis style) and creates an investigation plan tailored to her role as a Technical SRE:

Investigation Plan
1. Use metrics_agent to analyze API performance metrics including response times, error rates, and resource utilization to identify the extent and pattern of slowdown
2. Use logs_agent to examine application logs for errors related to slow API responses, focusing on database connection issues and memory errors
3. Use kubernetes_agent to check pod status and resource constraints, especially for web-service and database pods, looking for CrashLoopBackOff states and missing ConfigMaps
Complexity: Simple
Auto-execute: Yes
Agents involved: Metrics Agent, Logs Agent, Kubernetes Agent

The agents investigate sequentially according to the plan, each contributing their specialized analysis. The solution then aggregates these findings into a comprehensive executive summary:

Executive Summary
Key Insights
– Root Cause: Database service failure due to missing ConfigMap ‘database-config’ in production namespace, causing database pod to crash repeatedly
– Impact: Severe performance degradation with API response times increased from 150ms to 5000ms (33x slower)
– Severity: High – Database unavailability, memory exhaustion (100%), and CPU saturation (95%) causing 75% error rate
Next Steps
1. Immediate (< 1 hour): Create/update ConfigMap ‘database-config’ in production namespace and restart database pod
2. Short-term (< 24 hours):
– Fix permissions on ‘/var/lib/postgresql/data’ directory
– Increase Java heap space for web-service to address OutOfMemoryErrors
– Optimize UserService.loadAllUsers method causing memory issues
3. Long-term (< 1 week):
– Implement resource monitoring with alerts for CPU (>80%), memory (>90%)
– Optimize slow database queries, particularly “SELECT * FROM users WHERE status=’active'”
– Scale up resources or implement autoscaling for web-service
Critical Alerts
– Database pod (database-pod-7b9c4d8f2a-x5m1q) in CrashLoopBackOff state
– Web-service experiencing OutOfMemoryErrors in UserService.loadAllUsers(UserService.java:45)
– Node-3 experiencing memory pressure (>85% usage)
– Web-app-deployment showing readiness probe failures with 503 errors
Troubleshooting Steps
1. Verify ConfigMap status: `kubectl get configmap database-config -n production`
2. Check database pod logs: `kubectl logs database-pod-7b9c4d8f2a-x5m1q -n production`
3. Create/update ConfigMap: `kubectl create configmap database-config –from-file=database.conf -n production`
4. Fix data directory permissions: `kubectl exec database-pod-7b9c4d8f2a-x5m1q -n production — chmod -R 700 /var/lib/postgresql/data`
5. Restart database pod: `kubectl delete pod database-pod-7b9c4d8f2a-x5m1q -n production`

This investigation demonstrates how Amazon Bedrock AgentCore primitives work together:

Amazon Bedrock AgentCore Gateway – Provides secure access to infrastructure APIs through MCP tools
Amazon Bedrock AgentCore Identity – Handles ingress and egress authentication
Amazon Bedrock AgentCore Runtime – Hosts the multi-agent solution with automatic scaling
Amazon Bedrock AgentCore Memory – Personalizes Alice’s experience and stores investigation knowledge for future incidents
Amazon Bedrock AgentCore Observability – Captures detailed metrics and traces in CloudWatch for monitoring and debugging

The SRE agent demonstrates intelligent agent orchestration, with the supervisor routing work to specialists based on the investigation plan. The solution’s memory capabilities make sure each investigation builds organizational knowledge and provides personalized experiences based on user roles and preferences.
This investigation showcases several key capabilities:

Multi-source correlation – It connects database configuration issues to API performance degradation
Sequential investigation – Agents work systematically through the investigation plan while providing live updates
Source attribution – Findings include the specific tool and data source
Actionable insights – It provides a clear timeline of events and prioritized recovery steps
Cascading failure detection – It can help show how one failure propagates through the system

Business impact
Organizations implementing AI-powered SRE assistance report significant improvements in key operational metrics. Initial investigations that previously took 30–45 minutes can now be completed in 5–10 minutes, providing SREs with comprehensive context before diving into detailed analysis. This dramatic reduction in investigation time translates directly to faster incident resolution and reduced downtime.The solution improves how SREs interact with their infrastructure. Instead of navigating multiple dashboards and tools, engineers can ask questions in natural language and receive aggregated insights from relevant data sources. This reduction in context switching enables teams to maintain focus during critical incidents and reduces cognitive load during investigations.Perhaps most importantly, the solution democratizes knowledge across the team. All team members can access the same comprehensive investigation techniques, reducing dependency on tribal knowledge and on-call burden. The consistent methodology provided by the solution makes sure investigation approaches remain uniform across team members and incident types, improving overall reliability and reducing the chance of missed evidence.
The automatically generated investigation reports provide valuable documentation for post-incident reviews and help teams learn from each incident, building organizational knowledge over time. Furthermore, the solution extends existing AWS infrastructure investments, working alongside services like Amazon CloudWatch, AWS Systems Manager, and other AWS operational tools to provide a unified operational intelligence system.
Extending the solution
The modular architecture makes it straightforward to extend the solution for your specific needs.
For example, you can add specialized agents for your domain:

Security agent – For compliance checks and security incident response
Database agent – For database-specific troubleshooting and optimization
Network agent – For connectivity and infrastructure debugging

You can also replace the demo APIs with connections to your actual systems:

Kubernetes integration – Connect to your cluster APIs for pod status, deployments, and events
Log aggregation – Integrate with your log management service (Elasticsearch, Splunk, CloudWatch Logs)
Metrics platform – Connect to your monitoring service (Prometheus, Datadog, CloudWatch Metrics)
Runbook repository – Link to your operational documentation and playbooks stored in wikis, Git repositories, or knowledge bases

Clean up
To avoid incurring future charges, use the cleanup script to remove the billable AWS resources created during the demo:

# Complete cleanup – deletes AWS resources and local files
./scripts/cleanup.sh

This script automatically performs the following actions:

Stop backend servers
Delete the gateway and its targets
Delete Amazon Bedrock AgentCore Memory resources
Delete the Amazon Bedrock AgentCore Runtime
Remove generated files (gateway URIs, tokens, agent ARNs, memory IDs)

For detailed cleanup instructions, refer to Cleanup Instructions.
Conclusion
The SRE agent demonstrates how multi-agent systems can transform incident response from a manual, time-intensive process into a time-efficient, collaborative investigation that provides SREs with the insights they need to resolve issues quickly and confidently.
By combining the enterprise-grade infrastructure of Amazon Bedrock AgentCore with standardized tool access in MCP, we’ve created a foundation that can adapt as your infrastructure evolves and new capabilities emerge.
The complete implementation is available in our GitHub repository, including demo environments, configuration guides, and extension examples. We encourage you to explore the solution, customize it for your infrastructure, and share your experiences with the community.
To get started building your own SRE assistant, refer to the following resources:

Automate tasks in your application using AI agents
Amazon Bedrock AgentCore Samples GitHub repository
Model Context Protocol documentation
LangGraph documentation

About the authors
Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington, D.C.
Dheeraj Oruganty is a Delivery Consultant at Amazon Web Services. He is passionate about building innovative Generative AI and Machine Learning solutions that drive real business impact. His expertise spans Agentic AI Evaluations, Benchmarking and Agent Orchestration, where he actively contributes to research advancing the field. He holds a master’s degree in Data Science from Georgetown University. Outside of work, he enjoys geeking out on cars, motorcycles, and exploring nature.

OpenAI Releases ChatGPT ‘Pulse’: Proactive, Personalized Daily Bri …

OpenAI introduced ChatGPT Pulse, a proactive experience that compiles personalized, research-backed updates each morning. In preview on mobile and limited to $200/month Pro subscribers, Pulse surfaces topical cards built from a user’s chats, explicit feedback, and opt-in connected apps (e.g., calendar/email), shifting ChatGPT from a request-driven tool to a context-aware assistant.

What Pulse Actually Does Under the Hood

Each day, Pulse performs background research anchored to user signals: recent conversations, long-term interests, thumbs-up/down feedback, and data from connected apps where enabled. The output appears as scannable visual cards (briefs and deep links) rather than an infinite feed, designed for quick triage and drill-down. Early examples include targeted news roundups and context-conditioned suggestions (e.g., travel planning aligned with calendar events).

Data Sources and Controls

Integrations are off by default and can be toggled. When granted, Pulse may use Gmail/Google Calendar context to tailor cards (e.g., meeting prep, itinerary nudges). OpenAI positions this as a user-level personalization layer; reporting notes emphasize optionality and in-app settings for managing connected accounts and memory.

Availability and Rollout Plan

Pulse is rolling out now to Pro on the ChatGPT mobile app as a dedicated tab. OpenAI says it wants broader availability “soon,” with Plus access targeted after product and efficiency improvements. The company reiterated the Pro-first gating due to compute costs.

Product Positioning: Toward Agentic, Goal-Oriented Workflows

OpenAI frames Pulse as the first step toward agent-like behavior where the model tracks goals and initiates updates without prompts. External coverage highlights the shift from chat to assistant workflows that reason over user state and schedule. This aligns with OpenAI’s recent emphasis on agents and proactive help, not passive Q&A.

The Signal from Leadership

Sam Altman summarized the intent succinctly: Pulse is his “favorite feature” to date, starting with Pro. His post also underscores the model’s use of interests and recent chats, hinting at broader personalization as users share preferences over time. OpenAI’s official announcement on X mirrors the blog language around daily, proactive updates.

Today we are launching my favorite feature of ChatGPT so far, called Pulse. It is initially available to Pro subscribers.Pulse works for you overnight, and keeps thinking about your interests, your connected data, your recent chats, and more. Every morning, you get a…— Sam Altman (@sama) September 25, 2025

Competitive Context

Pulse lands in a crowded “morning brief” space but differs by tying briefs to your live context and chats rather than generic headlines. It also inches ChatGPT toward hands-on assistant territory seen in agent platforms that watch calendars, draft emails, and pre-stage tasks—yet packaged for consumers inside the ChatGPT app rather than a separate agent runner.

Summary

Pulse formalizes ChatGPT as a proactive system: it reads your signals, checks your day, and delivers a compact, personalized brief—first for Pro on mobile, with Plus on the roadmap once the system is optimized. The implementation details (APIs, enterprise knobs, retention policies) will determine how far it goes beyond morning cards into full agent workflows.

The post OpenAI Releases ChatGPT ‘Pulse’: Proactive, Personalized Daily Briefings for Pro Users appeared first on MarkTechPost.

OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on R …

OpenAI introduced GDPval, a new evaluation suite designed to measure how AI models perform on real-world, economically valuable tasks across 44 occupations in nine GDP-dominant U.S. sectors. Unlike academic benchmarks, GDPval centers on authentic deliverables—presentations, spreadsheets, briefs, CAD artifacts, audio/video—graded by occupational experts through blinded pairwise comparisons. OpenAI also released a 220-task “gold” subset and an experimental automated grader hosted at evals.openai.com.

From Benchmarks to Billables: How GDPval Builds Tasks

GDPval aggregates 1,320 tasks sourced from industry professionals averaging 14 years of experience. Tasks map to O*NET work activities and include multi-modal file handling (docs, slides, images, audio, video, spreadsheets, CAD), with up to dozens of reference files per task. The gold subset provides public prompts and references; primary scoring still relies on expert pairwise judgments due to subjectivity and format requirements.

https://openai.com/index/gdpval/

What the Data Says: Model vs. Expert

On the gold subset, frontier models approach expert quality on a substantial fraction of tasks under blind expert review, with model progress trending roughly linearly across releases. Reported model-vs-human win/tie rates near parity for top models, error profiles cluster around instruction-following, formatting, data usage, and hallucinations. Increased reasoning effort and stronger scaffolding (e.g., format checks, artifact rendering for self-inspection) yield predictable gains.

Time–Cost Math: Where AI Pays Off

GDPval runs scenario analyses comparing human-only to model-assisted workflows with expert review. It quantifies (i) human completion time and wage-based cost, (ii) reviewer time/cost, (iii) model latency and API cost, and (iv) empirically observed win rates. Results indicate potential time/cost reductions for many task classes once review overhead is included.

Automated Judging: Useful Proxy, Not Oracle

For the gold subset, an automated pairwise grader shows ~66% agreement with human experts, within ~5 percentage points of human–human agreement (~71%). It’s positioned as an accessibility proxy for rapid iteration, not a replacement for expert review.

https://openai.com/index/gdpval/

Why This Isn’t Yet Another Benchmark

Occupational breadth: Spans top GDP sectors and a wide slice of O*NET work activities, not just narrow domains.

Deliverable realism: Multi-file, multi-modal inputs/outputs stress structure, formatting, and data handling.

Moving ceiling: Uses human preference win rate against expert deliverables, enabling re-baselining as models improve.

Boundary Conditions: Where GDPval Doesn’t Reach

GDPval-v0 targets computer-mediated knowledge work. Physical labor, long-horizon interactivity, and organization-specific tooling are out of scope. Tasks are one-shot and precisely specified; ablations show performance drops with reduced context. Construction and grading are resource-intensive, motivating the automated grader—whose limits are documented—and future expansion.

Fit in the Stack: How GDPval Complements Other Evals

GDPval augments existing OpenAI evals with occupational, multi-modal, file-centric tasks and reports human preference outcomes, time/cost analyses, and ablations on reasoning effort and agent scaffolding. v0 is versioned and expected to broaden coverage and realism over time.

Summary

GDPval formalizes evaluation for economically relevant knowledge work by pairing expert-built tasks with blinded human preference judgments and an accessible automated grader. The framework quantifies model quality and practical time/cost trade-offs while exposing failure modes and the effects of scaffolding and reasoning effort. Scope remains v0—computer-mediated, one-shot tasks with expert review—yet it establishes a reproducible baseline for tracking real-world capability gains across occupations.

Check out the Paper, Technical details, and Dataset on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on Real-World Economically Valuable Tasks appeared first on MarkTechPost.

Meta FAIR Released Code World Model (CWM): A 32-Billion-Parameter Open …

Meta FAIR released Code World Model (CWM), a 32-billion-parameter dense decoder-only LLM that injects world modeling into code generation by training on execution traces and long-horizon agent–environment interactions—not just static source text.

What’s new: learning code by predicting execution?

CWM mid-trains on two large families of observation–action trajectories: (1) Python interpreter traces that record local variable states after each executed line, and (2) agentic interactions inside Dockerized repositories that capture edits, shell commands, and test feedback. This grounding is intended to teach semantics (how state evolves) rather than only syntax.

To scale collection, the research team built executable repository images from thousands of GitHub projects and foraged multi-step trajectories via a software-engineering agent (“ForagerAgent”). The release reports ~3M trajectories across ~10k images and 3.15k repos, with mutate-fix and issue-fix variants.

https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/

Model and context window

CWM is a dense, decoder-only Transformer (no MoE) with 64 layers, GQA (48Q/8KV), SwiGLU, RMSNorm, and Scaled RoPE. Attention alternates local 8k and global 131k sliding-window blocks, enabling 131k tokens effective context; training uses document-causal masking.

Training recipe (pre → mid → post)

General pretraining: 8T tokens (code-heavy) at 8k context.

Mid-training: +5T tokens, long-context (131k) with Python execution traces, ForagerAgent data, PR-derived diffs, IR/compilers, Triton kernels, and Lean math.

Post-training: 100B-token SFT for instruction + reasoning, then multi-task RL (~172B-token) across verifiable coding, math, and multi-turn SWE environments using a GRPO-style algorithm and a minimal toolset (bash/edit/create/submit).

Quantized inference fits on a single 80 GB H100.

Benchmarks

The research team cites the following pass@1 / scores (test-time scaling noted where applicable):

SWE-bench Verified: 65.8% (with test-time scaling).

LiveCodeBench-v5: 68.6%; LCB-v6: 63.5%.

Math-500: 96.6%; AIME-24: 76.0%; AIME-25: 68.2%.

CruxEval-Output: 94.3%.

The research team position CWM as competitive with similarly sized open-weights baselines and even with larger or closed models on SWE-bench Verified.

For context on SWE-bench Verified’s task design and metrics, see the official benchmark resources.

https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/

Why world modeling matters for code?

The release emphasizes two operational capabilities:

Execution-trace prediction: given a function and a trace start, CWM predicts stack frames (locals) and the executed line at each step via a structured format—usable as a “neural debugger” for grounded reasoning without live execution.

Agentic coding: multi-turn reasoning with tool use against real repos, verified by hidden tests and patch similarity rewards; the setup trains the model to localize faults and generate end-to-end patches (git diff) rather than snippets.

Some details worth noting

Tokenizer: Llama-3 family with reserved control tokens; reserved IDs are used to demarcate trace and reasoning segments during SFT.

Attention layout: the 3:1 local:global interleave is repeated across the depth; long-context training occurs at large token batch sizes to stabilize gradients.

Compute scaling: learning-rate/batch size schedules are derived from internal scaling-law sweeps tailored for long-context overheads.

Summary

CWM is a pragmatic step toward grounded code generation: Meta ties a 32B dense transformer to execution-trace learning and agentic, test-verified patching, releases intermediate/post-trained checkpoints, and gates usage under the FAIR Non-Commercial Research License—making it a useful platform for reproducible ablations on long-context, execution-aware coding without conflating research with production deployment.

Check out the Paper, GitHub Page, and Model on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meta FAIR Released Code World Model (CWM): A 32-Billion-Parameter Open-Weights LLM, to Advance Research on Code Generation with World Models appeared first on MarkTechPost.

DoWhile loops now supported in Amazon Bedrock Flows

Today, we are excited to announce support for DoWhile loops in Amazon Bedrock Flows. With this powerful new capability, you can create iterative, condition-based workflows directly within your Amazon Bedrock flows, using Prompt nodes, AWS Lambda functions, Amazon Bedrock Agents, Amazon Bedrock Flows inline code, Amazon Bedrock Knowledge Bases, Amazon Simple Storage Service (Amazon S3), and other Amazon Bedrock nodes within the loop structure. This feature avoids the need for complex workarounds, enabling sophisticated iteration patterns that use the full range of Amazon Bedrock Flows components. Tasks like content refinement, recursive analysis, and multi-step processing can now seamlessly integrate AI model calls, custom code execution, and knowledge retrieval in repeated cycles. By providing loop support with diverse node types, this feature simplifies generative AI application development and accelerates enterprise adoption of complex, adaptive AI solutions.
Organizations using Amazon Bedrock Flows can now use DoWhile loops to design and deploy workflows for building more scalable and efficient generative AI applications fully within the Amazon Bedrock environment while achieving the following:

Iterative processing – Execute repeated operations until specific conditions are met, enabling dynamic content refinement and recursive improvements
Conditional logic – Implement sophisticated decision-making within flows based on AI outputs and business rules
Complex use cases – Manage multi-step generative AI workflows that require repeated execution and refinement
Builder-friendly – Create and manage loops through both the Amazon Bedrock API and AWS Management Console in the traces
Observability – Employ seamless tracking of loop iterations, conditions, and execution paths

In this post, we discuss the benefits of this new feature, and show how to use DoWhile loops in Amazon Bedrock Flows.
Benefits of DoWhile loops in Amazon Bedrock Flows
Using DoWhile loops in Amazon Bedrock Flows offers the following benefits:

Simplified flow control – Create sophisticated iterative workflows without complex orchestration or external services
Flexible processing – Enable dynamic, condition-based execution paths that can adapt based on AI outputs and business rules
Enhanced development experience – Help users build complex iterative workflows through an intuitive interface, without requiring external workflow management

Solution overview
In the following sections, we show how to create a simple Amazon Bedrock flow using Do-while loops with Lambda functions. Our example showcases a practical application where we construct a flow that generates a blog post on a given topic in an iterative manner until certain acceptance criteria are fulfilled. The flow demonstrates the power of combining different types of Amazon Bedrock Flows nodes within a loop structure, where Prompt nodes generate and fine-tune the blog post, Inline Code nodes allow writing custom Python code to analyze the outputs, and S3 Storage nodes enable storing each version of the blog post during the process for reference. The DoWhile loop continues to execute until the quality of the blog post meets the condition set in the loop controller. This example illustrates how different flow nodes can work together within a loop to progressively transform data until desired conditions are met, providing a foundation for understanding more complex iterative workflows with various node combinations.
Prerequisites
Before implementing the new capabilities, make sure you have the following:

An AWS account
Other Amazon Bedrock services in place:

Create and test your base prompts for customer service interactions in Amazon Bedrock Prompt Management
Create guardrails with relevant rules using Amazon Bedrock Guardrails

Resources in auxiliary AWS services needed for your workflow, such as Lambda, Amazon DynamoDB, and Amazon S3
Required AWS Identity and Access Management (IAM) permissions:

Access to Amazon Bedrock Flows
Appropriate access to large language models (LLMs) in Amazon Bedrock

After these components are in place, you can proceed with using Amazon Bedrock Flows with DoWhile loop capabilities in your generative AI use case.
Create your flow using DoWhile Loop nodes
Complete the following steps to create your flow:

On the Amazon Bedrock console, choose Flows under Builder tools in the navigation pane.
Create a new flow, for example, dowhile-loop-demo. For detailed instructions on creating a flow, see Amazon Bedrock Flows is now generally available with enhanced safety and traceability.
Add a DoWhile loop node.
Add additional nodes according to the solution workflow (discussed in the next section).

Amazon Bedrock provides different node types to build your prompt flow. For this example, we use a DoWhile Loop node for calling different types of nodes for a generative AI-powered application, which creates a blog post on a given topic and checks the quality in every loop. There is one DoWhile Loop node in the flow. This new node type is on the Nodes tab in the left pane, as shown in the following screenshot.

DoWhile loop workflow
A DoWhile loop consists of two parts: the loop and the loop controller. The loop controller validates the logic for the loop and decides whether to continue or exit the loop. In this example, it is executing Prompt, Inline Code, S3 Storage nodes each time the loop is executed.
Let’s go through this flow step-by-step, as illustrated in the preceding screenshot:

A user asks to write a blog post on a specific topic (for example, using the following prompt: {“topic”: “AWS Lambda”, “Audience”: “Chief Technology Officer”, “word_count”:”500}). This prompt is sent to the Prompt node (Content_Generator).
The Prompt node (Content_Generator) writes a blog post based on the prompt using one of the Amazon Bedrock provided LLMs (such as Amazon Nova or Anthropic’s Claude) and is sent to the Loop Input node. This is the entry point to the DoWhile Loop node.
Three steps happen in tandem:

The Loop Input node forwards the blog post content to another Prompt node (Blog_Analysis_Rating) for rating the post based on criteria mentioned as part of the prompt. The output of this Prompt node is JSON code like the following example. The output of a Prompt node is always of type String. You can modify the prompt to get different types of output according to your needs. However, you can also ask the LLM to output a single rating number.

{
“overall_rating”: 8.5,
“category_ratings”: {
“clarity_and_readability”: 9,
“value_to_target_audience”: 8,
“engagement_level”: 8,
“technical_accuracy”: 9
}

The blog post is sent to the flow output during every iteration. This is the final version whenever the loop condition is not met (exiting the loop) or the end of maximum loop iterations.
At the same time, the output of the previous Prompt node (Content_Generator) is forwarded to another Prompt node (Blog_Refinement) by the Loop Input node. This node recreates or modifies the blog post based on the feedback from the analysis.

The output of the Prompt node (Blog_Analysis_Rating) is fed into the Inline Code node to extract the necessary rating and return that as a number or other information required for checking the condition inside the loop controller as input variables (for example, a rating).

def __func(variable):
return float(variable[“overall_rating”])
__func(variable)

Python code inside the Inline Code must be treated as untrusted, and appropriate parsing, validation, and data handling should be implemented.

The output of the Inline Code node is fed into the loop condition inside the loop controller to validate against the condition we set up inside the continue loop. In this example, we are checking for a rating less than or equal to 9 for the generated blog post. You can check up to five conditions. Additionally, a maximum loop iterations parameter makes sure that loop doesn’t continue infinitely.
The step consists of two parts:

A Prompt node (Blog_Refinement) forwards the newly generated blog post to loopinput inside the loop controller.
The loop controller stores the version of the post in Amazon S3 for future reference and comparing the different versions generated.

This path will execute if one of the conditions is met inside the continue loop and maximum loop iterations. If this continues, then the new modified blog post from earlier is forwarded to the input field in the Loop Input node as LoopInput and the loop continues.
The final output is produced after the DoWhile loop condition is met or maximum number of iterations are completed. The output will be final version of the blog post.

You can see the output as shown in the following screenshot. The system also provides access to node execution traces, offering detailed insights into each processing step, real-time performance metrics, and highlighting issues that may have occurred during the flow’s execution. Traces can be enabled using an API and sent to an Amazon CloudWatch log. In the API, set the enableTrace field to true in an InvokeFlow request. Each flowOutputEvent in the response is returned alongside a flowTraceEvent.

You have now successfully created and executed an Amazon Bedrock flow using DoWhile Loop nodes. You can also use Amazon Bedrock APIs to programmatically execute this flow. For additional details on how to configure flows, see Amazon Bedrock Flows is now generally available with enhanced safety and traceability.
Considerations
When working with DoWhile Loop nodes in Amazon Bedrock Flows, the following are the important things to note:

DoWhile Loop nodes don’t support nested loops (loops within loops)
Each loop controller can evaluate up to five input conditions for its exit criteria
A maximum iteration limit must be specified to help prevent infinite loops and enable controlled execution

Conclusion
The integration of DoWhile loops in Amazon Bedrock Flows marks a significant advancement in iterative workflow capabilities, enabling sophisticated loop-based processing that can incorporate Prompt nodes, Inline Code nodes, S3 Storage nodes, Lambda functions, agents, DoWhile Loop nodes, and Knowledge Base nodes. This enhancement responds directly to enterprise customers’ needs for handling complex, repetitive tasks within their AI workflows, helping developers create adaptive, condition-based solutions without requiring external orchestration tools. By providing support for iterative processing patterns, DoWhile loops help organizations build more sophisticated AI applications that can refine outputs, perform recursive operations, and implement complex business logic directly within the Amazon Bedrock environment. This powerful addition to Amazon Bedrock Flows democratizes the development of advanced AI workflows, making iterative AI processing more accessible and manageable across organizations.
DoWhile loops in Amazon Bedrock Flows are now available in all the AWS Regions where Amazon Bedrock Flows is supported, except for the AWS Gov Cloud (US) Region. To get started, open the Amazon Bedrock console or Amazon Bedrock APIs to begin building flows with Amazon Bedrock Flows. To learn more, refer to Create your first flow in Amazon Bedrock and Track each step in your flow by viewing its trace in Amazon Bedrock.
We’re excited to see the innovative applications you will build with these new capabilities. As always, we welcome your feedback through AWS re:Post for Amazon Bedrock or your usual AWS contacts. Join the generative AI builder community at community.aws to share your experiences and learn from others.

About the authors
Shubhankar Sumar is a Senior Solutions Architect at AWS, where he specializes in architecting generative AI-powered solutions for enterprise software and SaaS companies across the UK. With a strong background in software engineering, Shubhankar excels at designing secure, scalable, and cost-effective multi-tenant systems on the cloud. His expertise lies in seamlessly integrating cutting-edge generative AI capabilities into existing SaaS applications, helping customers stay at the forefront of technological innovation.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.
Eric Li is a Software Development Engineer II at AWS, where he builds core capabilities for Amazon Bedrock and SageMaker to support generative AI applications at scale. His work focuses on designing secure, observable, and cost-efficient systems that help developers and enterprises adopt generative AI with confidence. He is passionate about advancing developer experiences for building with large language models, making it easier to integrate AI into production-ready cloud applications.

How PropHero built an intelligent property investment advisor with con …

This post was written with Lucas Dahan, Dil Dolkun, and Mathew Ng from PropHero.
PropHero is a leading property wealth management service that democratizes access to intelligent property investment advice through big data, AI, and machine learning (ML). For the Spanish and Australian consumer base, PropHero needed an AI-powered advisory system that could engage customers in accurate property investment discussions. The goal was to provide personalized investment insights and to guide and assist users at every stage of their investment journey: from understanding the process, gaining visibility into timelines, securely uploading documents, to tracking progress in real time.
PropHero collaborated with the AWS Generative AI Innovation Center to implement an intelligent property investment advisor using AWS generative AI services with continuous evaluation. The solution helps users engage in natural language conversations about property investment strategies and receive personalized recommendations based on PropHero’s comprehensive market knowledge.
In this post, we explore how we built a multi-agent conversational AI system using Amazon Bedrock that delivers knowledge-grounded property investment advice. We explore the agent architecture, model selection strategy, and comprehensive continuous evaluation system that facilitates quality conversations while facilitating rapid iteration and improvement.
The challenge: Making property investment knowledge more accessible
The area of property investment presents numerous challenges for both novice and experienced investors. Information asymmetry creates barriers where comprehensive market data remains expensive or inaccessible. Traditional investment processes are manual, time-consuming, and require extensive market knowledge to navigate effectively. For the Spanish and Australian consumers specifically, we needed to build a solution that could provide accurate, contextually relevant property investment advice in Spanish while handling complex, multi-turn conversations about investment strategies. The system needed to maintain high accuracy while delivering responses at scale, continuously learning and improving from customer interactions. Most importantly, it needed to assist users across every phase of their journey, from initial onboarding through to final settlement, ensuring comprehensive support throughout the entire investment process.
Solution overview
We built a complete end-to-end solution using AWS generative AI services, architected around a multi-agent AI advisor with integrated continuous evaluation. The system provides seamless data flow from ingestion through intelligent advisory conversations with real-time quality monitoring. The following diagram illustrates this architecture.

The solution architecture consists of four virtual layers, each serving specific functions in the overall system design.
Data foundation layer
The data foundation provides the storage and retrieval infrastructure for system components:

Amazon DynamoDB – Fast storage for conversation history, evaluation metrics, and user interaction data
Amazon Relational Database (Amazon RDS) for PostgreSQL – A PostgreSQL database storing LangFuse observability data, including large language model (LLM) traces and latency metrics
Amazon Simple Storage Service (Amazon S3) – A central data lake storing Spanish FAQ documents, property investment guides, and conversation datasets

Multi-agent AI layer
The AI processing layer encompasses the core intelligence components that power the conversational experience:

Amazon Bedrock – Foundation models (FMs) such as LLMs and rerankers powering specialized agents
Amazon Bedrock Knowledge Bases – Semantic search engine with semantic chunking for FAQ-style content
LangGraph – Orchestration of multi-agent workflows and conversation state management
AWS Lambda – Serverless functions executing multi-agent logic and retrival of user information for richer context

Continuous evaluation layer
The evaluation infrastructure facilitates continuous quality monitoring and improvement through these components:

Amazon CloudWatch – Real-time monitoring of quality metrics with automated alerting and threshold management
Amazon EventBridge – Real-time event triggers for conversation completion and quality assessment
AWS Lambda – Automated evaluation functions measuring context relevance, response groundedness, and goal accuracy
Amazon QuickSight – Interactive dashboards and analytics for monitoring the respective metrics

Application and integration layer
The integration layer provides secure interfaces for external communication:

Amazon API Gateway – Secure API endpoints for conversational interface and evaluation webhooks

Multi-agent AI advisor architecture
The intelligent advisor uses a multi-agent system orchestrated through LangGraph, which sits in a single Lambda function, where each agent is optimized for specific tasks. The following diagram shows the communication flow among the various agents within the Lambda function.

Agent composition and model selection
Our model selection strategy involved extensive testing to match each component’s computational requirements with the most cost-effective Amazon Bedrock model. We evaluated factors including response quality, latency requirements, and cost per token to determine optimal model assignments for each agent type.Each component in the system uses the most appropriate model for its designated function, as outlined in the following table.

Component
Amazon Bedrock Model
Purpose

Router Agent
Anthropic Claude 3.5 Haiku
Query classification and routing

General Agent
Amazon Nova Lite
Common questions and conversation management

Advisor Agent
Amazon Nova Pro
Specialized property investment advice

Settlement agent
Anthropic Claude 3.5 Haiku
Customer support specialising on pre-settlement phase of investment

Response Agent
Amazon Nova Lite
Final response generation and formatting

Embedding
Cohere Embed Multilingual v3
Context retrieval

Retriever
Cohere Rerank 3.5
Context retrieval and ranking

Evaluator
Anthropic Claude 3.5 Haiku
Quality assessment and scoring

End-to-end conversation flow
The conversation processing follows a structured workflow that facilitates accurate responses while maintaining quality standards:

User queries enter through API Gateway and are routed to the router agent.
The router agent determines the appropriate specialized agent based on query analysis.
User information is retrieved at the start for richer context and knowledge-intensive queries trigger the retriever to access the Amazon Bedrock knowledge base.
Specialized agents process queries with retrieved user information and relevant context from the knowledge base.
The response agent formats and generates the final user-facing response with the appropriate tone.
Parallel evaluation processes assess context relevance, response groundedness, and goal accuracy.
Conversation data is stored in DynamoDB for analysis and improvement.

Amazon Bedrock Knowledge Bases configuration
The knowledge base uses several AWS services for optimal performance:

Data source – S3 bucket
Chunking strategy – Semantic chunking optimized for FAQ-style content
Embedding model – Cohere Embed Multilingual v3 for Spanish understanding
Reranker – Cohere Rerank 3.5 for reranking retrieved Spanish content
Vector database – Amazon OpenSearch Serverless for scalable semantic search

Sample conversation experience
The system handles natural Spanish conversations seamlessly. For example:
Usuario: “Hola, ¿qué es PropHero exactamente?”
Asistente: “¡Hola! PropHero es una plataforma que te permite crear, optimizar y crecer tu patrimonio inmobiliario delegando todo el proceso en nuestro equipo de profesionales…”
Usuario: “¿Y cómo funciona en la práctica?”
Asistente: “Es muy sencillo. Comenzamos con una sesión con tu Property Coach para definir tu estrategia de inversión personalizada…”
Integrated continuous evaluation system
The continuous evaluation system is integrated into the architecture, operating as a core component. This approach supports quality monitoring in real-time alongside conversation processing.
Evaluation metrics
We used the following metrics from the Ragas library for evaluation:

Context Relevance (0–1) – Measures the relevance of retrieved context to user queries, evaluating RAG system effectiveness
Response Groundedness (0–1) – Makes sure responses are factually accurate and derived from PropHero’s official information
Agent Goal Accuracy (0–1) – Binary measure of whether responses successfully address user investment goals

Real-time evaluation workflow
The evaluation system operates seamlessly within the conversation architecture:

Amazon DynamoDB Streams triggers – Conversation data written to DynamoDB automatically triggers a Lambda function for evaluation through Amazon DynamoDB Streams
Parallel processing – Lambda functions execute evaluation logic in parallel with response delivery
Multi-dimensional assessment – Each conversation is evaluated across three key dimensions simultaneously
Intelligent scoring with LLM-as-a-judge – Anthropic’s Claude 3.5 Haiku provides consistent evaluation as an LLM judge, offering standardized assessment criteria across conversations.
Monitoring and analytics – CloudWatch captures metrics from the evaluation process, and QuickSight provides dashboards for trend analysis

The following diagram provides an overview of the Lambda function responsible for continuous evaluation.

Implementation insights and best practices
Our development journey involved a 6-week iterative process with PropHero’s technical team. We conducted testing across different model combinations and evaluated chunking strategies using real customer FAQ data. This journey revealed several architectural optimizations that enhanced system performance, achieved significant cost reductions, and improved user experience.
Model selection strategy
Our approach to model selection demonstrates the importance of matching model capabilities to specific tasks. By using Amazon Nova Lite for simpler tasks and Amazon Nova Pro for complex reasoning, the solution achieves optimal cost-performance balance while maintaining high accuracy standards.
Chunking and retrieval optimization
Semantic chunking proved superior to hierarchical and fixed chunking approaches for FAQ-style content. The Cohere Rerank 3.5 model enabled the system to use fewer chunks (10 vs. 20) while maintaining accuracy, reducing latency and cost.
Multilingual capabilities
The system effectively handles Spanish and English queries by using FMs that support Spanish language on Amazon Bedrock.
Business impact
The PropHero AI advisor delivered measurable business value:

Enhanced customer engagement – A 90% goal accuracy rate makes sure customers receive relevant, actionable property investment advice. Over 50% of our users (and over 70% of paid users) are actively using the AI advisor.
Operational efficiency – Automated responses to common questions reduced customer service workload by 30%, freeing staff to focus on complex customer needs.
Scalable growth – The serverless architecture automatically scales to handle increasing customer demand without manual intervention.
Cost optimization – Strategic model selection achieved high performance while reducing AI costs by 60% compared to using premium models throughout.
Consumer base expansion – Successful Spanish language support enabled PropHero’s expansion into the Spanish consumer base with localized expertise.

Conclusion
The PropHero AI advisor demonstrates how AWS generative AI services can be used to create intelligent, context-aware conversational agents that deliver real business value. By combining a modular agent architecture with a robust evaluation system, PropHero has created a solution that enhances customer engagement while providing accurate and relevant responses.The comprehensive evaluation pipeline has been particularly valuable, providing clear metrics for measuring conversation quality and guiding ongoing improvements. This approach makes sure the AI advisor will continue to evolve and improve over time.For more information about building multi-agent AI advisors with continuous evaluation, refer to the following resources:

Retrieve data and generate AI responses with Amazon Bedrock Knowledge Bases – With Amazon Bedrock Knowledge Bases, you can implement semantic search with chunking strategies
LangGraph – LangGraph can help you build multi-agent workflows
Ragas – Ragas offers comprehensive LLM evaluation metrics, including context relevance, groundedness, and goal accuracy used in this implementation

To learn more about the Generative AI Innovation Center, get in touch with your account team.

About the authors
Adithya Suresh is a Deep Learning Architect at the AWS Generative AI Innovation Center based in Sydney, where he collaborates directly with enterprise customers to design and scale transformational generative AI solutions for complex business challenges. He uses AWS generative AI services to build bespoke AI systems that drive measurable business value across diverse industries.
Lucas Dahan was the Head of Data & AI at PropHero at the time of writing. He leads the technology team that is transforming property investment through innovative digital solutions.
Dil Dolkun is the Data & AI Engineer at PropHero’s tech team, and has been instrumental in designing data architectures and multi-agent workflows for PropHero’s generative AI property investment Advisor system.
Mathew Ng is a Technical Lead at PropHero, who architected and scaled PropHero’s cloud-native, high-performance software solution from early stage start up to successful Series A funding.
Aaron Su is a Solutions Architect at AWS, with a focus across AI and SaaS startups. He helps early-stage companies architect scalable, secure, and cost-effective cloud solutions.

Accelerate benefits claims processing with Amazon Bedrock Data Automat …

In the benefits administration industry, claims processing is a vital operational pillar that makes sure employees and beneficiaries receive timely benefits, such as health, dental, or disability payments, while controlling costs and adhering to regulations like HIPAA and ERISA. Businesses aim to optimize the workflow—covering claim submission, validation, adjudication, payment, and appeals—to enhance employee satisfaction, strengthen provider relationships, and mitigate financial risks. The process includes specific steps like claim submission (through portals or paper), data validation (verifying eligibility and accuracy), adjudication (assessing coverage against plan rules), payment or denial (including check processing for reimbursements), and appeal handling. Efficient claims processing supports competitive benefits offerings, which is crucial for talent retention and employer branding, but requires balancing speed, accuracy, and cost in a highly regulated environment.
Despite its importance, claims processing faces significant challenges in many organizations. Most notably, the reliance on legacy systems and manual processes results in frustratingly slow resolution times, high error rates, and increased administrative costs. Incomplete or inaccurate claim submissions—such as those with missing diagnosis codes or eligibility mismatches—frequently lead to denials and rework, creating frustration for both employees and healthcare providers. Additionally, fraud, waste, and abuse continue to inflate costs, yet detecting these issues without delaying legitimate claims remains challenging. Complex regulatory requirements demand constant system updates, and poor integration between systems—such as Human Resource Information Systems (HRIS) and other downstream systems—severely limits scalability. These issues drive up operational expenses, erode trust in benefits programs, and overburden customer service teams, particularly during appeals processes or peak claims periods.
Generative AI can help address these challenges. With Amazon Bedrock Data Automation, you can automate generation of useful insights from unstructured multimodal content such as documents, images, audio, and video. Amazon Bedrock Data Automation can be used in benefits claims process to automate document processing by extracting and classifying documents from claims packets, policy applications, and supporting documents with industry-leading accuracy, reducing manual errors and accelerating resolution times. Amazon Bedrock Data Automation natural language processing capabilities interpret unstructured data, such as provider notes, supporting compliance with plan rules and regulations. By automating repetitive tasks and providing insights, Amazon Bedrock Data Automation helps reduce administrative burdens, enhance experiences for both employees and providers, and support compliance in a cost-effective manner. Furthermore, its scalable architecture enables seamless integration with existing systems, improving data flow across HRIS, claims systems, and provider networks, and advanced analytics help detect fraud patterns to optimize cost control.
In this post, we examine the typical benefit claims processing workflow and identify where generative AI-powered automation can deliver the greatest impact.
Benefit claims processing
When an employee or beneficiary pays out of pocket for an expense covered under their health benefits, they submit a claim for reimbursement. This process requires several supporting documents, including doctor’s prescriptions and proof of payment, which might include check images, receipts, or electronic payment confirmations.
The claims processing workflow involves several critical steps:

Document intake and processing – The system receives and categorizes submitted documentation, including:

Medical records and prescriptions
Proof of payment documentation
Supporting forms and eligibility verification

Payment verification processing – For check-based reimbursements, the system must complete the following steps:

Extract information from check images, including the account number and routing number contained in the MICR line
Verify payee and payer names against the information provided during the claim submission process
Confirm payment amounts match the claimed expenses
Flag discrepancies for human review

Adjudication and reimbursement – When verification is complete, the system performs several actions:

Determine eligibility based on plan rules and coverage limits
Calculate appropriate reimbursement amounts
Initiate payment processing through direct deposit or check issuance
Provide notification to the claimant regarding the status of their reimbursement

In this post, we walk through a real-world scenario to make the complexity of this multi-step process clearer. The following example demonstrates how Amazon Bedrock Data Automation can streamline the claims processing workflow, from initial submission to final reimbursement.
Solution overview
Let’s consider a scenario where a benefit plan participant seeks treatment and pays out of pocket for the doctor’s fee using a check. They then buy the medications prescribed by the doctor at the pharmacy store. Later, they log in to their benefit provider’s portal and submit a claim along with the image of the check and payment receipt for the medications.
This solution uses Amazon Bedrock Data Automation to automate the two most critical and time-consuming aspects of this workflow: document intake and payment verification processing. The following diagram illustrates the benefits claims processing architecture.

The end-to-end process works through four integrated stages: ingestion, extraction, validation, and integration.
Ingestion
When a beneficiary uploads supporting documents (check image and pharmacy receipt) through the company’s benefit claims portal, these documents are securely saved in an Amazon Simple Storage Service (Amazon S3) bucket, triggering the automated claims processing pipeline.
Extraction
After documents are ingested, the system immediately begins with intelligent data extraction:

The S3 object upload triggers an AWS Lambda function, which invokes the Amazon Bedrock Data Automation project.
Amazon Bedrock Data Automation uses blueprints for file processing and extraction. Blueprints are artifacts used to configure file processing business logic by specifying a list of field names for data extraction, along with their desired data formats (string, number, or Boolean) and natural language context for data normalization and validation rules. Amazon Bedrock Data Automation provides a catalog of sample blueprints out of the box. You can create a custom blueprint for your unique document types that aren’t predefined in the catalog. This solution uses two blueprints designed for different document types, as shown in the following screenshot:

The catalog blueprint US-Bank-Check for check processing.
The custom blueprint benefit-claims-pharmacy-receipt-blueprint for pharmacy-specific receipts.

US-Bank-Check is a catalog blueprint provided out of the box by Amazon Bedrock Data Automation. The custom blueprint benefit-claims-pharmacy-receipt-blueprint is created using an AWS CloudFormation template to handle pharmacy receipt processing, addressing a specific document type that wasn’t available in the standard blueprint catalog. The benefit administrator wants to look for vendor-specific information such as name, address, and phone details for benefits claims processing. The custom blueprint schema contains natural language explanation of those fields, such as VendorName, VendorAddress, VendorPhone, and additional fields, explaining what the field represents, expected data types, and inference type for each extracted field (explained in Creating Blueprints for Extraction), as shown in the following screenshot.

3. The two blueprints are added to the Amazon Bedrock Data Automation project. An Amazon Bedrock Data Automation project is a grouping of both standard and custom blueprints that you can use to process different types of files (like documents, audio, and images) using specific configuration settings, where you can control what kind of information you want to extract from each file type. When the project is invoked asynchronously, it automatically applies the appropriate blueprint, extracts information such as confidence scores and bounding box details for each field, and saves results in a separate S3 bucket. This intelligent classification alleviates the need for you to write complex document classification logic.
The following screenshot illustrates the document classification by the standard catalog blueprint US-Bank-Check.

The following screenshot shows the document classification by the custom blueprint benefit-claims-pharmacy-receipt-blueprint.

Validation
With the data extracted, the system moves to the validation and decision-making process using the business rules specific to each document type.
The business rules are documented in standard operating procedure documents (AnyCompany Benefit Checks Standard Operating procedure.docx and AnyCompany Benefit Claims Standard Operating procedure.docx) and uploaded to an S3 bucket. Then the system creates a knowledge base for Amazon Bedrock with the S3 bucket as the source, as shown in the following screenshot.

When the extracted Amazon Bedrock Data Automation results are saved to the configured S3 bucket, a Lambda function is triggered automatically. Based on the business rules retrieved from the knowledge base for the specific document type and the extracted Amazon Bedrock Data Automation output, an Amazon Nova Lite large langue model (LLM) makes the automated approve/deny decision for claims.
The following screenshot shows the benefit claim adjudication automated decision for US-Bank-Check.

The following screenshot shows the benefit claim adjudication automated decision for benefit-claims-pharmacy-receipt-blueprint.

Integration
The system seamlessly integrates with existing business processes.
When validation is complete, an event is pushed to Amazon EventBridge, which triggers a Lambda function for downstream integration. In this implementation, we use an Amazon DynamoDB table and Amazon Simple Notification Service (Amazon SNS) email for downstream integration. A DynamoDB table is created as part of the deployment stack, which is used to populate details including document classification, extracted data, and automated decision. An email notification is sent for both check and receipts after the final decision is made by the system. The following screenshot shows an example email for pharmacy receipt approval.

This flexible architecture helps you integrate with your existing applications through internal APIs or events to update claim status or trigger additional workflows when validation fails.
Reducing manual effort through intelligent business rules management
Beyond automating document processing, this solution addresses a common operational challenge: Traditionally, customers must write and maintain code for handling business rules around claims adjudication and processing. Every business rule change requires development effort and code updates, slowing time-to-market and increasing maintenance overhead.
Our approach converts business rules and standard operating procedures (SOPs) into knowledge bases using Amazon Bedrock Knowledge Bases, which you can use for automated decision-making. This approach can dramatically reduce time-to-market when business rules change, because updates can be made through knowledge management rather than code deployment.
In the following sections, we walk you through the steps to deploy the solution to your own AWS account.
Prerequisites
To implement the solution provided in this post, you must have the following:

An AWS account
Access to Amazon Titan Text Embeddings V2 and Amazon Nova Lite foundation models (FMs) enabled in Amazon Bedrock

This solution uses Python 3.13 with Boto3 1.38. or later version, and the AWS Serverless Application Model Command Line Interface (AWS SAM CLI) version 1.138.0. We assume that you have installed these in your local machine already. If not, refer to the following instructions:

Python 3.13 installation
Install the AWS SAM CLI

Set up code in your local machine
To set up the code, clone the GitHub repository. After you have cloned the repository to your local machine, the project folder structure will look like the following code, as mentioned in the README file:

Deploy the solution in your account
The sample code comes with a CloudFormation template that creates necessary resources. To deploy the solution in your account, follow the deployment instructions in the README file.
Clean up
Deploying this solution in your account will incur costs. Follow the cleanup instructions in the README file to avoid charges when you are done.
Conclusion
Benefits administration companies can significantly enhance their operations by automating claims processing using the solution outlined in this post. This strategic approach directly addresses the industry’s core challenges and can deliver several key advantages:

Enhanced processing efficiency through accelerated claims resolution times, reduced manual error rates, and higher straight-through processing rates that minimize the frustrating delays and manual rework plaguing legacy systems
Streamlined document integration and fraud detection capabilities, where adding new supporting documents becomes seamless through new Amazon Bedrock Data Automation blueprints, while AI-powered analytics identify suspicious patterns without delaying legitimate claims, avoiding traditional months-long development cycles and reducing costly fraud, waste, and abuse
Agile business rule management that enables rapid adaptation to changing HIPAA and ERISA requirements and modification of business rules, significantly reducing administrative costs and time-to-market while improving scalability and integration with existing HRIS and claims, ultimately enhancing employee satisfaction, strengthening provider relationships, and supporting competitive benefits offerings that are crucial for talent retention and employer branding

To get started with this solution, refer to the GitHub repo. For more information about Amazon Bedrock Data Automation, refer to Transform unstructured data into meaningful insights using Amazon Bedrock Data Automation and try the Document Processing Using Amazon Bedrock Data Automation workshop.

About the authors
Saurabh Kumar is a Senior Solutions Architect at AWS based out of Raleigh, NC, with expertise in Resilience Engineering, Chaos Engineering, and Generative AI solutions. He advises customers on fault-tolerance strategies and generative AI-driven modernization approaches, helping organizations build robust architectures while leveraging generative AI technologies to drive innovation.
Kiran Lakkireddy is a Principal Solutions Architect at AWS with expertise in Financial Services, Benefits Management and HR Services industries. Kiran provides technology and architecture guidance to customers in their business transformation, with a specialized focus on GenAI security, compliance, and governance. He regularly speaks to customer security leadership on GenAI security, compliance, and governance topics, helping organizations navigate the complex landscape of AI implementation while maintaining robust security standards.
Tamilmanam Sambasivam is a Solutions Architect and AI/ML Specialist at AWS. She helps enterprise customers to solve their business problems by recommending the right AWS solutions. Her strong back ground in Information Technology (24+ years of experience) helps customers to strategize, develop and modernize their business problems in AWS cloud. In the spare time, Tamil like to travel and gardening.