Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State …

Anthropic released Claude Sonnet 4.5 and sets a new benchmark for end-to-end software engineering and real-world computer use. The update also ships concrete product surface changes (Claude Code checkpoints, a native VS Code extension, API memory/context tools) and an Agent SDK that exposes the same scaffolding Anthropic uses internally. Pricing remains unchanged from Sonnet 4 ($3 input / $15 output per million tokens).

What’s actually new?

SWE-bench Verified record. Anthropic reports 77.2% accuracy on the 500-problem SWE-bench Verified dataset using a simple two-tool scaffold (bash + file edit), averaged over 10 runs, no test-time compute, 200K “thinking” budget. A 1M-context setting reaches 78.2%, and a higher-compute setting with parallel sampling and rejection raises this to 82.0%.

Computer-use SOTA. On OSWorld-Verified, Sonnet 4.5 leads at 61.4%, up from Sonnet 4’s 42.2%, reflecting stronger tool control and UI manipulation for browser/desktop tasks.

Long-horizon autonomy. The team observed >30 hours of uninterrupted focus on multi-step coding tasks — a practical jump over earlier limits and directly relevant to agent reliability.

Reasoning/math. The release notes “substantial gains” across common reasoning and math evals; exact per-bench numbers (e.g., AIME config). Safety posture is ASL-3 with strengthened defenses against prompt-injection.

https://www.anthropic.com/news/claude-sonnet-4-5

What’s there for agents?

Sonnet 4.5 targets the brittle parts of real agents: extended planning, memory, and reliable tool orchestration. Anthropic’s Claude Agent SDK exposes their production patterns (memory management for long-running tasks, permissioning, sub-agent coordination) rather than just a bare LLM endpoint. That means teams can reproduce the same scaffolding used by Claude Code (now with checkpoints, a refreshed terminal, and VS Code integration) to keep multi-hour jobs coherent and reversible.

On measured tasks that simulate “using a computer,” the 19-point jump on OSWorld-Verified is notable; it tracks with the model’s ability to navigate, fill spreadsheets, and complete web flows in Anthropic’s browser demo. For enterprises experimenting with agentic RPA-style work, higher OSWorld scores usually correlate with lower intervention rates during execution.

Where you can run it?

Anthropic API & apps. Model ID claude-sonnet-4-5; price parity with Sonnet 4. File creation and code execution are now available directly in Claude apps for paid tiers.

AWS Bedrock. Available via Bedrock with integration paths to AgentCore; AWS highlights long-horizon agent sessions, memory/context features, and operational controls (observability, session isolation).

Google Cloud Vertex AI. GA on Vertex AI with support for multi-agent orchestration via ADK/Agent Engine, provisioned throughput, 1M-token analysis jobs, and prompt caching.

GitHub Copilot. Public preview rollout across Copilot Chat (VS Code, web, mobile) and Copilot CLI; organizations can enable via policy, and BYO key is supported in VS Code.

Summary

With a documented 77.2% SWE-bench Verified score under transparent constraints, a 61.4% OSWorld-Verified computer-use lead, and practical updates (checkpoints, SDK, Copilot/Bedrock/Vertex availability), Claude Sonnet 4.5 is developed for long-running, tool-heavy agent workloads rather than short demo prompts. Independent replication will determine how durable the “best for coding” claim is, but the design targets (autonomy, scaffolding, and computer control) are aligned with real production pain points today.

Introducing Claude Sonnet 4.5—the best coding model in the world.It’s the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains on tests of reasoning and math. pic.twitter.com/7LwV9WPNAv— Claude (@claudeai) September 29, 2025

The post Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State-of-the-Art Results appeared first on MarkTechPost.

Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM I …

oLLM is a lightweight Python library built on top of Huggingface Transformers and PyTorch and runs large-context Transformers on NVIDIA GPUs by aggressively offloading weights and KV-cache to fast local SSDs. The project targets offline, single-GPU workloads and explicitly avoids quantization, using FP16/BF16 weights with FlashAttention-2 and disk-backed KV caching to keep VRAM within 8–10 GB while handling up to ~100K tokens of context.

But What’s new?

(1) KV cache read/writes that bypass mmap to reduce host RAM usage; (2) DiskCache support for Qwen3-Next-80B; (3) Llama-3 FlashAttention-2 for stability; and (4) GPT-OSS memory reductions via “flash-attention-like” kernels and chunked MLP. The table published by the maintainer reports end-to-end memory/I/O footprints on an RTX 3060 Ti (8 GB):

Qwen3-Next-80B (bf16, 160 GB weights, 50K ctx) → ~7.5 GB VRAM + ~180 GB SSD; noted throughput “≈ 1 tok/2 s”.

GPT-OSS-20B (packed bf16, 10K ctx) → ~7.3 GB VRAM + 15 GB SSD.

Llama-3.1-8B (fp16, 100K ctx) → ~6.6 GB VRAM + 69 GB SSD.

How it works

oLLM streams layer weights directly from SSD into the GPU, offloads the attention KV cache to SSD, and optionally offloads layers to CPU. It uses FlashAttention-2 with online softmax so the full attention matrix is never materialized, and chunks large MLP projections to bound peak memory. This shifts the bottleneck from VRAM to storage bandwidth and latency, which is why the oLLM project emphasizes NVMe-class SSDs and KvikIO/cuFile (GPUDirect Storage) for high-throughput file I/O.

Supported models and GPUs

Out of the box the examples cover Llama-3 (1B/3B/8B), GPT-OSS-20B, and Qwen3-Next-80B. The library targets NVIDIA Ampere (RTX 30xx, A-series), Ada (RTX 40xx, L4), and Hopper; Qwen3-Next requires a dev build of Transformers (≥ 4.57.0.dev). Notably, Qwen3-Next-80B is a sparse MoE (80B total, ~3B active) that vendors typically position for multi-A100/H100 deployments; oLLM’s claim is that you can execute it offline on a single consumer GPU by paying the SSD penalty and accepting low throughput. This stands in contrast to vLLM docs, which suggest multi-GPU servers for the same model family.

Installation and minimal usage

The project is MIT-licensed and available on PyPI (pip install ollm), with an additional kvikio-cu{cuda_version} dependency for high-speed disk I/O. For Qwen3-Next models, install Transformers from GitHub. A short example in the README shows Inference(…).DiskCache(…) wiring and generate(…) with a streaming text callback. (PyPI currently lists 0.4.1; the README references 0.4.2 changes.)

Performance expectations and trade-offs

Throughput: The maintainer reports ~0.5 tok/s for Qwen3-Next-80B at 50K context on an RTX 3060 Ti—usable for batch/offline analytics, not for interactive chat. SSD latency dominates.

Storage pressure: Long contexts require very large KV caches; oLLM writes these to SSD to keep VRAM flat. This mirrors broader industry work on KV offloading (e.g., NVIDIA Dynamo/NIXL and community discussions), but the approach is still storage-bound and workload-specific.

Hardware reality check: Running Qwen3-Next-80B “on consumer hardware” is feasible with oLLM’s disk-centric design, but typical high-throughput inference for this model still expects multi-GPU servers. Treat oLLM as an execution path for large-context, offline passes rather than a drop-in replacement for production serving stacks like vLLM/TGI.

Bottom line

oLLM pushes a clear design point: keep precision high, push memory to SSD, and make ultra-long contexts viable on a single 8 GB NVIDIA GPU. It won’t match data-center throughput, but for offline document/log analysis, compliance review, or large-context summarization, it’s a pragmatic way to execute 8B–20B models comfortably and even step up to MoE-80B if you can tolerate ~100–200 GB of fast local storage and sub-1 tok/s generation.

Check out the GITHUB REPO here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM Inference to 8 GB Consumer GPUs via SSD Offload—No Quantization Required appeared first on MarkTechPost.

How to Design an Interactive Dash and Plotly Dashboard with Callback M …

In this tutorial, we set out to build an advanced interactive dashboard using Dash, Plotly, and Bootstrap. We highlight not only how these tools enable us to design layouts and visualizations, but also how Dash’s callback mechanism links controls to outputs, allowing for real-time responsiveness. By combining local execution with the ability to run in cloud platforms like Google Colab, we explore a workflow that is both flexible and practical. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install dash plotly pandas numpy dash-bootstrap-components

import dash
from dash import dcc, html, Input, Output, callback, dash_table
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import dash_bootstrap_components as dbc

print(“Generating sample data…”)
np.random.seed(42)

We begin by installing and importing the necessary components, including Dash, Plotly, Pandas, NumPy, and Bootstrap, to set up our dashboard environment. We also initialize random seeds and generate sample data so that we can consistently test the interactive features as we build them. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserstart_date = datetime(2023, 1, 1)
end_date = datetime(2024, 12, 31)
dates = pd.date_range(start=start_date, end=end_date, freq=’D’)
stock_names = [‘AAPL’, ‘GOOGL’, ‘MSFT’, ‘AMZN’, ‘TSLA’]

all_data = []
base_prices = {‘AAPL’: 150, ‘GOOGL’: 120, ‘MSFT’: 250, ‘AMZN’: 100, ‘TSLA’: 200}

for stock in stock_names:
print(f”Creating data for {stock}…”)
base_price = base_prices[stock]

n_days = len(dates)
returns = np.random.normal(0.0005, 0.025, n_days)
prices = np.zeros(n_days)
prices[0] = base_price

for i in range(1, n_days):
prices[i] = prices[i-1] * (1 + returns[i])

volumes = np.random.lognormal(15, 0.5, n_days).astype(int)

stock_df = pd.DataFrame({
‘Date’: dates,
‘Stock’: stock,
‘Price’: prices,
‘Volume’: volumes,
‘Returns’: np.concatenate([[0], np.diff(prices) / prices[:-1]]),
‘Sector’: np.random.choice([‘Technology’, ‘Consumer’, ‘Automotive’], 1)[0]
})

all_data.append(stock_df)

df = pd.concat(all_data, ignore_index=True)

df[‘Date’] = pd.to_datetime(df[‘Date’])
df_sorted = df.sort_values([‘Stock’, ‘Date’]).reset_index(drop=True)

print(“Calculating technical indicators…”)
df_sorted[‘MA_20’] = df_sorted.groupby(‘Stock’)[‘Price’].transform(lambda x: x.rolling(20, min_periods=1).mean())
df_sorted[‘Volatility’] = df_sorted.groupby(‘Stock’)[‘Returns’].transform(lambda x: x.rolling(30, min_periods=1).std())

df = df_sorted.copy()

print(f”Data generated successfully! Shape: {df.shape}”)
print(f”Date range: {df[‘Date’].min()} to {df[‘Date’].max()}”)
print(f”Stocks: {df[‘Stock’].unique().tolist()}”)

We generate synthetic stock data, including prices, volumes, and returns, for multiple tickers across a specified date range. We calculate moving averages and volatility to enrich the dataset with useful technical indicators, providing a strong foundation for building interactive visualizations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserapp = dash.Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])

app.layout = dbc.Container([
dbc.Row([
dbc.Col([
html.H1(” Advanced Financial Dashboard”, className=”text-center mb-4″),
html.P(f”Interactive dashboard with {len(df)} data points across {len(stock_names)} stocks”,
className=”text-center text-muted”),
html.Hr()
])
]),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H5(” Dashboard Controls”, className=”card-title”),

html.Label(“Select Stocks:”, className=”fw-bold mt-3″),
dcc.Dropdown(
id=’stock-dropdown’,
options=[{‘label’: f'{stock} ({base_prices[stock]})’, ‘value’: stock}
for stock in stock_names],
value=[‘AAPL’, ‘GOOGL’],
multi=True,
placeholder=”Choose stocks to analyze…”
),

html.Label(“Date Range:”, className=”fw-bold mt-3″),
dcc.DatePickerRange(
id=’date-picker-range’,
start_date=’2023-06-01′,
end_date=’2024-06-01′,
display_format=’YYYY-MM-DD’,
style={‘width’: ‘100%’}
),

html.Label(“Chart Style:”, className=”fw-bold mt-3″),
dcc.RadioItems(
id=’chart-type’,
options=[
{‘label’: ‘ Line Chart’, ‘value’: ‘line’},
{‘label’: ‘ Area Chart’, ‘value’: ‘area’},
{‘label’: ‘ Scatter Plot’, ‘value’: ‘scatter’}
],
value=’line’,
labelStyle={‘display’: ‘block’, ‘margin’: ‘5px’}
),

dbc.Checklist(
id=’show-ma’,
options=[{‘label’: ‘ Show Moving Average’, ‘value’: ‘show’}],
value=[],
style={‘margin’: ’10px 0′}
),
])
], className=”h-100″)
], width=3),

dbc.Col([
dbc.Card([
dbc.CardHeader(” Stock Price Analysis”),
dbc.CardBody([
dcc.Graph(id=’main-chart’, style={‘height’: ‘450px’})
])
])
], width=9)
], className=”mb-4″),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”avg-price”, className=”text-primary mb-0″),
html.Small(“Average Price”, className=”text-muted”)
])
])
], width=3),
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”total-volume”, className=”text-success mb-0″),
html.Small(“Total Volume”, className=”text-muted”)
])
])
], width=3),
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”price-range”, className=”text-info mb-0″),
html.Small(“Price Range”, className=”text-muted”)
])
])
], width=3),
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”data-points”, className=”text-warning mb-0″),
html.Small(“Data Points”, className=”text-muted”)
])
])
], width=3)
], className=”mb-4″),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardHeader(” Trading Volume”),
dbc.CardBody([
dcc.Graph(id=’volume-chart’, style={‘height’: ‘300px’})
])
])
], width=6),
dbc.Col([
dbc.Card([
dbc.CardHeader(” Returns Distribution”),
dbc.CardBody([
dcc.Graph(id=’returns-chart’, style={‘height’: ‘300px’})
])
])
], width=6)
], className=”mb-4″),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardHeader(” Latest Stock Data”),
dbc.CardBody([
dash_table.DataTable(
id=’data-table’,
columns=[
{‘name’: ‘Stock’, ‘id’: ‘Stock’},
{‘name’: ‘Date’, ‘id’: ‘Date’},
{‘name’: ‘Price ($)’, ‘id’: ‘Price’, ‘type’: ‘numeric’,
‘format’: {‘specifier’: ‘.2f’}},
{‘name’: ‘Volume’, ‘id’: ‘Volume’, ‘type’: ‘numeric’,
‘format’: {‘specifier’: ‘,.0f’}},
{‘name’: ‘Daily Return (%)’, ‘id’: ‘Returns’, ‘type’: ‘numeric’,
‘format’: {‘specifier’: ‘.2%’}}
],
style_cell={‘textAlign’: ‘center’, ‘fontSize’: ’14px’, ‘padding’: ’10px’},
style_header={‘backgroundColor’: ‘rgb(230, 230, 230)’, ‘fontWeight’: ‘bold’},
style_data_conditional=[
{
‘if’: {‘filter_query’: ‘{Returns} > 0’},
‘backgroundColor’: ‘#d4edda’
},
{
‘if’: {‘filter_query’: ‘{Returns} < 0’},
‘backgroundColor’: ‘#f8d7da’
}
],
page_size=15,
sort_action=”native”,
filter_action=”native”
)
])
])
])
])
], fluid=True)

We define the app layout with Bootstrap rows and cards, where we place controls (dropdown, date range, chart style, MA toggle) alongside the main graph. We add metric cards, two secondary graphs, and a sortable/filterable data table, so we organize everything into a responsive, clean interface that we can wire up to callbacks next. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@callback(
[Output(‘main-chart’, ‘figure’),
Output(‘volume-chart’, ‘figure’),
Output(‘returns-chart’, ‘figure’),
Output(‘data-table’, ‘data’),
Output(‘avg-price’, ‘children’),
Output(‘total-volume’, ‘children’),
Output(‘price-range’, ‘children’),
Output(‘data-points’, ‘children’)],
[Input(‘stock-dropdown’, ‘value’),
Input(‘date-picker-range’, ‘start_date’),
Input(‘date-picker-range’, ‘end_date’),
Input(‘chart-type’, ‘value’),
Input(‘show-ma’, ‘value’)]
)
def update_all_charts(selected_stocks, start_date, end_date, chart_type, show_ma):
print(f”Callback triggered with stocks: {selected_stocks}”)

if not selected_stocks:
selected_stocks = [‘AAPL’]

filtered_df = df[
(df[‘Stock’].isin(selected_stocks)) &
(df[‘Date’] >= start_date) &
(df[‘Date’] <= end_date)
].copy()

print(f”Filtered data shape: {filtered_df.shape}”)

if filtered_df.empty:
filtered_df = df[df[‘Stock’].isin(selected_stocks)].copy()
print(f”Using all available data. Shape: {filtered_df.shape}”)

if chart_type == ‘line’:
main_fig = px.line(filtered_df, x=’Date’, y=’Price’, color=’Stock’,
title=f’Stock Prices – {chart_type.title()} View’,
labels={‘Price’: ‘Price ($)’, ‘Date’: ‘Date’})
elif chart_type == ‘area’:
main_fig = px.area(filtered_df, x=’Date’, y=’Price’, color=’Stock’,
title=f’Stock Prices – {chart_type.title()} View’,
labels={‘Price’: ‘Price ($)’, ‘Date’: ‘Date’})
else:
main_fig = px.scatter(filtered_df, x=’Date’, y=’Price’, color=’Stock’,
title=f’Stock Prices – {chart_type.title()} View’,
labels={‘Price’: ‘Price ($)’, ‘Date’: ‘Date’})

if ‘show’ in show_ma:
for stock in selected_stocks:
stock_data = filtered_df[filtered_df[‘Stock’] == stock]
if not stock_data.empty:
main_fig.add_scatter(
x=stock_data[‘Date’],
y=stock_data[‘MA_20′],
mode=’lines’,
name=f'{stock} MA-20′,
line=dict(dash=’dash’, width=2)
)

main_fig.update_layout(height=450, showlegend=True, hovermode=’x unified’)

volume_fig = px.bar(filtered_df, x=’Date’, y=’Volume’, color=’Stock’,
title=’Daily Trading Volume’,
labels={‘Volume’: ‘Volume (shares)’, ‘Date’: ‘Date’})
volume_fig.update_layout(height=300, showlegend=True)

returns_fig = px.histogram(filtered_df.dropna(subset=[‘Returns’]),
x=’Returns’, color=’Stock’,
title=’Daily Returns Distribution’,
labels={‘Returns’: ‘Daily Returns’, ‘count’: ‘Frequency’},
nbins=50)
returns_fig.update_layout(height=300, showlegend=True)

if not filtered_df.empty:
avg_price = f”${filtered_df[‘Price’].mean():.2f}”
total_volume = f”{filtered_df[‘Volume’].sum():,.0f}”
price_range = f”${filtered_df[‘Price’].min():.0f} – ${filtered_df[‘Price’].max():.0f}”
data_points = f”{len(filtered_df):,}”

table_data = filtered_df.nlargest(100, ‘Date’)[
[‘Stock’, ‘Date’, ‘Price’, ‘Volume’, ‘Returns’]
].round(4).to_dict(‘records’)

for row in table_data:
row[‘Date’] = row[‘Date’].strftime(‘%Y-%m-%d’) if pd.notnull(row[‘Date’]) else ”
else:
avg_price = “No data”
total_volume = “No data”
price_range = “No data”
data_points = “0”
table_data = []

return (main_fig, volume_fig, returns_fig, table_data,
avg_price, total_volume, price_range, data_points)

We wire up Dash’s callback to connect our controls to every output, so changing any input instantly updates charts, stats, and the table. We filter the dataframe by selections and dates, build figures (plus optional MA overlays), and compute summary metrics. Finally, we format recent rows for the table so we can inspect the latest results at a glance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == ‘__main__’:
print(“Starting Dash app…”)
print(“Available data preview:”)
print(df.head())
print(f”Total rows: {len(df)}”)

app.run(mode=’inline’, port=8050, debug=True, height=1000)

# app.run(debug=True)

We set up the entry point for running the app. We print a quick preview of the dataset to determine what’s available, and then launch the Dash server. In Colab, we can run it inline. For local development, we can simply switch to the regular app.run(debug=True) for desktop development.

In conclusion, we integrate interactive charts, responsive layouts, and Dash’s callback mechanism into a cohesive application. We see how the callbacks orchestrate communication between user input and dynamic updates, turning static visuals into powerful interactive tools. With the ability to operate smoothly both locally and online, this approach provides a versatile foundation that we can extend for broader applications.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Design an Interactive Dash and Plotly Dashboard with Callback Mechanisms for Local and Online Deployment? appeared first on MarkTechPost.

This AI Research Proposes an AI Agent Immune System for Adaptive Cyber …

Can your AI security stack profile, reason, and neutralize a live security threat in ~220 ms—without a central round-trip? A team of researchers from Google and University of Arkansas at Little Rock outline an agentic cybersecurity “immune system” built from lightweight, autonomous sidecar AI agents colocated with workloads (Kubernetes pods, API gateways, edge services). Instead of exporting raw telemetry to a SIEM and waiting on batched classifiers, each agent learns local behavioral baselines, evaluates anomalies using federated intelligence, and applies least-privilege mitigations directly at the point of execution. In a controlled cloud-native simulation, this edge-first loop cut decision-to-mitigation to ~220 ms (≈3.4× faster than centralized pipelines), achieved F1 ≈ 0.89, and held host overhead under 10% CPU/RAM—evidence that collapsing detection and enforcement into the workload plane can deliver both speed and fidelity without material resource penalties.

https://arxiv.org/abs/2509.20640

What does “Profile → Reason → Neutralize” mean at the primitive level?

Profile. Agents are deployed as sidecars/daemonsets alongside microservices and API gateways. They build behavioral fingerprints from execution traces, syscall paths, API call sequences, and inter-service flows. This local baseline adapts to short-lived pods, rolling deploys, and autoscaling—conditions that routinely break perimeter controls and static allowlists. Profiling is not just a threshold on counts; it retains structural features (order, timing, peer set) that allow detection of zero-day-like deviations. The research team frames this as continuous, context-aware baselining across ingestion and sensing layers so that “normal” is learned per workload and per identity boundary.

Reason. When an anomaly appears (for example, an unusual burst of high-entropy uploads from a low-trust principal or a never-seen-before API call graph), the local agent mixes anomaly scores with federated intelligence—shared indicators and model deltas learned by peers—to produce a risk estimate. Reasoning is designed to be edge-first: the agent decides without a round-trip to a central adjudicator, and the trust decision is continuous rather than a static role gate. This aligns with zero-trust—identity and context are evaluated at each request, not just at session start—and it reduces central bottlenecks that add seconds of latency under load.

Neutralize. If risk exceeds a context-sensitive threshold, the agent executes an immediate local control mapped to least-privilege actions: quarantine the container (pause/isolate), rotate a credential, apply a rate-limit, revoke a token, or tighten a per-route policy. Enforcement is written back to policy stores and logged with a human-readable rationale for audit. The fast path here is the core differentiator: in the reported evaluation, the autonomous path triggers in ~220 ms versus ~540–750 ms for centralized ML or firewall update pipelines, which translates into a ~70% latency reduction and fewer opportunities for lateral movement during the decision window.

Where do the numbers come from, and what were the baselines?

The research team evaluated the architecture in a Kubernetes-native simulation spanning API abuse and lateral-movement scenarios. Against two typical baselines—(i) static rule pipelines and (ii) a batch-trained classifier—the agentic approach reports Precision 0.91 / Recall 0.87 / F1 0.89, while the baselines land near F1 0.64 (rules) and F1 0.79 (baseline ML). Decision latency falls to ~220 ms for local enforcement, compared with ~540–750 ms for centralized paths that require coordination with a controller or external firewall. Resource overhead on host services remains below 10% in CPU/RAM.

https://arxiv.org/abs/2509.20640

Why does this matter for zero-trust engineering, not just research graphs?

Zero-trust (ZT) calls for continuous verification at request-time using identity, device, and context. In practice, many ZT deployments still defer to central policy evaluators, so they inherit control-plane latency and queueing pathologies under load. By moving risk inference and enforcement to the autonomous edge, the architecture turns ZT posture from periodic policy pulls into a set of self-contained, continuously learning controllers that execute least-privilege changes locally and then synchronize state. That design simultaneously reduces mean time-to-contain (MTTC) and keeps decisions near the blast radius, which helps when inter-pod hops are measured in milliseconds. The research team also formalizes federated sharing to distribute indicators/model deltas without heavy raw-data movement, which is relevant for privacy boundaries and multi-tenant SaaS.

How does it integrate with existing stacks—Kubernetes, APIs, and identity?

Operationally, the agents are co-located with workloads (sidecar or node daemon). On Kubernetes, they can hook CNI-level telemetry for flow features, container runtime events for process-level signals, and envoy/nginx spans at API gateways for request graphs. For identity, they consume claims from your IdP and compute continuous trust scores that factor recent behavior and environment (e.g., geo-risk, device posture). Mitigations are expressed as idempotent primitives—network micro-policy updates, token revocation, per-route quotas—so they are straightforward to roll back or tighten incrementally. The architecture’s control loop (sense → reason → act → learn) is strictly feedback-driven and supports both human-in-the-loop (policy windows, approval gates for high-blast-radius changes) and autonomy for low-impact actions.

What are the governance and safety guardrails?

Speed without auditability is a non-starter in regulated environments. The research team emphasizes explainable decision logs that capture which signals and thresholds led to the action, with signed and versioned policy/model artifacts. It also discusses privacy-preserving modes—keeping sensitive data local while sharing model updates; differentially private updates are mentioned as an option in stricter regimes. For safety, the system supports override/rollback and staged rollouts (e.g., canarying new mitigation templates in non-critical namespaces). This is consistent with broader security work on threats and guardrails for agentic systems; if your org is adopting multi-agent pipelines, cross-check against current threat models for agent autonomy and tool use.

How do the reported results translate to production posture?

The evaluation is a 72-hour cloud-native simulation with injected behaviors: API misuse patterns, lateral movement, and zero-day-like deviations. Real systems will add messier signals (e.g., noisy sidecars, multi-cluster networking, mixed CNI plugins), which affects both detection and enforcement timing. That said, the fast-path structure—local decision + local act—is topology-agnostic and should preserve order-of-magnitude latency gains so long as mitigations are mapped to primitives available in your mesh/runtime. For production, begin with observe-only agents to build baselines, then turn on mitigations for low-risk actions (quota clamps, token revokes), then gate high-blast-radius controls (network slicing, container quarantine) behind policy windows until confidence/coverage metrics are green.

How does this sit in the broader agentic-security landscape?

There is growing research on securing agent systems and using agent workflows for security tasks. The research team discussed here is about defense via agent autonomy close to workloads. In parallel, other work tackles threat modeling for agentic AI, secure A2A protocol usage, and agentic vulnerability testing. If you adopt the architecture, pair it with a current agent-security threat model and a test harness that exercises tool-use boundaries and memory safety of agents.

Comparative Results (Kubernetes simulation)

MetricStatic rules pipelineBaseline ML (batch classifier)Agentic framework (edge autonomy)Precision0.710.830.91Recall0.580.760.87F10.640.790.89Decision-to-mitigation latency~750 ms~540 ms~220 msHost overhead (CPU/RAM)ModerateModerate<10%

Key Takeaways

Edge-first “cybersecurity immune system.” Lightweight sidecar/daemon AI agents colocated with workloads (Kubernetes pods, API gateways) learn behavioral fingerprints, decide locally, and enforce least-privilege mitigations without SIEM round-trips.

Measured performance. Reported decision-to-mitigation is ~220 ms—about 3.4× faster than centralized pipelines (≈540–750 ms)—with F1 ≈ 0.89 (P≈0.91, R≈0.87) in a Kubernetes simulation.

Low operational cost. Host overhead remains <10% CPU/RAM, making the approach practical for microservices and edge nodes.

Profile → Reason → Neutralize loop. Agents continuously baseline normal activity (profile), fuse local signals with federated intelligence for risk scoring (reason), and apply immediate, reversible controls such as container quarantine, token rotation, and rate-limits (neutralize).

Zero-trust alignment. Decisions are continuous and context-aware (identity, device, geo, workload), replacing static role gates and reducing dwell time and lateral movement risk.

Governance and safety. Actions are logged with explainable rationales; policies/models are signed and versioned; high-blast-radius mitigations can be gated behind human-in-the-loop and staged rollouts.

Summary

Treat defense as a distributed control plane made of profiling, reasoning, and neutralizing agents that act where the threat lives. The reported profile—~220 ms actions, ≈ 3.4× faster than centralized baselines, F1 ≈ 0.89, <10% overhead—is consistent with what you’d expect when you eliminate central hops and let autonomy handle least-privilege mitigations locally. It aligns with zero-trust’s continuous verification and gives teams a practical path to self-stabilizing operations: learn normal, flag deviations with federated context, and contain early—before lateral movement outpaces your control plane.

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post This AI Research Proposes an AI Agent Immune System for Adaptive Cybersecurity: 3.4× Faster Containment with <10% Overhead appeared first on MarkTechPost.

Gemini Robotics 1.5: DeepMind’s ER↔VLA Stack Brings Agentic Robots …

Can a single AI stack plan like a researcher, reason over scenes, and transfer motions across different robots—without retraining from scratch? Google DeepMind’s Gemini Robotics 1.5 says yes, by splitting embodied intelligence into two models: Gemini Robotics-ER 1.5 for high-level embodied reasoning (spatial understanding, planning, progress/success estimation, tool-use) and Gemini Robotics 1.5 for low-level visuomotor control. The system targets long-horizon, real-world tasks (e.g., multi-step packing, waste sorting with local rules) and introduces motion transfer to reuse data across heterogeneous platforms.

https://deepmind.google/discover/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/

What actually is the stack?

Gemini Robotics-ER 1.5 (reasoner/orchestrator): A multimodal planner that ingests images/video (and optionally audio), grounds references via 2D points, tracks progress, and invokes external tools (e.g., web search or local APIs) to fetch constraints before issuing sub-goals. It’s available via the Gemini API in Google AI Studio.

Gemini Robotics 1.5 (VLA controller): A vision-language-action model that converts instructions and percepts into motor commands, producing explicit “think-before-act” traces to decompose long tasks into short-horizon skills. Availability is limited to selected partners during the initial rollout.

https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf

Why split cognition from control?

Earlier end-to-end VLAs (Vision-Language-Action) struggle to plan robustly, verify success, and generalize across embodiments. Gemini Robotics 1.5 isolates those concerns: Gemini Robotics-ER 1.5 handles deliberation (scene reasoning, sub-goaling, success detection), while the VLA specializes in execution (closed-loop visuomotor control). This modularity improves interpretability (visible internal traces), error recovery, and long-horizon reliability.

Motion Transfer across embodiments

A core contribution is Motion Transfer (MT): training the VLA on a unified motion representation built from heterogeneous robot data—ALOHA, bi-arm Franka, and Apptronik Apollo—so skills learned on one platform can zero-shot transfer to another. This reduces per-robot data collection and narrows sim-to-real gaps by reusing cross-embodiment priors.

Quantitative signals

The research team showcased controlled A/B comparisons on real hardware and aligned MuJoCo scenes. This includes:

Generalization: Robotics 1.5 surpasses prior Gemini Robotics baselines in instruction following, action generalization, visual generalization, and task generalization across the three platforms.

Zero-shot cross-robot skills: MT yields measurable gains in progress and success when transferring skills across embodiments (e.g., Franka→ALOHA, ALOHA→Apollo), rather than merely improving partial progress.

“Thinking” improves acting: Enabling VLA thought traces increases long-horizon task completion and stabilizes mid-rollout plan revisions.

End-to-end agent gains: Pairing Gemini Robotics-ER 1.5 with the VLA agent substantially improves progress on multi-step tasks (e.g., desk organization, cooking-style sequences) versus a Gemini-2.5-Flash-based baseline orchestrator.

https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf

Safety and evaluation

DeepMind research team highlights layered controls: policy-aligned dialog/planning, safety-aware grounding (e.g., not pointing to hazardous objects), low-level physical limits, and expanded evaluation suites (e.g., ASIMOV/ASIMOV-style scenario testing and auto red-teaming to elicit edge-case failures). The goal is to catch hallucinated affordances or nonexistent objects before actuation.

Competitive/industry context

Gemini Robotics 1.5 is a shift from “single-instruction” robotics toward agentic, multi-step autonomy with explicit web/tool use and cross-platform learning, a capability set relevant to consumer and industrial robotics. Early partner access centers on established robotics vendors and humanoid platforms.

Key Takeaways

Two-model architecture (ER VLA): Gemini Robotics-ER 1.5 handles embodied reasoning—spatial grounding, planning, success/progress estimation, tool calls—while Robotics 1.5 is the vision-language-action executor that issues motor commands.

“Think-before-act” control: The VLA produces explicit intermediate reasoning/traces during execution, improving long-horizon decomposition and mid-task adaptation.

Motion Transfer across embodiments: A single VLA checkpoint reuses skills across heterogeneous robots (ALOHA, bi-arm Franka, Apptronik Apollo), enabling zero-/few-shot cross-robot execution rather than per-platform retraining.

Tool-augmented planning: ER 1.5 can invoke external tools (e.g., web search) to fetch constraints, then condition plans—e.g., packing after checking local weather or applying city-specific recycling rules.

Quantified improvements over prior baselines: The tech report documents higher instruction/action/visual/task generalization and better progress/success on real hardware and aligned simulators; results cover cross-embodiment transfers and long-horizon tasks.

Availability and access: ER 1.5 is available via the Gemini API (Google AI Studio) with docs, examples, and preview knobs; Robotics 1.5 (VLA) is limited to select partners with a public waitlist.

Safety & evaluation posture: DeepMind highlights layered safeguards (policy-aligned planning, safety-aware grounding, physical limits) and an upgraded ASIMOV benchmark plus adversarial evaluations to probe risky behaviors and hallucinated affordances.

Summary

Gemini Robotics 1.5 operationalizes a clean separation of embodied reasoning and control, adds motion transfer to recycle data across robots, and showcases the reasoning surface (point grounding, progress/success estimation, tool calls) to developers via the Gemini API. For teams building real-world agents, the design reduces per-platform data burden and strengthens long-horizon reliability—while keeping safety in scope with dedicated test suites and guardrails.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Gemini Robotics 1.5: DeepMind’s ER↔VLA Stack Brings Agentic Robots to the Real World appeared first on MarkTechPost.

Top 10 Local LLMs (2025): Context Windows, VRAM Targets, and Licenses …

Local LLMs matured fast in 2025: open-weight families like Llama 3.1 (128K context length (ctx)), Qwen3 (Apache-2.0, dense + MoE), Gemma 2 (9B/27B, 8K ctx), Mixtral 8×7B (Apache-2.0 SMoE), and Phi-4-mini (3.8B, 128K ctx) now ship reliable specs and first-class local runners (GGUF/llama.cpp, LM Studio, Ollama), making on-prem and even laptop inference practical if you match context length and quantization to VRAM. This guide lists the ten most deployable options by license clarity, stable GGUF availability, and reproducible performance characteristics (params, context length (ctx), quant presets).

Top 10 Local LLMs (2025)

1) Meta Llama 3.1-8B — robust “daily driver,” 128K context

Why it matters. A stable, multilingual baseline with long context and first-class support across local toolchains.Specs. Dense 8B decoder-only; official 128K context; instruction-tuned and base variants. Llama license (open weights). Common GGUF builds and Ollama recipes exist. Typical setup: Q4_K_M/Q5_K_M for ≤12-16 GB VRAM, Q6_K for ≥24 GB.

2) Meta Llama 3.2-1B/3B — edge-class, 128K context, on-device friendly

Why it matters. Small models that still take 128K tokens and run acceptably on CPUs/iGPUs when quantized; good for laptops and mini-PCs.Specs. 1B/3B instruction-tuned models; 128K context confirmed by Meta. Works well via llama.cpp GGUF and LM Studio’s multi-runtime stack (CPU/CUDA/Vulkan/Metal/ROCm).

3) Qwen3-14B / 32B — open Apache-2.0, strong tool-use & multilingual

Why it matters. Broad family (dense+MoE) under Apache-2.0 with active community ports to GGUF; widely reported as a capable general/agentic “daily driver” locally.Specs. 14B/32B dense checkpoints with long-context variants; modern tokenizer; rapid ecosystem updates. Start at Q4_K_M for 14B on 12 GB; move to Q5/Q6 when you have 24 GB+. (Qwen)

4) DeepSeek-R1-Distill-Qwen-7B — compact reasoning that fits

Why it matters. Distilled from R1-style reasoning traces; delivers step-by-step quality at 7B with widely available GGUFs. Excellent for math/coding on modest VRAM.Specs. 7B dense; long-context variants exist per conversion; curated GGUFs cover F32→Q4_K_M. For 8–12 GB VRAM try Q4_K_M; for 16–24 GB use Q5/Q6.

5) Google Gemma 2-9B / 27B — efficient dense; 8K context (explicit)

Why it matters. Strong quality-for-size and quantization behavior; 9B is a great mid-range local model.Specs. Dense 9B/27B; 8K context (don’t overstate); open weights under Gemma terms; widely packaged for llama.cpp/Ollama. 9B@Q4_K_M runs on many 12 GB cards.

6) Mixtral 8×7B (SMoE) — Apache-2.0 sparse MoE; cost/perf workhorse

Why it matters. Mixture-of-Experts throughput benefits at inference: ~2 experts/token selected at runtime; great compromise when you have ≥24–48 GB VRAM (or multi-GPU) and want stronger general performance.Specs. 8 experts of 7B each (sparse activation); Apache-2.0; instruct/base variants; mature GGUF conversions and Ollama recipes.

7) Microsoft Phi-4-mini-3.8B — small model, 128K context

Why it matters. Realistic “small-footprint reasoning” with 128K context and grouped-query attention; solid for CPU/iGPU boxes and latency-sensitive tools.Specs. 3.8B dense; 200k vocab; SFT/DPO alignment; model card documents 128K context and training profile. Use Q4_K_M on ≤8–12 GB VRAM.

8) Microsoft Phi-4-Reasoning-14B — mid-size reasoning (check ctx per build)

Why it matters. A 14B reasoning-tuned variant that is materially better for chain-of-thought-style tasks than generic 13–15B baselines.Specs. Dense 14B; context varies by distribution (model card for a common release lists 32K). For 24 GB VRAM, Q5_K_M/Q6_K is comfortable; mixed-precision runners (non-GGUF) need more.

9) Yi-1.5-9B / 34B — Apache-2.0 bilingual; 4K/16K/32K variants

Why it matters. Competitive EN/zh performance and permissive license; 9B is a strong alternative to Gemma-2-9B; 34B steps toward higher reasoning under Apache-2.0.Specs. Dense; context variants 4K/16K/32K; open weights under Apache-2.0 with active HF cards/repos. For 9B use Q4/Q5 on 12–16 GB.

10) InternLM 2 / 2.5-7B / 20B — research-friendly; math-tuned branches

Why it matters. An open series with lively research cadence; 7B is a practical local target; 20B moves you toward Gemma-2-27B-class capability (at higher VRAM).Specs. Dense 7B/20B; multiple chat/base/math variants; active HF presence. GGUF conversions and Ollama packs are common.

source: marktechpost.com

Summary

In local LLMs, the trade-offs are clear: pick dense models for predictable latency and simpler quantization (e.g., Llama 3.1-8B with a documented 128K context; Gemma 2-9B/27B with an explicit 8K window), move to sparse MoE like Mixtral 8×7B when your VRAM and parallelism justify higher throughput per cost, and treat small reasoning models (Phi-4-mini-3.8B, 128K) as the sweet spot for CPU/iGPU boxes. Licenses and ecosystems matter as much as raw scores: Qwen3’s Apache-2.0 releases (dense + MoE) and Meta/Google/Microsoft model cards give the operational guardrails (context, tokenizer, usage terms) you’ll actually live with. On the runtime side, standardize on GGUF/llama.cpp for portability, layer Ollama/LM Studio for convenience and hardware offload, and size quantization (Q4→Q6) to your memory budget. In short: choose by context + license + hardware path, not just leaderboard vibes.

The post Top 10 Local LLMs (2025): Context Windows, VRAM Targets, and Licenses Compared appeared first on MarkTechPost.

The Latest Gemini 2.5 Flash-Lite Preview is Now the Fastest Proprietar …

Google released an updated version of Gemini 2.5 Flash and Gemini 2.5 Flash-Lite preview models across AI Studio and Vertex AI, plus rolling aliases—gemini-flash-latest and gemini-flash-lite-latest—that always point to the newest preview in each family. For production stability, Google advises pinning fixed strings (gemini-2.5-flash, gemini-2.5-flash-lite). Google will give a two-week email notice before retargeting a -latest alias, and notes that rate limits, features, and cost may vary across alias updates.

https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/

What actually changed?

Flash: Improved agentic tool use and more efficient “thinking” (multi-pass reasoning). Google reports a +5 point lift on SWE-Bench Verified vs. the May preview (48.9% → 54.0%), indicating better long-horizon planning/code navigation.

Flash-Lite: Tuned for stricter instruction following, reduced verbosity, and stronger multimodal/translation. Google’s internal chart shows ~50% fewer output tokens for Flash-Lite and ~24% fewer for Flash, which directly cuts output-token spend and wall-clock time in throughput-bound services.

https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/

Independent Stats from the community thread

Artificial Analysis (the account behind the AI benchmarking site) received pre-release access and published external measurements across intelligence and speed. Highlights from the thread and companion pages:

Throughput: In endpoint tests, Gemini 2.5 Flash-Lite (Preview 09-2025, reasoning) is reported as the fastest proprietary model they track, around ~887 output tokens/s on AI Studio in their setup.

Intelligence index deltas: The September previews for Flash and Flash-Lite improve on Artificial Analysis’ aggregate “intelligence” scores compared with prior stable releases (site pages break down reasoning vs. non-reasoning tracks and blended price assumptions).

Token efficiency: The thread reiterates Google’s own reduction claims (−24% Flash, −50% Flash-Lite) and frames the win as cost-per-success improvements for tight latency budgets.

Google shared pre-release access for the new Gemini 2.5 Flash & Flash-Lite Preview 09-2025 models. We’ve independently benchmarked gains in intelligence (particularly for Flash-Lite), output speed and token efficiency compared to predecessorsKey takeaways from our intelligence… pic.twitter.com/ybzKvZBH5A— Artificial Analysis (@ArtificialAnlys) September 25, 2025

Cost surface and context budgets (for deployment choices)

Flash-Lite GA list price is $0.10 / 1M input tokens and $0.40 / 1M output tokens (Google’s July GA post and DeepMind’s model page). That baseline is where verbosity reductions translate to immediate savings.

Context: Flash-Lite supports ~1M-token context with configurable “thinking budgets” and tool connectivity (Search grounding, code execution)—useful for agent stacks that interleave reading, planning, and multi-tool calls.

Browser-agent angle and the o3 claim

A circulating claim says the “new Gemini Flash has o3-level accuracy, but is 2× faster and 4× cheaper on browser-agent tasks.” This is community-reported, not in Google’s official post. It likely traces to private/limited task suites (DOM navigation, action planning) with specific tool budgets and timeouts. Use it as a hypothesis for your own evals; don’t treat it as a cross-bench truth.

This is insane! The new Gemini Flash model released yesterday has the same accuracy as o3, but it is 2x faster and 4x cheaper for browser agent tasks.I ran evaluations the whole day and could not believe this. The previous gemini-2.5-flash had only 71% on this benchmark. https://t.co/KdgkuAK30W pic.twitter.com/F69BiZHiwD— Magnus Müller (@mamagnus00) September 26, 2025

Practical guidance for teams

Pin vs. chase -latest: If you depend on strict SLAs or fixed limits, pin the stable strings. If you continuously canary for cost/latency/quality, the -latest aliases reduce upgrade friction (Google provides two weeks’ notice before switching the pointer).

High-QPS or token-metered endpoints: Start with Flash-Lite preview; the verbosity and instruction-following upgrades shrink egress tokens. Validate multimodal and long-context traces under production load.

Agent/tool pipelines: A/B Flash preview where multi-step tool use dominates cost or failure modes; Google’s SWE-Bench Verified lift and community tokens/s figures suggest better planning under constrained thinking budgets.

Model strings (current)

Previews: gemini-2.5-flash-preview-09-2025, gemini-2.5-flash-lite-preview-09-2025

Stable: gemini-2.5-flash, gemini-2.5-flash-lite

Rolling aliases: gemini-flash-latest, gemini-flash-lite-latest (pointer semantics; may change features/limits/pricing).

Summary

Google’s new release update tightens tool-use competence (Flash) and token/latency efficiency (Flash-Lite) and introduces -latest aliases for faster iteration. External benchmarks from Artificial Analysis indicate meaningful throughput and intelligence-index gains for the Sept 2025. previews, with Flash-Lite now testing as the fastest proprietary model in their harness. Validate on your workload—especially browser-agent stacks—before committing to the aliases in production.

The post The Latest Gemini 2.5 Flash-Lite Preview is Now the Fastest Proprietary Model (External Tests) and 50% Fewer Output Tokens appeared first on MarkTechPost.

Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to T …

Hugging Face (HF) has released Smol2Operator, a reproducible, end-to-end recipe that turns a small vision-language model (VLM) with no prior UI grounding into a GUI-operating, tool-using agent. The release covers data transformation utilities, training scripts, transformed datasets, and the resulting 2.2B-parameter model checkpoint—positioned as a complete blueprint for building GUI agents from scratch rather than a single benchmark result.

But what’s new?

Two-phase post-training over a small VLM: Starting from SmolVLM2-2.2B-Instruct—a model that “initially has no grounding capabilities for GUI tasks”—Smol2Operator first instills perception/grounding, then layers agentic reasoning with supervised fine-tuning (SFT).

Unified action space across heterogeneous sources: A conversion pipeline normalizes disparate GUI action taxonomies (mobile, desktop, web) into a single, consistent function API (e.g., click, type, drag, normalized [0,1] coordinates), enabling coherent training across datasets. An Action Space Converter supports remapping to custom vocabularies.

But why Smol2Operator?

Most GUI-agent pipelines are blocked by fragmented action schemas and non-portable coordinates. Smol2Operator’s action-space unification and normalized coordinate strategy make datasets interoperable and training stable under image resizing, which is common in VLM preprocessing. This reduces the engineering overhead of assembling multi-source GUI data and lowers the barrier to reproducing agent behavior with small models.

How it works? training stack and data path

Data standardization:

Parse and normalize function calls from source datasets (e.g., AGUVIS stages) into a unified signature set; remove redundant actions; standardize parameter names; convert pixel to normalized coordinates.

Phase 1 (Perception/Grounding):

SFT on the unified action dataset to learn element localization and basic UI affordances, measured on ScreenSpot-v2 (element localization on screenshots).

Phase 2 (Cognition/Agentic reasoning):

Additional SFT to convert grounded perception into step-wise action planning aligned with the unified action API.

The HF Team reports a clean performance trajectory on ScreenSpot-v2 (benchmark) as grounding is learned, and shows similar training strategy scaling down to a ~460M “nanoVLM,” indicating the method’s portability across capacities (numbers are presented in the post’s tables).

Scope, limits, and next steps

Not a “SOTA at all costs” push: The HF team frame the work as a process blueprint—owning data conversion → grounding → reasoning—rather than chasing leaderboard peaks.

Evaluation focus: Demonstrations center on ScreenSpot-v2 perception and qualitative end-to-end task videos; broader cross-environment, cross-OS, or long-horizon task benchmarks are future work. The HF team notes potential gains from RL/DPO beyond SFT for on-policy adaptation.

Ecosystem trajectory: ScreenEnv’s roadmap includes wider OS coverage (Android/macOS/Windows), which would increase external validity of trained policies.

Summary

Smol2Operator is a fully open-source, reproducible pipeline that upgrades SmolVLM2-2.2B-Instruct—a VLM with zero GUI grounding—into an agentic GUI coder via a two-phase SFT process. The release standardizes heterogeneous GUI action schemas into a unified API with normalized coordinates, provides transformed AGUVIS-based datasets, publishes training notebooks and preprocessing code, and ships a final checkpoint plus a demo Space. It targets process transparency and portability over leaderboard chasing, and slots into the smolagents runtime with ScreenEnv for evaluation, offering a practical blueprint for teams building small, operator-grade GUI agents.

Check out the Technical details, and Full Collection on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to Train a 2.2B VLM into an Agentic GUI Coder appeared first on MarkTechPost.

Sakana AI Released ShinkaEvolve: An Open-Source Framework that Evolves …

Table of contentsWhat problem is it actually solving?Does the sample-efficiency claim hold beyond toy problems?How does the evolutionary loop look in practice?What are the concrete results?How does this compare to AlphaEvolve and related systems?SummaryFAQs — ShinkaEvolve

Sakana AI has released ShinkaEvolve, an open-sourced framework that uses large language models (LLMs) as mutation operators in an evolutionary loop to evolve programs for scientific and engineering problems—while drastically cutting the number of evaluations needed to reach strong solutions. On the canonical circle-packing benchmark (n=26 in a unit square), ShinkaEvolve reports a new SOTA configuration using ~150 program evaluations, where prior systems typically burned thousands. The project ships under Apache-2.0, with a research report and public code.

https://sakana.ai/shinka-evolve/

What problem is it actually solving?

Most “agentic” code-evolution systems explore by brute force: they mutate code, run it, score it, and repeat—consuming enormous sampling budgets. ShinkaEvolve targets that waste explicitly with three interacting components:

Adaptive parent sampling to balance exploration/exploitation. Parents are drawn from “islands” via fitness- and novelty-aware policies (power-law or weighted by performance and offspring counts) rather than always climbing the current best.

Novelty-based rejection filtering to avoid re-evaluating near-duplicates. Mutable code segments are embedded; if cosine similarity exceeds a threshold, a secondary LLM acts as a “novelty judge” before execution.

Bandit-based LLM ensembling so the system learns which model (e.g., GPT/Gemini/Claude/DeepSeek families) is yielding the biggest relative fitness jumps and routes future mutations accordingly (UCB1-style update on improvement over parent/baseline).

Does the sample-efficiency claim hold beyond toy problems?

The research team evaluates four distinct domains and shows consistent gains with small budgets:

Circle packing (n=26): reaches an improved configuration in roughly 150 evaluations; the research team also validate with stricter exact-constraint checking.

AIME math reasoning (2024 set): evolves agentic scaffolds that trace out a Pareto frontier (accuracy vs. LLM-call budget), outperforming hand-built baselines under limited query budgets / Pareto frontier of accuracy vs. calls and transferring to other AIME years and LLMs.

Competitive programming (ALE-Bench LITE): starting from ALE-Agent solutions, ShinkaEvolve delivers ~2.3% mean improvement across 10 tasks and pushes one task’s solution from 5th → 2nd in an AtCoder leaderboard counterfactual.

LLM training (Mixture-of-Experts): evolves a new load-balancing loss that improves perplexity and downstream accuracy at multiple regularization strengths vs. the widely-used global-batch LBL.

https://sakana.ai/shinka-evolve/

How does the evolutionary loop look in practice?

ShinkaEvolve maintains an archive of evaluated programs with fitness, public metrics, and textual feedback. For each generation: sample an island and parent(s); construct a mutation context with top-K and random “inspiration” programs; then propose edits via three operators—diff edits, full rewrites, and LLM-guided crossovers—while protecting immutable code regions with explicit markers. Executed candidates update both the archive and the bandit statistics that steer subsequent LLM/model selection. The system periodically produces a meta-scratchpad that summarizes recently successful strategies; those summaries are fed back into prompts to accelerate later generations.

What are the concrete results?

Circle packing: combined structured initialization (e.g., golden-angle patterns), hybrid global–local search (simulated annealing + SLSQP), and escape mechanisms (temperature reheating, ring rotations) discovered by the system—not hand-coded a priori.

AIME scaffolds: three-stage expert ensemble (generation → critical peer review → synthesis) that hits the accuracy/cost sweet spot at ~7 calls while retaining robustness when swapped to different LLM backends.

ALE-Bench: targeted engineering wins (e.g., caching kd-tree subtree stats; “targeted edge moves” toward misclassified items) that push scores without wholesale rewrites.

MoE loss: adds an entropy-modulated under-use penalty to the global-batch objective; empirically reduces miss-routing and improves perplexity/benchmarks as layer routing concentrates.

How does this compare to AlphaEvolve and related systems?

AlphaEvolve demonstrated strong closed-source results but at higher evaluation counts. ShinkaEvolve reproduces and surpasses the circle-packing result with orders-of-magnitude fewer samples and releases all components open-source. The research team also contrast variants (single-model vs. fixed ensemble vs. bandit ensemble) and ablate parent selection and novelty filtering, showing each contributes to the observed efficiency.

Summary

ShinkaEvolve is an Apache-2.0 framework for LLM-driven program evolution that cuts evaluations from thousands to hundreds by combining fitness/novelty-aware parent sampling, embedding-plus-LLM novelty rejection, and a UCB1-style adaptive LLM ensemble. It sets a new SOTA on circle packing (~150 evals), finds stronger AIME scaffolds under strict query budgets, improves ALE-Bench solutions (~2.3% mean gain, 5th→2nd on one task), and discovers a new MoE load-balancing loss that improves perplexity and downstream accuracy. Code and report are public.

FAQs — ShinkaEvolve

1) What is ShinkaEvolve?An open-source framework that couples LLM-driven program mutations with evolutionary search to automate algorithm discovery and optimization. Code and report are public.

2) How does it achieve higher sample-efficiency than prior evolutionary systems?Three mechanisms: adaptive parent sampling (explore/exploit balance), novelty-based rejection to avoid duplicate evaluations, and a bandit-based selector that routes mutations to the most promising LLMs.

3) What supports the results?It reaches state-of-the-art circle packing with ~150 evaluations; on AIME-2024 it evolves scaffolds under a 10-query cap per problem; it improves ALE-Bench solutions over strong baselines.

4) Where can I run it and what’s the license?The GitHub repo provides a WebUI and examples; ShinkaEvolve is released under Apache-2.0.

Check out the Technical details, Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Sakana AI Released ShinkaEvolve: An Open-Source Framework that Evolves Programs for Scientific Discovery with Unprecedented Sample-Efficiency appeared first on MarkTechPost.

Google AI Ships a Model Context Protocol (MCP) Server for Data Commons …

Google released a Model Context Protocol (MCP) server for Data Commons, exposing the project’s interconnected public datasets—census, health, climate, economics—through a standards-based interface that agentic systems can query in natural language. The Data Commons MCP Server is available now with quickstarts for Gemini CLI and Google’s Agent Development Kit (ADK).

What was released

An MCP server that lets any MCP-capable client or AI agent discover variables, resolve entities, fetch time series, and generate reports from Data Commons without hand-coding API calls. Google positions it as “from initial discovery to generative reports,” with example prompts spanning exploratory, analytical, and generative workflows.

Developer on-ramps: a PyPI package, a Gemini CLI flow, and an ADK sample/Colab to embed Data Commons queries inside agent pipelines.

Why MCP now?

MCP is an open protocol for connecting LLM agents to external tools and data with consistent capabilities (tools, prompts, resources) and transport semantics. By shipping a first-party MCP server, Google makes Data Commons addressable through the same interface that agents already use for other sources, reducing per-integration glue code and enabling registry-based discovery alongside other servers.

What you can do with it?

Exploratory: “What health data do you have for Africa?” → enumerate variables, coverage, and sources.

Analytical: “Compare life expectancy, inequality, and GDP growth for BRICS nations.” → retrieve series, normalize geos, align vintages, and return a table or chart payload.

Generative: “Generate a concise report on income vs. diabetes in US counties.” → fetch measures, compute correlations, include provenance.

Integration surface

Gemini CLI / any MCP client: install the Data Commons MCP package, point the client at the server, and issue NL queries; the client coordinates tool calls behind the scenes.

ADK agents: use Google’s sample agent to compose Data Commons calls with your own tools (e.g., visualization, storage) and return sourced outputs.

Docs entry point: MCP — Query data interactively with an AI agent with links to quickstart and user guide.

Real-world use case

Google highlights ONE Data Agent, built with the Data Commons MCP Server for the ONE Campaign. It lets policy analysts query tens of millions of health-financing datapoints via natural language, visualize results, and export clean datasets for downstream work.

Summary

In short, Google’s Data Commons MCP Server turns a sprawling corpus of public statistics into a first-class, protocol-native data source for agents—reducing custom glue code, preserving provenance, and fitting cleanly into existing MCP clients like Gemini CLI and ADK.

Check out the GitHub Repository and Try it out in Gemini CLI. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Google AI Ships a Model Context Protocol (MCP) Server for Data Commons, Giving AI Agents First-Class Access to Public Stats appeared first on MarkTechPost.

Building health care agents using Amazon Bedrock AgentCore

This blog was co-authored with Kuldeep Singh, Head of AI Platform at Innovaccer.
The integration of agentic AI is ushering in a transformative era in health care, marking a significant departure from traditional AI systems. Agentic AI demonstrates autonomous decision-making capabilities and adaptive learning in complex medical environments, enabling it to monitor patient progress, coordinate care teams, and adjust treatment strategies in real time. These intelligent systems are becoming deeply embedded in healthcare operations, from enhancing diagnostic precision through advanced pattern recognition to optimizing clinical workflows and accelerating drug discovery processes. Agentic AI combines proactive problem-solving abilities with real-time adaptability so that healthcare professionals can focus on high-value, patient-centered activities while the AI handles routine tasks and complex data analysis.
Innovaccer, a pioneering healthcare AI company, recently launched Innovaccer Gravity, built using Amazon Bedrock AgentCore, a new healthcare intelligence platform set to revolutionize data integration and AI-driven healthcare transformation. Building on their impressive track record—where their existing solutions serve more than 1,600 US care locations, manage more than 80 million unified health records, and have generated $1.5B in cost savings—this exemplifies how AWS customers are leading the agentic AI evolution by creating intelligent solutions that transform healthcare delivery while delivering significant ROI.
Health care demands precision and accountability. AI agents operating within this domain must handle sensitive patient data securely, adhere to rigorous compliance regulations (like HIPAA), and maintain consistent interoperability across diverse clinical workflows. Standard, generalized protocols fall short when dealing with complex healthcare systems and patient data protection requirements. Healthcare organizations need a robust service to convert their existing APIs into Model Context Protocol (MCP) compatible tools that can scale effectively while providing built-in authentication, authorization, encryption, and comprehensive audit trails. Amazon Bedrock AgentCore Gateway offers health care providers and digital health companies a straightforward and secure way to build, deploy, discover, and connect to tools at scale that they can use to create AI-powered healthcare solutions while maintaining the highest standards of security and compliance.
Problem
Healthcare organizations face significant data silo challenges because of diverse electronic health record (EHR) formats across different systems, often maintaining multiple systems to serve specialized departmental needs and legacy systems. FHIR (Fast Healthcare Interoperability Resources) solves these interoperability challenges by standardizing healthcare data into exchangeable resources (like patient records and lab results), enabling seamless communication between different systems while maintaining security and improving care coordination. However, implementing FHIR presents its own challenges, including technical complexity in integrating with legacy systems and the need for specialized expertise in healthcare informatics and API development.
The implementation of AI agents introduces new layers of complexity, requiring careful design and maintenance of interfaces with existing systems. AI agents need secure access to the FHIR data and other healthcare tools with authentication (both inbound and outbound) and end-to-end encryption. MCP is a standardized communication framework that enables AI systems to seamlessly interact with external tools, data sources, and services through a unified interface. However, the development and scaling of MCP servers require substantial resources and expertise. Hosting these services demands ongoing development time and attention to maintain optimal performance and reliability. As healthcare organizations navigate this complex terrain, addressing these challenges becomes critical for achieving true interoperability and harnessing the full potential of modern healthcare technology.
Deploy, enhance, and monitor AI agents at scale using Amazon Bedrock AgentCore
By using Amazon Bedrock AgentCore, you can deploy and operate highly capable AI agents securely at scale. It offers infrastructure purpose-built for dynamic agent workloads, powerful tools to enhance agents, and essential controls for real-world deployment. Bedrock AgentCore offers a set of composable services with the services most relevant to the solution in this post mentioned in the following list. For more information, see the Bedrock AgentCore documentation.

AgentCore Runtime provides a secure, serverless runtime purpose-built for deploying and scaling dynamic AI agents and tools using any open source framework, protocol, and model. Runtime was built to work for agentic workloads with industry-leading extended runtime support, fast cold starts, true session isolation, built-in identity, and support for multi-modal payloads.
AgentCore Gateway provides a secure way for agents to discover and use tools along with straightforward transformation of APIs, AWS Lambda functions, and existing services into agent-compatible tools. Gateway speeds up custom code development, infrastructure provisioning, and security implementation so developers can focus on building innovative agent applications.
AgentCore Identity provides a secure, scalable agent identity and access management capability accelerating AI agent development. It is compatible with existing identity providers, avoiding the need to migrate uses or rebuild authentication flows.
AgentCore Observability helps developers trace, debug, and monitor agent performance in production through unified operational dashboards. With support for OpenTelemetry compatible telemetry and detailed visualizations of each step of the agent workflow.

In this solution, we demonstrate how the user (a parent) can interact with a Strands or LangGraph agent in conversational style and get information about the immunization history and schedule of their child, inquire about the available slots, and book appointments. With some changes, AI agents can be made event-driven so that they can automatically send reminders, book appointments, and so on. This reduces the administrative burden on healthcare organizations and the parents who no longer need to keep track of the paperwork or make multiple calls to book appointments.

As shown in the preceding diagram, the workflow for the healthcare appointment book built using Amazon Bedrock AgentCore is the following:

User interacts with Strands or LangGraph agent: The solution contains both Strands and LangGraph agents. You can also use other frameworks such as AutoGen and CrewAI.
Reasoning LLM from Amazon Bedrock: Claude 3.5 Sonnet large language model (LLM) is used from Amazon Bedrock. The model demonstrates advanced reasoning by grasping nuances and complex instructions, along with strong tool-calling capabilities that allow it to effectively integrate with external applications and services to automate various tasks such as web browsing, calculations, or data interactions.
Tools exposed using AgentCore Gateway: AgentCore Gateway provides secure access to the necessary tools required for the Strands or LangGraph agent using standard MCP clients. In this solution, REST APIs are hosted on Amazon API Gateway and exposed as MCP tools using AgentCore Gateway.
Ingress authentication for AgentCore Gateway: AgentCore Gateway is protected with oAuth 2.0 using Amazon Cognito as the identity provider. You can use other oAuth 2.0 compatible identity providers such as Auth0, and Keycloak as needed to fit your use case.
OpenAPI specs converted into tools with AgentCore Gateway: Amazon API Gateway is used as the backend to expose the APIs. By importing the OpenAPI specs, AgentCore Gateway provides an MCP compatible server without additional configuration for tool metadata. The following are the tools used in the solution.

get_patient_emr(): Gets the parent’s and child’s demographics information.
search_immunization_emr() – Gets the immunization history and schedule for the child.
get_available_slots() – Gets the pediatrician’s schedule around parent’s preferred date.
book_appointment() – Books an appointment and returns the confirmation number.

AWS Healthlake as the FHIR server: HealthLake is used to manage patient data related to demographics, immunization history, schedule and appointments, and so on. HealthLake is a HIPAA-eligible service offering healthcare companies a complete view of individual and patient population health data using FHIR API-based transactions to securely store and transform their data into a queryable format at petabyte scale, and further analyze this data using machine learning (ML) models.
Egress authentication from AgentCore Gateway to tools: OAuth 2.0 with Amazon Cognito as the identity provider is used to do the authentication between AgentCore Gateway and the tools used in the solution.

Solution setup

Important: The following code example is meant for learning and demonstration purposes only. For production implementations, it is recommended to add required error handling, input validation, logging, and security controls.

The code and instructions to set up and clean up this example solution are available on GitHub. When set up, the solution looks like the following and is targeted towards parents to use the for immunization related appointments.

Customizing the solution
The solution can be customized to extend the same or a different use case through the following mechanisms:

OpenAPI specification: The solution uses a sample OpenAPI specification (named fhir-openapi-spec.yaml) with APIs hosted on API Gateway. The OpenAPI specification can be customized to add more tools or use entirely different tools by editing the YAML file. You must recreate the AgentCore gateway after making changes to the OpenAPI spec.
Agent instructions and LLM: The strands_agent.py or langgraph_agent.py can be modified to make changes to the goal or instructions for the Agent or to work with a different LLM.

Future enhancements
We’re already looking forward and planning future enhancements for this solution.

AgentCore Runtime: Host strands or a LangGraph agent on AgentCore Runtime.
AgentCore Memory: Use AgentCore Memory to preserve session information in short-term (in session) as well as long-term (across sessions) to provide a more personalized experience to the agent users.

Innovaccer’s use case for Bedrock AgentCore
Innovaccer’s gravity platform includes more than 400 connectors to unify data from EHRs from sources such as Epic, Oracle Cerner, and MEDITECH, more than 20 pre-trained models, 15 pre-built AI agents, 100 FHIR resources, and 60 out-of-the-box solutions with role based access control, comprehensive audit trail, end-to-end encryption, and secure personal health information (PHI) handling. They also provide a low-code or no-code interface to build additional AI agents with the tools exposed using Healthcare Model Context Protocol (HMCP) servers.
Innovaccer uses Bedrock AgentCore for the following purposes:

AgentCore Gateway to turn their OpenAPI specifications into HMCP compatible tools without the heavy lifting required to build, secure, or scale MCP servers.
AgentCore Identity to handle the inbound and outbound authentication integrating with Innovaccer- or customer-provided OAuth servers.
AgentCore Runtime to deploy and scale the AI agents with multi-agent collaboration, along with logging, traceability and ability to plug in custom guardrails.

Bedrock AgentCore supports enterprise-grade security with encryption in transit and at rest, complete session isolation, audit trails using AWS CloudTrail, and comprehensive controls to help Innovaccer agents operate reliably and securely at scale.
Pricing for Bedrock AgentCore Gateway:
AgentCore Gateway offers a consumption-based pricing model with billing based on API invocations (such as ListTools, InvokeTool and Search API), and indexing of tools. For more information, see the pricing page.
Conclusion
The integration of Amazon Bedrock AgentCore with healthcare systems represents a significant leap forward in the application of AI to improve patient care and streamline healthcare operations. By using the suite of services provided by Bedrock AgentCore, healthcare organizations can deploy sophisticated AI agents that securely interact with existing systems, adhere to strict compliance standards, and scale efficiently.
The solution architecture presented in this post demonstrates the practical application of these technologies, showcasing how AI agents can simplify complex processes such as immunization scheduling and appointment booking. This can reduce administrative burdens on healthcare providers and enhance the patient experience by providing straightforward access to critical health information and services.
As we look to the future, the potential for AI agents in the healthcare industry is vast. From improving diagnostic accuracy to personalizing treatment plans and streamlining clinical workflows, the possibilities are endless. Tools like Amazon Bedrock AgentCore can help healthcare organizations confidently navigate the complexities of implementing AI while maintaining the highest standards of security, compliance, and patient care.
The healthcare industry stands at the cusp of a transformative era, where AI agents will play an increasingly central role in delivering efficient, personalized, and high-quality care. By embracing these technologies and continuing to innovate, we can create a healthcare network that is more responsive, intelligent, and patient-centric than ever before.

About the Authors
Kamal Manchanda is a Senior Solutions Architect at AWS with 17 years of experience in cloud, data, and AI technologies. He works closely with C-level executives and technical teams of AWS customers to drive cloud adoption and digital transformation initiatives. Prior to AWS, he led global teams delivering cloud-centric systems, data-driven applications, and AI/ML solutions across consulting and product organizations. Kamal specializes in translating complex business challenges into scalable, secure solutions that deliver measurable business value.
Kuldeep Singh is AVP and Head of AI Platform at Innovaccer. He leads the work on AI agentic workflow layers for Gravity by Innovaccer, a healthcare intelligence platform designed to unify data, agents, and compliant workflows so health systems can deploy AI at scale. With deep experience in data engineering, AI, and product leadership, Kuldeep focuses on making healthcare more efficient, safe, and patient-centered. He plays a key role in building tools that allow care teams to automate complex, multi-step tasks (like integrating payer or EHR data, orchestrating clinical agents) without heavy engineering. He’s passionate about reducing clinician burnout, improving patient outcomes, and turning pilot projects into enterprise-wide AI solutions.

Build multi-agent site reliability engineering assistants with Amazon …

Site reliability engineers (SREs) face an increasingly complex challenge in modern distributed systems. During production incidents, they must rapidly correlate data from multiple sources—logs, metrics, Kubernetes events, and operational runbooks—to identify root causes and implement solutions. Traditional monitoring tools provide raw data but lack the intelligence to synthesize information across these diverse systems, often leaving SREs to manually piece together the story behind system failures.
With a generative AI solution, SREs can ask their infrastructure questions in natural language. For example, they can ask “Why are the payment-service pods crash looping?” or “What’s causing the API latency spike?” and receive comprehensive, actionable insights that combine infrastructure status, log analysis, performance metrics, and step-by-step remediation procedures. This capability transforms incident response from a manual, time-intensive process into a time-efficient, collaborative investigation.
In this post, we demonstrate how to build a multi-agent SRE assistant using Amazon Bedrock AgentCore, LangGraph, and the Model Context Protocol (MCP). This system deploys specialized AI agents that collaborate to provide the deep, contextual intelligence that modern SRE teams need for effective incident response and infrastructure management. We walk you through the complete implementation, from setting up the demo environment to deploying on Amazon Bedrock AgentCore Runtime for production use.
Solution overview
This solution uses a comprehensive multi-agent architecture that addresses the challenges of modern SRE operations through intelligent automation. The solution consists of four specialized AI agents working together under a supervisor agent to provide comprehensive infrastructure analysis and incident response assistance.
The examples in this post use synthetically generated data from our demo environment. The backend servers simulate realistic Kubernetes clusters, application logs, performance metrics, and operational runbooks. In production deployments, these stub servers would be replaced with connections to your actual infrastructure systems, monitoring services, and documentation repositories.
The architecture demonstrates several key capabilities:

Natural language infrastructure queries – You can ask complex questions about your infrastructure in plain English and receive detailed analysis combining data from multiple sources
Multi-agent collaboration – Specialized agents for Kubernetes, logs, metrics, and operational procedures work together to provide comprehensive insights
Real-time data synthesis – Agents access live infrastructure data through standardized APIs and present correlated findings
Automated runbook execution – Agents retrieve and display step-by-step operational procedures for common incident scenarios
Source attribution – Every finding includes explicit source attribution for verification and audit purposes

The following diagram illustrates the solution architecture.

The architecture demonstrates how the SRE support agent integrates seamlessly with Amazon Bedrock AgentCore components:

Customer interface – Receives alerts about degraded API response times and returns comprehensive agent responses
Amazon Bedrock AgentCore Runtime – Manages the execution environment for the multi-agent SRE solution
SRE support agent – Multi-agent collaboration system that processes incidents and orchestrates responses
Amazon Bedrock AgentCore Gateway – Routes requests to specialized tools through OpenAPI interfaces:

Kubernetes API for getting cluster events
Logs API for analyzing log patterns
Metrics API for analyzing performance trends
Runbooks API for searching operational procedures

Amazon Bedrock AgentCore Memory – Stores and retrieves session context and previous interactions for continuity
Amazon Bedrock AgentCore Identity – Handles authentication for tool access using Amazon Cognito integration
Amazon Bedrock AgentCore Observability – Collects and visualizes agent traces for monitoring and debugging
Amazon Bedrock LLMs – Powers the agent intelligence through Anthropic’s Claude large language models (LLMs)

The multi-agent solution uses a supervisor-agent pattern where a central orchestrator coordinates five specialized agents:

Supervisor agent – Analyzes incoming queries and creates investigation plans, routing work to appropriate specialists and aggregating results into comprehensive reports
Kubernetes infrastructure agent – Handles container orchestration and cluster operations, investigating pod failures, deployment issues, resource constraints, and cluster events
Application logs agent – Processes log data to find relevant information, identifies patterns and anomalies, and correlates events across multiple services
Performance metrics agent – Monitors system metrics and identifies performance issues, providing real-time analysis and historical trending
Operational runbooks agent – Provides access to documented procedures, troubleshooting guides, and escalation procedures based on the current situation

Using Amazon Bedrock AgentCore primitives
The solution showcases the power of Amazon Bedrock AgentCore by using multiple core primitives. The solution supports two providers for Anthropic’s LLMs. Amazon Bedrock supports Anthropic’s Claude 3.7 Sonnet for AWS integrated deployments, and Anthropic API supports Anthropic’s Claude 4 Sonnet for direct API access.
The Amazon Bedrock AgentCore Gateway component converts the SRE agent’s backend APIs (Kubernetes, application logs, performance metrics, and operational runbooks) into Model Context Protocol (MCP) tools. This enables agents built with an open-source framework supporting MCP (such as LangGraph in this post) to seamlessly access infrastructure APIs.
Security for the entire solution is provided by Amazon Bedrock AgentCore Identity. It supports ingress authentication for secure access control for agents connecting to the gateway, and egress authentication to manage authentication with backend servers, providing secure API access without hardcoding credentials.
The serverless execution environment for deploying the SRE agent in production is provided by Amazon Bedrock AgentCore Runtime. It automatically scales from zero to handle concurrent incident investigations while maintaining complete session isolation. Amazon Bedrock AgentCore Runtime supports both OAuth and AWS Identity and Access Management (IAM) for agent authentication. Applications that invoke agents must have appropriate IAM permissions and trust policies. For more information, see Identity and access management for Amazon Bedrock AgentCore.
Amazon Bedrock AgentCore Memory transforms the SRE agent from a stateless system into an intelligent learning assistant that personalizes investigations based on user preferences and historical context. The memory component provides three distinct strategies:

User preferences strategy (/sre/users/{user_id}/preferences) – Stores individual user preferences for investigation style, communication channels, escalation procedures, and report formatting. For example, Alice (a technical SRE) receives detailed systematic analysis with troubleshooting steps, whereas Carol (an executive) receives business-focused summaries with impact analysis.
Infrastructure knowledge strategy (/sre/infrastructure/{user_id}/{session_id}) – Accumulates domain expertise across investigations, enabling agents to learn from past discoveries. When the Kubernetes agent identifies a memory leak pattern, this knowledge becomes available for future investigations, enabling faster root cause identification.
Investigation memory strategy (/sre/investigations/{user_id}/{session_id}) – Maintains historical context of past incidents and their resolutions. This enables the solution to suggest proven remediation approaches and avoid anti-patterns that previously failed.

The memory component demonstrates its value through personalized investigations. When both Alice and Carol investigate “API response times have degraded 3x in the last hour,” they receive identical technical findings but completely different presentations.
Alice receives a technical analysis:

memory_client.retrieve_user_preferences(user_id=”Alice”)
# Returns: {“investigation_style”: “detailed_systematic_analysis”, “reports”: “technical_exposition_with_troubleshooting_steps”}

Carol receives an executive summary:

memory_client.retrieve_user_preferences(user_id=”Carol”)
# Returns: {“investigation_style”: “business_impact_focused”,”reports”: “executive_summary_without_technical_details”}

Adding observability to the SRE agent
Adding observability to an SRE agent deployed on Amazon Bedrock AgentCore Runtime is straightforward using the Amazon Bedrock AgentCore Observability primitive. This enables comprehensive monitoring through Amazon CloudWatch with metrics, traces, and logs. Setting up observability requires three steps:

Add the OpenTelemetry packages to your pyproject.toml:

dependencies = [
# … other dependencies …
“opentelemetry-instrumentation-langchain”,
“aws-opentelemetry-distro~=0.10.1”,
]

Configure observability for your agents to enable metrics in CloudWatch.
Start your container using the opentelemetry-instrument utility to automatically instrument your application.

The following command is added to the Dockerfile for the SRE agent:

# Run application with OpenTelemetry instrumentation
CMD [“uv”, “run”, “opentelemetry-instrument”, “uvicorn”, “sre_agent.agent_runtime:app”, “–host”, “0.0.0.0”, “–port”, “8080”]

As shown in the following screenshot, with observability enabled, you gain visibility into the following:

LLM invocation metrics – Token usage, latency, and model performance across agents
Tool execution traces – Duration and success rates for each MCP tool call
Memory operations – Retrieval patterns and storage efficiency
End-to-end request tracing – Complete request flow from user query to final response

The observability primitive automatically captures these metrics without additional code changes, providing production-grade monitoring capabilities out of the box.
Development to production flow
The SRE agent follows a four-step structured deployment process from local development to production, with detailed procedures documented in Development to Production Flow in the accompanying GitHub repo:

The deployment process maintains consistency across environments: the core agent code (sre_agent/) remains unchanged, and the deployment/ folder contains deployment-specific utilities. The same agent works locally and in production through environment configuration, with Amazon Bedrock AgentCore Gateway providing MCP tools access across different stages of development and deployment.
Implementation walkthrough
In the following section, we focus on how Amazon Bedrock AgentCore Gateway, Memory, and Runtime work together to build this multi-agent collaboration solution and deploy it end-to-end with MCP support and persistent intelligence.
We start by setting up the repository and establishing the local runtime environment with API keys, LLM providers, and demo infrastructure. We then bring core AgentCore components online by creating the gateway for standardized API access, configuring authentication, and establishing tool connectivity. We add intelligence through AgentCore Memory, creating strategies for user preferences and investigation history while loading personas for personalized incident response. Finally, we configure individual agents with specialized tools, integrate memory capabilities, orchestrate collaborative workflows, and deploy to AgentCore Runtime with full observability.
Detailed instructions for each step are provided in the repository:

Use Case Setup Guide – Backend deployment and development setup
Deployment Guide – Production containerization and Amazon Bedrock AgentCore Runtime deployment

Prerequisites
You can find the port forwarding requirements and other setup instructions in the README file’s Prerequisites section.
Convert APIs to MCP tools with Amazon Bedrock AgentCore Gateway
Amazon Bedrock AgentCore Gateway demonstrates the power of protocol standardization by converting existing backend APIs into MCP tools that agent frameworks can consume. This transformation happens seamlessly, requiring only OpenAPI specifications.
Upload OpenAPI specifications
The gateway process begins by uploading your existing API specifications to Amazon Simple Storage Service (Amazon S3). The create_gateway.sh script automatically handles uploading the four API specifications (Kubernetes, Logs, Metrics, and Runbooks) to your configured S3 bucket with proper metadata and content types. These specifications will be used to create API endpoint targets in the gateway.
Create an identity provider and gateway
Authentication is handled seamlessly through Amazon Bedrock AgentCore Identity. The main.py script creates both the credential provider and gateway:

# Create AgentCore Gateway with JWT authorization
def create_gateway(
client: Any,
gateway_name: str,
role_arn: str,
discovery_url: str,
allowed_clients: list = None,
description: str = “AgentCore Gateway created via SDK”,
search_type: str = “SEMANTIC”,
protocol_version: str = “2025-03-26”,
) -> Dict[str, Any]:

# Build auth config for Cognito
auth_config = {“customJWTAuthorizer”: {“discoveryUrl”: discovery_url}}
if allowed_clients:
auth_config[“customJWTAuthorizer”][“allowedClients”] = allowed_clients

protocol_configuration = {
“mcp”: {“searchType”: search_type, “supportedVersions”: [protocol_version]}
}

response = client.create_gateway(
name=gateway_name,
roleArn=role_arn,
protocolType=”MCP”,
authorizerType=”CUSTOM_JWT”,
authorizerConfiguration=auth_config,
protocolConfiguration=protocol_configuration,
description=description,
exceptionLevel=’DEBUG’
)
return response

Deploy API endpoint targets with credential providers
Each API becomes an MCP target through the gateway. The solution automatically handles credential management:

def create_api_endpoint_target(
client: Any,
gateway_id: str,
s3_uri: str,
provider_arn: str,
target_name_prefix: str = “open”,
description: str = “API Endpoint Target for OpenAPI schema”,
) -> Dict[str, Any]:

api_target_config = {“mcp”: {“openApiSchema”: {“s3”: {“uri”: s3_uri}}}}

# API key credential provider configuration
credential_config = {
“credentialProviderType”: “API_KEY”,
“credentialProvider”: {
“apiKeyCredentialProvider”: {
“providerArn”: provider_arn,
“credentialLocation”: “HEADER”,
“credentialParameterName”: “X-API-KEY”,
}
},
}

response = client.create_gateway_target(
gatewayIdentifier=gateway_id,
name=target_name_prefix,
description=description,
targetConfiguration=api_target_config,
credentialProviderConfigurations=[credential_config],
)
return response

Validate MCP tools are ready for agent framework
Post-deployment, Amazon Bedrock AgentCore Gateway provides a standardized /mcp endpoint secured with JWT tokens. Testing the deployment with mcp_cmds.sh reveals the power of this transformation:

Tool summary:
================
Total tools found: 21

Tool names:
• x_amz_bedrock_agentcore_search
• k8s-api___get_cluster_events
• k8s-api___get_deployment_status
• k8s-api___get_node_status
• k8s-api___get_pod_status
• k8s-api___get_resource_usage
• logs-api___analyze_log_patterns
• logs-api___count_log_events
• logs-api___get_error_logs
• logs-api___get_recent_logs
• logs-api___search_logs
• metrics-api___analyze_trends
• metrics-api___get_availability_metrics
• metrics-api___get_error_rates
• metrics-api___get_performance_metrics
• metrics-api___get_resource_metrics
• runbooks-api___get_common_resolutions
• runbooks-api___get_escalation_procedures
• runbooks-api___get_incident_playbook
• runbooks-api___get_troubleshooting_guide
• runbooks-api___search_runbooks

Universal agent framework compatibility
This MCP-standardized gateway can now be configured as a Streamable-HTTP server for MCP clients, including AWS Strands, Amazon’s agent development framework, LangGraph, the framework used in our SRE agent implementation, and CrewAI, a multi-agent collaboration framework.
The advantage of this approach is that existing APIs require no modification—only OpenAPI specifications. Amazon Bedrock AgentCore Gateway handles the following:

Protocol translation – Between REST APIs to MCP
Authentication – JWT token validation and credential injection
Security – TLS termination and access control
Standardization – Consistent tool naming and parameter handling

This means you can take existing infrastructure APIs (Kubernetes, monitoring, logging, documentation) and instantly make them available to AI agent frameworks that support MCP—through a single, secure, standardized interface.
Implement persistent intelligence with Amazon Bedrock AgentCore Memory
Whereas Amazon Bedrock AgentCore Gateway provides seamless API access, Amazon Bedrock AgentCore Memory transforms the SRE agent from a stateless system into an intelligent, learning assistant. The memory implementation demonstrates how a few lines of code can enable sophisticated personalization and cross-session knowledge retention.
Initialize memory strategies
The SRE agent memory component is built on Amazon Bedrock AgentCore Memory’s event-based model with automatic namespace routing. During initialization, the solution creates three memory strategies with specific namespace patterns:

from sre_agent.memory.client import SREMemoryClient
from sre_agent.memory.strategies import create_memory_strategies

# Initialize memory client
memory_client = SREMemoryClient(
memory_name=”sre_agent_memory”,
region=”us-east-1″
)

# Create three specialized memory strategies
strategies = create_memory_strategies()
for strategy in strategies:
memory_client.create_strategy(strategy)

The three strategies each serve distinct purposes:

User preferences (/sre/users/{user_id}/preferences) – Individual investigation styles and communication preferences
Infrastructure Knowledge: /sre/infrastructure/{user_id}/{session_id} – Domain expertise accumulated across investigations
Investigation Summaries: /sre/investigations/{user_id}/{session_id} – Historical incident patterns and resolutions

Load user personas and preferences
The solution comes preconfigured with user personas that demonstrate personalized investigations. The manage_memories.py script loads these personas:

# Load Alice – Technical SRE Engineer
alice_preferences = {
“investigation_style”: “detailed_systematic_analysis”,
“communication”: [“#alice-alerts”, “#sre-team”],
“escalation”: {“contact”: “alice.manager@company.com”, “threshold”: “15min”},
“reports”: “technical_exposition_with_troubleshooting_steps”,
“timezone”: “UTC”
}

# Load Carol – Executive/Director
carol_preferences = {
“investigation_style”: “business_impact_focused”,
“communication”: [“#carol-executive”, “#strategic-alerts”],
“escalation”: {“contact”: “carol.director@company.com”, “threshold”: “5min”},
“reports”: “executive_summary_without_technical_details”,
“timezone”: “EST”
}

# Store preferences using memory client
memory_client.store_user_preference(“Alice”, alice_preferences)
memory_client.store_user_preference(“Carol”, carol_preferences)

Automatic namespace routing in action
The power of Amazon Bedrock AgentCore Memory lies in its automatic namespace routing. When the SRE agent creates events, it only needs to provide the actor_id—Amazon Bedrock AgentCore Memory automatically determines which namespaces the event belongs to:

# During investigation, the supervisor agent stores context
memory_client.create_event(
memory_id=”sre_agent_memory-abc123″,
actor_id=”Alice”, # AgentCore Memory routes this automatically
session_id=”investigation_2025_01_15″,
messages=[(“investigation_started”, “USER”)]
)

# Memory system automatically:
# 1. Checks strategy namespaces <!– “all” is necessary here for technical accuracy –>
# 2. Matches actor_id “Alice” to /sre/users/Alice/preferences
# 3. Stores event in User Preferences Strategy
# 4. Makes event available for future retrievals

Validate the personalized investigation experience
The memory component’s impact becomes clear when both Alice and Carol investigate the same issue. Using identical technical findings, the solution produces completely different presentations of the same underlying content.
Alice’s technical report contains detailed systematic analysis for technical teams:

Technical Investigation Summary

Root Cause: Payment processor memory leak causing OOM kills

Analysis:
– Pod restart frequency increased 300% at 14:23 UTC
– Memory utilization peaked at 8.2GB (80% of container limit)
– JVM garbage collection latency spiked to 2.3s

Next Step:
1. Implement heap dump analysis (`kubectl exec payment-pod — jmap`)
2. Review recent code deployments for memory management changes
3. Consider increasing memory limits and implementing graceful shutdown

Carol’s executive summary contains business impact focused for executive stakeholders:

Business Impact Assessment
Status: CRITICAL – Customer payment processing degraded
Impact: 23% transaction failure rate, $47K revenue at risk
Timeline: Issue detected 14:23 UTC, resolution ETA 45 minutes
Business Actions: – Customer communication initiated via status page – Finance team alerted for revenue impact tracking – Escalating to VP Engineering if not resolved by 15:15 UTC

The memory component enables this personalization while continuously learning from each investigation, building organizational knowledge that improves incident response over time.
Deploy to production with Amazon Bedrock AgentCore Runtime
Amazon Bedrock AgentCore makes it straightforward to deploy existing agents to production. The process involves three key steps: containerizing your agent, deploying to Amazon Bedrock AgentCore Runtime, and invoking the deployed agent.
Containerize your agent
Amazon Bedrock AgentCore Runtime requires ARM64 containers. The following code shows the complete Dockerfile:

# Use uv’s ARM64 Python base image
FROM –platform=linux/arm64 ghcr.io/astral-sh/uv:python3.12-bookworm-slim

WORKDIR /app

# Copy uv files
COPY pyproject.toml uv.lock ./

# Install dependencies
RUN uv sync –frozen –no-dev

# Copy SRE agent module
COPY sre_agent/ ./sre_agent/

# Set environment variables
# Note: Set DEBUG=true to enable debug logging and traces
ENV PYTHONPATH=”/app”
PYTHONDONTWRITEBYTECODE=1
PYTHONUNBUFFERED=1

# Expose port
EXPOSE 8080

# Run application with OpenTelemetry instrumentation
CMD [“uv”, “run”, “opentelemetry-instrument”, “uvicorn”, “sre_agent.agent_runtime:app”, “–host”, “0.0.0.0”, “–port”, “8080”]

Existing agents just need a FastAPI wrapper (agent_runtime:app) to become compatible with Amazon Bedrock AgentCore, and we add opentelemetry-instrument to enable observability through Amazon Bedrock AgentCore.
Deploy to Amazon Bedrock AgentCore Runtime
Deploying to Amazon Bedrock AgentCore Runtime is straightforward with the deploy_agent_runtime.py script:

import boto3

# Create AgentCore client
client = boto3.client(‘bedrock-agentcore’, region_name=region)

# Environment variables for your agent
env_vars = {
‘GATEWAY_ACCESS_TOKEN’: gateway_access_token,
‘LLM_PROVIDER’: llm_provider,
‘ANTHROPIC_API_KEY’: anthropic_api_key # if using Anthropic
}

# Deploy container to AgentCore Runtime
response = client.create_agent_runtime(
agentRuntimeName=runtime_name,
agentRuntimeArtifact={
‘containerConfiguration’: {
‘containerUri’: container_uri # Your ECR container URI
}
},
networkConfiguration={“networkMode”: “PUBLIC”},
roleArn=role_arn,
environmentVariables=env_vars
)

print(f”Agent Runtime ARN: {response[‘agentRuntimeArn’]}”)

Amazon Bedrock AgentCore handles the infrastructure, scaling, and session management automatically.
Invoke your deployed agent
Calling your deployed agent is just as simple with invoke_agent_runtime.py:

# Prepare your query with user_id and session_id for memory personalization
payload = json.dumps({
“input”: {
“prompt”: “API response times have degraded 3x in the last hour”,
“user_id”: “Alice”, # User for personalized investigation
“session_id”: “investigation-20250127-123456″ # Session for context
}
})

# Invoke the deployed agent
response = agent_core_client.invoke_agent_runtime(
agentRuntimeArn=runtime_arn,
runtimeSessionId=session_id,
payload=payload,
qualifier=”DEFAULT”
)

# Get the response
response_data = json.loads(response[‘response’].read())
print(response_data) # Full response includes output with agent’s investigation

Key benefits of Amazon Bedrock AgentCore Runtime
Amazon Bedrock AgentCore Runtime offers the following key benefits:

Zero infrastructure management – No servers, load balancers, or scaling to configure
Built-in session isolation – Each conversation is completely isolated
AWS IAM integration – Secure access control without custom authentication
Automatic scaling – Scales from zero to thousands of concurrent sessions

The complete deployment process, including building containers and handling AWS permissions, is documented in the Deployment Guide.
Real-world use cases
Let’s explore how the SRE agent handles common incident response scenarios with a real investigation.
When facing a production issue, you can query the system in natural language. The solution uses Amazon Bedrock AgentCore Memory to personalize the investigation based on your role and preferences:

export USER_ID=Alice
sre-agent –prompt “API response times have degraded 3x in the last hour”

The supervisor retrieves Alice’s preferences from memory (detailed systematic analysis style) and creates an investigation plan tailored to her role as a Technical SRE:

Investigation Plan
1. Use metrics_agent to analyze API performance metrics including response times, error rates, and resource utilization to identify the extent and pattern of slowdown
2. Use logs_agent to examine application logs for errors related to slow API responses, focusing on database connection issues and memory errors
3. Use kubernetes_agent to check pod status and resource constraints, especially for web-service and database pods, looking for CrashLoopBackOff states and missing ConfigMaps
Complexity: Simple
Auto-execute: Yes
Agents involved: Metrics Agent, Logs Agent, Kubernetes Agent

The agents investigate sequentially according to the plan, each contributing their specialized analysis. The solution then aggregates these findings into a comprehensive executive summary:

Executive Summary
Key Insights
– Root Cause: Database service failure due to missing ConfigMap ‘database-config’ in production namespace, causing database pod to crash repeatedly
– Impact: Severe performance degradation with API response times increased from 150ms to 5000ms (33x slower)
– Severity: High – Database unavailability, memory exhaustion (100%), and CPU saturation (95%) causing 75% error rate
Next Steps
1. Immediate (< 1 hour): Create/update ConfigMap ‘database-config’ in production namespace and restart database pod
2. Short-term (< 24 hours):
– Fix permissions on ‘/var/lib/postgresql/data’ directory
– Increase Java heap space for web-service to address OutOfMemoryErrors
– Optimize UserService.loadAllUsers method causing memory issues
3. Long-term (< 1 week):
– Implement resource monitoring with alerts for CPU (>80%), memory (>90%)
– Optimize slow database queries, particularly “SELECT * FROM users WHERE status=’active'”
– Scale up resources or implement autoscaling for web-service
Critical Alerts
– Database pod (database-pod-7b9c4d8f2a-x5m1q) in CrashLoopBackOff state
– Web-service experiencing OutOfMemoryErrors in UserService.loadAllUsers(UserService.java:45)
– Node-3 experiencing memory pressure (>85% usage)
– Web-app-deployment showing readiness probe failures with 503 errors
Troubleshooting Steps
1. Verify ConfigMap status: `kubectl get configmap database-config -n production`
2. Check database pod logs: `kubectl logs database-pod-7b9c4d8f2a-x5m1q -n production`
3. Create/update ConfigMap: `kubectl create configmap database-config –from-file=database.conf -n production`
4. Fix data directory permissions: `kubectl exec database-pod-7b9c4d8f2a-x5m1q -n production — chmod -R 700 /var/lib/postgresql/data`
5. Restart database pod: `kubectl delete pod database-pod-7b9c4d8f2a-x5m1q -n production`

This investigation demonstrates how Amazon Bedrock AgentCore primitives work together:

Amazon Bedrock AgentCore Gateway – Provides secure access to infrastructure APIs through MCP tools
Amazon Bedrock AgentCore Identity – Handles ingress and egress authentication
Amazon Bedrock AgentCore Runtime – Hosts the multi-agent solution with automatic scaling
Amazon Bedrock AgentCore Memory – Personalizes Alice’s experience and stores investigation knowledge for future incidents
Amazon Bedrock AgentCore Observability – Captures detailed metrics and traces in CloudWatch for monitoring and debugging

The SRE agent demonstrates intelligent agent orchestration, with the supervisor routing work to specialists based on the investigation plan. The solution’s memory capabilities make sure each investigation builds organizational knowledge and provides personalized experiences based on user roles and preferences.
This investigation showcases several key capabilities:

Multi-source correlation – It connects database configuration issues to API performance degradation
Sequential investigation – Agents work systematically through the investigation plan while providing live updates
Source attribution – Findings include the specific tool and data source
Actionable insights – It provides a clear timeline of events and prioritized recovery steps
Cascading failure detection – It can help show how one failure propagates through the system

Business impact
Organizations implementing AI-powered SRE assistance report significant improvements in key operational metrics. Initial investigations that previously took 30–45 minutes can now be completed in 5–10 minutes, providing SREs with comprehensive context before diving into detailed analysis. This dramatic reduction in investigation time translates directly to faster incident resolution and reduced downtime.The solution improves how SREs interact with their infrastructure. Instead of navigating multiple dashboards and tools, engineers can ask questions in natural language and receive aggregated insights from relevant data sources. This reduction in context switching enables teams to maintain focus during critical incidents and reduces cognitive load during investigations.Perhaps most importantly, the solution democratizes knowledge across the team. All team members can access the same comprehensive investigation techniques, reducing dependency on tribal knowledge and on-call burden. The consistent methodology provided by the solution makes sure investigation approaches remain uniform across team members and incident types, improving overall reliability and reducing the chance of missed evidence.
The automatically generated investigation reports provide valuable documentation for post-incident reviews and help teams learn from each incident, building organizational knowledge over time. Furthermore, the solution extends existing AWS infrastructure investments, working alongside services like Amazon CloudWatch, AWS Systems Manager, and other AWS operational tools to provide a unified operational intelligence system.
Extending the solution
The modular architecture makes it straightforward to extend the solution for your specific needs.
For example, you can add specialized agents for your domain:

Security agent – For compliance checks and security incident response
Database agent – For database-specific troubleshooting and optimization
Network agent – For connectivity and infrastructure debugging

You can also replace the demo APIs with connections to your actual systems:

Kubernetes integration – Connect to your cluster APIs for pod status, deployments, and events
Log aggregation – Integrate with your log management service (Elasticsearch, Splunk, CloudWatch Logs)
Metrics platform – Connect to your monitoring service (Prometheus, Datadog, CloudWatch Metrics)
Runbook repository – Link to your operational documentation and playbooks stored in wikis, Git repositories, or knowledge bases

Clean up
To avoid incurring future charges, use the cleanup script to remove the billable AWS resources created during the demo:

# Complete cleanup – deletes AWS resources and local files
./scripts/cleanup.sh

This script automatically performs the following actions:

Stop backend servers
Delete the gateway and its targets
Delete Amazon Bedrock AgentCore Memory resources
Delete the Amazon Bedrock AgentCore Runtime
Remove generated files (gateway URIs, tokens, agent ARNs, memory IDs)

For detailed cleanup instructions, refer to Cleanup Instructions.
Conclusion
The SRE agent demonstrates how multi-agent systems can transform incident response from a manual, time-intensive process into a time-efficient, collaborative investigation that provides SREs with the insights they need to resolve issues quickly and confidently.
By combining the enterprise-grade infrastructure of Amazon Bedrock AgentCore with standardized tool access in MCP, we’ve created a foundation that can adapt as your infrastructure evolves and new capabilities emerge.
The complete implementation is available in our GitHub repository, including demo environments, configuration guides, and extension examples. We encourage you to explore the solution, customize it for your infrastructure, and share your experiences with the community.
To get started building your own SRE assistant, refer to the following resources:

Automate tasks in your application using AI agents
Amazon Bedrock AgentCore Samples GitHub repository
Model Context Protocol documentation
LangGraph documentation

About the authors
Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington, D.C.
Dheeraj Oruganty is a Delivery Consultant at Amazon Web Services. He is passionate about building innovative Generative AI and Machine Learning solutions that drive real business impact. His expertise spans Agentic AI Evaluations, Benchmarking and Agent Orchestration, where he actively contributes to research advancing the field. He holds a master’s degree in Data Science from Georgetown University. Outside of work, he enjoys geeking out on cars, motorcycles, and exploring nature.

OpenAI Releases ChatGPT ‘Pulse’: Proactive, Personalized Daily Bri …

OpenAI introduced ChatGPT Pulse, a proactive experience that compiles personalized, research-backed updates each morning. In preview on mobile and limited to $200/month Pro subscribers, Pulse surfaces topical cards built from a user’s chats, explicit feedback, and opt-in connected apps (e.g., calendar/email), shifting ChatGPT from a request-driven tool to a context-aware assistant.

What Pulse Actually Does Under the Hood

Each day, Pulse performs background research anchored to user signals: recent conversations, long-term interests, thumbs-up/down feedback, and data from connected apps where enabled. The output appears as scannable visual cards (briefs and deep links) rather than an infinite feed, designed for quick triage and drill-down. Early examples include targeted news roundups and context-conditioned suggestions (e.g., travel planning aligned with calendar events).

Data Sources and Controls

Integrations are off by default and can be toggled. When granted, Pulse may use Gmail/Google Calendar context to tailor cards (e.g., meeting prep, itinerary nudges). OpenAI positions this as a user-level personalization layer; reporting notes emphasize optionality and in-app settings for managing connected accounts and memory.

Availability and Rollout Plan

Pulse is rolling out now to Pro on the ChatGPT mobile app as a dedicated tab. OpenAI says it wants broader availability “soon,” with Plus access targeted after product and efficiency improvements. The company reiterated the Pro-first gating due to compute costs.

Product Positioning: Toward Agentic, Goal-Oriented Workflows

OpenAI frames Pulse as the first step toward agent-like behavior where the model tracks goals and initiates updates without prompts. External coverage highlights the shift from chat to assistant workflows that reason over user state and schedule. This aligns with OpenAI’s recent emphasis on agents and proactive help, not passive Q&A.

The Signal from Leadership

Sam Altman summarized the intent succinctly: Pulse is his “favorite feature” to date, starting with Pro. His post also underscores the model’s use of interests and recent chats, hinting at broader personalization as users share preferences over time. OpenAI’s official announcement on X mirrors the blog language around daily, proactive updates.

Today we are launching my favorite feature of ChatGPT so far, called Pulse. It is initially available to Pro subscribers.Pulse works for you overnight, and keeps thinking about your interests, your connected data, your recent chats, and more. Every morning, you get a…— Sam Altman (@sama) September 25, 2025

Competitive Context

Pulse lands in a crowded “morning brief” space but differs by tying briefs to your live context and chats rather than generic headlines. It also inches ChatGPT toward hands-on assistant territory seen in agent platforms that watch calendars, draft emails, and pre-stage tasks—yet packaged for consumers inside the ChatGPT app rather than a separate agent runner.

Summary

Pulse formalizes ChatGPT as a proactive system: it reads your signals, checks your day, and delivers a compact, personalized brief—first for Pro on mobile, with Plus on the roadmap once the system is optimized. The implementation details (APIs, enterprise knobs, retention policies) will determine how far it goes beyond morning cards into full agent workflows.

The post OpenAI Releases ChatGPT ‘Pulse’: Proactive, Personalized Daily Briefings for Pro Users appeared first on MarkTechPost.

OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on R …

OpenAI introduced GDPval, a new evaluation suite designed to measure how AI models perform on real-world, economically valuable tasks across 44 occupations in nine GDP-dominant U.S. sectors. Unlike academic benchmarks, GDPval centers on authentic deliverables—presentations, spreadsheets, briefs, CAD artifacts, audio/video—graded by occupational experts through blinded pairwise comparisons. OpenAI also released a 220-task “gold” subset and an experimental automated grader hosted at evals.openai.com.

From Benchmarks to Billables: How GDPval Builds Tasks

GDPval aggregates 1,320 tasks sourced from industry professionals averaging 14 years of experience. Tasks map to O*NET work activities and include multi-modal file handling (docs, slides, images, audio, video, spreadsheets, CAD), with up to dozens of reference files per task. The gold subset provides public prompts and references; primary scoring still relies on expert pairwise judgments due to subjectivity and format requirements.

https://openai.com/index/gdpval/

What the Data Says: Model vs. Expert

On the gold subset, frontier models approach expert quality on a substantial fraction of tasks under blind expert review, with model progress trending roughly linearly across releases. Reported model-vs-human win/tie rates near parity for top models, error profiles cluster around instruction-following, formatting, data usage, and hallucinations. Increased reasoning effort and stronger scaffolding (e.g., format checks, artifact rendering for self-inspection) yield predictable gains.

Time–Cost Math: Where AI Pays Off

GDPval runs scenario analyses comparing human-only to model-assisted workflows with expert review. It quantifies (i) human completion time and wage-based cost, (ii) reviewer time/cost, (iii) model latency and API cost, and (iv) empirically observed win rates. Results indicate potential time/cost reductions for many task classes once review overhead is included.

Automated Judging: Useful Proxy, Not Oracle

For the gold subset, an automated pairwise grader shows ~66% agreement with human experts, within ~5 percentage points of human–human agreement (~71%). It’s positioned as an accessibility proxy for rapid iteration, not a replacement for expert review.

https://openai.com/index/gdpval/

Why This Isn’t Yet Another Benchmark

Occupational breadth: Spans top GDP sectors and a wide slice of O*NET work activities, not just narrow domains.

Deliverable realism: Multi-file, multi-modal inputs/outputs stress structure, formatting, and data handling.

Moving ceiling: Uses human preference win rate against expert deliverables, enabling re-baselining as models improve.

Boundary Conditions: Where GDPval Doesn’t Reach

GDPval-v0 targets computer-mediated knowledge work. Physical labor, long-horizon interactivity, and organization-specific tooling are out of scope. Tasks are one-shot and precisely specified; ablations show performance drops with reduced context. Construction and grading are resource-intensive, motivating the automated grader—whose limits are documented—and future expansion.

Fit in the Stack: How GDPval Complements Other Evals

GDPval augments existing OpenAI evals with occupational, multi-modal, file-centric tasks and reports human preference outcomes, time/cost analyses, and ablations on reasoning effort and agent scaffolding. v0 is versioned and expected to broaden coverage and realism over time.

Summary

GDPval formalizes evaluation for economically relevant knowledge work by pairing expert-built tasks with blinded human preference judgments and an accessible automated grader. The framework quantifies model quality and practical time/cost trade-offs while exposing failure modes and the effects of scaffolding and reasoning effort. Scope remains v0—computer-mediated, one-shot tasks with expert review—yet it establishes a reproducible baseline for tracking real-world capability gains across occupations.

Check out the Paper, Technical details, and Dataset on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on Real-World Economically Valuable Tasks appeared first on MarkTechPost.

Meta FAIR Released Code World Model (CWM): A 32-Billion-Parameter Open …

Meta FAIR released Code World Model (CWM), a 32-billion-parameter dense decoder-only LLM that injects world modeling into code generation by training on execution traces and long-horizon agent–environment interactions—not just static source text.

What’s new: learning code by predicting execution?

CWM mid-trains on two large families of observation–action trajectories: (1) Python interpreter traces that record local variable states after each executed line, and (2) agentic interactions inside Dockerized repositories that capture edits, shell commands, and test feedback. This grounding is intended to teach semantics (how state evolves) rather than only syntax.

To scale collection, the research team built executable repository images from thousands of GitHub projects and foraged multi-step trajectories via a software-engineering agent (“ForagerAgent”). The release reports ~3M trajectories across ~10k images and 3.15k repos, with mutate-fix and issue-fix variants.

https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/

Model and context window

CWM is a dense, decoder-only Transformer (no MoE) with 64 layers, GQA (48Q/8KV), SwiGLU, RMSNorm, and Scaled RoPE. Attention alternates local 8k and global 131k sliding-window blocks, enabling 131k tokens effective context; training uses document-causal masking.

Training recipe (pre → mid → post)

General pretraining: 8T tokens (code-heavy) at 8k context.

Mid-training: +5T tokens, long-context (131k) with Python execution traces, ForagerAgent data, PR-derived diffs, IR/compilers, Triton kernels, and Lean math.

Post-training: 100B-token SFT for instruction + reasoning, then multi-task RL (~172B-token) across verifiable coding, math, and multi-turn SWE environments using a GRPO-style algorithm and a minimal toolset (bash/edit/create/submit).

Quantized inference fits on a single 80 GB H100.

Benchmarks

The research team cites the following pass@1 / scores (test-time scaling noted where applicable):

SWE-bench Verified: 65.8% (with test-time scaling).

LiveCodeBench-v5: 68.6%; LCB-v6: 63.5%.

Math-500: 96.6%; AIME-24: 76.0%; AIME-25: 68.2%.

CruxEval-Output: 94.3%.

The research team position CWM as competitive with similarly sized open-weights baselines and even with larger or closed models on SWE-bench Verified.

For context on SWE-bench Verified’s task design and metrics, see the official benchmark resources.

https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/

Why world modeling matters for code?

The release emphasizes two operational capabilities:

Execution-trace prediction: given a function and a trace start, CWM predicts stack frames (locals) and the executed line at each step via a structured format—usable as a “neural debugger” for grounded reasoning without live execution.

Agentic coding: multi-turn reasoning with tool use against real repos, verified by hidden tests and patch similarity rewards; the setup trains the model to localize faults and generate end-to-end patches (git diff) rather than snippets.

Some details worth noting

Tokenizer: Llama-3 family with reserved control tokens; reserved IDs are used to demarcate trace and reasoning segments during SFT.

Attention layout: the 3:1 local:global interleave is repeated across the depth; long-context training occurs at large token batch sizes to stabilize gradients.

Compute scaling: learning-rate/batch size schedules are derived from internal scaling-law sweeps tailored for long-context overheads.

Summary

CWM is a pragmatic step toward grounded code generation: Meta ties a 32B dense transformer to execution-trace learning and agentic, test-verified patching, releases intermediate/post-trained checkpoints, and gates usage under the FAIR Non-Commercial Research License—making it a useful platform for reproducible ablations on long-context, execution-aware coding without conflating research with production deployment.

Check out the Paper, GitHub Page, and Model on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meta FAIR Released Code World Model (CWM): A 32-Billion-Parameter Open-Weights LLM, to Advance Research on Code Generation with World Models appeared first on MarkTechPost.