The Role of Model Context Protocol (MCP) in Generative AI Security and …

Table of contentsOverviewWhat MCP standardizes?Normative authorization controlsWhere MCP supports security engineering in practice ?Case study: the first malicious MCP serverUsing MCP to structure red-team exercisesImplementation-Focused Security Hardening ChecklistGovernance alignmentCurrent adoption you can test againstSummaryResources used in the article

Overview

Model Context Protocol (MCP) is an open, JSON-RPC–based standard that formalizes how AI clients (assistants, IDEs, web apps) connect to servers exposing three primitives—tools, resources, and prompts—over defined transports (primarily stdio for local and Streamable HTTP for remote). MCP’s value for security work is that it renders agent/tool interactions explicit and auditable, with normative requirements around authorization that teams can verify in code and in tests. In practice, this enables tight blast-radius control for tool use, repeatable red-team scenarios at clear trust boundaries, and measurable policy enforcement—provided organizations treat MCP servers as privileged connectors subject to supply-chain scrutiny.

What MCP standardizes?

An MCP server publishes: (1) tools (schema-typed actions callable by the model), (2) resources (readable data objects the client can fetch and inject as context), and (3) prompts (reusable, parameterized message templates, typically user-initiated). Distinguishing these surfaces clarifies who is “in control” at each edge: model-driven for tools, application-driven for resources, and user-driven for prompts. Those roles matter in threat modeling, e.g., prompt injection often targets model-controlled paths, while unsafe output handling often occurs at application-controlled joins.

Transports. The spec defines two standard transports—stdio (Standard Input/Output) and Streamable HTTP—and leaves room for pluggable alternatives. Local stdio reduces network exposure; Streamable HTTP fits multi-client or web deployments and supports resumable streams. Treat the transport choice as a security control: constrain network egress for local servers, and apply standard web authN/Z and logging for remote ones.

Client/server lifecycle and discovery. MCP formalizes how clients discover server capabilities (tools/resources/prompts), negotiate sessions, and exchange messages. That uniformity is what lets security teams instrument call flows, capture structured logs, and assert pre/postconditions without bespoke adapters per integration.

Normative authorization controls

The Authorization approach is unusually prescriptive for an integration protocol and should be enforced as follows:

No token passthrough. “The MCP server MUST NOT pass through the token it received from the MCP client.” Servers are OAuth 2.1 resource servers; clients obtain tokens from an authorization server using RFC 8707 resource indicators so tokens are audience-bound to the intended server. This prevents confused-deputy paths and preserves upstream audit/limit controls.

Audience binding and validation. Servers MUST validate that the access token’s audience matches themselves (resource binding) before serving a request. Operationally, this stops a client-minted token for “Service A” from being replayed to “Service B.” Red teams should include explicit probes for this failure mode.

This is the core of MCP’s security structure: model-side capabilities are powerful, but the protocol insists that servers be first-class principals with their own credentials, scopes, and logs—rather than opaque pass-throughs for a user’s global token.

Where MCP supports security engineering in practice?

Clear trust boundaries. The clientserver edge is an explicit, inspectable boundary. You can attach consent UIs, scope prompts, and structured logging at that edge. Many client implementations present permission prompts that enumerate a server’s tools/resources before enabling them—useful for least-privilege and audit—even though UX is not specified by the standard.

Containment and least privilege. Because a server is a separate principal, you can enforce minimal upstream scopes. For example, a secrets-broker server can mint short-lived credentials and expose only constrained tools (e.g., “fetch secret by policy label”), rather than handing broad vault tokens to the model. Public MCP servers from security vendors illustrate this model.

Deterministic attack surfaces for red teaming. With typed tool schemas and replayable transports, red teams can build fixtures that simulate adversarial inputs at tool boundaries and verify post-conditions across models/clients. This yields reproducible tests for classes of failures like prompt injection, insecure output handling, and supply-chain abuse. Pair those tests with recognized taxonomies.

Case study: the first malicious MCP server

In late September 2025, researchers disclosed a trojanized postmark-mcp npm package that impersonated a Postmark email MCP server. Beginning with v1.0.16, the malicious build silently BCC-exfiltrated every email sent through it to an attacker-controlled address/domain. The package was subsequently removed, but guidance urged uninstalling the affected version and rotating credentials. This appears to be the first publicly documented malicious MCP server in the wild, and it underscores that MCP servers often run with high trust and should be vetted and version-pinned like any privileged connector.

Operational takeaways:

Maintain an allowlist of approved servers and pin versions/hashes.

Require code provenance (signed releases, SBOMs) for production servers.

Monitor for anomalous egress patterns consistent with BCC exfiltration.

Practice credential rotation and “bulk disconnect” drills for MCP integrations.

These are not theoretical controls; the incident impact flowed directly from over-trusted server code in a routine developer workflow.

Using MCP to structure red-team exercises

1) Prompt-injection and unsafe-output drills at the tool boundary. Build adversarial corpora that enter via resources (application-controlled context) and attempt to coerce calls to dangerous tools. Assert that the client sanitizes injected outputs and that server post-conditions (e.g., allowed hostnames, file paths) hold. Map findings to LLM01 (Prompt Injection) and LLM02 (Insecure Output Handling).

2) Confused-deputy probes for token misuse. Craft tasks that try to induce a server to use a client-issued token or to call an unintended upstream audience. A compliant server must reject foreign-audience tokens per the authorization spec; clients must request audience-correct tokens with RFC 8707 resource. Treat any success here as a P1.

3) Session/stream resilience. For remote transports, exercise reconnection/resumption flows and multi-client concurrency for session fixation/hijack risks. Validate non-deterministic session IDs and rapid expiry/rotation in load-balanced deployments. (Streamable HTTP supports resumable connections; use it to stress your session model.)

4) Supply-chain kill-chain drills. In a lab, insert a trojaned server (with benign markers) and verify whether your allowlists, signature checks, and egress detection catch it—mirroring the Postmark incident TTPs. Measure time to detection and credential rotation MTTR.

5) Baseline with trusted public servers. Use vetted servers to construct deterministic tasks. Two practical examples: Google’s Data Commons MCP exposes public datasets under a stable schema (good for fact-based tasks/replays), and Delinea’s MCP demonstrates least-privilege secrets brokering for agent workflows. These are ideal substrates for repeatable jailbreak and policy-enforcement tests.

Implementation-Focused Security Hardening Checklist

Client side

Display the exact command or configuration used to start local servers; gate startup behind explicit user consent and enumerate the tools/resources being enabled. Persist approvals with scope granularity. (This is common practice in clients such as Claude Desktop.)

Maintain an allowlist of servers with pinned versions and checksums; deny unknown servers by default.

Log every tool call (name, arguments metadata, principal, decision) and resource fetch with identifiers so you can reconstruct attack paths post-hoc.

Server side

Implement OAuth 2.1 resource-server behavior; validate tokens and audiences; never forward client-issued tokens upstream.

Minimize scopes; prefer short-lived credentials and capabilities that encode policy (e.g., “fetch secret by label” instead of free-form read).

For local deployments, prefer stdio inside a container/sandbox and restrict filesystem/network capabilities; for remote, use Streamable HTTP with TLS, rate limits, and structured audit logs.

Detection & response

Alert on anomalous server egress (unexpected destinations, email BCC patterns) and sudden capability changes between versions.

Prepare break-glass automation to revoke client approvals and rotate upstream secrets quickly when a server is flagged (your “disconnect & rotate” runbook). The Postmark incident showed why time matters.

Governance alignment

MCP’s separation of concerns—clients as orchestrators, servers as scoped principals with typed capabilities—aligns directly with NIST’s AI RMF guidance for access control, logging, and red-team evaluation of generative systems, and with OWASP’s LLM Top-10 emphasis on mitigating prompt injection, unsafe output handling, and supply-chain vulnerabilities. Use those frameworks to justify controls in security reviews and to anchor acceptance criteria for MCP integrations.

Current adoption you can test against

Anthropic/Claude: product docs and ecosystem material position MCP as the way Claude connects to external tools and data; many community tutorials closely follow the spec’s three-primitive model. This provides ready-made client surfaces for permissioning and logging.

Google’s Data Commons MCP: released Sept 24, 2025, it standardizes access to public datasets; its announcement and follow-up posts include production usage notes (e.g., the ONE Data Agent). Useful as a stable “truth source” in red-team tasks.

Delinea MCP: open-source server integrating with Secret Server and Delinea Platform, emphasizing policy-mediated secret access and OAuth alignment with the MCP authorization spec. A practical example of least-privilege tool exposure.

Summary

MCP is not a silver-bullet “security product.” It is a protocol that gives security and red-team practitioners stable, enforceable levers: audience-bound tokens, explicit clientserver boundaries, typed tool schemas, and transports you can instrument. Use those levers to (1) constrain what agents can do, (2) observe what they actually did, and (3) replay adversarial scenarios reliably. Treat MCP servers as privileged connectors—vet, pin, and monitor them—because adversaries already do. With those practices in place, MCP becomes a practical foundation for secure agentic systems and a reliable substrate for red-team evaluation.

Resources used in the article

MCP specification & concepts

https://modelcontextprotocol.io/specification/2025-06-18/basic/authorization

https://modelcontextprotocol.io/specification/2025-03-26/basic/transports

https://modelcontextprotocol.io/docs/concepts/architecture

https://modelcontextprotocol.io/docs/concepts/prompts

MCP ecosystem (official)

https://www.anthropic.com/news/model-context-protocol

https://docs.claude.com/en/docs/mcp

https://docs.claude.com/en/docs/claude-code/mcp

https://modelcontextprotocol.io/quickstart/server

https://modelcontextprotocol.io/docs/develop/connect-local-servers

https://modelcontextprotocol.io/docs/develop/connect-remote-servers

Security frameworks

https://owasp.org/www-project-top-10-for-large-language-model-applications/

https://genai.owasp.org/llm-top-10/

https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf

https://www.nist.gov/itl/ai-risk-management-framework

Incident: malicious postmark-mcp server

https://www.koi.security/blog/postmark-mcp-npm-malicious-backdoor-email-theft

https://thehackernews.com/2025/09/first-malicious-mcp-server-found.html

https://www.itpro.com/security/a-malicious-mcp-server-is-silently-stealing-user-emails

https://threatprotect.qualys.com/2025/09/30/malicious-mcp-server-on-npm-postmark-mcp-exploited-in-attack/

Example MCP servers referenced

https://developers.googleblog.com/en/datacommonsmcp/

https://blog.google/technology/developers/ai-agents-datacommons/

https://github.com/DelineaXPM/delinea-mcp

https://delinea.com/news/delinea-mcp-server-to-provide-secure-credential-access-for-ai-agents?hs_amp=true

https://delinea.com/blog/unlocking-ai-agents-mcp

The post The Role of Model Context Protocol (MCP) in Generative AI Security and Red Teaming appeared first on MarkTechPost.

How Hapag-Lloyd improved schedule reliability with ML-powered vessel s …

This post is cowritten with Thomas Voss and Bernhard Hersberger from Hapag-Lloyd.
Hapag-Lloyd is one of the world’s leading shipping companies with more than 308 modern vessels, 11.9 million TEUs (twenty-foot equivalent units) transported per year, and 16,700 motivated employees in more than 400 offices in 139 countries. They connect continents, businesses, and people through reliable container transportation services on the major trade routes across the globe.
In this post, we share how Hapag-Lloyd developed and implemented a machine learning (ML)-powered assistant predicting vessel arrival and departure times that revolutionizes their schedule planning. By using Amazon SageMaker AI and implementing robust MLOps practices, Hapag-Lloyd has enhanced its schedule reliability—a key performance indicator in the industry and quality promise to their customers.
For Hapag-Lloyd, accurate vessel schedule predictions are crucial for maintaining schedule reliability, where schedule reliability is defined as percentage of vessels arriving within 1 calendar day (earlier or later) of their estimated arrival time, communicated around 3 to 4 weeks before arrival.
Prior to developing the new ML solution, Hapag-Lloyd relied on simple rule-based and statistical calculations, based on historical transit patterns for vessel schedule predictions. While this statistical method provided basic predictions, it couldn’t effectively account for real-time conditions such as port congestion, requiring significant manual intervention from operations teams.
Developing a new ML solution to replace the existing system presented several key challenges:

Dynamic shipping conditions – The estimated time of arrival (ETA) prediction model needs to account for numerous variables that affect journey duration, including weather conditions, port-related delays such as congestion, labor strikes, and unexpected events that force route changes. For example, when the Suez Canal was blocked by the Ever Given container ship in March 2021, vessels had to be rerouted around Africa, adding approximately 10 days to their journey times.
Data integration at scale – The development of accurate models requires integration of large volumes of historical voyage data with external real-time data sources including port congestion information and vessel position tracking (AIS). The solution needs to scale across 120 vessel services or lines and 1,200 unique port-to-port routes.
Robust MLOps infrastructure – A robust MLOps infrastructure is required to continuously monitor model performance and quickly deploy updates whenever needed. This includes capabilities for regular model retraining to adapt to changing patterns, comprehensive performance monitoring, and maintaining real-time inference capabilities for immediate schedule adjustments.

Hapag-Llyod’s previous approach to schedule planning couldn’t effectively address these challenges. A comprehensive solution that could handle both the complexity of vessel schedule prediction and provide the infrastructure needed to sustain ML operations at global scale was needed.
The Hapag-Lloyd network consists of over 308 vessels and many more partner vessels that continuously circumnavigate the globe on predefined service routes, resulting in more than 3,500 port arrivals per month. Each vessel operates on a fixed service line, making regular round trips between a sequence of ports. For instance, a vessel might repeatedly sail a route from Southampton to Le Havre, Rotterdam, Hamburg, New York, and Philadelphia before starting the cycle again. For each port arrival, an ETA must be provided multiple weeks in advance to arrange critical logistics, including berth windows at ports and onward transportation of containers by sea, land or air transport. The following table shows an example where a vessel travels from Southampton to New York through Le Havre, Rotterdam, and Hamburg. The vessel’s time until arrival at the New York port can be calculated as the sum of ocean to port time to Southampton, and the respective berth times and port-to-port times for the intermediate ports called while sailing to New York. If this vessel encounters a delay in Rotterdam, it affects its arrival in Hamburg and cascades through the entire schedule, impacting arrivals in New York and beyond as shown in the following table. This ripple effect can disrupt carefully planned transshipment connections and require extensive replanning of downstream operations.

Port
Terminal call
Scheduled arrival
Scheduled departure

SOUTHAMPTON
1
2025-07-29 07:00
2025-07-29 21:00

LE HAVRE
2
2025-07-30 16:00
2025-07-31 16:00

ROTTERDAM
3
2025-08-03 18:00
2025-08-05 03:00

HAMBURG
4
2025-08-07 07:00
2025-08-08 07:00

NEW YORK
5
2025-08-18 13:00
2025-08-21 13:00

PHILADELPHIA
6
2025-08-22 06:00
2025-08-24 16:30

SOUTHAMPTON
7
2025-09-01 08:00
2025-09-02 20:00

When a vessel departs Rotterdam with a delay, new ETAs must be calculated for the remaining ports. For Hamburg, we only need to estimate the remaining sailing time from the vessel’s current position. However, for subsequent ports like New York, the prediction requires multiple components: the remaining sailing time to Hamburg, the duration of port operations in Hamburg, and the sailing time from Hamburg to New York.
Solution overview
As an input to the vessel ETA prediction, we process the following two data sources:

Hapag-Lloyd’s internal data, which is stored in a data lake. This includes detailed vessel schedules and routes, port and terminal performance information, real-time port congestion and waiting times, and vessel characteristics datasets. This data is prepared for model training using AWS Glue jobs.
Automatic Identification System (AIS) data, which provides streaming updates on the vessel movements. This AIS data ingestion is batched every 20 minutes using AWS Lambda and includes crucial information such as latitude, longitude, speed, and direction of vessels. New batches are processed using AWS Glue and Iceberg to update the existing AIS database—currently holding around 35 million observations.

These data sources are combined to create training datasets for the ML models. We carefully consider the timing of available data through temporal splitting to avoid data leakage. Data leakage occurs when using information that wouldn’t be available at prediction time in the real world. For example, when training a model to predict arrival time in Hamburg for a vessel currently in Rotterdam, we can’t use actual transit times that were only known after the vessel reached Hamburg.
A vessel’s journey can be divided into different legs, which led us to develop a multi-step solution using specialized ML models for each leg, which are orchestrated as hierarchical models to retrieve the overall ETA:

The Ocean to Port (O2P) model predicts the time needed for a vessel to reach its next port from its current position at sea. The model uses features such as remaining distance to destination, vessel speed, journey progress metrics, port congestion data, and historical sea leg durations.
The Port to Port (P2P) model forecasts sailing time between any two ports for a given date, considering key features such as ocean distance between ports, recent transit time trends, weather, and seasonal patterns.
The Berth Time model estimates how long a vessel will spend at port. The model uses vessel characteristics (such as tonnage and load capacity), planned container load, and historical port performance.
The Combined model takes as input the predictions from the O2P, P2P, and Berth Time models, along with the original schedule. Rather than predicting absolute arrival times, it computes the expected deviation from the original schedule by learning patterns in historical prediction accuracy and specific voyage conditions. These computed deviations are then used to update ETAs for the upcoming ports in a vessel’s schedule.

All four models are trained using the XGBoost algorithm built into SageMaker, chosen for its ability to handle complex relationships in tabular data and its robust performance with mixed numerical and categorical features. Each model has a dedicated training pipeline in SageMaker Pipelines, handling data preprocessing steps and model training. The following diagram shows the data processing pipeline, which generates the input datasets for ML training.

As an example, this diagram shows the training pipeline of the Berth model. The steps in the SageMaker training pipelines of the Berth, P2P, O2P, and Combined models are identical. Therefore, the training pipeline is implemented once as a blueprint and re-used across the other models, enabling a fast turn-around time of the implementation.

Because the Combined model depends on outputs from the other three specialized models, we use AWS Step Functions to orchestrate the SageMaker pipelines for training. This helps ensure that the individual models are updated in the correct sequence and maintains prediction consistency across the system. The orchestration of the training pipelines is shown in the following pipeline architecture.
The individual workflow begins with a data processing pipeline that prepares the input data (vessel schedules, AIS data, port congestion, and port performance metrics) and splits it into dedicated datasets. This feeds into three parallel SageMaker training pipelines for our base models (O2P, P2P, and Berth), each following a standardized process of feature encoding, hyperparameter optimization, model evaluation, and registration using SageMaker Processing and hyperparameter turning jobs and SageMaker Model Registry. After training, each base model runs a SageMaker batch transform job to generate predictions that serve as input features for the combined model training. The performance of the latest Combined model version is tested on the last 3 months of data with known ETAs, and performance metrics (R², mean absolute error (MAE)) are computed. If the model’s performance is below a set MAE threshold, the entire training process fails and the model version is automatically discarded, preventing the deployment of models that don’t meet the minimum performance threshold.
All four models are versioned and stored as separate model package groups in the SageMaker Model Registry, enabling systematic version control and deployment. This orchestrated approach helps ensure that our models are trained in the correct sequence using parallel processing, resulting in an efficient and maintainable training process.The hierarchical model approach helps further ensure that a degree of explainability comparable to the current statistical and rule-based solution is maintained—avoiding ML black box behavior. For example, it becomes possible to highlight unusually long berthing time predictions when discussing predictions results with business experts. This helps increase transparency and build trust, which in turn increases acceptance within the company.
Inference solution walkthrough
The inference infrastructure implements a hybrid approach combining batch processing with real-time API capabilities as shown in Figure 5. Because most data sources update daily and require extensive preprocessing, the core predictions are generated through nightly batch inference runs. These pre-computed predictions are complemented by a real-time API that implements business logic for schedule changes and ETA updates.

Daily batch Inference:

Amazon EventBridge triggers a Step Functions workflow every day.
The Step Functions workflow orchestrates the data and inference process:

Lambda copies internal Hapag-Lloyd data from the data lake to Amazon Simple Storage Service (Amazon S3).
AWS Glue jobs combine the different data sources and prepare inference inputs
SageMaker inference executes in sequence:

Fallback predictions are computed from historical averages and written to Amazon Relational Database Service (Amazon RDS). Fallback predictions are used in case of missing data or a downstream inference failure.
Preprocessing data for the four specialized ML models.
O2P, P2P, and Berth model batch transforms.
The Combined model batch transform generates final ETA predictions, which are written to Amazon RDS.
Input features and output files are stored in Amazon S3 for analytics and monitoring.

For operational reliability, any failures in the inference pipeline trigger immediate email notifications to the on-call operations team through Amazon Simple Email Service (Amazon SES).

Real-time API:

Amazon API Gateway receives client requests containing the current schedule and an indication for which vessel-port combinations an ETA update is required. By receiving the current schedule through the client request, we can take care of intraday schedule updates while doing daily batch transform updates.
The API Gateway triggers a Lambda function calculating the response. The Lambda function constructs the response by linking the ETA predictions (stored in Amazon RDS) with the current schedule using custom business logic, so that we can take care of short-term schedule changes unknown at inference time. Typical examples of short-term schedule changes are port omissions (for example, due to port congestion) and one-time port calls.

This architecture enables millisecond response times to custom requests while achieving a 99.5% availability (a maximum 3.5 hours downtime per month).

Conclusion
Hapag Lloyd’s ML powered vessel scheduling assistant outperforms the current solution in both accuracy and response time. Typical API response times are in the order of hundreds of milliseconds, helping to ensure a real-time user experience and outperforming the current solution by more than 80%. Low response times are crucial because, in addition to fully automated schedule updates, business experts require low response times to work with the schedule assistant interactively. In terms of accuracy, the MAE of the ML-powered ETA predictions outperform the current solution by approximately 12%, which translates into climbing by two positions in the international ranking of schedule reliability on average. This is one of the key performance metrics in liner shipping, and this is a significant improvement within the industry.
To learn more about architecting and governing ML workloads at scale on AWS, see the AWS blog post Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker and the accompanying AWS workshop AWS Multi-Account Data & ML Governance Workshop.
Acknowledgement
We acknowledge the significant and valuable work of Michal Papaj and Piotr Zielinski from Hapag-Lloyd in the data science and data engineering areas of the project.
About the authors
Thomas Voss Thomas Voss works at Hapag-Lloyd as a data scientist. With his background in academia and logistics, he takes pride in leveraging data science expertise to drive business innovation and growth through the practical design and modeling of AI solutions.
Bernhard Hersberger Bernhard Hersberger works as a data scientist at Hapag-Lloyd, where he heads the AI Hub team in Hamburg. He is enthusiastic about integrating AI solutions across the company, taking comprehensive responsibility from identifying business issues to deploying and scaling AI solutions worldwide.
Gabija Pasiunaite At AWS, Gabija Pasiunaite was a Machine Learning Engineer at AWS Professional Services based in Zurich. She specialized in building scalable ML and data solutions for AWS Enterprise customers, combining expertise in data engineering, ML automation and cloud infrastructure. Gabija has contributed to the AWS MLOps Framework used by AWS customers globally. Outside work, Gabija enjoys exploring new destinations and staying active through hiking, skiing, and running.
Jean-Michel Lourier Jean-Michel Lourier is a Senior Data Scientist within AWS Professional Services. He leads teams implementing data driven applications side by side with AWS customers to generate business value out of their data. He’s passionate about diving into tech and learning about AI, machine learning, and their business applications. He is also an enthusiastic cyclist.
Mousam Majhi Mousam Majhi is a Senior ProServe Cloud Architect focusing on Data & AI within AWS Professional Services. He works with Manufacturing and Travel, Transportation & Logistics customers in DACH to achieve their business outcomes by leveraging data and AI powered solutions. Outside of work, Mousam enjoys hiking in the Bavarian Alps.

Rox accelerates sales productivity with AI agents powered by Amazon Be …

This post was co-written with Shriram Sridharan, Taeuk Kang, and Santhosh Kumar Manavasi Lakshminarayanan from Rox.
Rox is building a new revenue operating system for the applied AI era.
Modern revenue teams rely on more data than ever before, such as Customer Relationship Management (CRM) systems, marketing automation, finance systems, support tickets, and live product usage. Though each serves its role, together they create silos that slow sellers down and leave insights untapped.
Rox addresses this by providing a revenue operating system: a unified layer that brings these signals together and equips AI agents to execute go-to-market (GTM) workflows. Instead of reconciling reports or updating fields, sellers get real-time intelligence and automation in their daily flow.
Today, we’re excited to announce that Rox is generally available, with Rox infrastructure built on AWS and delivered across web, Slack, macOS, and iOS. In this post, we share how Rox accelerates sales productivity with AI agents powered by Amazon Bedrock.
Solution overview
As noted in Rox is transforming revenue teams with AI-driven integration powered by AWS, modern GTM teams need more than a static database. Revenue data spans dozens of systems, such as product usage, finance, and support, and teams require a system that unifies context and acts on it in real time.
Rox delivers this through a layered architecture on AWS:

System of record – A unified, governed knowledge graph consolidates CRM, finance, support, product telemetry, and web data
Agent swarms – Intelligent, account-aware agents reason over the graph and orchestrate multi-step workflows like research, outreach, opportunity management, and proposal generation
Interfaces across surfaces – Sellers engage these workflows where they work, such as web application, Slack, iOS, and macOS

This converts the CRM from a passive system of record into an active system of action, so teams can act on their data immediately and intelligently.
The following diagram illustrates the solution architecture.

Benefits and features of ROX
Now generally available, Rox extends from intelligence to full execution with Command, a new conversational interface that orchestrates multi-agent workflows. Command coordinates with multiple specialized agents running in parallel. A single request (for example, “prep me for the ACME renewal and draft follow-ups”) expands into a plan: research usage and support signals, identify missing stakeholders, refresh enrichment, propose next-best actions, draft outreach, update the opportunity, and assemble a proposal. Each step is completed through tool calls into your systems and is subject to guardrail approvals. Our comprehensive safety architecture employs a sophisticated multi-layer guardrail system as the first line of defense against inappropriate, harmful, or malicious requests. Incoming requests undergo rigorous analysis through our advanced filtering mechanisms before reaching the inference layer. This preprocessing stage evaluates multiple dimensions of safety and appropriateness, such as legal compliance assessment and business relevance evaluation, to make sure only legitimate, safe, and contextually appropriate requests proceed to model execution.
Command decomposes the request, routes steps to the right agents, sequences external tool invocations (CRM, calendar, enrichment, email), reconciles results into the system of context, and returns one coherent thread that’s ready for consumption on the web, Slack, iOS, or macOS. Every suggestion is explainable (sources and traces), reversible (audit logs), and policy-aware (role-based access control, rate limits, required approvals).
How Amazon Bedrock powers Rox
Command demands a model capable of reasoning across multiple steps, orchestrating tools, and adapting dynamically.
To meet these needs, Rox chose Anthropic’s Claude Sonnet 4 on Amazon Bedrock. Anthropic’s Claude Sonnet 4 has consistently demonstrated unmatched tool-calling and reasoning performance, allowing Rox agents to sequence workflows like account research, enrichment, outreach, opportunity management, and proposal generation with reliability.
Amazon Bedrock provides the foundation to deliver Rox at enterprise scale, offering security, flexibility to integrate with the latest models, and scalability to handle thousands of concurrent agents reliably.
In addition to Command, Rox includes the following features:

Research – Offers deep account and market research, grounded in unified context (carried over from private beta)
Meet – Makes it possible to record, transcribe, summarize, and turn meetings into actions (carried over from private beta)
Outreach – Provides personalized prospect engagement, contextualized by unified data (new)
Revenue – Helps you track, update, and advance pipelines in the flow of work (new)
Auto-fill proposals – Helps you assemble tailored proposals in seconds from account context (new)
Rox apps – Offers modular extensions that add purpose-built workflows (dashboards, trackers) directly into the system (new)
iOS app – Delivers notifications and meeting prep on the go (new)
Mac app – Brings the ability to transcribe calls and add them to the system of context (new)
Regional expansion – Now live in the AWS Middle East (Bahrain) AWS Region, aligning with data residency and sovereignty needs (new)

Early customer impact
In beta, enterprises saw immediate gains:

50% higher representative productivity
20% faster sales velocity
Twofold revenue per rep

For example, real Rox customers were able to sharpen their focus on high-value opportunities, driving a 40–50% increase in average selling price. Another customer saw 90% reduction in rep prep time and faster closes, plus 15% more six-figure deals uncovered through Rox insights. Rox also shortens ramp time for new reps, with customers reporting 50% quicker ramp time using Rox.
Try Rox today
Our vision is for revenue teams to run with an always-on agent swarm that continuously researches accounts, engages stakeholders, and moves the pipeline forward.
Rox is now generally available. Get started at rox.com or visit the AWS Marketplace. Together with AWS, we will continue to build the AI-based operating system for modern revenue teams.

About the authors
Shriram Sridharan is the Co-Founder/Engineering Head of Rox, a Sequoia backed AI company. Before Rox, Shriram led the data infrastructure team at Confluent responsible for making Kafka faster and cheaper across clouds. Prior to that he was one of the early engineers in Amazon Aurora (pre-launch) re-imagining databases for the cloud. Aurora was the fastest growing AWS Service and a recipient of the 2019 SIGMOD systems award.
Taeuk Kang is a Founding Engineer at Rox, working across AI research and engineering. He studied Computer Science at Stanford. Prior to Rox, he built large language model agents and retrieval-augmented generation systems at X (formerly Twitter) and designed the distributed LLM infrastructure powering core product features and Trust & Safety, improving overall platform health. Earlier at Stripe, he developed high-performance streaming and batch data processing pipelines integrating Apache Flink, Spark, Kafka, and AWS SQS.
Santhosh Kumar Manavasi Lakshminarayanan leads Platform at Rox. Before Rox he was Director of Engineering at StreamSets, acquired by IBM leading StreamSets Cloud Platform making it seamless for big enterprises to run their data pipeline at scale on modern cloud providers. Before StreamSets, he was an senior engineer at Platform Metadata team at Informatica.
Andrew Brown is an Account Executive for AI Startups at Amazon Web Services (AWS) in San Francisco, CA. With a strong background in cloud computing and a focus on supporting startups, Andrew specializes in helping companies scale their operations using AWS technologies.
Santhan Pamulapati is a Sr. Solutions Architect for GenAI startups at AWS, with deep expertise in designing and building scalable solutions that drives customer growth. He has strong background in building HPC systems leveraging AWS services and worked with strategic customers to solve business challenges.

Delinea Released an MCP Server to Put Guardrails Around AI Agents Cred …

Delinea released an Model Context Protocol (MCP) server that let AI-agent access to credentials stored in Delinea Secret Server and the Delinea Platform. The server applies identity checks and policy rules on every call, aiming to keep long-lived secrets out of agent memory while retaining full auditability

What’s new for me?

The GitHub project DelineaXPM/delinea-mcp (MIT-licensed) exposes a constrained MCP tool surface for credential retrieval and account operations, supports OAuth 2.0 dynamic client registration per the MCP spec, and offers both STDIO and HTTP/SSE transports. The repo includes Docker artifacts and example configs for editor/agent integrations

How it works?

The server exposes MCP tools that proxy to Secret Server and (optionally) the Delinea Platform: secret and folder retrieval/search, inbox/access-request helpers, user/session admin, and report execution; secrets themselves remain vaulted and are never presented to the agent. Configuration separates secrets into environment variables (e.g., DELINEA_PASSWORD) and non-secrets into config.json, with scope controls (enabled_tools, allowed object types), TLS certs, and an optional registration pre-shared key.

Explain me why exactly it matters to me

Enterprises are rapidly wiring agents to operational systems through MCP. Recent incidents—such as a rogue MCP package exfiltrating email—underscore the need for registration controls, TLS, least-privilege tool surfaces, and traceable identity context on every call. Delinea’s server claims to implement these controls in a PAM-aligned pattern (ephemeral auth + policy checks + audit), reducing credential sprawl and simplifying revocation.

Summary

Delinea’s MIT-licensed MCP server gives enterprises a standard, auditable way for AI-agent credential access—short-lived tokens, policy evaluation, and constrained tools—to reduce secret exposure while integrating with Secret Server and the Delinea Platform. It’s available now on GitHub, with initial coverage and technical details confirming OAuth2, STDIO/HTTP(SSE) transports, and scoped operations.

The post Delinea Released an MCP Server to Put Guardrails Around AI Agents Credential Access appeared first on MarkTechPost.

OpenAI Launches Sora 2 and a Consent-Gated Sora iOS App

OpenAI released Sora 2, a text-to-video-and-audio model focused on physical plausibility, multi-shot controllability, and synchronized dialogue/SFX. The OpenAI team has also launched a new invite-only Sora iOS app (U.S. and Canada first) that enables social creation, remixing, and consent-controlled “cameos” for inserting a verified likeness into generated scenes.

Model capabilities

Sora 2 claims materially better world modeling (e.g., rebounds on missed shots instead of object “teleportation”), maintains state across shots for instruction-following edits, and generates native, time-aligned audio (speech, ambient, effects). These are framed as prerequisites for simulation-grade video generation rather than single-clip “best effort” synthesis.

App architecture and “cameos”

The Sora app is built around cameos: users record a short in-app video+audio to verify identity and capture likeness; cameo owners control who can use their likeness and can revoke or delete any video—including drafts—that includes them. The app is available on iOS devices and it will be expanding after the U.S./Canada rollout.

Safety posture

OpenAI’s Sora 2 documents an iterative rollout with specific launch-time restrictions and provenance controls:

Uploads/Generations: At launch, OpenAI is restricting the use of image uploads that feature a photorealistic person and all video uploads. Sora 2 does not support video-to-video at launch, blocks text-to-video of public figures, and blocks generations that include real people except when a user has opted-in via the cameo feature. Additional classifier thresholds apply when a real person appears.

Provenance: All outputs carry C2PA metadata and a visible moving watermark on downloads, with internal detection tools for origin assessment.

Parental controls

In parallel with Sora, OpenAI introduced parental controls integrated via ChatGPT: parents can opt teens into a non-personalized feed, manage DM permissions, and control whether continuous scroll is allowed—aligned with the Sora feed’s “creation-over-consumption” philosophy.

Access and pricing

The Sora iOS app is available to download now; access opens by invite, with Sora 2 initially free under compute-constrained caps. ChatGPT Pro users get access to an experimental Sora 2 Pro tier on sora.com (and coming to the app). API access is planned after the consumer rollout. Existing Sora 1 Turbo content remains available in user libraries.

Summary

Sora 2 pushes text-to-video toward controllable, physics-respecting, audio-synchronized generation—and OpenAI is shipping it inside an invite-only iOS app with consent-gated cameos plus C2PA metadata and visible watermarks for provenance. The initial U.S./Canada rollout prioritizes safety constraints (e.g., restrictions on public-figure depictions) while staging broader access and API plans, signaling a deliberate shift from raw capability demos to governed, production-ready media tooling.

Sora 2 is here. pic.twitter.com/hy95wDM5nB— OpenAI (@OpenAI) September 30, 2025

The post OpenAI Launches Sora 2 and a Consent-Gated Sora iOS App appeared first on MarkTechPost.

Zhipu AI Releases GLM-4.6: Achieving Enhancements in Real-World Coding …

Zhipu AI has released GLM-4.6, a major update to its GLM series focused on agentic workflows, long-context reasoning, and practical coding tasks. The model raises the input window to 200K tokens with a 128K max output, targets lower token consumption in applied tasks, and ships with open weights for local deployment.

https://z.ai/blog/glm-4.6

So, what’s exactly is new?

Context + output limits: 200K input context and 128K maximum output tokens.

Real-world coding results: On the extended CC-Bench (multi-turn tasks run by human evaluators in isolated Docker environments), GLM-4.6 is reported near parity with Claude Sonnet 4 (48.6% win rate) and uses ~15% fewer tokens vs. GLM-4.5 to finish tasks. Task prompts and agent trajectories are published for inspection.

Benchmark positioning: Zhipu summarizes “clear gains” over GLM-4.5 across eight public benchmarks and states parity with Claude Sonnet 4/4.6 on several; it also notes GLM-4.6 still lags Sonnet 4.5 on coding—a useful caveat for model selection.

Ecosystem availability: GLM-4.6 is available via Z.ai API and OpenRouter; it integrates with popular coding agents (Claude Code, Cline, Roo Code, Kilo Code), and existing Coding Plan users can upgrade by switching the model name to glm-4.6.

Open weights + license: Hugging Face model card lists License: MIT and Model size: 355B params (MoE) with BF16/F32 tensors. (MoE “total parameters” are not equal to active parameters per token; no active-params figure is stated for 4.6 on the card.)

Local inference: vLLM and SGLang are supported for local serving; weights are on Hugging Face and ModelScope.

https://z.ai/blog/glm-4.6

Summary

GLM-4.6 is an incremental but material step: a 200K context window, ~15% token reduction on CC-Bench versus GLM-4.5, near-parity task win-rate with Claude Sonnet 4, and immediate availability via Z.ai, OpenRouter, and open-weight artifacts for local serving.

FAQs

1) What are the context and output token limits?GLM-4.6 supports a 200K input context and 128K maximum output tokens.

2) Are open weights available and under what license?Yes. The Hugging Face model card lists open weights with License: MIT and a 357B-parameter MoE configuration (BF16/F32 tensors).

3) How does GLM-4.6 compare to GLM-4.5 and Claude Sonnet 4 on applied tasks?On the extended CC-Bench, GLM-4.6 reports ~15% fewer tokens vs. GLM-4.5 and near-parity with Claude Sonnet 4 (48.6% win-rate).

4) Can I run GLM-4.6 locally?Yes. Zhipu provides weights on Hugging Face/ModelScope and documents local inference with vLLM and SGLang; community quantizations are appearing for workstation-class hardware.

Check out the GitHub Page, Hugging Face Model Card and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Zhipu AI Releases GLM-4.6: Achieving Enhancements in Real-World Coding, Long-Context Processing, Reasoning, Searching and Agentic AI appeared first on MarkTechPost.

Modernize fraud prevention: GraphStorm v0.5 for real-time inference

Fraud continues to cause significant financial damage globally, with U.S. consumers alone losing $12.5 billion in 2024—a 25% increase from the previous year according to the Federal Trade Commission. This surge stems not from more frequent attacks, but from fraudsters’ increasing sophistication. As fraudulent activities become more complex and interconnected, conventional machine learning approaches fall short by analyzing transactions in isolation, unable to capture the networks of coordinated activities that characterize modern fraud schemes.
Graph neural networks (GNNs) effectively address this challenge by modeling relationships between entities—such as users sharing devices, locations, or payment methods. By analyzing both network structures and entity attributes, GNNs are effective at identifying sophisticated fraud schemes where perpetrators mask individual suspicious activities but leave traces in their relationship networks. However, implementing GNN-based online fraud prevention in production environments presents unique challenges: achieving sub-second inference responses, scaling to billions of nodes and edges, and maintaining operational efficiency for model updates. In this post, we show you how to overcome these challenges using GraphStorm, particularly the new real-time inference capabilities of GraphStorm v0.5.
Previous solutions required tradeoffs between capability and simplicity. Our initial DGL approach provided comprehensive real-time capabilities but demanded intricate service orchestration—including manually updating endpoint configurations and payload formats after retraining with new hyperparameters. This approach also lacked model flexibility, requiring customization of GNN models and configurations when using architectures beyond relational graph convolutional networks (RGCN). Subsequent in-memory DGL implementations reduced complexity but encountered scalability limitations with enterprise data volumes. We built GraphStorm to bridge this gap, by introducing distributed training and high-level APIs that help simplify GNN development at enterprise scale.
In a recent blog post, we illustrated GraphStorm’s enterprise-scale GNN model training and offline inference capability and simplicity. While offline GNN fraud detection can identify fraudulent transactions after they occur—preventing financial loss requires stopping fraud before it happens. GraphStorm v0.5 makes this possible through native real-time inference support through Amazon SageMaker AI. GraphStorm v0.5 delivers two innovations: streamlined endpoint deployment that reduces weeks of custom engineering—coding SageMaker entry point files, packaging model artifacts, and calling SageMaker deployment APIs—to a single-command operation, and standardized payload specification that helps simplify client integration with real-time inference services. These capabilities enable sub-second node classification tasks like fraud prevention, empowering organizations to proactively counter fraud threat with scalable, operationally straightforward GNN solutions.
To showcase these capabilities, this post presents a fraud prevention solution. Through this solution, we show how a data scientist can transition a trained GNN model to production-ready inference endpoints with minimal operational overhead. If you’re interested in implementing GNN-based models for real-time fraud prevention or similar business cases, you can adapt the approaches presented here to create your own solutions.
Solution overview
Our proposed solution is a 4-step pipeline as shown in the following figure. The pipeline starts at step 1 with transaction graph export from an online transaction processing (OLTP) graph database to scalable storage (Amazon Simple Storage Service (Amazon S3) or Amazon EFS), followed by distributed model training in step 2. Step 3 is GraphStorm v0.5’s simplified deployment process that creates SageMaker real-time inference endpoints with one command. After SageMaker AI has deployed the endpoint successfully, a client application integrates with the OLTP graph database that processes live transaction streams in step 4. By querying the graph database, the client prepares subgraphs around to-be predicted transactions, convert the subgraph into standardized payload format, and invoke deployed endpoint for real-time prediction.

To provide concrete implementation details for each step in the real-time inference solution, we demonstrate the complete workflow using the publicly available IEEE-CIS fraud detection task.
Note: This example uses a Jupyter notebook as the controller of the overall four-step pipeline for simplicity. For more production-ready design, see the architecture described in Build a GNN-based real-time fraud detection solution.
Prerequisites
To run this example, you need an AWS account that the example’s AWS Cloud Development Kit (AWS CDK) code uses to create required resources, including Amazon Virtual Private Cloud (Amazon VPC), an Amazon Neptune database, Amazon SageMaker AI, Amazon Elastic Container Registry (Amazon ECR), Amazon S3, and related roles and permission.
Note: These resources incur costs during execution (approximately $6 per hour with default settings). Monitor usage carefully and review pricing pages for these services before proceeding. Follow cleanup instructions at the end to avoid ongoing charges.
Hands-on example: Real-time fraud prevention with IEEE-CIS dataset
All implementation code for this example, including Jupyter notebooks and supporting Python scripts, is available in our public repository. The repository provides a complete end-to-end implementation that you can directly execute and adapt for your own fraud prevention use cases.
Dataset and task overview
This example uses the IEEE-CIS fraud detection dataset, containing 500,000 anonymized transactions with approximately 3.5% fraudulent cases. The dataset includes 392 categorical and numerical features, with key attributes like card types, product types, addresses, and email domains forming the graph structure shown in the following figure. Each transaction (with an isFraud label) connects to Card Type, Location, Product Type, and Purchaser and Recipient email domain entities, creating a heterogeneous graph that enables GNN models to detect fraud patterns through entity relationships.

Unlike our previous post that demonstrated GraphStorm plus Amazon Neptune Analytics for offline analysis workflows, this example uses a Neptune database as the OLTP graph store, optimized for the quick subgraph extraction required during real-time inference. Following the graph design, the tabular IEEE-CIS data is converted to a set CSV files compatible with Neptune database format, allowing direct loading into both the Neptune database and GraphStorm’s GNN model training pipeline with a single set of files.
Step 0: Environment setup
Step 0 establishes the running environment required for the four-step fraud prevention pipeline. Complete setup instructions are available in the implementation repository.
To run the example solution, you need to deploy an AWS CloudFormation stack through the AWS CDK. This stack creates the Neptune DB instance, the VPC to place it in, and appropriate roles and security groups. It additionally creates a SageMaker AI notebook instance, from which you run the example notebooks that come with the repository.

git clone https://github.com/aws-samples/amazon-neptune-samples.git
cd neptune-database-graphstorm-online-inference/neptune-db-cdk
# Ensure you have CDK installed and have appropriate credentials set up
cdk deploy

When deployment is finished (it takes approximately 10 minutes for required resources to be ready), the AWS CDK prints a few outputs, one of which is the name of the SageMaker notebook instance you use to run through the notebooks:

# Example output
NeptuneInfraStack.NotebookInstanceName = arn:aws:sagemaker:us-east-1:012345678912:notebook-instance/NeptuneNotebook-9KgSB9XXXXXX

You can navigate to the SageMaker AI notebook UI, find the corresponding notebook instance, and select its Open Jupyterlab link to access the notebook.
Alternatively, you can use the AWS Command Line Interface (AWS CLI) to get a pre-signed URL to access the notebook. You will need to replace the <notebook-instance-name> with the actual notebook instance name.

aws sagemaker create-presigned-notebook-instance-url –notebook-instance-name <notebook-instance-name>

When you’re in the notebook instance web console, open the first notebook, 0-Data-Preparation.ipynb, to start going through the example.
Step 1: Graph construction
In the Notebook 0-Data-Preparation, you transform the tabular IEEE-CIS dataset into the heterogeneous graph structure shown in the figure at the start of this section. The provided Jupyter Notebook extracts entities from transaction features, creating Card Type nodes from card1–card6 features, Purchaser and Recipient nodes from email domains, Product Type nodes from product codes, and Location nodes from geographic information. The transformation establishes relationships between transactions and these entities, generating graph data in Neptune import format for direct ingestion into the OLTP graph store. The create_neptune_db_data() function orchestrates this entity extraction and relationship creation process across all node types (which takes approximately 30 seconds).

GRAPH_NAME = “ieee-cis-fraud-detection”
PROCESSED_PREFIX = f”./{GRAPH_NAME}”
ID_COLS = “card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain”
CAT_COLS = “M1,M2,M3,M4,M5,M6,M7,M8,M9”
# Lists of columns to keep from each file
COLS_TO_KEEP = {
    “transaction.csv”: (
        ID_COLS.split(“,”)
        + CAT_COLS.split(“,”)
        +
        # Numerical features without missing values
        [f”C{idx}” for idx in range(1, 15)]
        + [“TransactionID”, “TransactionAmt”, “TransactionDT”, “isFraud”]
    ),
    “identity.csv”: [“TransactionID”, “DeviceType”],
}

create_neptune_db_data(
    data_prefix=”./input-data/”,
    output_prefix=PROCESSED_PREFIX,
    id_cols=ID_COLS,
    cat_cols=CAT_COLS,
    cols_to_keep=COLS_TO_KEEP,
    num_chunks=1,
)

This notebook also generates the JSON configuration file required by GraphStorm’s GConstruct command and executes the graph construction process. This GConstruct command transforms the Neptune-formatted data into a distributed binary graph format optimized for GraphStorm’s training pipeline, which partitions the heterogeneous graph structure across compute nodes to enable scalable model training on industry-scale graphs (measured in billions of nodes and edges). For the IEEE-CIS data, the GConstruct command takes 90 seconds to complete.
In the Notebook 1-Load-Data-Into-Neptune-DB, you load the CSV data into the Neptune database instance (takes approximately 9 minutes), which makes them available for online inference. During online inference, after selecting a transaction node, you query the Neptune database to get the graph neighborhood of the target node, retrieving the features of every node in the neighborhood and the subgraph structure around the target.
Step 2: Model training
After you have converted the data into the distributed binary graph format, it’s time to train a GNN model. GraphStorm provides command-line scripts to train a model without writing code. In the Notebook 2-Model-Training, you train a GNN model using GraphStorm’s node classification command with configuration managed through YAML files. The baseline configuration defines a two-layer RGCN model with 128-dimensional hidden layers, training for 4 epochs with a 0.001 learning rate and 1024 batch size, which takes approximately 100 seconds for 1 epoch of model training and evaluation in an ml.m5.4xlarge instance. To improve fraud detection accuracy, the notebook provides more advanced model configurations like the command below.

!python -m graphstorm.run.gs_node_classification
           –workspace ./ 
           –part-config ieee_gs/ieee-cis.json
           –num-trainers 1 
           –cf ieee_nc.yaml
           –eval-metric roc_auc
           –save-model-path ./model-simple/ 
           –topk-model-to-save 1 
           –imbalance-class-weights 0.1,1.0

Arguments in this command address the dataset’s label imbalance challenge where only 3.5% of transactions are fraudulent by using AUC-ROC as the evaluation metric and using class weights. The command also saves the best-performing model along with essential configuration files required for endpoint deployment. Advanced configurations can further enhance model performance through techniques like HGT encoders, multi-head attention, and class-weighted cross entropy loss function, though these optimizations increase computational requirements. GraphStorm enables these changes through run time arguments and YAML configurations, reducing the need for code modifications.
Step 3: Real-time endpoint deployment
In the Notebook 3-GraphStorm-Endpoint-Deployment, you deploy the real-time endpoint through GraphStorm v0.5’s straightforward launch script. The deployment requires three model artifacts generated during training: the saved model file that contains weights, the updated graph construction JSON file with feature transformation metadata, and the runtime-updated training configuration YAML file. These artifacts enable GraphStorm to recreate the exact training configurations and model for consistent inference behavior. Notably, the updated graph construction JSON and training configuration YAML file contains crucial configurations that are essential for restoring the trained model on the endpoint and processing incoming request payloads. It is crucial to use the updated JSON and YAML files for endpoint deployment.GraphStorm uses SageMaker AI bring your own container (BYOC) to deploy a consistent inference environment. You need to build and push the GraphStorm real-time Docker image to Amazon ECR using the provided shell scripts. This containerized approach provides consistent runtime environments compatible with the SageMaker AI managed infrastructure. The Docker image contains the necessary dependencies for GraphStorm’s real-time inference capabilities on the deployment environment.
To deploy the endpoint, you can use the GraphStorm-provided launch_realtime_endpoint.py script that helps you gather required artifacts and creates the necessary SageMaker AI resources to deploy an endpoint. The script accepts the Amazon ECR image URI, IAM role, model artifact paths, and S3 bucket configuration, automatically handling endpoint provisioning and configuration. By default, the script waits for endpoint deployment to be complete before exiting. When completed, it prints the name and AWS Region of the deployed endpoint for subsequent inference requests. You will need to replace the fields enclosed by <> with the actual values of your environment.

!python ~/graphstorm/sagemaker/launch/launch_realtime_endpoint.py
        –image-uri <account_id>.dkr.ecr.<aws_region>.amazonaws.com/graphstorm:sagemaker-endpoint-cpu
        –role arn:aws:iam::<account_id>:role/<your_role>
        –region <aws_region>
        –restore-model-path <restore-model-path>/models/epoch-1/
        –model-yaml-config-file <restore-model-path>/models/GRAPHSTORM_RUNTIME_UPDATED_TRAINING_CONFIG.yaml
        –graph-json-config-file <restore-model-path>/models/data_transform_new.json
        –infer-task-type node_classification
        –upload-tarfile-s3 s3://<cdk-created-bucket>
        –model-name ieee-fraud-detect

Step 4: Real-time inference
In the Notebook 4-Sample-Graph-and-Invoke-Endpoint, you build a basic client application that integrates with the deployed GraphStorm endpoint to perform real-time fraud prevention on incoming transactions. The inference process accepts transaction data through standardized JSON payloads, executes node classification predictions in a few hundreds of milliseconds, and returns fraud probability scores that enable immediate decision-making.
An end-to-end inference call for a node that already exists in the graph has three distinct stages:

Graph sampling from the Neptune database. For a given target node that already exists in the graph, retrieve its k-hop neighborhood with a fanout limit, that is, limiting the number of neighbors retrieved at each hop by a threshold.
Payload preparation for inference. Neptune returns graphs using GraphSON, a specialized JSON-like data format used to describe graph data. At this step, you need to convert the returned GraphSON to GraphStorm’s own JSON specification. This step is performed on the inference client, in this case a SageMaker notebook instance.
Model inference using a SageMaker endpoint. After the payload is prepared, you send an inference request to a SageMaker endpoint that has loaded a previously trained model snapshot. The endpoint receives the request, performs any feature transformations needed (such as converting categorical features to one-hot encoding), creates the binary graph representation in memory, and makes a prediction for the target node using the graph neighborhood and trained model weights. The response is encoded to JSON and sent back to the client.

An example response from the endpoint would look like:

{‘status_code’: 200,
 ‘request_uid’: ‘877042dbc361fc33’,
 ‘message’: ‘Request processed successfully.’,
 ‘error’: ”,
 ‘data’: {
    ‘results’: [
            {
                ‘node_type’: ‘Transaction’,
                ‘node_id’: ‘2991260’,
                ‘prediction’: [0.995966911315918, 0.004033133387565613]
            }
        ]
    }
}

The data of interest for the single transaction you made a prediction for are in the prediction key and corresponding node_id. The prediction gives you the raw scores the model produces for class 0 (legitimate) and class 1 (fraudulent) at the corresponding 0 and 1 indexes of the predictions list. In this example, the model marks the transaction as most likely legitimate. You can find the full GraphStorm response specification in the GraphStorm documentation.
Complete implementation examples, including client code and payload specifications, are provided in the repository to guide integration with production systems.
Clean up
To stop accruing costs on your account, you need to delete the AWS resources that you created with the AWS CDK at the Environment Setup step.
You must first delete the SageMaker endpoint created during the Step 3 for cdk destroy to complete. See the Delete Endpoints and Resources for more options to delete an endpoint. When done, you can run the following from the repository’s root:

cd neptune-database-graphstorm-online-inference/neptune-db-cdk
cdk destroy

See the AWS CDK docs for more information about how to use cdk destroy, or see the CloudFormation docs for how to delete a stack from the console UI. By default, the cdk destroy command does not delete the model artifacts and processed graph data stored in the S3 bucket during the training and deployment process. You must remove them manually. See Deleting a general purpose bucket for information about how to empty and delete an S3 bucket the AWS CDK has created.
Conclusion
Graph neural networks address complex fraud prevention challenges by modeling relationships between entities that traditional machine learning approaches miss when analyzing transactions in isolation. GraphStorm v0.5 helps simplify deployment of GNN real-time inference with one command for endpoint creation that previously required coordination of multiple services and a standardized payload specification that helps simplify client integration with real-time inference services. Organizations can now deploy enterprise-scale fraud prevention endpoints through streamlined commands that reduce custom engineering from weeks to single-command operations.
To implement GNN-based fraud prevention with your own data:

Review the GraphStorm documentation for model configuration options and deployment specifications.
Adapt this IEEE-CIS example to your fraud prevention dataset by modifying the graph construction and feature engineering steps using the complete source code and tutorials available in our GitHub repository.
Access step-by-step implementation guidance to build production-ready fraud prevention solutions with GraphStorm v0.5’s enhanced capabilities using your enterprise data.

About the authors
Jian Zhang is a Senior Applied Scientist who has been using machine learning techniques to help customers solve various problems, such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning, particularly graph neural network, solutions for customers in China, the US, and Singapore. As an enlightener of AWS graph capabilities, Zhang has given many public presentations about GraphStorm, the GNN, the Deep Graph Library (DGL), Amazon Neptune, and other AWS services.
Theodore Vasiloudis is a Senior Applied Scientist at AWS, where he works on distributed machine learning systems and algorithms. He led the development of GraphStorm Processing, the distributed graph processing library for GraphStorm and is a core developer for GraphStorm. He received his PhD in Computer Science from KTH Royal Institute of Technology, Stockholm, in 2019.
Xiang Song is a Senior Applied Scientist at AWS AI Research and Education (AIRE), where he develops deep learning frameworks including GraphStorm, DGL, and DGL-KE. He led the development of Amazon Neptune ML, a new capability of Neptune that uses graph neural networks for graphs stored in graph database. He is now leading the development of GraphStorm, an open source graph machine learning framework for enterprise use cases. He received his PhD in computer systems and architecture at the Fudan University, Shanghai, in 2014.
Florian Saupe is a Principal Technical Product Manager at AWS AI/ML research supporting science teams like the graph machine learning group, and ML Systems teams working on large scale distributed training, inference, and fault resilience. Before joining AWS, Florian lead technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems and robotics scientist—a field in which he holds a PhD.
Ozan Eken is a Product Manager at AWS, passionate about building cutting-edge Generative AI and Graph Analytics products. With a focus on simplifying complex data challenges, Ozan helps customers unlock deeper insights and accelerate innovation. Outside of work, he enjoys trying new foods, exploring different countries, and watching soccer.

Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State …

Anthropic released Claude Sonnet 4.5 and sets a new benchmark for end-to-end software engineering and real-world computer use. The update also ships concrete product surface changes (Claude Code checkpoints, a native VS Code extension, API memory/context tools) and an Agent SDK that exposes the same scaffolding Anthropic uses internally. Pricing remains unchanged from Sonnet 4 ($3 input / $15 output per million tokens).

What’s actually new?

SWE-bench Verified record. Anthropic reports 77.2% accuracy on the 500-problem SWE-bench Verified dataset using a simple two-tool scaffold (bash + file edit), averaged over 10 runs, no test-time compute, 200K “thinking” budget. A 1M-context setting reaches 78.2%, and a higher-compute setting with parallel sampling and rejection raises this to 82.0%.

Computer-use SOTA. On OSWorld-Verified, Sonnet 4.5 leads at 61.4%, up from Sonnet 4’s 42.2%, reflecting stronger tool control and UI manipulation for browser/desktop tasks.

Long-horizon autonomy. The team observed >30 hours of uninterrupted focus on multi-step coding tasks — a practical jump over earlier limits and directly relevant to agent reliability.

Reasoning/math. The release notes “substantial gains” across common reasoning and math evals; exact per-bench numbers (e.g., AIME config). Safety posture is ASL-3 with strengthened defenses against prompt-injection.

https://www.anthropic.com/news/claude-sonnet-4-5

What’s there for agents?

Sonnet 4.5 targets the brittle parts of real agents: extended planning, memory, and reliable tool orchestration. Anthropic’s Claude Agent SDK exposes their production patterns (memory management for long-running tasks, permissioning, sub-agent coordination) rather than just a bare LLM endpoint. That means teams can reproduce the same scaffolding used by Claude Code (now with checkpoints, a refreshed terminal, and VS Code integration) to keep multi-hour jobs coherent and reversible.

On measured tasks that simulate “using a computer,” the 19-point jump on OSWorld-Verified is notable; it tracks with the model’s ability to navigate, fill spreadsheets, and complete web flows in Anthropic’s browser demo. For enterprises experimenting with agentic RPA-style work, higher OSWorld scores usually correlate with lower intervention rates during execution.

Where you can run it?

Anthropic API & apps. Model ID claude-sonnet-4-5; price parity with Sonnet 4. File creation and code execution are now available directly in Claude apps for paid tiers.

AWS Bedrock. Available via Bedrock with integration paths to AgentCore; AWS highlights long-horizon agent sessions, memory/context features, and operational controls (observability, session isolation).

Google Cloud Vertex AI. GA on Vertex AI with support for multi-agent orchestration via ADK/Agent Engine, provisioned throughput, 1M-token analysis jobs, and prompt caching.

GitHub Copilot. Public preview rollout across Copilot Chat (VS Code, web, mobile) and Copilot CLI; organizations can enable via policy, and BYO key is supported in VS Code.

Summary

With a documented 77.2% SWE-bench Verified score under transparent constraints, a 61.4% OSWorld-Verified computer-use lead, and practical updates (checkpoints, SDK, Copilot/Bedrock/Vertex availability), Claude Sonnet 4.5 is developed for long-running, tool-heavy agent workloads rather than short demo prompts. Independent replication will determine how durable the “best for coding” claim is, but the design targets (autonomy, scaffolding, and computer control) are aligned with real production pain points today.

Introducing Claude Sonnet 4.5—the best coding model in the world.It’s the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains on tests of reasoning and math. pic.twitter.com/7LwV9WPNAv— Claude (@claudeai) September 29, 2025

The post Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State-of-the-Art Results appeared first on MarkTechPost.

How to Design an Interactive Dash and Plotly Dashboard with Callback M …

In this tutorial, we set out to build an advanced interactive dashboard using Dash, Plotly, and Bootstrap. We highlight not only how these tools enable us to design layouts and visualizations, but also how Dash’s callback mechanism links controls to outputs, allowing for real-time responsiveness. By combining local execution with the ability to run in cloud platforms like Google Colab, we explore a workflow that is both flexible and practical. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install dash plotly pandas numpy dash-bootstrap-components

import dash
from dash import dcc, html, Input, Output, callback, dash_table
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import dash_bootstrap_components as dbc

print(“Generating sample data…”)
np.random.seed(42)

We begin by installing and importing the necessary components, including Dash, Plotly, Pandas, NumPy, and Bootstrap, to set up our dashboard environment. We also initialize random seeds and generate sample data so that we can consistently test the interactive features as we build them. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserstart_date = datetime(2023, 1, 1)
end_date = datetime(2024, 12, 31)
dates = pd.date_range(start=start_date, end=end_date, freq=’D’)
stock_names = [‘AAPL’, ‘GOOGL’, ‘MSFT’, ‘AMZN’, ‘TSLA’]

all_data = []
base_prices = {‘AAPL’: 150, ‘GOOGL’: 120, ‘MSFT’: 250, ‘AMZN’: 100, ‘TSLA’: 200}

for stock in stock_names:
print(f”Creating data for {stock}…”)
base_price = base_prices[stock]

n_days = len(dates)
returns = np.random.normal(0.0005, 0.025, n_days)
prices = np.zeros(n_days)
prices[0] = base_price

for i in range(1, n_days):
prices[i] = prices[i-1] * (1 + returns[i])

volumes = np.random.lognormal(15, 0.5, n_days).astype(int)

stock_df = pd.DataFrame({
‘Date’: dates,
‘Stock’: stock,
‘Price’: prices,
‘Volume’: volumes,
‘Returns’: np.concatenate([[0], np.diff(prices) / prices[:-1]]),
‘Sector’: np.random.choice([‘Technology’, ‘Consumer’, ‘Automotive’], 1)[0]
})

all_data.append(stock_df)

df = pd.concat(all_data, ignore_index=True)

df[‘Date’] = pd.to_datetime(df[‘Date’])
df_sorted = df.sort_values([‘Stock’, ‘Date’]).reset_index(drop=True)

print(“Calculating technical indicators…”)
df_sorted[‘MA_20’] = df_sorted.groupby(‘Stock’)[‘Price’].transform(lambda x: x.rolling(20, min_periods=1).mean())
df_sorted[‘Volatility’] = df_sorted.groupby(‘Stock’)[‘Returns’].transform(lambda x: x.rolling(30, min_periods=1).std())

df = df_sorted.copy()

print(f”Data generated successfully! Shape: {df.shape}”)
print(f”Date range: {df[‘Date’].min()} to {df[‘Date’].max()}”)
print(f”Stocks: {df[‘Stock’].unique().tolist()}”)

We generate synthetic stock data, including prices, volumes, and returns, for multiple tickers across a specified date range. We calculate moving averages and volatility to enrich the dataset with useful technical indicators, providing a strong foundation for building interactive visualizations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserapp = dash.Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])

app.layout = dbc.Container([
dbc.Row([
dbc.Col([
html.H1(” Advanced Financial Dashboard”, className=”text-center mb-4″),
html.P(f”Interactive dashboard with {len(df)} data points across {len(stock_names)} stocks”,
className=”text-center text-muted”),
html.Hr()
])
]),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H5(” Dashboard Controls”, className=”card-title”),

html.Label(“Select Stocks:”, className=”fw-bold mt-3″),
dcc.Dropdown(
id=’stock-dropdown’,
options=[{‘label’: f'{stock} ({base_prices[stock]})’, ‘value’: stock}
for stock in stock_names],
value=[‘AAPL’, ‘GOOGL’],
multi=True,
placeholder=”Choose stocks to analyze…”
),

html.Label(“Date Range:”, className=”fw-bold mt-3″),
dcc.DatePickerRange(
id=’date-picker-range’,
start_date=’2023-06-01′,
end_date=’2024-06-01′,
display_format=’YYYY-MM-DD’,
style={‘width’: ‘100%’}
),

html.Label(“Chart Style:”, className=”fw-bold mt-3″),
dcc.RadioItems(
id=’chart-type’,
options=[
{‘label’: ‘ Line Chart’, ‘value’: ‘line’},
{‘label’: ‘ Area Chart’, ‘value’: ‘area’},
{‘label’: ‘ Scatter Plot’, ‘value’: ‘scatter’}
],
value=’line’,
labelStyle={‘display’: ‘block’, ‘margin’: ‘5px’}
),

dbc.Checklist(
id=’show-ma’,
options=[{‘label’: ‘ Show Moving Average’, ‘value’: ‘show’}],
value=[],
style={‘margin’: ’10px 0′}
),
])
], className=”h-100″)
], width=3),

dbc.Col([
dbc.Card([
dbc.CardHeader(” Stock Price Analysis”),
dbc.CardBody([
dcc.Graph(id=’main-chart’, style={‘height’: ‘450px’})
])
])
], width=9)
], className=”mb-4″),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”avg-price”, className=”text-primary mb-0″),
html.Small(“Average Price”, className=”text-muted”)
])
])
], width=3),
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”total-volume”, className=”text-success mb-0″),
html.Small(“Total Volume”, className=”text-muted”)
])
])
], width=3),
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”price-range”, className=”text-info mb-0″),
html.Small(“Price Range”, className=”text-muted”)
])
])
], width=3),
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”data-points”, className=”text-warning mb-0″),
html.Small(“Data Points”, className=”text-muted”)
])
])
], width=3)
], className=”mb-4″),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardHeader(” Trading Volume”),
dbc.CardBody([
dcc.Graph(id=’volume-chart’, style={‘height’: ‘300px’})
])
])
], width=6),
dbc.Col([
dbc.Card([
dbc.CardHeader(” Returns Distribution”),
dbc.CardBody([
dcc.Graph(id=’returns-chart’, style={‘height’: ‘300px’})
])
])
], width=6)
], className=”mb-4″),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardHeader(” Latest Stock Data”),
dbc.CardBody([
dash_table.DataTable(
id=’data-table’,
columns=[
{‘name’: ‘Stock’, ‘id’: ‘Stock’},
{‘name’: ‘Date’, ‘id’: ‘Date’},
{‘name’: ‘Price ($)’, ‘id’: ‘Price’, ‘type’: ‘numeric’,
‘format’: {‘specifier’: ‘.2f’}},
{‘name’: ‘Volume’, ‘id’: ‘Volume’, ‘type’: ‘numeric’,
‘format’: {‘specifier’: ‘,.0f’}},
{‘name’: ‘Daily Return (%)’, ‘id’: ‘Returns’, ‘type’: ‘numeric’,
‘format’: {‘specifier’: ‘.2%’}}
],
style_cell={‘textAlign’: ‘center’, ‘fontSize’: ’14px’, ‘padding’: ’10px’},
style_header={‘backgroundColor’: ‘rgb(230, 230, 230)’, ‘fontWeight’: ‘bold’},
style_data_conditional=[
{
‘if’: {‘filter_query’: ‘{Returns} > 0’},
‘backgroundColor’: ‘#d4edda’
},
{
‘if’: {‘filter_query’: ‘{Returns} < 0’},
‘backgroundColor’: ‘#f8d7da’
}
],
page_size=15,
sort_action=”native”,
filter_action=”native”
)
])
])
])
])
], fluid=True)

We define the app layout with Bootstrap rows and cards, where we place controls (dropdown, date range, chart style, MA toggle) alongside the main graph. We add metric cards, two secondary graphs, and a sortable/filterable data table, so we organize everything into a responsive, clean interface that we can wire up to callbacks next. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@callback(
[Output(‘main-chart’, ‘figure’),
Output(‘volume-chart’, ‘figure’),
Output(‘returns-chart’, ‘figure’),
Output(‘data-table’, ‘data’),
Output(‘avg-price’, ‘children’),
Output(‘total-volume’, ‘children’),
Output(‘price-range’, ‘children’),
Output(‘data-points’, ‘children’)],
[Input(‘stock-dropdown’, ‘value’),
Input(‘date-picker-range’, ‘start_date’),
Input(‘date-picker-range’, ‘end_date’),
Input(‘chart-type’, ‘value’),
Input(‘show-ma’, ‘value’)]
)
def update_all_charts(selected_stocks, start_date, end_date, chart_type, show_ma):
print(f”Callback triggered with stocks: {selected_stocks}”)

if not selected_stocks:
selected_stocks = [‘AAPL’]

filtered_df = df[
(df[‘Stock’].isin(selected_stocks)) &
(df[‘Date’] >= start_date) &
(df[‘Date’] <= end_date)
].copy()

print(f”Filtered data shape: {filtered_df.shape}”)

if filtered_df.empty:
filtered_df = df[df[‘Stock’].isin(selected_stocks)].copy()
print(f”Using all available data. Shape: {filtered_df.shape}”)

if chart_type == ‘line’:
main_fig = px.line(filtered_df, x=’Date’, y=’Price’, color=’Stock’,
title=f’Stock Prices – {chart_type.title()} View’,
labels={‘Price’: ‘Price ($)’, ‘Date’: ‘Date’})
elif chart_type == ‘area’:
main_fig = px.area(filtered_df, x=’Date’, y=’Price’, color=’Stock’,
title=f’Stock Prices – {chart_type.title()} View’,
labels={‘Price’: ‘Price ($)’, ‘Date’: ‘Date’})
else:
main_fig = px.scatter(filtered_df, x=’Date’, y=’Price’, color=’Stock’,
title=f’Stock Prices – {chart_type.title()} View’,
labels={‘Price’: ‘Price ($)’, ‘Date’: ‘Date’})

if ‘show’ in show_ma:
for stock in selected_stocks:
stock_data = filtered_df[filtered_df[‘Stock’] == stock]
if not stock_data.empty:
main_fig.add_scatter(
x=stock_data[‘Date’],
y=stock_data[‘MA_20′],
mode=’lines’,
name=f'{stock} MA-20′,
line=dict(dash=’dash’, width=2)
)

main_fig.update_layout(height=450, showlegend=True, hovermode=’x unified’)

volume_fig = px.bar(filtered_df, x=’Date’, y=’Volume’, color=’Stock’,
title=’Daily Trading Volume’,
labels={‘Volume’: ‘Volume (shares)’, ‘Date’: ‘Date’})
volume_fig.update_layout(height=300, showlegend=True)

returns_fig = px.histogram(filtered_df.dropna(subset=[‘Returns’]),
x=’Returns’, color=’Stock’,
title=’Daily Returns Distribution’,
labels={‘Returns’: ‘Daily Returns’, ‘count’: ‘Frequency’},
nbins=50)
returns_fig.update_layout(height=300, showlegend=True)

if not filtered_df.empty:
avg_price = f”${filtered_df[‘Price’].mean():.2f}”
total_volume = f”{filtered_df[‘Volume’].sum():,.0f}”
price_range = f”${filtered_df[‘Price’].min():.0f} – ${filtered_df[‘Price’].max():.0f}”
data_points = f”{len(filtered_df):,}”

table_data = filtered_df.nlargest(100, ‘Date’)[
[‘Stock’, ‘Date’, ‘Price’, ‘Volume’, ‘Returns’]
].round(4).to_dict(‘records’)

for row in table_data:
row[‘Date’] = row[‘Date’].strftime(‘%Y-%m-%d’) if pd.notnull(row[‘Date’]) else ”
else:
avg_price = “No data”
total_volume = “No data”
price_range = “No data”
data_points = “0”
table_data = []

return (main_fig, volume_fig, returns_fig, table_data,
avg_price, total_volume, price_range, data_points)

We wire up Dash’s callback to connect our controls to every output, so changing any input instantly updates charts, stats, and the table. We filter the dataframe by selections and dates, build figures (plus optional MA overlays), and compute summary metrics. Finally, we format recent rows for the table so we can inspect the latest results at a glance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == ‘__main__’:
print(“Starting Dash app…”)
print(“Available data preview:”)
print(df.head())
print(f”Total rows: {len(df)}”)

app.run(mode=’inline’, port=8050, debug=True, height=1000)

# app.run(debug=True)

We set up the entry point for running the app. We print a quick preview of the dataset to determine what’s available, and then launch the Dash server. In Colab, we can run it inline. For local development, we can simply switch to the regular app.run(debug=True) for desktop development.

In conclusion, we integrate interactive charts, responsive layouts, and Dash’s callback mechanism into a cohesive application. We see how the callbacks orchestrate communication between user input and dynamic updates, turning static visuals into powerful interactive tools. With the ability to operate smoothly both locally and online, this approach provides a versatile foundation that we can extend for broader applications.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Design an Interactive Dash and Plotly Dashboard with Callback Mechanisms for Local and Online Deployment? appeared first on MarkTechPost.

Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM I …

oLLM is a lightweight Python library built on top of Huggingface Transformers and PyTorch and runs large-context Transformers on NVIDIA GPUs by aggressively offloading weights and KV-cache to fast local SSDs. The project targets offline, single-GPU workloads and explicitly avoids quantization, using FP16/BF16 weights with FlashAttention-2 and disk-backed KV caching to keep VRAM within 8–10 GB while handling up to ~100K tokens of context.

But What’s new?

(1) KV cache read/writes that bypass mmap to reduce host RAM usage; (2) DiskCache support for Qwen3-Next-80B; (3) Llama-3 FlashAttention-2 for stability; and (4) GPT-OSS memory reductions via “flash-attention-like” kernels and chunked MLP. The table published by the maintainer reports end-to-end memory/I/O footprints on an RTX 3060 Ti (8 GB):

Qwen3-Next-80B (bf16, 160 GB weights, 50K ctx) → ~7.5 GB VRAM + ~180 GB SSD; noted throughput “≈ 1 tok/2 s”.

GPT-OSS-20B (packed bf16, 10K ctx) → ~7.3 GB VRAM + 15 GB SSD.

Llama-3.1-8B (fp16, 100K ctx) → ~6.6 GB VRAM + 69 GB SSD.

How it works

oLLM streams layer weights directly from SSD into the GPU, offloads the attention KV cache to SSD, and optionally offloads layers to CPU. It uses FlashAttention-2 with online softmax so the full attention matrix is never materialized, and chunks large MLP projections to bound peak memory. This shifts the bottleneck from VRAM to storage bandwidth and latency, which is why the oLLM project emphasizes NVMe-class SSDs and KvikIO/cuFile (GPUDirect Storage) for high-throughput file I/O.

Supported models and GPUs

Out of the box the examples cover Llama-3 (1B/3B/8B), GPT-OSS-20B, and Qwen3-Next-80B. The library targets NVIDIA Ampere (RTX 30xx, A-series), Ada (RTX 40xx, L4), and Hopper; Qwen3-Next requires a dev build of Transformers (≥ 4.57.0.dev). Notably, Qwen3-Next-80B is a sparse MoE (80B total, ~3B active) that vendors typically position for multi-A100/H100 deployments; oLLM’s claim is that you can execute it offline on a single consumer GPU by paying the SSD penalty and accepting low throughput. This stands in contrast to vLLM docs, which suggest multi-GPU servers for the same model family.

Installation and minimal usage

The project is MIT-licensed and available on PyPI (pip install ollm), with an additional kvikio-cu{cuda_version} dependency for high-speed disk I/O. For Qwen3-Next models, install Transformers from GitHub. A short example in the README shows Inference(…).DiskCache(…) wiring and generate(…) with a streaming text callback. (PyPI currently lists 0.4.1; the README references 0.4.2 changes.)

Performance expectations and trade-offs

Throughput: The maintainer reports ~0.5 tok/s for Qwen3-Next-80B at 50K context on an RTX 3060 Ti—usable for batch/offline analytics, not for interactive chat. SSD latency dominates.

Storage pressure: Long contexts require very large KV caches; oLLM writes these to SSD to keep VRAM flat. This mirrors broader industry work on KV offloading (e.g., NVIDIA Dynamo/NIXL and community discussions), but the approach is still storage-bound and workload-specific.

Hardware reality check: Running Qwen3-Next-80B “on consumer hardware” is feasible with oLLM’s disk-centric design, but typical high-throughput inference for this model still expects multi-GPU servers. Treat oLLM as an execution path for large-context, offline passes rather than a drop-in replacement for production serving stacks like vLLM/TGI.

Bottom line

oLLM pushes a clear design point: keep precision high, push memory to SSD, and make ultra-long contexts viable on a single 8 GB NVIDIA GPU. It won’t match data-center throughput, but for offline document/log analysis, compliance review, or large-context summarization, it’s a pragmatic way to execute 8B–20B models comfortably and even step up to MoE-80B if you can tolerate ~100–200 GB of fast local storage and sub-1 tok/s generation.

Check out the GITHUB REPO here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM Inference to 8 GB Consumer GPUs via SSD Offload—No Quantization Required appeared first on MarkTechPost.

This AI Research Proposes an AI Agent Immune System for Adaptive Cyber …

Can your AI security stack profile, reason, and neutralize a live security threat in ~220 ms—without a central round-trip? A team of researchers from Google and University of Arkansas at Little Rock outline an agentic cybersecurity “immune system” built from lightweight, autonomous sidecar AI agents colocated with workloads (Kubernetes pods, API gateways, edge services). Instead of exporting raw telemetry to a SIEM and waiting on batched classifiers, each agent learns local behavioral baselines, evaluates anomalies using federated intelligence, and applies least-privilege mitigations directly at the point of execution. In a controlled cloud-native simulation, this edge-first loop cut decision-to-mitigation to ~220 ms (≈3.4× faster than centralized pipelines), achieved F1 ≈ 0.89, and held host overhead under 10% CPU/RAM—evidence that collapsing detection and enforcement into the workload plane can deliver both speed and fidelity without material resource penalties.

https://arxiv.org/abs/2509.20640

What does “Profile → Reason → Neutralize” mean at the primitive level?

Profile. Agents are deployed as sidecars/daemonsets alongside microservices and API gateways. They build behavioral fingerprints from execution traces, syscall paths, API call sequences, and inter-service flows. This local baseline adapts to short-lived pods, rolling deploys, and autoscaling—conditions that routinely break perimeter controls and static allowlists. Profiling is not just a threshold on counts; it retains structural features (order, timing, peer set) that allow detection of zero-day-like deviations. The research team frames this as continuous, context-aware baselining across ingestion and sensing layers so that “normal” is learned per workload and per identity boundary.

Reason. When an anomaly appears (for example, an unusual burst of high-entropy uploads from a low-trust principal or a never-seen-before API call graph), the local agent mixes anomaly scores with federated intelligence—shared indicators and model deltas learned by peers—to produce a risk estimate. Reasoning is designed to be edge-first: the agent decides without a round-trip to a central adjudicator, and the trust decision is continuous rather than a static role gate. This aligns with zero-trust—identity and context are evaluated at each request, not just at session start—and it reduces central bottlenecks that add seconds of latency under load.

Neutralize. If risk exceeds a context-sensitive threshold, the agent executes an immediate local control mapped to least-privilege actions: quarantine the container (pause/isolate), rotate a credential, apply a rate-limit, revoke a token, or tighten a per-route policy. Enforcement is written back to policy stores and logged with a human-readable rationale for audit. The fast path here is the core differentiator: in the reported evaluation, the autonomous path triggers in ~220 ms versus ~540–750 ms for centralized ML or firewall update pipelines, which translates into a ~70% latency reduction and fewer opportunities for lateral movement during the decision window.

Where do the numbers come from, and what were the baselines?

The research team evaluated the architecture in a Kubernetes-native simulation spanning API abuse and lateral-movement scenarios. Against two typical baselines—(i) static rule pipelines and (ii) a batch-trained classifier—the agentic approach reports Precision 0.91 / Recall 0.87 / F1 0.89, while the baselines land near F1 0.64 (rules) and F1 0.79 (baseline ML). Decision latency falls to ~220 ms for local enforcement, compared with ~540–750 ms for centralized paths that require coordination with a controller or external firewall. Resource overhead on host services remains below 10% in CPU/RAM.

https://arxiv.org/abs/2509.20640

Why does this matter for zero-trust engineering, not just research graphs?

Zero-trust (ZT) calls for continuous verification at request-time using identity, device, and context. In practice, many ZT deployments still defer to central policy evaluators, so they inherit control-plane latency and queueing pathologies under load. By moving risk inference and enforcement to the autonomous edge, the architecture turns ZT posture from periodic policy pulls into a set of self-contained, continuously learning controllers that execute least-privilege changes locally and then synchronize state. That design simultaneously reduces mean time-to-contain (MTTC) and keeps decisions near the blast radius, which helps when inter-pod hops are measured in milliseconds. The research team also formalizes federated sharing to distribute indicators/model deltas without heavy raw-data movement, which is relevant for privacy boundaries and multi-tenant SaaS.

How does it integrate with existing stacks—Kubernetes, APIs, and identity?

Operationally, the agents are co-located with workloads (sidecar or node daemon). On Kubernetes, they can hook CNI-level telemetry for flow features, container runtime events for process-level signals, and envoy/nginx spans at API gateways for request graphs. For identity, they consume claims from your IdP and compute continuous trust scores that factor recent behavior and environment (e.g., geo-risk, device posture). Mitigations are expressed as idempotent primitives—network micro-policy updates, token revocation, per-route quotas—so they are straightforward to roll back or tighten incrementally. The architecture’s control loop (sense → reason → act → learn) is strictly feedback-driven and supports both human-in-the-loop (policy windows, approval gates for high-blast-radius changes) and autonomy for low-impact actions.

What are the governance and safety guardrails?

Speed without auditability is a non-starter in regulated environments. The research team emphasizes explainable decision logs that capture which signals and thresholds led to the action, with signed and versioned policy/model artifacts. It also discusses privacy-preserving modes—keeping sensitive data local while sharing model updates; differentially private updates are mentioned as an option in stricter regimes. For safety, the system supports override/rollback and staged rollouts (e.g., canarying new mitigation templates in non-critical namespaces). This is consistent with broader security work on threats and guardrails for agentic systems; if your org is adopting multi-agent pipelines, cross-check against current threat models for agent autonomy and tool use.

How do the reported results translate to production posture?

The evaluation is a 72-hour cloud-native simulation with injected behaviors: API misuse patterns, lateral movement, and zero-day-like deviations. Real systems will add messier signals (e.g., noisy sidecars, multi-cluster networking, mixed CNI plugins), which affects both detection and enforcement timing. That said, the fast-path structure—local decision + local act—is topology-agnostic and should preserve order-of-magnitude latency gains so long as mitigations are mapped to primitives available in your mesh/runtime. For production, begin with observe-only agents to build baselines, then turn on mitigations for low-risk actions (quota clamps, token revokes), then gate high-blast-radius controls (network slicing, container quarantine) behind policy windows until confidence/coverage metrics are green.

How does this sit in the broader agentic-security landscape?

There is growing research on securing agent systems and using agent workflows for security tasks. The research team discussed here is about defense via agent autonomy close to workloads. In parallel, other work tackles threat modeling for agentic AI, secure A2A protocol usage, and agentic vulnerability testing. If you adopt the architecture, pair it with a current agent-security threat model and a test harness that exercises tool-use boundaries and memory safety of agents.

Comparative Results (Kubernetes simulation)

MetricStatic rules pipelineBaseline ML (batch classifier)Agentic framework (edge autonomy)Precision0.710.830.91Recall0.580.760.87F10.640.790.89Decision-to-mitigation latency~750 ms~540 ms~220 msHost overhead (CPU/RAM)ModerateModerate<10%

Key Takeaways

Edge-first “cybersecurity immune system.” Lightweight sidecar/daemon AI agents colocated with workloads (Kubernetes pods, API gateways) learn behavioral fingerprints, decide locally, and enforce least-privilege mitigations without SIEM round-trips.

Measured performance. Reported decision-to-mitigation is ~220 ms—about 3.4× faster than centralized pipelines (≈540–750 ms)—with F1 ≈ 0.89 (P≈0.91, R≈0.87) in a Kubernetes simulation.

Low operational cost. Host overhead remains <10% CPU/RAM, making the approach practical for microservices and edge nodes.

Profile → Reason → Neutralize loop. Agents continuously baseline normal activity (profile), fuse local signals with federated intelligence for risk scoring (reason), and apply immediate, reversible controls such as container quarantine, token rotation, and rate-limits (neutralize).

Zero-trust alignment. Decisions are continuous and context-aware (identity, device, geo, workload), replacing static role gates and reducing dwell time and lateral movement risk.

Governance and safety. Actions are logged with explainable rationales; policies/models are signed and versioned; high-blast-radius mitigations can be gated behind human-in-the-loop and staged rollouts.

Summary

Treat defense as a distributed control plane made of profiling, reasoning, and neutralizing agents that act where the threat lives. The reported profile—~220 ms actions, ≈ 3.4× faster than centralized baselines, F1 ≈ 0.89, <10% overhead—is consistent with what you’d expect when you eliminate central hops and let autonomy handle least-privilege mitigations locally. It aligns with zero-trust’s continuous verification and gives teams a practical path to self-stabilizing operations: learn normal, flag deviations with federated context, and contain early—before lateral movement outpaces your control plane.

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post This AI Research Proposes an AI Agent Immune System for Adaptive Cybersecurity: 3.4× Faster Containment with <10% Overhead appeared first on MarkTechPost.

Gemini Robotics 1.5: DeepMind’s ER↔VLA Stack Brings Agentic Robots …

Can a single AI stack plan like a researcher, reason over scenes, and transfer motions across different robots—without retraining from scratch? Google DeepMind’s Gemini Robotics 1.5 says yes, by splitting embodied intelligence into two models: Gemini Robotics-ER 1.5 for high-level embodied reasoning (spatial understanding, planning, progress/success estimation, tool-use) and Gemini Robotics 1.5 for low-level visuomotor control. The system targets long-horizon, real-world tasks (e.g., multi-step packing, waste sorting with local rules) and introduces motion transfer to reuse data across heterogeneous platforms.

https://deepmind.google/discover/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/

What actually is the stack?

Gemini Robotics-ER 1.5 (reasoner/orchestrator): A multimodal planner that ingests images/video (and optionally audio), grounds references via 2D points, tracks progress, and invokes external tools (e.g., web search or local APIs) to fetch constraints before issuing sub-goals. It’s available via the Gemini API in Google AI Studio.

Gemini Robotics 1.5 (VLA controller): A vision-language-action model that converts instructions and percepts into motor commands, producing explicit “think-before-act” traces to decompose long tasks into short-horizon skills. Availability is limited to selected partners during the initial rollout.

https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf

Why split cognition from control?

Earlier end-to-end VLAs (Vision-Language-Action) struggle to plan robustly, verify success, and generalize across embodiments. Gemini Robotics 1.5 isolates those concerns: Gemini Robotics-ER 1.5 handles deliberation (scene reasoning, sub-goaling, success detection), while the VLA specializes in execution (closed-loop visuomotor control). This modularity improves interpretability (visible internal traces), error recovery, and long-horizon reliability.

Motion Transfer across embodiments

A core contribution is Motion Transfer (MT): training the VLA on a unified motion representation built from heterogeneous robot data—ALOHA, bi-arm Franka, and Apptronik Apollo—so skills learned on one platform can zero-shot transfer to another. This reduces per-robot data collection and narrows sim-to-real gaps by reusing cross-embodiment priors.

Quantitative signals

The research team showcased controlled A/B comparisons on real hardware and aligned MuJoCo scenes. This includes:

Generalization: Robotics 1.5 surpasses prior Gemini Robotics baselines in instruction following, action generalization, visual generalization, and task generalization across the three platforms.

Zero-shot cross-robot skills: MT yields measurable gains in progress and success when transferring skills across embodiments (e.g., Franka→ALOHA, ALOHA→Apollo), rather than merely improving partial progress.

“Thinking” improves acting: Enabling VLA thought traces increases long-horizon task completion and stabilizes mid-rollout plan revisions.

End-to-end agent gains: Pairing Gemini Robotics-ER 1.5 with the VLA agent substantially improves progress on multi-step tasks (e.g., desk organization, cooking-style sequences) versus a Gemini-2.5-Flash-based baseline orchestrator.

https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf

Safety and evaluation

DeepMind research team highlights layered controls: policy-aligned dialog/planning, safety-aware grounding (e.g., not pointing to hazardous objects), low-level physical limits, and expanded evaluation suites (e.g., ASIMOV/ASIMOV-style scenario testing and auto red-teaming to elicit edge-case failures). The goal is to catch hallucinated affordances or nonexistent objects before actuation.

Competitive/industry context

Gemini Robotics 1.5 is a shift from “single-instruction” robotics toward agentic, multi-step autonomy with explicit web/tool use and cross-platform learning, a capability set relevant to consumer and industrial robotics. Early partner access centers on established robotics vendors and humanoid platforms.

Key Takeaways

Two-model architecture (ER VLA): Gemini Robotics-ER 1.5 handles embodied reasoning—spatial grounding, planning, success/progress estimation, tool calls—while Robotics 1.5 is the vision-language-action executor that issues motor commands.

“Think-before-act” control: The VLA produces explicit intermediate reasoning/traces during execution, improving long-horizon decomposition and mid-task adaptation.

Motion Transfer across embodiments: A single VLA checkpoint reuses skills across heterogeneous robots (ALOHA, bi-arm Franka, Apptronik Apollo), enabling zero-/few-shot cross-robot execution rather than per-platform retraining.

Tool-augmented planning: ER 1.5 can invoke external tools (e.g., web search) to fetch constraints, then condition plans—e.g., packing after checking local weather or applying city-specific recycling rules.

Quantified improvements over prior baselines: The tech report documents higher instruction/action/visual/task generalization and better progress/success on real hardware and aligned simulators; results cover cross-embodiment transfers and long-horizon tasks.

Availability and access: ER 1.5 is available via the Gemini API (Google AI Studio) with docs, examples, and preview knobs; Robotics 1.5 (VLA) is limited to select partners with a public waitlist.

Safety & evaluation posture: DeepMind highlights layered safeguards (policy-aligned planning, safety-aware grounding, physical limits) and an upgraded ASIMOV benchmark plus adversarial evaluations to probe risky behaviors and hallucinated affordances.

Summary

Gemini Robotics 1.5 operationalizes a clean separation of embodied reasoning and control, adds motion transfer to recycle data across robots, and showcases the reasoning surface (point grounding, progress/success estimation, tool calls) to developers via the Gemini API. For teams building real-world agents, the design reduces per-platform data burden and strengthens long-horizon reliability—while keeping safety in scope with dedicated test suites and guardrails.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Gemini Robotics 1.5: DeepMind’s ER↔VLA Stack Brings Agentic Robots to the Real World appeared first on MarkTechPost.

Top 10 Local LLMs (2025): Context Windows, VRAM Targets, and Licenses …

Local LLMs matured fast in 2025: open-weight families like Llama 3.1 (128K context length (ctx)), Qwen3 (Apache-2.0, dense + MoE), Gemma 2 (9B/27B, 8K ctx), Mixtral 8×7B (Apache-2.0 SMoE), and Phi-4-mini (3.8B, 128K ctx) now ship reliable specs and first-class local runners (GGUF/llama.cpp, LM Studio, Ollama), making on-prem and even laptop inference practical if you match context length and quantization to VRAM. This guide lists the ten most deployable options by license clarity, stable GGUF availability, and reproducible performance characteristics (params, context length (ctx), quant presets).

Top 10 Local LLMs (2025)

1) Meta Llama 3.1-8B — robust “daily driver,” 128K context

Why it matters. A stable, multilingual baseline with long context and first-class support across local toolchains.Specs. Dense 8B decoder-only; official 128K context; instruction-tuned and base variants. Llama license (open weights). Common GGUF builds and Ollama recipes exist. Typical setup: Q4_K_M/Q5_K_M for ≤12-16 GB VRAM, Q6_K for ≥24 GB.

2) Meta Llama 3.2-1B/3B — edge-class, 128K context, on-device friendly

Why it matters. Small models that still take 128K tokens and run acceptably on CPUs/iGPUs when quantized; good for laptops and mini-PCs.Specs. 1B/3B instruction-tuned models; 128K context confirmed by Meta. Works well via llama.cpp GGUF and LM Studio’s multi-runtime stack (CPU/CUDA/Vulkan/Metal/ROCm).

3) Qwen3-14B / 32B — open Apache-2.0, strong tool-use & multilingual

Why it matters. Broad family (dense+MoE) under Apache-2.0 with active community ports to GGUF; widely reported as a capable general/agentic “daily driver” locally.Specs. 14B/32B dense checkpoints with long-context variants; modern tokenizer; rapid ecosystem updates. Start at Q4_K_M for 14B on 12 GB; move to Q5/Q6 when you have 24 GB+. (Qwen)

4) DeepSeek-R1-Distill-Qwen-7B — compact reasoning that fits

Why it matters. Distilled from R1-style reasoning traces; delivers step-by-step quality at 7B with widely available GGUFs. Excellent for math/coding on modest VRAM.Specs. 7B dense; long-context variants exist per conversion; curated GGUFs cover F32→Q4_K_M. For 8–12 GB VRAM try Q4_K_M; for 16–24 GB use Q5/Q6.

5) Google Gemma 2-9B / 27B — efficient dense; 8K context (explicit)

Why it matters. Strong quality-for-size and quantization behavior; 9B is a great mid-range local model.Specs. Dense 9B/27B; 8K context (don’t overstate); open weights under Gemma terms; widely packaged for llama.cpp/Ollama. 9B@Q4_K_M runs on many 12 GB cards.

6) Mixtral 8×7B (SMoE) — Apache-2.0 sparse MoE; cost/perf workhorse

Why it matters. Mixture-of-Experts throughput benefits at inference: ~2 experts/token selected at runtime; great compromise when you have ≥24–48 GB VRAM (or multi-GPU) and want stronger general performance.Specs. 8 experts of 7B each (sparse activation); Apache-2.0; instruct/base variants; mature GGUF conversions and Ollama recipes.

7) Microsoft Phi-4-mini-3.8B — small model, 128K context

Why it matters. Realistic “small-footprint reasoning” with 128K context and grouped-query attention; solid for CPU/iGPU boxes and latency-sensitive tools.Specs. 3.8B dense; 200k vocab; SFT/DPO alignment; model card documents 128K context and training profile. Use Q4_K_M on ≤8–12 GB VRAM.

8) Microsoft Phi-4-Reasoning-14B — mid-size reasoning (check ctx per build)

Why it matters. A 14B reasoning-tuned variant that is materially better for chain-of-thought-style tasks than generic 13–15B baselines.Specs. Dense 14B; context varies by distribution (model card for a common release lists 32K). For 24 GB VRAM, Q5_K_M/Q6_K is comfortable; mixed-precision runners (non-GGUF) need more.

9) Yi-1.5-9B / 34B — Apache-2.0 bilingual; 4K/16K/32K variants

Why it matters. Competitive EN/zh performance and permissive license; 9B is a strong alternative to Gemma-2-9B; 34B steps toward higher reasoning under Apache-2.0.Specs. Dense; context variants 4K/16K/32K; open weights under Apache-2.0 with active HF cards/repos. For 9B use Q4/Q5 on 12–16 GB.

10) InternLM 2 / 2.5-7B / 20B — research-friendly; math-tuned branches

Why it matters. An open series with lively research cadence; 7B is a practical local target; 20B moves you toward Gemma-2-27B-class capability (at higher VRAM).Specs. Dense 7B/20B; multiple chat/base/math variants; active HF presence. GGUF conversions and Ollama packs are common.

source: marktechpost.com

Summary

In local LLMs, the trade-offs are clear: pick dense models for predictable latency and simpler quantization (e.g., Llama 3.1-8B with a documented 128K context; Gemma 2-9B/27B with an explicit 8K window), move to sparse MoE like Mixtral 8×7B when your VRAM and parallelism justify higher throughput per cost, and treat small reasoning models (Phi-4-mini-3.8B, 128K) as the sweet spot for CPU/iGPU boxes. Licenses and ecosystems matter as much as raw scores: Qwen3’s Apache-2.0 releases (dense + MoE) and Meta/Google/Microsoft model cards give the operational guardrails (context, tokenizer, usage terms) you’ll actually live with. On the runtime side, standardize on GGUF/llama.cpp for portability, layer Ollama/LM Studio for convenience and hardware offload, and size quantization (Q4→Q6) to your memory budget. In short: choose by context + license + hardware path, not just leaderboard vibes.

The post Top 10 Local LLMs (2025): Context Windows, VRAM Targets, and Licenses Compared appeared first on MarkTechPost.

The Latest Gemini 2.5 Flash-Lite Preview is Now the Fastest Proprietar …

Google released an updated version of Gemini 2.5 Flash and Gemini 2.5 Flash-Lite preview models across AI Studio and Vertex AI, plus rolling aliases—gemini-flash-latest and gemini-flash-lite-latest—that always point to the newest preview in each family. For production stability, Google advises pinning fixed strings (gemini-2.5-flash, gemini-2.5-flash-lite). Google will give a two-week email notice before retargeting a -latest alias, and notes that rate limits, features, and cost may vary across alias updates.

https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/

What actually changed?

Flash: Improved agentic tool use and more efficient “thinking” (multi-pass reasoning). Google reports a +5 point lift on SWE-Bench Verified vs. the May preview (48.9% → 54.0%), indicating better long-horizon planning/code navigation.

Flash-Lite: Tuned for stricter instruction following, reduced verbosity, and stronger multimodal/translation. Google’s internal chart shows ~50% fewer output tokens for Flash-Lite and ~24% fewer for Flash, which directly cuts output-token spend and wall-clock time in throughput-bound services.

https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/

Independent Stats from the community thread

Artificial Analysis (the account behind the AI benchmarking site) received pre-release access and published external measurements across intelligence and speed. Highlights from the thread and companion pages:

Throughput: In endpoint tests, Gemini 2.5 Flash-Lite (Preview 09-2025, reasoning) is reported as the fastest proprietary model they track, around ~887 output tokens/s on AI Studio in their setup.

Intelligence index deltas: The September previews for Flash and Flash-Lite improve on Artificial Analysis’ aggregate “intelligence” scores compared with prior stable releases (site pages break down reasoning vs. non-reasoning tracks and blended price assumptions).

Token efficiency: The thread reiterates Google’s own reduction claims (−24% Flash, −50% Flash-Lite) and frames the win as cost-per-success improvements for tight latency budgets.

Google shared pre-release access for the new Gemini 2.5 Flash & Flash-Lite Preview 09-2025 models. We’ve independently benchmarked gains in intelligence (particularly for Flash-Lite), output speed and token efficiency compared to predecessorsKey takeaways from our intelligence… pic.twitter.com/ybzKvZBH5A— Artificial Analysis (@ArtificialAnlys) September 25, 2025

Cost surface and context budgets (for deployment choices)

Flash-Lite GA list price is $0.10 / 1M input tokens and $0.40 / 1M output tokens (Google’s July GA post and DeepMind’s model page). That baseline is where verbosity reductions translate to immediate savings.

Context: Flash-Lite supports ~1M-token context with configurable “thinking budgets” and tool connectivity (Search grounding, code execution)—useful for agent stacks that interleave reading, planning, and multi-tool calls.

Browser-agent angle and the o3 claim

A circulating claim says the “new Gemini Flash has o3-level accuracy, but is 2× faster and 4× cheaper on browser-agent tasks.” This is community-reported, not in Google’s official post. It likely traces to private/limited task suites (DOM navigation, action planning) with specific tool budgets and timeouts. Use it as a hypothesis for your own evals; don’t treat it as a cross-bench truth.

This is insane! The new Gemini Flash model released yesterday has the same accuracy as o3, but it is 2x faster and 4x cheaper for browser agent tasks.I ran evaluations the whole day and could not believe this. The previous gemini-2.5-flash had only 71% on this benchmark. https://t.co/KdgkuAK30W pic.twitter.com/F69BiZHiwD— Magnus Müller (@mamagnus00) September 26, 2025

Practical guidance for teams

Pin vs. chase -latest: If you depend on strict SLAs or fixed limits, pin the stable strings. If you continuously canary for cost/latency/quality, the -latest aliases reduce upgrade friction (Google provides two weeks’ notice before switching the pointer).

High-QPS or token-metered endpoints: Start with Flash-Lite preview; the verbosity and instruction-following upgrades shrink egress tokens. Validate multimodal and long-context traces under production load.

Agent/tool pipelines: A/B Flash preview where multi-step tool use dominates cost or failure modes; Google’s SWE-Bench Verified lift and community tokens/s figures suggest better planning under constrained thinking budgets.

Model strings (current)

Previews: gemini-2.5-flash-preview-09-2025, gemini-2.5-flash-lite-preview-09-2025

Stable: gemini-2.5-flash, gemini-2.5-flash-lite

Rolling aliases: gemini-flash-latest, gemini-flash-lite-latest (pointer semantics; may change features/limits/pricing).

Summary

Google’s new release update tightens tool-use competence (Flash) and token/latency efficiency (Flash-Lite) and introduces -latest aliases for faster iteration. External benchmarks from Artificial Analysis indicate meaningful throughput and intelligence-index gains for the Sept 2025. previews, with Flash-Lite now testing as the fastest proprietary model in their harness. Validate on your workload—especially browser-agent stacks—before committing to the aliases in production.

The post The Latest Gemini 2.5 Flash-Lite Preview is Now the Fastest Proprietary Model (External Tests) and 50% Fewer Output Tokens appeared first on MarkTechPost.

Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to T …

Hugging Face (HF) has released Smol2Operator, a reproducible, end-to-end recipe that turns a small vision-language model (VLM) with no prior UI grounding into a GUI-operating, tool-using agent. The release covers data transformation utilities, training scripts, transformed datasets, and the resulting 2.2B-parameter model checkpoint—positioned as a complete blueprint for building GUI agents from scratch rather than a single benchmark result.

But what’s new?

Two-phase post-training over a small VLM: Starting from SmolVLM2-2.2B-Instruct—a model that “initially has no grounding capabilities for GUI tasks”—Smol2Operator first instills perception/grounding, then layers agentic reasoning with supervised fine-tuning (SFT).

Unified action space across heterogeneous sources: A conversion pipeline normalizes disparate GUI action taxonomies (mobile, desktop, web) into a single, consistent function API (e.g., click, type, drag, normalized [0,1] coordinates), enabling coherent training across datasets. An Action Space Converter supports remapping to custom vocabularies.

But why Smol2Operator?

Most GUI-agent pipelines are blocked by fragmented action schemas and non-portable coordinates. Smol2Operator’s action-space unification and normalized coordinate strategy make datasets interoperable and training stable under image resizing, which is common in VLM preprocessing. This reduces the engineering overhead of assembling multi-source GUI data and lowers the barrier to reproducing agent behavior with small models.

How it works? training stack and data path

Data standardization:

Parse and normalize function calls from source datasets (e.g., AGUVIS stages) into a unified signature set; remove redundant actions; standardize parameter names; convert pixel to normalized coordinates.

Phase 1 (Perception/Grounding):

SFT on the unified action dataset to learn element localization and basic UI affordances, measured on ScreenSpot-v2 (element localization on screenshots).

Phase 2 (Cognition/Agentic reasoning):

Additional SFT to convert grounded perception into step-wise action planning aligned with the unified action API.

The HF Team reports a clean performance trajectory on ScreenSpot-v2 (benchmark) as grounding is learned, and shows similar training strategy scaling down to a ~460M “nanoVLM,” indicating the method’s portability across capacities (numbers are presented in the post’s tables).

Scope, limits, and next steps

Not a “SOTA at all costs” push: The HF team frame the work as a process blueprint—owning data conversion → grounding → reasoning—rather than chasing leaderboard peaks.

Evaluation focus: Demonstrations center on ScreenSpot-v2 perception and qualitative end-to-end task videos; broader cross-environment, cross-OS, or long-horizon task benchmarks are future work. The HF team notes potential gains from RL/DPO beyond SFT for on-policy adaptation.

Ecosystem trajectory: ScreenEnv’s roadmap includes wider OS coverage (Android/macOS/Windows), which would increase external validity of trained policies.

Summary

Smol2Operator is a fully open-source, reproducible pipeline that upgrades SmolVLM2-2.2B-Instruct—a VLM with zero GUI grounding—into an agentic GUI coder via a two-phase SFT process. The release standardizes heterogeneous GUI action schemas into a unified API with normalized coordinates, provides transformed AGUVIS-based datasets, publishes training notebooks and preprocessing code, and ships a final checkpoint plus a demo Space. It targets process transparency and portability over leaderboard chasing, and slots into the smolagents runtime with ScreenEnv for evaluation, offering a practical blueprint for teams building small, operator-grade GUI agents.

Check out the Technical details, and Full Collection on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to Train a 2.2B VLM into an Agentic GUI Coder appeared first on MarkTechPost.