Delinea Released an MCP Server to Put Guardrails Around AI Agents Cred …

Delinea released an Model Context Protocol (MCP) server that let AI-agent access to credentials stored in Delinea Secret Server and the Delinea Platform. The server applies identity checks and policy rules on every call, aiming to keep long-lived secrets out of agent memory while retaining full auditability

What’s new for me?

The GitHub project DelineaXPM/delinea-mcp (MIT-licensed) exposes a constrained MCP tool surface for credential retrieval and account operations, supports OAuth 2.0 dynamic client registration per the MCP spec, and offers both STDIO and HTTP/SSE transports. The repo includes Docker artifacts and example configs for editor/agent integrations

How it works?

The server exposes MCP tools that proxy to Secret Server and (optionally) the Delinea Platform: secret and folder retrieval/search, inbox/access-request helpers, user/session admin, and report execution; secrets themselves remain vaulted and are never presented to the agent. Configuration separates secrets into environment variables (e.g., DELINEA_PASSWORD) and non-secrets into config.json, with scope controls (enabled_tools, allowed object types), TLS certs, and an optional registration pre-shared key.

Explain me why exactly it matters to me

Enterprises are rapidly wiring agents to operational systems through MCP. Recent incidents—such as a rogue MCP package exfiltrating email—underscore the need for registration controls, TLS, least-privilege tool surfaces, and traceable identity context on every call. Delinea’s server claims to implement these controls in a PAM-aligned pattern (ephemeral auth + policy checks + audit), reducing credential sprawl and simplifying revocation.

Summary

Delinea’s MIT-licensed MCP server gives enterprises a standard, auditable way for AI-agent credential access—short-lived tokens, policy evaluation, and constrained tools—to reduce secret exposure while integrating with Secret Server and the Delinea Platform. It’s available now on GitHub, with initial coverage and technical details confirming OAuth2, STDIO/HTTP(SSE) transports, and scoped operations.

The post Delinea Released an MCP Server to Put Guardrails Around AI Agents Credential Access appeared first on MarkTechPost.

OpenAI Launches Sora 2 and a Consent-Gated Sora iOS App

OpenAI released Sora 2, a text-to-video-and-audio model focused on physical plausibility, multi-shot controllability, and synchronized dialogue/SFX. The OpenAI team has also launched a new invite-only Sora iOS app (U.S. and Canada first) that enables social creation, remixing, and consent-controlled “cameos” for inserting a verified likeness into generated scenes.

Model capabilities

Sora 2 claims materially better world modeling (e.g., rebounds on missed shots instead of object “teleportation”), maintains state across shots for instruction-following edits, and generates native, time-aligned audio (speech, ambient, effects). These are framed as prerequisites for simulation-grade video generation rather than single-clip “best effort” synthesis.

App architecture and “cameos”

The Sora app is built around cameos: users record a short in-app video+audio to verify identity and capture likeness; cameo owners control who can use their likeness and can revoke or delete any video—including drafts—that includes them. The app is available on iOS devices and it will be expanding after the U.S./Canada rollout.

Safety posture

OpenAI’s Sora 2 documents an iterative rollout with specific launch-time restrictions and provenance controls:

Uploads/Generations: At launch, OpenAI is restricting the use of image uploads that feature a photorealistic person and all video uploads. Sora 2 does not support video-to-video at launch, blocks text-to-video of public figures, and blocks generations that include real people except when a user has opted-in via the cameo feature. Additional classifier thresholds apply when a real person appears.

Provenance: All outputs carry C2PA metadata and a visible moving watermark on downloads, with internal detection tools for origin assessment.

Parental controls

In parallel with Sora, OpenAI introduced parental controls integrated via ChatGPT: parents can opt teens into a non-personalized feed, manage DM permissions, and control whether continuous scroll is allowed—aligned with the Sora feed’s “creation-over-consumption” philosophy.

Access and pricing

The Sora iOS app is available to download now; access opens by invite, with Sora 2 initially free under compute-constrained caps. ChatGPT Pro users get access to an experimental Sora 2 Pro tier on sora.com (and coming to the app). API access is planned after the consumer rollout. Existing Sora 1 Turbo content remains available in user libraries.

Summary

Sora 2 pushes text-to-video toward controllable, physics-respecting, audio-synchronized generation—and OpenAI is shipping it inside an invite-only iOS app with consent-gated cameos plus C2PA metadata and visible watermarks for provenance. The initial U.S./Canada rollout prioritizes safety constraints (e.g., restrictions on public-figure depictions) while staging broader access and API plans, signaling a deliberate shift from raw capability demos to governed, production-ready media tooling.

Sora 2 is here. pic.twitter.com/hy95wDM5nB— OpenAI (@OpenAI) September 30, 2025

The post OpenAI Launches Sora 2 and a Consent-Gated Sora iOS App appeared first on MarkTechPost.

Zhipu AI Releases GLM-4.6: Achieving Enhancements in Real-World Coding …

Zhipu AI has released GLM-4.6, a major update to its GLM series focused on agentic workflows, long-context reasoning, and practical coding tasks. The model raises the input window to 200K tokens with a 128K max output, targets lower token consumption in applied tasks, and ships with open weights for local deployment.

https://z.ai/blog/glm-4.6

So, what’s exactly is new?

Context + output limits: 200K input context and 128K maximum output tokens.

Real-world coding results: On the extended CC-Bench (multi-turn tasks run by human evaluators in isolated Docker environments), GLM-4.6 is reported near parity with Claude Sonnet 4 (48.6% win rate) and uses ~15% fewer tokens vs. GLM-4.5 to finish tasks. Task prompts and agent trajectories are published for inspection.

Benchmark positioning: Zhipu summarizes “clear gains” over GLM-4.5 across eight public benchmarks and states parity with Claude Sonnet 4/4.6 on several; it also notes GLM-4.6 still lags Sonnet 4.5 on coding—a useful caveat for model selection.

Ecosystem availability: GLM-4.6 is available via Z.ai API and OpenRouter; it integrates with popular coding agents (Claude Code, Cline, Roo Code, Kilo Code), and existing Coding Plan users can upgrade by switching the model name to glm-4.6.

Open weights + license: Hugging Face model card lists License: MIT and Model size: 355B params (MoE) with BF16/F32 tensors. (MoE “total parameters” are not equal to active parameters per token; no active-params figure is stated for 4.6 on the card.)

Local inference: vLLM and SGLang are supported for local serving; weights are on Hugging Face and ModelScope.

https://z.ai/blog/glm-4.6

Summary

GLM-4.6 is an incremental but material step: a 200K context window, ~15% token reduction on CC-Bench versus GLM-4.5, near-parity task win-rate with Claude Sonnet 4, and immediate availability via Z.ai, OpenRouter, and open-weight artifacts for local serving.

FAQs

1) What are the context and output token limits?GLM-4.6 supports a 200K input context and 128K maximum output tokens.

2) Are open weights available and under what license?Yes. The Hugging Face model card lists open weights with License: MIT and a 357B-parameter MoE configuration (BF16/F32 tensors).

3) How does GLM-4.6 compare to GLM-4.5 and Claude Sonnet 4 on applied tasks?On the extended CC-Bench, GLM-4.6 reports ~15% fewer tokens vs. GLM-4.5 and near-parity with Claude Sonnet 4 (48.6% win-rate).

4) Can I run GLM-4.6 locally?Yes. Zhipu provides weights on Hugging Face/ModelScope and documents local inference with vLLM and SGLang; community quantizations are appearing for workstation-class hardware.

Check out the GitHub Page, Hugging Face Model Card and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Zhipu AI Releases GLM-4.6: Achieving Enhancements in Real-World Coding, Long-Context Processing, Reasoning, Searching and Agentic AI appeared first on MarkTechPost.

Modernize fraud prevention: GraphStorm v0.5 for real-time inference

Fraud continues to cause significant financial damage globally, with U.S. consumers alone losing $12.5 billion in 2024—a 25% increase from the previous year according to the Federal Trade Commission. This surge stems not from more frequent attacks, but from fraudsters’ increasing sophistication. As fraudulent activities become more complex and interconnected, conventional machine learning approaches fall short by analyzing transactions in isolation, unable to capture the networks of coordinated activities that characterize modern fraud schemes.
Graph neural networks (GNNs) effectively address this challenge by modeling relationships between entities—such as users sharing devices, locations, or payment methods. By analyzing both network structures and entity attributes, GNNs are effective at identifying sophisticated fraud schemes where perpetrators mask individual suspicious activities but leave traces in their relationship networks. However, implementing GNN-based online fraud prevention in production environments presents unique challenges: achieving sub-second inference responses, scaling to billions of nodes and edges, and maintaining operational efficiency for model updates. In this post, we show you how to overcome these challenges using GraphStorm, particularly the new real-time inference capabilities of GraphStorm v0.5.
Previous solutions required tradeoffs between capability and simplicity. Our initial DGL approach provided comprehensive real-time capabilities but demanded intricate service orchestration—including manually updating endpoint configurations and payload formats after retraining with new hyperparameters. This approach also lacked model flexibility, requiring customization of GNN models and configurations when using architectures beyond relational graph convolutional networks (RGCN). Subsequent in-memory DGL implementations reduced complexity but encountered scalability limitations with enterprise data volumes. We built GraphStorm to bridge this gap, by introducing distributed training and high-level APIs that help simplify GNN development at enterprise scale.
In a recent blog post, we illustrated GraphStorm’s enterprise-scale GNN model training and offline inference capability and simplicity. While offline GNN fraud detection can identify fraudulent transactions after they occur—preventing financial loss requires stopping fraud before it happens. GraphStorm v0.5 makes this possible through native real-time inference support through Amazon SageMaker AI. GraphStorm v0.5 delivers two innovations: streamlined endpoint deployment that reduces weeks of custom engineering—coding SageMaker entry point files, packaging model artifacts, and calling SageMaker deployment APIs—to a single-command operation, and standardized payload specification that helps simplify client integration with real-time inference services. These capabilities enable sub-second node classification tasks like fraud prevention, empowering organizations to proactively counter fraud threat with scalable, operationally straightforward GNN solutions.
To showcase these capabilities, this post presents a fraud prevention solution. Through this solution, we show how a data scientist can transition a trained GNN model to production-ready inference endpoints with minimal operational overhead. If you’re interested in implementing GNN-based models for real-time fraud prevention or similar business cases, you can adapt the approaches presented here to create your own solutions.
Solution overview
Our proposed solution is a 4-step pipeline as shown in the following figure. The pipeline starts at step 1 with transaction graph export from an online transaction processing (OLTP) graph database to scalable storage (Amazon Simple Storage Service (Amazon S3) or Amazon EFS), followed by distributed model training in step 2. Step 3 is GraphStorm v0.5’s simplified deployment process that creates SageMaker real-time inference endpoints with one command. After SageMaker AI has deployed the endpoint successfully, a client application integrates with the OLTP graph database that processes live transaction streams in step 4. By querying the graph database, the client prepares subgraphs around to-be predicted transactions, convert the subgraph into standardized payload format, and invoke deployed endpoint for real-time prediction.

To provide concrete implementation details for each step in the real-time inference solution, we demonstrate the complete workflow using the publicly available IEEE-CIS fraud detection task.
Note: This example uses a Jupyter notebook as the controller of the overall four-step pipeline for simplicity. For more production-ready design, see the architecture described in Build a GNN-based real-time fraud detection solution.
Prerequisites
To run this example, you need an AWS account that the example’s AWS Cloud Development Kit (AWS CDK) code uses to create required resources, including Amazon Virtual Private Cloud (Amazon VPC), an Amazon Neptune database, Amazon SageMaker AI, Amazon Elastic Container Registry (Amazon ECR), Amazon S3, and related roles and permission.
Note: These resources incur costs during execution (approximately $6 per hour with default settings). Monitor usage carefully and review pricing pages for these services before proceeding. Follow cleanup instructions at the end to avoid ongoing charges.
Hands-on example: Real-time fraud prevention with IEEE-CIS dataset
All implementation code for this example, including Jupyter notebooks and supporting Python scripts, is available in our public repository. The repository provides a complete end-to-end implementation that you can directly execute and adapt for your own fraud prevention use cases.
Dataset and task overview
This example uses the IEEE-CIS fraud detection dataset, containing 500,000 anonymized transactions with approximately 3.5% fraudulent cases. The dataset includes 392 categorical and numerical features, with key attributes like card types, product types, addresses, and email domains forming the graph structure shown in the following figure. Each transaction (with an isFraud label) connects to Card Type, Location, Product Type, and Purchaser and Recipient email domain entities, creating a heterogeneous graph that enables GNN models to detect fraud patterns through entity relationships.

Unlike our previous post that demonstrated GraphStorm plus Amazon Neptune Analytics for offline analysis workflows, this example uses a Neptune database as the OLTP graph store, optimized for the quick subgraph extraction required during real-time inference. Following the graph design, the tabular IEEE-CIS data is converted to a set CSV files compatible with Neptune database format, allowing direct loading into both the Neptune database and GraphStorm’s GNN model training pipeline with a single set of files.
Step 0: Environment setup
Step 0 establishes the running environment required for the four-step fraud prevention pipeline. Complete setup instructions are available in the implementation repository.
To run the example solution, you need to deploy an AWS CloudFormation stack through the AWS CDK. This stack creates the Neptune DB instance, the VPC to place it in, and appropriate roles and security groups. It additionally creates a SageMaker AI notebook instance, from which you run the example notebooks that come with the repository.

git clone https://github.com/aws-samples/amazon-neptune-samples.git
cd neptune-database-graphstorm-online-inference/neptune-db-cdk
# Ensure you have CDK installed and have appropriate credentials set up
cdk deploy

When deployment is finished (it takes approximately 10 minutes for required resources to be ready), the AWS CDK prints a few outputs, one of which is the name of the SageMaker notebook instance you use to run through the notebooks:

# Example output
NeptuneInfraStack.NotebookInstanceName = arn:aws:sagemaker:us-east-1:012345678912:notebook-instance/NeptuneNotebook-9KgSB9XXXXXX

You can navigate to the SageMaker AI notebook UI, find the corresponding notebook instance, and select its Open Jupyterlab link to access the notebook.
Alternatively, you can use the AWS Command Line Interface (AWS CLI) to get a pre-signed URL to access the notebook. You will need to replace the <notebook-instance-name> with the actual notebook instance name.

aws sagemaker create-presigned-notebook-instance-url –notebook-instance-name <notebook-instance-name>

When you’re in the notebook instance web console, open the first notebook, 0-Data-Preparation.ipynb, to start going through the example.
Step 1: Graph construction
In the Notebook 0-Data-Preparation, you transform the tabular IEEE-CIS dataset into the heterogeneous graph structure shown in the figure at the start of this section. The provided Jupyter Notebook extracts entities from transaction features, creating Card Type nodes from card1–card6 features, Purchaser and Recipient nodes from email domains, Product Type nodes from product codes, and Location nodes from geographic information. The transformation establishes relationships between transactions and these entities, generating graph data in Neptune import format for direct ingestion into the OLTP graph store. The create_neptune_db_data() function orchestrates this entity extraction and relationship creation process across all node types (which takes approximately 30 seconds).

GRAPH_NAME = “ieee-cis-fraud-detection”
PROCESSED_PREFIX = f”./{GRAPH_NAME}”
ID_COLS = “card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain”
CAT_COLS = “M1,M2,M3,M4,M5,M6,M7,M8,M9”
# Lists of columns to keep from each file
COLS_TO_KEEP = {
    “transaction.csv”: (
        ID_COLS.split(“,”)
        + CAT_COLS.split(“,”)
        +
        # Numerical features without missing values
        [f”C{idx}” for idx in range(1, 15)]
        + [“TransactionID”, “TransactionAmt”, “TransactionDT”, “isFraud”]
    ),
    “identity.csv”: [“TransactionID”, “DeviceType”],
}

create_neptune_db_data(
    data_prefix=”./input-data/”,
    output_prefix=PROCESSED_PREFIX,
    id_cols=ID_COLS,
    cat_cols=CAT_COLS,
    cols_to_keep=COLS_TO_KEEP,
    num_chunks=1,
)

This notebook also generates the JSON configuration file required by GraphStorm’s GConstruct command and executes the graph construction process. This GConstruct command transforms the Neptune-formatted data into a distributed binary graph format optimized for GraphStorm’s training pipeline, which partitions the heterogeneous graph structure across compute nodes to enable scalable model training on industry-scale graphs (measured in billions of nodes and edges). For the IEEE-CIS data, the GConstruct command takes 90 seconds to complete.
In the Notebook 1-Load-Data-Into-Neptune-DB, you load the CSV data into the Neptune database instance (takes approximately 9 minutes), which makes them available for online inference. During online inference, after selecting a transaction node, you query the Neptune database to get the graph neighborhood of the target node, retrieving the features of every node in the neighborhood and the subgraph structure around the target.
Step 2: Model training
After you have converted the data into the distributed binary graph format, it’s time to train a GNN model. GraphStorm provides command-line scripts to train a model without writing code. In the Notebook 2-Model-Training, you train a GNN model using GraphStorm’s node classification command with configuration managed through YAML files. The baseline configuration defines a two-layer RGCN model with 128-dimensional hidden layers, training for 4 epochs with a 0.001 learning rate and 1024 batch size, which takes approximately 100 seconds for 1 epoch of model training and evaluation in an ml.m5.4xlarge instance. To improve fraud detection accuracy, the notebook provides more advanced model configurations like the command below.

!python -m graphstorm.run.gs_node_classification
           –workspace ./ 
           –part-config ieee_gs/ieee-cis.json
           –num-trainers 1 
           –cf ieee_nc.yaml
           –eval-metric roc_auc
           –save-model-path ./model-simple/ 
           –topk-model-to-save 1 
           –imbalance-class-weights 0.1,1.0

Arguments in this command address the dataset’s label imbalance challenge where only 3.5% of transactions are fraudulent by using AUC-ROC as the evaluation metric and using class weights. The command also saves the best-performing model along with essential configuration files required for endpoint deployment. Advanced configurations can further enhance model performance through techniques like HGT encoders, multi-head attention, and class-weighted cross entropy loss function, though these optimizations increase computational requirements. GraphStorm enables these changes through run time arguments and YAML configurations, reducing the need for code modifications.
Step 3: Real-time endpoint deployment
In the Notebook 3-GraphStorm-Endpoint-Deployment, you deploy the real-time endpoint through GraphStorm v0.5’s straightforward launch script. The deployment requires three model artifacts generated during training: the saved model file that contains weights, the updated graph construction JSON file with feature transformation metadata, and the runtime-updated training configuration YAML file. These artifacts enable GraphStorm to recreate the exact training configurations and model for consistent inference behavior. Notably, the updated graph construction JSON and training configuration YAML file contains crucial configurations that are essential for restoring the trained model on the endpoint and processing incoming request payloads. It is crucial to use the updated JSON and YAML files for endpoint deployment.GraphStorm uses SageMaker AI bring your own container (BYOC) to deploy a consistent inference environment. You need to build and push the GraphStorm real-time Docker image to Amazon ECR using the provided shell scripts. This containerized approach provides consistent runtime environments compatible with the SageMaker AI managed infrastructure. The Docker image contains the necessary dependencies for GraphStorm’s real-time inference capabilities on the deployment environment.
To deploy the endpoint, you can use the GraphStorm-provided launch_realtime_endpoint.py script that helps you gather required artifacts and creates the necessary SageMaker AI resources to deploy an endpoint. The script accepts the Amazon ECR image URI, IAM role, model artifact paths, and S3 bucket configuration, automatically handling endpoint provisioning and configuration. By default, the script waits for endpoint deployment to be complete before exiting. When completed, it prints the name and AWS Region of the deployed endpoint for subsequent inference requests. You will need to replace the fields enclosed by <> with the actual values of your environment.

!python ~/graphstorm/sagemaker/launch/launch_realtime_endpoint.py
        –image-uri <account_id>.dkr.ecr.<aws_region>.amazonaws.com/graphstorm:sagemaker-endpoint-cpu
        –role arn:aws:iam::<account_id>:role/<your_role>
        –region <aws_region>
        –restore-model-path <restore-model-path>/models/epoch-1/
        –model-yaml-config-file <restore-model-path>/models/GRAPHSTORM_RUNTIME_UPDATED_TRAINING_CONFIG.yaml
        –graph-json-config-file <restore-model-path>/models/data_transform_new.json
        –infer-task-type node_classification
        –upload-tarfile-s3 s3://<cdk-created-bucket>
        –model-name ieee-fraud-detect

Step 4: Real-time inference
In the Notebook 4-Sample-Graph-and-Invoke-Endpoint, you build a basic client application that integrates with the deployed GraphStorm endpoint to perform real-time fraud prevention on incoming transactions. The inference process accepts transaction data through standardized JSON payloads, executes node classification predictions in a few hundreds of milliseconds, and returns fraud probability scores that enable immediate decision-making.
An end-to-end inference call for a node that already exists in the graph has three distinct stages:

Graph sampling from the Neptune database. For a given target node that already exists in the graph, retrieve its k-hop neighborhood with a fanout limit, that is, limiting the number of neighbors retrieved at each hop by a threshold.
Payload preparation for inference. Neptune returns graphs using GraphSON, a specialized JSON-like data format used to describe graph data. At this step, you need to convert the returned GraphSON to GraphStorm’s own JSON specification. This step is performed on the inference client, in this case a SageMaker notebook instance.
Model inference using a SageMaker endpoint. After the payload is prepared, you send an inference request to a SageMaker endpoint that has loaded a previously trained model snapshot. The endpoint receives the request, performs any feature transformations needed (such as converting categorical features to one-hot encoding), creates the binary graph representation in memory, and makes a prediction for the target node using the graph neighborhood and trained model weights. The response is encoded to JSON and sent back to the client.

An example response from the endpoint would look like:

{‘status_code’: 200,
 ‘request_uid’: ‘877042dbc361fc33’,
 ‘message’: ‘Request processed successfully.’,
 ‘error’: ”,
 ‘data’: {
    ‘results’: [
            {
                ‘node_type’: ‘Transaction’,
                ‘node_id’: ‘2991260’,
                ‘prediction’: [0.995966911315918, 0.004033133387565613]
            }
        ]
    }
}

The data of interest for the single transaction you made a prediction for are in the prediction key and corresponding node_id. The prediction gives you the raw scores the model produces for class 0 (legitimate) and class 1 (fraudulent) at the corresponding 0 and 1 indexes of the predictions list. In this example, the model marks the transaction as most likely legitimate. You can find the full GraphStorm response specification in the GraphStorm documentation.
Complete implementation examples, including client code and payload specifications, are provided in the repository to guide integration with production systems.
Clean up
To stop accruing costs on your account, you need to delete the AWS resources that you created with the AWS CDK at the Environment Setup step.
You must first delete the SageMaker endpoint created during the Step 3 for cdk destroy to complete. See the Delete Endpoints and Resources for more options to delete an endpoint. When done, you can run the following from the repository’s root:

cd neptune-database-graphstorm-online-inference/neptune-db-cdk
cdk destroy

See the AWS CDK docs for more information about how to use cdk destroy, or see the CloudFormation docs for how to delete a stack from the console UI. By default, the cdk destroy command does not delete the model artifacts and processed graph data stored in the S3 bucket during the training and deployment process. You must remove them manually. See Deleting a general purpose bucket for information about how to empty and delete an S3 bucket the AWS CDK has created.
Conclusion
Graph neural networks address complex fraud prevention challenges by modeling relationships between entities that traditional machine learning approaches miss when analyzing transactions in isolation. GraphStorm v0.5 helps simplify deployment of GNN real-time inference with one command for endpoint creation that previously required coordination of multiple services and a standardized payload specification that helps simplify client integration with real-time inference services. Organizations can now deploy enterprise-scale fraud prevention endpoints through streamlined commands that reduce custom engineering from weeks to single-command operations.
To implement GNN-based fraud prevention with your own data:

Review the GraphStorm documentation for model configuration options and deployment specifications.
Adapt this IEEE-CIS example to your fraud prevention dataset by modifying the graph construction and feature engineering steps using the complete source code and tutorials available in our GitHub repository.
Access step-by-step implementation guidance to build production-ready fraud prevention solutions with GraphStorm v0.5’s enhanced capabilities using your enterprise data.

About the authors
Jian Zhang is a Senior Applied Scientist who has been using machine learning techniques to help customers solve various problems, such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning, particularly graph neural network, solutions for customers in China, the US, and Singapore. As an enlightener of AWS graph capabilities, Zhang has given many public presentations about GraphStorm, the GNN, the Deep Graph Library (DGL), Amazon Neptune, and other AWS services.
Theodore Vasiloudis is a Senior Applied Scientist at AWS, where he works on distributed machine learning systems and algorithms. He led the development of GraphStorm Processing, the distributed graph processing library for GraphStorm and is a core developer for GraphStorm. He received his PhD in Computer Science from KTH Royal Institute of Technology, Stockholm, in 2019.
Xiang Song is a Senior Applied Scientist at AWS AI Research and Education (AIRE), where he develops deep learning frameworks including GraphStorm, DGL, and DGL-KE. He led the development of Amazon Neptune ML, a new capability of Neptune that uses graph neural networks for graphs stored in graph database. He is now leading the development of GraphStorm, an open source graph machine learning framework for enterprise use cases. He received his PhD in computer systems and architecture at the Fudan University, Shanghai, in 2014.
Florian Saupe is a Principal Technical Product Manager at AWS AI/ML research supporting science teams like the graph machine learning group, and ML Systems teams working on large scale distributed training, inference, and fault resilience. Before joining AWS, Florian lead technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems and robotics scientist—a field in which he holds a PhD.
Ozan Eken is a Product Manager at AWS, passionate about building cutting-edge Generative AI and Graph Analytics products. With a focus on simplifying complex data challenges, Ozan helps customers unlock deeper insights and accelerate innovation. Outside of work, he enjoys trying new foods, exploring different countries, and watching soccer.

Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State …

Anthropic released Claude Sonnet 4.5 and sets a new benchmark for end-to-end software engineering and real-world computer use. The update also ships concrete product surface changes (Claude Code checkpoints, a native VS Code extension, API memory/context tools) and an Agent SDK that exposes the same scaffolding Anthropic uses internally. Pricing remains unchanged from Sonnet 4 ($3 input / $15 output per million tokens).

What’s actually new?

SWE-bench Verified record. Anthropic reports 77.2% accuracy on the 500-problem SWE-bench Verified dataset using a simple two-tool scaffold (bash + file edit), averaged over 10 runs, no test-time compute, 200K “thinking” budget. A 1M-context setting reaches 78.2%, and a higher-compute setting with parallel sampling and rejection raises this to 82.0%.

Computer-use SOTA. On OSWorld-Verified, Sonnet 4.5 leads at 61.4%, up from Sonnet 4’s 42.2%, reflecting stronger tool control and UI manipulation for browser/desktop tasks.

Long-horizon autonomy. The team observed >30 hours of uninterrupted focus on multi-step coding tasks — a practical jump over earlier limits and directly relevant to agent reliability.

Reasoning/math. The release notes “substantial gains” across common reasoning and math evals; exact per-bench numbers (e.g., AIME config). Safety posture is ASL-3 with strengthened defenses against prompt-injection.

https://www.anthropic.com/news/claude-sonnet-4-5

What’s there for agents?

Sonnet 4.5 targets the brittle parts of real agents: extended planning, memory, and reliable tool orchestration. Anthropic’s Claude Agent SDK exposes their production patterns (memory management for long-running tasks, permissioning, sub-agent coordination) rather than just a bare LLM endpoint. That means teams can reproduce the same scaffolding used by Claude Code (now with checkpoints, a refreshed terminal, and VS Code integration) to keep multi-hour jobs coherent and reversible.

On measured tasks that simulate “using a computer,” the 19-point jump on OSWorld-Verified is notable; it tracks with the model’s ability to navigate, fill spreadsheets, and complete web flows in Anthropic’s browser demo. For enterprises experimenting with agentic RPA-style work, higher OSWorld scores usually correlate with lower intervention rates during execution.

Where you can run it?

Anthropic API & apps. Model ID claude-sonnet-4-5; price parity with Sonnet 4. File creation and code execution are now available directly in Claude apps for paid tiers.

AWS Bedrock. Available via Bedrock with integration paths to AgentCore; AWS highlights long-horizon agent sessions, memory/context features, and operational controls (observability, session isolation).

Google Cloud Vertex AI. GA on Vertex AI with support for multi-agent orchestration via ADK/Agent Engine, provisioned throughput, 1M-token analysis jobs, and prompt caching.

GitHub Copilot. Public preview rollout across Copilot Chat (VS Code, web, mobile) and Copilot CLI; organizations can enable via policy, and BYO key is supported in VS Code.

Summary

With a documented 77.2% SWE-bench Verified score under transparent constraints, a 61.4% OSWorld-Verified computer-use lead, and practical updates (checkpoints, SDK, Copilot/Bedrock/Vertex availability), Claude Sonnet 4.5 is developed for long-running, tool-heavy agent workloads rather than short demo prompts. Independent replication will determine how durable the “best for coding” claim is, but the design targets (autonomy, scaffolding, and computer control) are aligned with real production pain points today.

Introducing Claude Sonnet 4.5—the best coding model in the world.It’s the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains on tests of reasoning and math. pic.twitter.com/7LwV9WPNAv— Claude (@claudeai) September 29, 2025

The post Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State-of-the-Art Results appeared first on MarkTechPost.

How to Design an Interactive Dash and Plotly Dashboard with Callback M …

In this tutorial, we set out to build an advanced interactive dashboard using Dash, Plotly, and Bootstrap. We highlight not only how these tools enable us to design layouts and visualizations, but also how Dash’s callback mechanism links controls to outputs, allowing for real-time responsiveness. By combining local execution with the ability to run in cloud platforms like Google Colab, we explore a workflow that is both flexible and practical. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install dash plotly pandas numpy dash-bootstrap-components

import dash
from dash import dcc, html, Input, Output, callback, dash_table
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import dash_bootstrap_components as dbc

print(“Generating sample data…”)
np.random.seed(42)

We begin by installing and importing the necessary components, including Dash, Plotly, Pandas, NumPy, and Bootstrap, to set up our dashboard environment. We also initialize random seeds and generate sample data so that we can consistently test the interactive features as we build them. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserstart_date = datetime(2023, 1, 1)
end_date = datetime(2024, 12, 31)
dates = pd.date_range(start=start_date, end=end_date, freq=’D’)
stock_names = [‘AAPL’, ‘GOOGL’, ‘MSFT’, ‘AMZN’, ‘TSLA’]

all_data = []
base_prices = {‘AAPL’: 150, ‘GOOGL’: 120, ‘MSFT’: 250, ‘AMZN’: 100, ‘TSLA’: 200}

for stock in stock_names:
print(f”Creating data for {stock}…”)
base_price = base_prices[stock]

n_days = len(dates)
returns = np.random.normal(0.0005, 0.025, n_days)
prices = np.zeros(n_days)
prices[0] = base_price

for i in range(1, n_days):
prices[i] = prices[i-1] * (1 + returns[i])

volumes = np.random.lognormal(15, 0.5, n_days).astype(int)

stock_df = pd.DataFrame({
‘Date’: dates,
‘Stock’: stock,
‘Price’: prices,
‘Volume’: volumes,
‘Returns’: np.concatenate([[0], np.diff(prices) / prices[:-1]]),
‘Sector’: np.random.choice([‘Technology’, ‘Consumer’, ‘Automotive’], 1)[0]
})

all_data.append(stock_df)

df = pd.concat(all_data, ignore_index=True)

df[‘Date’] = pd.to_datetime(df[‘Date’])
df_sorted = df.sort_values([‘Stock’, ‘Date’]).reset_index(drop=True)

print(“Calculating technical indicators…”)
df_sorted[‘MA_20’] = df_sorted.groupby(‘Stock’)[‘Price’].transform(lambda x: x.rolling(20, min_periods=1).mean())
df_sorted[‘Volatility’] = df_sorted.groupby(‘Stock’)[‘Returns’].transform(lambda x: x.rolling(30, min_periods=1).std())

df = df_sorted.copy()

print(f”Data generated successfully! Shape: {df.shape}”)
print(f”Date range: {df[‘Date’].min()} to {df[‘Date’].max()}”)
print(f”Stocks: {df[‘Stock’].unique().tolist()}”)

We generate synthetic stock data, including prices, volumes, and returns, for multiple tickers across a specified date range. We calculate moving averages and volatility to enrich the dataset with useful technical indicators, providing a strong foundation for building interactive visualizations. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserapp = dash.Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])

app.layout = dbc.Container([
dbc.Row([
dbc.Col([
html.H1(” Advanced Financial Dashboard”, className=”text-center mb-4″),
html.P(f”Interactive dashboard with {len(df)} data points across {len(stock_names)} stocks”,
className=”text-center text-muted”),
html.Hr()
])
]),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H5(” Dashboard Controls”, className=”card-title”),

html.Label(“Select Stocks:”, className=”fw-bold mt-3″),
dcc.Dropdown(
id=’stock-dropdown’,
options=[{‘label’: f'{stock} ({base_prices[stock]})’, ‘value’: stock}
for stock in stock_names],
value=[‘AAPL’, ‘GOOGL’],
multi=True,
placeholder=”Choose stocks to analyze…”
),

html.Label(“Date Range:”, className=”fw-bold mt-3″),
dcc.DatePickerRange(
id=’date-picker-range’,
start_date=’2023-06-01′,
end_date=’2024-06-01′,
display_format=’YYYY-MM-DD’,
style={‘width’: ‘100%’}
),

html.Label(“Chart Style:”, className=”fw-bold mt-3″),
dcc.RadioItems(
id=’chart-type’,
options=[
{‘label’: ‘ Line Chart’, ‘value’: ‘line’},
{‘label’: ‘ Area Chart’, ‘value’: ‘area’},
{‘label’: ‘ Scatter Plot’, ‘value’: ‘scatter’}
],
value=’line’,
labelStyle={‘display’: ‘block’, ‘margin’: ‘5px’}
),

dbc.Checklist(
id=’show-ma’,
options=[{‘label’: ‘ Show Moving Average’, ‘value’: ‘show’}],
value=[],
style={‘margin’: ’10px 0′}
),
])
], className=”h-100″)
], width=3),

dbc.Col([
dbc.Card([
dbc.CardHeader(” Stock Price Analysis”),
dbc.CardBody([
dcc.Graph(id=’main-chart’, style={‘height’: ‘450px’})
])
])
], width=9)
], className=”mb-4″),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”avg-price”, className=”text-primary mb-0″),
html.Small(“Average Price”, className=”text-muted”)
])
])
], width=3),
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”total-volume”, className=”text-success mb-0″),
html.Small(“Total Volume”, className=”text-muted”)
])
])
], width=3),
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”price-range”, className=”text-info mb-0″),
html.Small(“Price Range”, className=”text-muted”)
])
])
], width=3),
dbc.Col([
dbc.Card([
dbc.CardBody([
html.H4(id=”data-points”, className=”text-warning mb-0″),
html.Small(“Data Points”, className=”text-muted”)
])
])
], width=3)
], className=”mb-4″),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardHeader(” Trading Volume”),
dbc.CardBody([
dcc.Graph(id=’volume-chart’, style={‘height’: ‘300px’})
])
])
], width=6),
dbc.Col([
dbc.Card([
dbc.CardHeader(” Returns Distribution”),
dbc.CardBody([
dcc.Graph(id=’returns-chart’, style={‘height’: ‘300px’})
])
])
], width=6)
], className=”mb-4″),

dbc.Row([
dbc.Col([
dbc.Card([
dbc.CardHeader(” Latest Stock Data”),
dbc.CardBody([
dash_table.DataTable(
id=’data-table’,
columns=[
{‘name’: ‘Stock’, ‘id’: ‘Stock’},
{‘name’: ‘Date’, ‘id’: ‘Date’},
{‘name’: ‘Price ($)’, ‘id’: ‘Price’, ‘type’: ‘numeric’,
‘format’: {‘specifier’: ‘.2f’}},
{‘name’: ‘Volume’, ‘id’: ‘Volume’, ‘type’: ‘numeric’,
‘format’: {‘specifier’: ‘,.0f’}},
{‘name’: ‘Daily Return (%)’, ‘id’: ‘Returns’, ‘type’: ‘numeric’,
‘format’: {‘specifier’: ‘.2%’}}
],
style_cell={‘textAlign’: ‘center’, ‘fontSize’: ’14px’, ‘padding’: ’10px’},
style_header={‘backgroundColor’: ‘rgb(230, 230, 230)’, ‘fontWeight’: ‘bold’},
style_data_conditional=[
{
‘if’: {‘filter_query’: ‘{Returns} > 0’},
‘backgroundColor’: ‘#d4edda’
},
{
‘if’: {‘filter_query’: ‘{Returns} < 0’},
‘backgroundColor’: ‘#f8d7da’
}
],
page_size=15,
sort_action=”native”,
filter_action=”native”
)
])
])
])
])
], fluid=True)

We define the app layout with Bootstrap rows and cards, where we place controls (dropdown, date range, chart style, MA toggle) alongside the main graph. We add metric cards, two secondary graphs, and a sortable/filterable data table, so we organize everything into a responsive, clean interface that we can wire up to callbacks next. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@callback(
[Output(‘main-chart’, ‘figure’),
Output(‘volume-chart’, ‘figure’),
Output(‘returns-chart’, ‘figure’),
Output(‘data-table’, ‘data’),
Output(‘avg-price’, ‘children’),
Output(‘total-volume’, ‘children’),
Output(‘price-range’, ‘children’),
Output(‘data-points’, ‘children’)],
[Input(‘stock-dropdown’, ‘value’),
Input(‘date-picker-range’, ‘start_date’),
Input(‘date-picker-range’, ‘end_date’),
Input(‘chart-type’, ‘value’),
Input(‘show-ma’, ‘value’)]
)
def update_all_charts(selected_stocks, start_date, end_date, chart_type, show_ma):
print(f”Callback triggered with stocks: {selected_stocks}”)

if not selected_stocks:
selected_stocks = [‘AAPL’]

filtered_df = df[
(df[‘Stock’].isin(selected_stocks)) &
(df[‘Date’] >= start_date) &
(df[‘Date’] <= end_date)
].copy()

print(f”Filtered data shape: {filtered_df.shape}”)

if filtered_df.empty:
filtered_df = df[df[‘Stock’].isin(selected_stocks)].copy()
print(f”Using all available data. Shape: {filtered_df.shape}”)

if chart_type == ‘line’:
main_fig = px.line(filtered_df, x=’Date’, y=’Price’, color=’Stock’,
title=f’Stock Prices – {chart_type.title()} View’,
labels={‘Price’: ‘Price ($)’, ‘Date’: ‘Date’})
elif chart_type == ‘area’:
main_fig = px.area(filtered_df, x=’Date’, y=’Price’, color=’Stock’,
title=f’Stock Prices – {chart_type.title()} View’,
labels={‘Price’: ‘Price ($)’, ‘Date’: ‘Date’})
else:
main_fig = px.scatter(filtered_df, x=’Date’, y=’Price’, color=’Stock’,
title=f’Stock Prices – {chart_type.title()} View’,
labels={‘Price’: ‘Price ($)’, ‘Date’: ‘Date’})

if ‘show’ in show_ma:
for stock in selected_stocks:
stock_data = filtered_df[filtered_df[‘Stock’] == stock]
if not stock_data.empty:
main_fig.add_scatter(
x=stock_data[‘Date’],
y=stock_data[‘MA_20′],
mode=’lines’,
name=f'{stock} MA-20′,
line=dict(dash=’dash’, width=2)
)

main_fig.update_layout(height=450, showlegend=True, hovermode=’x unified’)

volume_fig = px.bar(filtered_df, x=’Date’, y=’Volume’, color=’Stock’,
title=’Daily Trading Volume’,
labels={‘Volume’: ‘Volume (shares)’, ‘Date’: ‘Date’})
volume_fig.update_layout(height=300, showlegend=True)

returns_fig = px.histogram(filtered_df.dropna(subset=[‘Returns’]),
x=’Returns’, color=’Stock’,
title=’Daily Returns Distribution’,
labels={‘Returns’: ‘Daily Returns’, ‘count’: ‘Frequency’},
nbins=50)
returns_fig.update_layout(height=300, showlegend=True)

if not filtered_df.empty:
avg_price = f”${filtered_df[‘Price’].mean():.2f}”
total_volume = f”{filtered_df[‘Volume’].sum():,.0f}”
price_range = f”${filtered_df[‘Price’].min():.0f} – ${filtered_df[‘Price’].max():.0f}”
data_points = f”{len(filtered_df):,}”

table_data = filtered_df.nlargest(100, ‘Date’)[
[‘Stock’, ‘Date’, ‘Price’, ‘Volume’, ‘Returns’]
].round(4).to_dict(‘records’)

for row in table_data:
row[‘Date’] = row[‘Date’].strftime(‘%Y-%m-%d’) if pd.notnull(row[‘Date’]) else ”
else:
avg_price = “No data”
total_volume = “No data”
price_range = “No data”
data_points = “0”
table_data = []

return (main_fig, volume_fig, returns_fig, table_data,
avg_price, total_volume, price_range, data_points)

We wire up Dash’s callback to connect our controls to every output, so changing any input instantly updates charts, stats, and the table. We filter the dataframe by selections and dates, build figures (plus optional MA overlays), and compute summary metrics. Finally, we format recent rows for the table so we can inspect the latest results at a glance. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == ‘__main__’:
print(“Starting Dash app…”)
print(“Available data preview:”)
print(df.head())
print(f”Total rows: {len(df)}”)

app.run(mode=’inline’, port=8050, debug=True, height=1000)

# app.run(debug=True)

We set up the entry point for running the app. We print a quick preview of the dataset to determine what’s available, and then launch the Dash server. In Colab, we can run it inline. For local development, we can simply switch to the regular app.run(debug=True) for desktop development.

In conclusion, we integrate interactive charts, responsive layouts, and Dash’s callback mechanism into a cohesive application. We see how the callbacks orchestrate communication between user input and dynamic updates, turning static visuals into powerful interactive tools. With the ability to operate smoothly both locally and online, this approach provides a versatile foundation that we can extend for broader applications.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Design an Interactive Dash and Plotly Dashboard with Callback Mechanisms for Local and Online Deployment? appeared first on MarkTechPost.

Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM I …

oLLM is a lightweight Python library built on top of Huggingface Transformers and PyTorch and runs large-context Transformers on NVIDIA GPUs by aggressively offloading weights and KV-cache to fast local SSDs. The project targets offline, single-GPU workloads and explicitly avoids quantization, using FP16/BF16 weights with FlashAttention-2 and disk-backed KV caching to keep VRAM within 8–10 GB while handling up to ~100K tokens of context.

But What’s new?

(1) KV cache read/writes that bypass mmap to reduce host RAM usage; (2) DiskCache support for Qwen3-Next-80B; (3) Llama-3 FlashAttention-2 for stability; and (4) GPT-OSS memory reductions via “flash-attention-like” kernels and chunked MLP. The table published by the maintainer reports end-to-end memory/I/O footprints on an RTX 3060 Ti (8 GB):

Qwen3-Next-80B (bf16, 160 GB weights, 50K ctx) → ~7.5 GB VRAM + ~180 GB SSD; noted throughput “≈ 1 tok/2 s”.

GPT-OSS-20B (packed bf16, 10K ctx) → ~7.3 GB VRAM + 15 GB SSD.

Llama-3.1-8B (fp16, 100K ctx) → ~6.6 GB VRAM + 69 GB SSD.

How it works

oLLM streams layer weights directly from SSD into the GPU, offloads the attention KV cache to SSD, and optionally offloads layers to CPU. It uses FlashAttention-2 with online softmax so the full attention matrix is never materialized, and chunks large MLP projections to bound peak memory. This shifts the bottleneck from VRAM to storage bandwidth and latency, which is why the oLLM project emphasizes NVMe-class SSDs and KvikIO/cuFile (GPUDirect Storage) for high-throughput file I/O.

Supported models and GPUs

Out of the box the examples cover Llama-3 (1B/3B/8B), GPT-OSS-20B, and Qwen3-Next-80B. The library targets NVIDIA Ampere (RTX 30xx, A-series), Ada (RTX 40xx, L4), and Hopper; Qwen3-Next requires a dev build of Transformers (≥ 4.57.0.dev). Notably, Qwen3-Next-80B is a sparse MoE (80B total, ~3B active) that vendors typically position for multi-A100/H100 deployments; oLLM’s claim is that you can execute it offline on a single consumer GPU by paying the SSD penalty and accepting low throughput. This stands in contrast to vLLM docs, which suggest multi-GPU servers for the same model family.

Installation and minimal usage

The project is MIT-licensed and available on PyPI (pip install ollm), with an additional kvikio-cu{cuda_version} dependency for high-speed disk I/O. For Qwen3-Next models, install Transformers from GitHub. A short example in the README shows Inference(…).DiskCache(…) wiring and generate(…) with a streaming text callback. (PyPI currently lists 0.4.1; the README references 0.4.2 changes.)

Performance expectations and trade-offs

Throughput: The maintainer reports ~0.5 tok/s for Qwen3-Next-80B at 50K context on an RTX 3060 Ti—usable for batch/offline analytics, not for interactive chat. SSD latency dominates.

Storage pressure: Long contexts require very large KV caches; oLLM writes these to SSD to keep VRAM flat. This mirrors broader industry work on KV offloading (e.g., NVIDIA Dynamo/NIXL and community discussions), but the approach is still storage-bound and workload-specific.

Hardware reality check: Running Qwen3-Next-80B “on consumer hardware” is feasible with oLLM’s disk-centric design, but typical high-throughput inference for this model still expects multi-GPU servers. Treat oLLM as an execution path for large-context, offline passes rather than a drop-in replacement for production serving stacks like vLLM/TGI.

Bottom line

oLLM pushes a clear design point: keep precision high, push memory to SSD, and make ultra-long contexts viable on a single 8 GB NVIDIA GPU. It won’t match data-center throughput, but for offline document/log analysis, compliance review, or large-context summarization, it’s a pragmatic way to execute 8B–20B models comfortably and even step up to MoE-80B if you can tolerate ~100–200 GB of fast local storage and sub-1 tok/s generation.

Check out the GITHUB REPO here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM Inference to 8 GB Consumer GPUs via SSD Offload—No Quantization Required appeared first on MarkTechPost.

This AI Research Proposes an AI Agent Immune System for Adaptive Cyber …

Can your AI security stack profile, reason, and neutralize a live security threat in ~220 ms—without a central round-trip? A team of researchers from Google and University of Arkansas at Little Rock outline an agentic cybersecurity “immune system” built from lightweight, autonomous sidecar AI agents colocated with workloads (Kubernetes pods, API gateways, edge services). Instead of exporting raw telemetry to a SIEM and waiting on batched classifiers, each agent learns local behavioral baselines, evaluates anomalies using federated intelligence, and applies least-privilege mitigations directly at the point of execution. In a controlled cloud-native simulation, this edge-first loop cut decision-to-mitigation to ~220 ms (≈3.4× faster than centralized pipelines), achieved F1 ≈ 0.89, and held host overhead under 10% CPU/RAM—evidence that collapsing detection and enforcement into the workload plane can deliver both speed and fidelity without material resource penalties.

https://arxiv.org/abs/2509.20640

What does “Profile → Reason → Neutralize” mean at the primitive level?

Profile. Agents are deployed as sidecars/daemonsets alongside microservices and API gateways. They build behavioral fingerprints from execution traces, syscall paths, API call sequences, and inter-service flows. This local baseline adapts to short-lived pods, rolling deploys, and autoscaling—conditions that routinely break perimeter controls and static allowlists. Profiling is not just a threshold on counts; it retains structural features (order, timing, peer set) that allow detection of zero-day-like deviations. The research team frames this as continuous, context-aware baselining across ingestion and sensing layers so that “normal” is learned per workload and per identity boundary.

Reason. When an anomaly appears (for example, an unusual burst of high-entropy uploads from a low-trust principal or a never-seen-before API call graph), the local agent mixes anomaly scores with federated intelligence—shared indicators and model deltas learned by peers—to produce a risk estimate. Reasoning is designed to be edge-first: the agent decides without a round-trip to a central adjudicator, and the trust decision is continuous rather than a static role gate. This aligns with zero-trust—identity and context are evaluated at each request, not just at session start—and it reduces central bottlenecks that add seconds of latency under load.

Neutralize. If risk exceeds a context-sensitive threshold, the agent executes an immediate local control mapped to least-privilege actions: quarantine the container (pause/isolate), rotate a credential, apply a rate-limit, revoke a token, or tighten a per-route policy. Enforcement is written back to policy stores and logged with a human-readable rationale for audit. The fast path here is the core differentiator: in the reported evaluation, the autonomous path triggers in ~220 ms versus ~540–750 ms for centralized ML or firewall update pipelines, which translates into a ~70% latency reduction and fewer opportunities for lateral movement during the decision window.

Where do the numbers come from, and what were the baselines?

The research team evaluated the architecture in a Kubernetes-native simulation spanning API abuse and lateral-movement scenarios. Against two typical baselines—(i) static rule pipelines and (ii) a batch-trained classifier—the agentic approach reports Precision 0.91 / Recall 0.87 / F1 0.89, while the baselines land near F1 0.64 (rules) and F1 0.79 (baseline ML). Decision latency falls to ~220 ms for local enforcement, compared with ~540–750 ms for centralized paths that require coordination with a controller or external firewall. Resource overhead on host services remains below 10% in CPU/RAM.

https://arxiv.org/abs/2509.20640

Why does this matter for zero-trust engineering, not just research graphs?

Zero-trust (ZT) calls for continuous verification at request-time using identity, device, and context. In practice, many ZT deployments still defer to central policy evaluators, so they inherit control-plane latency and queueing pathologies under load. By moving risk inference and enforcement to the autonomous edge, the architecture turns ZT posture from periodic policy pulls into a set of self-contained, continuously learning controllers that execute least-privilege changes locally and then synchronize state. That design simultaneously reduces mean time-to-contain (MTTC) and keeps decisions near the blast radius, which helps when inter-pod hops are measured in milliseconds. The research team also formalizes federated sharing to distribute indicators/model deltas without heavy raw-data movement, which is relevant for privacy boundaries and multi-tenant SaaS.

How does it integrate with existing stacks—Kubernetes, APIs, and identity?

Operationally, the agents are co-located with workloads (sidecar or node daemon). On Kubernetes, they can hook CNI-level telemetry for flow features, container runtime events for process-level signals, and envoy/nginx spans at API gateways for request graphs. For identity, they consume claims from your IdP and compute continuous trust scores that factor recent behavior and environment (e.g., geo-risk, device posture). Mitigations are expressed as idempotent primitives—network micro-policy updates, token revocation, per-route quotas—so they are straightforward to roll back or tighten incrementally. The architecture’s control loop (sense → reason → act → learn) is strictly feedback-driven and supports both human-in-the-loop (policy windows, approval gates for high-blast-radius changes) and autonomy for low-impact actions.

What are the governance and safety guardrails?

Speed without auditability is a non-starter in regulated environments. The research team emphasizes explainable decision logs that capture which signals and thresholds led to the action, with signed and versioned policy/model artifacts. It also discusses privacy-preserving modes—keeping sensitive data local while sharing model updates; differentially private updates are mentioned as an option in stricter regimes. For safety, the system supports override/rollback and staged rollouts (e.g., canarying new mitigation templates in non-critical namespaces). This is consistent with broader security work on threats and guardrails for agentic systems; if your org is adopting multi-agent pipelines, cross-check against current threat models for agent autonomy and tool use.

How do the reported results translate to production posture?

The evaluation is a 72-hour cloud-native simulation with injected behaviors: API misuse patterns, lateral movement, and zero-day-like deviations. Real systems will add messier signals (e.g., noisy sidecars, multi-cluster networking, mixed CNI plugins), which affects both detection and enforcement timing. That said, the fast-path structure—local decision + local act—is topology-agnostic and should preserve order-of-magnitude latency gains so long as mitigations are mapped to primitives available in your mesh/runtime. For production, begin with observe-only agents to build baselines, then turn on mitigations for low-risk actions (quota clamps, token revokes), then gate high-blast-radius controls (network slicing, container quarantine) behind policy windows until confidence/coverage metrics are green.

How does this sit in the broader agentic-security landscape?

There is growing research on securing agent systems and using agent workflows for security tasks. The research team discussed here is about defense via agent autonomy close to workloads. In parallel, other work tackles threat modeling for agentic AI, secure A2A protocol usage, and agentic vulnerability testing. If you adopt the architecture, pair it with a current agent-security threat model and a test harness that exercises tool-use boundaries and memory safety of agents.

Comparative Results (Kubernetes simulation)

MetricStatic rules pipelineBaseline ML (batch classifier)Agentic framework (edge autonomy)Precision0.710.830.91Recall0.580.760.87F10.640.790.89Decision-to-mitigation latency~750 ms~540 ms~220 msHost overhead (CPU/RAM)ModerateModerate<10%

Key Takeaways

Edge-first “cybersecurity immune system.” Lightweight sidecar/daemon AI agents colocated with workloads (Kubernetes pods, API gateways) learn behavioral fingerprints, decide locally, and enforce least-privilege mitigations without SIEM round-trips.

Measured performance. Reported decision-to-mitigation is ~220 ms—about 3.4× faster than centralized pipelines (≈540–750 ms)—with F1 ≈ 0.89 (P≈0.91, R≈0.87) in a Kubernetes simulation.

Low operational cost. Host overhead remains <10% CPU/RAM, making the approach practical for microservices and edge nodes.

Profile → Reason → Neutralize loop. Agents continuously baseline normal activity (profile), fuse local signals with federated intelligence for risk scoring (reason), and apply immediate, reversible controls such as container quarantine, token rotation, and rate-limits (neutralize).

Zero-trust alignment. Decisions are continuous and context-aware (identity, device, geo, workload), replacing static role gates and reducing dwell time and lateral movement risk.

Governance and safety. Actions are logged with explainable rationales; policies/models are signed and versioned; high-blast-radius mitigations can be gated behind human-in-the-loop and staged rollouts.

Summary

Treat defense as a distributed control plane made of profiling, reasoning, and neutralizing agents that act where the threat lives. The reported profile—~220 ms actions, ≈ 3.4× faster than centralized baselines, F1 ≈ 0.89, <10% overhead—is consistent with what you’d expect when you eliminate central hops and let autonomy handle least-privilege mitigations locally. It aligns with zero-trust’s continuous verification and gives teams a practical path to self-stabilizing operations: learn normal, flag deviations with federated context, and contain early—before lateral movement outpaces your control plane.

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post This AI Research Proposes an AI Agent Immune System for Adaptive Cybersecurity: 3.4× Faster Containment with <10% Overhead appeared first on MarkTechPost.

Gemini Robotics 1.5: DeepMind’s ER↔VLA Stack Brings Agentic Robots …

Can a single AI stack plan like a researcher, reason over scenes, and transfer motions across different robots—without retraining from scratch? Google DeepMind’s Gemini Robotics 1.5 says yes, by splitting embodied intelligence into two models: Gemini Robotics-ER 1.5 for high-level embodied reasoning (spatial understanding, planning, progress/success estimation, tool-use) and Gemini Robotics 1.5 for low-level visuomotor control. The system targets long-horizon, real-world tasks (e.g., multi-step packing, waste sorting with local rules) and introduces motion transfer to reuse data across heterogeneous platforms.

https://deepmind.google/discover/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/

What actually is the stack?

Gemini Robotics-ER 1.5 (reasoner/orchestrator): A multimodal planner that ingests images/video (and optionally audio), grounds references via 2D points, tracks progress, and invokes external tools (e.g., web search or local APIs) to fetch constraints before issuing sub-goals. It’s available via the Gemini API in Google AI Studio.

Gemini Robotics 1.5 (VLA controller): A vision-language-action model that converts instructions and percepts into motor commands, producing explicit “think-before-act” traces to decompose long tasks into short-horizon skills. Availability is limited to selected partners during the initial rollout.

https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf

Why split cognition from control?

Earlier end-to-end VLAs (Vision-Language-Action) struggle to plan robustly, verify success, and generalize across embodiments. Gemini Robotics 1.5 isolates those concerns: Gemini Robotics-ER 1.5 handles deliberation (scene reasoning, sub-goaling, success detection), while the VLA specializes in execution (closed-loop visuomotor control). This modularity improves interpretability (visible internal traces), error recovery, and long-horizon reliability.

Motion Transfer across embodiments

A core contribution is Motion Transfer (MT): training the VLA on a unified motion representation built from heterogeneous robot data—ALOHA, bi-arm Franka, and Apptronik Apollo—so skills learned on one platform can zero-shot transfer to another. This reduces per-robot data collection and narrows sim-to-real gaps by reusing cross-embodiment priors.

Quantitative signals

The research team showcased controlled A/B comparisons on real hardware and aligned MuJoCo scenes. This includes:

Generalization: Robotics 1.5 surpasses prior Gemini Robotics baselines in instruction following, action generalization, visual generalization, and task generalization across the three platforms.

Zero-shot cross-robot skills: MT yields measurable gains in progress and success when transferring skills across embodiments (e.g., Franka→ALOHA, ALOHA→Apollo), rather than merely improving partial progress.

“Thinking” improves acting: Enabling VLA thought traces increases long-horizon task completion and stabilizes mid-rollout plan revisions.

End-to-end agent gains: Pairing Gemini Robotics-ER 1.5 with the VLA agent substantially improves progress on multi-step tasks (e.g., desk organization, cooking-style sequences) versus a Gemini-2.5-Flash-based baseline orchestrator.

https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf

Safety and evaluation

DeepMind research team highlights layered controls: policy-aligned dialog/planning, safety-aware grounding (e.g., not pointing to hazardous objects), low-level physical limits, and expanded evaluation suites (e.g., ASIMOV/ASIMOV-style scenario testing and auto red-teaming to elicit edge-case failures). The goal is to catch hallucinated affordances or nonexistent objects before actuation.

Competitive/industry context

Gemini Robotics 1.5 is a shift from “single-instruction” robotics toward agentic, multi-step autonomy with explicit web/tool use and cross-platform learning, a capability set relevant to consumer and industrial robotics. Early partner access centers on established robotics vendors and humanoid platforms.

Key Takeaways

Two-model architecture (ER VLA): Gemini Robotics-ER 1.5 handles embodied reasoning—spatial grounding, planning, success/progress estimation, tool calls—while Robotics 1.5 is the vision-language-action executor that issues motor commands.

“Think-before-act” control: The VLA produces explicit intermediate reasoning/traces during execution, improving long-horizon decomposition and mid-task adaptation.

Motion Transfer across embodiments: A single VLA checkpoint reuses skills across heterogeneous robots (ALOHA, bi-arm Franka, Apptronik Apollo), enabling zero-/few-shot cross-robot execution rather than per-platform retraining.

Tool-augmented planning: ER 1.5 can invoke external tools (e.g., web search) to fetch constraints, then condition plans—e.g., packing after checking local weather or applying city-specific recycling rules.

Quantified improvements over prior baselines: The tech report documents higher instruction/action/visual/task generalization and better progress/success on real hardware and aligned simulators; results cover cross-embodiment transfers and long-horizon tasks.

Availability and access: ER 1.5 is available via the Gemini API (Google AI Studio) with docs, examples, and preview knobs; Robotics 1.5 (VLA) is limited to select partners with a public waitlist.

Safety & evaluation posture: DeepMind highlights layered safeguards (policy-aligned planning, safety-aware grounding, physical limits) and an upgraded ASIMOV benchmark plus adversarial evaluations to probe risky behaviors and hallucinated affordances.

Summary

Gemini Robotics 1.5 operationalizes a clean separation of embodied reasoning and control, adds motion transfer to recycle data across robots, and showcases the reasoning surface (point grounding, progress/success estimation, tool calls) to developers via the Gemini API. For teams building real-world agents, the design reduces per-platform data burden and strengthens long-horizon reliability—while keeping safety in scope with dedicated test suites and guardrails.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Gemini Robotics 1.5: DeepMind’s ER↔VLA Stack Brings Agentic Robots to the Real World appeared first on MarkTechPost.

Top 10 Local LLMs (2025): Context Windows, VRAM Targets, and Licenses …

Local LLMs matured fast in 2025: open-weight families like Llama 3.1 (128K context length (ctx)), Qwen3 (Apache-2.0, dense + MoE), Gemma 2 (9B/27B, 8K ctx), Mixtral 8×7B (Apache-2.0 SMoE), and Phi-4-mini (3.8B, 128K ctx) now ship reliable specs and first-class local runners (GGUF/llama.cpp, LM Studio, Ollama), making on-prem and even laptop inference practical if you match context length and quantization to VRAM. This guide lists the ten most deployable options by license clarity, stable GGUF availability, and reproducible performance characteristics (params, context length (ctx), quant presets).

Top 10 Local LLMs (2025)

1) Meta Llama 3.1-8B — robust “daily driver,” 128K context

Why it matters. A stable, multilingual baseline with long context and first-class support across local toolchains.Specs. Dense 8B decoder-only; official 128K context; instruction-tuned and base variants. Llama license (open weights). Common GGUF builds and Ollama recipes exist. Typical setup: Q4_K_M/Q5_K_M for ≤12-16 GB VRAM, Q6_K for ≥24 GB.

2) Meta Llama 3.2-1B/3B — edge-class, 128K context, on-device friendly

Why it matters. Small models that still take 128K tokens and run acceptably on CPUs/iGPUs when quantized; good for laptops and mini-PCs.Specs. 1B/3B instruction-tuned models; 128K context confirmed by Meta. Works well via llama.cpp GGUF and LM Studio’s multi-runtime stack (CPU/CUDA/Vulkan/Metal/ROCm).

3) Qwen3-14B / 32B — open Apache-2.0, strong tool-use & multilingual

Why it matters. Broad family (dense+MoE) under Apache-2.0 with active community ports to GGUF; widely reported as a capable general/agentic “daily driver” locally.Specs. 14B/32B dense checkpoints with long-context variants; modern tokenizer; rapid ecosystem updates. Start at Q4_K_M for 14B on 12 GB; move to Q5/Q6 when you have 24 GB+. (Qwen)

4) DeepSeek-R1-Distill-Qwen-7B — compact reasoning that fits

Why it matters. Distilled from R1-style reasoning traces; delivers step-by-step quality at 7B with widely available GGUFs. Excellent for math/coding on modest VRAM.Specs. 7B dense; long-context variants exist per conversion; curated GGUFs cover F32→Q4_K_M. For 8–12 GB VRAM try Q4_K_M; for 16–24 GB use Q5/Q6.

5) Google Gemma 2-9B / 27B — efficient dense; 8K context (explicit)

Why it matters. Strong quality-for-size and quantization behavior; 9B is a great mid-range local model.Specs. Dense 9B/27B; 8K context (don’t overstate); open weights under Gemma terms; widely packaged for llama.cpp/Ollama. 9B@Q4_K_M runs on many 12 GB cards.

6) Mixtral 8×7B (SMoE) — Apache-2.0 sparse MoE; cost/perf workhorse

Why it matters. Mixture-of-Experts throughput benefits at inference: ~2 experts/token selected at runtime; great compromise when you have ≥24–48 GB VRAM (or multi-GPU) and want stronger general performance.Specs. 8 experts of 7B each (sparse activation); Apache-2.0; instruct/base variants; mature GGUF conversions and Ollama recipes.

7) Microsoft Phi-4-mini-3.8B — small model, 128K context

Why it matters. Realistic “small-footprint reasoning” with 128K context and grouped-query attention; solid for CPU/iGPU boxes and latency-sensitive tools.Specs. 3.8B dense; 200k vocab; SFT/DPO alignment; model card documents 128K context and training profile. Use Q4_K_M on ≤8–12 GB VRAM.

8) Microsoft Phi-4-Reasoning-14B — mid-size reasoning (check ctx per build)

Why it matters. A 14B reasoning-tuned variant that is materially better for chain-of-thought-style tasks than generic 13–15B baselines.Specs. Dense 14B; context varies by distribution (model card for a common release lists 32K). For 24 GB VRAM, Q5_K_M/Q6_K is comfortable; mixed-precision runners (non-GGUF) need more.

9) Yi-1.5-9B / 34B — Apache-2.0 bilingual; 4K/16K/32K variants

Why it matters. Competitive EN/zh performance and permissive license; 9B is a strong alternative to Gemma-2-9B; 34B steps toward higher reasoning under Apache-2.0.Specs. Dense; context variants 4K/16K/32K; open weights under Apache-2.0 with active HF cards/repos. For 9B use Q4/Q5 on 12–16 GB.

10) InternLM 2 / 2.5-7B / 20B — research-friendly; math-tuned branches

Why it matters. An open series with lively research cadence; 7B is a practical local target; 20B moves you toward Gemma-2-27B-class capability (at higher VRAM).Specs. Dense 7B/20B; multiple chat/base/math variants; active HF presence. GGUF conversions and Ollama packs are common.

source: marktechpost.com

Summary

In local LLMs, the trade-offs are clear: pick dense models for predictable latency and simpler quantization (e.g., Llama 3.1-8B with a documented 128K context; Gemma 2-9B/27B with an explicit 8K window), move to sparse MoE like Mixtral 8×7B when your VRAM and parallelism justify higher throughput per cost, and treat small reasoning models (Phi-4-mini-3.8B, 128K) as the sweet spot for CPU/iGPU boxes. Licenses and ecosystems matter as much as raw scores: Qwen3’s Apache-2.0 releases (dense + MoE) and Meta/Google/Microsoft model cards give the operational guardrails (context, tokenizer, usage terms) you’ll actually live with. On the runtime side, standardize on GGUF/llama.cpp for portability, layer Ollama/LM Studio for convenience and hardware offload, and size quantization (Q4→Q6) to your memory budget. In short: choose by context + license + hardware path, not just leaderboard vibes.

The post Top 10 Local LLMs (2025): Context Windows, VRAM Targets, and Licenses Compared appeared first on MarkTechPost.

The Latest Gemini 2.5 Flash-Lite Preview is Now the Fastest Proprietar …

Google released an updated version of Gemini 2.5 Flash and Gemini 2.5 Flash-Lite preview models across AI Studio and Vertex AI, plus rolling aliases—gemini-flash-latest and gemini-flash-lite-latest—that always point to the newest preview in each family. For production stability, Google advises pinning fixed strings (gemini-2.5-flash, gemini-2.5-flash-lite). Google will give a two-week email notice before retargeting a -latest alias, and notes that rate limits, features, and cost may vary across alias updates.

https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/

What actually changed?

Flash: Improved agentic tool use and more efficient “thinking” (multi-pass reasoning). Google reports a +5 point lift on SWE-Bench Verified vs. the May preview (48.9% → 54.0%), indicating better long-horizon planning/code navigation.

Flash-Lite: Tuned for stricter instruction following, reduced verbosity, and stronger multimodal/translation. Google’s internal chart shows ~50% fewer output tokens for Flash-Lite and ~24% fewer for Flash, which directly cuts output-token spend and wall-clock time in throughput-bound services.

https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/

Independent Stats from the community thread

Artificial Analysis (the account behind the AI benchmarking site) received pre-release access and published external measurements across intelligence and speed. Highlights from the thread and companion pages:

Throughput: In endpoint tests, Gemini 2.5 Flash-Lite (Preview 09-2025, reasoning) is reported as the fastest proprietary model they track, around ~887 output tokens/s on AI Studio in their setup.

Intelligence index deltas: The September previews for Flash and Flash-Lite improve on Artificial Analysis’ aggregate “intelligence” scores compared with prior stable releases (site pages break down reasoning vs. non-reasoning tracks and blended price assumptions).

Token efficiency: The thread reiterates Google’s own reduction claims (−24% Flash, −50% Flash-Lite) and frames the win as cost-per-success improvements for tight latency budgets.

Google shared pre-release access for the new Gemini 2.5 Flash & Flash-Lite Preview 09-2025 models. We’ve independently benchmarked gains in intelligence (particularly for Flash-Lite), output speed and token efficiency compared to predecessorsKey takeaways from our intelligence… pic.twitter.com/ybzKvZBH5A— Artificial Analysis (@ArtificialAnlys) September 25, 2025

Cost surface and context budgets (for deployment choices)

Flash-Lite GA list price is $0.10 / 1M input tokens and $0.40 / 1M output tokens (Google’s July GA post and DeepMind’s model page). That baseline is where verbosity reductions translate to immediate savings.

Context: Flash-Lite supports ~1M-token context with configurable “thinking budgets” and tool connectivity (Search grounding, code execution)—useful for agent stacks that interleave reading, planning, and multi-tool calls.

Browser-agent angle and the o3 claim

A circulating claim says the “new Gemini Flash has o3-level accuracy, but is 2× faster and 4× cheaper on browser-agent tasks.” This is community-reported, not in Google’s official post. It likely traces to private/limited task suites (DOM navigation, action planning) with specific tool budgets and timeouts. Use it as a hypothesis for your own evals; don’t treat it as a cross-bench truth.

This is insane! The new Gemini Flash model released yesterday has the same accuracy as o3, but it is 2x faster and 4x cheaper for browser agent tasks.I ran evaluations the whole day and could not believe this. The previous gemini-2.5-flash had only 71% on this benchmark. https://t.co/KdgkuAK30W pic.twitter.com/F69BiZHiwD— Magnus Müller (@mamagnus00) September 26, 2025

Practical guidance for teams

Pin vs. chase -latest: If you depend on strict SLAs or fixed limits, pin the stable strings. If you continuously canary for cost/latency/quality, the -latest aliases reduce upgrade friction (Google provides two weeks’ notice before switching the pointer).

High-QPS or token-metered endpoints: Start with Flash-Lite preview; the verbosity and instruction-following upgrades shrink egress tokens. Validate multimodal and long-context traces under production load.

Agent/tool pipelines: A/B Flash preview where multi-step tool use dominates cost or failure modes; Google’s SWE-Bench Verified lift and community tokens/s figures suggest better planning under constrained thinking budgets.

Model strings (current)

Previews: gemini-2.5-flash-preview-09-2025, gemini-2.5-flash-lite-preview-09-2025

Stable: gemini-2.5-flash, gemini-2.5-flash-lite

Rolling aliases: gemini-flash-latest, gemini-flash-lite-latest (pointer semantics; may change features/limits/pricing).

Summary

Google’s new release update tightens tool-use competence (Flash) and token/latency efficiency (Flash-Lite) and introduces -latest aliases for faster iteration. External benchmarks from Artificial Analysis indicate meaningful throughput and intelligence-index gains for the Sept 2025. previews, with Flash-Lite now testing as the fastest proprietary model in their harness. Validate on your workload—especially browser-agent stacks—before committing to the aliases in production.

The post The Latest Gemini 2.5 Flash-Lite Preview is Now the Fastest Proprietary Model (External Tests) and 50% Fewer Output Tokens appeared first on MarkTechPost.

Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to T …

Hugging Face (HF) has released Smol2Operator, a reproducible, end-to-end recipe that turns a small vision-language model (VLM) with no prior UI grounding into a GUI-operating, tool-using agent. The release covers data transformation utilities, training scripts, transformed datasets, and the resulting 2.2B-parameter model checkpoint—positioned as a complete blueprint for building GUI agents from scratch rather than a single benchmark result.

But what’s new?

Two-phase post-training over a small VLM: Starting from SmolVLM2-2.2B-Instruct—a model that “initially has no grounding capabilities for GUI tasks”—Smol2Operator first instills perception/grounding, then layers agentic reasoning with supervised fine-tuning (SFT).

Unified action space across heterogeneous sources: A conversion pipeline normalizes disparate GUI action taxonomies (mobile, desktop, web) into a single, consistent function API (e.g., click, type, drag, normalized [0,1] coordinates), enabling coherent training across datasets. An Action Space Converter supports remapping to custom vocabularies.

But why Smol2Operator?

Most GUI-agent pipelines are blocked by fragmented action schemas and non-portable coordinates. Smol2Operator’s action-space unification and normalized coordinate strategy make datasets interoperable and training stable under image resizing, which is common in VLM preprocessing. This reduces the engineering overhead of assembling multi-source GUI data and lowers the barrier to reproducing agent behavior with small models.

How it works? training stack and data path

Data standardization:

Parse and normalize function calls from source datasets (e.g., AGUVIS stages) into a unified signature set; remove redundant actions; standardize parameter names; convert pixel to normalized coordinates.

Phase 1 (Perception/Grounding):

SFT on the unified action dataset to learn element localization and basic UI affordances, measured on ScreenSpot-v2 (element localization on screenshots).

Phase 2 (Cognition/Agentic reasoning):

Additional SFT to convert grounded perception into step-wise action planning aligned with the unified action API.

The HF Team reports a clean performance trajectory on ScreenSpot-v2 (benchmark) as grounding is learned, and shows similar training strategy scaling down to a ~460M “nanoVLM,” indicating the method’s portability across capacities (numbers are presented in the post’s tables).

Scope, limits, and next steps

Not a “SOTA at all costs” push: The HF team frame the work as a process blueprint—owning data conversion → grounding → reasoning—rather than chasing leaderboard peaks.

Evaluation focus: Demonstrations center on ScreenSpot-v2 perception and qualitative end-to-end task videos; broader cross-environment, cross-OS, or long-horizon task benchmarks are future work. The HF team notes potential gains from RL/DPO beyond SFT for on-policy adaptation.

Ecosystem trajectory: ScreenEnv’s roadmap includes wider OS coverage (Android/macOS/Windows), which would increase external validity of trained policies.

Summary

Smol2Operator is a fully open-source, reproducible pipeline that upgrades SmolVLM2-2.2B-Instruct—a VLM with zero GUI grounding—into an agentic GUI coder via a two-phase SFT process. The release standardizes heterogeneous GUI action schemas into a unified API with normalized coordinates, provides transformed AGUVIS-based datasets, publishes training notebooks and preprocessing code, and ships a final checkpoint plus a demo Space. It targets process transparency and portability over leaderboard chasing, and slots into the smolagents runtime with ScreenEnv for evaluation, offering a practical blueprint for teams building small, operator-grade GUI agents.

Check out the Technical details, and Full Collection on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to Train a 2.2B VLM into an Agentic GUI Coder appeared first on MarkTechPost.

Sakana AI Released ShinkaEvolve: An Open-Source Framework that Evolves …

Table of contentsWhat problem is it actually solving?Does the sample-efficiency claim hold beyond toy problems?How does the evolutionary loop look in practice?What are the concrete results?How does this compare to AlphaEvolve and related systems?SummaryFAQs — ShinkaEvolve

Sakana AI has released ShinkaEvolve, an open-sourced framework that uses large language models (LLMs) as mutation operators in an evolutionary loop to evolve programs for scientific and engineering problems—while drastically cutting the number of evaluations needed to reach strong solutions. On the canonical circle-packing benchmark (n=26 in a unit square), ShinkaEvolve reports a new SOTA configuration using ~150 program evaluations, where prior systems typically burned thousands. The project ships under Apache-2.0, with a research report and public code.

https://sakana.ai/shinka-evolve/

What problem is it actually solving?

Most “agentic” code-evolution systems explore by brute force: they mutate code, run it, score it, and repeat—consuming enormous sampling budgets. ShinkaEvolve targets that waste explicitly with three interacting components:

Adaptive parent sampling to balance exploration/exploitation. Parents are drawn from “islands” via fitness- and novelty-aware policies (power-law or weighted by performance and offspring counts) rather than always climbing the current best.

Novelty-based rejection filtering to avoid re-evaluating near-duplicates. Mutable code segments are embedded; if cosine similarity exceeds a threshold, a secondary LLM acts as a “novelty judge” before execution.

Bandit-based LLM ensembling so the system learns which model (e.g., GPT/Gemini/Claude/DeepSeek families) is yielding the biggest relative fitness jumps and routes future mutations accordingly (UCB1-style update on improvement over parent/baseline).

Does the sample-efficiency claim hold beyond toy problems?

The research team evaluates four distinct domains and shows consistent gains with small budgets:

Circle packing (n=26): reaches an improved configuration in roughly 150 evaluations; the research team also validate with stricter exact-constraint checking.

AIME math reasoning (2024 set): evolves agentic scaffolds that trace out a Pareto frontier (accuracy vs. LLM-call budget), outperforming hand-built baselines under limited query budgets / Pareto frontier of accuracy vs. calls and transferring to other AIME years and LLMs.

Competitive programming (ALE-Bench LITE): starting from ALE-Agent solutions, ShinkaEvolve delivers ~2.3% mean improvement across 10 tasks and pushes one task’s solution from 5th → 2nd in an AtCoder leaderboard counterfactual.

LLM training (Mixture-of-Experts): evolves a new load-balancing loss that improves perplexity and downstream accuracy at multiple regularization strengths vs. the widely-used global-batch LBL.

https://sakana.ai/shinka-evolve/

How does the evolutionary loop look in practice?

ShinkaEvolve maintains an archive of evaluated programs with fitness, public metrics, and textual feedback. For each generation: sample an island and parent(s); construct a mutation context with top-K and random “inspiration” programs; then propose edits via three operators—diff edits, full rewrites, and LLM-guided crossovers—while protecting immutable code regions with explicit markers. Executed candidates update both the archive and the bandit statistics that steer subsequent LLM/model selection. The system periodically produces a meta-scratchpad that summarizes recently successful strategies; those summaries are fed back into prompts to accelerate later generations.

What are the concrete results?

Circle packing: combined structured initialization (e.g., golden-angle patterns), hybrid global–local search (simulated annealing + SLSQP), and escape mechanisms (temperature reheating, ring rotations) discovered by the system—not hand-coded a priori.

AIME scaffolds: three-stage expert ensemble (generation → critical peer review → synthesis) that hits the accuracy/cost sweet spot at ~7 calls while retaining robustness when swapped to different LLM backends.

ALE-Bench: targeted engineering wins (e.g., caching kd-tree subtree stats; “targeted edge moves” toward misclassified items) that push scores without wholesale rewrites.

MoE loss: adds an entropy-modulated under-use penalty to the global-batch objective; empirically reduces miss-routing and improves perplexity/benchmarks as layer routing concentrates.

How does this compare to AlphaEvolve and related systems?

AlphaEvolve demonstrated strong closed-source results but at higher evaluation counts. ShinkaEvolve reproduces and surpasses the circle-packing result with orders-of-magnitude fewer samples and releases all components open-source. The research team also contrast variants (single-model vs. fixed ensemble vs. bandit ensemble) and ablate parent selection and novelty filtering, showing each contributes to the observed efficiency.

Summary

ShinkaEvolve is an Apache-2.0 framework for LLM-driven program evolution that cuts evaluations from thousands to hundreds by combining fitness/novelty-aware parent sampling, embedding-plus-LLM novelty rejection, and a UCB1-style adaptive LLM ensemble. It sets a new SOTA on circle packing (~150 evals), finds stronger AIME scaffolds under strict query budgets, improves ALE-Bench solutions (~2.3% mean gain, 5th→2nd on one task), and discovers a new MoE load-balancing loss that improves perplexity and downstream accuracy. Code and report are public.

FAQs — ShinkaEvolve

1) What is ShinkaEvolve?An open-source framework that couples LLM-driven program mutations with evolutionary search to automate algorithm discovery and optimization. Code and report are public.

2) How does it achieve higher sample-efficiency than prior evolutionary systems?Three mechanisms: adaptive parent sampling (explore/exploit balance), novelty-based rejection to avoid duplicate evaluations, and a bandit-based selector that routes mutations to the most promising LLMs.

3) What supports the results?It reaches state-of-the-art circle packing with ~150 evaluations; on AIME-2024 it evolves scaffolds under a 10-query cap per problem; it improves ALE-Bench solutions over strong baselines.

4) Where can I run it and what’s the license?The GitHub repo provides a WebUI and examples; ShinkaEvolve is released under Apache-2.0.

Check out the Technical details, Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Sakana AI Released ShinkaEvolve: An Open-Source Framework that Evolves Programs for Scientific Discovery with Unprecedented Sample-Efficiency appeared first on MarkTechPost.

Google AI Ships a Model Context Protocol (MCP) Server for Data Commons …

Google released a Model Context Protocol (MCP) server for Data Commons, exposing the project’s interconnected public datasets—census, health, climate, economics—through a standards-based interface that agentic systems can query in natural language. The Data Commons MCP Server is available now with quickstarts for Gemini CLI and Google’s Agent Development Kit (ADK).

What was released

An MCP server that lets any MCP-capable client or AI agent discover variables, resolve entities, fetch time series, and generate reports from Data Commons without hand-coding API calls. Google positions it as “from initial discovery to generative reports,” with example prompts spanning exploratory, analytical, and generative workflows.

Developer on-ramps: a PyPI package, a Gemini CLI flow, and an ADK sample/Colab to embed Data Commons queries inside agent pipelines.

Why MCP now?

MCP is an open protocol for connecting LLM agents to external tools and data with consistent capabilities (tools, prompts, resources) and transport semantics. By shipping a first-party MCP server, Google makes Data Commons addressable through the same interface that agents already use for other sources, reducing per-integration glue code and enabling registry-based discovery alongside other servers.

What you can do with it?

Exploratory: “What health data do you have for Africa?” → enumerate variables, coverage, and sources.

Analytical: “Compare life expectancy, inequality, and GDP growth for BRICS nations.” → retrieve series, normalize geos, align vintages, and return a table or chart payload.

Generative: “Generate a concise report on income vs. diabetes in US counties.” → fetch measures, compute correlations, include provenance.

Integration surface

Gemini CLI / any MCP client: install the Data Commons MCP package, point the client at the server, and issue NL queries; the client coordinates tool calls behind the scenes.

ADK agents: use Google’s sample agent to compose Data Commons calls with your own tools (e.g., visualization, storage) and return sourced outputs.

Docs entry point: MCP — Query data interactively with an AI agent with links to quickstart and user guide.

Real-world use case

Google highlights ONE Data Agent, built with the Data Commons MCP Server for the ONE Campaign. It lets policy analysts query tens of millions of health-financing datapoints via natural language, visualize results, and export clean datasets for downstream work.

Summary

In short, Google’s Data Commons MCP Server turns a sprawling corpus of public statistics into a first-class, protocol-native data source for agents—reducing custom glue code, preserving provenance, and fitting cleanly into existing MCP clients like Gemini CLI and ADK.

Check out the GitHub Repository and Try it out in Gemini CLI. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Google AI Ships a Model Context Protocol (MCP) Server for Data Commons, Giving AI Agents First-Class Access to Public Stats appeared first on MarkTechPost.

Building health care agents using Amazon Bedrock AgentCore

This blog was co-authored with Kuldeep Singh, Head of AI Platform at Innovaccer.
The integration of agentic AI is ushering in a transformative era in health care, marking a significant departure from traditional AI systems. Agentic AI demonstrates autonomous decision-making capabilities and adaptive learning in complex medical environments, enabling it to monitor patient progress, coordinate care teams, and adjust treatment strategies in real time. These intelligent systems are becoming deeply embedded in healthcare operations, from enhancing diagnostic precision through advanced pattern recognition to optimizing clinical workflows and accelerating drug discovery processes. Agentic AI combines proactive problem-solving abilities with real-time adaptability so that healthcare professionals can focus on high-value, patient-centered activities while the AI handles routine tasks and complex data analysis.
Innovaccer, a pioneering healthcare AI company, recently launched Innovaccer Gravity, built using Amazon Bedrock AgentCore, a new healthcare intelligence platform set to revolutionize data integration and AI-driven healthcare transformation. Building on their impressive track record—where their existing solutions serve more than 1,600 US care locations, manage more than 80 million unified health records, and have generated $1.5B in cost savings—this exemplifies how AWS customers are leading the agentic AI evolution by creating intelligent solutions that transform healthcare delivery while delivering significant ROI.
Health care demands precision and accountability. AI agents operating within this domain must handle sensitive patient data securely, adhere to rigorous compliance regulations (like HIPAA), and maintain consistent interoperability across diverse clinical workflows. Standard, generalized protocols fall short when dealing with complex healthcare systems and patient data protection requirements. Healthcare organizations need a robust service to convert their existing APIs into Model Context Protocol (MCP) compatible tools that can scale effectively while providing built-in authentication, authorization, encryption, and comprehensive audit trails. Amazon Bedrock AgentCore Gateway offers health care providers and digital health companies a straightforward and secure way to build, deploy, discover, and connect to tools at scale that they can use to create AI-powered healthcare solutions while maintaining the highest standards of security and compliance.
Problem
Healthcare organizations face significant data silo challenges because of diverse electronic health record (EHR) formats across different systems, often maintaining multiple systems to serve specialized departmental needs and legacy systems. FHIR (Fast Healthcare Interoperability Resources) solves these interoperability challenges by standardizing healthcare data into exchangeable resources (like patient records and lab results), enabling seamless communication between different systems while maintaining security and improving care coordination. However, implementing FHIR presents its own challenges, including technical complexity in integrating with legacy systems and the need for specialized expertise in healthcare informatics and API development.
The implementation of AI agents introduces new layers of complexity, requiring careful design and maintenance of interfaces with existing systems. AI agents need secure access to the FHIR data and other healthcare tools with authentication (both inbound and outbound) and end-to-end encryption. MCP is a standardized communication framework that enables AI systems to seamlessly interact with external tools, data sources, and services through a unified interface. However, the development and scaling of MCP servers require substantial resources and expertise. Hosting these services demands ongoing development time and attention to maintain optimal performance and reliability. As healthcare organizations navigate this complex terrain, addressing these challenges becomes critical for achieving true interoperability and harnessing the full potential of modern healthcare technology.
Deploy, enhance, and monitor AI agents at scale using Amazon Bedrock AgentCore
By using Amazon Bedrock AgentCore, you can deploy and operate highly capable AI agents securely at scale. It offers infrastructure purpose-built for dynamic agent workloads, powerful tools to enhance agents, and essential controls for real-world deployment. Bedrock AgentCore offers a set of composable services with the services most relevant to the solution in this post mentioned in the following list. For more information, see the Bedrock AgentCore documentation.

AgentCore Runtime provides a secure, serverless runtime purpose-built for deploying and scaling dynamic AI agents and tools using any open source framework, protocol, and model. Runtime was built to work for agentic workloads with industry-leading extended runtime support, fast cold starts, true session isolation, built-in identity, and support for multi-modal payloads.
AgentCore Gateway provides a secure way for agents to discover and use tools along with straightforward transformation of APIs, AWS Lambda functions, and existing services into agent-compatible tools. Gateway speeds up custom code development, infrastructure provisioning, and security implementation so developers can focus on building innovative agent applications.
AgentCore Identity provides a secure, scalable agent identity and access management capability accelerating AI agent development. It is compatible with existing identity providers, avoiding the need to migrate uses or rebuild authentication flows.
AgentCore Observability helps developers trace, debug, and monitor agent performance in production through unified operational dashboards. With support for OpenTelemetry compatible telemetry and detailed visualizations of each step of the agent workflow.

In this solution, we demonstrate how the user (a parent) can interact with a Strands or LangGraph agent in conversational style and get information about the immunization history and schedule of their child, inquire about the available slots, and book appointments. With some changes, AI agents can be made event-driven so that they can automatically send reminders, book appointments, and so on. This reduces the administrative burden on healthcare organizations and the parents who no longer need to keep track of the paperwork or make multiple calls to book appointments.

As shown in the preceding diagram, the workflow for the healthcare appointment book built using Amazon Bedrock AgentCore is the following:

User interacts with Strands or LangGraph agent: The solution contains both Strands and LangGraph agents. You can also use other frameworks such as AutoGen and CrewAI.
Reasoning LLM from Amazon Bedrock: Claude 3.5 Sonnet large language model (LLM) is used from Amazon Bedrock. The model demonstrates advanced reasoning by grasping nuances and complex instructions, along with strong tool-calling capabilities that allow it to effectively integrate with external applications and services to automate various tasks such as web browsing, calculations, or data interactions.
Tools exposed using AgentCore Gateway: AgentCore Gateway provides secure access to the necessary tools required for the Strands or LangGraph agent using standard MCP clients. In this solution, REST APIs are hosted on Amazon API Gateway and exposed as MCP tools using AgentCore Gateway.
Ingress authentication for AgentCore Gateway: AgentCore Gateway is protected with oAuth 2.0 using Amazon Cognito as the identity provider. You can use other oAuth 2.0 compatible identity providers such as Auth0, and Keycloak as needed to fit your use case.
OpenAPI specs converted into tools with AgentCore Gateway: Amazon API Gateway is used as the backend to expose the APIs. By importing the OpenAPI specs, AgentCore Gateway provides an MCP compatible server without additional configuration for tool metadata. The following are the tools used in the solution.

get_patient_emr(): Gets the parent’s and child’s demographics information.
search_immunization_emr() – Gets the immunization history and schedule for the child.
get_available_slots() – Gets the pediatrician’s schedule around parent’s preferred date.
book_appointment() – Books an appointment and returns the confirmation number.

AWS Healthlake as the FHIR server: HealthLake is used to manage patient data related to demographics, immunization history, schedule and appointments, and so on. HealthLake is a HIPAA-eligible service offering healthcare companies a complete view of individual and patient population health data using FHIR API-based transactions to securely store and transform their data into a queryable format at petabyte scale, and further analyze this data using machine learning (ML) models.
Egress authentication from AgentCore Gateway to tools: OAuth 2.0 with Amazon Cognito as the identity provider is used to do the authentication between AgentCore Gateway and the tools used in the solution.

Solution setup

Important: The following code example is meant for learning and demonstration purposes only. For production implementations, it is recommended to add required error handling, input validation, logging, and security controls.

The code and instructions to set up and clean up this example solution are available on GitHub. When set up, the solution looks like the following and is targeted towards parents to use the for immunization related appointments.

Customizing the solution
The solution can be customized to extend the same or a different use case through the following mechanisms:

OpenAPI specification: The solution uses a sample OpenAPI specification (named fhir-openapi-spec.yaml) with APIs hosted on API Gateway. The OpenAPI specification can be customized to add more tools or use entirely different tools by editing the YAML file. You must recreate the AgentCore gateway after making changes to the OpenAPI spec.
Agent instructions and LLM: The strands_agent.py or langgraph_agent.py can be modified to make changes to the goal or instructions for the Agent or to work with a different LLM.

Future enhancements
We’re already looking forward and planning future enhancements for this solution.

AgentCore Runtime: Host strands or a LangGraph agent on AgentCore Runtime.
AgentCore Memory: Use AgentCore Memory to preserve session information in short-term (in session) as well as long-term (across sessions) to provide a more personalized experience to the agent users.

Innovaccer’s use case for Bedrock AgentCore
Innovaccer’s gravity platform includes more than 400 connectors to unify data from EHRs from sources such as Epic, Oracle Cerner, and MEDITECH, more than 20 pre-trained models, 15 pre-built AI agents, 100 FHIR resources, and 60 out-of-the-box solutions with role based access control, comprehensive audit trail, end-to-end encryption, and secure personal health information (PHI) handling. They also provide a low-code or no-code interface to build additional AI agents with the tools exposed using Healthcare Model Context Protocol (HMCP) servers.
Innovaccer uses Bedrock AgentCore for the following purposes:

AgentCore Gateway to turn their OpenAPI specifications into HMCP compatible tools without the heavy lifting required to build, secure, or scale MCP servers.
AgentCore Identity to handle the inbound and outbound authentication integrating with Innovaccer- or customer-provided OAuth servers.
AgentCore Runtime to deploy and scale the AI agents with multi-agent collaboration, along with logging, traceability and ability to plug in custom guardrails.

Bedrock AgentCore supports enterprise-grade security with encryption in transit and at rest, complete session isolation, audit trails using AWS CloudTrail, and comprehensive controls to help Innovaccer agents operate reliably and securely at scale.
Pricing for Bedrock AgentCore Gateway:
AgentCore Gateway offers a consumption-based pricing model with billing based on API invocations (such as ListTools, InvokeTool and Search API), and indexing of tools. For more information, see the pricing page.
Conclusion
The integration of Amazon Bedrock AgentCore with healthcare systems represents a significant leap forward in the application of AI to improve patient care and streamline healthcare operations. By using the suite of services provided by Bedrock AgentCore, healthcare organizations can deploy sophisticated AI agents that securely interact with existing systems, adhere to strict compliance standards, and scale efficiently.
The solution architecture presented in this post demonstrates the practical application of these technologies, showcasing how AI agents can simplify complex processes such as immunization scheduling and appointment booking. This can reduce administrative burdens on healthcare providers and enhance the patient experience by providing straightforward access to critical health information and services.
As we look to the future, the potential for AI agents in the healthcare industry is vast. From improving diagnostic accuracy to personalizing treatment plans and streamlining clinical workflows, the possibilities are endless. Tools like Amazon Bedrock AgentCore can help healthcare organizations confidently navigate the complexities of implementing AI while maintaining the highest standards of security, compliance, and patient care.
The healthcare industry stands at the cusp of a transformative era, where AI agents will play an increasingly central role in delivering efficient, personalized, and high-quality care. By embracing these technologies and continuing to innovate, we can create a healthcare network that is more responsive, intelligent, and patient-centric than ever before.

About the Authors
Kamal Manchanda is a Senior Solutions Architect at AWS with 17 years of experience in cloud, data, and AI technologies. He works closely with C-level executives and technical teams of AWS customers to drive cloud adoption and digital transformation initiatives. Prior to AWS, he led global teams delivering cloud-centric systems, data-driven applications, and AI/ML solutions across consulting and product organizations. Kamal specializes in translating complex business challenges into scalable, secure solutions that deliver measurable business value.
Kuldeep Singh is AVP and Head of AI Platform at Innovaccer. He leads the work on AI agentic workflow layers for Gravity by Innovaccer, a healthcare intelligence platform designed to unify data, agents, and compliant workflows so health systems can deploy AI at scale. With deep experience in data engineering, AI, and product leadership, Kuldeep focuses on making healthcare more efficient, safe, and patient-centered. He plays a key role in building tools that allow care teams to automate complex, multi-step tasks (like integrating payer or EHR data, orchestrating clinical agents) without heavy engineering. He’s passionate about reducing clinician burnout, improving patient outcomes, and turning pilot projects into enterprise-wide AI solutions.