IBM Released new Granite 4.0 Models with a Novel Hybrid Mamba-2/Transf …

IBM just released Granite 4.0, an open-source LLM family that swaps monolithic Transformers for a hybrid Mamba-2/Transformer stack to cut serving memory while keeping quality. Sizes span a 3B dense “Micro,” a 3B hybrid “H-Micro,” a 7B hybrid MoE “H-Tiny” (~1B active), and a 32B hybrid MoE “H-Small” (~9B active). The models are Apache-2.0, cryptographically signed, and—per IBM—the first open models covered by an accredited ISO/IEC 42001:2023 AI management system certification. They are available on watsonx.ai and via Docker Hub, Hugging Face, LM Studio, NVIDIA NIM, Ollama, Replicate, Dell Pro AI Studio/Enterprise Hub, Kaggle, with Azure AI Foundry…

So, what is new?

Granite 4.0 introduces a hybrid design that interleaves a small fraction of self-attention blocks with a majority of Mamba-2 state-space layers (9:1 ratio). As per IBM technical blog, relative to conventional Transformer LLMs, Granite 4.0-H can reduce RAM by >70% for long-context and multi-session inference, translating into lower GPU cost at a given throughput/latency target. IBM’s internal comparisons also show the smallest Granite 4.0 models outperforming Granite 3.3-8B despite using fewer parameters.

Tell me what are the released variants?

IBM is shipping both Base and Instruct variants across four initial models:

Granite-4.0-H-Small: 32B total, ~9B active (hybrid MoE).

Granite-4.0-H-Tiny: 7B total, ~1B active (hybrid MoE).

Granite-4.0-H-Micro: 3B (hybrid dense).

Granite-4.0-Micro: 3B (dense Transformer for stacks that don’t yet support hybrids).

All are Apache-2.0 and cryptographically signed; IBM states Granite is the first open model family with accredited ISO/IEC 42001 coverage for its AI management system (AIMS). Reasoning-optimized (“Thinking”) variants are planned later in 2025.

How is it trained, context, and dtype?

Granite 4.0 was trained on samples up to 512K tokens and evaluated up to 128K tokens. Public checkpoints on Hugging Face are BF16 (quantized and GGUF conversions are also published), while FP8 is an execution option on supported hardware—not the format of the released weights.

Lets understand it’s performance signals (enterprise-relevant)

IBM highlights instruction following and tool-use benchmarks:

IFEval (HELM): Granite-4.0-H-Small leads most open-weights models (trailing only Llama 4 Maverick at far larger scale).

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

BFCLv3 (Function Calling): H-Small is competitive with larger open/closed models at lower price points.

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

MTRAG (multi-turn RAG): Improved reliability on complex retrieval workflows.

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

How can I get access?

Granite 4.0 is live on IBM watsonx.ai and distributed via Dell Pro AI Studio/Enterprise Hub, Docker Hub, Hugging Face, Kaggle, LM Studio, NVIDIA NIM, Ollama, OPAQUE, Replicate. IBM notes ongoing enablement for vLLM, llama.cpp, NexaML, and MLX for hybrid serving.

My thoughts/comments

I see Granite 4.0’s hybrid Mamba-2/Transformer stack and active-parameter MoE as a practical path to lower TCO: >70% memory reduction and long-context throughput gains translate directly into smaller GPU fleets without sacrificing instruction-following or tool-use accuracy (IFEval, BFCLv3, MTRAG). The BF16 checkpoints with GGUF conversions simplify local evaluation pipelines, and ISO/IEC 42001 plus signed artifacts address provenance/compliance gaps that typically stall enterprise deployment. Net result: a lean, auditable base model family (1B–9B active) that’s easier to productionize than prior 8B-class Transformers.

Check out the Hugging Face Model Card and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post IBM Released new Granite 4.0 Models with a Novel Hybrid Mamba-2/Transformer Architecture: Drastically Reducing Memory Use without Sacrificing Performance appeared first on MarkTechPost.

ServiceNow AI Releases Apriel-1.5-15B-Thinker: An Open-Weights Multimo …

ServiceNow AI Research Lab has released Apriel-1.5-15B-Thinker, a 15-billion-parameter open-weights multimodal reasoning model trained with a data-centric mid-training recipe—continual pretraining followed by supervised fine-tuning—without reinforcement learning or preference optimization. The model attains an Artificial Analysis Intelligence Index score of 52 with 8x cost savings compared to SOTA. The checkpoint ships under an MIT license on Hugging Face.

So, What’s new in it for me?

Frontier-level composite score at small scale. The model reports Artificial Analysis Intelligence Index (AAI) = 52, matching DeepSeek-R1-0528 on that combined metric while being dramatically smaller. AAI aggregates 10 third-party evaluations (MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, Terminal-Bench Hard, τ²-Bench Telecom).

Single-GPU deployability. The model card states the 15B checkpoint “fits on a single GPU,” targeting on-premises and air-gapped deployments with fixed memory and latency budgets.

Open weights and reproducible pipeline. Weights, training recipe, and evaluation protocol are public for independent verification.

https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker

Ok! I got it but what is it’s training mechanism?

Base and upscaling. Apriel-1.5-15B-Thinker starts from Mistral’s Pixtral-12B-Base-2409 multimodal decoder-vision stack. The research team applies depth upscaling—increasing decoder layers from 40→48—then projection-network realignment to align the vision encoder with the enlarged decoder. This avoids pretraining from scratch while preserving single-GPU deployability.

CPT (Continual Pretraining). Two stages: (1) mixed text+image data to build foundational reasoning and document/diagram understanding; (2) targeted synthetic visual tasks (reconstruction, matching, detection, counting) to sharpen spatial and compositional reasoning. Sequence lengths extend to 32k and 16k tokens respectively, with selective loss placement on response tokens for instruction-formatted samples.

SFT (Supervised Fine-Tuning). High-quality, reasoning-trace instruction data for math, coding, science, tool use, and instruction following; two additional SFT runs (stratified subset; longer-context) are weight-merged to form the final checkpoint. No RL (reinforcement learning) or RLAIF (reinforcement learning from AI feedback).

Data note. ~25% of the depth-upscaling text mix derives from NVIDIA’s Nemotron collection.

O’ Wow! Tell me about it’s results then?

Key text benchmarks (pass@1 / accuracy).

AIME 2025 (American Invitational Mathematics Examination 2025): 87.5–88%

GPQA Diamond (Graduate-Level Google-Proof Question Answering, Diamond split): ≈71%

IFBench (Instruction-Following Benchmark): ~62

τ²-Bench (Tau-squared Bench) Telecom: ~68

LiveCodeBench (functional code correctness): ~72.8

Using VLMEvalKit for reproducibility, Apriel scores competitively across MMMU / MMMU-Pro (Massive Multi-discipline Multimodal Understanding), LogicVista, MathVision, MathVista, MathVerse, MMStar, CharXiv, AI2D, BLINK, with stronger results on documents/diagrams and text-dominant math imagery.

https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker/blob/main/Apriel-1.5-Thinker.pdf

Lets Summarize everything

Apriel-1.5-15B-Thinker demonstrates that careful mid-training (continual pretraining + supervised fine-tuning, no reinforcement learning) can deliver a 52 on the Artificial Analysis Intelligence Index (AAI) while remaining deployable on a single graphics processing unit. Reported task-level scores (for example, AIME 2025 ≈88, GPQA Diamond ≈71, IFBench ≈62, Tau-squared Bench Telecom ≈68) align with the model card and place the 15-billion-parameter checkpoint in the most cost-efficient band of current open-weights reasoners. For enterprises, that combination—open weights, reproducible recipe, and single-GPU latency—makes Apriel a practical baseline to evaluate before considering larger closed systems.
The post ServiceNow AI Releases Apriel-1.5-15B-Thinker: An Open-Weights Multimodal Reasoning Model that Hits Frontier-Level Performance on a Single-GPU Budget appeared first on MarkTechPost.

Enhance agentic workflows with enterprise search using Kore.ai and Ama …

This post was written with Meghana Chintalapudi and Surabhi Sankhla of Kore.ai.
As organizations struggle with exponentially growing volumes of data distributed across multiple repositories and applications, employees lose significant time—approximately 30% according to the International Data Corporation (IDC)—searching for information that could be spent on higher-value work. The complexity of modern enterprise data networks demands solutions that can efficiently integrate, process, and deliver actionable insights across disparate systems.
In this post, we demonstrate how organizations can enhance their employee productivity by integrating Kore.ai’s AI for Work platform with Amazon Q Business. We show how to configure AI for Work as a data accessor for Amazon Q index for independent software vendors (ISVs), so employees can search enterprise knowledge and execute end-to-end agentic workflows involving search, reasoning, actions, and content generation. We explore the key benefits of this integration, including advanced search capabilities across more than 90 enterprise connectors and how to extend agentic experiences on top of a search foundation. The post includes a step-by-step implementation guide to help you set up this integration in your environment.
Components of the integration
Kore.ai is a leading Enterprise AI platform consistently recognized by Gartner as a leader in conversational AI. With three key Kore.ai offerings, AI for Work, AI for Process, and AI for Service, enterprises can build and deploy AI solutions based on their business needs. The AI for Work platform helps employees be more productive by making it possible to search across applications, take context-aware actions, generate content, and automate repetitive tasks. The platform goes beyond standalone search to deliver comprehensive agentic orchestration and workflows, helping employees follow up with clients, send weekly updates, or research and write marketing content with a single command. With AI for Work, your employees can create simple no-code agents while your admins have the flexibility to create more advanced low-code or pro-code agents. AI for Process, on the other hand, automates knowledge-intensive business processes end-to-end. AI for Service helps organizations deliver differentiated customer service experiences through self-service, proactive outreach campaigns, and agent assistance.
Amazon Q index for ISVs is a powerful, managed vector search service that supports seamless integration of generative AI applications with customers’ enterprise data through a unified, secure index. ISVs can access and retrieve relevant content through the SearchRelevantContent API for cross-application data retrieval without needing direct access or individual indexing of each data source, while customers retain full control over data access and governance.
When combined with additional search connectors offered by AI for Work platform and its ability to create and orchestrate agents, organizations gain a complete solution that transforms how employees access enterprise data and execute tasks end-to-end. The following video shows one such agentic experience in action, where the AI for Work interface seamlessly orchestrates agents to help a sales executive prepare for a client meeting—compiling information from Amazon Q index and AI for Work connectors, summarizing talking points, and sending them as an email, all from a single query.

Benefits for enterprises
Enterprises often struggle with fragmented data access and repetitive manual tasks that slow down critical business processes. For example, imagine a scenario where a product manager needs to compile quarterly feature requests—with the integration of Kore.ai’s AI for Work and Amazon Q index, they can instantly gather requests from Salesforce, support tickets, and JIRA; automatically generate a structured roadmap; and schedule stakeholder meetings, all with a single query. This seamless integration changes the way enterprises interact with enterprise systems, through multiple key advantages:

Improved search capabilities – Amazon Q index augments the generative AI experience by providing semantically relevant enterprise content across connected systems through its distributed vector database, delivering query responses at enterprise scale. Now, together with AI for Work, your employees can search data from over 90 connectors, integrating with enterprise systems like Microsoft 365, Salesforce, and Workday while also connecting with custom internal knowledge systems and third-party search providers. AI for Work’s orchestrator manages complex query processing and agent routing across multiple data sources, resulting in contextually appropriate and actionable results that significantly reduce search time while also enabling intelligent automations that extend far beyond traditional search capabilities.
Enhanced data processing – The system continuously ingests and analyzes data through the document processing pipeline in Amazon Q index, which automatically handles multiple formats using intelligent chunking algorithms that preserve semantic context. The AI for Work platform unifies search, content generation, and actions in a single interface, to support the creation of multi-step agentic experiences grounded in search. Through real-time incremental indexing that processes only changed content, the system maintains data freshness while converting siloed raw data into actionable insights and multi-step business processes that can be saved and reused across the organization.
Cost optimization – Organizations can achieve significant cost savings by streamlining routine tasks through agents that reduce operational overhead and improve resource allocation. AI for Work supports a wide range of agent-building options, from no-code and low-code to pro-code, for both non-technical employees and technical experts to build agents for themselves and to share across the organization, so teams can accomplish more with existing resources and benefit from sustained productivity improvements.
Security benefits – Security remains paramount, with Amazon Q index implementing vector-level security through end-to-end encryption using AWS Key Management Service (AWS KMS) customer managed keys and document-level access controls that filter search results based on user identity and group membership. The joint solution implements robust role-based access control and audit trails. This zero-trust security approach maintains compliance with industry standards while providing granular control over sensitive enterprise data, making sure users only see information from documents they have explicit permissions to access while maintaining complete data sovereignty. With AI for Work’s robust security and governance tools enterprises can manage permissions and agent access, monitor usage, and enforce guardrails for secure, enterprise-wide deployment of AI solutions at scale.

Solution overview
The Amazon Q Business data accessor provides a secure interface that integrates Kore.ai’s AI for Work platform with Amazon Q index. The integration delivers a robust solution that uses enterprise data across multiple systems to power intelligent agentic actions and content generation capabilities that transform how organizations handle routine tasks and automate complex processes end-to-end.
When a user submits a query through AI for Work, its orchestrator intelligently routes requests between Kore.ai’s native retrievers and Amazon Q index based on predefined routing rules and advanced intent recognition algorithms. For Amazon Q index requests, the architecture implements secure cross-account API calls using OAuth 2.0 tokens that transform into temporary AWS credentials, supporting both security and optimal performance while maintaining strict access controls throughout the entire system. With AI for Work’s agents, users can take follow up actions, such as drafting proposals or submitting tickets—directly on top of search results, for end-to-end task completion in a single interface. Users can also build personalized workflows of pre-defined steps and execute them from a single query to further save time.
This supports use cases such as automated roadmap generation, where a product manager can query feature requests across multiple systems and receive a structured roadmap complete with stakeholder notifications, or RFP response automation, where sales executives can generate comprehensive proposals by pulling compliance documentation and tailoring responses based on client requirements.
The following diagram illustrates the solution architecture.

Prerequisites
Before enabling the Amazon Q index integration with Kore.ai’s AI for Work, you must have the following components in place:

An AWS account with appropriate service access
Amazon Q Business set up with AWS IAM Identity Center for user authentication
Access to Kore.ai’s AI for Work (as a workspace admin)

With these prerequisites met, you can complete the basic configuration steps on both the Amazon Q Business and Kore.ai consoles to get started.
Add Kore.ai as a data accessor
After creating an Amazon Q Business application with AWS IAM Identity Center, administrators can configure Kore.ai as a data accessor through the Amazon Q Business console. Complete the following steps:

On the Amazon Q Business console, choose Data accessors in the navigation pane.
Choose Add data accessor.
Choose Kore.ai as your data accessor. You must retrieve tenantID, a unique identifier for your application tenant. Refer to Prerequisites for instructions to retrieve the TenantId for your application. Similar instructions are also listed later in this post.
For Data source access, configure your level of access. You can select specific data sources from your Amazon Q index to be available through the data accessor. This makes it possible to control which content is surfaced in the AI for Work environment.
For User access, specify which users or groups can access the Amazon Q index through the data accessor. This option makes it possible to configure granular permissions for data accessor accessibility and manage organizational access controls.

After you have added the data accessor, the Amazon Q Business console displays configuration details that you need to share with Kore.ai to complete the setup.

Note down the following information for the next step:

Amazon Q Business application ID
AWS Region of the Amazon Q Business application
Amazon Q Business retriever ID
Region for IAM Identity Center instance

Configure Amazon Q index in Kore.ai’s AI for Work
Kore.ai’s AI for Work supports flexible integration with Amazon Q index based on your enterprise search needs. There are two configuration options: configuring Amazon Q index as the primary enterprise knowledge source or configuring it as a search agent. We provide instructions for both options in this post.
Option 1: Configure Amazon Q index as the primary enterprise knowledge source
If you want Amazon Q index to act as the primary fallback search layer, coming into play, complete the following steps:

In AI for Work, go to Workspaces on the admin console. Then navigate to Enterprise Workspace, which is the default workspace.

Choose Configure to configure an enterprise knowledge data source.
On the Create New dropdown menu, choose Amazon Q.

Enter a source name and brief description.
Copy the tenant ID displayed—this is required during the setup of the data accessor in AWS, as described in the previous section.
Enter the details captured earlier:

Amazon Q Business application ID
Region of the Amazon Q Business application
Amazon Q Business retriever ID
Region for IAM Identity Center instance

Choose Continue to save and complete the configuration.

The new knowledge source now shows as Active.

Option 2: Configure Amazon Q index as a search agent
If you already have a primary search index, you can configure Amazon Q index as a search agent:

In AI for Work, go to Workspaces on the admin console.
Choose the workspace where you want to add Amazon Q index. (Enterprise Workspace is used by default).
Under AI Agents in the navigation pane, choose Search Agent
Choose Create agent.

Provide an agent name and purpose. This helps define when the search agent should be invoked.
Choose Continue to move to configuration.
For Select Search Index, choose Amazon Q.

Copy the tenant ID displayed—it is required during the setup of the data accessor in AWS.

Preview and test the agent.
After you have validated the agent, publish it to selected users or groups.

Your integration is now complete. You can now access the assistant application and start asking questions in the AI for Work console. If you’ve created a search agent, you can also access it from the list of agents and start interacting with it directly.
Clean up
When you are finished using this solution, clean up your resources to avoid additional costs:

Disable the Amazon Q index configuration within AI for Work’s settings.
Delete the Kore.ai data accessor from the Amazon Q Business console, which will remove permissions and access for users.
Delete the Amazon Q Business application to remove the associated index and data source connectors, on your AWS account.

Conclusion
The combination of Kore.ai’s AI for Work and Amazon Q index offers enterprises a transformative approach to boost employee productivity leveraging comprehensive search capabilities while streamlining repetitive tasks and processes. By integrating Kore.ai’s advanced agentic platform with the robust search infrastructure of Amazon Q index, organizations can now execute context aware actions by accessing relevant information across disparate systems while maintaining data ownership and security. This supports faster problem-solving, enhanced productivity, and better collaboration across the organization.
In this post, we explored how enterprises can use the integration between Kore.ai’s AI for Work and Amazon Q Business to streamline their operational processes and unlock valuable productivity gains. We demonstrated how organizations can set up this integration using an Amazon Q data accessor, helping teams access critical information securely and cost-effectively.
Unlock the full potential of your organization’s data and agentic workflows today with the Amazon Q index and Kore.ai’s AI for Work’s unified solution by following the steps in Amazon Q integration with AI for Work.

About the authors
Siddhant Gupta is a Software Development Manager on the Amazon Q team based in Seattle, WA. He is driving innovation and development in cutting-edge AI-powered solutions.
Chinmayee Rane is a Generative AI Specialist Solutions Architect at AWS, with a core focus on generative AI. She helps ISVs accelerate the adoption of generative AI by designing scalable and impactful solutions. With a strong background in applied mathematics and machine learning, she specializes in intelligent document processing and AI-driven innovation. Outside of work, she enjoys salsa and bachata dancing.
Bobby Williams is a Senior Solutions Architect at AWS. He has decades of experience designing, building, and supporting enterprise software solutions that scale globally. He works on solutions across industry verticals and horizontals and is driven to create a delightful experience for every customer.
Santhosh Urukonda is a Senior PACE (Prototyping & Cloud Engineering) Architect at AWSs with two decades of experience. He specializes in helping customers develop innovative, first-to-market solutions with a focus on generative AI.
Nikhil Kumar Goddeti is a Cloud Support Engineer II at AWS. He specializes in AWS Data Analytics services with emphasis on Amazon OpenSearch Service, Amazon Q Business, Amazon Kinesis, Amazon MSK, Amazon AppFlow, and Amazon Kendra. He is a Subject Matter Expert of OpenSearch. Outside of work, he enjoys travelling with his friends and playing cricket.
Meghana Chintalapudi is a Product Manager at Kore.ai, driving the development of search and agentic AI solutions for the AI for Work platform. She has led large-scale AI implementations for Fortune 500 clients, evolving from deterministic NLP and intent-detection models to advanced large language model deployments, with a strong emphasis on enterprise-grade security and scalability. Outside of work, Meghana is a dancer and takes movement workshops in Hyderabad, India.
Surabhi Sankhla is a VP of Product at Kore.ai, where she leads the AI for Work platform to help enterprises boost employee productivity. With over 13 years of experience in product management and technology, she has launched AI products from the ground up and scaled them to millions of users. At Kore.ai, she drives product strategy, client implementations, and go-to-market execution in partnership with cross-functional teams. Based in San Francisco, Surabhi is passionate about making AI accessible and impactful for all.

Accelerate development with the Amazon Bedrock AgentCore MCP server

Today, we’re excited to announce the Amazon Bedrock AgentCore Model Context Protocol (MCP) Server. With built-in support for runtime, gateway integration, identity management, and agent memory, the AgentCore MCP Server is purpose-built to speed up creation of components compatible with Bedrock AgentCore. You can use the AgentCore MCP server for rapid prototyping, production AI solutions, or to scale your agent infrastructure for your enterprise.
Agentic IDEs like Kiro, Claude Code, GitHub Copilot, and Cursor, along with sophisticated MCP servers are transforming how developers build AI agents. What typically takes significant time and effort, for example learning about Bedrock AgentCore services, integrating Runtime and Tools Gateway, managing security configurations, and deploying to production can now be completed in minutes through conversational commands with your coding assistant.
In this post we introduce the new AgentCore MCP server and walk through the installation steps so you can get started.
AgentCore MCP server capabilities
The AgentCore MCP server brings a new agentic development experience to AWS, providing specialized tools that automate the complete agent lifecycle, eliminate the steep learning curve, and reduce development friction that can slow innovation cycles. To address specific agent development challenges the AgentCore MCP server:

Transforms agents for AgentCore Runtime integration by providing guidance to your coding assistant on the minimum functionality changes needed—adding Runtime library imports, updating dependencies, initializing apps with BedrockAgentCoreApp(), converting entrypoints to decorators, and changing direct agent calls to payload handling—while preserving your existing agent logic and Strands Agents features.
Automates development environment provisioning by handling the complete setup process through your coding assistant: installing required dependencies (bedrock-agentcore SDK, bedrock-agentcore-starter-toolkit CLI helpers, strands-agents SDK), configuring AWS credentials and AWS Regions, defining execution roles with Bedrock AgentCore permissions, setting up ECR repositories, and creating .bedrock_agentcore.yaml configuration files.
Simplifies tool integration with Bedrock AgentCore Gateway for seamless agent-to-tool communication in the cloud environment.
Enables simple agent invocation and testing by providing natural language commands through your coding assistant to invoke provisioned agents on AgentCore Runtime and verify the complete workflow, including calls to AgentCore Gateway tools when applicable.

Layered approach
When using the AgentCore MCP server with your favorite client, we encourage you to consider a layered architecture designed to provide comprehensive AI agent development support:

Layer 1: Agentic IDE or client – Use Kiro, Claude Code, Cursor, VS Code extensions, or another natural language interface for developers. For very simple tasks, agentic IDEs are equipped with the right tools to look up documentation and perform tasks specific to Bedrock AgentCore. However, with this layer alone, developers may observe sub-optimal performance across AgentCore developer paths.
Layer 2: AWS service documentation – Install the AWS Documentation MCP Server for comprehensive AWS service documentation, including context about Bedrock AgentCore.
Layer 3: Framework documentation – Install the Strands, LangGraph, or other framework docs MCP servers or use the llms.txt for framework-specific context.
Layer 4: SDK documentation – Install the MCP or use the llms.txt for the Agent Framework SDK and Bedrock AgentCore SDK for a combined documentation layer that covers the Strands Agents SDK documentation and Bedrock AgentCore API references.
Layer 5: Steering files – Task-specific guidance for more complex and repeated workflows. Each IDE has a different approach to using steering files (for example, see Steering in the Kiro documentation).

Each layer builds upon the previous one, providing increasingly specific context so your coding assistant can handle everything from basic AWS operations to complex agent transformations and deployments.
Installation
To get started with the Amazon Bedrock AgentCore MCP server you can use the one-click install on the Github repository.
Each IDE integrates with an MCP differently using the mcp.json file. Review the MCP documentation for your IDE, such as Kiro, Cursor, Q CLI, and Claude Code to determine the location of the mcp.json.

Client
Location of mcp.json
Documentation

Kiro
.kiro/settings/mcp.json
https://kiro.dev/docs/mcp/

Cursor
.cursor/mcp.json
https://cursor.com/docs/context/mcp

Q CLI
~/.aws/amazonq/mcp.json
https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/qdev-mcp.html

Claude Code
~/.claude/mcp.json
https://docs.claude.com/en/docs/claude-code/mcp

Use the following in your mcp.json:

{
  “mcpServers”: {
    “awslabs.amazon-bedrock-agentcore-mcp-server”: {
      “command”: “uvx”,
      “args”: [“awslabs.amazon-bedrock-agentcore-mcp-server@latest”],
      “env”: {
        “FASTMCP_LOG_LEVEL”: “ERROR”
      },
      “disabled”: false,
      “autoApprove”: []
    }
  }
}

For example, here is what the IDE looks like on Kiro, with the AgentCore MCP server and the two tools, search_agentcore_docs and fetch_agentcore_doc, connected:

Using the AgentCore MCP server for agent development
While we show demos for various use cases below using the Kiro IDE, the AgentCore MCP server has also been tested to work on Claude Code, Amazon Q CLI, Cursor, and the VS Code Q plugin. First, let’s take a look at a typical agent development lifecycle using AgentCore services (remember that this is only one example with the tools available, and you are free to explore more such use cases simply by instructing the agent in your favorite Agentic IDE):

The agent development lifecycle follows these steps:

The user takes a local set of tools or MCP servers and

Creates a lambda target for AgentCore Gateway; or
Deploys the MCP server as-is on AgentCore Runtime

The user prepares the actual agent code using a preferred framework like Strands Agents or LangGraph. The user can either:

Start from scratch (the server can fetch docs from the Strands Agents or LangGraph documentation)
Start from fully or partially working agent code

The user asks the agent to transform the code into a format compatible with AgentCore Runtime with the intention to deploy the agent later. This causes the agent to:

Write an appropriate requirements.txt file
import necessary libraries including bedrock_agentcore
decorate the main handler (or create one) to access the core agent calling logic or input handler

The user may then ask the agent to deploy to AgentCore Runtime. The agent can look up documentation and can use the AgentCore CLI to deploy the agent code to Runtime
The user can test the agent by asking the agent to do so. The AgentCore CLI command required for this is written and executed by the client
The user then asks to modify the code to use the deployed AgentCore Gateway MCP server within this AgentCore Runtime agent.

The agent modifies the original code to add an MCP client that can call the deployed gateway
The agent then deploys a new version v2 of the agent to Runtime
The agent then tests this integration with a new prompt

Here is a demo of the MCP server working with Cursor IDE. We see the agent perform the following steps:

Transform the weather_agent.py to be compatible with AgentCore runtime
Use the AgentCore CLI to deploy the agent
Test the deployed agent with a successful prompt

Here’s another example of deploying a LangGraph agent to AgentCore Runtime with the Cursor IDE performing similar steps as seen above.

Clean up
If you’d like to uninstall the MCP server, follow the MCP documentation for your IDE, such as Kiro, Cursor, Q CLI, and Claude Code for instructions.
Conclusion
In this post, we showed how you can use the AgentCore MCP server with your favorite Agentic IDE of choice to speed up your development workflows.
We encourage you to review the Github repository, as well read through and use the following resources in your development:

Amazon Bedrock AgentCore CLI documentation
Strands Agents MCP Server
LangGraph llms.txt

We encourage you to try out the AgentCore MCP server and provide any feedback through issues in our GitHub repository.

About the authors

Shreyas Subramanian
Shreyas is a Principal Data Scientist and helps customers by using Generative AI to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Deep Learning, and he is a researcher studying the use of Machine Learning and Reinforcement Learning for accelerating learning and optimization tasks. Shreyas is also an Amazon best-selling book author with several research papers and patents to his name.

Primo Mu
Primo is a Software Development Engineer on the Agentic AI Foundation team at AWS, where he builds foundational systems and infrastructure that power intelligent AI applications. He has extensive experience working on backend stateless orchestration services behind products like Kiro and Q Dev CLI. He focuses on creating scalable frameworks and robust architectures that enable developers to build sophisticated agentic systems.

Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Mod …

Liquid AI has released LFM2-Audio-1.5B, a compact audio–language foundation model that both understands and generates speech and text through a single end-to-end stack. It positions itself for low-latency, real-time assistants on resource-constrained devices, extending the LFM2 family into audio while retaining a small footprint.

https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model

But what’s actually new? a unified backbone with disentangled audio I/O

LFM2-Audio extends the 1.2B-parameter LFM2 language backbone to treat audio and text as first-class sequence tokens. Crucially, the model disentangles audio representations: inputs are continuous embeddings projected directly from raw waveform chunks (~80 ms), while outputs are discrete audio codes. This avoids discretization artifacts on the input path while keeping training and generation autoregressive for both modalities on the output path.

On the implementation side, the released checkpoint uses:

Backbone: LFM2 (hybrid conv + attention), 1.2B params (LM only)

Audio encoder: FastConformer (~115M, canary-180m-flash)

Audio decoder: RQ-Transformer predicting discrete Mimi codec tokens (8 codebooks)

Context: 32,768 tokens; vocab: 65,536 (text) / 2049×8 (audio)

Precision: bfloat16; license: LFM Open License v1.0; languages: English

https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model

Two generation modes for real-time agents

Interleaved generation for live, speech-to-speech chat where the model alternates text and audio tokens to minimize perceived latency.

Sequential generation for ASR/TTS (switching modalities turn-by-turn).

Liquid AI provides a Python package (liquid-audio) and a Gradio demo to reproduce these behaviors.

Latency: <100 ms to first audio

Liquid AI team reports end-to-end latency below 100 ms from a 4-second audio query to the first audible response—a proxy for perceived responsiveness in interactive use—stating it is faster than models smaller than 1.5B parameters under their setup.

Benchmarks: VoiceBench and ASR results

On VoiceBench—a suite of nine audio-assistant evaluations—Liquid reports an overall score of 56.78 for LFM2-Audio-1.5B, with per-task numbers disclosed in the blog’s chart (e.g., AlpacaEval 3.71, CommonEval 3.49, WildVoice 3.17). The Liquid AI team contrasts this result with larger models like Qwen2.5-Omni-3B and Moshi-7B in the same table. (VoiceBench is an external benchmark introduced in late 2024 for LLM-based voice assistants)

The model card on Hugging Face provides an additional VoiceBench table (with closely related—but not identical—per-task values) and includes classic ASR WERs where LFM2-Audio matches or improves on Whisper-large-v3-turbo for some datasets despite being a generalist speech–text model. For example (lower is better): AMI 15.36 vs. 16.13 (Whisper-large-v3-turbo), LibriSpeech-clean 2.03 vs. 2.10.

https://huggingface.co/LiquidAI/LFM2-Audio-1.5B

Alright, but why does it really matter in voice AI trends?

Most “omni” stacks couple ASR → LLM → TTS, which adds latency and brittle interfaces. LFM2-Audio’s single-backbone design with continuous input embeddings and discrete output codes reduces glue logic and allows interleaved decoding for early audio emission. For developers, this translates to simpler pipelines and faster perceived response times, while still supporting ASR, TTS, classification, and conversational agents from one model. Liquid AI provides code, demo entry points, and distribution via Hugging Face.

Check out the GitHub Page, Hugging Face Model Card and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Model with Sub-100 ms Response Latency appeared first on MarkTechPost.

MLPerf Inference v5.1 (2025): Results Explained for GPUs, CPUs, and AI …

What MLPerf Inference Actually Measures?

MLPerf Inference quantifies how fast a complete system (hardware + runtime + serving stack) executes fixed, pre-trained models under strict latency and accuracy constraints. Results are reported for the Datacenter and Edge suites with standardized request patterns (“scenarios”) generated by LoadGen, ensuring architectural neutrality and reproducibility. The Closed division fixes the model and preprocessing for apples-to-apples comparisons; the Open division allows model changes that are not strictly comparable. Availability tags—Available, Preview, RDI (research/development/internal)—indicate whether configurations are shipping or experimental.

The 2025 Update (v5.0 → v5.1): What Changed?

The v5.1 results (published Sept 9, 2025) add three modern workloads and broaden interactive serving:

DeepSeek-R1 (first reasoning benchmark)

Llama-3.1-8B (summarization) replacing GPT-J

Whisper Large V3 (ASR)

This round recorded 27 submitters and first-time appearances of AMD Instinct MI355X, Intel Arc Pro B60 48GB Turbo, NVIDIA GB300, RTX 4000 Ada-PCIe-20GB, and RTX Pro 6000 Blackwell Server Edition. Interactive scenarios (tight TTFT/TPOT limits) were expanded beyond a single model to capture agent/chat workloads.

Scenarios: The Four Serving Patterns You Must Map to Real Workloads

Offline: maximize throughput, no latency bound—batching and scheduling dominate.

Server: Poisson arrivals with p99 latency bounds—closest to chat/agent backends.

Single-Stream / Multi-Stream (Edge emphasis): strict per-stream tail latency; Multi-Stream stresses concurrency at fixed inter-arrival intervals.

Each scenario has a defined metric (e.g., max Poisson throughput for Server; throughput for Offline).

Latency Metrics for LLMs: TTFT and TPOT Are Now First-Class

LLM tests report TTFT (time-to-first-token) and TPOT (time-per-output-token). v5.0 introduced stricter interactive limits for Llama-2-70B (p99 TTFT 450 ms, TPOT 40 ms) to reflect user-perceived responsiveness. The long-context Llama-3.1-405B keeps higher bounds (p99 TTFT 6 s, TPOT 175 ms) due to model size and context length. These constraints carry into v5.1 alongside new LLM and reasoning tasks.

The 2025 Datacenter Menu (Closed Division Targets You’ll Actually Compare)

Key v5.1 entries and their quality/latency gates (abbrev.):

LLM Q&A – Llama-2-70B (OpenOrca): Conversational 2000 ms/200 ms; Interactive 450 ms/40 ms; 99% and 99.9% accuracy targets.

LLM Summarization – Llama-3.1-8B (CNN/DailyMail): Conversational 2000 ms/100 ms; Interactive 500 ms/30 ms.

Reasoning – DeepSeek-R1: TTFT 2000 ms / TPOT 80 ms; 99% of FP16 (exact-match baseline).

ASR – Whisper Large V3 (LibriSpeech): WER-based quality (datacenter + edge).

Long-context – Llama-3.1-405B: TTFT 6000 ms, TPOT 175 ms.

Image – SDXL 1.0: FID/CLIP ranges; Server has a 20 s constraint.

Legacy CV/NLP (ResNet-50, RetinaNet, BERT-L, DLRM, 3D-UNet) remain for continuity.

Power Results: How to Read Energy Claims

MLPerf Power (optional) reports system wall-plug energy for the same runs (Server/Offline: system power; Single/Multi-Stream: energy per stream). Only measured runs are valid for energy efficiency comparisons; TDPs and vendor estimates are out-of-scope. v5.1 includes datacenter and edge power submissions but broader participation is encouraged.

How To Read the Tables Without Fooling Yourself?

Compare Closed vs Closed only; Open runs may use different models/quantization.

Match accuracy targets (99% vs 99.9%)—throughput often drops at stricter quality.

Normalize cautiously: MLPerf reports system-level throughput under constraints; dividing by accelerator count yields a derived “per-chip” number that MLPerf does not define as a primary metric. Use it only for budgeting sanity checks, not marketing claims.

Filter by Availability (prefer Available) and include Power columns when efficiency matters.

Interpreting 2025 Results: GPUs, CPUs, and Other Accelerators

GPUs (rack-scale to single-node). New silicon shows up prominently in Server-Interactive (tight TTFT/TPOT) and in long-context workloads where scheduler & KV-cache efficiency matter as much as raw FLOPs. Rack-scale systems (e.g., GB300 NVL72 class) post the highest aggregate throughput; normalize by both accelerator and host counts before comparing to single-node entries, and keep scenario/accuracy identical.

CPUs (standalone baselines + host effects). CPU-only entries remain useful baselines and highlight preprocessing and dispatch overheads that can bottleneck accelerators in Server mode. New Xeon 6 results and mixed CPU+GPU stacks appear in v5.1; check host generation and memory configuration when comparing systems with similar accelerators.

Alternative accelerators. v5.1 increases architectural diversity (GPUs from multiple vendors plus new workstation/server SKUs). Where Open-division submissions appear (e.g., pruned/low-precision variants), validate that any cross-system comparison holds constant division, model, dataset, scenario, and accuracy.

Practical Selection Playbook (Map Benchmarks to SLAs)

Interactive chat/agents → Server-Interactive on Llama-2-70B/Llama-3.1-8B/DeepSeek-R1 (match latency & accuracy; scrutinize p99 TTFT/TPOT).

Batch summarization/ETL → Offline on Llama-3.1-8B; throughput per rack is the cost driver.

ASR front-ends → Whisper V3 Server with tail-latency bound; memory bandwidth and audio pre/post-processing matter.

Long-context analytics → Llama-3.1-405B; evaluate if your UX tolerates 6 s TTFT / 175 ms TPOT.

What the 2025 Cycle Signals?

Interactive LLM serving is table-stakes. Tight TTFT/TPOT in v5.x makes scheduling, batching, paged attention, and KV-cache management visible in results—expect different leaders than in pure Offline.

Reasoning is now benchmarked. DeepSeek-R1 stresses control-flow and memory traffic differently from next-token generation.

Broader modality coverage. Whisper V3 and SDXL exercise pipelines beyond token decoding, surfacing I/O and bandwidth limits.

Summary

In summary, MLPerf Inference v5.1 makes inference comparisons actionable only when grounded in the benchmark’s rules: align on the Closed division, match scenario and accuracy (including LLM TTFT/TPOT limits for interactive serving), and prefer Available systems with measured Power to reason about efficiency; treat any per-device splits as derived heuristics because MLPerf reports system-level performance. The 2025 cycle expands coverage with DeepSeek-R1, Llama-3.1-8B, and Whisper Large V3, plus broader silicon participation, so procurement should filter results to the workloads that mirror production SLAs—Server-Interactive for chat/agents, Offline for batch—and validate claims directly in the MLCommons result pages and power methodology.

References:

MLCommons Releases New MLPerf Inference v5.1 Benchmark Results

MLPerf Inference: Datacenter

MLPerf Inference: Edge

https://docs.mlcommons.org/inference/

https://docs.mlcommons.org/inference/power/

Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models

DeepSeek Reasoning for MLPerf Inference v5.1

https://blogs.nvidia.com/blog/mlperf-inference-blackwell-ultra/

NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference-v5.1/README.html

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference5.1-repro/README.html

https://newsroom.intel.com/artificial-intelligence/intel-arc-pro-b-series-gpus-and-xeon-6-shine-in-mlperf-inference-v5-1

https://www.globenewswire.com/news-release/2025/09/09/3147136/0/en/MLCommons-Releases-New-MLPerf-Inference-v5-1-Benchmark-Results.html

https://www.tomshardware.com/pc-components/gpus/nvidia-claims-software-and-hardware-upgrades-allow-blackwell-ultra-gb300-to-dominate-mlperf-benchmarks-touts-45-percent-deepseek-r-1-inference-throughput-increase-over-gb200

https://newsroom.intel.com/tag/intel-arc-pro-b60

The post MLPerf Inference v5.1 (2025): Results Explained for GPUs, CPUs, and AI Accelerators appeared first on MarkTechPost.

The Role of Model Context Protocol (MCP) in Generative AI Security and …

Table of contentsOverviewWhat MCP standardizes?Normative authorization controlsWhere MCP supports security engineering in practice ?Case study: the first malicious MCP serverUsing MCP to structure red-team exercisesImplementation-Focused Security Hardening ChecklistGovernance alignmentCurrent adoption you can test againstSummaryResources used in the article

Overview

Model Context Protocol (MCP) is an open, JSON-RPC–based standard that formalizes how AI clients (assistants, IDEs, web apps) connect to servers exposing three primitives—tools, resources, and prompts—over defined transports (primarily stdio for local and Streamable HTTP for remote). MCP’s value for security work is that it renders agent/tool interactions explicit and auditable, with normative requirements around authorization that teams can verify in code and in tests. In practice, this enables tight blast-radius control for tool use, repeatable red-team scenarios at clear trust boundaries, and measurable policy enforcement—provided organizations treat MCP servers as privileged connectors subject to supply-chain scrutiny.

What MCP standardizes?

An MCP server publishes: (1) tools (schema-typed actions callable by the model), (2) resources (readable data objects the client can fetch and inject as context), and (3) prompts (reusable, parameterized message templates, typically user-initiated). Distinguishing these surfaces clarifies who is “in control” at each edge: model-driven for tools, application-driven for resources, and user-driven for prompts. Those roles matter in threat modeling, e.g., prompt injection often targets model-controlled paths, while unsafe output handling often occurs at application-controlled joins.

Transports. The spec defines two standard transports—stdio (Standard Input/Output) and Streamable HTTP—and leaves room for pluggable alternatives. Local stdio reduces network exposure; Streamable HTTP fits multi-client or web deployments and supports resumable streams. Treat the transport choice as a security control: constrain network egress for local servers, and apply standard web authN/Z and logging for remote ones.

Client/server lifecycle and discovery. MCP formalizes how clients discover server capabilities (tools/resources/prompts), negotiate sessions, and exchange messages. That uniformity is what lets security teams instrument call flows, capture structured logs, and assert pre/postconditions without bespoke adapters per integration.

Normative authorization controls

The Authorization approach is unusually prescriptive for an integration protocol and should be enforced as follows:

No token passthrough. “The MCP server MUST NOT pass through the token it received from the MCP client.” Servers are OAuth 2.1 resource servers; clients obtain tokens from an authorization server using RFC 8707 resource indicators so tokens are audience-bound to the intended server. This prevents confused-deputy paths and preserves upstream audit/limit controls.

Audience binding and validation. Servers MUST validate that the access token’s audience matches themselves (resource binding) before serving a request. Operationally, this stops a client-minted token for “Service A” from being replayed to “Service B.” Red teams should include explicit probes for this failure mode.

This is the core of MCP’s security structure: model-side capabilities are powerful, but the protocol insists that servers be first-class principals with their own credentials, scopes, and logs—rather than opaque pass-throughs for a user’s global token.

Where MCP supports security engineering in practice?

Clear trust boundaries. The clientserver edge is an explicit, inspectable boundary. You can attach consent UIs, scope prompts, and structured logging at that edge. Many client implementations present permission prompts that enumerate a server’s tools/resources before enabling them—useful for least-privilege and audit—even though UX is not specified by the standard.

Containment and least privilege. Because a server is a separate principal, you can enforce minimal upstream scopes. For example, a secrets-broker server can mint short-lived credentials and expose only constrained tools (e.g., “fetch secret by policy label”), rather than handing broad vault tokens to the model. Public MCP servers from security vendors illustrate this model.

Deterministic attack surfaces for red teaming. With typed tool schemas and replayable transports, red teams can build fixtures that simulate adversarial inputs at tool boundaries and verify post-conditions across models/clients. This yields reproducible tests for classes of failures like prompt injection, insecure output handling, and supply-chain abuse. Pair those tests with recognized taxonomies.

Case study: the first malicious MCP server

In late September 2025, researchers disclosed a trojanized postmark-mcp npm package that impersonated a Postmark email MCP server. Beginning with v1.0.16, the malicious build silently BCC-exfiltrated every email sent through it to an attacker-controlled address/domain. The package was subsequently removed, but guidance urged uninstalling the affected version and rotating credentials. This appears to be the first publicly documented malicious MCP server in the wild, and it underscores that MCP servers often run with high trust and should be vetted and version-pinned like any privileged connector.

Operational takeaways:

Maintain an allowlist of approved servers and pin versions/hashes.

Require code provenance (signed releases, SBOMs) for production servers.

Monitor for anomalous egress patterns consistent with BCC exfiltration.

Practice credential rotation and “bulk disconnect” drills for MCP integrations.

These are not theoretical controls; the incident impact flowed directly from over-trusted server code in a routine developer workflow.

Using MCP to structure red-team exercises

1) Prompt-injection and unsafe-output drills at the tool boundary. Build adversarial corpora that enter via resources (application-controlled context) and attempt to coerce calls to dangerous tools. Assert that the client sanitizes injected outputs and that server post-conditions (e.g., allowed hostnames, file paths) hold. Map findings to LLM01 (Prompt Injection) and LLM02 (Insecure Output Handling).

2) Confused-deputy probes for token misuse. Craft tasks that try to induce a server to use a client-issued token or to call an unintended upstream audience. A compliant server must reject foreign-audience tokens per the authorization spec; clients must request audience-correct tokens with RFC 8707 resource. Treat any success here as a P1.

3) Session/stream resilience. For remote transports, exercise reconnection/resumption flows and multi-client concurrency for session fixation/hijack risks. Validate non-deterministic session IDs and rapid expiry/rotation in load-balanced deployments. (Streamable HTTP supports resumable connections; use it to stress your session model.)

4) Supply-chain kill-chain drills. In a lab, insert a trojaned server (with benign markers) and verify whether your allowlists, signature checks, and egress detection catch it—mirroring the Postmark incident TTPs. Measure time to detection and credential rotation MTTR.

5) Baseline with trusted public servers. Use vetted servers to construct deterministic tasks. Two practical examples: Google’s Data Commons MCP exposes public datasets under a stable schema (good for fact-based tasks/replays), and Delinea’s MCP demonstrates least-privilege secrets brokering for agent workflows. These are ideal substrates for repeatable jailbreak and policy-enforcement tests.

Implementation-Focused Security Hardening Checklist

Client side

Display the exact command or configuration used to start local servers; gate startup behind explicit user consent and enumerate the tools/resources being enabled. Persist approvals with scope granularity. (This is common practice in clients such as Claude Desktop.)

Maintain an allowlist of servers with pinned versions and checksums; deny unknown servers by default.

Log every tool call (name, arguments metadata, principal, decision) and resource fetch with identifiers so you can reconstruct attack paths post-hoc.

Server side

Implement OAuth 2.1 resource-server behavior; validate tokens and audiences; never forward client-issued tokens upstream.

Minimize scopes; prefer short-lived credentials and capabilities that encode policy (e.g., “fetch secret by label” instead of free-form read).

For local deployments, prefer stdio inside a container/sandbox and restrict filesystem/network capabilities; for remote, use Streamable HTTP with TLS, rate limits, and structured audit logs.

Detection & response

Alert on anomalous server egress (unexpected destinations, email BCC patterns) and sudden capability changes between versions.

Prepare break-glass automation to revoke client approvals and rotate upstream secrets quickly when a server is flagged (your “disconnect & rotate” runbook). The Postmark incident showed why time matters.

Governance alignment

MCP’s separation of concerns—clients as orchestrators, servers as scoped principals with typed capabilities—aligns directly with NIST’s AI RMF guidance for access control, logging, and red-team evaluation of generative systems, and with OWASP’s LLM Top-10 emphasis on mitigating prompt injection, unsafe output handling, and supply-chain vulnerabilities. Use those frameworks to justify controls in security reviews and to anchor acceptance criteria for MCP integrations.

Current adoption you can test against

Anthropic/Claude: product docs and ecosystem material position MCP as the way Claude connects to external tools and data; many community tutorials closely follow the spec’s three-primitive model. This provides ready-made client surfaces for permissioning and logging.

Google’s Data Commons MCP: released Sept 24, 2025, it standardizes access to public datasets; its announcement and follow-up posts include production usage notes (e.g., the ONE Data Agent). Useful as a stable “truth source” in red-team tasks.

Delinea MCP: open-source server integrating with Secret Server and Delinea Platform, emphasizing policy-mediated secret access and OAuth alignment with the MCP authorization spec. A practical example of least-privilege tool exposure.

Summary

MCP is not a silver-bullet “security product.” It is a protocol that gives security and red-team practitioners stable, enforceable levers: audience-bound tokens, explicit clientserver boundaries, typed tool schemas, and transports you can instrument. Use those levers to (1) constrain what agents can do, (2) observe what they actually did, and (3) replay adversarial scenarios reliably. Treat MCP servers as privileged connectors—vet, pin, and monitor them—because adversaries already do. With those practices in place, MCP becomes a practical foundation for secure agentic systems and a reliable substrate for red-team evaluation.

Resources used in the article

MCP specification & concepts

https://modelcontextprotocol.io/specification/2025-06-18/basic/authorization

https://modelcontextprotocol.io/specification/2025-03-26/basic/transports

https://modelcontextprotocol.io/docs/concepts/architecture

https://modelcontextprotocol.io/docs/concepts/prompts

MCP ecosystem (official)

https://www.anthropic.com/news/model-context-protocol

https://docs.claude.com/en/docs/mcp

https://docs.claude.com/en/docs/claude-code/mcp

https://modelcontextprotocol.io/quickstart/server

https://modelcontextprotocol.io/docs/develop/connect-local-servers

https://modelcontextprotocol.io/docs/develop/connect-remote-servers

Security frameworks

https://owasp.org/www-project-top-10-for-large-language-model-applications/

https://genai.owasp.org/llm-top-10/

https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf

https://www.nist.gov/itl/ai-risk-management-framework

Incident: malicious postmark-mcp server

https://www.koi.security/blog/postmark-mcp-npm-malicious-backdoor-email-theft

https://thehackernews.com/2025/09/first-malicious-mcp-server-found.html

https://www.itpro.com/security/a-malicious-mcp-server-is-silently-stealing-user-emails

https://threatprotect.qualys.com/2025/09/30/malicious-mcp-server-on-npm-postmark-mcp-exploited-in-attack/

Example MCP servers referenced

https://developers.googleblog.com/en/datacommonsmcp/

https://blog.google/technology/developers/ai-agents-datacommons/

https://github.com/DelineaXPM/delinea-mcp

https://delinea.com/news/delinea-mcp-server-to-provide-secure-credential-access-for-ai-agents?hs_amp=true

https://delinea.com/blog/unlocking-ai-agents-mcp

The post The Role of Model Context Protocol (MCP) in Generative AI Security and Red Teaming appeared first on MarkTechPost.

How Hapag-Lloyd improved schedule reliability with ML-powered vessel s …

This post is cowritten with Thomas Voss and Bernhard Hersberger from Hapag-Lloyd.
Hapag-Lloyd is one of the world’s leading shipping companies with more than 308 modern vessels, 11.9 million TEUs (twenty-foot equivalent units) transported per year, and 16,700 motivated employees in more than 400 offices in 139 countries. They connect continents, businesses, and people through reliable container transportation services on the major trade routes across the globe.
In this post, we share how Hapag-Lloyd developed and implemented a machine learning (ML)-powered assistant predicting vessel arrival and departure times that revolutionizes their schedule planning. By using Amazon SageMaker AI and implementing robust MLOps practices, Hapag-Lloyd has enhanced its schedule reliability—a key performance indicator in the industry and quality promise to their customers.
For Hapag-Lloyd, accurate vessel schedule predictions are crucial for maintaining schedule reliability, where schedule reliability is defined as percentage of vessels arriving within 1 calendar day (earlier or later) of their estimated arrival time, communicated around 3 to 4 weeks before arrival.
Prior to developing the new ML solution, Hapag-Lloyd relied on simple rule-based and statistical calculations, based on historical transit patterns for vessel schedule predictions. While this statistical method provided basic predictions, it couldn’t effectively account for real-time conditions such as port congestion, requiring significant manual intervention from operations teams.
Developing a new ML solution to replace the existing system presented several key challenges:

Dynamic shipping conditions – The estimated time of arrival (ETA) prediction model needs to account for numerous variables that affect journey duration, including weather conditions, port-related delays such as congestion, labor strikes, and unexpected events that force route changes. For example, when the Suez Canal was blocked by the Ever Given container ship in March 2021, vessels had to be rerouted around Africa, adding approximately 10 days to their journey times.
Data integration at scale – The development of accurate models requires integration of large volumes of historical voyage data with external real-time data sources including port congestion information and vessel position tracking (AIS). The solution needs to scale across 120 vessel services or lines and 1,200 unique port-to-port routes.
Robust MLOps infrastructure – A robust MLOps infrastructure is required to continuously monitor model performance and quickly deploy updates whenever needed. This includes capabilities for regular model retraining to adapt to changing patterns, comprehensive performance monitoring, and maintaining real-time inference capabilities for immediate schedule adjustments.

Hapag-Llyod’s previous approach to schedule planning couldn’t effectively address these challenges. A comprehensive solution that could handle both the complexity of vessel schedule prediction and provide the infrastructure needed to sustain ML operations at global scale was needed.
The Hapag-Lloyd network consists of over 308 vessels and many more partner vessels that continuously circumnavigate the globe on predefined service routes, resulting in more than 3,500 port arrivals per month. Each vessel operates on a fixed service line, making regular round trips between a sequence of ports. For instance, a vessel might repeatedly sail a route from Southampton to Le Havre, Rotterdam, Hamburg, New York, and Philadelphia before starting the cycle again. For each port arrival, an ETA must be provided multiple weeks in advance to arrange critical logistics, including berth windows at ports and onward transportation of containers by sea, land or air transport. The following table shows an example where a vessel travels from Southampton to New York through Le Havre, Rotterdam, and Hamburg. The vessel’s time until arrival at the New York port can be calculated as the sum of ocean to port time to Southampton, and the respective berth times and port-to-port times for the intermediate ports called while sailing to New York. If this vessel encounters a delay in Rotterdam, it affects its arrival in Hamburg and cascades through the entire schedule, impacting arrivals in New York and beyond as shown in the following table. This ripple effect can disrupt carefully planned transshipment connections and require extensive replanning of downstream operations.

Port
Terminal call
Scheduled arrival
Scheduled departure

SOUTHAMPTON
1
2025-07-29 07:00
2025-07-29 21:00

LE HAVRE
2
2025-07-30 16:00
2025-07-31 16:00

ROTTERDAM
3
2025-08-03 18:00
2025-08-05 03:00

HAMBURG
4
2025-08-07 07:00
2025-08-08 07:00

NEW YORK
5
2025-08-18 13:00
2025-08-21 13:00

PHILADELPHIA
6
2025-08-22 06:00
2025-08-24 16:30

SOUTHAMPTON
7
2025-09-01 08:00
2025-09-02 20:00

When a vessel departs Rotterdam with a delay, new ETAs must be calculated for the remaining ports. For Hamburg, we only need to estimate the remaining sailing time from the vessel’s current position. However, for subsequent ports like New York, the prediction requires multiple components: the remaining sailing time to Hamburg, the duration of port operations in Hamburg, and the sailing time from Hamburg to New York.
Solution overview
As an input to the vessel ETA prediction, we process the following two data sources:

Hapag-Lloyd’s internal data, which is stored in a data lake. This includes detailed vessel schedules and routes, port and terminal performance information, real-time port congestion and waiting times, and vessel characteristics datasets. This data is prepared for model training using AWS Glue jobs.
Automatic Identification System (AIS) data, which provides streaming updates on the vessel movements. This AIS data ingestion is batched every 20 minutes using AWS Lambda and includes crucial information such as latitude, longitude, speed, and direction of vessels. New batches are processed using AWS Glue and Iceberg to update the existing AIS database—currently holding around 35 million observations.

These data sources are combined to create training datasets for the ML models. We carefully consider the timing of available data through temporal splitting to avoid data leakage. Data leakage occurs when using information that wouldn’t be available at prediction time in the real world. For example, when training a model to predict arrival time in Hamburg for a vessel currently in Rotterdam, we can’t use actual transit times that were only known after the vessel reached Hamburg.
A vessel’s journey can be divided into different legs, which led us to develop a multi-step solution using specialized ML models for each leg, which are orchestrated as hierarchical models to retrieve the overall ETA:

The Ocean to Port (O2P) model predicts the time needed for a vessel to reach its next port from its current position at sea. The model uses features such as remaining distance to destination, vessel speed, journey progress metrics, port congestion data, and historical sea leg durations.
The Port to Port (P2P) model forecasts sailing time between any two ports for a given date, considering key features such as ocean distance between ports, recent transit time trends, weather, and seasonal patterns.
The Berth Time model estimates how long a vessel will spend at port. The model uses vessel characteristics (such as tonnage and load capacity), planned container load, and historical port performance.
The Combined model takes as input the predictions from the O2P, P2P, and Berth Time models, along with the original schedule. Rather than predicting absolute arrival times, it computes the expected deviation from the original schedule by learning patterns in historical prediction accuracy and specific voyage conditions. These computed deviations are then used to update ETAs for the upcoming ports in a vessel’s schedule.

All four models are trained using the XGBoost algorithm built into SageMaker, chosen for its ability to handle complex relationships in tabular data and its robust performance with mixed numerical and categorical features. Each model has a dedicated training pipeline in SageMaker Pipelines, handling data preprocessing steps and model training. The following diagram shows the data processing pipeline, which generates the input datasets for ML training.

As an example, this diagram shows the training pipeline of the Berth model. The steps in the SageMaker training pipelines of the Berth, P2P, O2P, and Combined models are identical. Therefore, the training pipeline is implemented once as a blueprint and re-used across the other models, enabling a fast turn-around time of the implementation.

Because the Combined model depends on outputs from the other three specialized models, we use AWS Step Functions to orchestrate the SageMaker pipelines for training. This helps ensure that the individual models are updated in the correct sequence and maintains prediction consistency across the system. The orchestration of the training pipelines is shown in the following pipeline architecture.
The individual workflow begins with a data processing pipeline that prepares the input data (vessel schedules, AIS data, port congestion, and port performance metrics) and splits it into dedicated datasets. This feeds into three parallel SageMaker training pipelines for our base models (O2P, P2P, and Berth), each following a standardized process of feature encoding, hyperparameter optimization, model evaluation, and registration using SageMaker Processing and hyperparameter turning jobs and SageMaker Model Registry. After training, each base model runs a SageMaker batch transform job to generate predictions that serve as input features for the combined model training. The performance of the latest Combined model version is tested on the last 3 months of data with known ETAs, and performance metrics (R², mean absolute error (MAE)) are computed. If the model’s performance is below a set MAE threshold, the entire training process fails and the model version is automatically discarded, preventing the deployment of models that don’t meet the minimum performance threshold.
All four models are versioned and stored as separate model package groups in the SageMaker Model Registry, enabling systematic version control and deployment. This orchestrated approach helps ensure that our models are trained in the correct sequence using parallel processing, resulting in an efficient and maintainable training process.The hierarchical model approach helps further ensure that a degree of explainability comparable to the current statistical and rule-based solution is maintained—avoiding ML black box behavior. For example, it becomes possible to highlight unusually long berthing time predictions when discussing predictions results with business experts. This helps increase transparency and build trust, which in turn increases acceptance within the company.
Inference solution walkthrough
The inference infrastructure implements a hybrid approach combining batch processing with real-time API capabilities as shown in Figure 5. Because most data sources update daily and require extensive preprocessing, the core predictions are generated through nightly batch inference runs. These pre-computed predictions are complemented by a real-time API that implements business logic for schedule changes and ETA updates.

Daily batch Inference:

Amazon EventBridge triggers a Step Functions workflow every day.
The Step Functions workflow orchestrates the data and inference process:

Lambda copies internal Hapag-Lloyd data from the data lake to Amazon Simple Storage Service (Amazon S3).
AWS Glue jobs combine the different data sources and prepare inference inputs
SageMaker inference executes in sequence:

Fallback predictions are computed from historical averages and written to Amazon Relational Database Service (Amazon RDS). Fallback predictions are used in case of missing data or a downstream inference failure.
Preprocessing data for the four specialized ML models.
O2P, P2P, and Berth model batch transforms.
The Combined model batch transform generates final ETA predictions, which are written to Amazon RDS.
Input features and output files are stored in Amazon S3 for analytics and monitoring.

For operational reliability, any failures in the inference pipeline trigger immediate email notifications to the on-call operations team through Amazon Simple Email Service (Amazon SES).

Real-time API:

Amazon API Gateway receives client requests containing the current schedule and an indication for which vessel-port combinations an ETA update is required. By receiving the current schedule through the client request, we can take care of intraday schedule updates while doing daily batch transform updates.
The API Gateway triggers a Lambda function calculating the response. The Lambda function constructs the response by linking the ETA predictions (stored in Amazon RDS) with the current schedule using custom business logic, so that we can take care of short-term schedule changes unknown at inference time. Typical examples of short-term schedule changes are port omissions (for example, due to port congestion) and one-time port calls.

This architecture enables millisecond response times to custom requests while achieving a 99.5% availability (a maximum 3.5 hours downtime per month).

Conclusion
Hapag Lloyd’s ML powered vessel scheduling assistant outperforms the current solution in both accuracy and response time. Typical API response times are in the order of hundreds of milliseconds, helping to ensure a real-time user experience and outperforming the current solution by more than 80%. Low response times are crucial because, in addition to fully automated schedule updates, business experts require low response times to work with the schedule assistant interactively. In terms of accuracy, the MAE of the ML-powered ETA predictions outperform the current solution by approximately 12%, which translates into climbing by two positions in the international ranking of schedule reliability on average. This is one of the key performance metrics in liner shipping, and this is a significant improvement within the industry.
To learn more about architecting and governing ML workloads at scale on AWS, see the AWS blog post Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker and the accompanying AWS workshop AWS Multi-Account Data & ML Governance Workshop.
Acknowledgement
We acknowledge the significant and valuable work of Michal Papaj and Piotr Zielinski from Hapag-Lloyd in the data science and data engineering areas of the project.
About the authors
Thomas Voss Thomas Voss works at Hapag-Lloyd as a data scientist. With his background in academia and logistics, he takes pride in leveraging data science expertise to drive business innovation and growth through the practical design and modeling of AI solutions.
Bernhard Hersberger Bernhard Hersberger works as a data scientist at Hapag-Lloyd, where he heads the AI Hub team in Hamburg. He is enthusiastic about integrating AI solutions across the company, taking comprehensive responsibility from identifying business issues to deploying and scaling AI solutions worldwide.
Gabija Pasiunaite At AWS, Gabija Pasiunaite was a Machine Learning Engineer at AWS Professional Services based in Zurich. She specialized in building scalable ML and data solutions for AWS Enterprise customers, combining expertise in data engineering, ML automation and cloud infrastructure. Gabija has contributed to the AWS MLOps Framework used by AWS customers globally. Outside work, Gabija enjoys exploring new destinations and staying active through hiking, skiing, and running.
Jean-Michel Lourier Jean-Michel Lourier is a Senior Data Scientist within AWS Professional Services. He leads teams implementing data driven applications side by side with AWS customers to generate business value out of their data. He’s passionate about diving into tech and learning about AI, machine learning, and their business applications. He is also an enthusiastic cyclist.
Mousam Majhi Mousam Majhi is a Senior ProServe Cloud Architect focusing on Data & AI within AWS Professional Services. He works with Manufacturing and Travel, Transportation & Logistics customers in DACH to achieve their business outcomes by leveraging data and AI powered solutions. Outside of work, Mousam enjoys hiking in the Bavarian Alps.

Rox accelerates sales productivity with AI agents powered by Amazon Be …

This post was co-written with Shriram Sridharan, Taeuk Kang, and Santhosh Kumar Manavasi Lakshminarayanan from Rox.
Rox is building a new revenue operating system for the applied AI era.
Modern revenue teams rely on more data than ever before, such as Customer Relationship Management (CRM) systems, marketing automation, finance systems, support tickets, and live product usage. Though each serves its role, together they create silos that slow sellers down and leave insights untapped.
Rox addresses this by providing a revenue operating system: a unified layer that brings these signals together and equips AI agents to execute go-to-market (GTM) workflows. Instead of reconciling reports or updating fields, sellers get real-time intelligence and automation in their daily flow.
Today, we’re excited to announce that Rox is generally available, with Rox infrastructure built on AWS and delivered across web, Slack, macOS, and iOS. In this post, we share how Rox accelerates sales productivity with AI agents powered by Amazon Bedrock.
Solution overview
As noted in Rox is transforming revenue teams with AI-driven integration powered by AWS, modern GTM teams need more than a static database. Revenue data spans dozens of systems, such as product usage, finance, and support, and teams require a system that unifies context and acts on it in real time.
Rox delivers this through a layered architecture on AWS:

System of record – A unified, governed knowledge graph consolidates CRM, finance, support, product telemetry, and web data
Agent swarms – Intelligent, account-aware agents reason over the graph and orchestrate multi-step workflows like research, outreach, opportunity management, and proposal generation
Interfaces across surfaces – Sellers engage these workflows where they work, such as web application, Slack, iOS, and macOS

This converts the CRM from a passive system of record into an active system of action, so teams can act on their data immediately and intelligently.
The following diagram illustrates the solution architecture.

Benefits and features of ROX
Now generally available, Rox extends from intelligence to full execution with Command, a new conversational interface that orchestrates multi-agent workflows. Command coordinates with multiple specialized agents running in parallel. A single request (for example, “prep me for the ACME renewal and draft follow-ups”) expands into a plan: research usage and support signals, identify missing stakeholders, refresh enrichment, propose next-best actions, draft outreach, update the opportunity, and assemble a proposal. Each step is completed through tool calls into your systems and is subject to guardrail approvals. Our comprehensive safety architecture employs a sophisticated multi-layer guardrail system as the first line of defense against inappropriate, harmful, or malicious requests. Incoming requests undergo rigorous analysis through our advanced filtering mechanisms before reaching the inference layer. This preprocessing stage evaluates multiple dimensions of safety and appropriateness, such as legal compliance assessment and business relevance evaluation, to make sure only legitimate, safe, and contextually appropriate requests proceed to model execution.
Command decomposes the request, routes steps to the right agents, sequences external tool invocations (CRM, calendar, enrichment, email), reconciles results into the system of context, and returns one coherent thread that’s ready for consumption on the web, Slack, iOS, or macOS. Every suggestion is explainable (sources and traces), reversible (audit logs), and policy-aware (role-based access control, rate limits, required approvals).
How Amazon Bedrock powers Rox
Command demands a model capable of reasoning across multiple steps, orchestrating tools, and adapting dynamically.
To meet these needs, Rox chose Anthropic’s Claude Sonnet 4 on Amazon Bedrock. Anthropic’s Claude Sonnet 4 has consistently demonstrated unmatched tool-calling and reasoning performance, allowing Rox agents to sequence workflows like account research, enrichment, outreach, opportunity management, and proposal generation with reliability.
Amazon Bedrock provides the foundation to deliver Rox at enterprise scale, offering security, flexibility to integrate with the latest models, and scalability to handle thousands of concurrent agents reliably.
In addition to Command, Rox includes the following features:

Research – Offers deep account and market research, grounded in unified context (carried over from private beta)
Meet – Makes it possible to record, transcribe, summarize, and turn meetings into actions (carried over from private beta)
Outreach – Provides personalized prospect engagement, contextualized by unified data (new)
Revenue – Helps you track, update, and advance pipelines in the flow of work (new)
Auto-fill proposals – Helps you assemble tailored proposals in seconds from account context (new)
Rox apps – Offers modular extensions that add purpose-built workflows (dashboards, trackers) directly into the system (new)
iOS app – Delivers notifications and meeting prep on the go (new)
Mac app – Brings the ability to transcribe calls and add them to the system of context (new)
Regional expansion – Now live in the AWS Middle East (Bahrain) AWS Region, aligning with data residency and sovereignty needs (new)

Early customer impact
In beta, enterprises saw immediate gains:

50% higher representative productivity
20% faster sales velocity
Twofold revenue per rep

For example, real Rox customers were able to sharpen their focus on high-value opportunities, driving a 40–50% increase in average selling price. Another customer saw 90% reduction in rep prep time and faster closes, plus 15% more six-figure deals uncovered through Rox insights. Rox also shortens ramp time for new reps, with customers reporting 50% quicker ramp time using Rox.
Try Rox today
Our vision is for revenue teams to run with an always-on agent swarm that continuously researches accounts, engages stakeholders, and moves the pipeline forward.
Rox is now generally available. Get started at rox.com or visit the AWS Marketplace. Together with AWS, we will continue to build the AI-based operating system for modern revenue teams.

About the authors
Shriram Sridharan is the Co-Founder/Engineering Head of Rox, a Sequoia backed AI company. Before Rox, Shriram led the data infrastructure team at Confluent responsible for making Kafka faster and cheaper across clouds. Prior to that he was one of the early engineers in Amazon Aurora (pre-launch) re-imagining databases for the cloud. Aurora was the fastest growing AWS Service and a recipient of the 2019 SIGMOD systems award.
Taeuk Kang is a Founding Engineer at Rox, working across AI research and engineering. He studied Computer Science at Stanford. Prior to Rox, he built large language model agents and retrieval-augmented generation systems at X (formerly Twitter) and designed the distributed LLM infrastructure powering core product features and Trust & Safety, improving overall platform health. Earlier at Stripe, he developed high-performance streaming and batch data processing pipelines integrating Apache Flink, Spark, Kafka, and AWS SQS.
Santhosh Kumar Manavasi Lakshminarayanan leads Platform at Rox. Before Rox he was Director of Engineering at StreamSets, acquired by IBM leading StreamSets Cloud Platform making it seamless for big enterprises to run their data pipeline at scale on modern cloud providers. Before StreamSets, he was an senior engineer at Platform Metadata team at Informatica.
Andrew Brown is an Account Executive for AI Startups at Amazon Web Services (AWS) in San Francisco, CA. With a strong background in cloud computing and a focus on supporting startups, Andrew specializes in helping companies scale their operations using AWS technologies.
Santhan Pamulapati is a Sr. Solutions Architect for GenAI startups at AWS, with deep expertise in designing and building scalable solutions that drives customer growth. He has strong background in building HPC systems leveraging AWS services and worked with strategic customers to solve business challenges.

Zhipu AI Releases GLM-4.6: Achieving Enhancements in Real-World Coding …

Zhipu AI has released GLM-4.6, a major update to its GLM series focused on agentic workflows, long-context reasoning, and practical coding tasks. The model raises the input window to 200K tokens with a 128K max output, targets lower token consumption in applied tasks, and ships with open weights for local deployment.

https://z.ai/blog/glm-4.6

So, what’s exactly is new?

Context + output limits: 200K input context and 128K maximum output tokens.

Real-world coding results: On the extended CC-Bench (multi-turn tasks run by human evaluators in isolated Docker environments), GLM-4.6 is reported near parity with Claude Sonnet 4 (48.6% win rate) and uses ~15% fewer tokens vs. GLM-4.5 to finish tasks. Task prompts and agent trajectories are published for inspection.

Benchmark positioning: Zhipu summarizes “clear gains” over GLM-4.5 across eight public benchmarks and states parity with Claude Sonnet 4/4.6 on several; it also notes GLM-4.6 still lags Sonnet 4.5 on coding—a useful caveat for model selection.

Ecosystem availability: GLM-4.6 is available via Z.ai API and OpenRouter; it integrates with popular coding agents (Claude Code, Cline, Roo Code, Kilo Code), and existing Coding Plan users can upgrade by switching the model name to glm-4.6.

Open weights + license: Hugging Face model card lists License: MIT and Model size: 355B params (MoE) with BF16/F32 tensors. (MoE “total parameters” are not equal to active parameters per token; no active-params figure is stated for 4.6 on the card.)

Local inference: vLLM and SGLang are supported for local serving; weights are on Hugging Face and ModelScope.

https://z.ai/blog/glm-4.6

Summary

GLM-4.6 is an incremental but material step: a 200K context window, ~15% token reduction on CC-Bench versus GLM-4.5, near-parity task win-rate with Claude Sonnet 4, and immediate availability via Z.ai, OpenRouter, and open-weight artifacts for local serving.

FAQs

1) What are the context and output token limits?GLM-4.6 supports a 200K input context and 128K maximum output tokens.

2) Are open weights available and under what license?Yes. The Hugging Face model card lists open weights with License: MIT and a 357B-parameter MoE configuration (BF16/F32 tensors).

3) How does GLM-4.6 compare to GLM-4.5 and Claude Sonnet 4 on applied tasks?On the extended CC-Bench, GLM-4.6 reports ~15% fewer tokens vs. GLM-4.5 and near-parity with Claude Sonnet 4 (48.6% win-rate).

4) Can I run GLM-4.6 locally?Yes. Zhipu provides weights on Hugging Face/ModelScope and documents local inference with vLLM and SGLang; community quantizations are appearing for workstation-class hardware.

Check out the GitHub Page, Hugging Face Model Card and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Zhipu AI Releases GLM-4.6: Achieving Enhancements in Real-World Coding, Long-Context Processing, Reasoning, Searching and Agentic AI appeared first on MarkTechPost.

OpenAI Launches Sora 2 and a Consent-Gated Sora iOS App

OpenAI released Sora 2, a text-to-video-and-audio model focused on physical plausibility, multi-shot controllability, and synchronized dialogue/SFX. The OpenAI team has also launched a new invite-only Sora iOS app (U.S. and Canada first) that enables social creation, remixing, and consent-controlled “cameos” for inserting a verified likeness into generated scenes.

Model capabilities

Sora 2 claims materially better world modeling (e.g., rebounds on missed shots instead of object “teleportation”), maintains state across shots for instruction-following edits, and generates native, time-aligned audio (speech, ambient, effects). These are framed as prerequisites for simulation-grade video generation rather than single-clip “best effort” synthesis.

App architecture and “cameos”

The Sora app is built around cameos: users record a short in-app video+audio to verify identity and capture likeness; cameo owners control who can use their likeness and can revoke or delete any video—including drafts—that includes them. The app is available on iOS devices and it will be expanding after the U.S./Canada rollout.

Safety posture

OpenAI’s Sora 2 documents an iterative rollout with specific launch-time restrictions and provenance controls:

Uploads/Generations: At launch, OpenAI is restricting the use of image uploads that feature a photorealistic person and all video uploads. Sora 2 does not support video-to-video at launch, blocks text-to-video of public figures, and blocks generations that include real people except when a user has opted-in via the cameo feature. Additional classifier thresholds apply when a real person appears.

Provenance: All outputs carry C2PA metadata and a visible moving watermark on downloads, with internal detection tools for origin assessment.

Parental controls

In parallel with Sora, OpenAI introduced parental controls integrated via ChatGPT: parents can opt teens into a non-personalized feed, manage DM permissions, and control whether continuous scroll is allowed—aligned with the Sora feed’s “creation-over-consumption” philosophy.

Access and pricing

The Sora iOS app is available to download now; access opens by invite, with Sora 2 initially free under compute-constrained caps. ChatGPT Pro users get access to an experimental Sora 2 Pro tier on sora.com (and coming to the app). API access is planned after the consumer rollout. Existing Sora 1 Turbo content remains available in user libraries.

Summary

Sora 2 pushes text-to-video toward controllable, physics-respecting, audio-synchronized generation—and OpenAI is shipping it inside an invite-only iOS app with consent-gated cameos plus C2PA metadata and visible watermarks for provenance. The initial U.S./Canada rollout prioritizes safety constraints (e.g., restrictions on public-figure depictions) while staging broader access and API plans, signaling a deliberate shift from raw capability demos to governed, production-ready media tooling.

Sora 2 is here. pic.twitter.com/hy95wDM5nB— OpenAI (@OpenAI) September 30, 2025

The post OpenAI Launches Sora 2 and a Consent-Gated Sora iOS App appeared first on MarkTechPost.

Delinea Released an MCP Server to Put Guardrails Around AI Agents Cred …

Delinea released an Model Context Protocol (MCP) server that let AI-agent access to credentials stored in Delinea Secret Server and the Delinea Platform. The server applies identity checks and policy rules on every call, aiming to keep long-lived secrets out of agent memory while retaining full auditability

What’s new for me?

The GitHub project DelineaXPM/delinea-mcp (MIT-licensed) exposes a constrained MCP tool surface for credential retrieval and account operations, supports OAuth 2.0 dynamic client registration per the MCP spec, and offers both STDIO and HTTP/SSE transports. The repo includes Docker artifacts and example configs for editor/agent integrations

How it works?

The server exposes MCP tools that proxy to Secret Server and (optionally) the Delinea Platform: secret and folder retrieval/search, inbox/access-request helpers, user/session admin, and report execution; secrets themselves remain vaulted and are never presented to the agent. Configuration separates secrets into environment variables (e.g., DELINEA_PASSWORD) and non-secrets into config.json, with scope controls (enabled_tools, allowed object types), TLS certs, and an optional registration pre-shared key.

Explain me why exactly it matters to me

Enterprises are rapidly wiring agents to operational systems through MCP. Recent incidents—such as a rogue MCP package exfiltrating email—underscore the need for registration controls, TLS, least-privilege tool surfaces, and traceable identity context on every call. Delinea’s server claims to implement these controls in a PAM-aligned pattern (ephemeral auth + policy checks + audit), reducing credential sprawl and simplifying revocation.

Summary

Delinea’s MIT-licensed MCP server gives enterprises a standard, auditable way for AI-agent credential access—short-lived tokens, policy evaluation, and constrained tools—to reduce secret exposure while integrating with Secret Server and the Delinea Platform. It’s available now on GitHub, with initial coverage and technical details confirming OAuth2, STDIO/HTTP(SSE) transports, and scoped operations.

The post Delinea Released an MCP Server to Put Guardrails Around AI Agents Credential Access appeared first on MarkTechPost.

Modernize fraud prevention: GraphStorm v0.5 for real-time inference

Fraud continues to cause significant financial damage globally, with U.S. consumers alone losing $12.5 billion in 2024—a 25% increase from the previous year according to the Federal Trade Commission. This surge stems not from more frequent attacks, but from fraudsters’ increasing sophistication. As fraudulent activities become more complex and interconnected, conventional machine learning approaches fall short by analyzing transactions in isolation, unable to capture the networks of coordinated activities that characterize modern fraud schemes.
Graph neural networks (GNNs) effectively address this challenge by modeling relationships between entities—such as users sharing devices, locations, or payment methods. By analyzing both network structures and entity attributes, GNNs are effective at identifying sophisticated fraud schemes where perpetrators mask individual suspicious activities but leave traces in their relationship networks. However, implementing GNN-based online fraud prevention in production environments presents unique challenges: achieving sub-second inference responses, scaling to billions of nodes and edges, and maintaining operational efficiency for model updates. In this post, we show you how to overcome these challenges using GraphStorm, particularly the new real-time inference capabilities of GraphStorm v0.5.
Previous solutions required tradeoffs between capability and simplicity. Our initial DGL approach provided comprehensive real-time capabilities but demanded intricate service orchestration—including manually updating endpoint configurations and payload formats after retraining with new hyperparameters. This approach also lacked model flexibility, requiring customization of GNN models and configurations when using architectures beyond relational graph convolutional networks (RGCN). Subsequent in-memory DGL implementations reduced complexity but encountered scalability limitations with enterprise data volumes. We built GraphStorm to bridge this gap, by introducing distributed training and high-level APIs that help simplify GNN development at enterprise scale.
In a recent blog post, we illustrated GraphStorm’s enterprise-scale GNN model training and offline inference capability and simplicity. While offline GNN fraud detection can identify fraudulent transactions after they occur—preventing financial loss requires stopping fraud before it happens. GraphStorm v0.5 makes this possible through native real-time inference support through Amazon SageMaker AI. GraphStorm v0.5 delivers two innovations: streamlined endpoint deployment that reduces weeks of custom engineering—coding SageMaker entry point files, packaging model artifacts, and calling SageMaker deployment APIs—to a single-command operation, and standardized payload specification that helps simplify client integration with real-time inference services. These capabilities enable sub-second node classification tasks like fraud prevention, empowering organizations to proactively counter fraud threat with scalable, operationally straightforward GNN solutions.
To showcase these capabilities, this post presents a fraud prevention solution. Through this solution, we show how a data scientist can transition a trained GNN model to production-ready inference endpoints with minimal operational overhead. If you’re interested in implementing GNN-based models for real-time fraud prevention or similar business cases, you can adapt the approaches presented here to create your own solutions.
Solution overview
Our proposed solution is a 4-step pipeline as shown in the following figure. The pipeline starts at step 1 with transaction graph export from an online transaction processing (OLTP) graph database to scalable storage (Amazon Simple Storage Service (Amazon S3) or Amazon EFS), followed by distributed model training in step 2. Step 3 is GraphStorm v0.5’s simplified deployment process that creates SageMaker real-time inference endpoints with one command. After SageMaker AI has deployed the endpoint successfully, a client application integrates with the OLTP graph database that processes live transaction streams in step 4. By querying the graph database, the client prepares subgraphs around to-be predicted transactions, convert the subgraph into standardized payload format, and invoke deployed endpoint for real-time prediction.

To provide concrete implementation details for each step in the real-time inference solution, we demonstrate the complete workflow using the publicly available IEEE-CIS fraud detection task.
Note: This example uses a Jupyter notebook as the controller of the overall four-step pipeline for simplicity. For more production-ready design, see the architecture described in Build a GNN-based real-time fraud detection solution.
Prerequisites
To run this example, you need an AWS account that the example’s AWS Cloud Development Kit (AWS CDK) code uses to create required resources, including Amazon Virtual Private Cloud (Amazon VPC), an Amazon Neptune database, Amazon SageMaker AI, Amazon Elastic Container Registry (Amazon ECR), Amazon S3, and related roles and permission.
Note: These resources incur costs during execution (approximately $6 per hour with default settings). Monitor usage carefully and review pricing pages for these services before proceeding. Follow cleanup instructions at the end to avoid ongoing charges.
Hands-on example: Real-time fraud prevention with IEEE-CIS dataset
All implementation code for this example, including Jupyter notebooks and supporting Python scripts, is available in our public repository. The repository provides a complete end-to-end implementation that you can directly execute and adapt for your own fraud prevention use cases.
Dataset and task overview
This example uses the IEEE-CIS fraud detection dataset, containing 500,000 anonymized transactions with approximately 3.5% fraudulent cases. The dataset includes 392 categorical and numerical features, with key attributes like card types, product types, addresses, and email domains forming the graph structure shown in the following figure. Each transaction (with an isFraud label) connects to Card Type, Location, Product Type, and Purchaser and Recipient email domain entities, creating a heterogeneous graph that enables GNN models to detect fraud patterns through entity relationships.

Unlike our previous post that demonstrated GraphStorm plus Amazon Neptune Analytics for offline analysis workflows, this example uses a Neptune database as the OLTP graph store, optimized for the quick subgraph extraction required during real-time inference. Following the graph design, the tabular IEEE-CIS data is converted to a set CSV files compatible with Neptune database format, allowing direct loading into both the Neptune database and GraphStorm’s GNN model training pipeline with a single set of files.
Step 0: Environment setup
Step 0 establishes the running environment required for the four-step fraud prevention pipeline. Complete setup instructions are available in the implementation repository.
To run the example solution, you need to deploy an AWS CloudFormation stack through the AWS CDK. This stack creates the Neptune DB instance, the VPC to place it in, and appropriate roles and security groups. It additionally creates a SageMaker AI notebook instance, from which you run the example notebooks that come with the repository.

git clone https://github.com/aws-samples/amazon-neptune-samples.git
cd neptune-database-graphstorm-online-inference/neptune-db-cdk
# Ensure you have CDK installed and have appropriate credentials set up
cdk deploy

When deployment is finished (it takes approximately 10 minutes for required resources to be ready), the AWS CDK prints a few outputs, one of which is the name of the SageMaker notebook instance you use to run through the notebooks:

# Example output
NeptuneInfraStack.NotebookInstanceName = arn:aws:sagemaker:us-east-1:012345678912:notebook-instance/NeptuneNotebook-9KgSB9XXXXXX

You can navigate to the SageMaker AI notebook UI, find the corresponding notebook instance, and select its Open Jupyterlab link to access the notebook.
Alternatively, you can use the AWS Command Line Interface (AWS CLI) to get a pre-signed URL to access the notebook. You will need to replace the <notebook-instance-name> with the actual notebook instance name.

aws sagemaker create-presigned-notebook-instance-url –notebook-instance-name <notebook-instance-name>

When you’re in the notebook instance web console, open the first notebook, 0-Data-Preparation.ipynb, to start going through the example.
Step 1: Graph construction
In the Notebook 0-Data-Preparation, you transform the tabular IEEE-CIS dataset into the heterogeneous graph structure shown in the figure at the start of this section. The provided Jupyter Notebook extracts entities from transaction features, creating Card Type nodes from card1–card6 features, Purchaser and Recipient nodes from email domains, Product Type nodes from product codes, and Location nodes from geographic information. The transformation establishes relationships between transactions and these entities, generating graph data in Neptune import format for direct ingestion into the OLTP graph store. The create_neptune_db_data() function orchestrates this entity extraction and relationship creation process across all node types (which takes approximately 30 seconds).

GRAPH_NAME = “ieee-cis-fraud-detection”
PROCESSED_PREFIX = f”./{GRAPH_NAME}”
ID_COLS = “card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain”
CAT_COLS = “M1,M2,M3,M4,M5,M6,M7,M8,M9”
# Lists of columns to keep from each file
COLS_TO_KEEP = {
    “transaction.csv”: (
        ID_COLS.split(“,”)
        + CAT_COLS.split(“,”)
        +
        # Numerical features without missing values
        [f”C{idx}” for idx in range(1, 15)]
        + [“TransactionID”, “TransactionAmt”, “TransactionDT”, “isFraud”]
    ),
    “identity.csv”: [“TransactionID”, “DeviceType”],
}

create_neptune_db_data(
    data_prefix=”./input-data/”,
    output_prefix=PROCESSED_PREFIX,
    id_cols=ID_COLS,
    cat_cols=CAT_COLS,
    cols_to_keep=COLS_TO_KEEP,
    num_chunks=1,
)

This notebook also generates the JSON configuration file required by GraphStorm’s GConstruct command and executes the graph construction process. This GConstruct command transforms the Neptune-formatted data into a distributed binary graph format optimized for GraphStorm’s training pipeline, which partitions the heterogeneous graph structure across compute nodes to enable scalable model training on industry-scale graphs (measured in billions of nodes and edges). For the IEEE-CIS data, the GConstruct command takes 90 seconds to complete.
In the Notebook 1-Load-Data-Into-Neptune-DB, you load the CSV data into the Neptune database instance (takes approximately 9 minutes), which makes them available for online inference. During online inference, after selecting a transaction node, you query the Neptune database to get the graph neighborhood of the target node, retrieving the features of every node in the neighborhood and the subgraph structure around the target.
Step 2: Model training
After you have converted the data into the distributed binary graph format, it’s time to train a GNN model. GraphStorm provides command-line scripts to train a model without writing code. In the Notebook 2-Model-Training, you train a GNN model using GraphStorm’s node classification command with configuration managed through YAML files. The baseline configuration defines a two-layer RGCN model with 128-dimensional hidden layers, training for 4 epochs with a 0.001 learning rate and 1024 batch size, which takes approximately 100 seconds for 1 epoch of model training and evaluation in an ml.m5.4xlarge instance. To improve fraud detection accuracy, the notebook provides more advanced model configurations like the command below.

!python -m graphstorm.run.gs_node_classification
           –workspace ./ 
           –part-config ieee_gs/ieee-cis.json
           –num-trainers 1 
           –cf ieee_nc.yaml
           –eval-metric roc_auc
           –save-model-path ./model-simple/ 
           –topk-model-to-save 1 
           –imbalance-class-weights 0.1,1.0

Arguments in this command address the dataset’s label imbalance challenge where only 3.5% of transactions are fraudulent by using AUC-ROC as the evaluation metric and using class weights. The command also saves the best-performing model along with essential configuration files required for endpoint deployment. Advanced configurations can further enhance model performance through techniques like HGT encoders, multi-head attention, and class-weighted cross entropy loss function, though these optimizations increase computational requirements. GraphStorm enables these changes through run time arguments and YAML configurations, reducing the need for code modifications.
Step 3: Real-time endpoint deployment
In the Notebook 3-GraphStorm-Endpoint-Deployment, you deploy the real-time endpoint through GraphStorm v0.5’s straightforward launch script. The deployment requires three model artifacts generated during training: the saved model file that contains weights, the updated graph construction JSON file with feature transformation metadata, and the runtime-updated training configuration YAML file. These artifacts enable GraphStorm to recreate the exact training configurations and model for consistent inference behavior. Notably, the updated graph construction JSON and training configuration YAML file contains crucial configurations that are essential for restoring the trained model on the endpoint and processing incoming request payloads. It is crucial to use the updated JSON and YAML files for endpoint deployment.GraphStorm uses SageMaker AI bring your own container (BYOC) to deploy a consistent inference environment. You need to build and push the GraphStorm real-time Docker image to Amazon ECR using the provided shell scripts. This containerized approach provides consistent runtime environments compatible with the SageMaker AI managed infrastructure. The Docker image contains the necessary dependencies for GraphStorm’s real-time inference capabilities on the deployment environment.
To deploy the endpoint, you can use the GraphStorm-provided launch_realtime_endpoint.py script that helps you gather required artifacts and creates the necessary SageMaker AI resources to deploy an endpoint. The script accepts the Amazon ECR image URI, IAM role, model artifact paths, and S3 bucket configuration, automatically handling endpoint provisioning and configuration. By default, the script waits for endpoint deployment to be complete before exiting. When completed, it prints the name and AWS Region of the deployed endpoint for subsequent inference requests. You will need to replace the fields enclosed by <> with the actual values of your environment.

!python ~/graphstorm/sagemaker/launch/launch_realtime_endpoint.py
        –image-uri <account_id>.dkr.ecr.<aws_region>.amazonaws.com/graphstorm:sagemaker-endpoint-cpu
        –role arn:aws:iam::<account_id>:role/<your_role>
        –region <aws_region>
        –restore-model-path <restore-model-path>/models/epoch-1/
        –model-yaml-config-file <restore-model-path>/models/GRAPHSTORM_RUNTIME_UPDATED_TRAINING_CONFIG.yaml
        –graph-json-config-file <restore-model-path>/models/data_transform_new.json
        –infer-task-type node_classification
        –upload-tarfile-s3 s3://<cdk-created-bucket>
        –model-name ieee-fraud-detect

Step 4: Real-time inference
In the Notebook 4-Sample-Graph-and-Invoke-Endpoint, you build a basic client application that integrates with the deployed GraphStorm endpoint to perform real-time fraud prevention on incoming transactions. The inference process accepts transaction data through standardized JSON payloads, executes node classification predictions in a few hundreds of milliseconds, and returns fraud probability scores that enable immediate decision-making.
An end-to-end inference call for a node that already exists in the graph has three distinct stages:

Graph sampling from the Neptune database. For a given target node that already exists in the graph, retrieve its k-hop neighborhood with a fanout limit, that is, limiting the number of neighbors retrieved at each hop by a threshold.
Payload preparation for inference. Neptune returns graphs using GraphSON, a specialized JSON-like data format used to describe graph data. At this step, you need to convert the returned GraphSON to GraphStorm’s own JSON specification. This step is performed on the inference client, in this case a SageMaker notebook instance.
Model inference using a SageMaker endpoint. After the payload is prepared, you send an inference request to a SageMaker endpoint that has loaded a previously trained model snapshot. The endpoint receives the request, performs any feature transformations needed (such as converting categorical features to one-hot encoding), creates the binary graph representation in memory, and makes a prediction for the target node using the graph neighborhood and trained model weights. The response is encoded to JSON and sent back to the client.

An example response from the endpoint would look like:

{‘status_code’: 200,
 ‘request_uid’: ‘877042dbc361fc33’,
 ‘message’: ‘Request processed successfully.’,
 ‘error’: ”,
 ‘data’: {
    ‘results’: [
            {
                ‘node_type’: ‘Transaction’,
                ‘node_id’: ‘2991260’,
                ‘prediction’: [0.995966911315918, 0.004033133387565613]
            }
        ]
    }
}

The data of interest for the single transaction you made a prediction for are in the prediction key and corresponding node_id. The prediction gives you the raw scores the model produces for class 0 (legitimate) and class 1 (fraudulent) at the corresponding 0 and 1 indexes of the predictions list. In this example, the model marks the transaction as most likely legitimate. You can find the full GraphStorm response specification in the GraphStorm documentation.
Complete implementation examples, including client code and payload specifications, are provided in the repository to guide integration with production systems.
Clean up
To stop accruing costs on your account, you need to delete the AWS resources that you created with the AWS CDK at the Environment Setup step.
You must first delete the SageMaker endpoint created during the Step 3 for cdk destroy to complete. See the Delete Endpoints and Resources for more options to delete an endpoint. When done, you can run the following from the repository’s root:

cd neptune-database-graphstorm-online-inference/neptune-db-cdk
cdk destroy

See the AWS CDK docs for more information about how to use cdk destroy, or see the CloudFormation docs for how to delete a stack from the console UI. By default, the cdk destroy command does not delete the model artifacts and processed graph data stored in the S3 bucket during the training and deployment process. You must remove them manually. See Deleting a general purpose bucket for information about how to empty and delete an S3 bucket the AWS CDK has created.
Conclusion
Graph neural networks address complex fraud prevention challenges by modeling relationships between entities that traditional machine learning approaches miss when analyzing transactions in isolation. GraphStorm v0.5 helps simplify deployment of GNN real-time inference with one command for endpoint creation that previously required coordination of multiple services and a standardized payload specification that helps simplify client integration with real-time inference services. Organizations can now deploy enterprise-scale fraud prevention endpoints through streamlined commands that reduce custom engineering from weeks to single-command operations.
To implement GNN-based fraud prevention with your own data:

Review the GraphStorm documentation for model configuration options and deployment specifications.
Adapt this IEEE-CIS example to your fraud prevention dataset by modifying the graph construction and feature engineering steps using the complete source code and tutorials available in our GitHub repository.
Access step-by-step implementation guidance to build production-ready fraud prevention solutions with GraphStorm v0.5’s enhanced capabilities using your enterprise data.

About the authors
Jian Zhang is a Senior Applied Scientist who has been using machine learning techniques to help customers solve various problems, such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning, particularly graph neural network, solutions for customers in China, the US, and Singapore. As an enlightener of AWS graph capabilities, Zhang has given many public presentations about GraphStorm, the GNN, the Deep Graph Library (DGL), Amazon Neptune, and other AWS services.
Theodore Vasiloudis is a Senior Applied Scientist at AWS, where he works on distributed machine learning systems and algorithms. He led the development of GraphStorm Processing, the distributed graph processing library for GraphStorm and is a core developer for GraphStorm. He received his PhD in Computer Science from KTH Royal Institute of Technology, Stockholm, in 2019.
Xiang Song is a Senior Applied Scientist at AWS AI Research and Education (AIRE), where he develops deep learning frameworks including GraphStorm, DGL, and DGL-KE. He led the development of Amazon Neptune ML, a new capability of Neptune that uses graph neural networks for graphs stored in graph database. He is now leading the development of GraphStorm, an open source graph machine learning framework for enterprise use cases. He received his PhD in computer systems and architecture at the Fudan University, Shanghai, in 2014.
Florian Saupe is a Principal Technical Product Manager at AWS AI/ML research supporting science teams like the graph machine learning group, and ML Systems teams working on large scale distributed training, inference, and fault resilience. Before joining AWS, Florian lead technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems and robotics scientist—a field in which he holds a PhD.
Ozan Eken is a Product Manager at AWS, passionate about building cutting-edge Generative AI and Graph Analytics products. With a focus on simplifying complex data challenges, Ozan helps customers unlock deeper insights and accelerate innovation. Outside of work, he enjoys trying new foods, exploring different countries, and watching soccer.

Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State …

Anthropic released Claude Sonnet 4.5 and sets a new benchmark for end-to-end software engineering and real-world computer use. The update also ships concrete product surface changes (Claude Code checkpoints, a native VS Code extension, API memory/context tools) and an Agent SDK that exposes the same scaffolding Anthropic uses internally. Pricing remains unchanged from Sonnet 4 ($3 input / $15 output per million tokens).

What’s actually new?

SWE-bench Verified record. Anthropic reports 77.2% accuracy on the 500-problem SWE-bench Verified dataset using a simple two-tool scaffold (bash + file edit), averaged over 10 runs, no test-time compute, 200K “thinking” budget. A 1M-context setting reaches 78.2%, and a higher-compute setting with parallel sampling and rejection raises this to 82.0%.

Computer-use SOTA. On OSWorld-Verified, Sonnet 4.5 leads at 61.4%, up from Sonnet 4’s 42.2%, reflecting stronger tool control and UI manipulation for browser/desktop tasks.

Long-horizon autonomy. The team observed >30 hours of uninterrupted focus on multi-step coding tasks — a practical jump over earlier limits and directly relevant to agent reliability.

Reasoning/math. The release notes “substantial gains” across common reasoning and math evals; exact per-bench numbers (e.g., AIME config). Safety posture is ASL-3 with strengthened defenses against prompt-injection.

https://www.anthropic.com/news/claude-sonnet-4-5

What’s there for agents?

Sonnet 4.5 targets the brittle parts of real agents: extended planning, memory, and reliable tool orchestration. Anthropic’s Claude Agent SDK exposes their production patterns (memory management for long-running tasks, permissioning, sub-agent coordination) rather than just a bare LLM endpoint. That means teams can reproduce the same scaffolding used by Claude Code (now with checkpoints, a refreshed terminal, and VS Code integration) to keep multi-hour jobs coherent and reversible.

On measured tasks that simulate “using a computer,” the 19-point jump on OSWorld-Verified is notable; it tracks with the model’s ability to navigate, fill spreadsheets, and complete web flows in Anthropic’s browser demo. For enterprises experimenting with agentic RPA-style work, higher OSWorld scores usually correlate with lower intervention rates during execution.

Where you can run it?

Anthropic API & apps. Model ID claude-sonnet-4-5; price parity with Sonnet 4. File creation and code execution are now available directly in Claude apps for paid tiers.

AWS Bedrock. Available via Bedrock with integration paths to AgentCore; AWS highlights long-horizon agent sessions, memory/context features, and operational controls (observability, session isolation).

Google Cloud Vertex AI. GA on Vertex AI with support for multi-agent orchestration via ADK/Agent Engine, provisioned throughput, 1M-token analysis jobs, and prompt caching.

GitHub Copilot. Public preview rollout across Copilot Chat (VS Code, web, mobile) and Copilot CLI; organizations can enable via policy, and BYO key is supported in VS Code.

Summary

With a documented 77.2% SWE-bench Verified score under transparent constraints, a 61.4% OSWorld-Verified computer-use lead, and practical updates (checkpoints, SDK, Copilot/Bedrock/Vertex availability), Claude Sonnet 4.5 is developed for long-running, tool-heavy agent workloads rather than short demo prompts. Independent replication will determine how durable the “best for coding” claim is, but the design targets (autonomy, scaffolding, and computer control) are aligned with real production pain points today.

Introducing Claude Sonnet 4.5—the best coding model in the world.It’s the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains on tests of reasoning and math. pic.twitter.com/7LwV9WPNAv— Claude (@claudeai) September 29, 2025

The post Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State-of-the-Art Results appeared first on MarkTechPost.

Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM I …

oLLM is a lightweight Python library built on top of Huggingface Transformers and PyTorch and runs large-context Transformers on NVIDIA GPUs by aggressively offloading weights and KV-cache to fast local SSDs. The project targets offline, single-GPU workloads and explicitly avoids quantization, using FP16/BF16 weights with FlashAttention-2 and disk-backed KV caching to keep VRAM within 8–10 GB while handling up to ~100K tokens of context.

But What’s new?

(1) KV cache read/writes that bypass mmap to reduce host RAM usage; (2) DiskCache support for Qwen3-Next-80B; (3) Llama-3 FlashAttention-2 for stability; and (4) GPT-OSS memory reductions via “flash-attention-like” kernels and chunked MLP. The table published by the maintainer reports end-to-end memory/I/O footprints on an RTX 3060 Ti (8 GB):

Qwen3-Next-80B (bf16, 160 GB weights, 50K ctx) → ~7.5 GB VRAM + ~180 GB SSD; noted throughput “≈ 1 tok/2 s”.

GPT-OSS-20B (packed bf16, 10K ctx) → ~7.3 GB VRAM + 15 GB SSD.

Llama-3.1-8B (fp16, 100K ctx) → ~6.6 GB VRAM + 69 GB SSD.

How it works

oLLM streams layer weights directly from SSD into the GPU, offloads the attention KV cache to SSD, and optionally offloads layers to CPU. It uses FlashAttention-2 with online softmax so the full attention matrix is never materialized, and chunks large MLP projections to bound peak memory. This shifts the bottleneck from VRAM to storage bandwidth and latency, which is why the oLLM project emphasizes NVMe-class SSDs and KvikIO/cuFile (GPUDirect Storage) for high-throughput file I/O.

Supported models and GPUs

Out of the box the examples cover Llama-3 (1B/3B/8B), GPT-OSS-20B, and Qwen3-Next-80B. The library targets NVIDIA Ampere (RTX 30xx, A-series), Ada (RTX 40xx, L4), and Hopper; Qwen3-Next requires a dev build of Transformers (≥ 4.57.0.dev). Notably, Qwen3-Next-80B is a sparse MoE (80B total, ~3B active) that vendors typically position for multi-A100/H100 deployments; oLLM’s claim is that you can execute it offline on a single consumer GPU by paying the SSD penalty and accepting low throughput. This stands in contrast to vLLM docs, which suggest multi-GPU servers for the same model family.

Installation and minimal usage

The project is MIT-licensed and available on PyPI (pip install ollm), with an additional kvikio-cu{cuda_version} dependency for high-speed disk I/O. For Qwen3-Next models, install Transformers from GitHub. A short example in the README shows Inference(…).DiskCache(…) wiring and generate(…) with a streaming text callback. (PyPI currently lists 0.4.1; the README references 0.4.2 changes.)

Performance expectations and trade-offs

Throughput: The maintainer reports ~0.5 tok/s for Qwen3-Next-80B at 50K context on an RTX 3060 Ti—usable for batch/offline analytics, not for interactive chat. SSD latency dominates.

Storage pressure: Long contexts require very large KV caches; oLLM writes these to SSD to keep VRAM flat. This mirrors broader industry work on KV offloading (e.g., NVIDIA Dynamo/NIXL and community discussions), but the approach is still storage-bound and workload-specific.

Hardware reality check: Running Qwen3-Next-80B “on consumer hardware” is feasible with oLLM’s disk-centric design, but typical high-throughput inference for this model still expects multi-GPU servers. Treat oLLM as an execution path for large-context, offline passes rather than a drop-in replacement for production serving stacks like vLLM/TGI.

Bottom line

oLLM pushes a clear design point: keep precision high, push memory to SSD, and make ultra-long contexts viable on a single 8 GB NVIDIA GPU. It won’t match data-center throughput, but for offline document/log analysis, compliance review, or large-context summarization, it’s a pragmatic way to execute 8B–20B models comfortably and even step up to MoE-80B if you can tolerate ~100–200 GB of fast local storage and sub-1 tok/s generation.

Check out the GITHUB REPO here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meet oLLM: A Lightweight Python Library that brings 100K-Context LLM Inference to 8 GB Consumer GPUs via SSD Offload—No Quantization Required appeared first on MarkTechPost.