OpenAI Introduces GPT 5.2: A Long Context Workhorse For Agents, Coding …

OpenAI has just introduced GPT-5.2, its most advanced frontier model for professional work and long running agents, and is rolling it out across ChatGPT and the API.

GPT-5.2 is a family of three variants. In ChatGPT, users see ChatGPT-5.2 Instant, Thinking and Pro. In the API, the corresponding models are gpt-5.2-chat-latest, gpt-5.2, and gpt-5.2-pro. Instant targets everyday assistance and learning, Thinking targets complex multi step work and agents, and Pro allocates more compute for hard technical and analytical tasks.

Benchmark profile, from GDPval to SWE Bench

GPT-5.2 Thinking is positioned as the main workhorse for real world knowledge work. On GDPval, an evaluation of well specified knowledge tasks across 44 occupations in 9 large industries, it beats or ties top industry professionals on 70.9 percent of comparisons, while producing outputs at more than 11 times the speed and under 1 percent of the estimated expert cost. For engineering teams this means the model can reliably generate artifacts such as presentations, spreadsheets, schedules, and diagrams given structured instructions.

On an internal benchmark of junior investment banking spreadsheet modeling tasks, average scores rise from 59.1 percent with GPT-5.1 to 68.4 percent with GPT-5.2 Thinking and 71.7 percent with GPT-5.2 Pro. These tasks include three statement models and leveraged buyout models with constraints on formatting and citations, which is representative of many structured enterprise workflows.

In software engineering, GPT-5.2 Thinking reaches 55.6 percent on SWE-Bench Pro and 80.0 percent on SWE-bench Verified. SWE-Bench Pro evaluates repository level patch generation over multiple languages, while SWE-bench Verified focuses on Python.

Long context and agentic workflows

Long context is a core design target. GPT-5.2 Thinking sets a new state of the art on OpenAI MRCRv2, a benchmark that inserts multiple identical ‘needle’ queries into long dialogue “haystacks” and measures whether the model can reproduce the correct answer. It is the first model reported to reach near 100 percent accuracy on the 4 needle MRCR variant out to 256k tokens.

For workloads that exceed even that context, GPT-5.2 Thinking integrates with the Responses /compact endpoint, which performs context compaction to extend the effective window for tool heavy, long running jobs. This is relevant if you are building agents that iteratively call tools over many steps and need to maintain state beyond the raw token limit.

On tool usage, GPT-5.2 Thinking reaches 98.7 percent on Tau2-bench Telecom, a multi turn customer support benchmark where the model must orchestrate tool calls across a realistic workflow. The official examples from OpenAI release post show scenarios like a traveler with a delayed flight, missed connection, lost bag and medical seating requirement, where GPT-5.2 manages rebooking, special assistance seating and compensation in a consistent sequence while GPT-5.1 leaves steps unfinished.

Vision, science and math

Vision quality also moves up. GPT-5.2 Thinking roughly halves error rates on chart reasoning and user interface understanding benchmarks like CharXiv Reasoning and ScreenSpot Pro when a Python tool is enabled. The model shows improved spatial understanding of images, for example when labeling motherboard components with approximate bounding boxes, GPT-5.2 identifies more regions with tighter placement than GPT-5.1.

For scientific workloads, GPT-5.2 Pro scores 93.2 percent and GPT-5.2 Thinking 92.4 percent on GPQA Diamond, and GPT-5.2 Thinking solves 40.3 percent of FrontierMath Tier 1 to Tier 3 problems with Python tools enabled. These benchmarks cover graduate level physics, chemistry, biology and expert mathematics, and OpenAI highlights early use where GPT-5.2 Pro contributed to a proof in statistical learning theory under human verification.

Comparison Table

ModelPrimary positioningContext window / max outputKnowledge cutoffNotable benchmarks (Thinking / Pro vs GPT-5.1 Thinking)GPT-5.1 Flagship model for coding and agentic tasks with configurable reasoning effort400,000 tokens context, 128,000 max output2024-09-30SWE-Bench Pro 50.8 percent, SWE-bench Verified 76.3 percent, ARC-AGI-1 72.8 percent, ARC-AGI-2 17.6 percentGPT-5.2 (Thinking) New flagship model for coding and agentic tasks across industries and for long running agents400,000 tokens context, 128,000 max output2025-08-31GDPval wins or ties 70.9 percent vs industry professionals, SWE-Bench Pro 55.6 percent, SWE-bench Verified 80.0 percent, ARC-AGI-1 86.2 percent, ARC-AGI-2 52.9 percentGPT-5.2 ProHigher compute version of GPT-5.2 for the hardest reasoning and scientific workloads, produces smarter and more precise responses400,000 tokens context, 128,000 max output2025-08-31GPQA Diamond 93.2 percent vs 92.4 percent for GPT-5.2 Thinking and 88.1 percent for GPT-5.1 Thinking, ARC-AGI-1 90.5 percent and ARC-AGI-2 54.2 percent

Key Takeaways

GPT-5.2 Thinking is the new default workhorse model: It replaces GPT-5.1 Thinking as the main model for coding, knowledge work and agents, while keeping the same 400k context and 128k max output, but with clearly higher benchmark performance across GDPval, SWE-Bench, ARC-AGI and scientific QA.

Substantial accuracy jump over GPT-5.1 at similar scale: On key benchmarks, GPT-5.2 Thinking moves from 50.8 percent to 55.6 percent on SWE-Bench Pro and from 76.3 percent to 80.0 percent on SWE-bench Verified, and from 72.8 percent to 86.2 percent on ARC-AGI-1 and from 17.6 percent to 52.9 percent on ARC-AGI-2, while keeping token limits comparable.

GPT-5.2 Pro is targeted at high end reasoning and science: GPT-5.2 Pro is a higher compute variant that mainly improves hard reasoning and scientific tasks, for example reaching 93.2 percent on GPQA Diamond versus 92.4 percent for GPT-5.2 Thinking and 88.1 percent for GPT-5.1 Thinking, and higher scores on ARC-AGI tiers.

The post OpenAI Introduces GPT 5.2: A Long Context Workhorse For Agents, Coding And Knowledge Work appeared first on MarkTechPost.

CopilotKit v1.50 Brings AG-UI Agents Directly Into Your App With the N …

Agent frameworks are now good at reasoning and tools, but most teams still write custom code to turn agent graphs into robust user interfaces with shared state, streaming output and interrupts. CopilotKit targets this last mile. It is an open source framework for building AI copilots and in-app agents directly in your app, with real time context and UI control. ( Check out the CopilotKit GitHub)

The release of of CopilotKit’s v1.50 rebuilds the project on the Agent User Interaction Protocol (AG-UI) natively.The key idea is simple; Let AG-UI define all traffic between agents and UIs as a typed event stream to any  app through a single hook, useAgent.

useAgent, one React hook per AG-UI agent

AG-UI defines how an agent backend and a frontend exchange a single ordered sequence of JSON encoded events. These events include messages, tool calls, state updates and lifecycle signals, and they can stream any transport like HTTP, Web Sockets, or even WebRTC. 

CopilotKit v1.50 uses this protocol as the native transport layer. Instead of separate adapters for each framework, everything  now communicates via AG-UI directly. This is all made easily accessible by the new useAgent – a React hook that  provides programmatic control of any AG-UI agent. It subscribes to the event stream, keeps a local model of messages and shared state, and exposes a small API for sending user input and UI intents.

At a high level, a React component does three things:

Call useAgent with connection details for the backend agent.

Read current state, such as message list, streaming deltas and agent status flags.

Call useAgent methods from the hook to send user messages, trigger tools or update shared state.

Because the hook only depends on AG-UI, the same UI code can work with different agent frameworks, as long as they expose an AG-UI endpoint.

Context messaging and shared state

AG-UI assumes that agentic apps are stateful. The protocol standardizes how context moves between UI and agent. 

On the frontend, CopilotKit already lets developers register app data as context, for example with hooks that make parts of React state readable to the agent. In the AG-UI model this becomes explicit. State snapshots and state patch events keep the backend and the UI in sync. The agent sees a consistent view of the application, and the UI can render the same state without custom synchronization logic.

For an early level engineer this removes a common pattern. You no longer push props into prompts manually on every call. The state is then updated, and the AG-UI client encodes those updates as events, and the backend agent consumes the same state through its AG-UI library.

AG-UI, protocol layer between agents and users

AG-UI is defined as an open, lightweight protocol that standardizes how agents connect to user facing applications.It focuses on event semantics rather than transport. Core SDKs provide strongly typed event models and clients in TypeScript, Python and other languages.

The JavaScript package @ag-ui/core implements the streaming event based architecture on the client side. It exposes message and state models, run input types and event utilities, and currently records about 178,751 weekly downloads on npm for version 0.0.41. On the Python side, the ag-ui-protocol package provides the canonical event models, with around 619,035 downloads in the last week and about 2,172,180 in the last month.

CopilotKit v1.50 builds directly on these components. Frontend code uses CopilotKit React primitives, but under the hood the connection to the backend is an AG-UI client that sends and receives standard events.

First party integrations across the 3 hyperscalers

The AG-UI overview lists Microsoft Agent Framework, Google Agent Development Kit, ADK, and AWS Strands Agents as supported frameworks, each with dedicated documentation and demos. These are first party integrations maintained by the protocol and framework owners.

Microsoft published a tutorial that shows how to build both server and client applications using AG-UI with Agent Framework in .NET or Python. Google documents AG-UI under the Agentic UI section of the ADK docs, and CopilotKit provides a full guide on building an ADK along with AG-UI and CopilotKit stack. AWS Strands exposes AG-UI integration through official tutorials and a CopilotKit quickstart, which wires a Strands agent backend to a React client in one scaffolded project.

For a React team this means that useAgent can attach to agents defined in any of these frameworks, as long as the backend exposes an AG-UI endpoint. The frontend code stays the same, while the agent logic and hosting environment can change.

Ecosystem growth around CopilotKit and AG-UI

CopilotKit presents itself as the agentic framework for in-app copilots, with more than 20,000 GitHub stars and being trusted by over 100,000 developers. 

AG-UI itself has moved from a protocol proposal to a shared layer across multiple frameworks. The partnerships or integrations include with LangGraph, CrewAI, Mastra, Pydantic AI, Agno, LlamaIndex and others, plus SDKs in Kotlin, Go, Java, Rust and more.This cross framework adoption is what makes a generic hook like useAgent viable, because it can rely on a consistent event model.

Key Takeaways

CopilotKit v1.50 standardizes its frontend layer on AG-UI, so all agent to UI communication is a single event stream instead of custom links per backend.

The new useAgent React hook lets a component connect to any AG-UI compatible agent, and exposes messages, streaming tokens, tools and shared state through a typed interface.

AG-UI formalizes context messaging and shared state as replicated stores with event sourced deltas, so both agent and UI share a consistent application view without manual prompt wiring.

AG-UI has first party integrations with Microsoft Agent Framework, Google Agent Development Kit and AWS Strands Agents, which means the same CopilotKit UI code can target agents across all 3 major clouds.

CopilotKit and AG-UI show strong ecosystem traction, with high GitHub adoption and significant weekly downloads for @ag-ui/core on npm and ag-ui-protocol on PyPI, which signals that the protocol is becoming a common layer for agentic applications.

If you’re interested in using CopilotKit in a production product or business, you can schedule time with the team here: Scheduling link
The post CopilotKit v1.50 Brings AG-UI Agents Directly Into Your App With the New useAgent Hook appeared first on MarkTechPost.

The Machine Learning Divide: Marktechpost’s Latest ML Global Impact …

Los Angeles, December 11, 2025 — Marktechpost has released ML Global Impact Report 2025 (AIResearchTrends.com). This educational report’s analysis includes over 5,000 articles from more than 125 countries, all published within the Nature family of journals between January 1 and September 30, 2025. The scope of this report is strictly confined to this specific body of work and is not a comprehensive assessment of global research.This report focuses solely on the specific work presented and does not represent a full evaluation of worldwide research.

The ML Global Impact Report 2025 focuses on three core questions:

In which disciplines has ML become part of the standard methodological toolkit, and where is adoption still sparse.

Which kinds of problems are most likely to rely on ML, such as high-dimensional imaging, sequence data, or complex physical simulations.

How ML usage patterns differ by geography and research ecosystem, based on the global footprint of these selected 5,000 papers.

ML has most frequently become part of the standard methodological toolkit within the disciplines of applied sciences and health research, where it is often employed as a critical step within a larger experimental workflow rather than being the main subject of research itself. The analysis of the papers indicates that ML’s adoption is concentrated in these domains, with the tools serving to augment existing research pipelines. The report aims to distinguish these areas of common use from other fields where the integration of machine learning remains less frequent.

The kinds of problems most likely to rely on machine learning are those involving complex data analysis tasks, such as high-dimensional imaging, sequence data analysis, and intricate physical simulations. The report tracks the specific task types, including prediction, classification, segmentation, sequence modeling, feature extraction, and simulation, to understand where ML is being applied. This categorization highlights the utility of machine learning across different stages of the research process, from initial data processing to final output generation.

ML usage patterns show a distinct geographical separation between the origins of the tools and the heavy users of the technology. The majority of machine learning tools cited in the corpus originate from organizations based in the United States, which maintains many widely used frameworks and libraries. In contrast, China is identified as the largest contributor to the research papers, accounting for about 40% of all ML-tagged papers, significantly more than the United States’ contribution of around 18%. The report also highlights the global ecosystem by citing frequently used non-US tools, such as Scikit-learn (France), U-Net (Germany), and CatBoost (Russia), along with tools originated from Canada including GAN and RNN families.Overall, the ML Global Impact Report 2025 provides deep insights into the global research ecosystem, highlighting that Machine Learning has become a standard methodological tool primarily within applied sciences and health research. The analysis reveals a concentration of ML usage on complex data challenges, such as high-dimensional imaging and physical simulations. A core finding is the clear geographical split between the origin of ML tools—many maintained by US organizations—and the heaviest users of the technology, with China accounting for a significantly higher number of ML-tagged research papers in the analyzed corpus. These patterns are specific to the 5,000+ Nature family articles analysed, underscoring the report’s focused view on current research workflows.
The post The Machine Learning Divide: Marktechpost’s Latest ML Global Impact Report Reveals Geographic Asymmetry Between ML Tool Origins and Research Adoption appeared first on MarkTechPost.

How Harmonic Security improved their data-leakage detection system wit …

This post was written with Bryan Woolgar-O’Neil, Jamie Cockrill and Adrian Cunliffe from Harmonic Security
Organizations face increasing challenges protecting sensitive data while supporting third-party generative AI tools. Harmonic Security, a cybersecurity company, developed an AI governance and control layer that spots sensitive data in line as employees use AI, giving security teams the power to keep PII, source code, and payroll information safe while the business accelerates.
The following screenshot demonstrates Harmonic Security’s software tool, highlighting the different data leakage detection types, including Employee PII, Employee Financial Information, and Source Code.

Harmonic Security’s solution is also now available on AWS Marketplace, enabling organizations to deploy enterprise-grade data leakage protection with seamless AWS integration. The platform provides prompt-level visibility into GenAI usage, real-time coaching at the point of risk, and detection of high-risk AI applications—all powered by the optimized models described in this post.
The initial version of their system was effective, but with a detection latency of 1–2 seconds, there was an opportunity to further enhance its capabilities and improve the overall user experience. To achieve this, Harmonic Security partnered with the AWS Generative AI Innovation Center to optimize their system with four key objectives:

Reduce detection latency to under 500 milliseconds at the 95th percentile
Maintain detection accuracy across monitored data types
Continue to support EU data residency compliance
Enable scalable architecture for production loads

This post walks through how Harmonic Security used Amazon SageMaker AI, Amazon Bedrock, and Amazon Nova Pro to fine-tune a ModernBERT model, achieving low-latency, accurate, and scalable data leakage detection.
Solution overview
Harmonic Security’s initial data leakage detection system relied on an 8 billion (8B) parameter model, which effectively identified sensitive data but incurred 1–2 second latency, which ran close to the threshold of impacting user experience. To achieve sub-500 millisecond latency while maintaining accuracy, we developed two classification approaches using a fine-tuned ModernBERT model.
First, a binary classification model was prioritized to detect Mergers & Acquisitions (M&A) content, a critical category for helping prevent sensitive data leaks. We initially focused on binary classification because it was the simplest approach that would seamlessly integrate within their current system that invokes multiple binary classification models in parallel. Secondly, as an extension, we explored a multi-label classification model to detect multiple sensitive data types (such as billing information, financial projections, and employment records) in a single pass, aiming to reduce the computational overhead of running multiple parallel binary classifiers for greater efficiency. Although the multi-label approach showed promise for future scalability, Harmonic Security decided to stick with the binary classification model for the initial version.The solution uses the following key services:

Amazon SageMaker AI – For fine-tuning and deploying the model
Amazon Bedrock – For accessing industry-leading large language models (LLMs)
Amazon Nova Pro – A highly capable multimodal model that balances accuracy, speed, and cost

The following diagram illustrates the solution architecture for low-latency inference and scalability.

The architecture consists of the following components:

Model artifacts are stored in Amazon Simple Storage Service (Amazon S3)
A custom container with inference code is hosted in Amazon Elastic Container Registry (Amazon ECR)
A SageMaker endpoint uses ml.g5.4xlarge instances for GPU-accelerated inference
Amazon CloudWatch monitors invocations, triggering auto scaling to adjust instances (1–5) based on an 830 requests per minute (RPM) threshold.

The solution supports the following features:

Sub-500 milliseconds inference latency
EU AWS Region deployment support
Automatic scaling between 1–5 instances based on demand
Cost optimization during low-usage periods

Synthetic data generation
High-quality training data for sensitive information (such as M&A documents and financial data) is scarce. We used Meta Llama 3.3 70B Instruct and Amazon Nova Pro to generate synthetic data, expanding upon Harmonic’s existing dataset that included examples of data in the following categories: M&A, billing information, financial projection, employment records, sales pipeline, and investment portfolio. The following diagram provides a high-level overview of the synthetic data generation process.

Data generation framework
The synthetic data generation framework is comprised of a series of steps, including:

Smart example selection – K-means clustering on sentence embeddings supports diverse example selection
Adaptive prompts – Prompts incorporate domain knowledge, with temperature (0.7–0.85) and top-p sampling adjusted per category
Near-miss augmentation – Negative examples resembling positive cases to improve precision
Validation – An LLM-as-a-judge approach using Amazon Nova Pro and Meta Llama 3 validates examples for relevance and quality

Binary classification
For the binary M&A classification task, we generated three distinct types of examples:

Positive examples – These contained explicit M&A information while maintaining realistic document structures and finance-specific language patterns. They included key indicators like “merger,” “acquisition,” “deal terms,” and “synergy estimates.”
Negative examples – We created domain-relevant content that deliberately avoided M&A characteristics while remaining contextually appropriate for business communications.
Near-miss examples – These resembled positive examples but fell just outside the classification boundary. For instance, documents discussing strategic partnerships or joint ventures that didn’t constitute actual M&A activity.

The generation process maintained careful proportions between these example types, with particular emphasis on near-miss examples to address precision requirements.
Multi-label classification
For the more complex multi-label classification task across four sensitive information categories, we developed a sophisticated generation strategy:

Single-label examples – We generated examples containing information relevant to exactly one category to establish clear category-specific features
Multi-label examples – We created examples spanning multiple categories with controlled distributions, covering various combinations (2–4 labels)
Category-specific requirements – For each category, we defined mandatory elements to maintain explicit rather than implied associations:

Financial projections – Forward-looking revenue and growth data
Investment portfolio – Details about holdings and performance metrics
Billing and payment information – Invoices and supplier accounts
Sales pipeline – Opportunities and projected revenue

Our multi-label generation prioritized realistic co-occurrence patterns between categories while maintaining sufficient representation of individual categories and their combinations. As a result, synthetic data increased training examples by 10 times (binary) and 15 times (multi-label) more. It also improved the class balance because we made sure to generate the data with a more balanced label distribution.
Model fine-tuning
We fine-tuned ModernBERT models on SageMaker to achieve low latency and high accuracy. Compared with decoder-only models such as Meta Llama 3.2 3B and Google Gemma 2 2B, ModernBERT’s compact size (149M and 395M parameters) translated into faster latency while still delivering higher accuracy. We therefore selected ModernBERT over fine-tuning those alternatives. In addition, ModernBERT is one of the few BERT-based models that supports context lengths of up to 8,192 tokens, which was a key requirement for our project.
Binary classification model
Our first fine-tuned model used ModernBERT-base, and we focused on binary classification of M&A content.We approached this task methodically:

Data preparation – We enriched our M&A dataset with the synthetically generated data
Framework selection – We used the Hugging Face transformers library with the Trainer API in a PyTorch environment, running on SageMaker
Training process – Our process included:

Stratified sampling to maintain label distribution across training and evaluation sets
Specialized tokenization with sequence lengths up to 3,000 tokens to match what the client had in production
Binary cross-entropy loss optimization
Early stopping based on F1 score to prevent overfitting.

The result was a fine-tuned model that could distinguish M&A content from non-sensitive information with a higher F1 score than the 8B parameter model.
Multi-label classification model
For our second model, we tackled the more complex challenge of multi-label classification (detecting multiple sensitive data types simultaneously within single text passages).We fine-tuned a ModernBERT-large model to identify various sensitive data types like billing information, employment records, and financial projections in a single pass. This required:

Multi-hot label encoding – We converted our categories into vector format for simultaneous prediction.
Focal loss implementation – Instead of standard cross-entropy loss, we implemented a custom FocalLossTrainer class. Unlike static weighted loss functions, Focal Loss adaptively down-weights straightforward examples during training. This helps the model concentrate on challenging cases, significantly improving performance for less frequent or harder-to-detect classes.
Specialized configuration – We added configurable class thresholds (for example, 0.1 to 0.8) for each class probability to determine label assignment as we observed varying performance in different decision boundaries.

This approach enabled our system to identify multiple sensitive data types in a single inference pass.
Hyperparameter optimization
To find the optimal configuration for our models, we used Optuna to optimize key parameters. Optuna is an open-source hyperparameter optimization (HPO) framework that helps find the best hyperparameters for a given machine learning (ML) model by running many experiments (called trials). It uses a Bayesian algorithm called Tree-structured Parzen Estimator (TPE) to choose promising hyperparameter combinations based on past results.
The search space explored numerous combinations of key hyperparameters, as listed in the following table.

Hyperparameter
Range

Learning rate
5e-6–5e-5

Weight decay
0.01–0.5

Warmup ratio
0.0–0.2

Dropout rates
0.1–0.5

Batch size
16, 24, 32

Gradient accumulation steps
1, 4

Focal loss gamma (multi-label only)
1.0–3.0

Class threshold (multi-label only)
0.1–0.8

To optimize computational resources, we implemented pruning logic to stop under-performing trials early, so we could discard configurations that were less optimal. As seen in the following Optuna HPO history plot, trial 42 had the most optimal parameters with the highest F1 score for the binary classification, whereas trial 32 was the most optimal for the multi-label.

Moreover, our analysis showed that dropout and learning rate were the most important hyperparameters, accounting for 48% and 21% of the variance of the F1 score for the binary classification model. This explained why we noticed the model overfitting quickly during previous runs and stresses the importance of regularization.

After the optimization experiments, we discovered the following:

We were able to identify the optimal hyperparameters for each task
The models converged faster during training
The final performance metrics showed measurable improvements over configurations we tested manually

This allowed our models to achieve a high F1 score efficiently by running hyperparameter tuning in an automated fashion, which is crucial for production deployment.
Load testing and autoscaling policy
After fine-tuning and deploying the optimized model to a SageMaker real-time endpoint, we performed load testing to validate the performance and autoscaling under pressure to meet Harmonic Security’s latency, throughput, and elasticity needs. The objectives of the load testing were:

Validate latency SLA with an average of less than 500 milliseconds and P95 of approximately 1 second varying loads
Determine throughput capacity with maximum RPM using ml.g5.4xlarge instances within latency SLA
Inform the auto scaling policy design

The methodology involved the following:

Traffic simulation – Locust simulated concurrent user traffic with varying text lengths (50–9,999 characters)
Load pattern – We stepped ramp-up tests (60–2,000 RPM, 60 seconds each) and identified bottlenecks and stress-tested limits

As shown in the following graph, we found that the maximum throughput under a latency of 1 second was 1,185 RPM, so we decided to set the auto scaling threshold to 70% of that at 830 RPM.

Based on the performance observed during load testing, we configured a target-tracking auto scaling policy for the SageMaker endpoint using Application Auto Scaling. The following figure illustrates this policy workflow.

The key parameters defined were:

Metric – SageMakerVariantInvocationsPerInstance (830 invocations/instance/minute)
Min/Max Instances – 1–5
Cooldown – Scale-out 300 seconds, scale-in 600 seconds

This target-tracking policy adjusts instances based on traffic, maintaining performance and cost-efficiency. The following table summarizes our findings.

Model
Requests per Minute

8B model
800

ModernBERT with auto scaling (5 instances)
1,185-5925

Additional capacity (ModernBERT vs. 8B model)
48%-640%

Results
This section showcases the significant impact of the fine-tuning and optimization efforts on Harmonic Security’s data leakage detection system, with a primary focus on achieving substantial latency reductions. Absolute latency improvements are detailed first, underscoring the success in meeting the sub-500 millisecond target, followed by an overview of performance enhancements. The following subsections provide detailed results for binary M&A classification and multi-label classification across multiple sensitive data types.
Binary classification
We evaluated the fine-tuned ModernBERT-base model for binary M&A classification against the baseline 8B model, introduced in the solution overview. The most striking achievement was a transformative reduction in latency, addressing the initial 1–2 second delay that risked disrupting user experience. This leap to sub-500 millisecond latency is detailed in the following table, marking a pivotal enhancement in system responsiveness.

Model
median_ms
p95_ms
p99_ms
p100_ms

Modernbert-base-v2
46.03
81.19
102.37
183.11

8B model
189.15
259.99
286.63
346.36

Difference
-75.66%
-68.77%
-64.28%
-47.13%

Building on this latency breakthrough, the following performance metrics reflect percentage improvements in accuracy and F1 score.

Model
Accuracy Improvement
F1 Improvement

ModernBERT-base-v2
+1.56%
+2.26%

8B model

These results highlight that ModernBERT-base-v2 delivers a groundbreaking latency reduction, complemented by modest accuracy and F1 improvements of 1.56% and 2.26%, respectively, aligning with Harmonic Security’s objectives to enhance data leakage detection without impacting user experience.
Multi-label classification
We evaluated the fine-tuned ModernBERT-large model for multi-label classification against the baseline 8B model, with latency reduction as the cornerstone of this approach. The most significant advancement was a substantial decrease in latency across all evaluated categories, achieving sub-500 millisecond responsiveness and addressing the previous 1–2 second bottleneck. The latency results shown in the following table underscore this critical improvement.

Dataset
model
median_ms
p95_ms
p99_ms

Billing and payment
8B model
198
238
321

ModernBERT-large
158
199
246

Difference

-20.13%
-16.62%
-23.60%

Sales pipeline
8B model
194
265
341

ModernBERT-large
162
243
293

Difference

-16.63%
-8.31%
-13.97%

Financial projections
8B model
384
510
556

ModernBERT-large
160
275
310

Difference

-58.24%
-46.04%
-44.19%

Investment portfolio
8B model
397
498
703

ModernBERT-large
160
259
292

Difference

-59.69%
-47.86%
-58.46%

This approach also delivered a second key benefit: a reduction in computational parallelism by consolidating multiple classifications into a single pass. However, the multi-label model encountered challenges in maintaining consistent accuracy across all classes. Although categories like Financial Projections and Investment Portfolio showed promising accuracy gains, others such as Billing and Payment and Sales Pipeline experienced significant accuracy declines. This indicates that, despite its latency and parallelism advantages, the approach requires further development to maintain reliable accuracy across data types.
Conclusion
In this post, we explored how Harmonic Security collaborated with the AWS Generative AI Innovation Center to optimize their data leakage detection system achieving transformative results:
Key performance improvements:

Latency reduction: From 1–2 seconds to under 500 milliseconds (76% reduction at median)
Throughput increase: 48%–640% additional capacity with auto scaling
Accuracy gains: +1.56% for binary classification, with maintained precision across categories

By using SageMaker, Amazon Bedrock, and Amazon Nova Pro, Harmonic Security fine-tuned ModernBERT models that deliver sub-500 millisecond inference in production, meeting stringent performance goals while supporting EU compliance and establishing a scalable architecture.
This partnership showcases how tailored AI solutions can tackle critical cybersecurity challenges without hindering productivity. Harmonic Security’s solution is now available on AWS Marketplace, enabling organizations to adopt AI tools safely while protecting sensitive data in real time. Looking ahead, these high-speed models have the potential to add further controls for additional AI workflows.
To learn more, consider the following next steps:

Try Harmonic Security – Deploy the solution directly from AWS Marketplace to protect your organization’s GenAI usage
Explore AWS services – Dive into SageMaker, Amazon Bedrock, and Amazon Nova Pro to build advanced AI-driven security solutions. Visit the AWS Generative AI page for resources and tutorials.
Deep dive into fine-tuning – Explore the AWS Machine Learning Blog for in-depth guides on fine-tuning LLMs for specialized use cases.
Stay updated – Subscribe to the AWS Podcast for weekly insights on AI innovations and practical applications.
Connect with experts – Join the AWS Partner Network to collaborate with experts and scale your AI initiatives.
Attend AWS events – Register for AWS re: Invent. to explore cutting-edge AI advancements and network with industry leaders.

By adopting these steps, organizations can harness AI-driven cybersecurity to maintain robust data protection and seamless user experiences across diverse workflows.

About the authors
Babs Khalidson is a Deep Learning Architect at the AWS Generative AI Innovation Centre in London, where he specializes in fine-tuning large language models, building AI agents, and model deployment solutions. He has over 6 years of experience in artificial intelligence and machine learning across finance and cloud computing, with expertise spanning from research to production deployment.
Vushesh Babu Adhikari is a Data scientist at the AWS Generative AI Innovation center in London with extensive expertise in developing Gen AI solutions across diverse industries. He has over 7 years of experience spanning across a diverse set of industries including Finance , Telecom , Information Technology with specialized expertise in Machine learning & Artificial Intelligence.
Zainab Afolabi is a Senior Data Scientist at the AWS Generative AI Innovation Centre in London, where she leverages her extensive expertise to develop transformative AI solutions across diverse industries. She has over nine years of specialized experience in artificial intelligence and machine learning, as well as a passion for translating complex technical concepts into practical business applications.
Nuno Castro is a Sr. Applied Science Manager at the AWS Generative AI Innovation Center. He leads Generative AI customer engagements, helping AWS customers find the most impactful use case from ideation, prototype through to production. He’s has 19 years experience in the field in industries such as finance, manufacturing, and travel, leading ML teams for 11 years.
Christelle Xu is a Senior Generative AI Strategist who leads model customization and optimization strategy across EMEA within the AWS Generative AI Innovation Center, working with customers to deliver scalable Generative AI solutions, focusing on continued pre-training, fine-tuning, reinforcement learning, and training and inference optimization. She holds a Master’s degree in Statistics from the University of Geneva and a Bachelor’s degree from Brigham Young University.
Manuel Gomez is a Solutions Architect at AWS supporting generative AI startups across the UK and Ireland. He works with model producers, fine-tuning platforms, and agentic AI applications to design secure and scalable architectures. Before AWS, he worked in startups and consulting, and he has a background in industrial technologies and IoT. He is particularly interested in how multi-modal AI can be applied to real industry problems.
Bryan Woolgar-O’Neil is the co-founder & CTO at Harmonic Security. With over 20 years of software development experience, the last 10 were dedicated to building the Threat Intelligence company Digital Shadows, which was acquired by Reliaquest in 2022. His expertise lies in developing products based on cutting-edge software, focusing on making sense of large volumes of data.
Jamie Cockrill is the Director of Machine Learning at Harmonic Security, where he leads a team focused on building, training, and refining Harmonic’s Small Language Models.
Adrian Cunliffe is a Senior Machine Learning Engineer at Harmonic Security, where he focuses on scaling Harmonic’s Machine Learning engine that powers Harmonic’s proprietary models.

How Swisscom builds enterprise agentic AI for customer support and sal …

This post was written with Arun Sittampalam and Maxime Darcot from Swisscom.
As we navigate the constantly shifting AI ecosystem, enterprises face challenges in translating AI’s potential into scalable, production-ready solutions. Swisscom, Switzerland’s leading telecommunications provider with an estimated $19B revenue (2025) and over $37B Market capitalization as of June 2025 exemplifies how organizations can successfully navigate this complexity while maintaining their commitment to sustainability and excellence.
Swisscom has been recognized as the Most Sustainable Company in the Telecom industry for 3 consecutive years by World Finance magazine, Swisscom has established itself as an innovation leader committed to achieving net-zero greenhouse gas emissions by 2035 in alignment with the Paris Climate Agreement. This sustainability-first approach extends to their AI strategy where they’re breaking through what they call the “automation ceiling” – where traditional automation approaches fail to meet modern business demands.
In this post, we’ll show how Swisscom implemented Amazon Bedrock AgentCore to build and scale their enterprise AI agents for customer support and sales operations. As an early adopter of Amazon Bedrock in the AWS Europe Region (Zurich), Swisscom leads in enterprise AI implementation with their Chatbot Builder system and various AI initiatives. Their successful deployments include Conversational AI powered by Rasa and fine-tuned LLMs on Amazon SageMaker, and the Swisscom Swisscom myAI assistant, built to meet Swiss data protection standards.
Solution overview: Swisscom’s agentic AI enabler framework
The challenge of enterprise-wide scaling of AI agents lies in managing siloed agentic solutions while facilitating cross-departmental coordination. Swisscom addresses this through Model Context Protocol (MCP) servers and the Agent2Agent protocol (A2A), for seamless agent communication across domains. Operating under Switzerland’s strict data protection laws, they’ve developed a framework that balances compliance requirements with efficient scaling capabilities, helping prevent redundant efforts while maintaining high security standards.
Swisscom’s multi-agent architecture: System design and implementation challenges
Swisscom’s vision for enterprise-level agentic AI focuses on addressing fundamental challenges that organizations face when scaling AI solutions. They recognise that successful implementation requires more than just innovative technology, it demands a comprehensive approach to infrastructure and operations. One of the key challenges lies in orchestrating AI agents across different departments and systems while maintaining security and efficiency.
To illustrate these challenges in practice, let’s examine a common customer service scenario where an agent is tasked with helping a customer restore their Internet router connectivity. There are three potential causes for the connectivity loss: 1) a billing issue, 2) a network outage, or 3) a configuration mismatch known as a pairing issue. These issues typically reside in departments different from where the assigned agent operates, highlighting the need for seamless cross-departmental coordination.
The architecture diagram below illustrates the vision and associated challenges for a generic customer agent without the Amazon Bedrock AgentCore. The shared VPC setup of Swisscom is explained in more detail in the blog post, Automated networking with shared VPCs at Swisscom.

This architecture includes the following components:

A customer-facing generic agent deployed as a containerized runtime within a shared VPC, requiring both foundation model invocation capabilities and robust session management.
For task completion, the agent requires to access to other agents and MCP servers. These resources are typically distributed across multiple AWS accounts and are deployed as containerised runtimes within the shared VPC.
Internal application access primarily occurs through SAIL (Service and Interface Library), Swisscom’s central system for API hosting and service integration. Corporate network resources are accessible via AWS Direct Connect, with a VPC Transit Gateway facilitating secure cross-network communication.
Security compliance is paramount: each interaction requires temporary access tokens that authenticate both the agent and the customer context. This bidirectional validation is essential to the system components – agents, MCP servers, and tools must verify incoming tokens for service requests.
Gaining long-term insights from the stored sessions, such as customer preferences, demands a sophisticated analysis.

To build the solution mentioned above at scale, Swisscom identified several critical challenges that needed to be addressed:

Security and Authentication:

How to implement secure, transitive authentication and authorization that enforces least-privilege access based on intersecting permissions (customer, agent, department)?
How to enable controlled resource sharing across departments, cloud systems, and on-premises networks?

Integration and Interoperability:

How to make MCP servers and other agents centrally available to other use cases?
How to integrate and maintain compatibility with existing agentic use cases across Swisscom’s infrastructure?

Customer Intelligence and Operations:

How to effectively capture and utilize customer insights across multiple agentic interactions?
How to implement standardized evaluation and observability practices across the agents?

How Amazon Bedrock AgentCore addresses the challenges
Amazon Bedrock AgentCore provides Swisscom with a comprehensive solution that addresses their enterprise-scale agentic AI challenges.

AgentCore Runtime: Enables Swisscom’s developers to focus on building agents while the system handles secure, cost-efficient hosting and automatic scaling through Docker container deployment that maintains session-level isolation. Hosted in the shared VPC allows access to internal APIs.
AgentCore Identity: Seamlessly integrates with Swisscom’s existing identity provider, managing both inbound and outbound authentication, alleviating the need for custom token exchange servers and simplifying secure interactions between agents, tools, and data sources.
AgentCore Memory: Delivers a robust solution for managing both session-based and long-term memory storage with custom memory strategies. This is particularly valuable for B2C operations where understanding customer context across interactions is crucial. Keeping each user’s data separate also supports security and compliance efforts.
Strands Agents Framework: Demonstrates high adoption among Swisscom’s developers due to its simplified agent construction, faster development cycles, seamless integration with Bedrock AgentCore services, and built-in capabilities for tracing, evaluation, and OpenTelemetry logging.

This solution does the following:

The client sends a request to the Strands agent running on AgentCore Runtime, passing an authentication token from the Swisscom IdP.
The client’s token is validated and a new token for the agent’s downstream tool usage is generated and passed back to the agent.
The agent invokes the foundation model on Bedrock and stores the sessions in the AgentCore Memory. The traffic traverses the VPC endpoints for Bedrock and Bedrock AgentCore, keeping the traffic private.
The agent accesses internal APIs, MCP & A2A servers inside the shared VPC, authenticating with the temporary token from AgentCore Identity.

With the flexibility to use a subset of features of Amazon Bedrock AgentCore and their Amazon VPC integration Swisscom could remain secure and flexible to use the Bedrock AgentCore services for their specific needs, for example to integrate with existing agents on Amazon EKS. Amazon Bedrock AgentCore integrates with VPC to facilitate secure communication between agents and internal resources.
Results and benefits: Real-world implementation with self-service use case
Swisscom partnered with AWS to implement Amazon Bedrock AgentCore for two B2C cases: 1) generating personalized sales pitches, and 2) providing automated customer support for technical issues like self-service troubleshooting. Both agents are being integrated into Swisscom’s existing customer generative AI-powered chatbot system called SAM, necessitating high-performance agent-to-agent communication protocols due to the high volume of Swisscom customers and strict latency requirements. Throughout the development process, the team created an agent for each use case designed to be shared across the organization through MCP and A2A.
Amazon Bedrock AgentCore has proven instrumental in these implementations. By using the Bedrock AgentCore Memory long-term insights Swisscom can track and analyze customer interactions across different touchpoints, continuously improving the customer experience across domains. AgentCore Identity facilitates robust security, implementing precise access controls that limit agents to only those resources authorized for the specific customer interaction. The scalability of AgentCore Runtime allows these agents to efficiently handle thousands of requests per month each, maintaining low latency while optimizing costs.
The adoption of Strands Agents framework has been particularly valuable in this journey:

Development teams achieved their first business stakeholder demos within 3-4 weeks, despite having no prior experience with Strands Agents.
One project team migrated from their LangGraph implementation to Strands Agents, citing reduced complexity and faster development cycles.
The framework’s native OpenTelemetry integration supported seamless export of performance traces to Swisscom’s existing observability infrastructure, maintaining consistency with enterprise-wide monitoring standards.
The Strands evaluation test cases allowed teams quickly put an evaluation pipeline together without the need of additional tools, for a quick validation of the PoC.

Conclusion: Enterprise AI at scale – Key insights and Strategic implications
Swisscom’s implementation of Amazon Bedrock AgentCore demonstrates how enterprises can successfully navigate the complexities of production-ready Agentic AI while maintaining regulatory compliance and operational excellence. Swisscom’s journey offers 3 critical insights:

Architectural foundation matters: By addressing the fundamental challenges of secure cross-org authentication, standardized agent orchestration, and comprehensive observability, Swisscom established a scalable foundation that accelerates deployment rather than constraining it. The integration of AgentCore Runtime, Identity, and Memory services accelerated the infrastructure setup so teams could focus on business value.
Framework selection drives velocity: The adoption of Strands Agents framework exemplifies how the right development tools can dramatically reduce time-to-value. Teams achieving stakeholder demos within 3-4 weeks, coupled with successful migrations from alternative frameworks, validates the importance of developer experience in enterprise AI adoption.
Compliance as an enabler: Swisscom proved that regulatory compliance need not impede innovation. The system’s ability to scale while maintaining data sovereignty and user privacy has proven particularly valuable in the Swiss industry, where regulatory compliance is paramount.

As enterprises increasingly recognize AI agents as fundamental to competitive advantage, Swisscom’s implementation provides a proven reference architecture. Their success with high-volume B2C applications—from personalized sales assistance to automated technical support—illustrates that agentic AI can deliver measurable business outcomes at scale when built on appropriate infrastructure. This implementation serves as a blueprint for organizations seeking to deploy enterprise-scale AI solutions, showing how careful architectural planning and the right technology choices can lead to successful outcomes in both customer service and sales operations.
Next steps and looking ahead
The future roadmap focuses on three key areas: agent sharing, cross-domain integration, and governance. A centralized agent registry will facilitate discovery and reuse across the organization, supported by standardized documentation and shared best practices. Cross-domain integration will enable seamless collaboration between different business units, with clear standards for agent communication and interoperability. The implementation of robust governance mechanisms, including version control, usage monitoring, and regular security audits, will facilitate sustainable growth of the system while maintaining compliance with enterprise standards. This comprehensive approach will help drive continuous improvement based on real-world usage patterns and feedback.
Check out these additional links for relevant Agentic related information:

Transforming network operations with AI: How Swisscom built a network assistant using Amazon Bedrock
Introducing Amazon Bedrock AgentCore: Securely deploy and operate AI agents at any scale
Amazon Bedrock AgentCore Runtime, Browser, and Code Interpreter add support for VPC, AWS PrivateLink, CloudFormation, and tagging
Secure ingress connectivity to Amazon Bedrock AgentCore Gateway using interface VPC endpoints

About the authors
Arun Sittampalam, Director of Product Management AI at Swisscom, leads the company’s transformation toward Agentic AI, designing frameworks that scale large language model (LLM)–driven agents across enterprise environments. His team is building Swisscom’s agentic platform, integrating Amazon Bedrock, AgentCore and internal orchestration frameworks to empower Swisscom’s AI product teams to build and scale intelligent agents faster. Arun focuses on operationalizing multi-agent architectures that deliver automation, reliability, and scalability.
Maxime is a System and Security Architect at Swisscom, responsible for the architecture of Conversational and Agentic AI enablement. He is originally a Data Scientist with 10 years of experience in developing, deploying and maintaining NLP solutions which have been helping millions of Swisscom customers.
Julian Grüber is a Data Science Consultant at Amazon Web Services. He partners with strategic customers to scale GenAI solutions that unlock business value, working at both the use case and enterprise architecture level. Drawing on his background in applied mathematics, machine learning, business, and cloud infrastructure, Julian bridges technical depth with business outcomes to address complex AI/ML challenges.
Marco Fischer is a Senior Solutions Architect at Amazon Web Services. He works with leading telecom operators to design and deploy scalable, production-ready solutions. With over two decades of experience spanning software engineering, architecture, and cloud infrastructure, Marco combines deep technical expertise with a passion for solving complex enterprise challenges.
Akarsha Sehwag is a Generative AI Data Scientist for Amazon Bedrock AgentCore GTM team. With over six years of expertise in AI/ML, she has built production-ready enterprise solutions across diverse customer segments in Generative AI, Deep Learning and Computer Vision domains. Outside of work, she likes to hike, bike or play Badminton.
Ruben Merz is a Principal Solutions Architect at AWS, specializing in digital sovereignty, AI, and networking solutions for enterprise customers. With deep expertise in distributed systems and networking, he architects secure, compliant cloud solutions that help organizations navigate complex regulatory requirements while accelerating their digital transformation journeys.

Scaling MLflow for enterprise AI: What’s New in SageMaker AI with ML …

Today we’re announcing Amazon SageMaker AI with MLflow, now including a serverless capability that dynamically manages infrastructure provisioning, scaling, and operations for artificial intelligence and machine learning (AI/ML) development tasks. It scales resources up during intensive experimentation and down to zero when not in use, reducing operational overhead. It introduces enterprise-scale features including seamless access management with cross-account sharing, automated version upgrades, and integration with SageMaker AI capabilities like model customization and pipelines. With no administrator configuration needed and at no additional cost, data scientists can immediately begin tracking experiments, implementing observability, and evaluating model performance without infrastructure delays, making it straightforward to scale MLflow workloads across your organization while maintaining security and governance.
In this post, we explore how these new capabilities help you run large MLflow workloads—from generative AI agents to large language model (LLM) experimentation—with improved performance, automation, and security using SageMaker AI with MLflow.
Enterprise scale features in SageMaker AI with MLflow
The new MLflow serverless capability in SageMaker AI delivers enterprise-grade management with automatic scaling, default provisioning, seamless version upgrades, simplified AWS Identity and Access Management (IAM) authorization, resource sharing through AWS Resource Access Manager (AWS RAM), and integration with both Amazon SageMaker Pipelines and model customization. The term MLflow Apps replaces the previous MLflow tracking servers terminology, reflecting the simplified, application-focused approach. You can access the new MLflow Apps page in Amazon SageMaker Studio, as shown in the following screenshot.

A default MLflow App is automatically provisioned when you create a SageMaker Studio domain, streamlining the setup process. It’s enterprise-ready out of the box, requiring no additional provisioning or configuration. The MLflow App scales elastically with your usage, alleviating the need for manual capacity planning. Your training, tracking, and experimentation workloads can get the resources they need automatically, simplifying operations while maintaining performance.
Administrators can define a maintenance window during the creation of the MLflow App, during which in-place version upgrades of the MLflow App take place. This helps the MLflow App be standardized, secure, and continuously up to date, minimizing manual maintenance overhead. MLflow version 3.4 is supported with this launch, and as shown in the following screenshot, extends MLflow to ML, generative AI applications, and agent workloads.

Simplified identity management with MLflow Apps
We’ve simplified access control and IAM permissions for ML teams with the new MLflow App. A streamlined permissions set, such as sagemaker:CallMlflowAppApi, now covers common MLflow operations—from creating and searching experiments to updating trace information—making access control more straightforward to enforce.
By enabling simplified IAM permissions boundaries, users and platform administrators can standardize IAM roles across teams, personas, and projects, facilitating consistent and auditable access to MLflow experiments and metadata. For complete IAM permission and policy configurations, see Set up IAM permissions for MLflow Apps.
Cross-account sharing of MLflow Apps using AWS RAM
Administrators want to centrally manage their MLflow infrastructure while provisioning access across different AWS accounts. MLflow Apps support AWS cross-account sharing for collaborative enterprise AI development. Using AWS RAM, this feature helps AI platform administrators share an MLflow App seamlessly across data scientists with consumer AWS accounts, as illustrated in the following diagram.

Platform administrators can maintain a centralized, governed SageMaker domain that provisions and manages the MLflow App, and data scientists in separate consuming accounts can launch and interact with the MLflow App securely. Combined with the new simplified IAM permissions, enterprises can launch and manage an MLflow App from a centralized administrative AWS account. Using the shared MLflow App, a downstream data scientist consumer can log their MLflow experimentation and generative AI workloads while maintaining governance, auditability, and compliance from a single platform administrator control plane. To learn more about cross-account sharing, see Getting Started with AWS RAM.
SageMaker Pipelines and MLflow integration
SageMaker Pipelines is integrated with MLflow. SageMaker Pipelines is a serverless workflow orchestration service purpose-built for MLOps and LLMOps automation. You can seamlessly build, execute, and monitor repeatable end-to-end ML workflows with an intuitive drag-and-drop UI or the Python SDK. From a SageMaker pipeline, a default MLflow App will be created if one doesn’t already exist, an MLflow experiment name can be defined, and metrics, parameters, and artifacts are logged to the MLflow App as defined in your SageMaker pipeline code. The following screenshot shows an example ML pipeline using MLflow.

SageMaker model customization and MLflow integration
By default, SageMaker model customization integrates with MLflow, providing automatic linking between model customization jobs and MLflow experiments. When you run model customization fine-tuning jobs, the default MLflow App is used, an experiment is selected, and metrics, parameters, and artifacts are logged for you automatically. On the SageMaker model customization job page, you can view metrics sourced from MLflow and drill into additional metrics within the MLflow UI, as shown in the following screenshot.

Conclusion
These features make the new MLflow Apps in SageMaker AI ready for enterprise-scale ML and generative AI workloads with minimal administrative burden. You can get started with the examples provided in the GitHub samples repository and AWS workshop.
MLflow Apps are generally available in the AWS Regions where SageMaker Studio is available, except China and US GovCloud Regions. We invite you to explore the new capability and experience the enhanced efficiency and control it brings to your ML projects. Get started now by visiting the SageMaker AI with MLflow product detail page and Accelerate generative AI development using managed MLflow on Amazon SageMaker AI, and send your feedback to AWS re:Post for SageMaker or through your usual AWS support contacts.

About the authors
Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customers through their AIOps journey across model training, generative AI applications like agents, and scaling generative AI use cases. He also focuses on go-to-market strategies helping AWS build and align products to solve industry challenges in the generative AI space. You can connect with Sandeep on LinkedIn to learn about generative AI solutions.
Rahul Easwar is a Senior Product Manager at AWS, leading managed MLflow and Partner AI Apps within the Amazon SageMaker AIOps team. With over 20 years of experience spanning startups to enterprise technology, he leverages his entrepreneurial background and MBA from Chicago Booth to build scalable ML platforms that simplify AI adoption for organizations worldwide. Connect with Rahul on LinkedIn to learn more about his work in ML platforms and enterprise AI solutions.
Jessica Liao is a Senior UX Designer at AWS who leads design for MLflow, model governance, and inference within Amazon SageMaker AI, shaping how data scientists evaluate, govern, and deploy models. She brings expertise in handling complex problems and driving human-centered innovation from her experience designing DNA life science systems, which she now applies to make machine learning tools more accessible and intuitive through cross-functional collaboration.

Mistral AI Ships Devstral 2 Coding Models And Mistral Vibe CLI For Age …

Mistral AI has introduced Devstral 2, a next generation coding model family for software engineering agents, together with Mistral Vibe CLI, an open source command line coding assistant that runs inside the terminal or IDEs that support the Agent Communication Protocol.

https://mistral.ai/news/devstral-2-vibe-cli

Devstral 2 and Devstral Small 2, model sizes, context and benchmarks

Devstral 2 is a 123B parameter dense transformer with a 256K token context window. It reaches 72.2 percent on SWE-bench Verified, which places it among the strongest open weight models for software engineering tasks. The model is released as open weights under a modified MIT license and is currently free to use via the Mistral API.

Devstral Small 2 is a 24B parameter model with the same 256K context window. It scores 68.0 percent on SWE-bench Verified and sits in the range of models that are up to 5 times larger in parameter count. It is released under the Apache 2.0 license, which is a standard permissive license for production use.

Both models are described as open source and permissively licensed and are positioned as state of the art coding models for agentic workloads. Mistral reports that Devstral 2 is up to 7 times more cost efficient than Claude Sonnet on real world coding tasks at similar quality, which is important for continuous agent workloads.

https://mistral.ai/news/devstral-2-vibe-cli

In terms of model size relative to frontier systems, Devstral 2 and Devstral Small 2 are 5 times and 28 times smaller than DeepSeek V3.2, and 8 times and 41 times smaller than Kimi K2.

Built for production grade coding workflows

Devstral 2 is designed for software engineering agents that need to explore repositories, track dependencies and orchestrate edits across many files while maintaining architecture level context. The model can detect failures, retry with corrections and support tasks such as bug fixing or modernization of legacy systems at repository scale.

Mistral states that Devstral 2 can be fine tuned to favor specific programming languages or to optimize for very large enterprise codebases. Devstral Small 2 brings the same design goals to a smaller footprint that is suitable for local deployment, tight feedback loops and fully private runtimes. It also supports image inputs and can drive multimodal agents that must reason over both code and visual artifacts such as diagrams or screenshots.

https://mistral.ai/news/devstral-2-vibe-cli

Human evaluations against DeepSeek V3.2 and Claude Sonnet 4.5

To test real world coding behavior, Mistral evaluated Devstral 2 against DeepSeek V3.2 and Claude Sonnet 4.5 using tasks scaffolded through the Cline agent tool. In these human evaluations Devstral 2 shows a clear advantage over DeepSeek V3.2 with a 42.8 percent win rate versus a 28.6 percent loss rate.

Mistral Vibe CLI, a terminal native coding agent

Mistral Vibe CLI is an open source command line coding assistant written in Python and powered by Devstral models. It explores, modifies and executes changes across a codebase using natural language in the terminal, or inside IDEs that support the Agent Communication Protocol such as Zed where it is available as an extension.The project is released under the Apache 2.0 license on GitHub.

Vibe CLI provides a chat style interface on top of several key tools:

Project aware context, it scans the file structure and Git status to build a working view of the repository.

Smart references, it supports @ autocomplete for files, ! for shell commands and slash commands for configuration changes.

Multi file orchestration, it reasons over the full codebase, not only the active buffer, to coordinate architecture level changes and reduce pull request cycle time.

Persistent history, autocompletion and themes tuned for daily use in the terminal.

Developers configure Vibe CLI through a simple config.toml file where they can point to Devstral 2 via the Mistral API or to other local or remote models. The tool supports programmatic runs, auto approval toggles for tool execution and granular permissions so that risky operations in sensitive repositories require confirmation.

Key Takeaways

Devstral 2 is a 123B parameter dense coding model with 256K context, it reaches 72.2 percent on SWE bench Verified and is released as open weights under a modified MIT license.

Devstral Small 2 has 24B parameters with the same 256K context, it scores 68.0 percent on SWE bench Verified and uses an Apache 2.0 license for easier production adoption.

Both Devstral models are optimized for agentic coding workloads, they are designed to explore full repositories, track dependencies and apply multi file edits with failure detection and retries.

Mistral Vibe CLI is an open source Python based terminal native coding agent that connects to Devstral, it provides project aware context, smart references and multi file orchestration through a chat style interface in the terminal or IDEs that support the Agent Communication Protocol.

Check out the Full Technical details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Mistral AI Ships Devstral 2 Coding Models And Mistral Vibe CLI For Agentic, Terminal Native Development appeared first on MarkTechPost.

Implement automated smoke testing using Amazon Nova Act headless mode

Automated smoke testing using Amazon Nova Act headless mode helps development teams validate core functionality in continuous integration and continuous delivery (CI/CD) pipelines. Development teams often deploy code several times daily, so fast testing helps maintain application quality. Traditional end-to-end testing can take hours to complete, creating delays in your CI/CD pipeline.
Smoke testing is a subset of testing that validates the most critical functions of an application work correctly after deployment. These tests focus on key workflows like user login, core navigation, and key transactions rather than exhaustive feature coverage. Smoke tests typically complete in minutes rather than hours, making them ideal for CI/CD pipelines where fast feedback on code changes is essential.
Amazon Nova Act uses AI-powered UI understanding and natural language processing to interact with web applications, replacing traditional CSS selectors. Instead of maintaining brittle CSS selectors and complex test scripts, you can write tests using simple English commands that adapt to UI changes.
This post shows how to implement automated smoke testing using Amazon Nova Act headless mode in CI/CD pipelines. We use SauceDemo, a sample ecommerce application, as our target for demonstration. We demonstrate setting up Amazon Nova Act for headless browser automation in CI/CD environments and creating smoke tests that validate key user workflows. We then show how to implement parallel execution to maximize testing efficiency, configure GitLab CI/CD for automatic test execution on every deployment, and apply best practices for maintainable and scalable test automation.
Solution overview
The solution includes a Python test runner that executes smoke tests, ecommerce workflow validation for complete user journeys, GitLab CI/CD integration for automation, and parallel execution to speed up testing. Headless mode runs browser tests in the background without opening a browser window, which works well for automated testing.
The following diagram illustrates the testing workflow.

We walk through the following steps to implement automated smoke testing with Amazon Nova Act:

Set up your project and dependencies.
Create a smoke test with login validation.
Configure validation for the entire ecommerce workflow.
Configure the automated testing pipeline.
Configure parallel execution.

Prerequisites
To complete this walkthrough, you must have the following:

Access to Amazon Nova Act with API key.
A GitLab repository.
UV package manager. For instructions, refer to Installing uv.
Familiarity with Python and GitLab CI/CD.

Set up project and dependencies
Create your project and install dependencies:

# Create and navigate to project
uv init nova-act-smoke-tests
# Open in VS Code
code nova-act-smoke-tests
# Install required packages
uv add nova-act

UV is a fast Python package manager that handles dependency installation and virtual environment management automatically, similar to npm for Node.js projects.

Create a test runner
Create smoke_tests.py:

import os
from nova_act import NovaAct

# Check API key
if not os.getenv(“NOVA_ACT_API_KEY”):
exit(“❌ Set NOVA_ACT_API_KEY environment variable”)
SAUCEDEMO_URL = “https://www.saucedemo.com/”
with NovaAct(starting_page=SAUCEDEMO_URL) as nova:
nova.act(“Verify you are in the login page”)

print(“✅ Foundation setup complete!”)

Test your setup

Test your setup with the following commands:

export NOVA_ACT_API_KEY=”your-api-key”
uv run smoke_tests.py

Environment variables like NOVA_ACT_API_KEY keep sensitive information secure and separate from your code.
This solution implements the following security features:

Stores API keys in environment variables or .env files (add .env to .gitignore)
Uses different API keys for development, staging, and production environments
Implements key rotation every 90 days using automated scripts or calendar reminders
Monitors API key usage through logs to detect unauthorized access

You now have a modern Python project with Amazon Nova Act configured and ready for testing. Next, we show how to create a working smoke test that uses natural language browser automation.
Create smoke test for login validation
Let’s expand your foundation code to include a complete login test with proper structure.
Add main function and login test
Update smoke_tests.py:

import os
from nova_act import NovaAct

SAUCEDEMO_URL = “https://www.saucedemo.com/”

def test_login_flow():
“””Test complete login flow and product page verification”””
with NovaAct(starting_page=SAUCEDEMO_URL) as nova:
nova.act(“Enter ‘standard_user’ in the username field”)
nova.act(“Enter ‘secret_sauce’ in the password field”)
nova.act(“Click the login button”)
nova.act(“Verify Products appear on the page”)

def main():
# Check API key
if not os.getenv(“NOVA_ACT_API_KEY”):
exit(“❌ Set NOVA_ACT_API_KEY environment variable”)

print(“🚀 Starting Nova Act Smoke Test”)

try:
test_login_flow()
print(“✅ Login test: PASS”)
except Exception as e:
print(f”❌ Login test: FAIL – {e}”)
exit(1)

print(“🎉 All tests passed!”)

if __name__ == “__main__”:
main()

Test your login flow
Run your complete login test:

export NOVA_ACT_API_KEY=”your-api-key”
uv run smoke_tests.py

You should see the following output:

🚀 Starting NovaAct Smoke Test
✅ Login test: PASS
🎉 All tests passed!

Your smoke test now validates a complete user journey that uses natural language with Amazon Nova Act. The test handles page verification to confirm you’re on the login page, form interactions that enter user name and password credentials, action execution that clicks the login button, and success validation that verifies the products page loads correctly. The built-in error handling provides retry logic if the login process encounters any issues, showing how the AI-powered automation of Amazon Nova Act adapts to dynamic web applications without the brittleness of traditional CSS selector-based testing frameworks.
Although a login test provides valuable validation, real-world applications require testing complete user workflows that span multiple pages and complex interactions. Next, we expand the testing capabilities by building a comprehensive ecommerce journey that validates the entire customer experience.
Configure ecommerce workflow validation
Let’s build a comprehensive ecommerce workflow that tests the end-to-end customer journey from login to logout.
Add complete ecommerce test
Update smoke_tests.py to include the full workflow:

import os
from nova_act import NovaAct

SAUCEDEMO_URL = “https://www.saucedemo.com/”

def test_login_flow():
“””Test complete login flow and product page verification”””
with NovaAct(starting_page=SAUCEDEMO_URL) as nova:
nova.act(“Enter ‘standard_user’ in the username field”)
nova.act(“Enter ‘secret_sauce’ in the password field”)
nova.act(“Click the login button”)
nova.act(“Verify Products appear on the page”)

def test_ecommerce_workflow():
“””Test complete e-commerce workflow: login → shop → checkout → logout”””
with NovaAct(starting_page=SAUCEDEMO_URL) as nova:
# Login
nova.act(“Enter ‘standard_user’ in the username field”)
nova.act(“Enter ‘secret_sauce’ in the password field”)
nova.act(“Click the login button”)
nova.act(“Verify Products appear on the page”)

# Shopping
nova.act(“Select Sauce Labs Backpack”)
nova.act(“Add Sauce Labs Backpack to the cart”)
nova.act(“Navigate back to products page”)
nova.act(“Select Sauce Labs Onesie”)
nova.act(“Add Sauce Labs Onesie to the cart”)
nova.act(“Navigate back to products page”)

# Cart verification
nova.act(“Click cart and Navigate to the cart page”)
nova.act(“Verify 2 items are in the cart”)

# Checkout process
nova.act(“Click the Checkout button”)
nova.act(“Enter ‘John’ in the First Name field”)
nova.act(“Enter ‘Doe’ in the Last Name field”)
nova.act(“Enter ‘12345’ in the Zip/Postal Code field”)
nova.act(“Click the Continue button”)

# Order completion
nova.act(“Verify Checkout:Overview page appears”)
nova.act(“Click the Finish button”)
nova.act(“Verify ‘THANK YOU FOR YOUR ORDER’ appears on the page”)

# Return and logout
nova.act(“Click the Back Home button”)
nova.act(“Click the hamburger menu on the left”)
nova.act(“Click the Logout link”)
nova.act(“Verify the user is on the login page”)
def main():
# Check API key
if not os.getenv(“NOVA_ACT_API_KEY”):
exit(“❌ Set NOVA_ACT_API_KEY environment variable”)

print(“🚀 Starting Nova Act E-commerce Tests”)

tests = [
(“Login Flow”, test_login_flow),
(“E-commerce Workflow”, test_ecommerce_workflow)
]

passed = 0
for test_name, test_func in tests:
try:
test_func()
print(f”✅ {test_name}: PASS”)
passed += 1
except Exception as e:
print(f”❌ {test_name}: FAIL – {e}”)

print(f”n📊 Results: {passed}/{len(tests)} tests passed”)

if passed == len(tests):
print(“🎉 All tests passed!”)
else:
exit(1)

if __name__ == “__main__”:
main()

Test your ecommerce workflow
Run your comprehensive test suite:

export NOVA_ACT_API_KEY=”your-api-key”
uv run smoke_tests.py

You should see the following output:

🚀 Starting Nova Act E-commerce Tests
✅ Login Flow: PASS
✅ E-commerce Workflow: PASS
📊 Results: 2/2 tests passed
🎉 All tests passed!

Understanding the ecommerce journey
The workflow tests a complete customer experience:

Authentication – Login with valid credentials
Product discovery – Browse and select products
Shopping cart – Add items and verify cart contents
Checkout process – Enter shipping information
Order completion – Complete purchase and verify success
Navigation – Return to products and log out

The following screenshot shows the step-by-step visual guide of the user journey.

Your smoke tests now validate complete user journeys that mirror real customer experiences. The ecommerce workflow shows how Amazon Nova Act handles complex, multi-step processes across multiple pages. By testing the entire customer journey from authentication through order completion, you’re validating the primary revenue-generating workflows in your application.
This approach reduces maintenance overhead while providing comprehensive coverage of your application’s core functionality.
Running these tests manually provides immediate value, but the real power comes from integrating them into your development workflow. Automating test execution makes sure code changes are validated against your critical user journeys before reaching production.
Configure automated testing pipeline
With your comprehensive ecommerce workflow in place, you’re ready to integrate these tests into your CI pipeline. This step shows how to configure GitLab CI/CD to automatically run these smoke tests on every code change, making sure key user journeys remain functional throughout your development cycle. We show how to configure headless mode for CI environments while maintaining the visual debugging capabilities for local development.
Add headless mode for CI/CD
Update smoke_tests.py to support headless mode for CI environments by adding the following lines to both test functions:

def test_login_flow():
“””Test complete login flow and product page verification”””
headless = os.getenv(“HEADLESS”, “false”).lower() == “true”

with NovaAct(starting_page=SAUCEDEMO_URL, headless=headless) as nova:
# … rest of your test code remains the same

def test_ecommerce_workflow():
“””Test complete e-commerce workflow: login → shop → checkout → logout”””
headless = os.getenv(“HEADLESS”, “false”).lower() == “true”

with NovaAct(starting_page=SAUCEDEMO_URL, headless=headless) as nova:
# … rest of your test code remains the same

Create GitHub Actions workflow
GitLab CI/CD is GitLab’s built-in CI system that automatically runs pipelines when code changes occur. Pipelines are defined in YAML files that specify when to run tests and what steps to execute.
Create .gitlab-ci.yml:

stages:
– test

smoke-tests:
stage: test
image: mcr.microsoft.com/playwright/python:v1.40.0-jammy
rules:
– if: $CI_COMMIT_BRANCH == “main”
– if: $CI_COMMIT_BRANCH == “develop”
– if: $CI_PIPELINE_SOURCE == “merge_request_event”
– if: $CI_PIPELINE_SOURCE == “web”
before_script:
– pip install uv
– uv sync
– uv run playwright install chromium
script:
– uv run python smoke_tests.py
variables:
HEADLESS: ‘true’
NOVA_ACT_SKIP_PLAYWRIGHT_INSTALL: ‘true’

Configure GitLab CI/CD variables
GitLab CI/CD variables provide secure storage for sensitive information like API keys. These values are encrypted and only accessible to your GitLab CI/CD pipelines. Complete the following steps to add a variable:

In your project, choose Settings, CI/CD, and Variables.
Choose Add variable.
For the key, enter NOVA_ACT_API_KEY.
For the value, enter your Amazon Nova Act API key.
Select Mask variable to hide the value in job logs.
Choose Add variable.

Understanding the code changes
The key change is the headless mode configuration:

headless = os.getenv(“HEADLESS”, “false”).lower() == “true”
with NovaAct(starting_page=SAUCEDEMO_URL, headless=headless) as nova:

This configuration provides flexibility for different development environments. During local development when the HEADLESS environment variable is not set, the headless parameter defaults to False, which opens a browser window so you can see the automation in action. This visual feedback is invaluable for debugging test failures and understanding how Amazon Nova Act interacts with your application. In CI/CD environments where HEADLESS is set to true, the browser runs in the background without opening any windows, making it ideal for automated testing pipelines that don’t have display capabilities and need to run efficiently without visual overhead.
Test your CI/CD setup
Push your code to trigger the workflow:

git add .
git commit -m “Add Nova Act smoke tests with CI/CD”
git push origin main

Check the Pipelines section in your GitLab project to see the tests running.

Your smoke tests now run automatically as part of your CI pipeline, providing immediate feedback on code changes. The GitLab CI/CD integration makes sure critical user journeys are validated before any deployment reaches production, reducing the risk of shipping broken functionality to customers.
The implementation shows how modern package management with UV reduces CI/CD pipeline execution time compared to traditional pip installations. Combined with secure API key management through GitLab CI/CD variables, your testing infrastructure follows enterprise security best practices.
As your test suite grows, you might notice that running tests sequentially can become a bottleneck in your deployment pipeline. The next section addresses this challenge by implementing parallel execution to maximize your CI/CD efficiency.
Configure parallel execution
With your CI/CD pipeline successfully validating individual test cases, the next optimization focuses on performance enhancement through parallel execution. Concurrent test execution can reduce your total testing time by running multiple browser instances simultaneously, maximizing the efficiency of your CI/CD resources while maintaining test reliability and isolation.
Add parallel execution framework
Update smoke_tests.py to support concurrent testing:

import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from nova_act import NovaAct

SAUCEDEMO_URL = “https://www.saucedemo.com/”
headless = os.getenv(“HEADLESS”, “false”).lower() == “true”

def test_login_flow():
“””Test complete login flow and product page verification”””

with NovaAct(starting_page=SAUCEDEMO_URL, headless=headless) as nova:
nova.act(“Enter ‘standard_user’ in the username field”)
nova.act(“Enter ‘secret_sauce’ in the password field”)
nova.act(“Click the login button”)
# nova.act(“In case of error, make sure the username and password are correct, if required re-enter the username and password”)
nova.act(“Verify Products appear on the page”)

def test_ecommerce_workflow():
“””Test complete e-commerce workflow: login → shop → checkout → logout”””
with NovaAct(starting_page=SAUCEDEMO_URL, headless=headless) as nova:
# Login
nova.act(“Enter ‘standard_user’ in the username field”)
nova.act(“Enter ‘secret_sauce’ in the password field”)
nova.act(“Click the login button”)
nova.act(“Verify Products appear on the page”)

# Shopping
nova.act(“Select Sauce Labs Backpack”)
nova.act(“Add Sauce Labs Backpack to the cart”)
nova.act(“Navigate back to products page”)
nova.act(“Select Sauce Labs Onesie”)
nova.act(“Add Sauce Labs Onesie to the cart”)
nova.act(“Navigate back to products page”)

# Cart verification
nova.act(“Click cart and Navigate to the cart page”)
nova.act(“Verify 2 items are in the cart”)

# Checkout process
nova.act(“Click the Checkout button”)
nova.act(“Enter ‘John’ in the First Name field”)
nova.act(“Enter ‘Doe’ in the Last Name field”)
nova.act(“Enter ‘12345’ in the Zip/Postal Code field”)
nova.act(“Click the Continue button”)

# Order completion
nova.act(“Verify Checkout:Overview page appears”)
nova.act(“Click the Finish button”)
nova.act(“Verify ‘THANK YOU FOR YOUR ORDER’ appears on the page”)

# Return and logout
nova.act(“Click the Back Home button”)
nova.act(“Click the hamburger menu on the left”)
nova.act(“Click the Logout link”)
nova.act(“Verify the user is on the login page”)

def run_test(test_name, test_func):
“””Execute a single test and return result”””
try:
test_func()
print(f”✅ {test_name}: PASS”)
return True
except Exception as e:
print(f”❌ {test_name}: FAIL – {e}”)
return False

def main():
# Check API key
if not os.getenv(“NOVA_ACT_API_KEY”):
exit(“❌ Set NOVA_ACT_API_KEY environment variable”)

print(“🚀 Starting Nova Act Tests (Parallel)”)

tests = [
(“Login Flow”, test_login_flow),
(“E-commerce Workflow”, test_ecommerce_workflow)
]

# Configure parallel execution
max_workers = int(os.getenv(“MAX_WORKERS”, “2”))

# Run tests in parallel
results = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_test = {
executor.submit(run_test, name, func): name
for name, func in tests
}

for future in as_completed(future_to_test):
results.append(future.result())

# Report results
passed = sum(results)
total = len(results)

print(f”n📊 Results: {passed}/{total} tests passed”)

if passed == total:
print(“🎉 All tests passed!”)
else:
exit(1)

if __name__ == “__main__”:
main()

Update GitLab CI/CD for parallel execution
The parallel execution is already configured in your .gitlab-ci.yml with the MAX_WORKERS= “2” variable. The pipeline automatically uses the parallel framework when running the smoke tests.
Test parallel execution
Run your optimized tests:

export NOVA_ACT_API_KEY=”your-api-key”
export MAX_WORKERS=”2″
uv run smoke_tests.py

You should see both tests running simultaneously:

🚀 Starting Nova Act Tests (Parallel)
✅ Login Flow: PASS
✅ E-commerce Workflow: PASS
📊 Results: 2/2 tests passed
🎉 All tests passed!

Understanding parallel execution
ThreadPoolExecutor is a Python class that manages a pool of worker threads, allowing multiple tasks to run simultaneously. In this case, each thread runs a separate browser test, reducing total execution time.

# Configure worker count
max_workers = int(os.getenv(“MAX_WORKERS”, “2”))

# Execute tests concurrently
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_test = {
executor.submit(run_test, name, func): name
for name, func in tests
}

Parallel execution provides benefits such as faster execution (because tests run simultaneously instead of sequentially), configurable workers that adjust based on system resources, resource efficiency that optimizes CI/CD compute time, and scalability that makes it straightforward to add more tests without increasing total runtime.
However, there are important considerations to keep in mind. Each test opens a browser instance (which increases resource usage), tests must be independent of each other to maintain proper isolation, and you must balance worker counts with available CPU and memory limits in CI environments.
Each parallel test uses system resources and incurs API usage. Start with two workers and adjust based on your environment’s capacity and cost requirements. Monitor your Amazon Nova Act usage to optimize the balance between test speed and expenses.
The performance improvement is significant when comparing sequential vs. parallel execution. In sequential execution, tests run one after another with the total time being the sum of all individual test durations. With parallel execution, multiple tests run simultaneously, completing in approximately the time of the longest test, resulting in substantial time savings that become more valuable as your test suite grows.
Your smoke tests now feature concurrent execution that significantly reduces total testing time while maintaining complete test isolation and reliability. The ThreadPoolExecutor implementation allows multiple browser instances to run simultaneously, transforming your sequential test suite into a parallel execution that completes much faster. This performance improvement becomes increasingly valuable as your test suite grows, so comprehensive validation doesn’t become a bottleneck in your deployment pipeline.
The configurable worker count through the MAX_WORKERS environment variable provides flexibility to optimize performance based on available system resources. In CI/CD environments, this allows you to balance test execution speed with resource constraints, and local development can use full system capabilities for faster feedback cycles. The architecture maintains complete test independence, making sure parallel execution doesn’t introduce flakiness or cross-test dependencies that could compromise reliability. As a best practice, keep tests independent—each test should work correctly regardless of execution order or other tests running simultaneously.
Best practices
With your performance-optimized testing framework complete, consider the following practices for production readiness:

Keep tests independent. Tests are not impacted by execution order or other tests running simultaneously.
Add retry logic by wrapping your test functions in try-catch blocks with a retry mechanism for handling transient network issues.
Configure your GitLab CI/CD pipeline with a reasonable timeout and consider adding a scheduled run for daily validation of your production environment.
For ongoing maintenance, establish a rotation schedule for your Amazon Nova Act API keys and monitor your test execution times to catch performance regressions early. As your application grows, you can add new test functions to the parallel execution framework without impacting overall runtime, making this solution highly scalable for future needs.

Clean up
To avoid incurring future charges and maintain security, clean up the resources you created:

Remove or disable unused GitLab CI/CD pipelines
Rotate API keys every 90 days and revoke unused keys.
Delete the repositories provided with this post.
Remove API keys from inactive projects.
Clear cached credentials and temporary files from your local environment.

Conclusion
In this post, we showed how to implement automated smoke testing using Amazon Nova Act headless mode for CI/CD pipelines. We demonstrated how to create comprehensive ecommerce workflow tests that validate user journeys, implement parallel execution for faster test completion, and integrate automated testing with GitLab CI/CD for continuous validation.
The natural language approach using Amazon Nova Act needs less maintenance than traditional frameworks that use CSS selectors. Combined with modern tooling like UV package management and GitLab CI/CD, this solution provides fast, reliable test execution that scales with your development workflow. Your implementation now catches issues before they reach production, providing the fast feedback essential for confident continuous deployment while maintaining high application quality standards.
To learn more about browser automation and testing strategies on AWS, explore the following resources:

Getting Started Using the Nova Act Dev Tools and explore the capability in Nova Act playground
Amazon CloudWatch Synthetics for additional monitoring capabilities
GitLab CI/CD documentation for CI/CD best practices
AWS security best practices for securing your automation infrastructure

Try implementing these smoke tests in your own applications and consider extending the framework with additional test scenarios that match your specific user journeys. Share your experience and any optimizations you discover in the comments section.

About the authors
Sakthi Chellapparimanam Sakthivel is a Solutions Architect at AWS, specializing in .NET modernization and enterprise cloud transformations. He helps GSI and software/services customers build scalable, innovative solutions on AWS. He architects intelligent automation frameworks and GenAI-powered applications that drive measurable business outcomes across diverse industries. Beyond his technical pursuits, Sakthivel enjoys spending quality time with his family and playing cricket.
Shyam Soundar is a Solutions Architect at AWS with an extensive background in security, cost-optimization, and analytics offerings. Shyam works with enterprise customers to help them build and scale applications to achieve their business outcomes with lower cost.
Reena M is an FSI Solutions Architect at AWS, specializing in analytics and generative AI-based workloads, helping capital markets and banking customers create secure, scalable, and efficient solutions on AWS. She architects cutting-edge data platforms and AI-powered applications that transform how financial institutions leverage cloud technologies. Beyond her technical pursuits, Reena is also a writer and enjoys spending time with her family.

Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into Firs …

The new LiteRT NeuroPilot Accelerator from Google and MediaTek is a concrete step toward running real generative models on phones, laptops, and IoT hardware without shipping every request to a data center. It takes the existing LiteRT runtime and wires it directly into MediaTek’s NeuroPilot NPU stack, so developers can deploy LLMs and embedding models with a single API surface instead of per chip custom code.

What is LiteRT NeuroPilot Accelerator?

LiteRT is the successor of TensorFlow Lite. It is a high performance runtime that sits on device, runs models in .tflite FlatBuffer format, and can target CPU, GPU and now NPU backends through a unified hardware acceleration layer.

LiteRT NeuroPilot Accelerator is the new NPU path for MediaTek hardware. It replaces the older TFLite NeuroPilot delegate with a direct integration to the NeuroPilot compiler and runtime. Instead of treating the NPU as a thin delegate, LiteRT now uses a Compiled Model API that understands Ahead of Time (AOT) compilation and on device compilation, and exposes both through the same C++ and Kotlin APIs.

On the hardware side, the integration currently targets MediaTek Dimensity 7300, 8300, 9000, 9200, 9300 and 9400 SoCs, which together cover a large part of the Android mid range and flagship device space.

Why Developers Care, Unified Workflow For Fragmented NPUs??

Historically, on device ML stacks were CPU and GPU first. NPU SDKs shipped as vendor specific toolchains that required separate compilation flows per SoC, custom delegates, and manual runtime packaging. The result was a combinatorial explosion of binaries and a lot of device specific debugging.

LiteRT NeuroPilot Accelerator replaces that with a three step workflow that is the same regardless of which MediaTek NPU is present:

Convert or load a .tflite model as usual.

Optionally use the LiteRT Python tools to run AOT compilation and produce an AI Pack that is tied to one or more target SoCs.

Ship the AI Pack through Play for On-device AI (PODAI), then select Accelerator.NPU at runtime. LiteRT handles device targeting, runtime loading, and falls back to GPU or CPU if the NPU is not available.

For you as an engineer, the main change is that device targeting logic moves into a structured configuration file and Play delivery, while the app code mostly interacts with CompiledModel and Accelerator.NPU.

AOT and on device compilation are both supported. AOT compiles for a known SoC ahead of time and is recommended for larger models because it removes the cost of compiling on the user device. On device compilation is better for small models and generic .tflite distribution, at the cost of higher first run latency. The blog shows that for a model such as Gemma-3-270M, pure on device compilation can take more than 1 minute, which makes AOT the realistic option for production LLM use.

Gemma, Qwen, And Embedding Models On MediaTek NPU

The stack is built around open weight models rather than a single proprietary NLU path. Google and MediaTek list explicit, production oriented support for:

Qwen3 0.6B, for text generation in markets such as mainland China.

Gemma-3-270M, a compact base model that is easy to fine tune for tasks like sentiment analysis and entity extraction.

Gemma-3-1B, a multilingual text only model for summarization and general reasoning.

Gemma-3n E2B, a multimodal model that handles text, audio and vision for things like real time translation and visual question answering.

EmbeddingGemma 300M, a text embedding model for retrieval augmented generation, semantic search and classification.

On the latest Dimensity 9500, running on a Vivo X300 Pro, the Gemma 3n E2B variant reaches more than 1600 tokens per second in prefill and 28 tokens per second in decode at a 4K context length when executed on the NPU.

For text generation use cases, LiteRT-LM sits on top of LiteRT and exposes a stateful engine with a text in text out API. A typical C++ flow is to create ModelAssets, build an Engine with litert::lm::Backend::NPU, then create a Session and call GenerateContent per conversation. For embedding workloads, EmbeddingGemma uses the lower level LiteRT CompiledModel API in a tensor in tensor out configuration, again with the NPU selected through hardware accelerator options.

Developer Experience, C++ Pipeline And Zero Copy Buffers

LiteRT introduces a new C++ API that replaces the older C entry points and is designed around explicit Environment, Model, CompiledModel and TensorBuffer objects.

For MediaTek NPUs, this API integrates tightly with Android’s AHardwareBuffer and GPU buffers. You can construct input TensorBuffer instances directly from OpenGL or OpenCL buffers with TensorBuffer::CreateFromGlBuffer, which lets image processing code feed NPU inputs without an intermediate copy through CPU memory. This is important for real time camera and video processing where multiple copies per frame quickly saturate memory bandwidth.

A typical high level C++ path on device looks like this, omitting error handling for clarity:

Copy CodeCopiedUse a different Browser// Load model compiled for NPU
auto model = Model::CreateFromFile(“model.tflite”);
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);

// Create compiled model
auto compiled = CompiledModel::Create(*env, *model, *options);

// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write<float>(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);

The same Compiled Model API is used whether you are targeting CPU, GPU or the MediaTek NPU, which reduces the amount of conditional logic in application code.

Key Takeaways

LiteRT NeuroPilot Accelerator is the new, first class NPU integration between LiteRT and MediaTek NeuroPilot, replacing the old TFLite delegate and exposing a unified Compiled Model API with AOT and on device compilation on supported Dimensity SoCs.

The stack targets concrete open weight models, including Qwen3-0.6B, Gemma-3-270M, Gemma-3-1B, Gemma-3n-E2B and EmbeddingGemma-300M, and runs them through LiteRT and LiteRT LM on MediaTek NPUs with a single accelerator abstraction.

AOT compilation is strongly recommended for LLMs, for example Gemma-3-270M can take more than 1 minute to compile on device, so production deployments should compile once in the pipeline and ship AI Packs via Play for On device AI.

On a Dimensity 9500 class NPU, Gemma-3n-E2B can reach more than 1600 tokens per second in prefill and 28 tokens per second in decode at 4K context, with measured throughput up to 12 times CPU and 10 times GPU for LLM workloads.

For developers, the C++ and Kotlin LiteRT APIs provide a common path to select Accelerator.NPU, manage compiled models and use zero copy tensor buffers, so CPU, GPU and MediaTek NPU targets can share one code path and one deployment workflow.

Check out the Docs and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs appeared first on MarkTechPost.

A Coding Guide to Build a Procedural Memory Agent That Learns, Stores, …

In this tutorial, we explore how an intelligent agent can gradually form procedural memory by learning reusable skills directly from its interactions with an environment. We design a minimal yet powerful framework in which skills behave like neural modules: they store action sequences, carry contextual embeddings, and are retrieved by similarity when a new situation resembles an experience. As we run our agent through multiple episodes, we observe how its behaviour becomes more efficient, moving from primitive exploration to leveraging a library of skills that it has learned on its own. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

class Skill:
def __init__(self, name, preconditions, action_sequence, embedding, success_count=0):
self.name = name
self.preconditions = preconditions
self.action_sequence = action_sequence
self.embedding = embedding
self.success_count = success_count
self.times_used = 0

def is_applicable(self, state):
for key, value in self.preconditions.items():
if state.get(key) != value:
return False
return True

def __repr__(self):
return f”Skill({self.name}, used={self.times_used}, success={self.success_count})”

class SkillLibrary:
def __init__(self, embedding_dim=8):
self.skills = []
self.embedding_dim = embedding_dim
self.skill_stats = defaultdict(lambda: {“attempts”: 0, “successes”: 0})

def add_skill(self, skill):
for existing_skill in self.skills:
if self._similarity(skill.embedding, existing_skill.embedding) > 0.9:
existing_skill.success_count += 1
return existing_skill
self.skills.append(skill)
return skill

def retrieve_skills(self, state, query_embedding=None, top_k=3):
applicable = [s for s in self.skills if s.is_applicable(state)]
if query_embedding is not None and applicable:
similarities = [self._similarity(query_embedding, s.embedding) for s in applicable]
sorted_skills = [s for _, s in sorted(zip(similarities, applicable), reverse=True)]
return sorted_skills[:top_k]
return sorted(applicable, key=lambda s: s.success_count / max(s.times_used, 1), reverse=True)[:top_k]

def _similarity(self, emb1, emb2):
return np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2) + 1e-8)

def get_stats(self):
return {
“total_skills”: len(self.skills),
“total_uses”: sum(s.times_used for s in self.skills),
“avg_success_rate”: np.mean([s.success_count / max(s.times_used, 1) for s in self.skills]) if self.skills else 0
}

We define how skills are represented and stored in a memory structure. We implement similarity-based retrieval so that the agent can match a new state with past skills using cosine similarity. As we work through this layer, we see how skill reuse becomes possible once skills acquire metadata, embeddings, and usage statistics. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass GridWorld:
def __init__(self, size=5):
self.size = size
self.reset()

def reset(self):
self.agent_pos = [0, 0]
self.goal_pos = [self.size-1, self.size-1]
self.objects = {“key”: [2, 2], “door”: [3, 3], “box”: [1, 3]}
self.inventory = []
self.door_open = False
return self.get_state()

def get_state(self):
return {
“agent_pos”: tuple(self.agent_pos),
“has_key”: “key” in self.inventory,
“door_open”: self.door_open,
“at_goal”: self.agent_pos == self.goal_pos,
“objects”: {k: tuple(v) for k, v in self.objects.items()}
}

def step(self, action):
reward = -0.1
if action == “move_up”:
self.agent_pos[1] = min(self.agent_pos[1] + 1, self.size – 1)
elif action == “move_down”:
self.agent_pos[1] = max(self.agent_pos[1] – 1, 0)
elif action == “move_left”:
self.agent_pos[0] = max(self.agent_pos[0] – 1, 0)
elif action == “move_right”:
self.agent_pos[0] = min(self.agent_pos[0] + 1, self.size – 1)
elif action == “pickup_key”:
if self.agent_pos == self.objects[“key”] and “key” not in self.inventory:
self.inventory.append(“key”)
reward = 1.0
elif action == “open_door”:
if self.agent_pos == self.objects[“door”] and “key” in self.inventory:
self.door_open = True
reward = 2.0
done = self.agent_pos == self.goal_pos and self.door_open
if done:
reward = 10.0
return self.get_state(), reward, done

We construct a simple environment in which the agent learns tasks such as picking up a key, opening a door, and reaching a goal. We use this environment as a playground for our procedural memory system, allowing us to observe how primitive actions evolve into more complex, reusable skills. The environment’s structure helps us observe clear, interpretable improvements in behaviour across episodes. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ProceduralMemoryAgent:
def __init__(self, env, embedding_dim=8):
self.env = env
self.skill_library = SkillLibrary(embedding_dim)
self.embedding_dim = embedding_dim
self.episode_history = []
self.primitive_actions = [“move_up”, “move_down”, “move_left”, “move_right”, “pickup_key”, “open_door”]

def create_embedding(self, state, action_seq):
state_vec = np.zeros(self.embedding_dim)
state_vec[0] = hash(str(state[“agent_pos”])) % 1000 / 1000
state_vec[1] = 1.0 if state.get(“has_key”) else 0.0
state_vec[2] = 1.0 if state.get(“door_open”) else 0.0
for i, action in enumerate(action_seq[:self.embedding_dim-3]):
state_vec[3+i] = hash(action) % 1000 / 1000
return state_vec / (np.linalg.norm(state_vec) + 1e-8)

def extract_skill(self, trajectory):
if len(trajectory) < 2:
return None
start_state = trajectory[0][0]
actions = [a for _, a, _ in trajectory]
preconditions = {“has_key”: start_state.get(“has_key”, False), “door_open”: start_state.get(“door_open”, False)}
end_state = self.env.get_state()
if end_state.get(“has_key”) and not start_state.get(“has_key”):
name = “acquire_key”
elif end_state.get(“door_open”) and not start_state.get(“door_open”):
name = “open_door_sequence”
else:
name = f”navigate_{len(actions)}_steps”
embedding = self.create_embedding(start_state, actions)
return Skill(name, preconditions, actions, embedding, success_count=1)

def execute_skill(self, skill):
skill.times_used += 1
trajectory = []
total_reward = 0
for action in skill.action_sequence:
state = self.env.get_state()
next_state, reward, done = self.env.step(action)
trajectory.append((state, action, reward))
total_reward += reward
if done:
skill.success_count += 1
return trajectory, total_reward, True
return trajectory, total_reward, False

def explore(self, max_steps=20):
trajectory = []
state = self.env.get_state()
for _ in range(max_steps):
action = self._choose_exploration_action(state)
next_state, reward, done = self.env.step(action)
trajectory.append((state, action, reward))
state = next_state
if done:
return trajectory, True
return trajectory, False

We focus on building embeddings that encode the context of a state-action sequence, enabling us to meaningfully compare skills. We also extract skills from successful trajectories, transforming raw experience into reusable behaviours. As we run this code, we observe how simple exploration gradually yields structured knowledge that the agent can apply later. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def _choose_exploration_action(self, state):
agent_pos = state[“agent_pos”]
if not state.get(“has_key”):
key_pos = state[“objects”][“key”]
if agent_pos == key_pos:
return “pickup_key”
if agent_pos[0] < key_pos[0]:
return “move_right”
if agent_pos[0] > key_pos[0]:
return “move_left”
if agent_pos[1] < key_pos[1]:
return “move_up”
return “move_down”
if state.get(“has_key”) and not state.get(“door_open”):
door_pos = state[“objects”][“door”]
if agent_pos == door_pos:
return “open_door”
if agent_pos[0] < door_pos[0]:
return “move_right”
if agent_pos[0] > door_pos[0]:
return “move_left”
if agent_pos[1] < door_pos[1]:
return “move_up”
return “move_down”
goal_pos = (4, 4)
if agent_pos[0] < goal_pos[0]:
return “move_right”
if agent_pos[1] < goal_pos[1]:
return “move_up”
return np.random.choice(self.primitive_actions)

def run_episode(self, use_skills=True):
self.env.reset()
total_reward = 0
steps = 0
trajectory = []
while steps < 50:
state = self.env.get_state()
if use_skills and self.skill_library.skills:
query_emb = self.create_embedding(state, [])
skills = self.skill_library.retrieve_skills(state, query_emb, top_k=1)
if skills:
skill_traj, skill_reward, success = self.execute_skill(skills[0])
trajectory.extend(skill_traj)
total_reward += skill_reward
steps += len(skill_traj)
if success:
return trajectory, total_reward, steps, True
continue
action = self._choose_exploration_action(state)
next_state, reward, done = self.env.step(action)
trajectory.append((state, action, reward))
total_reward += reward
steps += 1
if done:
return trajectory, total_reward, steps, True
return trajectory, total_reward, steps, False

def train(self, episodes=10):
stats = {“rewards”: [], “steps”: [], “skills_learned”: [], “skill_uses”: []}
for ep in range(episodes):
trajectory, reward, steps, success = self.run_episode(use_skills=True)
if success and len(trajectory) >= 3:
segment = trajectory[-min(5, len(trajectory)):]
skill = self.extract_skill(segment)
if skill:
self.skill_library.add_skill(skill)
stats[“rewards”].append(reward)
stats[“steps”].append(steps)
stats[“skills_learned”].append(len(self.skill_library.skills))
stats[“skill_uses”].append(self.skill_library.get_stats()[“total_uses”])
print(f”Episode {ep+1}: Reward={reward:.1f}, Steps={steps}, Skills={len(self.skill_library.skills)}, Success={success}”)
return stats

We define how the agent chooses between using known skills and exploring with primitive actions. We train the agent across several episodes and record the evolution of learned skills, usage counts, and success rates. As we examine this part, we observe that skill reuse reduces episode length and improves overall rewards. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef visualize_training(stats):
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes[0, 0].plot(stats[“rewards”])
axes[0, 0].set_title(“Episode Rewards”)
axes[0, 1].plot(stats[“steps”])
axes[0, 1].set_title(“Steps per Episode”)
axes[1, 0].plot(stats[“skills_learned”])
axes[1, 0].set_title(“Skills in Library”)
axes[1, 1].plot(stats[“skill_uses”])
axes[1, 1].set_title(“Cumulative Skill Uses”)
plt.tight_layout()
plt.savefig(“skill_learning_stats.png”, dpi=150, bbox_inches=’tight’)
plt.show()

if __name__ == “__main__”:
print(“=== Procedural Memory Agent Demo ===n”)
env = GridWorld(size=5)
agent = ProceduralMemoryAgent(env)
print(“Training agent to learn reusable skills…n”)
stats = agent.train(episodes=15)
print(“n=== Learned Skills ===”)
for skill in agent.skill_library.skills:
print(f”{skill.name}: {len(skill.action_sequence)} actions, used {skill.times_used} times, {skill.success_count} successes”)
lib_stats = agent.skill_library.get_stats()
print(f”n=== Library Statistics ===”)
print(f”Total skills: {lib_stats[‘total_skills’]}”)
print(f”Total skill uses: {lib_stats[‘total_uses’]}”)
print(f”Avg success rate: {lib_stats[‘avg_success_rate’]:.2%}”)
visualize_training(stats)
print(“n✓ Skill learning complete! Check the visualization above.”)

We bring everything together by running training, printing learned skills, and plotting behaviour statistics. We visualize the trend in rewards and how the skill library grows over time. By running this snippet, we complete the lifecycle of procedural memory formation and confirm that the agent learns to behave more intelligently with experience.

In conclusion, we see how procedural memory emerges naturally when an agent learns to extract skills from its own successful trajectories. We observe how skills are gained, structure, metadata, embeddings, and usage patterns, allowing the agent to reuse them efficiently in future situations. Lastly, we appreciate how even a small environment and simple heuristics lead to meaningful learning dynamics, giving us a concrete understanding of what it means for an agent to develop reusable internal competencies over time.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Build a Procedural Memory Agent That Learns, Stores, Retrieves, and Reuses Skills as Neural Modules Over Time appeared first on MarkTechPost.

Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with …

Zhipu AI has open sourced the GLM-4.6V series as a pair of vision language models that treat images, video and tools as first class inputs for agents, not as afterthoughts bolted on top of text.

Model lineup and context length

The series has 2 models. GLM-4.6V is a 106B parameter foundation model for cloud and high performance cluster workloads. GLM-4.6V-Flash is a 9B parameter variant tuned for local deployment and low latency use.

GLM-4.6V extends the training context window to 128K tokens. In practice this supports roughly 150 pages of dense documents, 200 slide pages or one hour of video in a single pass because pages are encoded as images and consumed by the visual encoder.

Native multimodal tool use

The main technical change is native multimodal Function Calling. Traditional tool use in LLM systems routes everything through text. Images or pages are first turned into descriptions, the model calls tools using text arguments and then reads textual responses. This wastes information and increases latency.

GLM-4.6V introduces native multimodal Function Calling. Images, screenshots and document pages pass directly as tool parameters. Tools can return search result grids, charts, rendered web pages or product images. The model consumes those visual outputs and fuses them with text in the same reasoning chain. This closes the loop from perception to understanding to execution and is explicitly positioned as the bridge between visual perception and executable action for multimodal agents.

To support this, Zhipu AI extends the Model Context Protocol with URL based multimodal handling. Tools receive and return URLs that identify specific images or frames, which avoids file size limits and allows precise selection inside multi image contexts.

Rich text content, web search and frontend replication

Zhipu AI research team describes 4 canonical scenarios:

First, rich text content understanding and creation. GLM-4.6V reads mixed inputs such as papers, reports or slide decks and produces structured image text interleaved outputs. It understands text, charts, figures, tables and formulas in the same document. During generation it can crop relevant visuals or retrieve external images through tools, then run a visual audit step that filters low quality images and composes the final article with inline figures.

Second, visual web search. The model can detect user intent, plan which search tools to call and combine text to image and image to text search. It then aligns retrieved images and text, selects the relevant evidence and outputs a structured answer, for example a visual comparison of products or places.

Third, frontend replication and visual interaction. GLM-4.6V is tuned for design to code workflows. From a UI screenshot, it reconstructs pixel accurate HTML, CSS and JavaScript. Developers can then mark a region on the screenshot and issue natural language instructions, for example move this button left or change this card background. The model maps those instructions back to the code and returns an updated snippet.

Fourth, multimodal document understanding at long context. GLM-4.6V can read multi document inputs up to the 128K token context limit by treating pages as images. The research team reports a case where the model processes financial reports from 4 public companies, extracts core metrics and builds a comparison table, and a case where it summarises a full football match while keeping the ability to answer questions about specific goals and timestamps.

Architecture, data and reinforcement learning

The GLM-4.6V models belong to the GLM-V family and based on the tech report for GLM-4.5V and GLM-4.1V-Thinking. The research team highlights three main technical ingredients.

First, long sequence modeling. GLM-4.6V extends the training context window to 128K tokens and runs continual pre training on massive long context image text corpora. It uses compression alignment ideas from Glyph so that visual tokens can carry dense information that is aligned with language tokens.

Second, world knowledge enhancement. Zhipu AI team adds a billion scale multimodal perception and world knowledge dataset at pre training time. This covers layered encyclopedic concepts and everyday visual entities. The stated goal is to improve both basic perception and cross modal question answering completeness, not only benchmarks.

Third, agentic data synthesis and extended MCP. The research team generates large synthetic traces where the model calls tools, processes visual outputs and iterates on plans. They extend MCP with URL based multimodal handling and an interleaved output mechanism. The generation stack follows a Draft, Image Selection, Final Polish sequence. The model can autonomously call cropping or search tools between these stages to place images at the right positions in the output.

Tool invocation is part of the reinforcement learning objective. GLM-4.6V uses RL to align planning, instruction following and format adherence in complex tool chains.

Performance

https://z.ai/blog/glm-4.6v

Key Takeaways

GLM-4.6V is a 106B multimodal foundation model with a 128K token training context, and GLM-4.6V-Flash is a 9B variant optimized for local and low latency use.

Both models support native multimodal Function Calling so tools can consume and return images, video frames and document pages directly, which links visual perception to executable actions for agents.

GLM-4.6V is trained for long context multimodal understanding and interleaved generation, so it can read large mixed document sets and emit structured text with inline figures and tool selected images in one pass.

The series achieves state of the art performance on major multimodal benchmarks at similar parameter scales and is released as open source weights under the MIT license on Hugging Face and ModelScope.

Check out the Model Card on HF and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with Native Tool Calling appeared first on MarkTechPost.

Real-world reasoning: How Amazon Nova Lite 2.0 handles complex custome …

Artificial intelligence (AI) reasoning capabilities determine whether models can handle complex, real-world tasks beyond simple pattern matching. With strong reasoning, models can identify problems from ambiguous descriptions, apply policies under competing constraints, adapt tone to sensitive situations, and provide complete solutions that address root causes. Without robust reasoning, AI systems fail when faced with nuanced scenarios requiring judgment, context awareness, and multi-step problem-solving.
This post evaluates the reasoning capabilities of our latest offering in the Nova family, Amazon Nova Lite 2.0, using practical scenarios that test these critical dimensions. We compare its performance against other models in the Nova family—Lite 1.0, Micro, Pro 1.0, and Premier—to elucidate how the latest version advances reasoning quality and consistency.
Solution overview
We evaluate five Amazon Nova models across five customer support scenarios, measuring performance on eight dimensions:

Problem identification
Solution completeness
Policy adherence
Factual accuracy
Empathy and tone
Communication clarity
Logical coherence
Practical utility

An independent evaluator model (gpt-oss-20b) provides automated, unbiased scoring.
The evaluation architecture uses the same Region: us-east-1 and automatically handles different API formats: Converse API for Nova, OpenAI Chat Completions for gpt-oss-20b.
The sample notebook is available in the GitHub repository.
Test scenarios
To generate the scenarios evaluation dataset, we use Claude Sonnet 4.5 by Anthropic on Amazon Bedrock to generate a sample of 100 scenarios that pertain to common customer support interactions. We don’t use any of the Nova models to generate the scenarios to avoid any bias. We then randomly select five scenarios for our testing purposes that evaluate common real-world reasoning challenges:

Angry customer complaint – Tests de-escalation, empathy, and problem resolution when a customer threatens to leave after delayed delivery and poor service.
Software technical problem – Evaluates technical troubleshooting when an app crashes during photo uploads despite basic troubleshooting attempts.
Billing dispute – Assesses investigation skills and security awareness for unrecognized charges potentially indicating unauthorized access.
Product defect report – Measures warranty policy application and customer service for a two-month-old defective product.
Account security concern – Tests urgency response and security protocols for unauthorized password changes and fraudulent purchases.

Each scenario includes key issues to identify, required solutions, and relevant policies—providing objective criteria for evaluation. Depending on your industry/domain/use case, the scenarios and associated context may be different.
Implementation details
The evaluation framework establishes a comprehensive methodology for assessing model performance across multiple dimensions simultaneously. This systematic approach ensures that each model undergoes identical testing conditions, enabling fair comparison of reasoning capabilities across the Nova family. The technical implementation handles the complexity of managing different API formats while maintaining evaluation consistency. The framework assumes an active AWS account, access to Nova models and gpt-oss-20b, along with the availability of the boto3 SDK, and pandas, matplotlib, seaborn, scipy and numpy packages.
Model invocation
The system automatically detects which API format each model requires and routes requests accordingly. Nova models (Lite, Micro, Pro, Premier) use Amazon Bedrock Converse API, which provides a unified interface for conversational interactions. gpt-oss models use the OpenAI Chat Completions format, requiring a different request structure with the InvokeModel API. The invocation function checks the model identifier to determine the appropriate format. For gpt-oss models, it constructs a JSON request body with messages, token limits, and temperature settings, then parses the response to extract the generated content. For Nova models, it uses the Converse API with structured message objects and inference configuration parameters, extracting the response from the output message content. This dual-API approach supports seamless evaluation across different model families without requiring separate code paths or manual configuration changes. The same evaluation logic works for all models regardless of their underlying API requirements, with the system handling format differences transparently. The architecture also allows us to use models from different Regions while maintaining a single evaluation workflow.
The evaluation framework uses optimized prompts generated by the Amazon Bedrock Prompt Optimizer API. The optimizer analyzes and rewrites raw prompts to improve model performance with better structure, clarity, and organization, creating model-specific optimizations for each Nova model.
A scenario with the optimized prompt is shown in the following example:

“`json
{
“angry_customer”: {
“name”: “Angry Customer Complaint”,
“prompt”: “# Customer Support Response Tasknn## ContextnYou are a professional customer support representative for a technology company. You need to respond to an upset customer who has written the following message:nn”I am absolutely furious! I ordered a laptop 3 weeks ago and it still hasn’t arrived. When I called last week, your representative was rude and unhelpful. I’ve been a loyal customer for 5 years and this is how you treat me? I want my money back immediately and I’m considering switching to your competitor. This is unacceptable!”nn## InstructionsnCraft a professional, empathetic response that:n1. Acknowledges the customer’s frustration and validates their feelingsn2. Apologizes sincerely for the specific issues (delayed delivery and poor customer service)n3. Demonstrates understanding of their value as a loyal 5-year customern4. Offers a clear solution to address their refund requestn5. Provides a specific action plan to resolve the delivery issue (if they choose not to cancel)n6. Includes a concrete step to follow up and rebuild trustn7. Maintains a respectful, professional tone throughoutnnYour response should be concise, solution-oriented, and focused on retaining this valuable customer. Avoid making excuses or shifting blame.nnProvide your response immediately without any preamble.”,
“key_issues”: [
“Delayed delivery”,
“Poor customer service experience”,
“Customer loyalty concerns”,
“Refund request”
],
“required_solutions”: [
“Apologize sincerely”,
“Investigate delivery status”,
“Offer compensation”,
“Escalate if needed”
],
“policies”: [
“Always acknowledge customer emotions”,
“Provide specific next steps”,
“Offer multiple resolution options”
],
“_optimization_metadata”: {
“original_length”: 463,
“optimized_length”: 1330,
“target_model”: “amazon.nova-2-lite-v1:0”
}
}
}
“`

Evaluation Framework
The evaluator receives the scenario, model response, and evaluation criteria. We employ a two-step scoring process: first, the evaluator assigns a category label that best characterizes the response; then, the evaluator assigns a predetermined score corresponding to that category label. This approach ensures a consistent and uniform scoring methodology across all model responses.
The evaluation prompt structure:

“`python
EVALUATION_PROMPT = “””
# Customer Support Response Evaluation Task

You are an expert evaluator assessing customer support responses. Your task is to
provide **detailed, objective scoring** across 8 dimensions with specific reasoning
for each score.

## Context

### Original Customer Scenario
{scenario}

### Model’s Response to Evaluate
{response}

## Evaluation Criteria

### Key Issues That Should Be Identified
{key_issues}

### Required Solutions/Actions
{required_solutions}

### Company Policies to Follow
{policies}

## Scoring Instructions

Evaluate the response across **8 dimensions** using a **two-step process**:

### Step 1: Assign Category Label

For each dimension, first determine which category best describes the response:

**EXCELLENT**: Comprehensive, professional, exceeds expectations
– All requirements fully met with exceptional quality
– No significant improvements needed
– Demonstrates mastery of the dimension

**GOOD**: Solid performance with minor room for improvement
– Most requirements met effectively
– Minor gaps or areas for enhancement
– Clearly competent but not exceptional

**ADEQUATE**: Meets basic requirements but has notable gaps
– Core requirements partially met
– Significant room for improvement
– Functional but not impressive

**POOR**: Significant issues requiring major improvements
– Many requirements not met
– Critical gaps in quality
– Barely functional or ineffective

**FAILING**: Critical failures, does not meet requirements
– Fundamental requirements not met
– Unusable or harmful response
– Complete failure on this dimension

### Step 2: Assign Fixed Score

Each category maps to a fixed score:
– **EXCELLENT** → 10
– **GOOD** → 8
– **ADEQUATE** → 6
– **POOR** → 4
– **FAILING** → 2

For **EACH dimension**, provide:
1. **Category label** (EXCELLENT/GOOD/ADEQUATE/POOR/FAILING)
2. **Fixed score** (10/8/6/4/2 based on category)
3. **Specific reasoning** explaining your categorization

## Evaluation Dimensions

### 1. Problem Identification
**Question**: Did the response identify all key issues from the customer’s message?
– Check if all items from “Key Issues” were recognized
– Note any missed or misunderstood problems

### 2. Solution Completeness
**Question**: Are all identified problems addressed with appropriate solutions?
– Verify each issue has a corresponding solution or action
– Check if solutions are practical and actionable

### 3. Policy Adherence
**Question**: Does the response follow all stated company policies?
– Review against “Company Policies to Follow”
– Note any policy violations or omissions

### 4. Factual Accuracy
**Question**: Are technical details, processes, and options stated correctly?
– Check for factual errors or misleading information
– Verify technical accuracy of troubleshooting steps

### 5. Empathy & Tone
**Question**: Does the response demonstrate appropriate emotional intelligence?
– Assess acknowledgment of customer emotions
– Evaluate professionalism and empathy level

### 6. Communication Clarity
**Question**: Is the response clear, well-structured, and actionable?
– Check for clear language and organization
– Verify instructions are easy to follow

### 7. Logical Coherence
**Question**: Is the reasoning sound without contradictions?
– Look for logical flow and consistency
– Identify any contradictory statements

### 8. Practical Utility
**Question**: Would this response actually help the customer resolve their issue?
– Consider real-world effectiveness
– Assess likelihood of customer satisfaction

## Example Evaluation
<>
“””
“`

The evaluator must justify scores, providing transparency into the assessment. To address transparency concerns in AI evaluation, the evaluator provides detailed reasoning for each of the eight dimensions, plus an overall justification. This ensures that scores are not just numerical but backed by specific explanations of why each score was assigned.
Large language model (LLM)-as-a-judge evaluation
Machine translation-based evaluation techniques like ROUGE and BLEU fall short when it comes to open ended conversations. LLM-as-a-judge provides scalability, flexibility and evaluations that closely match human preferences up to 80%.
Refer to the comparison table in the README for further details.
Evaluation process
For each model and scenario combination, we perform 10 runs to measure consistency. This produces 250 evaluations (5 models × 5 scenarios × 10 runs) providing a statistical spread through multiple measurements. The number of runs and scenarios can be increased according to the specific use case. The framework includes diagnostic checks to verify evaluation quality and reliability. Failed evaluations (where the evaluator returns a score of 0 due to technical issues such as JSON parsing errors, or when models don’t respond owing to blocked responses adhering to Responsible AI criteria) are excluded from mean and standard deviation calculations to ensure accurate performance metrics. This prevents technical failures from artificially lowering model scores.
Results
The chosen scenarios and approach described here enable deep statistical analysis of model performance patterns. By examining both individual scenario outcomes and aggregate metrics, we can identify strengths and potential areas for improvement across the Nova model family. This multi-dimensional analysis approach provides confidence in the reliability of performance rankings.
Statistical analysis
The statistical evaluation we use follow the methods outlined in Miller, 2024. To quantify uncertainty in model performance estimates, we calculate standard error (SE) as:

SE = √(σ^2/n),

where σ^2 is the sample variance, and n is the sample size. SE measures how precise our estimate of the mean is and tells us how much the sample mean would vary if we repeated the evaluation many times. The standard error allows us to construct 95% confidence intervals (CI = μ± 1.96×SE), where μ is the sample mean. This provides plausible ranges for true model performance, facilitating statistical significance testing through interval overlap analysis. In addition, we introduce a coefficient of variation (CV) based consistency score calculated as (100 – CV%), where CV% = (σ/μ)×100, and σ is the standard deviation. This normalizes reliability measurement on a 0-100 scale, thereby providing an intuitive metric for response stability. Finally, zero-exclusion averaging prevents failed evaluations from artificially deflating scores, while error bars on visualizations transparently communicate uncertainty. For the sake of completeness, the code in the GitHub repository calculates other statistics such as a minimum detectable effect that demonstrates the ability to reliably detect meaningful performance differences, a pairwise model comparison metric that identifies correlations between model responses, and a power analysis that validates the chosen sample size. These methodologies transform the evaluation from simple score comparison into rigorous experimental science with quantified uncertainty, enabling confident conclusions about model performance differences.

Figure 1 Performance of models across the dimensions considered in the study with 95% confidence intervals

Figure 2 Overall performance of Nova Lite 2.0 compared to other models in the Nova family
Figure 1 shows the performance of models with scores averaged across all the runs for each dimension considered in the study; this is also depicted on the radar chart in Figure 2. Table 1 shows the scores across all dimensions considered in the study. Nova Lite 2.0 achieved the highest overall score (9.42/10) with a standard error of 0.08 and a coefficient of variation of 5.55%, demonstrating high-quality reasoning.

Metric
Nova Lite 2.0
Nova Lite 1.0
Nova Pro 1.0
Nova Micro
Nova Premier

Overall Score
9.42
8.65
8.53
7.70
7.16

Standard Error (SE)
0.08
0.09
0.12
0.32
0.38

95% Confidence Interval
[9.28, 9.57]
[8.48, 8.82]
[8.30, 8.76]
[7.08, 8.32]
[6.41, 7.91]

Consistency Score (CV-based)
94.45
93.05
90.46
71.37
62.96

Coefficient of Variation
5.55%
6.95%
9.54%
28.63%
37.04%

Table 1: Overall Model Performance Summary

Metric
Nova Lite 2.0
Nova Lite 1.0
Nova Pro 1.0
Nova Micro
Nova Premier

Problem Identification
9.63 ± 0.27
8.57 ± 0.46
8.16 ± 0.44
7.59 ± 0.74
6.94 ± 0.82

Solution Completeness
9.59 ± 0.23
8.08 ± 0.32
8.04 ± 0.42
6.78 ± 0.65
6.33 ± 0.69

Policy Adherence
8.82 ± 0.54
7.76 ± 0.59
7.55 ± 0.64
7.02 ± 0.69
6.37 ± 0.81

Factual Accuracy
9.55 ± 0.26
9.18 ± 0.30
9.10 ± 0.28
8.08 ± 0.74
8.00 ± 0.89

Empathy Tone
8.98 ± 0.33
8.57 ± 0.34
8.08 ± 0.36
7.55 ± 0.65
7.10 ± 0.79

Communication Clarity
9.76 ± 0.19
9.14 ± 0.28
8.94 ± 0.28
8.04 ± 0.69
7.63 ± 0.85

Logical Coherence
9.71 ± 0.35
9.67 ± 0.29
9.92 ± 0.11
8.98 ± 0.74
8.16 ± 0.91

Practical Utility
9.35 ± 0.27
8.24 ± 0.22
8.45 ± 0.24
7.55 ± 0.62
6.78 ± 0.70

Table 2: Dimension-Level Performance of the Nova models (Mean Scores with 95% Confidence Intervals)
Table 2 shows the performance across the eight dimensions considered in the study. Nova Lite 2.0 achieved consistently high scores across all dimensions.

Scenario
Nova Lite 2.0
Nova Lite 1.0
Nova Micro
Nova Pro 1.0
Nova Premier

Account Security Concern
9.25
7.95
7.65
6.90
2.00

Angry Customer Complaint
9.95
9.50
9.30
8.35
8.20

Billing Dispute
9.15
8.75
8.60
8.85
8.20

Product Defect Report
9.25
8.90
7.70
8.00
8.75

Software Technical Problem
10.00
8.20
8.55
8.75
8.60

Table 3 Summary of scores (on a scale of 1-10) across models and scenarios considered. A score of 2 for Nova Premier for Account Security Concern is due to Guardrails being invoked for almost all of the responses.
Table 3 summarizes the mean scores corresponding to each scenario considered in the study. Again, Nova Lite 2.0 achieves high scores across all dimensions.
Dimension analysis
The dimensional strengths of Nova Lite 2.0 demonstrate balanced capabilities across critical evaluation criteria. High scores in problem identification, communication, and logical reasoning indicate mature performance that translates effectively to real-world applications, distinguishing it from models that excel in individual dimensions but lack consistency.
Problem Identification: Nova Lite 2.0 excelled at identifying all key issues—crucial where missing problems lead to incomplete solutions.
Communication Clarity: The model achieved the highest score in this dimension, producing well-structured, actionable responses customers could follow easily.
Logical Coherence: Strong performance indicates the model maintains sound reasoning without contradictions across complex scenarios.
Empathy and Tone: High scores demonstrate appropriate emotional intelligence, critical for de-escalation and sensitive situations.
Table 4 shows sample evaluator explanations for high-scoring and low-scoring models, illustrating effective scoring methodology.

Nova Lite 2.0 – Score: 10 – Category: “Excellent” The response explicitly recognizes the four key issues: it mentions the delayed delivery (“delay in receiving your laptop”), the poor customer service experience (“unhelpful interaction with our support team”), the customer’s loyalty (“a valued customer of five years”), and the refund request (“cancel your order and receive a full refund”). All issues are acknowledged with appropriate language. Nova Premier – Score: 6 – Category: “Adequate” The response acknowledges frustration and loyalty, but it does not explicitly mention the delayed delivery or the rude customer‚ service representative, two key issues from the customer message.

Table 4 Sample explanations provided by the evaluator for Nova Lite 2.0 and Nova Premier for the Angry Customer scenario along the Problem Identification dimension
Key findings
The evaluation results reveal critical insights for model selection and deployment strategies. These findings emphasize considering multiple performance factors rather than focusing solely on aggregate scores, as optimal choices depend on specific application requirements and operational constraints.

Multi-dimensional reasoning matters: Models scoring well on accuracy but poorly on empathy or clarity are unsuitable for customer-facing applications. The balanced performance of Nova Lite 2 across all dimensions makes it production-ready.
Consistency predicts production success: The low variability of Nova Lite 2.0 versus other models indicates reliable performance across diverse scenarios—critical where inconsistent responses damage user trust.
Real-world evaluation reveals practical capabilities: Synthetic benchmarks miss critical dimensions like empathy, policy adherence, and practical utility. This framework surfaces production-relevant capabilities.

Implementation considerations
Successfully implementing this evaluation framework requires attention to operational factors that significantly impact assessment quality and cost-effectiveness. The choice of evaluation methodology, scoring mechanisms, and technical infrastructure directly influences result reliability and scalability.

Evaluator selection: We selected gpt-oss-20b to ensure independence from the Nova family, reducing potential bias. Amazon Bedrock offers built-in LLM-as-a-judge capabilities with standard metrics like correctness, completeness, and harmfulness. The framework presented in this post provides the flexibility to define specialized evaluation criteria and multi-dimensional assessments that can be customized to the specific use case of interest.
Scenario design: Effective scenarios balance realism with measurability. Each includes specific details grounding evaluation in realistic contexts. Objective criteria—key issues to identify, required solutions, relevant policies—enable consistent scoring. Realistic complexity combining multiple problems (billing dispute + security breach) and competing priorities (urgency vs protocols) reveals how models handle real-world ambiguity and surfaces capability gaps.
Statistical validation: Multiple runs per scenario provide confidence intervals and detect inconsistency, ensuring performance differences are statistically significant.

Key takeaways
Amazon Nova Lite 2.0 demonstrates impressive reasoning capabilities in tested real-world scenarios, achieving consistent high performance across diverse problem-solving tasks. Balanced scores across evaluation dimensions—from technical problem identification to empathetic communication—indicate robust reasoning potentially applicable to other domains after comprehensive testing. Multi-dimensional evaluation reveals nuanced model capabilities that single-metric benchmarks miss. Understanding performance across problem identification, solution completeness, policy adherence, empathy, clarity, and logical coherence provides actionable deployment insights. This practical testing methodology provides actionable insights for organizations evaluating AI systems. The framework’s focus on objective criteria, independent evaluation, and statistical validation creates reproducible assessments adaptable to domains requiring contextual judgment and problem-solving. As models advance, assessment methodologies must evolve to capture increasingly sophisticated reasoning capabilities—multi-turn conversations, complex decision-making under uncertainty, and nuanced judgment in ambiguous situations.
Conclusion
This comprehensive evaluation demonstrates that Amazon Nova Lite 2.0 delivers production-ready AI reasoning capabilities with measurable reliability across diverse business applications. The multi-dimensional assessment framework provides organizations with quantitative evidence needed to confidently deploy AI systems in critical operational environments.
Next steps
Evaluate Nova Lite 2.0 for your use case:

Bedrock Model Evaluation: Start with model evaluation tools of Amazon Bedrock, including the built-in LLM-as-a-judge capabilities for standard metrics, or adapt the custom framework discussed in this post for specialized evaluation criteria.
Implement multi-dimensional testing: Adapt the evaluation framework to your specific domain requirements.
Pilot deployment: Begin with low-risk scenarios to validate performance in your environment.
Scale systematically: Use the statistical validation approach to expand to additional use cases.

Additional resources

Miller, E; Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
Amazon Bedrock Documentation
Amazon Nova models
GitHub repository
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

About the authors
Madhu Pai, Ph.D., is a Principal Specialist Solutions Architect for Generative AI and Machine Learning at AWS. He leads strategic AI/ML initiatives that deliver scalable impact across diverse industries by identifying customer needs and building impactful solutions. Previously at AWS, Madhu served as the WW Partner Tech Lead for Manufacturing where he delivered compelling partner solutions that drove strategic outcomes for industrial manufacturing customers. He brings over 18 years of experience across multiple industries, leveraging data, AI, and ML to deliver measurable business results.
Sunita Koppar is a Senior Specialist Solutions Architect in Generative AI and Machine Learning at AWS, where she partners with customers across diverse industries to design solutions, build proof-of-concepts, and drive measurable business outcomes. Beyond her professional role, she is deeply passionate about learning and teaching Sanskrit, actively engaging with student communities to help them upskill and grow.
Satyanarayana Adimula is a Senior Builder in the AWS GenAI Invocation Center. With over 20 years of experience in data and analytics and deep expertise in generative AI, he helps organizations achieve measurable business outcomes. He builds agentic AI systems that automate workflows, accelerate decision-making, reduce costs, increase productivity, and create new revenue opportunities. His work spans large enterprise customers across various industries, including retail, banking, financial services, insurance, healthcare, media and entertainment, and professional services.

Create AI-powered chat assistants for your enterprise with Amazon Quic …

Teams need instant access to enterprise data and intelligent guidance on how to use it. Instead, they get scattered information across multiple systems. This results in employees spending valuable time searching for answers instead of making decisions.
In this post, we show how to build chat agents in Amazon Quick Suite to address this problem. We walk through a three-layer framework—identity, instructions, and knowledge—that transforms Quick Suite chat agents into intelligent enterprise AI assistants. In our example, we demonstrate how our chat agent guides feature discovery, use enterprise data to inform recommendations, and tailors solutions based on potential to impact and your team’s adoption readiness.
Benefits of Quick Suite chat agents
Quick Suite chat agents make advanced AI capabilities accessible to non-technical business users. Sales representatives, analysts, and domain experts can create sophisticated AI assistants without requiring deep technical expertise in machine learning or cloud infrastructure.
Quick Suite instances come with their own default system chat agent (My Assistant). Administrators can enable the ability to create custom chat agents for the users. Many users begin their Quick Suite journey by experimenting with My Assistant, discovering its AI capabilities through hands-on exploration. Users can enhance their interactions with contextual configuration: you can point the agent to specific Spaces to filter conversation scope, so responses draw from relevant organizational knowledge. You can also upload response templates or process documents directly into chat sessions to modify how the agent structures its outputs or approaches specific tasks.
Although these approaches offer immediate value and flexibility for individual users and one-off tasks, each conversation requires manual setup—selecting the right Spaces, uploading relevant templates, and providing context-specific instructions. With custom chat agents, you can capture these successful patterns into permanent, shareable solutions. You can preserve the contextual knowledge and behavioral guidelines in the agent’s persona, as well as the resource selections that make individual conversations successful, and package them into consistent, reusable agents that teams can deploy at scale. With this systematic deployment solution, individual insights become organizational assets that drive productivity gains. The solution reduces the cognitive load on users who no longer need to remember specific prompting techniques or locate the right resources for each interaction.
The three-layer foundation: Identity, instructions, and knowledge
Effective chat agents are built on three essential components that work together to create consistent, reliable AI assistants:

Identity – Defines who the agent is and what role it serves
Instructions – Specifies how the agent should think and respond
Knowledge – Provides the information the agent can access to search for answers and content generation

Understanding these three layers is crucial because they determine your agent’s behavior, including its communication style and the information it can retrieve.
Identity
Identity defines who your agent is and what role it plays, which shapes how it responds to every request. You can configure an identity through the Agent identity configuration field.
Instructions
Instructions function as behavioral directives that provide granular control over agent response generation, with specificity and consistency being crucial for effectiveness. Effective prompt engineering skills become essential when crafting both identity and instructions, because the precision and clarity of these elements directly impact the agent’s ability to understand context, follow behavioral directives, and maintain consistent, persona-driven responses. You can configure your Quick Suite chat agent with instructions in the Persona instructions, Communication style, and Reference documents fields. Reference documents refer to more specific or detailed instructions, or information attached as files that you require the agent to always have and follow exactly, like templates and process documents.
Knowledge
Large language models (LLMs) power the agents. The custom chat agent provides required context to LLMs through two distinct means: instructions as discussed in previous section, and searchable knowledge. Quick Spaces provides the ability to pool searchable knowledge for the chat agent in different forms:

Direct file uploads (indexed knowledge)
Amazon Quick Sight dashboards and topics
Knowledge bases created from data access integrations (indexed knowledge)
Action connectors to take actions on integrated third-party tools

Spaces function as dynamic, searchable knowledge repositories that facilitate real-time access to teams’ information in structured or unstructured form, while maintaining security boundaries and supporting collaborative workflows. These are ideal for enabling semantic search capabilities over evolving knowledge bases like current business data and collaborative knowledge.
Solution overview
The Quick Suite Product Specialist is a custom chat agent to help users identify the right Quick Suite features for their specific needs. My Assistant can answer any questions related to Quick Suite; the Product Specialist chat agent takes a product specialist’s approach to support user questions and requirements. This agent acts as an intelligent advisor that matches business challenges with appropriate Quick Suite capabilities.
The Product Specialist chat agent is configured to follow a three-phased methodology: discovery, analysis, and solution recommendations. This showcases how modern AI agents should balance comprehensive platform knowledge with practical wisdom about right-sizing solutions. It can recommend simple prompts to be used with My Assistant to serve individual users, or architect complex multi-capability workflows for enterprise-wide deployment, it exemplifies the principle of matching solution complexity to actual impact potential while fostering GenAI adoption across organizations and projecting potential ROI for recommended solutions.
In the following sections, we demonstrate how to build a knowledge Space consisting of the Quick Suite User Guide documentation and then configure the Quick Suite Product Specialist chat agent.
Prerequisites
To build a custom chat agent in Quick Suite, you must have the following:

An active Quick Suite instance
A Quick Suite subscription for the required capabilities:

Professional – Create, configure, and share, spaces and custom chat agents
Enterprise (includes Professional capabilities) – Create knowledge bases

For more information about Quick Suite’s subscription tiers, see Amazon Quick Suite pricing.
Create Space with knowledge base
We first set up a Quick Space as part of the context component of the three-layered foundation we discussed previously. This Space contains a searchable knowledge base for the Amazon Quick Suite User Guide.
This step is for reference on how to create indexed searchable content for specific documentation. Quick Suite chat agents are self-aware of all the Quick Suite capabilities and associated implementation practices.
We can choose from two options to create our Space: a static file or a live web-crawled knowledge base.
Use a static file
This option is a static snapshot of the official Quick Suite User Guide and must be updated occasionally to incorporate latest changes and additions to the platform documentation. Complete the following steps:

Go to Amazon Quick Suite User Guide.
Choose the PDF download option under the page header to download the User Guide as a PDF file to your local machine.

On the Quick Suite console, choose Spaces in the navigation pane.
Choose Create space to create a new Space:

For Title, enter a title, such as the following:

Amazon Quick Suite Documentation Space

For Description, enter a description, such as the following:

This Quick Space contains Amazon Quick Suite User Guide file.

Choose Add knowledge and choose File uploads.
Upload the User Guide PDF.
Choose Share to manage Viewer/Owner access to the created Space.

Files uploaded to a Space use the same access permissions as the Space.

Use a live web-crawled knowledge base
This represents a near real-time option in which you set up a direct connection between the documentation site and Quick Suite through a web crawler integration, and indexing the documentation, with an automatic refresh configuration set on the default schedule.

On the Quick Suite console, choose Integrations in the navigation pane.
Choose Add and choose Webcrawler to add a webcrawler.

For Name, use the default name.
Select No authentication.
Choose Create and continue.

Configure the knowledge base:

For Name, enter a name, such as the following:

Amazon Quick Suite User Guide Documentation KB

For Add URLs, enter the main documentation URL:

https://docs.aws.amazon.com/quicksuite/latest/userguide/

Choose Add.
Choose Create.
On the Knowledge bases tab, choose the knowledge base you created. The knowledge base refresh is initiated automatically.
To manage access to Knowledge base, choose Add Users & groups on the Permissions tab to search and add people or groups for Viewer access.

Choose Spaces in the navigation pane.
Choose Create space to create a new Space:

For Title, enter a title, such as the following:

Amazon Quick Suite Documentation Space

For Description, enter a description, such as the following:

This Quick Space consists of connection to the web-crawled knowledge base for Amazon Quick Suite’s User Guide from AWS Documentation website.

Choose Add knowledge, then choose Knowledge bases.
Locate the knowledge base you created and choose Add.
Choose Share to manage Viewer/Owner access to the created Space.

Knowledge base permission settings are honored by Quick Suite over Space sharing settings.
The Space is now created and should be syncing the latest Quick Suite User Guide.

Create chat agent
Complete the following steps to build your own Quick Suite Product Specialist:

On the Quick Suite console, choose Chat agents in the navigation pane.
Choose Create chat agent
Choose Skip to enter Builder view to create a custom chat agent, because we know exactly what instructions and assets the chat agent needs.

For Title, enter a title, such as the following:

Quick Suite Product Specialist

For Description, enter a description, such as the following:

A comprehensive expert agent that combines Amazon Quick Suite expertise with GenAI evangelism and prompt engineering mastery. DISCOVERS users’ productivity challenges, GenAI readiness, and solution scalability needs, ANALYZES their competency and impact potential, and provides optimal SOLUTION RECOMMENDATIONS based on Amazon Quick Suite capabilities including Custom Chat Agents, Flows, Automate, Integrations, Extensions, Spaces, Research, and Quick Sight with detailed implementation guidance and projected ROI analysis.

Update the AGENT PERSONA configuration:

For Agent identity, enter details such as the following:

You are a seasoned expert in Amazon Quick Suite’s capabilities with deep knowledge of how its features can solve various internal use cases. You also serve as a GenAI Evangelist, passionate about democratizing AI adoption across organizations, and an expert Prompt Engineer with mastery in crafting effective prompts for various AI systems. You specialize in use case discovery, analyzing productivity challenges, automation opportunities, GenAI solution design, and simple to complex workflow orchestration to recommend optimal Quick Suite solutions with detailed implementation guidance and projected ROI analysis.
The Agent identity field defines the agent’s internal persona, which shapes the decisions it makes. Using the keywords “seasoned expert” establishes authority that influences response confidence and depth, while the multi-role design (“GenAI Evangelist,” “expert Prompt Engineer”) makes sure the agent can pivot between technical guidance, strategic adoption advice, and educational support. The emphasis on “use case discovery” programs the agent to prioritize understanding before recommending, establishing a consultative rather than transactional interaction pattern. The phrase “democratizing AI adoption” internally calibrates the agent to serve users at different skill levels, preventing it from defaulting to overly technical responses that might intimidate beginners. These identity choices program how it interprets queries and structures responses.
For Persona instructions, enter instructions such as the following:

For each user problem follow this 3-phased approach:
A. DISCOVERY
1. Analyze the initial use case details provided
2. Before providing any recommendations, ask clarifying questions to understand:
-Knowledge base platforms and scale of use case relevant to identifying suitable Quick Suite capability
-User’s current experience level with GenAI solutions (Beginner/Intermediate/Advanced)
-Number of potential users who would benefit from this solution (Individual/Team/Department/Organization-wide)
-Available metrics around the problem/challenge (e.g., “it takes 8 hours to do this manually today”)
-Current AI/automation tools in use and satisfaction level
-Team’s technical capabilities and change management readiness
-Wait for user confirmation before proceeding
B. ANALYSIS
1. Analyze all the user provided information including their GenAI maturity, and scalability requirements
2. Assess impact potential: High impact = high user count + significant time/effort savings; Low impact = limited users + minimal savings
3. Right sizing the solution:
-Low impact = Consider simple prompt-based solutions using default Chat Agent (My Assistant)
-High impact = Recommend dedicated Quick Suite capabilities
-Avoid unnecessary complexity when simple solutions suffice
4. Calculate potential ROI in terms of as time savings by user count
5. CAPABILITY VERIFICATION PROTOCOL:
– Before recommending any specific Quick Suite feature, verify the exact capability exists in available documentation
– Clearly distinguish between Quick Flows (interactive, on-demand workflows) and Quick Automate (scheduled automation with triggers)
– If uncertain about a capability, explicitly state limitations and provide documented alternatives
– Never assume features exist without documentation confirmation
– When correcting previous errors, acknowledge the mistake and provide accurate information based on verified documentation
– Use the documentation knowledgebase available through the attached Space to validate capabilities before making recommendations
C. SOLUTION RECOMMENDATIONS
1. List appropriate Quick Suite capabilities with scalability-matched options:
-For low impact: Start with optimized prompts for default chat agent (My Assistant) or basic Quick Sight BI functionalities as suitable for the use case
-For moderate-high impact: assess and recommend dedicated scalable solutions (aligning with the use case) built as custom chat agent, Flows, Automation projects, required Integrations, Extensions for web browser/Slack/Teams/Outlook/Word specific use cases, relevant Spaces, Research, Quick Sight
-Present multiple options when applicable, prioritizing simplicity when impact doesn’t justify complexity
2. Provide clear reasoning for each suggested capability including:
-Impact-to-complexity analysis
-Scalability considerations (user adoption, maintenance, governance)
-Pros & Cons with emphasis on right-sizing the solution
-Detailed ROI projections including potential time savings multiplied by user count and estimated implementation costs (e.g., “suggested solution would save 7 hours per person across 50 users = 350 hours total weekly savings, equivalent to $X in productivity gains”)
-GenAI adoption benefits and change management considerations
-Prompt engineering best practices for Chat Agents when applicable
3. Ask if they want prescriptive implementation guidance, if they do, then provide detailed solution building pathways including:
-Step-by-step implementation approach starting with minimum viable solution
-Scaling pathway from simple to complex as adoption grows
-Prompt engineering templates and best practices
-GenAI adoption strategies and success metrics
-ROI tracking and measurement recommendations
-Change management recommendations
The three-phase methodology (discovery, analysis, solution recommendations) gives the agent best practices and guidelines on the kind of information it needs to collect to inform its recommendations, so its ability to get data about these features is augmented by user-specified context that is relevant to the recommended solutions.

For Tone, enter a description to calibrate emotional intelligence and approachability:

Professional, consultative, thorough, and evangelistic about GenAI potential while emphasizing practical, right-sized solutions. Ask clarifying questions to ensure accurate recommendations while inspiring confidence in AI adoption without over-engineering.

For Response format, configure the structural patterns (conversational vs. prescriptive, lists vs. paragraphs) that match different interaction phases:

Conversational in DISCOVERY phase with competency and scalability assessment questions. Always ask follow-up questions for clarity before concluding suggestions. Prescriptive in SOLUTION RECOMMENDATIONS phase: Provide structured recommendations with clear reasoning, impact analysis, prompt engineering guidance, and GenAI adoption strategies. Use numbered lists for capabilities and bullet points for implementation details.

For Length, set phase-appropriate boundaries to prevent both overwhelming verbosity and insufficient detail:

Succinct and to-the-point in DISCOVERY phase. For SOLUTION RECOMMENDATIONS phase: Comprehensive enough to cover all relevant Quick Suite capabilities with detailed reasoning, scalability analysis, prompt engineering best practices, and GenAI evangelism insights, but organized for easy scanning.

For Reference documents, you can provide reference documents that give additional guidance to the agent on enterprise considerations and guardrails to keep in mind while recommending solutions, as well as additional nuances about the different features to factor for solution complexity. For this example, we don’t upload additional documents.

For KNOWLEDGE SOURCES:

Choose Link spaces
Choose the Space you created earlier and choose Link.

Linking the Space makes sure the agent can verify capabilities against actual product documentation. The Space architecture maintains enterprise security by honoring underlying data source permissions, allowing AI deployment without compromising existing security permissions. The web crawler option for live documentation makes sure the agent’s knowledge stays current as the platform evolves.

For ACTIONS, set up relevant third-party platform integrations. For example, add one of your enterprise collaboration tools, such as Slack or Teams, for sharing the implementation recommendations from this agent with your team.

Action integrations extend capabilities beyond conversation to actual workflow execution. This dynamic knowledge approach configures an adaptive assistant that validates recommendations against current information, accesses real business data, and executes actions, all while respecting organizational security boundaries.

Update CUSTOMIZATION

For Welcome Message, enter a message such as the following:

Hello! I’m your Quick Suite Product Specialist, GenAI Evangelist, and Pro Prompt Engineer. Let’s DISCOVER your productivity challenge, assess its scalability potential and your GenAI readiness, and I’ll recommend the right-sized SOLUTION that maximizes impact, complete with projected ROI analysis.

For Suggested prompts, enter suggestions that end-users of this chat might use as quick start prompts to talk to the agent:

“What Quick Suite capability can help me with my productivity/automation use case?”
“How can I maximize impact with the simplest possible GenAI solution for my use case?”
“I’m new to GenAI – what’s the best Quick Suite solution to start with for my use case?”

Choose Update preview, test the chat agent, and make adjustments as necessary.
Choose Launch chat agent to publish the agent.
Choose Share to share access to the chat agent as necessary.

Test the chat agent
Let’s demonstrate the capabilities of the Quick Suite Product Specialist that you created:

On the Quick Suite console, choose Chat agents in the navigation pane.
Select the Quick Suite Product Specialist chat agent you created.
On the Actions menu, choose the Chat link.
Send the following request to the agent: “I want to get help in formatting my weekly status emails.”

The agent takes the initial prompts and returns with detailed discovery questionnaire to better understand your use case, without jumping to recommendations. You will notice some differences from run to run, and might not see the same questionnaire, and chat agent responses as shown in the example in this post.

Review and respond to the questionnaire.

The agent returns a comprehensive response including assessment of impact, multiple solution recommendations with reasoning, and high-level implementation pathway options, letting you choose your solution options, and receive prescriptive implementation guidance.

Continue interacting with the agent to get detailed implementation guidance. Try out the chat agent on your own use cases, build out recommended solutions, and learn from your interactions.

Clean up
When you are ready to remove the custom chat agent from your Quick Suite setup, clean up the resources to avoid potential additional indexing costs:

Delete the knowledge base:

On the Quick Suite console, choose Integrations in the navigation pane, then choose Knowledge bases.
Choose the options menu (three dots) next to the knowledge base you created.
Choose Delete knowledge base and follow the prompts to delete the knowledge base.

Delete the Space:

On the Quick Suite console, choose Spaces in the navigation pane.
Choose the options menu (three dots) next to the Space you created.
Choose Delete and follow the prompts to delete the Space.

Delete the chat agent:

On the Quick Suite console, choose Chat agents in the navigation pane.
Choose the options menu (three dots) next to the chat agent you created.
Choose Delete and follow the prompts to delete the chat agent.

Key takeaways
Building effective chat agents requires intentional design across three foundational layers. The Quick Suite Product Specialist demonstrates these principles in action:

Specificity drives consistency – Rather than hoping the LLM will determine the right approach, you can provide explicit identity definitions, behavioral constraints, decision frameworks, and output formats to transform generic AI into reliable expert assistants.
Structure prevents common failures – The three-phase methodology (discovery, analysis, solution recommendations) shows how systematic approaches guide users to right-size solutions, only after understanding the problem.
Dynamic knowledge maintains relevance – Linking live documentation and permission-aware Spaces makes sure agents validate recommendations against current information while respecting organizational security boundaries.

Conclusion
Custom chat agents in Quick Suite can transform how teams access and use enterprise knowledge. By applying the three-layer framework—identity, instructions, and knowledge—you can create AI assistants that deliver instant, accurate answers while maintaining enterprise security and compliance. The Quick Suite Product Specialist example demonstrates how structured methodologies and careful configuration turn generic AI into specialized experts that guide users to the right solutions for their specific needs.
Start with a focused use case that demonstrates clear ROI, then expand as adoption grows. Custom chat agents can deliver measurable productivity gains, helping teams find information faster, automating repetitive workflows, or providing expert guidance at scale. To learn more about creating and deploying Quick Suite chat agents, see Create, customize, and deploy AI-powered chat agents in Amazon Quick Suite.

About the authors
Nitish Chaudhari is a Senior Customer Solutions Manager at AWS, where he partners with customers to architect and implement generative AI solutions. He specializes in building collaborating agents, chat agents, and automation flows with Amazon Quick Suite and Amazon Bedrock that help teams solve real-world productivity challenges at scale. Before joining AWS, Nitish led product teams in the energy sector, and he now works closely with customers and AWS service teams to shape the next generation of generative AI capabilities.
Sindhu Santhanakrishnan is a Senior Product Manager at AWS, where she leads the development of custom agent capabilities in Amazon Quick Suite. She has played a key role in AWS’s automation journey, being part of the Q Apps launch, leading Q Actions in Q Business, and most recently driving the successful launch of chat agents in Quick Suite. She specializes in building business-focused automation solutions, with a background in launching zero-to-one products and customer data platforms. Sindhu holds a Master’s in Product Management from Carnegie Mellon University.
Vinayak Datar is a Senior Solutions Manager based in Bay Area, helping enterprise customers accelerate their AWS Cloud journey. He’s focusing on helping customers to convert ideas from concepts to working prototypes to production using AWS generative AI services.

Interview: From CUDA to Tile-Based Programming: NVIDIA’s Stephen Jon …

As AI models grow in complexity and hardware evolves to meet the demand, the software layer connecting the two must also adapt. We recently sat down with Stephen Jones, a Distinguished Engineer at NVIDIA and one of the original architects of CUDA.

Jones, whose background spans from fluid mechanics to aerospace engineering, offered deep insights into NVIDIA’s latest software innovations, including the shift toward tile-based programming, the introduction of “Green Contexts,” and how AI is rewriting the rules of code development.

Here are the key takeaways from our conversation.

The Shift to Tile-Based Abstraction

For years, CUDA programming has revolved around a hierarchy of grids, blocks, and threads. With the latest updates, NVIDIA is introducing a higher level of abstraction: CUDA Tile.

According to Jones, this new approach allows developers to program directly to arrays and tensors rather than managing individual threads. “It extends the existing CUDA,” Jones explained. “What we’ve done is we’ve added a way to talk about and program directly to arrays, tensors, vectors of data… allowing the language and the compiler to see what the high-level data was that you’re operating on opened up a whole realm of new optimizations”.

This shift is partly a response to the rapid evolution of hardware. As Tensor Cores become larger and denser to combat the slowing of Moore’s Law, the mapping of code to silicon becomes increasingly complex.

Future-Proofing: Jones noted that by expressing programs as vector operations (e.g., Tensor A times Tensor B), the compiler takes on the heavy lifting of mapping data to the specific hardware generation.

Stability: This ensures that program structure remains stable even as the underlying GPU architecture changes from Ampere to Hopper to Blackwell.

Python First, But Not Python Only

Recognizing that Python has become the lingua franca of Artificial Intelligence, NVIDIA launched CUDA Tile support with Python first. “Python’s the language of AI,” Jones stated, adding that an array-based representation is “much more natural to Python programmers” who are accustomed to NumPy.

However, performance purists need not worry. C++ support is arriving next year, maintaining NVIDIA’s philosophy that developers should be able to accelerate their code regardless of the language they choose.

“Green Contexts” and Reducing Latency

For engineers deploying Large Language Models (LLMs) in production, latency and jitter are critical concerns. Jones highlighted a new feature called Green Contexts, which allows for precise partitioning of the GPU.

“Green contexts lets you partition the GPU… into different sections,” Jones said. This allows developers to dedicate specific fractions of the GPU to different tasks, such as running pre-fill and decode operations simultaneously without them competing for resources. This micro-level specialization within a single GPU mirrors the disaggregation seen at the data center scale.

No Black Boxes: The Importance of Tooling

One of the pervasive fears regarding high-level abstractions is the loss of control. Jones, drawing on his experience as a CUDA user in the aerospace industry, emphasized that NVIDIA tools will never be black boxes.

“I really believe that the most important part of CUDA is the developer tools,” Jones affirmed. He assured developers that even when using tile-based abstractions, tools like Nsight Compute will allow inspection down to the individual machine language instructions and registers. “You’ve got to be able to tune and debug and optimize… it cannot be a black box,” he added.

Accelerating Time-to-Result

Ultimately, the goal of these updates is productivity. Jones described the objective as “left shifting” the performance curve, enabling developers to reach 80% of potential performance in a fraction of the time.

“If you can come to market [with] 80% of performance in a week instead of a month… then you’re spending the rest of your time just optimizing,” Jones explained. Crucially, this ease of use does not come at the cost of power; the new model still provides a path to 100% of the peak performance the silicon can offer.

Conclusion

As AI algorithms and scientific computing converge, NVIDIA is positioning CUDA not just as a low-level tool for hardware experts, but as a flexible platform that adapts to the needs of Python developers and HPC researchers alike. With support extending from Ampere to the upcoming Blackwell and Rubin architectures, these updates promise to streamline development across the entire GPU ecosystem.

For the full technical details on CUDA Tile and Green Contexts, visit the NVIDIA developer portal.
The post Interview: From CUDA to Tile-Based Programming: NVIDIA’s Stephen Jones on Building the Future of AI appeared first on MarkTechPost.

Jina AI Releases Jina-VLM: A 2.4B Multilingual Vision Language Model F …

Jina AI has released Jina-VLM, a 2.4B parameter vision language model that targets multilingual visual question answering and document understanding on constrained hardware. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone and uses an attention pooling connector to reduce visual tokens while preserving spatial structure. Among open 2B scale VLMs, it reaches state of the art results on multilingual benchmarks such as MMMB and Multilingual MMBench.

https://arxiv.org/pdf/2512.04032

Architecture, overlapping tiles with attention pooling connector

Jina-VLM keeps the standard VLM layout, but optimizes the vision side for arbitrary resolution and low token count. The vision encoder is SigLIP2 So400M/14 384, a 27 layer Vision Transformer with about 400M parameters. It processes 378×378 pixel crops into a 27×27 grid of 14×14 patches, so each tile produces 729 patch tokens.

To handle high resolution images, the model does not resize the full input to a single square. Instead, it constructs a grid of up to 12 overlapping tiles along with a global thumbnail. Each tile is a 378×378 crop, adjacent tiles overlap by 112 pixels, and the stride between tile origins is 266 pixels. A 4×3 grid covers an effective resolution of 1176×910 pixels before downscaling larger images to fit inside the tile budget.

The core design is the vision language connector. Rather than using the final ViT layer, Jina-VLM concatenates features from two intermediate layers, the third from last and ninth from last, that correspond to layers 24 and 18. This combines high level semantics and mid level spatial detail. The connector then applies attention pooling over 2×2 patch neighborhoods. It computes a mean pooled query for each 2×2 region, attends over the full concatenated feature map, and outputs a single pooled token per neighborhood. This reduces 729 visual tokens per tile to 182 tokens, which is a 4 times compression. A SwiGLU projection maps the pooled features to the Qwen3 embedding dimension.

With the default 12 tile configuration plus thumbnail, a naive connector would feed 9,477 visual tokens into the language model. Attention pooling cuts this to 2,366 visual tokens. The ViT compute does not change, but for the language backbone this yields about 3.9 times fewer prefill FLOPs and 4 times smaller KV cache. When including the shared ViT cost, the overall FLOPs drop by about 2.3 times for the default setting.

The language decoder is Qwen3-1.7B-Base. The model introduces special tokens for images, with <im_start> and <im_end> around the tile sequence and <im_col> to mark rows in the patch grid. Visual tokens from the connector and text embeddings are concatenated and passed to Qwen3 to generate answers.

Training pipeline and multilingual data mix

Training proceeds in 2 stages. All components, encoder, connector and decoder, are updated jointly, without freezing. The full corpus contains about 5M multimodal samples and 12B text tokens across more than 30 languages. Roughly half of the text is English, and the rest covers high and mid resource languages such as Chinese, Arabic, German, Spanish, French, Italian, Japanese and Korean.

Stage 1 is alignment training. The goal is cross language visual grounding, not instruction following. The team uses caption heavy datasets PixmoCap and PangeaIns, which span natural images, documents, diagrams and infographics. They add 15 percent text only data from the PleiAS common corpus to control degradation on pure language tasks. The connector uses a higher learning rate and shorter warmup than the encoder and decoder to speed up adaptation without destabilizing the backbones.

Stage 2 is instruction fine tuning. Here Jina VLM learns to follow prompts for visual question answering and reasoning. The mix combines LLaVA OneVision, Cauldron, Cambrian, PangeaIns and FineVision, plus Aya style multilingual text only instructions. The Jina research team first train for 30,000 steps with single source batches, then for another 30,000 steps with mixed source batches. This schedule stabilizes learning in the presence of very heterogeneous supervision.

Across pretraining and fine tuning, the model sees about 10B tokens in the first stage and 37B tokens in the second stage, with a total of roughly 1,300 GPU hours reported for the main experiments.

Benchmark profile, 2.4B model with multilingual strength

On standard English VQA tasks that include diagrams, charts, documents, OCR and mixed scenes, Jina-VLM reaches an average score of 72.3 across 8 benchmarks. These are AI2D, ChartQA, TextVQA, DocVQA, InfoVQA, OCRBench, SEED Bench 2 Plus and CharXiv. This is the best average among the 2B scale comparison models in this research paper from Jina AI.

On multimodal comprehension and real world understanding tasks, the model scores 67.4 on the multimodal group, which includes MME, MMB v1.1 and MMStar. It scores 61.9 on the real world group, which includes RealWorldQA, MME RealWorld and R Bench, and it reaches 68.2 accuracy on RealWorldQA itself, which is the best result among the baselines considered.

https://arxiv.org/pdf/2512.04032

Multi image reasoning is a weaker area. On BLINK, MuirBench and MMT, Jina-VLM reaches an average of 47.3. The research team point to limited multi-image training data as the reason. In contrast, hallucination control is strong. On the POPE benchmark, which measures object hallucination, the model scores 90.3, the best score in the comparison table.

For mathematical and structured reasoning, the model uses the same architecture, without thinking mode. It reaches 59.5 on MMMU and an overall math score of 33.3 across MathVista, MathVision, MathVerse, WeMath and LogicVista. Jina-VLM is comparable to InternVL3-2B on this set and clearly ahead of Qwen2-VL-2B, while InternVL3.5-2B remains stronger due to its larger scale and more specialized math training.

On pure text benchmarks, the picture is mixed. The research team reports that Jina-VLM keeps most of the Qwen3-1.7B performance on MMLU, GSM 8K, ARC C and HellaSwag. However, MMLU-Pro drops from 46.4 for the base model to 30.3 after multimodal tuning. The research team attribute this to instruction tuning that pushes the model toward very short answers, which clashes with the long multi step reasoning required by MMLU Pro.

The main highlight is multilingual multimodal understanding. On MMMB across Arabic, Chinese, English, Portuguese, Russian and Turkish, Jina-VLM reaches an average of 78.8. On Multilingual MMBench across the same languages, it reaches 74.3. The research team reports these as state of the art averages among open 2B scale VLMs.

Comparison Table

ModelParamsVQA AvgMMMBMulti. MMBDocVQAOCRBenchJina-VLM2.4B72.378.874.390.6778Qwen2-VL-2B2.1B66.471.369.489.2809Qwen3-VL-2B2.8B71.675.072.392.3858InternVL3-2B2.2B69.273.671.987.4835InternVL3.5-2B2.2B71.674.670.988.5836

Key Takeaways

Jina-VLM is a 2.4B parameter VLM that couples SigLIP2 So400M as vision encoder with Qwen3-1.7B as language backbone through an attention pooling connector that cuts visual tokens by 4 times while keeping spatial structure.

The model uses overlapping 378×378 tiles, 12 tiles plus a global thumbnail, to handle arbitrary resolution images up to roughly 4K, then feeds only pooled visual tokens to the LLM which reduces prefill FLOPs and KV cache size by about 4 times compared to naive patch token usage.

Training uses about 5M multimodal samples and 12B text tokens across nearly 30 languages in a 2 stage pipeline, first alignment with caption style data, then instruction fine tuning with LLaVA OneVision, Cauldron, Cambrian, PangeaIns, FineVision and multilingual instruction sets.

On English VQA, Jina-VLM reaches 72.3 average across 8 VQA benchmarks, and on multilingual multimodal benchmarks it leads the open 2B scale class with 78.8 on MMMB and 74.3 on Multilingual MMBench while keeping competitive text only performance.

Check out the Paper, Model on HF and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Jina AI Releases Jina-VLM: A 2.4B Multilingual Vision Language Model Focused on Token Efficient Visual QA appeared first on MarkTechPost.