Google DeepMind Introduces CodeMender: A New AI Agent that Uses Gemini …

What if an AI agent could localize a root cause, prove a candidate fix via automated analysis and testing, and proactively rewrite related code to eliminate the entire vulnerability class—then open an upstream patch for review? Google DeepMind introduces CodeMender, an AI agent that generates, validates, and upstreams fixes for real-world vulnerabilities using Gemini “Deep Think” reasoning and a tool-augmented workflow. In six months of internal deployment, CodeMender contributed 72 security patches across open-source projects, including codebases up to ~4.5M lines, and is designed to act both reactively (patching known issues) and proactively (rewriting code to remove vulnerability classes).

Understanding the Architecture

The agent couples large-scale code reasoning with program-analysis tooling: static and dynamic analysis, differential testing, fuzzing, and satisfiability-modulo-theory (SMT) solvers. A multi-agent design adds specialized “critique” reviewers that inspect semantic diffs and trigger self-corrections when regressions are detected. These components let the system localize root causes, synthesize candidate patches, and automatically regression-test changes before surfacing them for human review.

https://deepmind.google/discover/blog/introducing-codemender-an-ai-agent-for-code-security/?

Validation Pipeline and Human Gate

DeepMind emphasizes automatic validation before any human touches a patch: the system tests for root-cause fixes, functional correctness, absence of regressions, and style compliance; only high-confidence patches are proposed for maintainer review. This workflow is explicitly tied to Gemini Deep Think’s planning-centric reasoning over debugger traces, code search results, and test outcomes.

Proactive Hardening: Compiler-Level Guards

Beyond patching, CodeMender applies security-hardening transforms at scale. Example: automated insertion of Clang’s -fbounds-safety annotations in libwebp to enforce compiler-level bounds checks—an approach that would have neutralized the 2023 libwebp heap overflow (CVE-2023-4863) exploited in a zero-click iOS chain and similar buffer over/underflows where annotations are applied.

Case Studies

DeepMind details two non-trivial fixes: (1) a crash initially flagged as a heap overflow traced to incorrect XML stack management; and (2) a lifetime bug requiring edits to a custom C-code generator. In both cases, agent-generated patches passed automated analysis and an LLM-judge check for functional equivalence before proposal.

https://deepmind.google/discover/blog/introducing-codemender-an-ai-agent-for-code-security/?

Deployment Context and Related Initiatives

Google’s broader announcement frames CodeMender as part of a defensive stack that includes a new AI Vulnerability Reward Program (consolidating AI-related bounties) and the Secure AI Framework 2.0 for agent security. The post reiterates the motivation: as AI-powered vulnerability discovery scales (e.g., via BigSleep and OSS-Fuzz), automated remediation must scale in tandem.

Our Comments

CodeMender operationalizes Gemini Deep Think plus program-analysis tools (static/dynamic analysis, fuzzing, SMT) to localize root causes and propose patches that pass automated validation before human review. Reported early data: 72 upstreamed security fixes across open-source projects over six months, including codebases on the order of ~4.5M lines. The system also applies proactive hardening (e.g., compiler-enforced bounds via Clang -fbounds-safety) to reduce memory-safety bug classes rather than only patching instances. No latency or throughput benchmarks are published yet, so impact is best measured by validated fixes and scope of hardened code.

Check out the TECHNICAL DETAILS. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Introduces CodeMender: A New AI Agent that Uses Gemini Deep Think to Automatically Patch Critical Software Vulnerabilities appeared first on MarkTechPost.

Automate Amazon QuickSight data stories creation with agentic AI using …

Amazon QuickSight data stories support global customers by transforming complex data into interactive narratives for faster decisions. However, manual creation of multiple daily data stories consumes significant time and resources, delaying critical decisions and preventing teams from focusing on valuable analysis.
Each organization has multiple business units, and each business unit creates and operates multiple dashboards based on specific reporting requirements. Users create various data stories from these dashboards according to their needs. Currently, data story creation is a manual process that consumes significant time because users need to develop multiple narratives. By automating this process, organizations can dramatically improve productivity, so users can redirect their time toward making data-driven decisions.
In this post, we demonstrate how Amazon Nova Act automates QuickSight data story creation, saving time so you can focus on making critical, data-driven business decisions.
Amazon Nova Act modernizes web browser automation, which helps in performing complex, real-world tasks through web interfaces. Unlike traditional large language models (LLMs) focused on conversation, Amazon Nova Act emphasizes action-oriented capabilities by breaking down complex tasks into reliable atomic commands. This transformative technology advances autonomous automation with minimal human supervision, making it particularly valuable for business productivity and IT operations.
QuickSight data stories transform complex data into interactive presentations that guide viewers through insights. It automatically combines visualizations, text, and images to bridge the gap between analysts and stakeholders, helping organizations communicate data effectively and make faster decisions while maintaining professional standards.
With the automation capabilities of Amazon Nova Act, you can automatically generate data stories, reducing time-consuming manual efforts. Using browser automation, Amazon Nova Act seamlessly interacts with QuickSight to create customized data narratives. By combining the automation of Amazon Nova Act with the robust visualization capabilities of QuickSight, you can minimize repetitive tasks and accelerate data-driven decision-making across teams.
Solution overview
In our solution, QuickSight transforms complex data into interactive narratives through data stories, enabling faster decisions. Amazon Nova Act transforms web browser automation by enabling AI agents to execute complex tasks autonomously, streamlining operations for enhanced business productivity.
Prompt best practices
Amazon Nova Act achieves optimal results by breaking down prompts into distinct act() calls, similar to providing step-by-step instructions. At the time of writing, this is the recommended approach for building repeatable, reliable, simple-to-maintain workflows. In this section, we discuss some prompt best practices.
First, be prescriptive and succinct in what the agent should do. For example, don’t use the following code:
nova.act(“Select the SaaS-Sales dataset”)
We recommend the following prompt instead:
nova.act(“Click on Datasets option on the left-hand side and then select SaaS-Sales dataset “)
Additionally, we recommend breaking up large actions into smaller ones. For example, don’t use the following code:
nova.act(“Publish dashboard as ‘test-dashboard’”)
The following prompt is broken up into separate actions:
nova.act(“select Analyses on the left-hand side”)
nova.act(“select the ‘SaaS-Sales analysis’ “)
nova.act(“select ‘PUBLISH’ from the top right-hand corner”)
nova.act(“In the ‘Publish dashboard’ dialog box, locate the input field labeled ‘Dashboard name’. Enter ‘test_dashboard’ into this field”)
nova.act(“Select PUBLISH DASHBOARD”)
Prerequisites
The following prerequisites are needed to create and publish a QuickSight data story using Amazon Nova Act:

An API key for authentication. To generate an API key, refer to Amazon Nova Act.
For Amazon Nova Act prerequisites and installation instructions, refer to the GitHub repo.
A Pro user (author or reader) to create QuickSight data stories.
A published QuickSight dashboard containing the visuals required for your QuickSight data story.

For Windows users, complete the following setup and installation steps in Windows PowerShell:

Create a virtual environment: python -m venv venv.
Activate the virtual environment: venvScriptsactivate
Set your API key as an environment variable: $Env:NOVA_ACT_API_KEY=”your_api_key”
Install Amazon Nova Act: pip install nova-act
To run a script (Python file), use the following command, and specify the script name you want to run: python <script_name>.py

To keep it simple, we have hardcoded some of the values. You can implement programming logic using Python features to accept these values as input parameters.
There are multiple ways to write prompts. In the following sections, we provide examples demonstrating how to automate QuickSight data story creation and distribution.
Setup
Run the following code to import the NovaAct class from the nova_act module, create an Amazon Nova instance beginning at the QuickSight login page, and initiate an automated browser session:

from nova_act import NovaAct

nova = NovaAct(starting_page=”https://quicksight.aws.amazon.com/”)

nova.start()

Sign in with credentials After you have opened the QuickSight login page, complete the following steps to log in with your credentials:

Enter your QuickSight account name and choose Next. (Specify the QuickSight account name in the following code, or implement programming logic to handle it as an input parameter.) nova.act(“enter QuickSight account name <Account Name> and select Next”)
Enter your user name and move to the password field. (You can configure the user name as an input parameter using programming logic.) nova.act(“Enter username and click on the password field”)
Collect the password from the command line and enter it using Playwright: nova.page.keyboard.type(getpass())
Now that user name and password are filled in, choose Sign in. nova.act(“Click Sign in”)

If the agent is unable to focus on the page element (in this case, the password field), you can use the following code:
nova.act(“enter ” in the password field”)
nova.page.keyboard.type(getpass())
Create a new data story On the QuickSight console, choose Data stories in the navigation pane:
nova.act(“Select Data stories on the left side menu”)
nova.act(“Select NEW DATA STORY”).

To build the data story, you must complete the following steps:

Describe the data story
Select visuals from the dashboard
Build the data story

nova.act(“Please enter ‘Country wide sales data story’ into the ‘Describe your data story’ field and Click on + ADD”)
nova.act(“select all the visuals and select BUILD”)
time.sleep(300)

In this example, the script defaults to a single dashboard (Demo Dashboard). For multiple dashboards, include a prompt to select the specific dashboard and its visuals for the data story. Additionally, you can describe the data story according to your requirements. If there are multiple visuals, you can select the ones you want to include as part of the data story. Adjust the time.sleep duration based on dashboard data volume and the number of visuals being compiled.
To view your data story, choose Data stories in the navigation pane and choose your data story.

Clean up
Complete the following steps to delete the data story you created:

Sign in to the QuickSight console.
Choose Data stories in the navigation pane.
Find the data story you want to delete.
Choose the options menu icon (three dots) next to the story.
Choose Delete from the dropdown menu.

Conclusion
In this post, we demonstrated how to create a QuickSight data story using Amazon Nova Act prompts. This solution showcases how Amazon Nova Act simplifies task automation, significantly boosting productivity and saving valuable time.
To learn more about Amazon Nova Act and QuickSight data stories, check out the following resources:

Amazon Nova Act GitHub repo
Introducing Amazon Nova Act
Working with data stories in Amazon QuickSight

About the author
Satish Bhonsle is a Senior Technical Account Manager at AWS. He is passionate about customer success and technology. He loves working backwards by quickly understanding strategic customer objectives, aligning them to software capabilities and effectively driving customer success.

Implement automated monitoring for Amazon Bedrock batch inference

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API, along with capabilities to build generative AI applications with security, privacy, and responsible AI.
Batch inference in Amazon Bedrock is for larger workloads where immediate responses aren’t critical. With a batch processing approach, organizations can analyze substantial datasets efficiently, with significant cost advantages: you can benefit from a 50% reduction in pricing compared to the on-demand option. This makes batch inference particularly valuable for handling extensive data to get inference from Amazon Bedrock FMs.
As organizations scale their use of Amazon Bedrock FMs for large-volume data processing, implementing effective monitoring and management practices for batch inference jobs becomes an important focus area for optimization. This solution demonstrates how to implement automated monitoring for Amazon Bedrock batch inference jobs using AWS serverless services such as AWS Lambda, Amazon DynamoDB, and Amazon EventBridge, reducing operational overhead while maintaining reliable processing of large-scale batch inference workloads. Through a practical example in the financial services sector, we show how to build a production-ready system that automatically tracks job status, provides real-time notifications, and maintains audit records of processing activities.
Solution overview
Consider a scenario where a financial services company manages millions of customer interactions and data points, including credit histories, spending patterns, and financial preferences. This company recognized the potential of using advanced AI capabilities to deliver personalized product recommendations at scale. However, processing such massive datasets in real time isn’t always necessary or cost-effective.
The solution presented in this post uses batch inference in Amazon Bedrock with automated monitoring to process large volumes of customer data efficiently using the following architecture.

This architecture workflow includes the following steps:

The financial services company uploads customer credit data and product data to be processed to an Amazon Simple Storage Service (Amazon S3) bucket.
The first Lambda function reads the prompt template and data from the S3 bucket, and creates a JSONL file with prompts for the customers along with their credit data and available financial products.
The same Lambda function triggers a new Amazon Bedrock batch inference job using this JSONL file.
In the prompt template, the FM is given a role of expert in recommendation systems within the financial services industry. This way, the model understands the customer and their credit information to intelligently recommend most suitable products.
An EventBridge rule monitors the state changes of the batch inference job. When the job completes or fails, the rule triggers a second Lambda function.
The second Lambda function creates an entry for the job with its status in a DynamoDB table.
After a batch job is complete, its output files (containing personalized product recommendations) will be available in the S3 bucket’s inference_results folder.

This automated monitoring solution for Amazon Bedrock batch inference offers several key benefits:

Real-time visibility – Integration of DynamoDB and EventBridge provides real-time visibility into the status of batch inference jobs, enabling proactive monitoring and timely decision-making
Streamlined operations – Automated job monitoring and management minimizes manual overhead, reducing operational complexities so teams can focus on higher-value tasks like analyzing recommendation results
Optimized resource allocation – Metrics and insights about token count and latency stored in DynamoDB help organizations optimize resource allocation, facilitating efficient utilization of batch inference capabilities and cost-effectiveness

Prerequisites
To implement this solution, you must have the following:

An active AWS account with appropriate permissions to create resources, including S3 buckets, Lambda functions, and Amazon Bedrock resources.
Access to your selected models hosted on Amazon Bedrock. Make sure the selected model has been enabled in Amazon Bedrock.

Additionally, make sure to deploy the solution in an AWS Region that supports batch inference.
Deploy solution
For this solution, we provide an AWS CloudFormation template that sets up the services included in the architecture, to enable repeatable deployments. This template creates the following resources:

An S3 bucket to store the input and output
AWS Identity and Access Management (IAM) roles for Lambda functions, EventBridge rule, and Amazon Bedrock batch inference job
Amazon Bedrock Prompt Management template
EventBridge rule to trigger the Lambda function
DynamoDB table to store the job execution details

To deploy the CloudFormation template, complete the following steps:

Sign in to the AWS Management Console.
Open the template directly on the Create stack page of the CloudFormation console.
Choose Next and provide the following details:

For Stack name, enter a unique name.
For ModelId, enter the model ID that you need your batch job to run with. Only Anthropic Claude family models can be used with the CloudFormation template provided in this post.

Add optional tags, permissions, and other advanced settings if needed.
Review the stack details, select I acknowledge that AWS CloudFormation might create AWS IAM resources, and choose Next.
Choose Submit to initiate the deployment in your AWS account. The stack might take several minutes to complete.

Choose the Resources tab to find the newly created S3 bucket after the deployment succeeds.
Open the S3 bucket and confirm that there are two CSV files in your data folder.

On the Amazon S3 console, go to the data folder and create two more folders manually. This will prepare your S3 bucket to store the prompts and batch inference job results.

On the Lambda console, choose Functions in the navigation pane.
Choose the function that has create-jsonl-file in its name.

On the Test tab, choose Test to run the Lambda function. The function reads the CSV files from the S3 bucket and the prompt template, and creates a JSONL file with prompts for the customers under the prompts folder of your S3 bucket. The JSONL file has 100 prompts using the customers and products data. Lastly, the function submits a batch inference job with the CreateModelInvocationJob API call using the JSONL file.
On the Amazon Bedrock console, choose Prompt Management under Builder tools in the navigation pane.
Choose the finance-product-recommender-v1 prompt to see the prompt template input for the FM.
Choose Batch inference in the navigation pane under Inference and Assessment to find the submitted job.

The job progresses through different statuses: Submitted, Validating, In Progress, and lastly Completed, or Failed. You can leave this page and check the status after a few hours.
The EventBridge rule will automatically trigger the second Lambda function with event-bridge-trigger in its name on completion of the job. This function will add an entry in the DynamoDB table named bedrock_batch_job_status with details of the execution, as shown in the following screenshot.

This DynamoDB table functions as a state manager for Amazon Bedrock batch inference jobs, tracking the lifecycle of each request. The columns of the table are logically divided into the following categories:

Job identification and core attributes (job_arn, job_name) – These columns provide the unique identifier and a human-readable name for each batch inference request, serving as the primary keys or core attributes for tracking.
Execution and lifecycle management (StartTime, EndTime, last_processed_timestamp, TotalDuration) – This category captures the temporal aspects and the overall progression of the job, allowing for monitoring of its current state, start/end times, and total processing duration. last_processed_timestamp is crucial for understanding the most recent activity or checkpoint.
Processing statistics and performance (TotalRecordCount, ProcessedRecordCount, SuccessRecordCount, ErrorRecordCount) – These metrics provide granular insights into the processing efficiency and outcome of the batch job, highlighting data volume, successful processing rates, and error occurrences.
Cost and resource utilization metrics (InputTokenCount, OutputTokenCount) – Specifically designed for cost analysis, these columns track the consumption of tokens, which is a direct factor in Amazon Bedrock pricing, enabling accurate resource usage assessment.
Data and location management (InputLocation, OutputLocation) – These columns link the inference job to its source and destination data within Amazon S3, maintaining traceability of the data involved in the batch processing.

View product recommendations
Complete the following steps to open the output file and view the recommendations for each customer generated by the FM:

On the Amazon Bedrock console, open the completed batch inference job.
Find the job Amazon Resource Name (ARN) and copy the text after model-invocation-job/, as illustrated in the following screenshot.

Choose the link for S3 location under Output data. A new tab opens with the inference_results folder of the S3 bucket.

Search for the job results folder using the text copied from the previous step.
Open the folder to find two output files:

The file named manifest contains information like number of tokens, number of successful records, and number of errors.
The second output file contains the recommendations.

Download the second output file and open it in a text editor like Visual Studio Code to find the recommendations against each customer.

The example in the following screenshot shows several recommended products and why the FM chose this product for the specific customer.

Best practices
To optimize or enhance your monitoring solution, consider the following best practices:

Set up Amazon CloudWatch alarms for failed jobs to facilitate prompt attention to issues. For more details, see Amazon CloudWatch alarms.
Use appropriate DynamoDB capacity modes based on your workload patterns.
Configure relevant metrics and logging of batch job performance for operational visibility. Refer to Publish custom metrics for more details. The following are some useful metrics:

Average job duration
Token throughput rate (inputTokenCount + outputTokenCount) / jobDuration)
Error rates and types

Estimated costs
The cost estimate of running this solution one time is less than $1. The estimate for batch inference jobs considers Anthropic’s Claude 3.5 sonnet V2 model. Refer to Model pricing details for batch job pricing of other models on Amazon Bedrock.
Clean up
If you no longer need this automated monitoring solution, follow these steps to delete the resources it created to avoid additional costs:

On the Amazon S3 console, choose Buckets in the navigation pane.
Select the bucket you created and choose Empty to delete its contents.
On the AWS CloudFormation console, choose Stacks in the navigation pane.
Select the created stack and choose Delete.

This automatically deletes the deployed stack and the resources created.
Conclusion
In this post, we demonstrated how a financial services company can use an FM to process large volumes of customer records and get specific data-driven product recommendations. We also showed how to implement an automated monitoring solution for Amazon Bedrock batch inference jobs. By using EventBridge, Lambda, and DynamoDB, you can gain real-time visibility into batch processing operations, so you can efficiently generate personalized product recommendations based on customer credit data. The solution addresses key challenges in managing batch inference operations:

Alleviates the need for manual status checking or continuous polling
Provides immediate notifications when jobs complete or fail
Maintains a centralized record of job statuses

This automated monitoring approach significantly enhances the ability to process large amounts of financial data using batch inference for Amazon Bedrock. This solution offers a scalable, efficient, and cost-effective approach to do batch inference for a variety of use cases, such as generating product recommendations, identifying fraud patterns, or analyzing financial trends in bulk, with the added benefit of real-time operational visibility.

About the authors
Durga Prasad is a Senior Consultant at AWS, specializing in the Data and AI/ML. He has over 17 years of industry experience and is passionate about helping customers design, prototype, and scale Big Data and Generative AI applications using AWS native and open-source tech stacks.
Chanpreet Singh is a Senior Consultant at AWS with 18+ years of industry experience, specializing in Data Analytics and AI/ML solutions. He partners with enterprise customers to architect and implement cutting-edge solutions in Big Data, Machine Learning, and Generative AI using AWS native services, partner solutions and open-source technologies. A passionate technologist and problem solver, he balances his professional life with nature exploration, reading, and quality family time.

OpenAI Debuts Agent Builder and AgentKit: A Visual-First Stack for Bui …

OpenAI has released AgentKit, a cohesive platform that packages a visual Agent Builder, an embeddable ChatKit UI, and expanded Evals into a single workflow for shipping production agents. The launch includes Agent Builder in beta and the rest generally available.

What’s new?

Agent Builder (beta). A visual canvas for composing multi-step, multi-agent workflows with drag-and-drop nodes, connectors, per-node guardrails, preview runs, inline eval configuration, and full versioning. Teams can start from templates or a blank canvas; the Responses API powers execution. OpenAI highlights internal and customer usage to compress iteration cycles when moving from prototype to production.

With Agent Builder, you can drag and drop nodes, connect tools, and publish your agentic workflows with ChatKit and the Agents SDK.https://t.co/ayLhKaSPUFHere’s @christinaahuang to walk you through it: pic.twitter.com/iFczB31hAl— OpenAI Developers (@OpenAIDevs) October 6, 2025

Agents SDK. A code-first alternative to the canvas with type-safe libraries in Node, Python, and Go. OpenAI positions the SDK as faster to integrate than manual prompt-and-tool orchestration while sharing the same execution substrate (Responses API).

@Albertsons used AgentKit to build an agent.An associate can ask it to create a plan to improve ice cream sales. The agent looks at the full context — seasonality, historical trends, external factors — and gives a recommendation. pic.twitter.com/rak7G5qc5U— OpenAI Developers (@OpenAIDevs) October 6, 2025

ChatKit (GA). A drop-in, brand-customizable chat interface for deploying agentic experiences on the web or in apps. It handles streaming, threads, and “thinking” UIs; the marketing page shows organizations using it for support and internal assistants.

Built-in tools and connectors. Agent workflows can call web search, file search, image generation, code interpreter, “computer use,” and external connectors, including Model Context Protocol (MCP) servers—reducing glue code for common tasks.

Connector Registry (beta). Centralized admin governance across ChatGPT and the API for data sources such as Dropbox, Google Drive, SharePoint, Microsoft Teams, and third-party MCPs. Rollout begins for customers with the Global Admin Console.

Evals (GA) and optimization. New capabilities include datasets, trace grading for end-to-end workflow assessment, automated prompt optimization, and third-party model evaluation. OpenAI emphasizes continuous measurement to raise task accuracy.

Pricing and availability. OpenAI states ChatKit and the new Evals features are GA; Agent Builder is beta. All are included under standard API model pricing (i.e., pay for model/compute usage rather than separate SKUs).

How the pieces fit in the puzzle?

Design: Use Agent Builder to visually assemble agents and guardrails, or write agents with the Agents SDK against the Responses API.

Deploy: Embed with ChatKit to deliver a production chat surface without building a frontend from scratch.

Optimize: Instrument with Evals (datasets, trace grading, graders) and iterate prompts based on graded traces.

How safety is included?

OpenAI’s launch materials pair Agent Builder with guardrails (open-source, modular) that can detect jailbreaks, mask/flag PII, and enforce policies at the node/tool boundary. Admins govern connections and data flows through the Connector Registry spanning both ChatGPT and the API.

Our Comments

It is a consolidated stack: AgentKit packages a visual Agent Builder for graph-based workflows, an embeddable ChatKit UI, and an Agents SDK that sits on top of the Responses API; this reduces bespoke orchestration and frontend work while keeping evaluation in-loop via datasets and trace grading. Our assessment: the value is operational—versioned node graphs, built-in tools (web/file search, computer use), connector governance, and standardized eval hooks are production concerns that previously required custom infrastructure.

Introducing AgentKit—build, deploy, and optimize agentic workflows. ChatKit: Embeddable, customizable chat UI Agent Builder: WYSIWYG workflow creator Guardrails: Safety screening for inputs/outputs Evals: Datasets, trace grading, auto-prompt optimization pic.twitter.com/pGgNHKOvj3— OpenAI Developers (@OpenAIDevs) October 6, 2025

The post OpenAI Debuts Agent Builder and AgentKit: A Visual-First Stack for Building, Deploying, and Evaluating AI Agents appeared first on MarkTechPost.

A New Agency-Focused Supervision Approach Scales Software AI Agents Wi …

Do curated, tool-grounded demonstrations build stronger software agents than broad piles of generic instruction data? A team of researchers from Shanghai Jiao Tong University and SII Generative AI Research Lab (GAIR) proposes LIMI (“Less Is More for Agency”), a supervised fine-tuning method that turns a base model into a capable software/research agent using 78 samples. LIMI scores 73.5% average on AgencyBench (FTFC 71.7, RC@3 74.2, SR@3 74.6), beating strong baselines (GLM-4.5 45.1, Qwen3-235B-A22B 27.5, Kimi-K2 24.1, DeepSeek-V3.1 11.9) and even surpassing variants trained on 10,000 samples—with 128× less data.

https://arxiv.org/pdf/2509.17567

What exactly is new?

Agency Efficiency Principle: LIMI state that agentic competence scales more with data quality/structure than raw sample count. The research team fine-tune GLM-4.5/GLM-4.5-Air on 78 long-horizon, tool-use trajectories (samples) and report large gains on AgencyBench and generalization suites (TAU2-bench, EvalPlus-HE/MBPP, DS-1000, SciCode).

Minimal but dense supervision. Each trajectory (~13k–152k tokens; ~42.4k avg.) captures complete multi-turn workflows—model reasoning, tool calls, and environment observations—collected in the SII-CLI execution environment. Tasks span “vibe coding” (interactive software development) and research workflows (search, analysis, experiment design).

https://arxiv.org/pdf/2509.17567

How does it work?

Base models: GLM-4.5 (355B) and GLM-4.5-Air (106B). Training uses the slime SFT framework with identical configs across comparisons (to isolate data effects).

Data construction: 60 real queries from practitioners + 18 synthesized from high-star GitHub PRs (tight QA by PhD annotators). For each query, LIMI logs the full agent trajectory to successful completion inside SII-CLI.

Evaluation: AgencyBench (R=3 rounds) with FTFC, SR@3, RC@3; plus generalization suites (TAU2-airline/retail Pass^4, EvalPlus HE/MBPP, DS-1000, SciCode).

https://arxiv.org/pdf/2509.17567

Results

AgencyBench (avg): 73.5%. LIMI vs. GLM-4.5 (+28.4 pts); FTFC 71.7% vs 37.8%; SR@3 74.6% vs 47.4%.

Data efficiency: LIMI (78 samples) outperforms GLM-4.5 trained on AFM-CodeAgent SFT (10,000 samples): 73.5% vs 47.8%—+53.7% absolute with 128× less data. Similar gaps hold vs AFM-WebAgent (7,610) and CC-Bench-Traj (260).

Generalization: Across tool-use/coding/scientific computing, LIMI averages ~57%, exceeding GLM-4.5 and other baselines; without tool access, LIMI still leads slightly (50.0% vs 48.7% for GLM-4.5), indicating intrinsic gains beyond environment tooling.

https://arxiv.org/pdf/2509.17567

Key Takeaways

Data efficiency dominates scale. LIMI reaches 73.5% average on AgencyBench using curated trajectories, surpassing GLM-4.5 (45.1%) and showing a +53.7-point advantage over a 10k-sample SFT baseline—with 128× fewer samples.

Trajectory quality, not bulk. Training data are long-horizon, tool-grounded workflows in collaborative software development and scientific research, collected via the SII-CLI execution stack referenced by the paper.

Across-metric gains. On AgencyBench, LIMI reports FTFC 71.7%, SR@3 74.6%, and strong RC@3, with detailed tables showing large margins over baselines; generalization suites (TAU2, EvalPlus-HE/MBPP, DS-1000, SciCode) average 57.2%.

Works across scales. Fine-tuning GLM-4.5 (355B) and GLM-4.5-Air (106B) both yields large deltas over their bases, indicating method robustness to model size.

Our Comments

The research team trains GLM-4.5 variants with 78 curated, long-horizon, tool-grounded trajectories captured in a CLI environment spanning software-engineering and research tasks. It reports 73.5% average on AgencyBench with FTFC, RC@3, and SR@3 metrics; baseline GLM-4.5 is reported at 45.1%. A comparison against a 10,000-sample AFM-CodeAgent SFT baseline shows 73.5% vs 47.8%; tool-free evaluation indicates intrinsic gains (≈50.0% for LIMI vs 48.7% GLM-4.5). Trajectories are multi-turn and token-dense, emphasizing planning, tool orchestration, and verification.

Check out the Paper, GitHub Page and Model Card on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A New Agency-Focused Supervision Approach Scales Software AI Agents With Only 78 Examples appeared first on MarkTechPost.

StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Inter …

Why treat LLM inference as batched kernels to DRAM when a dataflow compiler can pipe tiles through on-chip FIFOs and stream converters?StreamTensor is a compiler that lowers PyTorch LLM graphs (GPT-2, Llama, Qwen, Gemma) into stream-scheduled dataflow accelerators on AMD’s Alveo U55C FPGA. The system introduces an iterative tensor (“itensor”) type to encode tile/order of streams, enabling provably correct inter-kernel streaming and automated insertion/sizing of DMA engines, FIFOs, and layout converters. On LLM decoding workloads, the research team reports up to 0.64× lower latency vs. GPUs and up to 1.99× higher energy efficiency.

https://arxiv.org/pdf/2509.13694

What StreamTensor does?

StreamTensor compiles PyTorch graphs into a stream-oriented dataflow design so that intermediate tiles are largely avoids off-chip DRAM round-trips via on-chip streaming and fusion; DMAs are inserted only when required; they are forwarded through on-chip FIFOs to downstream kernels. The compiler’s central abstraction—iterative tensors (itensors)—records iteration order, tiling, and layout, which makes inter-kernel stream compatibility explicit and drives converter generation only where needed. The framework also searches hierarchically over tiling, fusion, and resource allocation, and uses a linear program to size FIFOs to avoid stalls or deadlock while minimizing on-chip memory.

https://arxiv.org/pdf/2509.13694

What’s actually new?

Hierarchical DSE. The compiler explores three design spaces—(i) tiling/unroll/vectorization/permutation at the Linalg level, (ii) fusion under memory/resource constraints, and (iii) resource allocation/stream widths—optimizing for sustained throughput under bandwidth limits.

End-to-end PyTorch → device flow. Models enter via Torch-MLIR, are transformed to MLIR Linalg, and then into a dataflow IR whose nodes become hardware kernels with explicit streams and host/runtime glue—no manual RTL assembly.

iterative tensor (itensor) typing system. A first-class tensor type expresses iteration order, tiling, and affine maps. This makes stream order explicit, allows safe kernel fusion, and lets the compiler synthesize minimal buffer/format converters when producers/consumers disagree.

Formal FIFO sizing. Inter-kernel buffering is solved with a linear-programming formulation to avoid stalls/deadlocks while minimizing on-chip memory usage (BRAM/URAM).

Results

Latency: up to 0.76× vs prior FPGA LLM accelerators and 0.64× vs a GPU baseline on GPT-2; Energy efficiency: up to 1.99× vs A100 on emerging LLMs (model-dependent). Platform context: Alveo U55C (HBM2 16 GB, 460 GB/s, PCIe Gen3×16 or dual Gen4×8, 2×QSFP28).

https://arxiv.org/pdf/2509.13694

Our Comments

The useful contribution here is a PyTorch→Torch-MLIR→dataflow compiler that emits stream-scheduled kernels and a host/runtime for AMD’s Alveo U55C; the iterative tensor type plus linear-programming-based FIFO sizing enables safe inter-kernel streaming rather than DRAM round-trips. On reported LLM decoding benchmarks across GPT-2, Llama, Qwen, and Gemma, the research team show geometric-mean latency as low as 0.64× vs. a GPU baseline and energy efficiency up to 1.99×, with scope limited to decoding workloads. The hardware context is clear: Alveo U55C provides 16 GB HBM2 at 460 GB/s with dual QSFP28 and PCIe Gen3×16 or dual Gen4×8, which aligns with the streaming dataflow design.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows appeared first on MarkTechPost.

Responsible AI: How PowerSchool safeguards millions of students with A …

This post is cowritten with Gayathri Rengarajan and Harshit Kumar Nyati from PowerSchool.
PowerSchool is a leading provider of cloud-based software for K-12 education, serving over 60 million students in more than 90 countries and over 18,000 customers, including more than 90 of the top 100 districts by student enrollment in the United States. When we launched PowerBuddy, our AI assistant integrated across our multiple educational platforms, we faced a critical challenge: implementing content filtering sophisticated enough to distinguish between legitimate academic discussions and harmful content in educational contexts.
In this post, we demonstrate how we built and deployed a custom content filtering solution using Amazon SageMaker AI that achieved better accuracy while maintaining low false positive rates. We walk through our technical approach to fine tuning Llama 3.1 8B, our deployment architecture, and the performance results from internal validations.
PowerSchool’s PowerBuddy
PowerBuddy is an AI assistant that provides personalized insights, fosters engagement, and provides support throughout the educational journey. Educational leaders benefit from PowerBuddy being brought to their data and their users’ most common workflows within the PowerSchool ecosystem – such as Schoology Learning, Naviance CCLR, PowerSchool SIS, Performance Matters, and more – to ensure a consistent experience for students and their network of support providers at school and at home.
The PowerBuddy suite includes several AI solutions: PowerBuddy for Learning functions as a virtual tutor; PowerBuddy for College and Career provides insights for career exploration; PowerBuddy for Community simplifies access to district and school information, and others. The solution includes built-in accessibility features such as speech-to-text and text-to-speech functionality.
Content filtering for PowerBuddy
As an education technology provider serving millions of students—many of whom are minors—student safety is our highest priority. National data shows that approximately 20% of students ages 12–17 experience bullying, and 16% of high school students have reported seriously considering suicide. With PowerBuddy’s widespread adoption across K-12 schools, we needed robust guardrails specifically calibrated for educational environments.
The out-of-the-box content filtering and safety guardrails solutions available on the market didn’t fully meet PowerBuddy’s requirements, primarily because of the need for domain-specific awareness and fine-tuning within the education context. For example, when a high school student is learning about sensitive historical topics such as World War II or the Holocaust, it’s important that educational discussions aren’t mistakenly flagged for violent content. At the same time, the system must be able to detect and immediately alert school administrators to indications of potential harm or threats. Achieving this nuanced balance requires deep contextual understanding, which can only be enabled through targeted fine-tuning.
We needed to implement a sophisticated content filtering system that could intelligently differentiate between legitimate academic inquiries and truly harmful content—detecting and blocking prompts indicating bullying, self-harm, hate speech, inappropriate sexual content, violence, or harmful material not suitable for educational settings. Our challenge was finding a cloud solution to train and host a custom model that could reliably protect students while maintaining the educational functionality of PowerBuddy.
After evaluating multiple AI providers and cloud services that allow model customization and fine-tuning, we selected Amazon SageMaker AI as the most suitable platform based on these critical requirements:

Platform stability: As a mission-critical service supporting millions of students daily, we require an enterprise-grade infrastructure with high availability and reliability.
Autoscaling capabilities: Student usage patterns in education are highly cyclical, with significant traffic spikes during school hours. Our solution needed to handle these fluctuations without degrading performance.
Control of model weights after fine-tuning: We needed control over our fine-tuned models to enable continuous refinement of our safety guardrails, enabling us to quickly respond to new types of harmful content that might emerge in educational settings.
Incremental training capability: The ability to continually improve our content filtering model with new examples of problematic content was essential.
Cost-effectiveness: We needed a solution that would allow us to protect students without creating prohibitive costs that would limit schools’ access to our educational tools.
Granular control and transparency: Student safety demands visibility into how our filtering decisions are made, requiring a solution that isn’t a black box but provides transparency into model behavior and performance.
Mature managed service: Our team needed to focus on educational applications rather than infrastructure management, making a comprehensive managed service with production-ready capabilities essential.

Solution overview

Our content filtering system architecture, shown in the preceding figure, consists of several key components:

Data preparation pipeline:

Curated datasets of safe and unsafe content examples specific to educational contexts
Data preprocessing and augmentation to ensure robust model training
Secure storage in Amazon S3 buckets with appropriate encryption and access controls Note: All training data was fully anonymized and did not include personally identifiable student information

Model training infrastructure:

SageMaker training jobs for fine-tuning Llama 3.1 8B

Inference architecture:

Deployment on SageMaker managed endpoints with auto-scaling configured
Integration with PowerBuddy through Amazon API Gateway for real-time content filtering
Monitoring and logging through Amazon CloudWatch for continuous quality assessment

Continuous improvement loop:

Feedback collection mechanism for false positives/negatives
Scheduled retraining cycles to incorporate new data and improve performance
A/B testing framework to evaluate model improvements before full deployment

Development process
After exploring multiple approaches to content filtering, we decided to fine-tune Llama 3.1 8B using Amazon SageMaker JumpStart. This decision followed our initial attempts to develop a content filtering model from scratch, which proved challenging to optimize for consistency across various types of harmful content.
SageMaker JumpStart significantly accelerated our development process by providing pre-configured environments and optimized hyperparameters for fine-tuning foundation models. The platform’s streamlined workflow allowed our team to focus on curating high-quality training data specific to educational safety concerns rather than spending time on infrastructure setup and hyperparameter tuning.
We fine-tuned Llama 3.1 8B model using Low Rank Adaptation (LoRA) technique on Amazon SageMaker AI training jobs, which allowed us to maintain full control over the training process.
After the fine-tuning was done, we deployed the model on SageMaker AI managed endpoint and integrated it as a critical safety component within our PowerBuddy architecture.
For our production deployment, we selected NVIDIA A10G GPUs available through ml.g5.12xlarge instances, which offered the ideal balance of performance and cost-effectiveness for our model size. The AWS team provided crucial guidance on selecting optimal model serving configuration for our use case. This advice helped us optimize both performance and cost by ensuring we weren’t over-provisioning resources.
Technical implementation
Below is the code snippet to fine-tune the model on the pre-processed dataset. Instruction tuning dataset is first converted into domain adaptation dataset format and scripts utilize Fully Sharded Data Parallel (FSDP) as well as Low Rank Adaptation (LoRA) method for fine-tuning the model.
We define an estimator object first. By default, these models train via domain adaptation, so you must indicate instruction tuning by setting the instruction_tuned hyperparameter to True.

estimator = JumpStartEstimator(
model_id=model_id,
environment={“accept_eula”: “true”},
disable_output_compression=True,
hyperparameters={
“instruction_tuned”: “True”,
“epoch”: “5”,
“max_input_length”: “1024”,
“chat_dataset”: “False”
},
sagemaker_session=session,
base_job_name = “CF-M-0219251”
)

After we define the estimator, we are ready to start training:
estimator.fit({“training”: train_data_location})
After training, we created a model using the artifacts stored in S3 and deployed the model to a real-time endpoint for evaluation. We tested the model using our test dataset that covers key scenarios to validate performance and behavior. We calculated recall, F1, confusion matrix and inspected misclassifications. If needed, adjust hyperparameters/prompt template and retrain; otherwise proceed with production deployment.
You can also check out the sample notebook for fine tuning Llama 3 models on SageMaker JumpStart in SageMaker examples.
We used the Faster autoscaling on Amazon SageMaker realtime endpoints notebook to set up autoscaling on SageMaker AI endpoints.
Validation of solution
To validate our content filtering solution, we conducted extensive testing across multiple dimensions:

Accuracy testing: In our internal validation testing, the model achieved ~93% accuracy in identifying harmful content across a diverse test set representing various forms of inappropriate material.
False positive analysis: We worked to minimize instances where legitimate educational content was incorrectly flagged as harmful, achieving a false positive rate of less than 3.75% in test environments; results may vary by school context.
Performance testing: Our solution maintained response times averaging 1.5 seconds. Even during peak usage periods simulating real classroom environments, the system consistently delivered seamless user experience with no failed transactions.
Scalability and reliability validation:

Comprehensive load testing achieved 100% transaction success rate with consistent performance distribution, validating system reliability under sustained educational workload conditions.
Transactions completed successfully without degradation in performance or accuracy, demonstrating the system’s ability to scale effectively for classroom-sized concurrent usage scenarios.

Production deployment: Initial rollout to a select group of schools showed consistent performance in real-world educational environments.
Student safety outcomes: Schools reported a significant reduction in reported incidents of AI-enabled bullying or inappropriate content generation compared to other AI systems without specialized content filtering.

Fine-tuned model metrics compared to out-of-the-box content filtering solutions
The fine-tuned content filtering model demonstrated higher performance than generic, out-of-the-box filtering solutions in key safety metrics. It achieved a higher accuracy (0.93 compared to 0.89), and better F1-scores for both the safe (0.95 compared to 0.91) and unsafe (0.90 compared to 0.87) classes. The fine-tuned model also demonstrated a more balanced trade-off between precision and recall, indicating more consistent performance across classes. Importantly, it makes fewer false positive errors by misclassifying only 6 safe cases as unsafe, compared to 19 original responses in a test set of 160— a significant advantage in safety-sensitive applications. Overall, our fine-tuned content filtering model proved to be more reliable and effective.
Future plans
As the PowerBuddy suite evolves and is integrated into other PowerSchool products and agent flows, the content filter model will be continuously adapted and improved with fine tuning for other products with specific needs.
We plan to implement additional specialized adapters using the SageMaker AI multi-adapter inference feature alongside our content filtering model subject to feasibility and compliance consideration. The idea is to deploy fine-tuned small language models (SLMs) for specific problem solving in cases where large language models (LLMs) are huge and generic and don’t meet the need for narrower problem domains. For example:

Decision making agents specific to the Education domain
Data domain identification in cases of text to SQL queries

This approach will deliver significant cost savings by eliminating the need for separate model deployments while maintaining the specialized performance of each adapter.
The goal is to create an AI learning environment that is not only safe but also inclusive and responsive to diverse student needs across our global implementations, ultimately empowering students to learn effectively while being protected from harmful content.
Conclusion
The implementation of our specialized content filtering system on Amazon SageMaker AI has been transformative for PowerSchool’s ability to deliver safe AI experiences in educational settings. By building robust guardrails, we’ve addressed one of the primary concerns educators and parents have about introducing AI into classrooms—helping to ensure student safety.
As Shivani Stumpf, our Chief Product Officer, explains: “We’re now tracking around 500 school districts who’ve either purchased PowerBuddy or activated included features, reaching over 4.2 million students approximately. Our content filtering technology ensures students can benefit from AI-powered learning support without exposure to harmful content, creating a safe space for academic growth and exploration.”
The impact extends beyond just blocking harmful content. By establishing trust in our AI systems, we’ve enabled schools to embrace PowerBuddy as a valuable educational tool. Teachers report spending less time monitoring student interactions with technology and more time on personalized instruction. Students benefit from 24/7 learning support without the risks that might otherwise come with AI access.
For organizations requiring domain-specific safety guardrails, consider how the fine-tuning capabilities and managed endpoints of SageMaker AI can be adapted to your use case.
As we continue to expand PowerBuddy’s capabilities with the multi-adapter inference of SageMaker, we remain committed to maintaining the perfect balance between educational innovation and student safety—helping to ensure that AI becomes a positive force in education that parents, teachers, and students can trust.

About the authors
Gayathri Rengarajan is the Associate Director of Data Science at PowerSchool, leading the PowerBuddy initiative. Known for bridging deep technical expertise with strategic business needs, Gayathri has a proven track record of delivering enterprise-grade generative AI solutions from concept to production.
Harshit Kumar Nyati is a Lead Software Engineer at PowerSchool with 10+ years of experience in software engineering and analytics. He specializes in building enterprise-grade Generative AI applications using Amazon SageMaker AI, Amazon Bedrock, and other cloud services. His expertise includes fine-tuning LLMs, training ML models, hosting them in production, and designing MLOps pipelines to support the full lifecycle of AI applications.
Anjali Vijayakumar is a Senior Solutions Architect at AWS with over 9 years of experience helping customers build reliable and scalable cloud solutions. Based in Seattle, she specializes in architectural guidance for EdTech solutions, working closely with Education Technology companies to transform learning experiences through cloud innovation. Outside of work, Anjali enjoys exploring the Pacific Northwest through hiking.
Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.
Karan Jain is a Senior Machine Learning Specialist at AWS, where he leads the worldwide Go-To-Market strategy for Amazon SageMaker Inference. He helps customers accelerate their generative AI and ML journey on AWS by providing guidance on deployment, cost-optimization, and GTM strategy. He has led product, marketing, and business development efforts across industries for over 10 years, and is passionate about mapping complex service features to customer solutions.

Salesforce AI Research Releases CoDA-1.7B: a Discrete-Diffusion Code M …

Salesforce AI Research released CoDA-1.7B, a diffusion-based language model for code that generates by denoising whole sequences with bidirectional context, updating multiple tokens in parallel rather than left-to-right next-token prediction. The research team published both Base and Instruct checkpoints and an end-to-end training/evaluation/serving stack.

Understanding the architecture and training

CoDA adapts a 1.7B-parameter backbone to discrete diffusion for text: masked sequences are iteratively denoised using full-sequence attention, enabling native infilling and non-autoregressive decoding. The model card documents a three-stage pipeline (pre-training with bidirectional masking, supervised post-training, and progressive denoising at inference) plus reproducible scripts for TPU pre-training, GPU fine-tuning, and evaluation.

Key features surfaced in the release:

Bidirectional context via diffusion denoising (no fixed generation order).

Confidence-guided sampling (entropy-style decoding) to trade quality vs. speed.

Open training pipeline with deploy scripts and CLI.

How do they perform on Benchmarks?

On standard code-gen suites, CoDA-1.7B-Instruct reports: HumanEval 54.3%, HumanEval+ 47.6%, MBPP 47.2%, MBPP+ 63.2%, EvalPlus aggregate 55.4% (pass@1). For context, the model card compares against diffusion baselines including Dream-7B-Instruct (57.9% HumanEval), indicating CoDA’s 1.7B footprint is competitive with some 7B diffusion models on several metrics while using fewer parameters.

https://huggingface.co/Salesforce/CoDA-v0-Instruct

Inference behavior

Generation cost is governed by the number of diffusion steps; CoDA exposes knobs such as STEPS, ALG=”entropy”, ALG_TEMP, and block length to tune latency/quality trade-offs. Because tokens are updated in parallel under full attention, CoDA targets lower wall-clock latency at small scale compared with larger diffusion models, at comparable step budgets. (Hugging Face)

Deployment and licensing

The repository provides a FastAPI server with OpenAI-compatible APIs and an interactive CLI for local inference; instructions include environment setup and a start_server.sh launcher. Model cards and a Hugging Face collection centralize artifacts. The checkpoints are published under CC BY-NC 4.0 on Hugging Face.

Our Comments

CoDA-1.7B stands as a clean reference for discrete-diffusion code generation at small scale: 1.7B parameters, bidirectional denoising with parallel token updates, and a reproducible pipeline from pre-training to SFT and serving. The reported pass@1 results—HumanEval 54.3, HumanEval+ 47.6, MBPP 47.2, MBPP+ 63.2, EvalPlus aggregate 55.4—place it competitive with some 7B diffusion baselines (e.g., Dream-7B HumanEval 57.9) while using fewer parameters. Inference latency is explicitly governed by step count and decoding knobs (STEPS, entropy-style guidance), which is operationally useful for tuning throughput/quality. The release includes weights on Hugging Face and a FastAPI server/CLI for local deployment.

Check out the Paper, GitHub Repo and Model on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Salesforce AI Research Releases CoDA-1.7B: a Discrete-Diffusion Code Model with Bidirectional, Parallel Token Generation appeared first on MarkTechPost.

How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognit …

Table of contentsWhy WER Isn’t Enough ?What to Measure (and How) ?Benchmark Landscape: What Each CoversFilling the Gaps: What You Still Need to AddA Concrete, Reproducible Evaluation PlanReferences

Optimizing only for Automatic Speech Recognition (ASR) and Word Error Rate (WER) is insufficient for modern, interactive voice agents. Robust evaluation must measure end-to-end task success, barge-in behavior and latency, and hallucination-under-noise—alongside ASR, safety, and instruction following. VoiceBench offers a multi-facet speech-interaction benchmark across general knowledge, instruction following, safety, and robustness to speaker/environment/content variations, but it does not cover barge-in or real-device task completion. SLUE (and Phase-2) target spoken language understanding (SLU); MASSIVE and Spoken-SQuAD probe multilingual and spoken QA; DSTC tracks add spoken, task-oriented robustness. Combine these with explicit barge-in/endpointing tests, user-centric task-success measurement, and controlled noise-stress protocols to obtain a complete picture.

Why WER Isn’t Enough?

WER measures transcription fidelity, not interaction quality. Two agents with similar WER can diverge widely in dialog success because latency, turn-taking, misunderstanding recovery, safety, and robustness to acoustic and content perturbations dominate user experience. Prior work on real systems shows the need to evaluate user satisfaction and task success directly—e.g., Cortana’s automatic online evaluation predicted user satisfaction from in-situ interaction signals, not only ASR accuracy.

What to Measure (and How)?

1) End-to-End Task Success

Metric: Task Success Rate (TSR) with strict success criteria per task (goal completion, constraints met), plus Task Completion Time (TCT) and Turns-to-Success.Why. Real assistants are judged by outcomes. Competitions like Alexa Prize TaskBot explicitly measured users’ ability to finish multi-step tasks (e.g., cooking, DIY) with ratings and completion.

Protocol.

Define tasks with verifiable endpoints (e.g., “assemble shopping list with N items and constraints”).

Use blinded human raters and automatic logs to compute TSR/TCT/Turns.

For multilingual/SLU coverage, draw task intents/slots from MASSIVE.

2) Barge-In and Turn-Taking

Metrics:

Barge-In Detection Latency (ms): time from user onset to TTS suppression.

True/False Barge-In Rates: correct interruptions vs. spurious stops.

Endpointing Latency (ms): time to ASR finalization after user stop.

Why. Smooth interruption handling and fast endpointing determine perceived responsiveness. Research formalizes barge-in verification and continuous barge-in processing; endpointing latency continues to be an active area in streaming ASR.

Protocol.

Script prompts where the user interrupts TTS at controlled offsets and SNRs.

Measure suppression and recognition timings with high-precision logs (frame timestamps).

Include noisy/echoic far-field conditions. Classic and modern studies provide recovery and signaling strategies that reduce false barge-ins.

3) Hallucination-Under-Noise (HUN)

Metric. HUN Rate: fraction of outputs that are fluent but semantically unrelated to the audio, under controlled noise or non-speech audio.Why. ASR and audio-LLM stacks can emit “convincing nonsense,” especially with non-speech segments or noise overlays. Recent work defines and measures ASR hallucinations; targeted studies show Whisper hallucinations induced by non-speech sounds.

Protocol.

Construct audio sets with additive environmental noise (varied SNRs), non-speech distractors, and content disfluencies.

Score semantic relatedness (human judgment with adjudication) and compute HUN.

Track whether downstream agent actions propagate hallucinations to incorrect task steps.

4) Instruction Following, Safety, and Robustness

Metric Families.

Instruction-Following Accuracy (format and constraint adherence).

Safety Refusal Rate on adversarial spoken prompts.

Robustness Deltas across speaker age/accent/pitch, environment (noise, reverb, far-field), and content noise (grammar errors, disfluencies).

Why. VoiceBench explicitly targets these axes with spoken instructions (real and synthetic) spanning general knowledge, instruction following, and safety; it perturbs speaker, environment, and content to probe robustness.

Protocol.

Use VoiceBench for breadth on speech-interaction capabilities; report aggregate and per-axis scores.

For SLU specifics (NER, dialog acts, QA, summarization), leverage SLUE and Phase-2.

5) Perceptual Speech Quality (for TTS and Enhancement)

Metric. Subjective Mean Opinion Score via ITU-T P.808 (crowdsourced ACR/DCR/CCR).Why. Interaction quality depends on both recognition and playback quality. P.808 gives a validated crowdsourcing protocol with open-source tooling.

Benchmark Landscape: What Each Covers

VoiceBench (2024)

Scope: Multi-facet voice assistant evaluation with spoken inputs covering general knowledge, instruction following, safety, and robustness across speaker/environment/content variations; uses both real and synthetic speech.Limitations: Does not benchmark barge-in/endpointing latency or real-world task completion on devices; focuses on response correctness and safety under variations.

SLUE / SLUE Phase-2

Scope: Spoken language understanding tasks: NER, sentiment, dialog acts, named-entity localization, QA, summarization; designed to study end-to-end vs. pipeline sensitivity to ASR errors.Use: Great for probing SLU robustness and pipeline fragility in spoken settings.

MASSIVE

Scope: >1M virtual-assistant utterances across 51–52 languages with intents/slots; strong fit for multilingual task-oriented evaluation.Use: Build multilingual task suites and measure TSR/slot F1 under speech conditions (paired with TTS or read speech).

Spoken-SQuAD / HeySQuAD and Related Spoken-QA Sets

Scope: Spoken question answering to test ASR-aware comprehension and multi-accent robustness.Use: Stress-test comprehension under speech errors; not a full agent task suite.

DSTC (Dialog System Technology Challenge) Tracks

Scope: Robust dialog modeling with spoken, task-oriented data; human ratings alongside automatic metrics; recent tracks emphasize multilinguality, safety, and evaluation dimensionality.Use: Complementary for dialog quality, DST, and knowledge-grounded responses under speech conditions.

Real-World Task Assistance (Alexa Prize TaskBot)

Scope: Multi-step task assistance with user ratings and success criteria (cooking/DIY).Use: Gold-standard inspiration for defining TSR and interaction KPIs; the public reports describe evaluation focus and outcomes.

Filling the Gaps: What You Still Need to Add

Barge-In & Endpointing KPIsAdd explicit measurement harnesses. Literature offers barge-in verification and continuous processing strategies; streaming ASR endpointing latency remains an active research topic. Track barge-in detection latency, suppression correctness, endpointing delay, and false barge-ins.

Hallucination-Under-Noise (HUN) ProtocolsAdopt emerging ASR-hallucination definitions and controlled noise/non-speech tests; report HUN rate and its impact on downstream actions.

On-Device Interaction LatencyCorrelate user-perceived latency with streaming ASR designs (e.g., transducer variants); measure time-to-first-token, time-to-final, and local processing overhead.

Cross-Axis Robustness MatricesCombine VoiceBench’s speaker/environment/content axes with your task suite (TSR) to expose failure surfaces (e.g., barge-in under far-field echo; task success at low SNR; multilingual slots under accent shift).

Perceptual Quality for PlaybackUse ITU-T P.808 (with the open P.808 toolkit) to quantify user-perceived TTS quality in your end-to-end loop, not just ASR.

A Concrete, Reproducible Evaluation Plan

Assemble the Suite

Speech-Interaction Core: VoiceBench for knowledge, instruction following, safety, and robustness axes.

SLU Depth: SLUE/Phase-2 tasks (NER, dialog acts, QA, summarization) for SLU performance under speech.

Multilingual Coverage: MASSIVE for intent/slot and multilingual stress.

Comprehension Under ASR Noise: Spoken-SQuAD/HeySQuAD for spoken QA and multi-accent readouts.

Add Missing Capabilities

Barge-In/Endpointing Harness: scripted interruptions at controlled offsets and SNRs; log suppression time and false barge-ins; measure endpointing delay with streaming ASR.

Hallucination-Under-Noise: non-speech inserts and noise overlays; annotate semantic relatedness to compute HUN.

Task Success Block: scenario tasks with objective success checks; compute TSR, TCT, and Turns; follow TaskBot style definitions.

Perceptual Quality: P.808 crowdsourced ACR with the Microsoft toolkit.

Report Structure

Primary table: TSR/TCT/Turns; barge-in latency and error rates; endpointing latency; HUN rate; VoiceBench aggregate and per-axis; SLU metrics; P.808 MOS.

Stress plots: TSR and HUN vs. SNR and reverberation; barge-in latency vs. interrupt timing.

References

VoiceBench: first multi-facet speech-interaction benchmark for LLM-based voice assistants (knowledge, instruction following, safety, robustness). (ar5iv)

SLUE / SLUE Phase-2: spoken NER, dialog acts, QA, summarization; sensitivity to ASR errors in pipelines. (arXiv)

MASSIVE: 1M+ multilingual intent/slot utterances for assistants. (Amazon Science)

Spoken-SQuAD / HeySQuAD: spoken question answering datasets. (GitHub)

User-centric evaluation in production assistants (Cortana): predict satisfaction beyond ASR. (UMass Amherst)

Barge-in verification/processing and endpointing latency: AWS/academic barge-in papers, Microsoft continuous barge-in, recent endpoint detection for streaming ASR. (arXiv)

ASR hallucination definitions and non-speech-induced hallucinations (Whisper). (arXiv)

The post How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise appeared first on MarkTechPost.

Agentic Design Methodology: How to Build Reliable and Human-Like AI Ag …

Building robust AI agents differs fundamentally from traditional software development, as it centers on probabilistic model behavior rather than deterministic code execution. This guide provides a neutral overview of methodologies for designing AI agents that are both reliable and adaptable, with an emphasis on creating clear boundaries, effective behaviors, and safe interactions.

What Is Agentic Design?

Agentic design refers to constructing AI systems capable of independent action within defined parameters. Unlike conventional coding, which specifies exact outcomes for inputs, agentic systems require designers to articulate desirable behaviors and trust the model to navigate specifics.

Variability in AI Responses

Traditional software outputs remain constant for identical inputs. In contrast, agentic systems—based on probabilistic models—produce varied yet contextually appropriate responses each time. This makes effective prompt and guideline design critical for both human-likeness and safety.

In an agentic system, a request like “Can you help me reset my password?” might elicit different yet appropriate replies such as “Of course! Please tell me your username,” “Absolutely, let’s get started—what’s your email address?” or “I can assist with that. Do you remember your account ID?”. This variability is purposeful, designed to enhance user experience by mimicking the nuance and flexibility of human dialogue. At the same time, this unpredictability requires thoughtful guidelines and safeguards so the system responds safely and consistently across scenarios

Why Clear Instructions Matter

Language models interpret instructions rather than execute them literally. Vague guidance such as:

Copy CodeCopiedUse a different Browseragent.create_guideline(
condition=”User expresses frustration”,
action=”Try to make them happy”
)

can lead to unpredictable or unsafe behavior, like unintended offers or promises. Instead, instructions should be concrete and action-focused:

Instead, be specific and safe:

Copy CodeCopiedUse a different Browseragent.create_guideline(
condition=”User is upset by a delayed delivery”,
action=”Acknowledge the delay, apologize, and provide a status update”
)

This approach ensures the model’s actions align with organizational policy and user expectations.

Building Compliance: Layers of Control

LLMs can’t be fully “controlled,” but you can still guide and constrain their behavior effectively.

Layer 1: Guidelines

Use guidelines to define and shape normal behavior.

Copy CodeCopiedUse a different Browserawait agent.create_guideline(
condition=”Customer asks about topics outside your scope”,
action=”Politely decline and redirect to what you can help with”
)

Layer 2: Canned Responses

For high-risk situations (such as policy or medical advice), use pre-approved canned responses to ensure consistency and safety.

Copy CodeCopiedUse a different Browserawait agent.create_canned_response(
template=”I can help with account questions, but for policy details I’ll connect you to a specialist.”
)

This layered approach minimizes risk and ensures the agent never improvises in sensitive situations.

Tool Calling: When Agents Take Action

When AI agents take action using tools such as APIs or functions, the process involves more complexity than simply executing a command. For example, if a user says, “Schedule a meeting with Sarah for next week,” the agent must interpret several unclear elements: Which Sarah is being referred to? What specific day and time within “next week” should the meeting be scheduled? And on which calendar?

This illustrates the Parameter Guessing Problem, where the agent attempts to infer missing details that weren’t explicitly provided. To address this, tools should be designed with clear purpose descriptions, parameter hints, and contextual examples to reduce ambiguity. Additionally, tool names should be intuitive and parameter types consistent, helping the agent reliably select and populate inputs. Well-structured tools improve accuracy, reduce errors, and make the interactions smoother and more predictable for both the agent and the user.

This thoughtful tool design practice is essential for effective, safe agent functionality in real-world applications.When AI agents perform tasks through tools such as APIs or functions, the complexity is often higher than it initially appears.

Agent Design Is Iterative

Unlike static software, agent behavior in agentic systems is not fixed; it matures over time through a continuous cycle of observation, evaluation, and refinement. The process typically begins with implementing straightforward, high-frequency user scenarios—those “happy path” interactions where the agent’s responses can be easily anticipated and validated. Once deployed in a safe testing environment, the agent’s behavior is closely monitored for unexpected answers, user confusion, or any breaches of policy guidelines.

As issues are observed, the agent is systematically improved by introducing targeted rules or refining existing logic to address problematic cases. For example, if users repeatedly decline an upsell offer but the agent continues to bring it up, a focused rule can be added to prevent this behavior within the same session. Through this deliberate, incremental tuning, the agent gradually evolves from a basic prototype into a sophisticated conversational system that is responsive, reliable, and well-aligned with both user expectations and operational constraints.

Writing Effective Guidelines

Each guideline has three key parts:

Example:

Copy CodeCopiedUse a different Browserawait agent.create_guideline(
condition=”Customer requests a specific appointment time that’s unavailable”,
action=”Offer the three closest available slots as alternatives”,
tools=[get_available_slots]
)

Structured Conversations: Journeys

For complex tasks such as booking appointments, onboarding, or troubleshooting, simple guidelines alone are often insufficient. This is where Journeys become essential. Journeys provide a framework to design structured, multi-step conversational flows that guide the user through a process smoothly while maintaining a natural dialogue.

For example, a booking flow can be initiated by creating a journey with a clear title and conditions defining when it applies, such as when a customer wants to schedule an appointment. The journey then progresses through states—first asking the customer what type of service they need, then checking availability using an appropriate tool, and finally offering available time slots. This structured approach balances flexibility and control, enabling the agent to handle complex interactions efficiently without losing the conversational feel.

Example: Booking Flow

Copy CodeCopiedUse a different Browserbooking_journey = await agent.create_journey(
title=”Book Appointment”,
conditions=[“Customer wants to schedule an appointment”],
description=”Guide customer through the booking process”
)

t1 = await booking_journey.initial_state.transition_to(
chat_state=”Ask what type of service they need”
)
t2 = await t1.target.transition_to(
tool_state=check_availability_for_service
)
t3 = await t2.target.transition_to(
chat_state=”Offer available time slots”
)

Balancing Flexibility and Predictability

Balancing flexibility and predictability is essential when designing an AI agent. The agent should feel natural and conversational, rather than overly scripted, but it must still operate within safe and consistent boundaries. 

If instructions are too rigid—for example, telling the agent to “Say exactly: ‘Our premium plan is $99/month‘”—the interaction can feel mechanical and unnatural. On the other hand, instructions that are too vague, such as “Help them understand our pricing“, can lead to unpredictable or inconsistent responses. 

A balanced approach provides clear direction while allowing the agent some adaptability, for example: “Explain our pricing tiers clearly, highlight the value, and ask about the customer’s needs to recommend the best fit.” This ensures the agent remains both reliable and engaging in its interactions.

Designing for Real Conversations

Designing for real conversations requires recognizing that, unlike web forms, conversations are non-linear. Users may change their minds, skip steps, or move the discussion in unexpected directions. To handle this effectively, there are several key principles to follow. 

Context preservation ensures the agent keeps track of information already provided so it can respond appropriately. 

Progressive disclosure means revealing options or information gradually, rather than overwhelming the user with everything at once. 

Recovery mechanisms allow the agent to manage misunderstandings or deviations gracefully, for example by rephrasing a response or gently redirecting the conversation for clarity. 

This approach helps create interactions that feel natural, flexible, and user-friendly.

Effective agentic design means starting with core features, focusing on main tasks before tackling rare cases. It involves careful monitoring to spot any issues in the agent’s behavior. Improvements should be based on real observations, adding clear rules to guide better responses. It’s important to balance clear boundaries that keep the agent safe while allowing natural, flexible conversation. For complex tasks, use structured flows called journeys to guide multi-step interactions. Finally, be transparent about what the agent can do and its limits to set proper expectations. This simple process helps create reliable, user-friendly AI agents.
The post Agentic Design Methodology: How to Build Reliable and Human-Like AI Agents using Parlant appeared first on MarkTechPost.

Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mix …

What if, instead of re-sampling one agent, you could push Gemini-2.5 Pro to 34.1% on HLE by mixing 12–15 tool-using agents that share notes and stop early? Google Cloud AI Research, with collaborators from MIT, Harvard, and Google DeepMind, introduced TUMIX (Tool-Use Mixture)—a test-time framework that ensembles heterogeneous agent styles (text-only, code, search, guided variants) and lets them share intermediate answers over a few refinement rounds, then stop early via an LLM-based judge. The result: higher accuracy at lower cost on hard reasoning benchmarks such as HLE, GPQA-Diamond, and AIME (2024/2025).

https://arxiv.org/pdf/2510.01279

So, What exactly is different new?

Mixture over modality, not just more samples: TUMIX runs ~15 agent styles spanning Chain-of-Thought (CoT), code execution, web search, dual-tool agents, and guided variants. Each round, every agent sees (a) the original question and (b) other agents’ previous answers, then proposes a refined answer. This message-passing raises average accuracy early while diversity gradually collapses—so stopping matters.

Adaptive early-termination: An LLM-as-Judge halts refinement once answers exhibit strong consensus (with a minimum round threshold). This preserves accuracy at ~49% of the inference cost vs. fixed-round refinement; token cost drops to ~46% because late rounds are token-heavier.

Auto-designed agents: Beyond human-crafted agents, TUMIX prompts the base LLM to generate new agent types; mixing these with the manual set yields an additional ~+1.2% average lift without extra cost. The empirical “sweet spot” is ~12–15 agent styles.

https://arxiv.org/pdf/2510.01279

How does it work?

TUMIX runs a group of heterogeneous agents—text-only Chain-of-Thought, code-executing, web-searching, and guided variants—in parallel, then iterates a small number of refinement rounds where each agent conditions on the original question plus the other agents’ prior rationales and answers (structured note-sharing). After each round, an LLM-based judge evaluates consensus/consistency to decide early termination; if confidence is insufficient, another round is triggered, otherwise the system finalizes via simple aggregation (e.g., majority vote or selector). This mixture-of-tool-use design trades brute-force re-sampling for diverse reasoning paths, improving coverage of correct candidates while controlling token/tool budgets; empirically, benefits saturate around 12–15 agent styles, and stopping early preserves diversity and lowers cost without sacrificing accuracy

Lets discuss the Results

Under comparable inference budgets to strong tool-augmented baselines (Self-MoA, Symbolic-MoE, DEI, SciMaster, GSA), TUMIX yields the best average accuracy; a scaled variant (TUMIX+) pushes further with more compute:

HLE (Humanity’s Last Exam): Pro: 21.6% → 34.1% (TUMIX+); Flash: 9.7% → 23.1%.(HLE is a 2,500-question, difficult, multi-domain benchmark finalized in 2025.)

GPQA-Diamond: Pro: up to 88.3%; Flash: up to 82.1%. (GPQA-Diamond is the hardest 198-question subset authored by domain experts.)

AIME 2024/25: Pro: 96.7%; Flash: 86.7% with TUMIX(+) at test time.

Across tasks, TUMIX averages +3.55% over the best prior tool-augmented test-time scaling baseline at similar cost, and +7.8% / +17.4% over no-scaling for Pro/Flash, respectively.

https://arxiv.org/pdf/2510.01279

Our Comments

TUMIX is a great approach from Google because it frames test-time scaling as a search problem over heterogeneous tool policies rather than brute-force sampling. The parallel committee (text, code, search) improves candidate coverage, while the LLM-judge enables early-stop that preserves diversity and reduces token/tool spend—useful under latency budgets. The HLE gains (34.1% with Gemini-2.5 Pro) align with the benchmark’s finalized 2,500-question design, and the ~12–15 agent styles “sweet spot” indicates selection—not generation—is the limiting factor.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture appeared first on MarkTechPost.

Can a Small Language Model Predict Kernel Latency, Memory, and Model A …

Researchers from Cornell and Google introduce a unified Regression Language Model (RLM) that predicts numeric outcomes directly from code strings—covering GPU kernel latency, program memory usage, and even neural network accuracy and latency—without hand-engineered features. A 300M-parameter encoder–decoder initialized from T5-Gemma achieves strong rank correlations across heterogeneous tasks and languages, using a single text-to-number decoder that emits digits with constrained decoding.

What exactly is new?

Unified code-to-metric regression: One RLM predicts (i) peak memory from high-level code (Python/C/C++ and more), (ii) latency for Triton GPU kernels, and (iii) accuracy and hardware-specific latency from ONNX graphs—by reading raw text representations and decoding numeric outputs. No feature engineering, graph encoders, or zero-cost proxies are required.

Concrete results: Reported correlations include Spearman ρ ≈ 0.93 on APPS LeetCode memory, ρ ≈ 0.52 for Triton kernel latency, ρ > 0.5 average across 17 CodeNet languages, and Kendall τ ≈ 0.46 across five classic NAS spaces—competitive with and in some cases surpassing graph-based predictors.

Multi-objective decoding: Because the decoder is autoregressive, the model conditions later metrics on earlier ones (e.g., accuracy → per-device latencies), capturing realistic trade-offs along Pareto fronts.

https://arxiv.org/abs/2509.26476

Why is this important?

Performance prediction pipelines in compilers, GPU kernel selection, and NAS typically rely on bespoke features, syntax trees, or GNN encoders that are brittle to new ops/languages. Treating regression as next-token prediction over numbers standardizes the stack: tokenize inputs as plain text (source code, Triton IR, ONNX), then decode calibrated numeric strings digit-by-digit with constrained sampling. This reduces maintenance cost and improves transfer to new tasks via fine-tuning.

Data and benchmarks

Code-Regression dataset (HF): Curated to support code-to-metric tasks spanning APPS/LeetCode runs, Triton kernel latencies (KernelBook-derived), and CodeNet memory footprints.

NAS/ONNX suite: Architectures from NASBench-101/201, FBNet, Once-for-All (MB/PN/RN), Twopath, Hiaml, Inception, and NDS are exported to ONNX text to predict accuracy and device-specific latency.

How does it work?

Backbone: Encoder–decoder with a T5-Gemma encoder initialization (~300M params). Inputs are raw strings (code or ONNX). Outputs are numbers emitted as sign/exponent/mantissa digit tokens; constrained decoding enforces valid numerals and supports uncertainty via sampling.

Ablations: (i) Language pretraining accelerates convergence and improves Triton latency prediction; (ii) decoder-only numeric emission outperforms MSE regression heads even with y-normalization; (iii) learned tokenizers specialized for ONNX operators increase effective context; (iv) longer contexts help; (v) scaling to a larger Gemma encoder further improves correlation with adequate tuning.

Training code. The regress-lm library provides text-to-text regression utilities, constrained decoding, and multi-task pretraining/fine-tuning recipes.

Stats that matters

APPS (Python) memory: Spearman ρ > 0.9.

CodeNet (17 languages) memory: average ρ > 0.5; strongest languages include C/C++ (~0.74–0.75).

Triton kernels (A6000) latency: ρ ≈ 0.52.

NAS ranking: average Kendall τ ≈ 0.46 across NASNet, Amoeba, PNAS, ENAS, DARTS; competitive with FLAN and GNN baselines.

Key Takeaways

Unified code-to-metric regression works. A single ~300M-parameter T5Gemma-initialized model (“RLM”) predicts: (a) memory from high-level code, (b) Triton GPU kernel latency, and (c) model accuracy + device latency from ONNX—directly from text, no hand-engineered features.

The research shows Spearman ρ > 0.9 on APPS memory, ≈0.52 on Triton latency, >0.5 average across 17 CodeNet languages, and Kendall-τ ≈ 0.46 on five NAS spaces.

Numbers are decoded as text with constraints. Instead of a regression head, RLM emits numeric tokens with constrained decoding, enabling multi-metric, autoregressive outputs (e.g., accuracy followed by multi-device latencies) and uncertainty via sampling.

The Code-Regression dataset unifies APPS/LeetCode memory, Triton kernel latency, and CodeNet memory; the regress-lm library provides the training/decoding stack.

Our Comments

It is very interesting how this work reframes performance prediction as text-to-number generation: a compact T5Gemma-initialized RLM reads source (Python/C++), Triton kernels, or ONNX graphs and emits calibrated numerics via constrained decoding. The reported correlations—APPS memory (ρ>0.9), Triton latency on RTX A6000 (~0.52), and NAS Kendall-τ ≈0.46—are strong enough to matter for compiler heuristics, kernel pruning, and multi-objective NAS triage without bespoke features or GNNs. The open dataset and library make replication straightforward and lower the barrier to fine-tuning on new hardware or languages.

Check out the Paper, GitHub Page and Dataset Card. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Can a Small Language Model Predict Kernel Latency, Memory, and Model Accuracy from Code? A New Regression Language Model (RLM) Says Yes appeared first on MarkTechPost.

A Coding Guide to Build an Autonomous Agentic AI for Time Series Forec …

In this tutorial, we build an advanced agentic AI system that autonomously handles time series forecasting using the Darts library combined with a lightweight HuggingFace model for reasoning. We design the agent to operate in a perception–reasoning–action cycle, where it first analyzes patterns in the data, then selects an appropriate forecasting model, generates predictions, and finally explains and visualizes the results. By walking through this pipeline, we experience how agentic AI can bring together statistical modeling and natural language reasoning to make forecasting both accurate and interpretable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install darts transformers pandas matplotlib numpy -q

import pandas as pd
import numpy as np
from darts import TimeSeries
from darts.models import ExponentialSmoothing, NaiveSeasonal, LinearRegressionModel
from darts.metrics import mape, rmse
from transformers import pipeline
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

We begin by installing and importing the essential libraries, including Darts for time series forecasting, Transformers for reasoning, and supporting packages like pandas, NumPy, and matplotlib. With these tools in place, we set up the foundation to build and run our autonomous forecasting agent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass TimeSeriesAgent:
“””Autonomous agent for time series analysis and forecasting”””

def __init__(self):
print(” Initializing Agent Brain…”)
self.llm = pipeline(“text-generation”, model=”distilgpt2″, max_length=150,
do_sample=True, temperature=0.7)

self.models = {
‘exponential_smoothing’: ExponentialSmoothing(),
‘naive_seasonal’: NaiveSeasonal(K=12),
‘linear_regression’: LinearRegressionModel(lags=12)
}
self.selected_model = None
self.forecast = None

def perceive(self, data):
“””Agent perceives and analyzes the time series data”””
print(“n PERCEPTION PHASE”)
self.ts = TimeSeries.from_dataframe(data, ‘date’, ‘value’, freq=’M’)

trend = “increasing” if data[‘value’].iloc[-1] > data[‘value’].iloc[0] else “decreasing”
volatility = data[‘value’].std() / data[‘value’].mean()
seasonality = self._detect_seasonality(data[‘value’])

analysis = {
‘length’: len(data),
‘trend’: trend,
‘volatility’: f”{volatility:.2f}”,
‘has_seasonality’: seasonality,
‘mean’: f”{data[‘value’].mean():.2f}”,
‘range’: f”{data[‘value’].min():.2f} to {data[‘value’].max():.2f}”
}

print(f” Data Points: {analysis[‘length’]}”)
print(f” Trend: {analysis[‘trend’].upper()}”)
print(f” Volatility: {analysis[‘volatility’]}”)
print(f” Seasonality: {‘Detected’ if seasonality else ‘Not detected’}”)

return analysis

def _detect_seasonality(self, series, threshold=0.3):
“””Simple seasonality detection”””
if len(series) < 24:
return False
acf = np.correlate(series – series.mean(), series – series.mean(), mode=’full’)
acf = acf[len(acf)//2:]
acf /= acf[0]
return np.max(acf[12:24]) > threshold if len(acf) > 24 else False

def reason(self, analysis):
“””Agent reasons about which model to use”””
print(“n REASONING PHASE”)

prompt = f”Time series analysis: {analysis[‘length’]} data points, {analysis[‘trend’]} trend, ”
f”volatility {analysis[‘volatility’]}, seasonality: {analysis[‘has_seasonality’]}. ”

thought = self.llm(prompt, max_length=100, num_return_sequences=1)[0][‘generated_text’]
print(f” Agent Thinking: {thought[:150]}…”)

if analysis[‘has_seasonality’]:
self.selected_model = ‘naive_seasonal’
reason = “Seasonality detected – using Naive Seasonal model”
elif float(analysis[‘volatility’]) > 0.3:
self.selected_model = ‘exponential_smoothing’
reason = “High volatility – using Exponential Smoothing”
else:
self.selected_model = ‘linear_regression’
reason = “Stable trend – using Linear Regression”

print(f” Decision: {reason}”)
return self.selected_model

def act(self, horizon=12):
“””Agent takes action: trains model and generates forecast”””
print(“n ACTION PHASE”)

train, val = self.ts[:-12], self.ts[-12:]

model = self.models[self.selected_model]
print(f” Training {self.selected_model}…”)
model.fit(train)

self.forecast = model.predict(horizon)

if len(val) > 0:
val_pred = model.predict(len(val))
accuracy = 100 – mape(val, val_pred)
print(f” Validation Accuracy: {accuracy:.2f}%”)

print(f” Generated {horizon}-step forecast”)
return self.forecast

def explain(self):
“””Agent explains its predictions”””
print(“n EXPLANATION PHASE”)

forecast_values = self.forecast.values().flatten()
hist_values = self.ts.values().flatten()

change = ((forecast_values[-1] – hist_values[-1]) / hist_values[-1]) * 100
direction = “increase” if change > 0 else “decrease”

explanation = f”Based on my analysis using {self.selected_model}, ”
f”I predict a {abs(change):.1f}% {direction} in the next period. ”
f”Forecast range: {forecast_values.min():.2f} to {forecast_values.max():.2f}. ”
f”Historical mean was {hist_values.mean():.2f}.”

print(f” {explanation}”)

prompt = f”Forecast summary: {explanation} Explain implications:”
summary = self.llm(prompt, max_length=120)[0][‘generated_text’]
print(f”n Agent Summary: {summary[:200]}…”)

return explanation

def visualize(self):
“””Agent creates visualization of its work”””
print(“n Generating visualization…”)

plt.figure(figsize=(14, 6))

self.ts.plot(label=’Historical Data’, lw=2)

self.forecast.plot(label=f’Forecast ({self.selected_model})’,
lw=2, linestyle=’–‘)

plt.title(‘ Agentic AI Time Series Forecast’, fontsize=16, fontweight=’bold’)
plt.xlabel(‘Date’, fontsize=12)
plt.ylabel(‘Value’, fontsize=12)
plt.legend(loc=’best’, fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

We define a TimeSeriesAgent that thinks with a lightweight HuggingFace model and acts with a small portfolio of Darts models. We perceive patterns (trend, volatility, seasonality), reason to choose the best model, then train, forecast, and validate. Finally, we explain the prediction in plain language and visualize history versus forecast. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef create_sample_data():
“””Generate sample time series data”””
dates = pd.date_range(start=’2020-01-01′, periods=48, freq=’M’)
trend = np.linspace(100, 150, 48)
seasonality = 10 * np.sin(np.linspace(0, 4*np.pi, 48))
noise = np.random.normal(0, 3, 48)
values = trend + seasonality + noise

return pd.DataFrame({‘date’: dates, ‘value’: values})

We create a helper function create_sample_data() that generates synthetic time series data with a clear trend, sinusoidal seasonality, and random noise. This allows us to simulate realistic monthly data from 2020 to 2023 for testing and demonstrating the agent’s forecasting workflow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
“””Main execution: Agent autonomously handles forecasting task”””
print(“=”*70)
print(” AGENTIC AI TIME SERIES FORECASTING SYSTEM”)
print(“=”*70)

print(“n Loading data…”)
data = create_sample_data()
print(f”Loaded {len(data)} data points from 2020-01 to 2023-12″)

agent = TimeSeriesAgent()

analysis = agent.perceive(data)
agent.reason(analysis)
agent.act(horizon=12)
agent.explain()
agent.visualize()

print(“n” + “=”*70)
print(” AGENT COMPLETED FORECASTING TASK SUCCESSFULLY”)
print(“=”*70)

if __name__ == “__main__”:
main()

We define the main function that runs the full agentic AI pipeline. We load synthetic time series data, let the TimeSeriesAgent perceive patterns, reason to select the best model, act by training and forecasting, explain the results, and finally visualize them. This completes the end-to-end autonomous perception, reasoning, and action cycle.

In conclusion, we see how an autonomous agent can analyze time series data, reason about model selection, generate forecasts, and explain its predictions in natural language. By combining Darts with HuggingFace, we create a compact yet powerful framework that not only produces accurate forecasts but also clearly communicates insights. We complete the cycle with visualization, reinforcing how agentic AI makes forecasting more intuitive and interactive.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Build an Autonomous Agentic AI for Time Series Forecasting with Darts and Hugging Face appeared first on MarkTechPost.

Microsoft Releases ‘Microsoft Agent Framework’: An Open-Source SD …

Microsoft released the Microsoft Agent Framework (public preview), an open-source SDK and runtime that unifies core ideas from AutoGen (agent runtime and multi-agent patterns) with Semantic Kernel (enterprise controls, state, plugins) to help teams build, deploy, and observe production-grade AI agents and multi-agent workflows. The framework is available for Python and .NET and integrates directly with Azure AI Foundry’s Agent Service for scaling and operations.

What exactly is Microsoft shipping?

A consolidated agent runtime and API surface. The Agent Framework carries forward AutoGen’s single- and multi-agent abstractions while adding Semantic Kernel’s enterprise features: thread-based state management, type safety, filters, telemetry, and broad model/embedding support. Microsoft positions it as the successor built by the same teams, rather than a replacement that abandons either project.

First-class orchestration modes. It supports agent orchestration (LLM-driven decision-making) and workflow orchestration (deterministic, business-logic multi-agent flows), enabling hybrid systems where creative planning coexists with reliable handoffs and constraints.

Pro-code and platform interoperability. The base AIAgent interface is designed to swap chat model providers and to interoperate with Azure AI Foundry Agents, OpenAI Assistants, and Copilot Studio, reducing vendor lock-in at the application layer.

Open-source, multi-language SDKs under MIT license. The GitHub repo publishes Python and .NET packages with examples and CI/CD-friendly scaffolding. AutoGen remains maintained (bug fixes, security patches) with guidance to consider Agent Framework for new builds.

Where it runs in production?

Azure AI Foundry’s Agent Service provides the managed runtime: it links models, tools, and frameworks; manages thread state; enforces content safety and identity; and wires in observability. It also supports multi-agent orchestration natively and distinguishes itself from Copilot Studio’s low-code approach by targeting complex, pro-code enterprise scenarios.

But how is it connected to ‘AI economics’?

Enterprise AI economics are dominated by token throughput, latency, failure recovery, and observability. Microsoft’s consolidation addresses those by (a) giving one runtime abstraction for agent collaboration and tool use, (b) attaching production controls—telemetry, filters, identity/networking, safety—to the same abstraction, and (c) deploying onto a managed service that handles scaling, policy, and diagnostics. This reduces the “glue code” that typically drives cost and brittleness in multi-agent systems and aligns with Azure AI Foundry’s model-catalog + toolchain approach.

Architectural notes and developer surface

Runtime & state: Agents coordinate via a runtime that handles lifecycles, identities, communication, and security boundaries—concepts inherited and formalized from AutoGen. Threads are the unit of state, enabling reproducible runs, retries, and audits.

Functions & plugins: The framework leans on Semantic Kernel’s plugin architecture and function-calling to bind tools (code interpreters, custom functions) into agent policies with typed contracts. (

Model/provider flexibility: The same agent interface can target Azure OpenAI, OpenAI, local runtimes (e.g., Ollama/Foundry Local), and GitHub Models, enabling cost/performance tuning per task without rewriting orchestration logic.

Enterprise context

Microsoft frames the release as part of a broader push toward interoperable, standard-friendly “agentic” systems across Azure AI Foundry—consistent with prior statements about multi-agent collaboration, memory, and structured retrieval. Expect tighter ties to Foundry observability and governance controls as these stabilize.

Our Comments

We like this direction because it collapses two divergent stacks—AutoGen’s multi-agent runtime and Semantic Kernel’s enterprise plumbing—into one API surface with a managed path to production. The thread-based state model and OpenTelemetry hooks address the usual blind spots in agentic systems (repro, latency tracing, failure triage), and Azure AI Foundry’s Agent Service takes on identity, content safety, and tool orchestration so teams can iterate on policies instead of glue code. The Python/.NET parity and provider flexibility (Azure OpenAI, OpenAI, GitHub Models, local runtimes) also make cost/perf tuning practical without rewriting orchestration.

The post Microsoft Releases ‘Microsoft Agent Framework’: An Open-Source SDK and Runtime that Simplifies the Orchestration of Multi-Agent Systems appeared first on MarkTechPost.

AWS Open-Sources an MCP Server for Bedrock AgentCore to Streamline AI …

AWS released an open-source Model Context Protocol (MCP) server for Amazon Bedrock AgentCore, providing a direct path from natural-language prompts in agentic IDEs to deployable agents on AgentCore Runtime. The package ships with automated transformations, environment provisioning, and Gateway/tooling hooks designed to compress typical multi-step integration work into conversational commands.

So, what exactly is it?

The “AgentCore MCP server” exposes task-specific tools to a client (e.g., Kiro, Claude Code, Cursor, Amazon Q Developer CLI, or the VS Code Q plugin) and guides the assistant to: (1) minimally refactor an existing agent to the AgentCore Runtime model; (2) provision and configure the AWS environment (credentials, roles/permissions, ECR, config files); (3) wire up AgentCore Gateway for tool calls; and (4) invoke and test the deployed agent—all from the IDE’s chat surface.

Practically, the server teaches your coding assistant to convert entry points to AgentCore handlers, add bedrock_agentcore imports, generate requirements.txt, and rewrite direct agent calls into payload-based handlers compatible with Runtime. It can then call the AgentCore CLI to deploy and exercise the agent, including end-to-end calls through Gateway tools.

https://aws.amazon.com/blogs/machine-learning/accelerate-development-with-the-amazon-bedrock-agentcore-mcpserver/

How to Install? and what’s the client support?

AWS provides a one-click install flow from the GitHub repository, using a lightweight launcher (uvx) and a standard mcp.json entry that most MCP-capable clients consume. The AWS team lists the expected mcp.json locations for Kiro (.kiro/settings/mcp.json), Cursor (.cursor/mcp.json), Amazon Q CLI (~/.aws/amazonq/mcp.json), and Claude Code (~/.claude/mcp.json).

The repository sits in the awslabs “mcp” mono-repo (license Apache-2.0). While the AgentCore server directory hosts the implementation, the root repo also links to broader AWS MCP resources and documentation.

Architecture guidance and the “layered” context model

AWS recommends a layered approach to give the IDE’s assistant progressively richer context: start with the agentic client, then add the AWS Documentation MCP Server, layer in framework documentation (e.g., Strands Agents, LangGraph), include the AgentCore and agent-framework SDK docs, and finally steer recurrent workflows via per-IDE “steering files.” This arrangement reduces retrieval misses and helps the assistant plan the end-to-end transform/deploy/test loop without manual context switching.

Development workflow (typical path)

Bootstrap: Use local tools or MCP servers. Either provision a Lambda target for AgentCore Gateway or deploy the server directly to AgentCore Runtime.

Author/Refactor: Start from Strands Agents or LangGraph code. The server instructs the assistant to convert handlers, imports, and dependencies for Runtime compatibility.

Deploy: The assistant looks up relevant docs and invokes the AgentCore CLI to deploy.

Test & Iterate: Invoke the agent via natural language; if tools are needed, integrate Gateway (MCP client inside the agent), redeploy (v2), and retest.

https://aws.amazon.com/blogs/machine-learning/accelerate-development-with-the-amazon-bedrock-agentcore-mcpserver/

How does it make a difference?

Most “agent frameworks” still require developers to learn cloud-specific runtimes, credentials, role policies, registries, and deployment CLIs before any useful iteration. AWS’s MCP server shifts that work into the IDE assistant and narrows the “prompt-to-production” gap. Since it’s just another MCP server, it composes with existing doc servers (AWS service docs, Strands, LangGraph) and can ride improvements in MCP-aware clients, making it a low-friction entry point for teams standardizing on Bedrock AgentCore.

Comments from MTP (Marktechpost team)

I like that AWS shipped a real MCP endpoint for AgentCore that my IDE can call directly. The uvx-based mcp.json config makes client hookup trivial (Cursor, Claude Code, Kiro, Amazon Q CLI), and the server’s tooling maps cleanly onto the AgentCore Runtime/Gateway/Memory stack while preserving existing Strands/LangGraph code paths. Practically, this collapses the prompt→refactor→deploy→test loop into a reproducible, scriptable workflow rather than bespoke glue code.

Check out the GitHub Repo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post AWS Open-Sources an MCP Server for Bedrock AgentCore to Streamline AI Agent Development appeared first on MarkTechPost.