Gemma 3 27B model now available on Amazon Bedrock Marketplace and Amaz …

We are excited to announce the availability of Gemma 3 27B Instruct models through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, developers and data scientists can now deploy Gemma 3, a 27-billion-parameter language model, along with its specialized instruction-following versions, to help accelerate building, experimentation, and scalable deployment of generative AI solutions on AWS.
In this post, we show you how to get started with Gemma 3 27B Instruct on both Amazon Bedrock Marketplace and SageMaker JumpStart, and how to use the model’s powerful instruction-following capabilities in your applications.
Overview of Gemma 3 27B
Gemma 3 27B is a high-performance, open-weight, multimodal language model by Google designed to handle both text and image inputs with efficiency and contextual understanding. It introduces a redesigned attention architecture, enhanced multilingual support, and extended context capabilities. With its optimized memory usage and support for large input sequences, it is well-suited for complex reasoning tasks, long-form interactions, and vision-language applications. With 27 billion parameters and training on up to 6 trillion tokens of text, these models are optimized for tasks requiring advanced reasoning, multilingual capabilities, and instruction following. According to Google, Gemma3 27B Instruct models are ideal for developers, researchers, and businesses looking to build generative AI applications such as chatbots, virtual assistants, and automated content generation tools. The following are its key features:

Multimodal input – Processes text, images, and short videos for unified reasoning across modalities
Long context support – Handles up to 128,000 tokens, enabling seamless processing of long documents, conversations, and multimedia transcripts
Multilingual support – Offers out-of-the-box support for over 35 languages, with pre-training exposure to more than 140 languages in total
Function calling – Facilitates building agentic workflows by using natural‐language interfaces to APIs
Memory-efficient inference – Offers architectural updates that reduce KV-cache usage and introduce QK-norm for faster and more accurate outputs

Key use cases for Gemma3, as described by Google, include:

Q&A and summarization – Processing and condensing long documents or articles
Visual understanding – Image captioning, object identification, visual Q&A, and document understanding
Multilingual applications – Building AI assistants and tools across over 140 languages
Document processing – Analyzing multi-page articles or extracting information from large texts
Automated workflows – Using function calling to create AI agents that can interact with other systems

There are two primary methods for deploying Gemma 3 27B in AWS: The first approach involves using Amazon Bedrock Marketplace, which offers a streamlined way of accessing Amazon Bedrock APIs (Invoke and Converse) and tools such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, Amazon Bedrock Flows, Amazon Bedrock Guardrails, and model evaluation. The second approach is using SageMaker JumpStart, a machine learning (ML) hub, with foundation models (FMs), built-in algorithms, and pre-built ML solutions. You can deploy pre-trained models using either the Amazon SageMaker console or SDK.
Deploy Gemma 3 27B Instruct on Amazon Bedrock Marketplace
Amazon Bedrock Marketplace offers access to over 150 specialized FMs, including Gemma 3 27B Instruct.
Prerequisites
To try the Gemma 3 27B Instruct model using Amazon Bedrock Marketplace, you need the following:

An AWS account that will contain all your AWS resources
Access to accelerated instances (GPUs) for hosting the large language models (LLMs)

Deploy the model
To deploy the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, select Model catalog.
Filter for Gemma as the provider and choose Gemma 3 27B Instruct.

Information about Gemma3’s features, costs, and setup instructions can be found on its model overview page. This resource includes integration examples, API documentation, and programming samples. The model excels at a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. You can also access deployment guidelines and license details to begin implementing Gemma3 into your projects.

Review the model details, pricing, and deployment guidelines, and choose Deploy to start the deployment process.

For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters) or leave it as the default name that is pre-populated.
For Number of instances, enter a number of instances (between 1–100).
Select your preferred instance type, with GPU-powered options like ml.g5.48xlarge being particularly well-suited for running Gemma 3 efficiently.

Although default configurations are typically sufficient for basic needs, you have the option to customize security features such as virtual private cloud (VPC) networking, role-based permissions, and data encryption. These advanced settings might require adjustment for production environments to maintain compliance with your organization’s security protocols.

Prior to deploying Gemma 3, verify that your AWS account has sufficient quota allocation for ml.g5.48xlarge instances. A quota set to 0 will trigger deployment failures, as shown in the following screenshot.

To request a quota increase, open the AWS Service Quotas console and search for SageMaker. Locate ml.g5.48xlarge for endpoint usage and choose Request quota increase, then specify your required limit value.

While the deployment is in progress, you can choose Managed deployments in the navigation pane to monitor the deployment status.
When deployment is complete, you can test Gemma 3’s capabilities directly in the Amazon Bedrock playground by selecting the managed deployment and choosing Open in playground.

You can now use the playground to interact with Gemma 3.

For detailed steps and example code for invoking the model using Amazon Bedrock APIs, refer to Submit prompts and generate response using the API and the following code:

import boto3
bedrock_runtime = boto3.client(“bedrock-runtime”)
endpoint_arn = “arn:aws:sagemaker:us-east-2:061519324070:endpoint/endpoint-quick-start-3t7kp”
response = bedrock_runtime.converse(
    modelId=endpoint_arn,
    messages=[
        {
            “role”: “user”,
            “content”: [{“text”: “What is Amazon doing in the field of generative AI?”}]
        }
    ],
    inferenceConfig={
        “maxTokens”: 256,
        “temperature”: 0.1,
        “topP”: 0.999
    }
)
print(response[“output”][“message”][“content”][0][“text”])

Deploy Gemma 3 27B Instruct with SageMaker JumpStart
SageMaker JumpStart offers access to a broad selection of publicly available FMs. These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can use state-of-the-art model architectures—such as language models, computer vision models, and more—without having to build them from scratch.
With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker inference instances and can be isolated within your VPC. After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of Amazon SageMaker AI, including SageMaker inference for deploying models and container logs for improved observability. With SageMaker AI, you can streamline the entire model deployment process.
There are two ways to deploy the Gemma 3 model using SageMaker JumpStart:

Through the user-friendly SageMaker JumpStart interface
Using the SageMaker Python SDK for programmatic deployment

We examine both deployment methods to help you determine which approach aligns best with your requirements.
Prerequisites
To try the Gemma 3 27B Instruct model in SageMaker JumpStart, you need the following prerequisites:

An AWS account that will contain your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker AI. To learn more about how IAM works with SageMaker AI, see Identity and Access Management for Amazon SageMaker AI.
Access to Amazon SageMaker Studio and a SageMaker AI notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
Access to accelerated instances (GPUs) for hosting the LLMs.

Deploy the model through the SageMaker JumpStart UI
SageMaker JumpStart provides a user-friendly interface for deploying pre-built ML models with just a few clicks. Through the SageMaker JumpStart UI, you can select, customize, and deploy a wide range of models for various tasks such as image classification, object detection, and natural language processing, without the need for extensive coding or ML expertise.

On the SageMaker AI console, choose Studio in the navigation pane.
First-time users will be prompted to create a domain.
On the SageMaker Studio console, choose JumpStart in the navigation pane.

The model browser displays available models, with details like the provider name and model capabilities.

Search for Gemma 3 to view the Gemma 3 model card. Each model card shows key information, including:

Model name
Provider name
Task category (for example, Text Generation)
The Bedrock Ready badge (if applicable), indicating that this model can be registered with Amazon Bedrock, so you can use Amazon Bedrock APIs to invoke the model

Choose the model card to view the model details page.

The model details page includes the following information:

The model name and provider information
The Deploy button to deploy the model
About and Notebooks tabs with detailed information. The About tab includes important details, such as:
Model description
License information
Technical specifications
Usage guidelines

Before you deploy the model, we recommended you review the model details and license terms to confirm compatibility with your use case.

Choose Deploy to proceed with deployment.
For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters) or leave it as default.
For Instance type, choose an instance type (default: ml.g5.48xlarge).
For Initial instance count, enter the number of instances (default: 1).

Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed. Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.

Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
Choose Deploy to deploy the model.

The deployment process can take several minutes to complete.
Deploy the model programmatically using the SageMaker Python SDK
To use Gemma 3 with the SageMaker Python SDK, first make sure you have installed the SDK and set up your AWS permissions and environment correctly. The following is a code example showing how to programmatically deploy and run inference with Gemma 3:

import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker import Session, image_uris
import boto3
# Initialize SageMaker session
session = sagemaker.Session()
role = sagemaker.get_execution_role()
# Specify model parameters
model_id = “huggingface-vlm-gemma-3-27b-instruct”  # or “huggingface-llm-gemma-2b” for the smaller version
instance_type = “ml.g5.48xlarge”  # Choose appropriate instance based on your needs
# Create and deploy the model
model = JumpStartModel(
    model_id=model_id,
    role=role,
    instance_type=instance_type,
    model_version=”*”,  # Latest version
)
# Deploy the model
predictor = model.deploy(
    initial_instance_count=1,
    accept_eula=True  # Required for deploying foundation models
)

Run inference using the SageMaker API
With your Gemma 3 model successfully deployed as a SageMaker endpoint, you’re now ready to start making predictions. The SageMaker SDK provides a straightforward way to interact with your model endpoint for inference tasks. The following code demonstrates how to format your input and make API calls to the endpoint. The code handles both sending requests to the model and processing its responses, making it straightforward to integrate Gemma 3 into your applications.

import json
import boto3
# Initialize AWS session (ensure your AWS credentials are configured)
session = boto3.Session()
sagemaker_runtime = session.client(“sagemaker-runtime”)
# Define the SageMaker endpoint name (replace with your deployed endpoint name)
endpoint_name = “hf-vlm-gemma-3-27b-instruct-2025-05-07-18-09-16-221”

payload = {
    “inputs”: “What is Amazon doing in the field of generative AI?”,
    “parameters”: {
        “max_new_tokens”: 256,
        “temperature”: 0.1,
        “top_p”: 0.9,
        “return_full_text”: False
    }
}

# Run inference
try:
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType=”application/json”,
        Body=json.dumps(payload)
    )
    # Parse the response
    result = json.loads(response[“Body”].read().decode(“utf-8”))
    generated_text = result[0][“generated_text”].strip()
    print(“Generated Response:”)
    print(generated_text)
except Exception as e:
    print(f”Error during inference: {e}”)

Clean up
To avoid incurring ongoing charges for AWS resources used during exploration of Gemma3 27B Instruct models, it’s important to clean up deployed endpoints and associated resources. Complete the following steps:

Delete SageMaker endpoints:

On the SageMaker console, in the navigation pane, choose Endpoints under Inference.
Select the endpoint associated with the Gemma3 27B Instruct model (for example, gemma3-27b-instruct-endpoint).
Choose Delete and confirm the deletion. This stops the endpoint and prevents further compute charges.

Delete SageMaker models (if applicable):
On the SageMaker console, choose Models under Inference.
Select the model associated with your endpoint and choose Delete.
Verify Amazon Bedrock Marketplace resources:
On the Amazon Bedrock console, choose Model catalog in the navigation pane.
Make sure no additional endpoints are running for the Gemma3 27B Instruct model deployed through Amazon Bedrock Marketplace.

Always verify that all endpoints are deleted after experimentation to optimize costs. Refer to the Amazon SageMaker documentation for additional guidance on managing resources.
Conclusion
The availability of Gemma3 27B Instruct models in Amazon Bedrock Marketplace and SageMaker JumpStart empowers developers, researchers, and businesses to build cutting-edge generative AI applications with ease. With their high performance, multilingual capabilities and efficient deployment on AWS infrastructure, these models are well-suited for a wide range of use cases, from conversational AI to code generation and content automation. By using the seamless discovery and deployment capabilities of SageMaker JumpStart and Amazon Bedrock Marketplace, you can accelerate your AI innovation while benefiting from the secure, scalable, and cost-effective AWS Cloud infrastructure.
We encourage you to explore the Gemma3 27B Instruct models today by visiting the SageMaker JumpStart console or Amazon Bedrock Marketplace. Deploy the model and experiment with sample prompts to meet your specific needs. For further learning, explore the AWS Machine Learning Blog, the SageMaker JumpStart GitHub repository, and the Amazon Bedrock documentation. Start building your next generative AI solution with Gemma3 27B Instruct models and unlock new possibilities with AWS!

About the Authors
Santosh Vallurupalli is a Sr. Solutions Architect at AWS. Santosh specializes in networking, containers, and migrations, and enjoys helping customers in their journey of cloud adoption and building cloud-based solutions for challenging issues. In his spare time, he likes traveling, watching Formula1, and watching The Office on repeat.
Aravind Singirikonda is an AI/ML Solutions Architect at AWS. He works with AWS customers in the healthcare and life sciences domain to provide guidance and technical assistance, helping them improve the value of their AI/ML solutions when using AWS.
Pawan Matta is a Sr. Solutions Architect at AWS. He works with AWS customers in the gaming industry and guides them to deploy highly scalable, performant architectures. His area of focus is management and governance. In his free time, he likes to play FIFA and watch cricket.
Ajit Mahareddy is an experienced Product and Go-To-Market (GTM) leader with over 20 years of experience in product management, engineering, and GTM. Prior to his current role, Ajit led product management building AI/ML products at leading technology companies, including Uber, Turing, and eHealth. He is passionate about advancing generative AI technologies and driving real-world impact with generative AI.

GuardianGamer scales family-safe cloud gaming with AWS

This blog post is co-written with Heidi Vogel Brockmann and Ronald Brockmann from GuardianGamer.

Millions of families face a common challenge: how to keep children safe in online gaming without sacrificing the joy and social connection these games provide.
In this post, we share how GuardianGamer—a member of the AWS Activate startup community—has built a cloud gaming platform that helps parents better understand and engage with their children’s gaming experiences using AWS services. Built specifically for families with children under 13, GuardianGamer uses AWS services including Amazon Nova and Amazon Bedrock to deliver a scalable and efficient supervision platform. The team uses Amazon Nova for intelligent narrative generation to provide parents with meaningful insights into their children’s gaming activities and social interactions, while maintaining a non-intrusive approach to monitoring.
The challenge: Monitoring children’s online gaming experiences
Monitoring children’s online gaming activities has been overwhelming for parents, offering little visibility and limited control. GuardianGamer fills a significant void in the market for parents to effectively monitor their children’s gaming activities without being intrusive.
Traditional parental controls were primarily focused on blocking content rather than providing valuable data related to their children’s gaming experiences and social interactions. This led GuardianGamer’s founders to develop a better solution—one that uses AI to summarize gameplay and chat interactions, helping parents better understand and engage with their children’s gaming activities in a non-intrusive way, by using short video reels, while also helping identify potential safety concerns.
Creating connected experiences for parent and child
GuardianGamer is a cloud gaming platform built specifically for families with pre-teen children under 13, combining seamless gaming experiences with comprehensive parental insights. Built on AWS and using Amazon Nova for intelligent narrative generation, the platform streams popular games while providing parents with much-desired visibility into their children’s gaming activities and social interactions. The service prioritizes both safety and social connection through integrated private voice chat, delivering a positive gaming environment that keeps parents informed in a non-invasive way.
There are two connected experiences offered in the platform: one for parents to stay informed and one for kids to play in a highly trusted and safe GuardianGamer space.
For parents, GuardianGamer offers a comprehensive suite of parental engagement tools and insights, empowering them to stay informed and involved in their children’s online activities. Insights are generated from gaming and video understanding, and texted to parents to foster positive conversations between parents and kids. Through these tools, parents can actively manage their child’s gaming experience, enjoying a safe and balanced approach to online entertainment.
For kids, GuardianGamer offers uninterrupted gameplay with minimal latency, all while engaging in social interactions. The platform makes it possible for children to connect and play exclusively within a trusted circle of friends—each vetted and approved by parents—creating a secure digital extension of their real-world relationships. This transforms gaming sessions into natural extensions of friendships formed through school, sports, and community activities, all enhanced by advanced parental AI insights.
By seamlessly blending technology, community, and family, GuardianGamer creates a safer and enriching digital space, called “The Trusted Way for Kids to Play.”
Solution overview
When the GuardianGamer team set out to build a platform that would help parents supervise their children’s gaming experiences across Minecraft, Roblox, and beyond, they knew they needed a cloud infrastructure partner with global reach and proven scalability. Having worked with AWS on previous projects, the team found it to be the natural choice for their ambitious vision.

“Our goal was to build a solution that could scale from zero to millions of users worldwide while maintaining low latency and high reliability—all with a small, nimble engineering team. AWS serverless architecture gave us exactly what we needed without requiring a massive DevOps investment.”
– Heidi Vogel Brockmann, founder and CEO of GuardianGamer.

The following diagram illustrates the backend’s AWS architecture.

GuardianGamer’s backend uses a fully serverless stack built on AWS Lambda, Amazon DynamoDB, Amazon Cognito, Amazon Simple Storage Service (Amazon S3), and Amazon Simple Notification Service (Amazon SNS), making it possible to expand the platform effortlessly as user adoption grows while keeping operational overhead minimal. This architecture enables the team to focus on their core innovation: AI-powered game supervision for parents, rather than infrastructure management.
The cloud gaming component presented unique challenges, requiring low-latency GPU resources positioned close to users around the world.

“Gaming is an inherently global activity, and latency can make or break the user experience. The extensive Regional presence and diverse Amazon Elastic Compute Cloud (Amazon EC2) instance types give us the flexibility to deploy gaming servers where our users are.”
– Heidi Vogel Brockmann.

The team uses Amazon Elastic File System (Amazon EFS) for efficient game state storage within each AWS Region and Amazon Elastic Container Service (Amazon ECS) for streamlined cluster management.
For the AI analysis capabilities that form the heart of GuardianGamer’s parental supervision features, the team relies on AWS Batch to coordinate analysis jobs, and Amazon Bedrock provides access to powerful large language models (LLMs).

“We’re currently using Amazon Nova Lite for summary generation and highlight video selection, which helps parents quickly understand what’s happening in their children’s gameplay without watching hours of content, just a few minutes a day to keep up to date and start informed conversations with their child,”
– Heidi Vogel Brockmann.

Results
Together, AWS and GuardianGamer have successfully scaled GuardianGamer’s cloud gaming platform to handle thousands of concurrent users across multiple game environments. The company’s recent expansion to support Roblox—in addition to its existing Minecraft capabilities—has broadened its serviceable addressable market to 160 million children and their families.

“What makes our implementation special is how we use Amazon Nova to maintain a continuous record of each child’s gaming activities across sessions. When a parent opens our app, they see a comprehensive view of their child’s digital journey, not just isolated moments.”
– Ronald Brockmann, CTO of GuardianGamer.

Conclusion
GuardianGamer demonstrates how a small, agile team can use AWS services to build a sophisticated, AI-powered gaming platform that prioritizes both child safety and parent engagement. By combining cloud gaming infrastructure across multiple Regions with the capabilities of Amazon Bedrock and Amazon Nova, GuardianGamer is pioneering a new approach to family-friendly gaming. Through continuous parent feedback and responsible AI practices, the platform delivers safer, more transparent gaming experiences while maintaining rapid innovation.

“AWS has been exceptional at bringing together diverse teams and technologies across the company to support our vision. Our state-of-the-art architecture leverages several specialized AI components, including speech analysis, video processing, and game metadata collection. We’re particularly excited about incorporating Amazon Nova, which helps us transform complex gaming data into coherent narratives for parents. With AWS as our scaling partner, we’re confident we can deliver our service to millions of families worldwide.”
–  Heidi Vogel Brockmann.

Learn more about building family-safe gaming experiences on AWS. And for further reading, check out The psychology behind why children are hooked on Minecraft and Keep kids off Roblox if you’re worried, its CEO tells parents.

About the Authors
Heidi Vogel Brockmann is the CEO & Founder at GuardianGamer AI. Heidi is an engineer and a proactive mom of four with a mission to transform digital parenting in the gaming space. Frustrated by the lack of tools available for parents with gaming kids, Heidi built the platform to enable fun for kids and peace of mind for parents.
Ronald Brockmann is the CTO of GuardianGamer AI. With extensive expertise in cloud technology and video streaming, Ronald brings decades of experience in building scalable, secure systems. A named inventor on dozens of patents, he excels at building high-performance teams and deploying products at scale. His leadership combines innovative thinking with precise execution to drive GuardianGamer’s technical vision.
Raechel Frick is a Sr Product Marketing Manager at AWS. With over 20 years of experience in the tech industry, she brings a customer-first approach and growth mindset to building integrated marketing programs. Based in the greater Seattle area, Raechel balances her professional life with being a soccer mom and after-school carpool manager, demonstrating her ability to excel both in the corporate world and family life.
John D’Eufemia is an Account Manager at AWS supporting customers within Media, Entertainment, Games, and Sports. With an MBA from Clark University, where he graduated Summa Cum Laude, John brings entrepreneurial spirit to his work, having co-founded multiple ventures at Femia Holdings. His background includes significant leadership experience through his 8-year involvement with DECA Inc., where he served as both an advisor and co-founder of Clark University’s DECA chapter.

Researchers at UT Austin Introduce Panda: A Foundation Model for Nonli …

Chaotic systems, such as fluid dynamics or brain activity, are highly sensitive to initial conditions, making long-term predictions difficult. Even minor errors in modeling these systems can rapidly grow, which limits the effectiveness of many scientific machine learning (SciML) approaches. Traditional forecasting methods rely on models trained on specific time series or broad datasets lacking true dynamical structure. However, recent work has demonstrated the potential for local forecasting models to predict chaotic systems more accurately over longer timeframes by learning the numerical rules governing these systems. The real challenge is achieving out-of-domain generalization—creating models that can adapt and forecast new, previously unseen dynamical systems. This would require integrating prior knowledge with the ability to adapt locally. Still, the need for task-specific data constrains current methods and often overlooks key dynamical system properties such as ergodicity, channel coupling, and conserved quantities.

Machine learning for dynamical systems (MLDS) utilizes the unique properties of such systems as inductive biases. These include fixed relationships among system variables and invariant statistical measures, like strange attractors or conserved quantities. MLDS models use these properties to build more accurate and generalizable models, sometimes incorporating probabilistic or latent variable techniques. While datasets of dynamical systems have been curated and new systems are often generated by tweaking parameters or using symbolic methods, these approaches typically don’t ensure diverse or stable dynamics. Structural stability is a challenge—small changes may not yield new behaviors, while large ones can cause trivial dynamics. Foundation models aim to address this by enabling transfer learning and zero-shot inference. Still, most current models perform comparably to standard time series models or are limited in generating meaningful, dynamic variety. Some progress has been made through techniques like embedding spaces or symbolic discovery, but a richer, more diverse sampling of dynamical behaviors remains an open challenge. 

Researchers at the Oden Institute, UT Austin, introduce Panda (Patched Attention for Nonlinear Dynamics), a pretrained model trained solely on synthetic data from 20,000 algorithmically-generated chaotic systems. These systems were created using an evolutionary algorithm based on known chaotic ODEs. Despite training only on low-dimensional ODEs, Panda shows strong zero-shot forecasting on real-world nonlinear systems—including fluid dynamics and electrophysiology—and unexpectedly generalizes to PDEs. The model incorporates innovations like masked pretraining, channel attention, and kernelized patching to capture dynamical structure. A neural scaling law also emerges, linking Panda’s forecasting performance to the diversity of training systems. 

The researchers generated 20,000 new chaotic systems using a genetic algorithm that evolves from a curated set of 135 known chaotic ODEs. These systems are mutated and recombined using a skew product approach, with only truly chaotic behaviors retained through rigorous tests. Augmentations like time-delay embeddings and affine transformations expand the dataset while preserving its dynamics. A separate set of 9,300 unseen systems is held out for zero-shot testing. The model, Panda, is built on PatchTST and enhanced with features like channel attention, temporal-channel attention layers, and dynamic embeddings using polynomial and Fourier features, inspired by Koopman operator theory. 

Panda demonstrates strong zero-shot forecasting capabilities on unseen nonlinear dynamical systems, outperforming models like Chronos-SFT across various metrics and prediction horizons. Trained solely on 3D systems, it generalizes to higher-dimensional ones due to channel attention. Despite never encountering PDEs during training, Panda also succeeds on real-world experimental data and chaotic PDEs, such as the Kuramoto-Sivashinsky and von Kármán vortex street. Architectural ablations confirm the importance of channel attention and dynamics embeddings. The model exhibits neural scaling with increased dynamical system diversity and forms interpretable attention patterns, suggesting resonance and attractor-sensitive structure. This indicates Panda’s broad generalization across complex dynamical behaviors. 

In conclusion, Panda is a pretrained model designed to uncover generalizable patterns in dynamical systems. Trained on a large, diverse set of synthetic chaotic systems, Panda demonstrates strong zero-shot forecasting on unseen real-world data and even partial differential equations, despite only being trained on low-dimensional ODEs. Its performance improves with system diversity, revealing a neural scaling law. The model also shows emergent nonlinear resonance in attention patterns. While focused on low-dimensional dynamics, the approach may extend to higher-dimensional systems by leveraging sparse interactions. Future directions include alternative pretraining strategies to improve rollout performance forecasting chaotic behaviors. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search appeared first on MarkTechPost.

This AI Paper Introduces Differentiable MCMC Layers: A New AI Framewor …

Neural networks have long been powerful tools for handling complex data-driven tasks. Still, they often struggle to make discrete decisions under strict constraints, like routing vehicles or scheduling jobs. These discrete decision problems, commonly found in operations research, are computationally intensive and difficult to integrate into the smooth, continuous frameworks of neural networks. Such challenges limit the ability to combine learning-based models with combinatorial reasoning, creating a bottleneck in applications that demand both.

A major issue arises when integrating discrete combinatorial solvers with gradient-based learning systems. Many combinatorial problems are NP-hard, meaning it’s impossible to find exact solutions within a reasonable time for large instances. Existing strategies often depend on exact solvers or introduce continuous relaxations, which may not provide solutions that respect the hard constraints of the original problem. These approaches typically involve heavy computational costs, and when exact oracles are unavailable, the methods fail to deliver consistent gradients for learning. This creates a gap where neural networks can learn representations but cannot reliably make complex, structured decisions in a way that scales.

Commonly used methods rely on exact solvers for structured inference tasks, such as MAP solvers in graphical models or linear programming relaxations. These methods often require repeated oracle calls during each training iteration and depend on specific problem formulations. Techniques like Fenchel-Young losses or perturbation-based methods allow approximate learning, but their guarantees break down when used with inexact solvers like local search heuristics. This reliance on exact solutions hinders their practical use in large-scale, real-world combinatorial tasks, such as vehicle routing with dynamic requests and time windows.

Researchers from Google DeepMind and ENPC propose a novel solution by transforming local search heuristics into differentiable combinatorial layers through the lens of Markov Chain Monte Carlo (MCMC) methods. The researchers create MCMC layers that operate on discrete combinatorial spaces by mapping problem-specific neighborhood systems into proposal distributions. This design allows neural networks to integrate local search heuristics, like simulated annealing or Metropolis-Hastings, as part of the learning pipeline without access to exact solvers. Their approach enables gradient-based learning over discrete solutions by using acceptance rules that correct for the bias introduced by approximate solvers, ensuring theoretical soundness while reducing the computational burden.

In more detail, the researchers construct a framework where local search heuristics propose neighbor solutions based on the problem structure, and the acceptance rules from MCMC methods ensure these moves result in a valid sampling process over the solution space. The resulting MCMC layer approximates the target distribution of feasible solutions and provides unbiased gradients for a single iteration under a target-dependent Fenchel-Young loss. This makes it possible to perform learning even with minimal MCMC iterations, such as using a single sample per forward pass while maintaining theoretical convergence properties. By embedding this layer in a neural network, they can train models that predict parameters for combinatorial problems and improve solution quality over time.

The research team evaluated this method on a large-scale dynamic vehicle routing problem with time windows, a complex, real-world combinatorial optimization task. They showed their approach could handle large instances efficiently, significantly outperforming perturbation-based methods under limited time budgets. For example, their MCMC layer achieved a test relative cost of 5.9% compared to anticipative baselines when using a heuristic-based initialization. In comparison, the perturbation-based method achieved 6.3% under the same conditions. Even at extremely low time budgets, such as a 1 ms time limit, their method outperformed perturbation methods by a large margin—achieving 7.8% relative cost versus 65.2% for perturbation-based approaches. They also demonstrated that initializing the MCMC chain with ground-truth solutions or heuristic-enhanced states improved learning efficiency and solution quality, especially when using a small number of MCMC iterations.

This research demonstrates a principled way to integrate NP-hard combinatorial problems into neural networks without relying on exact solvers. The problem of combining learning with discrete decision-making is addressed by using MCMC layers constructed from local search heuristics, enabling theoretically sound, efficient training. The proposed method bridges the gap between deep learning and combinatorial optimization, providing a scalable and practical solution for complex tasks like vehicle routing.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post This AI Paper Introduces Differentiable MCMC Layers: A New AI Framework for Learning with Inexact Combinatorial Solvers in Neural Networks appeared first on MarkTechPost.

Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researche …

Reinforcement learning (RL) has emerged as a fundamental approach in LLM post-training, utilizing supervision signals from human feedback (RLHF) or verifiable rewards (RLVR). While RLVR shows promise in mathematical reasoning, it faces significant constraints due to dependence on training queries with verifiable answers. This requirement limits applications to large-scale training on general-domain queries where verification proves intractable. Further, current reward models, categorized into scalar and generative types, cannot effectively scale test-time compute for reward estimation. Existing approaches apply uniform computational resources across all inputs, lacking adaptability to allocate additional resources to challenging queries requiring nuanced analysis.

Formulation strategies and scoring schemes characterize reward models. Numeric approaches assign scalar scores to query-response pairs, while generative methods produce natural language feedback. Scoring follows absolute evaluation of individual pairs or discriminative comparison of candidate responses. Generative reward models, aligned with the LLM-as-a-Judge paradigm, offer interpretable feedback but face reliability concerns due to biased judgments. Inference-time scaling methods dynamically adjust computational resources, including parallel strategies like multi-sampling and horizon-based scaling for extended reasoning traces. However, they lack systematic adaptation to input complexity, limiting their effectiveness across diverse query types.

Researchers from Microsoft Research, Tsinghua University, and Peking University have proposed Reward Reasoning Models (RRMs), which perform explicit reasoning before producing final rewards. This reasoning phase allows RRMs to adaptively allocate additional computational resources when evaluating responses to complex tasks. RRMs introduce a dimension for enhancing reward modeling by scaling test-time compute while maintaining general applicability across diverse evaluation scenarios. Through chain-of-thought reasoning, RRMs utilize additional test-time compute for complex queries where appropriate rewards are not immediately apparent. This encourages RRMs to self-evolve reward reasoning capabilities without explicit reasoning traces as training data.

RRMs utilize the Qwen2 model with a Transformer-decoder backbone, formulating reward modeling as text completion where RRMs autoregressively generate thinking processes followed by final judgments. Each input contains a query and two responses to determine preference without allowing ties. Researchers use the RewardBench repository to guide systematic analysis across evaluation criteria, including instruction fidelity, helpfulness, accuracy, harmlessness, and detail level. RRMs support multi-response evaluation through ELO rating systems and knockout tournaments, both combinable with majority voting for enhanced test-time compute utilization. This samples RRMs multiple times for pairwise comparisons, performing majority voting to obtain robust comparison results.

Evaluation results show that RRMs achieve competitive performance against strong baselines on RewardBench and PandaLM Test benchmarks, with RRM-32B attaining 98.6% accuracy in reasoning categories. Comparing with DirectJudge models trained on identical data reveals substantial performance gaps, indicating RRMs effectively use test-time compute for complex queries. In reward-guided best-of-N inference, RRMs surpass all baseline models without additional test-time compute, with majority voting providing substantial improvements across evaluated subsets. Post-training experiments show steady downstream performance improvements on MMLU-Pro and GPQA. Scaling experiments across 7B, 14B, and 32B models confirm that longer thinking horizons consistently improve accuracy.

In conclusion, researchers introduced RRMs to perform explicit reasoning processes before reward assignment to address computational inflexibility in existing reward modeling approaches. Rule-based-reward RL enables RRMs to develop complex reasoning capabilities without requiring explicit reasoning traces as supervision. RRMs efficiently utilize test-time compute through parallel and sequential scaling approaches. The effectiveness of RRMs in practical applications, including reward-guided best-of-N inference and post-training feedback, demonstrates their potential as strong alternatives to traditional scalar reward models in alignment techniques.

Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment appeared first on MarkTechPost.

Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data …

Real-world data is often costly, messy, and limited by privacy rules. Synthetic data offers a solution—and it’s already widely used:

LLMs train on AI-generated text

Fraud systems simulate edge cases

Vision models pretrain on fake images

SDV (Synthetic Data Vault) is an open-source Python library that generates realistic tabular data using machine learning. It learns patterns from real data and creates high-quality synthetic data for safe sharing, testing, and model training.

In this tutorial, we’ll use SDV to generate synthetic data step by step.

Copy CodeCopiedUse a different Browserpip install sdv

We will first install the sdv library:

Copy CodeCopiedUse a different Browserfrom sdv.io.local import CSVHandler

connector = CSVHandler()
FOLDER_NAME = ‘.’ # If the data is in the same directory

data = connector.read(folder_name=FOLDER_NAME)
salesDf = data[‘data’]

Next, we import the necessary module and connect to our local folder containing the dataset files. This reads the CSV files from the specified folder and stores them as pandas DataFrames. In this case, we access the main dataset using data[‘data’].

Copy CodeCopiedUse a different Browserfrom sdv.metadata import Metadata
metadata = Metadata.load_from_json(‘metadata.json’)

We now import the metadata for our dataset. This metadata is stored in a JSON file and tells SDV how to interpret your data. It includes:

The table name

The primary key

The data type of each column (e.g., categorical, numerical, datetime, etc.)

Optional column formats like datetime patterns or ID patterns

Table relationships (for multi-table setups)

Here is a sample metadata.json format:

Copy CodeCopiedUse a different Browser{
“METADATA_SPEC_VERSION”: “V1”,
“tables”: {
“your_table_name”: {
“primary_key”: “your_primary_key_column”,
“columns”: {
“your_primary_key_column”: { “sdtype”: “id”, “regex_format”: “T[0-9]{6}” },
“date_column”: { “sdtype”: “datetime”, “datetime_format”: “%d-%m-%Y” },
“category_column”: { “sdtype”: “categorical” },
“numeric_column”: { “sdtype”: “numerical” }
},
“column_relationships”: []
}
}
}

Copy CodeCopiedUse a different Browserfrom sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframes(data)

Alternatively, we can use the SDV library to automatically infer the metadata. However, the results may not always be accurate or complete, so you might need to review and update it if there are any discrepancies.

Copy CodeCopiedUse a different Browserfrom sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)

With the metadata and original dataset ready, we can now use SDV to train a model and generate synthetic data. The model learns the structure and patterns in your real dataset and uses that knowledge to create synthetic records.

You can control how many rows to generate using the num_rows argument.

Copy CodeCopiedUse a different Browserfrom sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
salesDf,
synthetic_data,
metadata)

The SDV library also provides tools to evaluate the quality of your synthetic data by comparing it to the original dataset. A great place to start is by generating a quality report

You can also visualize how the synthetic data compares to the real data using SDV’s built-in plotting tools. For example, import get_column_plot from sdv.evaluation.single_table to create comparison plots for specific columns:

Copy CodeCopiedUse a different Browserfrom sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
real_data=salesDf,
synthetic_data=synthetic_data,
column_name=’Sales’,
metadata=metadata
)

fig.show()

We can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets.

Copy CodeCopiedUse a different Browserimport pandas as pd
import matplotlib.pyplot as plt

# Ensure ‘Date’ columns are datetime
salesDf[‘Date’] = pd.to_datetime(salesDf[‘Date’], format=’%d-%m-%Y’)
synthetic_data[‘Date’] = pd.to_datetime(synthetic_data[‘Date’], format=’%d-%m-%Y’)

# Extract ‘Month’ as year-month string
salesDf[‘Month’] = salesDf[‘Date’].dt.to_period(‘M’).astype(str)
synthetic_data[‘Month’] = synthetic_data[‘Date’].dt.to_period(‘M’).astype(str)

# Group by ‘Month’ and calculate average sales
actual_avg_monthly = salesDf.groupby(‘Month’)[‘Sales’].mean().rename(‘Actual Average Sales’)
synthetic_avg_monthly = synthetic_data.groupby(‘Month’)[‘Sales’].mean().rename(‘Synthetic Average Sales’)

# Merge the two series into a DataFrame
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison[‘Actual Average Sales’], label=’Actual Average Sales’, marker=’o’)
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison[‘Synthetic Average Sales’], label=’Synthetic Average Sales’, marker=’o’)

plt.title(‘Average Monthly Sales Comparison: Actual vs Synthetic’)
plt.xlabel(‘Month’)
plt.ylabel(‘Average Sales’)
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.ylim(bottom=0) # y-axis starts at 0
plt.tight_layout()
plt.show()

This chart also shows that the average monthly sales in both datasets are very similar, with only minimal differences.

In this tutorial, we demonstrated how to prepare your data and metadata for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can create high-quality synthetic data that closely mirrors the real data’s patterns and distributions. We also explored how to evaluate and visualize the synthetic data, confirming that key metrics like sales distributions and monthly trends remain consistent. Synthetic data offers a powerful way to overcome privacy and availability challenges while enabling robust data analysis and machine learning workflows.

Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV) appeared first on MarkTechPost.

NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Mo …

NVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks.

The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings.

Model Architecture and Training Stack

Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count.

The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization (RPO), a method intended to enhance the model’s utility in chat-based and instruction-following environments.

This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes.

Performance Benchmarks

Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains.

While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads.

Edge-Ready Deployment

One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations.

For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility.

Licensing and Access

The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models.

Conclusion

Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance.

Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks appeared first on MarkTechPost.

A Coding Implementation to Build an AI Agent with Live Python Executio …

In this tutorial, we will discover how to harness the power of an advanced AI Agent, augmented with both Python execution and result-validation capabilities, to tackle complex computational tasks. By integrating LangChain’s ReAct agent framework with Anthropic’s Claude API, we build an end-to-end solution to generate Python code and execute it live, capture its outputs, maintain execution state, and automatically verify results against expected properties or test cases. This seamless loop of “write → run → validate” empowers you to develop robust analyses, algorithms, and simple ML pipelines with confidence in every step.

Copy CodeCopiedUse a different Browser!pip install langchain langchain-anthropic langchain-core anthropic

We install the core LangChain framework along with the Anthropic integration and its core utilities, ensuring you have both the agent orchestration tools (langchain, langchain-core) and the Claude-specific bindings (langchain-anthropic, anthropic) available in your environment.

Copy CodeCopiedUse a different Browserimport os
from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import Tool
from langchain_core.prompts import PromptTemplate
from langchain_anthropic import ChatAnthropic
import sys
import io
import re
import json
from typing import Dict, Any, List

We bring together everything needed to build our ReAct-style agent: OS access for environment variables, LangChain’s agent constructors (create_react_agent, AgentExecutor), and Tool class for defining custom actions, the PromptTemplate for crafting the chain-of-thought prompt, and Anthropic’s ChatAnthropic client for connecting to Claude. Standard Python modules (sys, io, re, json) handle I/O capture, regular expressions, and serialization, while typing provides type hints for clearer, more maintainable code.

Copy CodeCopiedUse a different Browserclass PythonREPLTool:
def __init__(self):
self.globals_dict = {
‘__builtins__’: __builtins__,
‘json’: json,
‘re’: re
}
self.locals_dict = {}
self.execution_history = []

def run(self, code: str) -> str:
try:
old_stdout = sys.stdout
old_stderr = sys.stderr
sys.stdout = captured_output = io.StringIO()
sys.stderr = captured_error = io.StringIO()

execution_result = None

try:
result = eval(code, self.globals_dict, self.locals_dict)
execution_result = result
if result is not None:
print(result)
except SyntaxError:
exec(code, self.globals_dict, self.locals_dict)

output = captured_output.getvalue()
error_output = captured_error.getvalue()

sys.stdout = old_stdout
sys.stderr = old_stderr

self.execution_history.append({
‘code’: code,
‘output’: output,
‘result’: execution_result,
‘error’: error_output
})

response = f”**Code Executed:**n“`pythonn{code}n“`nn”
if error_output:
response += f”**Errors/Warnings:**n{error_output}nn”
response += f”**Output:**n{output if output.strip() else ‘No console output’}”

if execution_result is not None and not output.strip():
response += f”n**Return Value:** {execution_result}”

return response

except Exception as e:
sys.stdout = old_stdout
sys.stderr = old_stderr

error_info = f”**Code Executed:**n“`pythonn{code}n“`nn**Runtime Error:**n{str(e)}n**Error Type:** {type(e).__name__}”

self.execution_history.append({
‘code’: code,
‘output’: ”,
‘result’: None,
‘error’: str(e)
})

return error_info

def get_execution_history(self) -> List[Dict[str, Any]]:
return self.execution_history

def clear_history(self):
self.execution_history = []

This PythonREPLTool encapsulates a stateful in‐process Python REPL: it captures and executes arbitrary code (evaluating expressions or running statements), redirects stdout/stderr to record outputs and errors, and maintains a history of each execution. Returning a formatted summary, including the executed code, any console output or errors, and return values, provides transparent, reproducible feedback for every snippet run within our agent.

Copy CodeCopiedUse a different Browserclass ResultValidator:
def __init__(self, python_repl: PythonREPLTool):
self.python_repl = python_repl

def validate_mathematical_result(self, description: str, expected_properties: Dict[str, Any]) -> str:
“””Validate mathematical computations”””
validation_code = f”””
# Validation for: {description}
validation_results = {{}}

# Get the last execution results
history = {self.python_repl.execution_history}
if history:
last_execution = history[-1]
print(f”Last execution output: {{last_execution[‘output’]}}”)

# Extract numbers from the output
import re
numbers = re.findall(r’d+(?:.d+)?’, last_execution[‘output’])
if numbers:
numbers = [float(n) for n in numbers]
validation_results[‘extracted_numbers’] = numbers

# Validate expected properties
for prop, expected_value in {expected_properties}.items():
if prop == ‘count’:
actual_count = len(numbers)
validation_results[f’count_check’] = actual_count == expected_value
print(f”Count validation: Expected {{expected_value}}, Got {{actual_count}}”)
elif prop == ‘max_value’:
if numbers:
max_val = max(numbers)
validation_results[f’max_check’] = max_val <= expected_value
print(f”Max value validation: {{max_val}} <= {{expected_value}} = {{max_val <= expected_value}}”)
elif prop == ‘min_value’:
if numbers:
min_val = min(numbers)
validation_results[f’min_check’] = min_val >= expected_value
print(f”Min value validation: {{min_val}} >= {{expected_value}} = {{min_val >= expected_value}}”)
elif prop == ‘sum_range’:
if numbers:
total = sum(numbers)
min_sum, max_sum = expected_value
validation_results[f’sum_check’] = min_sum <= total <= max_sum
print(f”Sum validation: {{min_sum}} <= {{total}} <= {{max_sum}} = {{min_sum <= total <= max_sum}}”)

print(“nValidation Summary:”)
for key, value in validation_results.items():
print(f”{{key}}: {{value}}”)

validation_results
“””
return self.python_repl.run(validation_code)

def validate_data_analysis(self, description: str, expected_structure: Dict[str, Any]) -> str:
“””Validate data analysis results”””
validation_code = f”””
# Data Analysis Validation for: {description}
validation_results = {{}}

# Check if required variables exist in global scope
required_vars = {list(expected_structure.keys())}
existing_vars = []

for var_name in required_vars:
if var_name in globals():
existing_vars.append(var_name)
var_value = globals()[var_name]
validation_results[f'{{var_name}}_exists’] = True
validation_results[f'{{var_name}}_type’] = type(var_value).__name__

# Type-specific validations
if isinstance(var_value, (list, tuple)):
validation_results[f'{{var_name}}_length’] = len(var_value)
elif isinstance(var_value, dict):
validation_results[f'{{var_name}}_keys’] = list(var_value.keys())
elif isinstance(var_value, (int, float)):
validation_results[f'{{var_name}}_value’] = var_value

print(f”✓ Variable ‘{{var_name}}’ found: {{type(var_value).__name__}} = {{var_value}}”)
else:
validation_results[f'{{var_name}}_exists’] = False
print(f”✗ Variable ‘{{var_name}}’ not found”)

print(f”nFound {{len(existing_vars)}}/{{len(required_vars)}} required variables”)

# Additional structure validation
for var_name, expected_type in {expected_structure}.items():
if var_name in globals():
actual_type = type(globals()[var_name]).__name__
validation_results[f'{{var_name}}_type_match’] = actual_type == expected_type
print(f”Type check ‘{{var_name}}’: Expected {{expected_type}}, Got {{actual_type}}”)

validation_results
“””
return self.python_repl.run(validation_code)

def validate_algorithm_correctness(self, description: str, test_cases: List[Dict[str, Any]]) -> str:
“””Validate algorithm implementations with test cases”””
validation_code = f”””
# Algorithm Validation for: {description}
validation_results = {{}}
test_results = []

test_cases = {test_cases}

for i, test_case in enumerate(test_cases):
test_name = test_case.get(‘name’, f’Test {{i+1}}’)
input_val = test_case.get(‘input’)
expected = test_case.get(‘expected’)
function_name = test_case.get(‘function’)

print(f”nRunning {{test_name}}:”)
print(f”Input: {{input_val}}”)
print(f”Expected: {{expected}}”)

try:
if function_name and function_name in globals():
func = globals()[function_name]
if callable(func):
if isinstance(input_val, (list, tuple)):
result = func(*input_val)
else:
result = func(input_val)

passed = result == expected
test_results.append({{
‘test_name’: test_name,
‘input’: input_val,
‘expected’: expected,
‘actual’: result,
‘passed’: passed
}})

status = “✓ PASS” if passed else “✗ FAIL”
print(f”Actual: {{result}}”)
print(f”Status: {{status}}”)
else:
print(f”✗ ERROR: ‘{{function_name}}’ is not callable”)
else:
print(f”✗ ERROR: Function ‘{{function_name}}’ not found”)

except Exception as e:
print(f”✗ ERROR: {{str(e)}}”)
test_results.append({{
‘test_name’: test_name,
‘error’: str(e),
‘passed’: False
}})

# Summary
passed_tests = sum(1 for test in test_results if test.get(‘passed’, False))
total_tests = len(test_results)
validation_results[‘tests_passed’] = passed_tests
validation_results[‘total_tests’] = total_tests
validation_results[‘success_rate’] = passed_tests / total_tests if total_tests > 0 else 0

print(f”n=== VALIDATION SUMMARY ===”)
print(f”Tests passed: {{passed_tests}}/{{total_tests}}”)
print(f”Success rate: {{validation_results[‘success_rate’]:.1%}}”)

test_results
“””
return self.python_repl.run(validation_code)

This ResultValidator class builds on the PythonREPLTool to automatically generate and run bespoke validation routines, checking numerical properties, verifying data structures, or running algorithm test cases against the agent’s execution history. Emitting Python snippets that extract outputs, compare them to expected criteria, and summarize pass/fail results closes the loop on “execute → validate” within our agent’s workflow.

Copy CodeCopiedUse a different Browserpython_repl = PythonREPLTool()
validator = ResultValidator(python_repl)

Here, we instantiate our interactive Python REPL tool (python_repl) and then create a ResultValidator tied to that same REPL instance. This wiring ensures any code you execute is immediately available for automated validation steps, closing the loop on execution and correctness checking.

Copy CodeCopiedUse a different Browserpython_tool = Tool(
name=”python_repl”,
description=”Execute Python code and return both the code and its output. Maintains state between executions.”,
func=python_repl.run
)

validation_tool = Tool(
name=”result_validator”,
description=”Validate the results of previous computations with specific test cases and expected properties.”,
func=lambda query: validator.validate_mathematical_result(query, {})
)

Here, we wrap our REPL and validation methods into LangChain Tool objects, assigning them clear names and descriptions. The agent can invoke python_repl to run code and result_validator to check the last execution against your specified criteria automatically.

Copy CodeCopiedUse a different Browserprompt_template = “””You are Claude, an advanced AI assistant with Python execution and result validation capabilities.

You can execute Python code to solve complex problems and then validate your results to ensure accuracy.

Available tools:
{tools}

Use this format:
Question: the input question you must answer
Thought: analyze what needs to be done
Action: {tool_names}
Action Input: [your input]
Observation: [result]
… (repeat Thought/Action/Action Input/Observation as needed)
Thought: I should validate my results
Action: [validation if needed]
Action Input: [validation parameters]
Observation: [validation results]
Thought: I now have the complete answer
Final Answer: [comprehensive answer with validation confirmation]

Question: {input}
{agent_scratchpad}”””

prompt = PromptTemplate(
template=prompt_template,
input_variables=[“input”, “agent_scratchpad”],
partial_variables={
“tools”: “python_repl – Execute Python codenresult_validator – Validate computation results”,
“tool_names”: “python_repl, result_validator”
}
)

Above prompt template frames Claude as a dual-capability assistant that first reasons (“Thought”), selects from the python_repl and result_validator tools to run code and check outputs, and then iterates until it has a validated solution. By defining a clear chain-of-thought structure with placeholders for tool names and their usage examples, it guides the agent to: (1) break down the problem, (2) call python_repl to execute necessary code, (3) call result_validator to confirm correctness, and finally (4) deliver a self-checked “Final Answer.” This scaffolding ensures a disciplined “write → run → validate” workflow.

Copy CodeCopiedUse a different Browserclass AdvancedClaudeCodeAgent:
def __init__(self, anthropic_api_key=None):
if anthropic_api_key:
os.environ[“ANTHROPIC_API_KEY”] = anthropic_api_key

self.llm = ChatAnthropic(
model=”claude-3-opus-20240229″,
temperature=0,
max_tokens=4000
)

self.agent = create_react_agent(
llm=self.llm,
tools=[python_tool, validation_tool],
prompt=prompt
)

self.agent_executor = AgentExecutor(
agent=self.agent,
tools=[python_tool, validation_tool],
verbose=True,
handle_parsing_errors=True,
max_iterations=8,
return_intermediate_steps=True
)

self.python_repl = python_repl
self.validator = validator

def run(self, query: str) -> str:
try:
result = self.agent_executor.invoke({“input”: query})
return result[“output”]
except Exception as e:
return f”Error: {str(e)}”

def validate_last_result(self, description: str, validation_params: Dict[str, Any]) -> str:
“””Manually validate the last computation result”””
if ‘test_cases’ in validation_params:
return self.validator.validate_algorithm_correctness(description, validation_params[‘test_cases’])
elif ‘expected_structure’ in validation_params:
return self.validator.validate_data_analysis(description, validation_params[‘expected_structure’])
else:
return self.validator.validate_mathematical_result(description, validation_params)

def get_execution_summary(self) -> Dict[str, Any]:
“””Get summary of all executions”””
history = self.python_repl.get_execution_history()
return {
‘total_executions’: len(history),
‘successful_executions’: len([h for h in history if not h[‘error’]]),
‘failed_executions’: len([h for h in history if h[‘error’]]),
‘execution_details’: history
}

This AdvancedClaudeCodeAgent class wraps everything into a single, easy-to-use interface: it configures the Anthropic Claude client (using your API key), instantiates a ReAct-style agent with our python_repl and result_validator tools and the custom prompt, and sets up an executor that drives iterative “think → code → validate” loops. Its run() method lets you submit natural-language queries and returns Claude’s final, self-checked answer; validate_last_result() exposes manual hooks for additional checks; and get_execution_summary() provides a concise report on every code snippet you’ve executed (how many succeeded, failed, and their details).

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
API_KEY = “Use Your Own Key Here”

agent = AdvancedClaudeCodeAgent(anthropic_api_key=API_KEY)

print(” Advanced Claude Code Agent with Validation”)
print(“=” * 60)

print(“n Example 1: Prime Number Analysis with Twin Prime Detection”)
print(“-” * 60)
query1 = “””
Find all prime numbers between 1 and 200, then:
1. Calculate their sum
2. Find all twin prime pairs (primes that differ by 2)
3. Calculate the average gap between consecutive primes
4. Identify the largest prime gap in this range
After computation, validate that we found the correct number of primes and that all identified numbers are actually prime.
“””
result1 = agent.run(query1)
print(result1)

print(“n” + “=” * 80 + “n”)

print(” Example 2: Advanced Sales Data Analysis with Statistical Validation”)
print(“-” * 60)
query2 = “””
Create a comprehensive sales analysis:
1. Generate sales data for 12 products across 24 months with realistic seasonal patterns
2. Calculate monthly growth rates, yearly totals, and trend analysis
3. Identify top 3 performing products and worst 3 performing products
4. Perform correlation analysis between different products
5. Create summary statistics (mean, median, standard deviation, percentiles)
After analysis, validate the data structure, ensure all calculations are mathematically correct, and verify the statistical measures.
“””
result2 = agent.run(query2)
print(result2)

print(“n” + “=” * 80 + “n”)

print(” Example 3: Advanced Algorithm Implementation with Test Suite”)
print(“-” * 60)
query3 = “””
Implement and validate a comprehensive sorting and searching system:
1. Implement quicksort, mergesort, and binary search algorithms
2. Create test data with various edge cases (empty lists, single elements, duplicates, sorted/reverse sorted)
3. Benchmark the performance of different sorting algorithms
4. Implement a function to find the kth largest element using different approaches
5. Test all implementations with comprehensive test cases including edge cases
After implementation, validate each algorithm with multiple test cases to ensure correctness.
“””
result3 = agent.run(query3)
print(result3)

print(“n” + “=” * 80 + “n”)

print(” Example 4: Machine Learning Model with Cross-Validation”)
print(“-” * 60)
query4 = “””
Build a complete machine learning pipeline:
1. Generate a synthetic dataset with features and target variable (classification problem)
2. Implement data preprocessing (normalization, feature scaling)
3. Implement a simple linear classifier from scratch (gradient descent)
4. Split data into train/validation/test sets
5. Train the model and evaluate performance (accuracy, precision, recall)
6. Implement k-fold cross-validation
7. Compare results with different hyperparameters
Validate the entire pipeline by ensuring mathematical correctness of gradient descent, proper data splitting, and realistic performance metrics.
“””
result4 = agent.run(query4)
print(result4)

print(“n” + “=” * 80 + “n”)

print(” Execution Summary”)
print(“-” * 60)
summary = agent.get_execution_summary()
print(f”Total code executions: {summary[‘total_executions’]}”)
print(f”Successful executions: {summary[‘successful_executions’]}”)
print(f”Failed executions: {summary[‘failed_executions’]}”)

if summary[‘failed_executions’] > 0:
print(“nFailed executions details:”)
for i, execution in enumerate(summary[‘execution_details’]):
if execution[‘error’]:
print(f” {i+1}. Error: {execution[‘error’]}”)

print(f”nSuccess rate: {(summary[‘successful_executions’]/summary[‘total_executions’]*100):.1f}%”)

Finally, we instantiate the AdvancedClaudeCodeAgent with your Anthropic API key, run four illustrative example queries (covering prime‐number analysis, sales data analytics, algorithm implementations, and a simple ML pipeline), and print each validated result. Finally, it gathers and displays a concise execution summary, total runs, successes, failures, and error details, demonstrating the agent’s live “write → run → validate” workflow.

In conclusion, we have developed a versatile AdvancedClaudeCodeAgent capable of seamlessly blending generative reasoning with precise computational control. At its core, this Agent doesn’t just draft Python snippets; it runs them on the spot and checks their correctness against your specified criteria, closing the feedback loop automatically. Whether you’re performing prime-number analyses, statistical data evaluations, algorithm benchmarking, or end-to-end ML workflows, this pattern ensures reliability and reproducibility.

Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Implementation to Build an AI Agent with Live Python Execution and Automated Validation appeared first on MarkTechPost.

Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with La …

In this comprehensive tutorial, we guide users through creating a powerful multi-tool AI agent using LangGraph and Claude, optimized for diverse tasks including mathematical computations, web searches, weather inquiries, text analysis, and real-time information retrieval. It begins by simplifying dependency installations to ensure effortless setup, even for beginners. Users are then introduced to structured implementations of specialized tools, such as a safe calculator, an efficient web-search utility leveraging DuckDuckGo, a mock weather information provider, a detailed text analyzer, and a time-fetching function. The tutorial also clearly delineates the integration of these tools within a sophisticated agent architecture built using LangGraph, illustrating practical usage through interactive examples and clear explanations, facilitating both beginners and advanced developers to deploy custom multi-functional AI agents rapidly.

Copy CodeCopiedUse a different Browserimport subprocess
import sys

def install_packages():
packages = [
“langgraph”,
“langchain”,
“langchain-anthropic”,
“langchain-community”,
“requests”,
“python-dotenv”,
“duckduckgo-search”
]

for package in packages:
try:
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, package, “-q”])
print(f”✓ Installed {package}”)
except subprocess.CalledProcessError:
print(f”✗ Failed to install {package}”)

print(“Installing required packages…”)
install_packages()
print(“Installation complete!n”)

We automate the installation of essential Python packages required for building a LangGraph-based multi-tool AI agent. It leverages a subprocess to run pip commands silently and ensures each package, ranging from long-chain components to web search and environment handling tools, is installed successfully. This setup streamlines the environment preparation process, making the notebook portable and beginner-friendly.

Copy CodeCopiedUse a different Browserimport os
import json
import math
import requests
from typing import Dict, List, Any, Annotated, TypedDict
from datetime import datetime
import operator

from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage
from langchain_core.tools import tool
from langchain_anthropic import ChatAnthropic
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.memory import MemorySaver
from duckduckgo_search import DDGS

We import all the necessary libraries and modules for constructing the multi-tool AI agent. It includes Python standard libraries such as os, json, math, and datetime for general-purpose functionality and external libraries like requests for HTTP calls and duckduckgo_search for implementing web search. The LangChain and LangGraph ecosystems bring in message types, tool decorators, state graph components, and checkpointing utilities, while ChatAnthropic enables integration with the Claude model for conversational intelligence. These imports form the foundational building blocks for defining tools, agent workflows, and interactions.

Copy CodeCopiedUse a different Browseros.environ[“ANTHROPIC_API_KEY”] = “Use Your API Key Here”

ANTHROPIC_API_KEY = os.getenv(“ANTHROPIC_API_KEY”)

We set and retrieve the Anthropic API key required to authenticate and interact with Claude models. The os.environ line assigns your API key (which you should replace with a valid key), while os.getenv securely retrieves it for later use in model initialization. This approach ensures the key is accessible throughout the script without hardcoding it multiple times.

Copy CodeCopiedUse a different Browserfrom typing import TypedDict

class AgentState(TypedDict):
messages: Annotated[List[BaseMessage], operator.add]

@tool
def calculator(expression: str) -> str:
“””
Perform mathematical calculations. Supports basic arithmetic, trigonometry, and more.

Args:
expression: Mathematical expression as a string (e.g., “2 + 3 * 4”, “sin(3.14159/2)”)

Returns:
Result of the calculation as a string
“””
try:
allowed_names = {
‘abs’: abs, ’round’: round, ‘min’: min, ‘max’: max,
‘sum’: sum, ‘pow’: pow, ‘sqrt’: math.sqrt,
‘sin’: math.sin, ‘cos’: math.cos, ‘tan’: math.tan,
‘log’: math.log, ‘log10’: math.log10, ‘exp’: math.exp,
‘pi’: math.pi, ‘e’: math.e
}

expression = expression.replace(‘^’, ‘**’)

result = eval(expression, {“__builtins__”: {}}, allowed_names)
return f”Result: {result}”
except Exception as e:
return f”Error in calculation: {str(e)}”

We define the agent’s internal state and implement a robust calculator tool. The AgentState class uses TypedDict to structure agent memory, specifically tracking messages exchanged during the conversation. The calculator function, decorated with @tool to register it as an AI-usable utility, securely evaluates mathematical expressions. It allows for safe computation by limiting available functions to a predefined set from the math module and replacing common syntax like ^ with Python’s exponentiation operator. This ensures the tool can handle simple arithmetic and advanced functions like trigonometry or logarithms while preventing unsafe code execution.

Copy CodeCopiedUse a different Browser@tool
def web_search(query: str, num_results: int = 3) -> str:
“””
Search the web for information using DuckDuckGo.

Args:
query: Search query string
num_results: Number of results to return (default: 3, max: 10)

Returns:
Search results as formatted string
“””
try:
num_results = min(max(num_results, 1), 10)

with DDGS() as ddgs:
results = list(ddgs.text(query, max_results=num_results))

if not results:
return f”No search results found for: {query}”

formatted_results = f”Search results for ‘{query}’:nn”
for i, result in enumerate(results, 1):
formatted_results += f”{i}. **{result[‘title’]}**n”
formatted_results += f” {result[‘body’]}n”
formatted_results += f” Source: {result[‘href’]}nn”

return formatted_results
except Exception as e:
return f”Error performing web search: {str(e)}”

We define a web_search tool that enables the agent to fetch real-time information from the internet using the DuckDuckGo Search API via the duckduckgo_search Python package. The tool accepts a search query and an optional num_results parameter, ensuring that the number of results returned is between 1 and 10. It opens a DuckDuckGo search session, retrieves the results, and formats them neatly for user-friendly display. If no results are found or an error occurs, the function handles it gracefully by returning an informative message. This tool equips the agent with real-time search capabilities, enhancing responsiveness and utility.

Copy CodeCopiedUse a different Browser@tool
def weather_info(city: str) -> str:
“””
Get current weather information for a city using OpenWeatherMap API.
Note: This is a mock implementation for demo purposes.

Args:
city: Name of the city

Returns:
Weather information as a string
“””
mock_weather = {
“new york”: {“temp”: 22, “condition”: “Partly Cloudy”, “humidity”: 65},
“london”: {“temp”: 15, “condition”: “Rainy”, “humidity”: 80},
“tokyo”: {“temp”: 28, “condition”: “Sunny”, “humidity”: 70},
“paris”: {“temp”: 18, “condition”: “Overcast”, “humidity”: 75}
}

city_lower = city.lower()
if city_lower in mock_weather:
weather = mock_weather[city_lower]
return f”Weather in {city}:n”
f”Temperature: {weather[‘temp’]}°Cn”
f”Condition: {weather[‘condition’]}n”
f”Humidity: {weather[‘humidity’]}%”
else:
return f”Weather data not available for {city}. (This is a demo with limited cities: New York, London, Tokyo, Paris)”

We define a weather_info tool that simulates retrieving current weather data for a given city. While it does not connect to a live weather API, it uses a predefined dictionary of mock data for major cities like New York, London, Tokyo, and Paris. Upon receiving a city name, the function normalizes it to lowercase and checks for its presence in the mock dataset. It returns temperature, weather condition, and humidity in a readable format if found. Otherwise, it notifies the user that weather data is unavailable. This tool serves as a placeholder and can later be upgraded to fetch live data from an actual weather API.

Copy CodeCopiedUse a different Browser@tool
def text_analyzer(text: str) -> str:
“””
Analyze text and provide statistics like word count, character count, etc.

Args:
text: Text to analyze

Returns:
Text analysis results
“””
if not text.strip():
return “Please provide text to analyze.”

words = text.split()
sentences = text.split(‘.’) + text.split(‘!’) + text.split(‘?’)
sentences = [s.strip() for s in sentences if s.strip()]

analysis = f”Text Analysis Results:n”
analysis += f”• Characters (with spaces): {len(text)}n”
analysis += f”• Characters (without spaces): {len(text.replace(‘ ‘, ”))}n”
analysis += f”• Words: {len(words)}n”
analysis += f”• Sentences: {len(sentences)}n”
analysis += f”• Average words per sentence: {len(words) / max(len(sentences), 1):.1f}n”
analysis += f”• Most common word: {max(set(words), key=words.count) if words else ‘N/A’}”

return analysis

The text_analyzer tool provides a detailed statistical analysis of a given text input. It calculates metrics such as character count (with and without spaces), word count, sentence count, and average words per sentence, and it identifies the most frequently occurring word. The tool handles empty input gracefully by prompting the user to provide valid text. It uses simple string operations and Python’s set and max functions to extract meaningful insights. It is a valuable utility for language analysis or content quality checks in the AI agent’s toolkit.

Copy CodeCopiedUse a different Browser@tool
def current_time() -> str:
“””
Get the current date and time.

Returns:
Current date and time as a formatted string
“””
now = datetime.now()
return f”Current date and time: {now.strftime(‘%Y-%m-%d %H:%M:%S’)}”

The current_time tool provides a straightforward way to retrieve the current system date and time in a human-readable format. Using Python’s datetime module, it captures the present moment and formats it as YYYY-MM-DD HH:MM:SS. This utility is particularly useful for time-stamping responses or answering user queries about the current date and time within the AI agent’s interaction flow.

Copy CodeCopiedUse a different Browsertools = [calculator, web_search, weather_info, text_analyzer, current_time]

def create_llm():
if ANTHROPIC_API_KEY:
return ChatAnthropic(
model=”claude-3-haiku-20240307″,
temperature=0.1,
max_tokens=1024
)
else:
class MockLLM:
def invoke(self, messages):
last_message = messages[-1].content if messages else “”

if any(word in last_message.lower() for word in [‘calculate’, ‘math’, ‘+’, ‘-‘, ‘*’, ‘/’, ‘sqrt’, ‘sin’, ‘cos’]):
import re
numbers = re.findall(r'[d+-*/.()sw]+’, last_message)
expr = numbers[0] if numbers else “2+2″
return AIMessage(content=”I’ll help you with that calculation.”,
tool_calls=[{“name”: “calculator”, “args”: {“expression”: expr.strip()}, “id”: “calc1″}])
elif any(word in last_message.lower() for word in [‘search’, ‘find’, ‘look up’, ‘information about’]):
query = last_message.replace(‘search for’, ”).replace(‘find’, ”).replace(‘look up’, ”).strip()
if not query or len(query) < 3:
query = “python programming”
return AIMessage(content=”I’ll search for that information.”,
tool_calls=[{“name”: “web_search”, “args”: {“query”: query}, “id”: “search1”}])
elif any(word in last_message.lower() for word in [‘weather’, ‘temperature’]):
city = “New York”
words = last_message.lower().split()
for i, word in enumerate(words):
if word == ‘in’ and i + 1 < len(words):
city = words[i + 1].title()
break
return AIMessage(content=”I’ll get the weather information.”,
tool_calls=[{“name”: “weather_info”, “args”: {“city”: city}, “id”: “weather1″}])
elif any(word in last_message.lower() for word in [‘time’, ‘date’]):
return AIMessage(content=”I’ll get the current time.”,
tool_calls=[{“name”: “current_time”, “args”: {}, “id”: “time1″}])
elif any(word in last_message.lower() for word in [‘analyze’, ‘analysis’]):
text = last_message.replace(‘analyze this text:’, ”).replace(‘analyze’, ”).strip()
if not text:
text = “Sample text for analysis”
return AIMessage(content=”I’ll analyze that text for you.”,
tool_calls=[{“name”: “text_analyzer”, “args”: {“text”: text}, “id”: “analyze1″}])
else:
return AIMessage(content=”Hello! I’m a multi-tool agent powered by Claude. I can help with:n• Mathematical calculationsn• Web searchesn• Weather informationn• Text analysisn• Current time/datennWhat would you like me to help you with?”)

def bind_tools(self, tools):
return self

print(” Note: Using mock LLM for demo. Add your ANTHROPIC_API_KEY for full functionality.”)
return MockLLM()

llm = create_llm()
llm_with_tools = llm.bind_tools(tools)

We initialize the language model that powers the AI agent. If a valid Anthropic API key is available, it uses the Claude 3 Haiku model for high-quality responses. Without an API key, a MockLLM is defined to simulate basic tool-routing behavior based on keyword matching, allowing the agent to function offline with limited capabilities. The bind_tools method links the defined tools to the model, enabling it to invoke them as needed.

Copy CodeCopiedUse a different Browserdef agent_node(state: AgentState) -> Dict[str, Any]:
“””Main agent node that processes messages and decides on tool usage.”””
messages = state[“messages”]
response = llm_with_tools.invoke(messages)
return {“messages”: [response]}

def should_continue(state: AgentState) -> str:
“””Determine whether to continue with tool calls or end.”””
last_message = state[“messages”][-1]
if hasattr(last_message, ‘tool_calls’) and last_message.tool_calls:
return “tools”
return END

We define the agent’s core decision-making logic. The agent_node function handles incoming messages, invokes the language model (with tools), and returns the model’s response. The should_continue function then evaluates whether the model’s response includes tool calls. If so, it routes control to the tool execution node; otherwise, it directs the flow to end the interaction. These functions enable dynamic and conditional transitions within the agent’s workflow.

Copy CodeCopiedUse a different Browserdef create_agent_graph():
tool_node = ToolNode(tools)

workflow = StateGraph(AgentState)

workflow.add_node(“agent”, agent_node)
workflow.add_node(“tools”, tool_node)

workflow.add_edge(START, “agent”)
workflow.add_conditional_edges(“agent”, should_continue, {“tools”: “tools”, END: END})
workflow.add_edge(“tools”, “agent”)

memory = MemorySaver()

app = workflow.compile(checkpointer=memory)

return app

print(“Creating LangGraph Multi-Tool Agent…”)
agent = create_agent_graph()
print(“✓ Agent created successfully!n”)

We construct the LangGraph-powered workflow that defines the AI agent’s operational structure. It initializes a ToolNode to handle tool executions and uses a StateGraph to organize the flow between agent decisions and tool usage. Nodes and edges are added to manage transitions: starting with the agent, conditionally routing to tools, and looping back as needed. A MemorySaver is integrated for persistent state tracking across turns. The graph is compiled into an executable application (app), enabling a structured, memory-aware multi-tool agent ready for deployment.

Copy CodeCopiedUse a different Browserdef test_agent():
“””Test the agent with various queries.”””
config = {“configurable”: {“thread_id”: “test-thread”}}

test_queries = [
“What’s 15 * 7 + 23?”,
“Search for information about Python programming”,
“What’s the weather like in Tokyo?”,
“What time is it?”,
“Analyze this text: ‘LangGraph is an amazing framework for building AI agents.'”
]

print(” Testing the agent with sample queries…n”)

for i, query in enumerate(test_queries, 1):
print(f”Query {i}: {query}”)
print(“-” * 50)

try:
response = agent.invoke(
{“messages”: [HumanMessage(content=query)]},
config=config
)

last_message = response[“messages”][-1]
print(f”Response: {last_message.content}n”)

except Exception as e:
print(f”Error: {str(e)}n”)

The test_agent function is a validation utility that ensures that the LangGraph agent responds correctly across different use cases. It runs predefined queries, arithmetic, web search, weather, time, and text analysis, and prints the agent’s responses. Using a consistent thread_id for configuration, it invokes the agent with each query. It neatly displays the results, helping developers verify tool integration and conversational logic before moving to interactive or production use.

Copy CodeCopiedUse a different Browserdef chat_with_agent():
“””Interactive chat function.”””
config = {“configurable”: {“thread_id”: “interactive-thread”}}

print(” Multi-Tool Agent Chat”)
print(“Available tools: Calculator, Web Search, Weather Info, Text Analyzer, Current Time”)
print(“Type ‘quit’ to exit, ‘help’ for available commandsn”)

while True:
try:
user_input = input(“You: “).strip()

if user_input.lower() in [‘quit’, ‘exit’, ‘q’]:
print(“Goodbye!”)
break
elif user_input.lower() == ‘help’:
print(“nAvailable commands:”)
print(“• Calculator: ‘Calculate 15 * 7 + 23’ or ‘What’s sin(pi/2)?'”)
print(“• Web Search: ‘Search for Python tutorials’ or ‘Find information about AI'”)
print(“• Weather: ‘Weather in Tokyo’ or ‘What’s the temperature in London?'”)
print(“• Text Analysis: ‘Analyze this text: [your text]'”)
print(“• Current Time: ‘What time is it?’ or ‘Current date'”)
print(“• quit: Exit the chatn”)
continue
elif not user_input:
continue

response = agent.invoke(
{“messages”: [HumanMessage(content=user_input)]},
config=config
)

last_message = response[“messages”][-1]
print(f”Agent: {last_message.content}n”)

except KeyboardInterrupt:
print(“nGoodbye!”)
break
except Exception as e:
print(f”Error: {str(e)}n”)

The chat_with_agent function provides an interactive command-line interface for real-time conversations with the LangGraph multi-tool agent. It supports natural language queries and recognizes commands like “help” for usage guidance and “quit” to exit. Each user input is processed through the agent, which dynamically selects and invokes appropriate response tools. The function enhances user engagement by simulating a conversational experience and showcasing the agent’s capabilities in handling various queries, from math and web search to weather, text analysis, and time retrieval.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
test_agent()

print(“=” * 60)
print(” LangGraph Multi-Tool Agent is ready!”)
print(“=” * 60)

chat_with_agent()

def quick_demo():
“””Quick demonstration of agent capabilities.”””
config = {“configurable”: {“thread_id”: “demo”}}

demos = [
(“Math”, “Calculate the square root of 144 plus 5 times 3”),
(“Search”, “Find recent news about artificial intelligence”),
(“Time”, “What’s the current date and time?”)
]

print(” Quick Demo of Agent Capabilitiesn”)

for category, query in demos:
print(f”[{category}] Query: {query}”)
try:
response = agent.invoke(
{“messages”: [HumanMessage(content=query)]},
config=config
)
print(f”Response: {response[‘messages’][-1].content}n”)
except Exception as e:
print(f”Error: {str(e)}n”)

print(“n” + “=”*60)
print(” Usage Instructions:”)
print(“1. Add your ANTHROPIC_API_KEY to use Claude model”)
print(” os.environ[‘ANTHROPIC_API_KEY’] = ‘your-anthropic-api-key'”)
print(“2. Run quick_demo() for a quick demonstration”)
print(“3. Run chat_with_agent() for interactive chat”)
print(“4. The agent supports: calculations, web search, weather, text analysis, and time”)
print(“5. Example: ‘Calculate 15*7+23’ or ‘Search for Python tutorials'”)
print(“=”*60)

Finally, we orchestrate the execution of the LangGraph multi-tool agent. If the script is run directly, it initiates test_agent() to validate functionality with sample queries, followed by launching the interactive chat_with_agent() mode for real-time interaction. The quick_demo() function also briefly showcases the agent’s capabilities in math, search, and time queries. Clear usage instructions are printed at the end, guiding users on configuring the API key, running demonstrations, and interacting with the agent. This provides a smooth onboarding experience for users to explore and extend the agent’s functionality.

In conclusion, this step-by-step tutorial gives valuable insights into building an effective multi-tool AI agent leveraging LangGraph and Claude’s generative capabilities. With straightforward explanations and hands-on demonstrations, the guide empowers users to integrate diverse utilities into a cohesive and interactive system. The agent’s flexibility in performing tasks, from complex calculations to dynamic information retrieval, showcases the versatility of modern AI development frameworks. Also, the inclusion of user-friendly functions for both testing and interactive chat enhances practical understanding, enabling immediate application in various contexts. Developers can confidently extend and customize their AI agents with this foundational knowledge.

Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Step-by-Step Guide to Build a Customizable Multi-Tool AI Agent with LangGraph and Claude for Dynamic Agent Creation appeared first on MarkTechPost.

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms …

LLMs have shown impressive capabilities across various programming tasks, yet their potential for program optimization has not been fully explored. While some recent efforts have used LLMs to enhance performance in languages like C++ and Python, the broader application of LLMs to optimize code, especially in low-level programming contexts, remains limited. Existing LLM benchmarks largely focus on code generation from natural language or solving GitHub issues, as seen in HumanEval, MBPP, APPS, SWE-bench, and SWE-agent. Moreover, models such as Codex, AlphaCode, and Code Llama primarily aim to improve code generation quality rather than performance. However, select research has begun addressing optimization, including parallelization and code efficiency improvements, though many of these approaches are constrained by the need for formal verification, limiting scalability.

In contrast, some newer methods embrace test-based validation, allowing optimization of more complex programs with loops. Learning-based strategies in compiler optimization—like AutoPhase, which uses reinforcement learning for pass sequencing, and Coreset, which applies graph neural networks—have shown promise in improving performance. Superoptimization techniques aim to find the most efficient version of a program but are typically restricted to small-scale problems. Additionally, frameworks like AutoTVM and Ansor have focused on optimizing GPU kernel code through statistical modeling and search. Recently, LLM-driven optimization has gained attention, with reinforcement learning approaches guiding LLMs using feedback from test cases. Techniques like CodeRL and PPOCoder leverage policy optimization methods to fine-tune models for better performance, even across resource-constrained programming languages like Verilog. 

Stanford, UIUC, CMU, and Visa Research researchers explore using LLMs to optimize assembly code performance—an area traditionally handled by compilers like GCC. They introduce a reinforcement learning framework using Proximal Policy Optimization (PPO), guided by a reward balancing correctness and speedup over the gcc -O3 baseline. Using a dataset of 8,072 real-world programs, their model, Qwen2.5-Coder-7B-PPO, achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. Their results show that with RL training, LLMs can effectively outperform conventional compiler optimizations. 

The methodology involves optimizing compiled C programs for performance using an RL approach. Given a C program C, it is compiled to assembly P using gcc -O3. The goal is to generate a new assembly program P’ that is functionally equivalent but faster. Correctness is verified using a test set, and speedup is measured by execution time improvement. Using CodeNet as the dataset, the authors apply PPO to train a language model that generates improved code. Two reward functions—Correctness-Guided Speedup and Speedup-Only—are used to guide training based on program validity, correctness, and performance gains. 

The study evaluates various language models on optimizing assembly code, revealing that most models struggle with low test pass rates and minimal speedups. However, Qwen2.5-Coder-7B-PPO, trained with reinforcement learning, significantly outperforms others, achieving 96% accuracy and a 1.47× average speedup. Ablation studies show that using gcc -O3 as a reference aids performance, while removing it leads to sharp declines. Notably, models like Claude-3.7-sonnet can surpass compilers by identifying hardware-specific optimizations, such as replacing loops with a single popcnt instruction, demonstrating their ability to perform semantic-level code transformations beyond traditional compiler capabilities. 

In conclusion, the study explores using LLMs to optimize assembly code, a domain where traditional compilers struggle due to the complexity of low-level performance tuning. The authors fine-tune Qwen2.5-Coder-7B using PPO, rewarding both correctness (via test cases) and speedup over gcc -O3. They introduce a benchmark of 8,072 real-world C programs to evaluate performance. The model achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. While effective, limitations include a lack of formal correctness guarantees and variability in hardware performance across systems. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers appeared first on MarkTechPost.

A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Ag …

In this tutorial, we demonstrated how Microsoft’s AutoGen framework empowers developers to orchestrate complex, multi-agent workflows with minimal code. By leveraging AutoGen’s RoundRobinGroupChat and TeamTool abstractions, you can seamlessly assemble specialist assistants, such as Researchers, FactCheckers, Critics, Summarizers, and Editors, into a cohesive “DeepDive” tool. AutoGen handles the intricacies of turn‐taking, termination conditions, and streaming output, allowing you to focus on defining each agent’s expertise and system prompts rather than plumbing together callbacks or manual prompt chains. Whether conducting in‐depth research, validating facts, refining prose, or integrating third‐party tools, AutoGen provides a unified API that scales from simple two‐agent pipelines to elaborate, five‐agent collaboratives.

Copy CodeCopiedUse a different Browser!pip install -q autogen-agentchat[gemini] autogen-ext[openai] nest_asyncio

We install the AutoGen AgentChat package with Gemini support, the OpenAI extension for API compatibility, and the nest_asyncio library to patch the notebook’s event loop, ensuring you have all the components needed to run asynchronous, multi-agent workflows in Colab.

Copy CodeCopiedUse a different Browserimport os, nest_asyncio
from getpass import getpass

nest_asyncio.apply()
os.environ[“GEMINI_API_KEY”] = getpass(“Enter your Gemini API key: “)

We import and apply nest_asyncio to enable nested event loops in notebook environments, then securely prompt for your Gemini API key using getpass and store it in os.environ for authenticated model client access.

Copy CodeCopiedUse a different Browserfrom autogen_ext.models.openai import OpenAIChatCompletionClient

model_client = OpenAIChatCompletionClient(
model=”gemini-1.5-flash-8b”,
api_key=os.environ[“GEMINI_API_KEY”],
api_type=”google”,
)

We initialize an OpenAI‐compatible chat client pointed at Google’s Gemini by specifying the gemini-1.5-flash-8b model, injecting your stored Gemini API key, and setting api_type=”google”, giving you a ready-to-use model_client for downstream AutoGen agents.

Copy CodeCopiedUse a different Browserfrom autogen_agentchat.agents import AssistantAgent

researcher = AssistantAgent(name=”Researcher”, system_message=”Gather and summarize factual info.”, model_client=model_client)
factchecker = AssistantAgent(name=”FactChecker”, system_message=”Verify facts and cite sources.”, model_client=model_client)
critic = AssistantAgent(name=”Critic”, system_message=”Critique clarity and logic.”, model_client=model_client)
summarizer = AssistantAgent(name=”Summarizer”,system_message=”Condense into a brief executive summary.”, model_client=model_client)
editor = AssistantAgent(name=”Editor”, system_message=”Polish language and signal APPROVED when done.”, model_client=model_client)

We define five specialized assistant agents, Researcher, FactChecker, Critic, Summarizer, and Editor, each initialized with a role-specific system message and the shared Gemini-powered model client, enabling them to gather information, respectively, verify accuracy, critique content, condense summaries, and polish language within the AutoGen workflow.

Copy CodeCopiedUse a different Browserfrom autogen_agentchat.teams import RoundRobinGroupChat
from autogen_agentchat.conditions import MaxMessageTermination, TextMentionTermination

max_msgs = MaxMessageTermination(max_messages=20)
text_term = TextMentionTermination(text=”APPROVED”, sources=[“Editor”])
termination = max_msgs | text_term
team = RoundRobinGroupChat(
participants=[researcher, factchecker, critic, summarizer, editor],
termination_condition=termination
)

We import the RoundRobinGroupChat class along with two termination conditions, then compose a stop rule that fires after 20 total messages or when the Editor agent mentions “APPROVED.” Finally, it instantiates a round-robin team of the five specialized agents with that combined termination logic, enabling them to cycle through research, fact-checking, critique, summarization, and editing until one of the stop conditions is met.

Copy CodeCopiedUse a different Browserfrom autogen_agentchat.tools import TeamTool

deepdive_tool = TeamTool(team=team, name=”DeepDive”, description=”Collaborative multi-agent deep dive”)

WE wrap our RoundRobinGroupChat team in a TeamTool named “DeepDive” with a human-readable description, effectively packaging the entire multi-agent workflow into a single callable tool that other agents can invoke seamlessly.

Copy CodeCopiedUse a different Browserhost = AssistantAgent(
name=”Host”,
model_client=model_client,
tools=[deepdive_tool],
system_message=”You have access to a DeepDive tool for in-depth research.”
)

We create a “Host” assistant agent configured with the shared Gemini-powered model_client, grant it the DeepDive team tool for orchestrating in-depth research, and prime it with a system message that informs it of its ability to invoke the multi-agent DeepDive workflow.

Copy CodeCopiedUse a different Browserimport asyncio

async def run_deepdive(topic: str):
result = await host.run(task=f”Deep dive on: {topic}”)
print(” DeepDive result:n”, result)
await model_client.close()

topic = “Impacts of Model Context Protocl on Agentic AI”
loop = asyncio.get_event_loop()
loop.run_until_complete(run_deepdive(topic))

Finally, we define an asynchronous run_deepdive function that tells the Host agent to execute the DeepDive team tool on a given topic, prints the comprehensive result, and then closes the model client; it then grabs Colab’s existing asyncio loop and runs the coroutine to completion for a seamless, synchronous execution.

In conclusion, integrating Google Gemini via AutoGen’s OpenAI‐compatible client and wrapping our multi‐agent team as a callable TeamTool gives us a powerful template for building highly modular and reusable workflows. AutoGen abstracts away event loop management (with nest_asyncio), streaming responses, and termination logic, enabling us to iterate quickly on agent roles and overall orchestration. This advanced pattern streamlines the development of collaborative AI systems and lays the foundation for extending into retrieval pipelines, dynamic selectors, or conditional execution strategies.

Check out the Notebook here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post A Comprehensive Coding Guide to Crafting Advanced Round-Robin Multi-Agent Workflows with Microsoft AutoGen appeared first on MarkTechPost.

Researchers from the National University of Singapore Introduce ‘Thi …

The effectiveness of language models relies on their ability to simulate human-like step-by-step deduction. However, these reasoning sequences are resource-intensive and can be wasteful for simple questions that do not require elaborate computation. This lack of awareness regarding the complexity of the task is one of the core challenges in these models. They often default to detailed reasoning even for queries that could be answered directly. Such an approach increases token usage, extends response time, and increases system latency and memory usage. As a result, there’s a pressing need to equip language models with a mechanism that allows them to make autonomous decisions about whether to think deeply or respond succinctly.

Current tools attempting to solve this issue either rely on manually set heuristics or prompt engineering to switch between short and long responses. Some methods use separate models and route questions based on complexity estimates. Still, these external routing systems often lack insight into the target model’s strengths and fail to make optimal decisions. Other techniques fine-tune models with prompt-based cues like “reasoning on/off,” but these rely on static rules rather than dynamic understanding. Despite some improvements, these approaches fail to enable fully autonomous and context-sensitive control within a single model.

Researchers from the National University of Singapore introduced a new framework called Thinkless, which equips a language model with the ability to dynamically decide between using short or long-form reasoning. The framework is built on reinforcement learning and introduces two special control tokens—<short> for concise answers and <think> for detailed responses. By incorporating a novel algorithm called Decoupled Group Relative Policy Optimization (DeGRPO), Thinkless separates the training focus between selecting the reasoning mode and improving the accuracy of the generated response. This design prevents the model from falling into one-dimensional behavior and enables adaptive reasoning tailored to each query.

The methodology involves two stages: warm-up distillation and reinforcement learning. In the distillation phase, Thinkless is trained using outputs from two expert models—one specializing in short responses and the other in detailed reasoning. This stage helps the model establish a firm link between the control token and the desired reasoning format. The reinforcement learning stage then fine-tunes the model’s ability to decide which reasoning mode to use. DeGRPO decomposes the learning into two separate objectives: one for training the control token and another for refining the response tokens. This approach avoids the gradient imbalances in earlier models, where longer responses would overpower the learning signal, leading to a collapse in reasoning diversity. Thinkless ensures that both <short> and <think> tokens receive balanced updates, promoting stable learning across response types.

When evaluated, Thinkless significantly reduced long-form reasoning while preserving high accuracy. On the Minerva Algebra benchmark, the model used the <think> token in only 25.88% of cases while achieving 94.59% accuracy. In contrast, conventional reasoning models had to use extended chains of thought much more frequently. On the AIME 2024 dataset, Thinkless reached a 27.33% accuracy rate with 100% usage of the reasoning mode, showing that it could maintain performance when full reasoning was necessary. On the GSM8K dataset, it utilized <think> only 13.31% of the time, yet still achieved 84.18% accuracy. These results reflect the model’s ability to handle simple and complex queries with appropriate reasoning depth, cutting down on unnecessary token generation by as much as 90% in some tasks.

Overall, this study from the National University of Singapore researchers presents a compelling solution to the inefficiencies of uniform reasoning in large language models. By introducing a mechanism that enables models to judge task complexity and adjust their inference strategy accordingly, Thinkless optimizes both accuracy and efficiency. The method balances depth of reasoning and response precision without relying on fixed rules, offering a data-driven approach to more intelligent language model behavior.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPO appeared first on MarkTechPost.

Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long- …

Recent advances in long-context (LC) modeling have unlocked new capabilities for LLMs and large vision-language models (LVLMs). Long-context vision–language models (LCVLMs) show an important step forward by enabling LVLMs to process hundreds of images and thousands of interleaved text tokens in a single forward pass. However, the development of effective evaluation benchmarks lags. It is still unclear how well current LCVLMs perform in long-context settings, what tasks they struggle with, and how robust they are to input length variation. Current benchmarks face the following problem: (a) Limited coverage of downstream tasks, (b) Insufficient coverage of image types, (c) Lack of context length control, and (d) Single context length.

Various techniques have extended context windows for LVLMs, including longer pre-training lengths, position extrapolation, and efficient architectures. Models like Gemini-2.5 and Qwen2.5-VL have adopted these approaches alongside vision token compression methods to accommodate longer sequences. For evaluation, the Needle-in-a-Haystack task became a standard benchmark for testing LC ability by inserting information at specific depths within long texts. However, existing vision-language benchmarks remain limited, focusing only on NIAH variants or long-document VQA tasks. Even MileBench contains short-context tasks with an average length of only 9K tokens, failing to evaluate true LC capabilities across diverse vision-language applications.

Researchers from HKUST, Tencent AI Seattle Lab, University of Edinburgh, Miniml.AI, and NVIDIA AI Technology Center have proposed MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs. It comprises 13,331 examples spanning five downstream task categories, including Visual RAG and Many-Shot ICL, covering natural and synthetic image types. All examples are standardized across five input lengths from 8K to 128K tokens using a cross-modal tokenization scheme combining vision patches and text tokens. Through benchmarking 46 closed-source and open-source models, the research reveals that single-task performance poorly predicts overall LC capability, both model types struggle with LC tasks, and stronger reasoning models show better LC performance.

Researchers construct LC by inserting gold passages containing answers among large sets of distracting passages retrieved from Wikipedia. For ViQuAE, gold passages from KILT are used, while InfoSeek uses lead sections from Wikipedia entity pages. Further, Wikipedia pages are split into 100-word passages, and retrieved distractors are added until reaching desired input lengths. Many-shot in-context learning tasks utilize four diverse image classification datasets: Stanford Cars, Food101, SUN397, and iNat2021, accommodating 500 images within 128K context windows. Cross-modal token counting combines text tokens using the Llama2 tokenizer with visual tokens processed through 14×14 patches and 2×2 pixel unshuffle compression, ensuring compatibility with modern LVLMs for evaluation.

The evaluation on MMLONGBENCH across tasks and context Lengths shows that all models struggle, but closed-source models perform better. For the longest input length of 128K, all models struggle with long-context vision-language tasks, with GPT-4o achieving only 62.9 average performance. Gemini-2.5-Pro became the strongest performer, outperforming open-source models by 20 points except on ICL tasks. Further, Ovis2-34B model achieves a score of 41.6 on summarization, similar to GPT-4o (42.4). Qwen2.5-VL-32B achieves a SubEM score of 64.6 on VRAG, even better than Gemini-2.0-Flash. Models show generalization capabilities beyond their training context lengths, with Qwen2-VL-72B achieving a 51.9 average score at 128K despite a 32K training window.

In conclusion, researchers introduced MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs across diverse downstream tasks. It provides a rigorous foundation for diagnosing frontier model capabilities by covering five distinct task categories with unified cross-modal token counting and standardized context lengths. The evaluation of 46 models demonstrates that single-task performance unreliably predicts overall long-context ability, and frontier models face significant challenges in OCR accuracy and cross-modal retrieval. MMLONGBENCH is a standard evaluation framework to drive future research toward more efficient vision-language token encodings, robust position-extrapolation schemes, and improved multi-modal retrieval and reasoning capabilities.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models appeared first on MarkTechPost.

Principal Financial Group increases Voice Virtual Assistant performanc …

This post was cowritten by Mulay Ahmed, Assistant Director of Engineering, and Ruby Donald, Assistant Director of Engineering at Principal Financial Group. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
Principal Financial Group® is an integrated global financial services company with specialized solutions helping people, businesses, and institutions reach their long-term financial goals and access greater financial security.
With US contact centers that handle millions of customer calls annually, Principal® wanted to further modernize their customer call experience. With a robust AWS Cloud infrastructure already in place, they selected a cloud-first approach to create a more personalized and seamless experience for their customers that would:

Understand customer intents through natural language (vs. touch tone experiences)
Assist customers with self-service offerings where possible
Accurately route customer calls based on business rules
Assist engagement center agents with contextual data

Initially, Principal developed a voice Virtual Assistant (VA) using an Amazon Lex bot to recognize customer intents. The VA can perform self-service transactions or route customers to specific call center queues in the Genesys Cloud contact center platform, based on customer intents and business rules.
As customers interact with the VA, it’s essential to continuously monitor its health and performance. This allows Principal to identify opportunities for fine-tuning, which can enhance the VA’s ability to understand customer intents. Consequently, this will reduce fallback intent rates, improve functional intent fulfillment rates, and lead to better customer experiences.
In this post, we explore how Principal used this opportunity to build an integrated voice VA reporting and analytics solution using an Amazon QuickSight dashboard.
Amazon Lex is a service for building conversational interfaces using voice and text. It provides high-quality speech recognition and language understanding capabilities, enabling the addition of sophisticated, natural language chatbots to new and existing applications.
Genesys Cloud, an omni-channel orchestration and customer relationship platform, provides a contact center platform in a public cloud model that enables quick and simple integration of AWS Contact Center Intelligence (AWS CCI). As part of AWS CCI, Genesys Cloud integrates with Amazon Lex, which enables self-service, intelligent routing, and data collection capabilities.
QuickSight is a unified business intelligence (BI) service that makes it straightforward within an organization to build visualizations, perform ad hoc analysis, and quickly get business insights from their data.
Solution overview
Principal required a reporting and analytics solution that would monitor VA performance based on customer interactions at scale, enabling Principal to improve the Amazon Lex bot performance.
Reporting requirements included customer and VA interaction and Amazon Lex bot performance (target metrics and intent fulfillment) analytics to identify and implement tuning and training opportunities.
The solution used a QuickSight dashboard that derives these insights from the following customer interaction data used to measure VA performance:

Genesys Cloud data such as queues and data actions
Business-specific data such as product and call center operations data
Business API-specific data and metrics such as API response codes

The following diagram shows the solution architecture using Genesys, Amazon Lex, and QuickSight.

The solution workflow involves the following steps:

Users call in and interact with Genesys Cloud.
Genesys Cloud calls an AWS Lambda routing function. This function will return a response to Genesys Cloud with the necessary data, to route the customer call. To generate a response, the function fetches routing data from an Amazon DynamoDB table, and requests an Amazon Lex V2 bot to provide an answer on the user intent.
The Amazon Lex V2 bot processes the customer intent and calls a Lambda fulfillment function to fulfill the intent.
The fulfillment function executes custom logic (routing and session variables logic) and calls necessary APIs to fetch the data required to fulfill the intent.
The APIs process and return the data requested (such as data to perform a self-service transaction).
The Amazon Lex V2 bot’s conversation logs are sent to Amazon CloudWatch (these logs will be used for business analytics, operational monitoring, and alerts).
Genesys Cloud calls a third Lambda function to send customer interaction reports. The Genesys report function pushes these reports to an Amazon Simple Storage Service (Amazon S3) bucket (these reports will be used for business analytics).
An Amazon Data Firehose delivery stream ships the conversation logs from CloudWatch to an S3 bucket.
The Firehose delivery stream transforms the logs in Parquet or CSV format using a Lambda function.
An AWS Glue crawler scans the data in Amazon S3.
The crawler creates or updates the AWS Glue Data Catalog with the schema information.
We use Amazon Athena to query the datasets (customer interaction reports and conversation logs).
QuickSight connects to Athena to query the data from Amazon S3 using the Data Catalog.

Other design considerations
The following are other key design considerations to implement the VA solution:

Cost optimization – The solution uses Amazon S3 Bucket Keys to optimize on costs:

Reduce the number of Amazon S3 requests to AWS Key Management Service (AWS KMS) to complete encryption operations.
Reduce the number of AWS KMS events in AWS CloudTrail logs.

Encryption – The solution encrypts data at rest with AWS KMS and in transit using SSL/TLS.
Genesys Cloud integration – The integration between the Amazon Lex V2 bot and Genesys Cloud is done using AWS Identity and Access Management (IAM). For more details, see Genesys Cloud.
Logging and monitoring – The solution monitors AWS resources with CloudWatch and uses alerts to receive notification upon failure events.
Least privilege access – The solution uses IAM roles and policies to grant the minimum necessary permissions to uses and services.
Data privacy – The solution handles customer sensitive data such as personally identifiable information (PII) according to compliance and data protection requirements. It implements data masking when applicable and appropriate.
Secure APIs – APIs implemented in this solution are protected and designed according to compliance and security requirements.
Data types – The solution defines data types, such as time stamps, in the Data Catalog (and Athena) in order to refresh data (SPICE data) in QuickSight on a schedule.
DevOps – The solution is version controlled, and changes are deployed using pipelines, to enable faster release cycles.
Analytics on Amazon Lex – Analytics on Amazon Lex empowers teams with data-driven insights to improve the performance of their bots. The overview dashboard provides a single snapshot of key metrics such as the total number of conversations and intent recognition rates. Principal does not use this capability due to the following reasons:

The dashboard can’t integrate with external data:

Genesys Cloud data (such as queues and data actions)
Business-specific data (such as product and call center operations data)
Business API-specific data and metrics (such as response codes)

The dashboard can’t be customized to add additional views and data.

Sample dashboard
With this reporting and analytics solution, Principal can consolidate data from multiple sources and visualize the performance of the VA to identify areas of opportunities for improvement. The following screenshot shows an example of their QuickSight dashboard for illustrative purposes.

Conclusion
In this post, we presented how Principal created a report and analytics solution for their VA solution using Genesys Cloud and Amazon Lex, along with QuickSight to provide customer interaction insights.
The VA solution allowed Principal to maintain its existing contact center solution with Genesys Cloud and achieve better customer experiences. It offers other benefits such as the ability for a customer to receive support on some inquiries without requiring an agent on the call (self-service). It also provides intelligent routing capabilities, leading to reduced call time and increased agent productivity.
With the implementation of this solution, Principal can monitor and derive insights from its VA solution and fine-tune accordingly its performance.
In its 2025 roadmap, Principal will continue to strengthen the foundation of the solution described in this post. In a second post, Principal will present how they automate the deployment and testing of new Amazon Lex bot versions.
AWS and Amazon are not affiliates of any company of the Principal Financial Group®. This communication is intended to be educational in nature and is not intended to be taken as a recommendation.
Insurance products issued by Principal National Life Insurance Co (except in NY) and Principal Life Insurance Company®. Plan administrative services offered by Principal Life. Principal Funds, Inc. is distributed by Principal Funds Distributor, Inc. Securities offered through Principal Securities, Inc., member SIPC and/or independent broker/dealers. Referenced companies are members of the Principal Financial Group®, Des Moines, IA 50392. ©2025 Principal Financial Services, Inc. 4373397-042025

About the Authors
Mulay Ahmed is an Assistant Director of Engineering at Principal and well-versed in architecting and implementing complex enterprise-grade solutions on AWS Cloud.
Ruby Donald is an Assistant Director of Engineering at Principal and leads the Enterprise Virtual Assistants Engineering Team. She has extensive experience in building and delivering software at enterprise scale.

Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype t …

Modern web usage spans many digital interactions, from filling out forms and managing accounts to executing data queries and navigating complex dashboards. Despite the web being deeply intertwined with productivity and work processes, many of these actions still demand repetitive human input. This scenario is especially true for environments that require detailed instructions or decisions beyond mere searches. While artificial intelligence agents have emerged to support task automation, many prioritize complete autonomy. However, this frequently sidelines user control, leading to outcomes that diverge from user expectations. The next leap forward in productivity-enhancing AI involves agents designed not to replace users but to collaborate with them, blending automation with continuous, real-time human input for more accurate and trusted results.

A key challenge in deploying AI agents for web-based tasks is the lack of visibility and intervention. Users often cannot see what steps the agent is planning, how it intends to execute them, or when it might go off track. In scenarios that involve complex decisions, like entering payment information, interpreting dynamic content, or running scripts, users need mechanisms to step in and redirect the process. Without these capabilities, systems risk making irreversible mistakes or misaligning with user goals. This highlights a significant limitation in current AI automation: the absence of structured human-in-the-loop design, where users dynamically guide and supervise agent behavior, without acting merely as spectators.

Previous solutions approached web automation through rule-based scripts or general-purpose AI agents driven by language models. These systems interpret user commands and attempt to carry them out autonomously. However, they often execute plans without surfacing intermediate decisions or allowing meaningful user feedback. A few offer command-line-like interactions, which are inaccessible to the average user and rarely include layered safety mechanisms. Moreover, minimal support for task reuse or performance learning across sessions limits long-term value. These systems also tend to lack adaptability when the context changes mid-task or errors must be corrected collaboratively.

Researchers at Microsoft introduced Magentic-UI, an open-source prototype that emphasizes collaborative human-AI interaction for web-based tasks. Unlike previous systems aiming for full independence, this tool promotes real-time co-planning, execution sharing, and step-by-step user oversight. Magentic-UI is built on Microsoft’s AutoGen framework and is tightly integrated with Azure AI Foundry Labs. It’s a direct evolution from the previously introduced Magentic-One system. With its launch, Microsoft Research aims to address fundamental questions about human oversight, safety mechanisms, and learning in agentic systems by offering an experimental platform for researchers and developers.

Magentic-UI includes four core interactive features: co-planning, co-tasking, action guards, and plan learning. Co-planning lets users view and adjust the agent’s proposed steps before execution begins, offering full control over what the AI will do. Co-tasking enables real-time visibility during operation, letting users pause, edit, or take over specific actions. Action guards are customizable confirmations for high-risk activities like closing browser tabs or clicking “submit” on a form, actions that could have unintended consequences. Plan learning allows Magentic-UI to remember and refine steps for future tasks, improving over time through experience. These capabilities are supported by a modular team of agents: the Orchestrator leads planning and decision-making, WebSurfer handles browser interactions, Coder executes code in a sandbox, and FileSurfer interprets files and data.

Image Source

Technically, when a user submits a request, the Orchestrator agent generates a step-by-step plan. Users can modify it through a graphical interface by editing, deleting, or regenerating steps. Once finalized, the plan is delegated across specialized agents. Each agent reports after performing its task, and the Orchestrator determines whether to proceed, repeat, or request user feedback. All actions are visible on the interface, and users can halt execution at any point. This architecture not only ensures transparency but also allows for adaptive task flows. For example, if a step fails due to a broken link, the Orchestrator can dynamically adjust the plan with user consent.

In controlled evaluations using the GAIA benchmark, which includes complex tasks like navigating the web and interpreting documents, Magentic-UI’s performance was rigorously tested. GAIA consists of 162 tasks requiring multimodal understanding. When operating autonomously, Magentic-UI completed 30.3% of tasks successfully. However, when supported by a simulated user with access to additional task information, success jumped to 51.9%, a 71% improvement. Another configuration using a smarter simulated user improved the rate to 42.6%. Interestingly, Magentic-UI requested help in only 10% of the enhanced tasks and asked for final answers in 18%. In those cases, the system asked for help an average of just 1.1 times. This shows how minimal but well-timed human intervention significantly boosts task completion without high oversight costs.

Image Source

Magentic-UI also features a “Saved Plans” gallery that displays strategies reused from past tasks. Retrieval from this gallery is approximately three times faster than generating a new plan. A predictive mechanism surfaces these plans while users type, streamlining repeated tasks like flight searches or form submissions. Safety mechanisms are robust. Every browser or code action runs inside a Docker container, ensuring that no user credentials are exposed. Users can define allow-lists for site access, and every action can be gated behind approval prompts. A red-team evaluation further tested it against phishing attacks and prompt injections, where the system either sought user clarification or blocked execution, reinforcing its layered defense model.

Image Source

Several Key Takeaways from the Research on Magentic-UI:

With simple human input, magentic-UI boosts task completion by 71% (from 30.3% to 51.9%).

Requests user help in only 10% of enhanced tasks and averages 1.1 help requests per task.

It features a co-planning UI that allows full user control before execution.

Executes tasks via four modular agents: Orchestrator, WebSurfer, Coder, and FileSurfer.

Stores and reuses plans, reducing repeat task latency by up to 3x.

All actions are sandboxed via Docker containers; no user credentials are ever exposed.

Passed red-team evaluations against phishing and injection threats.

Supports fully user-configurable “action guards” for high-risk steps.

Fully open-source and integrated with Azure AI Foundry Labs.

In conclusion, Magentic-UI addresses a long-standing problem in AI automation, the lack of transparency and controllability. Rather than replacing users, it enables them to remain central to the process. The system performs well even with minimal help and learns to improve each time. The modular design, robust safeguards, and detailed interaction model create a strong foundation for future intelligent assistants.

Check out the Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use appeared first on MarkTechPost.