No-code data preparation for time series forecasting using Amazon Sage …

Time series forecasting helps businesses predict future trends based on historical data patterns, whether it’s for sales projections, inventory management, or demand forecasting. Traditional approaches require extensive knowledge of statistical methods and data science methods to process raw time series data.
Amazon SageMaker Canvas offers no-code solutions that simplify data wrangling, making time series forecasting accessible to all users regardless of their technical background. In this post, we explore how SageMaker Canvas and SageMaker Data Wrangler provide no-code data preparation techniques that empower users of all backgrounds to prepare data and build time series forecasting models in a single interface with confidence.
Solution overview
Using SageMaker Data Wrangler for data preparation allows for the modification of data for predictive analytics without programming knowledge. In this solution, we demonstrate the steps associated with this process. The solution includes the following:

Data Import from varying sources
Automated no-code algorithmic recommendations for data preparation
Step-by-step processes for preparation and analysis
Visual interfaces for data visualization and analysis
Export capabilities post data preparation
Built in security and compliance features

In this post, we focus on data preparation for time series forecasting using SageMaker Canvas.
Walkthrough
The following is a walkthrough of the solution for data preparation using Amazon SageMaker Canvas. For the walkthrough, you use the consumer electronics synthetic dataset found in this SageMaker Canvas Immersion Day lab, which we encourage you to try. This consumer electronics related time series (RTS) dataset primarily contains historical price data that corresponds to sales transactions over time. This dataset is designed to complement target time series (TTS) data to improve prediction accuracy in forecasting models, particularly for consumer electronics sales, where price changes can significantly impact buying behavior. The dataset can be used for demand forecasting, price optimization, and market analysis in the consumer electronics sector.
Prerequisites
For this walkthrough, you should have the following prerequisites:

An AWS account
AWS resources
Prerequisites for accessing SageMaker Canvas through an AWS account
Download the consumer_electronics.csv file from the SageMaker Canvas Immersion Day lab

Solution walkthrough
Below, we will provide the solution walkthrough and explain how users are able to use a dataset, prepare the data using no code using Data Wrangler, and run and train a time series forecasting model using SageMaker Canvas.
Sign in to the AWS Management Console and go to Amazon SageMaker AI and then to Canvas. On the Get started page, select Import and prepare option. You will see the following options to import your data set into Sagemaker Data Wrangler. First, select Tabular Data as we will be utilizing this data for our time series forecasting. You will see the following options available to select from:

Local upload
Canvas Datasets
Amazon S3
Amazon Redshift
Amazon Athena
Databricks
MySQL
PostgreSQL
SQL Server
RDS

For this demo, select Local upload. When you use this option, the data is stored in the SageMaker instance, specifically on an Amazon Elastic File System (Amazon EFS) storage volume in the SageMaker Studio environment. This storage is tied to the SageMaker Studio instance, but for more permanent data storage purposes, Amazon Simple Storage Service (Amazon S3) is a good option when working with SageMaker Data Wrangler. For long term data management, Amazon S3 is recommended.

Select the consumer_electronics.csv file from the prerequisites. After selecting the file to import,  you can use the Import settings panel to set your desired configurations. For the purpose of this demo, leave the options to their default values.

After the import is complete, use the Data flow options to modify the newly imported data. For future data forecasting, you may need to clean up data for the service to properly understand the values and disregard any errors in the data. SageMaker Canvas has various offerings to accomplish this. Options include Chat for data prep with natural language data modifications and Add Transform. Chat for data prep may be best for users who prefer natural language processing (NLP) interactions and may not be familiar with technical data transformations. Add transform is best for data professionals who know which transformations they want to apply to their data.
For time series forecasting using Amazon SageMaker Canvas, data must be prepared in a certain way for the service to properly forecast and understand the data. To make a time series forecast using SageMaker Canvas, the documentation linked mentions the following requirements:

A timestamp column with all values having the datetime type.
A target column that has the values that you’re using to forecast future values.
An item ID column that contains unique identifiers for each item in your dataset, such as SKU numbers.

The datetime values in the timestamp column must use one of the following formats:

YYYY-MM-DD HH:MM:SS
YYYY-MM-DDTHH:MM:SSZ
YYYY-MM-DD
MM/DD/YY
MM/DD/YY HH:MM
MM/DD/YYYY
YYYY/MM/DD HH:MM:SS
YYYY/MM/DD
DD/MM/YYYY
DD/MM/YY
DD-MM-YY
DD-MM-YYYY

You can make forecasts for the following intervals:

1 min
5 min
15 min
30 min
1 hour
1 day
1 week
1 month
1 year

For this example, remove the $ in the data, by using the Chat for data prep option. Give the chat a prompt such as Can you get rid of the $ in my data, and it will generate code to accommodate your request and modify the data, giving you a no-code solution to prepare the data for future modeling and predictive analysis. Choose Add to Steps to accept this code and apply changes to the data.

You can also convert values to float data type and check for missing data in your uploaded CSV file using either Chat for data prep or Add Transform options. To drop missing values using Data Transform:

Select Add Transform from the interface
Choose Handle Missing from the transform options
Select Drop missing from the available operations
Choose the columns you want to check for missing values
Select Preview to verify the changes
Choose Add to confirm and apply the transformation

For time-series forecasting, inferring missing values and resampling the data set to a certain frequency (hourly, daily, or weekly) are also important. In SageMaker Data Wrangler, the frequency of data can be altered by choosing Add Transform, selecting Time Series, selecting Resample from the Transform drop down, and then selecting the Timestamp dropdown, ts in this example. Then, you can select advanced options. For example, choose Frequency unit and then select the desired frequency from the list.

SageMaker Data Wrangler offers several methods to handle missing values in time-series data through its Handle missing transform. You can choose from options such as forward fill or backward fill, which are particularly useful for maintaining the temporal structure of the data. These operations can be applied by using natural language commands in Chat for data prep, allowing flexible and efficient handling of missing values in time-series forecasting preparation.
To create the data flow, choose Create model. Then, choose Run Validation, which checks the data to make sure the processes were done correctly. After this step of data transformation, you can access additional options by selecting the purple plus sign. The options include Get data insights, Chat for data prep, Combine data, Create model, and Export.
The prepared data can then be connected to SageMaker AI for time series forecasting strategies, in this case, to predict the future demand based on the historical data that has been prepared for machine learning.
When using SageMaker, it is also important to consider data storage and security. For the local import feature, data is stored on Amazon EFS volumes and encrypted by default. For more permanent storage, Amazon S3 is recommended. S3 offers security features such as server-side encryption (SSE-S3, SSE-KMS, or SSE-C), fine-grained access controls through AWS Identity and Access Management (IAM) roles and bucket policies, and the ability to use VPC endpoints for added network security. To help ensure data security in either case, it’s important to implement proper access controls, use encryption for data at rest and in transit, regularly audit access logs, and follow the principle of least privilege when assigning permissions.
In this next step, you learn how to train a model using SageMaker Canvas. Based on the previous step, select the purple plus sign and select Create Model, and then select Export to create a model. After selecting a column to predict (select price for this example), you go to the Build screen, with options such as Quick build and Standard build. Based on the column chosen, the model will predict future values based on the data that is being used.

Clean up
To avoid incurring future charges, delete the SageMaker Data Wrangler data flow and S3 Buckets if used for storage.

In the SageMaker console, navigate to Canvas
Select Import and prepare
Find your data flow in the list
Click the three dots (⋮) menu next to your flow
Select Delete to remove the data flow

If you used S3 for storage:

Open the Amazon S3 console
Navigate to your bucket
Select the bucket used for this project
Choose Delete
Type the bucket name to confirm deletion
Select Delete bucket

Conclusion
In this post, we showed you how Amazon SageMaker Data Wrangler offers a no-code solution for time series data preparation, traditionally a task requiring technical expertise. By using the intuitive interface of the Data Wrangler console and natural language-powered tools, even users who don’t have a technical background can effectively prepare their data for future forecasting needs. This democratization of data preparation not only saves time and resources but also empowers a wider range of professionals to engage in data-driven decision-making.

About the author
Muni T. Bondu is a Solutions Architect at Amazon Web Services (AWS), based in Austin, Texas. She holds a Bachelor of Science in Computer Science, with concentrations in Artificial Intelligence and Human-Computer Interaction, from the Georgia Institute of Technology.

Build an agentic multimodal AI assistant with Amazon Nova and Amazon B …

Modern enterprises are rich in data that spans multiple modalities—from text documents and PDFs to presentation slides, images, audio recordings, and more. Imagine asking an AI assistant about your company’s quarterly earnings call: the assistant should not only read the transcript but also “see” the charts in the presentation slides and “hear” the CEO’s remarks. Gartner predicts that by 2027, 40% of generative AI solutions will be multimodal (text, image, audio, video), up from only 1% in 2023. This shift underlines how vital multimodal understanding is becoming for business applications. Achieving this requires a multimodal generative AI assistant—one that can understand and combine text, visuals, and other data types. It also requires an agentic architecture so the AI assistant can actively retrieve information, plan tasks, and make decisions on tool calling, rather than just responding passively to prompts.
In this post, we explore a solution that does exactly that—using Amazon Nova Pro, a multimodal large language model (LLM) from AWS, as the central orchestrator, along with powerful new Amazon Bedrock features like Amazon Bedrock Data Automation for processing multimodal data. We demonstrate how agentic workflow patterns such as Retrieval Augmented Generation (RAG), multi-tool orchestration, and conditional routing with LangGraph enable end-to-end solutions that artificial intelligence and machine learning (AI/ML) developers and enterprise architects can adopt and extend. We walk through an example of a financial management AI assistant that can provide quantitative research and grounded financial advice by analyzing both the earnings call (audio) and the presentation slides (images), along with relevant financial data feeds. We also highlight how you can apply this pattern in industries like finance, healthcare, and manufacturing.
Overview of the agentic workflow
The core of the agentic pattern consists of the following stages:

Reason – The agent (often an LLM) examines the user’s request and the current context or state. It decides what the next step should be—whether that’s providing a direct answer or invoking a tool or sub-task to get more information.
Act – The agent executes that step. This could mean calling a tool or function, such as a search query, a database lookup, or a document analysis using Amazon Bedrock Data Automation.
Observe – The agent observes the result of the action. For instance, it reads the retrieved text or data that came back from the tool.
Loop – With new information in hand, the agent reasons again, deciding if the task is complete or if another step is needed. This loop continues until the agent determines it can produce a final answer for the user.

This iterative decision-making enables the agent to handle complex requests that are impossible to fulfill with a single prompt. However, implementing agentic systems can be challenging. They introduce more complexity in the control flow, and naive agents can be inefficient (making too many tool calls or looping unnecessarily) or hard to manage as they scale. This is where structured frameworks like LangGraph come in. LangGraph makes it possible to define a directed graph (or state machine) of potential actions with well-defined nodes (actions like “Report Writer” or “Query Knowledge Base”) and edges (allowable transitions). Although the agent’s internal reasoning still decides which path to take, LangGraph makes sure the process remains manageable and transparent. This controlled flexibility means the assistant has enough autonomy to handle diverse tasks while making sure the overall workflow is stable and predictable.
Solution overview
This solution is a financial management AI assistant designed to help analysts query portfolios, analyze companies, and generate reports. At its core is Amazon Nova, an LLM that acts as an intelligent LLM for inference. Amazon Nova processes text, images, or documents (like earnings call slides), and dynamically decides which tools to use to fulfill requests. Amazon Nova is optimized for enterprise tasks and supports function calling, so the model can plan actions and call tools in a structured way. With a large context window (up to 300,000 tokens in Amazon Nova Lite and Amazon Nova Pro), it can manage long documents or conversation history when reasoning.
The workflow consists of the following key components:

Knowledge base retrieval – Both the earnings call audio file and PowerPoint file are processed by Amazon Bedrock Data Automation, a managed service that extracts text, transcribes audio and video, and prepares data for analysis. If the user uploads a PowerPoint file, the system converts each slide into an image (PNG) for efficient search and analysis, a technique inspired by generative AI applications like Manus. Amazon Bedrock Data Automation is effectively a multimodal AI pipeline out of the box. In our architecture, Amazon Bedrock Data Automation acts as a bridge between raw data and the agentic workflow. Then Amazon Bedrock Knowledge Bases converts these chunks extracted from Amazon Bedrock Data Automation into vector embeddings using Amazon Titan Text Embeddings V2, and stores these vectors in an Amazon OpenSearch Serverless database.
Router agent – When a user asks a question—for example, “Summarize the key risks in this Q3 earnings report”—Amazon Nova first determines whether the task requires retrieving data, processing a file, or generating a response. It maintains memory of the dialogue, interprets the user’s request, and plans which actions to take to fulfill it. The “Memory & Planning” module in the solution diagram indicates that the router agent can use conversation history and chain-of-thought (CoT) prompting to determine next steps. Crucially, the router agent determines if the query can be answered with internal company data or if it requires external information and tools.
Multimodal RAG agent – For queries related with audio and video information, Amazon Bedrock Data Automation uses a unified API call to extract insights from such multimedia data, and stores the extracted insights in Amazon Bedrock Knowledge Bases. Amazon Nova uses Amazon Bedrock Knowledge Bases to retrieve factual answers using semantic search. This makes sure responses are grounded in real data, minimizing hallucination. If Amazon Nova generates an answer, a secondary hallucination check cross-references the response against trusted sources to catch unsupported claims.
Hallucination check (quality gate) – To further verify reliability, the workflow can include a postprocessing step using a different foundation model (FM) outside of the Amazon Nova family, such as Anthropic’s Claude, Mistral, or Meta’s Llama, to grade the answer’s faithfulness. For example, after Amazon Nova generates a response, a hallucination detector model or function can compare the answer against the retrieved sources or known facts. If a potential hallucination is detected (the answer isn’t supported by the reference data), the agent can choose to do additional retrieval, adjust the answer, or escalate to a human.
Multi-tool collaboration – This multi-tool collaboration allows the AI to not only find information but also take actions before formulating a final answer. This introduces multi-tool options. The supervisor agent might spawn or coordinate multiple tool-specific agents (for example, a web search agent to do a general web search, a stock search agent to get market data, or other specialized agents for company financial metrics or industry news). Each agent performs a focused task (one might call an API or perform a query on the internet) and returns findings to the supervisor agent. Amazon Nova Pro features a strong reasoning ability that allows the supervisor agent to merge these findings. This multi-agent approach follows the principle of dividing complex tasks among specialist agents, improving efficiency and reliability for complex queries.
Report creation agent – Another notable aspect in the architecture is the use of Amazon Nova Canvas for output generation. Amazon Nova Canvas is a specialized image-generation model in the Amazon Nova family, but in this context, we use the concept of a “canvas” more figuratively to mean a structured template or format generated content output. For instance, we could define a template for an “investor report” that the assistant fills out: Section 1: Key Highlights (bullet points), Section 2: Financial Summary (table of figures), Section 3: Notable Quotes, and so on. The agent can guide Amazon Nova to populate such a template by providing it with a system prompt containing the desired format (this is similar to few-shot prompting, where the layout is given). The result is that the assistant not only answers ad-hoc questions, but can also produce comprehensive generated reports that look as if a human analyst prepared them, combining text, image, and references to visuals.

These components are orchestrated in an agentic workflow. Instead of a fixed script, the solution uses a dynamic decision graph (implemented with the open source LangGraph library in the notebook solution) to route between steps. The result is an assistant that feels less like a chatbot and more like a collaborative analyst—one that can parse an earnings call audio recording, critique a slide deck, or draft an investor memo with minimal human intervention.
The following diagram shows the high-level architecture of the agentic AI workflow. Amazon Nova orchestrates various tools—including Bedrock Amazon Data Automation for document and image processing and a knowledge base for retrieval—to fulfill complex user requests. For brevity, we don’t list all the code here; the GitHub repo includes a full working example. Developers can run that to see the agent in action and extend it with their own data.

Example of the multi-tool collaboration workflow
To demonstrate the multi-tool collaboration agent workflow, we explore an example of how a question-answer interaction might flow through our deployed system for multi-tool collaboration:

User prompt – In the chat UI, the end-user asks a question, such as “What is XXX’s stock performance this year, and how does it compare to its rideshare‑industry peers?”
Agent initial response – The agent (Amazon Nova FM orchestrator) receives the question and responds with:

Received your question. Routing to the reasoning engine…

Planning and tool selection – The agent determines that it needs the following:

The ticker symbol for the company (XXX)
Real‑time stock price and YTD changes
Key financial metrics (revenue, net income, price-earnings ratio)
Industry benchmarks (peer YTD performance, average revenue growth)

Planning execution using tool calls – The agent calls tools to perform the following actions:

Look up ticker symbol:

Agent → WebSearchTool.lookupTicker(“XXX Inc”)
WebSearchTool → Agent: returns “XXX”

Fetch real‑time stock performance using the retrieved ticker symbol:

Agent → StockAnalysisTool.getPerformance(
symbol=”XXX”,
period=”YTD”
)
StockAnalysisTool → Agent:
{
currentPrice:
ytdChange:
52wkRange:
volume:
}

Retrieve company financial metrics using the retrieved ticker symbol:

Agent → CompanyFinancialAnalysisTool.getMetrics(“UBER”)
CompanyFinancialAnalysisTool → Agent:
{
revenueQ4_2024: xxx B,
netIncomeQ4_2024: xxx M,
peRatio: xxx
}

Gather industry benchmark data using the retrieved ticker symbol:

Agent → IndustryAnalysisTool.comparePeers(
symbol=”XXX”,
sector=”Rideshare”
)
IndustryAnalysisTool → Agent:
{
avgPeerYTD:
avgRevenueGrowth:
}

Validation loop – The agent runs a validation loop:

Agent: validate()
↳ Are all four data points present?
• Ticker :heavy_check_mark:
• Stock performance :heavy_check_mark:
• Financial metrics :heavy_check_mark:
• Industry benchmark :heavy_check_mark:
↳ All set—no retry needed.

If anything is missing or a tool encountered an error, the FM orchestrator triggers the error handler (up to three retries), then resumes the plan at the failed step.

Synthesis and final answer – The agent uses Amazon Nova Pro to synthesize the data points and generate final answers based on these data points.

The following figure shows a flow diagram of this multi-tool collaboration agent.

Benefits of using Amazon Bedrock for scalable generative AI agent workflows
This solution is built on Amazon Bedrock because AWS provides an integrated ecosystem for building such sophisticated solutions at scale:

Amazon Bedrock delivers top-tier FMs like Amazon Nova, with managed infrastructure—no need for provisioning GPU servers or handling scaling complexities.
Amazon Bedrock Data Automation offers an out-of-the-box solution to process documents, images, audio, and video into actionable data. Amazon Bedrock Data Automation can convert presentation slides to images, convert audio to text, perform OCR, and generate textual summaries or captions that are then indexed in an Amazon Bedrock knowledge bases.
Amazon Bedrock Knowledge Bases can store embeddings from unstructured data and support retrieval operations using similarity search.
In addition to LangGraph (as shown in this solution), you can also use Amazon Bedrock Agents to develop agentic workflows. Amazon Bedrock Agents simplifies the configuration of tool flows and action groups, so you can declaratively manage your agentic workflows.
Applications developed by open source frameworks like LangGraph (an extension of LangChain) can also run and scale with AWS infrastructure such as Amazon Elastic Compute Cloud (Amazon EC2) or Amazon SageMaker instances, so you can define directed graphs for agent orchestration, making it effortless to manage multi-step reasoning and tool chaining.

You don’t need to assemble a dozen disparate systems; AWS provides an integrated network for generative AI workflows.
Considerations and customizations
The architecture demonstrates exceptional flexibility through its modular design principles. At its core, the system uses Amazon Nova FMs, which can be selected based on task complexity. Amazon Nova Micro handles straightforward tasks like classification with minimal latency. Amazon Nova Lite manages moderately complex operations with balanced performance, and Amazon Nova Pro excels at sophisticated tasks requiring advanced reasoning or generating comprehensive responses.
The modular nature of the solution (Amazon Nova, tools, knowledge base, and Amazon Bedrock Data Automation) means each piece can be swapped or adjusted without overhauling the whole system. Solution architects can use this reference architecture as a foundation, implementing customizations as needed. You can seamlessly integrate new capabilities through AWS Lambda functions for specialized operations, and the LangGraph orchestration enables dynamic model selection and sophisticated routing logic. This architectural approach makes sure the system can evolve organically while maintaining operational efficiency and cost-effectiveness.
Bringing it to production requires thoughtful design, but AWS offers scalability, security, and reliability. For instance, you can secure the knowledge base content with encryption and access control, integrate the agent with AWS Identity and Access Management (IAM) to make sure it only performs allowed actions (for example, if an agent can access sensitive financial data, verify it checks user permissions ), and monitor the costs (you can track Amazon Bedrock pricing and tools usage; you might use Provisioned Throughput for consistent high-volume usage). Additionally, with AWS, you can scale from an experiment in a notebook to a full production deployment when you’re ready, using the same building blocks (integrated with proper AWS infrastructure like Amazon API Gateway or Lambda, if deploying as a service).
Vertical industries that can benefit from this solution
The architecture we described is quite general. Let’s briefly look at how this multimodal agentic workflow can drive value in different industries:

Financial services – In the financial sector, the solution integrates multimedia RAG to unify earnings call transcripts, presentation slides (converted to searchable images), and real-time market feeds into a single analytical framework. Multi-agent collaboration enables Amazon Nova to orchestrate tools like Amazon Bedrock Data Automation for slide text extraction, semantic search for regulatory filings, and live data APIs for trend detection. This allows the system to generate actionable insights—such as identifying portfolio risks or recommending sector rebalancing—while automating content creation for investor reports or trade approvals (with human oversight). By mimicking an analyst’s ability to cross-reference data types, the AI assistant transforms fragmented inputs into cohesive strategies.
Healthcare – Healthcare workflows use multimedia RAG to process clinical notes, lab PDFs, and X-rays, grounding responses in peer-reviewed literature and patient audio interview. Multi-agent collaboration excels in scenarios like triage: Amazon Nova interprets symptom descriptions, Amazon Bedrock Data Automation extracts text from scanned documents, and integrated APIs check for drug interactions, all while validating outputs against trusted sources. Content creation ranges from succinct patient summaries (“Severe pneumonia, treated with levofloxacin”) to evidence-based answers for complex queries, such as summarizing diabetes guidelines. The architecture’s strict hallucination checks and source citations support reliability, which is critical for maintaining trust in medical decision-making.
Manufacturing – Industrial teams use multimedia RAG to index equipment manuals, sensor logs, worker audio conversation, and schematic diagrams, enabling rapid troubleshooting. Multi-agent collaboration allows Amazon Nova to correlate sensor anomalies with manual excerpts, and Amazon Bedrock Data Automation highlights faulty parts in technical drawings. The system generates repair guides (for example, “Replace valve Part 4 in schematic”) or contextualizes historical maintenance data, bridging the gap between veteran expertise and new technicians. By unifying text, images, and time series data into actionable content, the assistant reduces downtime and preserves institutional knowledge—proving that even in hardware-centric fields, AI-driven insights can drive efficiency.

These examples highlight a common pattern: the synergy of data automation, powerful multimodal models, and agentic orchestration leads to solutions that closely mimic a human expert’s assistance. The financial AI assistant cross-checks figures and explanations like an analyst would, the clinical AI assistant correlates images and notes like a diligent doctor, and the industrial AI assistant recalls diagrams and logs like a veteran engineer. All of this is made possible by the underlying architecture we’ve built.
Conclusion
The era of siloed AI models that only handle one type of input is drawing to a close. As we’ve discussed, combining multimodal AI with an agentic workflow unlocks a new level of capability for enterprise applications. In this post, we demonstrated how to construct such a workflow using AWS services: we used Amazon Nova as the core AI orchestrator with its multimodal, agent-friendly capabilities, Amazon Bedrock Data Automation to automate the ingestion and indexing of complex data (documents, slides, audio) into Amazon Bedrock Knowledge Bases, and the concept of an agentic workflow graph for reasoning and condition (using LangChain or LangGraph) to orchestrate multi-step reasoning and tool usage. The end result is an AI assistant that operates much like a diligent analyst: researching, cross-checking multiple sources, and delivering insights—but at machine speed and scale.The solution demonstrates that building a sophisticated agentic AI system is no longer an academic dream—it’s practical and achievable with today’s AWS technologies. By using Amazon Nova as a powerful multimodal LLM and Amazon Bedrock Data Automation for multimodal data processing, along with frameworks for tool orchestration like LangGraph (or Amazon Bedrock Agents), developers get a head start. Many challenges (like OCR, document parsing, or conversational orchestration) are handled by these managed services or libraries, so you can focus on the business logic and domain-specific needs.
The solution presented in the BDA_nova_agentic sample notebook is a great starting point to experiment with these ideas. We encourage you to try it out, extend it, and tailor it to your organization’s needs. We’re excited to see what you will build—the techniques discussed here represent only a small portion of what’s possible when you combine modalities and intelligent agents.

About the authors
Julia Hu Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services, currently focused on the Amazon Bedrock team. Her core expertise lies in agentic AI, where she explores the capabilities of foundation models and AI agents to drive productivity in Generative AI applications. With a background in Generative AI, Applied Data Science, and IoT architecture, she partners with customers—from startups to large enterprises—to design and deploy impactful AI solutions.
Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.
Jessie-Lee Fry is a Product and Go-to Market (GTM) Strategy executive specializing in Generative AI and Machine Learning, with over 15 years of global leadership experience in Strategy, Product, Customer success, Business Development, Business Transformation and Strategic Partnerships. Jessie has defined and delivered a broad range of products and cross-industry go- to-market strategies driving business growth, while maneuvering market complexities and C-Suite customer groups. In her current role, Jessie and her team focus on helping AWS customers adopt Amazon Bedrock at scale enterprise use cases and adoption frameworks, meeting customers where they are in their Generative AI Journey.

Building Production-Ready Custom AI Agents for Enterprise Workflows wi …

In this tutorial, we walk you through the design and implementation of a custom agent framework built on PyTorch and key Python tooling, ranging from web intelligence and data science modules to advanced code generators. We’ll learn how to wrap core functionalities in monitored CustomTool classes, orchestrate multiple agents with tailored system prompts, and define end-to-end workflows that automate tasks like competitive website analysis and data-processing pipelines. Along the way, we demonstrate real-world examples, complete with retry logic, logging, and performance metrics, so you can confidently deploy and scale these agents within your organization’s existing infrastructure.

Copy CodeCopiedUse a different Browser!pip install -q torch transformers datasets pillow requests beautifulsoup4 pandas numpy scikit-learn openai

import os, json, asyncio, threading, time
import torch, pandas as pd, numpy as np
from PIL import Image
import requests
from io import BytesIO, StringIO
from concurrent.futures import ThreadPoolExecutor
from functools import wraps, lru_cache
from typing import Dict, List, Optional, Any, Callable, Union
import logging
from dataclasses import dataclass
import inspect

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_TIMEOUT = 15
MAX_RETRIES = 3

We begin by installing and importing all the core libraries, including PyTorch and Transformers, as well as data handling libraries such as pandas and NumPy, and utilities like BeautifulSoup for web scraping and scikit-learn for machine learning. We configure a standardized logging setup to capture information and error messages, and define global constants for API timeouts and retry limits, ensuring our tools behave predictably in production.

Copy CodeCopiedUse a different Browser@dataclass
class ToolResult:
“””Standardized tool result structure”””
success: bool
data: Any
error: Optional[str] = None
execution_time: float = 0.0
metadata: Dict[str, Any] = None

class CustomTool:
“””Base class for custom tools”””
def __init__(self, name: str, description: str, func: Callable):
self.name = name
self.description = description
self.func = func
self.calls = 0
self.avg_execution_time = 0.0
self.error_rate = 0.0

def execute(self, *args, **kwargs) -> ToolResult:
“””Execute tool with monitoring”””
start_time = time.time()
self.calls += 1

try:
result = self.func(*args, **kwargs)
execution_time = time.time() – start_time

self.avg_execution_time = ((self.avg_execution_time * (self.calls – 1)) + execution_time) / self.calls

return ToolResult(
success=True,
data=result,
execution_time=execution_time,
metadata={‘tool_name’: self.name, ‘call_count’: self.calls}
)
except Exception as e:
execution_time = time.time() – start_time
self.error_rate = (self.error_rate * (self.calls – 1) + 1) / self.calls

logger.error(f”Tool {self.name} failed: {str(e)}”)
return ToolResult(
success=False,
data=None,
error=str(e),
execution_time=execution_time,
metadata={‘tool_name’: self.name, ‘call_count’: self.calls}
)

We define a ToolResult dataclass to encapsulate every execution’s outcome, whether it succeeded, how long it took, any returned data, and error details if it failed. Our CustomTool base class then wraps individual functions with a unified execute method that tracks call counts, measures execution time, computes an average runtime, and logs any errors. By standardizing tool results and performance metrics this way, we ensure consistency and observability across all our custom utilities.

Copy CodeCopiedUse a different Browserclass CustomAgent:
“””Custom agent implementation with tool management”””
def __init__(self, name: str, system_prompt: str = “”, max_iterations: int = 5):
self.name = name
self.system_prompt = system_prompt
self.max_iterations = max_iterations
self.tools = {}
self.conversation_history = []
self.performance_metrics = {}

def add_tool(self, tool: CustomTool):
“””Add a tool to the agent”””
self.tools[tool.name] = tool

def run(self, task: str) -> Dict[str, Any]:
“””Execute a task using available tools”””
logger.info(f”Agent {self.name} executing task: {task}”)

task_lower = task.lower()
results = []

if any(keyword in task_lower for keyword in [‘analyze’, ‘website’, ‘url’, ‘web’]):
if ‘advanced_web_intelligence’ in self.tools:
import re
url_pattern = r’https?://[^s]+’
urls = re.findall(url_pattern, task)
if urls:
result = self.tools[‘advanced_web_intelligence’].execute(urls[0])
results.append(result)

elif any(keyword in task_lower for keyword in [‘data’, ‘analyze’, ‘stats’, ‘csv’]):
if ‘advanced_data_science_toolkit’ in self.tools:
if ‘name,age,salary’ in task:
data_start = task.find(‘name,age,salary’)
data_part = task[data_start:]
result = self.tools[‘advanced_data_science_toolkit’].execute(data_part, ‘stats’)
results.append(result)

elif any(keyword in task_lower for keyword in [‘generate’, ‘code’, ‘api’, ‘client’]):
if ‘advanced_code_generator’ in self.tools:
result = self.tools[‘advanced_code_generator’].execute(task)
results.append(result)

return {
‘agent’: self.name,
‘task’: task,
‘results’: [r.data if r.success else {‘error’: r.error} for r in results],
‘execution_summary’: {
‘tools_used’: len(results),
‘success_rate’: sum(1 for r in results if r.success) / len(results) if results else 0,
‘total_time’: sum(r.execution_time for r in results)
}
}

We encapsulate our AI logic in a CustomAgent class that holds a set of tools, a system prompt, and execution history, then routes each incoming task to the right tool based on simple keyword matching. In the run() method, we log the task, select the appropriate tool (web intelligence, data analysis, or code generation), execute it, and aggregate the results into a standardized response that includes success rates and timing metrics. This design enables us to easily extend agents by adding new tools and maintains our orchestration as both transparent and measurable.

Copy CodeCopiedUse a different Browserprint(” Building Advanced Tool Architecture”)

def performance_monitor(func):
“””Decorator for monitoring tool performance”””
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time()
try:
result = func(*args, **kwargs)
execution_time = time.time() – start_time
logger.info(f”{func.__name__} executed in {execution_time:.2f}s”)
return result
except Exception as e:
logger.error(f”{func.__name__} failed: {str(e)}”)
raise
return wrapper

@performance_monitor
def advanced_web_intelligence(url: str, analysis_type: str = “comprehensive”) -> Dict[str, Any]:
“””
Advanced web intelligence gathering with multiple analysis modes.

Args:
url: Target URL for analysis
analysis_type: Type of analysis (comprehensive, sentiment, technical, seo)

Returns:
Dict containing structured analysis results
“””
try:
response = requests.get(url, timeout=API_TIMEOUT, headers={
‘User-Agent’: ‘Mozilla/5.0’
})

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, ‘html.parser’)

title = soup.find(‘title’).text if soup.find(‘title’) else ‘No title’
meta_desc = soup.find(‘meta’, attrs={‘name’: ‘description’})
meta_desc = meta_desc.get(‘content’) if meta_desc else ‘No description’

if analysis_type == “comprehensive”:
return {
‘title’: title,
‘description’: meta_desc,
‘word_count’: len(soup.get_text().split()),
‘image_count’: len(soup.find_all(‘img’)),
‘link_count’: len(soup.find_all(‘a’)),
‘headers’: [h.text.strip() for h in soup.find_all([‘h1’, ‘h2’, ‘h3’])[:5]],
‘status_code’: response.status_code,
‘content_type’: response.headers.get(‘content-type’, ‘unknown’),
‘page_size’: len(response.content)
}
elif analysis_type == “sentiment”:
text = soup.get_text()[:2000]
positive_words = [‘good’, ‘great’, ‘excellent’, ‘amazing’, ‘wonderful’, ‘fantastic’]
negative_words = [‘bad’, ‘terrible’, ‘awful’, ‘horrible’, ‘disappointing’]

pos_count = sum(text.lower().count(word) for word in positive_words)
neg_count = sum(text.lower().count(word) for word in negative_words)

return {
‘sentiment_score’: pos_count – neg_count,
‘positive_indicators’: pos_count,
‘negative_indicators’: neg_count,
‘text_sample’: text[:200],
‘analysis_type’: ‘sentiment’
}

except Exception as e:
return {‘error’: f”Analysis failed: {str(e)}”}

@performance_monitor
def advanced_data_science_toolkit(data: str, operation: str) -> Dict[str, Any]:
“””
Comprehensive data science operations with statistical analysis.

Args:
data: CSV-like string or JSON data
operation: Type of analysis (stats, correlation, forecast, clustering)

Returns:
Dict with analysis results
“””
try:
if data.startswith(‘{‘) or data.startswith(‘[‘):
parsed_data = json.loads(data)
df = pd.DataFrame(parsed_data)
else:
df = pd.read_csv(StringIO(data))

if operation == “stats”:
numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist()

result = {
‘shape’: df.shape,
‘columns’: df.columns.tolist(),
‘dtypes’: {col: str(dtype) for col, dtype in df.dtypes.items()},
‘missing_values’: df.isnull().sum().to_dict(),
‘numeric_columns’: numeric_columns
}

if len(numeric_columns) > 0:
result[‘summary_stats’] = df[numeric_columns].describe().to_dict()
if len(numeric_columns) > 1:
result[‘correlation_matrix’] = df[numeric_columns].corr().to_dict()

return result

elif operation == “clustering”:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

numeric_df = df.select_dtypes(include=[np.number])
if numeric_df.shape[1] < 2:
return {‘error’: ‘Need at least 2 numeric columns for clustering’}

scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_df.fillna(0))

n_clusters = min(3, max(2, len(numeric_df) // 2))
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
clusters = kmeans.fit_predict(scaled_data)

return {
‘n_clusters’: n_clusters,
‘cluster_centers’: kmeans.cluster_centers_.tolist(),
‘cluster_labels’: clusters.tolist(),
‘inertia’: float(kmeans.inertia_),
‘feature_names’: numeric_df.columns.tolist()
}

except Exception as e:
return {‘error’: f”Data analysis failed: {str(e)}”}

@performance_monitor
def advanced_code_generator(task_description: str, language: str = “python”) -> Dict[str, str]:
“””
Advanced code generation with multiple language support and optimization.

Args:
task_description: Description of coding task
language: Target programming language

Returns:
Dict with generated code and metadata
“””
templates = {
‘python’: {
‘api_client’: ”’
import requests
import json
import time
from typing import Dict, Any, Optional

class APIClient:
“””Production-ready API client with retry logic and error handling”””

def __init__(self, base_url: str, api_key: Optional[str] = None, timeout: int = 30):
self.base_url = base_url.rstrip(‘/’)
self.timeout = timeout
self.session = requests.Session()

if api_key:
self.session.headers.update({‘Authorization’: f’Bearer {api_key}’})

self.session.headers.update({
‘Content-Type’: ‘application/json’,
‘User-Agent’: ‘CustomAPIClient/1.0′
})

def _make_request(self, method: str, endpoint: str, **kwargs) -> Dict[str, Any]:
“””Make HTTP request with retry logic”””
url = f'{self.base_url}/{endpoint.lstrip(“/”)}’

for attempt in range(3):
try:
response = self.session.request(method, url, timeout=self.timeout, **kwargs)
response.raise_for_status()
return response.json() if response.content else {}
except requests.exceptions.RequestException as e:
if attempt == 2: # Last attempt
raise
time.sleep(2 ** attempt) # Exponential backoff

def get(self, endpoint: str, params: Optional[Dict] = None) -> Dict[str, Any]:
return self._make_request(‘GET’, endpoint, params=params)

def post(self, endpoint: str, data: Optional[Dict] = None) -> Dict[str, Any]:
return self._make_request(‘POST’, endpoint, json=data)

def put(self, endpoint: str, data: Optional[Dict] = None) -> Dict[str, Any]:
return self._make_request(‘PUT’, endpoint, json=data)

def delete(self, endpoint: str) -> Dict[str, Any]:
return self._make_request(‘DELETE’, endpoint)
”’,
‘data_processor’: ”’
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional
import logging

logger = logging.getLogger(__name__)

class DataProcessor:
“””Advanced data processor with comprehensive cleaning and analysis”””

def __init__(self, data: pd.DataFrame):
self.original_data = data.copy()
self.processed_data = data.copy()
self.processing_log = []

def clean_data(self, strategy: str = ‘auto’) -> ‘DataProcessor’:
“””Clean data with configurable strategies”””
initial_shape = self.processed_data.shape

# Remove duplicates
self.processed_data = self.processed_data.drop_duplicates()

# Handle missing values based on strategy
if strategy == ‘auto’:
# For numeric columns, use mean
numeric_cols = self.processed_data.select_dtypes(include=[np.number]).columns
self.processed_data[numeric_cols] = self.processed_data[numeric_cols].fillna(
self.processed_data[numeric_cols].mean()
)

# For categorical columns, use mode
categorical_cols = self.processed_data.select_dtypes(include=[‘object’]).columns
for col in categorical_cols:
mode_value = self.processed_data[col].mode()
if len(mode_value) > 0:
self.processed_data[col] = self.processed_data[col].fillna(mode_value[0])

final_shape = self.processed_data.shape
self.processing_log.append(f”Cleaned data: {initial_shape} -> {final_shape}”)
return self

def normalize(self, method: str = ‘minmax’, columns: Optional[List[str]] = None) -> ‘DataProcessor’:
“””Normalize numerical columns”””
cols = columns or self.processed_data.select_dtypes(include=[np.number]).columns.tolist()

if method == ‘minmax’:
# Min-max normalization
for col in cols:
col_min, col_max = self.processed_data[col].min(), self.processed_data[col].max()
if col_max != col_min:
self.processed_data[col] = (self.processed_data[col] – col_min) / (col_max – col_min)
elif method == ‘zscore’:
# Z-score normalization
for col in cols:
mean_val, std_val = self.processed_data[col].mean(), self.processed_data[col].std()
if std_val != 0:
self.processed_data[col] = (self.processed_data[col] – mean_val) / std_val

self.processing_log.append(f”Normalized columns {cols} using {method}”)
return self

def get_insights(self) -> Dict[str, Any]:
“””Generate comprehensive data insights”””
insights = {
‘basic_info’: {
‘shape’: self.processed_data.shape,
‘columns’: self.processed_data.columns.tolist(),
‘dtypes’: {col: str(dtype) for col, dtype in self.processed_data.dtypes.items()}
},
‘data_quality’: {
‘missing_values’: self.processed_data.isnull().sum().to_dict(),
‘duplicate_rows’: self.processed_data.duplicated().sum(),
‘memory_usage’: self.processed_data.memory_usage(deep=True).to_dict()
},
‘processing_log’: self.processing_log
}

# Add statistical summary for numeric columns
numeric_data = self.processed_data.select_dtypes(include=[np.number])
if len(numeric_data.columns) > 0:
insights[‘statistical_summary’] = numeric_data.describe().to_dict()

return insights
”’
}
}

task_lower = task_description.lower()
if any(keyword in task_lower for keyword in [‘api’, ‘client’, ‘http’, ‘request’]):
code = templates[language][‘api_client’]
description = “Production-ready API client with retry logic and comprehensive error handling”
elif any(keyword in task_lower for keyword in [‘data’, ‘process’, ‘clean’, ‘analyze’]):
code = templates[language][‘data_processor’]
description = “Advanced data processor with cleaning, normalization, and insight generation”
else:
code = f”’# Generated code template for: {task_description}
# Language: {language}

class CustomSolution:
“””Auto-generated solution template”””

def __init__(self):
self.initialized = True

def execute(self, *args, **kwargs):
“””Main execution method – implement your logic here”””
return {{“message”: “Implement your custom logic here”, “task”: “{task_description}”}}

# Usage example:
# solution = CustomSolution()
# result = solution.execute()
”’
description = f”Custom template for {task_description}”

return {
‘code’: code,
‘language’: language,
‘description’: description,
‘complexity’: ‘production-ready’,
‘estimated_lines’: len(code.split(‘n’)),
‘features’: [‘error_handling’, ‘logging’, ‘type_hints’, ‘documentation’]
}

We wrap each core function in a @performance_monitor decorator so we can log execution times and catch failures, then implement three specialized tools: advanced_web_intelligence for comprehensive or sentiment-driven web scraping, advanced_data_science_toolkit for statistical analysis and clustering on CSV or JSON data, and advanced_code_generator for producing production-ready code templates, ensuring we monitor performance and maintain consistency across all our analytics and code-generation utilities.

Copy CodeCopiedUse a different Browserprint(” Setting up Custom Agent Framework”)

class AgentOrchestrator:
“””Manages multiple specialized agents with workflow coordination”””

def __init__(self):
self.agents = {}
self.workflows = {}
self.results_cache = {}
self.performance_metrics = {}

def create_specialist_agent(self, name: str, tools: List[CustomTool], system_prompt: str = None):
“””Create domain-specific agents”””
agent = CustomAgent(
name=name,
system_prompt=system_prompt or f”You are a specialist {name} agent.”,
max_iterations=5
)

for tool in tools:
agent.add_tool(tool)

self.agents[name] = agent
return agent

def execute_workflow(self, workflow_name: str, inputs: Dict) -> Dict:
“””Execute multi-step workflows across agents”””
if workflow_name not in self.workflows:
raise ValueError(f”Workflow {workflow_name} not found”)

workflow = self.workflows[workflow_name]
results = {}
workflow_start = time.time()

for step in workflow[‘steps’]:
agent_name = step[‘agent’]
task = step[‘task’].format(**inputs, **results)

if agent_name in self.agents:
step_start = time.time()
result = self.agents[agent_name].run(task)
step_time = time.time() – step_start

results[step[‘output_key’]] = result
results[f”{step[‘output_key’]}_time”] = step_time

total_time = time.time() – workflow_start

return {
‘workflow’: workflow_name,
‘inputs’: inputs,
‘results’: results,
‘metadata’: {
‘total_execution_time’: total_time,
‘steps_completed’: len(workflow[‘steps’]),
‘success’: True
}
}

def get_system_status(self) -> Dict[str, Any]:
“””Get comprehensive system status”””
return {
‘agents’: {name: {‘tools’: len(agent.tools)} for name, agent in self.agents.items()},
‘workflows’: list(self.workflows.keys()),
‘cache_size’: len(self.results_cache),
‘total_tools’: sum(len(agent.tools) for agent in self.agents.values())
}

orchestrator = AgentOrchestrator()

web_tool = CustomTool(
name=”advanced_web_intelligence”,
description=”Advanced web analysis and intelligence gathering”,
func=advanced_web_intelligence
)

data_tool = CustomTool(
name=”advanced_data_science_toolkit”,
description=”Comprehensive data science and statistical analysis”,
func=advanced_data_science_toolkit
)

code_tool = CustomTool(
name=”advanced_code_generator”,
description=”Advanced code generation and architecture”,
func=advanced_code_generator
)

web_agent = orchestrator.create_specialist_agent(
“web_analyst”,
[web_tool],
“You are a web analysis specialist. Provide comprehensive website analysis and insights.”
)

data_agent = orchestrator.create_specialist_agent(
“data_scientist”,
[data_tool],
“You are a data science expert. Perform statistical analysis and machine learning tasks.”
)

code_agent = orchestrator.create_specialist_agent(
“code_architect”,
[code_tool],
“You are a senior software architect. Generate optimized, production-ready code.”
)

We initialize an AgentOrchestrator to manage our suite of AI agents, register each CustomTool implementation for web intelligence, data science, and code generation, and then spin up three domain-specific agents: web_analyst, data_scientist, and code_architect. Each agent is seeded with its respective toolset and a clear system prompt. This setup enables us to coordinate and execute multi-step workflows across specialized expertise areas within a single, unified framework.

Copy CodeCopiedUse a different Browserprint(” Defining Advanced Workflows”)

orchestrator.workflows[‘competitive_analysis’] = {
‘steps’: [
{
‘agent’: ‘web_analyst’,
‘task’: ‘Analyze website {target_url} with comprehensive analysis’,
‘output_key’: ‘website_analysis’
},
{
‘agent’: ‘code_architect’,
‘task’: ‘Generate monitoring code for website analysis automation’,
‘output_key’: ‘monitoring_code’
}
]
}

orchestrator.workflows[‘data_pipeline’] = {
‘steps’: [
{
‘agent’: ‘data_scientist’,
‘task’: ‘Analyze the following CSV data with stats operation: {data_input}’,
‘output_key’: ‘data_analysis’
},
{
‘agent’: ‘code_architect’,
‘task’: ‘Generate data processing pipeline code’,
‘output_key’: ‘pipeline_code’
}
]
}

We define two key multi-agent workflows: competitive_analysis, which involves our web analyst scraping and analyzing a target URL before passing insights to our code architect to generate monitoring scripts, and data_pipeline, where our data scientist runs statistical analyses on CSV inputs. Then our code architect crafts the corresponding ETL pipeline code. These declarative step sequences let us orchestrate complex tasks end-to-end with minimal boilerplate.

Copy CodeCopiedUse a different Browserprint(” Running Production Examples”)

print(“n Advanced Web Intelligence Demo”)
try:
web_result = web_agent.run(“Analyze https://httpbin.org/html with comprehensive analysis type”)
print(f” Web Analysis Success: {json.dumps(web_result, indent=2)}”)
except Exception as e:
print(f” Web analysis error: {e}”)

print(“n Data Science Pipeline Demo”)
sample_data = “””name,age,salary,department
Alice,25,50000,Engineering
Bob,30,60000,Engineering
Carol,35,70000,Marketing
David,28,55000,Engineering
Eve,32,65000,Marketing”””

try:
data_result = data_agent.run(f”Analyze this data with stats operation: {sample_data}”)
print(f” Data Analysis Success: {json.dumps(data_result, indent=2)}”)
except Exception as e:
print(f” Data analysis error: {e}”)

print(“n Code Architecture Demo”)
try:
code_result = code_agent.run(“Generate an API client for data processing tasks”)
print(f” Code Generation Success: Generated {len(code_result[‘results’][0][‘code’].split())} lines of code”)
except Exception as e:
print(f” Code generation error: {e}”)

print(“n Multi-Agent Workflow Demo”)
try:
workflow_inputs = {‘target_url’: ‘https://httpbin.org/html’}
workflow_result = orchestrator.execute_workflow(‘competitive_analysis’, workflow_inputs)
print(f” Workflow Success: Completed in {workflow_result[‘metadata’][‘total_execution_time’]:.2f}s”)
except Exception as e:
print(f” Workflow error: {e}”)

We run a suite of production demos to validate each component: first, our web_analyst performs a full-site analysis; next, our data_scientist crunches sample CSV stats; then our code_architect generates an API client; and finally we orchestrate the end-to-end competitive analysis workflow, capturing success indicators, outputs, and execution timing for each step.

Copy CodeCopiedUse a different Browserprint(“n System Performance Metrics”)

system_status = orchestrator.get_system_status()
print(f”System Status: {json.dumps(system_status, indent=2)}”)

print(“nTool Performance:”)
for agent_name, agent in orchestrator.agents.items():
print(f”n{agent_name}:”)
for tool_name, tool in agent.tools.items():
print(f” – {tool_name}: {tool.calls} calls, {tool.avg_execution_time:.3f}s avg, {tool.error_rate:.1%} error rate”)

print(“n Advanced Custom Agent Framework Complete!”)
print(” Production-ready implementation with full monitoring and error handling!”)

We finish by retrieving and printing our orchestrator’s overall system status, listing registered agents, workflows, and cache size, then loop through each agent’s tools to display call counts, average execution times, and error rates. This gives us a real-time view of performance and reliability before we log a final confirmation that our production-ready agent framework is complete.

In conclusion, we now have a blueprint for creating specialized AI agents that perform complex analyses and generate production-quality code, and also self-monitor their execution health and resource usage. The AgentOrchestrator ties everything together, enabling you to coordinate multi-step workflows and capture granular performance insights across agents. Whether you’re automating market research, ETL tasks, or API client generation, this framework provides the extensibility, reliability, and observability required for enterprise-grade AI deployments.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building Production-Ready Custom AI Agents for Enterprise Workflows with Monitoring, Orchestration, and Scalability appeared first on MarkTechPost.

EmbodiedGen: A Scalable 3D World Generator for Realistic Embodied AI S …

The Challenge of Scaling 3D Environments in Embodied AI

Creating realistic and accurately scaled 3D environments is essential for training and evaluating embodied AI. However, current methods still rely on manually designed 3D graphics, which are costly and lack realism, thereby limiting scalability and generalization. Unlike internet-scale data used in models like GPT and CLIP, embodied AI data is expensive, context-specific, and difficult to reuse. Reaching general-purpose intelligence in physical settings requires realistic simulations, reinforcement learning, and diverse 3D assets. While recent diffusion models and 3D generation techniques show promise, many still lack key features such as physical accuracy, watertight geometry, and correct scale, making them inadequate for robotic training environments. 

Limitations of Existing 3D Generation Techniques

3D object generation typically follows three main approaches: feedforward generation for fast results, optimization-based methods for high quality, and view reconstruction from multiple images. While recent techniques have improved realism by separating geometry and texture creation, many models still prioritize visual appearance over real-world physics. This makes them less suitable for simulations that require accurate scaling and watertight geometry. For 3D scenes, panoramic techniques have enabled full-view rendering, but they still lack interactivity. Although some tools attempt to enhance simulation environments with generated assets, the quality and diversity remain limited, falling short of complex embodied intelligence research needs. 

Introducing EmbodiedGen: Open-Source, Modular, and Simulation-Ready

EmbodiedGen is an open-source framework developed collaboratively by researchers from Horizon Robotics, the Chinese University of Hong Kong, Shanghai Qi Zhi Institute, and Tsinghua University. It is designed to generate realistic, scalable 3D assets tailored for embodied AI tasks. The platform outputs physically accurate, watertight 3D objects in URDF format, complete with metadata for simulation compatibility. Featuring six modular components, including image-to-3D, text-to-3D, layout generation, and object rearrangement, it enables controllable and efficient scene creation. By bridging the gap between traditional 3D graphics and robotics-ready assets, EmbodiedGen facilitates the scalable and cost-effective development of interactive environments for embodied intelligence research. 

Key Features: Multi-Modal Generation for Rich 3D Content

EmbodiedGen is a versatile toolkit designed to generate realistic and interactive 3D environments tailored for embodied AI tasks. It combines multiple generation modules: transforming images or text into detailed 3D objects, creating articulated items with movable parts, and generating diverse textures to improve visual quality. It also supports full scene construction by arranging these assets in a way that respects real-world physical properties and scale. The output is directly compatible with simulation platforms, making it easier and more affordable to build lifelike virtual worlds. This system helps researchers efficiently simulate real-world scenarios without relying on expensive manual modeling. 

Simulation Integration and Real-World Physical Accuracy

EmbodiedGen is a powerful and accessible platform that enables the generation of diverse, high-quality 3D assets tailored for research in embodied intelligence. It features several key modules that allow users to create assets from images or text, generate articulated and textured objects, and construct realistic scenes. These assets are watertight, photorealistic, and physically accurate, making them ideal for simulation-based training and evaluation in robotics. The platform supports integration with popular simulation environments, including OpenAI Gym, MuJoCo, Isaac Lab, and SAPIEN, enabling researchers to efficiently simulate tasks such as navigation, object manipulation, and obstacle avoidance at a low cost.

RoboSplatter: High-Fidelity 3DGS Rendering for Simulation

A notable feature is RoboSplatter, which brings advanced 3D Gaussian Splatting (3DGS) rendering into physical simulations. Unlike traditional graphics pipelines, RoboSplatter enhances visual fidelity while reducing computational overhead. Through modules like Texture Generation and Real-to-Sim conversion, users can edit the appearance of 3D assets or recreate real-world scenes with high realism. Overall, EmbodiedGen simplifies the creation of scalable, interactive 3D worlds, bridging the gap between real-world robotics and digital simulation. It is openly available as a user-friendly toolkit to support broader adoption and continued innovation in embodied AI research. 

Why This Research Matters?

This research addresses a core bottleneck in embodied AI: the lack of scalable, realistic, and physics-compatible 3D environments for training and evaluation. While internet-scale data has driven progress in vision and language models, embodied intelligence demands simulation-ready assets with accurate scale, geometry, and interactivity—qualities often missing in traditional 3D generation pipelines. EmbodiedGen fills this gap by offering an open-source, modular platform capable of producing high-quality, controllable 3D objects and scenes compatible with major robotics simulators. Its ability to convert text and images into physically plausible 3D environments at scale makes it a foundational tool for advancing embodied AI research, digital twins, and real-to-sim learning.

Check out the Paper and Project Page All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

FREE REGISTRATION: miniCON AI Infrastructure 2025 (Aug 2, 2025) [Speakers: Jessica Liu, VP Product Management @ Cerebras, Andreas Schick, Director AI @ US FDA, Volkmar Uhlig, VP AI Infrastructure @ IBM, Daniele Stroppa, WW Sr. Partner Solutions Architect @ Amazon, Aditya Gautam, Machine Learning Lead @ Meta, Sercan Arik, Research Manager @ Google Cloud AI, Valentina Pedoia, Senior Director AI/ML @ the Altos Labs, Sandeep Kaipu, Software Engineering Manager @ Broadcom ]
The post EmbodiedGen: A Scalable 3D World Generator for Realistic Embodied AI Simulations appeared first on MarkTechPost.

Google Researchers Release Magenta RealTime: An Open-Weight Model for …

Google’s Magenta team has introduced Magenta RealTime (Magenta RT), an open-weight, real-time music generation model that brings unprecedented interactivity to generative audio. Licensed under Apache 2.0 and available on GitHub and Hugging Face, Magenta RT is the first large-scale music generation model that supports real-time inference with dynamic, user-controllable style prompts.

Background: Real-Time Music Generation

Real-time control and live interactivity are foundational to musical creativity. While prior Magenta projects like Piano Genie and DDSP emphasized expressive control and signal modeling, Magenta RT extends these ambitions to full-spectrum audio synthesis. It closes the gap between generative models and human-in-the-loop composition by enabling instantaneous feedback and dynamic musical evolution.

Magenta RT builds upon MusicLM and MusicFX’s underlying modeling techniques. However, unlike their API- or batch-oriented modes of generation, Magenta RT supports streaming synthesis with forward real-time factor (RTF) >1—meaning it can generate faster than real-time, even on free-tier Colab TPUs.

Technical Overview

Magenta RT is a Transformer-based language model trained on discrete audio tokens. These tokens are produced via a neural audio codec, which operates at 48 kHz stereo fidelity. The model leverages an 800 million parameter Transformer architecture that has been optimized for:

Streaming generation in 2-second audio segments

Temporal conditioning with a 10-second audio history window

Multimodal style control, using either text prompts or reference audio

To support this, the model architecture adapts MusicLM’s staged training pipeline, integrating a new joint music-text embedding module known as MusicCoCa (a hybrid of MuLan and CoCa). This allows semantically meaningful control over genre, instrumentation, and stylistic progression in real time.

Data and Training

Magenta RT is trained on ~190,000 hours of instrumental stock music. This large and diverse dataset ensures wide genre generalization and smooth adaptation across musical contexts. The training data was tokenized using a hierarchical codec, which enables compact representations without losing fidelity. Each 2-second chunk is conditioned not only on a user-specified prompt but also on a rolling context of 10 seconds of prior audio, enabling smooth, coherent progression.

The model supports two input modalities for style prompts:

Textual prompts, which are converted into embeddings using MusicCoCa

Audio prompts, encoded into the same embedding space via a learned encoder

This fusion of modalities permits real-time genre morphing and dynamic instrument blending—capabilities essential for live composition and DJ-like performance scenarios.

Performance and Inference

Despite the model’s scale (800M parameters), Magenta RT achieves a generation speed of 1.25 seconds for every 2 seconds of audio. This is sufficient for real-time usage (RTF ~0.625), and inference can be executed on free-tier TPUs in Google Colab.

The generation process is chunked to allow continuous streaming: each 2s segment is synthesized in a forward pipeline, with overlapping windowing to ensure continuity and coherence. Latency is further minimized via optimizations in model compilation (XLA), caching, and hardware scheduling.

Applications and Use Cases

Magenta RT is designed for integration into:

Live performances, where musicians or DJs can steer generation on-the-fly

Creative prototyping tools, offering rapid auditioning of musical styles

Educational tools, helping students understand structure, harmony, and genre fusion

Interactive installations, enabling responsive generative audio environments

Google has hinted at upcoming support for on-device inference and personal fine-tuning, which would allow creators to adapt the model to their unique stylistic signatures.

Comparison to Related Models

Magenta RT complements Google DeepMind’s MusicFX (DJ Mode) and Lyria’s RealTime API, but differs critically in being open source and self-hostable. It also stands apart from latent diffusion models (e.g., Riffusion) and autoregressive decoders (e.g., Jukebox) by focusing on codec-token prediction with minimal latency.

Compared to models like MusicGen or MusicLM, Magenta RT delivers lower latency and enables interactive generation, which is often missing from current prompt-to-audio pipelines that require full track generation upfront.

Conclusion

Magenta RealTime pushes the boundaries of real-time generative audio. By blending high-fidelity synthesis with dynamic user control, it opens up new possibilities for AI-assisted music creation. Its architecture balances scale and speed, while its open licensing ensures accessibility and community contribution. For researchers, developers, and musicians alike, Magenta RT represents a foundational step toward responsive, collaborative AI music systems.

Check out the Model on Hugging Face, GitHub Page, Technical Details and Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

FREE REGISTRATION: miniCON AI Infrastructure 2025 (Aug 2, 2025) [Speakers: Jessica Liu, VP Product Management @ Cerebras, Andreas Schick, Director AI @ US FDA, Volkmar Uhlig, VP AI Infrastructure @ IBM, Daniele Stroppa, WW Sr. Partner Solutions Architect @ Amazon, Aditya Gautam, Machine Learning Lead @ Meta, Sercan Arik, Research Manager @ Google Cloud AI, Valentina Pedoia, Senior Director AI/ML @ the Altos Labs, Sandeep Kaipu, Software Engineering Manager @ Broadcom ]
The post Google Researchers Release Magenta RealTime: An Open-Weight Model for Real-Time AI Music Generation appeared first on MarkTechPost.

This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent …

Multimodal LLMs: Expanding Capabilities Across Text and Vision

Expanding large language models (LLMs) to handle multiple modalities, particularly images and text, has enabled the development of more interactive and intuitive AI systems. Multimodal LLMs (MLLMs) can interpret visuals, answer questions about images, and engage in dialogues that include both text and pictures. Their ability to reason across visual and linguistic domains makes them increasingly valuable for applications such as education, content generation, and interactive assistants.

The Challenge of Text-Only Forgetting in MLLMs

However, integrating vision into LLMs creates a problem. When trained on datasets that mix images with text, MLLMs often lose their ability to handle purely textual tasks. This phenomenon, known as text-only forgetting, occurs because visual tokens inserted into the language sequence divert the model’s attention away from the text. As a result, the MLLM starts prioritizing image-related content and performs poorly on tasks that require only language understanding, such as basic reasoning, comprehension, or textual question-and-answer (Q&A) tasks.

Limitations of Existing Mitigation Strategies

Several methods attempt to address this degradation. Some approaches reintroduce large amounts of text-only data during training, while others alternate between text-only and multimodal fine-tuning. These strategies aim to remind the model of its original language capabilities. Other designs include adapter layers or prompt-based tuning. However, these techniques often increase training costs, require complex switching logic during inference, or fail to restore text comprehension entirely. The problem largely stems from how the model’s attention shifts when image tokens are introduced into the sequence.

Introducing WINGS: A Dual-Learner Approach by Alibaba and Nanjing University

Researchers from Alibaba Group’s AI Business team and Nanjing University have introduced a new approach called WINGS. The design adds two new modules—visual and textual learners—into each layer of the MLLM. These learners work in parallel with the model’s core attention mechanism. The structure resembles “wings” attached to either side of the attention layers. A routing component controls how much attention each learner receives based on the current token mix, allowing the model to balance its focus between visual and textual information dynamically.

Low-Rank Residual Attention (LoRRA): Balancing Efficiency and Modality Awareness

The WINGS architecture uses a mechanism called Low-Rank Residual Attention (LoRRA), which keeps computations lightweight while enabling the learners to capture essential modality-specific information. In the first stage of training, only visual learners are activated to align image features. In the second stage, both visual and textual learners are co-trained with a router module that uses attention weights to allocate responsibility. Each learner uses efficient attention blocks to interact with either the image or the surrounding text, and their outputs are combined with those of the main model. This ensures that visual attention doesn’t overwhelm textual understanding.

WINGS Performance Benchmarks Across Text and Multimodal Tasks

In terms of performance, WINGS showed strong results. On the MMLU dataset, it achieved a text-only score of 60.53, representing an improvement of 9.70 points compared to a similar baseline model. For CMMLU, it scored 69.82, which is 9.36 points higher than the baseline. In reasoning tasks like Race-High, it gained 11.9 points, and in WSC, an improvement of 11.12 points was recorded. In multimodal benchmarks like MMMU-VAL, WINGS achieved an improvement of 4.78 points. It also demonstrated robust results on the IIT benchmark, handling mixed text-and-image multi-turn dialogues more effectively than other open-source MLLMs at the same scale.

Conclusion: Toward More Balanced and Generalizable MLLMs

In summary, the researchers tackled the issue of catastrophic text-only forgetting in MLLMs by introducing WINGS, an architecture that pairs dedicated visual and textual learners alongside attention routing. By analyzing attention shifts and designing targeted interventions, they maintained text performance while enhancing visual understanding, offering a more balanced and efficient multimodal model.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models appeared first on MarkTechPost.

Mistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, …

With the frequent release of new large language models (LLMs), there is a persistent quest to minimize repetitive errors, enhance robustness, and significantly improve user interactions. As AI models become integral to more sophisticated computational tasks, developers are consistently refining their capabilities, ensuring seamless integration within diverse, real-world scenarios.

Mistral AI has released Mistral Small 3.2 (Mistral-Small-3.2-24B-Instruct-2506), an updated version of its earlier release, Mistral-Small-3.1-24B-Instruct-2503. Although a minor release, Mistral Small 3.2 introduces fundamental upgrades that aim to enhance the model’s overall reliability and efficiency, particularly in handling complex instructions, avoiding redundant outputs, and maintaining stability under function-calling scenarios.

A significant enhancement in Mistral Small 3.2 is its accuracy in executing precise instructions. Successful user interaction often requires precision in executing subtle commands. Benchmark scores accurately reflect this improvement: under the Wildbench v2 instruction test, Mistral Small 3.2 achieved 65.33% accuracy, an improvement from 55.6% for its predecessor. Conversely, performance in the difficult Arena Hard v2 test was almost doubled, from 19.56% to 43.1%, which provides evidence of its improved ability to execute and grasp intricate commands precisely.

Image Source

Correcting repetition errors, Mistral Small 3.2 greatly minimizes instances of infinite or repetitive output, a problem commonly faced in long conversational scenarios. Internal evaluations show that Small 3.2 effectively cuts instances of infinite generation errors by half, from 2.11% in Small 3.1 to 1.29%. This complete reduction directly increases the model’s usability and dependability in extended interactions. The new model also demonstrates greater capability to call functions, making it ideal for automation tasks. Also, improved robustness in the function calling template translates to more stable and dependable interactions.

STEM-related benchmark improvement further demonstrates Small 3.2’s aptitude. For example, the HumanEval Plus Pass@5 code test had its accuracy increase from 88.99% in Small 3.1 to a whopping 92.90%. Also, MMLU Pro test results increased from 66.76% to 69.06%, and GPQA Diamond ratings improved slightly from 45.96% to 46.13%, showing general competence in scientific and technical uses.

Image Source

Vision-based performance outcomes were inconsistent, with certain optimizations being selectively applied. ChartQA accuracy improved from 86.24% to 87.4%, and DocVQA marginally enhanced from 94.08% to 94.86%. In contrast, some tests, such as MMMU and Mathvista, experienced slight dips, indicating specific trade-offs encountered during the optimization process.

The key updates in Mistral Small 3.2 over Small 3.1 include:

Enhanced precision in instruction-following, with Wildbench v2 accuracy rising from 55.6% to 65.33%.

Reduced repetition errors, halving infinite generation instances from 2.11% to 1.29%.

Improved robustness in function calling templates, ensuring more stable integrations.

Notable increases in STEM-related performance, particularly in HumanEval Plus Pass@5 (92.90%) and MMLU Pro (69.06%).

In conclusion, Mistral Small 3.2 offers targeted and practical enhancements over its predecessor, providing users with greater accuracy, reduced redundancy, and improved integration capabilities. These advancements help position it as a reliable choice for complex AI-driven tasks across diverse application areas.

Check out the Model Card on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Mistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, Reduced Repetition, and Stronger Function Calling for AI Integration appeared first on MarkTechPost.

Building Event-Driven AI Agents with UAgents and Google Gemini: A Modu …

In this tutorial, we demonstrate how to use the UAgents framework to build a lightweight, event-driven AI agent architecture on top of Google’s Gemini API. We’ll start by applying nest_asyncio to enable nested event loops, then configure your Gemini API key and instantiate the GenAI client. Next, we’ll define our communication contracts, Question and Answer Pydantic models, and spin up two UAgents: one “gemini_agent” that listens for incoming Question messages, invokes the Gemini “flash” model to generate responses, and emits Answer messages; and one “client_agent” that triggers a query upon startup and handles the incoming answer. Finally, we’ll learn how to run these agents concurrently using Python’s multiprocessing utility and gracefully shut down the event loop once the exchange is complete, illustrating UAgents’ seamless orchestration of inter-agent messaging.

Copy CodeCopiedUse a different Browser!pip install -q uagents google-genai

We install the UAgents framework and the Google GenAI client library, providing the necessary tooling to build and run your event-driven AI agents with Gemini. The q flag runs the installation quietly, keeping your notebook output clean. Check out the Notebook here

Copy CodeCopiedUse a different Browserimport os, time, multiprocessing, asyncio
import nest_asyncio
from google import genai
from pydantic import BaseModel, Field
from uagents import Agent, Context

nest_asyncio.apply()

We set up our Python environment by importing essential modules, system utilities (os, time, multiprocessing, asyncio), nest_asyncio for enabling nested event loops (critical in notebooks), the Google GenAI client, Pydantic for schema validation, and core UAgents classes. Finally, nest_asyncio.apply() patches the event loop so you can run asynchronous UAgents workflows seamlessly in interactive environments. Check out the Notebook here

Copy CodeCopiedUse a different Browseros.environ[“GOOGLE_API_KEY”] = “Use Your Own API Key Here”

client = genai.Client()

Here we set our Gemini API key in the environment. Be sure to replace the placeholder with your actual key, and then initialize the GenAI client, which will handle all subsequent requests to Google’s Gemini models. This step ensures our agent has authenticated access to generate content through the API.

Copy CodeCopiedUse a different Browserclass Question(BaseModel):
question: str = Field(…)

class Answer(BaseModel):
answer: str = Field(…)

These Pydantic models define the structured message formats that our agents will exchange with each other. The Question model carries a single question string field, and the Answer model carries a single answer string field. By using Pydantic, we get automatic validation and serialization of incoming and outgoing messages, ensuring that each agent always works with well-formed data.

Copy CodeCopiedUse a different Browserai_agent = Agent(
name=”gemini_agent”,
seed=”agent_seed_phrase”,
port=8000,
endpoint=[“http://127.0.0.1:8000/submit”]
)

@ai_agent.on_event(“startup”)
async def ai_startup(ctx: Context):
ctx.logger.info(f”{ai_agent.name} listening on {ai_agent.address}”)

def ask_gemini(q: str) -> str:
resp = client.models.generate_content(
model=”gemini-2.0-flash”,
contents=f”Answer the question: {q}”
)
return resp.text

@ai_agent.on_message(model=Question, replies=Answer)
async def handle_question(ctx: Context, sender: str, msg: Question):
ans = ask_gemini(msg.question)
await ctx.send(sender, Answer(answer=ans))

In this block, we instantiate the UAgents “gemini_agent” with a unique name, seed phrase (for deterministic identity), listening port, and HTTP endpoint for message submissions. We then register a startup event handler that logs when the agent is ready, ensuring visibility into its lifecycle. The synchronous helper ask_gemini wraps the GenAI client call to Gemini’s “flash” model. At the same time, the @ai_agent.on_message handler deserializes incoming Question messages, invokes ask_gemini, and asynchronously sends back a validated Answer payload to the original sender. Check out the Notebook here

Copy CodeCopiedUse a different Browserclient_agent = Agent(
name=”client_agent”,
seed=”client_seed_phrase”,
port=8001,
endpoint=[“http://127.0.0.1:8001/submit”]
)

@client_agent.on_event(“startup”)
async def ask_on_start(ctx: Context):
await ctx.send(ai_agent.address, Question(question=”What is the capital of France?”))

@client_agent.on_message(model=Answer)
async def handle_answer(ctx: Context, sender: str, msg: Answer):
print(” Answer from Gemini:”, msg.answer)
# Use a more graceful shutdown
asyncio.create_task(shutdown_loop())

async def shutdown_loop():
await asyncio.sleep(1) # Give time for cleanup
loop = asyncio.get_event_loop()
loop.stop()

We set up a “client_agent” that, upon startup, sends a Question to the gemini_agent asking for the capital of France, then listens for an Answer, prints the received response, and gracefully shuts down the event loop after a brief delay. Check out the Notebook here

Copy CodeCopiedUse a different Browserdef run_agent(agent):
agent.run()

if __name__ == “__main__”:
p = multiprocessing.Process(target=run_agent, args=(ai_agent,))
p.start()
time.sleep(2)

client_agent.run()

p.join()

Finally, we define a helper run_agent function that calls agent.run(), then uses Python’s multiprocessing to launch the gemini_agent in its process. After giving it a moment to spin up, it runs the client_agent in the main process, blocking until the answer round-trip completes, and finally joins the background process to ensure a clean shutdown.

In conclusion, with this UAgents-focused tutorial, we now have a clear blueprint for creating modular AI services that communicate via well-defined event hooks and message schemas. You’ve seen how UAgents simplifies agent lifecycle management, registering startup events, handling incoming messages, and sending structured replies, all without boilerplate networking code. From here, you can expand your UAgents setup to include more sophisticated conversation workflows, multiple message types, and dynamic agent discovery.

Check out the Notebook here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building Event-Driven AI Agents with UAgents and Google Gemini: A Modular Python Implementation Guide appeared first on MarkTechPost.

PoE-World + Planner Outperforms Reinforcement Learning RL Baselines in …

The Importance of Symbolic Reasoning in World Modeling

Understanding how the world works is key to creating AI agents that can adapt to complex situations. While neural network-based models, such as Dreamer, offer flexibility, they require massive amounts of data to learn effectively, far more than humans typically do. On the other hand, newer methods use program synthesis with large language models to generate code-based world models. These are more data-efficient and can generalize well from limited input. However, their use has been mostly limited to simple domains, such as text or grid worlds, as scaling to complex, dynamic environments remains a challenge due to the difficulty of generating large, comprehensive programs.

Limitations of Existing Programmatic World Models

Recent research has investigated the use of programs to represent world models, often leveraging large language models to synthesize Python transition functions. Approaches like WorldCoder and CodeWorldModels generate a single, large program, which limits their scalability in complex environments and their ability to handle uncertainty and partial observability. Some studies focus on high-level symbolic models for robotic planning by integrating visual input with abstract reasoning. Earlier efforts employed restricted domain-specific languages tailored to specific benchmarks or utilized conceptually related structures, such as factor graphs in Schema Networks. Theoretical models, such as AIXI, also explore world modeling using Turing machines and history-based representations.

Introducing PoE-World: Modular and Probabilistic World Models

Researchers from Cornell, Cambridge, The Alan Turing Institute, and Dalhousie University introduce PoE-World, an approach to learning symbolic world models by combining many small, LLM-synthesized programs, each capturing a specific rule of the environment. Instead of creating one large program, PoE-World builds a modular, probabilistic structure that can learn from brief demonstrations. This setup supports generalization to new situations, allowing agents to plan effectively, even in complex games like Pong and Montezuma’s Revenge. While it doesn’t model raw pixel data, it learns from symbolic object observations and emphasizes accurate modeling over exploration for efficient decision-making.

Architecture and Learning Mechanism of PoE-World

PoE-World models the environment as a combination of small, interpretable Python programs called programmatic experts, each responsible for a specific rule or behavior. These experts are weighted and combined to predict future states based on past observations and actions. By treating features as conditionally independent and learning from the full history, the model remains modular and scalable. Hard constraints refine predictions, and experts are updated or pruned as new data is collected. The model supports planning and reinforcement learning by simulating likely future outcomes, enabling efficient decision-making. Programs are synthesized using LLMs and interpreted probabilistically, with expert weights optimized via gradient descent.

Empirical Evaluation on Atari Games

The study evaluates their agent, PoE-World + Planner, on Atari’s Pong and Montezuma’s Revenge, including harder, modified versions of these games. Using minimal demonstration data, their method outperforms baselines such as PPO, ReAct, and WorldCoder, particularly in low-data settings. PoE-World demonstrates strong generalization by accurately modeling game dynamics, even in altered environments without new demonstrations. It’s also the only method to consistently score positively in Montezuma’s Revenge. Pre-training policies in PoE-World’s simulated environment accelerate real-world learning. Unlike WorldCoder’s limited and sometimes inaccurate models, PoE-World produces more detailed, constraint-aware representations, leading to better planning and more realistic in-game behavior.

Conclusion: Symbolic, Modular Programs for Scalable AI Planning

In conclusion, understanding how the world works is crucial to building adaptive AI agents; however, traditional deep learning models require large datasets and struggle to update flexibly with limited input. Inspired by how humans and symbolic systems recombine knowledge, the study proposes PoE-World. This method utilizes large language models to synthesize modular, programmatic “experts” that represent different parts of the world. These experts combine compositionally to form a symbolic, interpretable world model that supports strong generalization from minimal data. Tested on Atari games like Pong and Montezuma’s Revenge, this approach demonstrates efficient planning and performance, even in unfamiliar scenarios. Code and demos are publicly available.

Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post PoE-World + Planner Outperforms Reinforcement Learning RL Baselines in Montezuma’s Revenge with Minimal Demonstration Data appeared first on MarkTechPost.

Build an Intelligent Multi-Tool AI Agent Interface Using Streamlit for …

In this tutorial, we’ll build a powerful and interactive Streamlit application that brings together the capabilities of LangChain, the Google Gemini API, and a suite of advanced tools to create a smart AI assistant. Using Streamlit’s intuitive interface, we’ll create a chat-based system that can search the web, fetch Wikipedia content, perform calculations, remember key details, and handle conversation history, all in real time. Whether we’re developers, researchers, or just exploring AI, this setup allows us to interact with a multi-agent system directly from the browser with minimal code and maximum flexibility.

Copy CodeCopiedUse a different Browser!pip install -q streamlit langchain langchain-google-genai langchain-community
!pip install -q pyngrok python-dotenv wikipedia duckduckgo-search
!npm install -g localtunnel

import streamlit as st
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import Tool, WikipediaQueryRun, DuckDuckGoSearchRun
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate
from langchain.callbacks.streamlit import StreamlitCallbackHandler
from langchain_community.utilities import WikipediaAPIWrapper, DuckDuckGoSearchAPIWrapper
import asyncio
import threading
import time
from datetime import datetime
import json

We begin by installing all the necessary Python and Node.js packages required for our AI assistant app. This includes Streamlit for the frontend, LangChain for agent logic, and tools like Wikipedia, DuckDuckGo, and ngrok/localtunnel for external search and hosting. Once set up, we import all modules to start building our interactive multi-tool AI agent.

Copy CodeCopiedUse a different BrowserGOOGLE_API_KEY = “Use Your API Key Here”
NGROK_AUTH_TOKEN = “Use Your Auth Token Here”
os.environ[“GOOGLE_API_KEY”] = GOOGLE_API_KEY

Next, we configure our environment by setting the Google Gemini API key and the ngrok authentication token. We assign these credentials to variables and set the GOOGLE_API_KEY so the LangChain agent can securely access the Gemini model during execution.

Copy CodeCopiedUse a different Browserclass InnovativeAgentTools:
“””Advanced tool collection for the multi-agent system”””

@staticmethod
def get_calculator_tool():
def calculate(expression: str) -> str:
“””Calculate mathematical expressions safely”””
try:
allowed_chars = set(‘0123456789+-*/.() ‘)
if all(c in allowed_chars for c in expression):
result = eval(expression)
return f”Result: {result}”
else:
return “Error: Invalid mathematical expression”
except Exception as e:
return f”Calculation error: {str(e)}”

return Tool(
name=”Calculator”,
func=calculate,
description=”Calculate mathematical expressions. Input should be a valid math expression.”
)

@staticmethod
def get_memory_tool(memory_store):
def save_memory(key_value: str) -> str:
“””Save information to memory”””
try:
key, value = key_value.split(“:”, 1)
memory_store[key.strip()] = value.strip()
return f”Saved ‘{key.strip()}’ to memory”
except:
return “Error: Use format ‘key: value'”

def recall_memory(key: str) -> str:
“””Recall information from memory”””
return memory_store.get(key.strip(), f”No memory found for ‘{key}'”)

return [
Tool(name=”SaveMemory”, func=save_memory,
description=”Save information to memory. Format: ‘key: value'”),
Tool(name=”RecallMemory”, func=recall_memory,
description=”Recall saved information. Input: key to recall”)
]

@staticmethod
def get_datetime_tool():
def get_current_datetime(format_type: str = “full”) -> str:
“””Get current date and time”””
now = datetime.now()
if format_type == “date”:
return now.strftime(“%Y-%m-%d”)
elif format_type == “time”:
return now.strftime(“%H:%M:%S”)
else:
return now.strftime(“%Y-%m-%d %H:%M:%S”)

return Tool(
name=”DateTime”,
func=get_current_datetime,
description=”Get current date/time. Options: ‘date’, ‘time’, or ‘full'”
)

Here, we define the InnovativeAgentTools class to equip our AI agent with specialized capabilities. We implement tools such as a Calculator for safe expression evaluation, Memory Tools to save and recall information across turns, and a date and time tool to fetch the current date and time. These tools enable our Streamlit AI agent to reason, remember, and respond contextually, much like a true assistant. Check out the full Notebook here

Copy CodeCopiedUse a different Browserclass MultiAgentSystem:
“””Innovative multi-agent system with specialized capabilities”””

def __init__(self, api_key: str):
self.llm = ChatGoogleGenerativeAI(
model=”gemini-pro”,
google_api_key=api_key,
temperature=0.7,
convert_system_message_to_human=True
)
self.memory_store = {}
self.conversation_memory = ConversationBufferWindowMemory(
memory_key=”chat_history”,
k=10,
return_messages=True
)
self.tools = self._initialize_tools()
self.agent = self._create_agent()

def _initialize_tools(self):
“””Initialize all available tools”””
tools = []

tools.extend([
DuckDuckGoSearchRun(api_wrapper=DuckDuckGoSearchAPIWrapper()),
WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
])

tools.append(InnovativeAgentTools.get_calculator_tool())
tools.append(InnovativeAgentTools.get_datetime_tool())
tools.extend(InnovativeAgentTools.get_memory_tool(self.memory_store))

return tools

def _create_agent(self):
“””Create the ReAct agent with advanced prompt”””
prompt = PromptTemplate.from_template(“””
You are an advanced AI assistant with access to multiple tools and persistent memory.

AVAILABLE TOOLS:
{tools}

TOOL USAGE FORMAT:
– Think step by step about what you need to do
– Use Action: tool_name
– Use Action Input: your input
– Wait for Observation
– Continue until you have a final answer

MEMORY CAPABILITIES:
– You can save important information using SaveMemory
– You can recall previous information using RecallMemory
– Always try to remember user preferences and context

CONVERSATION HISTORY:
{chat_history}

CURRENT QUESTION: {input}

REASONING PROCESS:
{agent_scratchpad}

Begin your response with your thought process, then take action if needed.
“””)

agent = create_react_agent(self.llm, self.tools, prompt)
return AgentExecutor(
agent=agent,
tools=self.tools,
memory=self.conversation_memory,
verbose=True,
handle_parsing_errors=True,
max_iterations=5
)

def chat(self, message: str, callback_handler=None):
“””Process user message and return response”””
try:
if callback_handler:
response = self.agent.invoke(
{“input”: message},
{“callbacks”: [callback_handler]}
)
else:
response = self.agent.invoke({“input”: message})
return response[“output”]
except Exception as e:
return f”Error processing request: {str(e)}”

In this section, we build the core of our application, the MultiAgentSystem class. Here, we integrate the Gemini Pro model using LangChain and initialize all essential tools, including web search, memory, and calculator functions. We configure a ReAct-style agent using a custom prompt that guides tool usage and memory handling. Finally, we define a chat method that allows the agent to process user input, invoke tools when necessary, and generate intelligent, context-aware responses. Check out the full Notebook here

Copy CodeCopiedUse a different Browserdef create_streamlit_app():
“””Create the innovative Streamlit application”””

st.set_page_config(
page_title=” Advanced LangChain Agent with Gemini”,
page_icon=””,
layout=”wide”,
initial_sidebar_state=”expanded”
)

st.markdown(“””
<style>
.main-header {
background: linear-gradient(90deg, #667eea 0%, #764ba2 100%);
padding: 1rem;
border-radius: 10px;
color: white;
text-align: center;
margin-bottom: 2rem;
}
.agent-response {
background-color: #f0f2f6;
padding: 1rem;
border-radius: 10px;
border-left: 4px solid #667eea;
margin: 1rem 0;
}
.memory-card {
background-color: #e8f4fd;
padding: 1rem;
border-radius: 8px;
margin: 0.5rem 0;
}
</style>
“””, unsafe_allow_html=True)

st.markdown(“””
<div class=”main-header”>
<h1> Advanced Multi-Agent System</h1>
<p>Powered by LangChain + Gemini API + Streamlit</p>
</div>
“””, unsafe_allow_html=True)

with st.sidebar:
st.header(” Configuration”)

api_key = st.text_input(
” Google AI API Key”,
type=”password”,
value=GOOGLE_API_KEY if GOOGLE_API_KEY != “your-gemini-api-key-here” else “”,
help=”Get your API key from https://ai.google.dev/”
)

if not api_key:
st.error(“Please enter your Google AI API key to continue”)
st.stop()

st.success(” API Key configured”)

st.header(” Agent Capabilities”)
st.markdown(“””
– **Web Search** (DuckDuckGo)
– **Wikipedia Lookup**
– **Mathematical Calculator**
– **Persistent Memory**
– **Date & Time**
– **Conversation History**
“””)

if ‘agent_system’ in st.session_state:
st.header(” Memory Store”)
memory = st.session_state.agent_system.memory_store
if memory:
for key, value in memory.items():
st.markdown(f”””
<div class=”memory-card”>
<strong>{key}:</strong> {value}
</div>
“””, unsafe_allow_html=True)
else:
st.info(“No memories stored yet”)

if ‘agent_system’ not in st.session_state:
with st.spinner(” Initializing Advanced Agent System…”):
st.session_state.agent_system = MultiAgentSystem(api_key)
st.success(” Agent System Ready!”)

st.header(” Interactive Chat”)

if ‘messages’ not in st.session_state:
st.session_state.messages = [{
“role”: “assistant”,
“content”: “”” Hello! I’m your advanced AI assistant powered by Gemini. I can:

• Search the web and Wikipedia for information
• Perform mathematical calculations
• Remember important information across our conversation
• Provide current date and time
• Maintain conversation context

Try asking me something like:
– “Calculate 15 * 8 + 32”
– “Search for recent news about AI”
– “Remember that my favorite color is blue”
– “What’s the current time?”
“””
}]

for message in st.session_state.messages:
with st.chat_message(message[“role”]):
st.markdown(message[“content”])

if prompt := st.chat_input(“Ask me anything…”):
st.session_state.messages.append({“role”: “user”, “content”: prompt})
with st.chat_message(“user”):
st.markdown(prompt)

with st.chat_message(“assistant”):
callback_handler = StreamlitCallbackHandler(st.container())

with st.spinner(” Thinking…”):
response = st.session_state.agent_system.chat(prompt, callback_handler)

st.markdown(f”””
<div class=”agent-response”>
{response}
</div>
“””, unsafe_allow_html=True)

st.session_state.messages.append({“role”: “assistant”, “content”: response})

st.header(” Example Queries”)
col1, col2, col3 = st.columns(3)

with col1:
if st.button(” Search Example”):
example = “Search for the latest developments in quantum computing”
st.session_state.example_query = example

with col2:
if st.button(” Math Example”):
example = “Calculate the compound interest on $1000 at 5% for 3 years”
st.session_state.example_query = example

with col3:
if st.button(” Memory Example”):
example = “Remember that I work as a data scientist at TechCorp”
st.session_state.example_query = example

if ‘example_query’ in st.session_state:
st.info(f”Example query: {st.session_state.example_query}”)

In this section, we bring everything together by building an interactive web interface using Streamlit. We configure the app layout, define custom CSS styles, and set up a sidebar for inputting API keys and configuring agent capabilities. We initialize the multi-agent system, maintain a message history, and enable a chat interface that allows users to interact in real-time. To make it even easier to explore, we also provide example buttons for search, math, and memory-related queries,  all in a beautifully styled, responsive UI. Check out the full Notebook here

Copy CodeCopiedUse a different Browserdef setup_ngrok_auth(auth_token):
“””Setup ngrok authentication”””
try:
from pyngrok import ngrok, conf

conf.get_default().auth_token = auth_token

try:
tunnels = ngrok.get_tunnels()
print(” Ngrok authentication successful!”)
return True
except Exception as e:
print(f” Ngrok authentication failed: {e}”)
return False

except ImportError:
print(” pyngrok not installed. Installing…”)
import subprocess
subprocess.run([‘pip’, ‘install’, ‘pyngrok’], check=True)
return setup_ngrok_auth(auth_token)

def get_ngrok_token_instructions():
“””Provide instructions for getting ngrok token”””
return “””
NGROK AUTHENTICATION SETUP:

1. Sign up for an ngrok account:
– Visit: https://dashboard.ngrok.com/signup
– Create a free account

2. Get your authentication token:
– Go to: https://dashboard.ngrok.com/get-started/your-authtoken
– Copy your authtoken

3. Replace ‘your-ngrok-auth-token-here’ in the code with your actual token

4. Alternative methods if ngrok fails:
– Use Google Colab’s built-in public URL feature
– Use localtunnel: !npx localtunnel –port 8501
– Use serveo.net: !ssh -R 80:localhost:8501 serveo.net
“””

Here, we set up a helper function to authenticate ngrok, which allows us to expose our local Streamlit app to the internet. We use the pyngrok library to configure the authentication token and verify the connection. If the token is missing or invalid, we provide detailed instructions on how to obtain one and suggest alternative tunneling methods, such as LocalTunnel or Serveo, making it easy for us to host and share our app from environments like Google Colab.

Copy CodeCopiedUse a different Browserdef main():
“””Main function to run the application”””
try:
create_streamlit_app()
except Exception as e:
st.error(f”Application error: {str(e)}”)
st.info(“Please check your API key and try refreshing the page”)

This main() function acts as the entry point for our Streamlit application. We simply call create_streamlit_app() to launch the full interface. If anything goes wrong, such as a missing API key or a failed tool initialization, we catch the error gracefully and display a helpful message, ensuring the user knows how to recover and continue using the app smoothly.

Copy CodeCopiedUse a different Browserdef run_in_colab():
“””Run the application in Google Colab with proper ngrok setup”””

print(” Starting Advanced LangChain Agent Setup…”)

if NGROK_AUTH_TOKEN == “your-ngrok-auth-token-here”:
print(” NGROK_AUTH_TOKEN not configured!”)
print(get_ngrok_token_instructions())

print(” Attempting alternative tunnel methods…”)
try_alternative_tunnels()
return

print(” Installing required packages…”)
import subprocess

packages = [
‘streamlit’,
‘langchain’,
‘langchain-google-genai’,
‘langchain-community’,
‘wikipedia’,
‘duckduckgo-search’,
‘pyngrok’
]

for package in packages:
try:
subprocess.run([‘pip’, ‘install’, package], check=True, capture_output=True)
print(f” {package} installed”)
except subprocess.CalledProcessError:
print(f” Failed to install {package}”)

app_content = ”’
import streamlit as st
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import Tool, WikipediaQueryRun, DuckDuckGoSearchRun
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate
from langchain.callbacks.streamlit import StreamlitCallbackHandler
from langchain_community.utilities import WikipediaAPIWrapper, DuckDuckGoSearchAPIWrapper
from datetime import datetime

# Configuration – Replace with your actual keys
GOOGLE_API_KEY = “”’ + GOOGLE_API_KEY + ”'”
os.environ[“GOOGLE_API_KEY”] = GOOGLE_API_KEY

class InnovativeAgentTools:
@staticmethod
def get_calculator_tool():
def calculate(expression: str) -> str:
try:
allowed_chars = set(‘0123456789+-*/.() ‘)
if all(c in allowed_chars for c in expression):
result = eval(expression)
return f”Result: {result}”
else:
return “Error: Invalid mathematical expression”
except Exception as e:
return f”Calculation error: {str(e)}”

return Tool(name=”Calculator”, func=calculate,
description=”Calculate mathematical expressions. Input should be a valid math expression.”)

@staticmethod
def get_memory_tool(memory_store):
def save_memory(key_value: str) -> str:
try:
key, value = key_value.split(“:”, 1)
memory_store[key.strip()] = value.strip()
return f”Saved ‘{key.strip()}’ to memory”
except:
return “Error: Use format ‘key: value'”

def recall_memory(key: str) -> str:
return memory_store.get(key.strip(), f”No memory found for ‘{key}'”)

return [
Tool(name=”SaveMemory”, func=save_memory, description=”Save information to memory. Format: ‘key: value'”),
Tool(name=”RecallMemory”, func=recall_memory, description=”Recall saved information. Input: key to recall”)
]

@staticmethod
def get_datetime_tool():
def get_current_datetime(format_type: str = “full”) -> str:
now = datetime.now()
if format_type == “date”:
return now.strftime(“%Y-%m-%d”)
elif format_type == “time”:
return now.strftime(“%H:%M:%S”)
else:
return now.strftime(“%Y-%m-%d %H:%M:%S”)

return Tool(name=”DateTime”, func=get_current_datetime,
description=”Get current date/time. Options: ‘date’, ‘time’, or ‘full'”)

class MultiAgentSystem:
def __init__(self, api_key: str):
self.llm = ChatGoogleGenerativeAI(
model=”gemini-pro”,
google_api_key=api_key,
temperature=0.7,
convert_system_message_to_human=True
)
self.memory_store = {}
self.conversation_memory = ConversationBufferWindowMemory(
memory_key=”chat_history”, k=10, return_messages=True
)
self.tools = self._initialize_tools()
self.agent = self._create_agent()

def _initialize_tools(self):
tools = []
try:
tools.extend([
DuckDuckGoSearchRun(api_wrapper=DuckDuckGoSearchAPIWrapper()),
WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
])
except Exception as e:
st.warning(f”Search tools may have limited functionality: {e}”)

tools.append(InnovativeAgentTools.get_calculator_tool())
tools.append(InnovativeAgentTools.get_datetime_tool())
tools.extend(InnovativeAgentTools.get_memory_tool(self.memory_store))
return tools

def _create_agent(self):
prompt = PromptTemplate.from_template(“””
You are an advanced AI assistant with access to multiple tools and persistent memory.

AVAILABLE TOOLS:
{tools}

TOOL USAGE FORMAT:
– Think step by step about what you need to do
– Use Action: tool_name
– Use Action Input: your input
– Wait for Observation
– Continue until you have a final answer

CONVERSATION HISTORY:
{chat_history}

CURRENT QUESTION: {input}

REASONING PROCESS:
{agent_scratchpad}

Begin your response with your thought process, then take action if needed.
“””)

agent = create_react_agent(self.llm, self.tools, prompt)
return AgentExecutor(agent=agent, tools=self.tools, memory=self.conversation_memory,
verbose=True, handle_parsing_errors=True, max_iterations=5)

def chat(self, message: str, callback_handler=None):
try:
if callback_handler:
response = self.agent.invoke({“input”: message}, {“callbacks”: [callback_handler]})
else:
response = self.agent.invoke({“input”: message})
return response[“output”]
except Exception as e:
return f”Error processing request: {str(e)}”

# Streamlit App
st.set_page_config(page_title=” Advanced LangChain Agent”, page_icon=””, layout=”wide”)

st.markdown(“””
<style>
.main-header {
background: linear-gradient(90deg, #667eea 0%, #764ba2 100%);
padding: 1rem; border-radius: 10px; color: white; text-align: center; margin-bottom: 2rem;
}
.agent-response {
background-color: #f0f2f6; padding: 1rem; border-radius: 10px;
border-left: 4px solid #667eea; margin: 1rem 0;
}
.memory-card {
background-color: #e8f4fd; padding: 1rem; border-radius: 8px; margin: 0.5rem 0;
}
</style>
“””, unsafe_allow_html=True)

st.markdown(‘<div class=”main-header”><h1> Advanced Multi-Agent System</h1><p>Powered by LangChain + Gemini API</p></div>’, unsafe_allow_html=True)

with st.sidebar:
st.header(” Configuration”)
api_key = st.text_input(” Google AI API Key”, type=”password”, value=GOOGLE_API_KEY)

if not api_key:
st.error(“Please enter your Google AI API key”)
st.stop()

st.success(” API Key configured”)

st.header(” Agent Capabilities”)
st.markdown(“- Web Search\n- Wikipedia\n- Calculator\n- Memory\n- Date/Time”)

if ‘agent_system’ in st.session_state and st.session_state.agent_system.memory_store:
st.header(” Memory Store”)
for key, value in st.session_state.agent_system.memory_store.items():
st.markdown(f'<div class=”memory-card”><strong>{key}:</strong> {value}</div>’, unsafe_allow_html=True)

if ‘agent_system’ not in st.session_state:
with st.spinner(” Initializing Agent…”):
st.session_state.agent_system = MultiAgentSystem(api_key)
st.success(” Agent Ready!”)

if ‘messages’ not in st.session_state:
st.session_state.messages = [{
“role”: “assistant”,
“content”: ” Hello! I’m your advanced AI assistant. I can search, calculate, remember information, and more! Try asking me to: calculate something, search for information, or remember a fact about you.”
}]

for message in st.session_state.messages:
with st.chat_message(message[“role”]):
st.markdown(message[“content”])

if prompt := st.chat_input(“Ask me anything…”):
st.session_state.messages.append({“role”: “user”, “content”: prompt})
with st.chat_message(“user”):
st.markdown(prompt)

with st.chat_message(“assistant”):
callback_handler = StreamlitCallbackHandler(st.container())
with st.spinner(” Thinking…”):
response = st.session_state.agent_system.chat(prompt, callback_handler)
st.markdown(f'<div class=”agent-response”>{response}</div>’, unsafe_allow_html=True)
st.session_state.messages.append({“role”: “assistant”, “content”: response})

# Example buttons
st.header(” Try These Examples”)
col1, col2, col3 = st.columns(3)
with col1:
if st.button(” Calculate 15 * 8 + 32″):
st.rerun()
with col2:
if st.button(” Search AI news”):
st.rerun()
with col3:
if st.button(” Remember my name is Alex”):
st.rerun()
”’

with open(‘streamlit_app.py’, ‘w’) as f:
f.write(app_content)

print(” Streamlit app file created successfully!”)

if setup_ngrok_auth(NGROK_AUTH_TOKEN):
start_streamlit_with_ngrok()
else:
print(” Ngrok authentication failed. Trying alternative methods…”)
try_alternative_tunnels()

In the run_in_colab() function, we make it easy to deploy the Streamlit app directly from a Google Colab environment. We begin by installing all required packages, then dynamically generate and write the complete Streamlit app code to a streamlit_app.py file. We verify the presence of a valid ngrok token to enable public access to the app from Colab, and if it’s missing or invalid, we guide ourselves through fallback tunneling options. This setup allows us to interact with our AI agent from anywhere, all within a few cells in Colab. Check out the full Notebook here

Copy CodeCopiedUse a different Browserdef start_streamlit_with_ngrok():
“””Start Streamlit with ngrok tunnel”””
import subprocess
import threading
from pyngrok import ngrok

def start_streamlit():
subprocess.run([‘streamlit’, ‘run’, ‘streamlit_app.py’, ‘–server.port=8501’, ‘–server.headless=true’])

print(” Starting Streamlit server…”)
thread = threading.Thread(target=start_streamlit)
thread.daemon = True
thread.start()

time.sleep(5)

try:
print(” Creating ngrok tunnel…”)
public_url = ngrok.connect(8501)
print(f” SUCCESS! Access your app at: {public_url}”)
print(” Your Advanced LangChain Agent is now running publicly!”)
print(” You can share this URL with others!”)

print(” Keeping tunnel alive… Press Ctrl+C to stop”)
try:
ngrok_process = ngrok.get_ngrok_process()
ngrok_process.proc.wait()
except KeyboardInterrupt:
print(” Shutting down…”)
ngrok.kill()

except Exception as e:
print(f” Ngrok tunnel failed: {e}”)
try_alternative_tunnels()

def try_alternative_tunnels():
“””Try alternative tunneling methods”””
print(” Trying alternative tunnel methods…”)

import subprocess
import threading

def start_streamlit():
subprocess.run([‘streamlit’, ‘run’, ‘streamlit_app.py’, ‘–server.port=8501’, ‘–server.headless=true’])

thread = threading.Thread(target=start_streamlit)
thread.daemon = True
thread.start()

time.sleep(3)

print(” Streamlit is running on http://localhost:8501″)
print(“n ALTERNATIVE TUNNEL OPTIONS:”)
print(“1. localtunnel: Run this in a new cell:”)
print(” !npx localtunnel –port 8501″)
print(“n2. serveo.net: Run this in a new cell:”)
print(” !ssh -R 80:localhost:8501 serveo.net”)
print(“n3. Colab public URL (if available):”)
print(” Use the ‘Public URL’ button in Colab’s interface”)

try:
while True:
time.sleep(60)
except KeyboardInterrupt:
print(” Shutting down…”)

if __name__ == “__main__”:
try:
get_ipython()
print(” Google Colab detected – starting setup…”)
run_in_colab()
except NameError:
main()

In this final part, we set up the execution logic to run the app either in a local environment or inside Google Colab. The start_streamlit_with_ngrok() function launches the Streamlit server in the background and uses ngrok to expose it publicly, making it easy to access and share. If ngrok fails, the try_alternative_tunnels() function activates with alternative tunneling options, such as LocalTunnel and Serveo. With the __main__ block, we automatically detect if we’re in Colab and launch the appropriate setup, making the entire deployment process smooth, flexible, and shareable from anywhere.

In conclusion, we’ll have a fully functional AI agent running inside a sleek Streamlit interface, capable of answering queries, remembering user inputs, and even sharing its services publicly using ngrok. We’ve seen how easily Streamlit enables us to integrate advanced AI functionalities into an engaging and user-friendly app. From here, we can expand the agent’s tools, plug it into larger workflows, or deploy it as part of our intelligent applications. With Streamlit as the front-end and LangChain agents powering the logic, we’ve built a solid foundation for next-gen interactive AI experiences.

Check out the full Notebook here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Build an Intelligent Multi-Tool AI Agent Interface Using Streamlit for Seamless Real-Time Interaction appeared first on MarkTechPost.

UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation …

Cybersecurity has become a significant area of interest in artificial intelligence, driven by the increasing reliance on large software systems and the expanding capabilities of AI tools. As threats evolve in complexity, ensuring the security of software systems has become more than just a matter of conventional protections; it now intersects with automated reasoning, vulnerability detection, and code-level comprehension. Modern cybersecurity requires tools and methods that can simulate real-world scenarios, identify hidden flaws, and validate system integrity across diverse software infrastructures. Within this environment, researchers have been developing benchmarks and methods to systematically evaluate AI agents’ ability to understand, detect, and even exploit vulnerabilities, drawing parallels with human security researchers. However, bridging the gap between AI reasoning and real-world cybersecurity complexities remains a key challenge.

Problem with Existing Benchmarks

One pressing issue is the lack of effective ways to evaluate whether AI systems are truly capable of understanding and handling security tasks under realistic conditions. Simplified benchmark tasks often dominate current testing methods, which rarely mirror the messy and layered reality of large-scale software repositories. These environments involve intricate input conditions, deep code paths, and subtle vulnerabilities that demand more than surface-level inspection. Without robust evaluation methods, it’s difficult to determine whether AI agents can be trusted to perform tasks like vulnerability detection or exploit development. More importantly, current benchmarks don’t reflect the scale and nuance of vulnerabilities found in actively maintained, widely used software systems, leaving a critical evaluation gap.

Limitations of Current Tools

Several benchmarks have been used to evaluate cybersecurity capabilities, including Cybench and the NYU CTF Bench. These focus on capture-the-flag-style tasks that offer limited complexity, typically involving small codebases and constrained test environments. Some benchmarks attempt to engage real-world vulnerabilities, but they often do so at a limited scale. Furthermore, many of the tools rely on either synthetic test cases or narrowly scoped challenge problems, which fail to represent the diversity of software inputs, execution paths, and bug types found in actual systems. Even specialized agents created for security analysis have been tested on benchmarks with only tens or a few hundred tasks, far short of the complexity of real-world threat landscapes.

Introducing CyberGym

Researchers introduced CyberGym, a large-scale and comprehensive benchmarking tool specifically designed to evaluate AI agents in real-world cybersecurity contexts. Developed at the University of California, Berkeley, CyberGym includes 1,507 distinct benchmark tasks sourced from actual vulnerabilities found and patched across 188 major open-source software projects. These vulnerabilities were originally identified by OSS-Fuzz, a continuous fuzzing campaign maintained by Google. To ensure realism, each benchmark instance includes the full pre-patch codebase, an executable, and a textual description of the vulnerability. Agents must generate a proof-of-concept test that reproduces the vulnerability in the unpatched version, and CyberGym evaluates success based on whether the vulnerability is triggered in the pre-patch version and absent in the post-patch one. This benchmark uniquely emphasizes the generation of Proof of Concepts (PoCs), a task that requires agents to traverse complex code paths and synthesize inputs to meet specific security conditions. CyberGym is modular and containerized, enabling easy expansion and reproducibility.

CyberGym Evaluation Levels

The evaluation pipeline in CyberGym is built around four levels of difficulty, each increasing the amount of input information provided. At level 0, the agent is given only the codebase with no hint of the vulnerability. Level 1 adds a natural language description. Level 2 introduces a ground-truth proof of concept (PoC) and crash stack trace, while Level 3 includes the patch itself and the post-patch codebase. Each level presents a new layer of reasoning and complexity. For instance, in level 1, agents must infer the vulnerability’s location and context purely from its textual description and codebase. To ensure benchmark quality, CyberGym applies filters such as checking the informativeness of patch commit messages, validating proof-of-concept (PoC) reproducibility, and removing redundancy by comparing stack traces. The final dataset comprises codebases with a median of 1,117 files and 387,491 lines of code, ranging up to over 40,000 files and 7 million lines of code. The patch sizes also vary, modifying a median of 1 file and seven lines, but sometimes spanning 40 files and over 3,000 lines. The vulnerabilities target various crash types, with 30.4% related to heap-buffer-overflow READ and 19.0% due to uninitialized value use.

Experimental Results

When tested against this benchmark, existing agents showed limited success. Among four agent frameworks, OpenHands, Codex, ENiGMA, and Cybench, the top performer was OpenHands combined with Claude-3.7-Sonnet, which reproduced only 11.9% of target vulnerabilities. This performance dropped significantly when dealing with longer PoC inputs, as success rates were highest for PoCs under 10 bytes (43.5%) and fell below 8% for lengths over 100 bytes. Open-source models, such as DeepSeek-V3, lagged, with only a 3.6% success rate. Even specialized models fine-tuned for code reasoning, like SWE-Gym-32B and R2E-Gym-32B, failed to generalize, scoring under 2%. Surprisingly, richer input information at higher difficulty levels increased performance: level 3 saw 17.1% success, while level 0 achieved only 3.5%. Analysis also revealed that most successful PoC reproductions occurred between 20 and 40 execution steps, with many runs exceeding 90 steps and ultimately failing. Despite these challenges, agents discovered 15 previously unknown zero-day vulnerabilities and two disclosed but unpatched ones across real-world projects, demonstrating their latent capacity for novel discovery.

Key Takeaways

Benchmark Volume and Realism: CyberGym contains 1,507 tasks derived from real, patched vulnerabilities across 188 software projects, making it the largest and most realistic benchmark of its kind.

Agent Limitations: Even the best-performing agent-model combination reproduced only 11.9% of vulnerabilities, with many combinations scoring under 5%.

Difficulty Scaling: Providing additional inputs, such as stack traces or patches, significantly improved performance, with level 3 tasks yielding a 17.1% success rate.

Length Sensitivity: Agents struggled with tasks involving long PoCs. PoCs exceeding 100 bytes, which made up 65.7% of the dataset, had the lowest success rates.

Discovery Potential: 15 new zero-day vulnerabilities were discovered by agent-generated PoCs, validating their potential use in real-world security analysis.

Model Behavior: Most successful exploits were generated early in the task execution, with diminishing returns after 80 steps.

Tool Interactions: Agents performed better when allowed to interact with tools (e.g., using ‘awk’, ‘grep’, or installing ‘xxd’) and adapt PoCs based on runtime feedback.

Conclusion

In conclusion, this study highlights a critical problem: evaluating AI in cybersecurity is not only challenging but essential for understanding its limitations and capabilities. CyberGym stands out by offering a large-scale, real-world framework for doing so. The researchers addressed the issue with a practical and detailed benchmark that forces agents to reason deeply across entire codebases, generate valid exploits, and adapt through iteration. The results make it clear that while current agents show promise, especially in discovering new bugs, there is still a long road ahead to enable AI to contribute to cybersecurity at scale reliably.

Check out the Paper, GitHub Page, Leaderboard. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation Framework to Evaluate AI Agents on Large-Scale Vulnerabilities Across Massive Codebases appeared first on MarkTechPost.

From Backend Automation to Frontend Collaboration: What’s New in AG- …

Introduction

AI agents are increasingly moving from pure backend automators to visible, collaborative elements within modern applications. However, making agents genuinely interactive—capable of both responding to users and proactively guiding workflows—has long been an engineering headache. Each team ends up building custom communication channels, event handling, and state management, all for similar interaction needs.

The initial release of AG‑UI, announced in May 2025, served as a practical, open‑source proof-of-concept protocol for inline agent-user communication. It introduced a single-stream architecture—typically HTTP POST paired with Server-Sent Events (SSE)—and established a vocabulary of structured JSON events (e.g., TEXT_MESSAGE_CONTENT, TOOL_CALL_START, STATE_DELTA) that could drive interactive front-end components. The first version addressed core integration challenges—real-time streaming, tool orchestration, shared state, and standardized event handling—but users found that further formalization of event types, versioning, and framework support was needed for broader production use.

AG‑UI latest update proposes a different approach. Instead of yet another toolkit, it offers a lightweight protocol that standardizes the conversation between agents and user interfaces. This new version brings the protocol closer to production quality, improves event clarity, and expands compatibility with real‑world agent frameworks and clients.

What Sets AG-UI’s Latest Update Apart

AG-UI’s latest update is an incremental but meaningful step for agent-driven applications. Unlike earlier ad-hoc attempts at interactivity, the latest update of AG-UI is built around explicit, versioned events. The protocol isn’t tightly coupled to any particular stack; it’s designed to work with multiple agent backends and client types out of the box.

Key features in the latest update of AG-UI include:

A formal set of ~16 event types, covering the full lifecycle of an agent—streamed outputs, tool invocations, state updates, user prompts, and error handling.

Cleaner event schemas, allowing clients and agents to negotiate capabilities and synchronize state more reliably.

More robust support for both direct (native) integration and adapter-based wrapping of legacy agents.

Expanded documentation and SDKs that make the protocol practical for production use, not just experimentation.

Interactive Agents Require Consistency

Many AI agents today remain hidden in the backend, designed to handle requests and return results, with little regard for real-time user interaction. Making agents interactive means solving for several technical challenges:

Streaming: Agents need to send incremental results or messages as soon as they’re available, not just at the end of a process.

Shared State: Both agent and UI should stay in sync, reflecting changes as the task progresses.

Tool Calls: Agents must be able to request external tools (such as APIs or user actions) and get results back in a structured way.

Bidirectional Messaging: Users should be able to respond or guide the agent, not just passively observe.

Security and Control: Tool invocation, cancellations, and error signals should be explicit and managed safely.

Without a shared protocol, every developer ends up reinventing these wheels—often imperfectly.

How the Latest Update of AG-UI Works

AG-UI’s latest update formalizes the agent-user interaction as a stream of typed events. Agents emit these events as they operate; clients subscribe to the stream, interpret the events, and send responses when needed.

The Event Stream

The core of the latest update of AG-UI is its event taxonomy. There are ~16 event types, including:

message: Agent output, such as a status update or a chunk of generated text.

function_call: Agent asks the client to run a function or tool, often requiring an external resource or user action.

state_update: Synchronizes variables or progress information.

input_request: Prompts the user for a value or choice.

tool_result: Sends results from tools back to the agent.

error and control: Signal errors, cancellations, or completion.

All events are JSON-encoded, typed, and versioned. This structure makes it straightforward to parse events, handle errors gracefully, and add new capabilities over time.

Integrating Agents and Clients

There are two main patterns for integration:

Native: Agents are built or modified to emit AG-UI events directly during execution.

Adapter: For legacy or third-party agents, an adapter module can intercept outputs and translate them into AG-UI events.

On the client side, applications open a persistent connection (usually via SSE or WebSocket), listen for events, and update their interface or send structured responses as needed.

The protocol is intentionally transport-agnostic, but supports real-time streaming for responsiveness.

Adoption and Ecosystem

Since its initial release, AG-UI has seen adoption among popular agent orchestration frameworks. AG‑UI latest version’s expanded event schema and improved documentation have accelerated integration efforts.

Current or in-progress integrations include:

LangChain, CrewAI, Mastra, AG2, Agno, LlamaIndex: Each offers orchestration for agents that can now interactively surface their internal state and progress.

AWS, A2A, ADK, AgentOps: Work is ongoing to bridge cloud, monitoring, and agent operation tools with AG-UI.

Human Layer (Slack integration): Demonstrates how agents can become collaborative team members in messaging environments.

The protocol has gained traction with developers looking to avoid building custom socket handlers and event schemas for each project. It currently has more than 3,500 GitHub stars and is being used in a growing number of agent-driven products.

Developer Experience

The latest update of AG-UI is designed to minimize friction for both agent builders and frontend engineers.

SDKs and Templates: The CLI tool npx create-ag-ui-app scaffolds a project with all dependencies and sample integrations included.

Clear Schemas: Events are versioned and documented, supporting robust error handling and future extensibility.

Practical Documentation: Real-world integration guides, example flows, and visual assets help reduce trial and error.

All resources and guides are available at AG-UI.com.

Use Cases

Embedded Copilots: Agents that work alongside users in existing apps, providing suggestions and explanations as tasks evolve.

Conversational UIs: Dialogue systems that maintain session state and support multi-turn interactions with tool usage.

Workflow Automation: Agents that orchestrate sequences involving both automated actions and human-in-the-loop steps.

Conclusion

The latest update of AG-UI provides a well-defined, lightweight protocol for building interactive agent-driven applications. Its event-driven architecture abstracts away much of the complexity of agent-user synchronization, real-time communication, and state management. With explicit schemas, broad framework support, and a focus on practical integration, AG‑UI latest update enables development teams to build more reliable, interactive AI systems—without repeatedly solving the same low-level problems.

Developers interested in adopting the latest update of AG-UI can find SDKs, technical documentation, and integration assets at AG-UI.com.

CopilotKit team is also organizing a Webinar.

Support open-source and Star the AG-UI GitHub repo.

Discord Community: https://go.copilotkit.ai/AG-UI-Discord

Thanks to the CopilotKit team for the thought leadership/ Resources for this article. CopilotKit team has supported us in this content/article.
The post From Backend Automation to Frontend Collaboration: What’s New in AG-UI Latest Update for AI Agent-User Interaction appeared first on MarkTechPost.

MiniMax AI Releases MiniMax-M1: A 456B Parameter Hybrid Model for Long …

The Challenge of Long-Context Reasoning in AI Models

Large reasoning models are not only designed to understand language but are also structured to think through multi-step processes that require prolonged attention spans and contextual comprehension. As the expectations from AI grow, especially in real-world and software development environments, researchers have sought architectures that can handle longer inputs and sustain deep, coherent reasoning chains without overwhelming computational costs.

Computational Constraints with Traditional Transformers

The primary difficulty in expanding these reasoning capabilities lies in the excessive computational load that comes with longer generation lengths. Traditional transformer-based models employ a softmax attention mechanism, which scales quadratically with the input size. This limits their capacity to handle long input sequences or extended chains of thought efficiently. This problem becomes even more pressing in areas that require real-time interaction or cost-sensitive applications, where inference expenses are significant.

Existing Alternatives and Their Limitations

Efforts to address this issue have yielded a range of methods, including sparse attention and linear attention variants. Some teams have experimented with state-space models and recurrent networks as alternatives to traditional attention structures. However, these innovations have seen limited adoption in the most competitive reasoning models due to either architectural complexity or a lack of scalability in real-world deployments. Even large-scale systems, such as Tencent’s Hunyuan-T1, which utilizes a novel Mamba architecture, remain closed-source, thereby restricting wider research engagement and validation.

Introduction of MiniMax-M1: A Scalable Open-Weight Model

Researchers at MiniMax AI introduced MiniMax-M1, a new open-weight, large-scale reasoning model that combines a mixture of experts’ architecture with lightning-fast attention. Built as an evolution of the MiniMax-Text-01 model, MiniMax-M1 contains 456 billion parameters, with 45.9 billion activated per token. It supports context lengths of up to 1 million tokens—eight times the capacity of DeepSeek R1. This model addresses compute scalability at inference time, consuming only 25% of the FLOPs required by DeepSeek R1 at 100,000 token generation length. It was trained using large-scale reinforcement learning on a broad range of tasks, from mathematics and coding to software engineering, marking a shift toward practical, long-context AI models.

Hybrid-Attention with Lightning Attention and Softmax Blocks

To optimize this architecture, MiniMax-M1 employs a hybrid attention scheme where every seventh transformer block uses traditional softmax attention, followed by six blocks using lightning attention. This significantly reduces computational complexity while preserving performance. The lightning attention itself is I/O-aware, adapted from linear attention, and is particularly effective at scaling reasoning lengths to hundreds of thousands of tokens. For reinforcement learning efficiency, the researchers introduced a novel algorithm called CISPO. Instead of clipping token updates as traditional methods do, CISPO clips importance sampling weights, enabling stable training and consistent token contributions, even in off-policy updates.

The CISPO Algorithm and RL Training Efficiency

The CISPO algorithm proved essential in overcoming the training instability faced in hybrid architectures. In comparative studies using the Qwen2.5-32B baseline, CISPO achieved a 2x speedup compared to DAPO. Leveraging this, the full reinforcement learning cycle for MiniMax-M1 was completed in just three weeks using 512 H800 GPUs, with a rental cost of approximately $534,700. The model was trained on a diverse dataset comprising 41 logic tasks generated via the SynLogic framework and real-world software engineering environments derived from the SWE bench. These environments utilized execution-based rewards to guide performance, resulting in stronger outcomes in practical coding tasks.

Benchmark Results and Comparative Performance

MiniMax-M1 delivered compelling benchmark results. Compared to DeepSeek-R1 and Qwen3-235B, it excelled in software engineering, long-context processing, and agentic tool use. Although it trailed the latest DeepSeek-R1-0528 in math and coding contests, it surpassed both OpenAI o3 and Claude 4 Opus in long-context understanding benchmarks. Furthermore, it outperformed Gemini 2.5 Pro in the TAU-Bench agent tool use evaluation.

Conclusion: A Scalable and Transparent Model for Long-Context AI

MiniMax-M1 presents a significant step forward by offering both transparency and scalability. By addressing the dual challenge of inference efficiency and training complexity, the research team at MiniMax AI has set a precedent for open-weight reasoning models. This work not only brings a solution to compute constraints but also introduces practical methods for scaling language model intelligence into real-world applications.

Check out the Paper, Model and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post MiniMax AI Releases MiniMax-M1: A 456B Parameter Hybrid Model for Long-Context and Reinforcement Learning RL Tasks appeared first on MarkTechPost.

OpenAI Releases an Open‑Sourced Version of a Customer Service Agent …

OpenAI has open-sourced a new multi-agent customer service demo on GitHub, showcasing how to build domain-specialized AI agents using its Agents SDK. This project—titled openai-cs-agents-demo—models an airline customer service chatbot capable of handling a range of travel-related queries by dynamically routing requests to specialized agents. Built with a Python backend and a Next.js frontend, the system provides both a functional conversational interface and a visual trace of agent handoffs and guardrail activations.

The architecture is divided into two main components. The Python backend handles agent orchestration using the Agents SDK, while the Next.js frontend offers a chat interface and an interactive visualization of agent transitions. This setup provides transparency into the decision-making and delegation process as agents triage, respond to, or reject user queries. The demo operates with several focused agents: a Triage Agent, Seat Booking Agent, Flight Status Agent, Cancellation Agent, and an FAQ Agent. Each of these is configured with specialized instructions and tools to fulfill their specific sub-tasks.

When a user enters a request—such as “change my seat” or “cancel my flight”—the Triage Agent processes the input to determine intent and dispatches the query to the appropriate downstream agent. For example, a booking change request will be routed to the Seat Booking Agent, which can verify confirmation numbers, offer seat map choices, and finalize seat changes. If a cancellation is requested, the system hands off to the Cancellation Agent, which follows a structured flow to confirm and execute the cancellation. The demo also includes a Flight Status Agent for real-time flight inquiries and an FAQ Agent that answers general questions about baggage policies or aircraft types.

A key strength of the system lies in its integration of guardrails for safety and relevance. The demo features two: a Relevance Guardrail and a Jailbreak Guardrail. The Relevance Guardrail filters out off-topic queries—for example, rejecting prompts like “write me a poem about strawberries.” The Jailbreak Guardrail blocks attempts to circumvent system boundaries or manipulate agent behavior, such as asking the model to reveal its internal instructions. When either guardrail is triggered, the system highlights it in the trace and sends a structured error message to the user.

The Agents SDK itself serves as the orchestration backbone. Each agent is defined as a composable unit with prompt templates, tool access, handoff logic, and output schemas. The SDK handles chaining agents via “handoffs,” supports real-time tracing, and allows developers to enforce input/output constraints with guardrails. This framework is the same one powering OpenAI’s internal experiments with tool-using and reasoning agents, but now exposed in an educational and extendable format.

Developers can run the demo locally by starting the Python backend server with Uvicorn and launching the frontend with a single npm run dev command. The entire system is configurable—developers can plug in new agents, define their own task routing strategies, and implement custom guardrails. With full transparency into prompts, decisions, and trace logs, the demo offers a practical foundation for real-world conversational AI systems in customer support or other enterprise domains.

By releasing this reference implementation, OpenAI provides a tangible example of how multi-agent coordination, tool use, and safety checks can be combined into a robust service experience. It’s particularly valuable for developers seeking to understand the anatomy of agentic systems—and how to build modular, controllable AI workflows that are both transparent and production-ready.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post OpenAI Releases an Open‑Sourced Version of a Customer Service Agent Demo with the Agents SDK appeared first on MarkTechPost.

Build a scalable AI video generator using Amazon SageMaker AI and CogV …

In recent years, the rapid advancement of artificial intelligence and machine learning (AI/ML) technologies has revolutionized various aspects of digital content creation. One particularly exciting development is the emergence of video generation capabilities, which offer unprecedented opportunities for companies across diverse industries. This technology allows for the creation of short video clips that can be seamlessly combined to produce longer, more complex videos. The potential applications of this innovation are vast and far-reaching, promising to transform how businesses communicate, market, and engage with their audiences. Video generation technology presents a myriad of use cases for companies looking to enhance their visual content strategies. For instance, ecommerce businesses can use this technology to create dynamic product demonstrations, showcasing items from multiple angles and in various contexts without the need for extensive physical photoshoots. In the realm of education and training, organizations can generate instructional videos tailored to specific learning objectives, quickly updating content as needed without re-filming entire sequences. Marketing teams can craft personalized video advertisements at scale, targeting different demographics with customized messaging and visuals. Furthermore, the entertainment industry stands to benefit greatly, with the ability to rapidly prototype scenes, visualize concepts, and even assist in the creation of animated content. The flexibility offered by combining these generated clips into longer videos opens up even more possibilities. Companies can create modular content that can be quickly rearranged and repurposed for different displays, audiences, or campaigns. This adaptability not only saves time and resources, but also allows for more agile and responsive content strategies. As we delve deeper into the potential of video generation technology, it becomes clear that its value extends far beyond mere convenience, offering a transformative tool that can drive innovation, efficiency, and engagement across the corporate landscape.
In this post, we explore how to implement a robust AWS-based solution for video generation that uses the CogVideoX model and Amazon SageMaker AI.
Solution overview
Our architecture delivers a highly scalable and secure video generation solution using AWS managed services. The data management layer implements three purpose-specific Amazon Simple Storage Service (Amazon S3) buckets—for input videos, processed outputs, and access logging—each configured with appropriate encryption and lifecycle policies to support data security throughout its lifecycle.
For compute resources, we use AWS Fargate for Amazon Elastic Container Service (Amazon ECS) to host the Streamlit web application, providing serverless container management with automatic scaling capabilities. Traffic is efficiently distributed through an Application Load Balancer. The AI processing pipeline uses SageMaker AI processing jobs to handle video generation tasks, decoupling intensive computation from the web interface for cost optimization and enhanced maintainability. User prompts are refined through Amazon Bedrock, which feeds into the CogVideoX-5b model for high-quality video generation, creating an end-to-end solution that balances performance, security, and cost-efficiency.
The following diagram illustrates the solution architecture.

CogVideoX model
CogVideoX is an open source, state-of-the-art text-to-video generation model capable of producing 10-second continuous videos at 16 frames per second with a resolution of 768×1360 pixels. The model effectively translates text prompts into coherent video narratives, addressing common limitations in previous video generation systems.
The model uses three key innovations:

A 3D Variational Autoencoder (VAE) that compresses videos along both spatial and temporal dimensions, improving compression efficiency and video quality
An expert transformer with adaptive LayerNorm that enhances text-to-video alignment through deeper fusion between modalities
Progressive training and multi-resolution frame pack techniques that enable the creation of longer, coherent videos with significant motion elements

CogVideoX also benefits from an effective text-to-video data processing pipeline with various preprocessing strategies and a specialized video captioning method, contributing to higher generation quality and better semantic alignment. The model’s weights are publicly available, making it accessible for implementation in various business applications, such as product demonstrations and marketing content. The following diagram shows the architecture of the model.

Prompt enhancement
To improve the quality of video generation, the solution provides an option to enhance user-provided prompts. This is done by instructing a large language model (LLM), in this case Anthropic’s Claude, to take a user’s initial prompt and expand upon it with additional details, creating a more comprehensive description for video creation. The prompt consists of three parts:

Role section – Defines the AI’s purpose in enhancing prompts for video generation
Task section – Specifies the instructions needed to be performed with the original prompt
Prompt section – Where the user’s original input is inserted

By adding more descriptive elements to the original prompt, this system aims to provide richer, more detailed instructions to video generation models, potentially resulting in more accurate and visually appealing video outputs. We use the following prompt template for this solution:
“””
<Role>
Your role is to enhance the user prompt that is given to you by
providing additional details to the prompt. The end goal is to
covert the user prompt into a short video clip, so it is necessary
to provide as much information you can.
</Role>
<Task>
You must add details to the user prompt in order to enhance it for
video generation. You must provide a 1 paragraph response. No
more and no less. Only include the enhanced prompt in your response.
Do not include anything else.
</Task>
<Prompt>
{prompt}
</Prompt>
“””
Prerequisites
Before you deploy the solution, make sure you have the following prerequisites:

The AWS CDK Toolkit – Install the AWS CDK Toolkit globally using npm: npm install -g aws-cdk This provides the core functionality for deploying infrastructure as code to AWS.
Docker Desktop – This is required for local development and testing. It makes sure container images can be built and tested locally before deployment.
The AWS CLI – The AWS Command Line Interface (AWS CLI) must be installed and configured with appropriate credentials. This requires an AWS account with necessary permissions. Configure the AWS CLI using aws configure with your access key and secret.
Python Environment – You must have Python 3.11+ installed on your system. We recommend using a virtual environment for isolation. This is required for both the AWS CDK infrastructure and Streamlit application.
Active AWS account – You will need to raise a service quota request for SageMaker to ml.g5.4xlarge for processing jobs.

Deploy the solution
This solution has been tested in the us-east-1 AWS Region. Complete the following steps to deploy:

Create and activate a virtual environment:

python -m venv .
venv source .venv/bin/activate

Install infrastructure dependencies:

cd infrastructure
pip install -r requirements.txt

Bootstrap the AWS CDK (if not already done in your AWS account):

cdk bootstrap

Deploy the infrastructure:

cdk deploy -c allowed_ips='[“‘$(curl -s ifconfig.me)’/32”]’
To access the Streamlit UI, choose the link for StreamlitURL in the AWS CDK output logs after deployment is successful. The following screenshot shows the Streamlit UI accessible through the URL.

Basic video generation
Complete the following steps to generate a video:

Input your natural language prompt into the text box at the top of the page.
Copy this prompt to the text box at the bottom.
Choose Generate Video to create a video using this basic prompt.

The following is the output from the simple prompt “A bee on a flower.”

Enhanced video generation
For higher-quality results, complete the following steps:

Enter your initial prompt in the top text box.
Choose Enhance Prompt to send your prompt to Amazon Bedrock.
Wait for Amazon Bedrock to expand your prompt into a more descriptive version.
Review the enhanced prompt that appears in the lower text box.
Edit the prompt further if desired.
Choose Generate Video to initiate the processing job with CogVideoX.

When processing is complete, your video will appear on the page with a download option.The following is an example of an enhanced prompt and output:
“””
A vibrant yellow and black honeybee gracefully lands on a large,
blooming sunflower in a lush garden on a warm summer day. The
bee’s fuzzy body and delicate wings are clearly visible as it
moves methodically across the flower’s golden petals, collecting
pollen. Sunlight filters through the petals, creating a soft,
warm glow around the scene. The bee’s legs are coated in pollen
as it works diligently, its antennae twitching occasionally. In
the background, other colorful flowers sway gently in a light
breeze, while the soft buzzing of nearby bees can be heard
“””

Add an image to your prompt
If you want to include an image with your text prompt, complete the following steps:

Complete the text prompt and optional enhancement steps.
Choose Include an Image.
Upload the photo you want to use.
With both text and image now prepared, choose Generate Video to start the processing job.

The following is an example of the previous enhanced prompt with an included image.

To view more samples, check out the CogVideoX gallery.
Clean up
To avoid incurring ongoing charges, clean up the resources you created as part of this post:
cdk destroy
Considerations
Although our current architecture serves as an effective proof of concept, several enhancements are recommended for a production environment. Considerations include implementing an API Gateway with AWS Lambda backed REST endpoints for improved interface and authentication, introducing a queue-based architecture using Amazon Simple Queue Service (Amazon SQS) for better job management and reliability, and enhancing error handling and monitoring capabilities.
Conclusion
Video generation technology has emerged as a transformative force in digital content creation, as demonstrated by our comprehensive AWS-based solution using the CogVideoX model. By combining powerful AWS services like Fargate, SageMaker, and Amazon Bedrock with an innovative prompt enhancement system, we’ve created a scalable and secure pipeline capable of producing high-quality video clips. The architecture’s ability to handle both text-to-video and image-to-video generation, coupled with its user-friendly Streamlit interface, makes it an invaluable tool for businesses across sectors—from ecommerce product demonstrations to personalized marketing campaigns. As showcased in our sample videos, the technology delivers impressive results that open new avenues for creative expression and efficient content production at scale. This solution represents not just a technological advancement, but a glimpse into the future of visual storytelling and digital communication.
To learn more about CogVideoX, refer to CogVideoX on Hugging Face. Try out the solution for yourself, and share your feedback in the comments.

About the Authors
Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.
Natasha Tchir is a Cloud Consultant at the Generative AI Innovation Center, specializing in machine learning. With a strong background in ML, she now focuses on the development of generative AI proof-of-concept solutions, driving innovation and applied research within the GenAIIC.
Katherine Feng is a Cloud Consultant at AWS Professional Services within the Data and ML team. She has extensive experience building full-stack applications for AI/ML use cases and LLM-driven solutions.
Jinzhao Feng is a Machine Learning Engineer at AWS Professional Services. He focuses on architecting and implementing large-scale generative AI and classic ML pipeline solutions. He is specialized in FMOps, LLMOps, and distributed training.