Building Production-Ready Custom AI Agents for Enterprise Workflows wi …

In this tutorial, we walk you through the design and implementation of a custom agent framework built on PyTorch and key Python tooling, ranging from web intelligence and data science modules to advanced code generators. We’ll learn how to wrap core functionalities in monitored CustomTool classes, orchestrate multiple agents with tailored system prompts, and define end-to-end workflows that automate tasks like competitive website analysis and data-processing pipelines. Along the way, we demonstrate real-world examples, complete with retry logic, logging, and performance metrics, so you can confidently deploy and scale these agents within your organization’s existing infrastructure.

Copy CodeCopiedUse a different Browser!pip install -q torch transformers datasets pillow requests beautifulsoup4 pandas numpy scikit-learn openai

import os, json, asyncio, threading, time
import torch, pandas as pd, numpy as np
from PIL import Image
import requests
from io import BytesIO, StringIO
from concurrent.futures import ThreadPoolExecutor
from functools import wraps, lru_cache
from typing import Dict, List, Optional, Any, Callable, Union
import logging
from dataclasses import dataclass
import inspect

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_TIMEOUT = 15
MAX_RETRIES = 3

We begin by installing and importing all the core libraries, including PyTorch and Transformers, as well as data handling libraries such as pandas and NumPy, and utilities like BeautifulSoup for web scraping and scikit-learn for machine learning. We configure a standardized logging setup to capture information and error messages, and define global constants for API timeouts and retry limits, ensuring our tools behave predictably in production.

Copy CodeCopiedUse a different Browser@dataclass
class ToolResult:
“””Standardized tool result structure”””
success: bool
data: Any
error: Optional[str] = None
execution_time: float = 0.0
metadata: Dict[str, Any] = None

class CustomTool:
“””Base class for custom tools”””
def __init__(self, name: str, description: str, func: Callable):
self.name = name
self.description = description
self.func = func
self.calls = 0
self.avg_execution_time = 0.0
self.error_rate = 0.0

def execute(self, *args, **kwargs) -> ToolResult:
“””Execute tool with monitoring”””
start_time = time.time()
self.calls += 1

try:
result = self.func(*args, **kwargs)
execution_time = time.time() – start_time

self.avg_execution_time = ((self.avg_execution_time * (self.calls – 1)) + execution_time) / self.calls

return ToolResult(
success=True,
data=result,
execution_time=execution_time,
metadata={‘tool_name’: self.name, ‘call_count’: self.calls}
)
except Exception as e:
execution_time = time.time() – start_time
self.error_rate = (self.error_rate * (self.calls – 1) + 1) / self.calls

logger.error(f”Tool {self.name} failed: {str(e)}”)
return ToolResult(
success=False,
data=None,
error=str(e),
execution_time=execution_time,
metadata={‘tool_name’: self.name, ‘call_count’: self.calls}
)

We define a ToolResult dataclass to encapsulate every execution’s outcome, whether it succeeded, how long it took, any returned data, and error details if it failed. Our CustomTool base class then wraps individual functions with a unified execute method that tracks call counts, measures execution time, computes an average runtime, and logs any errors. By standardizing tool results and performance metrics this way, we ensure consistency and observability across all our custom utilities.

Copy CodeCopiedUse a different Browserclass CustomAgent:
“””Custom agent implementation with tool management”””
def __init__(self, name: str, system_prompt: str = “”, max_iterations: int = 5):
self.name = name
self.system_prompt = system_prompt
self.max_iterations = max_iterations
self.tools = {}
self.conversation_history = []
self.performance_metrics = {}

def add_tool(self, tool: CustomTool):
“””Add a tool to the agent”””
self.tools[tool.name] = tool

def run(self, task: str) -> Dict[str, Any]:
“””Execute a task using available tools”””
logger.info(f”Agent {self.name} executing task: {task}”)

task_lower = task.lower()
results = []

if any(keyword in task_lower for keyword in [‘analyze’, ‘website’, ‘url’, ‘web’]):
if ‘advanced_web_intelligence’ in self.tools:
import re
url_pattern = r’https?://[^s]+’
urls = re.findall(url_pattern, task)
if urls:
result = self.tools[‘advanced_web_intelligence’].execute(urls[0])
results.append(result)

elif any(keyword in task_lower for keyword in [‘data’, ‘analyze’, ‘stats’, ‘csv’]):
if ‘advanced_data_science_toolkit’ in self.tools:
if ‘name,age,salary’ in task:
data_start = task.find(‘name,age,salary’)
data_part = task[data_start:]
result = self.tools[‘advanced_data_science_toolkit’].execute(data_part, ‘stats’)
results.append(result)

elif any(keyword in task_lower for keyword in [‘generate’, ‘code’, ‘api’, ‘client’]):
if ‘advanced_code_generator’ in self.tools:
result = self.tools[‘advanced_code_generator’].execute(task)
results.append(result)

return {
‘agent’: self.name,
‘task’: task,
‘results’: [r.data if r.success else {‘error’: r.error} for r in results],
‘execution_summary’: {
‘tools_used’: len(results),
‘success_rate’: sum(1 for r in results if r.success) / len(results) if results else 0,
‘total_time’: sum(r.execution_time for r in results)
}
}

We encapsulate our AI logic in a CustomAgent class that holds a set of tools, a system prompt, and execution history, then routes each incoming task to the right tool based on simple keyword matching. In the run() method, we log the task, select the appropriate tool (web intelligence, data analysis, or code generation), execute it, and aggregate the results into a standardized response that includes success rates and timing metrics. This design enables us to easily extend agents by adding new tools and maintains our orchestration as both transparent and measurable.

Copy CodeCopiedUse a different Browserprint(” Building Advanced Tool Architecture”)

def performance_monitor(func):
“””Decorator for monitoring tool performance”””
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time()
try:
result = func(*args, **kwargs)
execution_time = time.time() – start_time
logger.info(f”{func.__name__} executed in {execution_time:.2f}s”)
return result
except Exception as e:
logger.error(f”{func.__name__} failed: {str(e)}”)
raise
return wrapper

@performance_monitor
def advanced_web_intelligence(url: str, analysis_type: str = “comprehensive”) -> Dict[str, Any]:
“””
Advanced web intelligence gathering with multiple analysis modes.

Args:
url: Target URL for analysis
analysis_type: Type of analysis (comprehensive, sentiment, technical, seo)

Returns:
Dict containing structured analysis results
“””
try:
response = requests.get(url, timeout=API_TIMEOUT, headers={
‘User-Agent’: ‘Mozilla/5.0’
})

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, ‘html.parser’)

title = soup.find(‘title’).text if soup.find(‘title’) else ‘No title’
meta_desc = soup.find(‘meta’, attrs={‘name’: ‘description’})
meta_desc = meta_desc.get(‘content’) if meta_desc else ‘No description’

if analysis_type == “comprehensive”:
return {
‘title’: title,
‘description’: meta_desc,
‘word_count’: len(soup.get_text().split()),
‘image_count’: len(soup.find_all(‘img’)),
‘link_count’: len(soup.find_all(‘a’)),
‘headers’: [h.text.strip() for h in soup.find_all([‘h1’, ‘h2’, ‘h3’])[:5]],
‘status_code’: response.status_code,
‘content_type’: response.headers.get(‘content-type’, ‘unknown’),
‘page_size’: len(response.content)
}
elif analysis_type == “sentiment”:
text = soup.get_text()[:2000]
positive_words = [‘good’, ‘great’, ‘excellent’, ‘amazing’, ‘wonderful’, ‘fantastic’]
negative_words = [‘bad’, ‘terrible’, ‘awful’, ‘horrible’, ‘disappointing’]

pos_count = sum(text.lower().count(word) for word in positive_words)
neg_count = sum(text.lower().count(word) for word in negative_words)

return {
‘sentiment_score’: pos_count – neg_count,
‘positive_indicators’: pos_count,
‘negative_indicators’: neg_count,
‘text_sample’: text[:200],
‘analysis_type’: ‘sentiment’
}

except Exception as e:
return {‘error’: f”Analysis failed: {str(e)}”}

@performance_monitor
def advanced_data_science_toolkit(data: str, operation: str) -> Dict[str, Any]:
“””
Comprehensive data science operations with statistical analysis.

Args:
data: CSV-like string or JSON data
operation: Type of analysis (stats, correlation, forecast, clustering)

Returns:
Dict with analysis results
“””
try:
if data.startswith(‘{‘) or data.startswith(‘[‘):
parsed_data = json.loads(data)
df = pd.DataFrame(parsed_data)
else:
df = pd.read_csv(StringIO(data))

if operation == “stats”:
numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist()

result = {
‘shape’: df.shape,
‘columns’: df.columns.tolist(),
‘dtypes’: {col: str(dtype) for col, dtype in df.dtypes.items()},
‘missing_values’: df.isnull().sum().to_dict(),
‘numeric_columns’: numeric_columns
}

if len(numeric_columns) > 0:
result[‘summary_stats’] = df[numeric_columns].describe().to_dict()
if len(numeric_columns) > 1:
result[‘correlation_matrix’] = df[numeric_columns].corr().to_dict()

return result

elif operation == “clustering”:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

numeric_df = df.select_dtypes(include=[np.number])
if numeric_df.shape[1] < 2:
return {‘error’: ‘Need at least 2 numeric columns for clustering’}

scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_df.fillna(0))

n_clusters = min(3, max(2, len(numeric_df) // 2))
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
clusters = kmeans.fit_predict(scaled_data)

return {
‘n_clusters’: n_clusters,
‘cluster_centers’: kmeans.cluster_centers_.tolist(),
‘cluster_labels’: clusters.tolist(),
‘inertia’: float(kmeans.inertia_),
‘feature_names’: numeric_df.columns.tolist()
}

except Exception as e:
return {‘error’: f”Data analysis failed: {str(e)}”}

@performance_monitor
def advanced_code_generator(task_description: str, language: str = “python”) -> Dict[str, str]:
“””
Advanced code generation with multiple language support and optimization.

Args:
task_description: Description of coding task
language: Target programming language

Returns:
Dict with generated code and metadata
“””
templates = {
‘python’: {
‘api_client’: ”’
import requests
import json
import time
from typing import Dict, Any, Optional

class APIClient:
“””Production-ready API client with retry logic and error handling”””

def __init__(self, base_url: str, api_key: Optional[str] = None, timeout: int = 30):
self.base_url = base_url.rstrip(‘/’)
self.timeout = timeout
self.session = requests.Session()

if api_key:
self.session.headers.update({‘Authorization’: f’Bearer {api_key}’})

self.session.headers.update({
‘Content-Type’: ‘application/json’,
‘User-Agent’: ‘CustomAPIClient/1.0′
})

def _make_request(self, method: str, endpoint: str, **kwargs) -> Dict[str, Any]:
“””Make HTTP request with retry logic”””
url = f'{self.base_url}/{endpoint.lstrip(“/”)}’

for attempt in range(3):
try:
response = self.session.request(method, url, timeout=self.timeout, **kwargs)
response.raise_for_status()
return response.json() if response.content else {}
except requests.exceptions.RequestException as e:
if attempt == 2: # Last attempt
raise
time.sleep(2 ** attempt) # Exponential backoff

def get(self, endpoint: str, params: Optional[Dict] = None) -> Dict[str, Any]:
return self._make_request(‘GET’, endpoint, params=params)

def post(self, endpoint: str, data: Optional[Dict] = None) -> Dict[str, Any]:
return self._make_request(‘POST’, endpoint, json=data)

def put(self, endpoint: str, data: Optional[Dict] = None) -> Dict[str, Any]:
return self._make_request(‘PUT’, endpoint, json=data)

def delete(self, endpoint: str) -> Dict[str, Any]:
return self._make_request(‘DELETE’, endpoint)
”’,
‘data_processor’: ”’
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional
import logging

logger = logging.getLogger(__name__)

class DataProcessor:
“””Advanced data processor with comprehensive cleaning and analysis”””

def __init__(self, data: pd.DataFrame):
self.original_data = data.copy()
self.processed_data = data.copy()
self.processing_log = []

def clean_data(self, strategy: str = ‘auto’) -> ‘DataProcessor’:
“””Clean data with configurable strategies”””
initial_shape = self.processed_data.shape

# Remove duplicates
self.processed_data = self.processed_data.drop_duplicates()

# Handle missing values based on strategy
if strategy == ‘auto’:
# For numeric columns, use mean
numeric_cols = self.processed_data.select_dtypes(include=[np.number]).columns
self.processed_data[numeric_cols] = self.processed_data[numeric_cols].fillna(
self.processed_data[numeric_cols].mean()
)

# For categorical columns, use mode
categorical_cols = self.processed_data.select_dtypes(include=[‘object’]).columns
for col in categorical_cols:
mode_value = self.processed_data[col].mode()
if len(mode_value) > 0:
self.processed_data[col] = self.processed_data[col].fillna(mode_value[0])

final_shape = self.processed_data.shape
self.processing_log.append(f”Cleaned data: {initial_shape} -> {final_shape}”)
return self

def normalize(self, method: str = ‘minmax’, columns: Optional[List[str]] = None) -> ‘DataProcessor’:
“””Normalize numerical columns”””
cols = columns or self.processed_data.select_dtypes(include=[np.number]).columns.tolist()

if method == ‘minmax’:
# Min-max normalization
for col in cols:
col_min, col_max = self.processed_data[col].min(), self.processed_data[col].max()
if col_max != col_min:
self.processed_data[col] = (self.processed_data[col] – col_min) / (col_max – col_min)
elif method == ‘zscore’:
# Z-score normalization
for col in cols:
mean_val, std_val = self.processed_data[col].mean(), self.processed_data[col].std()
if std_val != 0:
self.processed_data[col] = (self.processed_data[col] – mean_val) / std_val

self.processing_log.append(f”Normalized columns {cols} using {method}”)
return self

def get_insights(self) -> Dict[str, Any]:
“””Generate comprehensive data insights”””
insights = {
‘basic_info’: {
‘shape’: self.processed_data.shape,
‘columns’: self.processed_data.columns.tolist(),
‘dtypes’: {col: str(dtype) for col, dtype in self.processed_data.dtypes.items()}
},
‘data_quality’: {
‘missing_values’: self.processed_data.isnull().sum().to_dict(),
‘duplicate_rows’: self.processed_data.duplicated().sum(),
‘memory_usage’: self.processed_data.memory_usage(deep=True).to_dict()
},
‘processing_log’: self.processing_log
}

# Add statistical summary for numeric columns
numeric_data = self.processed_data.select_dtypes(include=[np.number])
if len(numeric_data.columns) > 0:
insights[‘statistical_summary’] = numeric_data.describe().to_dict()

return insights
”’
}
}

task_lower = task_description.lower()
if any(keyword in task_lower for keyword in [‘api’, ‘client’, ‘http’, ‘request’]):
code = templates[language][‘api_client’]
description = “Production-ready API client with retry logic and comprehensive error handling”
elif any(keyword in task_lower for keyword in [‘data’, ‘process’, ‘clean’, ‘analyze’]):
code = templates[language][‘data_processor’]
description = “Advanced data processor with cleaning, normalization, and insight generation”
else:
code = f”’# Generated code template for: {task_description}
# Language: {language}

class CustomSolution:
“””Auto-generated solution template”””

def __init__(self):
self.initialized = True

def execute(self, *args, **kwargs):
“””Main execution method – implement your logic here”””
return {{“message”: “Implement your custom logic here”, “task”: “{task_description}”}}

# Usage example:
# solution = CustomSolution()
# result = solution.execute()
”’
description = f”Custom template for {task_description}”

return {
‘code’: code,
‘language’: language,
‘description’: description,
‘complexity’: ‘production-ready’,
‘estimated_lines’: len(code.split(‘n’)),
‘features’: [‘error_handling’, ‘logging’, ‘type_hints’, ‘documentation’]
}

We wrap each core function in a @performance_monitor decorator so we can log execution times and catch failures, then implement three specialized tools: advanced_web_intelligence for comprehensive or sentiment-driven web scraping, advanced_data_science_toolkit for statistical analysis and clustering on CSV or JSON data, and advanced_code_generator for producing production-ready code templates, ensuring we monitor performance and maintain consistency across all our analytics and code-generation utilities.

Copy CodeCopiedUse a different Browserprint(” Setting up Custom Agent Framework”)

class AgentOrchestrator:
“””Manages multiple specialized agents with workflow coordination”””

def __init__(self):
self.agents = {}
self.workflows = {}
self.results_cache = {}
self.performance_metrics = {}

def create_specialist_agent(self, name: str, tools: List[CustomTool], system_prompt: str = None):
“””Create domain-specific agents”””
agent = CustomAgent(
name=name,
system_prompt=system_prompt or f”You are a specialist {name} agent.”,
max_iterations=5
)

for tool in tools:
agent.add_tool(tool)

self.agents[name] = agent
return agent

def execute_workflow(self, workflow_name: str, inputs: Dict) -> Dict:
“””Execute multi-step workflows across agents”””
if workflow_name not in self.workflows:
raise ValueError(f”Workflow {workflow_name} not found”)

workflow = self.workflows[workflow_name]
results = {}
workflow_start = time.time()

for step in workflow[‘steps’]:
agent_name = step[‘agent’]
task = step[‘task’].format(**inputs, **results)

if agent_name in self.agents:
step_start = time.time()
result = self.agents[agent_name].run(task)
step_time = time.time() – step_start

results[step[‘output_key’]] = result
results[f”{step[‘output_key’]}_time”] = step_time

total_time = time.time() – workflow_start

return {
‘workflow’: workflow_name,
‘inputs’: inputs,
‘results’: results,
‘metadata’: {
‘total_execution_time’: total_time,
‘steps_completed’: len(workflow[‘steps’]),
‘success’: True
}
}

def get_system_status(self) -> Dict[str, Any]:
“””Get comprehensive system status”””
return {
‘agents’: {name: {‘tools’: len(agent.tools)} for name, agent in self.agents.items()},
‘workflows’: list(self.workflows.keys()),
‘cache_size’: len(self.results_cache),
‘total_tools’: sum(len(agent.tools) for agent in self.agents.values())
}

orchestrator = AgentOrchestrator()

web_tool = CustomTool(
name=”advanced_web_intelligence”,
description=”Advanced web analysis and intelligence gathering”,
func=advanced_web_intelligence
)

data_tool = CustomTool(
name=”advanced_data_science_toolkit”,
description=”Comprehensive data science and statistical analysis”,
func=advanced_data_science_toolkit
)

code_tool = CustomTool(
name=”advanced_code_generator”,
description=”Advanced code generation and architecture”,
func=advanced_code_generator
)

web_agent = orchestrator.create_specialist_agent(
“web_analyst”,
[web_tool],
“You are a web analysis specialist. Provide comprehensive website analysis and insights.”
)

data_agent = orchestrator.create_specialist_agent(
“data_scientist”,
[data_tool],
“You are a data science expert. Perform statistical analysis and machine learning tasks.”
)

code_agent = orchestrator.create_specialist_agent(
“code_architect”,
[code_tool],
“You are a senior software architect. Generate optimized, production-ready code.”
)

We initialize an AgentOrchestrator to manage our suite of AI agents, register each CustomTool implementation for web intelligence, data science, and code generation, and then spin up three domain-specific agents: web_analyst, data_scientist, and code_architect. Each agent is seeded with its respective toolset and a clear system prompt. This setup enables us to coordinate and execute multi-step workflows across specialized expertise areas within a single, unified framework.

Copy CodeCopiedUse a different Browserprint(” Defining Advanced Workflows”)

orchestrator.workflows[‘competitive_analysis’] = {
‘steps’: [
{
‘agent’: ‘web_analyst’,
‘task’: ‘Analyze website {target_url} with comprehensive analysis’,
‘output_key’: ‘website_analysis’
},
{
‘agent’: ‘code_architect’,
‘task’: ‘Generate monitoring code for website analysis automation’,
‘output_key’: ‘monitoring_code’
}
]
}

orchestrator.workflows[‘data_pipeline’] = {
‘steps’: [
{
‘agent’: ‘data_scientist’,
‘task’: ‘Analyze the following CSV data with stats operation: {data_input}’,
‘output_key’: ‘data_analysis’
},
{
‘agent’: ‘code_architect’,
‘task’: ‘Generate data processing pipeline code’,
‘output_key’: ‘pipeline_code’
}
]
}

We define two key multi-agent workflows: competitive_analysis, which involves our web analyst scraping and analyzing a target URL before passing insights to our code architect to generate monitoring scripts, and data_pipeline, where our data scientist runs statistical analyses on CSV inputs. Then our code architect crafts the corresponding ETL pipeline code. These declarative step sequences let us orchestrate complex tasks end-to-end with minimal boilerplate.

Copy CodeCopiedUse a different Browserprint(” Running Production Examples”)

print(“n Advanced Web Intelligence Demo”)
try:
web_result = web_agent.run(“Analyze https://httpbin.org/html with comprehensive analysis type”)
print(f” Web Analysis Success: {json.dumps(web_result, indent=2)}”)
except Exception as e:
print(f” Web analysis error: {e}”)

print(“n Data Science Pipeline Demo”)
sample_data = “””name,age,salary,department
Alice,25,50000,Engineering
Bob,30,60000,Engineering
Carol,35,70000,Marketing
David,28,55000,Engineering
Eve,32,65000,Marketing”””

try:
data_result = data_agent.run(f”Analyze this data with stats operation: {sample_data}”)
print(f” Data Analysis Success: {json.dumps(data_result, indent=2)}”)
except Exception as e:
print(f” Data analysis error: {e}”)

print(“n Code Architecture Demo”)
try:
code_result = code_agent.run(“Generate an API client for data processing tasks”)
print(f” Code Generation Success: Generated {len(code_result[‘results’][0][‘code’].split())} lines of code”)
except Exception as e:
print(f” Code generation error: {e}”)

print(“n Multi-Agent Workflow Demo”)
try:
workflow_inputs = {‘target_url’: ‘https://httpbin.org/html’}
workflow_result = orchestrator.execute_workflow(‘competitive_analysis’, workflow_inputs)
print(f” Workflow Success: Completed in {workflow_result[‘metadata’][‘total_execution_time’]:.2f}s”)
except Exception as e:
print(f” Workflow error: {e}”)

We run a suite of production demos to validate each component: first, our web_analyst performs a full-site analysis; next, our data_scientist crunches sample CSV stats; then our code_architect generates an API client; and finally we orchestrate the end-to-end competitive analysis workflow, capturing success indicators, outputs, and execution timing for each step.

Copy CodeCopiedUse a different Browserprint(“n System Performance Metrics”)

system_status = orchestrator.get_system_status()
print(f”System Status: {json.dumps(system_status, indent=2)}”)

print(“nTool Performance:”)
for agent_name, agent in orchestrator.agents.items():
print(f”n{agent_name}:”)
for tool_name, tool in agent.tools.items():
print(f” – {tool_name}: {tool.calls} calls, {tool.avg_execution_time:.3f}s avg, {tool.error_rate:.1%} error rate”)

print(“n Advanced Custom Agent Framework Complete!”)
print(” Production-ready implementation with full monitoring and error handling!”)

We finish by retrieving and printing our orchestrator’s overall system status, listing registered agents, workflows, and cache size, then loop through each agent’s tools to display call counts, average execution times, and error rates. This gives us a real-time view of performance and reliability before we log a final confirmation that our production-ready agent framework is complete.

In conclusion, we now have a blueprint for creating specialized AI agents that perform complex analyses and generate production-quality code, and also self-monitor their execution health and resource usage. The AgentOrchestrator ties everything together, enabling you to coordinate multi-step workflows and capture granular performance insights across agents. Whether you’re automating market research, ETL tasks, or API client generation, this framework provides the extensibility, reliability, and observability required for enterprise-grade AI deployments.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building Production-Ready Custom AI Agents for Enterprise Workflows with Monitoring, Orchestration, and Scalability appeared first on MarkTechPost.

EmbodiedGen: A Scalable 3D World Generator for Realistic Embodied AI S …

The Challenge of Scaling 3D Environments in Embodied AI

Creating realistic and accurately scaled 3D environments is essential for training and evaluating embodied AI. However, current methods still rely on manually designed 3D graphics, which are costly and lack realism, thereby limiting scalability and generalization. Unlike internet-scale data used in models like GPT and CLIP, embodied AI data is expensive, context-specific, and difficult to reuse. Reaching general-purpose intelligence in physical settings requires realistic simulations, reinforcement learning, and diverse 3D assets. While recent diffusion models and 3D generation techniques show promise, many still lack key features such as physical accuracy, watertight geometry, and correct scale, making them inadequate for robotic training environments. 

Limitations of Existing 3D Generation Techniques

3D object generation typically follows three main approaches: feedforward generation for fast results, optimization-based methods for high quality, and view reconstruction from multiple images. While recent techniques have improved realism by separating geometry and texture creation, many models still prioritize visual appearance over real-world physics. This makes them less suitable for simulations that require accurate scaling and watertight geometry. For 3D scenes, panoramic techniques have enabled full-view rendering, but they still lack interactivity. Although some tools attempt to enhance simulation environments with generated assets, the quality and diversity remain limited, falling short of complex embodied intelligence research needs. 

Introducing EmbodiedGen: Open-Source, Modular, and Simulation-Ready

EmbodiedGen is an open-source framework developed collaboratively by researchers from Horizon Robotics, the Chinese University of Hong Kong, Shanghai Qi Zhi Institute, and Tsinghua University. It is designed to generate realistic, scalable 3D assets tailored for embodied AI tasks. The platform outputs physically accurate, watertight 3D objects in URDF format, complete with metadata for simulation compatibility. Featuring six modular components, including image-to-3D, text-to-3D, layout generation, and object rearrangement, it enables controllable and efficient scene creation. By bridging the gap between traditional 3D graphics and robotics-ready assets, EmbodiedGen facilitates the scalable and cost-effective development of interactive environments for embodied intelligence research. 

Key Features: Multi-Modal Generation for Rich 3D Content

EmbodiedGen is a versatile toolkit designed to generate realistic and interactive 3D environments tailored for embodied AI tasks. It combines multiple generation modules: transforming images or text into detailed 3D objects, creating articulated items with movable parts, and generating diverse textures to improve visual quality. It also supports full scene construction by arranging these assets in a way that respects real-world physical properties and scale. The output is directly compatible with simulation platforms, making it easier and more affordable to build lifelike virtual worlds. This system helps researchers efficiently simulate real-world scenarios without relying on expensive manual modeling. 

Simulation Integration and Real-World Physical Accuracy

EmbodiedGen is a powerful and accessible platform that enables the generation of diverse, high-quality 3D assets tailored for research in embodied intelligence. It features several key modules that allow users to create assets from images or text, generate articulated and textured objects, and construct realistic scenes. These assets are watertight, photorealistic, and physically accurate, making them ideal for simulation-based training and evaluation in robotics. The platform supports integration with popular simulation environments, including OpenAI Gym, MuJoCo, Isaac Lab, and SAPIEN, enabling researchers to efficiently simulate tasks such as navigation, object manipulation, and obstacle avoidance at a low cost.

RoboSplatter: High-Fidelity 3DGS Rendering for Simulation

A notable feature is RoboSplatter, which brings advanced 3D Gaussian Splatting (3DGS) rendering into physical simulations. Unlike traditional graphics pipelines, RoboSplatter enhances visual fidelity while reducing computational overhead. Through modules like Texture Generation and Real-to-Sim conversion, users can edit the appearance of 3D assets or recreate real-world scenes with high realism. Overall, EmbodiedGen simplifies the creation of scalable, interactive 3D worlds, bridging the gap between real-world robotics and digital simulation. It is openly available as a user-friendly toolkit to support broader adoption and continued innovation in embodied AI research. 

Why This Research Matters?

This research addresses a core bottleneck in embodied AI: the lack of scalable, realistic, and physics-compatible 3D environments for training and evaluation. While internet-scale data has driven progress in vision and language models, embodied intelligence demands simulation-ready assets with accurate scale, geometry, and interactivity—qualities often missing in traditional 3D generation pipelines. EmbodiedGen fills this gap by offering an open-source, modular platform capable of producing high-quality, controllable 3D objects and scenes compatible with major robotics simulators. Its ability to convert text and images into physically plausible 3D environments at scale makes it a foundational tool for advancing embodied AI research, digital twins, and real-to-sim learning.

Check out the Paper and Project Page All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

FREE REGISTRATION: miniCON AI Infrastructure 2025 (Aug 2, 2025) [Speakers: Jessica Liu, VP Product Management @ Cerebras, Andreas Schick, Director AI @ US FDA, Volkmar Uhlig, VP AI Infrastructure @ IBM, Daniele Stroppa, WW Sr. Partner Solutions Architect @ Amazon, Aditya Gautam, Machine Learning Lead @ Meta, Sercan Arik, Research Manager @ Google Cloud AI, Valentina Pedoia, Senior Director AI/ML @ the Altos Labs, Sandeep Kaipu, Software Engineering Manager @ Broadcom ]
The post EmbodiedGen: A Scalable 3D World Generator for Realistic Embodied AI Simulations appeared first on MarkTechPost.

Google Researchers Release Magenta RealTime: An Open-Weight Model for …

Google’s Magenta team has introduced Magenta RealTime (Magenta RT), an open-weight, real-time music generation model that brings unprecedented interactivity to generative audio. Licensed under Apache 2.0 and available on GitHub and Hugging Face, Magenta RT is the first large-scale music generation model that supports real-time inference with dynamic, user-controllable style prompts.

Background: Real-Time Music Generation

Real-time control and live interactivity are foundational to musical creativity. While prior Magenta projects like Piano Genie and DDSP emphasized expressive control and signal modeling, Magenta RT extends these ambitions to full-spectrum audio synthesis. It closes the gap between generative models and human-in-the-loop composition by enabling instantaneous feedback and dynamic musical evolution.

Magenta RT builds upon MusicLM and MusicFX’s underlying modeling techniques. However, unlike their API- or batch-oriented modes of generation, Magenta RT supports streaming synthesis with forward real-time factor (RTF) >1—meaning it can generate faster than real-time, even on free-tier Colab TPUs.

Technical Overview

Magenta RT is a Transformer-based language model trained on discrete audio tokens. These tokens are produced via a neural audio codec, which operates at 48 kHz stereo fidelity. The model leverages an 800 million parameter Transformer architecture that has been optimized for:

Streaming generation in 2-second audio segments

Temporal conditioning with a 10-second audio history window

Multimodal style control, using either text prompts or reference audio

To support this, the model architecture adapts MusicLM’s staged training pipeline, integrating a new joint music-text embedding module known as MusicCoCa (a hybrid of MuLan and CoCa). This allows semantically meaningful control over genre, instrumentation, and stylistic progression in real time.

Data and Training

Magenta RT is trained on ~190,000 hours of instrumental stock music. This large and diverse dataset ensures wide genre generalization and smooth adaptation across musical contexts. The training data was tokenized using a hierarchical codec, which enables compact representations without losing fidelity. Each 2-second chunk is conditioned not only on a user-specified prompt but also on a rolling context of 10 seconds of prior audio, enabling smooth, coherent progression.

The model supports two input modalities for style prompts:

Textual prompts, which are converted into embeddings using MusicCoCa

Audio prompts, encoded into the same embedding space via a learned encoder

This fusion of modalities permits real-time genre morphing and dynamic instrument blending—capabilities essential for live composition and DJ-like performance scenarios.

Performance and Inference

Despite the model’s scale (800M parameters), Magenta RT achieves a generation speed of 1.25 seconds for every 2 seconds of audio. This is sufficient for real-time usage (RTF ~0.625), and inference can be executed on free-tier TPUs in Google Colab.

The generation process is chunked to allow continuous streaming: each 2s segment is synthesized in a forward pipeline, with overlapping windowing to ensure continuity and coherence. Latency is further minimized via optimizations in model compilation (XLA), caching, and hardware scheduling.

Applications and Use Cases

Magenta RT is designed for integration into:

Live performances, where musicians or DJs can steer generation on-the-fly

Creative prototyping tools, offering rapid auditioning of musical styles

Educational tools, helping students understand structure, harmony, and genre fusion

Interactive installations, enabling responsive generative audio environments

Google has hinted at upcoming support for on-device inference and personal fine-tuning, which would allow creators to adapt the model to their unique stylistic signatures.

Comparison to Related Models

Magenta RT complements Google DeepMind’s MusicFX (DJ Mode) and Lyria’s RealTime API, but differs critically in being open source and self-hostable. It also stands apart from latent diffusion models (e.g., Riffusion) and autoregressive decoders (e.g., Jukebox) by focusing on codec-token prediction with minimal latency.

Compared to models like MusicGen or MusicLM, Magenta RT delivers lower latency and enables interactive generation, which is often missing from current prompt-to-audio pipelines that require full track generation upfront.

Conclusion

Magenta RealTime pushes the boundaries of real-time generative audio. By blending high-fidelity synthesis with dynamic user control, it opens up new possibilities for AI-assisted music creation. Its architecture balances scale and speed, while its open licensing ensures accessibility and community contribution. For researchers, developers, and musicians alike, Magenta RT represents a foundational step toward responsive, collaborative AI music systems.

Check out the Model on Hugging Face, GitHub Page, Technical Details and Colab Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

FREE REGISTRATION: miniCON AI Infrastructure 2025 (Aug 2, 2025) [Speakers: Jessica Liu, VP Product Management @ Cerebras, Andreas Schick, Director AI @ US FDA, Volkmar Uhlig, VP AI Infrastructure @ IBM, Daniele Stroppa, WW Sr. Partner Solutions Architect @ Amazon, Aditya Gautam, Machine Learning Lead @ Meta, Sercan Arik, Research Manager @ Google Cloud AI, Valentina Pedoia, Senior Director AI/ML @ the Altos Labs, Sandeep Kaipu, Software Engineering Manager @ Broadcom ]
The post Google Researchers Release Magenta RealTime: An Open-Weight Model for Real-Time AI Music Generation appeared first on MarkTechPost.

This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent …

Multimodal LLMs: Expanding Capabilities Across Text and Vision

Expanding large language models (LLMs) to handle multiple modalities, particularly images and text, has enabled the development of more interactive and intuitive AI systems. Multimodal LLMs (MLLMs) can interpret visuals, answer questions about images, and engage in dialogues that include both text and pictures. Their ability to reason across visual and linguistic domains makes them increasingly valuable for applications such as education, content generation, and interactive assistants.

The Challenge of Text-Only Forgetting in MLLMs

However, integrating vision into LLMs creates a problem. When trained on datasets that mix images with text, MLLMs often lose their ability to handle purely textual tasks. This phenomenon, known as text-only forgetting, occurs because visual tokens inserted into the language sequence divert the model’s attention away from the text. As a result, the MLLM starts prioritizing image-related content and performs poorly on tasks that require only language understanding, such as basic reasoning, comprehension, or textual question-and-answer (Q&A) tasks.

Limitations of Existing Mitigation Strategies

Several methods attempt to address this degradation. Some approaches reintroduce large amounts of text-only data during training, while others alternate between text-only and multimodal fine-tuning. These strategies aim to remind the model of its original language capabilities. Other designs include adapter layers or prompt-based tuning. However, these techniques often increase training costs, require complex switching logic during inference, or fail to restore text comprehension entirely. The problem largely stems from how the model’s attention shifts when image tokens are introduced into the sequence.

Introducing WINGS: A Dual-Learner Approach by Alibaba and Nanjing University

Researchers from Alibaba Group’s AI Business team and Nanjing University have introduced a new approach called WINGS. The design adds two new modules—visual and textual learners—into each layer of the MLLM. These learners work in parallel with the model’s core attention mechanism. The structure resembles “wings” attached to either side of the attention layers. A routing component controls how much attention each learner receives based on the current token mix, allowing the model to balance its focus between visual and textual information dynamically.

Low-Rank Residual Attention (LoRRA): Balancing Efficiency and Modality Awareness

The WINGS architecture uses a mechanism called Low-Rank Residual Attention (LoRRA), which keeps computations lightweight while enabling the learners to capture essential modality-specific information. In the first stage of training, only visual learners are activated to align image features. In the second stage, both visual and textual learners are co-trained with a router module that uses attention weights to allocate responsibility. Each learner uses efficient attention blocks to interact with either the image or the surrounding text, and their outputs are combined with those of the main model. This ensures that visual attention doesn’t overwhelm textual understanding.

WINGS Performance Benchmarks Across Text and Multimodal Tasks

In terms of performance, WINGS showed strong results. On the MMLU dataset, it achieved a text-only score of 60.53, representing an improvement of 9.70 points compared to a similar baseline model. For CMMLU, it scored 69.82, which is 9.36 points higher than the baseline. In reasoning tasks like Race-High, it gained 11.9 points, and in WSC, an improvement of 11.12 points was recorded. In multimodal benchmarks like MMMU-VAL, WINGS achieved an improvement of 4.78 points. It also demonstrated robust results on the IIT benchmark, handling mixed text-and-image multi-turn dialogues more effectively than other open-source MLLMs at the same scale.

Conclusion: Toward More Balanced and Generalizable MLLMs

In summary, the researchers tackled the issue of catastrophic text-only forgetting in MLLMs by introducing WINGS, an architecture that pairs dedicated visual and textual learners alongside attention routing. By analyzing attention shifts and designing targeted interventions, they maintained text performance while enhancing visual understanding, offering a more balanced and efficient multimodal model.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post This AI Paper Introduces WINGS: A Dual-Learner Architecture to Prevent Text-Only Forgetting in Multimodal Large Language Models appeared first on MarkTechPost.

Mistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, …

With the frequent release of new large language models (LLMs), there is a persistent quest to minimize repetitive errors, enhance robustness, and significantly improve user interactions. As AI models become integral to more sophisticated computational tasks, developers are consistently refining their capabilities, ensuring seamless integration within diverse, real-world scenarios.

Mistral AI has released Mistral Small 3.2 (Mistral-Small-3.2-24B-Instruct-2506), an updated version of its earlier release, Mistral-Small-3.1-24B-Instruct-2503. Although a minor release, Mistral Small 3.2 introduces fundamental upgrades that aim to enhance the model’s overall reliability and efficiency, particularly in handling complex instructions, avoiding redundant outputs, and maintaining stability under function-calling scenarios.

A significant enhancement in Mistral Small 3.2 is its accuracy in executing precise instructions. Successful user interaction often requires precision in executing subtle commands. Benchmark scores accurately reflect this improvement: under the Wildbench v2 instruction test, Mistral Small 3.2 achieved 65.33% accuracy, an improvement from 55.6% for its predecessor. Conversely, performance in the difficult Arena Hard v2 test was almost doubled, from 19.56% to 43.1%, which provides evidence of its improved ability to execute and grasp intricate commands precisely.

Image Source

Correcting repetition errors, Mistral Small 3.2 greatly minimizes instances of infinite or repetitive output, a problem commonly faced in long conversational scenarios. Internal evaluations show that Small 3.2 effectively cuts instances of infinite generation errors by half, from 2.11% in Small 3.1 to 1.29%. This complete reduction directly increases the model’s usability and dependability in extended interactions. The new model also demonstrates greater capability to call functions, making it ideal for automation tasks. Also, improved robustness in the function calling template translates to more stable and dependable interactions.

STEM-related benchmark improvement further demonstrates Small 3.2’s aptitude. For example, the HumanEval Plus Pass@5 code test had its accuracy increase from 88.99% in Small 3.1 to a whopping 92.90%. Also, MMLU Pro test results increased from 66.76% to 69.06%, and GPQA Diamond ratings improved slightly from 45.96% to 46.13%, showing general competence in scientific and technical uses.

Image Source

Vision-based performance outcomes were inconsistent, with certain optimizations being selectively applied. ChartQA accuracy improved from 86.24% to 87.4%, and DocVQA marginally enhanced from 94.08% to 94.86%. In contrast, some tests, such as MMMU and Mathvista, experienced slight dips, indicating specific trade-offs encountered during the optimization process.

The key updates in Mistral Small 3.2 over Small 3.1 include:

Enhanced precision in instruction-following, with Wildbench v2 accuracy rising from 55.6% to 65.33%.

Reduced repetition errors, halving infinite generation instances from 2.11% to 1.29%.

Improved robustness in function calling templates, ensuring more stable integrations.

Notable increases in STEM-related performance, particularly in HumanEval Plus Pass@5 (92.90%) and MMLU Pro (69.06%).

In conclusion, Mistral Small 3.2 offers targeted and practical enhancements over its predecessor, providing users with greater accuracy, reduced redundancy, and improved integration capabilities. These advancements help position it as a reliable choice for complex AI-driven tasks across diverse application areas.

Check out the Model Card on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Mistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, Reduced Repetition, and Stronger Function Calling for AI Integration appeared first on MarkTechPost.

Building Event-Driven AI Agents with UAgents and Google Gemini: A Modu …

In this tutorial, we demonstrate how to use the UAgents framework to build a lightweight, event-driven AI agent architecture on top of Google’s Gemini API. We’ll start by applying nest_asyncio to enable nested event loops, then configure your Gemini API key and instantiate the GenAI client. Next, we’ll define our communication contracts, Question and Answer Pydantic models, and spin up two UAgents: one “gemini_agent” that listens for incoming Question messages, invokes the Gemini “flash” model to generate responses, and emits Answer messages; and one “client_agent” that triggers a query upon startup and handles the incoming answer. Finally, we’ll learn how to run these agents concurrently using Python’s multiprocessing utility and gracefully shut down the event loop once the exchange is complete, illustrating UAgents’ seamless orchestration of inter-agent messaging.

Copy CodeCopiedUse a different Browser!pip install -q uagents google-genai

We install the UAgents framework and the Google GenAI client library, providing the necessary tooling to build and run your event-driven AI agents with Gemini. The q flag runs the installation quietly, keeping your notebook output clean. Check out the Notebook here

Copy CodeCopiedUse a different Browserimport os, time, multiprocessing, asyncio
import nest_asyncio
from google import genai
from pydantic import BaseModel, Field
from uagents import Agent, Context

nest_asyncio.apply()

We set up our Python environment by importing essential modules, system utilities (os, time, multiprocessing, asyncio), nest_asyncio for enabling nested event loops (critical in notebooks), the Google GenAI client, Pydantic for schema validation, and core UAgents classes. Finally, nest_asyncio.apply() patches the event loop so you can run asynchronous UAgents workflows seamlessly in interactive environments. Check out the Notebook here

Copy CodeCopiedUse a different Browseros.environ[“GOOGLE_API_KEY”] = “Use Your Own API Key Here”

client = genai.Client()

Here we set our Gemini API key in the environment. Be sure to replace the placeholder with your actual key, and then initialize the GenAI client, which will handle all subsequent requests to Google’s Gemini models. This step ensures our agent has authenticated access to generate content through the API.

Copy CodeCopiedUse a different Browserclass Question(BaseModel):
question: str = Field(…)

class Answer(BaseModel):
answer: str = Field(…)

These Pydantic models define the structured message formats that our agents will exchange with each other. The Question model carries a single question string field, and the Answer model carries a single answer string field. By using Pydantic, we get automatic validation and serialization of incoming and outgoing messages, ensuring that each agent always works with well-formed data.

Copy CodeCopiedUse a different Browserai_agent = Agent(
name=”gemini_agent”,
seed=”agent_seed_phrase”,
port=8000,
endpoint=[“http://127.0.0.1:8000/submit”]
)

@ai_agent.on_event(“startup”)
async def ai_startup(ctx: Context):
ctx.logger.info(f”{ai_agent.name} listening on {ai_agent.address}”)

def ask_gemini(q: str) -> str:
resp = client.models.generate_content(
model=”gemini-2.0-flash”,
contents=f”Answer the question: {q}”
)
return resp.text

@ai_agent.on_message(model=Question, replies=Answer)
async def handle_question(ctx: Context, sender: str, msg: Question):
ans = ask_gemini(msg.question)
await ctx.send(sender, Answer(answer=ans))

In this block, we instantiate the UAgents “gemini_agent” with a unique name, seed phrase (for deterministic identity), listening port, and HTTP endpoint for message submissions. We then register a startup event handler that logs when the agent is ready, ensuring visibility into its lifecycle. The synchronous helper ask_gemini wraps the GenAI client call to Gemini’s “flash” model. At the same time, the @ai_agent.on_message handler deserializes incoming Question messages, invokes ask_gemini, and asynchronously sends back a validated Answer payload to the original sender. Check out the Notebook here

Copy CodeCopiedUse a different Browserclient_agent = Agent(
name=”client_agent”,
seed=”client_seed_phrase”,
port=8001,
endpoint=[“http://127.0.0.1:8001/submit”]
)

@client_agent.on_event(“startup”)
async def ask_on_start(ctx: Context):
await ctx.send(ai_agent.address, Question(question=”What is the capital of France?”))

@client_agent.on_message(model=Answer)
async def handle_answer(ctx: Context, sender: str, msg: Answer):
print(” Answer from Gemini:”, msg.answer)
# Use a more graceful shutdown
asyncio.create_task(shutdown_loop())

async def shutdown_loop():
await asyncio.sleep(1) # Give time for cleanup
loop = asyncio.get_event_loop()
loop.stop()

We set up a “client_agent” that, upon startup, sends a Question to the gemini_agent asking for the capital of France, then listens for an Answer, prints the received response, and gracefully shuts down the event loop after a brief delay. Check out the Notebook here

Copy CodeCopiedUse a different Browserdef run_agent(agent):
agent.run()

if __name__ == “__main__”:
p = multiprocessing.Process(target=run_agent, args=(ai_agent,))
p.start()
time.sleep(2)

client_agent.run()

p.join()

Finally, we define a helper run_agent function that calls agent.run(), then uses Python’s multiprocessing to launch the gemini_agent in its process. After giving it a moment to spin up, it runs the client_agent in the main process, blocking until the answer round-trip completes, and finally joins the background process to ensure a clean shutdown.

In conclusion, with this UAgents-focused tutorial, we now have a clear blueprint for creating modular AI services that communicate via well-defined event hooks and message schemas. You’ve seen how UAgents simplifies agent lifecycle management, registering startup events, handling incoming messages, and sending structured replies, all without boilerplate networking code. From here, you can expand your UAgents setup to include more sophisticated conversation workflows, multiple message types, and dynamic agent discovery.

Check out the Notebook here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building Event-Driven AI Agents with UAgents and Google Gemini: A Modular Python Implementation Guide appeared first on MarkTechPost.

PoE-World + Planner Outperforms Reinforcement Learning RL Baselines in …

The Importance of Symbolic Reasoning in World Modeling

Understanding how the world works is key to creating AI agents that can adapt to complex situations. While neural network-based models, such as Dreamer, offer flexibility, they require massive amounts of data to learn effectively, far more than humans typically do. On the other hand, newer methods use program synthesis with large language models to generate code-based world models. These are more data-efficient and can generalize well from limited input. However, their use has been mostly limited to simple domains, such as text or grid worlds, as scaling to complex, dynamic environments remains a challenge due to the difficulty of generating large, comprehensive programs.

Limitations of Existing Programmatic World Models

Recent research has investigated the use of programs to represent world models, often leveraging large language models to synthesize Python transition functions. Approaches like WorldCoder and CodeWorldModels generate a single, large program, which limits their scalability in complex environments and their ability to handle uncertainty and partial observability. Some studies focus on high-level symbolic models for robotic planning by integrating visual input with abstract reasoning. Earlier efforts employed restricted domain-specific languages tailored to specific benchmarks or utilized conceptually related structures, such as factor graphs in Schema Networks. Theoretical models, such as AIXI, also explore world modeling using Turing machines and history-based representations.

Introducing PoE-World: Modular and Probabilistic World Models

Researchers from Cornell, Cambridge, The Alan Turing Institute, and Dalhousie University introduce PoE-World, an approach to learning symbolic world models by combining many small, LLM-synthesized programs, each capturing a specific rule of the environment. Instead of creating one large program, PoE-World builds a modular, probabilistic structure that can learn from brief demonstrations. This setup supports generalization to new situations, allowing agents to plan effectively, even in complex games like Pong and Montezuma’s Revenge. While it doesn’t model raw pixel data, it learns from symbolic object observations and emphasizes accurate modeling over exploration for efficient decision-making.

Architecture and Learning Mechanism of PoE-World

PoE-World models the environment as a combination of small, interpretable Python programs called programmatic experts, each responsible for a specific rule or behavior. These experts are weighted and combined to predict future states based on past observations and actions. By treating features as conditionally independent and learning from the full history, the model remains modular and scalable. Hard constraints refine predictions, and experts are updated or pruned as new data is collected. The model supports planning and reinforcement learning by simulating likely future outcomes, enabling efficient decision-making. Programs are synthesized using LLMs and interpreted probabilistically, with expert weights optimized via gradient descent.

Empirical Evaluation on Atari Games

The study evaluates their agent, PoE-World + Planner, on Atari’s Pong and Montezuma’s Revenge, including harder, modified versions of these games. Using minimal demonstration data, their method outperforms baselines such as PPO, ReAct, and WorldCoder, particularly in low-data settings. PoE-World demonstrates strong generalization by accurately modeling game dynamics, even in altered environments without new demonstrations. It’s also the only method to consistently score positively in Montezuma’s Revenge. Pre-training policies in PoE-World’s simulated environment accelerate real-world learning. Unlike WorldCoder’s limited and sometimes inaccurate models, PoE-World produces more detailed, constraint-aware representations, leading to better planning and more realistic in-game behavior.

Conclusion: Symbolic, Modular Programs for Scalable AI Planning

In conclusion, understanding how the world works is crucial to building adaptive AI agents; however, traditional deep learning models require large datasets and struggle to update flexibly with limited input. Inspired by how humans and symbolic systems recombine knowledge, the study proposes PoE-World. This method utilizes large language models to synthesize modular, programmatic “experts” that represent different parts of the world. These experts combine compositionally to form a symbolic, interpretable world model that supports strong generalization from minimal data. Tested on Atari games like Pong and Montezuma’s Revenge, this approach demonstrates efficient planning and performance, even in unfamiliar scenarios. Code and demos are publicly available.

Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post PoE-World + Planner Outperforms Reinforcement Learning RL Baselines in Montezuma’s Revenge with Minimal Demonstration Data appeared first on MarkTechPost.

Build an Intelligent Multi-Tool AI Agent Interface Using Streamlit for …

In this tutorial, we’ll build a powerful and interactive Streamlit application that brings together the capabilities of LangChain, the Google Gemini API, and a suite of advanced tools to create a smart AI assistant. Using Streamlit’s intuitive interface, we’ll create a chat-based system that can search the web, fetch Wikipedia content, perform calculations, remember key details, and handle conversation history, all in real time. Whether we’re developers, researchers, or just exploring AI, this setup allows us to interact with a multi-agent system directly from the browser with minimal code and maximum flexibility.

Copy CodeCopiedUse a different Browser!pip install -q streamlit langchain langchain-google-genai langchain-community
!pip install -q pyngrok python-dotenv wikipedia duckduckgo-search
!npm install -g localtunnel

import streamlit as st
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import Tool, WikipediaQueryRun, DuckDuckGoSearchRun
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate
from langchain.callbacks.streamlit import StreamlitCallbackHandler
from langchain_community.utilities import WikipediaAPIWrapper, DuckDuckGoSearchAPIWrapper
import asyncio
import threading
import time
from datetime import datetime
import json

We begin by installing all the necessary Python and Node.js packages required for our AI assistant app. This includes Streamlit for the frontend, LangChain for agent logic, and tools like Wikipedia, DuckDuckGo, and ngrok/localtunnel for external search and hosting. Once set up, we import all modules to start building our interactive multi-tool AI agent.

Copy CodeCopiedUse a different BrowserGOOGLE_API_KEY = “Use Your API Key Here”
NGROK_AUTH_TOKEN = “Use Your Auth Token Here”
os.environ[“GOOGLE_API_KEY”] = GOOGLE_API_KEY

Next, we configure our environment by setting the Google Gemini API key and the ngrok authentication token. We assign these credentials to variables and set the GOOGLE_API_KEY so the LangChain agent can securely access the Gemini model during execution.

Copy CodeCopiedUse a different Browserclass InnovativeAgentTools:
“””Advanced tool collection for the multi-agent system”””

@staticmethod
def get_calculator_tool():
def calculate(expression: str) -> str:
“””Calculate mathematical expressions safely”””
try:
allowed_chars = set(‘0123456789+-*/.() ‘)
if all(c in allowed_chars for c in expression):
result = eval(expression)
return f”Result: {result}”
else:
return “Error: Invalid mathematical expression”
except Exception as e:
return f”Calculation error: {str(e)}”

return Tool(
name=”Calculator”,
func=calculate,
description=”Calculate mathematical expressions. Input should be a valid math expression.”
)

@staticmethod
def get_memory_tool(memory_store):
def save_memory(key_value: str) -> str:
“””Save information to memory”””
try:
key, value = key_value.split(“:”, 1)
memory_store[key.strip()] = value.strip()
return f”Saved ‘{key.strip()}’ to memory”
except:
return “Error: Use format ‘key: value'”

def recall_memory(key: str) -> str:
“””Recall information from memory”””
return memory_store.get(key.strip(), f”No memory found for ‘{key}'”)

return [
Tool(name=”SaveMemory”, func=save_memory,
description=”Save information to memory. Format: ‘key: value'”),
Tool(name=”RecallMemory”, func=recall_memory,
description=”Recall saved information. Input: key to recall”)
]

@staticmethod
def get_datetime_tool():
def get_current_datetime(format_type: str = “full”) -> str:
“””Get current date and time”””
now = datetime.now()
if format_type == “date”:
return now.strftime(“%Y-%m-%d”)
elif format_type == “time”:
return now.strftime(“%H:%M:%S”)
else:
return now.strftime(“%Y-%m-%d %H:%M:%S”)

return Tool(
name=”DateTime”,
func=get_current_datetime,
description=”Get current date/time. Options: ‘date’, ‘time’, or ‘full'”
)

Here, we define the InnovativeAgentTools class to equip our AI agent with specialized capabilities. We implement tools such as a Calculator for safe expression evaluation, Memory Tools to save and recall information across turns, and a date and time tool to fetch the current date and time. These tools enable our Streamlit AI agent to reason, remember, and respond contextually, much like a true assistant. Check out the full Notebook here

Copy CodeCopiedUse a different Browserclass MultiAgentSystem:
“””Innovative multi-agent system with specialized capabilities”””

def __init__(self, api_key: str):
self.llm = ChatGoogleGenerativeAI(
model=”gemini-pro”,
google_api_key=api_key,
temperature=0.7,
convert_system_message_to_human=True
)
self.memory_store = {}
self.conversation_memory = ConversationBufferWindowMemory(
memory_key=”chat_history”,
k=10,
return_messages=True
)
self.tools = self._initialize_tools()
self.agent = self._create_agent()

def _initialize_tools(self):
“””Initialize all available tools”””
tools = []

tools.extend([
DuckDuckGoSearchRun(api_wrapper=DuckDuckGoSearchAPIWrapper()),
WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
])

tools.append(InnovativeAgentTools.get_calculator_tool())
tools.append(InnovativeAgentTools.get_datetime_tool())
tools.extend(InnovativeAgentTools.get_memory_tool(self.memory_store))

return tools

def _create_agent(self):
“””Create the ReAct agent with advanced prompt”””
prompt = PromptTemplate.from_template(“””
You are an advanced AI assistant with access to multiple tools and persistent memory.

AVAILABLE TOOLS:
{tools}

TOOL USAGE FORMAT:
– Think step by step about what you need to do
– Use Action: tool_name
– Use Action Input: your input
– Wait for Observation
– Continue until you have a final answer

MEMORY CAPABILITIES:
– You can save important information using SaveMemory
– You can recall previous information using RecallMemory
– Always try to remember user preferences and context

CONVERSATION HISTORY:
{chat_history}

CURRENT QUESTION: {input}

REASONING PROCESS:
{agent_scratchpad}

Begin your response with your thought process, then take action if needed.
“””)

agent = create_react_agent(self.llm, self.tools, prompt)
return AgentExecutor(
agent=agent,
tools=self.tools,
memory=self.conversation_memory,
verbose=True,
handle_parsing_errors=True,
max_iterations=5
)

def chat(self, message: str, callback_handler=None):
“””Process user message and return response”””
try:
if callback_handler:
response = self.agent.invoke(
{“input”: message},
{“callbacks”: [callback_handler]}
)
else:
response = self.agent.invoke({“input”: message})
return response[“output”]
except Exception as e:
return f”Error processing request: {str(e)}”

In this section, we build the core of our application, the MultiAgentSystem class. Here, we integrate the Gemini Pro model using LangChain and initialize all essential tools, including web search, memory, and calculator functions. We configure a ReAct-style agent using a custom prompt that guides tool usage and memory handling. Finally, we define a chat method that allows the agent to process user input, invoke tools when necessary, and generate intelligent, context-aware responses. Check out the full Notebook here

Copy CodeCopiedUse a different Browserdef create_streamlit_app():
“””Create the innovative Streamlit application”””

st.set_page_config(
page_title=” Advanced LangChain Agent with Gemini”,
page_icon=””,
layout=”wide”,
initial_sidebar_state=”expanded”
)

st.markdown(“””
<style>
.main-header {
background: linear-gradient(90deg, #667eea 0%, #764ba2 100%);
padding: 1rem;
border-radius: 10px;
color: white;
text-align: center;
margin-bottom: 2rem;
}
.agent-response {
background-color: #f0f2f6;
padding: 1rem;
border-radius: 10px;
border-left: 4px solid #667eea;
margin: 1rem 0;
}
.memory-card {
background-color: #e8f4fd;
padding: 1rem;
border-radius: 8px;
margin: 0.5rem 0;
}
</style>
“””, unsafe_allow_html=True)

st.markdown(“””
<div class=”main-header”>
<h1> Advanced Multi-Agent System</h1>
<p>Powered by LangChain + Gemini API + Streamlit</p>
</div>
“””, unsafe_allow_html=True)

with st.sidebar:
st.header(” Configuration”)

api_key = st.text_input(
” Google AI API Key”,
type=”password”,
value=GOOGLE_API_KEY if GOOGLE_API_KEY != “your-gemini-api-key-here” else “”,
help=”Get your API key from https://ai.google.dev/”
)

if not api_key:
st.error(“Please enter your Google AI API key to continue”)
st.stop()

st.success(” API Key configured”)

st.header(” Agent Capabilities”)
st.markdown(“””
– **Web Search** (DuckDuckGo)
– **Wikipedia Lookup**
– **Mathematical Calculator**
– **Persistent Memory**
– **Date & Time**
– **Conversation History**
“””)

if ‘agent_system’ in st.session_state:
st.header(” Memory Store”)
memory = st.session_state.agent_system.memory_store
if memory:
for key, value in memory.items():
st.markdown(f”””
<div class=”memory-card”>
<strong>{key}:</strong> {value}
</div>
“””, unsafe_allow_html=True)
else:
st.info(“No memories stored yet”)

if ‘agent_system’ not in st.session_state:
with st.spinner(” Initializing Advanced Agent System…”):
st.session_state.agent_system = MultiAgentSystem(api_key)
st.success(” Agent System Ready!”)

st.header(” Interactive Chat”)

if ‘messages’ not in st.session_state:
st.session_state.messages = [{
“role”: “assistant”,
“content”: “”” Hello! I’m your advanced AI assistant powered by Gemini. I can:

• Search the web and Wikipedia for information
• Perform mathematical calculations
• Remember important information across our conversation
• Provide current date and time
• Maintain conversation context

Try asking me something like:
– “Calculate 15 * 8 + 32”
– “Search for recent news about AI”
– “Remember that my favorite color is blue”
– “What’s the current time?”
“””
}]

for message in st.session_state.messages:
with st.chat_message(message[“role”]):
st.markdown(message[“content”])

if prompt := st.chat_input(“Ask me anything…”):
st.session_state.messages.append({“role”: “user”, “content”: prompt})
with st.chat_message(“user”):
st.markdown(prompt)

with st.chat_message(“assistant”):
callback_handler = StreamlitCallbackHandler(st.container())

with st.spinner(” Thinking…”):
response = st.session_state.agent_system.chat(prompt, callback_handler)

st.markdown(f”””
<div class=”agent-response”>
{response}
</div>
“””, unsafe_allow_html=True)

st.session_state.messages.append({“role”: “assistant”, “content”: response})

st.header(” Example Queries”)
col1, col2, col3 = st.columns(3)

with col1:
if st.button(” Search Example”):
example = “Search for the latest developments in quantum computing”
st.session_state.example_query = example

with col2:
if st.button(” Math Example”):
example = “Calculate the compound interest on $1000 at 5% for 3 years”
st.session_state.example_query = example

with col3:
if st.button(” Memory Example”):
example = “Remember that I work as a data scientist at TechCorp”
st.session_state.example_query = example

if ‘example_query’ in st.session_state:
st.info(f”Example query: {st.session_state.example_query}”)

In this section, we bring everything together by building an interactive web interface using Streamlit. We configure the app layout, define custom CSS styles, and set up a sidebar for inputting API keys and configuring agent capabilities. We initialize the multi-agent system, maintain a message history, and enable a chat interface that allows users to interact in real-time. To make it even easier to explore, we also provide example buttons for search, math, and memory-related queries,  all in a beautifully styled, responsive UI. Check out the full Notebook here

Copy CodeCopiedUse a different Browserdef setup_ngrok_auth(auth_token):
“””Setup ngrok authentication”””
try:
from pyngrok import ngrok, conf

conf.get_default().auth_token = auth_token

try:
tunnels = ngrok.get_tunnels()
print(” Ngrok authentication successful!”)
return True
except Exception as e:
print(f” Ngrok authentication failed: {e}”)
return False

except ImportError:
print(” pyngrok not installed. Installing…”)
import subprocess
subprocess.run([‘pip’, ‘install’, ‘pyngrok’], check=True)
return setup_ngrok_auth(auth_token)

def get_ngrok_token_instructions():
“””Provide instructions for getting ngrok token”””
return “””
NGROK AUTHENTICATION SETUP:

1. Sign up for an ngrok account:
– Visit: https://dashboard.ngrok.com/signup
– Create a free account

2. Get your authentication token:
– Go to: https://dashboard.ngrok.com/get-started/your-authtoken
– Copy your authtoken

3. Replace ‘your-ngrok-auth-token-here’ in the code with your actual token

4. Alternative methods if ngrok fails:
– Use Google Colab’s built-in public URL feature
– Use localtunnel: !npx localtunnel –port 8501
– Use serveo.net: !ssh -R 80:localhost:8501 serveo.net
“””

Here, we set up a helper function to authenticate ngrok, which allows us to expose our local Streamlit app to the internet. We use the pyngrok library to configure the authentication token and verify the connection. If the token is missing or invalid, we provide detailed instructions on how to obtain one and suggest alternative tunneling methods, such as LocalTunnel or Serveo, making it easy for us to host and share our app from environments like Google Colab.

Copy CodeCopiedUse a different Browserdef main():
“””Main function to run the application”””
try:
create_streamlit_app()
except Exception as e:
st.error(f”Application error: {str(e)}”)
st.info(“Please check your API key and try refreshing the page”)

This main() function acts as the entry point for our Streamlit application. We simply call create_streamlit_app() to launch the full interface. If anything goes wrong, such as a missing API key or a failed tool initialization, we catch the error gracefully and display a helpful message, ensuring the user knows how to recover and continue using the app smoothly.

Copy CodeCopiedUse a different Browserdef run_in_colab():
“””Run the application in Google Colab with proper ngrok setup”””

print(” Starting Advanced LangChain Agent Setup…”)

if NGROK_AUTH_TOKEN == “your-ngrok-auth-token-here”:
print(” NGROK_AUTH_TOKEN not configured!”)
print(get_ngrok_token_instructions())

print(” Attempting alternative tunnel methods…”)
try_alternative_tunnels()
return

print(” Installing required packages…”)
import subprocess

packages = [
‘streamlit’,
‘langchain’,
‘langchain-google-genai’,
‘langchain-community’,
‘wikipedia’,
‘duckduckgo-search’,
‘pyngrok’
]

for package in packages:
try:
subprocess.run([‘pip’, ‘install’, package], check=True, capture_output=True)
print(f” {package} installed”)
except subprocess.CalledProcessError:
print(f” Failed to install {package}”)

app_content = ”’
import streamlit as st
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import Tool, WikipediaQueryRun, DuckDuckGoSearchRun
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate
from langchain.callbacks.streamlit import StreamlitCallbackHandler
from langchain_community.utilities import WikipediaAPIWrapper, DuckDuckGoSearchAPIWrapper
from datetime import datetime

# Configuration – Replace with your actual keys
GOOGLE_API_KEY = “”’ + GOOGLE_API_KEY + ”'”
os.environ[“GOOGLE_API_KEY”] = GOOGLE_API_KEY

class InnovativeAgentTools:
@staticmethod
def get_calculator_tool():
def calculate(expression: str) -> str:
try:
allowed_chars = set(‘0123456789+-*/.() ‘)
if all(c in allowed_chars for c in expression):
result = eval(expression)
return f”Result: {result}”
else:
return “Error: Invalid mathematical expression”
except Exception as e:
return f”Calculation error: {str(e)}”

return Tool(name=”Calculator”, func=calculate,
description=”Calculate mathematical expressions. Input should be a valid math expression.”)

@staticmethod
def get_memory_tool(memory_store):
def save_memory(key_value: str) -> str:
try:
key, value = key_value.split(“:”, 1)
memory_store[key.strip()] = value.strip()
return f”Saved ‘{key.strip()}’ to memory”
except:
return “Error: Use format ‘key: value'”

def recall_memory(key: str) -> str:
return memory_store.get(key.strip(), f”No memory found for ‘{key}'”)

return [
Tool(name=”SaveMemory”, func=save_memory, description=”Save information to memory. Format: ‘key: value'”),
Tool(name=”RecallMemory”, func=recall_memory, description=”Recall saved information. Input: key to recall”)
]

@staticmethod
def get_datetime_tool():
def get_current_datetime(format_type: str = “full”) -> str:
now = datetime.now()
if format_type == “date”:
return now.strftime(“%Y-%m-%d”)
elif format_type == “time”:
return now.strftime(“%H:%M:%S”)
else:
return now.strftime(“%Y-%m-%d %H:%M:%S”)

return Tool(name=”DateTime”, func=get_current_datetime,
description=”Get current date/time. Options: ‘date’, ‘time’, or ‘full'”)

class MultiAgentSystem:
def __init__(self, api_key: str):
self.llm = ChatGoogleGenerativeAI(
model=”gemini-pro”,
google_api_key=api_key,
temperature=0.7,
convert_system_message_to_human=True
)
self.memory_store = {}
self.conversation_memory = ConversationBufferWindowMemory(
memory_key=”chat_history”, k=10, return_messages=True
)
self.tools = self._initialize_tools()
self.agent = self._create_agent()

def _initialize_tools(self):
tools = []
try:
tools.extend([
DuckDuckGoSearchRun(api_wrapper=DuckDuckGoSearchAPIWrapper()),
WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
])
except Exception as e:
st.warning(f”Search tools may have limited functionality: {e}”)

tools.append(InnovativeAgentTools.get_calculator_tool())
tools.append(InnovativeAgentTools.get_datetime_tool())
tools.extend(InnovativeAgentTools.get_memory_tool(self.memory_store))
return tools

def _create_agent(self):
prompt = PromptTemplate.from_template(“””
You are an advanced AI assistant with access to multiple tools and persistent memory.

AVAILABLE TOOLS:
{tools}

TOOL USAGE FORMAT:
– Think step by step about what you need to do
– Use Action: tool_name
– Use Action Input: your input
– Wait for Observation
– Continue until you have a final answer

CONVERSATION HISTORY:
{chat_history}

CURRENT QUESTION: {input}

REASONING PROCESS:
{agent_scratchpad}

Begin your response with your thought process, then take action if needed.
“””)

agent = create_react_agent(self.llm, self.tools, prompt)
return AgentExecutor(agent=agent, tools=self.tools, memory=self.conversation_memory,
verbose=True, handle_parsing_errors=True, max_iterations=5)

def chat(self, message: str, callback_handler=None):
try:
if callback_handler:
response = self.agent.invoke({“input”: message}, {“callbacks”: [callback_handler]})
else:
response = self.agent.invoke({“input”: message})
return response[“output”]
except Exception as e:
return f”Error processing request: {str(e)}”

# Streamlit App
st.set_page_config(page_title=” Advanced LangChain Agent”, page_icon=””, layout=”wide”)

st.markdown(“””
<style>
.main-header {
background: linear-gradient(90deg, #667eea 0%, #764ba2 100%);
padding: 1rem; border-radius: 10px; color: white; text-align: center; margin-bottom: 2rem;
}
.agent-response {
background-color: #f0f2f6; padding: 1rem; border-radius: 10px;
border-left: 4px solid #667eea; margin: 1rem 0;
}
.memory-card {
background-color: #e8f4fd; padding: 1rem; border-radius: 8px; margin: 0.5rem 0;
}
</style>
“””, unsafe_allow_html=True)

st.markdown(‘<div class=”main-header”><h1> Advanced Multi-Agent System</h1><p>Powered by LangChain + Gemini API</p></div>’, unsafe_allow_html=True)

with st.sidebar:
st.header(” Configuration”)
api_key = st.text_input(” Google AI API Key”, type=”password”, value=GOOGLE_API_KEY)

if not api_key:
st.error(“Please enter your Google AI API key”)
st.stop()

st.success(” API Key configured”)

st.header(” Agent Capabilities”)
st.markdown(“- Web Search\n- Wikipedia\n- Calculator\n- Memory\n- Date/Time”)

if ‘agent_system’ in st.session_state and st.session_state.agent_system.memory_store:
st.header(” Memory Store”)
for key, value in st.session_state.agent_system.memory_store.items():
st.markdown(f'<div class=”memory-card”><strong>{key}:</strong> {value}</div>’, unsafe_allow_html=True)

if ‘agent_system’ not in st.session_state:
with st.spinner(” Initializing Agent…”):
st.session_state.agent_system = MultiAgentSystem(api_key)
st.success(” Agent Ready!”)

if ‘messages’ not in st.session_state:
st.session_state.messages = [{
“role”: “assistant”,
“content”: ” Hello! I’m your advanced AI assistant. I can search, calculate, remember information, and more! Try asking me to: calculate something, search for information, or remember a fact about you.”
}]

for message in st.session_state.messages:
with st.chat_message(message[“role”]):
st.markdown(message[“content”])

if prompt := st.chat_input(“Ask me anything…”):
st.session_state.messages.append({“role”: “user”, “content”: prompt})
with st.chat_message(“user”):
st.markdown(prompt)

with st.chat_message(“assistant”):
callback_handler = StreamlitCallbackHandler(st.container())
with st.spinner(” Thinking…”):
response = st.session_state.agent_system.chat(prompt, callback_handler)
st.markdown(f'<div class=”agent-response”>{response}</div>’, unsafe_allow_html=True)
st.session_state.messages.append({“role”: “assistant”, “content”: response})

# Example buttons
st.header(” Try These Examples”)
col1, col2, col3 = st.columns(3)
with col1:
if st.button(” Calculate 15 * 8 + 32″):
st.rerun()
with col2:
if st.button(” Search AI news”):
st.rerun()
with col3:
if st.button(” Remember my name is Alex”):
st.rerun()
”’

with open(‘streamlit_app.py’, ‘w’) as f:
f.write(app_content)

print(” Streamlit app file created successfully!”)

if setup_ngrok_auth(NGROK_AUTH_TOKEN):
start_streamlit_with_ngrok()
else:
print(” Ngrok authentication failed. Trying alternative methods…”)
try_alternative_tunnels()

In the run_in_colab() function, we make it easy to deploy the Streamlit app directly from a Google Colab environment. We begin by installing all required packages, then dynamically generate and write the complete Streamlit app code to a streamlit_app.py file. We verify the presence of a valid ngrok token to enable public access to the app from Colab, and if it’s missing or invalid, we guide ourselves through fallback tunneling options. This setup allows us to interact with our AI agent from anywhere, all within a few cells in Colab. Check out the full Notebook here

Copy CodeCopiedUse a different Browserdef start_streamlit_with_ngrok():
“””Start Streamlit with ngrok tunnel”””
import subprocess
import threading
from pyngrok import ngrok

def start_streamlit():
subprocess.run([‘streamlit’, ‘run’, ‘streamlit_app.py’, ‘–server.port=8501’, ‘–server.headless=true’])

print(” Starting Streamlit server…”)
thread = threading.Thread(target=start_streamlit)
thread.daemon = True
thread.start()

time.sleep(5)

try:
print(” Creating ngrok tunnel…”)
public_url = ngrok.connect(8501)
print(f” SUCCESS! Access your app at: {public_url}”)
print(” Your Advanced LangChain Agent is now running publicly!”)
print(” You can share this URL with others!”)

print(” Keeping tunnel alive… Press Ctrl+C to stop”)
try:
ngrok_process = ngrok.get_ngrok_process()
ngrok_process.proc.wait()
except KeyboardInterrupt:
print(” Shutting down…”)
ngrok.kill()

except Exception as e:
print(f” Ngrok tunnel failed: {e}”)
try_alternative_tunnels()

def try_alternative_tunnels():
“””Try alternative tunneling methods”””
print(” Trying alternative tunnel methods…”)

import subprocess
import threading

def start_streamlit():
subprocess.run([‘streamlit’, ‘run’, ‘streamlit_app.py’, ‘–server.port=8501’, ‘–server.headless=true’])

thread = threading.Thread(target=start_streamlit)
thread.daemon = True
thread.start()

time.sleep(3)

print(” Streamlit is running on http://localhost:8501″)
print(“n ALTERNATIVE TUNNEL OPTIONS:”)
print(“1. localtunnel: Run this in a new cell:”)
print(” !npx localtunnel –port 8501″)
print(“n2. serveo.net: Run this in a new cell:”)
print(” !ssh -R 80:localhost:8501 serveo.net”)
print(“n3. Colab public URL (if available):”)
print(” Use the ‘Public URL’ button in Colab’s interface”)

try:
while True:
time.sleep(60)
except KeyboardInterrupt:
print(” Shutting down…”)

if __name__ == “__main__”:
try:
get_ipython()
print(” Google Colab detected – starting setup…”)
run_in_colab()
except NameError:
main()

In this final part, we set up the execution logic to run the app either in a local environment or inside Google Colab. The start_streamlit_with_ngrok() function launches the Streamlit server in the background and uses ngrok to expose it publicly, making it easy to access and share. If ngrok fails, the try_alternative_tunnels() function activates with alternative tunneling options, such as LocalTunnel and Serveo. With the __main__ block, we automatically detect if we’re in Colab and launch the appropriate setup, making the entire deployment process smooth, flexible, and shareable from anywhere.

In conclusion, we’ll have a fully functional AI agent running inside a sleek Streamlit interface, capable of answering queries, remembering user inputs, and even sharing its services publicly using ngrok. We’ve seen how easily Streamlit enables us to integrate advanced AI functionalities into an engaging and user-friendly app. From here, we can expand the agent’s tools, plug it into larger workflows, or deploy it as part of our intelligent applications. With Streamlit as the front-end and LangChain agents powering the logic, we’ve built a solid foundation for next-gen interactive AI experiences.

Check out the full Notebook here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Build an Intelligent Multi-Tool AI Agent Interface Using Streamlit for Seamless Real-Time Interaction appeared first on MarkTechPost.

UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation …

Cybersecurity has become a significant area of interest in artificial intelligence, driven by the increasing reliance on large software systems and the expanding capabilities of AI tools. As threats evolve in complexity, ensuring the security of software systems has become more than just a matter of conventional protections; it now intersects with automated reasoning, vulnerability detection, and code-level comprehension. Modern cybersecurity requires tools and methods that can simulate real-world scenarios, identify hidden flaws, and validate system integrity across diverse software infrastructures. Within this environment, researchers have been developing benchmarks and methods to systematically evaluate AI agents’ ability to understand, detect, and even exploit vulnerabilities, drawing parallels with human security researchers. However, bridging the gap between AI reasoning and real-world cybersecurity complexities remains a key challenge.

Problem with Existing Benchmarks

One pressing issue is the lack of effective ways to evaluate whether AI systems are truly capable of understanding and handling security tasks under realistic conditions. Simplified benchmark tasks often dominate current testing methods, which rarely mirror the messy and layered reality of large-scale software repositories. These environments involve intricate input conditions, deep code paths, and subtle vulnerabilities that demand more than surface-level inspection. Without robust evaluation methods, it’s difficult to determine whether AI agents can be trusted to perform tasks like vulnerability detection or exploit development. More importantly, current benchmarks don’t reflect the scale and nuance of vulnerabilities found in actively maintained, widely used software systems, leaving a critical evaluation gap.

Limitations of Current Tools

Several benchmarks have been used to evaluate cybersecurity capabilities, including Cybench and the NYU CTF Bench. These focus on capture-the-flag-style tasks that offer limited complexity, typically involving small codebases and constrained test environments. Some benchmarks attempt to engage real-world vulnerabilities, but they often do so at a limited scale. Furthermore, many of the tools rely on either synthetic test cases or narrowly scoped challenge problems, which fail to represent the diversity of software inputs, execution paths, and bug types found in actual systems. Even specialized agents created for security analysis have been tested on benchmarks with only tens or a few hundred tasks, far short of the complexity of real-world threat landscapes.

Introducing CyberGym

Researchers introduced CyberGym, a large-scale and comprehensive benchmarking tool specifically designed to evaluate AI agents in real-world cybersecurity contexts. Developed at the University of California, Berkeley, CyberGym includes 1,507 distinct benchmark tasks sourced from actual vulnerabilities found and patched across 188 major open-source software projects. These vulnerabilities were originally identified by OSS-Fuzz, a continuous fuzzing campaign maintained by Google. To ensure realism, each benchmark instance includes the full pre-patch codebase, an executable, and a textual description of the vulnerability. Agents must generate a proof-of-concept test that reproduces the vulnerability in the unpatched version, and CyberGym evaluates success based on whether the vulnerability is triggered in the pre-patch version and absent in the post-patch one. This benchmark uniquely emphasizes the generation of Proof of Concepts (PoCs), a task that requires agents to traverse complex code paths and synthesize inputs to meet specific security conditions. CyberGym is modular and containerized, enabling easy expansion and reproducibility.

CyberGym Evaluation Levels

The evaluation pipeline in CyberGym is built around four levels of difficulty, each increasing the amount of input information provided. At level 0, the agent is given only the codebase with no hint of the vulnerability. Level 1 adds a natural language description. Level 2 introduces a ground-truth proof of concept (PoC) and crash stack trace, while Level 3 includes the patch itself and the post-patch codebase. Each level presents a new layer of reasoning and complexity. For instance, in level 1, agents must infer the vulnerability’s location and context purely from its textual description and codebase. To ensure benchmark quality, CyberGym applies filters such as checking the informativeness of patch commit messages, validating proof-of-concept (PoC) reproducibility, and removing redundancy by comparing stack traces. The final dataset comprises codebases with a median of 1,117 files and 387,491 lines of code, ranging up to over 40,000 files and 7 million lines of code. The patch sizes also vary, modifying a median of 1 file and seven lines, but sometimes spanning 40 files and over 3,000 lines. The vulnerabilities target various crash types, with 30.4% related to heap-buffer-overflow READ and 19.0% due to uninitialized value use.

Experimental Results

When tested against this benchmark, existing agents showed limited success. Among four agent frameworks, OpenHands, Codex, ENiGMA, and Cybench, the top performer was OpenHands combined with Claude-3.7-Sonnet, which reproduced only 11.9% of target vulnerabilities. This performance dropped significantly when dealing with longer PoC inputs, as success rates were highest for PoCs under 10 bytes (43.5%) and fell below 8% for lengths over 100 bytes. Open-source models, such as DeepSeek-V3, lagged, with only a 3.6% success rate. Even specialized models fine-tuned for code reasoning, like SWE-Gym-32B and R2E-Gym-32B, failed to generalize, scoring under 2%. Surprisingly, richer input information at higher difficulty levels increased performance: level 3 saw 17.1% success, while level 0 achieved only 3.5%. Analysis also revealed that most successful PoC reproductions occurred between 20 and 40 execution steps, with many runs exceeding 90 steps and ultimately failing. Despite these challenges, agents discovered 15 previously unknown zero-day vulnerabilities and two disclosed but unpatched ones across real-world projects, demonstrating their latent capacity for novel discovery.

Key Takeaways

Benchmark Volume and Realism: CyberGym contains 1,507 tasks derived from real, patched vulnerabilities across 188 software projects, making it the largest and most realistic benchmark of its kind.

Agent Limitations: Even the best-performing agent-model combination reproduced only 11.9% of vulnerabilities, with many combinations scoring under 5%.

Difficulty Scaling: Providing additional inputs, such as stack traces or patches, significantly improved performance, with level 3 tasks yielding a 17.1% success rate.

Length Sensitivity: Agents struggled with tasks involving long PoCs. PoCs exceeding 100 bytes, which made up 65.7% of the dataset, had the lowest success rates.

Discovery Potential: 15 new zero-day vulnerabilities were discovered by agent-generated PoCs, validating their potential use in real-world security analysis.

Model Behavior: Most successful exploits were generated early in the task execution, with diminishing returns after 80 steps.

Tool Interactions: Agents performed better when allowed to interact with tools (e.g., using ‘awk’, ‘grep’, or installing ‘xxd’) and adapt PoCs based on runtime feedback.

Conclusion

In conclusion, this study highlights a critical problem: evaluating AI in cybersecurity is not only challenging but essential for understanding its limitations and capabilities. CyberGym stands out by offering a large-scale, real-world framework for doing so. The researchers addressed the issue with a practical and detailed benchmark that forces agents to reason deeply across entire codebases, generate valid exploits, and adapt through iteration. The results make it clear that while current agents show promise, especially in discovering new bugs, there is still a long road ahead to enable AI to contribute to cybersecurity at scale reliably.

Check out the Paper, GitHub Page, Leaderboard. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation Framework to Evaluate AI Agents on Large-Scale Vulnerabilities Across Massive Codebases appeared first on MarkTechPost.

From Backend Automation to Frontend Collaboration: What’s New in AG- …

Introduction

AI agents are increasingly moving from pure backend automators to visible, collaborative elements within modern applications. However, making agents genuinely interactive—capable of both responding to users and proactively guiding workflows—has long been an engineering headache. Each team ends up building custom communication channels, event handling, and state management, all for similar interaction needs.

The initial release of AG‑UI, announced in May 2025, served as a practical, open‑source proof-of-concept protocol for inline agent-user communication. It introduced a single-stream architecture—typically HTTP POST paired with Server-Sent Events (SSE)—and established a vocabulary of structured JSON events (e.g., TEXT_MESSAGE_CONTENT, TOOL_CALL_START, STATE_DELTA) that could drive interactive front-end components. The first version addressed core integration challenges—real-time streaming, tool orchestration, shared state, and standardized event handling—but users found that further formalization of event types, versioning, and framework support was needed for broader production use.

AG‑UI latest update proposes a different approach. Instead of yet another toolkit, it offers a lightweight protocol that standardizes the conversation between agents and user interfaces. This new version brings the protocol closer to production quality, improves event clarity, and expands compatibility with real‑world agent frameworks and clients.

What Sets AG-UI’s Latest Update Apart

AG-UI’s latest update is an incremental but meaningful step for agent-driven applications. Unlike earlier ad-hoc attempts at interactivity, the latest update of AG-UI is built around explicit, versioned events. The protocol isn’t tightly coupled to any particular stack; it’s designed to work with multiple agent backends and client types out of the box.

Key features in the latest update of AG-UI include:

A formal set of ~16 event types, covering the full lifecycle of an agent—streamed outputs, tool invocations, state updates, user prompts, and error handling.

Cleaner event schemas, allowing clients and agents to negotiate capabilities and synchronize state more reliably.

More robust support for both direct (native) integration and adapter-based wrapping of legacy agents.

Expanded documentation and SDKs that make the protocol practical for production use, not just experimentation.

Interactive Agents Require Consistency

Many AI agents today remain hidden in the backend, designed to handle requests and return results, with little regard for real-time user interaction. Making agents interactive means solving for several technical challenges:

Streaming: Agents need to send incremental results or messages as soon as they’re available, not just at the end of a process.

Shared State: Both agent and UI should stay in sync, reflecting changes as the task progresses.

Tool Calls: Agents must be able to request external tools (such as APIs or user actions) and get results back in a structured way.

Bidirectional Messaging: Users should be able to respond or guide the agent, not just passively observe.

Security and Control: Tool invocation, cancellations, and error signals should be explicit and managed safely.

Without a shared protocol, every developer ends up reinventing these wheels—often imperfectly.

How the Latest Update of AG-UI Works

AG-UI’s latest update formalizes the agent-user interaction as a stream of typed events. Agents emit these events as they operate; clients subscribe to the stream, interpret the events, and send responses when needed.

The Event Stream

The core of the latest update of AG-UI is its event taxonomy. There are ~16 event types, including:

message: Agent output, such as a status update or a chunk of generated text.

function_call: Agent asks the client to run a function or tool, often requiring an external resource or user action.

state_update: Synchronizes variables or progress information.

input_request: Prompts the user for a value or choice.

tool_result: Sends results from tools back to the agent.

error and control: Signal errors, cancellations, or completion.

All events are JSON-encoded, typed, and versioned. This structure makes it straightforward to parse events, handle errors gracefully, and add new capabilities over time.

Integrating Agents and Clients

There are two main patterns for integration:

Native: Agents are built or modified to emit AG-UI events directly during execution.

Adapter: For legacy or third-party agents, an adapter module can intercept outputs and translate them into AG-UI events.

On the client side, applications open a persistent connection (usually via SSE or WebSocket), listen for events, and update their interface or send structured responses as needed.

The protocol is intentionally transport-agnostic, but supports real-time streaming for responsiveness.

Adoption and Ecosystem

Since its initial release, AG-UI has seen adoption among popular agent orchestration frameworks. AG‑UI latest version’s expanded event schema and improved documentation have accelerated integration efforts.

Current or in-progress integrations include:

LangChain, CrewAI, Mastra, AG2, Agno, LlamaIndex: Each offers orchestration for agents that can now interactively surface their internal state and progress.

AWS, A2A, ADK, AgentOps: Work is ongoing to bridge cloud, monitoring, and agent operation tools with AG-UI.

Human Layer (Slack integration): Demonstrates how agents can become collaborative team members in messaging environments.

The protocol has gained traction with developers looking to avoid building custom socket handlers and event schemas for each project. It currently has more than 3,500 GitHub stars and is being used in a growing number of agent-driven products.

Developer Experience

The latest update of AG-UI is designed to minimize friction for both agent builders and frontend engineers.

SDKs and Templates: The CLI tool npx create-ag-ui-app scaffolds a project with all dependencies and sample integrations included.

Clear Schemas: Events are versioned and documented, supporting robust error handling and future extensibility.

Practical Documentation: Real-world integration guides, example flows, and visual assets help reduce trial and error.

All resources and guides are available at AG-UI.com.

Use Cases

Embedded Copilots: Agents that work alongside users in existing apps, providing suggestions and explanations as tasks evolve.

Conversational UIs: Dialogue systems that maintain session state and support multi-turn interactions with tool usage.

Workflow Automation: Agents that orchestrate sequences involving both automated actions and human-in-the-loop steps.

Conclusion

The latest update of AG-UI provides a well-defined, lightweight protocol for building interactive agent-driven applications. Its event-driven architecture abstracts away much of the complexity of agent-user synchronization, real-time communication, and state management. With explicit schemas, broad framework support, and a focus on practical integration, AG‑UI latest update enables development teams to build more reliable, interactive AI systems—without repeatedly solving the same low-level problems.

Developers interested in adopting the latest update of AG-UI can find SDKs, technical documentation, and integration assets at AG-UI.com.

CopilotKit team is also organizing a Webinar.

Support open-source and Star the AG-UI GitHub repo.

Discord Community: https://go.copilotkit.ai/AG-UI-Discord

Thanks to the CopilotKit team for the thought leadership/ Resources for this article. CopilotKit team has supported us in this content/article.
The post From Backend Automation to Frontend Collaboration: What’s New in AG-UI Latest Update for AI Agent-User Interaction appeared first on MarkTechPost.

MiniMax AI Releases MiniMax-M1: A 456B Parameter Hybrid Model for Long …

The Challenge of Long-Context Reasoning in AI Models

Large reasoning models are not only designed to understand language but are also structured to think through multi-step processes that require prolonged attention spans and contextual comprehension. As the expectations from AI grow, especially in real-world and software development environments, researchers have sought architectures that can handle longer inputs and sustain deep, coherent reasoning chains without overwhelming computational costs.

Computational Constraints with Traditional Transformers

The primary difficulty in expanding these reasoning capabilities lies in the excessive computational load that comes with longer generation lengths. Traditional transformer-based models employ a softmax attention mechanism, which scales quadratically with the input size. This limits their capacity to handle long input sequences or extended chains of thought efficiently. This problem becomes even more pressing in areas that require real-time interaction or cost-sensitive applications, where inference expenses are significant.

Existing Alternatives and Their Limitations

Efforts to address this issue have yielded a range of methods, including sparse attention and linear attention variants. Some teams have experimented with state-space models and recurrent networks as alternatives to traditional attention structures. However, these innovations have seen limited adoption in the most competitive reasoning models due to either architectural complexity or a lack of scalability in real-world deployments. Even large-scale systems, such as Tencent’s Hunyuan-T1, which utilizes a novel Mamba architecture, remain closed-source, thereby restricting wider research engagement and validation.

Introduction of MiniMax-M1: A Scalable Open-Weight Model

Researchers at MiniMax AI introduced MiniMax-M1, a new open-weight, large-scale reasoning model that combines a mixture of experts’ architecture with lightning-fast attention. Built as an evolution of the MiniMax-Text-01 model, MiniMax-M1 contains 456 billion parameters, with 45.9 billion activated per token. It supports context lengths of up to 1 million tokens—eight times the capacity of DeepSeek R1. This model addresses compute scalability at inference time, consuming only 25% of the FLOPs required by DeepSeek R1 at 100,000 token generation length. It was trained using large-scale reinforcement learning on a broad range of tasks, from mathematics and coding to software engineering, marking a shift toward practical, long-context AI models.

Hybrid-Attention with Lightning Attention and Softmax Blocks

To optimize this architecture, MiniMax-M1 employs a hybrid attention scheme where every seventh transformer block uses traditional softmax attention, followed by six blocks using lightning attention. This significantly reduces computational complexity while preserving performance. The lightning attention itself is I/O-aware, adapted from linear attention, and is particularly effective at scaling reasoning lengths to hundreds of thousands of tokens. For reinforcement learning efficiency, the researchers introduced a novel algorithm called CISPO. Instead of clipping token updates as traditional methods do, CISPO clips importance sampling weights, enabling stable training and consistent token contributions, even in off-policy updates.

The CISPO Algorithm and RL Training Efficiency

The CISPO algorithm proved essential in overcoming the training instability faced in hybrid architectures. In comparative studies using the Qwen2.5-32B baseline, CISPO achieved a 2x speedup compared to DAPO. Leveraging this, the full reinforcement learning cycle for MiniMax-M1 was completed in just three weeks using 512 H800 GPUs, with a rental cost of approximately $534,700. The model was trained on a diverse dataset comprising 41 logic tasks generated via the SynLogic framework and real-world software engineering environments derived from the SWE bench. These environments utilized execution-based rewards to guide performance, resulting in stronger outcomes in practical coding tasks.

Benchmark Results and Comparative Performance

MiniMax-M1 delivered compelling benchmark results. Compared to DeepSeek-R1 and Qwen3-235B, it excelled in software engineering, long-context processing, and agentic tool use. Although it trailed the latest DeepSeek-R1-0528 in math and coding contests, it surpassed both OpenAI o3 and Claude 4 Opus in long-context understanding benchmarks. Furthermore, it outperformed Gemini 2.5 Pro in the TAU-Bench agent tool use evaluation.

Conclusion: A Scalable and Transparent Model for Long-Context AI

MiniMax-M1 presents a significant step forward by offering both transparency and scalability. By addressing the dual challenge of inference efficiency and training complexity, the research team at MiniMax AI has set a precedent for open-weight reasoning models. This work not only brings a solution to compute constraints but also introduces practical methods for scaling language model intelligence into real-world applications.

Check out the Paper, Model and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post MiniMax AI Releases MiniMax-M1: A 456B Parameter Hybrid Model for Long-Context and Reinforcement Learning RL Tasks appeared first on MarkTechPost.

OpenAI Releases an Open‑Sourced Version of a Customer Service Agent …

OpenAI has open-sourced a new multi-agent customer service demo on GitHub, showcasing how to build domain-specialized AI agents using its Agents SDK. This project—titled openai-cs-agents-demo—models an airline customer service chatbot capable of handling a range of travel-related queries by dynamically routing requests to specialized agents. Built with a Python backend and a Next.js frontend, the system provides both a functional conversational interface and a visual trace of agent handoffs and guardrail activations.

The architecture is divided into two main components. The Python backend handles agent orchestration using the Agents SDK, while the Next.js frontend offers a chat interface and an interactive visualization of agent transitions. This setup provides transparency into the decision-making and delegation process as agents triage, respond to, or reject user queries. The demo operates with several focused agents: a Triage Agent, Seat Booking Agent, Flight Status Agent, Cancellation Agent, and an FAQ Agent. Each of these is configured with specialized instructions and tools to fulfill their specific sub-tasks.

When a user enters a request—such as “change my seat” or “cancel my flight”—the Triage Agent processes the input to determine intent and dispatches the query to the appropriate downstream agent. For example, a booking change request will be routed to the Seat Booking Agent, which can verify confirmation numbers, offer seat map choices, and finalize seat changes. If a cancellation is requested, the system hands off to the Cancellation Agent, which follows a structured flow to confirm and execute the cancellation. The demo also includes a Flight Status Agent for real-time flight inquiries and an FAQ Agent that answers general questions about baggage policies or aircraft types.

A key strength of the system lies in its integration of guardrails for safety and relevance. The demo features two: a Relevance Guardrail and a Jailbreak Guardrail. The Relevance Guardrail filters out off-topic queries—for example, rejecting prompts like “write me a poem about strawberries.” The Jailbreak Guardrail blocks attempts to circumvent system boundaries or manipulate agent behavior, such as asking the model to reveal its internal instructions. When either guardrail is triggered, the system highlights it in the trace and sends a structured error message to the user.

The Agents SDK itself serves as the orchestration backbone. Each agent is defined as a composable unit with prompt templates, tool access, handoff logic, and output schemas. The SDK handles chaining agents via “handoffs,” supports real-time tracing, and allows developers to enforce input/output constraints with guardrails. This framework is the same one powering OpenAI’s internal experiments with tool-using and reasoning agents, but now exposed in an educational and extendable format.

Developers can run the demo locally by starting the Python backend server with Uvicorn and launching the frontend with a single npm run dev command. The entire system is configurable—developers can plug in new agents, define their own task routing strategies, and implement custom guardrails. With full transparency into prompts, decisions, and trace logs, the demo offers a practical foundation for real-world conversational AI systems in customer support or other enterprise domains.

By releasing this reference implementation, OpenAI provides a tangible example of how multi-agent coordination, tool use, and safety checks can be combined into a robust service experience. It’s particularly valuable for developers seeking to understand the anatomy of agentic systems—and how to build modular, controllable AI workflows that are both transparent and production-ready.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post OpenAI Releases an Open‑Sourced Version of a Customer Service Agent Demo with the Agents SDK appeared first on MarkTechPost.

Build a scalable AI video generator using Amazon SageMaker AI and CogV …

In recent years, the rapid advancement of artificial intelligence and machine learning (AI/ML) technologies has revolutionized various aspects of digital content creation. One particularly exciting development is the emergence of video generation capabilities, which offer unprecedented opportunities for companies across diverse industries. This technology allows for the creation of short video clips that can be seamlessly combined to produce longer, more complex videos. The potential applications of this innovation are vast and far-reaching, promising to transform how businesses communicate, market, and engage with their audiences. Video generation technology presents a myriad of use cases for companies looking to enhance their visual content strategies. For instance, ecommerce businesses can use this technology to create dynamic product demonstrations, showcasing items from multiple angles and in various contexts without the need for extensive physical photoshoots. In the realm of education and training, organizations can generate instructional videos tailored to specific learning objectives, quickly updating content as needed without re-filming entire sequences. Marketing teams can craft personalized video advertisements at scale, targeting different demographics with customized messaging and visuals. Furthermore, the entertainment industry stands to benefit greatly, with the ability to rapidly prototype scenes, visualize concepts, and even assist in the creation of animated content. The flexibility offered by combining these generated clips into longer videos opens up even more possibilities. Companies can create modular content that can be quickly rearranged and repurposed for different displays, audiences, or campaigns. This adaptability not only saves time and resources, but also allows for more agile and responsive content strategies. As we delve deeper into the potential of video generation technology, it becomes clear that its value extends far beyond mere convenience, offering a transformative tool that can drive innovation, efficiency, and engagement across the corporate landscape.
In this post, we explore how to implement a robust AWS-based solution for video generation that uses the CogVideoX model and Amazon SageMaker AI.
Solution overview
Our architecture delivers a highly scalable and secure video generation solution using AWS managed services. The data management layer implements three purpose-specific Amazon Simple Storage Service (Amazon S3) buckets—for input videos, processed outputs, and access logging—each configured with appropriate encryption and lifecycle policies to support data security throughout its lifecycle.
For compute resources, we use AWS Fargate for Amazon Elastic Container Service (Amazon ECS) to host the Streamlit web application, providing serverless container management with automatic scaling capabilities. Traffic is efficiently distributed through an Application Load Balancer. The AI processing pipeline uses SageMaker AI processing jobs to handle video generation tasks, decoupling intensive computation from the web interface for cost optimization and enhanced maintainability. User prompts are refined through Amazon Bedrock, which feeds into the CogVideoX-5b model for high-quality video generation, creating an end-to-end solution that balances performance, security, and cost-efficiency.
The following diagram illustrates the solution architecture.

CogVideoX model
CogVideoX is an open source, state-of-the-art text-to-video generation model capable of producing 10-second continuous videos at 16 frames per second with a resolution of 768×1360 pixels. The model effectively translates text prompts into coherent video narratives, addressing common limitations in previous video generation systems.
The model uses three key innovations:

A 3D Variational Autoencoder (VAE) that compresses videos along both spatial and temporal dimensions, improving compression efficiency and video quality
An expert transformer with adaptive LayerNorm that enhances text-to-video alignment through deeper fusion between modalities
Progressive training and multi-resolution frame pack techniques that enable the creation of longer, coherent videos with significant motion elements

CogVideoX also benefits from an effective text-to-video data processing pipeline with various preprocessing strategies and a specialized video captioning method, contributing to higher generation quality and better semantic alignment. The model’s weights are publicly available, making it accessible for implementation in various business applications, such as product demonstrations and marketing content. The following diagram shows the architecture of the model.

Prompt enhancement
To improve the quality of video generation, the solution provides an option to enhance user-provided prompts. This is done by instructing a large language model (LLM), in this case Anthropic’s Claude, to take a user’s initial prompt and expand upon it with additional details, creating a more comprehensive description for video creation. The prompt consists of three parts:

Role section – Defines the AI’s purpose in enhancing prompts for video generation
Task section – Specifies the instructions needed to be performed with the original prompt
Prompt section – Where the user’s original input is inserted

By adding more descriptive elements to the original prompt, this system aims to provide richer, more detailed instructions to video generation models, potentially resulting in more accurate and visually appealing video outputs. We use the following prompt template for this solution:
“””
<Role>
Your role is to enhance the user prompt that is given to you by
providing additional details to the prompt. The end goal is to
covert the user prompt into a short video clip, so it is necessary
to provide as much information you can.
</Role>
<Task>
You must add details to the user prompt in order to enhance it for
video generation. You must provide a 1 paragraph response. No
more and no less. Only include the enhanced prompt in your response.
Do not include anything else.
</Task>
<Prompt>
{prompt}
</Prompt>
“””
Prerequisites
Before you deploy the solution, make sure you have the following prerequisites:

The AWS CDK Toolkit – Install the AWS CDK Toolkit globally using npm: npm install -g aws-cdk This provides the core functionality for deploying infrastructure as code to AWS.
Docker Desktop – This is required for local development and testing. It makes sure container images can be built and tested locally before deployment.
The AWS CLI – The AWS Command Line Interface (AWS CLI) must be installed and configured with appropriate credentials. This requires an AWS account with necessary permissions. Configure the AWS CLI using aws configure with your access key and secret.
Python Environment – You must have Python 3.11+ installed on your system. We recommend using a virtual environment for isolation. This is required for both the AWS CDK infrastructure and Streamlit application.
Active AWS account – You will need to raise a service quota request for SageMaker to ml.g5.4xlarge for processing jobs.

Deploy the solution
This solution has been tested in the us-east-1 AWS Region. Complete the following steps to deploy:

Create and activate a virtual environment:

python -m venv .
venv source .venv/bin/activate

Install infrastructure dependencies:

cd infrastructure
pip install -r requirements.txt

Bootstrap the AWS CDK (if not already done in your AWS account):

cdk bootstrap

Deploy the infrastructure:

cdk deploy -c allowed_ips='[“‘$(curl -s ifconfig.me)’/32”]’
To access the Streamlit UI, choose the link for StreamlitURL in the AWS CDK output logs after deployment is successful. The following screenshot shows the Streamlit UI accessible through the URL.

Basic video generation
Complete the following steps to generate a video:

Input your natural language prompt into the text box at the top of the page.
Copy this prompt to the text box at the bottom.
Choose Generate Video to create a video using this basic prompt.

The following is the output from the simple prompt “A bee on a flower.”

Enhanced video generation
For higher-quality results, complete the following steps:

Enter your initial prompt in the top text box.
Choose Enhance Prompt to send your prompt to Amazon Bedrock.
Wait for Amazon Bedrock to expand your prompt into a more descriptive version.
Review the enhanced prompt that appears in the lower text box.
Edit the prompt further if desired.
Choose Generate Video to initiate the processing job with CogVideoX.

When processing is complete, your video will appear on the page with a download option.The following is an example of an enhanced prompt and output:
“””
A vibrant yellow and black honeybee gracefully lands on a large,
blooming sunflower in a lush garden on a warm summer day. The
bee’s fuzzy body and delicate wings are clearly visible as it
moves methodically across the flower’s golden petals, collecting
pollen. Sunlight filters through the petals, creating a soft,
warm glow around the scene. The bee’s legs are coated in pollen
as it works diligently, its antennae twitching occasionally. In
the background, other colorful flowers sway gently in a light
breeze, while the soft buzzing of nearby bees can be heard
“””

Add an image to your prompt
If you want to include an image with your text prompt, complete the following steps:

Complete the text prompt and optional enhancement steps.
Choose Include an Image.
Upload the photo you want to use.
With both text and image now prepared, choose Generate Video to start the processing job.

The following is an example of the previous enhanced prompt with an included image.

To view more samples, check out the CogVideoX gallery.
Clean up
To avoid incurring ongoing charges, clean up the resources you created as part of this post:
cdk destroy
Considerations
Although our current architecture serves as an effective proof of concept, several enhancements are recommended for a production environment. Considerations include implementing an API Gateway with AWS Lambda backed REST endpoints for improved interface and authentication, introducing a queue-based architecture using Amazon Simple Queue Service (Amazon SQS) for better job management and reliability, and enhancing error handling and monitoring capabilities.
Conclusion
Video generation technology has emerged as a transformative force in digital content creation, as demonstrated by our comprehensive AWS-based solution using the CogVideoX model. By combining powerful AWS services like Fargate, SageMaker, and Amazon Bedrock with an innovative prompt enhancement system, we’ve created a scalable and secure pipeline capable of producing high-quality video clips. The architecture’s ability to handle both text-to-video and image-to-video generation, coupled with its user-friendly Streamlit interface, makes it an invaluable tool for businesses across sectors—from ecommerce product demonstrations to personalized marketing campaigns. As showcased in our sample videos, the technology delivers impressive results that open new avenues for creative expression and efficient content production at scale. This solution represents not just a technological advancement, but a glimpse into the future of visual storytelling and digital communication.
To learn more about CogVideoX, refer to CogVideoX on Hugging Face. Try out the solution for yourself, and share your feedback in the comments.

About the Authors
Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.
Natasha Tchir is a Cloud Consultant at the Generative AI Innovation Center, specializing in machine learning. With a strong background in ML, she now focuses on the development of generative AI proof-of-concept solutions, driving innovation and applied research within the GenAIIC.
Katherine Feng is a Cloud Consultant at AWS Professional Services within the Data and ML team. She has extensive experience building full-stack applications for AI/ML use cases and LLM-driven solutions.
Jinzhao Feng is a Machine Learning Engineer at AWS Professional Services. He focuses on architecting and implementing large-scale generative AI and classic ML pipeline solutions. He is specialized in FMOps, LLMOps, and distributed training.

Building trust in AI: The AWS approach to the EU AI Act

As AI adoption accelerates and reshapes our future, organizations are adapting to evolving regulatory frameworks. In our report commissioned to Strand Partners, Unlocking Europe’s AI Potential in the Digital Decade 2025, 68% of European businesses surveyed underlined that they struggle to understand their responsibilities under the EU AI Act. European businesses also highlighted that an estimated 40% of their IT spend goes towards compliance-related costs, and those uncertain about regulations plan to invest 28% less in AI over the next year. More clarity around regulation and compliance is critical to meet the competitiveness targets set out by the European Commission.
The EU AI Act
The European Union’s Artificial Intelligence Act (EU AI Act) establishes comprehensive regulations for the development, deployment, use, and provision of AI within the EU. It brings a risk-based regulatory framework with the overarching goal of protecting fundamental rights and safety. The EU AI Act entered into force on August 1, 2024, and will apply in phases, with most requirements becoming applicable over the next 14 months. The first group of obligations on prohibited AI practices and AI literacy became enforceable on February 1, 2025, with the remaining obligations to follow gradually.
AWS customers across industries use our AI services for a myriad of purposes, such as to provide better customer service, optimize their businesses, or create new experiences for their customers. We are actively evaluating how our services can best support customers to meet their compliance obligations, while maintaining AWS’s own compliance with the applicable provisions of the EU AI Act. As the European Commission continues to publish compliance guidance, such as the Guidelines of Prohibited AI Practices and the Guidelines on AI System Definition, we will continue to provide updates to our customers through our AWS Blog posts and other AWS channels.
The AWS approach to the EU AI Act
AWS has long been committed to AI solutions that are safe and respect fundamental rights. We take a people-centric approach that prioritizes education, science, and our customers’ needs to integrate responsible AI across the end-to-end AI lifecycle. As a leader in AI technology, AWS prioritizes trust in our AI offerings and supports the EU AI Act’s goal of promoting trustworthy AI products and services. We do this in several ways:

Amazon was among the first signatories of the EU’s AI Pact, and the first major cloud service provider to announce ISO/IEC 42001 accredited certification for its AI services, covering Amazon Bedrock, Amazon Q Business, Amazon Textract, and Amazon Transcribe.
AWS AI Service Cards and our frontier model safety framework enhance transparency. AWS AI Service Cards provide customers with a single place to find information on the intended use cases and limitations, responsible AI design choices, and performance optimization best practices for our AI services and models. Amazon’s frontier model safety framework focuses on severe risks that are unique to frontier AI models as they scale in size and capability, and which require specialized evaluation methods and safeguards.
Amazon Bedrock Guardrails helps customers implement safeguards tailored to their generative AI applications and aligned with your responsible AI policies.
Our Responsible AI Guide helps customers think through how to develop and design AI systems with responsible considerations across the AI lifecycle.
As part of our AI Ready Commitment, we provide free educational resources that extend beyond technology to encompass people, processes and culture. Our courses cover aspects like security, compliance, and governance, as well as foundational learning on introduction to responsible AI and responsible AI practices.
Our Generative AI Innovation Center offers technical guidance and best practices to help customers establish effective governance frameworks when building with our AI services.

The EU AI Act requires all AI systems to meet certain requirements for fairness, transparency, accountability, and fundamental rights protection. Taking a risk-based approach, the EU AI Act establishes different categories of AI systems with corresponding requirements, and it brings obligations for all actors across the AI supply chain, including providers, deployers, distributors, users, and importers. AI systems deemed to pose unacceptable risks are prohibited. High-risk AI systems are allowed, but they are subject to stricter requirements for documentation, data governance, human oversight, and risk management procedures. In addition, certain AI systems (for example, those intended to interact directly with natural persons) are considered low risk and subject to transparency requirements. Apart from the requirements for AI systems, the EU AI Act also brings a separate set of obligations for providers of general-purpose AI (GPAI) models, depending on whether they pose systemic risks or not. The EU AI Act may apply to activities both inside and outside the EU. Therefore, even if your organization is not established in the EU, you may still be required to comply with the EU AI Act. We encourage all AWS customers to conduct a thorough assessment of their AI activities to determine whether they are subject to the EU AI Act and their specific obligations, regardless of their location.
Prohibited use cases
Beginning February 1, 2025, the EU AI Act has prohibited certain AI practices deemed to present unacceptable risks to fundamental rights. These prohibitions, a full list of which is available under Article 5 of the EU AI Act, generally focus on manipulative or exploitative practices that can be harmful or abusive and the evaluation or classification of individuals based on social behavior, personal traits, or biometric data.
AWS is committed to making sure our AI services meet applicable regulatory requirements, including those of the EU AI Act. Although AWS services support a wide range of customer use case categories, none are designed or intended for practices prohibited under the EU AI Act, and we maintain this commitment through our policies, including the AWS Acceptable Use Policy, Responsible AI Policy, and Responsible Use of AI Guide.
Compliance with the EU AI Act is a shared journey as set out by the regulation and responsibilities for developers (providers) and deployers of AI systems, and although AWS provides the building blocks for compliant solutions, AWS customers remain responsible for assessing how their use of AWS services falls under the EU AI Act, implementing appropriate controls for their AI applications, and making sure their specific use cases are compliant with the EU AI Act’s restrictions. We encourage AWS customers to carefully review the list of prohibited practices under the EU AI Act when building AI solutions using AWS services and review the European Commission’s recently published guidelines on prohibited practices.
Moving forward with the EU AI Act
As the regulatory landscape continues to evolve, customers should stay informed about the EU AI Act and assess how it applies to their organization’s use of AI. AWS remains engaged with EU institutions and relevant authorities across EU member states on the enforcement of the EU AI Act. We participate in industry dialogues and contribute our knowledge and experience to support balanced outcomes that safeguard against risks of this technology, particularly where AI use cases have the potential to affect individuals’ health and safety or fundamental rights, while enabling continued AI innovation in ways that will benefit all. We will continue to update our customers through our AWS ML Blog posts and other AWS channels as new guidance emerges and additional portions of the EU AI Act take effect.
If you have questions about compliance with the EU AI Act, or if you require additional information on AWS AI governance tools and resources, please contact your account representative or request to be contacted.
If you’d like to join our community of innovators and learn about upcoming events and gain expert insights, practical guidance, and connections that help you navigate the regulatory landscape, please express interest by registering.

Update on the AWS DeepRacer Student Portal

The AWS DeepRacer Student Portal will no longer be available starting September 15, 2025. This change comes as part of the broader transition of AWS DeepRacer from a service to an AWS Solution, representing an evolution in how we deliver AI & ML education. Since its launch, the AWS DeepRacer Student Portal has helped thousands of learners begin their AI & ML journey through hands-on reinforcement learning experiences. The portal has served as a foundational stepping stone for many who have gone on to pursue career development in AI through the AWS AI & ML Scholars program, which has been re-launched with a generative AI focused curriculum.
Starting July 14, 2025, the AWS DeepRacer Student Portal will enter a maintenance phase where new registrations will be disabled. Until September 15, 2025, existing users will retain full access to their content and training materials, with updates limited to critical security fixes, after which the portal will no longer be available. Going forward, AWS DeepRacer will be available as a solution in the AWS Solutions Library in the future, providing educational institutions and organizations with greater capabilities to build and customize their own DeepRacer learning experiences.
As part of our commitment to advancing AI & ML education, we recently launched the enhanced AWS AI & ML Scholars program on May 28, 2025. This new program embraces the latest developments in generative AI, featuring hands-on experience with AWS PartyRock and Amazon Q. The curriculum focuses on practical applications of AI technologies and emerging skills, reflecting the evolving needs of the technology industry and preparing students for careers in AI. To learn more about the new AI & ML Scholars program and continue your learning journey, visit awsaimlscholars.com. In addition, users can also explore AI learning content and build in-demand cloud skills using AWS Skill Builder.
We’re grateful to the entire AWS DeepRacer Student community for their enthusiasm and engagement, and we look forward to supporting the next chapter of your AI & ML learning journey.

About the author
Jayadev Kalla is a Product Manager with the AWS Social Responsibility and Impact team, focusing on AI & ML education. His goal is to expand access to AI education through hands-on learning experiences. Outside of work, Jayadev is a sports enthusiast and loves to cook.