September 2025 - Page 7 of 9

How to Build a Complete Multi-Domain AI Web Agent Using Notte and Gemi …

Posted on September 9, 2025 by i-genie

In this tutorial, we demonstrate a complete, advanced implementation of the Notte AI Agent, integrating the Gemini API to power reasoning and automation. By combining Notte’s browser automation capabilities with structured outputs through Pydantic models, it showcases how an AI web agent can research products, monitor social media, analyze markets, scan job opportunities, and more. The tutorial is designed as a practical, hands-on guide, featuring modular functions, demos, and workflows that demonstrate how developers can leverage AI-driven automation for real-world tasks such as e-commerce research, competitive intelligence, and content strategy. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install notte python-dotenv pydantic google-generativeai requests beautifulsoup4
!patchright install –with-deps chromium

import os
import json
import time
from typing import List, Optional
from pydantic import BaseModel
import google.generativeai as genai
from dotenv import load_dotenv

GEMINI_API_KEY = “USE YOUR OWN API KEY HERE”
os.environ[‘GEMINI_API_KEY’] = GEMINI_API_KEY
genai.configure(api_key=GEMINI_API_KEY)

import notte

We begin by installing all the required dependencies, including Notte, Gemini, and supporting libraries, and then configure our Gemini API key for authentication. After setting up the environment, we import Notte to start building and running our AI web agent seamlessly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ProductInfo(BaseModel):
name: str
price: str
rating: Optional[float]
availability: str
description: str

class NewsArticle(BaseModel):
title: str
summary: str
url: str
date: str
source: str

class SocialMediaPost(BaseModel):
content: str
author: str
likes: int
timestamp: str
platform: str

class SearchResult(BaseModel):
query: str
results: List[dict]
total_found: int

We define structured Pydantic models that let us capture and validate data consistently. With ProductInfo, NewsArticle, SocialMediaPost, and SearchResult, we ensure that the AI agent outputs reliable, well-structured information for products, news articles, social media posts, and search results. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedNotteAgent:
def __init__(self, headless=True, max_steps=20):
self.headless = headless
self.max_steps = max_steps
self.session = None
self.agent = None

def __enter__(self):
self.session = notte.Session(headless=self.headless)
self.session.__enter__()
self.agent = notte.Agent(
session=self.session,
reasoning_model=’gemini/gemini-2.5-flash’,
max_steps=self.max_steps
)
return self

def __exit__(self, exc_type, exc_val, exc_tb):
if self.session:
self.session.__exit__(exc_type, exc_val, exc_tb)

def research_product(self, product_name: str, website: str = “amazon.com”) -> ProductInfo:
“””Research a product and extract structured information”””
task = f”Go to {website}, search for ‘{product_name}’, click on the first relevant product, and extract detailed product information including name, price, rating, availability, and description.”

response = self.agent.run(
task=task,
response_format=ProductInfo,
url=f”https://{website}”
)
return response.answer

def news_aggregator(self, topic: str, num_articles: int = 3) -> List[NewsArticle]:
“””Aggregate news articles on a specific topic”””
task = f”Search for recent news about ‘{topic}’, find {num_articles} relevant articles, and extract title, summary, URL, date, and source for each.”

response = self.agent.run(
task=task,
url=”https://news.google.com”,
response_format=List[NewsArticle]
)
return response.answer

def social_media_monitor(self, hashtag: str, platform: str = “twitter”) -> List[SocialMediaPost]:
“””Monitor social media for specific hashtags”””
if platform.lower() == “twitter”:
url = “https://twitter.com”
elif platform.lower() == “reddit”:
url = “https://reddit.com”
else:
url = f”https://{platform}.com”

task = f”Go to {platform}, search for posts with hashtag ‘{hashtag}’, and extract content, author, engagement metrics, and timestamps from the top 5 posts.”

response = self.agent.run(
task=task,
url=url,
response_format=List[SocialMediaPost]
)
return response.answer

def competitive_analysis(self, company: str, competitors: List[str]) -> dict:
“””Perform competitive analysis by gathering pricing and feature data”””
results = {}

for competitor in [company] + competitors:
task = f”Go to {competitor}’s website, find their pricing page or main product page, and extract key features, pricing tiers, and unique selling points.”

try:
response = self.agent.run(
task=task,
url=f”https://{competitor}.com”
)
results[competitor] = response.answer
time.sleep(2)
except Exception as e:
results[competitor] = f”Error: {str(e)}”

return results

def job_market_scanner(self, job_title: str, location: str = “remote”) -> List[dict]:
“””Scan job market for opportunities”””
task = f”Search for ‘{job_title}’ jobs in ‘{location}’, extract job titles, companies, salary ranges, and application URLs from the first 10 results.”

response = self.agent.run(
task=task,
url=”https://indeed.com”
)
return response.answer

def price_comparison(self, product: str, websites: List[str]) -> dict:
“””Compare prices across multiple websites”””
price_data = {}

for site in websites:
task = f”Search for ‘{product}’ on this website and find the best price, including any discounts or special offers.”

try:
response = self.agent.run(
task=task,
url=f”https://{site}”
)
price_data[site] = response.answer
time.sleep(1)
except Exception as e:
price_data[site] = f”Error: {str(e)}”

return price_data

def content_research(self, topic: str, content_type: str = “blog”) -> dict:
“””Research content ideas and trending topics”””
if content_type == “blog”:
url = “https://medium.com”
task = f”Search for ‘{topic}’ articles, analyze trending content, and identify popular themes, engagement patterns, and content gaps.”
elif content_type == “video”:
url = “https://youtube.com”
task = f”Search for ‘{topic}’ videos, analyze view counts, titles, and descriptions to identify trending formats and popular angles.”
else:
url = “https://google.com”
task = f”Search for ‘{topic}’ content across the web and analyze trending discussions and popular formats.”

response = self.agent.run(task=task, url=url)
return {“topic”: topic, “insights”: response.answer, “platform”: content_type}

We wrap Notte in a context-managed AdvancedNotteAgent that sets up a headless browser session and a Gemini-powered reasoning model, allowing us to automate multi-step web tasks reliably. We then add high-level methods, including product research, news aggregation, social listening, competitive scans, job search, price comparison, and content research, that return clean, structured outputs. This lets us script real-world web workflows while keeping the interface simple and consistent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_ecommerce_research():
“””Demo: E-commerce product research and comparison”””
print(” E-commerce Research Demo”)
print(“=” * 50)

with AdvancedNotteAgent(headless=True) as agent:
product = agent.research_product(“wireless earbuds”, “amazon.com”)
print(f”Product Research Results:”)
print(f”Name: {product.name}”)
print(f”Price: {product.price}”)
print(f”Rating: {product.rating}”)
print(f”Availability: {product.availability}”)
print(f”Description: {product.description[:100]}…”)

print(“n Price Comparison:”)
websites = [“amazon.com”, “ebay.com”, “walmart.com”]
prices = agent.price_comparison(“wireless earbuds”, websites)
for site, data in prices.items():
print(f”{site}: {data}”)

def demo_news_intelligence():
“””Demo: News aggregation and analysis”””
print(” News Intelligence Demo”)
print(“=” * 50)

with AdvancedNotteAgent() as agent:
articles = agent.news_aggregator(“artificial intelligence”, 3)

for i, article in enumerate(articles, 1):
print(f”nArticle {i}:”)
print(f”Title: {article.title}”)
print(f”Source: {article.source}”)
print(f”Summary: {article.summary}”)
print(f”URL: {article.url}”)

def demo_social_listening():
“””Demo: Social media monitoring and sentiment analysis”””
print(” Social Media Listening Demo”)
print(“=” * 50)

with AdvancedNotteAgent() as agent:
posts = agent.social_media_monitor(“#AI”, “reddit”)

for i, post in enumerate(posts, 1):
print(f”nPost {i}:”)
print(f”Author: {post.author}”)
print(f”Content: {post.content[:100]}…”)
print(f”Engagement: {post.likes} likes”)
print(f”Platform: {post.platform}”)

def demo_market_intelligence():
“””Demo: Competitive analysis and market research”””
print(” Market Intelligence Demo”)
print(“=” * 50)

with AdvancedNotteAgent() as agent:
company = “openai”
competitors = [“anthropic”, “google”]
analysis = agent.competitive_analysis(company, competitors)

for comp, data in analysis.items():
print(f”n{comp.upper()}:”)
print(f”Analysis: {str(data)[:200]}…”)

def demo_job_market_analysis():
“””Demo: Job market scanning and analysis”””
print(” Job Market Analysis Demo”)
print(“=” * 50)

with AdvancedNotteAgent() as agent:
jobs = agent.job_market_scanner(“python developer”, “san francisco”)

print(f”Found {len(jobs)} job opportunities:”)
for job in jobs[:3]:
print(f”- {job}”)

def demo_content_strategy():
“””Demo: Content research and trend analysis”””
print(” Content Strategy Demo”)
print(“=” * 50)

with AdvancedNotteAgent() as agent:
blog_research = agent.content_research(“machine learning”, “blog”)
video_research = agent.content_research(“machine learning”, “video”)

print(“Blog Content Insights:”)
print(blog_research[“insights”][:300] + “…”)

print(“nVideo Content Insights:”)
print(video_research[“insights”][:300] + “…”)

We run a suite of demos that showcase real web automation end-to-end, including researching products and comparing prices, aggregating fresh news, and monitoring social chatter. We also conduct competitive scans, analyze the job market, and track blog/video trends, yielding structured, ready-to-use insights from each task. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass WorkflowManager:
def __init__(self):
self.agents = []
self.results = {}

def add_agent_task(self, name: str, task_func, *args, **kwargs):
“””Add an agent task to the workflow”””
self.agents.append({
‘name’: name,
‘func’: task_func,
‘args’: args,
‘kwargs’: kwargs
})

def execute_workflow(self, parallel=False):
“””Execute all agent tasks in the workflow”””
print(” Executing Multi-Agent Workflow”)
print(“=” * 50)

for agent_task in self.agents:
name = agent_task[‘name’]
func = agent_task[‘func’]
args = agent_task[‘args’]
kwargs = agent_task[‘kwargs’]

print(f”n Executing {name}…”)
try:
result = func(*args, **kwargs)
self.results[name] = result
print(f” {name} completed successfully”)
except Exception as e:
self.results[name] = f”Error: {str(e)}”
print(f” {name} failed: {str(e)}”)

if not parallel:
time.sleep(2)
return self.results

def market_research_workflow(company_name: str, product_category: str):
“””Complete market research workflow”””
workflow = WorkflowManager()

workflow.add_agent_task(
“Product Research”,
lambda: research_trending_products(product_category)
)

workflow.add_agent_task(
“Competitive Analysis”,
lambda: analyze_competitors(company_name, product_category)
)

workflow.add_agent_task(
“Social Sentiment”,
lambda: monitor_brand_sentiment(company_name)
)

return workflow.execute_workflow()

def research_trending_products(category: str):
“””Research trending products in a category”””
with AdvancedNotteAgent(headless=True) as agent:
task = f”Research trending {category} products, find top 5 products with prices, ratings, and key features.”
response = agent.agent.run(
task=task,
url=”https://amazon.com”
)
return response.answer

def analyze_competitors(company: str, category: str):
“””Analyze competitors in the market”””
with AdvancedNotteAgent(headless=True) as agent:
task = f”Research {company} competitors in {category}, compare pricing strategies, features, and market positioning.”
response = agent.agent.run(
task=task,
url=”https://google.com”
)
return response.answer

def monitor_brand_sentiment(brand: str):
“””Monitor brand sentiment across platforms”””
with AdvancedNotteAgent(headless=True) as agent:
task = f”Search for recent mentions of {brand} on social media and news, analyze sentiment and key themes.”
response = agent.agent.run(
task=task,
url=”https://reddit.com”
)
return response.answer

We design a WorkflowManager that chains multiple AI agent tasks into a single orchestrated pipeline. By adding modular tasks like product research, competitor analysis, and sentiment monitoring, we can execute a complete market research workflow in sequence (or parallel). This transforms individual demos into a coordinated multi-agent system that provides holistic insights for informed real-world decision-making. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
“””Main function to run all demos”””
print(” Advanced Notte AI Agent Tutorial”)
print(“=” * 60)
print(“Note: Make sure to set your GEMINI_API_KEY above!”)
print(“Get your free API key at: https://makersuite.google.com/app/apikey”)
print(“=” * 60)

if GEMINI_API_KEY == “YOUR_GEMINI_API_KEY”:
print(” Please set your GEMINI_API_KEY in the code above!”)
return

try:
print(“n1. E-commerce Research Demo”)
demo_ecommerce_research()

print(“n2. News Intelligence Demo”)
demo_news_intelligence()

print(“n3. Social Media Listening Demo”)
demo_social_listening()

print(“n4. Market Intelligence Demo”)
demo_market_intelligence()

print(“n5. Job Market Analysis Demo”)
demo_job_market_analysis()

print(“n6. Content Strategy Demo”)
demo_content_strategy()

print(“n7. Multi-Agent Workflow Demo”)
results = market_research_workflow(“Tesla”, “electric vehicles”)
print(“Workflow Results:”)
for task, result in results.items():
print(f”{task}: {str(result)[:150]}…”)

except Exception as e:
print(f” Error during execution: {str(e)}”)
print(” Tip: Make sure your Gemini API key is valid and you have internet connection”)

def quick_scrape(url: str, instructions: str = “Extract main content”):
“””Quick scraping function for simple data extraction”””
with AdvancedNotteAgent(headless=True, max_steps=5) as agent:
response = agent.agent.run(
task=f”{instructions} from this webpage”,
url=url
)
return response.answer

def quick_search(query: str, num_results: int = 5):
“””Quick search function with structured results”””
with AdvancedNotteAgent(headless=True, max_steps=10) as agent:
task = f”Search for ‘{query}’ and return the top {num_results} results with titles, URLs, and brief descriptions.”
response = agent.agent.run(
task=task,
url=”https://google.com”,
response_format=SearchResult
)
return response.answer

def quick_form_fill(form_url: str, form_data: dict):
“””Quick form filling function”””
with AdvancedNotteAgent(headless=False, max_steps=15) as agent:
data_str = “, “.join([f”{k}: {v}” for k, v in form_data.items()])
task = f”Fill out the form with this information: {data_str}, then submit it.”

response = agent.agent.run(
task=task,
url=form_url
)
return response.answer

if __name__ == “__main__”:
print(” Quick Test Examples:”)
print(“=” * 30)

print(“1. Quick Scrape Example:”)
try:
result = quick_scrape(“https://news.ycombinator.com”, “Extract the top 3 post titles”)
print(f”Scraped: {result}”)
except Exception as e:
print(f”Error: {e}”)

print(“n2. Quick Search Example:”)
try:
search_results = quick_search(“latest AI news”, 3)
print(f”Search Results: {search_results}”)
except Exception as e:
print(f”Error: {e}”)

print(“n3. Custom Agent Task:”)
try:
with AdvancedNotteAgent(headless=True) as agent:
response = agent.agent.run(
task=”Go to Wikipedia, search for ‘artificial intelligence’, and summarize the main article in 2 sentences.”,
url=”https://wikipedia.org”
)
print(f”Wikipedia Summary: {response.answer}”)
except Exception as e:
print(f”Error: {e}”)

main()

print(“n Tutorial Complete!”)
print(” Tips for success:”)
print(“- Start with simple tasks and gradually increase complexity”)
print(“- Use structured outputs (Pydantic models) for reliable data extraction”)
print(“- Implement rate limiting to respect API quotas”)
print(“- Handle errors gracefully in production workflows”)
print(“- Combine scripting with AI for cost-effective automation”)

print(“n Next Steps:”)
print(“- Customize the agents for your specific use cases”)
print(“- Add error handling and retry logic for production”)
print(“- Implement logging and monitoring for agent activities”)
print(“- Scale up with Notte’s hosted API service for enterprise features”)

We wrap everything with a main() function that runs all demos end-to-end, and then add quick helper utilities, including quick_scrape, quick_search, and quick_form_fill, to perform focused tasks with minimal setup. We also include quick tests to validate the helpers and a custom Wikipedia task before invoking the full workflow, ensuring we can iterate fast while still exercising the complete agent pipeline.

In conclusion, the tutorial demonstrates how Notte, when combined with Gemini, can evolve into a powerful, multi-purpose AI web agent for research, monitoring, and analysis. It not only demonstrates individual demos for e-commerce, news, and social media but also scales into advanced multi-agent workflows that combine insights across domains. By following this guide, developers can quickly prototype AI agents in Colab, extend them with custom tasks, and adapt the system for business intelligence, automation, and creative use cases.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Build a Complete Multi-Domain AI Web Agent Using Notte and Gemini appeared first on MarkTechPost.

GibsonAI Releases Memori: An Open-Source SQL-Native Memory Engine for …

Posted on September 9, 2025 by i-genie

When we think about human intelligence, memory is one of the first things that comes to mind. It’s what enables us to learn from our experiences, adapt to new situations, and make more informed decisions over time. Similarly, AI Agents become smarter with memory. For example, an agent can remember your past purchases, your budget, your preferences, and suggest gifts for your friends based on the learning from the past conversations.

Agents usually break tasks into steps (plan → search → call API → parse → write), but then they might forget what happened in earlier steps without memory. Agents repeat tool calls, fetch the same data again, or miss simple rules like “always refer to the user by their name.” As a result of repeating the same context over and over again, the agents can spend more tokens, achieve slower results, and provide inconsistent answers. The industry has collectively spent billions on vector databases and embedding infrastructure to solve what is, at its core, a data persistence problem for AI Agents. These solutions create black-box systems where developers cannot inspect, query, or understand why certain memories were retrieved.

The GibsonAI team built Memori to fix this issue. Memori is an open-source memory engine that provides persistent, intelligent memory for any LLM using standard SQL databases(PostgreSQL/MySQL). In this article, we’ll explore how Memori tackles memory challenges and what it offers.

The Stateless Nature of Modern AI: The Hidden Cost

Studies indicate that users spend 23-31% of their time providing context that they’ve already shared in previous conversations. For a development team using AI assistants, this translates to:

Individual Developer: ~2 hours/week repeating context

10-person Team: ~20 hours/week of lost productivity

Enterprise (1000 developers): ~2000 hours/week or $4M/year in redundant communication

Beyond productivity, this repetition breaks the illusion of intelligence. An AI that cannot remember your name after hundreds of conversations doesn’t feel intelligent.

Current Limitations of Stateless LLMs

No Learning from Interactions: Every mistake is repeated, every preference must be restated

Broken Workflows: Multi-session projects require constant context rebuilding

No Personalization: The AI cannot adapt to individual users or teams

Lost Insights: Valuable patterns in conversations are never captured

Compliance Challenges: No audit trail of AI decision-making

The Need for Persistent, Queryable Memory

What AI really needs is persistent, queryable memory just like every application relies on a database. But you can’t simply use your existing app database as AI memory because it isn’t designed for context selection, relevance ranking, or injecting knowledge back into an agent’s workflow. That’s why we built a memory layer that is essential for AI and agents to feel intelligent truly.

Why SQL Matters for AI Memory

SQL databases have been around for more than 50 years. They are the backbone of almost every application we use today, from banking apps to social networks. Why? Because SQL is simple, reliable, and universal.

Every developer knows SQL. You don’t need to learn a new query language.

Battle-tested reliability. SQL has run the world’s most critical systems for decades.

Powerful queries. You can filter, join, and aggregate data with ease.

Strong guarantees. ACID transactions make sure your data stays consistent and safe.

Huge ecosystem. Tools for migration, backups, dashboards, and monitoring are everywhere.

When you build on SQL, you’re standing on decades of proven tech, not reinventing the wheel.

The Drawbacks of Vector Databases

Most competing AI memory systems today are built on vector databases. On paper, they sound advanced: they let you store embeddings and search by similarity. But in practice, they come with hidden costs and complexity:

Multiple moving parts. A typical setup needs a vector DB, a cache, and a SQL DB just to function.

Vendor lock-in. Your data often lives inside a proprietary system, making it hard to move or audit.

Black-box retrieval. You can’t easily see why a certain memory was pulled.

Expensive. Infrastructure and usage costs add up quickly, especially at scale.

Hard to debug. Embeddings are not human-readable, so you can’t just query with SQL and check results.

Here’s how it compares to Memori’s SQL-first design:

AspectVector Database / RAG SolutionsMemori’s ApproachServices Required3–5 (Vector DB + Cache + SQL)1 (SQL only)DatabasesVector + Cache + SQLSQL onlyQuery LanguageProprietary APIStandard SQLDebuggingBlack box embeddingsReadable SQL queriesBackupComplex orchestrationcp memory.db backup.db or pg_basebackupData ProcessingEmbeddings: ~$0.0001 / 1K tokens (OpenAI) → cheap upfrontEntity Extraction: GPT-4o at ~$0.005 / 1K tokens → higher upfrontStorage Costs$0.10–0.50 / GB / month (vector DBs)~$0.01–0.05 / GB / month (SQL)Query Costs~$0.0004 / 1K vectors searchedNear zero (standard SQL queries)InfrastructureMultiple moving parts, higher maintenanceSingle database, simple to manage

Why It Works?

If you think SQL can’t handle memory at scale, think again. SQLite, one of the simplest SQL databases, is the most widely deployed database in the world:

Over 4 billion deployments

Runs on every iPhone, Android device, and web browser

Executes trillions of queries every single day

If SQLite can handle this massive workload with ease, why build AI memory on expensive, distributed vector clusters?

Memori Solution Overview

Memori uses structured entity extraction, relationship mapping, and SQL-based retrieval to create transparent, portable, and queryable AI memory. Memomi uses multiple agents working together to intelligently promote essential long-term memories to short-term storage for faster context injection.

With a single line of code memori.enable() any LLM gains the ability to remember conversations, learn from interactions, and maintain context across sessions. The entire memory system is stored in a standard SQLite database (or PostgreSQL/MySQL for enterprise deployments), making it fully portable, auditable, and owned by the user.

Key Differentiators

Radical Simplicity: One line to enable memory for any LLM framework (OpenAI, Anthropic, LiteLLM, LangChain)

True Data Ownership: Memory stored in standard SQL databases that users fully control

Complete Transparency: Every memory decision is queryable with SQL and fully explainable

Zero Vendor Lock-in: Export your entire memory as a SQLite file and move anywhere

Cost Efficiency: 80-90% cheaper than vector database solutions at scale

Compliance Ready: SQL-based storage enables audit trails, data residency, and regulatory compliance

Memori Use Cases

Smart shopping experience with an AI Agent that remembers customer preferences and shopping behavior.

Personal AI assistants that remember user preferences and context

Customer support bots that never ask the same question twice

Educational tutors who adapt to student progress

Team knowledge management systems with shared memory

Compliance-focused applications requiring complete audit trails

Business Impact Metrics

Based on early implementations from our community users, we identified that Memori helps with the following:

Development Time: 90% reduction in memory system implementation (hours vs. weeks)

Infrastructure Costs: 80-90% reduction compared to vector database solutions

Query Performance: 10-50ms response time (2-4x faster than vector similarity search)

Memory Portability: 100% of memory data portable (vs. 0% with cloud vector databases)

Compliance Readiness: Full SQL audit capability from day one

Maintenance Overhead: Single database vs. distributed vector systems

Technical Innovation

Memori introduces three core innovations:

Dual-Mode Memory System: Combining “conscious” working memory with “auto” intelligent search, mimicking human cognitive patterns

Universal Integration Layer: Automatic memory injection for any LLM without framework-specific code

Multi-Agent Architecture: Multiple specialized AI agents working together for intelligent memory

Existing Solutions in the Market

There are already several approaches to giving AI agents some form of memory, each with its own strengths and trade-offs:

Mem0 → A feature-rich solution that combines Redis, vector databases, and orchestration layers to manage memory in a distributed setup.

LangChain Memory → Provides convenient abstractions for developers building within the LangChain framework.

Vector Databases (Pinecone, Weaviate, Chroma) → Focused on semantic similarity search using embeddings, designed for specialized use cases.

Custom Solutions → In-house designs tailored to specific business needs, offering flexibility but requiring significant maintenance.

These solutions demonstrate the various directions the industry is taking to address the memory problem. Memori enters the landscape with a different philosophy, bringing memory into a SQL-native, open-source form that is simple, transparent, and production-ready.

Memori Built on a Strong Database Infrastructure

In addition to this, AI agents need not only memory but also a database backbone to make that memory usable and scalable. Think of AI agents that can run queries safely in an isolated database sandbox, optimise queries over time, and autoscale on demand, such as initiating a new database for a user to keep their relevant data separate.

A robust database infrastructure from GibsonAI backs Memori. This makes memory reliable and production-ready with:

Instant provisioning

Autoscale on demand

Database branching

Database versioning

Query optimization

Point of recovery

Strategic Vision

While competitors chase complexity with distributed vector solutions and proprietary embeddings, Memori embraces the proven reliability of SQL databases that have powered applications for decades.

The goal is not to build the most sophisticated memory system, but the most practical one. By storing AI memory in the same databases that already run the world’s applications, Memori enables a future where AI memory is as portable, queryable, and manageable as any other application data.

Check out the GitHub Page here. Thanks to the GibsonAI team for the thought leadership/Resources and supporting this article.
The post GibsonAI Releases Memori: An Open-Source SQL-Native Memory Engine for AI Agents appeared first on MarkTechPost.

A New MIT Study Shows Reinforcement Learning Minimizes Catastrophic Fo …

Posted on September 9, 2025 by i-genie

Table of contentsWhat is catastrophic forgetting in foundation models?Why does online reinforcement learning forget less than supervised fine-tuning?How can forgetting be measured?What do experiments on large language models reveal?How does RL compare to SFT in robotics tasks?What insights come from the ParityMNIST study?Why do on-policy updates matter?Are other explanations sufficient?What are the broader implications?ConclusionKey Takeaways

What is catastrophic forgetting in foundation models?

Foundation models excel in diverse domains but are largely static once deployed. Fine-tuning on new tasks often introduces catastrophic forgetting—the loss of previously learned capabilities. This limitation poses a barrier for building long-lived, continually improving AI agents.

Why does online reinforcement learning forget less than supervised fine-tuning?

A new MIT study compares reinforcement learning (RL) and supervised fine-tuning (SFT). Both can achieve high performance on new tasks, but SFT tends to overwrite prior abilities. RL, by contrast, preserves them. The key lies in how each method shifts the model’s output distribution relative to the base policy.

https://arxiv.org/pdf/2509.04259

How can forgetting be measured?

The research team proposes an empirical forgetting law:

Forgetting∝KL(π0∣∣π)

where π0 is the base model and π is the fine-tuned model. The forward KL divergence, measured on the new task, strongly predicts the extent of forgetting. This makes forgetting quantifiable without needing data from prior tasks.

What do experiments on large language models reveal?

Using Qwen 2.5 3B-Instruct as the base model, fine-tuning was performed on:

Math reasoning (Open-Reasoner-Zero),

Science Q&A (SciKnowEval subset),

Tool use (ToolAlpaca).

Performance was evaluated on prior benchmarks such as HellaSwag, MMLU, TruthfulQA, and HumanEval. Results showed that RL improved new-task accuracy while keeping prior-task accuracy stable, whereas SFT consistently sacrificed prior knowledge.

How does RL compare to SFT in robotics tasks?

In robotic control experiments with OpenVLA-7B fine-tuned in SimplerEnv pick-and-place scenarios, RL adaptation maintained general manipulation skills across tasks. SFT, while successful on the new task, degraded prior manipulation abilities—again illustrating RL’s conservatism in preserving knowledge.

What insights come from the ParityMNIST study?

To isolate mechanisms, the research team introduced a toy problem, ParityMNIST. Here, RL and SFT both reached high new-task accuracy, but SFT induced sharper declines on the FashionMNIST auxiliary benchmark. Crucially, plotting forgetting against KL divergence revealed a single predictive curve, validating KL as the governing factor.

Why do on-policy updates matter?

On-policy RL samples from the model’s own outputs, incrementally reweighting them by reward. This process constrains learning to distributions already close to the base model. SFT, in contrast, optimizes against fixed labels that may be arbitrarily distant. Theoretical analysis shows policy gradients converge to KL-minimal optimal solutions, formalizing RL’s advantage.

Are other explanations sufficient?

The research team tested alternatives: weight-space changes, hidden representation drift, sparsity of updates, and alternative distributional metrics (reverse KL, total variation, L2 distance). None matched the predictive strength of forward KL divergence, reinforcing that distributional closeness is the critical factor.

What are the broader implications?

Evaluation: Post-training should consider KL-conservatism, not just task accuracy.

Hybrid methods: Combining SFT efficiency with explicit KL minimization could yield optimal trade-offs.

Continual learning: RL’s Razor offers a measurable criterion for designing adaptive agents that learn new skills without erasing old ones.

Conclusion

The MIT research reframes catastrophic forgetting as a distributional problem governed by forward KL divergence. Reinforcement learning forgets less because its on-policy updates naturally bias toward KL-minimal solutions. This principle—RL’s Razor—provides both an explanation for RL’s robustness and a roadmap for developing post-training methods that support lifelong learning in foundation models.

Key Takeaways

Reinforcement learning (RL) preserves prior knowledge better than Supervised fine-tuning (SFT): Even when both achieve the same accuracy on new tasks, RL retains prior capabilities while SFT erases them.

Forgetting is predictable by KL divergence: The degree of catastrophic forgetting is strongly correlated with the forward KL divergence between the fine-tuned and base policy, measured on the new task.

RL’s Razor principle: On-policy RL converges to KL-minimal solutions, ensuring updates remain close to the base model and reducing forgetting.

Empirical validation across domains: Experiments on LLMs (math, science Q&A, tool use) and robotics tasks confirm RL’s robustness against forgetting, while SFT consistently trades old knowledge for new-task performance.

Controlled experiments confirm generality: In the ParityMNIST toy setting, both RL and SFT showed forgetting aligned with KL divergence, proving the principle holds beyond large-scale models.

Future design axis for post-training: Algorithms should be evaluated not only by new-task accuracy but also by how conservatively they shift distributions in KL space, opening avenues for hybrid RL–SFT methods.

Check out the PAPER and PROJECT PAGE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post A New MIT Study Shows Reinforcement Learning Minimizes Catastrophic Forgetting Compared to Supervised Fine-Tuning appeared first on MarkTechPost.

Maximize HyperPod Cluster utilization with HyperPod task governance fi …

Posted on September 9, 2025 by i-genie

We are excited to announce the general availability of fine-grained compute and memory quota allocation with HyperPod task governance. With this capability, customers can optimize Amazon SageMaker HyperPod cluster utilization on Amazon Elastic Kubernetes Service (Amazon EKS), distribute fair usage, and support efficient resource allocation across different teams or projects. For more information, see HyperPod task governance best practices for maximizing the value of SageMaker HyperPod task governance.
Compute quota management is an administrative mechanism that sets and controls compute resource limits across users, teams, and projects. It controls fair resource distribution, preventing a single entity from monopolizing cluster resources, thereby optimizing overall computational efficiency.
Because of budget constraints, customers might want to allocate compute resources across multiple teams fairly. For example, a data scientist might need some GPUs (for example, four H100 GPUs) for model development, but not the entire instance’s compute capacity. In other cases, customers have limited compute resources but many teams, and they want to fairly share compute resources across these teams, so that no idle capacity is left unused.
With HyperPod task governance, administrators can now allocate granular GPU, vCPU, and vCPU memory to teams and projects—in addition to the entire instance resources—based on their preferred strategy. Key capabilities include GPU-level quota allocation by instance type and family, or hardware type—supporting both Trainium and NVIDIA GPUs—and optional CPU and memory allocation for fine-tuned resource control. Administrators can also define the weight (or priority level) a team is given for fair-share idle compute allocation.

“With a wide variety of frontier AI data experiments and production pipelines, being able to maximize SageMaker HyperPod Cluster utilization is extremely high impact. This requires fair and controlled access to shared resources like state-of-the-art GPUs, granular hardware allocation, and more. This is exactly what HyperPod task governance is built for, and we’re excited to see AWS pushing efficient cluster utilization for a variety of AI use cases.”
– Daniel Xu, Director of Product at Snorkel AI, whose AI data technology platform empowers enterprises to build specialized AI applications by leveraging their organizational expertise at scale.

In this post, we dive deep into how to define quotas for teams or projects based on granular or instance-level allocation. We discuss different methods to define such policies, and how data scientists can schedule their jobs seamlessly with this new capability.
Solution overview
Prerequisites
To follow the examples in this blog post, you need to meet the following prerequisites:

An AWS account with access to SageMaker HyperPod.
A running SageMaker HyperPod (EKS-orchestrated) cluster. For more information on how to create and configured a new HyperPod cluster, see the HyperPod workshop or the SageMaker HyperPod cluster creation with Amazon EKS orchestration.
HyperPod task governance addon version 1.3 or later installed in the cluster. For more information, see set up HyperPod task governance

To schedule and execute the example jobs in the Submitting Tasks section, you will also need:

A local environment (either your local machine or a cloud-based compute environment), from which to run the HyperPod CLI and kubectl commands, configured as follows:

OS based on Linux or MacOS
Python 3.8, 3.9, 3.10, or 3.11 installed
AWS Command Line Interface (AWS CLI) configured with the appropriate credentials to use the above services
HyperPod CLI version 3.1.0
Kubernetes command-line tool, kubectl

HyperPod Training Operator installed in the cluster

Allocating granular compute and memory quota using the AWS console
Administrators are the primary persona interacting with SageMaker HyperPod task governance and are responsible for managing cluster compute allocation in alignment with the organization’s strategic priorities and goals.
Implementing this feature follows the familiar compute allocation creation workflow of HyperPod task governance. To get started, sign in to the AWS Management Console and navigate to Cluster Management under HyperPod Clusters in the Amazon SageMaker AI console. After selecting your HyperPod cluster, select the Policies tab in the cluster detail page. Navigate to Compute allocations and choose Create.

As with existing functionality, you can enable task prioritization and fair-share resource allocation through cluster policies that prioritize critical workloads and distribute idle compute across teams. By using HyperPod task governance, you can define queue admission policies (first-come-first-serve by default or task ranking) and idle compute allocation methods (first-come-first-serve or fair-share by default). In the Compute allocation section, you can create and edit allocations to distribute resources among teams, enable lending and borrowing of idle compute, configure preemption of low-priority tasks, and assign fair-share weights.
The key innovation is in the Allocations section shown in the following figure, where you’ll now find fine-grained options for resource allocation. In addition to the existing instance-level quotas, you can now directly specify GPU quotas by instance type and family or by hardware type. When you define GPU allocations, HyperPod task governance intelligently calculates appropriate default values for vCPUs and memory which are set proportionally.
For example, when allocating 2 GPUs from a single p5.48xlarge instance (which has 8 GPUs, 192 vCPUs, and 2 TiB memory) in your HyperPod cluster, HyperPod task governance assigns 48 vCPUs and 512 GiB memory as default values—which is equivalent to one quarter of the instance’s total resources. Similarly, if your HyperPod cluster contains 2 ml.g5.2xlarge instances (each with 1 GPU, 8 vCPUs, and 32 GiB memory), allocating 2 GPUs would automatically assign 16 vCPUs and 64 GiB memory from both instances as shown in the following image.

You can either proceed with these automatically calculated default values or customize the allocation by manually adjusting the vCPUs and vCPU memory fields as seen in the following image.

Amazon SageMaker HyperPod supports clusters that include CPU-based instances, GPU-based instances, and AWS Neuron-based hardware (AWS Inferentia and AWS Trainium chips). You can specify resource allocation for your team by instances, GPUs, vCPUs, vCPU memory, or Neuron devices, as shown in the following image.

Quota allocation can be more than capacity. Resources added to the compute allocation policy that aren’t currently available in the cluster represent planning for future capacity upgrades. Jobs that require these unprovisioned resources will be automatically queued and remain in a pending state until the necessary resources become available. It’s important to understand that in SageMaker HyperPod, compute allocations function as quotas, which are verified during workload scheduling to understand if a workload should be admitted or not, regardless of actual capacity availability. When resource requests are within these defined allocation limits and current utilization, the Kubernetes scheduler (kube-scheduler) handles the actual distribution and placement of pods across the HyperPod cluster nodes.
Allocating granular compute and memory quota using AWS CLI
You can also create or update compute quotas using the AWS CLI. The following is an example for creating a compute quota with only GPU count specification using the AWS CLI:

aws sagemaker
create-compute-quota
–region <aws_region>
–name “only-gpu-quota”
–cluster-arn “arn:aws:sagemaker: <aws_region>:<account_id>:cluster/<cluster_id>”
–description “test description”
–compute-quota-config “ComputeQuotaResources=[{InstanceType=ml.g6.12xlarge,Accelerators=2}],ResourceSharingConfig={Strategy=LendAndBorrow,BorrowLimit=10}”
–activation-state “Enabled”
–compute-quota-target “TeamName=onlygputeam2,FairShareWeight=10”

Compute quotas can also be created with mixed quota types, including a certain number of instances and granular compute resources, as shown in the following example:

aws sagemaker
create-compute-quota
–region <aws_region>
–name “mix-quota-type”
–cluster-arn “arn:aws:sagemaker:<aws_region>:<account_id>:cluster/<cluster_id>”
–description “Mixed quota allocation”
–compute-quota-config “ComputeQuotaResources=[{InstanceType=ml.g6.12xlarge,Accelerators=2}, {InstanceType=ml.p5.48xlarge,Count=3}, {InstanceType=ml.c5.2xlarge,VCpu=2}],ResourceSharingConfig={Strategy=LendAndBorrow,BorrowLimit=10}”
–activation-state “Enabled”
–compute-quota-target “TeamName=mixquotatype,FairShareWeight=10”

HyperPod task governance deep dive
SageMaker HyperPod task governance enables allocation of GPU, CPU, and memory resources by integrating with Kueue, a Kubernetes-native system for job queueing.
Kueue doesn’t replace existing Kubernetes scheduling components, but rather integrates with the kube-scheduler, such that Kueue decides whether a workload should be admitted based on the resource quotas and current utilization, and then the kube-scheduler takes care of pod placement on the nodes.
When a workload requests specific resources, Kueue selects an appropriate resource flavor based on availability, node affinity, and job priority. The scheduler then injects the corresponding node labels and tolerations into the PodSpec, allowing Kubernetes to place the pod on nodes with the requested hardware configuration. This supports precise resource governance and efficient allocation for multi-tenant clusters.
When a SageMaker HyperPod task governance compute allocation is created, Kueue creates ClusterQueues that define resource quotas and scheduling policies, along with ResourceFlavors for the selected instance types with their unique resource characteristics.
For example, the following compute allocation policy allocates ml.g6.12xlarge instances with 2 GPUs and 48 vCPUs to the onlygputeam team, implementing a LendAndBorrow strategy with an up to 50% borrowing limit. This configuration enables flexible resource sharing while maintaining priority through a fair share weight of 10 and the ability to preempt lower priority tasks from other teams.

aws sagemaker describe-compute-quota
–region <aws_region>
–compute-quota-id <compute_quota_id>

#output
{
   “ComputeQuotaArn”: “arn:aws:sagemaker:<aws_region>:<account_id>:compute-quota/<compute_quota_id>”,
   “ComputeQuotaId”: “<compute_quota_id>”,
   “Name”: “only-gpu-quota”,
   “Description”: “Only GPU quota allocation”,
   “ComputeQuotaVersion”: 1,
   “Status”: “Created”,
   “ClusterArn”: “arn:aws:sagemaker:<aws_region>:<account_id>:cluster/<cluster_id>”,
   “ComputeQuotaConfig”: {
   “ComputeQuotaResources”: [
   {
   “InstanceType”: “ml.g6.12xlarge”,
   “Accelerators”: 2,
   “VCpu”: 48.0
   }
   ],
   “ResourceSharingConfig”: {
   “Strategy”: “LendAndBorrow”,
   “BorrowLimit”: 50
   },
   “PreemptTeamTasks”: “LowerPriority”
   },
   “ComputeQuotaTarget”: {
   “TeamName”: “onlygputeam”,
   “FairShareWeight”: 10
   },
   “ActivationState”: “Enabled”,
   “CreationTime”: “2025-07-24T11:12:12.021000-07:00”,
   “CreatedBy”: {},
   “LastModifiedTime”: “2025-07-24T11:15:45.205000-07:00”,
   “LastModifiedBy”: {}
}

The corresponding Kueue ClusterQueue is configured with the ml.g6.12xlarge flavor, providing quotas for 2 NVIDIA GPUs, 48 CPU cores, and 192 Gi memory.

kubectl describe clusterqueue hyperpod-ns-onlygputeam-clusterqueue

# output
Name: hyperpod-ns-onlygputeam-clusterqueue
Namespace:
Labels: sagemaker.amazonaws.com/quota-allocation-id=onlygputeam
   sagemaker.amazonaws.com/sagemaker-managed-queue=true
Annotations: <none>
API Version: kueue.x-k8s.io/v1beta1
Kind: ClusterQueue
Metadata:
…
Spec:
  Cohort: shared-pool
  Fair Sharing:
   Weight: 10
  Flavor Fungibility:
   When Can Borrow: TryNextFlavor
   When Can Preempt: TryNextFlavor
  Namespace Selector:
   Match Labels:
   kubernetes.io/metadata.name: hyperpod-ns-onlygputeam
  Preemption:
   Borrow Within Cohort:
   Policy: LowerPriority
   Reclaim Within Cohort: Any
   Within Cluster Queue: LowerPriority
  Queueing Strategy: BestEffortFIFO
  Resource Groups:
   Covered Resources:
   nvidia.com/gpu
   aws.amazon.com/neurondevice
   cpu
   memory
   vpc.amazonaws.com/efa
   Flavors:
   Name: ml.g6.12xlarge
   Resources:
   Borrowing Limit: 1
   Name: nvidia.com/gpu
   Nominal Quota: 2
   Borrowing Limit: 0
   Name: aws.amazon.com/neurondevice
   Nominal Quota: 0
   Borrowing Limit: 24
   Name: cpu
   Nominal Quota: 48
   Borrowing Limit: 96Gi
   Name: memory
   Nominal Quota: 192Gi
   Borrowing Limit: 0
   Name: vpc.amazonaws.com/efa
   Nominal Quota: 1
   …

A Kueue LocalQueue will be also created, and will reference the corresponding ClusterQueue. The LocalQueue acts as the namespace-scoped resource through which users can submit workloads, and these workloads are then admitted and scheduled according to the quotas and policies defined in the ClusterQueue.

kubectl describe localqueue hyperpod-ns-onlygputeam-localqueue -n hyperpod-ns-onlygputeam

# output
Name: hyperpod-ns-onlygputeam-localqueue
Namespace: hyperpod-ns-onlygputeam
Labels: sagemaker.amazonaws.com/quota-allocation-id=onlygputeam
   sagemaker.amazonaws.com/sagemaker-managed-queue=true
Annotations: <none>
API Version: kueue.x-k8s.io/v1beta1
Kind: LocalQueue
Metadata:
…
Spec:
  Cluster Queue: hyperpod-ns-onlygputeam-clusterqueue
  Stop Policy: None
Status:
  Admitted Workloads: 0

Submitting tasks
There are two ways to submit tasks on Amazon EKS orchestrated SageMaker HyperPod clusters: the SageMaker HyperPod CLI and the Kubernetes command-line tool, kubectl. With both options, data scientists need to reference their team’s namespace and task priority class—in addition to the requested GPU and vCPU compute and memory resources—to use their granular allocated quota with appropriate prioritization. If the user doesn’t specify a priority class, then SageMaker HyperPod task governance will automatically assume the lowest priority. The specific GPU type comes from an instance type selection, because data scientists want to use GPUs with certain capabilities (for example, H100 instead of H200) to perform their tasks efficiently.
HyperPod CLI
The HyperPod CLI was created to abstract the complexities of working with kubectl and so that developers using SageMaker HyperPod can iterate faster with custom commands.The following is an example of a job submission with the HyperPod CLI requesting both compute and memory resources:

hyp create hyp-pytorch-job
–job-name sample-job1
–image <account_id>.dkr.ecr.<aws_region>.amazonaws.com/<image_name>:<tag>
–pull-policy “Always”
–tasks-per-node 1
–max-retry 1
–priority high-priority
–namespace hyperpod-ns-team1
–queue-name hyperpod-ns-team1-localqueue
–instance-type ml.g5.8xlarge
–accelerators 1
–vcpu 4
–memory 1
–accelerators-limit 1
–vcpu-limit 5
–memory-limit 2

The highlighted parameters enable requesting granular compute and memory resources. The HyperPod CLI requires to install the HyperPod Training Operator in the cluster and then build a container image that includes the HyperPod Elastic Agent. For further instructions on how to build such container image, please refer to the HyperPod Training Operator documentation.
For more information on the supported HyperPod CLI arguments and related description, see the SageMaker HyperPod CLI reference documentation.
Kubectl
The following is an example of a kubectl command to submit a job to the HyperPod cluster using the specified queue. This is a simple example of a PyTorch job that will check for GPU availability and then sleep for 5 minutes. Compute and memory resources are requested using the standard Kubernetes resource management constructs.

apiVersion: batch/v1
kind: Job
metadata:
name: gpu-training-job
namespace: hyperpod-ns-team1
spec:
parallelism: 1
completions: 1
suspend: true
template:
metadata:
labels:
kueue.x-k8s.io/queue-name: hyperpod-ns-team1-localqueue
kueue.x-k8s.io/priority-class: high-priority
spec:
containers:
– name: training-container
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command:
– “python”
– “-c”
– “import torch; print(‘GPU available:’, torch.cuda.is_available()); import time; time.sleep(15)”
resources:
requests:
nvidia.com/gpu: 1
cpu: “4”
memory: “1Gi”
limits:
nvidia.com/gpu: 1
restartPolicy: Never

Sample commands

Following is a short reference guide for helpful commands when interacting with SageMaker HyperPod task governance:

Describing cluster policy with the AWS CLI – This AWS CLI command is useful for viewing the cluster policy settings for your cluster.
List compute quota allocations with the AWS CLI – Use this AWS CLI command to view the different teams and set up task governance and their respective quota allocation settings.
HyperPod CLI – The HyperPod CLI abstracts common kubectl commands used to interact with SageMaker HyperPod clusters such as submitting, listing, and cancelling tasks. See the SageMaker HyperPod CLI reference documentation for a full list of commands.
kubectl – You can also use kubectl to interact with task governance; some useful commands are:

kubectl get workloads -n hyperpod-ns-<team-name> kubectl describe workload <workload-name> -n hyperpod-ns-<team-name>. These commands show the workloads running in your cluster per namespace and provide detailed reasonings on Kueue admission. You can use these commands to answer questions such as “Why was my task preempted?” or “Why did my task get admitted?”
Common scenarios
A common use case for more granular allocation of GPU compute is fine-tuning small and medium sized large language models (LLMs). A single H100 or H200 GPU might be sufficient to address such a use case (also depending on the chosen batch size and other factors), and machine learning (ML) platform administrators can choose to allocate a single GPU to each data scientist or ML researcher to optimize the utilization of an instance like ml.p5.48xlarge, which comes with 8 H100 GPUs onboard.
Small language models (SLMs) have emerged as a significant advancement in generative AI, offering lower latency, decreased deployment costs, and enhanced privacy capabilities while maintaining impressive performance on targeted tasks, making them increasingly vital for agentic workflows and edge computing scenarios. The new SageMaker HyperPod task governance with fine-grained GPU, CPU, and memory allocation significantly enhances SLM development by enabling precise matching of resources to model requirements, allowing teams to efficiently run multiple experiments concurrently with different architectures. This resource optimization is particularly valuable as organizations develop specialized SLMs for domain-specific applications, with priority-based scheduling so that critical model training jobs receive resources first while maximizing overall cluster utilization. By providing exactly the right resources at the right time, HyperPod accelerates the development of specialized, domain-specific SLMs that can be deployed as efficient agents in complex workflows, enabling more responsive and cost-effective AI solutions across industries.
With the growing popularity of SLMs, organizations can use granular quota allocation to create targeted quota policies that prioritize GPU resources, addressing the budget-sensitive nature of ML infrastructure where GPUs represent the most significant cost and performance factor. Organizations can now selectively apply CPU and memory limits where needed, creating a granular resource management approach that efficiently supports diverse machine learning workloads regardless of model size.
Similarly, to support inference workloads, multiple teams might not require an entire instance to deploy their models, helping to avoid having entire instances equipped with multiple GPUs allocated to each team and leaving GPU compute sitting idle.
Finally, during experimentation and algorithm development, data scientists and ML researchers can choose to deploy a container hosting their preferred IDE on HyperPod, like JupyterLab or Code-OSS (Visual Studio Code open source). In this scenario, they often experiment with smaller batch sizes before scaling to multi-GPU configurations, hence not needing entire multi-GPU instances to be allocated.Similar considerations apply to CPU instances; for example, an ML platform administrator might decide to use CPU instances for IDE deployment, because data scientists prefer to scale their training or fine-tuning with jobs rather than experimenting with the local IDE compute. In such cases, depending on the instances of choice, partitioning CPU cores across the team might be beneficial.
Conclusion
The introduction of fine-grained compute quota allocation in SageMaker HyperPod represents a significant advancement in ML infrastructure management. By enabling GPU-level resource allocation alongside instance-level controls, organizations can now precisely tailor their compute resources to match their specific workloads and team structures.
This granular approach to resource governance addresses critical challenges faced by ML teams today, balancing budget constraints, maximizing expensive GPU utilization, and ensuring fair access across data science teams of all sizes. Whether fine-tuning SLMs that require single GPUs, running inference workloads with varied resource needs, or supporting development environments that don’t require full instance power, this flexible capability helps ensure that no compute resources sit idle unnecessarily.
ML workloads continue to diversify in their resource requirements and SageMaker HyperPod task governance now provides the adaptability organizations need to optimize their GPU capacity investments. To learn more, visit the SageMaker HyperPod product page and HyperPod task governance documentation.
Give this a try in the Amazon SageMaker AI console and leave your comments here.

About the authors
Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, ML model management, and ML governance to improve overall organizational efficiency and productivity. He has extensive experience automating processes and deploying various technologies.
Zhenshan Jin is a Senior Software Engineer at Amazon Web Services (AWS), where he leads software development for task governance on SageMaker HyperPod. In his role, he focuses on empowering customers with advanced AI capabilities while fostering an environment that maximizes engineering team efficiency and productivity.
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
Sindhura Palakodety is a Solutions Architect at AWS. She is passionate about helping customers build enterprise-scale Well-Architected solutions on the AWS platform and specializes in the data analytics domain.

Build and scale adoption of AI agents for education with Strands Agent …

Posted on September 9, 2025 by i-genie

Basic AI chat isn’t enough for most business applications. Institutions need AI that can pull from their databases, integrate with their existing tools, handle multi-step processes, and make decisions independently.
This post demonstrates how to quickly build sophisticated AI agents using Strands Agents, scale them reliably with Amazon Bedrock AgentCore, and make them accessible through LibreChat’s familiar interface to drive immediate user adoption across your institution.
Challenges with basic AI chat interfaces
Although basic AI chat interfaces can answer questions and generate content, educational institutions need capabilities that simple chat can’t provide:

Contextual decision-making – A student asking “What courses should I take?” needs an agent that can access their transcript, check prerequisites, verify graduation requirements, and consider schedule conflicts—not just generic course descriptions
Multi-step workflows – Degree planning requires analyzing current progress, identifying remaining requirements, suggesting course sequences, and updating recommendations as students make decisions
Institutional data integration – Effective educational AI must connect to student information systems, learning management services, academic databases, and institutional repositories to provide relevant, personalized guidance
Persistent memory and learning – Agents need to remember previous interactions with students, track their academic journey over semesters, and build understanding of individual learning patterns and needs

Combining open source flexibility with enterprise infrastructure
The integration presented in this post demonstrates how three technologies can work together to address these challenges:

Strands Agents – Build sophisticated multi-agent workflows in just a few lines of code
Amazon Bedrock AgentCore – Scale agents reliably with serverless, pay-per-use deployment
LibreChat – Provide users with a familiar chat interface that drives immediate adoption

Strands Agents overview
Strands Agents is an open source SDK that takes a model-driven approach to building and running AI agents in just a few lines of code. Unlike LibreChat’s simple agent implementation, Strands supports sophisticated patterns including multi-agent orchestration through workflow, graph, and swarm tools; semantic search for managing thousands of tools; and advanced reasoning capabilities with deep analytical thinking cycles. The framework simplifies agent development by embracing the capabilities of state-of-the-art models to plan, chain thoughts, call tools, and reflect, while scaling from local development to production deployment with flexible architectures and comprehensive observability.
Amazon Bedrock AgentCore overview
Amazon Bedrock AgentCore is a comprehensive set of enterprise-grade services that help developers quickly and securely deploy and operate AI agents at scale using the framework and model of your choice, hosted on Amazon Bedrock or elsewhere. The services are composable and work with popular open source frameworks and many models, so you don’t have to choose between open source flexibility and enterprise-grade security and reliability.
Amazon Bedrock AgentCore includes modular services that can be used together or independently: Runtime (secure, serverless runtime for deploying and scaling dynamic agents), Gateway (converts APIs and AWS Lambda functions into agent-compatible tools), Memory (manages both short-term and long-term memory), Identity (provides secure access management), and Observability (offers real-time visibility into agent performance).
The key Amazon Bedrock AgentCore service used in this integration is Amazon Bedrock AgentCore Runtime, a secure, serverless runtime purpose-built for deploying and scaling dynamic AI agents and tools using an open source framework including LangGraph, CrewAI, and Strands Agents; a protocol; and a model of your choosing. Amazon Bedrock AgentCore Runtime was built to work for agentic workloads with industry-leading extended runtime support, fast cold starts, true session isolation, built-in identity, and support for multimodal payloads. Rather than the typical serverless model where functions spin up, execute, and immediately terminate, Amazon Bedrock AgentCore Runtime provisions dedicated microVMs that can persist for up to 8 hours, enabling sophisticated multi-step agentic workflows where each subsequent call builds upon the accumulated context and state from previous interactions within the same session.
LibreChat overview
LibreChat has emerged as a leading open source alternative to commercial AI chat interfaces, offering educational institutions a powerful solution for deploying conversational AI at scale. Built with flexibility and extensibility in mind, LibreChat provides several key advantages for higher education:

Multi-model support – LibreChat supports integration with multiple AI providers, so institutions can choose the most appropriate models for different use cases while avoiding vendor lock-in
User management – Robust authentication and authorization systems help institutions manage access across student populations, faculty, and staff with appropriate permissions and usage controls
Conversation management – Students and faculty can organize their AI interactions into projects and topics, creating a more structured learning environment
Customizable interface – The solution can be branded and customized to match institutional identity and specific pedagogical needs

Integration benefits
Integrating Strands Agents with Amazon Bedrock AgentCore and LibreChat creates unique benefits that extend the capabilities of both services far beyond what either could achieve independently:

Seamless agent experience through familiar interface – LibreChat’s intuitive chat interface becomes a gateway to sophisticated agentic workflows. Users can trigger complex multi-step processes, data analysis, and external system integrations through natural conversation, without needing to learn new interfaces or complex APIs.
Dynamic agent loading and management – Unlike static AI chat implementations, this integration supports dynamic agent loading with access management. New agentic applications can be deployed separately and made available to users without requiring LibreChat updates or downtime, enabling rapid agent development.
Enterprise-grade security and scaling – Amazon Bedrock AgentCore Runtime provides complete session isolation for each user session, where each session runs with isolated CPU, memory, and filesystem resources. This creates complete separation between user sessions, safeguarding stateful agent reasoning processes and helping prevent cross-session data contamination. The service can scale up to thousands of agent sessions in seconds while developers only pay for actual usage, making it ideal for educational institutions that need to support large student populations with varying usage patterns.
Built-in AWS resource integration – Organizations already running infrastructure on AWS can seamlessly connect their existing resources—databases, data lakes, Lambda functions, and applications—to Strands Agents without complex integrations or data movement. Agents can directly access and surface insights through the LibreChat interface, turning existing AWS investments into intelligent, conversational experiences, such as querying an Amazon Relational Database Service (Amazon RDS) database, analyzing data in Amazon Simple Storage Service (Amazon S3), or integrating with existing microservices.
Cost-effective agentic computing – By using LibreChat’s efficient architecture with the Amazon Bedrock AgentCore pay-per-use model, organizations can deploy sophisticated agentic applications without the high fixed costs typically associated with enterprise AI systems. Users only pay for actual agent computation and tool usage.

Agent use cases in higher education settings
The integration of LibreChat with Strands Agents enables numerous educational applications that demonstrate the solution’s versatility and power:

A course recommendation agent can analyze a student’s academic history, current enrollment, and career interests to suggest relevant courses. By integrating with the student information system, the agent can make sure recommendations consider prerequisites, schedule conflicts, and graduation requirements.
A degree progress tracking agent can interact with students and help them understand their specific degree requirements and provide guidance on remaining coursework, elective options, and timeline optimization.
Agents can be configured with access to academic databases and institutional repositories, helping students and faculty discover relevant research papers and resources, providing guidance on academic writing, citation formats, and research methodology specific to different disciplines.
Agents can handle routine student inquiries about registration, deadlines, and campus resources, freeing up staff time for more complex student support needs.

Refer to the following GitHub repo for Strands Agent code examples for educational use cases.
Solution overview
The following architecture diagram illustrates the overall system design for deploying LibreChat with Strands Agents integration. Strands Agents is deployed using Amazon Bedrock AgentCore Runtime, a secure, serverless runtime purpose-built for deploying and scaling dynamic AI agents and tools using an open source framework including Strands Agents.

The solution architecture includes several key components:

LibreChat core services – The core chat interface runs in an Amazon Elastic Container Service (Amazon ECS) with AWS Fargate cluster, including LibreChat for the user-facing experience, Meilisearch for enhanced search capabilities, and Retrieval Augmented Generation (RAG) API services for document retrieval.
LibreChat supporting infrastructure – This solution uses Amazon Elastic File System (Amazon EFS) for storing Meilisearch’s indexes and user uploaded files; Amazon Aurora PostgreSQL-Compatible Edition for vector database used by the RAG API; Amazon S3 for storing LibreChat configurations; Amazon DocumentDB for user, session, conversation data management; and AWS Secrets Manager for managing access to the resources.
Strands Agents integration – This solution integrates Strands Agents (hosted by Amazon Bedrock AgentCore Runtime) with LibreChat through custom endpoints using Lambda and Amazon API Gateway. This integration pattern enables dynamic loading of agents in LibreChat for advanced generative AI capabilities. In particularly, the solution showcases a user activity analysis agent that draws insights from LibreChat logs.
Authentication and security – The integration between LibreChat and Strands Agents implements a multi-layered authentication approach that maintains security without compromising user experience or administrative simplicity. When a student or faculty member selects a Strands Agent from LibreChat’s interface, the authentication flow operates seamlessly in the background through several coordinated layers:

User authentication – LibreChat handles user login through your institution’s existing authentication system, with comprehensive options including OAuth, LDAP/AD, or local accounts as detailed in the LibreChat authentication documentation.
API Gateway security – After users are authenticated to LibreChat, the system automatically handles API Gateway security by authenticating each request using preconfigured API keys.
Service-to-service authentication – The underlying Lambda function uses AWS Identity and Access Management (IAM) roles to securely invoke Amazon Bedrock AgentCore Runtime where the Strands Agent is deployed.
Resource access control – Strands Agents operate within defined permissions to access only authorized resources.

Deployment process
This solution uses the AWS Cloud Development Kit (AWS CDK) and AWS CloudFormation to handle the deployment through several automated phases. We will use a log analysis agent as an example to demonstrate the deployment process. The agent makes it possible for the admin to perform LibreChat log analysis through natural language queries.
LibreChat is deployed as a containerized service with ECS Fargate clusters and is integrated with supporting services, including virtual private cloud (VPC) networking, Application Load Balancer (ALB), and the complete data layer with Aurora PostgreSQL-Compatible, DocumentDB, Amazon EFS, and Amazon S3 storage. Security is built in with appropriate IAM roles, security groups, and secrets management.
The user activity analysis agent provides valuable insights into how students interact with AI tools, identifying peak usage times, popular topics, and potential areas where students might need additional support. The agent is automatically provisioned using the following CloudFormation template, which deploys Strands Agents using Amazon Bedrock AgentCore Runtime, provisions a Lambda function that invokes the agent, API Gateway to make the agent a URL endpoint, and a second Lambda function that accesses LibreChat logs stored in DocumentDB. The second Lambda is used as a tool of the agent.
The following code shows how to configure LibreChat to make the agent a custom endpoint:

custom:
– name: ‘log-analysis-assitant’
apiKey: ‘{AWS_API_GATEWAY_KEY}’
baseURL: ‘{AWS_API_GATEWAY_URL}’
models:
default: [‘Strands Agent’]
fetch: false
headers:
x-api-key: ‘{AWS_API_GATEWAY_KEY}’
titleConvo: true
titleModel: ‘us.amazon.nova-lite-v1:0’
modelDisplayLabel: ‘log-analysis-assitant’
forcePrompt: false
stream: false
iconURL: ‘https://d1.awsstatic.com/onedam/marketing-channels/website/aws/en_US/
product-categories/ai-ml/machine-learning/approved/images/256f3da1-3193-
441c-b2641f33fdd6.a045b9b4c4f34545e1c79a405140ac0146699835.jpeg’

After the stack is deployed successfully, you can log in to LibreChat, select the agent, and start chatting. The following screenshot shows an example question that the user activity analysis agent can help answer, where it reads the LibreChat user activities from DocumentDB and generates an answer.

Deployment considerations and best practices
When deploying this LibreChat and Strands Agents integration, organizations should carefully consider several key factors that can significantly impact both the success of the implementation and its long-term sustainability.
Security and compliance form the foundation of any successful deployment, particularly in educational environments where data protection is paramount. Organizations must implement robust data classification schemes to maintain appropriate handling of sensitive information, and role-based access controls make sure users only access AI capabilities and data appropriate to their roles. Beyond traditional perimeter security, a layered authorization approach becomes critical when deploying AI systems that might access multiple data sources with varying sensitivity levels. This involves implementing multiple authorization checks throughout the application stack, including service-to-service authorization, trusted identity propagation that carries the end-user’s identity through the system components, and granular access controls that evaluate permissions at each data access point rather than relying solely on broad service-level permissions. Such layered security architectures help mitigate risks like prompt injection vulnerabilities and unauthorized cross-tenant data access, making sure that even if one security layer is compromised, additional controls remain in place to protect sensitive educational data. Regular compliance monitoring becomes essential, with automated audits and checks maintaining continued adherence to relevant data protection regulations throughout the system’s lifecycle, while also validating that layered authorization policies remain effective as the system evolves.
Cost management requires a strategic approach that balances functionality with financial sustainability. Organizations must prioritize their generative AI spending based on business impact and criticality while maintaining cost transparency across customer and user segments. Implementing comprehensive usage monitoring helps organizations track AI service consumption patterns and identify optimization opportunities before costs become problematic. The human element of deployment often proves more challenging than the technical implementation. Faculty training programs should provide comprehensive guidance on integrating AI tools into teaching practices, focusing not just on how to use the tools but how to use them effectively for educational outcomes. Student onboarding requires clear guidelines and tutorials that promote both effective AI interaction and academic integrity. Perhaps most importantly, establishing continuous feedback loops makes sure the system evolves based on actual user experiences and measured educational outcomes rather than assumptions about what users need.Successful deployments also require careful attention to the dynamic nature of AI technology. The architecture’s support for dynamic agent loading enables organizations to add specialized agents for new departments or use cases without disrupting existing services. Version control systems should maintain different agent versions for testing and gradual rollout of improvements, and performance monitoring tracks both technical metrics and user satisfaction to guide continuous improvement efforts.
Conclusion
The integration of LibreChat with Strands Agents represents a significant step forward in democratizing access to advanced AI capabilities in higher education. By combining the accessibility and customization of open source systems with the sophistication and reliability of enterprise-grade AI services, institutions can provide students and faculty with powerful tools that enhance learning, research, and academic success.This architecture demonstrates that educational institutions don’t need to choose between powerful AI capabilities and institutional control. Instead, they can take advantage of the innovation and flexibility of open source solutions with the scalability and reliability of cloud-based AI services. The integration example showcased in this post illustrates the solution’s versatility and potential for customization as institutions expand and adapt the solution to meet evolving educational needs.
For future work, the LibreChat system’s Model Context Protocol (MCP) server integration capabilities offer exciting possibilities for enhanced agent architectures. A particularly promising avenue involves wrapping agents as MCP servers, transforming them into standardized tools that can be seamlessly integrated alongside other MCP-enabled agents. This approach would enable educators to compose sophisticated multi-agent workflows, creating highly personalized educational experiences tailored to individual learning styles.
The future of education is about having the right AI tools, properly integrated and ethically deployed, to enhance human learning and achievement through flexible, interoperable, and extensible solutions that can evolve with educational needs.
Acknowledgement
The authors extend their gratitude to Arun Thangavel, Ashish Rawat and Kosti Vasilakakis for their insightful feedback and review of the post.

About the authors
Dr. Changsha Ma is a Senior AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, and spending time with friends and families.
Sudheer Manubolu is a Solutions Architect at Amazon Web Services (AWS). He specializes in cloud architecture, enterprise solutions, and AI/ML implementations. He provides technical and architectural guidance to customers building transformative solutions on AWS, with particular emphasis on leveraging AWS’s AI/ML and container services to drive innovation and operational excellence.
Abhilash Thallapally is a Solutions Architect at AWS helping public sector customers design and build scalable AI/ML solutions using Amazon SageMaker. His work covers a wide range of ML use cases, with a primary interest in computer vision, Generative AI, IoT, and deploying cost optimized solutions on AWS.
Mary Strain leads strategy for artificial intelligence and machine learning for US education at AWS. Mary began her career as a middle school teacher in the Bronx, NY. Since that time, she has held leadership roles in education and public sector technology organizations. Mary has advised K12, higher education, and state and local government on innovative policies and practices in competency based assessment, curriculum design, micro credentials and workforce development initiatives. As an advisor to The Education Design Studio at The University of Pennsylvania, The Coalition of Schools Educating Boys of Color, and The State of NJ AI Task Force, Mary has been on the leading edge of bringing innovative solutions to education for two decades.

Skai uses Amazon Bedrock Agents to significantly improve customer insi …

Posted on September 9, 2025 by i-genie

This post was written with Lior Heber and Yarden Ron of Skai.
Skai (formerly Kenshoo) is an AI-driven omnichannel advertising and analytics platform designed for brands and agencies to plan, launch, optimize, and measure paid media across search, social, retail media marketplaces and other “walled-garden” channels from a single interface. By unifying data from over 100 publishers and retail networks, Skai applies real-time analytics, predictive modeling, and incremental testing to surface budget and bidding recommendations, connect media spend to sales outcomes, and reduce channel silos, giving marketers full-funnel visibility and higher return on ad spend at scale.
Skai recognized that our customers were spending days (sometimes weeks) manually preparing reports, struggling to query complex datasets, and lacking intuitive visualization tools. Traditional analytics platforms required technical expertise, leaving many users overwhelmed by untapped data potential. But through the partnership with AWS and adoption of Amazon Bedrock Agents AI assistants that can autonomously perform complex, multi-step tasks by orchestrating calls to APIs, we’ve redefined what’s possible. Now, customers can analyze their data in natural language, generate reports in minutes instead of days, and visualize insights through natural language conversation.
In this post, we share how Skai used Amazon Bedrock Agents to improve data access and analysis and improve customer insights.
Challenges with data analytics
Before adopting Amazon Bedrock Agents, Skai’s customers accessed their data through tables, charts, and predefined business questions. Campaign manager teams, looking to do deep research on their data, would spend around 1.5 days a week preparing static reports, while individual users struggled to connect the dots between their massive amount of data points. Critical business questions, like where should a client spend their time optimizing campaigns, and how, remained hidden in unstructured knowledge and siloed data points.
We identified three systematic challenges:

Time-consuming report generation – Grids display flat and grouped data at specific entity levels, like campaigns, ads, products, and keywords. However, gaining a comprehensive understanding by connecting these different entities and determining relevant time frames is time-consuming. Users must manipulate raw data to construct a complete narrative.
Summarization – Analyzing extracted raw data posed significant challenges in understanding, identifying key patterns, summarizing complex datasets, and drawing insightful conclusions. Users lacked intuitive tools to dynamically explore data dimensions, hindering their ability to gain a holistic view and extract crucial insights for informed decisions.
Recommendations – Presenting data-driven recommendations to stakeholders with varying understanding requires deep data analysis, anticipating perspectives, and clear, persuasive communication to demonstrate ROI and facilitate informed decisions.

How Celeste powered transformation
To address the challenges of time-consuming report generation, the difficulty in summarizing complex data, and the need for data-driven recommendations, Skai used AWS to build Celeste, a generative AI agent. With AI agents, users can ask questions in natural language, and the agent automatically collects data from multiple sources, synthesizes it into a cohesive narrative with actionable insights, and provides data-oriented recommendations.
The Skai Platform absorbs an enormous amount of data about product searches across many retailers and traditional search engines. Sorting through this data can be time-consuming, but the capabilities in Celeste can make this type of exploratory research much easier.
Skai’s solution leverages Amazon Bedrock Agents to create an AI-driven analytics assistant that transforms how users interact with complex advertising data. The system processes natural language queries like ‘Compare ad group performance across low-performing campaigns in Q1,’ eliminating the need for a database specialist. Agent automatically joins Skai’s datasets from profiles, campaigns, ads, products, keywords, and search terms across multiple advertising publishers. Beyond simple data retrieval, the assistant generates comprehensive insights and case studies while providing actionable recommendations on campaign activity, complete with detailed analytical approaches and ready-to-present stakeholder materials.
For example, consider the following question: “I’m launching a new home security product and want to activate 3 new Sponsored Product campaigns and 2 new Sponsored Brand campaigns on Amazon. What high-performing keywords and their match types are already running in other campaigns that would be good to include in these new activations?”
When asked this question with real client data, Celeste answered quickly, finding a combination of branded and generic category terms that the manufacturer might consider for this new product launch. With just a few follow-up questions, Celeste was able to provide estimated CPCs, budgets, and a high-level testing plan for these hypothetical campaigns, complete with negative keywords to reduce unnecessary conflict with their existing campaigns.
This is a great example of an exploratory question that requires summary analysis, identification of trends and insights, and recommendations. Skai data directly supports these kinds of analyses, and the capabilities within Celeste give the agent the intelligence to provide smart recommendations. Amazon Bedrock makes this possible because it gives Celeste access to strong foundation models (FMs) without exposing clients to the risk of having those models’ vendors use sensitive questions for purposes outside of supporting the client directly. Celeste reduces 75% on average the time needed to build client case studies, transforming a process that often took weeks into one requiring only minutes.
Accelerating time-to-value through managed AI using Amazon Bedrock
One critical element of Skai’s success story was our deliberate choice of Amazon Bedrock as the foundational AI service. Unlike alternatives requiring extensive infrastructure setup and model management, Amazon Bedrock provided a frictionless path from concept to production.
The journey began with a simple question: How can we use generative AI to provide our clients a new and improved experience without building AI infrastructure from scratch? With Amazon Bedrock, Skai could experiment within hours and deliver a working proof of concept in days. The team could test multiple FMs (Anthropic’s Claude, Meta’s Llama, and Amazon Nova) without managing separate environments and iterate rapidly through Amazon Bedrock Agents.
One developer noted, “We went from whiteboard to a working prototype in a single sprint. With traditional approaches, we’d still be configuring infrastructure.”
With Amazon Bedrock Agents, Skai could prioritize customer value and rapid iteration over infrastructure complexity. The managed service minimized DevOps overhead for model deployment and scaling while alleviating the need for specialized ML expertise in FM tuning. This helped the team concentrate on data integration and customer-specific analytics patterns, using cost-effective on-demand models at scale while making sure client data remained private and secure.With Amazon Bedrock Agents, domain experts can focus exclusively on what matters most: translating customer data challenges into actionable insights.
Benefits of Amazon Bedrock Agents
The introduction of Amazon Bedrock Agents dramatically simplified Skai’s architecture while reducing the need to build custom code. Built-in action groups replaced thousands of lines of custom integration code that would have required weeks of development time. The platform’s native memory and session management capabilities meant the team could focus on business logic rather than infrastructure concerns. Declarative API definitions reduced integration time from weeks to hours. Additionally, the integrated code interpreter simplified math problem management and facilitated accuracy and scale issues.
As a solution provider serving many customers, security and compliance were non-negotiable. Amazon Bedrock addressed these security requirements by inheriting AWS’s comprehensive compliance certifications including HIPAA, SOC2, and ISO27001. Commitment to not retaining data for model training proved critical for protecting sensitive customer information, while its seamless integration with existing AWS Identity and Access Management (IAM) policies and VPC configurations simplified deployment.
During every client demonstration of Celeste, initial inquiries consistently centered on privacy, security, and the protection of proprietary data. With an AWS infrastructure, Skai confidently assured clients that their data would not be used to train any models, effectively distinguishing Skai from its competitors.With pay-as-you-go model, Skai scaled economically without AI infrastructure investment. The team avoided costly upfront commitments to GPU clusters or specialized instances, instead leveraging automatic scaling based on actual usage patterns. This approach provided granular cost attribution to specific agents, allowing Skai to understand and optimize spending at a detailed level. The flexibility to select the most appropriate model for each specific task further optimized both performance and costs, ensuring resources aligned precisely with business needs.
AWS Enterprise Support as a strategic partner in AI innovation
Working with cutting-edge generative AI agents presents unique challenges that extend far beyond traditional technical support needs. When building Celeste, Skai encountered complex scenarios where solutions didn’t emerge as expected, from managing 200,000-token conversations to optimizing latency in multi-step agent workflows. AWS Enterprise Support proved invaluable as a strategic partner rather than just a support service.
AWS Enterprise Support provided dedicated Technical Account Management (TAM) and Solutions Architect (SA) services that went well beyond reactive problem-solving. Our TAM and SA became an extension of our engineering team, offering the following:

Regular architectural reviews to optimize our Amazon Bedrock Agents implementation
Proactive monitoring recommendations that helped us identify potential bottlenecks before they impacted customer experience
Direct access to AWS service teams when we needed deep technical expertise on the advanced features of Amazon Bedrock Agents
Strategic guidance and optimization as we scaled from prototype to production

When complex issues arose, such as our initial 90-second (or more) latency challenges or session management complexities, Enterprise Support provided immediate escalation paths and expert consultation.
This comprehensive support framework was instrumental in achieving our aggressive KPIs and time-to-market goals. The combination of proactive guidance, rapid issue resolution, and strategic partnership helped us achieve the following:

Reduce proof of concept to production timeline by 50%
Maintain 99.9% uptime during critical customer demonstrations
Scale confidently, knowing we had enterprise-grade support backing our innovation

The value of Enterprise Support provided the confidence and partnership necessary to build our product roadmap on emerging AI technologies, knowing AWS was fully committed to the success of Celeste.
Solution overview
The following diagram illustrates the solution architecture.

Our Amazon Bedrock Agent operates on several core components.
First, a custom layer comprises the following:

Customer Experience UI (CX UI) – The frontend interface that users interact with to submit questions and view responses
Chat Manager – Orchestrates the conversation flow, manages session state, and handles the communication between the UI and the processing layer
Chat Executor – Receives processed requests from Chat Manager, interfaces with Amazon Bedrock Agent and handles the business logic for determining when and how to invoke the agent, and manages the overall conversation workflow and short memory

Second, we used the following in conjunction with Amazon Bedrock:

Amazon Bedrock agent – An orchestrator that receives queries from Chat Executor, determines which tools to invoke based on the query, and manages the tool invocation process.
Anthropic’s Claude 3.5 Sonnet V2 – The FM that generates natural language responses. The model generates queries for the API and processes the structured data returned by tools. It creates coherent, contextual answers for users.

Finally, the data layer consists of the following:

Tool API – A custom API that receives tool invocation requests from the Amazon Bedrock agent and queries the customer data
Customer data – The data storage containing sensitive customer information that remains isolated from Amazon Bedrock

The solution also includes the following key security measures:

Data isolation is enforced between the Tool API and Amazon Bedrock agent
Raw customer data is never shared
Skai can maintain data privacy and compliance requirements

Overcoming critical challenges
Implementing the solution brought with it a few key challenges.
Firstly, early prototypes suffered from 90-second (or more) response times when chaining multiple agents and APIs. By adopting a custom orchestrator and streaming, we reduced median latency by 30%, as illustrated in the following table.

Approach
Average Latency (seconds)
P90
P99

Baseline
136
194
215

Optimized Workflow
44
102
102

Secondly, customers frequently analyzed multi-year datasets, exceeding Anthropic Claude’s 50,000-token window. Our solution uses dynamic session chunking to split conversations while retaining key context, and employs Retrieval Augmented Generation (RAG)-based memory retrieval.
Lastly, we implemented the following measures for error handling at scale:

Real-time tracing using WatchDog with Amazon CloudWatch Logs Insights to monitor more than 230 agent metrics
A retry mechanism, in which failed API calls with 500 error: “BEDROCK_MODEL_INVOCATION_SERVICE_UNAVAILABLE” are automatically retried
Amazon CloudWatch monitoring and alerting

Business results
Since deploying with AWS, Skai’s platform has achieved significant results, as shown in the following table.

Metric
Improvement

Report Generation Time
50% Faster

Case Study Generation Time
75% Faster

QBR Composition Time
80% Faster

Report to Recommendation Time
90% Faster

While the metrics above demonstrate measurable improvements, the true business impact becomes clear through customer feedback. The core challenges Skai addressed—time-consuming report generation, complex data analysis, and the need for actionable recommendations, have been resolved in ways that fundamentally changed how users work with advertising data.
Customer testimonials
“It’s made my life easier. It’s made my team’s life easier. It’s made my clients’ lives easier and better. So we all work in jobs where there’s millions and millions of data points to scour through every day, and being able to do that as efficiently as possible and as fluidly as possible with Celeste AI is always a welcome addition to Skai.” – Aram Howard, Amazon Advertising Executive, Data Analyst | Channel Bakers
“Celeste is saving hours of time. It’s like having another set of eyes to give suggestions. I’m so stoked to see where this could take us.” – Erick Rudloph, Director of Search Marketing, Xcite Media Group
“It truly feels like having a data scientist right next to me to answer questions, even with recommendations for starting an optimization or looking at an account’s performance.” – Director of Search Marketing at Media Agency
Looking ahead: The future of Celeste
We’re expanding Celeste’s capabilities in the following areas:

Personalizing the user experience, retaining memories and preferences across multiple sessions.
Ingestion of custom data assets, so the client can bring their own data into Celeste and seamlessly connect it to Celeste’s existing data and knowledge.
New tools for seamless team integration. These tools will allow Celeste to generate client presentations, build data dashboards, and provide timely notifications.

Conclusion
With Amazon Bedrock Agents, Skai transformed raw data into strategic assets, helping customers make faster, smarter decisions without technical bottlenecks. By combining a robust AWS AI/ML infrastructure with our domain expertise, we’ve created a blueprint other organizations can follow to democratize data analytics.
What truly set our journey apart was the ease with which Amazon Bedrock helped us transition from concept to production. Rather than building complex AI infrastructure, we used a fully managed service that let us focus on our core strengths: understanding customer data challenges and delivering insights at scale. The decision to use Amazon Bedrock resulted in considerable business acceleration, helping us deliver value in weeks rather than quarters while maintaining production grade security, performance, and reliability.
Skai’s journey with Amazon Bedrock continues—follow our series for updates on multi-agent systems and other generative innovations.

About the authors
Lior Heber is the Al Lead Architect at Skai, where he has spent over a decade shaping the company’s technology with a focus on innovation, developer experience, and intelligent Ul design. With a strong background in software architecture and Al-driven solutions, Lior has led transformative projects that push the boundaries of how teams build and deliver products. Beyond his work in tech, he co-founded Colorful Family, a project creating children’s books for diverse families. Lior combines technical expertise with creativity, always looking for ways to bridge technology and human experience.
Yarden Ron is a Software Development Team Lead at Skai, bringing over four years of leadership and engineering experience to the AI-powered commerce media platform. He recently spearheaded the launch of Celeste AI – a GenAI agent designed to revolutionize how marketers engage with their platforms by making insights faster, smarter, and more intuitive. Based in Israel, Yarden blends technical acumen with collaborative drive, leading teams that turn innovative ideas into impactful products.
Tomer Berkovich is a Technical Account Manager at AWS with a specialty focus on Generative AI and Machine Learning. He brings over two decades of technology, engineering, and architecture experience to help organizations navigate their AI/ML journey on AWS. When he isn’t working, he enjoys spending time with his family, exploring emerging technologies, and powerlifting while chasing new personal records.
Dov Amir is a Senior Solutions Architect at AWS, bringing over 20 years of experience in Software, cloud and architecture. In his current role, Dov helps customers accelerate cloud adoption and application modernization by leveraging cloud-native technologies and generative AI.
Gili Nachum is a Principal AI/ML Specialist Solutions Architect who works as part of the EMEA Amazon Machine Learning team. Gili is passionate about the challenges of training deep learning models, and how machine learning is changing the world as we know it. In his spare time, Gili enjoys playing table tennis.

How to Create a Bioinformatics AI Agent Using Biopython for DNA and Pr …

Posted on September 8, 2025 by i-genie

In this tutorial, we demonstrate how to build an advanced yet accessible Bioinformatics AI Agent using Biopython and popular Python libraries, designed to run seamlessly in Google Colab. By combining sequence retrieval, molecular analysis, visualization, multiple sequence alignment, phylogenetic tree construction, and motif searches into a single streamlined class, the tutorial provides a hands-on approach to explore the full spectrum of biological sequence analysis. Users can start with built-in sample sequences such as the SARS-CoV-2 Spike protein, Human Insulin precursor, and E. coli 16S rRNA, or fetch custom sequences directly from NCBI. With built-in visualization tools powered by Plotly and Matplotlib, researchers and students alike can quickly perform comprehensive DNA and protein analyses without needing prior setup beyond a Colab notebook. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install biopython pandas numpy matplotlib seaborn plotly requests beautifulsoup4 scipy scikit-learn networkx
!apt-get update
!apt-get install -y clustalw

import os
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from Bio import SeqIO, Entrez, Align, Phylo
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqUtils import gc_fraction, molecular_weight
from Bio.SeqUtils.ProtParam import ProteinAnalysis
from Bio.Blast import NCBIWWW, NCBIXML
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import warnings
warnings.filterwarnings(‘ignore’)

Entrez.email = “your_email@example.com”

We begin by installing essential bioinformatics and data science libraries, along with ClustalW for sequence alignment. We then import Biopython modules, visualization tools, and supporting packages, while setting up Entrez with our email to fetch sequences from NCBI. This ensures our Colab environment is fully prepared for advanced sequence analysis. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass BioPythonAIAgent:
def __init__(self, email=”your_email@example.com”):
self.email = email
Entrez.email = email
self.sequences = {}
self.analysis_results = {}
self.alignments = {}
self.trees = {}

def fetch_sequence_from_ncbi(self, accession_id, db=”nucleotide”, rettype=”fasta”):
try:
handle = Entrez.efetch(db=db, id=accession_id, rettype=rettype, retmode=”text”)
record = SeqIO.read(handle, “fasta”)
handle.close()
self.sequences[accession_id] = record
return record
except Exception as e:
print(f”Error fetching sequence: {str(e)}”)
return None

def create_sample_sequences(self):
covid_spike = “MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT”

human_insulin = “MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN”

e_coli_16s = “AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAATGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGCGTTAAGGTTAATAACCTTGGCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA”

sample_sequences = [
(“COVID_Spike”, covid_spike, “SARS-CoV-2 Spike Protein”),
(“Human_Insulin”, human_insulin, “Human Insulin Precursor”),
(“E_coli_16S”, e_coli_16s, “E. coli 16S rRNA”)
]

for seq_id, seq_str, desc in sample_sequences:
record = SeqRecord(Seq(seq_str), id=seq_id, description=desc)
self.sequences[seq_id] = record

return sample_sequences

def analyze_sequence(self, sequence_id=None, sequence=None):
if sequence_id and sequence_id in self.sequences:
seq_record = self.sequences[sequence_id]
seq = seq_record.seq
description = seq_record.description
elif sequence:
seq = Seq(sequence)
description = “Custom sequence”
else:
return None

analysis = {
‘length’: len(seq),
‘composition’: {}
}

for base in [‘A’, ‘T’, ‘G’, ‘C’]:
analysis[‘composition’][base] = seq.count(base)

if ‘A’ in analysis[‘composition’] and ‘T’ in analysis[‘composition’]:
analysis[‘gc_content’] = round(gc_fraction(seq) * 100, 2)
try:
analysis[‘molecular_weight’] = round(molecular_weight(seq, seq_type=’DNA’), 2)
except:
analysis[‘molecular_weight’] = len(seq) * 650

try:
if len(seq) % 3 == 0:
protein = seq.translate()
analysis[‘translation’] = str(protein)
analysis[‘stop_codons’] = protein.count(‘*’)

if ‘*’ not in str(protein)[:-1]:
prot_analysis = ProteinAnalysis(str(protein)[:-1])
analysis[‘protein_mw’] = round(prot_analysis.molecular_weight(), 2)
analysis[‘isoelectric_point’] = round(prot_analysis.isoelectric_point(), 2)
analysis[‘protein_composition’] = prot_analysis.get_amino_acids_percent()
except:
pass

key = sequence_id if sequence_id else “custom”
self.analysis_results[key] = analysis

return analysis

def visualize_composition(self, sequence_id):
if sequence_id not in self.analysis_results:
return

analysis = self.analysis_results[sequence_id]

fig = make_subplots(
rows=2, cols=2,
specs=[[{“type”: “pie”}, {“type”: “bar”}],
[{“colspan”: 2}, None]],
subplot_titles=(“Nucleotide Composition”, “Base Count”, “Sequence Properties”)
)

labels = list(analysis[‘composition’].keys())
values = list(analysis[‘composition’].values())

fig.add_trace(
go.Pie(labels=labels, values=values, name=”Composition”),
row=1, col=1
)

fig.add_trace(
go.Bar(x=labels, y=values, name=”Count”, marker_color=[‘red’, ‘blue’, ‘green’, ‘orange’]),
row=1, col=2
)

properties = [‘Length’, ‘GC%’, ‘MW (kDa)’]
prop_values = [
analysis[‘length’],
analysis.get(‘gc_content’, 0),
analysis.get(‘molecular_weight’, 0) / 1000
]

fig.add_trace(
go.Scatter(x=properties, y=prop_values, mode=’markers+lines’,
marker=dict(size=10, color=’purple’), name=”Properties”),
row=2, col=1
)

fig.update_layout(
title=f”Comprehensive Analysis: {sequence_id}”,
showlegend=False,
height=600
)

fig.show()

def perform_multiple_sequence_alignment(self, sequence_ids):
if len(sequence_ids) < 2:
return None

sequences = []
for seq_id in sequence_ids:
if seq_id in self.sequences:
sequences.append(self.sequences[seq_id])

if len(sequences) < 2:
return None

from Bio.Align import PairwiseAligner
aligner = PairwiseAligner()
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.open_gap_score = -2
aligner.extend_gap_score = -0.5

alignments = []
for i in range(len(sequences)):
for j in range(i+1, len(sequences)):
alignment = aligner.align(sequences[i].seq, sequences[j].seq)[0]
alignments.append(alignment)

return alignments

def create_phylogenetic_tree(self, alignment_key=None, sequences=None):
if alignment_key and alignment_key in self.alignments:
alignment = self.alignments[alignment_key]
elif sequences:
records = []
for i, seq in enumerate(sequences):
record = SeqRecord(Seq(seq), id=f”seq_{i}”)
records.append(record)
SeqIO.write(records, “temp.fasta”, “fasta”)

try:
clustalw_cline = ClustalwCommandline(“clustalw2″, infile=”temp.fasta”)
stdout, stderr = clustalw_cline()
alignment = AlignIO.read(“temp.aln”, “clustal”)
os.remove(“temp.fasta”)
os.remove(“temp.aln”)
os.remove(“temp.dnd”)
except:
return None
else:
return None

calculator = DistanceCalculator(‘identity’)
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)

tree_key = f”tree_{len(self.trees)}”
self.trees[tree_key] = tree

return tree

def visualize_tree(self, tree):
fig, ax = plt.subplots(figsize=(10, 6))
Phylo.draw(tree, axes=ax)
plt.title(“Phylogenetic Tree”)
plt.tight_layout()
plt.show()

def protein_structure_analysis(self, sequence_id):
if sequence_id not in self.sequences:
return None

seq = self.sequences[sequence_id].seq

try:
if len(seq) % 3 == 0:
protein = seq.translate()
if ‘*’ not in str(protein)[:-1]:
prot_analysis = ProteinAnalysis(str(protein)[:-1])

structure_analysis = {
‘molecular_weight’: prot_analysis.molecular_weight(),
‘isoelectric_point’: prot_analysis.isoelectric_point(),
‘amino_acid_percent’: prot_analysis.get_amino_acids_percent(),
‘secondary_structure’: prot_analysis.secondary_structure_fraction(),
‘flexibility’: prot_analysis.flexibility(),
‘gravy’: prot_analysis.gravy()
}

return structure_analysis
except:
pass

return None

def comparative_analysis(self, sequence_ids):
results = []

for seq_id in sequence_ids:
if seq_id in self.analysis_results:
analysis = self.analysis_results[seq_id].copy()
analysis[‘sequence_id’] = seq_id
results.append(analysis)

df = pd.DataFrame(results)

if len(df) > 1:
fig = make_subplots(
rows=2, cols=2,
subplot_titles=(“Length Comparison”, “GC Content”, “Molecular Weight”, “Composition Heatmap”)
)

fig.add_trace(
go.Bar(x=df[‘sequence_id’], y=df[‘length’], name=”Length”),
row=1, col=1
)

if ‘gc_content’ in df.columns:
fig.add_trace(
go.Scatter(x=df[‘sequence_id’], y=df[‘gc_content’], mode=’markers+lines’, name=”GC%”),
row=1, col=2
)

if ‘molecular_weight’ in df.columns:
fig.add_trace(
go.Bar(x=df[‘sequence_id’], y=df[‘molecular_weight’], name=”MW”),
row=2, col=1
)

fig.update_layout(title=”Comparative Sequence Analysis”, height=600)
fig.show()

return df

def codon_usage_analysis(self, sequence_id):
if sequence_id not in self.sequences:
return None

seq = self.sequences[sequence_id].seq

if len(seq) % 3 != 0:
return None

codons = {}
for i in range(0, len(seq) – 2, 3):
codon = str(seq[i:i+3])
codons[codon] = codons.get(codon, 0) + 1

codon_df = pd.DataFrame(list(codons.items()), columns=[‘Codon’, ‘Count’])
codon_df = codon_df.sort_values(‘Count’, ascending=False)

fig = px.bar(codon_df.head(20), x=’Codon’, y=’Count’,
title=f”Top 20 Codon Usage – {sequence_id}”)
fig.show()

return codon_df

def motif_search(self, sequence_id, motif_pattern):
if sequence_id not in self.sequences:
return []

seq = str(self.sequences[sequence_id].seq)
positions = []

for i in range(len(seq) – len(motif_pattern) + 1):
if seq[i:i+len(motif_pattern)] == motif_pattern:
positions.append(i)

return positions

def gc_content_window(self, sequence_id, window_size=100):
if sequence_id not in self.sequences:
return None

seq = self.sequences[sequence_id].seq
gc_values = []
positions = []

for i in range(0, len(seq) – window_size + 1, window_size//4):
window = seq[i:i+window_size]
gc_values.append(gc_fraction(window) * 100)
positions.append(i + window_size//2)

fig = go.Figure()
fig.add_trace(go.Scatter(x=positions, y=gc_values, mode=’lines+markers’,
name=f’GC Content (window={window_size})’))
fig.update_layout(
title=f”GC Content Sliding Window Analysis – {sequence_id}”,
xaxis_title=”Position”,
yaxis_title=”GC Content (%)”
)
fig.show()

return positions, gc_values

def run_comprehensive_analysis(self, sequence_ids):
results = {}

for seq_id in sequence_ids:
if seq_id in self.sequences:
analysis = self.analyze_sequence(seq_id)
self.visualize_composition(seq_id)

gc_analysis = self.gc_content_window(seq_id)
codon_analysis = self.codon_usage_analysis(seq_id)

results[seq_id] = {
‘basic_analysis’: analysis,
‘gc_window’: gc_analysis,
‘codon_usage’: codon_analysis
}

if len(sequence_ids) > 1:
comparative_df = self.comparative_analysis(sequence_ids)
results[‘comparative’] = comparative_df

return results

We define a BioPython AIAgent that allows us to fetch or create sequences, run core analyses (composition, GC%, translation, and protein properties), and visualize results interactively. We also perform pairwise alignments, build phylogenetic trees, scan motifs, profile codon usage, analyze GC with sliding windows, and compare multiple sequences—then bundle everything into one comprehensive pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browseragent = BioPythonAIAgent()

sample_seqs = agent.create_sample_sequences()

for seq_id, _, _ in sample_seqs:
agent.analyze_sequence(seq_id)

results = agent.run_comprehensive_analysis([‘COVID_Spike’, ‘Human_Insulin’, ‘E_coli_16S’])

print(“BioPython AI Agent Tutorial Complete!”)
print(“Available sequences:”, list(agent.sequences.keys()))
print(“Available methods:”, [method for method in dir(agent) if not method.startswith(‘_’)])

We instantiate the BioPythonAIAgent, generate sample sequences (COVID Spike, Human Insulin, and E. coli 16S), and run a full analysis pipeline. The outputs confirm that our agent successfully performs nucleotide, codon, and GC-content analyses while also preparing comparative visualizations. Finally, we print the list of available sequences and supported methods, indicating that the agent’s full analytical capabilities are now ready for use. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browseragent.visualize_composition(‘COVID_Spike’)
agent.gc_content_window(‘E_coli_16S’, window_size=50)
agent.codon_usage_analysis(‘COVID_Spike’)

comparative_df = agent.comparative_analysis([‘COVID_Spike’, ‘Human_Insulin’, ‘E_coli_16S’])
print(comparative_df)

motif_positions = agent.motif_search(‘COVID_Spike’, ‘ATG’)
print(f”ATG start codons found at positions: {motif_positions}”)

tree = agent.create_phylogenetic_tree(sequences=[
str(agent.sequences[‘COVID_Spike’].seq[:300]),
str(agent.sequences[‘Human_Insulin’].seq[:300]),
str(agent.sequences[‘E_coli_16S’].seq[:300])
])

if tree:
agent.visualize_tree(tree)

We visualize nucleotide composition, scan E. coli 16S GC% in sliding windows, and profile codon usage for the COVID Spike sequence. We then compare sequences side-by-side, search for the “ATG” motif, and build/plot a quick phylogenetic tree from the first 300 nt of each sequence.

In conclusion, we have a fully functional BioPython AI Agent capable of handling multiple layers of sequence analysis, from basic nucleotide composition to codon usage profiling, GC-content sliding windows, motif searches, and even comparative analyses across species. The integration of visualization and phylogenetic tree construction provides both intuitive and in-depth insights into genetic data. Whether for academic projects, bioinformatics education, or research prototyping, this Colab-friendly workflow showcases how open-source tools like Biopython can be harnessed with modern AI-inspired pipelines to simplify and accelerate biological data exploration.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Create a Bioinformatics AI Agent Using Biopython for DNA and Protein Analysis appeared first on MarkTechPost.

Meta Superintelligence Labs Introduces REFRAG: Scaling RAG with 16× L …

Posted on September 8, 2025 by i-genie

Table of contentsWhy is long context such a bottleneck for LLMs?How does REFRAG compress and shorten context?How is acceleration achieved?How does REFRAG preserve accuracy?What do the experiments reveal?SummaryFAQs

A team of researchers from Meta Superintelligence Labs, National University of Singapore and Rice University has unveiled REFRAG (REpresentation For RAG), a decoding framework that rethinks retrieval-augmented generation (RAG) efficiency. REFRAG extends LLM context windows by 16× and achieves up to a 30.85× acceleration in time-to-first-token (TTFT) without compromising accuracy.

Why is long context such a bottleneck for LLMs?

The attention mechanism in large language models scales quadratically with input length. If a document is twice as long, the compute and memory cost can grow fourfold. This not only slows inference but also increases the size of the key-value (KV) cache, making large-context applications impractical in production systems. In RAG settings, most retrieved passages contribute little to the final answer, but the model still pays the full quadratic price to process them.

How does REFRAG compress and shorten context?

REFRAG introduces a lightweight encoder that splits retrieved passages into fixed-size chunks (e.g., 16 tokens) and compresses each into a dense chunk embedding. Instead of feeding thousands of raw tokens, the decoder processes this shorter sequence of embeddings. The result is a 16× reduction in sequence length, with no change to the LLM architecture.

https://arxiv.org/pdf/2509.01092

How is acceleration achieved?

By shortening the decoder’s input sequence, REFRAG reduces the quadratic attention computation and shrinks the KV cache. Empirical results show 16.53× TTFT acceleration at k=16 and 30.85× acceleration at k=32, far surpassing prior state-of-the-art CEPE (which achieved only 2–8×). Throughput also improves by up to 6.78× compared to LLaMA baselines.

How does REFRAG preserve accuracy?

A reinforcement learning (RL) policy supervises compression. It identifies the most information-dense chunks and allows them to bypass compression, feeding raw tokens directly into the decoder. This selective strategy ensures that critical details—such as exact numbers or rare entities—are not lost. Across multiple benchmarks, REFRAG maintained or improved perplexity compared to CEPE while operating at far lower latency.

What do the experiments reveal?

REFRAG was pretrained on 20B tokens from the SlimPajama corpus (Books + arXiv) and tested on long-context datasets including Book, Arxiv, PG19, and ProofPile. On RAG benchmarks, multi-turn conversation tasks, and long-document summarization, REFRAG consistently outperformed strong baselines:

16× context extension beyond standard LLaMA-2 (4k tokens).

~9.3% perplexity improvement over CEPE across four datasets.

Better accuracy in weak retriever settings, where irrelevant passages dominate, due to the ability to process more passages under the same latency budget.

https://arxiv.org/pdf/2509.01092

Summary

REFRAG shows that long-context LLMs don’t have to be slow or memory-hungry. By compressing retrieved passages into compact embeddings, selectively expanding only the important ones, and rethinking how RAG decoding works, Meta Superintelligence Labs has made it possible to process much larger inputs while running dramatically faster. This makes large-context applications—like analyzing entire reports, handling multi-turn conversations, or scaling enterprise RAG systems—not only feasible but efficient, without compromising accuracy.

FAQs

Q1. What is REFRAG?REFRAG (REpresentation For RAG) is a decoding framework from Meta Superintelligence Labs that compresses retrieved passages into embeddings, enabling faster and longer-context inference in LLMs.

Q2. How much faster is REFRAG compared to existing methods?REFRAG delivers up to 30.85× faster time-to-first-token (TTFT) and 6.78× throughput improvement compared to LLaMA baselines, while outperforming CEPE.

Q3. Does compression reduce accuracy?No. A reinforcement learning policy ensures critical chunks remain uncompressed, preserving key details. Across benchmarks, REFRAG maintained or improved accuracy relative to prior methods.

Q4. Where will the code be available?Meta Superintelligence Labs will release REFRAG on GitHub at facebookresearch/refrag

Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Meta Superintelligence Labs Introduces REFRAG: Scaling RAG with 16× Longer Contexts and 31× Faster Decoding appeared first on MarkTechPost.

Tilde AI Releases TildeOpen LLM: An Open-Source Large Language Model w …

Posted on September 8, 2025 by i-genie

Latvian language-tech firm Tilde has released TildeOpen LLM, an open-source foundational large language model (LLM) purpose-built for European languages, with a sharp focus on under-represented and smaller national and regional languages. It’s a strategic leap toward linguistic equity and digital sovereignty within the EU.

Under the Hood: Architecture, Training and Governance

The public release occurred on September 3, 2025, when Tilde deployed the model free to users via Hugging Face.

Built as a 30-billion-parameter dense decoder-only transformer, the model is available under a permissive license (CC-BY-4.0) and includes broad language support—from Latvian and Lithuanian to Ukrainian, Turkish, and beyond.

Training occurred on the EU’s supercomputers: LUMI (Finland) and JUPITER, tapping into 2 million GPU hours awarded via the European Commission’s Large AI Grand Challenge.

Fine technical detail: trained via EleutherAI–inspired GPT-NeoX scripts across 450K updates, consuming ~2 trillion tokens. Training included three-stage sampling: uniform across languages, natural distribution to boost high-data-volume languages, and a final uniform sweep for balance.

Hyperparameters: 60 layers, embedding size 6144, 48 attention heads, 8192-token context window, SwiGLU activations, RoPE positional encoding, RMSNorm layer norms.

Language Equity and Data Sovereignty

Mainstream models lean heavily on English and other major languages, causing skewed performance when dealing with Baltic, Slavic, or other smaller European languages. This under-representation leads to poor grammar, awkward phrasing, and hallucinations.

TildeOpen resolves this by embedding an “equitable tokenizer”, engineered to represent text similarly regardless of language—reducing token count and increasing inference efficiency for lesser-represented languages.

Crucially, organizations can self-host—in local data centers or secure EU-compliant clouds—ensuring adherence to GDPR and other data-protection mandates. This addresses sovereignty concerns tied to US- or Asia-hosted models.

Strategic Horizon: From Prototype to European AI Infrastructure

TildeOpen is a foundational “base” model. It is expected for it’s upcoming versions more specialized (e.g., instruction-tuned translation models) built atop this core.

It’s also a geo-flag planting moment: Latvia, via Tilde, positions itself as a tech exporter, with aspirations to scale European AI infrastructure while preserving linguistic diversity.

For Research, the move mirrors broader research on multilingual model behavior—gaps still exist. Evaluations show even strong open LLMs can hallucinate or lag in lexical accuracy for Baltic languages, reinforcing the need for localized development.

Summary

TildeOpen LLM reframes EU AI—not just as regulatory compliance, but as technical stewardship. It’s a grounded, high-capacity model with transparent architecture, scalable deployment, and a fierce commitment to linguistic equity. It doesn’t indulge hype; it delivers substance.

FAQs

Q1: What is TildeOpen LLM?TildeOpen is a 30B-parameter multilingual large language model trained on EU supercomputers, optimized for European languages, especially under-represented ones.

Q2: How is it different from mainstream LLMs?Unlike global models that prioritize English, TildeOpen uses an equitable tokenizer and balanced training to ensure fair representation and accuracy across smaller European languages.

Q3: Can organizations self-host the model?Yes. TildeOpen is open-source under CC-BY-4.0 and can be deployed in local data centers or EU-compliant clouds to meet GDPR and data sovereignty requirements.

Q4: What are the main use cases?Government services, translation, education, AI assistants, speech technologies, and multilingual customer support—any domain requiring accurate European language processing.

Check out the Model on Hugging Face and Technical details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Tilde AI Releases TildeOpen LLM: An Open-Source Large Language Model with Over 30 Billion Parameters and Support Most European Languages appeared first on MarkTechPost.

Implementing DeepSpeed for Scalable Transformers: Advanced Training wi …

Posted on September 7, 2025 by i-genie

In this advanced DeepSpeed tutorial, we provide a hands-on walkthrough of cutting-edge optimization techniques for training large language models efficiently. By combining ZeRO optimization, mixed-precision training, gradient accumulation, and advanced DeepSpeed configurations, the tutorial demonstrates how to maximize GPU memory utilization, reduce training overhead, and enable scaling of transformer models in resource-constrained environments, such as Colab. Alongside model creation and training, it also covers performance monitoring, inference optimization, checkpointing, and benchmarking different ZeRO stages, providing practitioners with both theoretical insights and practical code to accelerate model development. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys
import os
import json
import time
from pathlib import Path

def install_dependencies():
“””Install required packages for DeepSpeed in Colab”””
print(” Installing DeepSpeed and dependencies…”)

subprocess.check_call([
sys.executable, “-m”, “pip”, “install”,
“torch”, “torchvision”, “torchaudio”, “–index-url”,
“https://download.pytorch.org/whl/cu118”
])

subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “deepspeed”])

subprocess.check_call([
sys.executable, “-m”, “pip”, “install”,
“transformers”, “datasets”, “accelerate”, “wandb”
])

print(” Installation complete!”)

install_dependencies()

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import deepspeed
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
from typing import Dict, Any
import argparse

We set up our Colab environment by installing PyTorch with CUDA support, DeepSpeed, and essential libraries like Transformers, Datasets, Accelerate, and Weights & Biases. We ensure everything is ready so we can smoothly build and train models with DeepSpeed. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SyntheticTextDataset(Dataset):
“””Synthetic dataset for demonstration purposes”””

def __init__(self, size: int = 1000, seq_length: int = 512, vocab_size: int = 50257):
self.size = size
self.seq_length = seq_length
self.vocab_size = vocab_size

self.data = torch.randint(0, vocab_size, (size, seq_length))

def __len__(self):
return self.size

def __getitem__(self, idx):
return {
‘input_ids’: self.data[idx],
‘labels’: self.data[idx].clone()
}

We create a SyntheticTextDataset where we generate random token sequences to mimic real text data. We use these sequences as both inputs and labels, allowing us to quickly test DeepSpeed training without relying on a large external dataset. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedDeepSpeedTrainer:
“””Advanced DeepSpeed trainer with multiple optimization techniques”””

def __init__(self, model_config: Dict[str, Any], ds_config: Dict[str, Any]):
self.model_config = model_config
self.ds_config = ds_config
self.model = None
self.engine = None
self.tokenizer = None

def create_model(self):
“””Create a GPT-2 style model for demonstration”””
print(” Creating model…”)

config = GPT2Config(
vocab_size=self.model_config[‘vocab_size’],
n_positions=self.model_config[‘seq_length’],
n_embd=self.model_config[‘hidden_size’],
n_layer=self.model_config[‘num_layers’],
n_head=self.model_config[‘num_heads’],
resid_pdrop=0.1,
embd_pdrop=0.1,
attn_pdrop=0.1,
)

self.model = GPT2LMHeadModel(config)
self.tokenizer = GPT2Tokenizer.from_pretrained(‘gpt2’)

self.tokenizer.pad_token = self.tokenizer.eos_token

print(f” Model parameters: {sum(p.numel() for p in self.model.parameters()):,}”)
return self.model

def create_deepspeed_config(self):
“””Create comprehensive DeepSpeed configuration”””
return {
“train_batch_size”: self.ds_config[‘train_batch_size’],
“train_micro_batch_size_per_gpu”: self.ds_config[‘micro_batch_size’],
“gradient_accumulation_steps”: self.ds_config[‘gradient_accumulation_steps’],

“zero_optimization”: {
“stage”: self.ds_config[‘zero_stage’],
“allgather_partitions”: True,
“allgather_bucket_size”: 5e8,
“overlap_comm”: True,
“reduce_scatter”: True,
“reduce_bucket_size”: 5e8,
“contiguous_gradients”: True,
“cpu_offload”: self.ds_config.get(‘cpu_offload’, False)
},

“fp16”: {
“enabled”: True,
“loss_scale”: 0,
“loss_scale_window”: 1000,
“initial_scale_power”: 16,
“hysteresis”: 2,
“min_loss_scale”: 1
},

“optimizer”: {
“type”: “AdamW”,
“params”: {
“lr”: self.ds_config[‘learning_rate’],
“betas”: [0.9, 0.999],
“eps”: 1e-8,
“weight_decay”: 0.01
}
},

“scheduler”: {
“type”: “WarmupLR”,
“params”: {
“warmup_min_lr”: 0,
“warmup_max_lr”: self.ds_config[‘learning_rate’],
“warmup_num_steps”: 100
}
},

“gradient_clipping”: 1.0,

“wall_clock_breakdown”: True,

“memory_breakdown”: True,

“tensorboard”: {
“enabled”: True,
“output_path”: “./logs/”,
“job_name”: “deepspeed_advanced_tutorial”
}
}

def initialize_deepspeed(self):
“””Initialize DeepSpeed engine”””
print(” Initializing DeepSpeed…”)

parser = argparse.ArgumentParser()
parser.add_argument(‘–local_rank’, type=int, default=0)
args = parser.parse_args([])

self.engine, optimizer, _, lr_scheduler = deepspeed.initialize(
args=args,
model=self.model,
config=self.create_deepspeed_config()
)

print(f” DeepSpeed engine initialized with ZeRO stage {self.ds_config[‘zero_stage’]}”)
return self.engine

def train_step(self, batch: Dict[str, torch.Tensor]) -> Dict[str, float]:
“””Perform a single training step with DeepSpeed optimizations”””

input_ids = batch[‘input_ids’].to(self.engine.device)
labels = batch[‘labels’].to(self.engine.device)

outputs = self.engine(input_ids=input_ids, labels=labels)
loss = outputs.loss

self.engine.backward(loss)

self.engine.step()

return {
‘loss’: loss.item(),
‘lr’: self.engine.lr_scheduler.get_last_lr()[0] if self.engine.lr_scheduler else 0
}

def train(self, dataloader: DataLoader, num_epochs: int = 2):
“””Complete training loop with monitoring”””
print(f” Starting training for {num_epochs} epochs…”)

self.engine.train()
total_steps = 0

for epoch in range(num_epochs):
epoch_loss = 0.0
epoch_steps = 0

print(f”n Epoch {epoch + 1}/{num_epochs}”)

for step, batch in enumerate(dataloader):
start_time = time.time()

metrics = self.train_step(batch)

epoch_loss += metrics[‘loss’]
epoch_steps += 1
total_steps += 1

if step % 10 == 0:
step_time = time.time() – start_time
print(f” Step {step:4d} | Loss: {metrics[‘loss’]:.4f} | ”
f”LR: {metrics[‘lr’]:.2e} | Time: {step_time:.3f}s”)

if step % 20 == 0 and hasattr(self.engine, ‘monitor’):
self.log_memory_stats()

if step >= 50:
break

avg_loss = epoch_loss / epoch_steps
print(f” Epoch {epoch + 1} completed | Average Loss: {avg_loss:.4f}”)

print(” Training completed!”)

def log_memory_stats(self):
“””Log GPU memory statistics”””
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
print(f” GPU Memory – Allocated: {allocated:.2f}GB | Reserved: {reserved:.2f}GB”)

def save_checkpoint(self, path: str):
“””Save model checkpoint using DeepSpeed”””
print(f” Saving checkpoint to {path}”)
self.engine.save_checkpoint(path)

def demonstrate_inference(self, text: str = “The future of AI is”):
“””Demonstrate optimized inference with DeepSpeed”””
print(f”n Running inference with prompt: ‘{text}'”)

inputs = self.tokenizer.encode(text, return_tensors=’pt’).to(self.engine.device)

self.engine.eval()

with torch.no_grad():
outputs = self.engine.module.generate(
inputs,
max_length=inputs.shape[1] + 50,
num_return_sequences=1,
temperature=0.8,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)

generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f” Generated text: {generated_text}”)

self.engine.train()

We build an end-to-end trainer that creates a GPT-2 model, sets a DeepSpeed config (ZeRO, FP16, AdamW, warmup scheduler, tensorboard), and initializes the engine. We then run efficient training steps with logging and memory statistics, save checkpoints, and demonstrate inference to verify optimization and generation in one place. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_advanced_tutorial():
“””Main function to run the advanced DeepSpeed tutorial”””

print(” Advanced DeepSpeed Tutorial Starting…”)
print(“=” * 60)

model_config = {
‘vocab_size’: 50257,
‘seq_length’: 512,
‘hidden_size’: 768,
‘num_layers’: 6,
‘num_heads’: 12
}

ds_config = {
‘train_batch_size’: 16,
‘micro_batch_size’: 4,
‘gradient_accumulation_steps’: 4,
‘zero_stage’: 2,
‘learning_rate’: 1e-4,
‘cpu_offload’: False
}

print(” Configuration:”)
print(f” Model size: ~{sum(np.prod(shape) for shape in [[model_config[‘vocab_size’], model_config[‘hidden_size’]], [model_config[‘hidden_size’], model_config[‘hidden_size’]] * model_config[‘num_layers’]]) / 1e6:.1f}M parameters”)
print(f” ZeRO Stage: {ds_config[‘zero_stage’]}”)
print(f” Batch size: {ds_config[‘train_batch_size’]}”)

trainer = AdvancedDeepSpeedTrainer(model_config, ds_config)

model = trainer.create_model()

engine = trainer.initialize_deepspeed()

print(“n Creating synthetic dataset…”)
dataset = SyntheticTextDataset(
size=200,
seq_length=model_config[‘seq_length’],
vocab_size=model_config[‘vocab_size’]
)

dataloader = DataLoader(
dataset,
batch_size=ds_config[‘micro_batch_size’],
shuffle=True
)

print(“n Pre-training memory stats:”)
trainer.log_memory_stats()

trainer.train(dataloader, num_epochs=2)

print(“n Post-training memory stats:”)
trainer.log_memory_stats()

trainer.demonstrate_inference(“DeepSpeed enables efficient training of”)

checkpoint_path = “./deepspeed_checkpoint”
trainer.save_checkpoint(checkpoint_path)

demonstrate_zero_stages()
demonstrate_memory_optimization()

print(“n Tutorial completed successfully!”)
print(“Key DeepSpeed features demonstrated:”)
print(” ZeRO optimization for memory efficiency”)
print(” Mixed precision training (FP16)”)
print(” Gradient accumulation”)
print(” Learning rate scheduling”)
print(” Checkpoint saving/loading”)
print(” Memory monitoring”)

def demonstrate_zero_stages():
“””Demonstrate different ZeRO optimization stages”””
print(“n ZeRO Optimization Stages Explained:”)
print(” Stage 0: Disabled (baseline)”)
print(” Stage 1: Optimizer state partitioning (~4x memory reduction)”)
print(” Stage 2: Gradient partitioning (~8x memory reduction)”)
print(” Stage 3: Parameter partitioning (~Nx memory reduction)”)

zero_configs = {
1: {“stage”: 1, “reduce_bucket_size”: 5e8},
2: {“stage”: 2, “allgather_partitions”: True, “reduce_scatter”: True},
3: {“stage”: 3, “stage3_prefetch_bucket_size”: 5e8, “stage3_param_persistence_threshold”: 1e6}
}

for stage, config in zero_configs.items():
estimated_memory_reduction = [1, 4, 8, “Nx”][stage]
print(f” Stage {stage}: ~{estimated_memory_reduction}x memory reduction”)

def demonstrate_memory_optimization():
“””Show memory optimization techniques”””
print(“n Memory Optimization Techniques:”)
print(” Gradient Checkpointing: Trade compute for memory”)
print(” CPU Offloading: Move optimizer states to CPU”)
print(” Compression: Reduce communication overhead”)
print(” Mixed Precision: Use FP16 for faster training”)

We orchestrate the full training run: set configs, build the GPT-2 model and DeepSpeed engine, create a synthetic dataset, monitor GPU memory, train for two epochs, run inference, and save a checkpoint. We then explain ZeRO stages and highlight memory-optimization tactics, such as gradient checkpointing and CPU offloading, to understand the trade-offs in practice. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DeepSpeedConfigGenerator:
“””Utility class to generate DeepSpeed configurations”””

@staticmethod
def generate_config(
batch_size: int = 16,
zero_stage: int = 2,
use_cpu_offload: bool = False,
learning_rate: float = 1e-4
) -> Dict[str, Any]:
“””Generate a complete DeepSpeed configuration”””

config = {
“train_batch_size”: batch_size,
“train_micro_batch_size_per_gpu”: max(1, batch_size // 4),
“gradient_accumulation_steps”: max(1, batch_size // max(1, batch_size // 4)),

“zero_optimization”: {
“stage”: zero_stage,
“allgather_partitions”: True,
“allgather_bucket_size”: 5e8,
“overlap_comm”: True,
“reduce_scatter”: True,
“reduce_bucket_size”: 5e8,
“contiguous_gradients”: True
},

“fp16”: {
“enabled”: True,
“loss_scale”: 0,
“loss_scale_window”: 1000,
“initial_scale_power”: 16,
“hysteresis”: 2,
“min_loss_scale”: 1
},

“optimizer”: {
“type”: “AdamW”,
“params”: {
“lr”: learning_rate,
“betas”: [0.9, 0.999],
“eps”: 1e-8,
“weight_decay”: 0.01
}
},

“scheduler”: {
“type”: “WarmupLR”,
“params”: {
“warmup_min_lr”: 0,
“warmup_max_lr”: learning_rate,
“warmup_num_steps”: 100
}
},

“gradient_clipping”: 1.0,
“wall_clock_breakdown”: True
}

if use_cpu_offload:
config[“zero_optimization”][“cpu_offload”] = True
config[“zero_optimization”][“pin_memory”] = True

if zero_stage == 3:
config[“zero_optimization”].update({
“stage3_prefetch_bucket_size”: 5e8,
“stage3_param_persistence_threshold”: 1e6,
“stage3_gather_16bit_weights_on_model_save”: True
})

return config

def benchmark_zero_stages():
“””Benchmark different ZeRO stages”””
print(“n Benchmarking ZeRO Stages…”)

model_config = {
‘vocab_size’: 50257,
‘seq_length’: 256,
‘hidden_size’: 512,
‘num_layers’: 4,
‘num_heads’: 8
}

results = {}

for stage in [1, 2]:
print(f”n Testing ZeRO Stage {stage}…”)

ds_config = {
‘train_batch_size’: 8,
‘micro_batch_size’: 2,
‘gradient_accumulation_steps’: 4,
‘zero_stage’: stage,
‘learning_rate’: 1e-4
}

try:
trainer = AdvancedDeepSpeedTrainer(model_config, ds_config)
model = trainer.create_model()
engine = trainer.initialize_deepspeed()

if torch.cuda.is_available():
torch.cuda.reset_peak_memory_stats()

dataset = SyntheticTextDataset(size=20, seq_length=model_config[‘seq_length’])
dataloader = DataLoader(dataset, batch_size=ds_config[‘micro_batch_size’])

start_time = time.time()
for i, batch in enumerate(dataloader):
if i >= 5:
break
trainer.train_step(batch)

end_time = time.time()
peak_memory = torch.cuda.max_memory_allocated() / 1024**3

results[stage] = {
‘peak_memory_gb’: peak_memory,
‘time_per_step’: (end_time – start_time) / 5
}

print(f” Peak Memory: {peak_memory:.2f}GB”)
print(f” Time per step: {results[stage][‘time_per_step’]:.3f}s”)

del trainer, model, engine
torch.cuda.empty_cache()

except Exception as e:
print(f” Error with stage {stage}: {str(e)}”)

if len(results) > 1:
print(f”n Comparison:”)
stage_1_memory = results.get(1, {}).get(‘peak_memory_gb’, 0)
stage_2_memory = results.get(2, {}).get(‘peak_memory_gb’, 0)

if stage_1_memory > 0 and stage_2_memory > 0:
memory_reduction = (stage_1_memory – stage_2_memory) / stage_1_memory * 100
print(f” Memory reduction from Stage 1 to 2: {memory_reduction:.1f}%”)

def demonstrate_advanced_features():
“””Demonstrate additional advanced DeepSpeed features”””
print(“n Advanced DeepSpeed Features:”)

print(” Dynamic Loss Scaling: Automatically adjusts FP16 loss scaling”)

print(” Gradient Compression: Reduces communication overhead”)

print(” Pipeline Parallelism: Splits model across devices”)

print(” Expert Parallelism: Efficient Mixture-of-Experts training”)

print(” Curriculum Learning: Progressive training strategies”)

if __name__ == “__main__”:
print(f” CUDA Available: {torch.cuda.is_available()}”)
if torch.cuda.is_available():
print(f” GPU: {torch.cuda.get_device_name()}”)
print(f” Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f}GB”)

try:
run_advanced_tutorial()

benchmark_zero_stages()

demonstrate_advanced_features()

except Exception as e:
print(f” Error during tutorial: {str(e)}”)
print(” Tips for troubleshooting:”)
print(” – Ensure you have GPU runtime enabled in Colab”)
print(” – Try reducing batch_size or model size if facing memory issues”)
print(” – Enable CPU offloading in ds_config if needed”)

We generate reusable DeepSpeed configurations, benchmark ZeRO stages to compare memory and speed, and showcase advanced features such as dynamic loss scaling and pipeline/MoE parallelism. We also detect CUDA, run the full tutorial end-to-end, and provide clear troubleshooting tips, allowing us to iterate confidently in Colab.

In conclusion, we gain a comprehensive understanding of how DeepSpeed enhances model training efficiency by striking a balance between performance and memory trade-offs. From leveraging ZeRO stages for memory reduction to applying FP16 mixed precision and CPU offloading, the tutorial showcases powerful strategies that make large-scale training accessible on modest hardware. By the end, learners will have trained and optimized a GPT-style model, benchmarked configurations, monitored GPU resources, and explored advanced features such as pipeline parallelism and gradient compression.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Implementing DeepSpeed for Scalable Transformers: Advanced Training with Gradient Checkpointing and Parallelism appeared first on MarkTechPost.

Meet ARGUS: A Scalable AI Framework for Training Large Recommender Tra …

Posted on September 7, 2025 by i-genie

Yandex has introduced ARGUS (AutoRegressive Generative User Sequential modeling), a large-scale transformer-based framework for recommender systems that scales up to one billion parameters. This breakthrough places Yandex among a small group of global technology leaders — alongside Google, Netflix, and Meta — that have successfully overcome the long-standing technical barriers in scaling recommender transformers.

Breaking Technical Barriers in Recommender Systems

Recommender systems have long struggled with three stubborn constraints: short-term memory, limited scalability, and poor adaptability to shifting user behavior. Conventional architectures trim user histories down to a small window of recent interactions, discarding months or years of behavioral data. The result is a shallow view of intent that misses long-term habits, subtle shifts in taste, and seasonal cycles. As catalogs expand into the billions of items, these truncated models not only lose precision but also choke on the computational demands of personalization at scale. The outcome is familiar: stale recommendations, lower engagement, and fewer opportunities for serendipitous discovery.

Very few companies have successfully scaled recommender transformers beyond experimental setups. Google, Netflix, and Meta have invested heavily in this area, reporting gains from architectures like YouTubeDNN, PinnerFormer, and Meta’s Generative Recommenders. With ARGUS, Yandex joins this select group of companies demonstrating billion-parameter recommender models in live services. By modeling entire behavioral timelines, the system uncovers both obvious and hidden correlations in user activity. This long-horizon perspective allows ARGUS to capture evolving intent and cyclical patterns with far greater fidelity. For example, instead of reacting only to a recent purchase, the model learns to anticipate seasonal behaviors—like automatically surfacing the preferred brand of tennis balls when summer approaches—without requiring the user to repeat the same signals year after year.

Technical Innovations Behind ARGUS

The framework introduces several key advances:

Dual-objective pre-training: ARGUS decomposes autoregressive learning into two subtasks — next-item prediction and feedback prediction. This combination improves both imitation of historical system behavior and modeling of true user preferences.

Scalable transformer encoders: Models scale from 3.2M to 1B parameters, with consistent performance improvements across all metrics. At the billion-parameter scale, pairwise accuracy uplift increased by 2.66%, demonstrating the emergence of a scaling law for recommender transformers.

Extended context modeling: ARGUS handles user histories up to 8,192 interactions long in a single pass, enabling personalization over months of behavior rather than just the last few clicks.

Efficient fine-tuning: A two-tower architecture allows offline computation of embeddings and scalable deployment, reducing inference cost relative to prior target-aware or impression-level online models.

Real-World Deployment and Measured Gains

ARGUS has already been deployed at scale on Yandex’s music platform, serving millions of users. In production A/B tests, the system achieved:

+2.26% increase in total listening time (TLT)

+6.37% increase in like likelihood

These constitute the largest recorded quality improvements in the platform’s history for any deep learning–based recommender model.

Future Directions

Yandex researchers plan to extend ARGUS to real-time recommendation tasks, explore feature engineering for pairwise ranking, and adapt the framework to high-cardinality domains such as large e-commerce and video platforms. The demonstrated ability to scale user-sequence modeling with transformer architectures suggests that recommender systems are poised to follow a scaling trajectory similar to natural language processing.

Conclusion

With ARGUS, Yandex has established itself as one of the few global leaders driving state-of-the-art recommender systems. By openly sharing its breakthroughs, the company is not only improving personalization across its own services but also accelerating the evolution of recommendation technologies for the entire industry.

Check out the PAPER here. Thanks to the Yandex team for the thought leadership/ Resources for this article.
The post Meet ARGUS: A Scalable AI Framework for Training Large Recommender Transformers to One Billion Parameters appeared first on MarkTechPost.

Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 2 …

Posted on September 7, 2025 by i-genie

Hugging Face has just released FineVision, an open multimodal dataset designed to set a new standard for Vision-Language Models (VLMs). With 17.3 million images, 24.3 million samples, 88.9 million question-answer turns, and nearly 10 billion answer tokens, FineVision position itself as one of the largest and structured publicly available VLM training datasets.

FineVision aggregates 200+ sources into a unified format, rigorously filtered for duplicates and benchmark contamination. Rated systematically across multiple quality dimensions, the dataset enables researchers and devs to construct robust training mixtures while minimizing data leakage.

Why is FineVision Important for VLM Training?

Most state-of-the-art VLMs rely on proprietary datasets, limiting reproducibility and accessibility for the broader research community. FineVision addresses this gap by:

Scale and Coverage: 5 TB of curated data across 9 categories, including General VQA, OCR QA, Chart & Table reasoning, Science, Captioning, Grounding & Counting, and GUI navigation.

Benchmark Gains: Across 11 widely used benchmarks (e.g., AI2D, ChartQA, DocVQA, ScienceQA, OCRBench), models trained on FineVision outperform alternatives by significant margins—up to 46.3% over LLaVA, 40.7% over Cauldron, and 12.1% over Cambrian.

New Skill Domains: FineVision introduces data for emerging tasks like GUI navigation, pointing, and counting, expanding the capabilities of VLMs beyond conventional captioning and VQA.

How Was FineVision Built?

The curation pipeline followed a three-step process:

Collection and AugmentationOver 200 publicly available image-text datasets were gathered. Missing modalities (e.g., text-only data) were reformatted into QA pairs. Underrepresented domains, such as GUI data, were supplemented through targeted collection.

Cleaning

Removed oversized QA pairs (>8192 tokens).

Resized large images to a maximum of 2048 px while preserving aspect ratio.

Discarded corrupted samples.

Quality RatingUsing Qwen3-32B and Qwen2.5-VL-32B-Instruct as judges, every QA pair was rated on four axes:Text Formatting QualityQuestion-Answer RelevanceVisual DependencyImage-Question CorrespondenceThese ratings enable selective training mixtures, though ablations show that retaining all samples yields the best performance, even when lower-rated samples are included.

Comparative Analysis: FineVision vs. Existing Open Datasets

DatasetImagesSamplesTurnsTokensLeakagePerf. Drop After DeduplicationCauldron2.0M1.8M27.8M0.3B3.05%-2.39%LLaVA-Vision2.5M3.9M9.1M1.0B2.15%-2.72%Cambrian-7M5.4M7.0M12.2M0.8B2.29%-2.78%FineVision17.3M24.3M88.9M9.5B1.02%-1.45%

FineVision is not only one of the largest but also the least hallucinated dataset, with just 1% overlap with benchmark test sets. This ensures minimal data leakage and reliable evaluation performance.

Performance Insights

Model Setup: Ablations were conducted using nanoVLM (460M parameters), combining SmolLM2-360M-Instruct as the language backbone and SigLIP2-Base-512 as the vision encoder.

Training Efficiency: On 32 NVIDIA H100 GPUs, one full epoch (12k steps) takes ~20 hours.

Performance Trends:

FineVision models improve steadily with exposure to diverse data, overtaking baselines after ~12k steps.

Deduplication experiments confirm FineVision’s low leakage compared to Cauldron, LLaVA, and Cambrian.

Multilingual subsets, even when the backbone is monolingual, show slight performance gains, suggesting diversity outweighs strict alignment.

Attempts at multi-stage training (two or 2.5 stages) did not yield consistent benefits, reinforcing that scale + diversity is more critical than training heuristics.

Why FineVision Brings the New Standard?

+20% Average Performance Boost: Outperforms all existing open datasets across 10+ benchmarks.

Unprecedented Scale: 17M+ images, 24M+ samples, 10B tokens.

Skill Expansion: GUI navigation, counting, pointing, and document reasoning included.

Lowest Data Leakage: 1% contamination, compared to 2–3% in other datasets.

Fully Open Source: Available on Hugging Face Hub for immediate use via the datasets library.

Conclusion

FineVision marks a significant advancement in open multimodal datasets. Its large scale, systematic curation, and transparent quality assessments create a reproducible and extensible foundation for training state-of-the-art Vision-Language Models. By reducing dependence on proprietary resources, it enables researchers and devs to build competitive systems and accelerate progress in areas such as document analysis, visual reasoning, and agentic multimodal tasks.

Check out the Dataset and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs) appeared first on MarkTechPost.

Google AI Introduces Personal Health Agent (PHA): A Multi-Agent Framew …

Posted on September 6, 2025 by i-genie

Table of contentsWhat is a Personal Health Agent?How does the PHA framework operate?How was the PHA evaluated?Evaluation of the Data Science AgentEvaluation of the Domain Expert AgentEvaluation of the Health Coach AgentEvaluation of the Integrated PHA SystemHow does the PHA contribute to health AI?What is the larger significance of Google’s PHA blueprint?Conclusion

https://arxiv.org/abs/2508.20148v1

What is a Personal Health Agent?

Large language models (LLMs) have demonstrated strong performance across various domains like clinical reasoning, decision support, and consumer health applications. However, most existing platforms are designed as single-purpose tools, such as symptom checkers, digital coaches, or health information assistants. These approaches often fail to address the complexity of real-world health needs, where individuals require integrated reasoning over wearable streams, personal health records, and laboratory test results.

A team of researchers from Google has proposed a Personal Health Agent (PHA) framework. The PHA is designed as a multi-agent system that unifies complementary roles: data analysis, medical knowledge reasoning, and health coaching. Instead of returning isolated outputs from a single model, the PHA employs a central orchestrator to coordinate specialized sub-agents, iteratively synthesize their outputs, and deliver coherent, personalized guidance.

https://arxiv.org/abs/2508.20148v1

How does the PHA framework operate?

The Personal Health Agent (PHA) is built on top of the Gemini 2.0 model family. It follows a modular architecture consisting of three sub-agents and one orchestrator:

Data Science Agent (DS)The DS agent interprets and analyzes time-series data from wearables (e.g., step counts, heart rate variability, sleep metrics) and structured health records. It is capable of decomposing open-ended user questions into formal analysis plans, executing statistical reasoning, and comparing results against population-level reference data. For example, it can quantify whether physical activity in the past month is associated with improvements in sleep quality.

Domain Expert Agent (DE)The DE agent provides medically contextualized information. It integrates personal health records, demographic information, and wearable signals to generate explanations grounded in medical knowledge. Unlike general-purpose LLMs that may produce plausible but unreliable outputs, the DE agent follows an iterative reasoning-investigation-examination loop, combining authoritative medical resources with personal data. This allows it to provide evidence-based interpretations, such as whether a specific blood pressure measurement is within a safe range for an individual with a particular condition.

Health Coach Agent (HC)The HC agent addresses behavioral change and long-term goal setting. Drawing from established coaching strategies such as motivational interviewing, it conducts multi-turn conversations, identifies user goals, clarifies constraints, and generates structured, personalized plans. For example, it may guide a user through setting a weekly exercise schedule, adapting to individual barriers, and incorporating feedback from progress tracking.

OrchestratorThe orchestrator coordinates these three agents. When a query is received, it assigns a primary agent responsible for generating the main output and supporting agents to provide contextual data or domain knowledge. After collecting the results, the orchestrator runs an iterative reflection loop, checking outputs for coherence and accuracy before synthesizing them into a single response. This ensures that the final output is not merely an aggregation of agent responses but an integrated recommendation.

How was the PHA evaluated?

The research team conducted one of the most comprehensive evaluations of a health AI system to date. Their evaluation framework involved 10 benchmark tasks, 7,000+ human annotations, and 1,100 hours of assessment from health experts and end-users.

Evaluation of the Data Science Agent

The DS agent was assessed on its ability to generate structured analysis plans and produce correct, executable code. Compared to baseline Gemini models, it demonstrated:

A significant increase in analysis plan quality, improving mean expert-rated scores from 53.7% to 75.6%.

A reduction in critical data handling errors from 25.4% to 11.0%.

An improvement in code pass rates from 58.4% to 75.5% on first attempts, with further gains under iterative self-correction.

https://arxiv.org/abs/2508.20148v1

Evaluation of the Domain Expert Agent

The DE agent was benchmarked across four capabilities: factual accuracy, diagnostic reasoning, contextual personalization, and multimodal data synthesis. Results include:

Factual knowledge: On over 2,000 board-style exam questions across endocrinology, cardiology, sleep medicine, and fitness, the DE agent achieved 83.6% accuracy, outperforming baseline Gemini (81.8%).

Diagnostic reasoning: On 2,000 self-reported symptom cases, it achieved 46.1% top-1 diagnostic accuracy compared to 41.4% for a state-of-the-art Gemini baseline.

Personalization: In user studies, 72% of participants preferred DE agent responses to baseline outputs, citing higher trustworthiness and contextual relevance.

Multimodal synthesis: In expert clinician reviews of health summaries generated from wearable, lab, and survey data, the DE agent’s outputs were rated more clinically significant, comprehensive, and trustworthy than baseline outputs.

Evaluation of the Health Coach Agent

The HC agent was designed and assessed through expert interviews and user studies. Experts emphasized the need for six coaching capabilities: goal identification, active listening, context clarification, empowerment, SMART (Specific, Measurable, Attainable, Relevant, Time-bound) recommendations, and iterative feedback incorporation.

In evaluations, the HC agent demonstrated improved conversation flow and user engagement compared to baseline models. It avoided premature recommendations and instead balanced information gathering with actionable advice, producing outputs more consistent with expert coaching practices.

Evaluation of the Integrated PHA System

At the system level, the orchestrator and three agents were tested together in open-ended, multimodal conversations reflecting realistic health scenarios. Both experts and end-users rated the integrated Personal Health Agent (PHA) significantly higher than baseline Gemini systems across measures of accuracy, coherence, personalization, and trustworthiness.

How does the PHA contribute to health AI?

The introduction of a multi-agent PHA addresses several limitations of existing health AI systems:

Integration of heterogeneous data: Wearable signals, medical records, and lab test results are analyzed jointly rather than in isolation.

Division of labor: Each sub-agent specializes in a domain where single monolithic models often underperform, e.g., numerical reasoning for DS, clinical grounding for DE, and behavioral engagement for HC.

Iterative reflection: The orchestrator’s review cycle reduces inconsistencies that often arise when multiple outputs are simply concatenated.

Systematic evaluation: Unlike most prior work, which relied on small-scale case studies, the Personal Health Agent (PHA) was validated with a large multimodal dataset (the WEAR-ME study) and extensive expert involvement.

What is the larger significance of Google’s PHA blueprint?

The introduction of Personal Health Agent (PHA) demonstrates that health AI can move beyond single-purpose applications toward modular, orchestrated systems capable of reasoning across multimodal data. It shows that breaking down tasks into specialized sub-agents leads to measurable improvements in robustness, accuracy, and user trust.

It is important to note that this work is a research construct, not a commercial product. The research team emphasized that the PHA design is exploratory and that deployment would require addressing regulatory, privacy, and ethical considerations. Nonetheless, the framework and evaluation results represent a significant advance in the technical foundations of personal health AI.

Conclusion

The Personal Health Agent framework provides a comprehensive design for integrating wearable data, health records, and behavioral coaching through a multi-agent system coordinated by an orchestrator. Its evaluation across 10 benchmarks, using thousands of annotations and expert assessments, shows consistent improvements over baseline LLMs in statistical analysis, medical reasoning, personalization, and coaching interactions.

By structuring health AI as a coordinated system of specialized agents rather than a monolithic model, the PHA demonstrates how accuracy, coherence, and trust can be improved in personal health applications. This work establishes a foundation for further research on agentic health systems and highlights a pathway toward integrated, reliable health reasoning tools.

Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Introduces Personal Health Agent (PHA): A Multi-Agent Framework that Enables Personalized Interactions to Address Individual Health Needs appeared first on MarkTechPost.

How to Build a Complete End-to-End NLP Pipeline with Gensim: Topic Mod …

Posted on September 6, 2025 by i-genie

In this tutorial, we present a complete end-to-end Natural Language Processing (NLP) pipeline built with Gensim and supporting libraries, designed to run seamlessly in Google Colab. It integrates multiple core techniques in modern NLP, including preprocessing, topic modeling with Latent Dirichlet Allocation (LDA), word embeddings with Word2Vec, TF-IDF-based similarity analysis, and semantic search. The pipeline not only demonstrates how to train and evaluate these models but also showcases practical visualizations, advanced topic analysis, and document classification workflows. By combining statistical methods with machine learning approaches, the tutorial provides a comprehensive framework for understanding and experimenting with text data at scale. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install –upgrade scipy==1.11.4
!pip install gensim==4.3.2 nltk wordcloud matplotlib seaborn pandas numpy scikit-learn
!pip install –upgrade setuptools

print(“Please restart runtime after installation!”)
print(“Go to Runtime > Restart runtime, then run the next cell”)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import warnings
warnings.filterwarnings(‘ignore’)

from gensim import corpora, models, similarities
from gensim.models import Word2Vec, LdaModel, TfidfModel, CoherenceModel
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short

import nltk
nltk.download(‘punkt’, quiet=True)
nltk.download(‘stopwords’, quiet=True)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

We install and upgrade the necessary libraries, such as SciPy, Gensim, NLTK, and visualization tools, to ensure compatibility. We then import all required modules for preprocessing, modeling, and analysis. We also download NLTK resources to tokenize and handle stopwords efficiently, thereby setting up the environment for our NLP pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedGensimPipeline:
def __init__(self):
self.dictionary = None
self.corpus = None
self.lda_model = None
self.word2vec_model = None
self.tfidf_model = None
self.similarity_index = None
self.processed_docs = None

def create_sample_corpus(self):
“””Create a diverse sample corpus for demonstration”””
documents = [ “Data science combines statistics, programming, and domain expertise to extract insights”,
“Big data analytics helps organizations make data-driven decisions at scale”,
“Cloud computing provides scalable infrastructure for modern applications and services”,
“Cybersecurity protects digital systems from threats and unauthorized access attempts”,
“Software engineering practices ensure reliable and maintainable code development”,
“Database management systems store and organize large amounts of structured information”,
“Python programming language is widely used for data analysis and machine learning”,
“Statistical modeling helps identify patterns and relationships in complex datasets”,
“Cross-validation techniques ensure robust model performance evaluation and selection”,
“Recommendation systems suggest relevant items based on user preferences and behavior”,
“Text mining extracts valuable insights from unstructured textual data sources”,
“Image classification assigns predefined categories to visual content automatically”,
“Reinforcement learning trains agents through interaction with dynamic environments”
]
return documents

def preprocess_documents(self, documents):
“””Advanced document preprocessing using Gensim filters”””
print(“Preprocessing documents…”)

CUSTOM_FILTERS = [
strip_tags, strip_punctuation, strip_multiple_whitespaces,
strip_numeric, remove_stopwords, strip_short, lambda x: x.lower()
]

processed_docs = []
for doc in documents:
processed = preprocess_string(doc, CUSTOM_FILTERS)

stop_words = set(stopwords.words(‘english’))
processed = [word for word in processed if word not in stop_words and len(word) > 2]

processed_docs.append(processed)

self.processed_docs = processed_docs
print(f”Processed {len(processed_docs)} documents”)
return processed_docs

def create_dictionary_and_corpus(self):
“””Create Gensim dictionary and corpus”””
print(“Creating dictionary and corpus…”)

self.dictionary = corpora.Dictionary(self.processed_docs)

self.dictionary.filter_extremes(no_below=2, no_above=0.8)

self.corpus = [self.dictionary.doc2bow(doc) for doc in self.processed_docs]

print(f”Dictionary size: {len(self.dictionary)}”)
print(f”Corpus size: {len(self.corpus)}”)

def train_word2vec_model(self):
“””Train Word2Vec model for word embeddings”””
print(“Training Word2Vec model…”)

self.word2vec_model = Word2Vec(
sentences=self.processed_docs,
vector_size=100,
window=5,
min_count=2,
workers=4,
epochs=50
)

print(“Word2Vec model trained successfully”)

def analyze_word_similarities(self):
“””Analyze word similarities using Word2Vec”””
print(“n=== Word2Vec Similarity Analysis ===”)

test_words = [‘machine’, ‘data’, ‘learning’, ‘computer’]

for word in test_words:
if word in self.word2vec_model.wv:
similar_words = self.word2vec_model.wv.most_similar(word, topn=3)
print(f”Words similar to ‘{word}’: {similar_words}”)

try:
if all(w in self.word2vec_model.wv for w in [‘machine’, ‘computer’, ‘data’]):
analogy = self.word2vec_model.wv.most_similar(
positive=[‘computer’, ‘data’],
negative=[‘machine’],
topn=1
)
print(f”Analogy result: {analogy}”)
except:
print(“Not enough vocabulary for complex analogies”)

def train_lda_model(self, num_topics=5):
“””Train LDA topic model”””
print(f”Training LDA model with {num_topics} topics…”)

self.lda_model = LdaModel(
corpus=self.corpus,
id2word=self.dictionary,
num_topics=num_topics,
random_state=42,
passes=10,
alpha=’auto’,
per_word_topics=True,
eval_every=None
)

print(“LDA model trained successfully”)

def evaluate_topic_coherence(self):
“””Evaluate topic model coherence”””
print(“Evaluating topic coherence…”)

coherence_model = CoherenceModel(
model=self.lda_model,
texts=self.processed_docs,
dictionary=self.dictionary,
coherence=’c_v’
)

coherence_score = coherence_model.get_coherence()
print(f”Topic Coherence Score: {coherence_score:.4f}”)
return coherence_score

def display_topics(self):
“””Display discovered topics”””
print(“n=== Discovered Topics ===”)

topics = self.lda_model.print_topics(num_words=8)
for idx, topic in enumerate(topics):
print(f”Topic {idx}: {topic[1]}”)

def create_tfidf_model(self):
“””Create TF-IDF model for document similarity”””
print(“Creating TF-IDF model…”)

self.tfidf_model = TfidfModel(self.corpus)
corpus_tfidf = self.tfidf_model[self.corpus]

self.similarity_index = similarities.MatrixSimilarity(corpus_tfidf)

print(“TF-IDF model and similarity index created”)

def find_similar_documents(self, query_doc_idx=0):
“””Find documents similar to a query document”””
print(f”n=== Document Similarity Analysis ===”)

query_doc_tfidf = self.tfidf_model[self.corpus[query_doc_idx]]

similarities_scores = self.similarity_index[query_doc_tfidf]

sorted_similarities = sorted(enumerate(similarities_scores), key=lambda x: x[1], reverse=True)

print(f”Documents most similar to document {query_doc_idx}:”)
for doc_idx, similarity in sorted_similarities[:5]:
print(f”Doc {doc_idx}: {similarity:.4f}”)

def visualize_topics(self):
“””Create visualizations for topic analysis”””
print(“Creating topic visualizations…”)

doc_topic_matrix = []
for doc_bow in self.corpus:
doc_topics = dict(self.lda_model.get_document_topics(doc_bow, minimum_probability=0))
topic_vec = [doc_topics.get(i, 0) for i in range(self.lda_model.num_topics)]
doc_topic_matrix.append(topic_vec)

doc_topic_df = pd.DataFrame(doc_topic_matrix, columns=[f’Topic_{i}’ for i in range(self.lda_model.num_topics)])

plt.figure(figsize=(12, 8))
sns.heatmap(doc_topic_df.T, annot=True, cmap=’Blues’, fmt=’.2f’)
plt.title(‘Document-Topic Distribution Heatmap’)
plt.xlabel(‘Documents’)
plt.ylabel(‘Topics’)
plt.tight_layout()
plt.show()

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for topic_id in range(min(6, self.lda_model.num_topics)):
topic_words = dict(self.lda_model.show_topic(topic_id, topn=20))

wordcloud = WordCloud(
width=300, height=200,
background_color=’white’,
colormap=’viridis’
).generate_from_frequencies(topic_words)

axes[topic_id].imshow(wordcloud, interpolation=’bilinear’)
axes[topic_id].set_title(f’Topic {topic_id}’)
axes[topic_id].axis(‘off’)

for i in range(self.lda_model.num_topics, 6):
axes[i].axis(‘off’)

plt.tight_layout()
plt.show()

def advanced_topic_analysis(self):
“””Perform advanced topic analysis”””
print(“n=== Advanced Topic Analysis ===”)

topic_distributions = []
for i, doc_bow in enumerate(self.corpus):
doc_topics = self.lda_model.get_document_topics(doc_bow)
dominant_topic = max(doc_topics, key=lambda x: x[1]) if doc_topics else (0, 0)
topic_distributions.append({
‘doc_id’: i,
‘dominant_topic’: dominant_topic[0],
‘topic_probability’: dominant_topic[1]
})

topic_df = pd.DataFrame(topic_distributions)

plt.figure(figsize=(10, 6))
topic_counts = topic_df[‘dominant_topic’].value_counts().sort_index()
plt.bar(range(len(topic_counts)), topic_counts.values)
plt.xlabel(‘Topic ID’)
plt.ylabel(‘Number of Documents’)
plt.title(‘Distribution of Dominant Topics Across Documents’)
plt.xticks(range(len(topic_counts)), [f’Topic {i}’ for i in topic_counts.index])
plt.show()

return topic_df

def document_classification_demo(self, new_document):
“””Classify a new document using trained models”””
print(f”n=== Document Classification Demo ===”)
print(f”Classifying: ‘{new_document[:50]}…'”)

processed_new = preprocess_string(new_document, [
strip_tags, strip_punctuation, strip_multiple_whitespaces,
strip_numeric, remove_stopwords, strip_short, lambda x: x.lower()
])

new_doc_bow = self.dictionary.doc2bow(processed_new)

doc_topics = self.lda_model.get_document_topics(new_doc_bow)

print(“Topic probabilities:”)
for topic_id, prob in doc_topics:
print(f” Topic {topic_id}: {prob:.4f}”)

new_doc_tfidf = self.tfidf_model[new_doc_bow]
similarities_scores = self.similarity_index[new_doc_tfidf]
most_similar = np.argmax(similarities_scores)

print(f”Most similar document: {most_similar} (similarity: {similarities_scores[most_similar]:.4f})”)

return doc_topics, most_similar

def run_complete_pipeline(self):
“””Execute the complete NLP pipeline”””
print(“=== Advanced Gensim NLP Pipeline ===n”)

raw_documents = self.create_sample_corpus()
self.preprocess_documents(raw_documents)

self.create_dictionary_and_corpus()

self.train_word2vec_model()
self.train_lda_model(num_topics=5)
self.create_tfidf_model()

self.analyze_word_similarities()
coherence_score = self.evaluate_topic_coherence()
self.display_topics()

self.visualize_topics()
topic_df = self.advanced_topic_analysis()

self.find_similar_documents(query_doc_idx=0)

new_doc = “Deep neural networks are powerful machine learning models for pattern recognition”
self.document_classification_demo(new_doc)

return {
‘coherence_score’: coherence_score,
‘topic_distributions’: topic_df,
‘models’: {
‘lda’: self.lda_model,
‘word2vec’: self.word2vec_model,
‘tfidf’: self.tfidf_model
}
}

We define the AdvancedGensimPipeline class as a modular framework to handle every stage of text analysis in one place. It starts with creating a sample corpus, preprocessing it, and then building a dictionary and corpus representations. We train Word2Vec for embeddings, LDA for topic modeling, and TF-IDF for similarity, followed by visualization, coherence evaluation, and classification of new documents. This way, we bring together the complete NLP workflow, from raw text to insights, into a single reusable pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef compare_topic_models(pipeline, topic_range=[3, 5, 7, 10]):
print(“n=== Topic Model Comparison ===”)

coherence_scores = []
perplexity_scores = []

for num_topics in topic_range:
lda_temp = LdaModel(
corpus=pipeline.corpus,
id2word=pipeline.dictionary,
num_topics=num_topics,
random_state=42,
passes=10,
alpha=’auto’
)

coherence_model = CoherenceModel(
model=lda_temp,
texts=pipeline.processed_docs,
dictionary=pipeline.dictionary,
coherence=’c_v’
)
coherence = coherence_model.get_coherence()
coherence_scores.append(coherence)

perplexity = lda_temp.log_perplexity(pipeline.corpus)
perplexity_scores.append(perplexity)

print(f”Topics: {num_topics}, Coherence: {coherence:.4f}, Perplexity: {perplexity:.4f}”)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

ax1.plot(topic_range, coherence_scores, ‘bo-‘)
ax1.set_xlabel(‘Number of Topics’)
ax1.set_ylabel(‘Coherence Score’)
ax1.set_title(‘Model Coherence vs Number of Topics’)
ax1.grid(True)

ax2.plot(topic_range, perplexity_scores, ‘ro-‘)
ax2.set_xlabel(‘Number of Topics’)
ax2.set_ylabel(‘Perplexity’)
ax2.set_title(‘Model Perplexity vs Number of Topics’)
ax2.grid(True)

plt.tight_layout()
plt.show()

return coherence_scores, perplexity_scores

This function compare_topic_models lets us systematically test different numbers of topics for the LDA model and compare their performance. We calculate coherence scores (to check topic interpretability) and perplexity scores (to check model fit) for each topic count in the given range. The results are displayed as line plots, helping us visually decide the most balanced number of topics for our dataset. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef semantic_search_engine(pipeline, query, top_k=5):
“””Implement semantic search using trained models”””
print(f”n=== Semantic Search: ‘{query}’ ===”)

processed_query = preprocess_string(query, [
strip_tags, strip_punctuation, strip_multiple_whitespaces,
strip_numeric, remove_stopwords, strip_short, lambda x: x.lower()
])

query_bow = pipeline.dictionary.doc2bow(processed_query)
query_tfidf = pipeline.tfidf_model[query_bow]

similarities_scores = pipeline.similarity_index[query_tfidf]

top_indices = np.argsort(similarities_scores)[::-1][:top_k]

print(“Top matching documents:”)
for i, idx in enumerate(top_indices):
score = similarities_scores[idx]
print(f”{i+1}. Document {idx} (Score: {score:.4f})”)
print(f” Content: {‘ ‘.join(pipeline.processed_docs[idx][:10])}…”)

return top_indices, similarities_scores[top_indices]

The semantic_search_engine function adds a search layer to the pipeline by taking a query, preprocessing it, and converting it into a bag-of-words and TF-IDF representations. It then compares the query against all documents using the similarity index and returns the top matches. This way, we can quickly retrieve the most relevant documents along with their similarity scores, making the pipeline useful for practical information retrieval and semantic search tasks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
pipeline = AdvancedGensimPipeline()
results = pipeline.run_complete_pipeline()

print(“n” + “=”*60)
coherence_scores, perplexity_scores = compare_topic_models(pipeline)

print(“n” + “=”*60)
search_results = semantic_search_engine(
pipeline,
“artificial intelligence neural networks deep learning”
)

print(“n” + “=”*60)
print(“Pipeline completed successfully!”)
print(f”Final coherence score: {results[‘coherence_score’]:.4f}”)
print(f”Vocabulary size: {len(pipeline.dictionary)}”)
print(f”Word2Vec model size: {pipeline.word2vec_model.wv.vector_size} dimensions”)

print(“nModels trained and ready for use!”)
print(“Access models via: pipeline.lda_model, pipeline.word2vec_model, pipeline.tfidf_model”)

This main block ties everything together into a complete, executable pipeline. We initialize the AdvancedGensimPipeline, run the full workflow, and then evaluate topic models with different numbers of topics. Next, we test the semantic search engine with a query about artificial intelligence and deep learning. Finally, it prints out summary metrics, such as the coherence score, vocabulary size, and Word2Vec embedding dimensions, confirming that all models are trained and ready for further use.

In conclusion, we gain a powerful, modular workflow that covers the entire spectrum of text analysis, from cleaning and preprocessing raw documents to discovering hidden topics, visualizing results, comparing models, and performing semantic search. The inclusion of Word2Vec embeddings, TF-IDF similarity, and coherence evaluation ensures that the pipeline is both versatile and robust, while visualizations and classification demos make the results interpretable and actionable. This cohesive design enables learners, researchers, and practitioners to quickly adapt the framework for real-world applications, making it a valuable foundation for advanced NLP experimentation and production-ready text analytics.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Build a Complete End-to-End NLP Pipeline with Gensim: Topic Modeling, Word Embeddings, Semantic Search, and Advanced Text Analysis appeared first on MarkTechPost.

Meet Chatterbox Multilingual: An Open-Source Zero-Shot Text To Speech …

Posted on September 6, 2025 by i-genie

Table of contentsWhat does Chatterbox Multilingual offer?How does it compare with commercial systems?How is expressive control implemented?How does watermarking contribute to responsible AI usage?What deployment options are available?What is the significance of Chatterbox Multilingual open release?

Resemble AI has recently released Chatterbox Multilingual, a production grade open-source Text To Speech (TTS) model designed for zero-shot voice cloning in 23 languages. It is distributed under the MIT license, making it freely available for integration and modification. The system builds on the original Chatterbox framework and adds multilingual capability, expressive controls, and built-in watermarking for traceability.

What does Chatterbox Multilingual offer?

Chatterbox Multilingual enables voice cloning without retraining by leveraging zero-shot learning. You can easily generate a synthetic voice using a short audio sample that captures the speaker’s features/characteristics. It supports 23 languages, including Arabic, Hindi, Chinese, Swahili, and other widely spoken languages, giving it coverage across diverse linguistic families.

Apart from basic voice cloning, the model integrates emotion and intensity controls, which allow users to specify not just what is said, but also how it is delivered. The model also includes PerTh watermarking by default to ensures that every output can be authenticated through neural watermark extraction. These features make the model suitable for tasks where both accuracy and security are important.

How does it compare with commercial systems?

Evaluations indicate that Chatterbox Multilingual performs competitively with most commercial TTS models. In blind A/B tests conducted on Podonos, listeners expressed a 63.75% preference for Chatterbox over ElevenLabs. This suggests that in certain conditions, users found Chatterbox outputs closer to natural or accurate speech reproduction.

https://www.resemble.ai/chatterbox/

It is worth noting that while some reported numbers compare performance on specific languages such as German, the only verifiable public metric is the Podonos listener preference result. This makes preference-based benchmarking the most reliable evidence currently available.

How is expressive control implemented?

Chatterbox Multilingual not only reproduce voice identity but also provides tools for controlling delivery style. The model allows adjustment of emotion categories such as happy, sad, or angry, and includes an exaggeration parameter to regulate intensity. This means a cloned voice can be made more enthusiastic, subdued, or dramatic depending on context.

Such flexibility is useful in interactive media, dialog agents, gaming, and assistive technologies, where emotional nuance affects the effectiveness of communication. Rather than producing static or neutral speech, the system can generate output that adapts to context-specific needs.

How does watermarking contribute to responsible AI usage?

Every file generated by Chatterbox Multilingual contains PerTh (Perceptual Threshold) watermarking, a neural technique developed by Resemble AI. The watermark is inaudible to listeners but can be extracted using the provided open-source detector. This enables traceability and verification of generated content, an increasingly important factor as synthetic audio becomes more widespread.

By embedding watermarking at the system level and keeping it always active, Chatterbox helps mitigate risks of misuse without requiring external enforcement mechanisms. This design choice aligns with ongoing discussions about the ethics of generative audio systems.

What deployment options are available?

The open-source release provides a baseline system that can be installed and run by researchers, developers, or hobbyists under the permissive MIT license. For environments where high concurrency, latency targets, or compliance guarantees are necessary, Resemble AI offers a managed variant called Chatterbox Multilingual Pro.

This hosted version supports sub-200 ms latency, fine-tuned voices, and includes SLAs (service-level agreements) along with compliance features required in enterprise deployments. While the open-source project serves as a general foundation, the Pro service is aimed at production workloads with operational constraints.

What is the significance of Chatterbox Multilingual open release?

Chatterbox Multilingual contributes a multilingual, open, and controllable voice cloning system to the speech synthesis community. It integrates zero-shot cloning, expressivity controls, and watermarking in a framework that is both technically advanced and freely available.

Performance studies suggest it is competitive with leading proprietary solutions, offering a practical platform for further research and application development. Its open-source license makes it accessible to a broad range of users, from academic researchers to independent developers, strengthening the ecosystem of multilingual speech synthesis tools.

Check out the GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Meet Chatterbox Multilingual: An Open-Source Zero-Shot Text To Speech (TTS) Multilingual Model with Emotion Control and Watermarking appeared first on MarkTechPost.