MBZUAI Researchers Release K2 Think: A 32B Open-Source System for Adva …

A team of researchers from MBZUAI’s Institute of Foundation Models and G42 released K2 Think, is a 32B-parameter open reasoning system for advanced AI reasoning. It pairs long chain-of-thought supervised fine-tuning with reinforcement learning from verifiable rewards, agentic planning, test-time scaling, and inference optimizations (speculative decoding + wafer-scale hardware). The result is frontier-level math performance with markedly lower parameter count and competitive results on code and science—together with a transparent, fully open release spanning weights, data, and code.

System overview

K2 Think is built by post-training an open-weight Qwen2.5-32B base model and adding a lightweight test-time compute scaffold. The design emphasizes parameter efficiency: a 32B backbone is deliberately chosen to enable fast iteration and deployment while leaving headroom for post-training gains. The core recipe combines six “pillars”: (1) Long chain-of-thought (CoT) supervised fine-tuning; (2) Reinforcement Learning with Verifiable Rewards (RLVR); (3) agentic planning before solving; (4) test-time scaling via best-of-N selection with verifiers; (5) speculative decoding; and (6) inference on a wafer-scale engine.

The goals are straightforward: raise pass@1 on competition-grade math benchmarks, maintain strong code/science performance, and keep response length and wall-clock latency under control through plan-before-you-think prompting and hardware-aware inference.

Pillar 1: Long CoT SFT

Phase-1 SFT uses curated, long chain-of-thought traces and instruction/response pairs spanning math, code, science, instruction following, and general chat (AM-Thinking-v1-Distilled). The effect is to teach the base model to externalize intermediate reasoning and adopt a structured output format. Rapid pass@1 gains occur early (≈0.5 epoch), with AIME’24 stabilizing around ~79% and AIME’25 around ~72% on the SFT checkpoint before RL, indicating convergence.

Pillar 2: RL with Verifiable Rewards

K2 Think then trains with RLVR on Guru, a ~92k-prompt, six-domain dataset (Math, Code, Science, Logic, Simulation, Tabular) designed for verifiable end-to-end correctness. The implementation uses the verl library with a GRPO-style policy-gradient algorithm. Notable observation: starting RL from a strong SFT checkpoint yields modest absolute gains and can plateau/degenerate, whereas applying the same RL recipe directly on the base model shows large relative improvements (e.g., ~40% on AIME’24 over training), supporting a trade-off between SFT strength and RL headroom.

A second ablation shows multi-stage RL with a reduced initial context window (e.g., 16k → 32k) underperforms—failing to recover the SFT baseline—suggesting that reducing max sequence length below the SFT regime can disrupt learned reasoning patterns.

Pillars 3–4: Agentic “Plan-Before-You-Think” and Test-time Scaling

At inference, the system first elicits a compact plan before generating a full solution, then performs best-of-N (e.g., N=3) sampling with verifiers to select the most likely-correct answer. Two effects are reported: (i) consistent quality gains from the combined scaffold; and (ii) shorter final responses despite the added plan—average token counts drop across benchmarks, with reductions up to ~11.7% (e.g., Omni-HARD), and overall lengths comparable to much larger open models. This matters for both latency and cost.

Table-level analysis shows K2 Think’s response lengths are shorter than Qwen3-235B-A22B and in the same range as GPT-OSS-120B on math; after adding plan-before-you-think and verifiers, K2 Think’s average tokens fall versus its own post-training checkpoint (e.g., AIME’24 −6.7%, AIME’25 −3.9%, HMMT25 −7.2%, Omni-HARD −11.7%, LCBv5 −10.5%, GPQA-D −2.1%).

Pillars 5–6: Speculative decoding and wafer-scale inference

K2 Think targets Cerebras Wafer-Scale Engine inference with speculative decoding, advertising per-request throughput upwards of 2,000 tokens/sec, which makes the test-time scaffold practical for production and research loops. The hardware-aware inference path is a central part of the release and aligns with the system’s “small-but-fast” philosophy.

https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

Evaluation protocol

Benchmarking covers competition-level math (AIME’24, AIME’25, HMMT’25, Omni-MATH-HARD), code (LiveCodeBench v5; SciCode sub/main), and science knowledge/reasoning (GPQA-Diamond; HLE). The research team reports a standardized setup: max generation length 64k tokens, temperature 1.0, top-p 0.95, stop marker </answer>, and each score as an average of 16 independent pass@1 evaluations to reduce run-to-run variance.

https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

Results

Math (micro-average across AIME’24/’25, HMMT25, Omni-HARD). K2 Think reaches 67.99, leading the open-weight cohort and comparing favorably even to much larger systems; it posts 90.83 (AIME’24), 81.24 (AIME’25), 73.75 (HMMT25), and 60.73 on Omni-HARD—the latter being the most difficult split. The positioning is consistent with strong parameter efficiency relative to DeepSeek V3.1 (671B) and GPT-OSS-120B (120B).

Code. LiveCodeBench v5 score is 63.97, exceeding similarly sized peers and even larger open models (e.g., > Qwen3-235B-A22B at 56.64). On SciCode, K2 Think is 39.2/12.0 (sub/main), tracking the best open systems closely on sub-problem accuracy.

Science. GPQA-Diamond reaches 71.08; HLE is 9.95. The model is not just a math specialist: it stays competitive across knowledge-heavy tasks.

https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

https://k2think-about.pages.dev/assets/tech-report/K2-Think_Tech-Report.pdf

Key numbers at a glance

Backbone: Qwen2.5-32B (open weight), post-trained with long CoT SFT + RLVR (GRPO via verl).

RL data: Guru (~92k prompts) across Math/Code/Science/Logic/Simulation/Tabular.

Inference scaffold: Plan-before-you-think + BoN with verifiers; shorter outputs (e.g., −11.7% tokens on Omni-HARD) at higher accuracy.

Throughput target: ~2,000 tok/s on Cerebras WSE with speculative decoding.

Math micro-avg: 67.99 (AIME’24 90.83, AIME’25 81.24, HMMT’25 73.75, Omni-HARD 60.73).

Code/Science: LCBv5 63.97; SciCode 39.2/12.0; GPQA-D 71.08; HLE 9.95.

Safety-4 macro: 0.75 (Refusal 0.83, Conv. Robustness 0.89, Cybersecurity 0.56, Jailbreak 0.72).

Summary

K2 Think demonstrates that integrative post-training + test-time compute + hardware-aware inference can close much of the gap to larger, proprietary reasoning systems. At 32B, it is tractable to fine-tune and serve; with plan-before-you-think and BoN-with-verifiers, it controls token budgets; with speculative decoding on wafer-scale hardware, it reaches ~2k tok/s per request. K2 Think is presented as a fully open system—weights, training data, deployment code, and test-time optimization code.

Check out the Paper, Model on Hugging Face, GitHub and Direct Access. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post MBZUAI Researchers Release K2 Think: A 32B Open-Source System for Advanced AI Reasoning and Outperforms 20x Larger Reasoning Models appeared first on MarkTechPost.

Alibaba Qwen Team Releases Qwen3-ASR: A New Speech Recognition Model B …

Alibaba Cloud’s Qwen team unveiled Qwen3-ASR Flash, an all-in-one automatic speech recognition (ASR) model (available as API service) built upon the strong intelligence of Qwen3-Omni that simplifies multilingual, noisy, and domain-specific transcription without juggling multiple systems.

Key Capabilities

Multilingual recognition: Supports automatic detection and transcription across 11 languages including English and Chinese, plus Arabic, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, and simplified Chinese (zh). That breadth positions Qwen3-ASR for global usage without separate models.

Context injection mechanism: Users can paste arbitrary text—names, domain-specific jargon, even nonsensical strings—to bias transcription. This is especially powerful in scenarios rich in idioms, proper nouns, or evolving lingo.

Robust audio handling: Maintains performance in noisy environments, low-quality recordings, far-field input (e.g., distance mics), and multimedia vocals like songs or raps. Reported Word Error Rate (WER) remains under 8%, which is technically impressive for such diverse inputs.

Single-model simplicity: Eliminates complexity of maintaining different models for languages or audio contexts—one model with an API Service to rule them all.

Use cases span edtech platforms (lecture capture, multilingual tutoring), media (subtitling, voice-over), and customer service (multilingual IVR or support transcription).

https://qwen.ai/blog?id=41e4c0f6175f9b004a03a07e42343eaaf48329e7&from=research.latest-advancements-list

Technical Assessment

Language Detection + TranscriptionAutomatic language detection lets the model determine the language before transcribing—crucial for mixed-language environments or passive audio capture. This reduces the need for manual language selection and improves usability.

Context Token InjectionPasting text as “context” biases recognition toward expected vocabulary. Technically, this could operate via prefix tuning or prefix-injection—embedding context in the input stream to influence decoding. It’s a flexible way to adapt to domain-specific lexicons without re-training the model.

WER < 8% Across Complex ScenariosHolding sub-8% WER across music, rap, background noise, and low-fidelity audio puts Qwen3-ASR in the upper echelon of open recognition systems. For comparison, robust models on clean read speech target 3–5% WER, but performance typically degrades significantly in noisy or musical contexts.

Multilingual CoverageSupporting 11 languages, including divergence into logographic Chinese and languages with varying phonotactics like Arabic and Japanese, suggests substantial multilingual training data and cross-lingual modeling capacity. Handling both tonal (Mandarin) and non-tonal languages is non-trivial.

Single-Model ArchitectureOperationally elegant: deploy one model for all tasks. This reduces ops burden—no need to swap or select models dynamically. Everything runs in a unified ASR pipeline with built-in language detection.

Deployment and Demo

The Hugging Face Space for Qwen3-ASR provides a live interface: upload audio, optionally input context, and choose a language or use auto-detect. It is available as an API Service.

Conclusion

Qwen3-ASR Flash (available as an API Service) is a technically compelling, deploy-friendly ASR solution. It offers a rare combination: multilingual support, context-aware transcription, and noise-robust recognition—all in one model.

Check out the API Service, Technical details and Demo on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Alibaba Qwen Team Releases Qwen3-ASR: A New Speech Recognition Model Built Upon Qwen3-Omni Achieving Robust Speech Recogition Performance appeared first on MarkTechPost.

Top 7 Model Context Protocol (MCP) Servers for Vibe Coding

Modern software development is shifting from static workflows to dynamic, agent-driven coding experiences. At the center of this transition is the Model Context Protocol (MCP), a standard for connecting AI agents to external tools, data, and services. MCP provides a structured way for large language models (LLMs) to request, consume, and persist context. This makes coding sessions more adaptive, reproducible, and collaborative. In short, MCP acts as the “middleware” that enables Vibe Coding—an interactive style of programming where developers and AI agents co-create in real time.

Below are seven notable MCP servers that extend developer environments with specialized capabilities for version control, memory, database integration, research, and browser automation for Vibe Coders.

GitMCP – Git Integration for AI Agents

GitMCP focuses on making repositories natively accessible to AI agents. It bridges MCP with Git workflows, allowing models to clone, browse, and interact with codebases directly. This reduces the overhead of manually feeding context to the agent.

Key Features: Direct access to branches, commits, diffs, and pull requests.

Practical Use: Automating code reviews, generating contextual explanations of commits, and preparing documentation.

Developer Value: Keeps the agent aware of project history and structure, avoiding redundant queries.

Supabase MCP – Database-First Coding

Supabase MCP integrates real-time databases and authentication directly into MCP-enabled workflows. By exposing a Postgres-native API to LLMs, it lets agents query live data, run migrations, or even test queries without leaving the coding session.

Key Features: Postgres queries, authentication, storage access.

Practical Use: Rapid prototyping of applications with live data interaction.

Developer Value: Eliminates the need for separate tooling when testing queries or managing schema changes.

Browser MCP – Web Automation Layer

Browser MCP enables agents to launch headless browsers, scrape data, and interact with web applications. It effectively equips an LLM with browsing capabilities inside a coding environment.

Key Features: Navigation, DOM inspection, form interaction, and screenshot capture.

Practical Use: Debugging frontend applications, testing authentication flows, and collecting real-time content.

Developer Value: Simplifies automated QA and lets developers test code against live production environments without custom scripting.

Context7 – Scalable Context Management

Context7, developed by Upstash, is built to handle persistent memory across sessions. It ensures that agents have long-term awareness of projects without repeatedly re-feeding context.

Key Features: Scalable memory storage, context retrieval APIs.

Practical Use: Multi-session projects where state and knowledge must persist across restarts.

Developer Value: Reduces token costs and boosts reliability by avoiding repeated context injection.

21stDev – Experimental Multi-Agent MCP

21stDev MCP is an experimental server that supports orchestration of multiple agents. Instead of a single AI instance managing all tasks, 21stDev coordinates different specialized agents through MCP.

Key Features: Multi-agent orchestration, modular plugin design.

Practical Use: Building pipelines where one agent manages code generation, another handles database validation, and another performs testing.

Developer Value: Enables a distributed agentic system without complex integration overhead.

OpenMemory MCP – Agent Memory Layer

OpenMemory MCP addresses one of the hardest problems in LLM workflows: persistent, inspectable memory. Unlike vector databases that act as black boxes, OpenMemory MCP provides transparent, queryable memory that developers can inspect and debug.

Key Features: Memory persistence, explainable retrieval, developer-level inspection.

Practical Use: Building agents that can remember user preferences, project requirements, or coding styles across sessions.

Developer Value: Improves trust by making memory retrieval transparent, not opaque.

Exa Search MCP – Research-Driven Development

Exa Search, built by Exa AI, is an MCP server specialized for research. It connects developers to live, verifiable information from the web without leaving the coding environment.

Key Features: Retrieves current statistics, bug fixes, and real-world examples.

Practical Use: When coding requires up-to-date references—such as API changes, performance benchmarks, or bug reports—Exa Search finds and integrates them directly.

Developer Value: Reduces the risk of using outdated or hallucinated information, accelerating bug resolution and feature development.

Conclusion

MCP servers are redefining how developers interact with AI systems by embedding context directly into workflows. Whether it’s GitMCP for version control, Supabase MCP for database interaction, Browser MCP for live web testing, Context7 for persistent memory, or Exa Search for research-driven coding, each server targets a different layer of the development stack. Together, these tools make Vibe Coding a practical reality—where human developers and AI agents collaborate seamlessly, grounded in accurate context and real-time feedback.
The post Top 7 Model Context Protocol (MCP) Servers for Vibe Coding appeared first on MarkTechPost.

Powering innovation at scale: How AWS is tackling AI infrastructure ch …

As generative AI continues to transform how enterprises operate—and develop net new innovations—the infrastructure demands for training and deploying AI models have grown exponentially. Traditional infrastructure approaches are struggling to keep pace with today’s computational requirements, network demands, and resilience needs of modern AI workloads.
At AWS, we’re also seeing a transformation across the technology landscape as organizations move from experimental AI projects to production deployments at scale. This shift demands infrastructure that can deliver unprecedented performance while maintaining security, reliability, and cost-effectiveness. That’s why we’ve made significant investments in networking innovations, specialized compute resources, and resilient infrastructure that’s designed specifically for AI workloads.
Accelerating model experimentation and training with SageMaker AI
The gateway to our AI infrastructure strategy is Amazon SageMaker AI, which provides purpose-built tools and workflows to streamline experimentation and accelerate the end-to-end model development lifecycle. One of our key innovations in this area is Amazon SageMaker HyperPod, which removes the undifferentiated heavy lifting involved in building and optimizing AI infrastructure.
At its core, SageMaker HyperPod represents a paradigm shift by moving beyond the traditional emphasis on raw computational power toward intelligent and adaptive resource management. It comes with advanced resiliency capabilities so that clusters can automatically recover from model training failures across the full stack, while automatically splitting training workloads across thousands of accelerators for parallel processing.
The impact of infrastructure reliability on training efficiency is significant. On a 16,000-chip cluster, for instance, every 0.1% decrease in daily node failure rate improves cluster productivity by 4.2% —translating to potential savings of up to $200,000 per day for a 16,000 H100 GPU cluster. To address this challenge, we recently introduced Managed Tiered Checkpointing in HyperPod, leveraging CPU memory for high-performance checkpoint storage with automatic data replication. This innovation helps deliver faster recovery times and is a cost-effective solution compared to traditional disk-based approaches.
For those working with today’s most popular models, HyperPod also offers over 30 curated model training recipes, including support for OpenAI GPT-OSS, DeepSeek R1, Llama, Mistral, and Mixtral. These recipes automate key steps like loading training datasets, applying distributed training techniques, and configuring systems for checkpointing and recovery from infrastructure failures. And with support for popular tools like Jupyter, vLLM, LangChain, and MLflow, you can manage containerized apps and scale clusters dynamically as you scale your foundation model training and inference workloads.
Overcoming the bottleneck: Network performance
As organizations scale their AI initiatives from proof of concept to production, network performance often becomes the critical bottleneck that can make or break success. This is particularly true when training large language models, where even minor network delays can add days or weeks to training time and significantly increase costs. In 2024, the scale of our networking investments was unprecedented; we installed over 3 million network links to support our latest AI network fabric, or 10p10u infrastructure. Supporting more than 20,000 GPUs while delivering 10s of petabits of bandwidth with under 10 microseconds of latency between servers, this infrastructure enables organizations to train massive models that were previously impractical or impossibly expensive. To put this in perspective: what used to take weeks can now be accomplished in days, allowing companies to iterate faster and bring AI innovations to customers sooner.
At the heart of this network architecture is our revolutionary Scalable Intent Driven Routing (SIDR) protocol and Elastic Fabric Adapter (EFA). SIDR acts as an intelligent traffic control system that can instantly reroute data when it detects network congestion or failures, responding in under one second—ten times faster than traditional distributed networking approaches.
Accelerated computing for AI
The computational demands of modern AI workloads are pushing traditional infrastructure to its limits. Whether you’re fine-tuning a foundation model for your specific use case or training a model from scratch, having the right compute infrastructure isn’t just about raw power—it’s about having the flexibility to choose the most cost-effective and efficient solution for your specific needs.
AWS offers the industry’s broadest selection of accelerated computing options, anchored by both our long-standing partnership with NVIDIA and our custom-built AWS Trainium chips. This year’s launch of P6 instances featuring NVIDIA Blackwell chips demonstrates our continued commitment to bringing the latest GPU technology to our customers. The P6-B200 instances provide 8 NVIDIA Blackwell GPUs with 1.4 TB of high bandwidth GPU memory and up to 3.2 Tbps of EFAv4 networking. In preliminary testing, customers like JetBrains have already seen greater than 85% faster training times on P6-B200 over H200-based P5en instances across their ML pipelines.
To make AI more affordable and accessible, we also developed AWS Trainium, our custom AI chip designed specifically for ML workloads. Using a unique systolic array architecture, Trainium creates efficient computing pipelines that reduce memory bandwidth demands. To simplify access to this infrastructure, EC2 Capacity Blocks for ML also enable you to reserve accelerated compute instances within EC2 UltraClusters for up to six months, giving customers predictable access to the accelerated compute they need.
Preparing for tomorrow’s innovations, today
As AI continues to transform every aspect of our lives, one thing is clear: AI is only as good as the foundation upon which it is built. At AWS, we’re committed to being that foundation, delivering the security, resilience, and continuous innovation needed for the next generation of AI breakthroughs. From our revolutionary 10p10u network fabric to custom Trainium chips, from P6e-GB200 UltraServers to SageMaker HyperPod’s advanced resilience capabilities, we’re enabling organizations of all sizes to push the boundaries of what’s possible with AI. We’re excited to see what our customers will build next on AWS.

About the author
Barry Cooks is a global enterprise technology veteran with 25 years of experience leading teams in cloud computing, hardware design, application microservices, artificial intelligence, and more. As VP of Technology at Amazon, he is responsible for compute abstractions (containers, serverless, VMware, micro-VMs), quantum experimentation, high performance computing, and AI training. He oversees key AWS services including AWS Lambda, Amazon Elastic Container Service, Amazon Elastic Kubernetes Service, and Amazon SageMaker. Barry also leads responsible AI initiatives across AWS, promoting the safe and ethical development of AI as a force for good. Prior to joining Amazon in 2022, Barry served as CTO at DigitalOcean, where he guided the organization through its successful IPO. His career also includes leadership roles at VMware and Sun Microsystems. Barry holds a BS in Computer Science from Purdue University and an MS in Computer Science from the University of Oregon.

Accelerate your model training with managed tiered checkpointing on Am …

As organizations scale their AI infrastructure to support trillion-parameter models, they face a difficult trade-off: reduced training time with lower cost or faster training time with a higher cost. When they checkpoint frequently to speed up recovery time and minimize lost training time, they incur in substantially higher storage cost. And when they checkpoint infrequently, they reduce costs at the risk of losing valuable training progress when failures occur.
This challenge is exacerbated in large distributed training environments, with thousands of accelerators, where issues can occur frequently. According to an article released by Meta, one failure happened every 3 hours during the Meta Llama 3 model training. The GPU issues accounted for 60% of the total failures, and network, CPU, and disks account the other 40%. With infrequent checkpointing, these accumulated failures can result in losing days of training progress over the course of a complete training run, thereby driving up costs and time to market. Frequent checkpoints can saturate networks, overload storage, and result in unpredictable performance.
To help solve these challenges, AWS announced managed tiered checkpointing in Amazon SageMaker HyperPod, a purpose-built infrastructure to scale and accelerate generative AI model development across thousands of AI accelerators. Managed tiered checkpointing uses CPU memory for high-performance checkpoint storage with automatic data replication across adjacent compute nodes for enhanced reliability. Although SageMaker HyperPod identifies node issues automatically and replaces those nodes so your training can resume, managed tiered checkpointing helps you implement the best checkpointing strategy and maximize your training throughput.
Managed tiered checkpointing has been tested on large distributed training clusters ranging from hundreds of GPU to over 15,000 GPU, with checkpoints being saved within seconds.
In this post, we dive deep into those concepts and understand how to use the managed tiered checkpointing feature.
Solution overview
Checkpointing is the method of saving an intermediate model’s state during the training process. You can resume training from a recent checkpoint in the event of an issue by saving the model’s parameters, optimizer states, and other metadata during training. Additionally, you can resolve training problems, such as irregular learning rates, without a full restart by loading an earlier checkpoint state.
Use the following formula to find a rough initial estimate of the total size of the checkpoint for your model without the optimizer state:Model checkpoint size (GB) = (Number of parameters × Bytes per parameter) ÷ 10243 bytesFor example, if you train a Meta Llama 3 70-billion-parameter model using BFloat16 as the parameter’s precision, the checkpoint size will be 130 GB. If you train a DeepSeek-R1 671-billion-parameter model using BFloat16, the checkpoint size will be 1.25 TB. All without storing optimizer states.Checkpoints include optimizer states, training metadata (such as step number), and other additional data, resulting in a larger than expected size. When using an Adam optimizer, the optimizer will save three additional float16 statistics per parameter, resulting in an additional 6 bytes per parameter. Therefore, with the optimizer state saved, the Meta Llama 3 70B model checkpoint size will be approximately 521 GB, and the DeepSeek-R1 671B model checkpoint size will be approximately 5 TB. That is a four-times increase in size, and handling those checkpoints becomes a challenge.
The following table summarizes the checkpoint sizes for each model.

Model name
Size of Checkpoint
Size of Checkpoint + Optimizer States

Meta Llama 3 70B
130 GB
521 GB

DeepSeek R1 671B
1.43 TB
5 TB

It’s also important to consider the training strategy. In a Fully Sharded Data Parallel (FSDP) scenario, each rank (a single GPU process in a distributed training) saves its own part of the checkpoint. At the same time, it reduces the amount of data each rank has to save during a checkpoint, and imposes a stress on the file system level. On a Network File System (NFS) shared file system, those concurrent writes become a bottleneck. Using a distributed file system, such Amazon FSx for Lustre, can help alleviate that pressure at a higher total cost. In a Distributed Data Parallel (DDP) scenario, a single rank writes the complete checkpoint at one time, and all ranks read the checkpoint when loading it back. On the file system level, this means a single writer and multiple readers. On an NFS file system, many readers can be a problem because they will be constrained based on the file system, network stack, and queue size. A single writer, over the network, will not take advantage of all the network throughput. Here again, a fast, distributed file system like FSx for Lustre can help solve those problems at a higher total cost of ownership.
As we can see, traditional checkpointing methods that rely solely on remote persistent storage create a computational overhead during checkpoint creation, because writing terabytes of model parameters to persistent storage might throttle it, consume expensive network bandwidth, and require complex orchestration across distributed systems. By storing checkpoints in fast-access in-memory locations, such as CPU RAM, while maintaining configurable backup to Amazon Simple Storage Service (Amazon S3) for persistence, the system delivers faster recovery times, and is a cost-effective solution compared to traditional disk-based approaches.
Managed tiered checkpointing works as follows:

When training your model, you define the checkpoint frequency.
Model training uses GPU HBM memory to store the model, its parameters, and intermediate results, and do the heavy computation.
Triggering a checkpoint stops model training. The GPU will convert the model weights (tensors) into a state dictionary and copy the data to the instance’s CPU, then the training resumes while managed tiered checkpointing copies the data to RAM.
Because RAM is volatile, managed tiered checkpointing copies the data asynchronously from the host RAM to adjacent nodes using RDMA over Elastic Fabric Adapter (EFA). If a node experiences an issue, its checkpoint data will be available on other nodes too.
From time to time, it copies the data to a second layer of persistent storage, such as Amazon S3. This helps both when writing to RAM fails and when you want to persistently store the checkpoint data for future use.

With managed tiered checkpointing, you can configure frequency and retention policies for both in-memory and persistent storage tiers. You use the first layer (in-memory) to save checkpoints at a high frequency and for fast recovery, periodically saving to Amazon S3 for backup. Managed tiered checkpointing provides a file system that can be seamlessly integrated with your PyTorch Distributed Checkpointing (DCP) training. Adding it to your training script only requires a few lines of code. Furthermore, it improves the performance of checkpoints by using in-memory storage while using other tiers for persistent storage. PyTorch DCP solves the issue of saving a model’s checkpoint when it uses distributed resources, such as multiple GPUs across multiple compute nodes. Trainers, parameters, and the dataset are partitioned across those nodes and resources, then PyTorch DCP saves and loads from multiple ranks in parallel. PyTorch DCP produces multiple files per checkpoint, at least one per rank. Depending on the volume of those files, number and size, shared and network file systems such as NFS will struggle with inode and metadata management. Managed tiered checkpointing helps solve that issue by making it possible to use multiple tiers, reducing intrusion to the training time and still receiving the benefits of PyTorch DCP, such as deduplication of checkpoint data.
With managed tiered checkpointing in SageMaker HyperPod, you can maintain a high training throughput even in large-scale environments prone to failures. It uses your existing SageMaker HyperPod cluster orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS) and compute nodes, and there are no additional costs to use the library.
In the following sections, we explore how to configure the SageMaker HyperPod cluster’s training scripts to use this new feature.
Configure your SageMaker HyperPod cluster for managed tiered checkpointing
SageMaker HyperPod provisions resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). By reducing the complex work of building and maintaining compute clusters using accelerators like AWS Trainium and NVIDIA H200/B200 GPUs, it speeds up the creation of foundation models. To create a new SageMaker HyperPod cluster, refer to the Amazon SageMaker HyperPod Developer Guide. If you want to accelerate your deployment by using field hardened assets, refer to the following GitHub repo.
The examples shared in this post are intended to help you learn more about this new feature. If you’re considering running the examples provided here in a production environment, have your security team review the content and make sure they adhere to your security standards. At AWS, security is the top priority and we understand that every customer has their own security framework.Before creating or updating a cluster to add the managed tiered checkpointing feature, you must set up the EKS pods to access an S3 bucket either on your own account or across accounts. When working with buckets on the same account as the SageMaker HyperPod EKS cluster, you can use the following policy (change your bucket name before applying it):

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Action”: [
                “s3:DeleteObject”,
                “s3:GetBucketLocation”,
                “s3:GetObject”,
                “s3:ListBucket”,
                “s3:PutObject”
            ],
            “Resource”: [
                “arn:aws:s3:::<bucket_name>”,
                “arn:aws:s3:::<bucket_name>/*”
            ],
            “Effect”: “Allow”
        }
    ]
}

If the bucket is in a different account, you must authorize an AWS Identity and Access Management (IAM) principal to access those buckets. The following IAM policy will do that for you. Be sure to change both the bucket name and the IAM principal (for example, your AWS account ID).

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Sid”: “CheckPointCrossAccountAccess”,
            “Effect”: “Allow”,
            “Principal”: {
                “AWS”: “arn:aws:iam::<account_id>:root”
            },
            “Action”: [
                “s3:DeleteObject”,
                “s3:GetBucketLocation”,
                “s3:GetObject”,
                “s3:ListBucket”,
                “s3:PutObject”
            ],
            “Resource”: [
                “arn:aws:s3:::<bucket_name>”,
                “arn:aws:s3:::<bucket_name>/*”
            ]
        }
    ]
}

To create a new cluster with managed tiered checkpointing, you can pass a parameter using –tiered-storage-config and setting Mode to Enable using an AWS Command Line Interface (AWS CLI) command:

aws sagemaker create-cluster
–cluster-name “ml-cluster”
–tiered-storage-config { “Mode”: “Enable” }
–instance-groups ‘[{
“InstanceCount”: 1,
….
}]’

You can also update it using the UpdateCluster API and pass the CachingConfig parameter with the required AllocatedMemory configuration. You can use the CachingConfiguration parameter to define a fixed value or a percentage of the CPU RAM for checkpointing.

aws sagemaker update-cluster
–cluster-name <my-training-cluster>
–tiered-storage-config {
“Mode”: “Enable”
“InstanceMemoryAllocationPercentage”: <percent>
}

Now that your SageMaker HyperPod cluster has the managed tiered checkpointing feature, let’s prepare the training scripts and add them.
Install the managed tiered checkpoint libraries and integrate with your training script
Managed tiered checkpointing integrates with PyTorch DCP. You start by installing the sagemaker-checkpointing library. Then you create and configure a namespace to store the checkpoints based on the defined frequency. Finally, you add the checkpoint function inside your training loop.
To install the library, we simply use Python’s pip. Make sure you already have the dependencies installed: Python 3.10 or higher, PyTorch with DCP support, and the AWS credentials configured properly. To integrate Amazon S3 as another storage layer, you also need s3torchconnector installed.

# Install the pre-requisites
pip install torch boto3 botocore tenacity s3torchconnector

# Install the Managed Tiered Checkpointing library
pip install amzn-sagemaker-checkpointing

Now you can import the library on your script and configure the namespace and frequency for checkpointing:

import torchimport torch.distributed as dist
from torch.distributed.checkpoint import async_save, load
from amzn_sagemaker_checkpointing.config.sagemaker_checkpoint_config import SageMakerCheckpointConfig
from amzn_sagemaker_checkpointing.checkpointing.filesystem.filesystem import (
    SageMakerTieredStorageWriter,
    SageMakerTieredStorageReader
)

checkpoint_config = SageMakerCheckpointConfig(
    # Unique ID for your training job
    # Allowed characters in ID include: alphanumeric, hyphens, and underscores
    namespace=os.environ.get(‘TRAINING_JOB_NAME’, f’job-{int(time.time())}’),

    # Number of distributed processes/available GPUs
    world_size=dist.get_world_size(),

    # Amazon S3 storage location, required for SageMakerTieredStorageReader for read fallbacks
    # Required for SageMakerTieredStorageWriter when save_to_s3 is True
    s3_tier_base_path=”s3://<my-bucket>/checkpoints”

In the preceding code snippet, we have configured managed tiered checkpointing with the same world_size as the number of ranks in our cluster. When you start a distributed training, each GPU in the cluster is assigned a rank number, and the total number of GPUs available is the world_size. We set up Amazon S3 as our backup persistent storage, setting managed tiered checkpointing to store data in Amazon S3 every 100 training steps. Both world_size and namespace are required parameters; the others are optional.
Now that the configuration is ready, let’s set up PyTorch DCP and integrate managed tiered checkpointing.
First, configure the storage writer. This component will pass on to the PyTorch DCP async_save function alongside the model’s state dictionary. We use the SageMakerTieredStorageWriter when writing the checkpoints and the SageMakeTieredStorageReader when restoring from those checkpoints.
Inside your model training loop, you add the storage writer configuration and pass along both the managed tiered checkpointing configuration and the step number:

   state_dict = {
       “model”: model.state_dict(),
       “optimizer”: optimizer.state_dict(),
       “step”: training_step,
       “epoch”: epoch
   }
   
   # Create storage writer for current step and if it need to save to a persistent storage too 
   checkpoint_config.save_to_s3 = training_step % s3_ckpt_freq == 0
   storage_writer = SageMakerTieredStorageWriter(
       checkpoint_config=checkpoint_config,
       step=training_step
   )

You can define the step number explicitly for the storage writer, or you can let the storage writer identify the step number from the path where the checkpoint is being saved. If you want to let the storage writer infer the step number from the base path, don’t set the stepparameter and make sure your path contains the step number in it.
Now you can call the PyTorch DCP asynchronous save function and pass along the state dictionary and the storage writer configuration:async_save(state_dict=state_dict, storage_writer=storage_writer)
We have set up managed tiered checkpointing to write checkpoints at our desired frequency and location (in-memory). Let’s use the storage reader to restore those checkpoints. First, pass the managed tiered checkpointing configuration to the SageMakerTieredStorageReader, then call the PyTorch DCP load function, passing the model state dictionary and the storage reader configuration:

storage_reader = SageMakerTieredStorageReader(checkpoint_config=checkpoint_config)
load(state_dict, storage_reader=storage_reader)

To work through a complete example, refer to the following GitHub repository, where we’ve created a simple training script, including the managed tiered checkpointing feature.
Clean up
After you have worked with managed tiered checkpointing, and you want to clean up the environment, simply remove the amzn-sagemaker-checkpointing library by running pip uninstall amzn-sagemaker-checkpointing.
If you installed the solution in a Python virtual environment, then just deleting the virtual environment will suffice.Managed tiered checkpointing is a free feature that doesn’t require additional resources to run. You use your existing SageMaker HyperPod EKS cluster and compute nodes.
Best practices to optimize your checkpoint strategy with managed tiered checkpointing
Managed tiered checkpointing will attempt to write to the in-memory tier first. This optimizes the writing performance because in-memory provides ultra-low latency checkpoint access. You should configure managed tiered checkpointing to write to a second layer, such as Amazon S3, from time to time. For example, configure managed tiered checkpointing to write to the in-memory layer every 10 steps, and configure it to write to Amazon S3 every 100 steps.
If managed tiered checkpointing fails to write to the in-memory layer, and the node experiences an issue, then you still have your checkpoint saved on Amazon S3. While writing to Amazon S3, managed tiered checkpointing uses multiple TCP streams (chunks) to optimize Amazon S3 writes.
In terms of consistency, managed tiered checkpointing uses an all-or-nothing writing strategy. It implements a fallback mechanism that will seamlessly transition between the storage tiers. Checkpoint metadata, such as step number, is stored alongside the data for every tier.
When trying to troubleshoot managed tiered checkpointing, you can check the log written locally to /var/log/sagemaker_checkpointing/{namespace}_checkpointing.log. It publishes data about the training step, rank number, and the operation details. The following is an example output of that file:

[timestamp] [namespace] [logger_name] [INFO] [filename:451] [Rank 0] Step 240: Starting checkpoint write ([SavePlan Items Count] items)
[timestamp] [namespace] [logger_name] [INFO] [filename:498] [Rank 0] Step 240: In-memory write completed in [Latency]s ([Throughput] MB/s)
[timestamp] [namespace] [logger_name] [INFO] [filename:530] [Rank 0] Step 240: S3 batch write completed in [Latency]s ([Size] total, [Throughput] MB/s average)

Managed tiered checkpointing also writes those metrics to the console, so it’s straightforward to troubleshoot during development. They contain information on which step number is being written to which storage layer and the throughput and total time taken to write the data. With that information, you can monitor and troubleshoot managed tiered checkpointing thoroughly.
When you combine those tools with the SageMaker HyperPod observability stack, you get a complete view of all metrics of your training or inference workload.
Conclusion
The new managed tiered checkpointing feature in SageMaker HyperPod augments FM training efficiency by intelligently distributing checkpoints across multiple storage tiers. This advanced approach places model states in fast access locations such as CPU RAM memory, while using persistent storage such as Amazon S3 for cost-effective, long-term persistence. As of the time of this launch, managed tiered checkpointing is supported only on SageMaker HyperPod on Amazon EKS.
Managed tiered checkpointing delivers fast recovery times without increased storage costs, avoiding complex trade-offs between resiliency, training efficiency, and storage costs. It has been validated on large distributed training clusters that range from hundreds of GPU to more than 15,000 GPU, with checkpoints being saved within seconds.
Integrating managed tiered checkpointing on your training scripts is straightforward, with just a few lines of code, providing immediate access to sophisticated checkpoint management without requiring deep engineering expertise.
For more information on how managed tiered checkpointing works, how to set it up, and other details, refer to HyperPod managed tier checkpointing.

About the authors
Paulo Aragao is a Principal WorldWide Solutions Architect focused on Generative AI at the Specialist Organisation on AWS. He helps Enterprises and Startups to build their Foundation Models strategy and innovate faster by leveraging his extensive knowledge on High Perfomance Computing and Machine Learning. A long time bass player, and natural born rock fan, Paulo enjoys spending time travelling with his family, scuba diving, and playing real time strategy and role-playing games.
Kunal Jha is a Principal Product Manager at AWS. He is focused on building Amazon SageMaker Hyperpod as the best-in-class choice for Generative AI model’s training and inference. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest.
Mandar Kulkarni is a Software Development Engineer II at AWS, where he works on Amazon SageMaker. He specializes in building scalable and performant machine learning libraries and infrastructure solutions, particularly focusing on SageMaker HyperPod. His technical interests span machine learning, artificial intelligence, distributed systems and application security. When not architecting ML solutions, Mandar enjoys hiking, practicing Indian classical music, sports, and spending quality time with his young family.
Vinay Devadiga is a Software Development Engineer II at AWS with a deep passion for artificial intelligence and cloud computing. He focuses on building scalable, high-performance systems that enable the power of AI and machine learning to solve complex problems.Vinay enjoys staying at the forefront of technology, continuously learning, and applying new advancements to drive innovation. Outside of work, he likes playing sports and spending quality time with his family.
Vivek Maran is a Software Engineer at AWS. He currently works on the development of Amazon SageMaker HyperPod, a resilient platform for large scale distributed training and inference. His interests include large scale distributed systems, network systems, and artificial intelligence. Outside of work, he enjoys music, running, and keeping up to date with business & technology trends.

How to Build a Complete Multi-Domain AI Web Agent Using Notte and Gemi …

In this tutorial, we demonstrate a complete, advanced implementation of the Notte AI Agent, integrating the Gemini API to power reasoning and automation. By combining Notte’s browser automation capabilities with structured outputs through Pydantic models, it showcases how an AI web agent can research products, monitor social media, analyze markets, scan job opportunities, and more. The tutorial is designed as a practical, hands-on guide, featuring modular functions, demos, and workflows that demonstrate how developers can leverage AI-driven automation for real-world tasks such as e-commerce research, competitive intelligence, and content strategy. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install notte python-dotenv pydantic google-generativeai requests beautifulsoup4
!patchright install –with-deps chromium

import os
import json
import time
from typing import List, Optional
from pydantic import BaseModel
import google.generativeai as genai
from dotenv import load_dotenv

GEMINI_API_KEY = “USE YOUR OWN API KEY HERE”
os.environ[‘GEMINI_API_KEY’] = GEMINI_API_KEY
genai.configure(api_key=GEMINI_API_KEY)

import notte

We begin by installing all the required dependencies, including Notte, Gemini, and supporting libraries, and then configure our Gemini API key for authentication. After setting up the environment, we import Notte to start building and running our AI web agent seamlessly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ProductInfo(BaseModel):
name: str
price: str
rating: Optional[float]
availability: str
description: str

class NewsArticle(BaseModel):
title: str
summary: str
url: str
date: str
source: str

class SocialMediaPost(BaseModel):
content: str
author: str
likes: int
timestamp: str
platform: str

class SearchResult(BaseModel):
query: str
results: List[dict]
total_found: int

We define structured Pydantic models that let us capture and validate data consistently. With ProductInfo, NewsArticle, SocialMediaPost, and SearchResult, we ensure that the AI agent outputs reliable, well-structured information for products, news articles, social media posts, and search results. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedNotteAgent:
def __init__(self, headless=True, max_steps=20):
self.headless = headless
self.max_steps = max_steps
self.session = None
self.agent = None

def __enter__(self):
self.session = notte.Session(headless=self.headless)
self.session.__enter__()
self.agent = notte.Agent(
session=self.session,
reasoning_model=’gemini/gemini-2.5-flash’,
max_steps=self.max_steps
)
return self

def __exit__(self, exc_type, exc_val, exc_tb):
if self.session:
self.session.__exit__(exc_type, exc_val, exc_tb)

def research_product(self, product_name: str, website: str = “amazon.com”) -> ProductInfo:
“””Research a product and extract structured information”””
task = f”Go to {website}, search for ‘{product_name}’, click on the first relevant product, and extract detailed product information including name, price, rating, availability, and description.”

response = self.agent.run(
task=task,
response_format=ProductInfo,
url=f”https://{website}”
)
return response.answer

def news_aggregator(self, topic: str, num_articles: int = 3) -> List[NewsArticle]:
“””Aggregate news articles on a specific topic”””
task = f”Search for recent news about ‘{topic}’, find {num_articles} relevant articles, and extract title, summary, URL, date, and source for each.”

response = self.agent.run(
task=task,
url=”https://news.google.com”,
response_format=List[NewsArticle]
)
return response.answer

def social_media_monitor(self, hashtag: str, platform: str = “twitter”) -> List[SocialMediaPost]:
“””Monitor social media for specific hashtags”””
if platform.lower() == “twitter”:
url = “https://twitter.com”
elif platform.lower() == “reddit”:
url = “https://reddit.com”
else:
url = f”https://{platform}.com”

task = f”Go to {platform}, search for posts with hashtag ‘{hashtag}’, and extract content, author, engagement metrics, and timestamps from the top 5 posts.”

response = self.agent.run(
task=task,
url=url,
response_format=List[SocialMediaPost]
)
return response.answer

def competitive_analysis(self, company: str, competitors: List[str]) -> dict:
“””Perform competitive analysis by gathering pricing and feature data”””
results = {}

for competitor in [company] + competitors:
task = f”Go to {competitor}’s website, find their pricing page or main product page, and extract key features, pricing tiers, and unique selling points.”

try:
response = self.agent.run(
task=task,
url=f”https://{competitor}.com”
)
results[competitor] = response.answer
time.sleep(2)
except Exception as e:
results[competitor] = f”Error: {str(e)}”

return results

def job_market_scanner(self, job_title: str, location: str = “remote”) -> List[dict]:
“””Scan job market for opportunities”””
task = f”Search for ‘{job_title}’ jobs in ‘{location}’, extract job titles, companies, salary ranges, and application URLs from the first 10 results.”

response = self.agent.run(
task=task,
url=”https://indeed.com”
)
return response.answer

def price_comparison(self, product: str, websites: List[str]) -> dict:
“””Compare prices across multiple websites”””
price_data = {}

for site in websites:
task = f”Search for ‘{product}’ on this website and find the best price, including any discounts or special offers.”

try:
response = self.agent.run(
task=task,
url=f”https://{site}”
)
price_data[site] = response.answer
time.sleep(1)
except Exception as e:
price_data[site] = f”Error: {str(e)}”

return price_data

def content_research(self, topic: str, content_type: str = “blog”) -> dict:
“””Research content ideas and trending topics”””
if content_type == “blog”:
url = “https://medium.com”
task = f”Search for ‘{topic}’ articles, analyze trending content, and identify popular themes, engagement patterns, and content gaps.”
elif content_type == “video”:
url = “https://youtube.com”
task = f”Search for ‘{topic}’ videos, analyze view counts, titles, and descriptions to identify trending formats and popular angles.”
else:
url = “https://google.com”
task = f”Search for ‘{topic}’ content across the web and analyze trending discussions and popular formats.”

response = self.agent.run(task=task, url=url)
return {“topic”: topic, “insights”: response.answer, “platform”: content_type}

We wrap Notte in a context-managed AdvancedNotteAgent that sets up a headless browser session and a Gemini-powered reasoning model, allowing us to automate multi-step web tasks reliably. We then add high-level methods, including product research, news aggregation, social listening, competitive scans, job search, price comparison, and content research, that return clean, structured outputs. This lets us script real-world web workflows while keeping the interface simple and consistent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_ecommerce_research():
“””Demo: E-commerce product research and comparison”””
print(” E-commerce Research Demo”)
print(“=” * 50)

with AdvancedNotteAgent(headless=True) as agent:
product = agent.research_product(“wireless earbuds”, “amazon.com”)
print(f”Product Research Results:”)
print(f”Name: {product.name}”)
print(f”Price: {product.price}”)
print(f”Rating: {product.rating}”)
print(f”Availability: {product.availability}”)
print(f”Description: {product.description[:100]}…”)

print(“n Price Comparison:”)
websites = [“amazon.com”, “ebay.com”, “walmart.com”]
prices = agent.price_comparison(“wireless earbuds”, websites)
for site, data in prices.items():
print(f”{site}: {data}”)

def demo_news_intelligence():
“””Demo: News aggregation and analysis”””
print(” News Intelligence Demo”)
print(“=” * 50)

with AdvancedNotteAgent() as agent:
articles = agent.news_aggregator(“artificial intelligence”, 3)

for i, article in enumerate(articles, 1):
print(f”nArticle {i}:”)
print(f”Title: {article.title}”)
print(f”Source: {article.source}”)
print(f”Summary: {article.summary}”)
print(f”URL: {article.url}”)

def demo_social_listening():
“””Demo: Social media monitoring and sentiment analysis”””
print(” Social Media Listening Demo”)
print(“=” * 50)

with AdvancedNotteAgent() as agent:
posts = agent.social_media_monitor(“#AI”, “reddit”)

for i, post in enumerate(posts, 1):
print(f”nPost {i}:”)
print(f”Author: {post.author}”)
print(f”Content: {post.content[:100]}…”)
print(f”Engagement: {post.likes} likes”)
print(f”Platform: {post.platform}”)

def demo_market_intelligence():
“””Demo: Competitive analysis and market research”””
print(” Market Intelligence Demo”)
print(“=” * 50)

with AdvancedNotteAgent() as agent:
company = “openai”
competitors = [“anthropic”, “google”]
analysis = agent.competitive_analysis(company, competitors)

for comp, data in analysis.items():
print(f”n{comp.upper()}:”)
print(f”Analysis: {str(data)[:200]}…”)

def demo_job_market_analysis():
“””Demo: Job market scanning and analysis”””
print(” Job Market Analysis Demo”)
print(“=” * 50)

with AdvancedNotteAgent() as agent:
jobs = agent.job_market_scanner(“python developer”, “san francisco”)

print(f”Found {len(jobs)} job opportunities:”)
for job in jobs[:3]:
print(f”- {job}”)

def demo_content_strategy():
“””Demo: Content research and trend analysis”””
print(” Content Strategy Demo”)
print(“=” * 50)

with AdvancedNotteAgent() as agent:
blog_research = agent.content_research(“machine learning”, “blog”)
video_research = agent.content_research(“machine learning”, “video”)

print(“Blog Content Insights:”)
print(blog_research[“insights”][:300] + “…”)

print(“nVideo Content Insights:”)
print(video_research[“insights”][:300] + “…”)

We run a suite of demos that showcase real web automation end-to-end, including researching products and comparing prices, aggregating fresh news, and monitoring social chatter. We also conduct competitive scans, analyze the job market, and track blog/video trends, yielding structured, ready-to-use insights from each task. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass WorkflowManager:
def __init__(self):
self.agents = []
self.results = {}

def add_agent_task(self, name: str, task_func, *args, **kwargs):
“””Add an agent task to the workflow”””
self.agents.append({
‘name’: name,
‘func’: task_func,
‘args’: args,
‘kwargs’: kwargs
})

def execute_workflow(self, parallel=False):
“””Execute all agent tasks in the workflow”””
print(” Executing Multi-Agent Workflow”)
print(“=” * 50)

for agent_task in self.agents:
name = agent_task[‘name’]
func = agent_task[‘func’]
args = agent_task[‘args’]
kwargs = agent_task[‘kwargs’]

print(f”n Executing {name}…”)
try:
result = func(*args, **kwargs)
self.results[name] = result
print(f” {name} completed successfully”)
except Exception as e:
self.results[name] = f”Error: {str(e)}”
print(f” {name} failed: {str(e)}”)

if not parallel:
time.sleep(2)
return self.results

def market_research_workflow(company_name: str, product_category: str):
“””Complete market research workflow”””
workflow = WorkflowManager()

workflow.add_agent_task(
“Product Research”,
lambda: research_trending_products(product_category)
)

workflow.add_agent_task(
“Competitive Analysis”,
lambda: analyze_competitors(company_name, product_category)
)

workflow.add_agent_task(
“Social Sentiment”,
lambda: monitor_brand_sentiment(company_name)
)

return workflow.execute_workflow()

def research_trending_products(category: str):
“””Research trending products in a category”””
with AdvancedNotteAgent(headless=True) as agent:
task = f”Research trending {category} products, find top 5 products with prices, ratings, and key features.”
response = agent.agent.run(
task=task,
url=”https://amazon.com”
)
return response.answer

def analyze_competitors(company: str, category: str):
“””Analyze competitors in the market”””
with AdvancedNotteAgent(headless=True) as agent:
task = f”Research {company} competitors in {category}, compare pricing strategies, features, and market positioning.”
response = agent.agent.run(
task=task,
url=”https://google.com”
)
return response.answer

def monitor_brand_sentiment(brand: str):
“””Monitor brand sentiment across platforms”””
with AdvancedNotteAgent(headless=True) as agent:
task = f”Search for recent mentions of {brand} on social media and news, analyze sentiment and key themes.”
response = agent.agent.run(
task=task,
url=”https://reddit.com”
)
return response.answer

We design a WorkflowManager that chains multiple AI agent tasks into a single orchestrated pipeline. By adding modular tasks like product research, competitor analysis, and sentiment monitoring, we can execute a complete market research workflow in sequence (or parallel). This transforms individual demos into a coordinated multi-agent system that provides holistic insights for informed real-world decision-making. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
“””Main function to run all demos”””
print(” Advanced Notte AI Agent Tutorial”)
print(“=” * 60)
print(“Note: Make sure to set your GEMINI_API_KEY above!”)
print(“Get your free API key at: https://makersuite.google.com/app/apikey”)
print(“=” * 60)

if GEMINI_API_KEY == “YOUR_GEMINI_API_KEY”:
print(” Please set your GEMINI_API_KEY in the code above!”)
return

try:
print(“n1. E-commerce Research Demo”)
demo_ecommerce_research()

print(“n2. News Intelligence Demo”)
demo_news_intelligence()

print(“n3. Social Media Listening Demo”)
demo_social_listening()

print(“n4. Market Intelligence Demo”)
demo_market_intelligence()

print(“n5. Job Market Analysis Demo”)
demo_job_market_analysis()

print(“n6. Content Strategy Demo”)
demo_content_strategy()

print(“n7. Multi-Agent Workflow Demo”)
results = market_research_workflow(“Tesla”, “electric vehicles”)
print(“Workflow Results:”)
for task, result in results.items():
print(f”{task}: {str(result)[:150]}…”)

except Exception as e:
print(f” Error during execution: {str(e)}”)
print(” Tip: Make sure your Gemini API key is valid and you have internet connection”)

def quick_scrape(url: str, instructions: str = “Extract main content”):
“””Quick scraping function for simple data extraction”””
with AdvancedNotteAgent(headless=True, max_steps=5) as agent:
response = agent.agent.run(
task=f”{instructions} from this webpage”,
url=url
)
return response.answer

def quick_search(query: str, num_results: int = 5):
“””Quick search function with structured results”””
with AdvancedNotteAgent(headless=True, max_steps=10) as agent:
task = f”Search for ‘{query}’ and return the top {num_results} results with titles, URLs, and brief descriptions.”
response = agent.agent.run(
task=task,
url=”https://google.com”,
response_format=SearchResult
)
return response.answer

def quick_form_fill(form_url: str, form_data: dict):
“””Quick form filling function”””
with AdvancedNotteAgent(headless=False, max_steps=15) as agent:
data_str = “, “.join([f”{k}: {v}” for k, v in form_data.items()])
task = f”Fill out the form with this information: {data_str}, then submit it.”

response = agent.agent.run(
task=task,
url=form_url
)
return response.answer

if __name__ == “__main__”:
print(” Quick Test Examples:”)
print(“=” * 30)

print(“1. Quick Scrape Example:”)
try:
result = quick_scrape(“https://news.ycombinator.com”, “Extract the top 3 post titles”)
print(f”Scraped: {result}”)
except Exception as e:
print(f”Error: {e}”)

print(“n2. Quick Search Example:”)
try:
search_results = quick_search(“latest AI news”, 3)
print(f”Search Results: {search_results}”)
except Exception as e:
print(f”Error: {e}”)

print(“n3. Custom Agent Task:”)
try:
with AdvancedNotteAgent(headless=True) as agent:
response = agent.agent.run(
task=”Go to Wikipedia, search for ‘artificial intelligence’, and summarize the main article in 2 sentences.”,
url=”https://wikipedia.org”
)
print(f”Wikipedia Summary: {response.answer}”)
except Exception as e:
print(f”Error: {e}”)

main()

print(“n Tutorial Complete!”)
print(” Tips for success:”)
print(“- Start with simple tasks and gradually increase complexity”)
print(“- Use structured outputs (Pydantic models) for reliable data extraction”)
print(“- Implement rate limiting to respect API quotas”)
print(“- Handle errors gracefully in production workflows”)
print(“- Combine scripting with AI for cost-effective automation”)

print(“n Next Steps:”)
print(“- Customize the agents for your specific use cases”)
print(“- Add error handling and retry logic for production”)
print(“- Implement logging and monitoring for agent activities”)
print(“- Scale up with Notte’s hosted API service for enterprise features”)

We wrap everything with a main() function that runs all demos end-to-end, and then add quick helper utilities, including quick_scrape, quick_search, and quick_form_fill, to perform focused tasks with minimal setup. We also include quick tests to validate the helpers and a custom Wikipedia task before invoking the full workflow, ensuring we can iterate fast while still exercising the complete agent pipeline.

In conclusion, the tutorial demonstrates how Notte, when combined with Gemini, can evolve into a powerful, multi-purpose AI web agent for research, monitoring, and analysis. It not only demonstrates individual demos for e-commerce, news, and social media but also scales into advanced multi-agent workflows that combine insights across domains. By following this guide, developers can quickly prototype AI agents in Colab, extend them with custom tasks, and adapt the system for business intelligence, automation, and creative use cases.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Build a Complete Multi-Domain AI Web Agent Using Notte and Gemini appeared first on MarkTechPost.

GibsonAI Releases Memori: An Open-Source SQL-Native Memory Engine for …

When we think about human intelligence, memory is one of the first things that comes to mind. It’s what enables us to learn from our experiences, adapt to new situations, and make more informed decisions over time. Similarly, AI Agents become smarter with memory. For example, an agent can remember your past purchases, your budget, your preferences, and suggest gifts for your friends based on the learning from the past conversations.

Agents usually break tasks into steps (plan → search → call API → parse → write), but then they might forget what happened in earlier steps without memory. Agents repeat tool calls, fetch the same data again, or miss simple rules like “always refer to the user by their name.” As a result of repeating the same context over and over again, the agents can spend more tokens, achieve slower results, and provide inconsistent answers. The industry has collectively spent billions on vector databases and embedding infrastructure to solve what is, at its core, a data persistence problem for AI Agents. These solutions create black-box systems where developers cannot inspect, query, or understand why certain memories were retrieved.

The GibsonAI team built Memori to fix this issue. Memori is an open-source memory engine that provides persistent, intelligent memory for any LLM using standard SQL databases(PostgreSQL/MySQL). In this article, we’ll explore how Memori tackles memory challenges and what it offers.

The Stateless Nature of Modern AI: The Hidden Cost

Studies indicate that users spend 23-31% of their time providing context that they’ve already shared in previous conversations. For a development team using AI assistants, this translates to:

Individual Developer: ~2 hours/week repeating context

10-person Team: ~20 hours/week of lost productivity

Enterprise (1000 developers): ~2000 hours/week or $4M/year in redundant communication

Beyond productivity, this repetition breaks the illusion of intelligence. An AI that cannot remember your name after hundreds of conversations doesn’t feel intelligent.

Current Limitations of Stateless LLMs

No Learning from Interactions: Every mistake is repeated, every preference must be restated

Broken Workflows: Multi-session projects require constant context rebuilding

No Personalization: The AI cannot adapt to individual users or teams

Lost Insights: Valuable patterns in conversations are never captured

Compliance Challenges: No audit trail of AI decision-making

The Need for Persistent, Queryable Memory

What AI really needs is persistent, queryable memory just like every application relies on a database. But you can’t simply use your existing app database as AI memory because it isn’t designed for context selection, relevance ranking, or injecting knowledge back into an agent’s workflow. That’s why we built a memory layer that is essential for AI and agents to feel intelligent truly.

Why SQL Matters for AI Memory

SQL databases have been around for more than 50 years. They are the backbone of almost every application we use today, from banking apps to social networks. Why? Because SQL is simple, reliable, and universal.

Every developer knows SQL. You don’t need to learn a new query language.

Battle-tested reliability. SQL has run the world’s most critical systems for decades.

Powerful queries. You can filter, join, and aggregate data with ease.

Strong guarantees. ACID transactions make sure your data stays consistent and safe.

Huge ecosystem. Tools for migration, backups, dashboards, and monitoring are everywhere.

When you build on SQL, you’re standing on decades of proven tech, not reinventing the wheel.

The Drawbacks of Vector Databases

Most competing AI memory systems today are built on vector databases. On paper, they sound advanced: they let you store embeddings and search by similarity. But in practice, they come with hidden costs and complexity:

Multiple moving parts. A typical setup needs a vector DB, a cache, and a SQL DB just to function.

Vendor lock-in. Your data often lives inside a proprietary system, making it hard to move or audit.

Black-box retrieval. You can’t easily see why a certain memory was pulled.

Expensive. Infrastructure and usage costs add up quickly, especially at scale.

Hard to debug. Embeddings are not human-readable, so you can’t just query with SQL and check results.

Here’s how it compares to Memori’s SQL-first design:

AspectVector Database / RAG SolutionsMemori’s ApproachServices Required3–5 (Vector DB + Cache + SQL)1 (SQL only)DatabasesVector + Cache + SQLSQL onlyQuery LanguageProprietary APIStandard SQLDebuggingBlack box embeddingsReadable SQL queriesBackupComplex orchestrationcp memory.db backup.db or pg_basebackupData ProcessingEmbeddings: ~$0.0001 / 1K tokens (OpenAI) → cheap upfrontEntity Extraction: GPT-4o at ~$0.005 / 1K tokens → higher upfrontStorage Costs$0.10–0.50 / GB / month (vector DBs)~$0.01–0.05 / GB / month (SQL)Query Costs~$0.0004 / 1K vectors searchedNear zero (standard SQL queries)InfrastructureMultiple moving parts, higher maintenanceSingle database, simple to manage

Why It Works?

If you think SQL can’t handle memory at scale, think again. SQLite, one of the simplest SQL databases, is the most widely deployed database in the world:

Over 4 billion deployments

Runs on every iPhone, Android device, and web browser

Executes trillions of queries every single day

If SQLite can handle this massive workload with ease, why build AI memory on expensive, distributed vector clusters?

Memori Solution Overview

Memori uses structured entity extraction, relationship mapping, and SQL-based retrieval to create transparent, portable, and queryable AI memory. Memomi uses multiple agents working together to intelligently promote essential long-term memories to short-term storage for faster context injection.

With a single line of code memori.enable() any LLM gains the ability to remember conversations, learn from interactions, and maintain context across sessions. The entire memory system is stored in a standard SQLite database (or PostgreSQL/MySQL for enterprise deployments), making it fully portable, auditable, and owned by the user.

Key Differentiators

Radical Simplicity: One line to enable memory for any LLM framework (OpenAI, Anthropic, LiteLLM, LangChain)

True Data Ownership: Memory stored in standard SQL databases that users fully control

Complete Transparency: Every memory decision is queryable with SQL and fully explainable

Zero Vendor Lock-in: Export your entire memory as a SQLite file and move anywhere

Cost Efficiency: 80-90% cheaper than vector database solutions at scale

Compliance Ready: SQL-based storage enables audit trails, data residency, and regulatory compliance

Memori Use Cases

Smart shopping experience with an AI Agent that remembers customer preferences and shopping behavior.

Personal AI assistants that remember user preferences and context

Customer support bots that never ask the same question twice

Educational tutors who adapt to student progress

Team knowledge management systems with shared memory

Compliance-focused applications requiring complete audit trails

Business Impact Metrics

Based on early implementations from our community users, we identified that Memori helps with the following:

Development Time: 90% reduction in memory system implementation (hours vs. weeks)

Infrastructure Costs: 80-90% reduction compared to vector database solutions

Query Performance: 10-50ms response time (2-4x faster than vector similarity search)

Memory Portability: 100% of memory data portable (vs. 0% with cloud vector databases)

Compliance Readiness: Full SQL audit capability from day one

Maintenance Overhead: Single database vs. distributed vector systems

Technical Innovation

Memori introduces three core innovations:

Dual-Mode Memory System: Combining “conscious” working memory with “auto” intelligent search, mimicking human cognitive patterns

Universal Integration Layer: Automatic memory injection for any LLM without framework-specific code

Multi-Agent Architecture: Multiple specialized AI agents working together for intelligent memory

Existing Solutions in the Market

There are already several approaches to giving AI agents some form of memory, each with its own strengths and trade-offs:

Mem0 → A feature-rich solution that combines Redis, vector databases, and orchestration layers to manage memory in a distributed setup.

LangChain Memory → Provides convenient abstractions for developers building within the LangChain framework.

Vector Databases (Pinecone, Weaviate, Chroma) → Focused on semantic similarity search using embeddings, designed for specialized use cases.

Custom Solutions → In-house designs tailored to specific business needs, offering flexibility but requiring significant maintenance.

These solutions demonstrate the various directions the industry is taking to address the memory problem. Memori enters the landscape with a different philosophy, bringing memory into a SQL-native, open-source form that is simple, transparent, and production-ready.

Memori Built on a Strong Database Infrastructure

In addition to this, AI agents need not only memory but also a database backbone to make that memory usable and scalable. Think of AI agents that can run queries safely in an isolated database sandbox, optimise queries over time, and autoscale on demand, such as initiating a new database for a user to keep their relevant data separate.

A robust database infrastructure from GibsonAI backs Memori. This makes memory reliable and production-ready with:

Instant provisioning

Autoscale on demand

Database branching

Database versioning

Query optimization

Point of recovery

Strategic Vision

While competitors chase complexity with distributed vector solutions and proprietary embeddings, Memori embraces the proven reliability of SQL databases that have powered applications for decades.

The goal is not to build the most sophisticated memory system, but the most practical one. By storing AI memory in the same databases that already run the world’s applications, Memori enables a future where AI memory is as portable, queryable, and manageable as any other application data.

Check out the GitHub Page here. Thanks to the GibsonAI team for the thought leadership/Resources and supporting this article.
The post GibsonAI Releases Memori: An Open-Source SQL-Native Memory Engine for AI Agents appeared first on MarkTechPost.

A New MIT Study Shows Reinforcement Learning Minimizes Catastrophic Fo …

Table of contentsWhat is catastrophic forgetting in foundation models?Why does online reinforcement learning forget less than supervised fine-tuning?How can forgetting be measured?What do experiments on large language models reveal?How does RL compare to SFT in robotics tasks?What insights come from the ParityMNIST study?Why do on-policy updates matter?Are other explanations sufficient?What are the broader implications?ConclusionKey Takeaways

What is catastrophic forgetting in foundation models?

Foundation models excel in diverse domains but are largely static once deployed. Fine-tuning on new tasks often introduces catastrophic forgetting—the loss of previously learned capabilities. This limitation poses a barrier for building long-lived, continually improving AI agents.

Why does online reinforcement learning forget less than supervised fine-tuning?

A new MIT study compares reinforcement learning (RL) and supervised fine-tuning (SFT). Both can achieve high performance on new tasks, but SFT tends to overwrite prior abilities. RL, by contrast, preserves them. The key lies in how each method shifts the model’s output distribution relative to the base policy.

https://arxiv.org/pdf/2509.04259

How can forgetting be measured?

The research team proposes an empirical forgetting law:

Forgetting∝KL(π0​∣∣π)

/*

where π0 is the base model and π is the fine-tuned model. The forward KL divergence, measured on the new task, strongly predicts the extent of forgetting. This makes forgetting quantifiable without needing data from prior tasks.

What do experiments on large language models reveal?

Using Qwen 2.5 3B-Instruct as the base model, fine-tuning was performed on:

Math reasoning (Open-Reasoner-Zero),

Science Q&A (SciKnowEval subset),

Tool use (ToolAlpaca).

Performance was evaluated on prior benchmarks such as HellaSwag, MMLU, TruthfulQA, and HumanEval. Results showed that RL improved new-task accuracy while keeping prior-task accuracy stable, whereas SFT consistently sacrificed prior knowledge.

How does RL compare to SFT in robotics tasks?

In robotic control experiments with OpenVLA-7B fine-tuned in SimplerEnv pick-and-place scenarios, RL adaptation maintained general manipulation skills across tasks. SFT, while successful on the new task, degraded prior manipulation abilities—again illustrating RL’s conservatism in preserving knowledge.

What insights come from the ParityMNIST study?

To isolate mechanisms, the research team introduced a toy problem, ParityMNIST. Here, RL and SFT both reached high new-task accuracy, but SFT induced sharper declines on the FashionMNIST auxiliary benchmark. Crucially, plotting forgetting against KL divergence revealed a single predictive curve, validating KL as the governing factor.

Why do on-policy updates matter?

On-policy RL samples from the model’s own outputs, incrementally reweighting them by reward. This process constrains learning to distributions already close to the base model. SFT, in contrast, optimizes against fixed labels that may be arbitrarily distant. Theoretical analysis shows policy gradients converge to KL-minimal optimal solutions, formalizing RL’s advantage.

Are other explanations sufficient?

The research team tested alternatives: weight-space changes, hidden representation drift, sparsity of updates, and alternative distributional metrics (reverse KL, total variation, L2 distance). None matched the predictive strength of forward KL divergence, reinforcing that distributional closeness is the critical factor.

What are the broader implications?

Evaluation: Post-training should consider KL-conservatism, not just task accuracy.

Hybrid methods: Combining SFT efficiency with explicit KL minimization could yield optimal trade-offs.

Continual learning: RL’s Razor offers a measurable criterion for designing adaptive agents that learn new skills without erasing old ones.

Conclusion

The MIT research reframes catastrophic forgetting as a distributional problem governed by forward KL divergence. Reinforcement learning forgets less because its on-policy updates naturally bias toward KL-minimal solutions. This principle—RL’s Razor—provides both an explanation for RL’s robustness and a roadmap for developing post-training methods that support lifelong learning in foundation models.

Key Takeaways

Reinforcement learning (RL) preserves prior knowledge better than Supervised fine-tuning (SFT): Even when both achieve the same accuracy on new tasks, RL retains prior capabilities while SFT erases them.

Forgetting is predictable by KL divergence: The degree of catastrophic forgetting is strongly correlated with the forward KL divergence between the fine-tuned and base policy, measured on the new task.

RL’s Razor principle: On-policy RL converges to KL-minimal solutions, ensuring updates remain close to the base model and reducing forgetting.

Empirical validation across domains: Experiments on LLMs (math, science Q&A, tool use) and robotics tasks confirm RL’s robustness against forgetting, while SFT consistently trades old knowledge for new-task performance.

Controlled experiments confirm generality: In the ParityMNIST toy setting, both RL and SFT showed forgetting aligned with KL divergence, proving the principle holds beyond large-scale models.

Future design axis for post-training: Algorithms should be evaluated not only by new-task accuracy but also by how conservatively they shift distributions in KL space, opening avenues for hybrid RL–SFT methods.

Check out the PAPER and PROJECT PAGE. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post A New MIT Study Shows Reinforcement Learning Minimizes Catastrophic Forgetting Compared to Supervised Fine-Tuning appeared first on MarkTechPost.

Maximize HyperPod Cluster utilization with HyperPod task governance fi …

We are excited to announce the general availability of fine-grained compute and memory quota allocation with HyperPod task governance. With this capability, customers can optimize Amazon SageMaker HyperPod cluster utilization on Amazon Elastic Kubernetes Service (Amazon EKS), distribute fair usage, and support efficient resource allocation across different teams or projects. For more information, see HyperPod task governance best practices for maximizing the value of SageMaker HyperPod task governance.
Compute quota management is an administrative mechanism that sets and controls compute resource limits across users, teams, and projects. It controls fair resource distribution, preventing a single entity from monopolizing cluster resources, thereby optimizing overall computational efficiency.
Because of budget constraints, customers might want to allocate compute resources across multiple teams fairly. For example, a data scientist might need some GPUs (for example, four H100 GPUs) for model development, but not the entire instance’s compute capacity. In other cases, customers have limited compute resources but many teams, and they want to fairly share compute resources across these teams, so that no idle capacity is left unused.
With HyperPod task governance, administrators can now allocate granular GPU, vCPU, and vCPU memory to teams and projects—in addition to the entire instance resources—based on their preferred strategy. Key capabilities include GPU-level quota allocation by instance type and family, or hardware type—supporting both Trainium and NVIDIA GPUs—and optional CPU and memory allocation for fine-tuned resource control. Administrators can also define the weight (or priority level) a team is given for fair-share idle compute allocation.

“With a wide variety of frontier AI data experiments and production pipelines, being able to maximize SageMaker HyperPod Cluster utilization is extremely high impact. This requires fair and controlled access to shared resources like state-of-the-art GPUs, granular hardware allocation, and more. This is exactly what HyperPod task governance is built for, and we’re excited to see AWS pushing efficient cluster utilization for a variety of AI use cases.”
– Daniel Xu, Director of Product at Snorkel AI, whose AI data technology platform empowers enterprises to build specialized AI applications by leveraging their organizational expertise at scale.

In this post, we dive deep into how to define quotas for teams or projects based on granular or instance-level allocation. We discuss different methods to define such policies, and how data scientists can schedule their jobs seamlessly with this new capability.
Solution overview
Prerequisites
To follow the examples in this blog post, you need to meet the following prerequisites:

An AWS account with access to SageMaker HyperPod.
A running SageMaker HyperPod (EKS-orchestrated) cluster. For more information on how to create and configured a new HyperPod cluster, see the HyperPod workshop or the SageMaker HyperPod cluster creation with Amazon EKS orchestration.
HyperPod task governance addon version 1.3 or later installed in the cluster. For more information, see set up HyperPod task governance

To schedule and execute the example jobs in the Submitting Tasks section, you will also need:

A local environment (either your local machine or a cloud-based compute environment), from which to run the HyperPod CLI and kubectl commands, configured as follows:

OS based on Linux or MacOS
Python 3.8, 3.9, 3.10, or 3.11 installed
AWS Command Line Interface (AWS CLI) configured with the appropriate credentials to use the above services
HyperPod CLI version 3.1.0
Kubernetes command-line tool, kubectl

HyperPod Training Operator installed in the cluster

Allocating granular compute and memory quota using the AWS console
Administrators are the primary persona interacting with SageMaker HyperPod task governance and are responsible for managing cluster compute allocation in alignment with the organization’s strategic priorities and goals.
Implementing this feature follows the familiar compute allocation creation workflow of HyperPod task governance. To get started, sign in to the AWS Management Console and navigate to Cluster Management under HyperPod Clusters in the Amazon SageMaker AI console. After selecting your HyperPod cluster, select the Policies tab in the cluster detail page. Navigate to Compute allocations and choose Create.

As with existing functionality, you can enable task prioritization and fair-share resource allocation through cluster policies that prioritize critical workloads and distribute idle compute across teams. By using HyperPod task governance, you can define queue admission policies (first-come-first-serve by default or task ranking) and idle compute allocation methods (first-come-first-serve or fair-share by default). In the Compute allocation section, you can create and edit allocations to distribute resources among teams, enable lending and borrowing of idle compute, configure preemption of low-priority tasks, and assign fair-share weights.
The key innovation is in the Allocations section shown in the following figure, where you’ll now find fine-grained options for resource allocation. In addition to the existing instance-level quotas, you can now directly specify GPU quotas by instance type and family or by hardware type. When you define GPU allocations, HyperPod task governance intelligently calculates appropriate default values for vCPUs and memory which are set proportionally.
For example, when allocating 2 GPUs from a single p5.48xlarge instance (which has 8 GPUs, 192 vCPUs, and 2 TiB memory) in your HyperPod cluster, HyperPod task governance assigns 48 vCPUs and 512 GiB memory as default values—which is equivalent to one quarter of the instance’s total resources. Similarly, if your HyperPod cluster contains 2 ml.g5.2xlarge instances (each with 1 GPU, 8 vCPUs, and 32 GiB memory), allocating 2 GPUs would automatically assign 16 vCPUs and 64 GiB memory from both instances as shown in the following image.

You can either proceed with these automatically calculated default values or customize the allocation by manually adjusting the vCPUs and vCPU memory fields as seen in the following image.

Amazon SageMaker HyperPod supports clusters that include CPU-based instances, GPU-based instances, and AWS Neuron-based hardware (AWS Inferentia and AWS Trainium chips). You can specify resource allocation for your team by instances, GPUs, vCPUs, vCPU memory, or Neuron devices, as shown in the following image.

Quota allocation can be more than capacity. Resources added to the compute allocation policy that aren’t currently available in the cluster represent planning for future capacity upgrades. Jobs that require these unprovisioned resources will be automatically queued and remain in a pending state until the necessary resources become available. It’s important to understand that in SageMaker HyperPod, compute allocations function as quotas, which are verified during workload scheduling to understand if a workload should be admitted or not, regardless of actual capacity availability. When resource requests are within these defined allocation limits and current utilization, the Kubernetes scheduler (kube-scheduler) handles the actual distribution and placement of pods across the HyperPod cluster nodes.
Allocating granular compute and memory quota using AWS CLI
You can also create or update compute quotas using the AWS CLI. The following is an example for creating a compute quota with only GPU count specification using the AWS CLI:

aws sagemaker
create-compute-quota
–region <aws_region> 
–name “only-gpu-quota”
–cluster-arn “arn:aws:sagemaker: <aws_region>:<account_id>:cluster/<cluster_id>”
–description “test description”
–compute-quota-config “ComputeQuotaResources=[{InstanceType=ml.g6.12xlarge,Accelerators=2}],ResourceSharingConfig={Strategy=LendAndBorrow,BorrowLimit=10}”
–activation-state “Enabled”
–compute-quota-target “TeamName=onlygputeam2,FairShareWeight=10” 

Compute quotas can also be created with mixed quota types, including a certain number of instances and granular compute resources, as shown in the following example:

aws sagemaker
create-compute-quota
–region <aws_region>
–name “mix-quota-type”
–cluster-arn “arn:aws:sagemaker:<aws_region>:<account_id>:cluster/<cluster_id>”
–description “Mixed quota allocation”
–compute-quota-config “ComputeQuotaResources=[{InstanceType=ml.g6.12xlarge,Accelerators=2}, {InstanceType=ml.p5.48xlarge,Count=3}, {InstanceType=ml.c5.2xlarge,VCpu=2}],ResourceSharingConfig={Strategy=LendAndBorrow,BorrowLimit=10}”
–activation-state “Enabled”
–compute-quota-target “TeamName=mixquotatype,FairShareWeight=10” 

HyperPod task governance deep dive
SageMaker HyperPod task governance enables allocation of GPU, CPU, and memory resources by integrating with Kueue, a Kubernetes-native system for job queueing.
Kueue doesn’t replace existing Kubernetes scheduling components, but rather integrates with the kube-scheduler, such that Kueue decides whether a workload should be admitted based on the resource quotas and current utilization, and then the kube-scheduler takes care of pod placement on the nodes.
When a workload requests specific resources, Kueue selects an appropriate resource flavor based on availability, node affinity, and job priority. The scheduler then injects the corresponding node labels and tolerations into the PodSpec, allowing Kubernetes to place the pod on nodes with the requested hardware configuration. This supports precise resource governance and efficient allocation for multi-tenant clusters.
When a SageMaker HyperPod task governance compute allocation is created, Kueue creates ClusterQueues that define resource quotas and scheduling policies, along with ResourceFlavors for the selected instance types with their unique resource characteristics.
For example, the following compute allocation policy allocates ml.g6.12xlarge instances with 2 GPUs and 48 vCPUs to the onlygputeam team, implementing a LendAndBorrow strategy with an up to 50% borrowing limit. This configuration enables flexible resource sharing while maintaining priority through a fair share weight of 10 and the ability to preempt lower priority tasks from other teams.

aws sagemaker describe-compute-quota                      
–region <aws_region>
–compute-quota-id <compute_quota_id>

#output
{
    “ComputeQuotaArn”: “arn:aws:sagemaker:<aws_region>:<account_id>:compute-quota/<compute_quota_id>”,
    “ComputeQuotaId”: “<compute_quota_id>”,
    “Name”: “only-gpu-quota”,
    “Description”: “Only GPU quota allocation”,
    “ComputeQuotaVersion”: 1,
    “Status”: “Created”,
    “ClusterArn”: “arn:aws:sagemaker:<aws_region>:<account_id>:cluster/<cluster_id>”,
    “ComputeQuotaConfig”: {
        “ComputeQuotaResources”: [
            {
                “InstanceType”: “ml.g6.12xlarge”,
                “Accelerators”: 2,
                “VCpu”: 48.0
            }
        ],
        “ResourceSharingConfig”: {
            “Strategy”: “LendAndBorrow”,
            “BorrowLimit”: 50
        },
        “PreemptTeamTasks”: “LowerPriority”
    },
    “ComputeQuotaTarget”: {
        “TeamName”: “onlygputeam”,
        “FairShareWeight”: 10
    },
    “ActivationState”: “Enabled”,
    “CreationTime”: “2025-07-24T11:12:12.021000-07:00”,
    “CreatedBy”: {},
    “LastModifiedTime”: “2025-07-24T11:15:45.205000-07:00”,
    “LastModifiedBy”: {}
}

The corresponding Kueue ClusterQueue is configured with the ml.g6.12xlarge flavor, providing quotas for 2 NVIDIA GPUs, 48 CPU cores, and 192 Gi memory.

kubectl describe clusterqueue hyperpod-ns-onlygputeam-clusterqueue

# output
Name:         hyperpod-ns-onlygputeam-clusterqueue
Namespace:
Labels:       sagemaker.amazonaws.com/quota-allocation-id=onlygputeam
              sagemaker.amazonaws.com/sagemaker-managed-queue=true
Annotations:  <none>
API Version:  kueue.x-k8s.io/v1beta1
Kind:         ClusterQueue
Metadata:
  …
Spec:
  Cohort:  shared-pool
  Fair Sharing:
    Weight:  10
  Flavor Fungibility:
    When Can Borrow:   TryNextFlavor
    When Can Preempt:  TryNextFlavor
  Namespace Selector:
    Match Labels:
      kubernetes.io/metadata.name:  hyperpod-ns-onlygputeam
  Preemption:
    Borrow Within Cohort:
      Policy:               LowerPriority
    Reclaim Within Cohort:  Any
    Within Cluster Queue:   LowerPriority
  Queueing Strategy:        BestEffortFIFO
  Resource Groups:
    Covered Resources:
      nvidia.com/gpu
      aws.amazon.com/neurondevice
      cpu
      memory
      vpc.amazonaws.com/efa
    Flavors:
      Name:  ml.g6.12xlarge
      Resources:
        Borrowing Limit:  1
        Name:             nvidia.com/gpu
        Nominal Quota:    2
        Borrowing Limit:  0
        Name:             aws.amazon.com/neurondevice
        Nominal Quota:    0
        Borrowing Limit:  24
        Name:             cpu
        Nominal Quota:    48
        Borrowing Limit:  96Gi
        Name:             memory
        Nominal Quota:    192Gi
        Borrowing Limit:  0
        Name:             vpc.amazonaws.com/efa
        Nominal Quota:    1
    …

A Kueue LocalQueue will be also created, and will reference the corresponding ClusterQueue. The LocalQueue acts as the namespace-scoped resource through which users can submit workloads, and these workloads are then admitted and scheduled according to the quotas and policies defined in the ClusterQueue.

kubectl describe localqueue hyperpod-ns-onlygputeam-localqueue -n hyperpod-ns-onlygputeam

# output
Name:         hyperpod-ns-onlygputeam-localqueue
Namespace:    hyperpod-ns-onlygputeam
Labels:       sagemaker.amazonaws.com/quota-allocation-id=onlygputeam
              sagemaker.amazonaws.com/sagemaker-managed-queue=true
Annotations:  <none>
API Version:  kueue.x-k8s.io/v1beta1
Kind:         LocalQueue
Metadata:
    …
Spec:
  Cluster Queue:  hyperpod-ns-onlygputeam-clusterqueue
  Stop Policy:    None
Status:
  Admitted Workloads:  0

Submitting tasks
There are two ways to submit tasks on Amazon EKS orchestrated SageMaker HyperPod clusters: the SageMaker HyperPod CLI and the Kubernetes command-line tool, kubectl. With both options, data scientists need to reference their team’s namespace and task priority class—in addition to the requested GPU and vCPU compute and memory resources—to use their granular allocated quota with appropriate prioritization. If the user doesn’t specify a priority class, then SageMaker HyperPod task governance will automatically assume the lowest priority. The specific GPU type comes from an instance type selection, because data scientists want to use GPUs with certain capabilities (for example, H100 instead of H200) to perform their tasks efficiently.
HyperPod CLI
The HyperPod CLI was created to abstract the complexities of working with kubectl and so that developers using SageMaker HyperPod can iterate faster with custom commands.The following is an example of a job submission with the HyperPod CLI requesting both compute and memory resources:

hyp create hyp-pytorch-job
–job-name sample-job1
–image <account_id>.dkr.ecr.<aws_region>.amazonaws.com/<image_name>:<tag>
–pull-policy “Always”
–tasks-per-node 1
–max-retry 1
–priority high-priority
–namespace hyperpod-ns-team1
–queue-name hyperpod-ns-team1-localqueue
–instance-type ml.g5.8xlarge
–accelerators 1
–vcpu 4
–memory 1
–accelerators-limit 1
–vcpu-limit 5
–memory-limit 2

The highlighted parameters enable requesting granular compute and memory resources. The HyperPod CLI requires to install the HyperPod Training Operator in the cluster and then build a container image that includes the HyperPod Elastic Agent. For further instructions on how to build such container image, please refer to the HyperPod Training Operator documentation.
For more information on the supported HyperPod CLI arguments and related description, see the SageMaker HyperPod CLI reference documentation.
Kubectl
The following is an example of a kubectl command to submit a job to the HyperPod cluster using the specified queue. This is a simple example of a PyTorch job that will check for GPU availability and then sleep for 5 minutes. Compute and memory resources are requested using the standard Kubernetes resource management constructs.

apiVersion: batch/v1
kind: Job
metadata:
name: gpu-training-job
namespace: hyperpod-ns-team1
spec:
parallelism: 1
completions: 1
suspend: true
template:
metadata:
labels:
kueue.x-k8s.io/queue-name: hyperpod-ns-team1-localqueue
kueue.x-k8s.io/priority-class: high-priority
spec:
containers:
– name: training-container
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command:
– “python”
– “-c”
– “import torch; print(‘GPU available:’, torch.cuda.is_available()); import time; time.sleep(15)”
resources:
requests:
nvidia.com/gpu: 1
cpu: “4”
memory: “1Gi”
limits:
nvidia.com/gpu: 1
restartPolicy: Never

Sample commands

Following is a short reference guide for helpful commands when interacting with SageMaker HyperPod task governance:

Describing cluster policy with the AWS CLI – This AWS CLI command is useful for viewing the cluster policy settings for your cluster.
List compute quota allocations with the AWS CLI – Use this AWS CLI command to view the different teams and set up task governance and their respective quota allocation settings.
HyperPod CLI – The HyperPod CLI abstracts common kubectl commands used to interact with SageMaker HyperPod clusters such as submitting, listing, and cancelling tasks. See the SageMaker HyperPod CLI reference documentation for a full list of commands.
kubectl – You can also use kubectl to interact with task governance; some useful commands are:

kubectl get workloads -n hyperpod-ns-<team-name> kubectl describe workload <workload-name> -n hyperpod-ns-<team-name>. These commands show the workloads running in your cluster per namespace and provide detailed reasonings on Kueue admission. You can use these commands to answer questions such as “Why was my task preempted?” or “Why did my task get admitted?”
Common scenarios
A common use case for more granular allocation of GPU compute is fine-tuning small and medium sized large language models (LLMs). A single H100 or H200 GPU might be sufficient to address such a use case (also depending on the chosen batch size and other factors), and machine learning (ML) platform administrators can choose to allocate a single GPU to each data scientist or ML researcher to optimize the utilization of an instance like ml.p5.48xlarge, which comes with 8 H100 GPUs onboard.
Small language models (SLMs) have emerged as a significant advancement in generative AI, offering lower latency, decreased deployment costs, and enhanced privacy capabilities while maintaining impressive performance on targeted tasks, making them increasingly vital for agentic workflows and edge computing scenarios. The new SageMaker HyperPod task governance with fine-grained GPU, CPU, and memory allocation significantly enhances SLM development by enabling precise matching of resources to model requirements, allowing teams to efficiently run multiple experiments concurrently with different architectures. This resource optimization is particularly valuable as organizations develop specialized SLMs for domain-specific applications, with priority-based scheduling so that critical model training jobs receive resources first while maximizing overall cluster utilization. By providing exactly the right resources at the right time, HyperPod accelerates the development of specialized, domain-specific SLMs that can be deployed as efficient agents in complex workflows, enabling more responsive and cost-effective AI solutions across industries.
With the growing popularity of SLMs, organizations can use granular quota allocation to create targeted quota policies that prioritize GPU resources, addressing the budget-sensitive nature of ML infrastructure where GPUs represent the most significant cost and performance factor. Organizations can now selectively apply CPU and memory limits where needed, creating a granular resource management approach that efficiently supports diverse machine learning workloads regardless of model size.
Similarly, to support inference workloads, multiple teams might not require an entire instance to deploy their models, helping to avoid having entire instances equipped with multiple GPUs allocated to each team and leaving GPU compute sitting idle.
Finally, during experimentation and algorithm development, data scientists and ML researchers can choose to deploy a container hosting their preferred IDE on HyperPod, like JupyterLab or Code-OSS (Visual Studio Code open source). In this scenario, they often experiment with smaller batch sizes before scaling to multi-GPU configurations, hence not needing entire multi-GPU instances to be allocated.Similar considerations apply to CPU instances; for example, an ML platform administrator might decide to use CPU instances for IDE deployment, because data scientists prefer to scale their training or fine-tuning with jobs rather than experimenting with the local IDE compute. In such cases, depending on the instances of choice, partitioning CPU cores across the team might be beneficial.
Conclusion
The introduction of fine-grained compute quota allocation in SageMaker HyperPod represents a significant advancement in ML infrastructure management. By enabling GPU-level resource allocation alongside instance-level controls, organizations can now precisely tailor their compute resources to match their specific workloads and team structures.
This granular approach to resource governance addresses critical challenges faced by ML teams today, balancing budget constraints, maximizing expensive GPU utilization, and ensuring fair access across data science teams of all sizes. Whether fine-tuning SLMs that require single GPUs, running inference workloads with varied resource needs, or supporting development environments that don’t require full instance power, this flexible capability helps ensure that no compute resources sit idle unnecessarily.
ML workloads continue to diversify in their resource requirements and SageMaker HyperPod task governance now provides the adaptability organizations need to optimize their GPU capacity investments. To learn more, visit the SageMaker HyperPod product page and HyperPod task governance documentation.
Give this a try in the Amazon SageMaker AI console and leave your comments here.

About the authors
Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, ML model management, and ML governance to improve overall organizational efficiency and productivity. He has extensive experience automating processes and deploying various technologies.
Zhenshan Jin is a Senior Software Engineer at Amazon Web Services (AWS), where he leads software development for task governance on SageMaker HyperPod. In his role, he focuses on empowering customers with advanced AI capabilities while fostering an environment that maximizes engineering team efficiency and productivity.
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
Sindhura Palakodety is a Solutions Architect at AWS. She is passionate about helping customers build enterprise-scale Well-Architected solutions on the AWS platform and specializes in the data analytics domain.

Build and scale adoption of AI agents for education with Strands Agent …

Basic AI chat isn’t enough for most business applications. Institutions need AI that can pull from their databases, integrate with their existing tools, handle multi-step processes, and make decisions independently.
This post demonstrates how to quickly build sophisticated AI agents using Strands Agents, scale them reliably with Amazon Bedrock AgentCore, and make them accessible through LibreChat’s familiar interface to drive immediate user adoption across your institution.
Challenges with basic AI chat interfaces
Although basic AI chat interfaces can answer questions and generate content, educational institutions need capabilities that simple chat can’t provide:

Contextual decision-making – A student asking “What courses should I take?” needs an agent that can access their transcript, check prerequisites, verify graduation requirements, and consider schedule conflicts—not just generic course descriptions
Multi-step workflows – Degree planning requires analyzing current progress, identifying remaining requirements, suggesting course sequences, and updating recommendations as students make decisions
Institutional data integration – Effective educational AI must connect to student information systems, learning management services, academic databases, and institutional repositories to provide relevant, personalized guidance
Persistent memory and learning – Agents need to remember previous interactions with students, track their academic journey over semesters, and build understanding of individual learning patterns and needs

Combining open source flexibility with enterprise infrastructure
The integration presented in this post demonstrates how three technologies can work together to address these challenges:

Strands Agents – Build sophisticated multi-agent workflows in just a few lines of code
Amazon Bedrock AgentCore – Scale agents reliably with serverless, pay-per-use deployment
LibreChat – Provide users with a familiar chat interface that drives immediate adoption

Strands Agents overview
Strands Agents is an open source SDK that takes a model-driven approach to building and running AI agents in just a few lines of code. Unlike LibreChat’s simple agent implementation, Strands supports sophisticated patterns including multi-agent orchestration through workflow, graph, and swarm tools; semantic search for managing thousands of tools; and advanced reasoning capabilities with deep analytical thinking cycles. The framework simplifies agent development by embracing the capabilities of state-of-the-art models to plan, chain thoughts, call tools, and reflect, while scaling from local development to production deployment with flexible architectures and comprehensive observability.
Amazon Bedrock AgentCore overview
Amazon Bedrock AgentCore is a comprehensive set of enterprise-grade services that help developers quickly and securely deploy and operate AI agents at scale using the framework and model of your choice, hosted on Amazon Bedrock or elsewhere. The services are composable and work with popular open source frameworks and many models, so you don’t have to choose between open source flexibility and enterprise-grade security and reliability.
Amazon Bedrock AgentCore includes modular services that can be used together or independently: Runtime (secure, serverless runtime for deploying and scaling dynamic agents), Gateway (converts APIs and AWS Lambda functions into agent-compatible tools), Memory (manages both short-term and long-term memory), Identity (provides secure access management), and Observability (offers real-time visibility into agent performance).
The key Amazon Bedrock AgentCore service used in this integration is Amazon Bedrock AgentCore Runtime, a secure, serverless runtime purpose-built for deploying and scaling dynamic AI agents and tools using an open source framework including LangGraph, CrewAI, and Strands Agents; a protocol; and a model of your choosing. Amazon Bedrock AgentCore Runtime was built to work for agentic workloads with industry-leading extended runtime support, fast cold starts, true session isolation, built-in identity, and support for multimodal payloads. Rather than the typical serverless model where functions spin up, execute, and immediately terminate, Amazon Bedrock AgentCore Runtime provisions dedicated microVMs that can persist for up to 8 hours, enabling sophisticated multi-step agentic workflows where each subsequent call builds upon the accumulated context and state from previous interactions within the same session.
LibreChat overview
LibreChat has emerged as a leading open source alternative to commercial AI chat interfaces, offering educational institutions a powerful solution for deploying conversational AI at scale. Built with flexibility and extensibility in mind, LibreChat provides several key advantages for higher education:

Multi-model support – LibreChat supports integration with multiple AI providers, so institutions can choose the most appropriate models for different use cases while avoiding vendor lock-in
User management – Robust authentication and authorization systems help institutions manage access across student populations, faculty, and staff with appropriate permissions and usage controls
Conversation management – Students and faculty can organize their AI interactions into projects and topics, creating a more structured learning environment
Customizable interface – The solution can be branded and customized to match institutional identity and specific pedagogical needs

Integration benefits
Integrating Strands Agents with Amazon Bedrock AgentCore and LibreChat creates unique benefits that extend the capabilities of both services far beyond what either could achieve independently:

Seamless agent experience through familiar interface – LibreChat’s intuitive chat interface becomes a gateway to sophisticated agentic workflows. Users can trigger complex multi-step processes, data analysis, and external system integrations through natural conversation, without needing to learn new interfaces or complex APIs.
Dynamic agent loading and management – Unlike static AI chat implementations, this integration supports dynamic agent loading with access management. New agentic applications can be deployed separately and made available to users without requiring LibreChat updates or downtime, enabling rapid agent development.
Enterprise-grade security and scaling – Amazon Bedrock AgentCore Runtime provides complete session isolation for each user session, where each session runs with isolated CPU, memory, and filesystem resources. This creates complete separation between user sessions, safeguarding stateful agent reasoning processes and helping prevent cross-session data contamination. The service can scale up to thousands of agent sessions in seconds while developers only pay for actual usage, making it ideal for educational institutions that need to support large student populations with varying usage patterns.
Built-in AWS resource integration – Organizations already running infrastructure on AWS can seamlessly connect their existing resources—databases, data lakes, Lambda functions, and applications—to Strands Agents without complex integrations or data movement. Agents can directly access and surface insights through the LibreChat interface, turning existing AWS investments into intelligent, conversational experiences, such as querying an Amazon Relational Database Service (Amazon RDS) database, analyzing data in Amazon Simple Storage Service (Amazon S3), or integrating with existing microservices.
Cost-effective agentic computing – By using LibreChat’s efficient architecture with the Amazon Bedrock AgentCore pay-per-use model, organizations can deploy sophisticated agentic applications without the high fixed costs typically associated with enterprise AI systems. Users only pay for actual agent computation and tool usage.

Agent use cases in higher education settings
The integration of LibreChat with Strands Agents enables numerous educational applications that demonstrate the solution’s versatility and power:

A course recommendation agent can analyze a student’s academic history, current enrollment, and career interests to suggest relevant courses. By integrating with the student information system, the agent can make sure recommendations consider prerequisites, schedule conflicts, and graduation requirements.
A degree progress tracking agent can interact with students and help them understand their specific degree requirements and provide guidance on remaining coursework, elective options, and timeline optimization.
Agents can be configured with access to academic databases and institutional repositories, helping students and faculty discover relevant research papers and resources, providing guidance on academic writing, citation formats, and research methodology specific to different disciplines.
Agents can handle routine student inquiries about registration, deadlines, and campus resources, freeing up staff time for more complex student support needs.

Refer to the following GitHub repo for Strands Agent code examples for educational use cases.
Solution overview
The following architecture diagram illustrates the overall system design for deploying LibreChat with Strands Agents integration. Strands Agents is deployed using Amazon Bedrock AgentCore Runtime, a secure, serverless runtime purpose-built for deploying and scaling dynamic AI agents and tools using an open source framework including Strands Agents.

The solution architecture includes several key components:

LibreChat core services – The core chat interface runs in an Amazon Elastic Container Service (Amazon ECS) with AWS Fargate cluster, including LibreChat for the user-facing experience, Meilisearch for enhanced search capabilities, and Retrieval Augmented Generation (RAG) API services for document retrieval.
LibreChat supporting infrastructure – This solution uses Amazon Elastic File System (Amazon EFS) for storing Meilisearch’s indexes and user uploaded files; Amazon Aurora PostgreSQL-Compatible Edition for vector database used by the RAG API; Amazon S3 for storing LibreChat configurations; Amazon DocumentDB for user, session, conversation data management; and AWS Secrets Manager for managing access to the resources.
Strands Agents integration – This solution integrates Strands Agents (hosted by Amazon Bedrock AgentCore Runtime) with LibreChat through custom endpoints using Lambda and Amazon API Gateway. This integration pattern enables dynamic loading of agents in LibreChat for advanced generative AI capabilities. In particularly, the solution showcases a user activity analysis agent that draws insights from LibreChat logs.
Authentication and security – The integration between LibreChat and Strands Agents implements a multi-layered authentication approach that maintains security without compromising user experience or administrative simplicity. When a student or faculty member selects a Strands Agent from LibreChat’s interface, the authentication flow operates seamlessly in the background through several coordinated layers:

User authentication – LibreChat handles user login through your institution’s existing authentication system, with comprehensive options including OAuth, LDAP/AD, or local accounts as detailed in the LibreChat authentication documentation.
API Gateway security – After users are authenticated to LibreChat, the system automatically handles API Gateway security by authenticating each request using preconfigured API keys.
Service-to-service authentication – The underlying Lambda function uses AWS Identity and Access Management (IAM) roles to securely invoke Amazon Bedrock AgentCore Runtime where the Strands Agent is deployed.
Resource access control – Strands Agents operate within defined permissions to access only authorized resources.

Deployment process
This solution uses the AWS Cloud Development Kit (AWS CDK) and AWS CloudFormation to handle the deployment through several automated phases. We will use a log analysis agent as an example to demonstrate the deployment process. The agent makes it possible for the admin to perform LibreChat log analysis through natural language queries.
LibreChat is deployed as a containerized service with ECS Fargate clusters and is integrated with supporting services, including virtual private cloud (VPC) networking, Application Load Balancer (ALB), and the complete data layer with Aurora PostgreSQL-Compatible, DocumentDB, Amazon EFS, and Amazon S3 storage. Security is built in with appropriate IAM roles, security groups, and secrets management.
The user activity analysis agent provides valuable insights into how students interact with AI tools, identifying peak usage times, popular topics, and potential areas where students might need additional support. The agent is automatically provisioned using the following CloudFormation template, which deploys Strands Agents using Amazon Bedrock AgentCore Runtime, provisions a Lambda function that invokes the agent, API Gateway to make the agent a URL endpoint, and a second Lambda function that accesses LibreChat logs stored in DocumentDB. The second Lambda is used as a tool of the agent.
The following code shows how to configure LibreChat to make the agent a custom endpoint:

custom:
– name: ‘log-analysis-assitant’
apiKey: ‘{AWS_API_GATEWAY_KEY}’
baseURL: ‘{AWS_API_GATEWAY_URL}’
models:
default: [‘Strands Agent’]
fetch: false
headers:
x-api-key: ‘{AWS_API_GATEWAY_KEY}’
titleConvo: true
titleModel: ‘us.amazon.nova-lite-v1:0’
modelDisplayLabel: ‘log-analysis-assitant’
forcePrompt: false
stream: false
iconURL: ‘https://d1.awsstatic.com/onedam/marketing-channels/website/aws/en_US/
product-categories/ai-ml/machine-learning/approved/images/256f3da1-3193-
441c-b2641f33fdd6.a045b9b4c4f34545e1c79a405140ac0146699835.jpeg’

After the stack is deployed successfully, you can log in to LibreChat, select the agent, and start chatting. The following screenshot shows an example question that the user activity analysis agent can help answer, where it reads the LibreChat user activities from DocumentDB and generates an answer.

Deployment considerations and best practices
When deploying this LibreChat and Strands Agents integration, organizations should carefully consider several key factors that can significantly impact both the success of the implementation and its long-term sustainability.
Security and compliance form the foundation of any successful deployment, particularly in educational environments where data protection is paramount. Organizations must implement robust data classification schemes to maintain appropriate handling of sensitive information, and role-based access controls make sure users only access AI capabilities and data appropriate to their roles. Beyond traditional perimeter security, a layered authorization approach becomes critical when deploying AI systems that might access multiple data sources with varying sensitivity levels. This involves implementing multiple authorization checks throughout the application stack, including service-to-service authorization, trusted identity propagation that carries the end-user’s identity through the system components, and granular access controls that evaluate permissions at each data access point rather than relying solely on broad service-level permissions. Such layered security architectures help mitigate risks like prompt injection vulnerabilities and unauthorized cross-tenant data access, making sure that even if one security layer is compromised, additional controls remain in place to protect sensitive educational data. Regular compliance monitoring becomes essential, with automated audits and checks maintaining continued adherence to relevant data protection regulations throughout the system’s lifecycle, while also validating that layered authorization policies remain effective as the system evolves.
Cost management requires a strategic approach that balances functionality with financial sustainability. Organizations must prioritize their generative AI spending based on business impact and criticality while maintaining cost transparency across customer and user segments. Implementing comprehensive usage monitoring helps organizations track AI service consumption patterns and identify optimization opportunities before costs become problematic. The human element of deployment often proves more challenging than the technical implementation. Faculty training programs should provide comprehensive guidance on integrating AI tools into teaching practices, focusing not just on how to use the tools but how to use them effectively for educational outcomes. Student onboarding requires clear guidelines and tutorials that promote both effective AI interaction and academic integrity. Perhaps most importantly, establishing continuous feedback loops makes sure the system evolves based on actual user experiences and measured educational outcomes rather than assumptions about what users need.Successful deployments also require careful attention to the dynamic nature of AI technology. The architecture’s support for dynamic agent loading enables organizations to add specialized agents for new departments or use cases without disrupting existing services. Version control systems should maintain different agent versions for testing and gradual rollout of improvements, and performance monitoring tracks both technical metrics and user satisfaction to guide continuous improvement efforts.
Conclusion
The integration of LibreChat with Strands Agents represents a significant step forward in democratizing access to advanced AI capabilities in higher education. By combining the accessibility and customization of open source systems with the sophistication and reliability of enterprise-grade AI services, institutions can provide students and faculty with powerful tools that enhance learning, research, and academic success.This architecture demonstrates that educational institutions don’t need to choose between powerful AI capabilities and institutional control. Instead, they can take advantage of the innovation and flexibility of open source solutions with the scalability and reliability of cloud-based AI services. The integration example showcased in this post illustrates the solution’s versatility and potential for customization as institutions expand and adapt the solution to meet evolving educational needs.
For future work, the LibreChat system’s Model Context Protocol (MCP) server integration capabilities offer exciting possibilities for enhanced agent architectures. A particularly promising avenue involves wrapping agents as MCP servers, transforming them into standardized tools that can be seamlessly integrated alongside other MCP-enabled agents. This approach would enable educators to compose sophisticated multi-agent workflows, creating highly personalized educational experiences tailored to individual learning styles.
The future of education is about having the right AI tools, properly integrated and ethically deployed, to enhance human learning and achievement through flexible, interoperable, and extensible solutions that can evolve with educational needs.
Acknowledgement
The authors extend their gratitude to Arun Thangavel, Ashish Rawat and Kosti Vasilakakis for their insightful feedback and review of the post.

About the authors
Dr. Changsha Ma is a Senior AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, and spending time with friends and families.
Sudheer Manubolu is a Solutions Architect at Amazon Web Services (AWS). He specializes in cloud architecture, enterprise solutions, and AI/ML implementations. He provides technical and architectural guidance to customers building transformative solutions on AWS, with particular emphasis on leveraging AWS’s AI/ML and container services to drive innovation and operational excellence.
Abhilash Thallapally is a Solutions Architect at AWS helping public sector customers design and build scalable AI/ML solutions using Amazon SageMaker. His work covers a wide range of ML use cases, with a primary interest in computer vision, Generative AI, IoT, and deploying cost optimized solutions on AWS.
Mary Strain leads strategy for artificial intelligence and machine learning for US education at AWS. Mary began her career as a middle school teacher in the Bronx, NY. Since that time, she has held leadership roles in education and public sector technology organizations. Mary has advised K12, higher education, and state and local government on innovative policies and practices in competency based assessment, curriculum design, micro credentials and workforce development initiatives. As an advisor to The Education Design Studio at The University of Pennsylvania, The Coalition of Schools Educating Boys of Color, and The State of NJ AI Task Force, Mary has been on the leading edge of bringing innovative solutions to education for two decades.

Skai uses Amazon Bedrock Agents to significantly improve customer insi …

This post was written with Lior Heber and Yarden Ron of Skai.
Skai (formerly Kenshoo) is an AI-driven omnichannel advertising and analytics platform designed for brands and agencies to plan, launch, optimize, and measure paid media across search, social, retail media marketplaces and other “walled-garden” channels from a single interface. By unifying data from over 100 publishers and retail networks, Skai applies real-time analytics, predictive modeling, and incremental testing to surface budget and bidding recommendations, connect media spend to sales outcomes, and reduce channel silos, giving marketers full-funnel visibility and higher return on ad spend at scale.
Skai recognized that our customers were spending days (sometimes weeks) manually preparing reports, struggling to query complex datasets, and lacking intuitive visualization tools. Traditional analytics platforms required technical expertise, leaving many users overwhelmed by untapped data potential. But through the partnership with AWS and adoption of Amazon Bedrock Agents AI assistants that can autonomously perform complex, multi-step tasks by orchestrating calls to APIs, we’ve redefined what’s possible. Now, customers can analyze their data in natural language, generate reports in minutes instead of days, and visualize insights through natural language conversation.
In this post, we share how Skai used Amazon Bedrock Agents to improve data access and analysis and improve customer insights.
Challenges with data analytics
Before adopting Amazon Bedrock Agents, Skai’s customers accessed their data through tables, charts, and predefined business questions. Campaign manager teams, looking to do deep research on their data, would spend around 1.5 days a week preparing static reports, while individual users struggled to connect the dots between their massive amount of data points. Critical business questions, like where should a client spend their time optimizing campaigns, and how, remained hidden in unstructured knowledge and siloed data points.
We identified three systematic challenges:

Time-consuming report generation – Grids display flat and grouped data at specific entity levels, like campaigns, ads, products, and keywords. However, gaining a comprehensive understanding by connecting these different entities and determining relevant time frames is time-consuming. Users must manipulate raw data to construct a complete narrative.
Summarization – Analyzing extracted raw data posed significant challenges in understanding, identifying key patterns, summarizing complex datasets, and drawing insightful conclusions. Users lacked intuitive tools to dynamically explore data dimensions, hindering their ability to gain a holistic view and extract crucial insights for informed decisions.
Recommendations – Presenting data-driven recommendations to stakeholders with varying understanding requires deep data analysis, anticipating perspectives, and clear, persuasive communication to demonstrate ROI and facilitate informed decisions.

How Celeste powered transformation
To address the challenges of time-consuming report generation, the difficulty in summarizing complex data, and the need for data-driven recommendations, Skai used AWS to build Celeste, a generative AI agent. With AI agents, users can ask questions in natural language, and the agent automatically collects data from multiple sources, synthesizes it into a cohesive narrative with actionable insights, and provides data-oriented recommendations.
The Skai Platform absorbs an enormous amount of data about product searches across many retailers and traditional search engines. Sorting through this data can be time-consuming, but the capabilities in Celeste can make this type of exploratory research much easier.
Skai’s solution leverages Amazon Bedrock Agents to create an AI-driven analytics assistant that transforms how users interact with complex advertising data. The system processes natural language queries like ‘Compare ad group performance across low-performing campaigns in Q1,’ eliminating the need for a database specialist. Agent automatically joins Skai’s datasets from profiles, campaigns, ads, products, keywords, and search terms across multiple advertising publishers. Beyond simple data retrieval, the assistant generates comprehensive insights and case studies while providing actionable recommendations on campaign activity, complete with detailed analytical approaches and ready-to-present stakeholder materials.
For example, consider the following question: “I’m launching a new home security product and want to activate 3 new Sponsored Product campaigns and 2 new Sponsored Brand campaigns on Amazon. What high-performing keywords and their match types are already running in other campaigns that would be good to include in these new activations?”
When asked this question with real client data, Celeste answered quickly, finding a combination of branded and generic category terms that the manufacturer might consider for this new product launch. With just a few follow-up questions, Celeste was able to provide estimated CPCs, budgets, and a high-level testing plan for these hypothetical campaigns, complete with negative keywords to reduce unnecessary conflict with their existing campaigns.
This is a great example of an exploratory question that requires summary analysis, identification of trends and insights, and recommendations. Skai data directly supports these kinds of analyses, and the capabilities within Celeste give the agent the intelligence to provide smart recommendations. Amazon Bedrock makes this possible because it gives Celeste access to strong foundation models (FMs) without exposing clients to the risk of having those models’ vendors use sensitive questions for purposes outside of supporting the client directly. Celeste reduces 75% on average the time needed to build client case studies, transforming a process that often took weeks into one requiring only minutes.
Accelerating time-to-value through managed AI using Amazon Bedrock
One critical element of Skai’s success story was our deliberate choice of Amazon Bedrock as the foundational AI service. Unlike alternatives requiring extensive infrastructure setup and model management, Amazon Bedrock provided a frictionless path from concept to production.
The journey began with a simple question: How can we use generative AI to provide our clients a new and improved experience without building AI infrastructure from scratch? With Amazon Bedrock, Skai could experiment within hours and deliver a working proof of concept in days. The team could test multiple FMs (Anthropic’s Claude, Meta’s Llama, and Amazon Nova) without managing separate environments and iterate rapidly through Amazon Bedrock Agents.
One developer noted, “We went from whiteboard to a working prototype in a single sprint. With traditional approaches, we’d still be configuring infrastructure.”
With Amazon Bedrock Agents, Skai could prioritize customer value and rapid iteration over infrastructure complexity. The managed service minimized DevOps overhead for model deployment and scaling while alleviating the need for specialized ML expertise in FM tuning. This helped the team concentrate on data integration and customer-specific analytics patterns, using cost-effective on-demand models at scale while making sure client data remained private and secure.With Amazon Bedrock Agents, domain experts can focus exclusively on what matters most: translating customer data challenges into actionable insights.
Benefits of Amazon Bedrock Agents
The introduction of Amazon Bedrock Agents dramatically simplified Skai’s architecture while reducing the need to build custom code. Built-in action groups replaced thousands of lines of custom integration code that would have required weeks of development time. The platform’s native memory and session management capabilities meant the team could focus on business logic rather than infrastructure concerns. Declarative API definitions reduced integration time from weeks to hours. Additionally, the integrated code interpreter simplified math problem management and facilitated accuracy and scale issues.
As a solution provider serving many customers, security and compliance were non-negotiable. Amazon Bedrock addressed these security requirements by inheriting AWS’s comprehensive compliance certifications including HIPAA, SOC2, and ISO27001. Commitment to not retaining data for model training proved critical for protecting sensitive customer information, while its seamless integration with existing AWS Identity and Access Management (IAM) policies and VPC configurations simplified deployment.
During every client demonstration of Celeste, initial inquiries consistently centered on privacy, security, and the protection of proprietary data. With an AWS infrastructure, Skai confidently assured clients that their data would not be used to train any models, effectively distinguishing Skai from its competitors.With pay-as-you-go model, Skai scaled economically without AI infrastructure investment. The team avoided costly upfront commitments to GPU clusters or specialized instances, instead leveraging automatic scaling based on actual usage patterns. This approach provided granular cost attribution to specific agents, allowing Skai to understand and optimize spending at a detailed level. The flexibility to select the most appropriate model for each specific task further optimized both performance and costs, ensuring resources aligned precisely with business needs.
AWS Enterprise Support as a strategic partner in AI innovation
Working with cutting-edge generative AI agents presents unique challenges that extend far beyond traditional technical support needs. When building Celeste, Skai encountered complex scenarios where solutions didn’t emerge as expected, from managing 200,000-token conversations to optimizing latency in multi-step agent workflows. AWS Enterprise Support proved invaluable as a strategic partner rather than just a support service.
AWS Enterprise Support provided dedicated Technical Account Management (TAM) and Solutions Architect (SA) services that went well beyond reactive problem-solving. Our TAM and SA became an extension of our engineering team, offering the following:

Regular architectural reviews to optimize our Amazon Bedrock Agents implementation
Proactive monitoring recommendations that helped us identify potential bottlenecks before they impacted customer experience
Direct access to AWS service teams when we needed deep technical expertise on the advanced features of Amazon Bedrock Agents
Strategic guidance and optimization as we scaled from prototype to production

When complex issues arose, such as our initial 90-second (or more) latency challenges or session management complexities, Enterprise Support provided immediate escalation paths and expert consultation.
This comprehensive support framework was instrumental in achieving our aggressive KPIs and time-to-market goals. The combination of proactive guidance, rapid issue resolution, and strategic partnership helped us achieve the following:

Reduce proof of concept to production timeline by 50%
Maintain 99.9% uptime during critical customer demonstrations
Scale confidently, knowing we had enterprise-grade support backing our innovation

The value of Enterprise Support provided the confidence and partnership necessary to build our product roadmap on emerging AI technologies, knowing AWS was fully committed to the success of Celeste.
Solution overview
The following diagram illustrates the solution architecture.

Our Amazon Bedrock Agent operates on several core components.
First, a custom layer comprises the following:

Customer Experience UI (CX UI) – The frontend interface that users interact with to submit questions and view responses
Chat Manager – Orchestrates the conversation flow, manages session state, and handles the communication between the UI and the processing layer
Chat Executor – Receives processed requests from Chat Manager, interfaces with Amazon Bedrock Agent and handles the business logic for determining when and how to invoke the agent, and manages the overall conversation workflow and short memory

Second, we used the following in conjunction with Amazon Bedrock:

Amazon Bedrock agent – An orchestrator that receives queries from Chat Executor, determines which tools to invoke based on the query, and manages the tool invocation process.
Anthropic’s Claude 3.5 Sonnet V2 – The FM that generates natural language responses. The model generates queries for the API and processes the structured data returned by tools. It creates coherent, contextual answers for users.

Finally, the data layer consists of the following:

Tool API – A custom API that receives tool invocation requests from the Amazon Bedrock agent and queries the customer data
Customer data – The data storage containing sensitive customer information that remains isolated from Amazon Bedrock

The solution also includes the following key security measures:

Data isolation is enforced between the Tool API and Amazon Bedrock agent
Raw customer data is never shared
Skai can maintain data privacy and compliance requirements

Overcoming critical challenges
Implementing the solution brought with it a few key challenges.
Firstly, early prototypes suffered from 90-second (or more) response times when chaining multiple agents and APIs. By adopting a custom orchestrator and streaming, we reduced median latency by 30%, as illustrated in the following table.

Approach
Average Latency (seconds)
P90
P99

Baseline
136
194
215

Optimized Workflow
44
102
102

Secondly, customers frequently analyzed multi-year datasets, exceeding Anthropic Claude’s 50,000-token window. Our solution uses dynamic session chunking to split conversations while retaining key context, and employs Retrieval Augmented Generation (RAG)-based memory retrieval.
Lastly, we implemented the following measures for error handling at scale:

Real-time tracing using WatchDog with Amazon CloudWatch Logs Insights to monitor more than 230 agent metrics
A retry mechanism, in which failed API calls with 500 error: “BEDROCK_MODEL_INVOCATION_SERVICE_UNAVAILABLE” are automatically retried
Amazon CloudWatch monitoring and alerting

Business results
Since deploying with AWS, Skai’s platform has achieved significant results, as shown in the following table.

Metric
Improvement

Report Generation Time
50% Faster

Case Study Generation Time
75% Faster

QBR Composition Time
80% Faster

Report to Recommendation Time
90% Faster

While the metrics above demonstrate measurable improvements, the true business impact becomes clear through customer feedback. The core challenges Skai addressed—time-consuming report generation, complex data analysis, and the need for actionable recommendations, have been resolved in ways that fundamentally changed how users work with advertising data.
Customer testimonials
“It’s made my life easier. It’s made my team’s life easier. It’s made my clients’ lives easier and better. So we all work in jobs where there’s millions and millions of data points to scour through every day, and being able to do that as efficiently as possible and as fluidly as possible with Celeste AI is always a welcome addition to Skai.” – Aram Howard, Amazon Advertising Executive, Data Analyst | Channel Bakers
“Celeste is saving hours of time. It’s like having another set of eyes to give suggestions. I’m so stoked to see where this could take us.” – Erick Rudloph, Director of Search Marketing, Xcite Media Group
“It truly feels like having a data scientist right next to me to answer questions, even with recommendations for starting an optimization or looking at an account’s performance.” – Director of Search Marketing at Media Agency
Looking ahead: The future of Celeste
We’re expanding Celeste’s capabilities in the following areas:

Personalizing the user experience, retaining memories and preferences across multiple sessions.
Ingestion of custom data assets, so the client can bring their own data into Celeste and seamlessly connect it to Celeste’s existing data and knowledge.
New tools for seamless team integration. These tools will allow Celeste to generate client presentations, build data dashboards, and provide timely notifications.

Conclusion
With Amazon Bedrock Agents, Skai transformed raw data into strategic assets, helping customers make faster, smarter decisions without technical bottlenecks. By combining a robust AWS AI/ML infrastructure with our domain expertise, we’ve created a blueprint other organizations can follow to democratize data analytics.
What truly set our journey apart was the ease with which Amazon Bedrock helped us transition from concept to production. Rather than building complex AI infrastructure, we used a fully managed service that let us focus on our core strengths: understanding customer data challenges and delivering insights at scale. The decision to use Amazon Bedrock resulted in considerable business acceleration, helping us deliver value in weeks rather than quarters while maintaining production grade security, performance, and reliability.
Skai’s journey with Amazon Bedrock continues—follow our series for updates on multi-agent systems and other generative innovations.

About the authors
Lior Heber is the Al Lead Architect at Skai, where he has spent over a decade shaping the company’s technology with a focus on innovation, developer experience, and intelligent Ul design. With a strong background in software architecture and Al-driven solutions, Lior has led transformative projects that push the boundaries of how teams build and deliver products. Beyond his work in tech, he co-founded Colorful Family, a project creating children’s books for diverse families. Lior combines technical expertise with creativity, always looking for ways to bridge technology and human experience.
Yarden Ron is a Software Development Team Lead at Skai, bringing over four years of leadership and engineering experience to the AI-powered commerce media platform. He recently spearheaded the launch of Celeste AI – a GenAI agent designed to revolutionize how marketers engage with their platforms by making insights faster, smarter, and more intuitive. Based in Israel, Yarden blends technical acumen with collaborative drive, leading teams that turn innovative ideas into impactful products.
Tomer Berkovich is a Technical Account Manager at AWS with a specialty focus on Generative AI and Machine Learning. He brings over two decades of technology, engineering, and architecture experience to help organizations navigate their AI/ML journey on AWS. When he isn’t working, he enjoys spending time with his family, exploring emerging technologies, and powerlifting while chasing new personal records.
Dov Amir is a Senior Solutions Architect at AWS, bringing over 20 years of experience in Software, cloud and architecture. In his current role, Dov helps customers accelerate cloud adoption and application modernization by leveraging cloud-native technologies and generative AI.
Gili Nachum is a Principal AI/ML Specialist Solutions Architect who works as part of the EMEA Amazon Machine Learning team. Gili is passionate about the challenges of training deep learning models, and how machine learning is changing the world as we know it. In his spare time, Gili enjoys playing table tennis.

How to Create a Bioinformatics AI Agent Using Biopython for DNA and Pr …

In this tutorial, we demonstrate how to build an advanced yet accessible Bioinformatics AI Agent using Biopython and popular Python libraries, designed to run seamlessly in Google Colab. By combining sequence retrieval, molecular analysis, visualization, multiple sequence alignment, phylogenetic tree construction, and motif searches into a single streamlined class, the tutorial provides a hands-on approach to explore the full spectrum of biological sequence analysis. Users can start with built-in sample sequences such as the SARS-CoV-2 Spike protein, Human Insulin precursor, and E. coli 16S rRNA, or fetch custom sequences directly from NCBI. With built-in visualization tools powered by Plotly and Matplotlib, researchers and students alike can quickly perform comprehensive DNA and protein analyses without needing prior setup beyond a Colab notebook. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install biopython pandas numpy matplotlib seaborn plotly requests beautifulsoup4 scipy scikit-learn networkx
!apt-get update
!apt-get install -y clustalw

import os
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from Bio import SeqIO, Entrez, Align, Phylo
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqUtils import gc_fraction, molecular_weight
from Bio.SeqUtils.ProtParam import ProteinAnalysis
from Bio.Blast import NCBIWWW, NCBIXML
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import warnings
warnings.filterwarnings(‘ignore’)

Entrez.email = “your_email@example.com”

We begin by installing essential bioinformatics and data science libraries, along with ClustalW for sequence alignment. We then import Biopython modules, visualization tools, and supporting packages, while setting up Entrez with our email to fetch sequences from NCBI. This ensures our Colab environment is fully prepared for advanced sequence analysis. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass BioPythonAIAgent:
def __init__(self, email=”your_email@example.com”):
self.email = email
Entrez.email = email
self.sequences = {}
self.analysis_results = {}
self.alignments = {}
self.trees = {}

def fetch_sequence_from_ncbi(self, accession_id, db=”nucleotide”, rettype=”fasta”):
try:
handle = Entrez.efetch(db=db, id=accession_id, rettype=rettype, retmode=”text”)
record = SeqIO.read(handle, “fasta”)
handle.close()
self.sequences[accession_id] = record
return record
except Exception as e:
print(f”Error fetching sequence: {str(e)}”)
return None

def create_sample_sequences(self):
covid_spike = “MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT”

human_insulin = “MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN”

e_coli_16s = “AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAATGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGCGTTAAGGTTAATAACCTTGGCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA”

sample_sequences = [
(“COVID_Spike”, covid_spike, “SARS-CoV-2 Spike Protein”),
(“Human_Insulin”, human_insulin, “Human Insulin Precursor”),
(“E_coli_16S”, e_coli_16s, “E. coli 16S rRNA”)
]

for seq_id, seq_str, desc in sample_sequences:
record = SeqRecord(Seq(seq_str), id=seq_id, description=desc)
self.sequences[seq_id] = record

return sample_sequences

def analyze_sequence(self, sequence_id=None, sequence=None):
if sequence_id and sequence_id in self.sequences:
seq_record = self.sequences[sequence_id]
seq = seq_record.seq
description = seq_record.description
elif sequence:
seq = Seq(sequence)
description = “Custom sequence”
else:
return None

analysis = {
‘length’: len(seq),
‘composition’: {}
}

for base in [‘A’, ‘T’, ‘G’, ‘C’]:
analysis[‘composition’][base] = seq.count(base)

if ‘A’ in analysis[‘composition’] and ‘T’ in analysis[‘composition’]:
analysis[‘gc_content’] = round(gc_fraction(seq) * 100, 2)
try:
analysis[‘molecular_weight’] = round(molecular_weight(seq, seq_type=’DNA’), 2)
except:
analysis[‘molecular_weight’] = len(seq) * 650

try:
if len(seq) % 3 == 0:
protein = seq.translate()
analysis[‘translation’] = str(protein)
analysis[‘stop_codons’] = protein.count(‘*’)

if ‘*’ not in str(protein)[:-1]:
prot_analysis = ProteinAnalysis(str(protein)[:-1])
analysis[‘protein_mw’] = round(prot_analysis.molecular_weight(), 2)
analysis[‘isoelectric_point’] = round(prot_analysis.isoelectric_point(), 2)
analysis[‘protein_composition’] = prot_analysis.get_amino_acids_percent()
except:
pass

key = sequence_id if sequence_id else “custom”
self.analysis_results[key] = analysis

return analysis

def visualize_composition(self, sequence_id):
if sequence_id not in self.analysis_results:
return

analysis = self.analysis_results[sequence_id]

fig = make_subplots(
rows=2, cols=2,
specs=[[{“type”: “pie”}, {“type”: “bar”}],
[{“colspan”: 2}, None]],
subplot_titles=(“Nucleotide Composition”, “Base Count”, “Sequence Properties”)
)

labels = list(analysis[‘composition’].keys())
values = list(analysis[‘composition’].values())

fig.add_trace(
go.Pie(labels=labels, values=values, name=”Composition”),
row=1, col=1
)

fig.add_trace(
go.Bar(x=labels, y=values, name=”Count”, marker_color=[‘red’, ‘blue’, ‘green’, ‘orange’]),
row=1, col=2
)

properties = [‘Length’, ‘GC%’, ‘MW (kDa)’]
prop_values = [
analysis[‘length’],
analysis.get(‘gc_content’, 0),
analysis.get(‘molecular_weight’, 0) / 1000
]

fig.add_trace(
go.Scatter(x=properties, y=prop_values, mode=’markers+lines’,
marker=dict(size=10, color=’purple’), name=”Properties”),
row=2, col=1
)

fig.update_layout(
title=f”Comprehensive Analysis: {sequence_id}”,
showlegend=False,
height=600
)

fig.show()

def perform_multiple_sequence_alignment(self, sequence_ids):
if len(sequence_ids) < 2:
return None

sequences = []
for seq_id in sequence_ids:
if seq_id in self.sequences:
sequences.append(self.sequences[seq_id])

if len(sequences) < 2:
return None

from Bio.Align import PairwiseAligner
aligner = PairwiseAligner()
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.open_gap_score = -2
aligner.extend_gap_score = -0.5

alignments = []
for i in range(len(sequences)):
for j in range(i+1, len(sequences)):
alignment = aligner.align(sequences[i].seq, sequences[j].seq)[0]
alignments.append(alignment)

return alignments

def create_phylogenetic_tree(self, alignment_key=None, sequences=None):
if alignment_key and alignment_key in self.alignments:
alignment = self.alignments[alignment_key]
elif sequences:
records = []
for i, seq in enumerate(sequences):
record = SeqRecord(Seq(seq), id=f”seq_{i}”)
records.append(record)
SeqIO.write(records, “temp.fasta”, “fasta”)

try:
clustalw_cline = ClustalwCommandline(“clustalw2″, infile=”temp.fasta”)
stdout, stderr = clustalw_cline()
alignment = AlignIO.read(“temp.aln”, “clustal”)
os.remove(“temp.fasta”)
os.remove(“temp.aln”)
os.remove(“temp.dnd”)
except:
return None
else:
return None

calculator = DistanceCalculator(‘identity’)
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)

tree_key = f”tree_{len(self.trees)}”
self.trees[tree_key] = tree

return tree

def visualize_tree(self, tree):
fig, ax = plt.subplots(figsize=(10, 6))
Phylo.draw(tree, axes=ax)
plt.title(“Phylogenetic Tree”)
plt.tight_layout()
plt.show()

def protein_structure_analysis(self, sequence_id):
if sequence_id not in self.sequences:
return None

seq = self.sequences[sequence_id].seq

try:
if len(seq) % 3 == 0:
protein = seq.translate()
if ‘*’ not in str(protein)[:-1]:
prot_analysis = ProteinAnalysis(str(protein)[:-1])

structure_analysis = {
‘molecular_weight’: prot_analysis.molecular_weight(),
‘isoelectric_point’: prot_analysis.isoelectric_point(),
‘amino_acid_percent’: prot_analysis.get_amino_acids_percent(),
‘secondary_structure’: prot_analysis.secondary_structure_fraction(),
‘flexibility’: prot_analysis.flexibility(),
‘gravy’: prot_analysis.gravy()
}

return structure_analysis
except:
pass

return None

def comparative_analysis(self, sequence_ids):
results = []

for seq_id in sequence_ids:
if seq_id in self.analysis_results:
analysis = self.analysis_results[seq_id].copy()
analysis[‘sequence_id’] = seq_id
results.append(analysis)

df = pd.DataFrame(results)

if len(df) > 1:
fig = make_subplots(
rows=2, cols=2,
subplot_titles=(“Length Comparison”, “GC Content”, “Molecular Weight”, “Composition Heatmap”)
)

fig.add_trace(
go.Bar(x=df[‘sequence_id’], y=df[‘length’], name=”Length”),
row=1, col=1
)

if ‘gc_content’ in df.columns:
fig.add_trace(
go.Scatter(x=df[‘sequence_id’], y=df[‘gc_content’], mode=’markers+lines’, name=”GC%”),
row=1, col=2
)

if ‘molecular_weight’ in df.columns:
fig.add_trace(
go.Bar(x=df[‘sequence_id’], y=df[‘molecular_weight’], name=”MW”),
row=2, col=1
)

fig.update_layout(title=”Comparative Sequence Analysis”, height=600)
fig.show()

return df

def codon_usage_analysis(self, sequence_id):
if sequence_id not in self.sequences:
return None

seq = self.sequences[sequence_id].seq

if len(seq) % 3 != 0:
return None

codons = {}
for i in range(0, len(seq) – 2, 3):
codon = str(seq[i:i+3])
codons[codon] = codons.get(codon, 0) + 1

codon_df = pd.DataFrame(list(codons.items()), columns=[‘Codon’, ‘Count’])
codon_df = codon_df.sort_values(‘Count’, ascending=False)

fig = px.bar(codon_df.head(20), x=’Codon’, y=’Count’,
title=f”Top 20 Codon Usage – {sequence_id}”)
fig.show()

return codon_df

def motif_search(self, sequence_id, motif_pattern):
if sequence_id not in self.sequences:
return []

seq = str(self.sequences[sequence_id].seq)
positions = []

for i in range(len(seq) – len(motif_pattern) + 1):
if seq[i:i+len(motif_pattern)] == motif_pattern:
positions.append(i)

return positions

def gc_content_window(self, sequence_id, window_size=100):
if sequence_id not in self.sequences:
return None

seq = self.sequences[sequence_id].seq
gc_values = []
positions = []

for i in range(0, len(seq) – window_size + 1, window_size//4):
window = seq[i:i+window_size]
gc_values.append(gc_fraction(window) * 100)
positions.append(i + window_size//2)

fig = go.Figure()
fig.add_trace(go.Scatter(x=positions, y=gc_values, mode=’lines+markers’,
name=f’GC Content (window={window_size})’))
fig.update_layout(
title=f”GC Content Sliding Window Analysis – {sequence_id}”,
xaxis_title=”Position”,
yaxis_title=”GC Content (%)”
)
fig.show()

return positions, gc_values

def run_comprehensive_analysis(self, sequence_ids):
results = {}

for seq_id in sequence_ids:
if seq_id in self.sequences:
analysis = self.analyze_sequence(seq_id)
self.visualize_composition(seq_id)

gc_analysis = self.gc_content_window(seq_id)
codon_analysis = self.codon_usage_analysis(seq_id)

results[seq_id] = {
‘basic_analysis’: analysis,
‘gc_window’: gc_analysis,
‘codon_usage’: codon_analysis
}

if len(sequence_ids) > 1:
comparative_df = self.comparative_analysis(sequence_ids)
results[‘comparative’] = comparative_df

return results

We define a BioPython AIAgent that allows us to fetch or create sequences, run core analyses (composition, GC%, translation, and protein properties), and visualize results interactively. We also perform pairwise alignments, build phylogenetic trees, scan motifs, profile codon usage, analyze GC with sliding windows, and compare multiple sequences—then bundle everything into one comprehensive pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browseragent = BioPythonAIAgent()

sample_seqs = agent.create_sample_sequences()

for seq_id, _, _ in sample_seqs:
agent.analyze_sequence(seq_id)

results = agent.run_comprehensive_analysis([‘COVID_Spike’, ‘Human_Insulin’, ‘E_coli_16S’])

print(“BioPython AI Agent Tutorial Complete!”)
print(“Available sequences:”, list(agent.sequences.keys()))
print(“Available methods:”, [method for method in dir(agent) if not method.startswith(‘_’)])

We instantiate the BioPythonAIAgent, generate sample sequences (COVID Spike, Human Insulin, and E. coli 16S), and run a full analysis pipeline. The outputs confirm that our agent successfully performs nucleotide, codon, and GC-content analyses while also preparing comparative visualizations. Finally, we print the list of available sequences and supported methods, indicating that the agent’s full analytical capabilities are now ready for use. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browseragent.visualize_composition(‘COVID_Spike’)
agent.gc_content_window(‘E_coli_16S’, window_size=50)
agent.codon_usage_analysis(‘COVID_Spike’)

comparative_df = agent.comparative_analysis([‘COVID_Spike’, ‘Human_Insulin’, ‘E_coli_16S’])
print(comparative_df)

motif_positions = agent.motif_search(‘COVID_Spike’, ‘ATG’)
print(f”ATG start codons found at positions: {motif_positions}”)

tree = agent.create_phylogenetic_tree(sequences=[
str(agent.sequences[‘COVID_Spike’].seq[:300]),
str(agent.sequences[‘Human_Insulin’].seq[:300]),
str(agent.sequences[‘E_coli_16S’].seq[:300])
])

if tree:
agent.visualize_tree(tree)

We visualize nucleotide composition, scan E. coli 16S GC% in sliding windows, and profile codon usage for the COVID Spike sequence. We then compare sequences side-by-side, search for the “ATG” motif, and build/plot a quick phylogenetic tree from the first 300 nt of each sequence.

In conclusion, we have a fully functional BioPython AI Agent capable of handling multiple layers of sequence analysis, from basic nucleotide composition to codon usage profiling, GC-content sliding windows, motif searches, and even comparative analyses across species. The integration of visualization and phylogenetic tree construction provides both intuitive and in-depth insights into genetic data. Whether for academic projects, bioinformatics education, or research prototyping, this Colab-friendly workflow showcases how open-source tools like Biopython can be harnessed with modern AI-inspired pipelines to simplify and accelerate biological data exploration.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Create a Bioinformatics AI Agent Using Biopython for DNA and Protein Analysis appeared first on MarkTechPost.

Meta Superintelligence Labs Introduces REFRAG: Scaling RAG with 16× L …

Table of contentsWhy is long context such a bottleneck for LLMs?How does REFRAG compress and shorten context?How is acceleration achieved?How does REFRAG preserve accuracy?What do the experiments reveal?SummaryFAQs

A team of researchers from Meta Superintelligence Labs, National University of Singapore and Rice University has unveiled REFRAG (REpresentation For RAG), a decoding framework that rethinks retrieval-augmented generation (RAG) efficiency. REFRAG extends LLM context windows by 16× and achieves up to a 30.85× acceleration in time-to-first-token (TTFT) without compromising accuracy.

Why is long context such a bottleneck for LLMs?

The attention mechanism in large language models scales quadratically with input length. If a document is twice as long, the compute and memory cost can grow fourfold. This not only slows inference but also increases the size of the key-value (KV) cache, making large-context applications impractical in production systems. In RAG settings, most retrieved passages contribute little to the final answer, but the model still pays the full quadratic price to process them.

How does REFRAG compress and shorten context?

REFRAG introduces a lightweight encoder that splits retrieved passages into fixed-size chunks (e.g., 16 tokens) and compresses each into a dense chunk embedding. Instead of feeding thousands of raw tokens, the decoder processes this shorter sequence of embeddings. The result is a 16× reduction in sequence length, with no change to the LLM architecture.

https://arxiv.org/pdf/2509.01092

How is acceleration achieved?

By shortening the decoder’s input sequence, REFRAG reduces the quadratic attention computation and shrinks the KV cache. Empirical results show 16.53× TTFT acceleration at k=16 and 30.85× acceleration at k=32, far surpassing prior state-of-the-art CEPE (which achieved only 2–8×). Throughput also improves by up to 6.78× compared to LLaMA baselines.

How does REFRAG preserve accuracy?

A reinforcement learning (RL) policy supervises compression. It identifies the most information-dense chunks and allows them to bypass compression, feeding raw tokens directly into the decoder. This selective strategy ensures that critical details—such as exact numbers or rare entities—are not lost. Across multiple benchmarks, REFRAG maintained or improved perplexity compared to CEPE while operating at far lower latency.

What do the experiments reveal?

REFRAG was pretrained on 20B tokens from the SlimPajama corpus (Books + arXiv) and tested on long-context datasets including Book, Arxiv, PG19, and ProofPile. On RAG benchmarks, multi-turn conversation tasks, and long-document summarization, REFRAG consistently outperformed strong baselines:

16× context extension beyond standard LLaMA-2 (4k tokens).

~9.3% perplexity improvement over CEPE across four datasets.

Better accuracy in weak retriever settings, where irrelevant passages dominate, due to the ability to process more passages under the same latency budget.

https://arxiv.org/pdf/2509.01092

Summary

REFRAG shows that long-context LLMs don’t have to be slow or memory-hungry. By compressing retrieved passages into compact embeddings, selectively expanding only the important ones, and rethinking how RAG decoding works, Meta Superintelligence Labs has made it possible to process much larger inputs while running dramatically faster. This makes large-context applications—like analyzing entire reports, handling multi-turn conversations, or scaling enterprise RAG systems—not only feasible but efficient, without compromising accuracy.

FAQs

Q1. What is REFRAG?REFRAG (REpresentation For RAG) is a decoding framework from Meta Superintelligence Labs that compresses retrieved passages into embeddings, enabling faster and longer-context inference in LLMs.

Q2. How much faster is REFRAG compared to existing methods?REFRAG delivers up to 30.85× faster time-to-first-token (TTFT) and 6.78× throughput improvement compared to LLaMA baselines, while outperforming CEPE.

Q3. Does compression reduce accuracy?No. A reinforcement learning policy ensures critical chunks remain uncompressed, preserving key details. Across benchmarks, REFRAG maintained or improved accuracy relative to prior methods.

Q4. Where will the code be available?Meta Superintelligence Labs will release REFRAG on GitHub at facebookresearch/refrag

Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Meta Superintelligence Labs Introduces REFRAG: Scaling RAG with 16× Longer Contexts and 31× Faster Decoding appeared first on MarkTechPost.

Tilde AI Releases TildeOpen LLM: An Open-Source Large Language Model w …

Latvian language-tech firm Tilde has released TildeOpen LLM, an open-source foundational large language model (LLM) purpose-built for European languages, with a sharp focus on under-represented and smaller national and regional languages. It’s a strategic leap toward linguistic equity and digital sovereignty within the EU.

Under the Hood: Architecture, Training and Governance

The public release occurred on September 3, 2025, when Tilde deployed the model free to users via Hugging Face.

Built as a 30-billion-parameter dense decoder-only transformer, the model is available under a permissive license (CC-BY-4.0) and includes broad language support—from Latvian and Lithuanian to Ukrainian, Turkish, and beyond.

Training occurred on the EU’s supercomputers: LUMI (Finland) and JUPITER, tapping into 2 million GPU hours awarded via the European Commission’s Large AI Grand Challenge.

Fine technical detail: trained via EleutherAI–inspired GPT-NeoX scripts across 450K updates, consuming ~2 trillion tokens. Training included three-stage sampling: uniform across languages, natural distribution to boost high-data-volume languages, and a final uniform sweep for balance.

Hyperparameters: 60 layers, embedding size 6144, 48 attention heads, 8192-token context window, SwiGLU activations, RoPE positional encoding, RMSNorm layer norms.

Language Equity and Data Sovereignty

Mainstream models lean heavily on English and other major languages, causing skewed performance when dealing with Baltic, Slavic, or other smaller European languages. This under-representation leads to poor grammar, awkward phrasing, and hallucinations.

TildeOpen resolves this by embedding an “equitable tokenizer”, engineered to represent text similarly regardless of language—reducing token count and increasing inference efficiency for lesser-represented languages.

Crucially, organizations can self-host—in local data centers or secure EU-compliant clouds—ensuring adherence to GDPR and other data-protection mandates. This addresses sovereignty concerns tied to US- or Asia-hosted models.

Strategic Horizon: From Prototype to European AI Infrastructure

TildeOpen is a foundational “base” model. It is expected for it’s upcoming versions more specialized (e.g., instruction-tuned translation models) built atop this core.

It’s also a geo-flag planting moment: Latvia, via Tilde, positions itself as a tech exporter, with aspirations to scale European AI infrastructure while preserving linguistic diversity.

For Research, the move mirrors broader research on multilingual model behavior—gaps still exist. Evaluations show even strong open LLMs can hallucinate or lag in lexical accuracy for Baltic languages, reinforcing the need for localized development.

Summary

TildeOpen LLM reframes EU AI—not just as regulatory compliance, but as technical stewardship. It’s a grounded, high-capacity model with transparent architecture, scalable deployment, and a fierce commitment to linguistic equity. It doesn’t indulge hype; it delivers substance.

FAQs

Q1: What is TildeOpen LLM?TildeOpen is a 30B-parameter multilingual large language model trained on EU supercomputers, optimized for European languages, especially under-represented ones.

Q2: How is it different from mainstream LLMs?Unlike global models that prioritize English, TildeOpen uses an equitable tokenizer and balanced training to ensure fair representation and accuracy across smaller European languages.

Q3: Can organizations self-host the model?Yes. TildeOpen is open-source under CC-BY-4.0 and can be deployed in local data centers or EU-compliant clouds to meet GDPR and data sovereignty requirements.

Q4: What are the main use cases?Government services, translation, education, AI assistants, speech technologies, and multilingual customer support—any domain requiring accurate European language processing.

Check out the Model on Hugging Face and Technical details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Tilde AI Releases TildeOpen LLM: An Open-Source Large Language Model with Over 30 Billion Parameters and Support Most European Languages appeared first on MarkTechPost.

Implementing DeepSpeed for Scalable Transformers: Advanced Training wi …

In this advanced DeepSpeed tutorial, we provide a hands-on walkthrough of cutting-edge optimization techniques for training large language models efficiently. By combining ZeRO optimization, mixed-precision training, gradient accumulation, and advanced DeepSpeed configurations, the tutorial demonstrates how to maximize GPU memory utilization, reduce training overhead, and enable scaling of transformer models in resource-constrained environments, such as Colab. Alongside model creation and training, it also covers performance monitoring, inference optimization, checkpointing, and benchmarking different ZeRO stages, providing practitioners with both theoretical insights and practical code to accelerate model development. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys
import os
import json
import time
from pathlib import Path

def install_dependencies():
“””Install required packages for DeepSpeed in Colab”””
print(” Installing DeepSpeed and dependencies…”)

subprocess.check_call([
sys.executable, “-m”, “pip”, “install”,
“torch”, “torchvision”, “torchaudio”, “–index-url”,
“https://download.pytorch.org/whl/cu118”
])

subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “deepspeed”])

subprocess.check_call([
sys.executable, “-m”, “pip”, “install”,
“transformers”, “datasets”, “accelerate”, “wandb”
])

print(” Installation complete!”)

install_dependencies()

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import deepspeed
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
from typing import Dict, Any
import argparse

We set up our Colab environment by installing PyTorch with CUDA support, DeepSpeed, and essential libraries like Transformers, Datasets, Accelerate, and Weights & Biases. We ensure everything is ready so we can smoothly build and train models with DeepSpeed. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SyntheticTextDataset(Dataset):
“””Synthetic dataset for demonstration purposes”””

def __init__(self, size: int = 1000, seq_length: int = 512, vocab_size: int = 50257):
self.size = size
self.seq_length = seq_length
self.vocab_size = vocab_size

self.data = torch.randint(0, vocab_size, (size, seq_length))

def __len__(self):
return self.size

def __getitem__(self, idx):
return {
‘input_ids’: self.data[idx],
‘labels’: self.data[idx].clone()
}

We create a SyntheticTextDataset where we generate random token sequences to mimic real text data. We use these sequences as both inputs and labels, allowing us to quickly test DeepSpeed training without relying on a large external dataset. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedDeepSpeedTrainer:
“””Advanced DeepSpeed trainer with multiple optimization techniques”””

def __init__(self, model_config: Dict[str, Any], ds_config: Dict[str, Any]):
self.model_config = model_config
self.ds_config = ds_config
self.model = None
self.engine = None
self.tokenizer = None

def create_model(self):
“””Create a GPT-2 style model for demonstration”””
print(” Creating model…”)

config = GPT2Config(
vocab_size=self.model_config[‘vocab_size’],
n_positions=self.model_config[‘seq_length’],
n_embd=self.model_config[‘hidden_size’],
n_layer=self.model_config[‘num_layers’],
n_head=self.model_config[‘num_heads’],
resid_pdrop=0.1,
embd_pdrop=0.1,
attn_pdrop=0.1,
)

self.model = GPT2LMHeadModel(config)
self.tokenizer = GPT2Tokenizer.from_pretrained(‘gpt2’)

self.tokenizer.pad_token = self.tokenizer.eos_token

print(f” Model parameters: {sum(p.numel() for p in self.model.parameters()):,}”)
return self.model

def create_deepspeed_config(self):
“””Create comprehensive DeepSpeed configuration”””
return {
“train_batch_size”: self.ds_config[‘train_batch_size’],
“train_micro_batch_size_per_gpu”: self.ds_config[‘micro_batch_size’],
“gradient_accumulation_steps”: self.ds_config[‘gradient_accumulation_steps’],

“zero_optimization”: {
“stage”: self.ds_config[‘zero_stage’],
“allgather_partitions”: True,
“allgather_bucket_size”: 5e8,
“overlap_comm”: True,
“reduce_scatter”: True,
“reduce_bucket_size”: 5e8,
“contiguous_gradients”: True,
“cpu_offload”: self.ds_config.get(‘cpu_offload’, False)
},

“fp16”: {
“enabled”: True,
“loss_scale”: 0,
“loss_scale_window”: 1000,
“initial_scale_power”: 16,
“hysteresis”: 2,
“min_loss_scale”: 1
},

“optimizer”: {
“type”: “AdamW”,
“params”: {
“lr”: self.ds_config[‘learning_rate’],
“betas”: [0.9, 0.999],
“eps”: 1e-8,
“weight_decay”: 0.01
}
},

“scheduler”: {
“type”: “WarmupLR”,
“params”: {
“warmup_min_lr”: 0,
“warmup_max_lr”: self.ds_config[‘learning_rate’],
“warmup_num_steps”: 100
}
},

“gradient_clipping”: 1.0,

“wall_clock_breakdown”: True,

“memory_breakdown”: True,

“tensorboard”: {
“enabled”: True,
“output_path”: “./logs/”,
“job_name”: “deepspeed_advanced_tutorial”
}
}

def initialize_deepspeed(self):
“””Initialize DeepSpeed engine”””
print(” Initializing DeepSpeed…”)

parser = argparse.ArgumentParser()
parser.add_argument(‘–local_rank’, type=int, default=0)
args = parser.parse_args([])

self.engine, optimizer, _, lr_scheduler = deepspeed.initialize(
args=args,
model=self.model,
config=self.create_deepspeed_config()
)

print(f” DeepSpeed engine initialized with ZeRO stage {self.ds_config[‘zero_stage’]}”)
return self.engine

def train_step(self, batch: Dict[str, torch.Tensor]) -> Dict[str, float]:
“””Perform a single training step with DeepSpeed optimizations”””

input_ids = batch[‘input_ids’].to(self.engine.device)
labels = batch[‘labels’].to(self.engine.device)

outputs = self.engine(input_ids=input_ids, labels=labels)
loss = outputs.loss

self.engine.backward(loss)

self.engine.step()

return {
‘loss’: loss.item(),
‘lr’: self.engine.lr_scheduler.get_last_lr()[0] if self.engine.lr_scheduler else 0
}

def train(self, dataloader: DataLoader, num_epochs: int = 2):
“””Complete training loop with monitoring”””
print(f” Starting training for {num_epochs} epochs…”)

self.engine.train()
total_steps = 0

for epoch in range(num_epochs):
epoch_loss = 0.0
epoch_steps = 0

print(f”n Epoch {epoch + 1}/{num_epochs}”)

for step, batch in enumerate(dataloader):
start_time = time.time()

metrics = self.train_step(batch)

epoch_loss += metrics[‘loss’]
epoch_steps += 1
total_steps += 1

if step % 10 == 0:
step_time = time.time() – start_time
print(f” Step {step:4d} | Loss: {metrics[‘loss’]:.4f} | ”
f”LR: {metrics[‘lr’]:.2e} | Time: {step_time:.3f}s”)

if step % 20 == 0 and hasattr(self.engine, ‘monitor’):
self.log_memory_stats()

if step >= 50:
break

avg_loss = epoch_loss / epoch_steps
print(f” Epoch {epoch + 1} completed | Average Loss: {avg_loss:.4f}”)

print(” Training completed!”)

def log_memory_stats(self):
“””Log GPU memory statistics”””
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1024**3
reserved = torch.cuda.memory_reserved() / 1024**3
print(f” GPU Memory – Allocated: {allocated:.2f}GB | Reserved: {reserved:.2f}GB”)

def save_checkpoint(self, path: str):
“””Save model checkpoint using DeepSpeed”””
print(f” Saving checkpoint to {path}”)
self.engine.save_checkpoint(path)

def demonstrate_inference(self, text: str = “The future of AI is”):
“””Demonstrate optimized inference with DeepSpeed”””
print(f”n Running inference with prompt: ‘{text}'”)

inputs = self.tokenizer.encode(text, return_tensors=’pt’).to(self.engine.device)

self.engine.eval()

with torch.no_grad():
outputs = self.engine.module.generate(
inputs,
max_length=inputs.shape[1] + 50,
num_return_sequences=1,
temperature=0.8,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)

generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f” Generated text: {generated_text}”)

self.engine.train()

We build an end-to-end trainer that creates a GPT-2 model, sets a DeepSpeed config (ZeRO, FP16, AdamW, warmup scheduler, tensorboard), and initializes the engine. We then run efficient training steps with logging and memory statistics, save checkpoints, and demonstrate inference to verify optimization and generation in one place. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_advanced_tutorial():
“””Main function to run the advanced DeepSpeed tutorial”””

print(” Advanced DeepSpeed Tutorial Starting…”)
print(“=” * 60)

model_config = {
‘vocab_size’: 50257,
‘seq_length’: 512,
‘hidden_size’: 768,
‘num_layers’: 6,
‘num_heads’: 12
}

ds_config = {
‘train_batch_size’: 16,
‘micro_batch_size’: 4,
‘gradient_accumulation_steps’: 4,
‘zero_stage’: 2,
‘learning_rate’: 1e-4,
‘cpu_offload’: False
}

print(” Configuration:”)
print(f” Model size: ~{sum(np.prod(shape) for shape in [[model_config[‘vocab_size’], model_config[‘hidden_size’]], [model_config[‘hidden_size’], model_config[‘hidden_size’]] * model_config[‘num_layers’]]) / 1e6:.1f}M parameters”)
print(f” ZeRO Stage: {ds_config[‘zero_stage’]}”)
print(f” Batch size: {ds_config[‘train_batch_size’]}”)

trainer = AdvancedDeepSpeedTrainer(model_config, ds_config)

model = trainer.create_model()

engine = trainer.initialize_deepspeed()

print(“n Creating synthetic dataset…”)
dataset = SyntheticTextDataset(
size=200,
seq_length=model_config[‘seq_length’],
vocab_size=model_config[‘vocab_size’]
)

dataloader = DataLoader(
dataset,
batch_size=ds_config[‘micro_batch_size’],
shuffle=True
)

print(“n Pre-training memory stats:”)
trainer.log_memory_stats()

trainer.train(dataloader, num_epochs=2)

print(“n Post-training memory stats:”)
trainer.log_memory_stats()

trainer.demonstrate_inference(“DeepSpeed enables efficient training of”)

checkpoint_path = “./deepspeed_checkpoint”
trainer.save_checkpoint(checkpoint_path)

demonstrate_zero_stages()
demonstrate_memory_optimization()

print(“n Tutorial completed successfully!”)
print(“Key DeepSpeed features demonstrated:”)
print(” ZeRO optimization for memory efficiency”)
print(” Mixed precision training (FP16)”)
print(” Gradient accumulation”)
print(” Learning rate scheduling”)
print(” Checkpoint saving/loading”)
print(” Memory monitoring”)

def demonstrate_zero_stages():
“””Demonstrate different ZeRO optimization stages”””
print(“n ZeRO Optimization Stages Explained:”)
print(” Stage 0: Disabled (baseline)”)
print(” Stage 1: Optimizer state partitioning (~4x memory reduction)”)
print(” Stage 2: Gradient partitioning (~8x memory reduction)”)
print(” Stage 3: Parameter partitioning (~Nx memory reduction)”)

zero_configs = {
1: {“stage”: 1, “reduce_bucket_size”: 5e8},
2: {“stage”: 2, “allgather_partitions”: True, “reduce_scatter”: True},
3: {“stage”: 3, “stage3_prefetch_bucket_size”: 5e8, “stage3_param_persistence_threshold”: 1e6}
}

for stage, config in zero_configs.items():
estimated_memory_reduction = [1, 4, 8, “Nx”][stage]
print(f” Stage {stage}: ~{estimated_memory_reduction}x memory reduction”)

def demonstrate_memory_optimization():
“””Show memory optimization techniques”””
print(“n Memory Optimization Techniques:”)
print(” Gradient Checkpointing: Trade compute for memory”)
print(” CPU Offloading: Move optimizer states to CPU”)
print(” Compression: Reduce communication overhead”)
print(” Mixed Precision: Use FP16 for faster training”)

We orchestrate the full training run: set configs, build the GPT-2 model and DeepSpeed engine, create a synthetic dataset, monitor GPU memory, train for two epochs, run inference, and save a checkpoint. We then explain ZeRO stages and highlight memory-optimization tactics, such as gradient checkpointing and CPU offloading, to understand the trade-offs in practice. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass DeepSpeedConfigGenerator:
“””Utility class to generate DeepSpeed configurations”””

@staticmethod
def generate_config(
batch_size: int = 16,
zero_stage: int = 2,
use_cpu_offload: bool = False,
learning_rate: float = 1e-4
) -> Dict[str, Any]:
“””Generate a complete DeepSpeed configuration”””

config = {
“train_batch_size”: batch_size,
“train_micro_batch_size_per_gpu”: max(1, batch_size // 4),
“gradient_accumulation_steps”: max(1, batch_size // max(1, batch_size // 4)),

“zero_optimization”: {
“stage”: zero_stage,
“allgather_partitions”: True,
“allgather_bucket_size”: 5e8,
“overlap_comm”: True,
“reduce_scatter”: True,
“reduce_bucket_size”: 5e8,
“contiguous_gradients”: True
},

“fp16”: {
“enabled”: True,
“loss_scale”: 0,
“loss_scale_window”: 1000,
“initial_scale_power”: 16,
“hysteresis”: 2,
“min_loss_scale”: 1
},

“optimizer”: {
“type”: “AdamW”,
“params”: {
“lr”: learning_rate,
“betas”: [0.9, 0.999],
“eps”: 1e-8,
“weight_decay”: 0.01
}
},

“scheduler”: {
“type”: “WarmupLR”,
“params”: {
“warmup_min_lr”: 0,
“warmup_max_lr”: learning_rate,
“warmup_num_steps”: 100
}
},

“gradient_clipping”: 1.0,
“wall_clock_breakdown”: True
}

if use_cpu_offload:
config[“zero_optimization”][“cpu_offload”] = True
config[“zero_optimization”][“pin_memory”] = True

if zero_stage == 3:
config[“zero_optimization”].update({
“stage3_prefetch_bucket_size”: 5e8,
“stage3_param_persistence_threshold”: 1e6,
“stage3_gather_16bit_weights_on_model_save”: True
})

return config

def benchmark_zero_stages():
“””Benchmark different ZeRO stages”””
print(“n Benchmarking ZeRO Stages…”)

model_config = {
‘vocab_size’: 50257,
‘seq_length’: 256,
‘hidden_size’: 512,
‘num_layers’: 4,
‘num_heads’: 8
}

results = {}

for stage in [1, 2]:
print(f”n Testing ZeRO Stage {stage}…”)

ds_config = {
‘train_batch_size’: 8,
‘micro_batch_size’: 2,
‘gradient_accumulation_steps’: 4,
‘zero_stage’: stage,
‘learning_rate’: 1e-4
}

try:
trainer = AdvancedDeepSpeedTrainer(model_config, ds_config)
model = trainer.create_model()
engine = trainer.initialize_deepspeed()

if torch.cuda.is_available():
torch.cuda.reset_peak_memory_stats()

dataset = SyntheticTextDataset(size=20, seq_length=model_config[‘seq_length’])
dataloader = DataLoader(dataset, batch_size=ds_config[‘micro_batch_size’])

start_time = time.time()
for i, batch in enumerate(dataloader):
if i >= 5:
break
trainer.train_step(batch)

end_time = time.time()
peak_memory = torch.cuda.max_memory_allocated() / 1024**3

results[stage] = {
‘peak_memory_gb’: peak_memory,
‘time_per_step’: (end_time – start_time) / 5
}

print(f” Peak Memory: {peak_memory:.2f}GB”)
print(f” Time per step: {results[stage][‘time_per_step’]:.3f}s”)

del trainer, model, engine
torch.cuda.empty_cache()

except Exception as e:
print(f” Error with stage {stage}: {str(e)}”)

if len(results) > 1:
print(f”n Comparison:”)
stage_1_memory = results.get(1, {}).get(‘peak_memory_gb’, 0)
stage_2_memory = results.get(2, {}).get(‘peak_memory_gb’, 0)

if stage_1_memory > 0 and stage_2_memory > 0:
memory_reduction = (stage_1_memory – stage_2_memory) / stage_1_memory * 100
print(f” Memory reduction from Stage 1 to 2: {memory_reduction:.1f}%”)

def demonstrate_advanced_features():
“””Demonstrate additional advanced DeepSpeed features”””
print(“n Advanced DeepSpeed Features:”)

print(” Dynamic Loss Scaling: Automatically adjusts FP16 loss scaling”)

print(” Gradient Compression: Reduces communication overhead”)

print(” Pipeline Parallelism: Splits model across devices”)

print(” Expert Parallelism: Efficient Mixture-of-Experts training”)

print(” Curriculum Learning: Progressive training strategies”)

if __name__ == “__main__”:
print(f” CUDA Available: {torch.cuda.is_available()}”)
if torch.cuda.is_available():
print(f” GPU: {torch.cuda.get_device_name()}”)
print(f” Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f}GB”)

try:
run_advanced_tutorial()

benchmark_zero_stages()

demonstrate_advanced_features()

except Exception as e:
print(f” Error during tutorial: {str(e)}”)
print(” Tips for troubleshooting:”)
print(” – Ensure you have GPU runtime enabled in Colab”)
print(” – Try reducing batch_size or model size if facing memory issues”)
print(” – Enable CPU offloading in ds_config if needed”)

We generate reusable DeepSpeed configurations, benchmark ZeRO stages to compare memory and speed, and showcase advanced features such as dynamic loss scaling and pipeline/MoE parallelism. We also detect CUDA, run the full tutorial end-to-end, and provide clear troubleshooting tips, allowing us to iterate confidently in Colab.

In conclusion, we gain a comprehensive understanding of how DeepSpeed enhances model training efficiency by striking a balance between performance and memory trade-offs. From leveraging ZeRO stages for memory reduction to applying FP16 mixed precision and CPU offloading, the tutorial showcases powerful strategies that make large-scale training accessible on modest hardware. By the end, learners will have trained and optimized a GPT-style model, benchmarked configurations, monitored GPU resources, and explored advanced features such as pipeline parallelism and gradient compression.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Implementing DeepSpeed for Scalable Transformers: Advanced Training with Gradient Checkpointing and Parallelism appeared first on MarkTechPost.