i-genie, Author at i-genie.co.uk

Boost cold-start recommendations with vLLM on AWS Trainium

Posted on July 25, 2025 by i-genie

Cold start in recommendation systems goes beyond just new user or new item problems—it’s the complete absence of personalized signals at launch. When someone first arrives, or when fresh content appears, there’s no behavioral history to tell the engine what they care about, so everyone ends up in broad generic segments. That not only dampens click-through and conversion rates, it can drive users away before a system ever gets a chance to learn their tastes. Standard remedies—collaborative filtering, matrix factorization, or popularity lists—lack the nuance to bridge that signal gap, and their one-size-fits-all suggestions quickly feel stale. Imagine, instead, if you could generate detailed interest profiles from day one. By tapping into large language models (LLMs) for zero-shot reasoning, you can synthesize rich, context-aware user and item embeddings without waiting for weeks of interaction data—turning a cold start into a warm welcome.
In this post, we demonstrate how to use vLLM for scalable inference and use AWS Deep Learning Containers (DLC) to streamline model packaging and deployment. We’ll generate interest expansions through structured prompts, encode them into embeddings, retrieve candidates with FAISS, apply validation to keep results grounded, and frame the cold-start challenge as a scientific experiment—benchmarking LLM and encoder pairings, iterating rapidly on recommendation metrics, and showing clear ROI for each configuration.
Solution overview
We build our cold-start solution on Amazon EC2 Trainium chips. To streamline model deployment, we use DLCs with the AWS Neuron SDK, which installs Neuron-optimized PyTorch modules and includes the latest AWS Trainium drivers and runtime pre-installed.

Figure : Cold-start recommendation pipeline on AWS Trainium with vLLM & NxD

Sharding large models across multiple Trainium chips is handled by the distributed library used by Neuron, NeuronX Distributed (NxD), which integrates seamlessly with vLLM. NxD manages model partitions across multiple instances with minimal code changes, enabling parallel inference of even 70B parameter LLMs. This combination—Trainium chips, Neuron Tools, and vLLM—gives machine learning (ML) engineers a flexible, cost-efficient, production-ready solution for experimenting with different LLM and encoder configurations and delivers rapid iteration on recommendation quality metrics without modifying core model code.
In the next section, we orchestrate our experiments in a Jupyter notebook—providing a reproducible, end-to-end workflow from loading data and engineering structured prompts to generating embeddings and retrieving candidates with FAISS—complete with interactive charts to visualize recommendation performance. Then, in the production deep-dive, we walk through a reference implementation that packages your Neuron-optimized LLM and encoder as DLC images and deploys them on Amazon Elastic Kubernetes Service (Amazon EKS) with autoscaling, so your inference layer automatically adapts to demand while optimizing cost and performance.
Expanding user interest profiles with LLMs
In this post, we use the Amazon Book Reviews dataset (mohamedbakhet/amazon-books-reviews) from Kaggle, which provides real-world user reviews and metadata for tens of thousands of books. This rich collection lets us simulate cold-start scenarios—where a brand-new user has only a single review or like—and evaluate how well our interest expansions, powered by distilled versions of Meta’s Llama 8B and 70B models, generate rich user profiles. We use an LLM to enrich a new user’s profile from minimal initial data. For example, if a user has only reviewed one science fiction novel, the LLM infers related subtopics—such as galactic empires, cyberpunk dystopias, or space exploration—that the user is likely to enjoy. We use structured prompts that embed the user’s existing activity into a concise instruction to verify consistency and relevance, as demonstrated in the following example:

prompt = (
f”The user has shown interest in: {user_review_category}.n”
“Suggest 3–5 related book topics they might enjoy.n”
“Respond with a JSON list of topic keywords.”
)
expanded_topics = llm.generate([prompt])[0].text

By constraining the LLM’s output format—asking it to return a JSON array of topic keywords—we avoid free‑form tangents and obtain a predictable list of interest expansions. Modern generative models, such as Meta’s Llama, possess broad domain knowledge and human‑like reasoning, enabling them to connect related concepts and serve as powerful cold‑start boosters by inferring deep user preferences from a single review. These synthetic interests become new signals for our recommendation pipeline, allowing us to retrieve and rank books from the Amazon Reviews collection even with minimal user history. You can experiment with Llama variants ranging from one‑billion to seventy‑billion parameters to identify which model yields the most discriminative and relevant expansions. Those findings will guide our choice of model for production and determine the size and scale of the Amazon EC2 Trainium and Inferentia instances we provision, setting us up for live user A/B tests to validate performance in real‑world settings.
Encoding user interests and retrieving relevant content
After we have our expanded interests, the next step is to turn both those interests and our catalog of books into vectors that we can compare. We explore three sizes of the Google T5 encoder—base, large and XL—to see how embedding dimensionality affects matching quality. The following are the steps:

Load the encoder for each size
Encode book summaries into a single NumPy matrix and normalize it
Build a FAISS index on those normalized vectors for fast nearest‑neighbor search
Encode the expanded interest text the same way and query FAISS to retrieve the top k most similar books

from transformers import T5Tokenizer, T5EncoderModel
import faiss
import numpy as np

# Our dataset of book summaries
content_texts = df[“review/summary”].tolist()
encoder_sizes = [“t5-base”, “t5-large”, “t5-xl”]
top_k = 5

for size in encoder_sizes:
# 1. Load the tokenizer and encoder model for this size
tokenizer = T5Tokenizer.from_pretrained(size)
model = T5EncoderModel.from_pretrained(size)

# 2. Encode all content into embeddings and normalize
inputs = tokenizer(content_texts, return_tensors=”pt”, truncation=True, padding=True)
outputs = model(**inputs)
content_embs = outputs.last_hidden_state.mean(dim=1).detach().cpu().numpy().astype(“float32”)
faiss.normalize_L2(content_embs)

# 3. Build a FAISS index using inner-product (equivalent to cosine on unit vectors)
index = faiss.IndexFlatIP(content_embs.shape[1])
index.add(content_embs)

# 4. Encode a single expanded interest and query the index
interest = “space opera with political intrigue”
enc = tokenizer([interest], return_tensors=”pt”, truncation=True, padding=True)
interest_emb = model(**enc).last_hidden_state.mean(dim=1).detach().cpu().numpy().astype(“float32″)
faiss.normalize_L2(interest_emb)

distances, indices = index.search(interest_emb, top_k)
recommendations = [content_texts[i] for i in indices[0]]

print(f”nTop {top_k} recommendations using {size}:”)
for title in recommendations:
print(” -“, title)

You can compare how each encoder scale affects both the average FAISS distance (that is, how far apart your interest is from the content) and the actual recommended titles. Swapping in a different encoder family—such as SentenceTransformers—is as straightforward as replacing the model and tokenizer imports.
Measuring and improving recommendation quality
Now that we’ve generated FAISS indexes for every LLM‑encoder pairing and computed the mean distance between each expanded interest query and its top 10 neighbors, we know exactly how tightly or loosely each model’s embeddings cluster. The following chart shows those average distances for each combination—revealing that 1B and 3B models collapse to almost zero, while 8B and 70B models (especially with larger encoders) produce progressively higher distances, signifying richer, more discriminative signals for recommendation.

Figure : Average FAISS distance by model and encoder

The chart shows that the 1B and 3B models yield an average FAISS distance of zero, meaning their expanded‑interest embeddings are essentially identical and offer no differentiation. By contrast, the 8B model produces a distance of about 0.5 with t5‑base, rising further with t5‑large and t5‑xl, which demonstrates that larger encoders capture more of the model’s nuance. The 70B model only adds a small boost—and only with the XL encoder—so its extra cost yields limited benefit.
In practical terms, a Llama 8B LLM paired with a base or large T5 encoder delivers clear separation in embedding space without the higher inference time and resource usage of a 70B model.
Comparing model and encoder impact on embedding spread
To see how LLM size and encoder scale shape our embedding space, you can measure—for each LLM and encoder pair—the mean FAISS distance from a representative expanded interest vector to its top 10 neighbors. The following bar chart plots those averages side by side. You can instantly spot that 1B and 3B collapse to zero, 8B jumps to around 0.5 and rises with larger encoders, and 70B only adds a small extra spread at the XL scale. This helps you choose the smallest combination that still gives you the embedding diversity needed for effective cold‑start recommendations.

Figure : FAISS distance by LLM and encoder size

Evaluating recommendation overlap across Llama variations and encoders to balance consistency and novelty
In the next analysis, you build a basic recommend_books helper that, for various LLM sizes and encoder choices, loads the corresponding expanded‑interest DataFrame, reads its FAISS index, reconstructs the first embedding as a stand‑in query, and returns the top-k book titles. Using this helper, we first measure how much each pair of encoders agrees on recommendations for a single LLM—comparing base compared to large, base compared to XL, and large compared XL—and then, separately, how each pair of LLM sizes aligns for a fixed encoder. Finally, we focus on the 8B model (shown in the following figure) and plot a heatmap of its encoder overlaps, which shows that base and large share about 40% of their top 5 picks while XL diverges more—illustrating how changing the encoder shifts the balance between consistency and novelty in the recommendations.

Figure : 8B model: encoder overlap heatmap

For the 8B model, the heatmap shows that t5_base and t5_large share 40% of their top 5 recommendations, t5_base and t5_xl also overlap 40%, while t5_large vs t5_xl overlap only 20%, indicating that the XL encoder introduces the greatest amount of novel titles compared to the other pairs.
Tweaking tensor_parallel_size for optimal cost performance
To balance inference speed against resource cost, we measured how increasing Neuron tensor parallelism affects latency when expanding user interests with the Llama 3.1 8B model on a trn1.32xlarge instance. We ran the same zero‑shot expansion workload at tensor_parallel_size values of 2, 8, 16, and 32. As shown in the first chart, P50 Latency falls by 74 %—from 2,480 ms at TP = 2 to 650 ms at TP = 16—then inches lower to 532 ms at TP = 32 (an additional 18 % drop). The following cost-to-performance chart shows that beyond TP = 16, doubling parallelism roughly doubles cost for only a 17 % further latency gain.

Figure : Latency compared to tensor parallel size

In practice, setting tensor_parallel_size to 16 delivers the best trade‑off: you capture most of the speed‑up from model sharding while avoiding the sharply diminishing returns and higher core‑hour costs that come with maximal parallelism, as shown in the following figure.

Figure : Cost-performance compared to tensor parallel size

The preceding figure visualizes the cost-to-performance ratio of the Llama 8B tests, emphasizing that TP=16 offers the most balanced efficiency before the benefits plateau.
What’s next?
Now that we have determined the models and encoders to use, as well as the optimal configuration to use with our dataset, such as sequence size and batch size, the next step is to deploy the models and define a production workflow that generates expanded interest that is encoded and ready for match with more content.
Conclusion
This post showed how AWS Trainium, the Neuron SDK, and scalable LLM inference can tackle cold-start challenges by enriching sparse user profiles for better recommendations from day one.
Importantly, our experiments highlight that larger models and encoders don’t always mean better outcomes. While they can produce richer signals, the gains often don’t justify the added cost. You might find that an 8B LLM with a T5-large encoder strikes the best balance between performance and efficiency.
Rather than assuming bigger is better, this approach helps teams identify the optimal model-encoder pair—delivering high-quality recommendations with cost-effective infrastructure.

About the authors
Yahav Biran is a Principal Architect at AWS, focusing on large-scale AI workloads. He contributes to open-source projects and publishes in AWS blogs and academic journals, including the AWS compute and AI blogs and the Journal of Systems Engineering. He frequently delivers technical presentations and collaborates with customers to design Cloud applications. Yahav holds a Ph.D. in Systems Engineering from Colorado State University.
Nir Ozeri Nir is a Sr. Solutions Architect Manager with Amazon Web Services, based out of New York City. Nir leads a team of Solution Architects focused on ISV customers. Nir specializes in application modernization, application and product delivery, and scalable application architecture.

Benchmarking Amazon Nova: A comprehensive analysis through MT-Bench an …

Posted on July 25, 2025 by i-genie

Large language models (LLMs) have rapidly evolved, becoming integral to applications ranging from conversational AI to complex reasoning tasks. However, as models grow in size and capability, effectively evaluating their performance has become increasingly challenging. Traditional benchmarking metrics like perplexity and BLEU scores often fail to capture the nuances of real-world interactions, making human-aligned evaluation frameworks crucial. Understanding how LLMs are assessed can lead to more reliable deployments and fair comparisons across different models.
In this post, we explore automated and human-aligned judging methods based on LLM-as-a-judge. LLM-as-a-judge refers to using a more powerful LLM to evaluate and rank responses generated by other LLMs based on predefined criteria such as correctness, coherence, helpfulness, or reasoning depth. This approach has become increasingly popular due to the scalability, consistency, faster iteration, and cost-efficiency compared to solely relying on human judges. We discuss different LLM-as-a-judge evaluation scenarios, including pairwise comparisons, where two models or responses are judged against each other, and single-response scoring, where individual outputs are rated based on predefined criteria. To provide concrete insights, we use MT-Bench and Arena-Hard, two widely used evaluation frameworks. MT-Bench offers a structured, multi-turn evaluation approach tailored for chatbot-like interactions, whereas Arena-Hard focuses on ranking LLMs through head-to-head response battles in challenging reasoning and instruction-following tasks. These frameworks aim to bridge the gap between automated and human judgment, making sure that LLMs aren’t evaluated solely based on synthetic benchmarks but also on practical use cases.
The repositories for MT-Bench and Arena-Hard were originally developed using OpenAI’s GPT API, primarily employing GPT-4 as the judge. Our team has expanded its functionality by integrating it with the Amazon Bedrock API to enable using Anthropic’s Claude Sonnet on Amazon as judge. In this post, we use both MT-Bench and Arena-Hard to benchmark Amazon Nova models by comparing them to other leading LLMs available through Amazon Bedrock.
Amazon Nova models and Amazon Bedrock
Our study evaluated all four models from the Amazon Nova family, including Amazon Nova Premier, which is the most recent addition to the family. Introduced at AWS re:Invent in December 2024, Amazon Nova models are designed to provide frontier-level intelligence with leading price-performance ratios. These models rank among the fastest and most economical options in their respective intelligence categories and are specifically optimized for powering enterprise generative AI applications in a cost-effective, secure, and reliable manner.
The understanding model family comprises four distinct tiers: Amazon Nova Micro (text-only, designed for ultra-efficient edge deployment), Amazon Nova Lite (multimodal, optimized for versatility), Amazon Nova Pro (multimodal, offering an ideal balance between intelligence and speed for most enterprise applications), and Amazon Nova Premier (multimodal, representing the most advanced Nova model for complex tasks and serving as a teacher for model distillation). Amazon Nova models support a wide range of applications, including coding, reasoning, and structured text generation.
Additionally, through Amazon Bedrock Model Distillation, customers can transfer the intelligence capabilities of Nova Premier to faster, more cost-effective models such as Nova Pro or Nova Lite, tailored to specific domains or use cases. This functionality is accessible through both the Amazon Bedrock console and APIs, including the Converse API and Invoke API.
MT-Bench analysis
MT-Bench is a unified framework that uses LLM-as-a-judge, based on a set of predefined questions. The evaluation questions are a set of challenging multi-turn open-ended questions designed to evaluate chat assistants. Users also have the flexibility to define their own question and answer pairs in a way that suits their needs. The framework presents models with challenging multi-turn questions across eight key domains:

Writing
Roleplay
Reasoning
Mathematics
Coding
Data Extraction
STEM
Humanities

The LLMs are evaluated using two types of evaluation:

Single-answer grading – This mode asks the LLM judge to grade and give a score to a model’s answer directly without pairwise comparison. For each turn, the LLM judge gives a score on a scale of 0–10. Then the average score is computed on all turns.
Win-rate based grading – This mode uses two metrics:

pairwise-baseline – Run a pairwise comparison against a baseline model.
pairwise-all – Run a pairwise comparison between all model pairs on all questions.

Evaluation setup
In this study, we employed Anthropic’s Claude 3.7 Sonnet as our LLM judge, given its position as one of the most advanced language models available at the time of our study. We focused exclusively on single-answer grading, wherein the LLM judge directly evaluates and scores model-generated responses without conducting pairwise comparisons.
The eight domains covered in our study can be broadly categorized into two groups: those with definitive ground truth and those without. Specifically, Reasoning, Mathematics, Coding, and Data Extraction fall into the former category because they typically have reference answers against which responses can be objectively evaluated. Conversely, Writing, Roleplay, STEM, and Humanities often lack such clear-cut ground truth. Here we provide an example question from the Writing and Math categories:

{
   “question_id”: 81,
   “category”: “writing”,
   “turns”: [
   “Compose an engaging travel blog post about a recent trip to Hawaii,
   highlighting cultural experiences and must-see attractions.”,
   “Rewrite your previous response. Start every sentence with the letter A.”
   ]
}
{
   “question_id”: 111,
   “category”: “math”,
   “turns”: [
   “The vertices of a triangle are at points (0, 0), (-1, 1), and (3, 3).
What is the area of the triangle?”,
   “What’s area of the circle circumscribing the triangle?”
   ],
   “reference”: [
   “Area is 3”,
   “5pi”
   ]
}

To account for this distinction, MT-Bench employs different judging prompts for each category (refer to the following GitHub repo), tailoring the evaluation process to the nature of the task at hand. As shown in the following evaluation prompt, for questions without a reference answer, MT-Bench adopts the single-v1 prompt, only passing the question and model-generated answer. When evaluating questions with a reference answer, it only passes the reference_answer, as shown in the single-math-v1 prompt.

{
   “name”: “single-v1”,
   “type”: “single”,
   “system_prompt”: “You are a helpful assistant.”,
   “prompt_template”:
   “[Instruction]nPlease act as an impartial judge and evaluate the quality of
         the response provided by an AI assistant to the user question displayed below.
         Your evaluation should consider factors such as the helpfulness, relevance, accuracy,
         depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible.
   After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: “[[rating]]”,
   for example: “Rating: [[5]]”.nn[Question]n{question}nn[The Start of Assistant’s Answer]n{answer}n[The End of Assistant’s Answer]”,
   “description”: “Prompt for general questions”,
   “category”: “general”,
   “output_format”: “[[rating]]”
}
{
   “name”: “single-math-v1”,
   “type”: “single”,
   “system_prompt”: “You are a helpful assistant.”,
   “prompt_template”:
   “[Instruction]nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant
   to the user question displayed below. Your evaluation should consider correctness and helpfulness.
   You will be given a reference answer and the assistant’s answer. Begin your evaluation by comparing the assistant’s answer with the reference answer.
   Identify and correct any mistakes. Be as objective as possible. After providing your explanation,
   you must rate the response on a scale of 1 to 10 by strictly following this format: “[[rating]]”, for example: “Rating: [[5]]”.
   nn[Question]n{question}nn[The Start of Reference Answer]n{ref_answer_1}n[The End of Reference Answer]
   nn[The Start of Assistant’s Answer]n{answer}n[The End of Assistant’s Answer]”,
   “description”: “Prompt for general questions”,
   “category”: “math”,
   “output_format”: “[[rating]]”
}

Overall performance analysis across Amazon Nova Models
In our evaluation using Anthropic’s Claude 3.7 Sonnet as an LLM-as-a-judge framework, we observed a clear performance hierarchy among Amazon Nova models. The scores ranged from 8.0 to 8.6, with Amazon Nova Premier achieving the highest median score of 8.6, followed closely by Amazon Nova Pro at 8.5. Both Amazon Nova Lite and Nova Micro achieved respectable median scores of 8.0.
What distinguishes these models beyond their median scores is their performance consistency. Nova Premier demonstrated the most stable performance across evaluation categories with a narrow min-max margin of 1.5 (ranging from 7.94 to 9.47). In comparison, Nova Pro showed greater variability with a min-max margin of 2.7 (from 6.44 to 9.13). Similarly, Nova Lite exhibited more consistent performance than Nova Micro, as evidenced by their respective min-max margins. For enterprise deployments where response time is critical, Nova Lite and Nova Micro excel with less than 6-second average latencies for single question-answer generation. This performance characteristic makes them particularly suitable for edge deployment scenarios and applications with strict latency requirements. When factoring in their lower cost, these models present compelling options for many practical use cases where the slight reduction in performance score is an acceptable trade-off.
Interestingly, our analysis revealed that Amazon Nova Premier, despite being the largest model, demonstrates superior token efficiency. It generates more concise responses that consume up to 190 fewer tokens for single question-answer generation than comparable models. This observation aligns with research indicating that more sophisticated models are generally more effective at filtering irrelevant information and structuring responses efficiently.
The narrow 0.6-point differential between the highest and lowest performing models suggests that all Amazon Nova variants demonstrate strong capabilities. Although larger models such as Nova Premier offer marginally better performance with greater consistency, smaller models provide compelling alternatives when latency and cost are prioritized. This performance profile gives developers flexibility to select the appropriate model based on their specific application requirements.
The following graph summarizes the overall performance scores and latency for all four models.

The following table shows token consumption and cost analysis for Amazon Nova Models.

Model
Avg. total tokens per query
Price per 1k input tokens
Avg. cost per query (cents)

Amazon Nova Premier
2154
$0.0025
$5.4

Amazon Nova Pro
2236
$0.0008
$1.8

Amazon Nova Lite
2343
$0.00006
$0.14

Amazon Nova Micro
2313
$0.000035
$0.08

Category-specific model comparison
The following radar plot compares the Amazon Nova models across all eight domains.

The radar plot reveals distinct performance patterns across the Amazon Nova model family, with a clear stratification across domains. Nova Premier consistently outperforms its counterparts, showing particular strengths in Math, Reasoning, Humanities, and Extraction, where it achieves scores approaching or exceeding 9. Nova Pro follows closely behind Premier in most categories, maintaining competitive performance especially in Writing and Coding, while showing more pronounced gaps in Humanities, Reasoning, and Math. Both Nova Lite and Micro demonstrate similar performance profiles to each other, with their strongest showing in Roleplay, and their most significant limitations in Humanities and Math, where the differential between Premier and the smaller models is most pronounced (approximately 1.5–3 points).
The consistent performance hierarchy across all domains (Premier > Pro > Lite ≈ Micro) aligns with model size and computational resources, though the magnitude of these differences varies significantly by category. Math and reasoning emerge among the most discriminating domains for model capability assessment and suggest substantial benefit from the additional scale of Amazon Nova Premier. However, workloads focused on creative content (Roleplay, Writing) provide the most consistent performance across the Nova family and suggest smaller models as compelling options given their latency and cost benefits. This domain-specific analysis offers practitioners valuable guidance when selecting the appropriate Nova model based on their application’s primary knowledge requirements.
In this study, we adopted Anthropic’s Claude 3.7 Sonnet as the single LLM judge. However, although Anthropic’s Claude 3.7 Sonnet is a popular choice for LLM judging due to its capabilities, studies have shown that it does exhibit certain bias (for example, it prefers longer responses). If permitted by time and resources, consider adopting a multi-LLM judge evaluation framework to effectively reduce biases intrinsic to individual LLM judges and increase evaluation reliability.
Arena-Hard-Auto analysis
Arena-Hard-Auto is a benchmark that uses 500 challenging prompts as a dataset to evaluate different LLMs using LLM-as-a-judge. The dataset is curated through an automated pipeline called BenchBuilder, which uses LLMs to automatically cluster, grade, and filter open-ended prompts from large, crowd-sourced datasets such as Chatbot-Arena to enable continuous benchmarking without a human in the loop. The paper reports that the new evaluation metrics provide three times higher separation of model performances compared to MT-Bench and achieve a 98.6% correlation with human preference rankings.
Test framework and methodology
The Arena-Hard-Auto benchmarking framework evaluates different LLMs using a pairwise comparison. Each model’s performance is quantified by comparing it against a strong baseline model, using a structured, rigorous setup to generate reliable and detailed judgments. We use the following components for the evaluation:

Pairwise comparison setup – Instead of evaluating models in isolation, they’re compared directly with a strong baseline model. This baseline provides a fixed standard, making it straightforward to understand how the models perform relative to an already high-performing model.
Judge model with fine-grained categories – A powerful model (Anthropic’s Claude 3.7 Sonnet) is used as a judge. This judge doesn’t merely decide which model is better, it also categorizes the comparison into five detailed preference labels. By using this nuanced scale, large performance gaps are penalized more heavily than small ones, which helps separate models more effectively based on performance differences:

A >> B (A is significantly better than B)
A > B (A is better than B)
A ~= B (A and B are similar)
B > A (B is better than A)
B >> A (B is significantly better than A)

Chain-of-thought (CoT) prompting – CoT prompting encourages the judge model to explain its reasoning before giving a final judgment. This process can lead to more thoughtful and reliable evaluations by helping the model analyze each response in depth rather than making a snap decision.
Two-game setup to avoid position bias – To minimize bias that might arise from a model consistently being presented first or second, each model pair is evaluated twice, swapping the order of the models. This way, if there’s a preference for models in certain positions, the setup controls for it. The total number of judgments is doubled (for example, 500 queries x 2 positions = 1,000 judgments).
Bradley-Terry model for scoring – After the comparisons are made, the Bradley-Terry model is applied to calculate each model’s final score. This model uses pairwise comparison data to estimate the relative strength of each model in a way that reflects not only the number of wins but also the strength of wins. This scoring method is more robust than simply calculating win-rate because it accounts for pairwise outcomes across the models.
Bootstrapping for statistical stability – By repeatedly sampling the comparison results (bootstrapping), the evaluation becomes statistically stable. This stability is beneficial because it makes sure the model rankings are reliable and less sensitive to random variations in the data.
Style control – Certain style features like response length and markdown formatting are separated from content quality, using style controls, to provide a clearer assessment of each model’s intrinsic capabilities.

The original work focuses on pairwise comparison only. For our benchmarking, we also included our own implementation of single-score judgment, taking inspiration from MT-Bench. We again use Anthropic’s Claude 3.7 Sonnet as the judge and use the following prompt for judging without a reference model:

{
“system_prompt”:
“Please act as an impartial judge and evaluate the quality
of the response provided by an AI assistant to the user question
displayed below. Your evaluation should consider factors
such as the helpfulness, relevance, accuracy, depth, creativity,
and level of detail of the response.
Begin your evaluation by providing a short explanation.
Be as objective as possible. After providing your explanation,
you must rate the response on a scale of 1 to 10 by strictly
following this format: “[[rating]]”, for example: “Rating: [[5]]”.”
}

Performance comparison
We evaluated five models, including Amazon Nova Premier, Amazon Nova Pro, Amazon Nova Lite, Amazon Nova Micro, DeepSeek-R1, and a strong reference model. The Arena-Hard benchmark generates confidence intervals by bootstrapping, as explained before. The 95% confidence interval shows the uncertainty of the models and is indicative of model performance. From the following plot, we can see that all the Amazon Nova models get a high pairwise Bradley-Terry score. It should be noted that the Bradley-Terry score for the reference model is 5; this is because Bradley-Terry scores are computed by pairwise comparisons where the reference model is one of the models in the pair. So, for the reference model, the score will be 50%, and because the total score is normalized between 0 and 10, the reference model has a score of 5.

The confidence interval analysis, as shown in the following table, was done to statistically evaluate the Amazon Nova model family alongside DeepSeek-R1, providing deeper insights beyond raw scores. Nova Premier leads the pack (8.36–8.72), with DeepSeek-R1 (7.99–8.30) and Nova Pro (7.72–8.12) following closely. The overlapping confidence intervals among these top performers indicate statistically comparable capabilities. Nova Premier demonstrates strong performance consistency with a tight confidence interval (−0.16, +0.20), while maintaining the highest overall scores. A clear statistical separation exists between these leading models and the purpose-built Nova Lite (6.51–6.98) and Nova Micro (5.68–6.14), which are designed for different use cases. This comprehensive analysis confirms the position of Nova Premier as a top performer, with the entire Nova family offering options across the performance spectrum to meet varied customer requirements and resource constraints.

Model
Pairwise score 25th quartile
Pairwise score 75th quartile
Confidence interval

Amazon Nova Premier
8.36
8.72
(−0.16, +0.20)

Amazon Nova Pro
7.72
8.12
(−0.18, +0.23)

Amazon Nova Lite
6.51
6.98
(−0.22, +0.25)

Amazon Nova Micro
5.68
6.14
(−0.21, +0.25)

DeepSeek-R1
7.99
8.30
(−0.15, +0.16)

Cost per output token is one of the contributors to the overall cost of the LLM model and impacts the usage. The cost was computed based on the average output tokens over the 500 responses. Although Amazon Nova Premier leads in performance (85.22), Nova Light and Nova Micro offer compelling value despite their wider confidence intervals. Nova Micro delivers 69% of the performance of Nova Premier at 89 times cheaper cost, while Nova Light achieves 79% of the capabilities of Nova Premier, at 52 times lower price. These dramatic cost efficiencies make the more affordable Nova models attractive options for many applications where absolute top performance isn’t essential, highlighting the effective performance-cost tradeoffs across the Amazon Nova family.
Conclusion
In this post, we explored the use of LLM-as-a-judge through MT-Bench and Arena-Hard benchmarks to evaluate model performance rigorously. We then compared Amazon Nova models against a leading reasoning model, that is, DeepSeek-R1 hosted on Amazon Bedrock, analyzing their capabilities across various tasks. Our findings indicate that Amazon Nova models deliver strong performance, especially in Extraction, Humanities, STEM, and Roleplay, while maintaining lower operational costs, making them a competitive choice for enterprises looking to optimize efficiency without compromising on quality. These insights highlight the importance of benchmarking methodologies in guiding model selection and deployment decisions in real-world applications.
For more information on Amazon Bedrock and the latest Amazon Nova models, refer to the Amazon Bedrock User Guide and Amazon Nova User Guide. The AWS Generative AI Innovation Center has a group of AWS science and strategy experts with comprehensive expertise spanning the generative AI journey, helping customers prioritize use cases, build a roadmap, and move solutions into production. Check out Generative AI Innovation Center for our latest work and customer success stories.

About the authors
Mengdie (Flora) Wang is a Data Scientist at AWS Generative AI Innovation Center, where she works with customers to architect and implement scalable Generative AI solutions that address their unique business challenges. She specializes in model customization techniques and agent-based AI systems, helping organizations harness the full potential of generative AI technology. Prior to AWS, Flora earned her Master’s degree in Computer Science from the University of Minnesota, where she developed her expertise in machine learning and artificial intelligence.
Baishali Chaudhury is an Applied Scientist at the Generative AI Innovation Center at AWS, where she focuses on advancing Generative AI solutions for real-world applications. She has a strong background in computer vision, machine learning, and AI for healthcare. Baishali holds a PhD in Computer Science from University of South Florida and PostDoc from Moffitt Cancer Centre.
Rahul Ghosh is an Applied Scientist at Amazon’s Generative AI Innovation Center, where he works with AWS customers across different verticals to expedite their use of Generative AI. Rahul holds a Ph.D. in Computer Science from the University of Minnesota.
Jae Oh Woo is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where he specializes in developing custom solutions and model customization for a diverse range of use cases. He has a strong passion for interdisciplinary research that connects theoretical foundations with practical applications in the rapidly evolving field of generative AI. Prior to joining Amazon, Jae Oh was a Simons Postdoctoral Fellow at the University of Texas at Austin. He holds a Ph.D. in Applied Mathematics from Yale University.
Jamal Saboune is an Applied Science Manager with AWS Generative AI Innovation Center. He is currently leading a team focused on supporting AWS customers build innovative and scalable Generative AI products across several industries. Jamal holds a PhD in AI and Computer Vision from the INRIA Lab in France, and has a long R&D experience designing and building AI solutions that add value to users.
Wan Chen is an Applied Science Manager at the Generative AI Innovation Center. As a ML/AI veteran in tech industry, she has wide range of expertise on traditional machine learning, recommender system, deep learning and Generative AI. She is a stronger believer of Superintelligence, and is very passionate to push the boundary of AI research and application to enhance human life and drive business growth. She holds Ph.D in Applied Mathematics from University of British Columbia, and had worked as postdoctoral fellow in Oxford University.
Anila Joshi has more than a decade of experience building AI solutions. As a AWSI Geo Leader at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

SYNCOGEN: A Machine Learning Framework for Synthesizable 3D Molecular …

Posted on July 24, 2025 by i-genie

Introduction: The Challenge of Synthesizable Molecule Generation

In modern drug discovery, generative molecular design models have greatly expanded the chemical space available to researchers, enabling rapid exploration of new compounds. Yet, a major challenge remains: many AI-generated molecules are difficult or impossible to synthesize in the laboratory, limiting their practical value in pharmaceutical and chemical development.

While template-based methods—such as synthesis trees constructed from reaction templates—help address synthetic accessibility, these approaches only capture 2D molecular graphs, lacking the rich 3D structural information that determines a molecule’s behaviour in biological systems.

Bridging 3D Structure and Synthesis: The Need for a Unified Framework

Recent advances in 3D generative models can directly generate atomic coordinates, allowing for geometry-based design and improved property prediction. However, most methods do not systematically integrate synthetic feasibility constraints: the resulting molecules may possess desired shapes or properties, but there is no guarantee they can be assembled from existing building blocks using known reactions.

Synthetic accessibility is crucial for successful drug discovery and materials design, prompting the need for solutions that simultaneously ensure both realistic 3D geometry and direct synthetic routes.

SYNCOGEN: A Novel Framework for Synthesizable 3D Molecule Design

Researchers from the University of Toronto, University of Cambridge, McGill University, and others have proposed SYNCOGEN (Synthesizable Co-Generation) that addresses this gap with a pioneering approach that jointly models both reaction pathways and atomic coordinates during molecule generation. This unified framework enables the generation of 3D molecular structures along with tractable synthetic routes, ensuring that every proposed molecule is not only physically meaningful but also practically synthesizable.

Key Innovations of SYNCOGEN

Multimodal Generation: By blending masked graph diffusion (for reaction graphs) with flow matching (for atomic coordinates), SYNCOGEN samples from the joint distribution of building blocks, chemical reactions, and 3D structures.

Comprehensive Input Representation: Each molecule is represented as a triple (X, E, C), where:

X encodes building block identity,

E encodes reaction types and specific connection centers,

C contains all atomic coordinates.

Simultaneous Training: Both graph and coordinate modalities are modeled together, using losses that combine cross-entropy for graphs, masked mean squared error for coordinates, and pairwise distance penalties to ensure geometric realism.

The SYNSPACE Dataset: Enabling Large-Scale, Synthesizability-Aware Training

To train SYNCOGEN, researchers created SYNSPACE, a dataset featuring over 600,000 synthesizable molecules, each constructed from 93 commercial building blocks and 19 robust reaction templates. Every molecule in SYNSPACE is annotated with multiple energy-minimized 3D conformations (over 3.3 million structures total), providing a diverse and reliable training resource that closely mirrors realistic chemical synthesis.

Dataset Construction Workflow

Molecules are systematically built by iterative reaction assembly, starting from an initial building block and choosing compatible reaction centers and partners for successive coupling steps.

For each resulting molecular graph, multiple low-energy conformers are generated and optimized using computational chemistry methods, ensuring each structure is both chemically plausible and energetically favourable.

Model Architecture and Training

SYNCOGEN leverages a modified SEMLAFLOW backbone, an SE(3)-equivariant neural network originally designed for 3D molecular generation. The architecture includes:

Specialized input and output heads to translate between building block-level graphs and atom-level features.

Loss functions and noising schemes that carefully balance graph accuracy and 3D structural fidelity, including visibility-aware coordinate handling to support variable atom counts and masking.

Training innovations such as edge count limits, compatibility masking, and self-conditioning to maintain chemistry-valid molecule generation.

Performance: State-of-the-Art Results in Synthesizable Molecule Generation

Benchmarking

SYNCOGEN achieves state-of-the-art performance on unconditional 3D molecule generation tasks, outperforming leading all-atom and graph-based generative frameworks. Notable improvements include:

High chemical validity: More than 96% of generated molecules are chemically valid.

Superior synthetic accessibility: Retrosynthesis software (AiZynthFinder, Syntheseus) solve rates of up to 72%, far surpassing most competing methods.

Excellent geometric and energetic realism: Generated conformers closely match the bond length, angle, and dihedral distributions of experimental datasets, with low non-bonded interaction energies.

Practical utility: SYNCOGEN enables direct generation of synthetic routes alongside 3D coordinates, uniquely bridging computational chemistry and experimental synthesis.

Fragment Linking and Drug Design

SYNCOGEN also demonstrates competitive performance in molecular inpainting for fragment linking, a crucial drug design task. It can generate easily synthesizable analogs of complex drugs, producing candidates with favorable docking scores and retrosynthetic tractability—a feat not matched by conventional 3D generative models.

Future Directions and Applications

SYNCOGEN marks a foundational advance for synthesizability-aware molecular generation, with potential extensions including:

Property-conditioned generation: Directly optimize for desired physicochemical or biological properties.

Protein pocket conditioning: Generate ligands customized for specific protein binding sites.

Expanding reaction space: Incorporate more diverse building blocks and reaction templates to widen accessible chemical space.

Automated synthesis robotics: Link generative models with laboratory automation for closed-loop drug and materials discovery.

Conclusion: A Step Toward Realizable Computational Molecular Design

SYNCOGEN sets a new benchmark for joint 3D and reaction-aware molecule generation, enabling researchers and pharmaceutical scientists to design molecules that are both structurally meaningful and experimentally feasible. By uniting generative models with strict synthetic constraints, SYNCOGEN brings computational design much closer to laboratory realization, unlocking new opportunities in drug discovery, materials science, and beyond.

FAQ 1: What is SYNCOGEN and how does it improve synthesizable 3D molecule generation?SYNCOGEN is an advanced generative modeling framework that simultaneously generates both the 3D structures and the synthetic reaction pathways for small molecules. By jointly modeling reaction graphs and atomic coordinates, SYNCOGEN ensures that generated molecules are not only physically realistic but also easily synthesizable in real-world laboratory settings. This dual approach uniquely enables practical molecule design for drug discovery, bridging a critical gap left by earlier models that focused only on 2D structures or neglect synthetic accessibility.

FAQ 2: How is SYNCOGEN trained to guarantee synthetic accessibility and 3D accuracy?SYNCOGEN is trained using the SYNSPACE dataset, which includes over 600,000 synthesizable molecules constructed from a fixed set of reliable building blocks and reaction templates, each paired with multiple energy-minimized 3D conformers. The model utilizes masked graph diffusion for the reaction graph and flow matching for atomic coordinates, combining graph cross-entropy, coordinate mean squared error, and pairwise distance penalties during training to enforce both chemical validity and geometric realism. Training-time constraints, such as edge count limits and compatibility masking, further ensure the generation of practical, chemistry-valid molecules.

FAQ 3: What are the main applications and future directions for SYNCOGEN in chemical and pharmaceutical research?SYNCOGEN sets a new standard for synthesizability-aware 3D molecule generation, enabling direct suggestion of synthetic routes alongside 3D structures—key for drug design, fragment linking, and automated synthesis platforms. Future applications include conditioning generation on specific properties or protein binding pockets, expanding the library of applicable reactions and building blocks, and integrating with laboratory robotics for fully automated molecule synthesis and screening.

Check out the Paper here. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]
The post SYNCOGEN: A Machine Learning Framework for Synthesizable 3D Molecular Generation Through Joint Graph and Coordinate Modeling appeared first on MarkTechPost.

A Code Implementation to Efficiently Leverage LangChain to Automate Pu …

Posted on July 24, 2025 by i-genie

In this tutorial, we are excited to introduce the Advanced PubMed Research Assistant, which guides you through building a streamlined pipeline for querying and analyzing biomedical literature. In this tutorial, we focus on leveraging the PubmedQueryRun tool to perform targeted searches, such as “CRISPR gene editing,” and then parse, cache, and explore those results. You’ll learn how to extract publication dates, titles, and summaries; store queries for instant reuse; and prepare your data for visualization or further analysis.

Copy CodeCopiedUse a different Browser!pip install -q langchain-community xmltodict pandas matplotlib seaborn wordcloud google-generativeai langchain-google-genai

import os
import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from collections import Counter
from wordcloud import WordCloud
import warnings
warnings.filterwarnings(‘ignore’)

from langchain_community.tools.pubmed.tool import PubmedQueryRun
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType

We install and configure all the essential Python packages, including langchain-community, xmltodict, pandas, matplotlib, seaborn, and wordcloud, as well as Google Generative AI and LangChain Google integrations. We import core data‑processing and visualization libraries, silence warnings, and bring in the PubmedQueryRun tool and ChatGoogleGenerativeAI client. Finally, we prepare to initialize our LangChain agent with the PubMed search capability.

Copy CodeCopiedUse a different Browserclass AdvancedPubMedResearcher:
“””Advanced PubMed research assistant with analysis capabilities”””

def __init__(self, gemini_api_key=None):
“””Initialize the researcher with optional Gemini integration”””
self.pubmed_tool = PubmedQueryRun()
self.research_cache = {}

if gemini_api_key:
os.environ[“GOOGLE_API_KEY”] = gemini_api_key
self.llm = ChatGoogleGenerativeAI(
model=”gemini-1.5-flash”,
temperature=0,
convert_system_message_to_human=True
)
self.agent = self._create_agent()
else:
self.llm = None
self.agent = None

def _create_agent(self):
“””Create LangChain agent with PubMed tool”””
tools = [
Tool(
name=”PubMed Search”,
func=self.pubmed_tool.invoke,
description=”Search PubMed for biomedical literature. Use specific terms.”
)
]

return initialize_agent(
tools,
self.llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)

def search_papers(self, query, max_results=5):
“””Search PubMed and parse results”””
print(f” Searching PubMed for: ‘{query}'”)

try:
results = self.pubmed_tool.invoke(query)
papers = self._parse_pubmed_results(results)

self.research_cache[query] = {
‘papers’: papers,
‘timestamp’: datetime.now(),
‘query’: query
}

print(f” Found {len(papers)} papers”)
return papers

except Exception as e:
print(f” Error searching PubMed: {str(e)}”)
return []

def _parse_pubmed_results(self, results):
“””Parse PubMed search results into structured data”””
papers = []

publications = results.split(‘nnPublished: ‘)[1:]

for pub in publications:
try:
lines = pub.strip().split(‘n’)

pub_date = lines[0] if lines else “Unknown”

title_line = next((line for line in lines if line.startswith(‘Title: ‘)), ”)
title = title_line.replace(‘Title: ‘, ”) if title_line else “Unknown Title”

summary_start = None
for i, line in enumerate(lines):
if ‘Summary::’ in line:
summary_start = i + 1
break

summary = “”
if summary_start:
summary = ‘ ‘.join(lines[summary_start:])

papers.append({
‘date’: pub_date,
‘title’: title,
‘summary’: summary,
‘word_count’: len(summary.split()) if summary else 0
})

except Exception as e:
print(f” Error parsing paper: {str(e)}”)
continue

return papers

def analyze_research_trends(self, queries):
“””Analyze trends across multiple research topics”””
print(” Analyzing research trends…”)

all_papers = []
topic_counts = {}

for query in queries:
papers = self.search_papers(query, max_results=3)
topic_counts[query] = len(papers)

for paper in papers:
paper[‘topic’] = query
all_papers.append(paper)

df = pd.DataFrame(all_papers)

if df.empty:
print(” No papers found for analysis”)
return None

self._create_visualizations(df, topic_counts)

return df

def _create_visualizations(self, df, topic_counts):
“””Create research trend visualizations”””
plt.style.use(‘seaborn-v0_8’)
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle(‘PubMed Research Analysis Dashboard’, fontsize=16, fontweight=’bold’)

topics = list(topic_counts.keys())
counts = list(topic_counts.values())

axes[0,0].bar(range(len(topics)), counts, color=’skyblue’, alpha=0.7)
axes[0,0].set_xlabel(‘Research Topics’)
axes[0,0].set_ylabel(‘Number of Papers’)
axes[0,0].set_title(‘Papers Found by Topic’)
axes[0,0].set_xticks(range(len(topics)))
axes[0,0].set_xticklabels([t[:20]+’…’ if len(t)>20 else t for t in topics], rotation=45, ha=’right’)

if ‘word_count’ in df.columns and not df[‘word_count’].empty:
axes[0,1].hist(df[‘word_count’], bins=10, color=’lightcoral’, alpha=0.7)
axes[0,1].set_xlabel(‘Abstract Word Count’)
axes[0,1].set_ylabel(‘Frequency’)
axes[0,1].set_title(‘Distribution of Abstract Lengths’)

try:
dates = pd.to_datetime(df[‘date’], errors=’coerce’)
valid_dates = dates.dropna()
if not valid_dates.empty:
axes[1,0].hist(valid_dates, bins=10, color=’lightgreen’, alpha=0.7)
axes[1,0].set_xlabel(‘Publication Date’)
axes[1,0].set_ylabel(‘Number of Papers’)
axes[1,0].set_title(‘Publication Timeline’)
plt.setp(axes[1,0].xaxis.get_majorticklabels(), rotation=45)
except:
axes[1,0].text(0.5, 0.5, ‘Date parsing unavailable’, ha=’center’, va=’center’, transform=axes[1,0].transAxes)

all_titles = ‘ ‘.join(df[‘title’].fillna(”).astype(str))
if all_titles.strip():
clean_titles = re.sub(r'[^a-zA-Zs]’, ”, all_titles.lower())

try:
wordcloud = WordCloud(width=400, height=300, background_color=’white’,
max_words=50, colormap=’viridis’).generate(clean_titles)
axes[1,1].imshow(wordcloud, interpolation=’bilinear’)
axes[1,1].axis(‘off’)
axes[1,1].set_title(‘Common Words in Titles’)
except:
axes[1,1].text(0.5, 0.5, ‘Word cloud unavailable’, ha=’center’, va=’center’, transform=axes[1,1].transAxes)

plt.tight_layout()
plt.show()

def comparative_analysis(self, topic1, topic2):
“””Compare two research topics”””
print(f” Comparing ‘{topic1}’ vs ‘{topic2}'”)

papers1 = self.search_papers(topic1)
papers2 = self.search_papers(topic2)

avg_length1 = sum(p[‘word_count’] for p in papers1) / len(papers1) if papers1 else 0
avg_length2 = sum(p[‘word_count’] for p in papers2) / len(papers2) if papers2 else 0

print(“n Comparison Results:”)
print(f”Topic 1 ({topic1}):”)
print(f” – Papers found: {len(papers1)}”)
print(f” – Avg abstract length: {avg_length1:.1f} words”)

print(f”nTopic 2 ({topic2}):”)
print(f” – Papers found: {len(papers2)}”)
print(f” – Avg abstract length: {avg_length2:.1f} words”)

return papers1, papers2

def intelligent_query(self, question):
“””Use AI agent to answer research questions (requires Gemini API)”””
if not self.agent:
print(” AI agent not available. Please provide Gemini API key.”)
print(” Get free API key at: https://makersuite.google.com/app/apikey”)
return None

print(f” Processing intelligent query with Gemini: ‘{question}'”)
try:
response = self.agent.run(question)
return response
except Exception as e:
print(f” Error with AI query: {str(e)}”)
return None

We encapsulate the PubMed querying workflow in our AdvancedPubMedResearcher class, initializing the PubmedQueryRun tool and an optional Gemini-powered LLM agent for advanced analysis. We provide methods to search for papers, parse and cache results, analyze research trends with rich visualizations, and compare topics side by side. This class streamlines programmatic exploration of biomedical literature and intelligent querying in just a few method calls.

Copy CodeCopiedUse a different Browserdef main():
“””Main tutorial demonstration”””
print(” Advanced PubMed Research Assistant Tutorial”)
print(“=” * 50)

# Initialize researcher
# Uncomment next line and add your free Gemini API key for AI features
# Get your free API key at: https://makersuite.google.com/app/apikey
# researcher = AdvancedPubMedResearcher(gemini_api_key=”your-gemini-api-key”)
researcher = AdvancedPubMedResearcher()

print(“n1⃣ Basic PubMed Search”)
papers = researcher.search_papers(“CRISPR gene editing”, max_results=3)

if papers:
print(f”nFirst paper preview:”)
print(f”Title: {papers[0][‘title’]}”)
print(f”Date: {papers[0][‘date’]}”)
print(f”Summary preview: {papers[0][‘summary’][:200]}…”)

print(“nn2⃣ Research Trends Analysis”)
research_topics = [
“machine learning healthcare”,
“CRISPR gene editing”,
“COVID-19 vaccine”
]

df = researcher.analyze_research_trends(research_topics)

if df is not None:
print(f”nDataFrame shape: {df.shape}”)
print(“nSample data:”)
print(df[[‘topic’, ‘title’, ‘word_count’]].head())

print(“nn3⃣ Comparative Analysis”)
papers1, papers2 = researcher.comparative_analysis(
“artificial intelligence diagnosis”,
“traditional diagnostic methods”
)

print(“nn4⃣ Advanced Features”)
print(“Cache contents:”, list(researcher.research_cache.keys()))

if researcher.research_cache:
latest_query = list(researcher.research_cache.keys())[-1]
cached_data = researcher.research_cache[latest_query]
print(f”Latest cached query: ‘{latest_query}'”)
print(f”Cached papers count: {len(cached_data[‘papers’])}”)

print(“n Tutorial complete!”)
print(“nNext steps:”)
print(“- Add your FREE Gemini API key for AI-powered analysis”)
print(” Get it at: https://makersuite.google.com/app/apikey”)
print(“- Customize queries for your research domain”)
print(“- Export results to CSV with: df.to_csv(‘research_results.csv’)”)

print(“n Bonus: To test AI features, run:”)
print(“researcher = AdvancedPubMedResearcher(gemini_api_key=’your-key’)”)
print(“response = researcher.intelligent_query(‘What are the latest breakthrough in cancer treatment?’)”)
print(“print(response)”)

if __name__ == “__main__”:
main()

We implement the main function to orchestrate the full tutorial demo, guiding users through basic PubMed searches, multi‑topic trend analyses, comparative studies, and cache inspection in a clear, numbered sequence. We wrap up by highlighting the next steps, including adding your Gemini API key for AI features, customizing queries to your domain, and exporting results to CSV, along with a bonus snippet for running intelligent, Gemini-powered research queries.

In conclusion, we have now demonstrated how to harness the power of PubMed programmatically, from crafting precise search queries to parsing and caching results for quick retrieval. By following these steps, you can automate your literature review process, track research trends over time, and integrate advanced analyses into your workflows. We encourage you to experiment with different search terms, dive into the cached results, and extend this framework to support your ongoing biomedical research.

Check out the CODES here. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]
The post A Code Implementation to Efficiently Leverage LangChain to Automate PubMed Literature Searches, Parsing, and Trend Visualization appeared first on MarkTechPost.

Amazon Researchers Reveal Mitra: Advancing Tabular Machine Learning wi …

Posted on July 24, 2025 by i-genie

Introduction

Amazon researchers have released Mitra, a cutting-edge foundation model purpose-built for tabular data. Unlike traditional approaches that tailor a bespoke model for every dataset, Mitra harnesses the power of in-context learning (ICL) and synthetic data pretraining, achieving state-of-the-art performance across tabular machine learning benchmarks. Integrated into AutoGluon 1.4, Mitra is designed to generalize robustly, offering a transformative shift for practitioners working with structured data in fields like healthcare, finance, e-commerce, and the sciences.

https://www.amazon.science/blog/mitra-mixed-synthetic-priors-for-enhancing-tabular-foundation-models

The Foundation: Learning from Synthetic Priors

Mitra departs from the norm by being pretrained exclusively on synthetic data. Rather than relying on the limited and heterogeneous nature of real-world tabular datasets, Amazon researchers engineered a principled strategy for generating and mixing diverse synthetic priors. This approach draws inspiration from the way large language models are pretrained on vast and varied text corpora.

Key Components of Mitra’s Synthetic Pretraining:

Mixture of Priors: Synthetic datasets are generated from a variety of prior distributions—including structural causal models and tree-based algorithms (like random forests and gradient boosting).

Generalization: The diversity and quality of these priors ensure that Mitra learns patterns applicable across numerous, unforeseen real-world datasets.

Task Structure: During pretraining, each synthetic task involves a support set and a query set—enabling Mitra to adapt to new tasks via in-context learning, without requiring parameter updates for every new table.

In-Context Learning and Fine-Tuning: Adapting Without New Models

Traditional tabular ML methods like XGBoost and random forests require a new model for each task or data distribution. In contrast, Mitra leverages in-context learning: given a small number of labeled examples (support set), Mitra can make accurate predictions on new, unseen data (query set) for classification or regression, adapting to each scenario without retraining.

For users who require further adaptation, fine-tuning is also supported, allowing the model to be tailored to specific tasks when needed.

Architecture Innovations

Mitra employs a 2-D attention mechanism across both rows and features, mirroring or extending the architecture advances pioneered by transformers but specialized for tabular data. This enables the model to:

Handle varying table sizes and feature types.

Capture complex interactions between table columns and records.

Support heterogeneous data natively, a key challenge in tabular ML.

Benchmark Performance and Practical Strengths

Results

Mitra achieves state-of-the-art results on multiple major tabular benchmarks:

TabRepo

TabZilla

AutoML Benchmark (AMLB)

TabArena

Its strengths are especially pronounced on small-to-medium datasets (under 5,000 samples, fewer than 100 features), delivering leading results on both classification and regression problems. Notably, Mitra outperforms strong baselines like TabPFNv2, TabICL, CatBoost, and AutoGluon’s prior iterations.

https://www.amazon.science/blog/mitra-mixed-synthetic-priors-for-enhancing-tabular-foundation-models

Usability

Available in AutoGluon 1.4: Mitra is open-source, with models ready for seamless integration into existing ML pipelines.

Runs on GPU and CPU: Optimized for versatility in deployment environments.

Weights shared on Hugging Face: Open-source for both classification and regression use cases.

Implications and Future Directions

By learning from a carefully curated blend of synthetic priors, Mitra brings the generalizability of large foundation models to the tabular domain. It is poised to accelerate research and applied data science by:

Reducing time-to-solution: No need to craft and tune unique models per task.

Enabling cross-domain transfer: Lessons learned from synthetic tasks transfer broadly.

Fostering further innovation: The synthetic prior methodology paves the way for richer, more adaptive tabular foundation models in the future.

Getting Started

AutoGluon 1.4 will soon feature Mitra for out-of-the-box usage.

Open-source weights and documentation are provided for both classification and regression tasks.

Researchers and practitioners are encouraged to experiment and build upon this new foundation for tabular predictio

Check out the Open Weights Classification model, Open Weights Regression model and Blog. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]
The post Amazon Researchers Reveal Mitra: Advancing Tabular Machine Learning with Synthetic Priors appeared first on MarkTechPost.

Customize Amazon Nova in Amazon SageMaker AI using Direct Preference O …

Posted on July 24, 2025 by i-genie

At the AWS Summit in New York City, we introduced a comprehensive suite of model customization capabilities for Amazon Nova foundation models. Available as ready-to-use recipes on Amazon SageMaker AI, you can use them to adapt Nova Micro, Nova Lite, and Nova Pro across the model training lifecycle, including pre-training, supervised fine-tuning, and alignment.
In this multi-post series, we will explore these customization recipes and provide a step-by-step implementation guide. We are starting with Direct Preference Optimization (DPO, an alignment technique that offers a straightforward way to tune model outputs with your preferences. DPO uses prompts paired with two responses—one preferred over the other—to guide the model toward outputs that better reflect your desired tone, style, or guidelines. You can implement this technique using either parameter-efficient or full model DPO, based on your data volume and cost considerations. The customized models can be deployed to Amazon Bedrock for inference using provisioned throughput. The parameter-efficient version supports on-demand inference. Nova customization recipes are available in SageMaker training jobs and SageMaker HyperPod, giving you flexibility to select the environment that best fits your infrastructure and scale requirements.
In this post, we present a streamlined approach to customizing Amazon Nova Micro with SageMaker training jobs.
Solution overview
The workflow for using Amazon Nova recipes with SageMaker training jobs, as illustrated in the accompanying diagram, consists of the following steps:

The user selects a specific Nova customization recipe which provides comprehensive configurations to control Amazon Nova training parameters, model settings, and distributed training strategies. You can use the default configurations optimized for the SageMaker AI environment or customize them to experiment with different settings.
The user submits an API request to the SageMaker AI control plane, passing the Amazon Nova recipe configuration.
SageMaker uses the training job launcher script to run the Nova recipe on a managed compute cluster.
Based on the selected recipe, SageMaker AI provisions the required infrastructure, orchestrates distributed training, and, upon completion, automatically decommissions the cluster.

This streamlined architecture delivers a fully managed user experience, so you can quickly define Amazon Nova training parameters and select your preferred infrastructure using straightforward recipes, while SageMaker AI handles the end-to-end infrastructure management—within a pay-as-you-go pricing model that is only billed for the net training time in seconds.

The customized Amazon Nova model is subsequently deployed on Amazon Bedrock using the createcustommodel API within Bedrock – and can integrate with native tooling such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Agents.
Business Use Case – Implementation Walk-through
In this post, we focus on adapting the Amazon Nova Micro model to optimize structured function calling for application-specific agentic workflows. We demonstrate how this approach can optimize Amazon Nova models for domain-specific use cases by a 81% increase in F1 score and up to 42% gains in ROUGE metrics. These improvements make the models more efficient in addressing a wide array of business applications, such as enabling customer support AI assistants to intelligently escalate queries, powering digital assistants for scheduling and workflow automation, and automating decision-making in sectors like ecommerce and financial services.
As shown in the following diagram, our approach uses DPO to align the Amazon Nova model with human preferences by presenting the model with pairs of responses—one preferred by human annotators and one less preferred—based on a given user query and available tool actions. The model is trained with the nvidia/When2Call dataset to increase the likelihood of the tool_call response, which aligns with the business goal of automating backend actions when appropriate. Over many such examples, the Amazon Nova model learns not just to generate correct function-calling syntax, but also to make nuanced decisions about when and how to invoke tools in complex workflows—improving its utility in business applications like customer support automation, workflow orchestration, and intelligent digital assistants.

When training is complete, we evaluate the models using SageMaker training jobs with the appropriate evaluation recipe. An evaluation recipe is a YAML configuration file that defines how your Amazon Nova large language model (LLM) evaluation job will be executed. Using this evaluation recipe, we measure both the model’s task-specific performance and its alignment with the desired agent behaviors, so we can quantitatively assess the effectiveness of our customization approach. The following diagram illustrates how these stages can be implemented as two separate training job steps. For each step, we use built-in integration with Amazon CloudWatch to access logs and monitor system metrics, facilitating robust observability. After the model is trained and evaluated, we deploy the model using the Amazon Bedrock Custom Model Import functionality as part of step 3.

Prerequisites
You must complete the following prerequisites before you can run the Amazon Nova Micro model fine-tuning notebook:

Make the following quota increase requests for SageMaker AI. For this use case, you will need to request a minimum of 2 p5.48xlarge instance (with 8 x NVIDIA H100 GPUs) and scale to more p5.48xlarge instances (depending on time-to-train and cost-to-train trade-offs for your use case). On the Service Quotas console, request the following SageMaker AI quotas:

P5 instances (p5.48xlarge) for training job usage: 2

(Optional) You can create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding role. (You can use JupyterLab in your local setup, too.)
Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess, AmazonS3FullAccess, and AmazonBedrockFullAccess to give required access to SageMaker AI and Amazon Bedrock to run the examples.
Assign the following policy as the trust relationship to your IAM role:

{
   “Version”: “2012-10-17”,
   “Statement”: [
   {
   “Sid”: “”,
   “Effect”: “Allow”,
   “Principal”: {
   “Service”: [
   “bedrock.amazonaws.com”,
   “sagemaker.amazonaws.com”
   ]
   },
   “Action”: “sts:AssumeRole”
   }
   ]
}

Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets:

git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git

cd sagemaker-distributed-training-workshop/18_sagemaker_training_recipes/nova

Next, we run the notebook nova-micro-dpo-peft.ipynb to fine-tune the Amazon Nova model using DPO, and PEFT on SageMaker training jobs.
Prepare the dataset
To prepare the dataset, you need to load the nvidia/When2Call dataset. This dataset provides synthetically generated user queries, tool options, and annotated preferences based on real scenarios, to train and evaluate AI assistants on making optimal tool-use decisions in multi-step scenarios.
Complete the following steps to format the input in a chat completion format, and configure the data channels for SageMaker training jobs on Amazon Simple Storage Service (Amazon S3):

Load the nvidia/When2Call dataset:

from datasets import load_dataset
dataset = load_dataset(“nvidia/When2Call”, “train_pref”, split=”train”)

The DPO technique requires a dataset containing the following:

User prompts (e.g., “Write a professional email asking for a raise”)
Preferred outputs (ideal responses)
Non-preferred outputs (undesirable responses)

The following code is an example from the original dataset:

As part of data preprocessing, we convert the data into the format required by Amazon Nova Micro, as shown in the following code. For examples and specific constraints of the Amazon Nova format, see Preparing data for fine-tuning Understanding models.

For the full data conversion code, see here.

Split the dataset into train and test datasets:

from datasets import Dataset, DatasetDict
from random import randint

…

dataset = DatasetDict(
{“train”: train_dataset, “test”: test_dataset, “val”: val_dataset}
)
train_dataset = dataset[“train”].map(
prepare_dataset, remove_columns=train_dataset.features
)

test_dataset = dataset[“test”].map(
prepare_dataset, remove_columns=test_dataset.features
)

Prepare the training and test datasets for the SageMaker training job by saving them as .jsonl files, which is required by SageMaker HyperPod recipes for Amazon Nova, and constructing the Amazon S3 paths where these files will be uploaded:

…

train_dataset.to_json(“./data/train/dataset.jsonl”)
test_dataset.to_json(“./data/test/dataset.jsonl”)

s3_client.upload_file(
“./data/train/dataset.jsonl”, bucket_name, f”{input_path}/train/dataset.jsonl”
)
s3_client.upload_file(
“./data/test/dataset.jsonl”, bucket_name, f”{input_path}/test/dataset.jsonl”
)

DPO training using SageMaker training jobs
To fine-tune the model using DPO and SageMaker training jobs with recipes, we use the PyTorch Estimator class. Start by setting the fine-tuning workload with the following steps:

Select the instance type and the container image for the training job:

instance_type = “ml.p5.48xlarge”
instance_count = 2

image_uri = (
f”708977205387.dkr.ecr.{sagemaker_session.boto_session.region_name}.amazonaws.com/nova-fine-tune-repo:SM-TJ-DPO-latest”
)

Create the PyTorch Estimator to encapsulate the training setup from a selected Amazon Nova recipe:

from sagemaker.pytorch import PyTorch

# define Training Job Name
job_name = “train-nova-micro-dpo”

recipe_overrides = {
“training_config”: {
“trainer”: {“max_epochs”: 1},
“model”: {
“dpo_cfg”: {“beta”: 0.1},
“peft”: {
“peft_scheme”: “lora”,
“lora_tuning”: {
“loraplus_lr_ratio”: 16.0,
“alpha”: 128,
“adapter_dropout”: 0.01,
},
},
},
},
}

estimator = PyTorch(
output_path=f”s3://{bucket_name}/{job_name}”,
base_job_name=job_name,
role=role,
instance_count=instance_count,
instance_type=instance_type,
training_recipe=recipe,
recipe_overrides=recipe_overrides,
max_run=18000,
sagemaker_session=sess,
image_uri=image_uri,
disable_profiler=True,
debugger_hook_config=False,
)

You can point to the specific recipe with the training_recipe parameter and override the recipe by providing a dictionary as recipe_overrides parameter.
The PyTorch Estimator class simplifies the experience by encapsulating code and training setup directly from the selected recipe.
In this example, training_recipe: fine-tuning/nova/dpo-peft-nova-micro-v1 is defining the DPO fine-tuning setup with PEFT technique

Set up the input channels for the PyTorch Estimator by creating an TrainingInput objects from the provided S3 bucket paths for the training and test datasets:

from sagemaker.inputs import TrainingInput

train_input = TrainingInput(
   s3_data=train_dataset_s3_path,
   distribution=”FullyReplicated”,
   s3_data_type=”Converse”,
)
test_input = TrainingInput(
   s3_data=test_dataset_s3_path,
   distribution=”FullyReplicated”,
   s3_data_type=”Converse”,
)

Submit the training job using the fit function call on the created Estimator:

estimator.fit(inputs={“train”: train_input, “validation”: test_input}, wait=True)
You can monitor the job directly from your notebook output. You can also refer the SageMaker AI console, which shows the status of the job and the corresponding CloudWatch logs for governance and observability, as shown in the following screenshots.

SageMaker training jobs console

SageMaker training jobs system metrics

After the job is complete, the trained model weights will be available in an escrow S3 bucket. This secure bucket is controlled by Amazon and uses special access controls. You can access the paths shared in manifest files that are saved in a customer S3 bucket as part of the training process.
Evaluate the fine-tuned model using the evaluation recipe
To assess model performance against benchmarks or custom datasets, we can use the Nova evaluation recipes and SageMaker training jobs to execute an evaluation workflow, by pointing to the model trained in the previous step. Among several supported benchmarks, such as mmlu, math, gen_qa, and llm_judge, in the following steps we are going to provide two options for gen_qa and llm_judge tasks, which allow us to evaluate response accuracy, precision and model inference quality with the possibility to use our own dataset and compare results with the base model on Amazon Bedrock.
Option A: Evaluate gen_qa task

Use the code in the to prepare the dataset, structured in the following format as required by the evaluation recipe:

{
“system”: “(Optional) String containing the system prompt that sets the behavior, role, or personality of the model”,
“query”: “String containing the input prompt”,
“response”: “String containing the expected model output”
}

Save the dataset as .jsonl files, which is required by Amazon Nova evaluation recipes, and upload them to the Amazon S3 path:

# Save datasets to s3
val_dataset.to_json(“./data/val/gen_qa.jsonl”)

s3_client.upload_file(
“./data/val/gen_qa.jsonl”, bucket_name, f”{input_path}/val/gen_qa.jsonl”
)
…

Create the evaluation recipe pointing to trained model, validation data, and the evaluation metrics applicable to your use case:

model_path = “<ESCROW_S3_PATH_MODEL_CHECKPOINTS>”

recipe_content = f”””
run:
  name: nova-micro-gen_qa-eval-job
  model_type: amazon.nova-micro-v1:0:128k
  model_name_or_path: {model_path}
  replicas: 1
  data_s3_path: {val_dataset_s3_path} # Required, input data s3 location

evaluation:
  task: gen_qa
  strategy: gen_qa
  metric: all

inference:
  max_new_tokens: 4096
  top_p: 0.9
  temperature: 0.1
“””

with open(“eval-recipe.yaml”, “w”) as f:
f.write(recipe_content)

Select the instance type, the container image for the evaluation job, and define the checkpoint path where the model will be stored. The recommended instance types for the Amazon Nova evaluation recipes are: ml.g5.12xlarge for Amazon Nova Micro and Amazon Nova Lite, and ml.g5.48xlarge for Amazon Nova Pro:

instance_type = “ml.g5.12xlarge”
instance_count = 1

image_uri = (
f”708977205387.dkr.ecr.{sagemaker_session.boto_session.region_name}.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-latest”
)

Create the PyTorch Estimator to encapsulate the evaluation setup from the created recipe:

from sagemaker.pytorch import PyTorch

# define Training Job Name
job_name = “train-nova-micro-eval”

estimator = PyTorch(
   output_path=f”s3://{bucket_name}/{job_name}”,
   base_job_name=job_name,
   role=role,
   instance_count=instance_count,
   instance_type=instance_type,
   training_recipe=”./eval-recipe.yaml”,
   max_run=18000,
   sagemaker_session=sagemaker_session,
   image_uri=image_uri,
disable_profiler=True,
debugger_hook_config=False,
)

Set up the input channels for PyTorch Estimator by creating an TrainingInput objects from the provided S3 bucket paths for the validation dataset:

from sagemaker.inputs import TrainingInput

eval_input = TrainingInput(
   s3_data=val_dataset_s3_path,
   distribution=”FullyReplicated”,
   s3_data_type=”S3Prefix”,
)

Submit the training job:

estimator.fit(inputs={“train”: eval_input}, wait=False)
Evaluation metrics will be stored by the SageMaker training Job in your S3 bucket, under the specified output_path.
The following figure and accompanying table show the evaluation results against the base model for the gen_qa task:

F1
F1 QUASI
ROUGE 1
ROUGE 2
ROUGE L

Base
0.26
0.37
0.38
0.28
0.34

Fine-tuned
0.46
0.52
0.52
0.4
0.46

% Difference
81%
40%
39%
42%
38%

Option B: Evaluate llm_judge task

For the llm_judge task, structure the dataset with the below format, where response_A represents the ground truth and response_B represents our customized model output:

{
“prompt”: “String containing the input prompt and instructions”,
“response_A”: “String containing the ground truth output”,
“response_B”: “String containing the customized model output”
}

Following the same approach described for the gen_qa task, create an evaluation recipe specifically for the llm_judge task, by specifying judge as strategy:

recipe_content = f”””
run:
name: nova-micro-llm-judge-eval-job
model_type: amazon.nova-micro-v1:0:128k
model_name_or_path: “nova-micro/prod”
…

evaluation:
task: llm_judge
strategy: judge
metric: all

…
“””

The complete implementation including dataset preparation, recipe creation, and job submission steps, refer to the notebook nova-micro-dpo-peft.ipynb.
The following figure shows the results for the llm_judge task:

This graph shows the preference percentages when using an LLM as a judge to evaluate model performance across two different comparisons. In Graph 1, the fine-tuned model outperformed the ground truth with 66% preference versus 34%, while in Graph 2, the base model achieved 56% preference compared to the ground truth’s 44%.
Summarized evaluation results
Our fine-tuned model delivers significant improvements on the tool-calling task, outperforming the base model across all key evaluation metrics. Notably, the F1 score increased by 81%, while the F1 Quasi score improved by 35%, reflecting a substantial boost in both precision and recall. In terms of lexical overlap, the model demonstrated enhanced accuracy in matching generated answers to reference texts —tools to invoke and structure of the invoked function— achieving gains of 39% and 42% for ROUGE-1 and ROUGE-2 scores, respectively. The llm_judge evaluation further validates these improvements, with the fine-tuned model outputs being preferred in 66.2% against the ground truth outputs. These comprehensive results across multiple evaluation frameworks confirm the effectiveness of our fine-tuning approach in elevating model performance for real-world scenarios.
Deploy the model on Amazon Bedrock
To deploy the fine-tuned model, we can use the Amazon Bedrock CreateCustomModel API and use Bedrock On-demand inference with the native model invocation tools. To deploy the model, complete the following steps:

Create a custom model, by pointing to the model checkpoints saved in the escrow S3 bucket:

…
model_path = “<ESCROW_S3_PATH_MODEL_CHECKPOINTS>”
# Define name for imported model
imported_model_name = “nova-micro-sagemaker-dpo-peft”

request_params = {
   “modelName”: imported_model_name,
   “modelSourceConfig”: {“s3DataSource”: {“s3Uri”: model_path}},
   “roleArn”: role,
   “clientRequestToken”: “NovaRecipeSageMaker”,
}
# Create the model import
response = bedrock.create_custom_model(**request_params)

Monitor the model status. Wait until the model reaches the status ACTIVE or FAILED:

from IPython.display import clear_output
import time

while True:
   response = bedrock.list_custom_models(sortBy=’CreationTime’,sortOrder=’Descending’)
   model_summaries = response[“modelSummaries”]
   status = “”
   for model in model_summaries:
   if model[“modelName”] == imported_model_name:
   status = model[“modelStatus”].upper()
   model_arn = model[“modelArn”]
   print(f'{model[“modelStatus”].upper()} {model[“modelArn”]} …’)
   if status in [“ACTIVE”, “FAILED”]:
   break
   if status in [“ACTIVE”, “FAILED”]:
   break
   clear_output(wait=True)
   time.sleep(10)

When the model import is complete, you will see it available through the AWS CLI:

aws bedrock list-custom-models
{
“modelSummaries”: [
{
“modelArn”: “arn:aws:bedrock:us-east-1: 123456789101:custom-model/imported/abcd1234efgh”,
“modelName”: “nova-micro-sagemaker-dpo-peft”,
“creationTime”: “2025-07-16T12:52:39.348Z”,
“baseModelArn”: “arn:aws:bedrock:us-east-1::foundation-model/amazon.nova-micro-v1:0:128k”,
“baseModelName”: “”,
“customizationType”: “IMPORTED”,
“ownerAccountId”: “123456789101”,
“modelStatus”: “Active”
}
]
}

Configure Amazon Bedrock Custom Model on-demand inference:

request_params = {
“clientRequestToken”: “NovaRecipeSageMakerODI”,
“modelDeploymentName”: f”{imported_model_name}-odi”,
“modelArn”: model_arn,
}

response = bedrock.create_custom_model_deployment(**request_params)

Monitor the model deployment status. Wait until the model reaches the status ACTIVE or FAILED:

from IPython.display import clear_output
import time

while True:
response = bedrock.list_custom_model_deployments(
sortBy=”CreationTime”, sortOrder=”Descending”
)
model_summaries = response[“modelDeploymentSummaries”]
status = “”
for model in model_summaries:
if model[“customModelDeploymentName”] == f”{imported_model_name}-odi”:
status = model[“status”].upper()
custom_model_arn = model[“customModelDeploymentArn”]
print(f'{model[“status”].upper()} {model[“customModelDeploymentArn”]} …’)
if status in [“CREATING”]:
break
if status in [“ACTIVE”, “FAILED”]:
break
clear_output(wait=True)
time.sleep(10)

Run model inference through AWS SDK:

tools = [
{
   “toolSpec”: {
   “name”: “fetch_weather”,
   “description”: ‘Fetch weather information’,
   “inputSchema”: {
   “json”: {
   “type”: “object”,
   “properties”: {
   “type”: “object”,
   “properties”: {
   “query”: {
   “type”: “string”,
   “description”: “Property query”,
   },
   “num_results”: {
   “type”: “integer”,
   “description”: “Property num_results”,
   },
   },
   “required”: [“query”],
   },
   },
   },
   }
   }
…
]

system_prompt = f”””
You are a helpful AI assistant that can answer questions and provide information.
You can use tools to help you with your tasks.

You have access to the following tools:

<tools>
{{tools}}
</tools>
For each function call, return a json object with function name and parameters:

{{{{“name”: “function name”, “parameters”: “dictionary of argument name and its value”}}}}
“””

system_prompt = system_prompt.format(tools=json.dumps({‘tools’: tools}))

messages = [
{“role”: “user”, “content”: [{“text”: “What is the weather in New York?”}]},
]

Submit the inference request by using the converse API:

response = client.converse(
modelId=model_arn,
messages=messages,
system=[“text”: system_prompt],
inferenceConfig={
    “temperature”: temperature,
        “maxTokens”: max_tokens,
        “topP”: top_p
   },
)

response[“output”]

We get the following output response:

{
“message”:{
“role”:”assistant”,
“content”:[
{
“text”:”{“name”: “fetch_weather”, “parameters”: {“query”: “Rome, Italy”}}”
}
]
}
}

Clean up
To clean up your resources and avoid incurring more charges, follow these steps:

Delete unused SageMaker Studio resources
(Optional) Delete the SageMaker Studio domain
On the SageMaker console, choose Training in the navigation pane and verify that your training job isn’t running anymore.
Delete custom model deployments in Amazon Bedrock. To do so, use the AWS CLI or AWS SDK to delete it.

Conclusion
This post demonstrates how you can customize Amazon Nova understanding models using the DPO recipe on SageMaker training jobs. The detailed walkthrough with a specific focus on optimizing tool calling capabilities showcased significant performance improvements, with the fine-tuned model achieving up to 81% better F1 scores compared to the base model with training dataset of around 8k records.
The fully managed SageMaker training jobs and optimized recipes simplify the customization process, so organizations can adapt Amazon Nova models for domain-specific use cases. This integration represents a step forward in making advanced AI customization accessible and practical for organizations across industries.
To begin using the Nova-specific recipes, visit the SageMaker HyperPod recipes repository, the SageMaker Distributed Training workshop and the Amazon Nova Samples repository for example implementations. Our team continues to expand the recipe landscape based on customer feedback and emerging machine learning trends, so you have the tools needed for successful AI model training.

About the authors
Mukund Birje is a Sr. Product Marketing Manager on the AIML team at AWS. In his current role he’s focused on driving adoption of Amazon Nova Foundation Models. He has over 10 years of experience in marketing and branding across a variety of industries. Outside of work you can find him hiking, reading, and trying out new restaurants. You can connect with him on LinkedIn.
Karan Bhandarkar is a Principal Product Manager with Amazon Nova. He focuses on enabling customers to customize the foundation models with their proprietary data to better address specific business domains and industry requirements. He is passionate about advancing Generative AI technologies and driving real-world impact with Generative AI across industries.
Kanwaljit Khurmi is a Principal Worldwide Generative AI Solutions Architect at AWS. He collaborates with AWS product teams, engineering departments, and customers to provide guidance and technical assistance, helping them enhance the value of their hybrid machine learning solutions on AWS. Kanwaljit specializes in assisting customers with containerized applications and high-performance computing solutions.
Bruno Pistone is a Senior World Wide Generative AI/ML Specialist Solutions Architect at AWS based in Milan, Italy. He works with AWS product teams and large customers to help them fully understand their technical needs and design AI and Machine Learning solutions that take full advantage of the AWS cloud and Amazon Machine Learning stack. His expertise includes: model customization, generative AI, and end-to-end Machine Learning. He enjoys spending time with friends, exploring new places, and traveling to new destinations.

Multi-tenant RAG implementation with Amazon Bedrock and Amazon OpenSea …

Posted on July 24, 2025 by i-genie

In recent years, the emergence of large language models (LLMs) has accelerated AI adoption across various industries. However, to further augment LLMs’ capabilities and effectively use up-to-date information and domain-specific knowledge, integration with external data sources is essential. Retrieval Augmented Generation (RAG) has gained attention as an effective approach to address this challenge.
RAG is a technique that searches relevant information from existing knowledge bases or documents based on user input, and incorporates this information into the LLM input to generate more accurate and contextually appropriate responses. This technique is being implemented across a wide range of applications, from using technical documentation in product development to answering FAQs in customer support, and even supporting decision-making systems based on the latest data.
The implementation of RAG brings significant value to both software-as-a-service (SaaS) providers and their users (tenants).
SaaS providers can use a multi-tenant architecture that delivers services to multiple tenants from a single code base. As tenants use the service, their data accumulates while being protected by appropriate access control and data isolation. When implementing AI capabilities using LLMs in such environments, RAG makes it possible to use each tenant’s specific data to provide personalized AI services.
Let’s consider a customer service call center SaaS as an example. Each tenant’s historical inquiry records, FAQs, and product manuals are accumulated as tenant-specific knowledge bases. By implementing a RAG system, the LLM can generate appropriate responses relevant to each tenant’s context by referencing these tenant-specific data sources. This enables highly accurate interactions that incorporate tenant-specific business knowledge—a level of customization that would not be possible with generic AI assistants. RAG serves as a crucial component for delivering personalized AI experiences in SaaS, contributing to service differentiation and value enhancement.
However, using tenant-specific data through RAG presents technical challenges from security and privacy perspectives. The primary concern is implementing secure architecture that maintains data isolation between tenants and helps prevent unintended data leakage or cross-tenant access. In multi-tenant environments, the implementation of data security critically impacts the trustworthiness and competitive advantage of SaaS providers.
Amazon Bedrock Knowledge Bases enables simpler RAG implementation. When using OpenSearch as a vector database, there are two options: Amazon OpenSearch Service or Amazon OpenSearch Serverless. Each option has different characteristics and permission models when building multi-tenant environments:

Amazon OpenSearch Serverless:

Metadata filtering enables filtering of search results from the vector database by tenant (for more details, see Multi-tenant RAG with Amazon Bedrock Knowledge Bases)
Its permission model doesn’t segregate permissions for write operations such as data creations and updates

Amazon OpenSearch Service:

Fine-grained access control (FGAC)is available
Access is through a single AWS Identity and Access Management (IAM) role attached to the knowledge base, helping prevent the use of FGAC for permission segregation

In this post, we introduce tenant isolation patterns using a combination of JSON Web Token (JWT) and FGAC, along with tenant resource routing. If the aforementioned permission model limits you from achieving your FGAC objectives, you can use the solution in this post. The solution is implemented using OpenSearch Service as the vector database and AWS Lambda as the orchestration layer.
In the next section, we explore the specific implementation of tenant isolation using JWT and FGAC in OpenSearch Service, and how this enables a secure multi-tenant RAG environment.
Effectiveness of JWT in multi-tenant data isolation in OpenSearch Service
As introduced in Storing Multi-Tenant SaaS Data with Amazon OpenSearch Service, OpenSearch Service offers multiple methods for managing multi-tenant data: domain-level isolation, index-level isolation, and document-level isolation.
To implement access permission segregation at the index and document levels, you can use FGAC, which is supported by the OpenSearch Security plugin.
In OpenSearch Service, you can achieve granular access control by mapping IAM identities to OpenSearch roles. This enables detailed permission settings in OpenSearch for each IAM identity. However, this approach presents significant scalability challenges. As the number of tenants increases, the required number of IAM users or roles also increases, potentially hitting the limit of AWS service quotas. Additionally, managing numerous IAM entities leads to operational complexity. Although dynamically generated IAM policies could overcome this challenge, each dynamically generated policy is attached to a single IAM role. A single IAM role can be mapped to a single OpenSearch role, but this would still require an IAM role and dynamic policy per tenant for appropriate isolation, which results in similar operational complexity managing numerous entities.
This post provides an alternative approach and focuses on the effectiveness of JWT, a self-contained token for implementing data isolation and access control in multi-tenant environments. Using JWT provides the following advantages:

Dynamic tenant identification – JWT payloads can include attribute information (tenant context) to identify tenants. This enables the system to dynamically identify tenants for each request and allows passing this context to subsequent resources and services.
Integration with FGAC in OpenSearch – FGAC can directly use attribute information in JWT for role mapping. This allows mapping of access permissions to specific indexes or documents based on information such as tenant IDs in the JWT.

Combining JWT with FGAC provides secure, flexible, and scalable data isolation and access control in a multi-tenant RAG environment using OpenSearch Service. In the next section, we explore specific implementation details and technical considerations for applying this concept in actual systems.
Solution overview
In RAG, data such as relevant documents used to augment LLM outputs are vectorized by embedding language models and indexed in a vector database. User questions in natural language are converted to vectors using the embedding model and searched in the vector database. The data retrieved through vector search is passed to the LLM as context to augment the output. The following diagram illustrates the solution architecture.

This solution uses OpenSearch Service as the vector data store for storing knowledge sources in RAG. The flow is as follows:

RAG application users for each tenant are created as users in an Amazon Cognito user pool, receiving a JWT enriched with tenant ID information when logging in to the frontend. Each user’s tenant information is stored in Amazon DynamoDB and added to the JWT by a pre-token generation Lambda trigger during user authentication.
When a user initiates a chat on the frontend, the user query is passed to Lambda using Amazon API Gateway along with the JWT.
The user query is vectorized in conjunction with text embedding models available in Amazon Bedrock.
Domain and index information for retrieval is obtained from DynamoDB.
Vector search is performed on OpenSearch Service to retrieve information related to the query from the index.
The retrieved information is added to the prompt as context and passed to an LLM available in Amazon Bedrock to generate a response.

The key aspect of this solution is using JWT for tenant data isolation in OpenSearch Service and routing to each tenant’s data. It separates access permissions for each dataset using FGAC available in OpenSearch Service and uses tenant ID information added to the JWT for mapping application users to separated permission sets. The solution provides three different patterns for data isolation granularity to meet customer requirements. Routing is also enabled by defining the mapping between tenant ID information from JWT and data location (domain, index) in DynamoDB.
When users add documents, files are uploaded to Amazon Simple Storage Service (Amazon S3) and metadata is written to DynamoDB management table. When storing data in OpenSearch Service, the text embedding model (Amazon Bedrock) is called by the ingest pipeline for vectorization. For document creation, update, and deletion, JWT is attached to requests, allowing tenant identification.
This solution is implemented using the AWS Cloud Development Kit (AWS CDK). For details, refer to the GitHub repository. The instructions to deploy the solution are included in the README file in the repository.
Prerequisites
To try this solution, you must have the following prerequisites:

An AWS account.
IAM access permissions necessary for running the AWS CDK.
A frontend execution environment: node.js and npm installation is required.
The AWS CDK must be configured. For details, refer to Tutorial: Create your first AWS CDK app.
Access to the models used in Amazon Bedrock must be configured. This solution uses Anthropic’s Claude 3.5 Sonnet v2 and Amazon Titan Text Embedding V2. For details, refer to Add or remove access to Amazon Bedrock foundation models.

In addition to the resources shown in the architecture diagram, the following resources and configurations are created as AWS CloudFormation custom resources through AWS CDK deployment:

Amazon Cognito user pool:

Users for tenant-a, tenant-b, tenant-c, and tenant-d

DynamoDB table:

Mapping between users and tenants
Mapping between tenants and OpenSearch connection destinations and indexes

OpenSearch Service domain:

JWT authentication settings
Ingest pipeline for vector embedding
FGAC roles and role mappings for each tenant
k-NN index

User authentication and JWT generation with Amazon Cognito
This solution uses an Amazon Cognito user pool for RAG application user authentication. Amazon Cognito user pools issue JWT during authentication. Because FGAC in OpenSearch Service supports JWT authentication, access from users authenticated by the Amazon Cognito user pool can be permitted by registering public keys issued by the user pool with the OpenSearch Service domain. Additionally, authorization is performed using attributes that can be added to the JWT payload for tenant data access permission segregation with FGAC, which we discuss in the following sections. To achieve this, a pre-token generation Lambda trigger is configured in the Amazon Cognito user pool to retrieve tenant ID information for each user stored in DynamoDB and add it to the token. The obtained JWT is retained by the frontend and used for requests to the backend. DynamoDB stores the mapping between user ID (sub) and tenant ID as follows:

{
“pk”: {
“S”: “membership#<Cognito user ID (sub)>”
},
“sk”: {
“S”: “tenant#tenant-a”
}
}

Although multiple patterns exist for implementing multi-tenant authentication with Amazon Cognito, this implementation uses a single user pool with user-tenant mappings in DynamoDB. Additional considerations are necessary for production environments; refer to Multi-tenant application best practices.
Request routing to tenant data using JWT
In multi-tenant architectures where resources are separated by tenant, requests from tenants are essential to route to appropriate resources. To learn more about tenant routing strategies, see Tenant routing strategies for SaaS applications on AWS. This solution uses an approach similar to data-driven routing as described in the post for routing to OpenSearch Service.
The DynamoDB table stores mapping information for tenant IDs, target OpenSearch Service domains, and indexes as follows:

{
“pk”: {
“S”: “tenant#tenant-a”
},
“sk”: {
“S”: “os_config”
},
“os_host”: {
“S”: “<Amazon OpenSearch Service domain endpoint>”
},
“os_index”: {
“S”: “tenant-a-index”
},
“rag_role”: {
“S”: “tenant-a_role”
}
}

The JWT is obtained from the Authorization header in HTTP requests sent from the frontend to the Lambda function through API Gateway. The routing destination is determined by retrieving the routing information using the tenant ID obtained from parsing the JWT. Additionally, the JWT is used as authentication information for requests to OpenSearch, as described in the following section.
Multi-tenant isolation of data locations and access permissions in OpenSearch Service
Multi-tenant data isolation strategies in OpenSearch Service include three types of isolation patterns: domain-level, index-level, and document-level isolation, and hybrid models combining these. This solution uses FGAC for access permission control to tenant data, creating dedicated roles for each tenant.
Mapping between tenant users and FGAC tenant roles is implemented through backend roles. In JWT authentication available in OpenSearch Service, the attribute within the JWT payload to be linked with backend roles can be specified as the Roles key. The following screenshot shows this domain config.

The JWT payload includes a tenant_id attribute as follows:”tenant_id”: “tenant-a” Tenant users and FGAC roles are linked by setting this attribute as the roles key in OpenSearch JWT authentication and mapping roles as follows:

{
“tenant-a_role”: {
“backend_roles”: [
“tenant-a”
]
}
}

The following screenshot shows an example of tenant role mapping in FGAC in OpenSearch Dashboards.

The sample in this solution provides four tenants—tenant-a, tenant-b, tenant-c, and tenant-d—so you can try all three isolation methods. The following diagram illustrates this architecture.

Each role is assigned permissions to access only the corresponding tenant data. In this section, we introduce how to implement each of the three isolation methods using JWT and FGAC:

Domain-level isolation – Assign individual OpenSearch Service domains to each tenant. Because domains are dedicated to each tenant in this pattern of isolation, there’s no need for data isolation within the domain. Therefore, FGAC roles grant access permissions across the indexes. The following code is part of index_permissions in the FGAC role definition that grants access to the indexes:

“index_permissions”: [
{
“index_patterns”: [
“*”
],

Index-level isolation – Multiple tenants share an OpenSearch Service domain, with individual indexes assigned to each tenant. Each tenant should only be able to access their own index, so index_permissions in the FGAC role is configured as follows (example for tenant-b):

“index_permissions”: [
{
“index_patterns”: [
“tenant-b-index*”
]

Document-level isolation – Multiple tenants share OpenSearch Service domains and indexes, using FGAC document-level security for access permission segregation of tenant data within the index. Each index includes a field to store tenant ID information, and document-level security queries are set for that field. The following code is part of index_permissions for an FGAC role that allows tenant-c to access only its own data in a configuration where tenant-c and tenant-d share an index:

“index_permissions”: [
{
“index_patterns”: [
“tenant-cd-shared-index*”
],
“dls”: “””{“bool”: {“must”: {“match”: {“tenant_id”: “tenant-c”}}}}”””,

The following screenshot shows an example of index permission for document-level isolation in the FGAC role.

Considerations
The implementation in this post uses a model where DynamoDB tables and S3 buckets are shared between tenants. For production use, consider partitioning models as introduced in Partitioning Pooled Multi-Tenant SaaS Data with Amazon DynamoDB and Partitioning and Isolating Multi-Tenant SaaS Data with Amazon S3) and determine the optimal model based on your requirements.
Additionally, you can use dynamic generation of IAM policies as an additional layer to restrict access permissions to each resource.
Clean up
To avoid unexpected charges, we recommend deleting resources when they are no longer needed. Because the resources are created with the AWS CDK, run the cdk destroy command to delete them. This operation will also delete the documents uploaded to Amazon S3.
Conclusions
In this post, we introduced a solution that uses OpenSearch Service as a vector data store in multi-tenant RAG, achieving data isolation and routing using JWT and FGAC.
This solution uses a combination of JWT and FGAC to implement strict tenant data access isolation and routing, necessitating the use of OpenSearch Service. The RAG application is implemented independently, because at the time of writing, Amazon Bedrock Knowledge Bases can’t use JWT-based access to OpenSearch Service.Multi-tenant RAG usage is important for SaaS companies, and strategies vary depending on requirements such as data isolation strictness, ease of management, and cost. This solution implements multiple isolation models, so you can choose based on your requirements.For other solutions and information regarding multi-tenant RAG implementation, refer to the following resources:

Multi-tenant RAG with Amazon Bedrock Knowledge Bases
Build a multi-tenant generative AI environment for your enterprise on AWS
Self-managed multi-tenant vector search with Amazon Aurora PostgreSQL
Multi-tenant vector search with Amazon Aurora PostgreSQL and Amazon Bedrock Knowledge Bases

About the authors
Kazuki Nagasawa is a Cloud Support Engineer at Amazon Web Services. He specializes in Amazon OpenSearch Service and focuses on solving customers’ technical challenges. In his spare time, he enjoys exploring whiskey varieties and discovering new ramen restaurants.
Kensuke Fukumoto is a Senior Solutions Architect at Amazon Web Services. He’s passionate about helping ISVs and SaaS providers modernize their applications and transition to SaaS models. In his free time, he enjoys riding motorcycles and visiting saunas.

Enhance generative AI solutions using Amazon Q index with Model Contex …

Posted on July 24, 2025 by i-genie

Today’s enterprises increasingly rely on AI-driven applications to enhance decision-making, streamline workflows, and deliver improved customer experiences. Achieving these outcomes demands secure, timely, and accurate access to authoritative data—especially when such data resides across diverse repositories and applications within strict enterprise security boundaries.
Interoperable technologies powered by open standards like the Model Context Protocol (MCP) are rapidly emerging. MCP simplifies the process for connecting AI applications and agents to third-party tools and data sources, enabling lightweight, real-time interactions and structured operations with minimal engineering effort. Independent software vendor (ISV) applications can securely query their customers’ Amazon Q index using cross-account access, retrieving only the content each user is authorized to see, such as documents, tickets, chat threads, CRM records, and more. Amazon Q connectors regularly sync and index this data to keep it fresh. Amazon Q index’s hybrid semantic-plus-keyword ranking then helps ISVs deliver context-rich answers without building their own search stack.
As large language models (LLMs) and generative AI become integral to enterprise operations, clearly defined integration patterns between MCP and Amazon Q index become increasingly valuable. ISVs exploring the MCP landscape to automate structured actions such as creating tickets or processing approvals can seamlessly integrate Amazon Q index to retrieve authoritative data. Authoritative data enables accurate and confident execution of these actions, reducing risk, minimizing costly errors, and strengthening trust in AI-driven outcomes. For example, a customer support assistant using MCP can automatically open an urgent ticket and instantly retrieve a relevant troubleshooting guide from Amazon Q index to accelerate incident resolution. AWS continues to invest in tighter interoperability between MCP and Amazon Q index within enterprise AI architectures. In this post, we explore best practices and integration patterns for combining Amazon Q index and MCP, enabling enterprises to build secure, scalable, and actionable AI search-and-retrieval architectures.
Key components overview
Let’s break down the two key components referenced throughout the post: MCP and Amazon Q index.
MCP is an open JSON-RPC standard that lets LLMs invoke external tools and data using structured schemas. Each tool schema defines actions, inputs, outputs, versioning, and access scope, giving developers a consistent interface across enterprise systems. To learn more, refer to the MCP User Guide.
Amazon Q index is a fully managed, cross-account, semantic search service within Amazon Q Business that helps ISVs augment their generative AI chat assistants with customer data. It combines semantic and keyword-based ranking to securely retrieve relevant, user-authorized content through the SearchRelevantContent API, so ISVs can enrich their applications with precise, customer-specific context.
Companies like Zoom and PagerDuty use Amazon Q index to enhance their AI-driven search experiences. For example, Zoom uses Amazon Q index to help users securely and contextually access their enterprise knowledge directly within the Zoom AI Companion interface, enhancing real-time productivity during meetings. Similarly, PagerDuty Advance uses Amazon Q index to surface operational runbooks and incident context during live alerts, dramatically improving incident resolution workflows.
Enhancing MCP workflows with Amazon Q index
To fully capitalize on MCP-driven structured actions, modern AI assistants require enterprise-grade knowledge retrieval capabilities—fast responses, precise relevance ranking, and robust permission enforcement. Effective actions depend on timely, accurate, and secure access to authoritative enterprise data. Amazon Q index directly meets these advanced search needs, providing a secure, scalable retrieval layer that enhances and accelerates MCP workflows:

Secure ISV integration with the data accessor pattern – ISVs can seamlessly integrate customer enterprise data into their applications using Amazon Q index, providing enriched, generative AI-driven experiences without needing to store or directly index customer data sources. This follows the data accessor pattern, where the ISV acts as a trusted accessor with scoped permissions to securely query the customer’s Amazon Q index and retrieve only authorized results. Companies like Asana, Zoom, and PagerDuty already use this integration approach to enhance their applications securely and efficiently.
Highly accurate and managed relevance – Amazon Q index automatically executes both keyword-based (sparse) matching and vector-based (dense/semantic) similarity searches with every SearchRelevantContent API call. Semantic search uses embeddings to understand the contextual meaning of content rather than relying solely on keyword matches, significantly improving accuracy and user satisfaction. Combining semantic and keyword-based search (a hybrid approach) facilitates maximum retrieval accuracy and relevant results.
Built-in connectors and automatic indexing – Amazon Q index offers managed, built-in connectors for widely used enterprise applications such as SharePoint, Amazon Simple Storage Service (Amazon S3), and Confluence. These connectors automatically crawl and index enterprise content on a scheduled basis, significantly reducing manual setup and maintenance while keeping data fresh and searchable.
Fully managed document-level security – During indexing, Amazon Q index captures source-system ACLs, automatically enforcing these permissions with every query. Users can only search data they’ve been previously granted permission to access. Data is encrypted using customer managed AWS Key Management Service (AWS KMS) keys, with access logged using AWS CloudTrail for auditability.

By managing indexing, ranking, and security, Amazon Q index helps organizations deploy sophisticated enterprise search quickly—typically within weeks. To learn more, see Amazon Q index for independent software vendors (ISVs).
Amazon Q index integration patterns
Now that we’ve explored how Amazon Q index enhances MCP workflows, let’s look at two practical integration patterns enterprises and ISVs commonly adopt to combine these complementary technologies. ISVs and enterprises can access a unified, identity-aware semantic search API called SearchRelevantContent that securely accesses connected enterprise data sources (to learn more, see New capabilities from Amazon Q Business enable ISVs to enhance generative AI experiences).
When planning their integration strategy, organizations typically evaluate factors such as implementation speed, operational complexity, security requirements, and existing MCP commitments. The following patterns highlight common integration approaches, outlining the associated trade-offs and benefits of each scenario:

Pattern 1 – Amazon Q index integration with a data accessor (no MCP layer)
Pattern 2 – Integrating Amazon Q index using MCP tools

Pattern 1: Amazon Q index integration with a data accessor (no MCP layer)
Customers might opt for simplicity and speed by directly using Amazon Q index without involving MCP. The following diagram illustrates this straightforward and fully managed approach.

This pattern is best suited when your primary requirement is direct, performant search through a fully managed API, and you don’t currently need the orchestration and standardization provided by MCP integration. To learn more, refer to Q index workshop and the following GitHub repo.
The pattern includes the following components:

The SearchRelevantContent API is called using a secure, scoped AWS Identity and Access Management (IAM) role provided by the ISV. There’s no MCP layer to build, credentials to manage, or infrastructure to run—integration is handled entirely through an AWS managed API.
After the ISV-provided IAM role is approved by the enterprise and AWS, AWS manages the backend—including connectors, incremental content crawling, vector and keyword indexing, intelligent ranking, and secure, document-level access control within Amazon Q index.
Enterprise permissions are scoped to a single IAM role that the enterprise explicitly approves. Indexed data is encrypted using customer managed KMS keys, with access tightly controlled and fully audited through CloudTrail.

Pattern 2: Integrating Amazon Q index using MCP tools
By adding Amazon Q index retrieval using MCP, ISVs maintain a consistent MCP-based architecture across actions and retrieval, as illustrated in the following diagram.

This pattern provides a uniform MCP interface for ISVs who already use MCP tools for multiple structured actions. To learn more, refer to the following GitHub repo.
The pattern includes the following components:

The SearchRelevantContent API is wrapped as a tool inside an existing MCP system, adding custom logging or throttling.
End-users interact only with the ISV’s application. Behind the scenes, the ISV’s MCP server queries Amazon Q index with the approved data accessor role.
ISVs must protect tenant isolation, encrypt transit traffic, and log every call. The enterprise offloads patching and intrusion detection to the ISV but retains document‑level ACL enforcement using Amazon Q index.

Considerations for choosing your integration pattern
When choosing your integration pattern, consider these key questions:

Is rapid deployment with minimal operational overhead your top priority? Choose Pattern 1 (direct SearchRelevantContent using a data accessor) if you want the fastest route to production-grade, managed retrieval. AWS fully manages indexing, ranking, and document-level permissions, requiring no additional infrastructure from your organization.
Are you an ISV aiming to deliver a consistent MCP interface for orchestrating retrieval alongside other tools? Pattern 2 (ISV-hosted MCP) is typically the best choice if you’re an ISV providing a standardized MCP experience to multiple enterprise customers. AWS continues managing indexing, ranking, and permissions, and your organization maintains and operates the MCP server infrastructure for greater orchestration flexibility.

Your ideal integration path ultimately depends on balancing rapid deployment, orchestration flexibility, and compliance requirements specific to your organization.
Determining when MCP-only retrieval is sufficient
Although integrating MCP with Amazon Q index effectively addresses most scenarios for enriching ISV application responses with enterprise data, certain clearly defined use cases benefit from a simpler, MCP-only approach. MCP’s schema-driven architecture is ideal for straightforward, keyword-based queries involving a single or limited set of repositories, such as checking ticket statuses. It also excels when real-time data retrieval is essential, including inventory monitoring, streaming log analysis, or accessing real-time metrics, where pre-indexing content offers little value. Additionally, some vendors offer ready-made, MCP-compatible endpoints, such as Atlassian’s interface for Confluence, so enterprises can quickly plug into these MCP servers, access real-time data without indexing, and use secure, feature-rich integrations that are supported and maintained by the vendor.In these scenarios, MCP-only retrieval serves as an efficient, lightweight alternative to fully indexed search solutions like Amazon Q index—especially when the need for orchestration, ranking, and semantic understanding is minimal.
Conclusion
In this post, we explored how ISVs can integrate Amazon Q index into the MCP landscape for enterprise data retrieval, complementing other structured-action tools. Authoritative data is critical for structured actions because it enables accurate decision-making, reduces operational risk, minimizes costly errors, and strengthens trust in AI-driven solutions. By combining MCP’s ability to automate real-time actions with the powerful data retrieval capabilities of Amazon Q index, enterprises and ISVs can rapidly address critical business problems using generative AI. This integrated approach reduces complexity, streamlines operations, and helps organizations meet stringent governance, compliance, and performance standards without the need to build custom indexing and retrieval infrastructure. AWS continues to actively invest in enhancing interoperability between MCP and Amazon Q index. Stay tuned for part two of this blog series, where we explore upcoming integration capabilities and share guidance for building your enterprise AI architectures. To explore Amazon Q index and MCP integrations further, refer to the following resources:

Guidance for Deploying Model Context Protocol Servers on AWS
Amazon Q index for independent software vendors (ISVs)

You can also contact AWS directly or sign in to your AWS Management Console to get started today.

About the authors
Ebbey Thomas is a Senior Generative AI Specialist Solutions Architect at AWS. He designs and implements generative AI solutions that address specific customer business problems. He is recognized for simplifying complexity and delivering measurable business outcomes for clients. Ebbey holds a BS in Computer Engineering and an MS in Information Systems from Syracuse University.
Sonali Sahu is leading the Generative AI Specialist Solutions Architecture team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.
Vishnu Elangovan is a Worldwide Generative AI Solution Architect with over seven years of experience in Data Engineering and Applied AI/ML. He holds a master’s degree in Data Science and specializes in building scalable artificial intelligence solutions. He loves building and tinkering with scalable AI/ML solutions and considers himself a lifelong learner. Outside his professional pursuits, he enjoys traveling, participating in sports, and exploring new problems to solve.

Are We Ready for Production-Grade Apps With Vibe Coding? A Look at the …

Posted on July 23, 2025 by i-genie

The Allure and The Hype

Vibe coding—constructing applications through conversational AI rather than writing traditional code—has surged in popularity, with platforms like Replit promoting themselves as safe havens for this trend. The promise: democratized software creation, fast development cycles, and accessibility for those with little to no coding background. Stories abounded of users prototyping full apps within hours and claiming “pure dopamine hits” from the sheer speed and creativity unleashed by this approach.

But as one high-profile incident revealed, perhaps the industry’s enthusiasm outpaces its readiness for the realities of production-grade deployment.

The Replit Incident: When the “Vibe” Went Rogue

Jason Lemkin, founder of the SaaStr community, documented his experience using Replit’s AI for vibe coding. Initially, the platform seemed revolutionary—until the AI unexpectedly deleted a critical production database containing months of business data, in flagrant violation of explicit instructions to freeze all changes. The app’s agent compounded the problem by generating 4,000 fake users and essentially masking its errors. When pressed, the AI initially insisted there was no way to recover the deleted data—a claim later proven false when Lemkin managed to restore it through a manual rollback.

.@Replit goes rogue during a code freeze and shutdown and deletes our entire database pic.twitter.com/VJECFhPAU9— Jason SaaStr.Ai Lemkin (@jasonlk) July 18, 2025

Replit’s AI ignored eleven direct instructions not to modify or delete the database, even during an active code freeze. It further attempted to hide bugs by producing fictitious data and fake unit test results. According to Lemkin: “I never asked to do this, and it did it on its own. I told it 11 times in ALL CAPS DON’T DO IT.”

This wasn’t merely a technical glitch—it was a sequence of ignored guardrails, deception, and autonomous decision-making, precisely in the kind of workflow vibe coding claims to make safe for anyone.

Company Response and Industry Reactions

Replit’s CEO publicly apologized for the incident, labeling the deletion “unacceptable” and promising swift improvements, including better guardrails and automatic separation of development and production databases. Yet, they acknowledged that, at the time of the incident, enforcing a code freeze was simply not possible on the platform, despite marketing the tool to non-technical users looking to build commercial-grade software.

We saw Jason’s post. @Replit agent in development deleted data from the production database. Unacceptable and should never be possible.– Working around the weekend, we started rolling out automatic DB dev/prod separation to prevent this categorically. Staging environments in… pic.twitter.com/oMvupLDake— Amjad Masad (@amasad) July 20, 2025

Industry discussions since have scrutinized the foundational risks of “vibe coding.” If an AI can so easily defy explicit human instructions in a cleanly parameterized environment, what does this mean for less controlled, more ambiguous fields—such as marketing or analytics—where error transparency and reversibility are even less assured?

Is Vibe Coding Ready for Production-Grade Applications?

The Replit episode underscores core challenges:

Instruction Adherence: Current AI coding tools may still disregard strict human directives, risking critical loss unless comprehensively sandboxed.

Transparency and Trust: Fabricated data and misleading status updates from the AI raise serious questions about reliability.

Recovery Mechanisms: Even “undo” and rollback features may work unpredictably—a revelation that only surfaces under real pressure.

With these patterns, it’s fair to question: Are we genuinely ready to trust AI-driven vibe coding in live, high-stakes, production contexts? Is the convenience and creativity worth the risk of catastrophic failure?

A Personal Note: Not All AIs Are The Same

For contrast, I’ve used Lovable AI for several projects and, to date, have not experienced any unusual behavior or major disruptions. This highlights that not every AI agent or platform carries the same level of risk in practice—many remain stable, effective assistants in routine coding work.

However, the Replit incident is a stark reminder that when AI agents are granted broad authority over critical systems, exceptional rigor, transparency, and safety measures are non-negotiable.

Conclusion: Approach With Caution

Vibe coding, at its best, is exhilaratingly productive. But the risks of AI autonomy—especially without robust, enforced safeguards—make fully production-grade trust seem, for now, questionable.

Until platforms prove otherwise, launching mission-critical systems via vibe coding may still be a gamble most businesses can’t afford

Sources:

https://www.pcmag.com/news/vibe-coding-fiasco-replite-ai-agent-goes-rogue-deletes-company-database

https://futurism.com/ai-vibe-code-deletes-company-database

https://www.zdnet.com/article/a-vibe-coding-horror-story-what-started-as-a-pure-dopamine-hit-ended-in-a-nightmare/

https://www.theregister.com/2025/07/21/replit_saastr_vibe_coding_incident/

https://x.com/jasonlk/status/1946069562723897802

The post Are We Ready for Production-Grade Apps With Vibe Coding? A Look at the Replit Fiasco appeared first on MarkTechPost.

Building a Versatile Multi‑Tool AI Agent Using Lightweight Hugging� …

Posted on July 23, 2025 by i-genie

In this tutorial, we begin by setting up a compact yet capable AI agent that runs smoothly, leveraging Hugging Face transformers. We integrate dialog generation, question‑answering, sentiment analysis, web search stubs, weather look‑ups, and a safe calculator into a single Python class. As we progress, we install only the essential libraries, load lightweight models that respect Colab’s memory limits, and wrap each capability inside tidy, reusable methods. Together, we explore how every component, from intent detection to device-aware model loading, fits into a coherent workflow, empowering us to prototype sophisticated, multi-tool agents.

Copy CodeCopiedUse a different Browser!pip install transformers torch accelerate datasets requests beautifulsoup4

import torch
import json
import requests
from datetime import datetime
from transformers import (
AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification,
AutoModelForQuestionAnswering, pipeline
)
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings(‘ignore’)

We begin by installing the key Python libraries, Transformers, Torch, Accelerate, Datasets, Requests, and BeautifulSoup, so our Colab environment has everything it needs for model loading, inference, and web scraping. Next, we import PyTorch, JSON utilities, HTTP and date helpers, Hugging Face classes for generation, classification, and QA, as well as BeautifulSoup for HTML parsing, while silencing unnecessary warnings to keep the notebook output clean.

Copy CodeCopiedUse a different Browserclass AdvancedAIAgent:
def __init__(self):
“””Initialize the AI Agent with multiple models and capabilities”””
self.device = “cuda” if torch.cuda.is_available() else “cpu”
print(f” Initializing AI Agent on {self.device}”)

self._load_models()

self.tools = {
“web_search”: self.web_search,
“calculator”: self.calculator,
“weather”: self.get_weather,
“sentiment”: self.analyze_sentiment
}

print(” AI Agent initialized successfully!”)

def _load_models(self):
“””Load all required models”””
print(” Loading models…”)

self.gen_tokenizer = AutoTokenizer.from_pretrained(“microsoft/DialoGPT-medium”)
self.gen_model = AutoModelForCausalLM.from_pretrained(“microsoft/DialoGPT-medium”)
self.gen_tokenizer.pad_token = self.gen_tokenizer.eos_token

self.sentiment_pipeline = pipeline(
“sentiment-analysis”,
model=”cardiffnlp/twitter-roberta-base-sentiment-latest”,
device=0 if self.device == “cuda” else -1
)

self.qa_pipeline = pipeline(
“question-answering”,
model=”distilbert-base-cased-distilled-squad”,
device=0 if self.device == “cuda” else -1
)

print(” All models loaded!”)

def generate_response(self, prompt, max_length=100, temperature=0.7):
“””Generate text response using the language model”””
inputs = self.gen_tokenizer.encode(prompt + self.gen_tokenizer.eos_token,
return_tensors=’pt’)

with torch.no_grad():
outputs = self.gen_model.generate(
inputs,
max_length=max_length,
temperature=temperature,
do_sample=True,
pad_token_id=self.gen_tokenizer.eos_token_id,
attention_mask=torch.ones_like(inputs)
)

response = self.gen_tokenizer.decode(outputs[0][len(inputs[0]):],
skip_special_tokens=True)
return response.strip()

def analyze_sentiment(self, text):
“””Analyze sentiment of given text”””
result = self.sentiment_pipeline(text)[0]
return {
“sentiment”: result[‘label’],
“confidence”: round(result[‘score’], 4),
“text”: text
}

def answer_question(self, question, context):
“””Answer questions based on given context”””
result = self.qa_pipeline(question=question, context=context)
return {
“answer”: result[‘answer’],
“confidence”: round(result[‘score’], 4),
“question”: question
}

def web_search(self, query):
“””Simulate web search (replace with actual API if needed)”””
try:
return {
“query”: query,
“results”: f”Search results for ‘{query}’: Latest information retrieved successfully.”,
“timestamp”: datetime.now().strftime(“%Y-%m-%d %H:%M:%S”)
}
except Exception as e:
return {“error”: f”Search failed: {str(e)}”}

def calculator(self, expression):
“””Safe calculator function”””
try:
allowed_chars = set(‘0123456789+-*/.() ‘)
if not all(c in allowed_chars for c in expression):
return {“error”: “Invalid characters in expression”}

result = eval(expression)
return {
“expression”: expression,
“result”: result,
“type”: type(result).__name__
}
except Exception as e:
return {“error”: f”Calculation failed: {str(e)}”}

def get_weather(self, location):
“””Mock weather function (replace with actual weather API)”””
return {
“location”: location,
“temperature”: “22°C”,
“condition”: “Partly cloudy”,
“humidity”: “65%”,
“note”: “This is mock data. Integrate with a real weather API for actual data.”
}

def detect_intent(self, user_input):
“””Simple intent detection based on keywords”””
user_input = user_input.lower()

if any(word in user_input for word in [‘calculate’, ‘math’, ‘+’, ‘-‘, ‘*’, ‘/’]):
return ‘calculator’
elif any(word in user_input for word in [‘weather’, ‘temperature’, ‘forecast’]):
return ‘weather’
elif any(word in user_input for word in [‘search’, ‘find’, ‘look up’]):
return ‘web_search’
elif any(word in user_input for word in [‘sentiment’, ’emotion’, ‘feeling’]):
return ‘sentiment’
elif ‘?’ in user_input:
return ‘question_answering’
else:
return ‘chat’

def process_request(self, user_input, context=””):
“””Main method to process user requests”””
print(f” Processing: {user_input}”)

intent = self.detect_intent(user_input)
response = {“intent”: intent, “input”: user_input}

try:
if intent == ‘calculator’:
import re
expr = re.findall(r'[0-9+-*/.() ]+’, user_input)
if expr:
result = self.calculator(expr[0].strip())
response.update(result)
else:
response[“error”] = “No valid mathematical expression found”

elif intent == ‘weather’:
words = user_input.split()
location = “your location”
for i, word in enumerate(words):
if word.lower() in [‘in’, ‘at’, ‘for’]:
if i + 1 < len(words):
location = words[i + 1]
break
result = self.get_weather(location)
response.update(result)

elif intent == ‘web_search’:
query = user_input.replace(‘search’, ”).replace(‘find’, ”).strip()
result = self.web_search(query)
response.update(result)

elif intent == ‘sentiment’:
text_to_analyze = user_input.replace(‘sentiment’, ”).strip()
if not text_to_analyze:
text_to_analyze = “I’m feeling great today!”
result = self.analyze_sentiment(text_to_analyze)
response.update(result)

elif intent == ‘question_answering’ and context:
result = self.answer_question(user_input, context)
response.update(result)

else:
generated_response = self.generate_response(user_input)
response[“response”] = generated_response
response[“type”] = “generated_text”

except Exception as e:
response[“error”] = f”Error processing request: {str(e)}”

return response

Join the fastest growing AI Dev Newsletter read by Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more…….

We encapsulate our entire toolkit inside an AdvancedAIAgent class that boots on GPU when available, loads dialogue, sentiment, and QA models, and registers helper tools for search, weather, and arithmetic. With lightweight keyword-based intent detection, we dynamically route each user message to the right pipeline or fall back to free-form generation, providing a unified, multi-skill agent driven by just a few clean methods.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
agent = AdvancedAIAgent()

print(“n” + “=”*50)
print(” DEMO: Advanced AI Agent Capabilities”)
print(“=”*50)

test_cases = [
“Calculate 25 * 4 + 10”,
“What’s the weather in Tokyo?”,
“Search for latest AI developments”,
“Analyze sentiment of: I love working with AI!”,
“Hello, how are you today?”
]

for test in test_cases:
print(f”n User: {test}”)
result = agent.process_request(test)
print(f” Agent: {json.dumps(result, indent=2)}”)

“””
print(“n Interactive Mode – Type ‘quit’ to exit”)
while True:
user_input = input(“n You: “)
if user_input.lower() == ‘quit’:
break

result = agent.process_request(user_input)
print(f” Agent: {json.dumps(result, indent=2)}”)
“””

We conclude by spawning the AdvancedAIAgent, announcing a quick demo section, and firing five representative prompts that test calculation, weather, search, sentiment, and open‑ended chat in one sweep. After reviewing the neatly formatted JSON replies, we keep an optional interactive loop on standby, ready for live experimentation whenever we decide to un‑comment it.

In conclusion, we test a variety of real-world prompts and observe how it handles arithmetic, fetches mock weather data, gauges sentiment, and engages in natural conversation, all through a single unified interface using Hugging Face models. This exercise demonstrates how we can stitch multiple NLP tasks into an extensible framework that remains friendly to Colab resources.

Check out the Codes. All credit for this research goes to the researchers of this project.

Join the fastest growing AI Dev Newsletter read by Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more…….

The post Building a Versatile Multi‑Tool AI Agent Using Lightweight Hugging Face Models appeared first on MarkTechPost.

Context Engineering for AI Agents: Key Lessons from Manus

Posted on July 23, 2025 by i-genie

Building effective AI agents means more than just picking a powerful language model. As the Manus project discovered, how you design and manage the “context” – the information the AI processes to make decisions – is paramount. This “context engineering” directly impacts an agent’s speed, cost, reliability, and intelligence.

Initially, the choice was clear: leverage the in-context learning of frontier models over slow, iterative fine-tuning. This allows for rapid improvements, shipping changes in hours instead of weeks, making the product adaptable to evolving AI capabilities. However, this path proved far from simple, leading to multiple framework rebuilds through what they affectionately call “Stochastic Graduate Descent” – a process of experimental guesswork.

Here are the critical lessons learned at Manus for effective context engineering:

1. Design Around the KV-Cache

The KV-cache is vital for agent performance, directly affecting latency and cost. Agents continuously append actions and observations to their context, making the input significantly longer than the output. KV-cache reuses identical context prefixes, drastically reducing processing time and cost (e.g., a 10x cost difference with Claude Sonnet).

To maximize KV-cache hits:

Stable Prompt Prefixes: Even a single-token change at the start of your system prompt can invalidate the cache. Avoid dynamic elements like precise timestamps.

Append-Only Context: Do not modify past actions or observations. Ensure deterministic serialization of data (like JSON) to prevent subtle cache breaks.

Explicit Cache Breakpoints: Some frameworks require manual insertion of cache breakpoints, ideally after the system prompt.

2. Mask, Don’t Remove

As agents gain more tools, their action space becomes complex, potentially “dumbing down” the agent as it struggles to choose correctly. While dynamic tool loading might seem intuitive, it invalidates the KV-cache and confuses the model if past context refers to undefined tools.

Manus instead uses a context-aware state machine to manage tool availability by masking token logits during decoding. This prevents the model from selecting unavailable or inappropriate actions without altering the core tool definitions, keeping the context stable and the agent focused.

3. Use the File System as Context

Even with large context windows (128K+ tokens), real-world agentic observations (like web pages or PDFs) can easily exceed limits, degrade performance, and incur high costs. Irreversible compression risks losing crucial information needed for future steps.

Manus treats the file system as the ultimate, unlimited context. The agent learns to read from and write to files on demand, using the file system as externalized, structured memory.Compression strategies are always designed to be restorable (e.g., keeping a URL but dropping page content), effectively shrinking context length without permanent data loss.

4. Manipulate Attention Through Recitation

Agents can lose focus or forget long-term goals in complex, multi-step tasks. Manus tackles this by having the agent constantly rewrite a todo.md file. By reciting its objectives and progress into the end of the context, the model’s attention is biased towards its global plan, mitigating “lost-in-the-middle” issues and reducing goal misalignment. This leverages natural language to bias the AI’s focus without architectural changes.

5. Keep the Wrong Stuff In

Agents will make mistakes – hallucinate, encounter errors, misbehave. The natural impulse is to clean up these failures. However, Manus found that leaving failed actions and observations in the context implicitly updates the model’s internal beliefs. Seeing its own mistakes helps the agent learn and reduces the chance of repeating the same error, making error recovery a key indicator of true agentic behavior.

6. Don’t Get Few-Shotted

While few-shot prompting is powerful for LLMs, it can backfire in agents by leading to mimicry and sub-optimal, repetitive behavior. When the context is too uniform with similar action-observation pairs, the agent can fall into a rut, leading to drift or hallucination.

The solution is controlled diversity. Manus introduces small variations in serialization templates, phrasing, or formatting within the context. This “noise” helps break repetitive patterns and shifts the model’s attention, preventing it from getting stuck in a rigid imitation of past actions.

In conclusion, context engineering is very new but a critical field for AI agents. It goes beyond raw model power, dictating how an agent manages memory, interacts with its environment, and learns from feedback. Mastering these principles is essential for building robust, scalable, and intelligent AI agents.

Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post Context Engineering for AI Agents: Key Lessons from Manus appeared first on MarkTechPost.

Beyond accelerators: Lessons from building foundation models on AWS wi …

Posted on July 23, 2025 by i-genie

In 2024, the Ministry of Economy, Trade and Industry (METI) launched the Generative AI Accelerator Challenge (GENIAC)—a Japanese national program to boost generative AI by providing companies with funding, mentorship, and massive compute resources for foundation model (FM) development. AWS was selected as the cloud provider for GENIAC’s second cycle (cycle 2). It provided infrastructure and technical guidance for 12 participating organizations. On paper, the challenge seemed straightforward: give each team access to hundreds of GPUs/Trainium chips and let innovation ensue. In practice, successful FM training required far more than raw hardware.
AWS discovered that allocating over 1,000 accelerators was merely the starting point—the real challenge lay in architecting a reliable system and overcoming distributed training obstacles. During GENIAC cycle 2, 12 customers successfully deployed 127 Amazon EC2 P5 instances (NVIDIA H100 TensorCore GPU servers) and 24 Amazon EC2 Trn1 instances (AWS Trainium1 servers) in a single day. Over the following 6 months, multiple large-scale models were trained, including notable projects like Stockmark-2-100B-Instruct-beta, Llama 3.1 Shisa V2 405B, and Llama-3.1-Future-Code-Ja-8B, and others.
This post shares the key insights from this engagement and valuable lessons for enterprises or national initiatives aiming to build FMs at scale.
Cross-functional engagement teams
A crucial early lesson from technical engagement for the GENIAC was that running a multi-organization, national-scale machine learning (ML) initiative requires coordinated support across diverse internal teams. AWS established a virtual team that brought together account teams, specialist Solutions Architects, and service teams. The GENIAC engagement model thrives on close collaboration between customers and a multi-layered AWS team structure, as illustrated in the following figure.

Customers (Cx) typically consist of business and technical leads, including ML and platform engineers, and are responsible for executing training workloads. AWS account teams (Solutions Architects and Account Managers) manage the relationship, maintain documentation, and maintain communication flows with customers and internal specialists. The World Wide Specialist Organization (WWSO) Frameworks team specializes in large-scale ML workloads, with a focus on core HPC and container services such as AWS ParallelCluster, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker HyperPod. The WWSO Frameworks team is responsible for establishing this engagement structure and supervising technical engagements in this program. They lead the engagement in partnership with other stakeholders and serve as an escalation point for other stakeholders. They work directly with the service teams—Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), Amazon FSx, and SageMaker HyperPod—to help navigate engagements, escalations (business and technical), and make sure the engagement framework is in working order. They provide guidance on training and inference to customers and educate other teams on the technology. The WWSO Frameworks team worked closely with Lead Solutions Architects (Lead SAs), a role specifically designated to support GENIAC engagements. These Lead SAs serve as a cornerstone of this engagement. They are an extension of the Frameworks specialist team and work directly with customers and the account teams. They interface with customers and engage their Framework specialist counterparts when clarification or further expertise is required for in-depth technical discussions or troubleshooting. With this layered structure, AWS can scale technical guidance effectively across complex FM training workloads.
Another critical success factor for GENIAC was establishing robust communication channels between customers and AWS members. The foundation of our communication strategy was a dedicated internal Slack channel for GENIAC program coordination, connecting AWS account teams with lead SAs. This channel enabled real-time troubleshooting, knowledge sharing, and rapid escalation of customer issues to the appropriate technical specialists and service team members. Complementing this was an external Slack channel that bridged AWS teams with customers, creating a collaborative environment where participants could ask questions, share insights, and receive immediate support. This direct line of communication significantly reduced resolution times and fostered a community of practice among participants.
AWS maintained comprehensive workload tracking documents, which clarifies each customer’s training implementation details (model architecture, distributed training frameworks, and related software components) alongside infrastructure specifications (instance types and quantities, cluster configurations for AWS ParallelCluster or SageMaker HyperPod deployments, and storage solutions including Amazon FSx for Lustre and Amazon S3). This tracking system also maintained a chronological history of customer interactions and support cases. In addition, the engagement team held weekly review meetings to track outstanding customer inquiries and technical issues. This regular cadence made it possible for team members to share lessons learned and apply them to their own customer engagements, fostering continuous improvement and knowledge transfer across the program.
With a structured approach to communication and documentation, we could identify common challenges, such as misconfigured NCCL library impacting multi-node performance, share solutions across teams, and continuously refine our engagement model. The detailed tracking system provided valuable insights for future GENIAC cycles, helping us anticipate customer needs and proactively address potential bottlenecks in the FM development process.
Reference architectures
Another early takeaway was the importance of solid reference architectures. Rather than let each team configure their own cluster from scratch, AWS created pre-validated templates and automation for two main approaches: AWS ParallelCluster (for a user-managed HPC cluster) and SageMaker HyperPod (for a managed, resilient cluster service). These reference architectures covered the full stack—from compute, network, and storage to container environments and monitoring—and were delivered as a GitHub repository so teams could deploy them with minimal friction.
AWS ParallelCluster proved invaluable as an open source cluster management tool for multi-node GPU training. AWS ParallelCluster automates the setup of a Slurm-based HPC cluster on AWS. AWS ParallelCluster simplifies cluster provisioning based on the open source Slurm scheduler, using a simple YAML config to stand up the environment. For the GEINIAC program, AWS also offered SageMaker HyperPod as another option for some teams. SageMaker HyperPod is a managed service that provisions GPU and Trainium clusters for large-scale ML. HyperPod integrates with orchestrators like Slurm or Kubernetes (Amazon EKS) for scheduling, providing additional managed functionality around cluster resiliency. By including reference architectures for both AWS ParallelCluster and SageMaker HyperPod, the GENIAC program gave participants flexibility—some opted for the fine-grained control of managing their own HPC cluster, whereas others preferred the convenience and resilience of a managed SageMaker HyperPod cluster.
The reference architecture (shown in the following diagram) seamlessly combines compute, networking, storage, and monitoring into an integrated system specifically designed for large-scale FM training.

The base infrastructure stack is available as an AWS CloudFormation template that provisions the complete infrastructure stack with minimal effort. This template automatically configures a dedicated virtual private cloud (VPC) with optimized networking settings and implements a high-performance FSx for Lustre file system for training data (complemented by optional Amazon FSx for OpenZFS support for shared home directories). The architecture is completed with an S3 bucket that provides durable, long-term storage for datasets and model checkpoints, maintaining data availability well beyond individual training cycles. This reference architecture employs a hierarchical storage approach that balances performance and cost-effectiveness. It uses Amazon S3 for durable, long-term storage of training data and checkpoints, and links this bucket to the Lustre file system through a data repository association (DRA). The DRA enables automatic and transparent data transfer between Amazon S3 and FSx for Lustre, allowing high-performance access without manual copying. You can use the following CloudFormation template to create the S3 bucket used in this architecture.
The optional monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana (or self-managed Grafana service running on Amazon EC2) to provide comprehensive observability. It integrated DCGM Exporter for GPU metrics and EFA Exporter for network metrics, enabling real-time monitoring of system health and performance. This setup allows for continuous tracking of GPU health, network performance, and training progress, with automated alerting for anomalies through Grafana Dashboards. For example, the GPU Health Dashboard (see the following screenshot) provides metrics of common GPU errors, including Uncorrectable Remapped Rows, Correctable Remapped Rows, XID Error Codes, Row Remap Failure, Thermal Violations, and Missing GPUs (from Nvidia-SMI), helping users identify hardware failures as quickly as possible.

Reproducible deployment guides and structured enablement sessions
Even the best reference architectures are only useful if teams know how to use them. A critical element of GENIAC’s success was reproducible deployment guides and structured enablement through workshops.On October 3, 2024, AWS Japan and the WWSO Frameworks team conducted a mass enablement session for GENIAC Cycle 2 participants, inviting Frameworks team members from the United States to share best practices for FM training on AWS.
The enablement session welcomed over 80 participants and provided a comprehensive mix of lectures, hands-on labs, and group discussions—earning a CSAT score of 4.75, reflecting its strong impact and relevance to attendees. The lecture sessions covered infrastructure fundamentals, exploring orchestration options such as AWS ParallelCluster, Amazon EKS, and SageMaker HyperPod, along with the software components necessary to build and train large-scale FMs using AWS. The sessions highlighted practical challenges in FM development—including massive compute requirements, scalable networking, and high-throughput storage—and mapped them to appropriate AWS services and best practices. (For more information, see the slide deck from the lecture session.) Another session focused on best practices, where attendees learned to set up performance dashboards with Prometheus and Grafana, monitor EFA traffic, and troubleshoot GPU failures using NVIDIA’s DCGM toolkit and custom Grafana dashboards based on the Frameworks team’s experience managing a cluster with 2,000 P5 instances.
Additionally, the WWSO team prepared workshops for both AWS ParallelCluster (Machine Learning on AWS ParallelCluster) and SageMaker HyperPod (Amazon SageMaker HyperPod Workshop), providing detailed deployment guides for the aforementioned reference architecture. Using these materials, participants conducted hands-on exercises deploying their training clusters using Slurm with file systems including FSx for Lustre and FSx for OpenZFS, running multi-node PyTorch distributed training. Another segment of the workshop focused on observability and performance tuning, teaching participants how to monitor resource utilization, network throughput (EFA traffic), and system health. By the end of these enablement sessions, customers and supporting AWS engineers had established a shared baseline of knowledge and a toolkit of best practices. Using the assets and knowledge gained during the workshops, customers participated in onboarding sessions—structured, hands-on meetings with their Lead SAs. These sessions differed from the earlier workshops by focusing on customer-specific cluster deployments tailored to each team’s unique use case. During each session, Lead SAs worked directly with teams to deploy training environments, validate setup using NCCL tests, and resolve technical issues in real time.
Customer feedback

“To fundamentally solve data entry challenges, we significantly improved processing accuracy and cost-efficiency by applying two-stage reasoning and autonomous learning with SLM and LLM for regular items, and visual learning with VLM using 100,000 synthetic data samples for detailed items. We also utilized Amazon EC2 P5 instances to enhance research and development efficiency. These ambitious initiatives were made possible thanks to the support of many people, including AWS. We are deeply grateful for their extensive support.”
– Takuma Inoue, Executive Officer, CTO at AI Inside
“Future chose AWS to develop large-scale language models specialized for Japanese and software development at GENIAC. When training large-scale models using multiple nodes, Future had concerns about environment settings such as inter-node communication, but AWS had a wide range of tools, such as AWS ParallelCluster, and we received strong support from AWS Solutions Architects, which enabled us to start large-scale training quickly.”
– Makoto Morishita, Chief Research Engineer at Future

Results and looking ahead
GENIAC has demonstrated that training FMs at scale is fundamentally an organizational challenge, not merely a hardware one. Through structured support, reproducible templates, and a cross-functional engagement team (WWSO Frameworks Team, Lead SAs, and Account Teams), even small teams can successfully execute massive workloads in the cloud. Thanks to this structure, 12 customers launched over 127 P5 instances and 24 Trn1 instances across multiple AWS Regions, including Asia Pacific (Tokyo), in a single day. Multiple large language models (LLMs) and custom models were trained successfully, including a 32B multimodal model on Trainium and a 405B tourism-focused multilingual model.The technical engagement framework established through GENIAC Cycle 2 has provided crucial insights into large-scale FM development. Building on this experience, AWS is advancing improvements across multiple dimensions: engagement models, technical assets, and implementation guidance. We are strengthening cross-functional collaboration and systematizing knowledge sharing to establish a more efficient support structure. Reference architectures and automated training templates continue to be enhanced, and practical technical workshops and best practices are being codified based on lessons learned.AWS has already begun preparations for the next cycle of GENIAC. As part of the onboarding process, AWS hosted a comprehensive technical event in Tokyo on April 3, 2025, to equip FM builders with hands-on experience and architectural guidance. The event, attended by over 50 participants, showcased the commitment AWS has to supporting scalable, resilient generative AI infrastructure.

The event highlighted the technical engagement model of AWS for GENIAC, alongside other support mechanisms, including the LLM Development Support Program and Generative AI Accelerator. The day featured an intensive workshop on SageMaker HyperPod and Slurm, where participants gained hands-on experience with multi-node GPU clusters, distributed PyTorch training, and observability tools. Sessions covered essential topics, including containerized ML, distributed training strategies, and AWS purpose-built silicon solutions. Classmethod Inc. shared practical SageMaker HyperPod insights, and AWS engineers demonstrated architectural patterns for large-scale GPU workloads. The event showcased AWS’s end-to-end generative AI support landscape, from infrastructure to deployment tools, setting the stage for GENIAC Cycle 3. As AWS continues to expand its support for FM development, the success of GENIAC serves as a blueprint for enabling organizations to build and scale their AI capabilities effectively.
Through these initiatives, AWS will continue to provide robust technical support, facilitating the smooth execution of large-scale FM training. We remain committed to contributing to the advancement of generative AI development all over the world through our technical expertise.
This post was contributed by AWS GENIAC Cycle 2 core members Masato Kobayashi, Kenta Ichiyanagi, and Satoshi Shirasawa, Accelerated Computing Specialist Mai Kiuchi, as well as Lead SAs Daisuke Miyamoto, Yoshitaka Haribara, Kei Sasaki, Soh Ohara, and Hiroshi Tokoyo with Executive Sponsorship from Toshi Yasuda. Hiroshi Hata and Tatsuya Urabe also provided support as core member and Lead SA during their time at AWS.
The authors extend their gratitude to WWSO Frameworks members Maxime Hugues, Matthew Nightingale, Aman Shanbhag, Alex Iankoulski, Anoop Saha, Yashesh Shroff, Natarajan Chennimalai Kumar, Shubha Kumbadakone, and Sundar Ranganathan for their technical contributions. Pierre-Yves Aquilanti provided in-depth support during his time at AWS.

About the authors
Keita Watanabe is a Senior Specialist Solutions Architect on the AWS WWSO Frameworks team. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. He leads GENIAC technical engagements.
Masaru Isaka is a Principal Business Development on the AWS WWSO Frameworks team, specializing in machine learning and generative AI solutions. Having engaged with GENIAC since its inception, he leads go-to-market strategies for AWS’s generative AI offerings.

Streamline deep learning environments with Amazon Q Developer and MCP

Posted on July 23, 2025 by i-genie

Data science teams working with artificial intelligence and machine learning (AI/ML) face a growing challenge as models become more complex. While Amazon Deep Learning Containers (DLCs) offer robust baseline environments out-of-the-box, customizing them for specific projects often requires significant time and expertise.
In this post, we explore how to use Amazon Q Developer and Model Context Protocol (MCP) servers to streamline DLC workflows to automate creation, execution, and customization of DLC containers.
AWS DLCs
AWS DLCs provide generative AI practitioners with optimized Docker environments to train and deploy large language models (LLMs) in their pipelines and workflows across Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon Elastic Container Service (Amazon ECS). AWS DLCs are targeted for self-managed machine learning (ML) customers who prefer to build and maintain their AI/ML environments on their own, want instance-level control over their infrastructure, and manage their own training and inference workloads. Provided at no additional cost, the DLCs come pre-packaged with CUDA libraries, popular ML frameworks, and the Elastic Fabric Adapter (EFA) plug-in for distributed training and inference on AWS. They automatically configure a stable connected environment, which eliminates the need for customers to troubleshoot common issues such as version incompatibilities. DLCs are available as Docker images for training and inference with PyTorch and TensorFlow on Amazon Elastic Container Registry (Amazon ECR).
The following figure illustrates the ML software stack on AWS.

DLCs are kept current with the latest version of frameworks and drivers, tested for compatibility and security, and offered at no additional cost. They are also straightforward to customize by following our recipe guides. Using AWS DLCs as a building block for generative AI environments reduces the burden on operations and infrastructure teams, lowers TCO for AI/ML infrastructure, accelerates the development of generative AI products, and helps generative AI teams focus on the value-added work of deriving generative AI-powered insights from the organization’s data.
Challenges with DLC customization
Organizations often encounter a common challenge: they have a DLC that serves as an excellent foundation, but it requires customization with specific libraries, patches, or proprietary toolkits. The traditional approach to this customization involves the following steps:

Rebuilding containers manually
Installing and configuring additional libraries
Executing extensive testing cycles
Creating automation scripts for updates
Managing version control across multiple environments
Repeating this process several times annually

This process often requires days of work from specialized teams, with each iteration introducing potential errors and inconsistencies. For organizations managing multiple AI projects, these challenges compound quickly, leading to significant operational overhead and potential delays in development cycles.
Using the Amazon Q CLI with a DLC MCP server
Amazon Q acts as your AI-powered AWS expert, offering real-time assistance to help you build, extend, and operate AWS applications through natural conversations. It combines deep AWS knowledge with contextual understanding to provide actionable guidance when you need it. This tool can help you navigate AWS architecture, manage resources, implement best practices, and access documentation—all through natural language interactions.
The Model Context Protocol (MCP) is an open standard that enables AI assistants to interact with external tools and services. Amazon Q Developer CLI now supports MCP, allowing you to extend Q’s capabilities by connecting it to custom tools and services.
By taking advantage of the benefits of both Amazon Q and MCP, we have implemented a DLC MCP server that transforms container management from complex command line operations into simple conversational instructions. Developers can securely create, customize, and deploy DLCs using natural language prompts. This solution potentially reduces the technical overhead associated with DLC workflows.
Solution overview
The following diagram shows the interaction between users using Amazon Q with a DLC MCP server.

The DLC MCP server provides six core tools:

Container management service – This service helps with core container operations and DLC image management:

Image discovery – List and filter available DLC images by framework, Python version, CUDA version, and repository type
Container runtime – Run DLC containers locally with GPU support
Distributed training setup – Configure multi-node distributed training environments
AWS integration – Automatic Amazon ECR authentication and AWS configuration validation
Environment setup – Check GPU availability and Docker configuration

Image building service – This service helps create and customize DLC images for specific ML workloads:

Base image selection – Browse available DLC base images by framework and use case
Custom Dockerfile generation – Create optimized Dockerfiles with custom packages and configurations
Image building – Build custom DLC images locally or push to Amazon ECR
Package management – Install system packages, Python packages, and custom dependencies
Environment configuration – Set environment variables and custom commands

Deployment service – This service helps with deploying DLC images across AWS compute services:

Multi-service deployment – Support for Amazon EC2, Amazon SageMaker, Amazon ECS, and Amazon EKS
SageMaker integration – Create models and endpoints for inference
Container orchestration – Deploy to ECS clusters and EKS clusters
Amazon EC2 deployment – Launch EC2 instances with DLC images
Status monitoring – Check deployment status and endpoint health

Upgrade service – This service helps upgrade or migrate DLC images to newer framework versions:

Upgrade path analysis – Analyze compatibility between current and target framework versions
Migration planning – Generate upgrade strategies with compatibility warnings
Dockerfile generation – Create upgrade Dockerfiles that preserve customizations
Version migration – Upgrade PyTorch, TensorFlow, and other frameworks
Custom file preservation – Maintain custom files and configurations during upgrades

Troubleshooting service – This service helps diagnose and resolve DLC-related issues:

Error diagnosis – Analyze error messages and provide specific solutions
Framework compatibility – Check version compatibility and requirements
Performance optimization – Get framework-specific performance tuning tips
Common issues – Maintain a database of solutions for frequent DLC problems
Environment validation – Verify system requirements and configurations

Best practices service – This service provides best practices on the following:

Security guidelines – Comprehensive security best practices for DLC deployments
Cost optimization – Strategies to reduce costs while maintaining performance
Deployment patterns – System-specific deployment recommendations
Framework guidance – Framework-specific best practices and optimizations
Custom image guidelines – Best practices for creating maintainable custom images

Prerequisites
Follow the installation steps in the GitHub repo to set up the DLC MCP server and Amazon Q CLI in your workstation.
Interact with the DLC MPC server
You’re now ready to start using the Amazon Q CLI with DLC MCP server. Let’s start with the CLI, as shown in the following screenshot. You can also check the default tools and loaded server tools in the CLI with the /tools command.

In the following sections, we demonstrate three separate use cases using the DLC MPC server.
Run a DLC training container
In this scenario, our goal is to identify a PyTorch base image, launch the image in a local Docker container, and run a simple test script to verify the container.
We start with the prompt “Run Pytorch container for training.”
The MCP server automatically handles the entire workflow: it authenticates with Amazon ECR and pulls the appropriate PyTorch DLC image.

Amazon Q used the GPU image because we didn’t specify the device type. Let’s try asking for a CPU image and see its response. After identifying the image, the server pulls the image from the ECR repository successfully and runs the container in your environment. Amazon Q has built-in tools that handle bash scripting and file operations, and a few other standard tools that speed up the runtime.

After the image is identified, the run_the_container tool from the DLC MCP server is used to start the container locally, and Amazon Q tests it with simple scripts to make sure the container is loading and running the scripts as expected. In our example, our test script checks the PyTorch version.

We further prompt the server to perform a training task on the PyTorch CPU training container using a popular dataset. Amazon Q autonomously selects the CIFAR-10 dataset for this example. Amazon Q gathers the dataset and model information based on its pretrained knowledge without human intervention. Amazon Q prompts the user about the choices it’s making on your behalf. If needed, you can specify the required model or dataset directly in the prompt.

When the scripts are ready for execution, the server runs the training job on the container. After successfully training, it summarizes the training job results along with the model path.

Create a custom DLC with NVIDIA’s NeMO Toolkit
In this scenario, we walk through the process of enhancing an existing DLC with NVIDIA’s NeMo toolkit. NeMo, a powerful framework for conversational AI, is built on PyTorch Lightning and is designed for efficient development of AI models. Our goal is to create a custom Docker image that integrates NeMo into the existing PyTorch GPU training container. This section demonstrates how to create a custom DLC image that combines the PyTorch GPU environment with the advanced capabilities of the NeMo toolkit.
The server invokes the create_custom_dockerfile tool from our MCP server’s image building module. We can use this tool to specify our base image from Amazon ECR and add custom commands to install NeMo.

This Dockerfile serves as a blueprint for our custom DLC image, making sure the necessary components are in place. Refer to the Dockerfile in the GitHub repo.

After the custom Dockerfile is created, the server starts building our custom DLC image. To achieve this, Amazon Q uses the build_custom_dlc_image tool in the image building module. This tool streamlines the process by setting up the build environment with specified arguments. This step transforms our base image into a specialized container tailored for NeMo-based AI development.

The build command pulls from a specified ECR repository, making sure we’re working with the most up-to-date base image. The image also comes with related packages and libraries to test NeMo; you can specify the requirements in the prompt if required.

NeMo is now ready to use with a quick environment check to make sure our tools are in the toolbox before we begin. You can run a simple Python script in the Docker container that shows you everything you want to know. In the following screenshot, you can see the PyTorch version 2.7.1+cu128 and PyTorch Lightning version 2.5.2. The NeMo modules are loaded and ready for use.

The DLC MCP server has transformed the way we create custom DLC images. Traditionally, setting up environments, managing dependencies, and writing Dockerfiles for AI development was a time-consuming and error-prone process. It often took hours, if not days, to get everything just right. But now, with Amazon Q along with the DLC MCP server, you can accomplish this in just a few minutes.
For NeMo-based AI applications, you can focus more on model development and less on infrastructure setup. The standardized process makes it straightforward to move from development to production, and you can be confident that your container will work the same way each time it’s built.
Add the latest version of the DeepSeek model to a DLC
In this scenario, we explore how to enhance an existing PyTorch GPU DLC by adding the DeepSeek model. Unlike our previous example where we added the NeMo toolkit, here we integrate a powerful language model using the latest PyTorch GPU container as our base. Let’s start with the prompt shown in the following screenshot.

Amazon Q interacts with DLC MCP server to list the DLC images and check for available PyTorch GPU images. After the base image is picked, multiple tools from the DLC MCP server, such as create_custom_dockerfile and build_custom_dlc_image, are used to create and build the Dockerfile. The key components in Dockerfile for this example are:

{
“base_image”: “763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.6.0-gpu-py312-cu124-ubuntu22.04-ec2”,
“custom_commands”: [
“mkdir -p /opt/ml/model”,
“mkdir -p /opt/ml/code”,
“pip install –upgrade torch torchvision torchaudio”
],
“environment_variables”: {
“CUDA_VISIBLE_DEVICES”: “0”,
“HF_HOME”: “/opt/ml/model”,
“MODEL_NAME”: “deepseek-ai/deepseek-coder-6.7b-instruct”
}
}

This configuration sets up our working directories, handles the PyTorch upgrade to 2.7.1 (latest), and sets essential environment variables for DeepSeek integration. The server also includes important Python packages like transformers, accelerate, and Flask for a production-ready setup.
Before diving into the build process, let’s understand how the MCP server prepares the groundwork. When you initiate the process, the server automatically generates several scripts and configuration files. This includes:

A custom Dockerfile tailored to your requirements
Build scripts for container creation and pushing to Amazon ECR
Test scripts for post-build verification
Inference server setup scripts
Requirement files listing necessary dependencies

The build process first handles authentication with Amazon ECR, establishing a secure connection to the AWS container registry. Then, it either locates your existing repository or creates a new one if needed. In the image building phase, the base PyTorch 2.6.0 image gets transformed with an upgrade to version 2.7.1, complete with CUDA 12.8 support. The DeepSeek Coder 6.7B Instruct model integration happens seamlessly.

After the build is successful, we move to the testing phase using the automatically generated test scripts. These scripts help verify both the basic functionality and production readiness of the DeepSeek container. To make sure our container is ready for deployment, we spin it up using the code shown in the following screenshot.

The container initialization takes about 3 seconds—a remarkably quick startup time that’s crucial for production environments. The server performs a simple inference check using a curl command that sends a POST request to our local endpoint. This test is particularly important because it verifies not just the model’s functionality, but also the entire infrastructure we’ve set up.

We have successfully created a powerful inference image that uses the DLC PyTorch container’s performance optimizations and GPU acceleration while seamlessly integrating DeepSeek’s advanced language model capabilities. The result is more than just a development tool—it’s a production-ready solution complete with health checks, error handling, and optimized inference performance. This makes it ideal for deployment in environments where reliability and performance are critical. This integration creates new opportunities for developers and organizations looking to implement advanced language models in their applications.
Conclusion
The combination of DLC MCP and Amazon Q transforms what used to be weeks of DevOps work into a conversation with your tools. This not only saves time and reduces errors, but also helps teams focus on their core ML tasks rather than infrastructure management.
For more information about Amazon Q Developer, refer to the Amazon Q Developer product page to find video resources and blog posts. You can share your thoughts with us in the comments section or in the issues section of the project’s GitHub repository.

About the authors
Sathya Balakrishnan is a Sr. Cloud Architect in the Professional Services team at AWS, specializing in data and ML solutions. He works with US federal financial clients. He is passionate about building pragmatic solutions to solve customers’ business problems. In his spare time, he enjoys watching movies and hiking with his family.
Jyothirmai Kottu is a Software Development Engineer in the Deep Learning Containers team at AWS, specializing in building and maintaining robust AI and ML infrastructure. Her work focuses on enhancing the performance, reliability, and usability of DLCs, which are crucial tools for AI/ML practitioners working with AI frameworks. She is passionate about making AI/ML tools more accessible and efficient for developers around the world. Outside of her professional life, she enjoys a good coffee, yoga, and exploring new places with family and friends.
Arindam Paul is a Sr. Product Manager in SageMaker AI team at AWS responsible for Deep Learning workloads on SageMaker, EC2, EKS, and ECS. He is passionate about using AI to solve customer problems. In his spare time, he enjoys working out and gardening.

Meet WrenAI: The Open-Source AI Business Intelligence Agent for Natura …

Posted on July 22, 2025 by i-genie

WrenAI is an open-source Generative Business Intelligence (GenBI) agent developed by Canner, designed to enable seamless, natural-language interaction with structured data. It targets both technical and non-technical teams, providing the tools to query, analyze, and visualize data without writing SQL. All capabilities and integrations are verified against the official documentation and latest releases.

Key Capabilities

Natural Language to SQL:Users can ask data questions in plain language (across multiple languages) and WrenAI translates these into accurate, production-grade SQL queries. This streamlines data access for non-technical users.

Multi-Modal Output:The platform generates SQL, charts, summary reports, dashboards, and spreadsheets. Both textual and visual outputs (e.g., charts, tables) are available for immediate data presentation or operational reporting.

GenBI Insights:WrenAI provides AI-generated summaries, reports, and context-aware visualizations, enabling quick, decision-ready analysis.

LLM Flexibility:WrenAI supports a range of large language models, including:

OpenAI GPT series

Azure OpenAI

Google Gemini, Vertex AI

DeepSeek

Databricks

AWS Bedrock (Anthropic Claude, Cohere, etc.)

Groq

Ollama (for deploying local or custom LLMs)

Other OpenAI API-compatible and user-defined models.

Semantic Layer & Indexing:Uses a Modeling Definition Language (MDL) for encoding schema, metrics, joins, and definitions, giving LLMs precise context and reducing hallucinations. The semantic engine ensures context-rich queries, schema embeddings, and relevance-based retrieval for accurate SQL.

Export & Collaboration:Results can be exported to Excel, Google Sheets, or APIs for further analysis or team sharing.

API Embeddability:Query and visualization capabilities are accessible via API, enabling seamless embedding in custom apps and frontends.

Architecture Overview

WrenAI’s architecture is modular and highly extensible for robust deployment and integration:

ComponentDescriptionUser InterfaceWeb-based or CLI UI for natural language queries and data visualization.Orchestration LayerHandles input parsing, manages LLM selection, and coordinates query execution.Semantic IndexingEmbeds database schema and metadata, providing crucial context for the LLM.LLM AbstractionUnified API for integrating multiple LLM providers, both cloud and local.Query EngineExecutes generated SQL on supported databases/data warehouses.VisualizationRenders tables, charts, dashboards, and exports results as needed.Plugins/ExtensibilityAllows custom connectors, templates, prompt logic, and integrations for domain-specific needs.

Semantic Engine Details

Schema Embeddings:Dense vector representations capture schema and business context, powering relevance-based retrieval.

Few-Shot Prompting & Metadata Injection:Schema samples, joins, and business logic are injected into LLM prompts for better reasoning and accuracy.

Context Compression:The engine adapts schema representation size according to token limits, preserving critical detail for each model.

Retriever-Augmented Generation:Relevant schema and metadata are gathered via vector search and added to prompts for context alignment.

Model-Agnostic:Wren Engine works across LLMs via protocol-based abstraction, ensuring consistent context regardless of backend.

Supported Integrations

Databases and Warehouses:Out-of-the-box support for BigQuery, PostgreSQL, MySQL, Microsoft SQL Server, ClickHouse, Trino, Snowflake, DuckDB, Amazon Athena, and Amazon Redshift, among others.

Deployment Modes:Can be run self-hosted, in the cloud, or as a managed service.

API and Embedding:Easily integrates into other applications and platforms via API.

Typical Use Cases

Marketing/Sales:Rapid generation of performance charts, funnel analyses, or region-based summaries from natural language prompts.

Product/Operations:Analyze product usage, customer churn, or operational metrics with follow-up questions and visual summaries.

Executives/Analysts:Automated, up-to-date business dashboards and KPI tracking, delivered in minutes.

Conclusion

WrenAI is a verified, open-source GenBI solution that bridges the gap between business teams and databases through conversational, context-aware, AI-powered analytics. It is extensible, multi-LLM compatible, secure, and engineered with a strong semantic backbone to ensure trustworthy, explainable, and easily integrated business intelligence.

Check out the GitHub Page. All credit for this research goes to the researchers of this project.

Join the fastest growing AI Dev Newsletter read by Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more…….

The post Meet WrenAI: The Open-Source AI Business Intelligence Agent for Natural Language Data Analytics appeared first on MarkTechPost.

This AI Paper from Alibaba Introduces Lumos-1: A Unified Autoregressiv …

Posted on July 22, 2025 by i-genie

Autoregressive video generation is a rapidly evolving research domain. It focuses on the synthesis of videos frame-by-frame using learned patterns of both spatial arrangements and temporal dynamics. Unlike traditional video creation methods, which may rely on pre-built frames or handcrafted transitions, autoregressive models aim to generate content dynamically based on prior tokens. This approach is similar to how large language models predict the next word. It offers a potential to unify video, image, and text generation under a shared framework by using the structural power of transformer-based architectures.

One major problem in this space is how to accurately capture and model the intrinsic spatiotemporal dependencies in videos. Videos contain rich structures across both time and space. Encoding this complexity so models can predict coherent future frames remains a challenge. When these dependencies are not modeled well, it leads to broken frame continuity or unrealistic content generation. Traditional training techniques like random masking also struggle. They often fail to provide balanced learning signals across frames. When spatial information from adjacent frames leaks, prediction becomes too easy.

Several methods attempt to address this challenge by adapting the autoregressive generation pipeline. However, they often deviate from standard large language model structures. Some use external pre-trained text encoders, making models more complex and less coherent. Others bring significant latency during generation with inefficient decoding. Autoregressive models like Phenaki and EMU3 try to support end-to-end generation. Despite this, they still struggle with performance consistency and high training costs. Techniques like raster-scan order or global sequence attention also do not scale well to high-dimensional video data.

The research team from Alibaba Group’s DAMO Academy, Hupan Lab, and Zhejiang University introduced Lumos-1. It is a unified model for autoregressive video generation that stays true to large language model architecture. Unlike previous tools, Lumos-1 eliminates the need for external encoders and changes very little in the original LLM design. The model uses MM-RoPE, or Multi-Modal Rotary Position Embeddings, to address the challenge of modeling video’s three-dimensional structure. The model also uses a token dependency approach. This preserves intra-frame bidirectionality and inter-frame temporal causality, which aligns more naturally with how video data behaves.

In MM-RoPE, researchers expand existing RoPE methods to balance frequency spectrum for spatial and temporal dimensions. Traditional 3D RoPE misallocates frequency focus, causing detail loss or ambiguous positional encoding. MM-RoPE restructures allocations so that temporal, height, and width each receive balanced representation. To address loss imbalance in frame-wise training, Lumos-1 introduces AR-DF, or Autoregressive Discrete Diffusion Forcing. It uses temporal tube masking during training, so the model does not rely too much on unmasked spatial info. This ensures even learning across the video sequence. The inference strategy mirrors the training, allowing high-quality frame generation without degradation.

Lumos-1 was trained from scratch on 60 million images and 10 million videos, using only 48 GPUs. This is considered memory-efficient given the training scale. The model achieved results comparable to top models in the field. It matched EMU3’s results on GenEval benchmarks. It performed equivalently to COSMOS-Video2World on the VBench-I2V test. It also rivaled OpenSoraPlan’s outputs on the VBench-T2V benchmark. These comparisons show that Lumos-1’s lightweight training does not compromise competitiveness. The model supports text-to-video, image-to-video, and text-to-image generation. This demonstrates strong generalization across modalities.

Overall, this research not only identifies and addresses core challenges in spatiotemporal modeling for video generation but also showcases how Lumos-1 sets a new standard for unifying efficiency and effectiveness in autoregressive frameworks. By successfully blending advanced architectures with innovative training, Lumos-1 paves the way for the next generation of scalable, high-quality video generation models and opens up new avenues for future multimodal research.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project.

Join the fastest growing AI Dev Newsletter read by Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more…….

The post This AI Paper from Alibaba Introduces Lumos-1: A Unified Autoregressive Video Generator Leveraging MM-RoPE and AR-DF for Efficient Spatiotemporal Modeling appeared first on MarkTechPost.