Meet ARGUS: A Scalable AI Framework for Training Large Recommender Tra …

Yandex has introduced ARGUS (AutoRegressive Generative User Sequential modeling), a large-scale transformer-based framework for recommender systems that scales up to one billion parameters. This breakthrough places Yandex among a small group of global technology leaders — alongside Google, Netflix, and Meta — that have successfully overcome the long-standing technical barriers in scaling recommender transformers.

Breaking Technical Barriers in Recommender Systems

Recommender systems have long struggled with three stubborn constraints: short-term memory, limited scalability, and poor adaptability to shifting user behavior. Conventional architectures trim user histories down to a small window of recent interactions, discarding months or years of behavioral data. The result is a shallow view of intent that misses long-term habits, subtle shifts in taste, and seasonal cycles. As catalogs expand into the billions of items, these truncated models not only lose precision but also choke on the computational demands of personalization at scale. The outcome is familiar: stale recommendations, lower engagement, and fewer opportunities for serendipitous discovery.

Very few companies have successfully scaled recommender transformers beyond experimental setups. Google, Netflix, and Meta have invested heavily in this area, reporting gains from architectures like YouTubeDNN, PinnerFormer, and Meta’s Generative Recommenders. With ARGUS, Yandex joins this select group of companies demonstrating billion-parameter recommender models in live services. By modeling entire behavioral timelines, the system uncovers both obvious and hidden correlations in user activity. This long-horizon perspective allows ARGUS to capture evolving intent and cyclical patterns with far greater fidelity. For example, instead of reacting only to a recent purchase, the model learns to anticipate seasonal behaviors—like automatically surfacing the preferred brand of tennis balls when summer approaches—without requiring the user to repeat the same signals year after year.

Technical Innovations Behind ARGUS

The framework introduces several key advances:

Dual-objective pre-training: ARGUS decomposes autoregressive learning into two subtasks — next-item prediction and feedback prediction. This combination improves both imitation of historical system behavior and modeling of true user preferences.

Scalable transformer encoders: Models scale from 3.2M to 1B parameters, with consistent performance improvements across all metrics. At the billion-parameter scale, pairwise accuracy uplift increased by 2.66%, demonstrating the emergence of a scaling law for recommender transformers.

Extended context modeling: ARGUS handles user histories up to 8,192 interactions long in a single pass, enabling personalization over months of behavior rather than just the last few clicks.

Efficient fine-tuning: A two-tower architecture allows offline computation of embeddings and scalable deployment, reducing inference cost relative to prior target-aware or impression-level online models.

Real-World Deployment and Measured Gains

ARGUS has already been deployed at scale on Yandex’s music platform, serving millions of users. In production A/B tests, the system achieved:

+2.26% increase in total listening time (TLT)

+6.37% increase in like likelihood

These constitute the largest recorded quality improvements in the platform’s history for any deep learning–based recommender model.

Future Directions

Yandex researchers plan to extend ARGUS to real-time recommendation tasks, explore feature engineering for pairwise ranking, and adapt the framework to high-cardinality domains such as large e-commerce and video platforms. The demonstrated ability to scale user-sequence modeling with transformer architectures suggests that recommender systems are poised to follow a scaling trajectory similar to natural language processing.

Conclusion

With ARGUS, Yandex has established itself as one of the few global leaders driving state-of-the-art recommender systems. By openly sharing its breakthroughs, the company is not only improving personalization across its own services but also accelerating the evolution of recommendation technologies for the entire industry.

Check out the PAPER here. Thanks to the Yandex team for the thought leadership/ Resources for this article.
The post Meet ARGUS: A Scalable AI Framework for Training Large Recommender Transformers to One Billion Parameters appeared first on MarkTechPost.

How to Build a Complete End-to-End NLP Pipeline with Gensim: Topic Mod …

In this tutorial, we present a complete end-to-end Natural Language Processing (NLP) pipeline built with Gensim and supporting libraries, designed to run seamlessly in Google Colab. It integrates multiple core techniques in modern NLP, including preprocessing, topic modeling with Latent Dirichlet Allocation (LDA), word embeddings with Word2Vec, TF-IDF-based similarity analysis, and semantic search. The pipeline not only demonstrates how to train and evaluate these models but also showcases practical visualizations, advanced topic analysis, and document classification workflows. By combining statistical methods with machine learning approaches, the tutorial provides a comprehensive framework for understanding and experimenting with text data at scale. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install –upgrade scipy==1.11.4
!pip install gensim==4.3.2 nltk wordcloud matplotlib seaborn pandas numpy scikit-learn
!pip install –upgrade setuptools

print(“Please restart runtime after installation!”)
print(“Go to Runtime > Restart runtime, then run the next cell”)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import warnings
warnings.filterwarnings(‘ignore’)

from gensim import corpora, models, similarities
from gensim.models import Word2Vec, LdaModel, TfidfModel, CoherenceModel
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short

import nltk
nltk.download(‘punkt’, quiet=True)
nltk.download(‘stopwords’, quiet=True)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

We install and upgrade the necessary libraries, such as SciPy, Gensim, NLTK, and visualization tools, to ensure compatibility. We then import all required modules for preprocessing, modeling, and analysis. We also download NLTK resources to tokenize and handle stopwords efficiently, thereby setting up the environment for our NLP pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedGensimPipeline:
def __init__(self):
self.dictionary = None
self.corpus = None
self.lda_model = None
self.word2vec_model = None
self.tfidf_model = None
self.similarity_index = None
self.processed_docs = None

def create_sample_corpus(self):
“””Create a diverse sample corpus for demonstration”””
documents = [ “Data science combines statistics, programming, and domain expertise to extract insights”,
“Big data analytics helps organizations make data-driven decisions at scale”,
“Cloud computing provides scalable infrastructure for modern applications and services”,
“Cybersecurity protects digital systems from threats and unauthorized access attempts”,
“Software engineering practices ensure reliable and maintainable code development”,
“Database management systems store and organize large amounts of structured information”,
“Python programming language is widely used for data analysis and machine learning”,
“Statistical modeling helps identify patterns and relationships in complex datasets”,
“Cross-validation techniques ensure robust model performance evaluation and selection”,
“Recommendation systems suggest relevant items based on user preferences and behavior”,
“Text mining extracts valuable insights from unstructured textual data sources”,
“Image classification assigns predefined categories to visual content automatically”,
“Reinforcement learning trains agents through interaction with dynamic environments”
]
return documents

def preprocess_documents(self, documents):
“””Advanced document preprocessing using Gensim filters”””
print(“Preprocessing documents…”)

CUSTOM_FILTERS = [
strip_tags, strip_punctuation, strip_multiple_whitespaces,
strip_numeric, remove_stopwords, strip_short, lambda x: x.lower()
]

processed_docs = []
for doc in documents:
processed = preprocess_string(doc, CUSTOM_FILTERS)

stop_words = set(stopwords.words(‘english’))
processed = [word for word in processed if word not in stop_words and len(word) > 2]

processed_docs.append(processed)

self.processed_docs = processed_docs
print(f”Processed {len(processed_docs)} documents”)
return processed_docs

def create_dictionary_and_corpus(self):
“””Create Gensim dictionary and corpus”””
print(“Creating dictionary and corpus…”)

self.dictionary = corpora.Dictionary(self.processed_docs)

self.dictionary.filter_extremes(no_below=2, no_above=0.8)

self.corpus = [self.dictionary.doc2bow(doc) for doc in self.processed_docs]

print(f”Dictionary size: {len(self.dictionary)}”)
print(f”Corpus size: {len(self.corpus)}”)

def train_word2vec_model(self):
“””Train Word2Vec model for word embeddings”””
print(“Training Word2Vec model…”)

self.word2vec_model = Word2Vec(
sentences=self.processed_docs,
vector_size=100,
window=5,
min_count=2,
workers=4,
epochs=50
)

print(“Word2Vec model trained successfully”)

def analyze_word_similarities(self):
“””Analyze word similarities using Word2Vec”””
print(“n=== Word2Vec Similarity Analysis ===”)

test_words = [‘machine’, ‘data’, ‘learning’, ‘computer’]

for word in test_words:
if word in self.word2vec_model.wv:
similar_words = self.word2vec_model.wv.most_similar(word, topn=3)
print(f”Words similar to ‘{word}’: {similar_words}”)

try:
if all(w in self.word2vec_model.wv for w in [‘machine’, ‘computer’, ‘data’]):
analogy = self.word2vec_model.wv.most_similar(
positive=[‘computer’, ‘data’],
negative=[‘machine’],
topn=1
)
print(f”Analogy result: {analogy}”)
except:
print(“Not enough vocabulary for complex analogies”)

def train_lda_model(self, num_topics=5):
“””Train LDA topic model”””
print(f”Training LDA model with {num_topics} topics…”)

self.lda_model = LdaModel(
corpus=self.corpus,
id2word=self.dictionary,
num_topics=num_topics,
random_state=42,
passes=10,
alpha=’auto’,
per_word_topics=True,
eval_every=None
)

print(“LDA model trained successfully”)

def evaluate_topic_coherence(self):
“””Evaluate topic model coherence”””
print(“Evaluating topic coherence…”)

coherence_model = CoherenceModel(
model=self.lda_model,
texts=self.processed_docs,
dictionary=self.dictionary,
coherence=’c_v’
)

coherence_score = coherence_model.get_coherence()
print(f”Topic Coherence Score: {coherence_score:.4f}”)
return coherence_score

def display_topics(self):
“””Display discovered topics”””
print(“n=== Discovered Topics ===”)

topics = self.lda_model.print_topics(num_words=8)
for idx, topic in enumerate(topics):
print(f”Topic {idx}: {topic[1]}”)

def create_tfidf_model(self):
“””Create TF-IDF model for document similarity”””
print(“Creating TF-IDF model…”)

self.tfidf_model = TfidfModel(self.corpus)
corpus_tfidf = self.tfidf_model[self.corpus]

self.similarity_index = similarities.MatrixSimilarity(corpus_tfidf)

print(“TF-IDF model and similarity index created”)

def find_similar_documents(self, query_doc_idx=0):
“””Find documents similar to a query document”””
print(f”n=== Document Similarity Analysis ===”)

query_doc_tfidf = self.tfidf_model[self.corpus[query_doc_idx]]

similarities_scores = self.similarity_index[query_doc_tfidf]

sorted_similarities = sorted(enumerate(similarities_scores), key=lambda x: x[1], reverse=True)

print(f”Documents most similar to document {query_doc_idx}:”)
for doc_idx, similarity in sorted_similarities[:5]:
print(f”Doc {doc_idx}: {similarity:.4f}”)

def visualize_topics(self):
“””Create visualizations for topic analysis”””
print(“Creating topic visualizations…”)

doc_topic_matrix = []
for doc_bow in self.corpus:
doc_topics = dict(self.lda_model.get_document_topics(doc_bow, minimum_probability=0))
topic_vec = [doc_topics.get(i, 0) for i in range(self.lda_model.num_topics)]
doc_topic_matrix.append(topic_vec)

doc_topic_df = pd.DataFrame(doc_topic_matrix, columns=[f’Topic_{i}’ for i in range(self.lda_model.num_topics)])

plt.figure(figsize=(12, 8))
sns.heatmap(doc_topic_df.T, annot=True, cmap=’Blues’, fmt=’.2f’)
plt.title(‘Document-Topic Distribution Heatmap’)
plt.xlabel(‘Documents’)
plt.ylabel(‘Topics’)
plt.tight_layout()
plt.show()

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for topic_id in range(min(6, self.lda_model.num_topics)):
topic_words = dict(self.lda_model.show_topic(topic_id, topn=20))

wordcloud = WordCloud(
width=300, height=200,
background_color=’white’,
colormap=’viridis’
).generate_from_frequencies(topic_words)

axes[topic_id].imshow(wordcloud, interpolation=’bilinear’)
axes[topic_id].set_title(f’Topic {topic_id}’)
axes[topic_id].axis(‘off’)

for i in range(self.lda_model.num_topics, 6):
axes[i].axis(‘off’)

plt.tight_layout()
plt.show()

def advanced_topic_analysis(self):
“””Perform advanced topic analysis”””
print(“n=== Advanced Topic Analysis ===”)

topic_distributions = []
for i, doc_bow in enumerate(self.corpus):
doc_topics = self.lda_model.get_document_topics(doc_bow)
dominant_topic = max(doc_topics, key=lambda x: x[1]) if doc_topics else (0, 0)
topic_distributions.append({
‘doc_id’: i,
‘dominant_topic’: dominant_topic[0],
‘topic_probability’: dominant_topic[1]
})

topic_df = pd.DataFrame(topic_distributions)

plt.figure(figsize=(10, 6))
topic_counts = topic_df[‘dominant_topic’].value_counts().sort_index()
plt.bar(range(len(topic_counts)), topic_counts.values)
plt.xlabel(‘Topic ID’)
plt.ylabel(‘Number of Documents’)
plt.title(‘Distribution of Dominant Topics Across Documents’)
plt.xticks(range(len(topic_counts)), [f’Topic {i}’ for i in topic_counts.index])
plt.show()

return topic_df

def document_classification_demo(self, new_document):
“””Classify a new document using trained models”””
print(f”n=== Document Classification Demo ===”)
print(f”Classifying: ‘{new_document[:50]}…'”)

processed_new = preprocess_string(new_document, [
strip_tags, strip_punctuation, strip_multiple_whitespaces,
strip_numeric, remove_stopwords, strip_short, lambda x: x.lower()
])

new_doc_bow = self.dictionary.doc2bow(processed_new)

doc_topics = self.lda_model.get_document_topics(new_doc_bow)

print(“Topic probabilities:”)
for topic_id, prob in doc_topics:
print(f” Topic {topic_id}: {prob:.4f}”)

new_doc_tfidf = self.tfidf_model[new_doc_bow]
similarities_scores = self.similarity_index[new_doc_tfidf]
most_similar = np.argmax(similarities_scores)

print(f”Most similar document: {most_similar} (similarity: {similarities_scores[most_similar]:.4f})”)

return doc_topics, most_similar

def run_complete_pipeline(self):
“””Execute the complete NLP pipeline”””
print(“=== Advanced Gensim NLP Pipeline ===n”)

raw_documents = self.create_sample_corpus()
self.preprocess_documents(raw_documents)

self.create_dictionary_and_corpus()

self.train_word2vec_model()
self.train_lda_model(num_topics=5)
self.create_tfidf_model()

self.analyze_word_similarities()
coherence_score = self.evaluate_topic_coherence()
self.display_topics()

self.visualize_topics()
topic_df = self.advanced_topic_analysis()

self.find_similar_documents(query_doc_idx=0)

new_doc = “Deep neural networks are powerful machine learning models for pattern recognition”
self.document_classification_demo(new_doc)

return {
‘coherence_score’: coherence_score,
‘topic_distributions’: topic_df,
‘models’: {
‘lda’: self.lda_model,
‘word2vec’: self.word2vec_model,
‘tfidf’: self.tfidf_model
}
}

We define the AdvancedGensimPipeline class as a modular framework to handle every stage of text analysis in one place. It starts with creating a sample corpus, preprocessing it, and then building a dictionary and corpus representations. We train Word2Vec for embeddings, LDA for topic modeling, and TF-IDF for similarity, followed by visualization, coherence evaluation, and classification of new documents. This way, we bring together the complete NLP workflow, from raw text to insights, into a single reusable pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef compare_topic_models(pipeline, topic_range=[3, 5, 7, 10]):
print(“n=== Topic Model Comparison ===”)

coherence_scores = []
perplexity_scores = []

for num_topics in topic_range:
lda_temp = LdaModel(
corpus=pipeline.corpus,
id2word=pipeline.dictionary,
num_topics=num_topics,
random_state=42,
passes=10,
alpha=’auto’
)

coherence_model = CoherenceModel(
model=lda_temp,
texts=pipeline.processed_docs,
dictionary=pipeline.dictionary,
coherence=’c_v’
)
coherence = coherence_model.get_coherence()
coherence_scores.append(coherence)

perplexity = lda_temp.log_perplexity(pipeline.corpus)
perplexity_scores.append(perplexity)

print(f”Topics: {num_topics}, Coherence: {coherence:.4f}, Perplexity: {perplexity:.4f}”)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

ax1.plot(topic_range, coherence_scores, ‘bo-‘)
ax1.set_xlabel(‘Number of Topics’)
ax1.set_ylabel(‘Coherence Score’)
ax1.set_title(‘Model Coherence vs Number of Topics’)
ax1.grid(True)

ax2.plot(topic_range, perplexity_scores, ‘ro-‘)
ax2.set_xlabel(‘Number of Topics’)
ax2.set_ylabel(‘Perplexity’)
ax2.set_title(‘Model Perplexity vs Number of Topics’)
ax2.grid(True)

plt.tight_layout()
plt.show()

return coherence_scores, perplexity_scores

This function compare_topic_models lets us systematically test different numbers of topics for the LDA model and compare their performance. We calculate coherence scores (to check topic interpretability) and perplexity scores (to check model fit) for each topic count in the given range. The results are displayed as line plots, helping us visually decide the most balanced number of topics for our dataset. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef semantic_search_engine(pipeline, query, top_k=5):
“””Implement semantic search using trained models”””
print(f”n=== Semantic Search: ‘{query}’ ===”)

processed_query = preprocess_string(query, [
strip_tags, strip_punctuation, strip_multiple_whitespaces,
strip_numeric, remove_stopwords, strip_short, lambda x: x.lower()
])

query_bow = pipeline.dictionary.doc2bow(processed_query)
query_tfidf = pipeline.tfidf_model[query_bow]

similarities_scores = pipeline.similarity_index[query_tfidf]

top_indices = np.argsort(similarities_scores)[::-1][:top_k]

print(“Top matching documents:”)
for i, idx in enumerate(top_indices):
score = similarities_scores[idx]
print(f”{i+1}. Document {idx} (Score: {score:.4f})”)
print(f” Content: {‘ ‘.join(pipeline.processed_docs[idx][:10])}…”)

return top_indices, similarities_scores[top_indices]

The semantic_search_engine function adds a search layer to the pipeline by taking a query, preprocessing it, and converting it into a bag-of-words and TF-IDF representations. It then compares the query against all documents using the similarity index and returns the top matches. This way, we can quickly retrieve the most relevant documents along with their similarity scores, making the pipeline useful for practical information retrieval and semantic search tasks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
pipeline = AdvancedGensimPipeline()
results = pipeline.run_complete_pipeline()

print(“n” + “=”*60)
coherence_scores, perplexity_scores = compare_topic_models(pipeline)

print(“n” + “=”*60)
search_results = semantic_search_engine(
pipeline,
“artificial intelligence neural networks deep learning”
)

print(“n” + “=”*60)
print(“Pipeline completed successfully!”)
print(f”Final coherence score: {results[‘coherence_score’]:.4f}”)
print(f”Vocabulary size: {len(pipeline.dictionary)}”)
print(f”Word2Vec model size: {pipeline.word2vec_model.wv.vector_size} dimensions”)

print(“nModels trained and ready for use!”)
print(“Access models via: pipeline.lda_model, pipeline.word2vec_model, pipeline.tfidf_model”)

This main block ties everything together into a complete, executable pipeline. We initialize the AdvancedGensimPipeline, run the full workflow, and then evaluate topic models with different numbers of topics. Next, we test the semantic search engine with a query about artificial intelligence and deep learning. Finally, it prints out summary metrics, such as the coherence score, vocabulary size, and Word2Vec embedding dimensions, confirming that all models are trained and ready for further use.

In conclusion, we gain a powerful, modular workflow that covers the entire spectrum of text analysis, from cleaning and preprocessing raw documents to discovering hidden topics, visualizing results, comparing models, and performing semantic search. The inclusion of Word2Vec embeddings, TF-IDF similarity, and coherence evaluation ensures that the pipeline is both versatile and robust, while visualizations and classification demos make the results interpretable and actionable. This cohesive design enables learners, researchers, and practitioners to quickly adapt the framework for real-world applications, making it a valuable foundation for advanced NLP experimentation and production-ready text analytics.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Build a Complete End-to-End NLP Pipeline with Gensim: Topic Modeling, Word Embeddings, Semantic Search, and Advanced Text Analysis appeared first on MarkTechPost.

Google AI Introduces Personal Health Agent (PHA): A Multi-Agent Framew …

Table of contentsWhat is a Personal Health Agent?How does the PHA framework operate?How was the PHA evaluated?Evaluation of the Data Science AgentEvaluation of the Domain Expert AgentEvaluation of the Health Coach AgentEvaluation of the Integrated PHA SystemHow does the PHA contribute to health AI?What is the larger significance of Google’s PHA blueprint?Conclusion

https://arxiv.org/abs/2508.20148v1

What is a Personal Health Agent?

Large language models (LLMs) have demonstrated strong performance across various domains like clinical reasoning, decision support, and consumer health applications. However, most existing platforms are designed as single-purpose tools, such as symptom checkers, digital coaches, or health information assistants. These approaches often fail to address the complexity of real-world health needs, where individuals require integrated reasoning over wearable streams, personal health records, and laboratory test results.

A team of researchers from Google has proposed a Personal Health Agent (PHA) framework. The PHA is designed as a multi-agent system that unifies complementary roles: data analysis, medical knowledge reasoning, and health coaching. Instead of returning isolated outputs from a single model, the PHA employs a central orchestrator to coordinate specialized sub-agents, iteratively synthesize their outputs, and deliver coherent, personalized guidance.

https://arxiv.org/abs/2508.20148v1

How does the PHA framework operate?

The Personal Health Agent (PHA) is built on top of the Gemini 2.0 model family. It follows a modular architecture consisting of three sub-agents and one orchestrator:

Data Science Agent (DS)The DS agent interprets and analyzes time-series data from wearables (e.g., step counts, heart rate variability, sleep metrics) and structured health records. It is capable of decomposing open-ended user questions into formal analysis plans, executing statistical reasoning, and comparing results against population-level reference data. For example, it can quantify whether physical activity in the past month is associated with improvements in sleep quality.

Domain Expert Agent (DE)The DE agent provides medically contextualized information. It integrates personal health records, demographic information, and wearable signals to generate explanations grounded in medical knowledge. Unlike general-purpose LLMs that may produce plausible but unreliable outputs, the DE agent follows an iterative reasoning-investigation-examination loop, combining authoritative medical resources with personal data. This allows it to provide evidence-based interpretations, such as whether a specific blood pressure measurement is within a safe range for an individual with a particular condition.

Health Coach Agent (HC)The HC agent addresses behavioral change and long-term goal setting. Drawing from established coaching strategies such as motivational interviewing, it conducts multi-turn conversations, identifies user goals, clarifies constraints, and generates structured, personalized plans. For example, it may guide a user through setting a weekly exercise schedule, adapting to individual barriers, and incorporating feedback from progress tracking.

OrchestratorThe orchestrator coordinates these three agents. When a query is received, it assigns a primary agent responsible for generating the main output and supporting agents to provide contextual data or domain knowledge. After collecting the results, the orchestrator runs an iterative reflection loop, checking outputs for coherence and accuracy before synthesizing them into a single response. This ensures that the final output is not merely an aggregation of agent responses but an integrated recommendation.

How was the PHA evaluated?

The research team conducted one of the most comprehensive evaluations of a health AI system to date. Their evaluation framework involved 10 benchmark tasks, 7,000+ human annotations, and 1,100 hours of assessment from health experts and end-users.

Evaluation of the Data Science Agent

The DS agent was assessed on its ability to generate structured analysis plans and produce correct, executable code. Compared to baseline Gemini models, it demonstrated:

A significant increase in analysis plan quality, improving mean expert-rated scores from 53.7% to 75.6%.

A reduction in critical data handling errors from 25.4% to 11.0%.

An improvement in code pass rates from 58.4% to 75.5% on first attempts, with further gains under iterative self-correction.

https://arxiv.org/abs/2508.20148v1

https://arxiv.org/abs/2508.20148v1

https://arxiv.org/abs/2508.20148v1

Evaluation of the Domain Expert Agent

The DE agent was benchmarked across four capabilities: factual accuracy, diagnostic reasoning, contextual personalization, and multimodal data synthesis. Results include:

Factual knowledge: On over 2,000 board-style exam questions across endocrinology, cardiology, sleep medicine, and fitness, the DE agent achieved 83.6% accuracy, outperforming baseline Gemini (81.8%).

Diagnostic reasoning: On 2,000 self-reported symptom cases, it achieved 46.1% top-1 diagnostic accuracy compared to 41.4% for a state-of-the-art Gemini baseline.

Personalization: In user studies, 72% of participants preferred DE agent responses to baseline outputs, citing higher trustworthiness and contextual relevance.

Multimodal synthesis: In expert clinician reviews of health summaries generated from wearable, lab, and survey data, the DE agent’s outputs were rated more clinically significant, comprehensive, and trustworthy than baseline outputs.

Evaluation of the Health Coach Agent

The HC agent was designed and assessed through expert interviews and user studies. Experts emphasized the need for six coaching capabilities: goal identification, active listening, context clarification, empowerment, SMART (Specific, Measurable, Attainable, Relevant, Time-bound) recommendations, and iterative feedback incorporation.

In evaluations, the HC agent demonstrated improved conversation flow and user engagement compared to baseline models. It avoided premature recommendations and instead balanced information gathering with actionable advice, producing outputs more consistent with expert coaching practices.

Evaluation of the Integrated PHA System

At the system level, the orchestrator and three agents were tested together in open-ended, multimodal conversations reflecting realistic health scenarios. Both experts and end-users rated the integrated Personal Health Agent (PHA) significantly higher than baseline Gemini systems across measures of accuracy, coherence, personalization, and trustworthiness.

How does the PHA contribute to health AI?

The introduction of a multi-agent PHA addresses several limitations of existing health AI systems:

Integration of heterogeneous data: Wearable signals, medical records, and lab test results are analyzed jointly rather than in isolation.

Division of labor: Each sub-agent specializes in a domain where single monolithic models often underperform, e.g., numerical reasoning for DS, clinical grounding for DE, and behavioral engagement for HC.

Iterative reflection: The orchestrator’s review cycle reduces inconsistencies that often arise when multiple outputs are simply concatenated.

Systematic evaluation: Unlike most prior work, which relied on small-scale case studies, the Personal Health Agent (PHA) was validated with a large multimodal dataset (the WEAR-ME study) and extensive expert involvement.

What is the larger significance of Google’s PHA blueprint?

The introduction of Personal Health Agent (PHA) demonstrates that health AI can move beyond single-purpose applications toward modular, orchestrated systems capable of reasoning across multimodal data. It shows that breaking down tasks into specialized sub-agents leads to measurable improvements in robustness, accuracy, and user trust.

It is important to note that this work is a research construct, not a commercial product. The research team emphasized that the PHA design is exploratory and that deployment would require addressing regulatory, privacy, and ethical considerations. Nonetheless, the framework and evaluation results represent a significant advance in the technical foundations of personal health AI.

Conclusion

The Personal Health Agent framework provides a comprehensive design for integrating wearable data, health records, and behavioral coaching through a multi-agent system coordinated by an orchestrator. Its evaluation across 10 benchmarks, using thousands of annotations and expert assessments, shows consistent improvements over baseline LLMs in statistical analysis, medical reasoning, personalization, and coaching interactions.

By structuring health AI as a coordinated system of specialized agents rather than a monolithic model, the PHA demonstrates how accuracy, coherence, and trust can be improved in personal health applications. This work establishes a foundation for further research on agentic health systems and highlights a pathway toward integrated, reliable health reasoning tools.

Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Introduces Personal Health Agent (PHA): A Multi-Agent Framework that Enables Personalized Interactions to Address Individual Health Needs appeared first on MarkTechPost.

Meet Chatterbox Multilingual: An Open-Source Zero-Shot Text To Speech …

Table of contentsWhat does Chatterbox Multilingual offer?How does it compare with commercial systems?How is expressive control implemented?How does watermarking contribute to responsible AI usage?What deployment options are available?What is the significance of Chatterbox Multilingual open release?

Resemble AI has recently released Chatterbox Multilingual, a production grade open-source Text To Speech (TTS) model designed for zero-shot voice cloning in 23 languages. It is distributed under the MIT license, making it freely available for integration and modification. The system builds on the original Chatterbox framework and adds multilingual capability, expressive controls, and built-in watermarking for traceability.

What does Chatterbox Multilingual offer?

Chatterbox Multilingual enables voice cloning without retraining by leveraging zero-shot learning. You can easily generate a synthetic voice using a short audio sample that captures the speaker’s features/characteristics. It supports 23 languages, including Arabic, Hindi, Chinese, Swahili, and other widely spoken languages, giving it coverage across diverse linguistic families.

Apart from basic voice cloning, the model integrates emotion and intensity controls, which allow users to specify not just what is said, but also how it is delivered. The model also includes PerTh watermarking by default to ensures that every output can be authenticated through neural watermark extraction. These features make the model suitable for tasks where both accuracy and security are important.

How does it compare with commercial systems?

Evaluations indicate that Chatterbox Multilingual performs competitively with most commercial TTS models. In blind A/B tests conducted on Podonos, listeners expressed a 63.75% preference for Chatterbox over ElevenLabs. This suggests that in certain conditions, users found Chatterbox outputs closer to natural or accurate speech reproduction.

https://www.resemble.ai/chatterbox/

It is worth noting that while some reported numbers compare performance on specific languages such as German, the only verifiable public metric is the Podonos listener preference result. This makes preference-based benchmarking the most reliable evidence currently available.

How is expressive control implemented?

Chatterbox Multilingual not only reproduce voice identity but also provides tools for controlling delivery style. The model allows adjustment of emotion categories such as happy, sad, or angry, and includes an exaggeration parameter to regulate intensity. This means a cloned voice can be made more enthusiastic, subdued, or dramatic depending on context.

Such flexibility is useful in interactive media, dialog agents, gaming, and assistive technologies, where emotional nuance affects the effectiveness of communication. Rather than producing static or neutral speech, the system can generate output that adapts to context-specific needs.

How does watermarking contribute to responsible AI usage?

Every file generated by Chatterbox Multilingual contains PerTh (Perceptual Threshold) watermarking, a neural technique developed by Resemble AI. The watermark is inaudible to listeners but can be extracted using the provided open-source detector. This enables traceability and verification of generated content, an increasingly important factor as synthetic audio becomes more widespread.

By embedding watermarking at the system level and keeping it always active, Chatterbox helps mitigate risks of misuse without requiring external enforcement mechanisms. This design choice aligns with ongoing discussions about the ethics of generative audio systems.

What deployment options are available?

The open-source release provides a baseline system that can be installed and run by researchers, developers, or hobbyists under the permissive MIT license. For environments where high concurrency, latency targets, or compliance guarantees are necessary, Resemble AI offers a managed variant called Chatterbox Multilingual Pro.

This hosted version supports sub-200 ms latency, fine-tuned voices, and includes SLAs (service-level agreements) along with compliance features required in enterprise deployments. While the open-source project serves as a general foundation, the Pro service is aimed at production workloads with operational constraints.

What is the significance of Chatterbox Multilingual open release?

Chatterbox Multilingual contributes a multilingual, open, and controllable voice cloning system to the speech synthesis community. It integrates zero-shot cloning, expressivity controls, and watermarking in a framework that is both technically advanced and freely available.

Performance studies suggest it is competitive with leading proprietary solutions, offering a practical platform for further research and application development. Its open-source license makes it accessible to a broad range of users, from academic researchers to independent developers, strengthening the ecosystem of multilingual speech synthesis tools.

Check out the GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Meet Chatterbox Multilingual: An Open-Source Zero-Shot Text To Speech (TTS) Multilingual Model with Emotion Control and Watermarking appeared first on MarkTechPost.

Accelerating HPC and AI research in universities with Amazon SageMaker …

This post was written with Mohamed Hossam of Brightskies.
Research universities engaged in large-scale AI and high-performance computing (HPC) often face significant infrastructure challenges that impede innovation and delay research outcomes. Traditional on-premises HPC clusters come with long GPU procurement cycles, rigid scaling limits, and complex maintenance requirements. These obstacles restrict researchers’ ability to iterate quickly on AI workloads such as natural language processing (NLP), computer vision, and foundation model (FM) training. Amazon SageMaker HyperPod alleviates the undifferentiated heavy lifting involved in building AI models. It helps quickly scale model development tasks such as training, fine-tuning, or inference across a cluster of hundreds or thousands of AI accelerators (NVIDIA GPUs H100, A100, and others) integrated with preconfigured HPC tools and automated scaling.
In this post, we demonstrate how a research university implemented SageMaker HyperPod to accelerate AI research by using dynamic SLURM partitions, fine-grained GPU resource management, budget-aware compute cost tracking, and multi-login node load balancing—all integrated seamlessly into the SageMaker HyperPod environment.
Solution overview
Amazon SageMaker HyperPod is designed to support large-scale machine learning operations for researchers and ML scientists. The service is fully managed by AWS, removing operational overhead while maintaining enterprise-grade security and performance.
The following architecture diagram illustrates how to access SageMaker HyperPod to submit jobs. End users can use AWS Site-to-Site VPN, AWS Client VPN, or AWS Direct Connect to securely access the SageMaker HyperPod cluster. These connections terminate on the Network Load Balancer that efficiently distributes SSH traffic to login nodes, which are the primary entry points for job submission and cluster interaction. At the core of the architecture is SageMaker HyperPod compute, a controller node that orchestrates cluster operations, and multiple compute nodes arranged in a grid configuration. This setup supports efficient distributed training workloads with high-speed interconnects between nodes, all contained within a private subnet for enhanced security.
The storage infrastructure is built around two main components: Amazon FSx for Lustre provides high-performance file system capabilities, and Amazon S3 for dedicated storage for datasets and checkpoints. This dual-storage approach provides both fast data access for training workloads and secure persistence of valuable training artifacts.

The implementation consisted of several stages. In the following steps, we demonstrate how to deploy and configure the solution.
Prerequisites
Before deploying Amazon SageMaker HyperPod, make sure the following prerequisites are in place:

AWS configuration:

The AWS Command Line Interface (AWS CLI) configured with appropriate permissions
Cluster configuration files prepared: cluster-config.json and provisioning-parameters.json

Network setup:

A virtual private cloud (VPC) configured for cluster resources.
Security groups with Elastic Fabric Adapter (EFA) communication enabled.
An Amazon FSx for Lustre file system for shared, high-performance storage

An AWS Identity and Management (IAM) role with permissions for the following:

Amazon Elastic Compute Cloud (Amazon EC2) instance and Amazon SageMaker cluster management
FSx for Lustre and Amazon Simple Storage Service (Amazon S3) access
Amazon CloudWatch Logs and AWS Systems Manager integration
EFA network configuration

Launch the CloudFormation stack
We launched an AWS CloudFormation stack to provision the necessary infrastructure components, including a VPC and subnet, FSx for Lustre file system, S3 bucket for lifecycle scripts and training data, and IAM roles with scoped permissions for cluster operation. Refer to the Amazon SageMaker HyperPod workshop for CloudFormation templates and automation scripts.
Customize SLURM cluster configuration
To align compute resources with departmental research needs, we created SLURM partitions to reflect the organizational structure, for example NLP, computer vision, and deep learning teams. We used the SLURM partition configuration to define slurm.conf with custom partitions. SLURM accounting was enabled by configuring slurmdbd and linking usage to departmental accounts and supervisors.
To support fractional GPU sharing and efficient utilization, we enabled Generic Resource (GRES) configuration. With GPU stripping, multiple users can access GPUs on the same node without contention. The GRES setup followed the guidelines from the Amazon SageMaker HyperPod workshop.
Provision and validate the cluster
We validated the cluster-config.json and provisioning-parameters.json files using the AWS CLI and a SageMaker HyperPod validation script:

$curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/validate-config.py

$pip3 install boto3

$python3 validate-config.py –cluster-config cluster-config.json –provisioning-parameters provisioning-parameters.json

Then we created the cluster:

$aws sagemaker create-cluster
–cli-input-json file://cluster-config.json
–region us-west-2

Implement cost tracking and budget enforcement
To monitor usage and control costs, each SageMaker HyperPod resource (for example, Amazon EC2, FSx for Lustre, and others) was tagged with a unique ClusterName tag. AWS Budgets and AWS Cost Explorer reports were configured to track monthly spending per cluster. Additionally, alerts were set up to notify researchers if they approached their quota or budget thresholds.
This integration helped facilitate efficient utilization and predictable research spending.
Enable load balancing for login nodes
As the number of concurrent users increased, the university adopted a multi-login node architecture. Two login nodes were deployed in EC2 Auto Scaling groups. A Network Load Balancer was configured with target groups to route SSH and Systems Manager traffic. Lastly, AWS Lambda functions enforced session limits per user using Run-As tags with Session Manager, a capability of Systems Manager.
For details about the full implementation, see Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user experience.
Configure federated access and user mapping
To facilitate secure and seamless access for researchers, the institution integrated AWS IAM Identity Center with their on-premises Active Directory (AD) using AWS Directory Service. This allowed for unified control and administration of user identities and access privileges across SageMaker HyperPod accounts. The implementation consisted of the following key components:

Federated user integration – We mapped AD users to POSIX user names using Session Manager run-as tags, allowing fine-grained control over compute node access
Secure session management – We configured Systems Manager to make sure users access compute nodes using their own accounts, not the default ssm-user
Identity-based tagging – Federated user names were automatically mapped to user directories, workloads, and budgets through resource tags

For full step-by-step guidance, refer to the Amazon SageMaker HyperPod workshop.
This approach streamlined user provisioning and access control while maintaining strong alignment with institutional policies and compliance requirements.
Post-deployment optimizations
To help prevent unnecessary consumption of compute resources by idle sessions, the university configured SLURM with Pluggable Authentication Modules (PAM). This setup enforces automatic logout for users after their SLURM jobs are complete or canceled, supporting prompt availability of compute nodes for queued jobs.
The configuration improved job scheduling throughput by freeing idle nodes immediately and reduced administrative overhead in managing inactive sessions.
Additionally, QoS policies were configured to control resource consumption, limit job durations, and enforce fair GPU access across users and departments. For example:

MaxTRESPerUser – Makes sure GPU or CPU usage per user stays within defined limits
MaxWallDurationPerJob – Helps prevent excessively long jobs from monopolizing nodes
Priority weights – Aligns priority scheduling based on research group or project

These enhancements facilitated an optimized, balanced HPC environment that aligns with the shared infrastructure model of academic research institutions.
Clean up
To delete the resources and avoid incurring ongoing charges, complete the following steps:

Delete the SageMaker HyperPod cluster:

$aws sagemaker delete-cluster –cluster-name <name>

Delete the CloudFormation stack used for the SageMaker HyperPod infrastructure:

$aws cloudformation delete-stack –stack-name <stack-name> –region <region>

This will automatically remove associated resources, such as the VPC and subnets, FSx for Lustre file system, S3 bucket, and IAM roles. If you created these resources outside of CloudFormation, you must delete them manually.
Conclusion
SageMaker HyperPod provides research universities with a powerful, fully managed HPC solution tailored for the unique demands of AI workloads. By automating infrastructure provisioning, scaling, and resource optimization, institutions can accelerate innovation while maintaining budget control and operational efficiency. Through customized SLURM configurations, GPU sharing using GRES, federated access, and robust login node balancing, this solution highlights the potential of SageMaker HyperPod to transform research computing, so researchers can focus on science, not infrastructure.
For more details on making the most of SageMaker HyperPod, check out the SageMaker HyperPod workshop and explore further blog posts about SageMaker HyperPod.

About the authors
Tasneem Fathima is Senior Solutions Architect at AWS. She supports Higher Education and Research customers in the United Arab Emirates to adopt cloud technologies, improve their time to science, and innovate on AWS.
Mohamed Hossam is a Senior HPC Cloud Solutions Architect at Brightskies, specializing in high-performance computing (HPC) and AI infrastructure on AWS. He supports universities and research institutions across the Gulf and Middle East in harnessing GPU clusters, accelerating AI adoption, and migrating HPC/AI/ML workloads to the AWS Cloud. In his free time, Mohamed enjoys playing video games.

Exploring the Real-Time Race Track with Amazon Nova

This post is co-written by Jake Friedman, President + Co-founder of Wildlife.
Amazon Nova is enhancing sports fan engagement through an immersive Formula 1 (F1)-inspired experience that turns traditional spectators into active participants. This post explores the Real-Time Race Track (RTRT), an interactive experience built using Amazon Nova in Amazon Bedrock, that lets fans design, customize, and share their own racing circuits. We highlight how generative AI capabilities come together to deliver strategic racing insights such as pit timing and tire choices, and interactive features like an AI voice assistant and a retro-style racing poster.
Evolving fan expectations and the technical barriers to real-time, multimodal engagement
Today’s sports audiences expect more than passive viewing—they want to participate, customize, and share. As fan expectations evolve, delivering engaging and interactive experiences has become essential to keeping audiences invested. Static digital content no longer holds attention; fans are drawn to immersive formats that make it possible to influence or co-create aspects of the event. For brands and rights holders, this shift presents both an opportunity and a challenge: how to deliver dynamic, meaningful engagement at scale. Delivering this level of interactivity comes with a unique set of technical challenges. It requires support for multiple modalities—text, speech, image, and data—working together in real time to create a seamless and immersive experience. Because fan-facing experiences are often offered for free, cost-efficiency becomes critical to sustain engagement at scale. And with users expecting instant responses, maintaining low-latency performance across interactions is essential to avoid disrupting the experience.
Creating immersive fan engagement with the RTRT using Amazon Nova
To foster an engaging and immersive experience, we developed the Real-Time Race Track, allowing F1 fans to design their own custom racing circuit using Amazon Nova. You can draw your track in different lengths and shapes while receiving real-time AI recommendations to modify your racing conditions. You can choose any location around the world for your race track and Amazon Nova Pro will use it to generate your track’s name and simulate realistic track conditions using that region’s weather and climate data. When your track is complete, Amazon Nova Pro analyzes the track to produce metrics like top speed and projected lap time, and offers two viable race strategies focused on tire management. You can also consult with Amazon Nova Sonic, a speech-to-speech model, for strategic track design recommendations. The experience culminates with Amazon Nova Canvas generating a retro-inspired racing poster of your custom track design that you can share or download. The following screenshots show some examples of the RTRT interface.

Amazon Nova models are cost-effective and deliver among the best price-performance in their respective class, helping enterprises create scalable fan experiences while managing costs effectively. With fast speech processing and high efficiency, Amazon Nova provides seamless, real-time, multimodal interactions that meet the demands of interactive fan engagement. Additionally, Amazon Nova comes with built-in controls to maintain the safe and responsible use of AI. Combining comprehensive capabilities, cost-effectiveness, low latency, and trusted reliability, Amazon Nova is the ideal solution for applications requiring real-time, dynamic engagement.
Prompts, inputs, and system design behind the RTRT experience
The RTRT uses the multimodal capabilities of Amazon Nova Pro to effectively lead users from a single line path drawing to a fully viable race track design, including strategic racing recommendations and a bold visual representation of their circuit in the style of a retro racing poster.
The following diagram gives an overview of the system architecture.

Prompt engineering plays a crucial role in delivering structured output that can flow seamlessly into the UI, which has been optimized for at-a-glance takeaways that use Amazon Nova Pro to quickly analyze multiple data inputs to accelerate users’ decision making. In the RTRT, this extends to the input images provided to Amazon Nova Pro for vision analysis. Each time the user adds new segments to their racing circuit, a version of the path is relayed to Amazon Nova Pro with visible coordinate markers that produce accurate path analysis (see the following screenshot) and corresponding output data, which can be visually represented back to users with color-coded track sectors.

This is paired with multiple system prompts to define the role of Amazon Nova Pro at each stage of the app, as well as to return responses that are ready to be consumed by the front end.
The following is a prompt example:

The system is designed to analyze the input image of a completed racetrack path outline.
You must always return valid JSON.

The prompts also use sets of examples to produce consistent results across a diverse range of possible track designs and locations:

Using the input data craft a track title for a fictional Formula 1 track.
Use the names of existing tracks from <example/> as a framework of how to format the
title.
The title must not infringe on any existing track names or copyrighted material.
The title should take into account the location of the track when choosing what
language certain components of the track title are in.

This is also a key stage in which to employ responsible use of AI, instructing the model not to generate content that might infringe on existing race tracks or other copyrighted material.

These considerations are essential when working with creative models like Amazon Nova Canvas. Race cars commonly feature liveries that contain a dozen or more sponsor logos. To avoid concern, and to provide the cleanest, most aesthetically appealing retro racing poster designs, Amazon Nova Canvas was given a range of conditioning images that facilitate vehicle accuracy and consistency. The images work in tandem with our prompt for a bold illustration style featuring cinematic angles.
The following is a prompt example:

Use a bold vector-style illustration approach with flat color fills, bold outlines,
stylized gradients. Maintain a vintage racing poster aesthetic with minimal texture.
Position the viewer to emphasize motion and speed.

The following images show the output.

Conclusion
The Real-Time Race Track showcases how generative AI can deliver personalized, interactive moments that resonate with modern sports audiences. Amazon Nova models power each layer of the experience, from speech and image generation to strategy and analysis, delivering rich, low-latency interactions at scale. This collaboration highlights how brands can use Amazon Nova to build tailored and engaging experiences.

About the authors
Raechel Frick is a Sr. Product Marketing Manager at AWS. With over 20 years of experience in the tech industry, she brings a customer-first approach and growth mindset to building integrated marketing programs.
Anuj Jauhari is a Sr. Product Marketing Manager at AWS, enabling customers to innovate and achieve business impact with generative AI solutions built on Amazon Nova models.
Jake Friedman is the President and Co-founder at Wildlife, where he leads a team launching interactive experiences and content campaigns for global brands. His work has been recognized with the Titanium Grand Prix at the Cannes Lions International Festival of Creativity for “boundary-busting, envy-inspiring work that marks a new direction for the industry and moves it forward.”

Google AI Releases EmbeddingGemma: A 308M Parameter On-Device Embeddin …

EmbeddingGemma is Google’s new open text embedding model optimized for on-device AI, designed to balance efficiency with state-of-the-art retrieval performance.

How compact is EmbeddingGemma compared to other models?

At just 308 million parameters, EmbeddingGemma is lightweight enough to run on mobile devices and offline environments. Despite its size, it performs competitively with much larger embedding models. Inference latency is low (sub-15 ms for 256 tokens on EdgeTPU), making it suitable for real-time applications.

How well does it perform on multilingual benchmarks?

EmbeddingGemma was trained across 100+ languages and achieved the highest ranking on the Massive Text Embedding Benchmark (MTEB) among models under 500M parameters. Its performance rivals or exceeds embedding models nearly twice its size, particularly in cross-lingual retrieval and semantic search.

https://developers.googleblog.com/en/introducing-embeddinggemma/

https://developers.googleblog.com/en/introducing-embeddinggemma/

What is the underlying architecture?

EmbeddingGemma is built on a Gemma 3–based encoder backbone with mean pooling. Importantly, the architecture does not use the multimodal-specific bidirectional attention layers that Gemma 3 applies for image inputs. Instead, EmbeddingGemma employs a standard transformer encoder stack with full-sequence self-attention, which is typical for text embedding models.

This encoder produces 768-dimensional embeddings and supports sequences up to 2,048 tokens, making it well-suited for retrieval-augmented generation (RAG) and long-document search. The mean pooling step ensures fixed-length vector representations regardless of input size.

https://developers.googleblog.com/en/introducing-embeddinggemma/

What makes its embeddings flexible?

EmbeddingGemma employs Matryoshka Representation Learning (MRL). This allows embeddings to be truncated from 768 dimensions down to 512, 256, or even 128 dimensions with minimal loss of quality. Developers can tune the trade-off between storage efficiency and retrieval precision without retraining.

Can it run entirely offline?

Yes. EmbeddingGemma was specifically designed for on-device, offline-first use cases. Since it shares a tokenizer with Gemma 3n, the same embeddings can directly power compact retrieval pipelines for local RAG systems, with privacy benefits from avoiding cloud inference.

What tools and frameworks support EmbeddingGemma?

It integrates seamlessly with:

Hugging Face (transformers, Sentence-Transformers, transformers.js)

LangChain and LlamaIndex for RAG pipelines

Weaviate and other vector databases

ONNX Runtime for optimized deployment across platformsThis ecosystem ensures developers can slot it directly into existing workflows.

How can it be implemented in practice?

(1) Load and Embed

from sentence_transformers import SentenceTransformer
model = SentenceTransformer(“google/embeddinggemma-300m”)
emb = model.encode([“example text to embed”])

(2) Adjust Embedding SizeUse full 768 dims for maximum accuracy or truncate to 512/256/128 dims for lower memory or faster retrieval.

(3) Integrate into RAGRun similarity search locally (cosine similarity) and feed top results into Gemma 3n for generation. This enables a fully offline RAG pipeline.

Why EmbeddingGemma?

Efficiency at scale – High multilingual retrieval accuracy in a compact footprint.

Flexibility – Adjustable embedding dimensions via MRL.

Privacy – End-to-end offline pipelines without external dependencies.

Accessibility – Open weights, permissive licensing, and strong ecosystem support.

EmbeddingGemma proves that smaller embedding models can achieve best-in-class retrieval performance while being light enough for offline deployment. It marks an important step toward efficient, privacy-conscious, and scalable on-device AI.

Check out the Model and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Releases EmbeddingGemma: A 308M Parameter On-Device Embedding Model with State-of-the-Art MTEB Results appeared first on MarkTechPost.

Google DeepMind Finds a Fundamental Bug in RAG: Embedding Limits Break …

Retrieval-Augmented Generation (RAG) systems generally rely on dense embedding models that map queries and documents into fixed-dimensional vector spaces. While this approach has become the default for many AI applications, a recent research from Google DeepMind team explains a fundamental architectural limitation that cannot be solved by larger models or better training alone.

What Is the Theoretical Limit of Embedding Dimensions?

At the core of the issue is the representational capacity of fixed-size embeddings. An embedding of dimension d cannot represent all possible combinations of relevant documents once the database grows beyond a critical size. This follows from results in communication complexity and sign-rank theory.

For embeddings of size 512, retrieval breaks down around 500K documents.

For 1024 dimensions, the limit extends to about 4 million documents.

For 4096 dimensions, the theoretical ceiling is 250 million documents.

These values are best-case estimates derived under free embedding optimization, where vectors are directly optimized against test labels. Real-world language-constrained embeddings fail even earlier.

https://arxiv.org/pdf/2508.21038

How Does the LIMIT Benchmark Expose This Problem?

To test this limitation empirically, Google DeepMind Team introduced LIMIT (Limitations of Embeddings in Information Retrieval), a benchmark dataset specifically designed to stress-test embedders. LIMIT has two configurations:

LIMIT full (50K documents): In this large-scale setup, even strong embedders collapse, with recall@100 often falling below 20%.

LIMIT small (46 documents): Despite the simplicity of this toy-sized setup, models still fail to solve the task. Performance varies widely but remains far from reliable:

Promptriever Llama3 8B: 54.3% recall@2 (4096d)

GritLM 7B: 38.4% recall@2 (4096d)

E5-Mistral 7B: 29.5% recall@2 (4096d)

Gemini Embed: 33.7% recall@2 (3072d)

Even with just 46 documents, no embedder reaches full recall, highlighting that the limitation is not dataset size alone but the single-vector embedding architecture itself.

In contrast, BM25, a classical sparse lexical model, does not suffer from this ceiling. Sparse models operate in effectively unbounded dimensional spaces, allowing them to capture combinations that dense embeddings cannot.

https://arxiv.org/pdf/2508.21038

Why Does This Matter for RAG?

CCurrent RAG implementations typically assume that embeddings can scale indefinitely with more data. The Google DeepMind research team explains how this assumption is incorrect: embedding size inherently constrains retrieval capacity. This affects:

Enterprise search engines handling millions of documents.

Agentic systems that rely on complex logical queries.

Instruction-following retrieval tasks, where queries define relevance dynamically.

Even advanced benchmarks like MTEB fail to capture these limitations because they test only a narrow part/section of query-document combinations.

What Are the Alternatives to Single-Vector Embeddings?

The research team suggested that scalable retrieval will require moving beyond single-vector embeddings:

Cross-Encoders: Achieve perfect recall on LIMIT by directly scoring query-document pairs, but at the cost of high inference latency.

Multi-Vector Models (e.g., ColBERT): Offer more expressive retrieval by assigning multiple vectors per sequence, improving performance on LIMIT tasks.

Sparse Models (BM25, TF-IDF, neural sparse retrievers): Scale better in high-dimensional search but lack semantic generalization.

The key insight is that architectural innovation is required, not simply larger embedders.

What is the Key Takeaway?

The research team’s analysis shows that dense embeddings, despite their success, are bound by a mathematical limit: they cannot capture all possible relevance combinations once corpus sizes exceed limits tied to embedding dimensionality. The LIMIT benchmark demonstrates this failure concretely:

On LIMIT full (50K docs): recall@100 drops below 20%.

On LIMIT small (46 docs): even the best models max out at ~54% recall@2.

Classical techniques like BM25, or newer architectures such as multi-vector retrievers and cross-encoders, remain essential for building reliable retrieval engines at scale.

Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google DeepMind Finds a Fundamental Bug in RAG: Embedding Limits Break Retrieval at Scale appeared first on MarkTechPost.

What is OLMoASR and How Does It Compare to OpenAI’s Whisper in Speec …

The Allen Institute for AI (AI2) has released OLMoASR, a suite of open automatic speech recognition (ASR) models that rival closed-source systems such as OpenAI’s Whisper. Beyond just releasing model weights, AI2 has published training data identifiers, filtering steps, training recipes, and benchmark scripts—an unusually transparent move in the ASR space. This makes OLMoASR one of the most trending and extensible platforms for speech recognition research.

Why Open Automatic Speech Recognition ASR?

Most speech recognition models available today—whether from OpenAI, Google, or Microsoft—are only accessible via APIs. While these services provide high performance, they operate as black boxes: the training datasets are opaque, the filtering methods are undocumented, and the evaluation protocols are not always aligned with research standards.

This lack of transparency poses challenges for reproducibility and scientific progress. Researchers cannot verify claims, test variations, or adapt models to new domains without re-building large datasets themselves. OLMoASR addresses this problem by opening the entire pipeline. The release is not just about enabling practical transcription—it’s about pushing ASR toward a more open, scientific foundation.

Model Architecture and Scaling

OLMoASR uses a transformer encoder–decoder architecture, the dominant paradigm in modern ASR.

The encoder ingests audio waveforms and produces hidden representations.

The decoder generates text tokens conditioned on the encoder’s outputs.

This design is similar to Whisper, but OLMoASR makes the implementation fully open.

The family of models covers six sizes, all trained on English:

tiny.en – 39M parameters, designed for lightweight inference

base.en – 74M parameters

small.en – 244M parameters

medium.en – 769M parameters

large.en-v1 – 1.5B parameters, trained on 440K hours

large.en-v2 – 1.5B parameters, trained on 680K hours

This range allows developers to trade off between inference cost and accuracy. Smaller models are suited for embedded devices or real-time transcription, while the larger models maximize accuracy for research or batch workloads.

Data: From Web Scraping to Curated Mixes

One of the core contributions of OLMoASR is the open release of training datasets, not just the models.

OLMoASR-Pool (~3M hours)

This massive collection contains weakly supervised speech paired with transcripts scraped from the web. It includes around 3 million hours of audio and 17 million text transcripts. Like Whisper’s original dataset, it is noisy, containing misaligned captions, duplicates, and transcription errors.

OLMoASR-Mix (~1M hours)

To address quality issues, AI2 applied rigorous filtering:

Alignment heuristics to ensure audio and transcripts match

Fuzzy deduplication to remove repeated or low-diversity examples

Cleaning rules to eliminate duplicate lines and mismatched text

The result is a high-quality, 1M-hour dataset that boosts zero-shot generalization—critical for real-world tasks where data may differ from training distributions.

This two-tiered data strategy mirrors practices in large-scale language model pretraining: use vast noisy corpora for scale, then refine with filtered subsets to improve quality.

Performance Benchmarks

AI2 benchmarked OLMoASR against Whisper across both short-form and long-form speech tasks, using datasets like LibriSpeech, TED-LIUM3, Switchboard, AMI, and VoxPopuli.

Medium Model (769M)

12.8% WER (word error rate) on short-form speech

11.0% WER on long-form speech

This nearly matches Whisper-medium.en, which achieves 12.4% and 10.5% respectively.

Large Models (1.5B)

large.en-v1 (440K hours): 13.0% WER short-form vs Whisper large-v1 at 12.2%

large.en-v2 (680K hours): 12.6% WER, closing the gap to less than 0.5%

Smaller Models

Even the tiny and base versions perform competitively:

tiny.en: ~20.5% WER short-form, ~15.6% WER long-form

base.en: ~16.6% WER short-form, ~12.9% WER long-form

This gives developers flexibility to choose models based on compute and latency requirements.

How to use?

Transcribing audio takes just a few lines of code:

Copy CodeCopiedUse a different Browserimport olmoasr

model = olmoasr.load_model(“medium”, inference=True)
result = model.transcribe(“audio.mp3”)
print(result)

The output includes both the transcription and time-aligned segments, making it useful for captioning, meeting transcription, or downstream NLP pipelines.

Fine-Tuning and Domain Adaptation

Since AI2 provides full training code and recipes, OLMoASR can be fine-tuned for specialized domains:

Medical speech recognition – adapting models on datasets like MIMIC-III or proprietary hospital recordings

Legal transcription – training on courtroom audio or legal proceedings

Low-resource accents – fine-tuning on dialects not well covered in OLMoASR-Mix

This adaptability is critical: ASR performance often drops when models are used in specialized domains with domain-specific jargon. Open pipelines make domain adaptation straightforward.

Applications

OLMoASR opens up exciting opportunities across academic research and real-world AI development:

Educational Research: Researchers can explore the intricate relationships between model architecture, dataset quality, and filtering techniques to understand their effects on speech recognition performance.

Human-Computer Interaction: Developers gain the freedom to embed speech recognition capabilities directly into conversational AI systems, real-time meeting transcription platforms, and accessibility applications—all without dependency on proprietary APIs or external services.

Multimodal AI Development: When combined with large language models, OLMoASR enables the creation of advanced multimodal assistants that can seamlessly process spoken input and generate intelligent, contextually-aware responses.

Research Benchmarking: The open availability of both training data and evaluation metrics positions OLMoASR as a standardized reference point, allowing researchers to compare new approaches against a consistent, reproducible baseline in future ASR studies.

Conclusion

The release of OLMoASR brings high-quality speech recognition can be developed and released in a way that prioritizes transparency and reproducibility. While the models are currently limited to English and still demand significant compute for training, they provide a solid foundation for adaptation and extension. This release sets a clear reference point for future work in open ASR and makes it easier for researchers and developers to study, benchmark, and apply speech recognition models in different domains.

Check out the MODEL on Hugging Face, GitHub Page and TECHNICAL DETAILS. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post What is OLMoASR and How Does It Compare to OpenAI’s Whisper in Speech Recognition? appeared first on MarkTechPost.

Build character consistent storyboards using Amazon Nova in Amazon Bed …

Although careful prompt crafting can yield good results, achieving professional-grade visual consistency often requires adapting the underlying model itself. Building on the prompt engineering and character development approach covered in Part 1 of this two-part series, we now push the consistency level for specific characters by fine-tuning an Amazon Nova Canvas foundation model (FM). Through fine-tuning techniques, creators can instruct the model to maintain precise control over character appearances, expressions, and stylistic elements across multiple scenes.
In this post, we take an animated short film, Picchu, produced by FuzzyPixel from Amazon Web Services (AWS), prepare training data by extracting key character frames, and fine-tune a character-consistent model for the main character Mayu and her mother, so we can quickly generate storyboard concepts for new sequels like the following images.

Solution overview
To implement an automated workflow, we propose the following comprehensive solution architecture that uses AWS services for an end-to-end implementation.

The workflow consists of the following steps:

The user uploads a video asset to an Amazon Simple Storage Service (Amazon S3) bucket.
Amazon Elastic Container Service (Amazon ECS) is triggered to process the video asset.
Amazon ECS downsamples the frames, selects those containing the character, and then center-crops them to produce the final character images.
Amazon ECS invokes an Amazon Nova model (Amazon Nova Pro) from Amazon Bedrock to create captions from the images.
Amazon ECS writes the image captions and metadata to the S3 bucket.
The user uses a notebook environment in Amazon SageMaker AI to invoke the model training job.
The user fine-tunes a custom Amazon Nova Canvas model by invoking Amazon Bedrock create_model_customization_job and create_model_provisioned_throughput API calls to create a custom model available for inference.

This workflow is structured in two distinct phases. The initial phase, in Steps 1–5, focuses on preparing the training data. In this post, we walk through an automated pipeline to extract images from an input video and then generate labeled training data. The second phase, in Steps 6–7, focuses on fine-tuning the Amazon Nova Canvas model and performing test inference using the custom-trained model. For these latter steps, we provide the preprocessed image data and comprehensive example code in the following GitHub repository to guide you through the process.
Prepare the training data
Let’s begin with the first phase of our workflow. In our example, we build an automated video object/character extraction pipeline to extract high-resolution images with accurate caption labels using the following steps.
Creative character extraction
We recommend first sampling video frames at fixed intervals (for example, 1 frame per second). Then, apply Amazon Rekognition label detection and face collection search to identify frames and characters of interest. Label detection can identify over 2,000 unique labels and locate their positions within frames, making it ideal for initial detection of general character categories or non-human characters. To distinguish between different characters, we then use the Amazon Rekognition feature to search faces in a collection. This feature identifies and tracks characters by matching their faces against a pre-populated face collection. If these two approaches aren’t precise enough, we can use Amazon Rekognition Custom Labels to train a custom model for detecting specific characters. The following diagram illustrates this workflow.

After detection, we center-crop each character with appropriate pixel padding and then run a deduplication algorithm using the Amazon Titan Multimodal Embeddings model to remove semantically similar images above a threshold value. Doing so helps us build a diverse dataset because redundant or nearly identical frames could lead to model overfitting (when a model learns the training data too precisely, including its noise and fluctuations, making it perform poorly on new, unseen data). We can calibrate the similarity threshold to fine-tune what we consider to be identical images, so we can better control the balance between dataset diversity and redundancy elimination.
Data labeling
We generate captions for each image using Amazon Nova Pro in Amazon Bedrock and then upload the image and label manifest file to an Amazon S3 location. This process focuses on two critical aspects of prompt engineering: character description to help the FM identify and name the characters based on their unique attributes, and varied description generation that avoids repetitive patterns in the caption (for example, “an animated character”). The following is an example prompt template used during our data labeling process:

system_prompt = “””
You are an expert image description specialist who creates concise, natural alt
text that makes visual content accessible while maintaining clarity and focus.
Your task is to analyze the provided image and provide a creative description
(20-30 words) that emphasizes the Three main characters, capturing the essential
elements of their interaction while avoiding unnecessary details.
“””

prompt = “””

1. Identify the main characters in the image: Character 1, Character 2, and
Character 3 at least one will be in the picture so provide at a minimum a
description with at least one character name.
– “Character 1” describe the first character, key traits, background, attributes.
– “Character 2” describe the second character, key traits, background, attributes.
– “Character 3” describe the third character, key traits, background, attributes.
2. Just state their name WITHOUT adding any standard characteristics.
3. Only capture visual element outside the standard characteristics
4. Capture the core interaction between them
5. Include only contextual details that are crucial for understanding the scene
6. Create a natural, flowing description using everyday language

Here are some examples

[Identify the main characters]
[Assessment of their primary interaction]
[Selection of crucial contextual elements]
[Crafting of concise, natural description]

{
“alt_text”: “[Concise, natural description focusing on the main characters]”
}

Note: Provide only the JSON object as the final response.

The data labeling output is formatted as a JSONL file, where each line pairs an image reference Amazon S3 path with a caption generated by Amazon Nova Pro. This JSONL file is then uploaded to Amazon S3 for training. The following is an example of the file:

{“image_ref”: “s3://media-ip-dataset/characters/blue_character_01.jpg”, “alt_text”: “This
animated character features a round face with large expressive eyes. The character
has a distinctive blue color scheme with a small tuft of hair on top. The design is
stylized with clean lines and a minimalist approach typical of modern animation.”}
{“image_ref”: “s3://media-ip-dataset/props/iconic_prop_series1.jpg”, “alt_text”: “This
object appears to be an iconic prop from the franchise. It has a metallic appearance
with distinctive engravings and a unique shape that fans would immediately recognize.
The lighting highlights its dimensional qualities and fine details that make it
instantly identifiable.”}

Human verification
For enterprise use cases, we recommend incorporating a human-in-the-loop process to verify labeled data before proceeding with model training. This verification can be implemented using Amazon Augmented AI (Amazon A2I), a service that helps annotators verify both image and caption quality. For more details, refer to Get Started with Amazon Augmented AI.
Fine-tune Amazon Nova Canvas
Now that we have the training data, we can fine-tune the Amazon Nova Canvas model in Amazon Bedrock. Amazon Bedrock requires an AWS Identity and Access Management (IAM) service role to access the S3 bucket where you stored your model customization training data. For more details, see Model customization access and security. You can perform the fine-tuning task directly on the Amazon Bedrock console or use the Boto3 API. We explain both approaches in this post, and you can find the end-to-end code sample in picchu-finetuning.ipynb.
Create a fine-tuning job on the Amazon Bedrock console
Let’s start by creating an Amazon Nova Canvas fine-tuning job on the Amazon Bedrock console:

On the Amazon Bedrock console, in the navigation pane, choose Custom models under Foundation models.
Choose Customize model and then Create Fine-tuning job.

On the Create Fine-tuning job details page, choose the model you want to customize and enter a name for the fine-tuned model.
In the Job configuration section, enter a name for the job and optionally add tags to associate with it.
In the Input data section, enter the Amazon S3 location of the training dataset file.
In the Hyperparameters section, enter values for hyperparameters, as shown in the following screenshot.

In the Output data section, enter the Amazon S3 location where Amazon Bedrock should save the output of the job.
Choose Fine-tune model job to begin the fine-tuning process.

This hyperparameter combination yielded good results during our experimentation. In general, increasing the learning rate makes the model train more aggressively, which often presents an interesting trade-off: we might achieve character consistency more quickly, but it might impact overall image quality. We recommend a systematic approach to adjusting hyperparameters. Start with the suggested batch size and learning rate, and try increasing or decreasing the number of training steps first. If the model struggles to learn your dataset even after 20,000 steps (the maximum allowed in Amazon Bedrock), then we suggest either increasing the batch size or adjusting the learning rate upward. These adjustments, through subtle, can make a significant difference in our model’s performance. For more details about the hyperparameters, refer to Hyperparameters for Creative Content Generation models.
Create a fine-tuning job using the Python SDK
The following Python code snippet creates the same fine-tuning job using the create_model_customization_job API:

bedrock = boto3.client(‘bedrock’)
jobName = “picchu-canvas-v0”
# Set parameters
hyperParameters = {
        “stepCount”: “14000”,
        “batchSize”: “64”,
        “learningRate”: “0.000001”,
    }

# Create job
response_ft = bedrock.create_model_customization_job(
    jobName=jobName,
    customModelName=jobName,
    roleArn=roleArn,
    baseModelIdentifier=”amazon.nova-canvas-v1:0″,
    hyperParameters=hyperParameters,
    trainingDataConfig={“s3Uri”: training_path},
    outputDataConfig={“s3Uri”: f”s3://{bucket}/{prefix}”}
)

jobArn = response_ft.get(‘jobArn’)
print(jobArn)

When the job is complete, you can retrieve the new customModelARN using the following code:

custom_model_arn = bedrock.list_model_customization_jobs(
    nameContains=jobName
)[“modelCustomizationJobSummaries”][0][“customModelArn”]

Deploy the fine-tuned model
With the preceding hyperparameter configuration, this fine-tuning job might take up to 12 hours to complete. When it’s complete, you should see a new model in the custom models list. You can then create provisioned throughput to host the model. For more details on provisioned throughput and different commitment plans, see Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock.
Deploy the model on the Amazon Bedrock console
To deploy the model from the Amazon Bedrock console, complete the following steps:

On the Amazon Bedrock console, choose Custom models under Foundation models in the navigation pane.
Select the new custom model and choose Purchase provisioned throughput.

In the Provisioned Throughput details section, enter a name for the provisioned throughput.
Under Select model, choose the custom model you just created.
Then specify the commitment term and model units.

After you purchase provisioned throughput, a new model Amazon Resource Name (ARN) is created. You can invoke this ARN when the provisioned throughput is in service.

Deploy the model using the Python SDK
The following Python code snippet creates provisioned throughput using the create_provisioned_model_throughput API:

custom_model_name = “picchu-canvas-v0″

# Create the provision throughput job and retrieve the provisioned model id
provisioned_model_id = bedrock.create_provisioned_model_throughput(
    modelUnits=1,
    # create a name for your provisioned throughput model
    provisionedModelName=custom_model_name,
    modelId=custom_model_arn
)[‘provisionedModelArn’]

Test the fine-tuned model
When the provisioned throughput is live, we can use the following code snippet to test the custom model and experiment with generating some new images for a sequel to Picchu:

import json
import io
from PIL import Image
import base64

def decode_base64_image(img_b64):
return Image.open(io.BytesIO(base64.b64decode(img_b64)))

def generate_image(prompt,
negative_prompt=”text, ugly, blurry, distorted, low
quality, pixelated, watermark, text, deformed”,
num_of_images=3,
seed=1):
“””
Generate an image using Amazon Nova Canvas.
“””

image_gen_config = {
“numberOfImages”: num_of_images,
“quality”: “premium”,
“width”: 1024, # Maximum resolution 2048 x 2048
“height”: 1024, # 1:1 ratio
“cfgScale”: 8.0,
“seed”: seed,
}

# Prepare the request body
request_body = {
“taskType”: “TEXT_IMAGE”,
“textToImageParams”: {
“text”: prompt,
“negativeText”: negative_prompt, # List things to avoid
},
“imageGenerationConfig”: image_gen_config
}

response = bedrock_runtime.invoke_model(
modelId=provisioned_model_id,
body=json.dumps(request_body)
)

# Parse the response
response_body = json.loads(response[‘body’].read())

if “images” in response_body:
# Extract the image
return [decode_base64_image(img) for img in response_body[‘images’]]
else:
return
seed = random.randint(1, 858993459)
print(f”seed: {seed}”)

images = generate_image(prompt=prompt, seed=seed)

Mayu face shows a mix of nervousness and determination. Mommy kneels beside her, gently holder her. A landscape is visible in the background.
A steep cliff face with a long wooden ladder extending downwards. Halfway down the ladder is Mayu with a determined expression on her face. Mayu’s small hands grip the sides of the ladder tightly as she carefully places her feet on each rung. The surrounding environment shows a rugged, mountainous landscape.
Mayu standing proudly at the entrance of a simple school building. Her face beams with a wide smile, expressing pride and accomplishment.

Clean up
To avoid incurring AWS charges after you are done testing, complete the cleanup steps in picchu-finetuning.ipynb and delete the following resources:

Amazon SageMaker Studio domain
Fine-tuned Amazon Nova model and provision throughput endpoint

Conclusion
In this post, we demonstrated how to elevate character and style consistency in storyboarding from Part 1 by fine-tuning Amazon Nova Canvas in Amazon Bedrock. Our comprehensive workflow combines automated video processing, intelligent character extraction using Amazon Rekognition, and precise model customization using Amazon Bedrock to create a solution that maintains visual fidelity and dramatically accelerates the storyboarding process. By fine-tuning the Amazon Nova Canvas model on specific characters and styles, we’ve achieved a level of consistency that surpasses standard prompt engineering, so creative teams can produce high-quality storyboards in hours rather than weeks. Start experimenting with Nova Canvas fine-tuning today, so you can also elevate your storytelling with better character and style consistency.

About the authors
Dr. Achin Jain is a Senior Applied Scientist at Amazon AGI, where he works on building multi-modal foundation models. He brings over 10+ years of combined industry and academic research experience. He has led the development of several modules for Amazon Nova Canvas and Amazon Titan Image Generator, including supervised fine-tuning (SFT), model customization, instant customization, and guidance with color palette.
James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.
Randy Ridgley is a Principal Solutions Architect focused on real-time analytics and AI. With expertise in designing data lakes and pipelines. Randy helps organizations transform diverse data streams into actionable insights. He specializes in IoT solutions, analytics, and infrastructure-as-code implementations. As an open-source contributor and technical leader, Randy provides deep technical knowledge to deliver scalable data solutions across enterprise environments.

Build character consistent storyboards using Amazon Nova in Amazon Bed …

The art of storyboarding stands as the cornerstone of modern content creation, weaving its essential role through filmmaking, animation, advertising, and UX design. Though traditionally, creators have relied on hand-drawn sequential illustrations to map their narratives, today’s AI foundation models (FMs) are transforming this landscape. FMs like Amazon Nova Canvas and Amazon Nova Reel offer capabilities in transforming text and image inputs into professional-grade visuals and short clips that promise to revolutionize preproduction workflows.
This technological leap forward, however, presents its own set of challenges. Although these models excel at generating diverse concepts rapidly—a boon for creative exploration—maintaining consistent character designs and stylistic coherence across scenes remains a significant hurdle. Even subtle modifications to prompts or model configurations can yield dramatically different visual outputs, potentially disrupting narrative continuity and creating additional work for content creators.
To address these challenges, we’ve developed this two-part series exploring practical solutions for achieving visual consistency. In Part 1, we deep dive into prompt engineering and character development pipelines, sharing tested prompt patterns that deliver reliable, consistent results with Amazon Nova Canvas and Amazon Nova Reel. Part 2 explores techniques like fine-tuning Amazon Nova Canvas to achieve exceptional visual consistency and precise character control.

Consistent character design with Amazon Nova Canvas
The foundation of effective storyboarding begins with establishing well-defined character designs. Amazon Nova Canvas offers several powerful techniques to create and maintain character consistency throughout your visual narrative. To help you implement these techniques in your own projects, we’ve provided comprehensive code examples and resources in our GitHub repository. We encourage you to follow along as we walk through each step in detail. If you’re new to Amazon Nova Canvas, we recommend first reviewing Generating images with Amazon Nova to familiarize yourself with the basic concepts.
Basic text prompting
Amazon Nova Canvas transforms text descriptions into visual representations. Unlike large language models (LLMs), image generation models don’t interpret commands or engage in reasoning—they respond best to descriptive captions. Including specific details in your prompts, such as physical attributes, clothing, and styling elements, directly influences the generated output.
For example, “A 7-year-old Peruvian girl with dark hair in two low braids wearing a school uniform” provides clear visual elements for the model to generate an initial character concept, as shown in the following example image.

Visual style implementation
Consistency in storyboarding requires both character features and unified visual style. Our approach separates style information into two key components in the prompt:

Style description – An opening phrase that defines the visual medium (for example, “A graphic novel style illustration of”)
Style details – A closing phrase that specifies artistic elements (for example, “Bold linework, dramatic shadows, flat color palettes”)

This structured technique enables exploration of various artistic styles, including graphic novels, sketches, and 3D illustrations, while maintaining character consistency throughout the storyboard. The following is an example prompt template and some style information you can experiment with:

{style_description} A 7 year old peruvian girl with dark hair in two low braids wearing a
school uniform. {style_details}
styles = [
{
“name”: “graphic-novel”,
“description”: “A graphic novel style illustation of”,
“details”: “Bold linework, dramatic shadows, and flat color palettes. Use
high contrast lighting and cinematic composition typical of comic book
panels. Include expressive line work to convey emotion and movement.”,
},
{
“name”: “sketch”,
“description”: “A simple black and white line sketch of”,
“details”: “Rough, sketch-like lines create a storyboard aesthetic. High
contrast. No color”,
},
{
“name”: “digital-illustration”,
“description”: “A 3D digital drawing of”,
“details”: “High contrast. Rounded character design. Smooth rendering.
Soft texture. Luminous lighting”,
},
]

Character variation through seed values
The seed parameter serves as a tool for generating character variations while adhering to the same prompt. By keeping the text description constant and varying only the seed value, creators can explore multiple interpretations of their character design without starting from scratch, as illustrated in the following example images.

Seed = 1
Seed = 20
Seed = 57
Seed = 139
Seed = 12222

Prompt adherence control with cfgScale
The cfgScale parameter is another tool for maintaining character consistency, controlling how strictly Amazon Nova Canvas follows your prompt. Operating on a scale from 1.1–10, lower values give the model more creative freedom and higher values enforce strict prompt adherence. The default value of 6.5 typically provides an optimal balance, but as demonstrated in the following images, finding the right setting is crucial. Too low a value can result in inconsistent character representations, whereas too high a value might overemphasize prompt elements at the cost of natural composition.

Seed = 57, cfgScale = 1.1
Seed = 57, cfgScale = 3.5
Seed = 57, cfgScale = 6.5
Seed = 57, cfgScale = 8.0
Seed = 57, cfgScale = 10

Scene integration with consistent parameters
Now we can put these techniques together to test for character consistency across different narrative contexts, as shown in the following example images. We maintain consistent input for style, seed, and cfgScale, varying only the scene description to make sure character remains recognizable throughout the scene sequences.

Seed = 57, Cfg_scale: 6.5
Seed = 57, Cfg_scale: 6.5
Seed = 57, Cfg_scale: 6.5

A graphic novel style illustration of a 7 year old Peruvian girl with dark hair in two low braids wearing a school uniform riding a bike on a mountain pass Bold linework, dramatic shadows, and flat color palettes. Use high contrast lighting and cinematic composition typical of comic book panels. Include expressive line work to convey emotion and movement.
A graphic novel style illustation of a 7 year old Peruvian girl with dark hair in two low braids wearing a school uniform walking on a path through tall grass in the Andes Bold linework, dramatic shadows, and flat color palettes. Use high contrast lighting and cinematic composition typical of comic book panels. Include expressive line work to convey emotion and movement.
A graphic novel style illustration of a 7 year old Peruvian girl with dark hair in two low braids wearing a school uniform eating ice cream at the beach Bold linework, dramatic shadows, and flat color palettes. Use high contrast lighting and cinematic composition typical of comic book panels. Include expressive line work to convey emotion and movement.

Storyboard development pipeline
Building upon the character consistency techniques we’ve discussed, we can now implement an end-to-end storyboard development pipeline that transforms written scene and character descriptions into visually coherent storyboards. This systematic approach uses our established parameters for style descriptions, seed values, and cfgScale values to provide character consistency while adapting to different narrative contexts. The following are some example scene and character descriptions:

“scenes”:[
{
“description”: “Mayu stands at the edge of a mountainous path, clutching
a book. Her mother, Maya, kneels beside her, offering words of encouragement
and handing her the book. Mayu looks nervous but determined as she prepares
to start her journey.”
},
{
“description”: “Mayu encounters a ‘danger’ sign with a drawing of a
snake. She looks scared, but then remembers her mother’s words. She takes a
deep breath, looks at her book for reassurance, and then searches for a stick
on the ground.”
},
{
“description”: “Mayu bravely makes her way through tall grass, swinging
her stick and making noise to scare off potential snakes. Her face shows a
mix of fear and courage as she pushes forward on her journey.”
}
],
“characters”:{
“Mayu”: “A 7-year-old Peruvian girl with dark hair in two low braids wearing a
school uniform”,
“Maya”: “An older Peruvian woman with long dark hair tied back in a bun, wearing
traditional Peruvian clothing”
}

Our pipeline uses Amazon Nova Lite to first craft optimized image prompts incorporating our established best practices, which are then passed to Amazon Nova Canvas for image generation. By setting numberOfImages higher (typically three variations), while maintaining consistent seed and cfgScale values, we give creators multiple options that preserve character consistency. We used the following prompt for Amazon Nova Lite to generate optimized image prompts:

Describe an image that best represents the scene described. Here are some examples:
scene: Rosa is in the kitchen, rummaging through the pantry, looking for a snack. She
hears a strange noise coming from the back of the pantry and becomes startled.
imagery: A dimly lit pantry with shelves stocked with various food items, and Rosa
peering inside, her face expressing curiosity and a hint of fear.
scene: Rosa says goodbye to her mother, Maya. Maya offers her words of encouragement.
imagery: A wide shot of Rosa’s determined face, facing Maya and receiving a small wrapped
gift.
Only describe the imagery. Use no more than 60 words.
scene: {scene_description}
imagery:

Our pipeline generated the following storyboard panels.

Mayu stands at the edge of a mountainous path, clutching a book. Her mother, Maya, kneels beside her, offering words of encouragement and handing her the book. Mayu looks nervous but determined as she prepares to start her journey.

Mayu encounters a ‘danger’ sign with a drawing of a snake. She looks scared, but then remembers her mother’s words. She takes a deep breath, looks at her book for reassurance, and then searches for a stick on the ground.

Mayu bravely makes her way through tall grass, swinging her stick and making noise to scare off potential snakes. Her face shows a mix of fear and courage as she pushes forward on her journey.

Although these techniques noticeably improve character consistency, they aren’t perfect. Upon closer inspection, you will notice that even images within the same scene show variations in character consistency. Using consistent seed values helps control these variations, and the techniques outlined in this post significantly improve consistency compared to basic prompt engineering. However, if your use case requires near-perfect character consistency, we recommend proceeding to Part 2, where we explore advanced fine-tuning techniques.
Video generation for animated storyboards
If you want to go beyond static scene images to transform your storyboard into short, animated video clips, you can use Amazon Nova Reel. We use Amazon Nova Lite to convert image prompts into video prompts, adding subtle motion and camera movements optimized for the Amazon Nova Reel model. These prompts, along with the original images, serve as creative constraints for Amazon Nova Reel to generate the final animated sequences. The following is the example prompt and its resulting animated scene in GIF format:

A sunlit forest path with a ‘Danger’ sign featuring a snake. A 7-year-old Peruvian girl
stands, visibly scared but resolute. Bold linework, dramatic shadows, and flat color
palettes. High contrast lighting and cinematic composition. Mist slowly drifting.
Camera dolly in.

Input Image
Output Video

Conclusion
In this first part of our series, we explored fundamental techniques for achieving character and style consistency using Amazon Nova Canvas, from structured prompt engineering to building an end-to-end storyboarding pipeline. We demonstrated how combining style descriptions, seed values, and careful cfgScale parameter control can significantly improve character consistency across different scenes. We also showed how integrating Amazon Nova Lite with Amazon Nova Reel can enhance the storyboarding workflow, enabling both optimized prompt generation and animated sequences.
Although these techniques provide a solid foundation for consistent storyboard generation, they aren’t perfect—subtle variations might still occur. We invite you to continue to Part 2, where we explore advanced model fine-tuning techniques that can help achieve near-perfect character consistency and visual fidelity.

About the authors
Alex Burkleaux is a Senior AI/ML Specialist Solution Architect at AWS. She helps customers use AI Services to build media solutions using Generative AI. Her industry experience includes over-the-top video, database management systems, and reliability engineering.
James Wu is a Senior AI/ML Specialist Solution Architect at AWS, helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.
Vladimir Budilov is a Principal Solutions Architect at AWS focusing on agentic & generative AI, and software architecture. He leads large-scale GenAI implementations, bridging cutting-edge AI capabilities with production-ready business solutions, while optimizing for cost and solution resilience.
Nora Shannon Johnson is a Solutions Architect at Amazon Music focused on discovery and growth through AI/ML. In the past, she supported AWS through the development of generative AI prototypes and tools for developers in financial services, health care, retail, and more. She has been an engineer and consultant in various industries including DevOps, fintech, industrial AI/ML, and edtech in the United States, Europe, and Latin America.
Ehsan Shokrgozar is a Senior Solutions Architect specializing in Media and Entertainment at AWS. He is passionate about helping M&E customers build more efficient workflows. He combines his previous experience as Technical Director and Pipeline Engineer at various Animation/VFX studios with his knowledge of building M&E workflows in the cloud to help customers achieve their business goals.

Google Brings Gemini CLI to GitHub Actions: Secure, Free, and Enterpri …

How do devs integrate coding capabilities directly into their GitHub repositories? Google has recently introduced Gemini CLI GitHub Actions, a new way for developers to integrate Gemini’s AI coding capabilities directly into their GitHub repositories. Built on top of GitHub’s workflow automation framework, this Google’s new release turns Gemini from a terminal-only coding assistant into a collaborative teammate that participates in issue triage, pull request reviews, and repository maintenance.

But how is it different from Microsoft’s GitHub Copilot? Unlike Microsoft’s GitHub Copilot features, which require paid subscriptions for advanced functionality, Google’s integration is available at no cost. This really helps open-source devs, small teams, and enterprises that want to embed AI into their workflows without additional licensing overhead.

From Terminal to Repository Integration

Google first released Gemini CLI earlier this year as a command-line interface that connected developers directly to the Gemini 2.5 Pro model. With a one-million-token context window, built-in tools, and open-source licensing, Gemini CLI was designed for local, developer-focused workflows.

The new GitHub Actions integration extends those capabilities to collaborative environments. Instead of operating only on a developer’s machine, Gemini can now participate in repository-level automation action, where it assists teams during code reviews, issue management, and continuous integration processes, saving hours of time for dev and helps in faster code deployment.

Core Capabilities

Gemini CLI GitHub Actions comes with three key use cases:

Automated Issue TriageNew issues are automatically labeled, categorized, and prioritized. This reduces the time dev maintainers spend manually managing backlogs and helps teams focus on critical bugs or features.

AI-Powered Pull Request ReviewsEvery new pull request can be reviewed by Gemini before real human dev reviewers. The system checks code for style adherence, potential bugs, and correctness. This allows human dev maintainers to focus on design-level concerns rather than surface-level errors. Saving a lot of time and effort!

On-Demand Collaboration via CommandsDevelopers can interact with Gemini directly in GitHub comments. By mentioning @gemini-cli and issuing commands such as /review, /triage, or /write-tests, they can trigger specific actions. This makes Gemini act like a conversational collaborator inside the repository just like how devs interact with each other inside Slack or JIRA.

Setup and Configuration

Integrating Gemini CLI GitHub Actions is very straightforward. Developers need Gemini CLI version 0.1.18 or higher. Running the command /setup-github inside the CLI scaffolds the necessary workflow files under .github/workflows and ensures configuration settings are properly managed.

For authentication, Google provides two methods:

API Key Authentication: Developers can store a GEMINI_API_KEY in GitHub Secrets. This method is simple and sufficient for most individual and team projects.

Workload Identity Federation (WIF): For enterprise users, WIF provides a more secure option by replacing long-lived credentials with short-lived, federated tokens. This approach aligns with modern security best practices for CI/CD pipelines.

Gemini’s behavior can be further customized using a GEMINI.md file placed in the repository. This file can contain coding guidelines, documentation links, or project-specific rules. The AI model then uses this context to tailor its reviews and responses.

Security Model

But apart from all these cool benefits of Gemini CLI GitHub Actions, the question is how secure it is? The commands executed by the model are run in isolated environments since the system supports multiple sandboxing technologies—Docker, Podman, and macOS Seatbelt.

Additionally, since version 0.1.14 of Gemini CLI, all executions are logged for auditability. Any commands flagged as unusual or potentially unsafe require explicit developer confirmation before execution. For production environments, Google strongly recommends using WIF authentication to avoid risks associated with static API keys.

Example Workflow

The following minimal YAML configuration enables Gemini to automatically review pull requests. This workflow ensures that every new or updated pull request is analyzed by Gemini before merging, providing consistent automated review across the repository.

Copy CodeCopiedUse a different Browsername: Gemini Pull Request Review
on:
pull_request:
types: [opened, synchronize]
jobs:
gemini-review:
runs-on: ubuntu-latest
steps:
– uses: actions/checkout@v4
– uses: google-github-actions/run-gemini-cli@v0.1
with:
args: review –files .
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}

Summary

Gemini CLI GitHub Actions represents a significant step in Google’s effort to embed AI into collaborative software development. By combining free access, flexible configuration, and strong security practices, the release lowers the barrier for teams to experiment with AI-driven automation inside their repositories.
The post Google Brings Gemini CLI to GitHub Actions: Secure, Free, and Enterprise-Ready AI Integration appeared first on MarkTechPost.

AI and the Brain: How DINOv3 Models Reveal Insights into Human Visual …

Introduction

Understanding how the brain builds internal representations of the visual world is one of the most fascinating challenges in neuroscience. Over the past decade, deep learning has reshaped computer vision, producing neural networks that not only perform at human-level accuracy on recognition tasks but also seem to process information in ways that resemble our brains. This unexpected overlap raises an intriguing question: can studying AI models help us better understand how the brain itself learns to see?

Researchers at Meta AI and École Normale Supérieure set out to explore this question by focusing on DINOv3, a self-supervised vision transformer trained on billions of natural images. They compared DINOv3’s internal activations with human brain responses to the same images, using two complementary neuroimaging techniques. fMRI provided high-resolution spatial maps of cortical activity, while MEG captured the precise timing of brain responses. Together, these datasets offered a rich view of how the brain processes visual information.

https://arxiv.org/pdf/2508.18226

Technical Details

The research team explores three factors that might drive brain-model similarity: model size, the amount of training data, and the type of images used for training. To do this, the team trained multiple versions of DINOv3, varying these factors independently.

Brain-Model Similarity

The research team found strong evidence of convergence while looking at how well DINOv3 matched brain responses. The model’s activations predicted fMRI signals in both early visual regions and higher-order cortical areas. Peak voxel correlations reached R = 0.45, and MEG results showed that alignment started as early as 70 milliseconds after image onset and lasted up to three seconds. Importantly, early DINOv3 layers aligned with regions like V1 and V2, while deeper layers matched activity in higher-order regions, including parts of the prefrontal cortex.

Training Trajectories

Tracking these similarities over the course of training revealed a developmental trajectory. Low-level visual alignments emerged very early, after only a small fraction of training, while higher-level alignments required billions of images. This mirrors the way the human brain develops, with sensory areas maturing earlier than associative cortices. The study showed that temporal alignment emerged fastest, spatial alignment more slowly, and encoding similarity in between, highlighting the layered nature of representational development.

Role of Model Factors

The role of model factors was equally telling. Larger models consistently achieved higher similarity scores, especially in higher-order cortical regions. Longer training improved alignment across the board, with high-level representations benefiting most from extended exposure. The type of images mattered as well: models trained on human-centric images produced the strongest alignment. Those trained on satellite or cellular images showed partial convergence in early visual regions but much weaker similarity in higher-level brain areas. This suggests that ecologically relevant data are crucial for capturing the full range of human-like representations.

Links to Cortical Properties

Interestingly, the timing of when DINOv3’s representations emerged also lined up with structural and functional properties of the cortex. Regions with greater developmental expansion, thicker cortex, or slower intrinsic timescales aligned later in training. Conversely, highly myelinated regions aligned earlier, reflecting their role in fast information processing. These correlations suggest that AI models can offer clues about the biological principles underlying cortical organization.

Nativism vs. Empiricism

The study highlights a balance between innate structure and learning. DINOv3’s architecture gives it a hierarchical processing pipeline, but full brain-like similarity only emerged with prolonged training on ecologically valid data. This interplay between architectural priors and experience echoes debates in cognitive science about nativism and empiricism.

Developmental Parallels

The parallels to human development are striking. Just as sensory cortices in the brain mature quickly and associative areas develop more slowly, DINOv3 aligned with sensory regions early in training and with prefrontal areas much later. This suggests that training trajectories in large-scale AI models may serve as computational analogues for the staged maturation of human brain functions.

Beyond the Visual Pathway

The results also extended beyond traditional visual pathways. DINOv3 showed alignment in prefrontal and multimodal regions, raising questions about whether such models capture higher-order features relevant for reasoning and decision-making. While this study focused only on DINOv3, it points toward exciting possibilities for using AI as a tool to test hypotheses about brain organization and development.

https://arxiv.org/pdf/2508.18226

Conclusion

In conclusion, this research shows that self-supervised vision models like DINOv3 are more than just powerful computer vision systems. They also approximate aspects of human visual processing, revealing how size, training, and data shape convergence between brains and machines. By studying how models learn to “see,” we gain valuable insights into how the human brain itself develops the ability to perceive and interpret the world.

Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post AI and the Brain: How DINOv3 Models Reveal Insights into Human Visual Processing appeared first on MarkTechPost.

Tencent Hunyuan Open-Sources Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B: …

Introduction

Tencent’s Hunyuan team has released Hunyuan-MT-7B (a translation model) and Hunyuan-MT-Chimera-7B (an ensemble model). Both models are designed specifically for multilingual machine translation and were introduced in conjunction with Tencent’s participation in the WMT2025 General Machine Translation shared task, where Hunyuan-MT-7B ranked first in 30 out of 31 language pairs.

https://github.com/Tencent-Hunyuan/Hunyuan-MT/blob/main/Hunyuan_MT_Technical_Report.pdf

Model Overview

Hunyuan-MT-7B

A 7B parameter translation model.

Supports mutual translation across 33 languages, including Chinese ethnic minority languages such as Tibetan, Mongolian, Uyghur, and Kazakh.

Optimized for both high-resource and low-resource translation tasks, achieving state-of-the-art results among models of comparable size.

Hunyuan-MT-Chimera-7B

An integrated weak-to-strong fusion model.

Combines multiple translation outputs at inference time and produces a refined translation using reinforcement learning and aggregation techniques.

Represents the first open-source translation model of this type, improving translation quality beyond single-system outputs.

https://github.com/Tencent-Hunyuan/Hunyuan-MT/blob/main/Hunyuan_MT_Technical_Report.pdf

Training Framework

The models were trained using a five-stage framework designed for translation tasks:

General Pre-training

1.3 trillion tokens covering 112 languages and dialects.

Multilingual corpora assessed for knowledge value, authenticity, and writing style.

Diversity maintained through disciplinary, industry, and thematic tagging systems.

MT-Oriented Pre-training

Monolingual corpora from mC4 and OSCAR, filtered using fastText (language ID), minLSH (deduplication), and KenLM (perplexity filtering).

Parallel corpora from OPUS and ParaCrawl, filtered with CometKiwi.

Replay of general pre-training data (20%) to avoid catastrophic forgetting.

Supervised Fine-Tuning (SFT)

Stage I: ~3M parallel pairs (Flores-200, WMT test sets, curated Mandarin–minority data, synthetic pairs, instruction-tuning data).

Stage II: ~268k high-quality pairs selected through automated scoring (CometKiwi, GEMBA) and manual verification.

Reinforcement Learning (RL)

Algorithm: GRPO.

Reward functions:

XCOMET-XXL and DeepSeek-V3-0324 scoring for quality.

Terminology-aware rewards (TAT-R1).

Repetition penalties to avoid degenerate outputs.

Weak-to-Strong RL

Multiple candidate outputs generated and aggregated through reward-based output

Applied in Hunyuan-MT-Chimera-7B, improving translation robustness and reducing repetitive errors.

Benchmark Results

Automatic Evaluation

WMT24pp (English⇔XX): Hunyuan-MT-7B achieved 0.8585 (XCOMET-XXL), surpassing larger models like Gemini-2.5-Pro (0.8250) and Claude-Sonnet-4 (0.8120).

FLORES-200 (33 languages, 1056 pairs): Hunyuan-MT-7B scored 0.8758 (XCOMET-XXL), outperforming open-source baselines including Qwen3-32B (0.7933).

Mandarin⇔Minority Languages: Scored 0.6082 (XCOMET-XXL), higher than Gemini-2.5-Pro (0.5811), showing significant improvements in low-resource settings.

Comparative Results

Outperforms Google Translator by 15–65% across evaluation categories.

Outperforms specialized translation models such as Tower-Plus-9B and Seed-X-PPO-7B despite having fewer parameters.

Chimera-7B adds ~2.3% improvement on FLORES-200, particularly in Chinese⇔Other and non-English⇔non-Chinese translations.

Human Evaluation

A custom evaluation set (covering social, medical, legal, and internet domains) compared Hunyuan-MT-7B with state-of-the-art models:

Hunyuan-MT-7B: Avg. 3.189

Gemini-2.5-Pro: Avg. 3.223

DeepSeek-V3: Avg. 3.219

Google Translate: Avg. 2.344

This shows that Hunyuan-MT-7B, despite being smaller at 7B parameters, approaches the quality of much larger proprietary models.

Case Studies

The report highlights several real-world cases:

Cultural References: Correctly translates “小红薯” as the platform “REDnote,” unlike Google Translate’s “sweet potatoes.”

Idioms: Interprets “You are killing me” as “你真要把我笑死了” (expressing amusement), avoiding literal misinterpretation.

Medical Terms: Translates “uric acid kidney stones” precisely, while baselines generate malformed outputs.

Minority Languages: For Kazakh and Tibetan, Hunyuan-MT-7B produces coherent translations, where baselines fail or output nonsensical text.

Chimera Enhancements: Adds improvements in gaming jargon, intensifiers, and sports terminology.

Conclusion

Tencent’s release of Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B establishes a new standard for open-source translation. By combining a carefully designed training framework with specialized focus on low-resource and minority language translation, the models achieve quality on par with or exceeding larger closed-source systems. The launch of these 2 models provides the AI research community with accessible, high-performance tools for multilingual translation research and deployment.

Check out the Paper, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Tencent Hunyuan Open-Sources Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B: A State-of-the-Art Multilingual Translation Models appeared first on MarkTechPost.

Authenticate Amazon Q Business data accessors using a trusted token is …

Since its general availability in 2024, Amazon Q Business (Amazon Q) has enabled independent software vendors (ISVs) to enhance their Software as a Service (SaaS) solutions through secure access to customers’ enterprise data by becoming Amazon Q Business data accessor. To find out more on data accessor, see this page. The data accessor now supports trusted identity propagation. With trusted token issuer (TTI) authorization support, ISVs as data accessor can integrate with Amazon Q index while maintaining enterprise-grade security standards for their software-as-a-service (SaaS) solutions.
Prior to TTI support, data accessors needed to implement authorization code flow with AWS IAM Identity Center integration when accessing the Amazon Q index. With TTI support for data accessors, ISVs can now use their own OpenID Provider to authenticate enterprise users, alleviating the need for double authentication while maintaining security standards.
In this blog post, we show you how to implement TTI authorization for data accessors, compare authentication options, and provide step-by-step guidance for both ISVs and enterprises.
Prerequisites
Before you begin, make sure you have the following requirements:

An AWS account with administrator access
Access to Amazon Q Business
For ISVs:

An OpenID Connect (OIDC) compatible authorization server

For enterprises:

Amazon Q Business administrator access
Permission to create trusted token issuers

Solution Overview
This solution demonstrates how to implement TTI authentication for Amazon Q Business data accessors. The following diagram illustrates the overall flow between different resources, from ISV becoming a data accessor, customer enabling ISV data accessor, to ISV accessing customer’s Amazon Q index:

Understanding Trusted Token Issuer Authentication
Trusted Token Issuer represents an advanced identity integration capability for Amazon Q. At its core, TTI is a token exchange API that propagates identity information into IAM role sessions, enabling AWS services to make authorization decisions based on the actual end user’s identity and group memberships. This mechanism allows AWS services to apply authorization and security controls based on the authenticated user context. The TTI support simplifies the identity integration process while maintaining robust security standards, making it possible for organizations to ensure that access to Amazon Q respects user-level permissions and group memberships. This enables fine-grained access control and maintains proper security governance within Amazon Q implementations.
Trusted Token Issuer authentication simplifies the identity integration process for Amazon Q by enabling the propagation of user identity information into AWS IAM role sessions. Each token exchange allows AWS services to make authorization decisions based on the authenticated user’s identity and group memberships. The TTI support streamlines the integration process while maintaining robust security standards, enabling organizations to implement appropriate access controls within their Amazon Q implementations.
Understanding Data Accessors
A data accessor is an ISV that has registered with AWS and is authorized to use their customers’ Amazon Q index for the ISV’s Large Language Model (LLM) solution. The process begins with ISV registration, where they provide configuration information including display name, business logo, and OpenID Connect (OIDC) configuration details for TTI support.
During ISV registration, providers must specify their tenantId configuration – a unique identifier for their application tenant. This identifier might be known by different names in various applications (such as Workspace ID in Slack or Domain ID in Asana) and is required for proper customer isolation in multi-tenant environments.
Amazon Q customers then add the ISV as a data accessor to their environment, granting access to their Amazon Q index based on specific permissions and data source selections. Once authorized, the ISV can query the customers’ index through API requests using their TTI authentication flow, creating a secure and controlled pathway for accessing customer data.
Implementing TTI Authentication for Amazon Q index Access
This section explains how to implement TTI authentication for accessing the Amazon Q index. The implementation involves initial setup by the customer and subsequent authentication flow implemented by data accessors for user access.
TTI provides capabilities that enable identity-enhanced IAM role sessions through Trusted Identity Propagation (TIP), allowing AWS services to make authorization decisions based on authenticated user identities and group memberships. Here’s how it works:
To enable data accessor access to a customer’s Amazon Q index through TTI, customers must perform an initial one-time setup by adding a data accessor on Amazon Q Business application. During setup, a TTI with the data accessor’s identity provider information is created in the customer’s AWS IAM Identity Center, allowing the data accessor’s identity provider to authenticate access to the customer’s Amazon Q index.

The process to set up an ISV data accessor with TTI authentication consists of the following steps:

The customer’s IT administrator accesses their Amazon Q Business application and creates a trusted token issuer with the ISV’s OAuth information. This returns a TrustedTokenIssuer (TTI) Amazon Resource Name (ARN).
The IT administrator creates an ISV data accessor with the TTI ARN received in Step 1.
Amazon Q Business confirms the provided TTI ARN with AWS IAM Identity Center and creates a data accessor application.
Upon successful creation of the ISV data accessor, the IT administrator receives data accessor details to share with the ISV.
The IT administrator provides these details to the ISV application.

Once the data accessor setup is complete in the customer’s Amazon Q environment, users can access the Amazon Q index through the ISV application by authenticating only against the data accessor’s identity provider.

The authentication flow proceeds as follows:

A user authenticates against the data accessor’s identity provider through the ISV application. The ISV application receives an ID token for that user, generated from the ISV’s identity provider with the same client ID registered on their data accessor.
The ISV application needs to use the AWS Identity and Access Management (IAM) role that they created during the data accessor onboarding process by calling AssumeRole API, then make CreateTokenWithIAM API request to the customer’s AWS IAM Identity Center with the ID token. AWS IAM Identity Center validates the ID token with the ISV’s identity provider and returns an IAM Identity Center token.
The ISV application requests an AssumeRole API with: IAM Identity Center token, extracted identity context, and tenantId. The tenantId is a security control jointly established between the ISV and their customer, with the customer maintaining control over how it’s used in their trust relationships. This combination facilitates secure access to the correct customer environment.
The ISV application calls the SearchRelevantContent API with the session credentials and receives relevant content from the customer’s Amazon Q index.

Choosing between TTI and Authorization Code
When implementing Amazon Q integration, ISVs need to consider two approaches, each with its own benefits and considerations:

Trusted Token Issuer
Authorization Code

Advantages
Single authentication on the ISV system
Enhanced security through mandatory user initiation for each session

Enables backend-only access to SearchRelevantContent API without user interaction

Considerations
Some enterprises may prefer authentication flows that require explicit user consent for each session, providing additional control over API access timing and duration
Requires double authentication on the ISV system

Requires ISVs to host and maintain OpenID Provider

TTI excels in providing a seamless user experience through single authentication on the ISV system and enables backend-only implementations for SearchRelevantContent API access without requiring direct user interaction. However, this approach requires ISVs to maintain their own OIDC authorization server, which may present implementation challenges for some organizations. Additionally, some enterprises might have concerns about ISVs having persistent ability to make API requests on behalf of their users without explicit per-session authorization.
Next Steps
For ISVs: Becoming a Data Accessor with TTI Authentication
Getting started on Amazon Q data accessor registration process with TTI authentication is straightforward. If you already have an OIDC compatible authorization server for your application’s authentication, you’re most of the way there.
To begin the registration process, you’ll need to provide the following information:

Display name and business logo that will be displayed on AWS Management Console
OIDC configuration details (OIDC ClientId and discovery endpoint URL)
TenantID configuration details that specify how your application identifies different customer environments

For details, see Information to be provided to the Amazon Q Business team.
For ISVs using Amazon Cognito as their OIDC authorization server, here’s how to retrieve the required OIDC configuration details:

To get the OIDC ClientId:- Navigate to the Amazon Cognito console- Select your User Pool- Go to “Applications” > “App clients”- The ClientId is listed under “Client ID” for your app client
To get the discovery endpoint URL:- The URL follows this format:https://cognito-idp.{region}.amazonaws.com/{userPoolId}/.well-known/openid-configuration– Replace {region} with your AWS region (e.g., us-east-1)- Replace {userPoolId} with your Cognito User Pool IDFor example, if your User Pool is in us-east-1 with ID ‘us-east-1_abcd1234’, your discovery endpoint URL would be: https://cognito-idp.us-east-1.amazonaws.com/us-east-1_abcd1234/.well-known/openid-configuration

Note: While this example uses Amazon Cognito, the process will vary depending on your OIDC provider. Common providers like Auth0, Okta, or custom implementations will have their own methods for accessing these configuration details.
Once registered, you can enhance your generative AI application with the powerful capabilities of Amazon Q, allowing your customers to access their enterprise knowledge base through your familiar interface. AWS provides comprehensive documentation and support to help you implement the authentication flow and API integration efficiently.
For Enterprises: Enabling TTI-authenticated Data Accessor
To enable a TTI-authenticated data accessor, your IT administrator needs to complete the following steps in the Amazon Q console:

Create a trusted token issuer using the ISV’s OAuth information
Set up the data accessor with the generated TTI ARN
Configure appropriate data source access permissions

This streamlined setup allows your users to access Amazon Q index through the ISV’s application using their existing ISV application credentials, alleviating the need for multiple logins while maintaining security controls over your enterprise data.
Both ISVs and enterprises benefit from AWS’s comprehensive documentation and support throughout the implementation process, facilitating a smooth and secure integration experience.
Clean up resources
To avoid unused resources, follow these steps if you no longer need the data accessor:

Delete the data accessor:

On the Amazon Q Business console, choose Data accessors in the navigation pane
Select your data accessor and choose Delete.

Delete the TTI:

On the IAM Identity Center console, choose Trusted Token Issuers in the navigation pane.
Select the associated issuer and choose Delete.

Conclusion
The introduction of Trusted Token Issuer (TTI) authentication for Amazon Q data accessors marks a significant advancement in how ISVs integrate with Amazon Q Business. By enabling data accessors to use their existing OIDC infrastructure, we’ve alleviated the need for double authentication while maintaining enterprise-grade security standards through TTI’s robust tenant isolation mechanisms and secure multi-tenant access controls, making sure each customer’s data remains protected within their dedicated environment. This streamlined approach not only enhances the end-user experience but also simplifies the integration process for ISVs building generative AI solutions.
In this post, we showed how to implement TTI authentication for Amazon Q data accessors. We covered the setup process for both ISVs and enterprises and demonstrated how TTI authentication simplifies the user experience while maintaining security standards.
To learn more about Amazon Q Business and data accessor integration, refer to Share your enterprise data with data accessors using Amazon Q index and Information to be provided to the Amazon Q Business team. You can also contact your AWS account team for personalized guidance. Visit the Amazon Q Business console to begin using these enhanced authentication capabilities today.

About the Authors
Takeshi Kobayashi is a Senior AI/ML Solutions Architect within the Amazon Q Business team, responsible for developing advanced AI/ML solutions for enterprise customers. With over 14 years of experience at Amazon in AWS, AI/ML, and technology, Takeshi is dedicated to leveraging generative AI and AWS services to build innovative solutions that address customer needs. Based in Seattle, WA, Takeshi is passionate about pushing the boundaries of artificial intelligence and machine learning technologies.
Siddhant Gupta is a Software Development Manager on the Amazon Q team based in Seattle, WA. He is driving innovation and development in cutting-edge AI-powered solutions.
Akhilesh Amara is a Software Development Engineer on the Amazon Q team based in Seattle, WA. He is contributing to the development and enhancement of intelligent and innovative AI tools.