A Coding Guide to Build a Functional Data Analysis Workflow Using Lila …

In this tutorial, we demonstrate a fully functional and modular data analysis pipeline using the Lilac library, without relying on signal processing. It combines Lilac’s dataset management capabilities with Python’s functional programming paradigm to create a clean, extensible workflow. From setting up a project and generating realistic sample data to extracting insights and exporting filtered outputs, the tutorial emphasizes reusable, testable code structures. Core functional utilities, such as pipe, map_over, and filter_by, are used to build a declarative flow, while Pandas facilitates detailed data transformations and quality analysis.

Copy CodeCopiedUse a different Browser!pip install lilac[all] pandas numpy

To get started, we install the required libraries using the command !pip install lilac[all] pandas numpy. This ensures we have the full Lilac suite alongside Pandas and NumPy for smooth data handling and analysis. We should run this in our notebook before proceeding.

Copy CodeCopiedUse a different Browserimport json
import uuid
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional
from functools import reduce, partial
import lilac as ll

We import all the essential libraries. These include json and uuid for handling data and generating unique project names, pandas for working with data in tabular form, and Path from pathlib for managing directories. We also introduce type hints for improved function clarity and functools for functional composition patterns. Finally, we import the core Lilac library as ll to manage our datasets.

Copy CodeCopiedUse a different Browserdef pipe(*functions):
“””Compose functions left to right (pipe operator)”””
return lambda x: reduce(lambda acc, f: f(acc), functions, x)

def map_over(func, iterable):
“””Functional map wrapper”””
return list(map(func, iterable))

def filter_by(predicate, iterable):
“””Functional filter wrapper”””
return list(filter(predicate, iterable))

def create_sample_data() -> List[Dict[str, Any]]:
“””Generate realistic sample data for analysis”””
return [
{“id”: 1, “text”: “What is machine learning?”, “category”: “tech”, “score”: 0.9, “tokens”: 5},
{“id”: 2, “text”: “Machine learning is AI subset”, “category”: “tech”, “score”: 0.8, “tokens”: 6},
{“id”: 3, “text”: “Contact support for help”, “category”: “support”, “score”: 0.7, “tokens”: 4},
{“id”: 4, “text”: “What is machine learning?”, “category”: “tech”, “score”: 0.9, “tokens”: 5},
{“id”: 5, “text”: “Deep learning neural networks”, “category”: “tech”, “score”: 0.85, “tokens”: 4},
{“id”: 6, “text”: “How to optimize models?”, “category”: “tech”, “score”: 0.75, “tokens”: 5},
{“id”: 7, “text”: “Performance tuning guide”, “category”: “guide”, “score”: 0.6, “tokens”: 3},
{“id”: 8, “text”: “Advanced optimization techniques”, “category”: “tech”, “score”: 0.95, “tokens”: 3},
{“id”: 9, “text”: “Gradient descent algorithm”, “category”: “tech”, “score”: 0.88, “tokens”: 3},
{“id”: 10, “text”: “Model evaluation metrics”, “category”: “tech”, “score”: 0.82, “tokens”: 3},
]

In this section, we define reusable functional utilities. The pipe function helps us chain transformations clearly, while map_over and filter_by allow us to transform or filter iterable data functionally. Then, we create a sample dataset that mimics real-world records, featuring fields such as text, category, score, and tokens, which we will later use to demonstrate Lilac’s data curation capabilities.

Copy CodeCopiedUse a different Browserdef setup_lilac_project(project_name: str) -> str:
“””Initialize Lilac project directory”””
project_dir = f”./{project_name}-{uuid.uuid4().hex[:6]}”
Path(project_dir).mkdir(exist_ok=True)
ll.set_project_dir(project_dir)
return project_dir

def create_dataset_from_data(name: str, data: List[Dict]) -> ll.Dataset:
“””Create Lilac dataset from data”””
data_file = f”{name}.jsonl”
with open(data_file, ‘w’) as f:
for item in data:
f.write(json.dumps(item) + ‘n’)

config = ll.DatasetConfig(
namespace=”tutorial”,
name=name,
source=ll.sources.JSONSource(filepaths=[data_file])
)

return ll.create_dataset(config)

With the setup_lilac_project function, we initialize a unique working directory for our Lilac project and register it using Lilac’s API. Using create_dataset_from_data, we convert our raw list of dictionaries into a .jsonl file and create a Lilac dataset by defining its configuration. This prepares the data for clean and structured analysis.

Copy CodeCopiedUse a different Browserdef extract_dataframe(dataset: ll.Dataset, fields: List[str]) -> pd.DataFrame:
“””Extract data as pandas DataFrame”””
return dataset.to_pandas(fields)

def apply_functional_filters(df: pd.DataFrame) -> Dict[str, pd.DataFrame]:
“””Apply various filters and return multiple filtered versions”””

filters = {
‘high_score’: lambda df: df[df[‘score’] >= 0.8],
‘tech_category’: lambda df: df[df[‘category’] == ‘tech’],
‘min_tokens’: lambda df: df[df[‘tokens’] >= 4],
‘no_duplicates’: lambda df: df.drop_duplicates(subset=[‘text’], keep=’first’),
‘combined_quality’: lambda df: df[(df[‘score’] >= 0.8) & (df[‘tokens’] >= 3) & (df[‘category’] == ‘tech’)]
}

return {name: filter_func(df.copy()) for name, filter_func in filters.items()}

We extract the dataset into a Pandas DataFrame using extract_dataframe, which allows us to work with selected fields in a familiar format. Then, using apply_functional_filters, we define and apply a set of logical filters, such as high-score selection, category-based filtering, token count constraints, duplicate removal, and composite quality conditions, to generate multiple filtered views of the data.

Copy CodeCopiedUse a different Browserdef analyze_data_quality(df: pd.DataFrame) -> Dict[str, Any]:
“””Analyze data quality metrics”””
return {
‘total_records’: len(df),
‘unique_texts’: df[‘text’].nunique(),
‘duplicate_rate’: 1 – (df[‘text’].nunique() / len(df)),
‘avg_score’: df[‘score’].mean(),
‘category_distribution’: df[‘category’].value_counts().to_dict(),
‘score_distribution’: {
‘high’: len(df[df[‘score’] >= 0.8]),
‘medium’: len(df[(df[‘score’] >= 0.6) & (df[‘score’] < 0.8)]),
‘low’: len(df[df[‘score’] < 0.6])
},
‘token_stats’: {
‘mean’: df[‘tokens’].mean(),
‘min’: df[‘tokens’].min(),
‘max’: df[‘tokens’].max()
}
}

def create_data_transformations() -> Dict[str, callable]:
“””Create various data transformation functions”””
return {
‘normalize_scores’: lambda df: df.assign(norm_score=df[‘score’] / df[‘score’].max()),
‘add_length_category’: lambda df: df.assign(
length_cat=pd.cut(df[‘tokens’], bins=[0, 3, 5, float(‘inf’)], labels=[‘short’, ‘medium’, ‘long’])
),
‘add_quality_tier’: lambda df: df.assign(
quality_tier=pd.cut(df[‘score’], bins=[0, 0.6, 0.8, 1.0], labels=[‘low’, ‘medium’, ‘high’])
),
‘add_category_rank’: lambda df: df.assign(
category_rank=df.groupby(‘category’)[‘score’].rank(ascending=False)
)
}

To evaluate the dataset quality, we use analyze_data_quality, which helps us measure key metrics like total and unique records, duplicate rates, category breakdowns, and score/token distributions. This gives us a clear picture of the dataset’s readiness and reliability. We also define transformation functions using create_data_transformations, enabling enhancements such as score normalization, token-length categorization, quality tier assignment, and intra-category ranking.

Copy CodeCopiedUse a different Browserdef apply_transformations(df: pd.DataFrame, transform_names: List[str]) -> pd.DataFrame:
“””Apply selected transformations”””
transformations = create_data_transformations()
selected_transforms = [transformations[name] for name in transform_names if name in transformations]

return pipe(*selected_transforms)(df.copy()) if selected_transforms else df

def export_filtered_data(filtered_datasets: Dict[str, pd.DataFrame], output_dir: str) -> None:
“””Export filtered datasets to files”””
Path(output_dir).mkdir(exist_ok=True)

for name, df in filtered_datasets.items():
output_file = Path(output_dir) / f”{name}_filtered.jsonl”
with open(output_file, ‘w’) as f:
for _, row in df.iterrows():
f.write(json.dumps(row.to_dict()) + ‘n’)
print(f”Exported {len(df)} records to {output_file}”)

Then, through apply_transformations, we selectively apply the needed transformations in a functional chain, ensuring our data is enriched and structured. Once filtered, we use export_filtered_data to write each dataset variant into a separate .jsonl file. This enables us to store subsets, such as high-quality entries or non-duplicate records, in an organized format for downstream use.

Copy CodeCopiedUse a different Browserdef main_analysis_pipeline():
“””Main analysis pipeline demonstrating functional approach”””

print(” Setting up Lilac project…”)
project_dir = setup_lilac_project(“advanced_tutorial”)

print(” Creating sample dataset…”)
sample_data = create_sample_data()
dataset = create_dataset_from_data(“sample_data”, sample_data)

print(” Extracting data…”)
df = extract_dataframe(dataset, [‘id’, ‘text’, ‘category’, ‘score’, ‘tokens’])

print(” Analyzing data quality…”)
quality_report = analyze_data_quality(df)
print(f”Original data: {quality_report[‘total_records’]} records”)
print(f”Duplicates: {quality_report[‘duplicate_rate’]:.1%}”)
print(f”Average score: {quality_report[‘avg_score’]:.2f}”)

print(” Applying transformations…”)
transformed_df = apply_transformations(df, [‘normalize_scores’, ‘add_length_category’, ‘add_quality_tier’])

print(” Applying filters…”)
filtered_datasets = apply_functional_filters(transformed_df)

print(“n Filter Results:”)
for name, filtered_df in filtered_datasets.items():
print(f” {name}: {len(filtered_df)} records”)

print(” Exporting filtered datasets…”)
export_filtered_data(filtered_datasets, f”{project_dir}/exports”)

print(“n Top Quality Records:”)
best_quality = filtered_datasets[‘combined_quality’].head(3)
for _, row in best_quality.iterrows():
print(f” • {row[‘text’]} (score: {row[‘score’]}, category: {row[‘category’]})”)

return {
‘original_data’: df,
‘transformed_data’: transformed_df,
‘filtered_data’: filtered_datasets,
‘quality_report’: quality_report
}

if __name__ == “__main__”:
results = main_analysis_pipeline()
print(“n Analysis complete! Check the exports folder for filtered datasets.”)

Finally, in the main_analysis_pipeline, we execute the full workflow, from setup to data export, showcasing how Lilac, combined with functional programming, allows us to build modular, scalable, and expressive pipelines. We even print out the top-quality entries as a quick snapshot. This function represents our full data curation loop, powered by Lilac.

In conclusion, users will have gained a hands-on understanding of creating a reproducible data pipeline that leverages Lilac’s dataset abstractions and functional programming patterns for scalable, clean analysis. The pipeline covers all critical stages, including dataset creation, transformation, filtering, quality analysis, and export, offering flexibility for both experimentation and deployment. It also demonstrates how to embed meaningful metadata such as normalized scores, quality tiers, and length categories, which can be instrumental in downstream tasks like modeling or human review.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide to Build a Functional Data Analysis Workflow Using Lilac for Transforming, Filtering, and Exporting Structured Insights appeared first on MarkTechPost.

UC San Diego Researchers Introduced Dex1B: A Billion-Scale Dataset for …

Challenges in Dexterous Hand Manipulation Data Collection

Creating large-scale data for dexterous hand manipulation remains a major challenge in robotics. Although hands offer greater flexibility and richer manipulation potential than simpler tools, such as grippers, their complexity makes them difficult to control effectively. Many in the field have questioned whether dexterous hands are worth the added difficulty. The real issue, however, may be a lack of diverse, high-quality training data. Existing methods, such as human demonstrations, optimization, and reinforcement learning, offer partial solutions but have limitations. Generative models have emerged as a promising alternative; however, they often struggle with physical feasibility and tend to produce limited diversity by adhering too closely to known examples.

Evolution of Dexterous Hand Manipulation Approaches

Dexterous hand manipulation has long been central to robotics, initially driven by control-based techniques for precise multi-fingered grasping. Though these methods achieved impressive accuracy, they often struggled to generalize across varied settings. Learning-based approaches later emerged, offering greater adaptability through techniques such as pose prediction, contact maps, and intermediate representations, although they remain sensitive to data quality. Existing datasets, both synthetic and real-world, have their limits, either lacking diversity or being confined to human hand shapes.

Introduction to Dex1B Dataset

Researchers at UC San Diego have developed Dex1B, a massive dataset of one billion high-quality, diverse demonstrations for dexterous hand tasks like grasping and articulation. They combined optimization techniques with generative models, using geometric constraints for feasibility and conditioning strategies to boost diversity. Starting with a small, carefully curated dataset, they trained a generative model to scale up efficiently. A debiasing mechanism further enhanced diversity. Compared to previous datasets, such as DexGraspNet, Dex1B offers vastly more data. They also introduced DexSimple, a strong new baseline that leverages this scale to outperform past methods by 22% on grasping tasks.

Dex1B Benchmark Design and Methodology

The Dex1B benchmark is a large-scale dataset designed to evaluate two key dexterous manipulation tasks, grasping and articulation, using over one billion demonstrations across three robotic hands. Initially, a small but high-quality seed dataset is created using optimization methods. This seed data trains a generative model that produces more diverse and scalable demonstrations. To ensure success and variety, the team applies debiasing techniques and post-optimization adjustments. Tasks are completed via smooth, collision-free motion planning. The result is a richly diverse, simulation-validated dataset that enables realistic, high-volume training for complex hand-object interactions.

Insights on Multimodal Attention in Model Performance

Recent research explores the effect of combining cross-attention with self-attention in multimodal models. While self-attention facilitates understanding of relationships within a single modality, cross-attention enables the model to connect information across different modalities. The study finds that using both together improves performance, particularly in tasks that require aligning and integrating text and image features. Interestingly, cross-attention alone can sometimes outperform self-attention, especially when applied at deeper layers. This insight suggests that carefully designing how and where attention mechanisms are utilized within a model is crucial for comprehending and processing complex multimodal data.

Conclusion: Dex1B’s Impact and Future Potential

In conclusion, Dex1B is a massive synthetic dataset comprising one billion demonstrations for dexterous hand tasks, such as grasping and articulation. To generate this data efficiently, the researchers designed an iterative pipeline that combines optimization techniques with a generative model called DexSimple. Starting with an initial dataset created through optimization, DexSimple generates diverse, realistic manipulation proposals, which are then refined and quality-checked. Enhanced with geometric constraints, DexSimple significantly outperforms previous models on benchmarks like DexGraspNet. The dataset and model prove effective not only in simulation but also in real-world robotics, advancing the field of dexterous hand manipulation with scalable, high-quality data.

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post UC San Diego Researchers Introduced Dex1B: A Billion-Scale Dataset for Dexterous Hand Manipulation in Robotics appeared first on MarkTechPost.

Build Custom AI Tools for Your AI Agents that Combine Machine Learning …

The ability to build custom tools is critical for building customizable AI Agents. In this tutorial, we demonstrate how to create a powerful and intelligent data analysis tool using Python that can be integrated into AI agents powered by LangChain. By defining a structured schema for user inputs and implementing key functionalities like correlation analysis, clustering, outlier detection, and target variable profiling, this tool transforms raw tabular data into actionable insights. Leveraging the modularity of LangChain’s BaseTool, the implementation illustrates how developers can encapsulate domain-specific logic and build reusable components that elevate the analytical capabilities of autonomous AI systems.

Copy CodeCopiedUse a different Browser!pip install langchain langchain-core pandas numpy matplotlib seaborn scikit-learn

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from typing import Dict, List, Tuple, Optional, Any
from langchain_core.tools import BaseTool
from langchain_core.tools.base import ToolException
from pydantic import BaseModel, Field
import json

We install essential Python packages for data analysis, visualization, machine learning, and LangChain tool development. It then imports key libraries, including pandas, numpy, scikit-learn, and langchain_core, setting up the environment to build a custom intelligent tool for AI agents. These libraries provide the foundation for preprocessing, clustering, evaluation, and tool integration.

Copy CodeCopiedUse a different Browserclass DataAnalysisInput(BaseModel):
data: List[Dict[str, Any]] = Field(description=”List of data records as dictionaries”)
analysis_type: str = Field(default=”comprehensive”, description=”Type of analysis: ‘comprehensive’, ‘clustering’, ‘correlation’, ‘outlier'”)
target_column: Optional[str] = Field(default=None, description=”Target column for focused analysis”)
max_clusters: int = Field(default=5, description=”Maximum clusters for clustering analysis”)

Above, we define the input schema for the custom analysis tool using Pydantic’s BaseModel. The DataAnalysisInput class ensures that incoming data follows a structured format, allowing users to specify the dataset, type of analysis, an optional target column, and the maximum number of clusters for clustering tasks. It serves as a clean interface for validating inputs before analysis begins.

Copy CodeCopiedUse a different Browserclass IntelligentDataAnalyzer(BaseTool):
name: str = “intelligent_data_analyzer”
description: str = “Advanced data analysis tool that performs statistical analysis, machine learning clustering, outlier detection, correlation analysis, and generates visualizations with actionable insights.”
args_schema: type[BaseModel] = DataAnalysisInput
response_format: str = “content_and_artifact”

def _run(self, data: List[Dict], analysis_type: str = “comprehensive”, target_column: Optional[str] = None, max_clusters: int = 5) -> Tuple[str, Dict]:
try:
df = pd.DataFrame(data)
if df.empty:
raise ToolException(“Dataset is empty”)

insights = {“dataset_info”: self._get_dataset_info(df)}

if analysis_type in [“comprehensive”, “correlation”]:
insights[“correlation_analysis”] = self._correlation_analysis(df)
if analysis_type in [“comprehensive”, “clustering”]:
insights[“clustering_analysis”] = self._clustering_analysis(df, max_clusters)
if analysis_type in [“comprehensive”, “outlier”]:
insights[“outlier_detection”] = self._outlier_detection(df)

if target_column and target_column in df.columns:
insights[“target_analysis”] = self._target_analysis(df, target_column)

recommendations = self._generate_recommendations(df, insights)
summary = self._create_analysis_summary(insights, recommendations)

artifact = {
“insights”: insights,
“recommendations”: recommendations,
“data_shape”: df.shape,
“analysis_type”: analysis_type,
“numeric_columns”: df.select_dtypes(include=[np.number]).columns.tolist(),
“categorical_columns”: df.select_dtypes(include=[‘object’]).columns.tolist()
}

return summary, artifact

except Exception as e:
raise ToolException(f”Analysis failed: {str(e)}”)

def _get_dataset_info(self, df: pd.DataFrame) -> Dict:
return {
“shape”: df.shape,
“columns”: df.columns.tolist(),
“dtypes”: df.dtypes.astype(str).to_dict(),
“missing_values”: df.isnull().sum().to_dict(),
“memory_usage”: df.memory_usage(deep=True).sum()
}

def _correlation_analysis(self, df: pd.DataFrame) -> Dict:
numeric_df = df.select_dtypes(include=[np.number])
if numeric_df.empty:
return {“message”: “No numeric columns for correlation analysis”}

corr_matrix = numeric_df.corr()
strong_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
corr_val = corr_matrix.iloc[i, j]
if abs(corr_val) > 0.7:
strong_corr.append({“var1”: corr_matrix.columns[i], “var2”: corr_matrix.columns[j], “correlation”: round(corr_val, 3)})

return {
“correlation_matrix”: corr_matrix.round(3).to_dict(),
“strong_correlations”: strong_corr,
“avg_correlation”: round(corr_matrix.values[np.triu_indices_from(corr_matrix.values, k=1)].mean(), 3)
}

def _clustering_analysis(self, df: pd.DataFrame, max_clusters: int) -> Dict:
numeric_df = df.select_dtypes(include=[np.number]).dropna()
if numeric_df.shape[0] < 2 or numeric_df.shape[1] < 2:
return {“message”: “Insufficient numeric data for clustering”}

scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_df)

inertias = []
K_range = range(1, min(max_clusters + 1, len(numeric_df) // 2 + 1))

for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(scaled_data)
inertias.append(kmeans.inertia_)

optimal_k = self._find_elbow_point(inertias, K_range)
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(scaled_data)

cluster_stats = {}
for i in range(optimal_k):
cluster_data = numeric_df[cluster_labels == i]
cluster_stats[f”cluster_{i}”] = {
“size”: len(cluster_data),
“percentage”: round(len(cluster_data) / len(numeric_df) * 100, 1),
“means”: cluster_data.mean().round(3).to_dict()
}

return {
“optimal_clusters”: optimal_k,
“cluster_stats”: cluster_stats,
“silhouette_score”: round(silhouette_score(scaled_data, cluster_labels), 3) if len(set(cluster_labels)) > 1 else 0.0,
“inertias”: inertias
}

def _outlier_detection(self, df: pd.DataFrame) -> Dict:
numeric_df = df.select_dtypes(include=[np.number])
if numeric_df.empty:
return {“message”: “No numeric columns for outlier detection”}

outliers = {}
for col in numeric_df.columns:
data = numeric_df[col].dropna()
Q1, Q3 = data.quantile(0.25), data.quantile(0.75)
IQR = Q3 – Q1
iqr_outliers = data[(data < Q1 – 1.5 * IQR) | (data > Q3 + 1.5 * IQR)]
z_scores = np.abs((data – data.mean()) / data.std())
z_outliers = data[z_scores > 3]

outliers[col] = {
“iqr_outliers”: len(iqr_outliers),
“z_score_outliers”: len(z_outliers),
“outlier_percentage”: round(len(iqr_outliers) / len(data) * 100, 2)
}

return outliers

def _target_analysis(self, df: pd.DataFrame, target_col: str) -> Dict:
if target_col not in df.columns:
return {“error”: f”Column {target_col} not found”}

target_data = df[target_col].dropna()

if pd.api.types.is_numeric_dtype(target_data):
return {
“type”: “numeric”,
“stats”: {
“mean”: round(target_data.mean(), 3),
“median”: round(target_data.median(), 3),
“std”: round(target_data.std(), 3),
“skewness”: round(target_data.skew(), 3),
“kurtosis”: round(target_data.kurtosis(), 3)
},
“distribution”: “normal” if abs(target_data.skew()) < 0.5 else “skewed”
}
else:
value_counts = target_data.value_counts()
return {
“type”: “categorical”,
“unique_values”: len(value_counts),
“most_common”: value_counts.head(5).to_dict(),
“entropy”: round(-sum((p := value_counts / len(target_data)) * np.log2(p + 1e-10)), 3)
}

def _generate_recommendations(self, df: pd.DataFrame, insights: Dict) -> List[str]:
recommendations = []

missing_pct = sum(insights[“dataset_info”][“missing_values”].values()) / (df.shape[0] * df.shape[1]) * 100
if missing_pct > 10:
recommendations.append(f”Consider data imputation – {missing_pct:.1f}% missing values detected”)

if “correlation_analysis” in insights and insights[“correlation_analysis”].get(“strong_correlations”):
recommendations.append(“Strong correlations detected – consider feature selection or dimensionality reduction”)

if “clustering_analysis” in insights:
cluster_info = insights[“clustering_analysis”]
if isinstance(cluster_info, dict) and “optimal_clusters” in cluster_info:
recommendations.append(f”Data segments into {cluster_info[‘optimal_clusters’]} distinct groups – useful for targeted strategies”)

if “outlier_detection” in insights:
high_outlier_cols = [col for col, info in insights[“outlier_detection”].items() if isinstance(info, dict) and info.get(“outlier_percentage”, 0) > 5]
if high_outlier_cols:
recommendations.append(f”High outlier percentage in: {‘, ‘.join(high_outlier_cols)} – investigate data quality”)

return recommendations if recommendations else [“Data appears well-structured with no immediate concerns”]

def _create_analysis_summary(self, insights: Dict, recommendations: List[str]) -> str:
dataset_info = insights[“dataset_info”]
summary = f””” INTELLIGENT DATA ANALYSIS COMPLETE

Dataset Overview: {dataset_info[‘shape’][0]} rows × {dataset_info[‘shape’][1]} columns
Numeric Features: {len([c for c, t in dataset_info[‘dtypes’].items() if ‘int’ in t or ‘float’ in t])}
Categorical Features: {len([c for c, t in dataset_info[‘dtypes’].items() if ‘object’ in t])}

Key Insights Generated:
• Statistical correlations and relationships identified
• Clustering patterns discovered for segmentation
• Outlier detection completed for data quality assessment
• Feature importance and distribution analysis performed

Top Recommendations:
{chr(10).join(‘• ‘ + rec for rec in recommendations[:3])}

Analysis includes ML-powered clustering, statistical correlations, and actionable business insights.”””

return summary

def _find_elbow_point(self, inertias: List[float], k_range: range) -> int:
if len(inertias) < 3:
return list(k_range)[0]
diffs = [inertias[i-1] – inertias[i] for i in range(1, len(inertias))]
return list(k_range)[diffs.index(max(diffs)) + 1] if diffs else list(k_range)[0]

The IntelligentDataAnalyzer class is a custom tool built using LangChain’s BaseTool, designed to perform comprehensive data analysis on structured datasets. It integrates multiple analytical methods, including correlation matrix generation, K-Means clustering with silhouette scoring, outlier detection using IQR and z-score, and descriptive statistics on a target column, into a unified pipeline. The tool not only extracts valuable insights but also auto-generates recommendations and a summary report, making it highly useful for building AI agents that require decision-support capabilities grounded in data.

Copy CodeCopiedUse a different Browserdata_analyzer = IntelligentDataAnalyzer()

sample_data = [
{“age”: 25, “income”: 50000, “education”: “Bachelor”, “satisfaction”: 7},
{“age”: 35, “income”: 75000, “education”: “Master”, “satisfaction”: 8},
{“age”: 45, “income”: 90000, “education”: “PhD”, “satisfaction”: 6},
{“age”: 28, “income”: 45000, “education”: “Bachelor”, “satisfaction”: 7},
{“age”: 52, “income”: 120000, “education”: “Master”, “satisfaction”: 9},
]

result = data_analyzer.invoke({
“data”: sample_data,
“analysis_type”: “comprehensive”,
“target_column”: “satisfaction”
})

print(“Analysis Summary:”)
print(result)

Finally, we initialize the IntelligentDataAnalyzer tool and feed it a sample dataset comprising demographic and satisfaction data. By specifying the analysis type as “comprehensive” and setting “satisfaction” as the target column, the tool performs a full suite of analyses, including statistical profiling, correlation checking, clustering, outlier detection, and target distribution analysis. The final output is a human-readable summary and structured insights that demonstrate how an AI agent can automatically process and interpret real-world tabular data.

In conclusion, we have created an advanced custom tool to integrate with AI Agent. The IntelligentDataAnalyzer class handles a diverse range of analytical tasks, from statistical profiling to machine learning-based clustering, and also presents insights in a structured output with clear recommendations. This approach highlights how custom LangChain tools can bridge the gap between data science and interactive AI, making agents more context-aware and capable of delivering rich, data-driven decisions.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Build Custom AI Tools for Your AI Agents that Combine Machine Learning and Statistical Analysis appeared first on MarkTechPost.

Tencent Open Sources Hunyuan-A13B: A 13B Active Parameter MoE Model wi …

Tencent’s Hunyuan team has introduced Hunyuan-A13B, a new open-source large language model built on a sparse Mixture-of-Experts (MoE) architecture. While the model consists of 80 billion total parameters, only 13 billion are active during inference, offering a highly efficient balance between performance and computational cost. It supports Grouped Query Attention (GQA), 256K context length, and a dual-mode reasoning framework that toggles between fast and slow thinking.

Designed for efficient deployment and robust reasoning, Hunyuan-A13B achieves top-tier performance across agentic benchmarks including BFCL-v3, τ-Bench, C3-Bench, and ComplexFuncBench, often outperforming larger models in tool-calling and long-context scenarios.

Architecture: Sparse MoE with 13B Active Parameters

At its core, Hunyuan-A13B follows a fine-grained MoE design comprising 1 shared expert and 64 non-shared experts, with 8 experts activated per forward pass. This architecture, backed by scaling experiments, ensures performance consistency while keeping inference costs low. The model includes 32 layers, uses SwiGLU activations, a vocabulary size of 128K, and integrates GQA for enhanced memory efficiency during long-context inference.

The model’s MoE setup is paired with an optimized training curriculum: a 20T-token pretraining phase, followed by fast annealing and long-context adaptation. This last phase scales the context window first to 32K and then to 256K tokens using NTK-aware positional encoding, ensuring stable performance at large sequence lengths.

Dual-Mode Reasoning: Fast and Slow Thinking

A standout feature of Hunyuan-A13B is its dual-mode Chain-of-Thought (CoT) capability. It supports both a low-latency fast-thinking mode for routine queries and a more elaborate slow-thinking mode for multi-step reasoning. These modes are controlled through a simple tag system: /no think for fast inference and /think for reflective reasoning. This flexibility allows users to adapt computational cost to task complexity.

Post-Training: Reinforcement Learning with Task-Specific Reward Models

The post-training pipeline of Hunyuan-A13B includes multi-stage supervised fine-tuning (SFT) and reinforcement learning (RL) across both reasoning-specific and general tasks. The RL stages incorporate outcome-based rewards and tool-specific feedback, including sandbox execution environments for code and rule-based checks for agents.

In the agent training phase, the team synthesized diverse tool-use scenarios with planner, checker, and tool roles, generating over 20,000 format combinations. This reinforced Hunyuan-A13B’s ability to execute real-world workflows such as spreadsheet processing, information search, and structured reasoning.

Evaluation: State-of-the-Art Agentic Performance

Hunyuan-A13B shows strong benchmark results across diverse NLP tasks:

On MATH, CMATH, and GPQA, it scores on par or above larger dense and MoE models.

It surpasses Qwen3-A22B and DeepSeek R1 in logical reasoning (BBH: 89.1; ZebraLogic: 84.7).

In coding, it holds its own with 83.9 on MBPP and 69.3 on MultiPL-E.

For agent tasks, it leads on BFCL-v3 (78.3) and ComplexFuncBench (61.2), validating its tool-usage capabilities.

Long-context comprehension is another highlight. On PenguinScrolls, it scores 87.7—just shy of Gemini 2.5 Pro. On RULER, it sustains high performance (73.9) even at 64K–128K context, outperforming larger models like Qwen3-A22B and DeepSeek R1 in context resilience.

Inference Optimization and Deployment

Hunyuan-A13B is fully integrated with popular inference frameworks like vLLM, SGLang, and TensorRT-LLM. It supports precision formats such as W16A16, W8A8, and KV Cache FP8, along with features like Auto Prefix Caching and Chunk Prefill. It achieves up to 1981.99 tokens/sec throughput on a 32-batch input (2048 input, 14336 output length), making it practical for real-time applications.

Open Source and Industry Relevance

Available on Hugging Face and GitHub, Hunyuan-A13B is released with permissive open-source licensing. It’s engineered for efficient research and production use, especially in latency-sensitive environments and long-context tasks.

By combining MoE scalability, agentic reasoning, and open-source accessibility, Tencent’s Hunyuan-A13B offers a compelling alternative to heavyweight LLMs, enabling broader experimentation and deployment without sacrificing capability.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Tencent Open Sources Hunyuan-A13B: A 13B Active Parameter MoE Model with Dual-Mode Reasoning and 256K Context appeared first on MarkTechPost.

Getting started with Gemini Command Line Interface (CLI)

Google recently released the Gemini CLI, a powerful command-line tool designed to supercharge developer workflows with AI. Whether you’re working across massive codebases, automating tedious tasks, or generating new apps from sketches and PDFs, Gemini CLI brings multimodal intelligence right to your terminal.

With Gemini CLI, you can:

Query and edit large codebases—even beyond the standard 1M token context window.

Generate apps from visual inputs like PDFs or design sketches.

Automate operational workflows—from handling pull requests to managing rebases.

Connect external tools and MCP servers, including Imagen, Veo, and Lyria for media generation.

Use Google Search as a grounding tool, directly within your terminal.

In this tutorial, we’ll walk you through how to install, configure, and start using Gemini CLI to enhance your daily developer tasks.

Installing Node JS

To get started, you’ll need to have Node.js installed on your system:

Go to nodejs.org and download the latest LTS version.

Run the installer.

Use the default settings and complete the installation.

Installing & using the CLI

To install the Gemini CLI, run the following command:

Copy CodeCopiedUse a different Browsernpm install -g @google/gemini-cli

Once installed, it can be initialized by simple running the following code in the terminal

Copy CodeCopiedUse a different Browsergemini

On the first run, you’ll be prompted to:

Choose a color theme for the CLI interface.

Authenticate with your personal Google account – This allows access to Gemini with generous usage limits:  60 requests per minute and 1,000 requests per day.

You’re now ready to start using Gemini CLI to enhance your development workflow!

If you need access to a specific Gemini model or want higher usage limits, you can use your own API key.

Generate a key from Google AI Studio.

Set it as an environment variable in your terminal by running:

Copy CodeCopiedUse a different Browserexport GEMINI_API_KEY=”YOUR_API_KEY”

Replace YOUR_API_KEY with the actual key you generated. This allows Gemini CLI to authenticate using your key instead of your personal Google account.

Querying a GitHub Repo with Gemini

Once everything is configured, we will test it with a Github repo

Run the following command to clone the Marktechpost Github repo containing multiple AI tutorials

Copy CodeCopiedUse a different Browsergit clone https://github.com/Marktechpost/AI-Notebooks.git
cd AI-Notebooks

Once in the AI-Notebooks folder, run the following command to run the CLI

Copy CodeCopiedUse a different Browsergemini

This will launch the CLI

Summarizing the different tutorials in the repository

To get started, let’s try a simple prompt:

Copy CodeCopiedUse a different BrowserGive an overview of the different tutorials in this repository

Gemini CLI will read the README.md file—assuming it contains details about the tutorials—and generate a concise summary based on that information.

Explaining the different files in a sub-folder

To refer to a specific directory or file in your prompt, use the @ symbol followed by the folder or file name. Gemini CLI also supports auto-complete, so when you type @, it will suggest available files and folders automatically.

Let’s test this with the following prompt:

Copy CodeCopiedUse a different Browser@A2A_Simple_Agent briefly explain the different files in this folder and how they work together to implement the A2A agent. Focus only on the .py files and the README.md file

Executing a git command

Gemini CLI can also execute shell commands directly from your prompts.

Copy CodeCopiedUse a different BrowserHow many git commits have been made so far

When running a command like this, Gemini will:

Ask for your permission before executing it.

Run the shell command safely.

Automatically fetch and display the result.

Updating the memory

We can also manage the AI’s instructional context using the /memory command

Copy CodeCopiedUse a different Browser/memory add This Git repository contains multiple self-contained tutorial projects demonstrating how to use the Gemini CLI and build agent-based systems. Each folder (e.g., A2A_Simple_Agent) focuses on a specific concept like agent communication, tool use, or integration patterns. When asked, summarize or build on individual tutorials while keeping their scope isolated.

Checking the stats

The /stats command in Gemini CLI provides a detailed summary of your current session. It shows key metrics such as total token usage, any savings from cached tokens (when available), and the overall session duration. This is useful for tracking your usage efficiency and understanding how the model is being utilized during your workflow.

Copy CodeCopiedUse a different Browser/stats

Quitting the session

You can end your Gemini CLI session at any time by using the /quit command. Once you exit, the CLI will display a session summary—including total tokens used, session duration, and a breakdown of input and output tokens.

Copy CodeCopiedUse a different Browser/quit

Further reading

To explore the full range of commands, check out the Gemini CLI Commands Guide. There are many powerful commands that make Gemini CLI a versatile tool for developers. In this tutorial, we’ve only scratched the surface to give you a basic overview of its core features. For more details and updates, visit the official Gemini CLI GitHub repository.
The post Getting started with Gemini Command Line Interface (CLI) appeared first on MarkTechPost.

Alibaba Qwen Team Releases Qwen-VLo: A Unified Multimodal Understandin …

The Alibaba Qwen team has introduced Qwen-VLo, a new addition to its Qwen model family, designed to unify multimodal understanding and generation within a single framework. Positioned as a powerful creative engine, Qwen-VLo enables users to generate, edit, and refine high-quality visual content from text, sketches, and commands—in multiple languages and through step-by-step scene construction. This model marks a significant leap in multimodal AI, making it highly applicable for designers, marketers, content creators, and educators.

Unified Vision-Language Modeling

Qwen-VLo builds on Qwen-VL, Alibaba’s earlier vision-language model, by extending it with image generation capabilities. The model integrates visual and textual modalities in both directions—it can interpret images and generate relevant textual descriptions or respond to visual prompts, while also producing visuals based on textual or sketch-based instructions. This bidirectional flow enables seamless interaction between modalities, optimizing creative workflows.

Key Features of Qwen-VLo

Concept-to-Polish Visual Generation: Qwen-VLo supports generating high-resolution images from rough inputs, such as text prompts or simple sketches. The model understands abstract concepts and converts them into polished, aesthetically refined visuals. This capability is ideal for early-stage ideation in design and branding.

On-the-Fly Visual Editing: With natural language commands, users can iteratively refine images, adjusting object placements, lighting, color themes, and composition. Qwen-VLo simplifies tasks like retouching product photography or customizing digital advertisements, eliminating the need for manual editing tools.

Multilingual Multimodal Understanding: Qwen-VLo is trained with support for multiple languages, allowing users from diverse linguistic backgrounds to engage with the model. This makes it suitable for global deployment in industries such as e-commerce, publishing, and education.

Progressive Scene Construction: Rather than rendering complex scenes in one pass, Qwen-VLo enables progressive generation. Users can guide the model step-by-step—adding elements, refining interactions, and adjusting layouts incrementally. This mirrors natural human creativity and improves user control over output.

Architecture and Training Enhancements

While details of the model architecture are not deeply specified in the public blog, Qwen-VLo likely inherits and extends the Transformer-based architecture from the Qwen-VL line. The enhancements focus on fusion strategies for cross-modal attention, adaptive fine-tuning pipelines, and integration of structured representations for better spatial and semantic grounding.

The training data includes multilingual image-text pairs, sketches with image ground truths, and real-world product photography. This diverse corpus allows Qwen-VLo to generalize well across tasks like composition generation, layout refinement, and image captioning.

Target Use Cases

Design & Marketing: Qwen-VLo’s ability to convert text concepts into polished visuals makes it ideal for ad creatives, storyboards, product mockups, and promotional content.

Education: Educators can visualize abstract concepts (e.g., science, history, art) interactively. Language support enhances accessibility in multilingual classrooms.

E-commerce & Retail: Online sellers can use the model to generate product visuals, retouch shots, or localize designs per region.

Social Media & Content Creation: For influencers or content producers, Qwen-VLo offers fast, high-quality image generation without relying on traditional design software.

Key Benefits

Qwen-VLo stands out in the current LMM (Large Multimodal Model) landscape by offering:

Seamless text-to-image and image-to-text transitions

Localized content generation in multiple languages

High-resolution outputs suitable for commercial use

Editable and interactive generation pipeline

Its design supports iterative feedback loops and precision edits, which are critical for professional-grade content generation workflows.

Conclusion

Alibaba’s Qwen-VLo pushes forward the frontier of multimodal AI by merging understanding and generation capabilities into a cohesive, interactive model. Its flexibility, multilingual support, and progressive generation features make it a valuable tool for a wide array of content-driven industries. As the demand for visual and language content convergence grows, Qwen-VLo positions itself as a scalable, creative assistant ready for global adoption.

Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Alibaba Qwen Team Releases Qwen-VLo: A Unified Multimodal Understanding and Generation Model appeared first on MarkTechPost.

Getting Started with MLFlow for LLM Evaluation

MLflow is a powerful open-source platform for managing the machine learning lifecycle. While it’s traditionally used for tracking model experiments, logging parameters, and managing deployments, MLflow has recently introduced support for evaluating Large Language Models (LLMs).

In this tutorial, we explore how to use MLflow to evaluate the performance of an LLM—in our case, Google’s Gemini model—on a set of fact-based prompts. We’ll generate responses to fact-based prompts using Gemini and assess their quality using a variety of metrics supported directly by MLflow.

Setting up the dependencies

For this tutorial, we’ll be using both the OpenAI and Gemini APIs. MLflow’s built-in generative AI evaluation metrics currently rely on OpenAI models (e.g., GPT-4) to act as judges for metrics like answer similarity or faithfulness, so an OpenAI API key is required. You can obtain:

Your OpenAI API key from https://platform.openai.com/settings/organization/api-keys

Your Google Gemini API key from https://ai.google.dev/gemini-api/docs

Installing the libraries

Copy CodeCopiedUse a different Browserpip install mlflow openai pandas google-genai

Setting the OpenAI and Google API Keys as environment variable

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass

os.environ[“OPENAI_API_KEY”] = getpass(‘Enter OpenAI API Key:’)
os.environ[“GOOGLE_API_KEY”] = getpass(‘Enter Google API Key:’)

Preparing Evaluation Data and Fetching Outputs from Gemini

Copy CodeCopiedUse a different Browserimport mlflow
import openai
import os
import pandas as pd
from google import genai

Creating the evaluation data

In this step, we define a small evaluation dataset containing factual prompts along with their correct ground truth answers. These prompts span topics such as science, health, web development, and programming. This structured format allows us to objectively compare the Gemini-generated responses against known correct answers using various evaluation metrics in MLflow.

Copy CodeCopiedUse a different Browsereval_data = pd.DataFrame(
{
“inputs”: [
“Who developed the theory of general relativity?”,
“What are the primary functions of the liver in the human body?”,
“Explain what HTTP status code 404 means.”,
“What is the boiling point of water at sea level in Celsius?”,
“Name the largest planet in our solar system.”,
“What programming language is primarily used for developing iOS apps?”,
],
“ground_truth”: [
“Albert Einstein developed the theory of general relativity.”,
“The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.”,
“HTTP 404 means ‘Not Found’ — the server can’t find the requested resource.”,
“The boiling point of water at sea level is 100 degrees Celsius.”,
“Jupiter is the largest planet in our solar system.”,
“Swift is the primary programming language used for iOS app development.”
]
}
)

eval_data

Getting Gemini Responses

This code block defines a helper function gemini_completion() that sends a prompt to the Gemini 1.5 Flash model using the Google Generative AI SDK and returns the generated response as plain text. We then apply this function to each prompt in our evaluation dataset to generate the model’s predictions, storing them in a new “predictions” column. These predictions will later be evaluated against the ground truth answers

Copy CodeCopiedUse a different Browserclient = genai.Client()
def gemini_completion(prompt: str) -> str:
response = client.models.generate_content(
model=”gemini-1.5-flash”,
contents=prompt
)
return response.text.strip()

eval_data[“predictions”] = eval_data[“inputs”].apply(gemini_completion)
eval_data

Evaluating Gemini Outputs with MLflow

In this step, we initiate an MLflow run to evaluate the responses generated by the Gemini model against a set of factual ground-truth answers. We use the mlflow.evaluate() method with four lightweight metrics: answer_similarity (measuring semantic similarity between the model’s output and the ground truth), exact_match (checking for word-for-word matches), latency (tracking response generation time), and token_count (logging the number of output tokens).

It’s important to note that the answer_similarity metric internally uses OpenAI’s GPT model to judge the semantic closeness between answers, which is why access to the OpenAI API is required. This setup provides an efficient way to assess LLM outputs without relying on custom evaluation logic. The final evaluation results are printed and also saved to a CSV file for later inspection or visualization.

Copy CodeCopiedUse a different Browsermlflow.set_tracking_uri(“mlruns”)
mlflow.set_experiment(“Gemini Simple Metrics Eval”)

with mlflow.start_run():
results = mlflow.evaluate(
model_type=”question-answering”,
data=eval_data,
predictions=”predictions”,
targets=”ground_truth”,
extra_metrics=[
mlflow.metrics.genai.answer_similarity(),
mlflow.metrics.exact_match(),
mlflow.metrics.latency(),
mlflow.metrics.token_count()
]
)
print(“Aggregated Metrics:”)
print(results.metrics)

# Save detailed table
results.tables[“eval_results_table”].to_csv(“gemini_eval_results.csv”, index=False)

To view the detailed results of our evaluation, we load the saved CSV file into a DataFrame and adjust the display settings to ensure full visibility of each response. This allows us to inspect individual prompts, Gemini-generated predictions, ground truth answers, and the associated metric scores without truncation, which is especially helpful in notebook environments like Colab or Jupyter.

Copy CodeCopiedUse a different Browserresults = pd.read_csv(‘gemini_eval_results.csv’)
pd.set_option(‘display.max_colwidth’, None)
results

Check out the Codes here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Getting Started with MLFlow for LLM Evaluation appeared first on MarkTechPost.

Unbabel Introduces TOWER+: A Unified Framework for High-Fidelity Trans …

Large language models have driven progress in machine translation, leveraging massive training corpora to translate dozens of languages and dialects while capturing subtle linguistic nuances. Yet, fine-tuning these models for translation accuracy often impairs their instruction-following and conversational skills, and broad-purpose versions struggle to meet professional fidelity standards. Balancing precise, culturally aware translations with the ability to handle code generation, problem-solving, and user-specific formatting remains challenging. Models must also preserve terminological consistency and adhere to formatting guidelines across varied audiences. Stakeholders require systems that can dynamically adapt to domain requirements and user preferences without sacrificing fluency. Benchmark scores such as WMT24++, covering 55 language variants, and IFEval’s 541 instruction-focused prompts highlight the gap between specialized translation quality and general-purpose versatility, posing a critical bottleneck for enterprise deployment.

Current Approaches to Tailoring Language Models for Translation Accuracy

Multiple approaches have been explored to tailor language models for translation. Fine-tuning pre-trained large language models on parallel corpora has been used to improve the adequacy and fluency of translated text. Meanwhile, continued pretraining on a combination of monolingual and parallel data enhances multilingual fluency. Some research teams have supplemented training with reinforcement learning from human feedback to align outputs with quality preferences. Proprietary systems such as GPT-4o and Claude 3.7 have demonstrated leading translation quality, and open-weight adaptations including TOWER V2 and GEMMA 2 models have reached parity or surpassed closed-source models under certain language scenarios. These strategies reflect continuous efforts to address the dual demands of translation accuracy and broad language capabilities.

Introducing TOWER+: Unified Training for Translation and General Language Tasks

Researchers from Unbabel, Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa (Lisbon ELLIS Unit), and MICS, CentraleSupélec, Université Paris-Saclay, introduced TOWER+, a suite of models. The research team designed variants at multiple parameter scales, 2 billion, 9 billion, and 72 billion, to explore the trade-off between translation specialization and general-purpose utility. By implementing a unified training pipeline, the researchers aimed to position TOWER+ models on the Pareto frontier, achieving both high translation performance and robust general capabilities without sacrificing one for the other. The approach leverages architectures to balance the specific demands of machine translation with the flexibility required by conversational and instructional tasks, supporting a range of application scenarios.

TOWER+ Training Pipeline: Pretraining, Supervised Tuning, Preferences, and RL

The training pipeline begins with continued pretraining on carefully curated data that includes monolingual content, filtered parallel sentences formatted as translation instructions, and a small fraction of instruction-like examples. Next, supervised fine-tuning refines the model using a combination of translation tasks and diverse instruction-following scenarios, including code generation, mathematical problem-solving, and question-answering. A preference optimization stage follows, employing weighted preference optimization and group-relative policy updates trained on off-policy signals and human-edited translation variants. Finally, reinforcement learning with verifiable rewards reinforces precise compliance with transformation guidelines, using regex-based checks and preference annotations to refine the model’s ability to follow explicit instructions during translation. This combination of pretraining, supervised alignment, and reward-driven updates yields a robust balance between specialized translation accuracy and versatile language proficiency.

Benchmark Results: TOWER+ Achieves State-of-the-Art Translation and Instruction Following

The TOWER+ 9B model achieved a win rate of 33.47% on multilingual general chat prompts, while earning an XCOMET-XXL score of 84.38 across 24 language pairs, outperforming similarly sized open-weight counterparts. The flagship 72 billion-parameter variant secured a 54.52 percent win rate on M-ArenaHard, recorded an IFEval instruction-following score of 89.02, and reached an XCOMET-XXL level of 83.29 on the full WMT24++ benchmark. On the combined translation and instruction-following benchmark, IF-MT scored 5.55 for instruction adherence and 88.95 for translation fidelity, establishing state-of-the-art results among open-weight models. These outcomes confirm that the researchers’ integrative pipeline effectively bridges the gap between specialized translation performance and broad language capabilities, demonstrating its viability for both enterprise and research applications.

Key Technical Highlights of the TOWER+ Models

TOWER+ models, developed by Unbabel and academic partners, span 2 B, 9 B, and 72 B parameters to explore the performance frontier between translation specialization and general-purpose utility.

The post-training pipeline integrates four stages: continued pretraining (66% monolingual, 33% parallel, and 1% instruction), supervised fine-tuning (22.3% translation), Weighted Preference Optimization, and verifiable reinforcement learning, to preserve chat skills while enhancing translation accuracy.

Continued pretraining covers 27 languages and dialects, as well as 47 language pairs, over 32 billion tokens, merging specialized and general checkpoints to maintain balance.

The 9 B variant achieved a 33.47% win rate on M-ArenaHard, 83.84% on IFEval, and an 84.38% XCOMET-XXL across 24 pairs, with IF-MT scores of 4.85 (instruction) and 88.51 (translation).

The 72 B model recorded 54.52% M-ArenaHard, 89.02% IFEval, 83.29% XCOMET-XXL, and 5.55/88.95% IF-MT, setting a new open-weight standard.

Even the 2B model matched larger baselines, with 6.33% on M-ArenaHard and 87.65% IF-MT translation quality.

Benchmarked against GPT-4O-1120, Claude-Sonnet-3.7, ALMA-R, GEMMA-2, and LLAMA-3.3, the TOWER+ suite consistently matches or outperforms on both specialized and general tasks.

The research provides a reproducible recipe for building LLMs that serve translation and conversational needs concurrently, reducing model proliferation and operational overhead.

Conclusion: A Pareto-Optimal Framework for Future Translation-Focused LLMs

In conclusion, by unifying large-scale pretraining with specialized alignment stages, TOWER+ demonstrates that translation excellence and conversational versatility can coexist within a single open-weight suite. The models achieve a Pareto-optimal balance across translation fidelity, instruction-following, and general chat capabilities, offering a scalable blueprint for future domain-specific LLM development.

Check out the Paper and Models. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Unbabel Introduces TOWER+: A Unified Framework for High-Fidelity Translation and Instruction-Following in Multilingual LLMs appeared first on MarkTechPost.

Polaris-4B and Polaris-7B: Post-Training Reinforcement Learning for Ef …

The Rising Need for Scalable Reasoning Models in Machine Intelligence

Advanced reasoning models are at the frontier of machine intelligence, especially in domains like math problem-solving and symbolic reasoning. These models are designed to perform multi-step calculations and logical deductions, often generating solutions that mirror human reasoning processes. Reinforcement learning techniques are used to improve accuracy after pretraining; however, scaling these methods while retaining efficiency remains a complex challenge. As demand increases for smaller, more resource-efficient models that still exhibit high reasoning capability, researchers are now turning to strategies that address data quality, exploration methods, and long-context generalization.

Challenges in Reinforcement Learning for Large Reasoning Architectures

A persistent problem with reinforcement learning for large-scale reasoning models is the mismatch between the model’s capability and the difficulty of the training data. When a model is exposed to tasks that are too simple, its learning curve stagnates. Conversely, overly difficult data can overwhelm the model and yield no learning signal. This difficulty imbalance is especially pronounced when applying recipes that work well for small models to larger ones. Another issue is the lack of methods to efficiently adapt rollout diversity and output length during both training and inference, which further constrains a model’s reasoning abilities on complex benchmarks.

Limitations of Existing Post-Training Approaches on Advanced Models

Earlier approaches, such as DeepScaleR and GRPO, have demonstrated that reinforcement learning can improve the performance of small-scale reasoning models with as few as 1.5 billion parameters. However, applying these same recipes to more capable models, such as Qwen3-4B or Deepseek-R1-Distill-Qwen-7B, results in only marginal gains or even performance drops. One key limitation is the static nature of data distribution and the limited diversity of sampling. Most of these approaches do not filter data based on model capability, nor do they adjust sampling temperature or response length over time. As a result, they often fail to scale effectively when used on more advanced architectures.

Introducing Polaris: A Tailored Recipe for Scalable RL in Reasoning Tasks

Researchers from the University of Hong Kong, Bytedance Seed, and Fudan University introduced Polaris, a post-training recipe designed specifically to scale reinforcement learning for advanced reasoning tasks. Polaris includes two preview models: Polaris-4B-Preview and Polaris-7B-Preview. Polaris-4B-Preview is fine-tuned from Qwen3-4B, while Polaris-7B-Preview is based on Deepseek-R1-Distill-Qwen-7B. The researchers focused on building a model-agnostic framework that modifies data difficulty, encourages diverse exploration through controlled sampling temperatures, and extends inference capabilities through length extrapolation. These strategies were developed using open-source datasets and training pipelines, and both models are optimized to run on consumer-grade graphics processing units (GPUs).

Polaris Innovations: Difficulty Balancing, Controlled Sampling, and Long-Context Inference

Polaris implements multiple innovations. First, the training data is curated by removing problems that are either too easy or unsolvable, creating a mirrored J-shape distribution of difficulty. This ensures that the training data evolves with the model’s growing capabilities. Second, the researchers dynamically adjust the sampling temperature across training stages—using 1.4, 1.45, and 1.5 for Polaris-4B and 0.7, 1.0, and 1.1 for Polaris-7B—to maintain rollout diversity. Furthermore, the method employs a Yarn-based extrapolation technique to extend the inference context length to 96K tokens without requiring additional training. This addresses the inefficiency of long-sequence training by enabling a “train-short, test-long” approach. The model also employs techniques such as the Rollout Rescue Mechanism and Intra-Batch Informative Substitution to prevent zero-reward batches and ensure that useful training signals are preserved, even when the rollout size is kept small at 8.

Benchmark Results: Polaris Outperforms Larger Commercial Models

Polaris models achieve state-of-the-art results across multiple math benchmarks. Polaris-4B-Preview records 81.2% accuracy on AIME24 and 79.4% on AIME25, outperforming even Qwen3-32B on the same tasks while using less than 2% of its parameters. It scores 44.0% on Minerva Math, 69.1% on Olympiad Bench, and 94.8% on AMC23. Polaris-7B-Preview also performs strongly, scoring 72.6% on AIME24 and 52.6% on AIME25. These results demonstrate consistent improvement over models such as Claude-4-Opus and Grok-3-Beta, establishing Polaris as a competitive, lightweight model that bridges the performance gap between small open models and commercial 30B+ models.

Conclusion: Efficient Reinforcement Learning Through Smart Post-Training Strategies

The researchers demonstrated that the key to scaling reasoning models is not just larger model size but intelligent control over training data difficulty, sampling diversity, and inference length. Polaris offers a reproducible recipe that effectively tunes these elements, allowing smaller models to rival the reasoning ability of massive commercial systems.

Check out the Model and Code. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Polaris-4B and Polaris-7B: Post-Training Reinforcement Learning for Efficient Math and Logic Reasoning appeared first on MarkTechPost.

AWS costs estimation using Amazon Q CLI and AWS Cost Analysis MCP

Managing and optimizing AWS infrastructure costs is a critical challenge for organizations of all sizes. Traditional cost analysis approaches often involve the following:

Complex spreadsheets – Creating and maintaining detailed cost models, which requires significant effort
Multiple tools – Switching between the AWS Pricing Calculator, AWS Cost Explorer, and third-party tools
Specialized knowledge – Understanding the nuances of AWS pricing across services and AWS Regions
Time-consuming analysis – Manually comparing different deployment options and scenarios
Delayed optimization – Cost insights often come too late to inform architectural decisions

Amazon Q Developer CLI with the Model Context Protocol (MCP) offers a revolutionary approach to AWS cost analysis. By using generative AI through natural language prompts, teams can now generate detailed cost estimates, comparisons, and optimization recommendations in minutes rather than hours, while providing accuracy through integration with official AWS pricing data.
In this post, we explore how to use Amazon Q CLI with the AWS Cost Analysis MCP server to perform sophisticated cost analysis that follows AWS best practices. We discuss basic setup and advanced techniques, with detailed examples and step-by-step instructions.
Solution overview
Amazon Q Developer CLI is a command line interface that brings the generative AI capabilities of Amazon Q directly to your terminal. Developers can interact with Amazon Q through natural language prompts, making it an invaluable tool for various development tasks. Developed by Anthropic as an open protocol, the Model Context Protocol (MCP) provides a standardized way to connect AI models to different data sources or tools. Using a client-server architecture (as illustrated in the following diagram), the MCP helps developers expose their data through lightweight MCP servers while building AI applications as MCP clients that connect to these servers.
The MCP uses a client-server architecture containing the following components:

Host – A program or AI tool that requires access to data through the MCP protocol, such as Anthropic’s Claude Desktop, an integrated development environment (IDE), or other AI applications
Client – Protocol clients that maintain one-to-one connections with servers
Server – Lightweight programs that expose capabilities through standardized MCP or act as tools
Data sources – Local data sources such as databases and file systems, or external systems available over the internet through APIs (web APIs) that MCP servers can connect with

As announced in April 2025, the MCP enables Amazon Q Developer to connect with specialized servers that extend its capabilities beyond what’s possible with the base model alone. MCP servers act as plugins for Amazon Q, providing domain-specific knowledge and functionality. The AWS Cost Analysis MCP server specifically enables Amazon Q to generate detailed cost estimates, reports, and optimization recommendations using real-time AWS pricing data.
Prerequisites
To implement this solution, you must have an AWS account with appropriate permissions and follow the steps below.
Set up your environment
Before you can start analyzing costs, you need to set up your environment with Amazon Q CLI and the AWS Cost Analysis MCP server. This section provides detailed instructions for installation and configuration.
Install Amazon Q Developer CLI
Amazon Q Developer CLI is available as a standalone installation. Complete the following steps to install it:

Download and install Amazon Q Developer CLI. For instructions, see Using Amazon Q Developer on the command line.
Verify the installation by running the following command: q –version You should see output similar to the following: Amazon Q Developer CLI version 1.x.x
Configure Amazon Q CLI with your AWS credentials: q login
Choose the login method suitable for you:

Use for free with AWS Builder ID
Use with Pro license

Set up MCP servers
Before using the AWS Cost Analysis MCP server with Amazon Q CLI, you must install several tools and configure your environment. The following steps guide you through installing the necessary tools and setting up the MCP server configuration:

Install Panoc using the following command (you can install with brew as well), converting the output to PDF: pip install pandoc
Install uv with the following command: pip install uv
Install Python 3.10 or newer: uv python install 3.10
Add the servers to your ~/.aws/amazonq/mcp.json file: {
“mcpServers”: {
“awslabs.cost-analysis-mcp-server”: {
“command”: “uvx”,
“args”: [“awslabs.cost-analysis-mcp-server”],
“env”: {
“FASTMCP_LOG_LEVEL”: “ERROR”
},
“autoApprove”: [],
“disabled”: false
}
}
}
Now, Amazon Q CLI automatically discovers MCP servers in the ~/.aws/amazonq/mcp.json file.

Understanding MCP server tools
The AWS Cost Analysis MCP server provides several powerful tools:

get_pricing_from_web – Retrieves pricing information from AWS pricing webpages
get_pricing_from_api – Fetches pricing data from the AWS Price List API
generate_cost_report – Creates detailed cost analysis reports with breakdowns and visualizations
analyze_cdk_project – Analyzes AWS Cloud Development Kit (AWS CDK) projects to identify services used and estimate costs
analyze_terraform_project – Analyzes Terraform projects to identify services used and estimate costs
get_bedrock_patterns – Retrieves architecture patterns for Amazon Bedrock with cost considerations

These tools work together to help you create accurate cost estimates that follow AWS best practices.
Test your setup
Let’s verify that everything is working correctly by generating a simple cost analysis:

Start the Amazon Q CLI chat interface and verify the output shows the MCP server being loaded and initialized: q chat
In the chat interface, enter the following prompt:Please create a cost analysis for a simple web application with an Application Load Balancer, two t3.medium EC2 instances, and an RDS db.t3.medium MySQL database. Assume 730 hours of usage per month and moderate traffic of about 100 GB data transfer. Convert estimation to a PDF format.
Amazon Q CLI will ask for permission to trust the tool that is being used; enter t to trust it. Amazon Q should generate and display a detailed cost analysis. Your output should look like the following screenshot. If you see the cost analysis report, your environment is set up correctly. If you encounter issues, verify that Amazon Q CLI can access the MCP servers by making sure you installed install the necessary tools and the servers are in the ~/.aws/amazonq/mcp.json file.

Configuration options
The AWS Cost Analysis MCP server supports several configuration options to customize your cost analysis experience:

Output format – Choose between markdown, CSV formats, or PDF (which we installed the package for) for cost reports
Pricing model – Specify on-demand, reserved instances, or savings plans
Assumptions and exclusions – Customize the assumptions and exclusions in your cost analysis
Detailed cost data – Provide specific usage patterns for more accurate estimates

Now that our environment is set up, let’s create more cost analyses.
Create AWS Cost Analysis reports
In this section, we walk through the process of creating AWS cost analysis reports using Amazon Q CLI with the AWS Cost Analysis MCP server.
When you provide a prompt to Amazon Q CLI, the AWS Cost Analysis MCP server completes the following steps:

Interpret your requirements.
Retrieve pricing data from AWS pricing sources.
Generate a detailed cost analysis report.
Provide optimization recommendations.

This process happens seamlessly, so you can focus on describing what you want rather than how to create it.
AWS Cost Analysis reports typically include the following information:

Service costs – Breakdown of costs by AWS service
Unit pricing – Detailed unit pricing information
Usage quantities – Estimated usage quantities for each service
Calculation details – Step-by-step calculations showing how costs were derived
Assumptions – Clearly stated assumptions used in the analysis
Exclusions – Costs that were not included in the analysis
Recommendations – Cost optimization suggestions

Example 1: Analyze a serverless application
Let’s create a cost analysis for a simple serverless application. Use the following prompt:
Create a cost analysis for a serverless application using API Gateway, Lambda, and DynamoDB. Assume 1 million API calls per month, average Lambda execution time of 200ms with 512MB memory, and 10GB of DynamoDB storage with 5 million read requests and 1 million write requests per month. Convert estimation to a PDF format.
Upon entering your prompt, Amazon Q CLI will retrieve pricing data using the get_pricing_from_web or get_pricing_from_api tools, and will use generate_cost_report with awslabscost_analysis_mcp_server.

You should receive an output giving a detailed cost breakdown based on the prompt along with optimization recommendations.

The generated cost analysis shows the following information:

Amazon API Gateway costs for 1 million requests
AWS Lambda costs for compute time and requests
Amazon DynamoDB costs for storage, read, and write capacity
Total monthly cost estimate
Cost optimization recommendations

Example 2: Analyze multi-tier architectures
Multi-tier architectures separate applications into functional layers (presentation, application, and data) to improve scalability and security. This example analyzes costs for implementing such an architecture on AWS with components for each tier:
Create a cost analysis for a three-tier web application with a presentation tier (ALB and CloudFront), application tier (ECS with Fargate), and data tier (Aurora PostgreSQL). Include costs for 2 Fargate tasks with 1 vCPU and 2GB memory each, an Aurora db.r5.large instance with 100GB storage, an Application Load Balancer with 10
This time, we are formatting it into both PDF and DOCX.

The cost analysis shows the following information:

Presentation tier costs (Application Load Balancer and AWS CloudFront)
Application tier costs (Amazon Elastic Container Service (Amazon ECS) and AWS Fargate)
Data tier costs (Amazon Aurora PostgreSQL-Compatible Edition)
Detailed breakdown of each component’s pricing
Total monthly cost estimate
Cost optimization recommendations for each tier

Example 3: Compare deployment options
When deploying containers on AWS, choosing between Amazon ECS with Amazon Elastic Compute Cloud (Amazon EC2) or Fargate involves different cost structures and management overhead. This example compares these options to determine the most cost-effective solution for a specific workload:
Compare the costs between running a containerized application on ECS with EC2 launch type versus Fargate launch type. Assume 4 containers each needing 1 vCPU and 2GB memory, running 24/7 for a month. For EC2, use t3.medium instances. Provide a recommendation on which option is more cost-effective for this workload. Convert estimation to a HTML webpage.
This time, we are formatting it into a HTML webpage.

The cost comparison includes the following information:

Amazon ECS with Amazon EC2 launch type costs
Amazon ECS with Fargate launch type costs
Detailed breakdown of each option’s pricing components
Side-by-side comparison of total costs
Recommendations for the most cost-effective option
Considerations for when each option might be preferred

Real-world examples
Let’s explore some real-world architecture patterns and how to analyze their costs using Amazon Q CLI with the AWS Cost Analysis MCP server.
Ecommerce platform
Ecommerce platforms require scalable, resilient architectures with careful cost management. These systems typically use microservices to handle various functions independently while maintaining high availability. This example analyzes costs for a complete ecommerce solution with multiple components serving moderate traffic levels:
Create a cost analysis for an e-commerce platform with microservices architecture. Include components for product catalog, shopping cart, checkout, payment processing, order management, and user authentication. Assume moderate traffic of 500,000 monthly active users, 2 million page views per day, and 50,000 orders per month. Ensure the analysis follows AWS best practices for cost optimization. Convert estimation to a PDF format.

The cost analysis includes the following key components:

Frontend delivery costs (Amazon Simple Storage Service (Amazon S3) and CloudFront)
API Gateway and Lambda costs for serverless components
Container costs for microservices (Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon ECS)
Database costs (Amazon Relational Database Service (Amazon RDS) and DynamoDB)
Caching costs (Amazon ElastiCache)
Storage and data transfer costs
Monitoring and security costs
Total monthly cost estimate
Cost optimization recommendations for each component
Reserved instance and savings plan opportunities

Data analytics platform
Modern data analytics platforms need to efficiently ingest, store, process, and visualize large volumes of data while managing costs effectively. This example examines the AWS services and costs involved in building a complete analytics pipeline handling significant daily data volumes with multiple user access requirements:
Create a cost analysis for a data analytics platform processing 500GB of new data daily. Include components for data ingestion (Kinesis), storage (S3), processing (EMR), and visualization (QuickSight). Assume 50 users accessing dashboards daily and data retention of 90 days. Ensure the analysis follows AWS best practices for cost optimization and includes recommendations for cost-effective scaling. Convert estimation to a HTML webpage.

The cost analysis includes the following key components:

Data ingestion costs (Amazon Kinesis Data Streams and Amazon Data Firehose)
Storage costs (Amazon S3 with lifecycle policies)
Processing costs (Amazon EMR cluster)
Visualization costs (Amazon QuickSight)
Data transfer costs between services
Total monthly cost estimate
Cost optimization recommendations for each component
Scaling considerations and their cost implications

Clean up
If you no longer need to use the AWS Cost Analysis MCP server with Amazon Q CLI, you can remove it from your configuration:

Open your ~/.aws/amazonq/mcp.json file.
Remove or comment out the “awslabs.cost-analysis-mcp-server” entry.
Save the file.

This will prevent the server from being loaded when you start Amazon Q CLI in the future.
Conclusion
In this post, we explored how to use Amazon Q CLI with the AWS Cost Analysis MCP server to create detailed cost analyses that use accurate AWS pricing data. This approach offers significant advantages over traditional cost estimation methods:

Time savings – Generate complex cost analyses in minutes instead of hours
Accuracy – Make sure estimates use the latest AWS pricing information
Comprehensive – Include relevant cost components and considerations
Actionable – Receive specific optimization recommendations
Iterative – Quickly compare different scenarios through simple prompts
Validation – Check estimates against official AWS pricing

As you continue exploring AWS cost analysis, we encourage you to deepen your knowledge by learning more about the Model Context Protocol (MCP) to understand how it enhances the capabilities of Amazon Q. For hands-on cost estimation, the AWS Pricing Calculator offers an interactive experience to model and compare different deployment scenarios. To make sure your architectures follow financial best practices, the AWS Well-Architected Framework Cost Optimization Pillar provides comprehensive guidance on building cost-efficient systems. And to stay at the cutting edge of these tools, keep an eye on updates to the official AWS MCP servers—they’re constantly evolving with new features to make your cost analysis experience even more powerful and accurate.

About the Authors
Joel Asante, an Austin-based Solutions Architect at Amazon Web Services (AWS), works with GovTech (Government Technology) customers. With a strong background in data science and application development, he brings deep technical expertise to creating secure and scalable cloud architectures for his customers. Joel is passionate about data analytics, machine learning, and robotics, leveraging his development experience to design innovative solutions that meet complex government requirements. He holds 13 AWS certifications and enjoys family time, fitness, and cheering for the Kansas City Chiefs and Los Angeles Lakers in his spare time.
Dunieski Otano is a Solutions Architect at Amazon Web Services based out of Miami, Florida. He works with World Wide Public Sector MNO (Multi-International Organizations) customers. His passion is Security, Machine Learning and Artificial Intelligence, and Serverless. He works with his customers to help them build and deploy high available, scalable, and secure solutions. Dunieski holds 14 AWS certifications and is an AWS Golden Jacket recipient. In his free time, you will find him spending time with his family and dog, watching a great movie, coding, or flying his drone.
Varun Jasti is a Solutions Architect at Amazon Web Services, working with AWS Partners to design and scale artificial intelligence solutions for public sector use cases to meet compliance standards. With a background in Computer Science, his work covers broad range of ML use cases primarily focusing on LLM training/inferencing and computer vision. In his spare time, he loves playing tennis and swimming.

Google DeepMind Releases AlphaGenome: A Deep Learning Model that can m …

A Unified Deep Learning Model to Understand the Genome

Google DeepMind has unveiled AlphaGenome, a new deep learning framework designed to predict the regulatory consequences of DNA sequence variations across a wide spectrum of biological modalities. AlphaGenome stands out by accepting long DNA sequences—up to 1 megabase—and outputting high-resolution predictions, such as base-level splicing events, chromatin accessibility, gene expression, and transcription factor binding.

Built to address limitations in earlier models, AlphaGenome bridges the gap between long-sequence input processing and nucleotide-level output precision. It unifies predictive tasks across 11 output modalities and handles over 5,000 human genomic tracks and 1,000+ mouse tracks. This level of multimodal capability positions AlphaGenome as one of the most comprehensive sequence-to-function models in genomics.

Technical Architecture and Training Methodology

AlphaGenome adopts a U-Net-style architecture with a transformer core. It processes DNA sequences in 131kb parallelized chunks across TPUv3 devices, enabling context-aware, base-pair-resolution predictions. The architecture uses two-dimensional embeddings for spatial interaction modeling (e.g., contact maps) and one-dimensional embeddings for linear genomic tasks.

Training involved two stages:

Pre-training: using fold-specific and all-folds models to predict from observed experimental tracks.

Distillation: a student model learns from teacher models to deliver consistent and efficient predictions, enabling fast inference (~1 second per variant) on GPUs like the NVIDIA H100.

Performance Across Benchmarks

AlphaGenome was rigorously benchmarked against specialized and multimodal models across 24 genome track and 26 variant effect prediction tasks. It outperformed or matched state-of-the-art models in 22/24 and 24/26 evaluations, respectively. In splicing, gene expression, and chromatin-related tasks, it consistently surpassed specialized models like SpliceAI, Borzoi, and ChromBPNet.

For instance:

Splicing: AlphaGenome is the first to simultaneously model splice sites, splice site usage, and splice junctions at 1 bp resolution. It outperformed Pangolin and SpliceAI on 6 of 7 benchmarks.

eQTL prediction: The model achieved a 25.5% relative improvement in direction-of-effect prediction compared to Borzoi.

Chromatin accessibility: It demonstrated strong correlation with DNase-seq and ATAC-seq experimental data, outperforming ChromBPNet by 8-19%.

Variant Effect Prediction from Sequence Alone

One of AlphaGenome’s key strengths lies in variant effect prediction (VEP). It handles zero-shot and supervised VEP tasks without relying on population genetics data, making it robust for rare variants and distal regulatory regions. With a single inference, AlphaGenome evaluates how a mutation may impact splicing patterns, expression levels, and chromatin state—all in a multimodal fashion.

The model’s ability to reproduce clinically observed splicing disruptions, such as exon skipping or novel junction formation, illustrates its utility in diagnosing rare genetic diseases. It accurately modeled the effects of a 4bp deletion in the DLG1 gene observed in GTEx samples.

Application in GWAS Interpretation and Disease Variant Analysis

AlphaGenome aids in interpreting GWAS signals by assigning directionality of variant effects on gene expression. Compared to colocalization methods like COLOC, AlphaGenome provided complementary and broader coverage—resolving 4x more loci in the lowest MAF quintile.

It also demonstrated utility in cancer genomics. When analyzing non-coding mutations upstream of the TAL1 oncogene (linked to T-ALL), AlphaGenome’s predictions matched known epigenomic changes and expression upregulation mechanisms, confirming its ability to assess gain-of-function mutations in regulatory elements.

TL;DR

AlphaGenome by Google DeepMind is a powerful deep learning model that predicts the effects of DNA mutations across multiple regulatory modalities at base-pair resolution. It combines long-range sequence modeling, multimodal prediction, and high-resolution output in a unified architecture. Outperforming specialized and generalist models across 50 benchmarks, AlphaGenome significantly improves the interpretation of non-coding genetic variants and is now available in preview to support genomics research worldwide.

Check out the Paper, Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google DeepMind Releases AlphaGenome: A Deep Learning Model that can more Comprehensively Predict the Impact of Single Variants or Mutations in DNA appeared first on MarkTechPost.

MIT and NUS Researchers Introduce MEM1: A Memory-Efficient Framework f …

Modern language agents need to handle multi-turn conversations, retrieving and updating information as tasks evolve. However, most current systems simply add all past interactions to the prompt, regardless of relevance. This leads to bloated memory usage, slower performance, and poor reasoning on longer inputs that weren’t seen during training. Real-world examples, such as research or shopping assistants, show how follow-up questions depend on the previous context. Yet, constant growth prompts strain on system resources and attention. While some solutions use external memory modules, they’re hard to integrate. This raises an important question: can language models learn to manage their memory intelligently as part of reasoning?

Limitations of Context-Growing Prompts and Challenges in Memory Integration

LLM agents have grown from handling simple queries to navigating complex, multi-step tasks like web browsing and research. Frameworks like ReAct, which blend reasoning and action, have helped enable these abilities. Training methods typically rely on behavior cloning or reinforcement learning to shape agent behavior. However, managing memory during multi-turn interactions remains a challenge. The common approach, adding all past context to each prompt, leads to bloated and inefficient memory usage. While external tools like retrievers or summarizers help, they’re often separate from the agent’s reasoning, making integration complex.

Introducing MEM1: A Reinforcement Learning Framework for Constant Memory Language Agents

Researchers from MIT, NUS, SMART, and Yonsei University developed MEM1, a reinforcement learning framework that enables language agents to handle complex, multi-turn tasks while maintaining constant memory usage. Instead of storing full interaction histories, MEM1 updates a compact internal state at each step, merging new information with memory and discarding unnecessary details. This unified reasoning and memory approach enhances efficiency and performance without requiring additional modules. MEM1 was tested across various tasks, including web QA and online shopping, demonstrating up to 3.5 times better performance and 3.7 times less memory usage than larger models, while also generalizing well to longer, unseen task sequences.

Combining Memory Pruning and Iterative Reasoning for Human-Like Problem Solving

MEM1 is designed to tackle complex reasoning tasks by combining memory management with iterative thinking. At each step, the agent processes new information and integrates it with prior knowledge to form a consolidated internal state, then prunes previous context to maintain memory efficiency. This structured memory updating mirrors how humans solve puzzles by focusing on key information while discarding the rest. The team uses reinforcement learning to train the agent to retain only relevant data and applies a masking strategy during optimization to ensure accurate policy updates. To better test long-term reasoning, they also create multi-objective QA tasks from existing datasets.

Benchmarking MEM1 on Long-Horizon QA and Navigation Tasks

The study assesses the MEM1 agent’s capacity to handle complex, multi-turn tasks while maintaining nearly constant memory usage. Trained using reinforcement learning on the Qwen2.5-7B base model, MEM1 is tested in question answering with retrieval-augmented generation and web navigation environments. It is compared against several baselines using both accuracy and efficiency metrics. Results show that MEM1 outperforms others in long-horizon tasks, maintaining strong performance even as task complexity increases. It uses fewer tokens, responds faster, and scales more efficiently. Despite being smaller, MEM1 even surpasses larger models like Qwen2.5-14B-Instruct and GPT-4o in demanding scenarios.

Conclusion and Future Directions for Reinforcement-Learned Memory Consolidation in LLMs

In conclusion, MEM1 is a reinforcement learning framework designed to help language agents handle long, multi-step tasks more efficiently. Unlike traditional methods that store all past information, leading to memory bloat and slower performance, MEM1 maintains a compact internal state by merging new inputs with memory and discarding unnecessary data. It performs well in tasks like question answering and web navigation, while using less memory and computing power. However, MEM1 assumes clear, reliable reward signals, which many real-world tasks lack. Future work aims to adapt MEM1 for open-ended tasks with uncertain or delayed rewards, thereby expanding its applications to broader, more practical scenarios.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post MIT and NUS Researchers Introduce MEM1: A Memory-Efficient Framework for Long-Horizon Language Agents appeared first on MarkTechPost.

Google AI Releases Gemini CLI: An Open-Source AI Agent for Your Termin …

Google has unveiled Gemini CLI, an open-source command-line AI agent that integrates the Gemini 2.5 Pro model directly into the terminal. Designed for developers and technical power users, Gemini CLI allows users to interact with Gemini using natural language directly from the command line—supporting workflows such as code explanation, debugging, documentation generation, file manipulation, and even web-grounded research.

Gemini CLI builds on the backend infrastructure of Gemini Code Assist and offers a similar intelligence layer to developers who prefer terminal-based interfaces. It supports scripting, prompt-based interactions, and agent extensions, giving developers the flexibility to integrate it into CI/CD pipelines, automation scripts, or everyday development work. By combining terminal accessibility with the full power of Gemini’s multimodal reasoning, Google is positioning this tool as a lightweight but powerful complement to IDE-bound assistants.

A standout feature of Gemini CLI is its integration with Gemini 2.5 Pro, a frontier LLM that supports up to 1 million tokens in context. Developers can access the model for free using a personal Google account, with generous usage quotas—up to 60 requests per minute and 1,000 per day. The tool is built to be lightweight and immediately usable; installation is as simple as running npx or using npm install -g. Once installed, users can authenticate and start issuing natural-language prompts from their terminal.

What makes Gemini CLI particularly appealing to developers is its open-source license (Apache 2.0). Developers can inspect, modify, and extend the codebase hosted on GitHub, building their own agents or modifying prompts to suit specific project requirements. This flexibility fosters both transparency and community innovation, allowing AI capabilities to be fine-tuned to real-world developer workflows.

The CLI supports both interactive sessions and non-interactive scripting. For example, a user might run gemini and type “Explain the changes in this codebase since yesterday,” or use it in a script with –prompt to automate documentation generation. It’s also extensible via configuration files like GEMINI.md, allowing developers to preload context, customize system prompts, or define tool-specific workflows.

Gemini CLI goes beyond basic language modeling. It incorporates Model-Context Protocol (MCP) extensions and Google Search grounding, enabling it to reason based on real-time information. Developers can also integrate multimodal tools like Veo (for video generation) and Imagen (for image generation), expanding the scope of what can be done from the terminal. Whether it’s prototyping visuals, scaffolding code, or summarizing research, Gemini CLI is designed to accommodate a diverse range of technical use cases.

Early adoption has been promising. Developers appreciate the natural language flexibility, scripting compatibility, and model performance, especially given the free-tier access. The community is already submitting pull requests and contributing to the codebase, and Google appears to be actively engaging in further improvements based on GitHub feedback. It’s also noteworthy that the Gemini CLI backend shares infrastructure with Gemini Code Assist, ensuring consistency across terminal and IDE environments.

From a broader perspective, Gemini CLI enters a competitive landscape of AI development tools that includes GitHub Copilot, OpenAI Codex CLI, and other LLM-powered agents. However, Google’s decision to make Gemini CLI open-source, paired with a generous free quota and a terminal-native interface, sets it apart. It appeals directly to backend developers, DevOps engineers, and technical teams looking for flexible, integrated AI tooling without being locked into proprietary IDEs or paid platforms.

To get started, users can install Gemini CLI with a one-liner, authenticate via their Google account, and begin experimenting with natural-language commands. The setup is minimal, and the learning curve is shallow, especially for users already familiar with command-line tools. For those looking to go deeper, the project’s GitHub repository offers detailed examples, instructions for contributing, and information about extending the agent’s capabilities.

In conclusion, Gemini CLI is Google’s push to bring advanced AI capabilities closer to where many developers spend most of their time: the terminal. By blending open-source transparency, powerful model access, extensibility, and real-time grounding, Gemini CLI presents itself as a compelling tool for developers who want more from their AI assistants. It not only streamlines development workflows but also opens new avenues for automation, multimodal interaction, and intelligent reasoning—all without leaving the command line.

TLDR: Google AI has released Gemini CLI, an open-source command-line interface that integrates Gemini 2.5 Pro directly into the terminal. It allows developers to run natural-language commands for code generation, debugging, file operations, and more—without leaving the shell. Built with extensibility in mind, Gemini CLI supports scripting, multimodal tools like Veo and Imagen, and real-time web grounding. With a generous free-tier, shared backend with Gemini Code Assist, and support for the Model-Context Protocol (MCP), it offers a powerful AI experience tailored for developers and automation workflows.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Releases Gemini CLI: An Open-Source AI Agent for Your Terminal appeared first on MarkTechPost.

Tailor responsible AI with new safeguard tiers in Amazon Bedrock Guard …

Amazon Bedrock Guardrails provides configurable safeguards to help build trusted generative AI applications at scale. It provides organizations with integrated safety and privacy safeguards that work across multiple foundation models (FMs), including models available in Amazon Bedrock, as well as models hosted outside Amazon Bedrock from other model providers and cloud providers. With the standalone ApplyGuardrail API, Amazon Bedrock Guardrails offers a model-agnostic and scalable approach to implementing responsible AI policies for your generative AI applications. Guardrails currently offers six key safeguards: content filters, denied topics, word filters, sensitive information filters, contextual grounding checks, and Automated Reasoning checks (preview), to help prevent unwanted content and align AI interactions with your organization’s responsible AI policies.
As organizations strive to implement responsible AI practices across diverse use cases, they face the challenge of balancing safety controls with varying performance and language requirements across different applications, making a one-size-fits-all approach ineffective. To address this, we’ve introduced safeguard tiers for Amazon Bedrock Guardrails, so you can choose appropriate safeguards based on your specific needs. For instance, a financial services company can implement comprehensive, multi-language protection for customer-facing AI assistants while using more focused, lower-latency safeguards for internal analytics tools, making sure each application upholds responsible AI principles with the right level of protection without compromising performance or functionality.
In this post, we introduce the new safeguard tiers available in Amazon Bedrock Guardrails, explain their benefits and use cases, and provide guidance on how to implement and evaluate them in your AI applications.
Solution overview
Until now, when using Amazon Bedrock Guardrails, you were provided with a single set of the safeguards associated to specific AWS Regions and a limited set of languages supported. The introduction of safeguard tiers in Amazon Bedrock Guardrails provides three key advantages for implementing AI safety controls:

A tier-based approach that gives you control over which guardrail implementations you want to use for content filters and denied topics, so you can select the appropriate protection level for each use case. We provide more details about this in the following sections.
Cross-Region Inference Support (CRIS) for Amazon Bedrock Guardrails, so you can use compute capacity across multiple Regions, achieving better scaling and availability for your guardrails. With this, your requests get automatically routed during guardrail policy evaluation to the optimal Region within your geography, maximizing available compute resources and model availability. This helps maintain guardrail performance and reliability when demand increases. There’s no additional cost for using CRIS with Amazon Bedrock Guardrails, and you can select from specific guardrail profiles for controlling model versioning and future upgrades.
Advanced capabilities as a configurable tier option for use cases where more robust protection or broader language support are critical priorities, and where you can accommodate a modest latency increase.

Safeguard tiers are applied at the guardrail policy level, specifically for content filters and denied topics. You can tailor your protection strategy for different aspects of your AI application. Let’s explore the two available tiers:

Classic tier (default):

Maintains the existing behavior of Amazon Bedrock Guardrails
Limited language support: English, French, and Spanish
Does not require CRIS for Amazon Bedrock Guardrails
Optimized for lower-latency applications

Standard tier:

Provided as a new capability that you can enable for existing or new guardrails
Multilingual support for more than 60 languages
Enhanced robustness against prompt typos and manipulated inputs
Enhanced prompt attack protection covering modern jailbreak and prompt injection techniques, including token smuggling, AutoDAN, and many-shot, among others
Enhanced topic detection with improved understanding and handling of complex topics
Requires the use of CRIS for Amazon Bedrock Guardrails and might have a modest increase in latency profile compared to the Classic tier option

You can select each tier independently for content filters and denied topics policies, allowing for mixed configurations within the same guardrail, as illustrated in the following hierarchy. With this flexibility, companies can implement the right level of protection for each specific application.

Policy: Content filters

Tier: Classic or Standard

Policy: Denied topics

Tier: Classic or Standard

Other policies: Word filters, sensitive information filters, contextual grounding checks, and Automated Reasoning checks (preview)

To illustrate how these tiers can be applied, consider a global financial services company deploying AI in both customer-facing and internal applications:

For their customer service AI assistant, they might choose the Standard tier for both content filters and denied topics, to provide comprehensive protection across many languages.
For internal analytics tools, they could use the Classic tier for content filters prioritizing low latency, while implementing the Standard tier for denied topics to provide robust protection against sensitive financial information disclosure.

You can configure the safeguard tiers for content filters and denied topics in each guardrail through the AWS Management Console, or programmatically through the Amazon Bedrock SDK and APIs. You can use a new or existing guardrail. For information on how to create or modify a guardrail, see Create your guardrail.
Your existing guardrails are automatically set to the Classic tier by default to make sure you have no impact on your guardrails’ behavior.
Quality enhancements with the Standard tier
According to our tests, the new Standard tier improves harmful content filtering recall by more than 15% with a more than 7% gain in balanced accuracy compared to the Classic tier. A key differentiating feature of the new Standard tier is its multilingual support, maintaining strong performance with over 78% recall and over 88% balanced accuracy for the most common 14 languages.The enhancements in protective capabilities extend across several other aspects. For example, content filters for prompt attacks in the Standard tier show a 30% improvement in recall and 16% gain in balanced accuracy compared to the Classic tier, while maintaining a lower false positive rate. For denied topic detection, the new Standard tier delivers a 32% increase in recall, resulting in an 18% improvement in balanced accuracy.These substantial evolutions in detection capabilities for Amazon Bedrock Guardrails, combined with consistently low false positive rates and robust multilingual performance, also represent a significant advancement in content protection technology compared to other commonly available solutions. The multilingual improvements are particularly noteworthy, with the new Standard tier in Amazon Bedrock Guardrails showing consistent performance gains of 33–49% in recall across different language evaluations compared to other competitors’ options.
Benefits of safeguard tiers
Different AI applications have distinct safety requirements based on their audience, content domain, and geographic reach. For example:

Customer-facing applications often require stronger protection against potential misuse compared to internal applications
Applications serving global customers need guardrails that work effectively across many languages
Internal enterprise tools might prioritize controlling specific topics in just a few primary languages

The combination of the safeguard tiers with CRIS for Amazon Bedrock Guardrails also addresses various operational needs with practical benefits that go beyond feature differences:

Independent policy evolution – Each policy (content filters or denied topics) can evolve at its own pace without disrupting the entire guardrail system. You can configure these with specific guardrail profiles in CRIS for controlling model versioning in the models powering your guardrail policies.
Controlled adoption – You decide when and how to adopt new capabilities, maintaining stability for production applications. You can continue to use Amazon Bedrock Guardrails with your previous configurations without changes and only move to the new tiers and CRIS configurations when you consider it appropriate.
Resource efficiency – You can implement enhanced protections only where needed, balancing security requirements with performance considerations.
Simplified migration path – When new capabilities become available, you can evaluate and integrate them gradually by policy area rather than facing all-or-nothing choices. This also simplifies testing and comparison mechanisms such as A/B testing or blue/green deployments for your guardrails.

This approach helps organizations balance their specific protection requirements with operational considerations in a more nuanced way than a single-option system could provide.
Configure safeguard tiers on the Amazon Bedrock console
On the Amazon Bedrock console, you can configure the safeguard tiers for your guardrail in the Content filters tier or Denied topics tier sections by selecting your preferred tier.

Use of the new Standard tier requires setting up cross-Region inference for Amazon Bedrock Guardrails, choosing the guardrail profile of your choice.

Configure safeguard tiers using the AWS SDK
You can also configure the guardrail’s tiers using the AWS SDK. The following is an example to get started with the Python SDK:

import boto3
import json

bedrock = boto3.client(
“bedrock”,
region_name=”us-east-1″
)

# Create a guardrail with Standard tier for both Content Filters and Denied Topics
response = bedrock.create_guardrail(
name=”enhanced-safety-guardrail”,
# cross-Region is required for STANDARD tier
crossRegionConfig={
‘guardrailProfileIdentifier’: ‘us.guardrail.v1:0’
},
# Configure Denied Topics with Standard tier
topicPolicyConfig={
“topicsConfig”: [
{
“name”: “Financial Advice”,
“definition”: “Providing specific investment advice or financial recommendations”,
“type”: “DENY”,
“inputEnabled”: True,
“inputAction”: “BLOCK”,
“outputEnabled”: True,
“outputAction”: “BLOCK”
}
],
“tierConfig”: {
“tierName”: “STANDARD”
}
},
# Configure Content Filters with Standard tier
contentPolicyConfig={
“filtersConfig”: [
{
“inputStrength”: “HIGH”,
“outputStrength”: “HIGH”,
“type”: “SEXUAL”
},
{
“inputStrength”: “HIGH”,
“outputStrength”: “HIGH”,
“type”: “VIOLENCE”
}
],
“tierConfig”: {
“tierName”: “STANDARD”
}
},
blockedInputMessaging=”I cannot respond to that request.”,
blockedOutputsMessaging=”I cannot provide that information.”
)

Within a given guardrail, the content filter and denied topic policies can be configured with its own tier independently, giving you granular control over how guardrails behave. For example, you might choose the Standard tier for content filtering while keeping denied topics in the Classic tier, based on your specific requirements.
For migrating existing guardrails’ configurations to use the Standard tier, add the sections highlighted in the preceding example for crossRegionConfig and tierConfig to your current guardrail definition. You can do this using the UpdateGuardrail API, or create a new guardrail with the CreateGuardrail API.
Evaluating your guardrails
To thoroughly evaluate your guardrails’ performance, consider creating a test dataset that includes the following:

Safe examples – Content that should pass through guardrails
Harmful examples – Content that should be blocked
Edge cases – Content that tests the boundaries of your policies
Examples in multiple languages – Especially important when using the Standard tier

You can also rely on openly available datasets for this purpose. Ideally, your dataset should be labeled with the expected response for each case for assessing accuracy and recall of your guardrails.
With your dataset ready, you can use the Amazon Bedrock ApplyGuardrail API as shown in the following example to efficiently test your guardrail’s behavior for user inputs without invoking FMs. This way, you can save the costs associated with the large language model (LLM) response generation.

import boto3
import json

bedrock_runtime = boto3.client(
“bedrock-runtime”,
region_name=”us-east-1″
)

# Test the guardrail with potentially problematic content
content = [
{
“text”: {
“text”: “Your test prompt here”
}
}
]

response = bedrock_runtime.apply_guardrail(
content=content,
source=”INPUT”,
guardrailIdentifier=”your-guardrail-id”,
guardrailVersion=”DRAFT”
)

print(json.dumps(response, indent=2, default=str))

Later, you can repeat the process for the outputs of the LLMs if needed. For this, you can use the ApplyGuardrail API if you want an independent evaluation for models in AWS or outside in another provider, or you can directly use the Converse API if you intend to use models in Amazon Bedrock. When using the Converse API, the inputs and outputs are evaluated with the same invocation request, optimizing latency and reducing coding overheads.
Because your dataset is labeled, you can directly implement a mechanism for assessing the accuracy, recall, and potential false negatives or false positives through the use of libraries like SKLearn Metrics:

# scoring script
# labels and preds store list of ground truth label and guardrails predictions

from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(labels, preds, labels=[0, 1]).ravel()

recall = tp / (tp + fn) if (tp + fn) != 0 else 0
fpr = fp / (fp + tn) if (fp + tn) != 0 else 0
balanced_accuracy = 0.5 * (recall + 1 – fpr)

Alternatively, if you don’t have labeled data or your use cases have subjective responses, you can also rely on mechanisms such as LLM-as-a-judge, where you pass the inputs and guardrails’ evaluation outputs to an LLM for assessing a score based on your own predefined criteria. For more information, see Automate building guardrails for Amazon Bedrock using test-drive development.
Best practices for implementing tiers
We recommend considering the following aspects when configuring your tiers for Amazon Bedrock Guardrails:

Start with staged testing – Test both tiers with a representative sample of your expected inputs and responses before making broad deployment decisions.
Consider your language requirements – If your application serves users in multiple languages, the Standard tier’s expanded language support might be essential.
Balance safety and performance – Evaluate both the accuracy improvements and latency differences to make informed decisions. Consider if you can afford a few additional milliseconds of latency for improved robustness with the Standard tier or prefer a latency-optimized option for more straight forward evaluations with the Classic tier.
Use policy-level tier selection – Take advantage of the ability to select different tiers for different policies to optimize your guardrails. You can choose separate tiers for content filters and denied topics, while combining with the rest of the policies and features available in Amazon Bedrock Guardrails.
Remember cross-Region requirements – The Standard tier requires cross-Region inference, so make sure your architecture and compliance requirements can accommodate this. With CRIS, your request originates from the Region where your guardrail is deployed, but it might be served from a different Region from the ones included in the guardrail inference profile for optimizing latency and availability.

Conclusion
The introduction of safeguard tiers in Amazon Bedrock Guardrails represents a significant step forward in our commitment to responsible AI. By providing flexible, powerful, and evolving safety tools for generative AI applications, we’re empowering organizations to implement AI solutions that are not only innovative but also ethical and trustworthy. This capabilities-based approach enables you to tailor your responsible AI practices to each specific use case. You can now implement the right level of protection for different applications while creating a path for continuous improvement in AI safety and ethics.The new Standard tier delivers significant improvements in multilingual support and detection accuracy, making it an ideal choice for many applications, especially those serving diverse global audiences or requiring enhanced protection. This aligns with responsible AI principles by making sure AI systems are fair and inclusive across different languages and cultures. Meanwhile, the Classic tier remains available for use cases prioritizing low latency or those with simpler language requirements, allowing organizations to balance performance with protection as needed.
By offering these customizable protection levels, we’re supporting organizations in their journey to develop and deploy AI responsibly. This approach helps make sure that AI applications are not only powerful and efficient but also align with organizational values, comply with regulations, and maintain user trust.
To learn more about safeguard tiers in Amazon Bedrock Guardrails, refer to Detect and filter harmful content by using Amazon Bedrock Guardrails, or visit the Amazon Bedrock console to create your first tiered guardrail.

About the Authors
Koushik Kethamakka is a Senior Software Engineer at AWS, focusing on AI/ML initiatives. At Amazon, he led real-time ML fraud prevention systems for Amazon.com before moving to AWS to lead development of AI/ML services like Amazon Lex and Amazon Bedrock. His expertise spans product and system design, LLM hosting, evaluations, and fine-tuning. Recently, Koushik’s focus has been on LLM evaluations and safety, leading to the development of products like Amazon Bedrock Evaluations and Amazon Bedrock Guardrails. Prior to joining Amazon, Koushik earned his MS from the University of Houston.
Hang Su is a Senior Applied Scientist at AWS AI. He has been leading the Amazon Bedrock Guardrails Science team. His interest lies in AI safety topics, including harmful content detection, red-teaming, sensitive information detection, among others.
Shyam Srinivasan is on the Amazon Bedrock product team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam likes to run long distances, travel around the world, and experience new cultures with family and friends.
Aartika Sardana Chandras is a Senior Product Marketing Manager for AWS Generative AI solutions, with a focus on Amazon Bedrock. She brings over 15 years of experience in product marketing, and is dedicated to empowering customers to navigate the complexities of the AI lifecycle. Aartika is passionate about helping customers leverage powerful AI technologies in an ethical and impactful manner.
Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services, specializing in Amazon Bedrock security. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies and security principles allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value while maintaining robust security postures.
Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Structured data response with Amazon Bedrock: Prompt Engineering and T …

Generative AI is revolutionizing industries by streamlining operations and enabling innovation. While textual chat interactions with GenAI remain popular, real-world applications often depend on structured data for APIs, databases, data-driven workloads, and rich user interfaces. Structured data can also enhance conversational AI, enabling more reliable and actionable outputs. A key challenge is that LLMs (Large Language Models) are inherently unpredictable, which makes it difficult for them to produce consistently structured outputs like JSON. This challenge arises because their training data mainly includes unstructured text, such as articles, books, and websites, with relatively few examples of structured formats. As a result, LLMs can struggle with precision when generating JSON outputs, which is crucial for seamless integration into existing APIs and databases. Models vary in their ability to support structured responses, including recognizing data types and managing complex hierarchies effectively. These capabilities can make a difference when choosing the right model.
This blog demonstrates how Amazon Bedrock, a managed service for securely accessing top AI models, can help address these challenges by showcasing two alternative options:

Prompt Engineering: A straightforward approach to shaping structured outputs using well-crafted prompts.
Tool Use with the Bedrock Converse API: An advanced method that enables better control, consistency, and native JSON schema integration.

We will use a customer review analysis example to demonstrate how Bedrock generates structured outputs, such as sentiment scores, with simplified Python code.
Building a prompt engineering solution
This section will demonstrate how to use prompt engineering effectively to generate structured outputs using Amazon Bedrock. Prompt engineering involves crafting precise input prompts to guide large language models (LLMs) in producing consistent and structured responses. It is a fundamental technique for developing Generative AI applications, particularly when structured outputs are required.Here are the five key steps we will follow:

Configure the Bedrock client and runtime parameters.
Create a JSON schema for structured outputs.
Craft a prompt and guide the model with clear instructions and examples.
Add a customer review as input data to analyse.
Invoke Bedrock, call the model, and process the response.

While we demonstrate customer review analysis to generate a JSON output, these methods can also be used with other formats like XML or CSV.
Step 1: Configure Bedrock
To begin, we’ll set up some constants and initialize a Python Bedrock client connection object using the Python Boto3 SDK for Bedrock runtime, which facilitates interaction with Bedrock:

The REGION specifies the AWS region for model execution, while the MODEL_ID identifies the specific Bedrock model. The TEMPERATURE constant controls the output randomness, where higher values increase creativity, and lower values maintain precision, such as when generating structured output. MAX_TOKENS determines the output length, balancing cost-efficiency and data completeness.
Step 2: Define the Schema
Defining a schema is essential for facilitating structured and predictable model outputs, maintaining data integrity, and enabling seamless API integration. Without a well-defined schema, models may generate inconsistent or incomplete responses, leading to errors in downstream applications. The JSON standard schema used in the code below serves as a blueprint for structured data generation, guiding the model on how to format its output with explicit instructions.
Let’s create a JSON schema for customer reviews with three required fields: reviewId (string, max 50 chars), sentiment (number, -1 to 1), and summary (string, max 200 chars).

Step 3: Craft the Prompt text
To generate consistent, structured, and accurate responses, prompts must be clear and well-structured, as LLMs rely on precise input to produce reliable outputs. Poorly designed prompts can lead to ambiguity, errors, or formatting issues, disrupting structured workflows, so we follow these best practices:

Clearly outline the AI’s role and objectives to avoid ambiguity.
Divide tasks into smaller, manageable numbered steps for clarity.
Indicate that a JSON schema will be provided (see Step 5 below) to maintain a consistent and valid structure.
Use one-shot prompting with a sample output to guide the model; add more examples if needed for consistency, but avoid too many, as they may limit the model’s ability to handle new inputs.
Define how to handle missing or invalid data.

Step 4: Integrate Input Data
For demonstration purposes, we’ll include a review text in the prompt as a Python variable:

Separating the input data with <input> tags improve readability and clarity, making it straightforward to identify and reference. This hardcoded input simulates real-world data integration. For production use, you might dynamically populate input data from APIs or user submissions.
Step 5: Call Bedrock
In this section, we construct a Bedrock request by defining a body object that includes the JSON schema, prompt, and input review data from previous steps. This structured request makes sure the model receives clear instructions, adheres to a predefined schema, and processes sample input data correctly. Once the request is prepared, we invoke Amazon Bedrock to generate a structured JSON response.

We reuse the MAX_TOKENS, TEMPERATURE, and MODEL_ID constants defined in Step 1. The body object has essential inference configurations like anthropic_version for model compatibility and the messages array, which includes a single message to provide the model with task instructions, the schema, and the input data. The role defines the “speaker” in the interaction context, with user value representing the program sending the request. Alternatively, we could simplify the input by combining instructions, schema, and data into one text prompt, which is straightforward to manage but less modular.
Finally, we use the client.invoke_model method to send the request. After invoking, the model processes the request, and the JSON data must be properly (not explained here) extracted from the Bedrock response. For example:

Tool Use with the Amazon Bedrock Converse API
In the previous chapter, we explored a solution using Bedrock Prompt Engineering. Now, let’s look at an alternative approach for generating structured responses with Bedrock.
We will extend the previous solution by using the Amazon Bedrock Converse API, a consistent interface designed to facilitate multi-turn conversations with Generative AI models. The API abstracts model-specific configurations, including inference parameters, simplifying integration.
A key feature of the Converse API is Tool Use (also known as Function Calling), which enables the model to execute external tools, such as calling an external API. This method supports standard JSON schema integration directly into tool definitions, facilitating output alignment with predefined formats. Not all Bedrock models support Tool Use, so make sure you check which models are compatible with these feature.
Building on the previously defined data, the following code provides a straightforward example of Tool Use tailored to our curstomer review use case:

In this code the tool_list defines a custom customer review analysis tool with its input schema and purpose, while the messages provide the earlier defined instructions and input data. Unlike in the previous prompt engineering example we used the earlier defined JSON schema in the definition of a tool. Finally, the client.converse call combines these components, specifying the tool to use and inference configurations, resulting in outputs tailored to the given schema and task. After exploring Prompt Engineering and Tool Use in Bedrock solutions for structured response generation, let’s now evaluate how different foundation models perform across these approaches.
Test Results: Claude Models on Amazon Bedrock
Understanding the capabilities of foundation models in structured response generation is essential for maintaining reliability, optimizing performance, and building scalable, future-proof Generative AI applications with Amazon Bedrock. To evaluate how well models handle structured outputs, we conducted extensive testing of Anthropic’s Claude models, comparing prompt-based and tool-based approaches across 1,000 iterations per model. Each iteration processed 100 randomly generated items, providing broad test coverage across different input variations.The examples shown earlier in this blog are intentionally simplified for demonstration purposes, where Bedrock performed seamlessly with no issues. To better assess the models under real-world challenges, we used a more complex schema that featured nested structures, arrays, and diverse data types to identify edge cases and potential issues. The outputs were validated for adherence to the JSON format and schema, maintaining consistency and accuracy. The following diagram summarizes the results, showing the number of successful, valid JSON responses for each model across the two demonstrated approaches: Prompt Engineering and Tool Use.

The results demonstrated that all models achieved over 93% success across both approaches, with Tool Use methods consistently outperforming prompt-based ones. While the evaluation was conducted using a highly complex JSON schema, simpler schemas result in significantly fewer issues, often nearly none. Future updates to the models are expected to further enhance performance.
Final Thoughts
In conclusion, we demonstrated two methods for generating structured responses with Amazon Bedrock: Prompt Engineering and Tool Use with the Converse API. Prompt Engineering is flexible, works with Bedrock models (including those without Tool Use support), and handles various schema types (e.g., Open API schemas), making it a great starting point. However, it can be fragile, requiring exact prompts and struggling with complex needs. On the other hand, Tool Use offers greater reliability, consistent results, seamless API integration, and runtime validation of JSON schema for enhanced control.
For simplicity, we did not demonstrate a few areas in this blog. Other techniques for generating structured responses include using models with built-in support for configurable response formats, such as JSON, when invoking models, or leveraging constraint decoding techniques with third-party libraries like LMQL. Additionally, generating structured data with GenAI can be challenging due to issues like invalid JSON, missing fields, or formatting errors. To maintain data integrity and handle unexpected outputs or API failures, effective error handling, thorough testing, and validation are essential.
To try the Bedrock techniques demonstrated in this blog, follow the steps to Run example Amazon Bedrock API requests through the AWS SDK for Python (Boto3). With pay-as-you-go pricing, you’re only charged for API calls, so little to no cleanup is required after testing. For more details on best practices, refer to the Bedrock prompt engineering guidelines and model-specific documentation, such as Anthropic’s best practices.
Structured data is key to leveraging Generative AI in real-world scenarios like APIs, data-driven workloads, and rich user interfaces beyond text-based chat. Start using Amazon Bedrock today to unlock its potential for reliable structured responses.

About the authors
Adam Nemeth is a Senior Solutions Architect at AWS, where he helps global financial customers embrace cloud computing through architectural guidance and technical support. With over 24 years of IT expertise, Adam previously worked at UBS before joining AWS. He lives in Switzerland with his wife and their three children.
Dominic Searle is a Senior Solutions Architect at Amazon Web Services, where he has had the pleasure of working with Global Financial Services customers as they explore how Generative AI can be integrated into their technology strategies. Providing technical guidance, he enjoys helping customers effectively leverage AWS Services to solve real business problems.