BARE: A Synthetic Data Generation AI Method that Combines the Diversit …

As the need for high-quality training data grows, synthetic data generation has become essential for improving LLM performance. Instruction-tuned models are commonly used for this task, but they often struggle to generate diverse outputs, which is crucial for model generalization. Despite efforts such as prompting techniques that encourage variation—like conditioning on past outputs or assuming different personas—the diversity remains limited. In contrast, base models, which lack post-training biases, generate more diverse responses but tend to be lower in quality. Studies show that base models produce outputs with lower pairwise cosine similarity, indicating greater diversity, while instruct-tuned models risk mode collapse.

Synthetic data is widely used in training state-of-the-art models for reasoning, coding, and problem-solving tasks. Still, its overuse can lead to issues such as iterative degradation, where models generate increasingly homogenized outputs. Existing approaches to enhance diversity—such as temperature scaling, nucleus sampling, and multi-stage generation—offer partial solutions but often require significant manual effort. While downstream performance is the standard metric for evaluating synthetic data, embedding-based measures like BERTScore provide better insights into semantic diversity. Additionally, assessing the quality of individual synthetic samples remains a challenge, necessitating more robust evaluation frameworks.

Researchers from UC Berkeley, Stanford, Foundry, Microsoft Research, and Princeton propose a synthetic data generation method that integrates base and instruct-tuned models to balance diversity and quality. Their approach, Base-Refine (BARE), follows a two-stage process where base model outputs are refined using instruct-tuned models, enhancing dataset quality while preserving diversity. Fine-tuning with just 1,000 BARE-generated samples achieves performance comparable to top models on LiveCodeBench and improves GSM8K accuracy by 101% over instruct-only data. BARE also boosts RAFT-based fine-tuning by 18.4%, demonstrating its effectiveness in generating high-quality, diverse data for various machine-learning tasks.

BARE is a synthetic data generation method that enhances dataset quality by refining diverse base model outputs with instruct-tuned models. The process begins with a base model generating an initial dataset with minimal few-shot examples. Then, an instruct-tuned model improves each sample by correcting errors and enhancing clarity while preserving diversity. This two-stage approach ensures high-quality yet varied data, making BARE particularly effective in data-scarce domains. With only three few-shot examples and general prompts, BARE minimizes human effort while maximizing flexibility. Experimental results show its potential to generate more accurate and diverse synthetic datasets for machine learning tasks.

The evaluation of BARE focuses on diversity, data quality, and downstream performance across the same domains and baselines discussed earlier. Implementing Llama-3.1-70B-Base for initial generation and Llama-3.1-70B-Instruct for refinement, BARE maintains data diversity while improving generation quality. Fine-tuning experiments show BARE outperforms base and instruct models, enhancing model accuracy across multiple datasets. Notably, refining with GPT-4o further boosts performance. Ablation studies confirm that using a base model is essential for diversity, as refining instruct-only outputs lowers accuracy. Overall, BARE effectively integrates base and instruct-tuned models to generate high-quality synthetic data for improved downstream tasks.

In conclusion, the study quantitatively examines synthetic data generation methods, revealing that base models ensure diversity while instruct-tuned models enhance quality. BARE integrates both to generate high-quality, diverse data. Extensive experiments validate its effectiveness, improving downstream tasks like GSM8K, LiveCodeBench, and RAFT, setting a new state-of-the-art. Future work could refine the process through fine-tuned refiners, additional stages, or alternative training objectives. Beyond synthetic training data, BARE can also create diverse evaluation datasets. As synthetic data becomes essential for model training, BARE offers a scalable solution that balances diversity and quality, outperforming existing methods in various domains.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

The post BARE: A Synthetic Data Generation AI Method that Combines the Diversity of Base Models with the Quality of Instruct-Tuned Models appeared first on MarkTechPost.

Meta AI Introduces ParetoQ: A Unified Machine Learning Framework for S …

As deep learning models continue to grow, the quantization of machine learning models becomes essential, and the need for effective compression techniques has become increasingly relevant. Low-bit quantization is a method that reduces model size while attempting to retain accuracy. Researchers have been determining the best bit-width for maximizing efficiency without compromising performance. Various studies have explored different bit-width settings, but conflicting conclusions have arisen due to the absence of a standardized evaluation framework. This ongoing pursuit influences the development of large-scale artificial intelligence models, determining their feasibility for deployment in memory-constrained environments.

A major challenge in low-bit quantization is identifying the optimal trade-off between computational efficiency & model accuracy. The debate over which bit-width is most effective remains unresolved, with some arguing that 4-bit quantization provides the best balance, while others claim that 1.58-bit models can achieve comparable results. However, prior research has lacked a unified methodology to compare different quantization settings, leading to inconsistent conclusions. This knowledge gap complicates establishing reliable scaling laws in low-bit precision quantization. Moreover, achieving stable training in extremely low-bit settings poses a technical hurdle, as lower-bit models often experience significant representational shifts compared to higher-bit counterparts.

Quantization approaches vary in their implementation and effectiveness. After training a model in full precision, post-training quantization (PTQ) applies quantization, making it easy to deploy but prone to accuracy degradation at low bit-widths. Quantization-aware training (QAT), on the other hand, integrates quantization into the training process, allowing models to adapt to low-bit representations more effectively. Other techniques, such as learnable quantization and mixed-precision strategies, have been explored to fine-tune the balance between accuracy and model size. However, these methods lack a universal framework for systematic evaluation, making it difficult to compare their efficiency under different conditions.

Researchers at Meta have introduced ParetoQ, a structured framework designed to unify the assessment of sub-4-bit quantization techniques. This framework allows rigorous comparisons across different bit-width settings, including 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization. By refining training schemes and bit-specific quantization functions, ParetoQ achieves improved accuracy and efficiency over previous methodologies. Unlike prior works that independently optimize for specific bit levels, ParetoQ establishes a consistent evaluation process that objectively compares quantization trade-offs.

ParetoQ employs an optimized quantization-aware training strategy to minimize accuracy loss while maintaining model compression efficiency. The framework refines bit-specific quantization functions and tailors training strategies for each bit-width. A critical finding from this study is the distinct learning transition observed between 2-bit and 3-bit quantization. Models trained at 3-bit precision and higher maintain representation similarities with their original pre-trained distributions, while models trained at 2-bit or lower experience drastic representational shifts. To overcome this challenge, the framework systematically optimizes the quantization grid, training allocation, and bit-specific learning strategies.

Extensive experiments confirm the superior performance of ParetoQ over existing quantization methods. A ternary 600M-parameter model developed using ParetoQ outperforms the previous state-of-the-art ternary 3B-parameter model in accuracy while utilizing only one-fifth of the parameters. The study demonstrates that 2-bit quantization achieves an accuracy improvement of 1.8 percentage points over a comparable 4-bit model of the same size, establishing its viability as an alternative to conventional 4-bit quantization. Further, ParetoQ enables a more hardware-friendly implementation, with optimized 2-bit CPU kernels achieving higher speed and memory efficiency compared to 4-bit quantization. The experiments also reveal that ternary, 2-bit and 3-bit quantization models achieve better accuracy-size trade-offs than 1-bit and 4-bit quantization, reinforcing the significance of sub-4-bit approaches.

The findings of this study provide a strong foundation for optimizing low-bit quantization in large language models. By introducing a structured framework, the research effectively addresses the challenges of accuracy trade-offs and bit-width optimization. The results indicate that while extreme low-bit quantization is viable, 2-bit and 3-bit quantization currently offer the best balance between performance and efficiency. Future advancements in hardware support for low-bit computation will further enhance the practicality of these techniques, enabling more efficient deployment of large-scale machine learning models in resource-constrained environments.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Meta AI Introduces ParetoQ: A Unified Machine Learning Framework for Sub-4-Bit Quantization in Large Language Models appeared first on MarkTechPost.

Sundial: A New Era for Time Series Foundation Models with Generative A …

Time series forecasting presents a fundamental challenge due to its intrinsic non-determinism, making it difficult to predict future values accurately. Traditional methods generally employ point forecasting, providing a single deterministic value that cannot describe the range of possible values. Although recent deep learning methods have improved forecasting precision, they require task-specific training and do not generalize across seen distributions. Most models place strict parametric assumptions or utilize discrete tokenization, which can give rise to out-of-vocabulary issues and quantization errors. Overcoming these constraints is key to creating scalable, transferable, and generalizable time series forecasting models that can function across domains without extensive re-training.

Current forecasting models can be roughly divided into two categories: statistical models and deep learning-based models. Statistical models, such as ARIMA and Exponential Smoothing, are interpretable but cannot capture the complex dependencies of large datasets. Transformer-based deep learning models display impressive predictive ability; however, they are not robust, require extensive in-distribution training, and are extremely dependent on discrete tokenization. This tokenization scheme, used in frameworks such as TimesFM, Timer, and Moirai, embeds time series data into categorical token sequences, discarding fine-grained information, rigid representation learning, and potential quantization inconsistencies. In addition, most forecasting models rely on prior probabilistic distributions, such as Gaussian priors, that limit their ability to capture the rich and highly variable nature of real-world data. These constraints limit the ability of existing methods to provide accurate and reliable probabilistic forecasts that adequately reflect uncertainty in decision-making applications.

To overcome these challenges, Sundial proposes a generative, scalable, and flexible time series foundation model that can learn complex patterns from raw data directly. In contrast to discrete tokenization, it uses continuous tokenization with native patching, which maintains time series continuity and enables more expressive representation learning. One of the innovations behind its forecasting power is TimeFlow Loss, a flow-matching-based generative training objective, which can enable the model to learn predictive distributions without probabilistic assumptions beforehand. This approach avoids mode collapse and enables multiple plausible future trajectories instead of a single deterministic prediction. In addition, the model is trained on TimeBench, a large-scale dataset of one trillion time points sampled from real-world and synthetic time series, which endows it with strong generalization capabilities on a wide range of forecasting tasks.

Sundial combines several innovations in tokenization, architecture, and training methods. Its native patching-based continuous tokenization system processes time series data as continuous segments rather than segmenting them into discrete categorical tokens. A re-normalization method enhances generalizability by managing variability in the dataset and distribution shifts. The basic architecture is a decoder-only Transformer that uses causal self-attention and rotary position embeddings, which improve its ability to manage temporal dependencies. Training stability and inference efficiency are enhanced through Pre-LN, FlashAttention, and KV Cache optimizations. The introduction of TimeFlow Loss enables probabilistic forecasting through flow-matching, allowing the model to learn non-parametric distributions without being constrained by fixed assumptions. Rather than producing a single-point estimate, the model produces multiple possible outcomes, thus improving decision-making processes in uncertain environments. Training is conducted on TimeBench, a trillion-scale dataset covering topics in finance, weather, IoT, healthcare, and more, thus ensuring wide applicability and strength across a broad range of domains.

Sundial achieves state-of-the-art performance on a variety of zero-shot forecasting benchmarks, reflecting superior accuracy, efficiency, and scalability. In the context of long-term forecasting, it outperforms previous time series foundation models consistently, reflecting substantial reductions in Mean Squared Error and Mean Absolute Error. In probabilistic forecasting, Sundial is one of the top-performing models, reflecting excellence in key metrics such as MASE and CRPS while having a substantial advantage in terms of inference speed. The scalability of the model is evident, with larger configurations leading to better accuracy, and TimeFlow Loss reflecting greater effectiveness compared to standard MSE- or diffusion-based objectives. Sundial also provides flexible inference capabilities, allowing users to trade off computational efficiency and forecasting accuracy, which makes it particularly useful for practical applications requiring reliable and adaptive time series forecasts.

Sundial is a significant breakthrough in time series forecasting with a generative modeling framework that combines continuous tokenization, Transformer models, and a novel probabilistic training objective. With TimeFlow Loss, it surpasses conventional parametric forecasting methods by learning highly flexible and unconstrained predictive distributions. When trained on the trillion-scale TimeBench dataset, it achieves state-of-the-art on a variety of forecasting tasks with strong zero-shot generalization. Its ability to generate multiple plausible future trajectories, combined with its efficiency, makes it a powerful decision-making tool in many industries, thereby reimagining the promise of time series foundation models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Sundial: A New Era for Time Series Foundation Models with Generative AI appeared first on MarkTechPost.

Fine-Tuning of Llama-2 7B Chat for Python Code Generation: Using QLoRA …

In this tutorial, we demonstrate how to efficiently fine-tune the Llama-2 7B Chat model for Python code generation using advanced techniques such as QLoRA, gradient checkpointing, and supervised fine-tuning with the SFTTrainer. Leveraging the Alpaca-14k dataset, we walk through setting up the environment, configuring LoRA parameters, and applying memory optimization strategies to train a model that excels in generating high-quality Python code. This step-by-step guide is designed for practitioners seeking to harness the power of LLMs with minimal computational overhead.

Copy CodeCopiedUse a different Browser!pip install -q accelerate
!pip install -q peft
!pip install -q transformers
!pip install -q trl

First, install the required libraries for our project. They include accelerate, peft, transformers, and trl from the Python Package Index. The -q flag (quiet mode) keeps the output minimal.

Copy CodeCopiedUse a different Browserimport os
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
HfArgumentParser,
TrainingArguments,
pipeline,
logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

Import the essential modules for our training setup. They include utilities for dataset loading, model/tokenizer, training arguments, logging, LoRA configuration, and the SFTTrainer.

Copy CodeCopiedUse a different Browser# The model to train from the Hugging Face hub
model_name = “NousResearch/llama-2-7b-chat-hf”
# The instruction dataset to use
dataset_name = “user/minipython-Alpaca-14k”

# Fine-tuned model name
new_model = “/kaggle/working/llama-2-7b-codeAlpaca”

We specify the base model from the Hugging Face hub, the instruction dataset, and the new model’s name.

Copy CodeCopiedUse a different Browser# QLoRA parameters
# LoRA attention dimension
lora_r = 64
# Alpha parameter for LoRA scaling
lora_alpha = 16
# Dropout probability for LoRA layers
lora_dropout = 0.1

Define the LoRA parameters for our model. `lora_r` sets the LoRA attention dimension, `lora_alpha` scales LoRA updates, and `lora_dropout` controls dropout probability.

Copy CodeCopiedUse a different Browser# TrainingArguments parameters

# Output directory where the model predictions and checkpoints will be stored
output_dir = “/kaggle/working/llama-2-7b-codeAlpaca”
# Number of training epochs
num_train_epochs = 1
# Enable fp16 training (set to True for mixed precision training)
fp16 = True
# Batch size per GPU for training
per_device_train_batch_size = 8
# Batch size per GPU for evaluation
per_device_eval_batch_size = 8
# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 2
# Enable gradient checkpointing
gradient_checkpointing = True
# Maximum gradient norm (gradient clipping)
max_grad_norm = 0.3
# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4
# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001
# Optimizer to use
optim = “adamw_torch”
# Learning rate schedule
lr_scheduler_type = “constant”
# Group sequences into batches with the same length
# Saves memory and speeds up training considerably
group_by_length = True
# Ratio of steps for a linear warmup
warmup_ratio = 0.03
# Save checkpoint every X updates steps
save_steps = 100
# Log every X updates steps
logging_steps = 10

These parameters configure the training process. They include output paths, number of epochs, precision (fp16), batch sizes, gradient accumulation, and checkpointing. Additional settings like learning rate, optimizer, and scheduling help fine-tune training behavior. Warmup and logging settings control how the model starts training and how we monitor progress.

Copy CodeCopiedUse a different Browserimport torch
print(“PyTorch Version:”, torch.__version__)
print(“CUDA Version:”, torch.version.cuda)

Import PyTorch and print both the installed PyTorch version and the corresponding CUDA version.

Copy CodeCopiedUse a different Browser!nvidia-smi

This command shows the GPU information, including driver version, CUDA version, and current GPU usage.

Copy CodeCopiedUse a different Browser# SFT parameters

# Maximum sequence length to use
max_seq_length = None
# Pack multiple short examples in the same input sequence to increase efficiency
packing = False
# Load the entire model on the GPU 0
device_map = {“”: 0}

Define SFT parameters, such as the maximum sequence length, whether to pack multiple examples, and mapping the entire model to GPU 0.

Copy CodeCopiedUse a different Browser# SFT parameters

# Maximum sequence length to use
max_seq_length = None
# Pack multiple short examples in the same input sequence to increase efficiency
packing = False
# Load dataset
dataset = load_dataset(dataset_name, split=”train”)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = “right”
# Load base model with 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map=”auto”,
)

# Prepare model for training
model.gradient_checkpointing_enable()
model.enable_input_require_grads()

Set additional SFT parameters and load our dataset and tokenizer. We configure padding tokens for the tokenizer and load the base model with 8-bit quantization. Finally, we enable gradient checkpointing and ensure the model requires input gradients for training.

Copy CodeCopiedUse a different Browserfrom peft import get_peft_model

Import the `get_peft_model` function, which applies parameter-efficient fine-tuning (PEFT) to our base model.

Copy CodeCopiedUse a different Browser# Load LoRA configuration
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias=”none”,
task_type=”CAUSAL_LM”,
)

# Apply LoRA to the model
model = get_peft_model(model, peft_config)
# Set training parameters
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
fp16=fp16,
max_grad_norm=max_grad_norm,
warmup_ratio=warmup_ratio,
group_by_length=True,
lr_scheduler_type=lr_scheduler_type,
)
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field=”text”,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)

Configure and apply LoRA to our model using `LoraConfig` and `get_peft_model`. We then create `TrainingArguments` for model training, specifying epoch counts, batch sizes, and optimization settings. Lastly, we set up the `SFTTrainer`, passing it the model, dataset, tokenizer, and training arguments.

Copy CodeCopiedUse a different Browser# Train model
trainer.train()
# Save trained model
trainer.model.save_pretrained(new_model)

Initiate the supervised fine-tuning process (`trainer.train()`) and then save the trained LoRA model to the specified directory.

Copy CodeCopiedUse a different Browser# Run text generation pipeline with the fine-tuned model
prompt = “How can I write a Python program that calculates the mean, standard deviation, and coefficient of variation of a dataset from a CSV file?”
pipe = pipeline(task=”text-generation”, model=trainer.model, tokenizer=tokenizer, max_length=400)
result = pipe(f”<s>[INST] {prompt} [/INST]”)
print(result[0][‘generated_text’])

Create a text generation pipeline using our fine-tuned model and tokenizer. Then, we provide a prompt, generate text using the pipeline, and print the output.

Copy CodeCopiedUse a different Browserfrom kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret(“HF_TOKEN”)

Access Kaggle Secrets to retrieve a stored Hugging Face token (`HF_TOKEN`). This token is used for authentication with the Hugging Face Hub.

Copy CodeCopiedUse a different Browser# Empty VRAM
# del model
# del pipe
# del trainer
# del dataset
del tokenizer
import gc
gc.collect()
gc.collect()
torch.cuda.empty_cache()

The above snippet shows how to free up GPU memory by deleting references and clearing caches. We delete the tokenizer, run garbage collection, and empty the CUDA cache to reduce VRAM usage.

Copy CodeCopiedUse a different Browserimport torch

# Check the number of GPUs available
num_gpus = torch.cuda.device_count()
print(f”Number of GPUs available: {num_gpus}”)

# Check if CUDA device 1 is available
if num_gpus > 1:
print(“cuda:1 is available.”)
else:
print(“cuda:1 is not available.”)

We import PyTorch and check the number of GPUs detected. Then, we print the count and conditionally report whether the GPU with ID 1 is available.

Copy CodeCopiedUse a different Browserimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Specify the device ID for your desired GPU (e.g., 0 for the first GPU, 1 for the second GPU)
device_id = 1 # Change this based on your available GPUs
device = f”cuda:{device_id}”
# Load the base model on the specified GPU
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map=”auto”, # Use auto to load on the available device
)
# Load the LoRA weights
lora_model = PeftModel.from_pretrained(base_model, new_model)
# Move LoRA model to the specified GPU
lora_model.to(device)
# Merge the LoRA weights with the base model weights
model = lora_model.merge_and_unload()
# Ensure the merged model is on the correct device
model.to(device)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = “right”

Select a GPU device (device_id 1) and load the base model with specified precision and memory optimizations. Then, load and merge LoRA weights into the base model, ensuring the merged model is moved to the designated GPU. Finally, load the tokenizer and configure it with appropriate padding settings.

In conclusion, following this tutorial, you have successfully fine-tuned the Llama-2 7B Chat model to specialize in Python code generation. Integrating QLoRA, gradient checkpointing, and SFTTrainer demonstrates a practical approach to managing resource constraints while achieving high performance.

Download the Colab Notebook here. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.
The post Fine-Tuning of Llama-2 7B Chat for Python Code Generation: Using QLoRA, SFTTrainer, and Gradient Checkpointing on the Alpaca-14k Dataset appeared first on MarkTechPost.

Governing the ML lifecycle at scale, Part 4: Scaling MLOps with securi …

Data science teams often face challenges when transitioning models from the development environment to production. These include difficulties integrating data science team’s models into the IT team’s production environment, the need to retrofit data science code to meet enterprise security and governance standards, gaining access to production grade data, and maintaining repeatability and reproducibility in machine learning (ML) pipelines, which can be difficult without a proper platform infrastructure and standardized templates.
This post, part of the “Governing the ML lifecycle at scale” series (Part 1, Part 2, Part 3), explains how to set up and govern a multi-account ML platform that addresses these challenges. The platform provides self-service provisioning of secure environments for ML teams, accelerated model development with predefined templates, a centralized model registry for collaboration and reuse, and standardized model approval and deployment processes.
An enterprise might have the following roles involved in the ML lifecycles. The functions for each role can vary from company to company. In this post, we assign the functions in terms of the ML lifecycle to each role as follows:

Lead data scientist – Provision accounts for ML development teams, govern access to the accounts and resources, and promote standardized model development and approval process to eliminate repeated engineering effort. Usually, there is one lead data scientist for a data science group in a business unit, such as marketing.
Data scientists – Perform data analysis, model development, model evaluation, and registering the models in a model registry.
ML engineers – Develop model deployment pipelines and control the model deployment processes.
Governance officer – Review the model’s performance including documentation, accuracy, bias and access, and provide final approval for models to be deployed.
Platform engineers – Define a standardized process for creating development accounts that conform to the company’s security, monitoring, and governance standards; create templates for model development; and manage the infrastructure and mechanisms for sharing model artifacts.

This ML platform provides several key benefits. First, it enables every step in the ML lifecycle to conform to the organization’s security, monitoring, and governance standards, reducing overall risk. Second, the platform gives data science teams the autonomy to create accounts, provision ML resources and access ML resources as needed, reducing resource constraints that often hinder their work.
Additionally, the platform automates many of the repetitive manual steps in the ML lifecycle, allowing data scientists to focus their time and efforts on building ML models and discovering insights from the data rather than managing infrastructure. The centralized model registry also promotes collaboration across teams, enables centralized model governance, increasing visibility into models developed throughout the organization and reducing duplicated work.
Finally, the platform standardizes the process for business stakeholders to review and consume models, smoothing the collaboration between the data science and business teams. This makes sure models can be quickly tested, approved, and deployed to production to deliver value to the organization.
Overall, this holistic approach to governing the ML lifecycle at scale provides significant benefits in terms of security, agility, efficiency, and cross-functional alignment.
In the next section, we provide an overview of the multi-account ML platform and how the different roles collaborate to scale MLOps.
Solution overview
The following architecture diagram illustrates the solutions for a multi-account ML platform and how different personas collaborate within this platform.

There are five accounts illustrated in the diagram:

ML Shared Services Account – This is the central hub of the platform. This account manages templates for setting up new ML Dev Accounts, as well as SageMaker Projects templates for model development and deployment, in AWS Service Catalog. It also hosts a model registry to store ML models developed by data science teams, and provides a single location to approve models for deployment.
ML Dev Account – This is where data scientists perform their work. In this account, data scientists can create new SageMaker notebooks based on the needs, connect to data sources such as Amazon Simple Storage Service (Amazon S3) buckets, analyze data, build models and create model artifacts (for example, a container image), and more. The SageMaker projects, provisioned using the templates in the ML Shared Services Account, can speed up the model development process because it has steps (such as connecting to an S3 bucket) configured. The diagram shows one ML Dev Account, but there can be multiple ML Dev Accounts in an organization.
ML Test Account – This is the test environment for new ML models, where stakeholders can review and approve models before deployment to production.
ML Prod Account – This is the production account for new ML models. After the stakeholders approve the models in the ML Test Account, the models are automatically deployed to this production account.
Data Governance Account – This account hosts data governance services for data lake, central feature store, and fine-grained data access.

Key activities and actions are numbered in the preceding diagram. Some of these activities are performed by various personas, whereas others are automatically triggered by AWS services.

ML engineers create the pipelines in Github repositories, and the platform engineer converts them into two different Service Catalog portfolios: ML Admin Portfolio and SageMaker Project Portfolio. The ML Admin Portfolio will be used by the lead data scientist to create AWS resources (for example, SageMaker domains). The SageMaker Project Portfolio has SageMaker projects that data scientists and ML engineers can use to accelerate model training and deployment.
The platform engineer shares the two Service Catalog portfolios with workload accounts in the organization.
Data engineer prepares and governs datasets using services such as Amazon S3, AWS Lake Formation, and Amazon DataZone for ML.
The lead data scientist uses the ML Admin Portfolio to set up SageMaker domains and the SageMaker Project Portfolio to set up SageMaker projects for their teams.
Data scientists subscribe to datasets, and use SageMaker notebooks to analyze data and develop models.
Data scientists use the SageMaker projects to build model training pipelines. These SageMaker projects automatically register the models in the model registry.
The lead data scientist approves the model locally in the ML Dev Account.
This step consists of the following sub-steps:

 After the data scientists approve the model, it triggers an event bus in Amazon EventBridge that ships the event to the ML Shared Services Account.
The event in EventBridge triggers the AWS Lambda function that copies model artifacts (managed by SageMaker, or Docker images) from the ML Dev Account into the ML Shared Services Account, creates a model package in the ML Shared Services Account, and registers the new model in the model registry in the ML Shared Services account.

ML engineers review and approve the new model in the ML Shared Services account for testing and deployment. This action triggers a pipeline that was set up using a SageMaker project.
The approved models are first deployed to the ML Test Account. Integration tests will be run and endpoint validated before being approved for production deployment.
After testing, the governance officer approves the new model in the CodePipeline.
After the model is approved, the pipeline will continue to deploy the new model into the ML Prod Account, and creates a SageMaker endpoint.

The following sections provide details on the key components of this diagram, how to set them up, and sample code.
Set up the ML Shared Services Account
The ML Shared Services Account helps the organization standardize management of artifacts and resources across data science teams. This standardization also helps enforce controls across resources consumed by data science teams.
The ML Shared Services Account has the following features:
Service Catalog portfolios – This includes the following portfolios:

ML Admin Portfolio – This is intended to be used by the project admins of the workload accounts. It is used to create AWS resources for their teams. These resources can include SageMaker domains, Amazon Redshift clusters, and more.
SageMaker Projects Portfolio – This portfolio contains the SageMaker products to be used by the ML teams to accelerate their ML models’ development while complying with the organization’s best practices.
Central model registry – This is the centralized place for ML models developed and approved by different teams. For details on setting this up, refer to Part 2 of this series.

The following diagram illustrates this architecture.

As the first step, the cloud admin sets up the ML Shared Services Account by using one of the blueprints for customizations in AWS Control Tower account vending, as described in Part 1.
In the following sections, we walk through how to set up the ML Admin Portfolio. The same steps can be used to set up the SageMaker Projects Portfolio.
Bootstrap the infrastructure for two portfolios
After the ML Shared Services Account has been set up, the ML platform admin can bootstrap the infrastructure for the ML Admin Portfolio using sample code in the GitHub repository. The code contains AWS CloudFormation templates that can be later deployed to create the SageMaker Projects Portfolio.
Complete the following steps:

Clone the GitHub repo to a local directory:

git clone https://github.com/aws-samples/data-and-ml-governance-workshop.git

Change the directory to the portfolio directory:

cd data-and-ml-governance-workshop/module-3/ml-admin-portfolio

Install dependencies in a separate Python environment using your preferred Python packages manager:

python3 -m venv env
source env/bin/activate pip
install -r requirements.txt

Bootstrap your deployment target account using the following command:

cdk bootstrap aws://<target account id>/<target region> –profile <target account profile>
If you already have a role and AWS Region from the account set up, you can use the following command instead:

cdk bootstrap

Lastly, deploy the stack:

cdk deploy –all –require-approval never

When it’s ready, you can see the MLAdminServicesCatalogPipeline pipeline in AWS CloudFormation.
Navigate to AWS CodeStar Connections of the Service Catalog page, you can see there’s a connection named “codeconnection-service-catalog”. If you click the connection, you will notice that we need to connect it to GitHub to allow you to integrate it with your pipelines and start pushing code. Click the ‘Update pending connection’ to integrate with your GitHub account.
Once that is done, you need to create empty GitHub repositories to start pushing code to. For example, you can create a repository called “ml-admin-portfolio-repo”. Every project you deploy will need a repository created in GitHub beforehand.
Trigger CodePipeline to deploy the ML Admin Portfolio
Complete the following steps to trigger the pipeline to deploy the ML Admin Portfolio. We recommend creating a separate folder for the different repositories that will be created in the platform.

Get out of the cloned repository and create a parallel folder called platform-repositories:

cd ../../.. # (as many .. as directories you have moved in)
mkdir platform-repositories

Clone and fill the empty created repository:

cd platform-repositories
git clone https://github.com/example-org/ml-admin-service-catalog-repo.git
cd ml-admin-service-catalog-repo
cp -aR ../../ml-platform-shared-services/module-3/ml-admin-portfolio/. .

Push the code to the Github repository to create the Service Catalog portfolio:

git add .
git commit -m “Initial commit”
git push -u origin main

After it is pushed, the Github repository we created earlier is no longer empty. The new code push triggers the pipeline named cdk-service-catalog-pipeline to build and deploy artifacts to Service Catalog.
It takes about 10 minutes for the pipeline to finish running. When it’s complete, you can find a portfolio named ML Admin Portfolio on the Portfolios page on the Service Catalog console.

Repeat the same steps to set up the SageMaker Projects Portfolio, make sure you’re using the sample code (sagemaker-projects-portfolio) and create a new code repository (with a name such as sm-projects-service-catalog-repo).
Share the portfolios with workload accounts
You can share the portfolios with workload accounts in Service Catalog. Again, we use ML Admin Portfolio as an example.

On the Service Catalog console, choose Portfolios in the navigation pane.
Choose the ML Admin Portfolio.
On the Share tab, choose Share.
In the Account info section, provide the following information:

For Select how to share, select Organization node.
Choose Organizational Unit, then enter the organizational unit (OU) ID of the workloads OU.

In the Share settings section, select Principal sharing.
Choose Share.Selecting the Principal sharing option allows you to specify the AWS Identity and Access Management (IAM) roles, users, or groups by name for which you want to grant permissions in the shared accounts.
On the portfolio details page, on the Access tab, choose Grant access.
For Select how to grant access, select Principal Name.
In the Principal Name section, choose role/ for Type and enter the name of the role that the ML admin will assume in the workload accounts for Name.
Choose Grant access.
Repeat these steps to share the SageMaker Projects Portfolio with workload accounts.

Confirm available portfolios in workload accounts
If the sharing was successful, you should see both portfolios available on the Service Catalog console, on the Portfolios page under Imported portfolios.

Now that the service catalogs in the ML Shared Services Account have been shared with the workloads OU, the data science team can provision resources such as SageMaker domains using the templates and set up SageMaker projects to accelerate ML models’ development while complying with the organization’s best practices.
We demonstrated how to create and share portfolios with workload accounts. However, the journey doesn’t stop here. The ML engineer can continue to evolve existing products and develop new ones based on the organization’s requirements.
The following sections describe the processes involved in setting up ML Development Accounts and running ML experiments.
Set up the ML Development Account
The ML Development account setup consists of the following tasks and stakeholders:

The team lead requests the cloud admin to provision the ML Development Account.
The cloud admin provisions the account.
The team lead uses shared Service Catalog portfolios to provisions SageMaker domains, set up IAM roles and give access, and get access to data in Amazon S3, or Amazon DataZone or AWS Lake Formation, or a central feature group, depending on which solution the organization decides to use.

Run ML experiments
Part 3 in this series described multiple ways to share data across the organization. The current architecture allows data access using the following methods:

Option 1: Train a model using Amazon DataZone – If the organization has Amazon DataZone in the central governance account or data hub, a data publisher can create an Amazon DataZone project to publish the data. Then the data scientist can subscribe to the Amazon DataZone published datasets from Amazon SageMaker Studio, and use the dataset to build an ML model. Refer to the sample code for details on how to use subscribed data to train an ML model.
Option 2: Train a model using Amazon S3 – Make sure the user has access to the dataset in the S3 bucket. Follow the sample code to run an ML experiment pipeline using data stored in an S3 bucket.
Option 3: Train a model using a data lake with Athena – Part 2 introduced how to set up a data lake. Follow the sample code to run an ML experiment pipeline using data stored in a data lake with Amazon Athena.
Option 4: Train a model using a central feature group – Part 2 introduced how to set up a central feature group. Follow the sample code to run an ML experiment pipeline using data stored in a central feature group.

You can choose which option to use depending on your setup. For options 2, 3, and 4, the SageMaker Projects Portfolio provides project templates to run ML experiment pipelines, steps including data ingestion, model training, and registering the model in the model registry.
In the following example, we use option 2 to demonstrate how to build and run an ML pipeline using a SageMaker project that was shared from the ML Shared Services Account.

On the SageMaker Studio domain, under Deployments in the navigation pane, choose Projects
Choose Create project.
There is a list of projects that serve various purposes. Because we want to access data stored in an S3 bucket for training the ML model, choose the project that uses data in an S3 bucket on the Organization templates tab.
Follow the steps to provide the necessary information, such as Name, Tooling Account(ML Shared Services account id), S3 bucket(for MLOPS)  and then create the project.

It takes a few minutes to create the project.

After the project is created, a SageMaker pipeline is triggered to perform the steps specified in the SageMaker project. Choose Pipelines in the navigation pane to see the pipeline.You can choose the pipeline to see the Directed Acyclic Graph (DAG) of the pipeline. When you choose a step, its details show in the right pane.

The last step of the pipeline is registering the model in the current account’s model registry. As the next step, the lead data scientist will review the models in the model registry, and decide if a model should be approved to be promoted to the ML Shared Services Account.
Approve ML models
The lead data scientist should review the trained ML models and approve the candidate model in the model registry of the development account. After an ML model is approved, it triggers a local event, and the event buses in EventBridge will send model approval events to the ML Shared Services Account, and the artifacts of the models will be copied to the central model registry. A model card will be created for the model if it’s a new one, or the existing model card will update the version.
The following architecture diagram shows the flow of model approval and model promotion.

Model deployment
After the previous step, the model is available in the central model registry in the ML Shared Services Account. ML engineers can now deploy the model.
If you had used the sample code to bootstrap the SageMaker Projects portfolio, you can use the Deploy real-time endpoint from ModelRegistry – Cross account, test and prod option in SageMaker Projects to set up a project to set up a pipeline to deploy the model to the target test account and production account.

On the SageMaker Studio console, choose Projects in the navigation pane.
Choose Create project.
On the Organization templates tab, you can view the templates that were populated earlier from Service Catalog when the domain was created.
Select the template Deploy real-time endpoint from ModelRegistry – Cross account, test and prod and choose Select project template.
Fill in the template:

The SageMakerModelPackageGroupName is the model group name of the model promoted from the ML Dev Account in the previous step.
Enter the Deployments Test Account ID for PreProdAccount, and the Deployments Prod Account ID for ProdAccount.

The pipeline for deployment is ready. The ML engineer will review the newly promoted model in the ML Shared Services Account. If the ML engineer approves model, it will trigger the deployment pipeline. You can see the pipeline on the CodePipeline console.

 
The pipeline will first deploy the model to the test account, and then pause for manual approval to deploy to the production account. ML engineer can test the performance and Governance officer can validate the model results in the test account. If the results are satisfactory, Governance officer can approve in CodePipeline to deploy the model to production account.

Conclusion
This post provided detailed steps for setting up the key components of a multi-account ML platform. This includes configuring the ML Shared Services Account, which manages the central templates, model registry, and deployment pipelines; sharing the ML Admin and SageMaker Projects Portfolios from the central Service Catalog; and setting up the individual ML Development Accounts where data scientists can build and train models.
The post also covered the process of running ML experiments using the SageMaker Projects templates, as well as the model approval and deployment workflows. Data scientists can use the standardized templates to speed up their model development, and ML engineers and stakeholders can review, test, and approve the new models before promoting them to production.
This multi-account ML platform design follows a federated model, with a centralized ML Shared Services Account providing governance and reusable components, and a set of development accounts managed by individual lines of business. This approach gives data science teams the autonomy they need to innovate, while providing enterprise-wide security, governance, and collaboration.
We encourage you to test this solution by following the AWS Multi-Account Data & ML Governance Workshop to see the platform in action and learn how to implement it in your own organization.

About the authors
Jia (Vivian) Li is a Senior Solutions Architect in AWS, with specialization in AI/ML. She currently supports customers in financial industry. Prior to joining AWS in 2022, she had 7 years of experience supporting enterprise customers use AI/ML in the cloud to drive business results. Vivian has a BS from Peking University and a PhD from University of Southern California. In her spare time, she enjoys all the water activities, and hiking in the beautiful mountains in her home state, Colorado.
Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys riding motorcycle and walking with his dogs.
Dr. Alessandro Cerè is a GenAI Evaluation Specialist and Solutions Architect at AWS. He assists customers across industries and regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations. Bringing a unique perspective to the field of AI, Alessandro has a background in quantum physics and research experience in quantum communications and quantum memories. In his spare time, he pursues his passion for landscape and underwater photography.
Alberto Menendez is a DevOps Consultant in Professional Services at AWS. He helps accelerate customers’ journeys to the cloud and achieve their digital transformation goals. In his free time, he enjoys playing sports, especially basketball and padel, spending time with family and friends, and learning about technology.
Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.
Viktor Malesevic is a Senior Machine Learning Engineer within AWS Professional Services, leading teams to build advanced machine learning solutions in the cloud. He’s passionate about making AI impactful, overseeing the entire process from modeling to production. In his spare time, he enjoys surfing, cycling, and traveling.

Accelerate your Amazon Q implementation: starter kits for SMBs

Whether you’re a small or medium-sized business (SMB) or a managed service provider at the beginning of your cloud journey, you might be wondering how to get started. Questions like “Am I following best practices?”, “Am I optimizing my cloud costs?”, and “How difficult is the learning curve?” are quite common. AWS is here to provide a concept called starter kits.
Starter kits are complete, deployable solutions that address common, repeatable business problems. They deploy the services that make up a solution according to best practices, helping you optimize costs and become familiar with these kinds of architectural patterns without a large investment in training. Most of all, starter kits save you time—time that can be better spent on your business or with your customers.
In this post, we showcase a starter kit for Amazon Q Business. If you have a repository of documents that you need to turn into a knowledge base quickly, or simply want to test out the capabilities of Amazon Q Business without a large investment of time at the console, then this solution is for you.
This deployment guide covers the steps to set up an Amazon Q solution that connects to Amazon Simple Storage Service (Amazon S3) and a web crawler data source, and integrates with AWS IAM Identity Center for authentication. An AWS CloudFormation template automates the deployment of this solution.
Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. It empowers employees to be more creative, data-driven, efficient, prepared, and productive.
Solution overview
The following diagram illustrates the solution architecture.

The workflow involves the following steps:

The user authenticates using an AWS Identity and Access Management (IAM) identity user name and password before accessing the Amazon Q web application.
Upon successful authentication, the user can access the Amazon Q web UI and ask a question.
Amazon Q retrieves relevant information from its index, which is populated using data from the connected data sources (Amazon S3 and a web crawler).
Amazon Q then generates a response using its internal large language model (LLM) and presents it to the user through the Amazon Q web UI.
The user can provide feedback on the response through the Amazon Q web UI.

Prerequisites
Before deploying the solution, make sure you have the following in place:

AWS account – You will need an active AWS account with the necessary permissions to deploy CloudFormation stacks and create the required resources.
Amazon S3 bucket – Make sure you have an existing S3 bucket that will be used as the data source for Amazon Q. To create a S3 bucket, refer to Create your first S3 bucket.
AWS IAM Identity Center – Configure AWS IAM Identity Center in your AWS environment. You will need to provide the necessary details, such as the IAM Identity Center instance Amazon Resource Name (ARN), during the deployment process.

Deploy the solution using AWS CloudFormation
Complete the following steps to deploy the CloudFormation template:

Sign in to the AWS Management Console.
Choose one of the following Launch Stack options for your desired AWS Region to open the AWS CloudFormation console and create a new stack. Please note that this stack will default to us-east-1.
For Stack name, enter a name for your application (for example, AMAZON-Q-STARTER-KIT).
In the Parameters section, for IAMIdentityCenterARN, enter the ARN of your IAM Identity Center instance.
For QBusinessApplicationName, enter a name for the Amazon Q Business application.
For S3DataSourceBucket, enter the name of the S3 bucket you created earlier.
For WebCrawlerDataSourceUrl, enter the URL of the web crawler data source.
Choose Next.

On the Configure stack options page, leave everything as default, select I acknowledge that AWS CloudFormation might create IAM resources and and choose Next.

On the Review and create page, choose Submit.
On the Amazon Q Business console, you will see the new application you created.
Choose the new Amazon Q Business application, and in the Data sources section, select the data source s3_datasource and choose Sync now.
Select the data source webpage-datasource and choose Sync now.
To add groups and users to your Amazon Q application, refer to instructions.

Test the solution
To validate the Amazon Q solution is functioning as expected, perform the following tests:

Test data ingestion:

Upload a test file to the S3 bucket.
Verify that the file is successfully ingested and processed by Amazon Q.
Check the Amazon Q web experience UI for the processed data.

Test web crawler functionality:
Verify that the web crawler is able to retrieve and ingest the data from the website.
Make sure the data is displayed correctly in the Amazon Q web experience UI.

Clean up
To clean up, delete the CloudFormation stack and the S3 bucket you created.
Conclusion
The Amazon Q starter kit provides a streamlined solution for SMBs to use the power of generative AI and intelligent question-answering. By automating the deployment and integration with key data sources, this kit eases the complexity of setting up Amazon Q, empowering businesses to quickly unlock insights and improve productivity.
If your SMB has a repository of documents that need to be transformed into a valuable knowledge base, or you simply want to explore the capabilities of Amazon Q, we encourage you to take advantage of this starter kit. Get started today and experience the transformative benefits of enterprise-grade question-answering tailored for your business needs, and let us know what you think in the comments. To explore more generative AI use cases, refer to AI Use Case Explorer.

About the Authors
Nneoma Okoroafor is a Partner Solutions Architect focused on AI/ML and generative AI. Nneoma is passionate about providing guidance to AWS Partners on using the latest technologies and techniques to deliver innovative solutions to customers.
Joshua Amah is a Partner Solutions Architect with Amazon Web Services. He primarily serves consulting partners, providing architectural guidance and recommendations for new and existing workloads. Outside of work, he enjoys playing soccer, golf, and spending time with family and friends.
Jason Brown is a Partner Solutions Architect focused on helping AWS Distribution Partners and their Seller Partners build and grow their AWS practices. Jason is passionate about building solutions for MSPs and VARs in the small business space. Outside the office, Jason is an avid traveler and enjoys offshore fishing.

Building the future of construction analytics: CONXAI’s AI inference …

This is a guest post co-written with Tim Krause, Lead MLOps Architect at CONXAI.
CONXAI Technology GmbH is pioneering the development of an advanced AI platform for the Architecture, Engineering, and Construction (AEC) industry. Our platform uses advanced AI to empower construction domain experts to create complex use cases efficiently.
Construction sites typically employ multiple CCTV cameras, generating vast amounts of visual data. These camera feeds can be analyzed using AI to extract valuable insights. However, to comply with GDPR regulations, all individuals captured in the footage must be anonymized by masking or blurring their identities.
In this post, we dive deep into how CONXAI hosts the state-of-the-art OneFormer segmentation model on AWS using Amazon Simple Storage Service (Amazon S3), Amazon Elastic Kubernetes Service (Amazon EKS), KServe, and NVIDIA Triton.
Our AI solution is offered in two forms:

Model as a service (MaaS) – Our AI model is accessible through an API, enabling seamless integration. Pricing is based on processing batches of 1,000 images, offering flexibility and scalability for users.
Software as a service (SaaS) – This option provides a user-friendly dashboard, acting as a central control panel. Users can add and manage new cameras, view footage, perform analytical searches, and enforce GDPR compliance with automatic person anonymization.

Our AI model, fine-tuned with a proprietary dataset of over 50,000 self-labeled images from construction sites, achieves significantly greater accuracy compared to other MaaS solutions. With the ability to recognize more than 40 specialized object classes—such as cranes, excavators, and portable toilets—our AI solution is uniquely designed and optimized for the construction industry.
Our journey to AWS
Initially, CONXAI started with a small cloud provider specializing in offering affordable GPUs. However, it lacked essential services required for machine learning (ML) applications, such as frontend and backend infrastructure, DNS, load balancers, scaling, blob storage, and managed databases. At that time, the application was deployed as a single monolithic container, which included Kafka and a database. This setup was neither scalable nor maintainable.
After migrating to AWS, we gained access to a robust ecosystem of services. Initially, we deployed the all-in-one AI container on a single Amazon Elastic Compute Cloud (Amazon EC2) instance. Although this provided a basic solution, it wasn’t scalable, necessitating the development of a new architecture.
Our top reasons for choosing AWS were primarily driven by the team’s extensive experience with AWS. Additionally, the initial cloud credits provided by AWS were invaluable for us as a startup. We now use AWS managed services wherever possible, particularly for data-related tasks, to minimize maintenance overhead and pay only for the resources we actually use.
At the same time, we aimed to remain cloud-agnostic. To achieve this, we chose Kubernetes, enabling us to deploy our stack directly on a customer’s edge—such as on construction sites—when needed. Some customers are potentially very compliance-restrictive, not allowing data to leave the construction site. Another opportunity is federated learning, training on the customer’s edge and only transferring model weights, without sensitive data, into the cloud. In the future, this approach might lead to having one model fine-tuned for each camera to achieve the best accuracy, which requires hardware resources on-site. For the time being, we use Amazon EKS to offload the management overhead to AWS, but we could easily deploy on a standard Kubernetes cluster if needed.
Our previous model was running on TorchServe. With our new model, we first tried performing inference in Python with Flask and PyTorch, as well as with BentoML. Achieving high inference throughput with high GPU utilization for cost-efficiency was very challenging. Exporting the model to ONNX format was particularly difficult because the OneFormer model lacks strong community support. It took us some time to identify why the OneFormer model was so slow in the ONNX Runtime with NVIDIA Triton. We ultimately resolved the issue by converting ONNX to TensorRT.
Defining the final architecture, training the model, and optimizing costs took approximately 2–3 months. Currently, we improve our model by incorporating increasingly accurate labeled data, a process that takes around 3–4 weeks of training on a single GPU. Deployment is fully automated with GitLab CI/CD pipelines, Terraform, and Helm, requiring less than an hour to complete without any downtime. New model versions are typically rolled out in shadow mode for 1–2 weeks to provide stability and accuracy before full deployment.
Solution overview
The following diagram illustrates the solution architecture.

 The architecture consists of the following key components:

The S3 bucket (1) is the most important data source. It is cost-effective, scalable, and provides almost unlimited blob storage. We encrypt the S3 bucket, and we delete all data with privacy concerns after processing took place. Almost all microservices read and write files from and to Amazon S3, which ultimately triggers (2) Amazon EventBridge (3). The process begins when a customer uploads an image on Amazon S3 using a presigned URL provided by our API handling user authentication and authorization through Amazon Cognito.
The S3 bucket is configured in such a way that it forwards (2) all events into EventBridge.
TriggerMesh is a Kubernetes controller where we use AWSEventBridgeSource (6). It abstracts the infrastructure automation and automatically creates an Amazon Simple Queue Service (Amazon SQS) (5) processing queue, which acts as a processing buffer. Additionally, it creates an EventBridge rule (4) to forward the S3 event from the event bus into the SQS processing queue. Finally, TriggerMesh creates a Kubernetes Pod to poll events from the processing queue to feed it into the Knative broker (7). The resources in the Kubernetes cluster are deployed in a private subnet.
The central place for Knative Eventing is the Knative broker (7). It is backed by Amazon Managed Streaming for Apache Kafka (Amazon MSK) (8).
The Knative trigger (9) polls the Knative broker based on a specific CloudEventType and forwards it accordingly to the KServe InferenceService (10).
KServe is a standard model inference platform on Kubernetes that uses Knative Serving as its foundation and is fully compatible with Knative Eventing. It also pulls models from a model repository into the container before the model server starts, eliminating the need to build a new container image for each model version.
We use KServe’s “Collocate transformer and predictor in same pod” feature to maximize inference speed and throughput because containers within the same pod can communicate over localhost and the network traffic never leaves the CPU.
After many performance tests, we achieved best performance with the NVIDIA Triton Inference Server (11) after converting our model first into ONNX and then into TensorRT.
Our transformer (12) uses Flask with Gunicorn and is optimized for the number of workers and CPU cores to maintain GPU utilization over 90%. The transformer gets a CloudEvent with the reference of the image Amazon S3 path, downloads it, and performs model inference over HTTP. After getting back the model results, it performs preprocessing and finally uploads the processed model results back to Amazon S3.
We use Karpenter as the cluster auto scaler. Karpenter is responsible for scaling the inference component to handle high user request loads. Karpenter launches new EC2 instances when the system experiences increased demand. This allows the system to automatically scale up computing resources to meet the increased workload.

All this divides our architecture mainly in AWS managed data service and the Kubernetes cluster:

The S3 bucket, EventBridge, and SQS queue as well as Amazon MSK are all fully managed services on AWS. This keeps our data management effort low.
We use Amazon EKS for everything else. TriggerMesh, AWSEventBridgeSource, Knative Broker, Knative Trigger, KServe with our Python transformer, and the Triton Inference Server are also within the same EKS cluster on a dedicated EC2 instance with a GPU. Because our EKS cluster is just used for processing, it is fully stateless.

Summary
From initially having our own highly customized model, transitioning to AWS, improving our architecture, and introducing our new Oneformer model, CONXAI is now proud to provide scalable, reliable, and secure ML inference to customers, enabling construction site improvements and accelerations. We achieved a GPU utilization of over 90%, and the number of processing errors has dropped almost to zero in recent months. One of the major design choices was the separation of the model from the preprocessing and postprocessing code in the transformer. With this technology stack, we gained the ability to scale down to zero on Kubernetes using the Knative serverless feature, while our scale-up time from a cold state is just 5–10 minutes, which can save significant infrastructure costs for potential batch inference use cases.
The next important step is to use these model results with proper analytics and data science. These results can also serve as a data source for generative AI features such as automated report generation. Furthermore, we want to label more diverse images and train the model on additional construction domain classes as part of a continuous improvement process. We also work closely with AWS specialists to bring our model in AWS Inferentia chipsets for better cost-efficiency.
To learn more about the services in this solution, refer to the following resources:

Getting Started with Karpenter
Manage scale-to-zero scenarios with Karpenter and Serverless
Get started with AWS Fargate for your cluster
How Conxai uses Knative Eventing to provide AI APIs in the construction industry

About the Authors
Tim Krause is Lead MLOps Architect at CONXAI. He takes care of all activities when AI meets infrastructure. He joined the company with previous Platform, Kubernetes, DevOps, and Big Data knowledge and was training LLMs from scratch.
Mehdi Yosofie is a Solutions Architect at AWS, working with startup customers, and leveraging his expertise to help startup customers design their workloads on AWS.

Microsoft AI Researchers Introduce Advanced Low-Bit Quantization Techn …

Edge devices like smartphones, IoT gadgets, and embedded systems process data locally, improving privacy, reducing latency, and enhancing responsiveness, and AI is getting integrated into these devices rapidly. But, deploying large language models (LLMs) on these devices is difficult and complex due to their high computational and memory demands. 

LLMs are massive in size and power requirements. With billions of parameters, they demand significant memory and processing capacity that exceeds the capabilities of most edge devices. While quantization techniques reduce model size and power consumption, conventional hardware is optimized for symmetric computations, limiting support for mixed-precision arithmetic. This lack of native hardware support for low-bit computations restricts deployment across mobile and embedded platforms. 

Prior methods for running LLMs on edge devices use high-bit precision formats like FP32 and FP16, which improve numerical stability but require significant memory and energy. Some approaches use lower-bit quantization (e.g., int8 or int4) to reduce resource demands, but compatibility issues arise with existing hardware. Another technique, dequantization, re-expands compressed models before computation but introduces latency and negates efficiency gains. Also, traditional matrix multiplication (GEMM) requires uniform precision levels, which makes performance optimization across different hardware architectures complex.

Microsoft researchers introduced a series of advancements to enable efficient low-bit quantization for LLMs on edge devices. Their approach includes three major innovations: 

Ladder data type compiler 

T-MAC mpGEMM library

LUT Tensor Core hardware architecture 

These techniques aim to overcome hardware limitations by facilitating mixed-precision general matrix multiplication (mpGEMM) and reducing computational overhead. With these solutions, researchers propose a practical framework that supports efficient LLM inference without requiring specialized GPUs or high-power accelerators.

The Ladder data type compiler’s first component bridges the gap between low-bit model representations and hardware constraints. It converts unsupported data formats into hardware-compatible representations while maintaining efficiency. This approach ensures modern deep learning architectures can utilize custom data types without sacrificing performance. 

Image Source

The T-MAC mpGEMM library optimizes mixed-precision computations using a lookup table (LUT)–based method instead of traditional multiplication operations. This innovation eliminates the need for dequantization and significantly enhances CPU computational efficiency. 

Image Source

Also, the LUT Tensor Core hardware architecture introduces a specialized accelerator designed for low-bit quantization. It leverages an optimized instruction set to improve performance while reducing power consumption.

Image Source

In evaluations, the Ladder data type compiler outperforms conventional deep neural network (DNN) compilers by up to 14.6 times for specific low-bit computations. When tested on edge devices like the Surface Laptop 7 with the Qualcomm Snapdragon X Elite chipset, the T-MAC library achieved 48 tokens per second for the 3B BitNet-b1.58 model, outperforming existing inference libraries. On lower-end devices such as the Raspberry Pi 5, it achieved 11 tokens per second, demonstrating significant efficiency improvements. Meanwhile, the LUT Tensor Core hardware achieved an 11.2-fold increase in energy efficiency and a 20.9-fold boost in computational density.

Several key takeaways from the research by Microsoft include: 

Low-bit quantization reduces model size, enabling efficient execution on edge devices.

The T-MAC library enhances inference speed by eliminating traditional multiplication operations.

The Ladder compiler ensures seamless integration of custom low-bit data formats with existing hardware.

Optimized techniques reduce power usage, making LLMs feasible for low-energy devices.

These methods allow LLMs to operate effectively on a wide range of hardware, from high-end laptops to low-power IoT devices.

These innovations achieve 48 tokens per second on Snapdragon X Elite, 30 tokens per second on 2-bit 7B Llama, and 20 tokens per second on 4-bit 7B Llama.

They also enable AI-driven applications across mobile, robotic, and embedded AI systems by making LLMs more accessible.

In conclusion, the study highlights the importance of hardware-aware quantization techniques for deploying LLMs on edge devices. The proposed solutions effectively address the long-standing challenges of memory consumption, computational efficiency, and hardware compatibility. By implementing Ladder, T-MAC, and LUT Tensor Core, researchers have paved the way for next-generation AI applications that are faster, more energy-efficient, and more scalable across various platforms.

Check out the Details and Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Microsoft AI Researchers Introduce Advanced Low-Bit Quantization Techniques to Enable Efficient LLM Deployment on Edge Devices without High Computational Costs appeared first on MarkTechPost.

s1: A Simple Yet Powerful Test-Time Scaling Approach for LLMs

Language models (LMs) have significantly progressed through increased computational power during training, primarily through large-scale self-supervised pretraining. While this approach has yielded powerful models, a new paradigm called test-time scaling has emerged, focusing on improving performance by increasing computation at inference time. OpenAI’s o1 model has validated this approach, showing enhanced reasoning capabilities through test-time compute scaling. However, replicating these results has proven challenging, with various attempts using techniques like Monte Carlo Tree Search (MCTS), multi-agent approaches, and reinforcement learning. Even models like DeepSeek R1 have used millions of samples and complex training stages, yet none have replicated the test-time scaling behavior in o1.

Various methods have been developed to tackle the test-time scaling challenge. Sequential scaling approaches enable models to generate successive solution attempts, with each iteration building upon previous outcomes. Tree-based search methods combine sequential and parallel scaling, implementing techniques like MCTS and guided beam search. REBASE has emerged as a notable approach, utilizing a process reward model to optimize tree search through balanced exploitation and pruning, showing superior performance compared to sampling-based methods and MCTS. These approaches heavily rely on reward models, which come in two forms: outcome reward models for evaluating complete solutions in Best-of-N selection, and process reward models for assessing individual reasoning steps in tree-based search methods.

Researchers from Stanford University, the University of Washington, the Allen Institute for AI, and Contextual AI have proposed a streamlined approach to achieve test-time scaling and enhanced reasoning capabilities. Their method centers on two key innovations: the carefully curated s1K dataset comprising 1,000 questions with reasoning traces, selected based on difficulty, diversity, and quality criteria, and a novel technique called budget forcing. This budget-forcing mechanism controls test-time computation by either cutting short or extending the model’s thinking process through strategic “Wait” insertions, enabling the model to review and correct its reasoning. The approach was implemented by fine-tuning the Qwen2.5-32B-Instruct language model on the s1K dataset.

The data selection process follows a three-stage filtering approach based on quality, difficulty, and diversity criteria. The quality filtering stage begins by removing samples with API errors and formatting issues, reducing the initial dataset to 51,581 examples, from which 384 high-quality samples are initially selected. The difficulty assessment employs two key metrics: model performance evaluation using Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct models, with correctness verified by Claude 3.5 Sonnet, and reasoning trace length measured by the Qwen2.5 tokenizer. For diversity, questions are classified into specific domains using the Mathematics Subject Classification system through Claude 3.5 Sonnet. This comprehensive filtering process results in a final dataset of 1,000 samples spanning 50 domains.

The s1-32B model demonstrates significant performance improvements through test-time compute scaling with budget forcing. s1-32B operates in a superior scaling paradigm compared to the base Qwen2.5-32B-Instruct model using majority voting, validating the effectiveness of sequential scaling over parallel approaches. Moreover, s1-32B emerges as the most efficient open data reasoning model in sample efficiency, showing marked improvement over the base model with just 1,000 additional training samples. While r1-32B achieves better performance it requires 800 times more training data. Notably, s1-32B approaches Gemini 2.0 Thinking’s performance on AIME24, suggesting successful knowledge distillation.

This paper shows that Supervised Fine-Tuning (SFT) with just 1,000 carefully selected examples can create a competitive reasoning model that matches the o1-preview’s performance and achieves optimal efficiency. The introduced budget forcing technique, when combined with the reasoning model, successfully reproduces OpenAI’s test-time scaling behavior. The effectiveness of such minimal training data suggests that the model’s reasoning capabilities are largely present from pretraining on trillions of tokens, with the fine-tuning process merely activating these latent abilities. This aligns with the “Superficial Alignment Hypothesis” from LIMA research, suggesting that a relatively small number of examples can effectively align a model’s behavior with desired outcomes.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post s1: A Simple Yet Powerful Test-Time Scaling Approach for LLMs appeared first on MarkTechPost.

Enhancing Mobile Ad Hoc Network Security: A Hybrid Deep Learning Model …

Ad hoc networks are decentralized, self-configuring networks where nodes communicate without fixed infrastructure. They are commonly used in military, disaster recovery, and IoT applications. Each node acts as both a host and a router, dynamically forwarding data.

Flooding attacks in ad hoc networks occur when a malicious node excessively transmits fake route requests or data packets, overwhelming the network. This leads to resource exhaustion, increased latency, and potential network failure.

Recent works on flooding attack mitigation in ad hoc networks focus on trust-based routing, machine learning classification, and adaptive intrusion detection. Techniques like SVM, neural networks, and optimization algorithms improve attack detection, reliability, and network performance. Hybrid models further enhance accuracy and reduce false alarms. Despite notable progress in mitigating such attacks in MANETs, current methods struggle to balance detection accuracy, maintain energy efficiency, and adapt to rapidly changing network conditions.

As a response to these challenges, a new paper was recently published proposing an energy-efficient hybrid routing protocol to mitigate flooding attacks in MANETs using a CNN-LSTM/GRU model for classification. The hybrid approach integrates machine learning with the routing protocol to optimize energy efficiency while preventing attacks. The model classifies nodes as trusted or untrusted based on their packet transmission behavior, blacklisting those that exceed predefined thresholds. Training involves extracting features from both benign and malicious nodes, with classification relying on learned patterns.

To enhance accuracy, the model applies CNN for feature extraction, followed by LSTM or GRU for sequence learning, optimizing decision-making in real-time. The protocol eliminates malicious nodes upon detecting RREQ flooding attacks, ensuring energy conservation. MATLAB is used to create a training dataset and implement an Euclidean distance-based classification. Trust estimation uses link expiration time (LET) and residual energy (RE), with nodes requiring a minimum trust value of 0.5 to participate in routing. Finally, the ML-based AODV protocol selects nodes with the highest trust values to optimize packet delivery and minimize rerouting.

To evaluate the proposed approach, the research team conducted simulations in MATLAB R2023a to assess the performance of a hybrid deep learning model for flooding attack detection in MANETs. The simulation environment accurately modeled the physical layer of MANETs to ensure realistic evaluation conditions. Key performance metrics were analyzed, including packet delivery ratio, throughput, routing overhead, stability time of cluster heads, and attack detection time.

The results demonstrated that the proposed model outperformed existing DBN, CNN, and LSTM approaches. It achieved a higher packet delivery ratio (96.10% for 60 nodes), improved throughput (263 kbps for 100 nodes), and lower routing overhead. Moreover, it exhibited faster attack detection times, outperforming LSTM, CNN, and DBN. Classification performance metrics further confirmed its superiority, with 95% accuracy, 90% specificity, and 100% sensitivity. These findings validate the model’s effectiveness in enhancing MANET security.

The proposed hybrid deep learning model shows promise in mitigating flooding attacks but has limitations. Its computational complexity increases with network size, limiting real-time use in large networks, and it requires substantial memory and processing power. Additionally, relying on MATLAB simulations may not fully reflect real-world MANET dynamics. Regular updates and retraining are also needed to adapt to evolving attack strategies.

In conclusion, while the hybrid models (CNN-LSTM and CNN-GRU) outperform baseline approaches, challenges like computational overhead and evolving attacks remain.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Enhancing Mobile Ad Hoc Network Security: A Hybrid Deep Learning Model for Flooding Attack Detection appeared first on MarkTechPost.

Protect your DeepSeek model deployments with Amazon Bedrock Guardrails

The rapid advancement of generative AI has brought powerful publicly available large language models (LLMs), such as DeepSeek-R1, to the forefront of innovation. The DeepSeek-R1 models are now accessible through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart, and distilled variants are available through Amazon Bedrock Custom Model Import. According to DeepSeek AI, these models offer strong capabilities in reasoning, coding, and natural language understanding. However, their deployment in production environments—like all models—requires careful consideration of data privacy requirements, appropriate management of bias in output, and the need for robust monitoring and control mechanisms.
Organizations adopting open source, open weights models such as DeepSeek-R1 have important opportunities to address several key considerations:

Enhancing security measures to prevent potential misuse, guided by resources such as OWASP LLM Top 10 and MITRE Atlas
Making sure to protect sensitive information
Fostering responsible content generation practices
Striving for compliance with relevant industry regulations

These concerns become particularly critical in highly regulated industries such as healthcare, finance, and government services, where data privacy and content accuracy are paramount.
This blog post provides a comprehensive guide to implementing robust safety protections for DeepSeek-R1 and other open weight models using Amazon Bedrock Guardrails. We’ll explore:

How to use the security features offered by Amazon Bedrock to protect your data and applications
Practical implementation of guardrails to prevent prompt attacks and filter harmful content
Implementing a robust defense-in-depth strategy

By following this guide, you’ll learn how to use the advanced capabilities of DeepSeek models while maintaining strong security controls and promoting ethical AI practices. Whether developing customer-facing generative AI applications or internal tools, these implementation patterns will help you meet your requirements for secure and responsible AI. By following this step-by-step approach, organizations can deploy open weights LLMs such as DeepSeek-R1 in line with best practices for AI safety and security.
DeepSeek models and deployment on Amazon Bedrock
DeepSeek AI, a company specializing in open weights foundation AI models, recently launched their DeepSeek-R1 models, which according to their paper have shown outstanding reasoning abilities and performance in industry benchmarks. According to third-party evaluations, these models consistently achieve top three rankings across various metrics, including quality index, scientific reasoning and knowledge, quantitative reasoning, and coding (HumanEval).
The company has further developed their portfolio by releasing six dense models derived from DeepSeek-R1, built on Llama and Qwen architectures, which they’ve made open weight models. These models are now accessible through AWS generative AI solutions: DeepSeek-R1 is available through Amazon Bedrock Marketplace and SageMaker Jumpstart, while the Llama-based distilled versions can be implemented through Amazon Bedrock Custom Model Import.
Amazon Bedrock offers comprehensive security features to help secure hosting and operation of open source and open weights models while maintaining data privacy and regulatory compliance. Key features include data encryption at rest and in transit, fine-grained access controls, secure connectivity options, and various compliance certifications. Additionally, Amazon Bedrock provides guardrails for content filtering and sensitive information protection to support responsible AI use. AWS enhances these capabilities with extensive platform-wide security and compliance measures:

Data encryption at rest and in transit using AWS Key Management Service (AWS KMS)
Access management through AWS Identity and Access Management (IAM)
Network security through Amazon Virtual Private Cloud (Amazon VPC) deployment, VPC endpoints, and AWS Network Firewall for TLS inspection and strict policy rules
Service control policies (SCPs) for AWS account-level governance
Security groups and network access control lists (NACLs) for access restriction
Compliance certifications including HIPAA, SOC, ISO, and GDPR
FedRAMP High authorization in AWS GovCloud (US-West) for Amazon Bedrock
Monitoring and logging through Amazon CloudWatch and AWS CloudTrail

Organizations should customize these security settings based on their specific compliance and security needs when deploying to production environments. AWS conducts vulnerability scanning of all model containers as part of its security process and accepts only models in Safetensors format to help prevent unsafe code execution.
Amazon Bedrock Guardrails
Amazon Bedrock Guardrails provides configurable safeguards to help safely build generative AI applications at scale. Amazon Bedrock Guardrails can also be integrated with other Amazon Bedrock tools including Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases to build safer and more secure generative AI applications aligned with responsible AI policies. To learn more, see the AWS Responsible AI page.
Core functionality
Amazon Bedrock Guardrails can be used in two ways. First, it can be integrated directly with the InvokeModel and Converse API call, where guardrails are applied to both input prompts and model outputs during the inference process. This method is suitable with models hosted on Amazon Bedrock through the Amazon Bedrock Marketplace and Amazon Bedrock Custom Model Import. Alternatively, the ApplyGuardrail API offers a more flexible approach, allowing for independent evaluation of content without invoking a model. This second method is useful for assessing inputs or outputs at various stages of an application, working with custom or third-party models outside of Amazon Bedrock. Both approaches enable developers to implement safeguards customized to their use cases and aligned with responsible AI policies, ensuring secure and compliant interactions in generative AI applications.
Key Amazon Bedrock Guardrails policies
Amazon Bedrock Guardrails provides the following configurable guardrail policies to help safely build generative AI applications at scale:

Content filters

Adjustable filtering intensity for harmful content
Predefined categories: Hate, Insults, Sexual Content, Violence, Misconduct, and Prompt Attacks
Multi-modal content including text and images (preview)

Topic filters

Capability to restrict specific topics
Prevention of unauthorized topics in both queries and responses

Word filters

Blocks specific words, phrases, and profanity
Custom filters for offensive language or competitor references

Sensitive information filters

Personally identifiable information (PII) blocking or masking
Support for custom regex patterns
Probabilistic detection for standard formats (such as SSN, DOB, and addresses)

Contextual grounding checks

Hallucination detection through source grounding
Query relevance validation

Automated Reasoning checks for hallucination prevention (gated preview)

Other capabilities
Model-agnostic implementation:

Compatible with all Amazon Bedrock foundation models
Supports fine-tuned models
Extends to external custom and third-party models through the ApplyGuardrail API

This comprehensive framework helps customers implement responsible AI, maintaining content safety and user privacy across diverse generative AI applications.
Solution Overview

Guardrail configuration

Create a guardrail with specific policies tailored to your use case and configure the policies.

Integration with InvokeModel API

Call the Amazon Bedrock InvokeModel API with the guardrail identifier in your request.
When you make the API call, Amazon Bedrock applies the specified guardrail to both the input and output.

Guardrail evaluation process

Input evaluation: Before sending the prompt to the model, the guardrail evaluates the user input against the configured policies.
Parallel policy checking: For improved latency, the input is evaluated in parallel for each configured policy.
Input intervention: If the input violates any guardrail policies, a pre-configured blocked message is returned, and the model inference is discarded.
Model inference: If the input passes the guardrail checks, the prompt is sent to the specified model for inference.
Output evaluation: After the model generates a response, the guardrail evaluates the output against the configured policies.
Output intervention: If the model response violates any guardrail policies, it will be either blocked with a pre-configured message or have sensitive information masked, depending on the policy.
Response delivery: If the output passes all guardrail checks, the response is returned to the application without modifications

Prerequisites
Before setting up guardrails for models imported using the Amazon Bedrock Custom Model Import feature, make sure you meet these prerequisites:

An AWS account with access to Amazon Bedrock along with the necessary IAM role with the required permissions. For centralized access management, we recommend that you use AWS IAM Identity Center.
Make sure that a custom model is already imported using the Amazon Bedrock Custom Model Import service. For illustration, we’ll use DeepSeek-R1-Distill-Llama-8B, which can be imported using Amazon Bedrock Custom Model Import. You have two options for deploying this model:

Follow the instructions in Deploy DeepSeek-R1 distilled Llama models to deploy DeepSeek’s distilled Llama model.
Use the notebook available from aws-samples for deployment.

You can create the guardrail using the AWS Management Console as explained in this blog post. Alternatively, you can follow this notebook for a programmatic example of how to create the guardrail in this solution. This notebook does the following :

Install the required dependencies
Create a guardrail using the boto3 API and filters to meet the use case mentioned previously.
Configure the tokenizer for the imported model.
Test Amazon Bedrock Guardrails using prompts that show various Amazon Bedrock guardrail filters in action.

This approach integrates guardrails into both the user inputs and the model outputs. This makes sure that any potentially harmful or inappropriate content is intercepted during both phases of the interaction. For open weight distilled models imported using Amazon Bedrock Custom Model Import, Amazon Bedrock Marketplace, and Amazon SageMaker JumpStart, critical filters to implement include those for prompt attacks, content moderation, topic restrictions, and sensitive information protection.
Implementing a defense-in-depth strategy with AWS services
While Amazon Bedrock Guardrails provides essential content and prompt safety controls, implementing a comprehensive defense-in-depth strategy is crucial when deploying any foundation model, especially open weights models such as DeepSeek-R1. For detailed guidance on defense-in-depth approaches aligned with OWASP Top 10 for LLMs, see our previous blog post on architecting secure generative AI applications.
Key highlights include:

Developing organizational resiliency by starting with security in mind
Building on a secure cloud foundation using AWS services
Applying a layered defense strategy across multiple trust boundaries
Addressing the OWASP Top 10 risks for LLM applications
Implementing security best practices throughout the AI/ML lifecycle
Using AWS security services in conjunction with AI and machine learning (AI/ML)-specific features
Considering diverse perspectives and aligning security with business objectives
Preparing for and mitigating risks such as prompt injection and data poisoning

The combination of model-level controls (guardrails) with a defense-in-depth strategy creates a robust security posture that can help protect against:

Data exfiltration attempts
Unauthorized access to fine-tuned models or training data
Potential vulnerabilities in model implementation
Malicious use of AI agents and integrations

We recommend conducting thorough threat modeling exercises using AWS guidance for generative AI workloads before deploying any new AI/ML solutions. This helps align security controls with specific risk scenarios and business requirements.
Conclusion
Implementing safety protection for LLMs, including DeepSeek-R1 models, is crucial for maintaining a secure and ethical AI environment. By using Amazon Bedrock Guardrails with the Amazon Bedrock InvokeModel API and the ApplyGuardrails API, you can help mitigate the risks associated with advanced language models while still harnessing their powerful capabilities. However, it’s important to recognize that model-level protections are just one component of a comprehensive security strategy.
The strategies outlined in this post address several key security concerns that are common across various open weights models hosted on Amazon Bedrock using Amazon Bedrock Custom Model Import, Amazon Bedrock Marketplace, and through Amazon SageMaker JumpStart. These include potential vulnerabilities to prompt injection attacks, the generation of harmful content, and other risks identified in recent assessments. By implementing these guardrails alongside a defense-in-depth approach, organizations can significantly reduce the risk of misuse and better align their AI applications with ethical standards and regulatory requirements.
As AI technology continues to evolve, it’s essential to prioritize safety and responsible use of generative AI. Amazon Bedrock Guardrails provides a configurable and robust framework for implementing these safeguards, allowing developers to customize protection measures according to their specific use cases and organizational policies. We strongly recommend conducting thorough threat modeling of your AI workloads using AWS guidance to evaluate security risks and implementing appropriate controls across your entire technology stack.
Remember to regularly review and update not only your guardrails but all security controls to address new potential vulnerabilities and help maintain protection against emerging threats in the rapidly evolving landscape of AI security. While today we focus on DeepSeek-R1 models, the AI landscape is continuously evolving with new models emerging regularly. Amazon Bedrock Guardrails, combined with AWS security services and best practices, provides a consistent security framework that can adapt to protect your generative AI applications across various open weights models, both current and future. By treating security as a continuous process of assessment, improvement, and adaptation, organizations can confidently deploy innovative AI solutions while maintaining robust security controls.

About the Authors
Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Bedrock at Amazon Web Services. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value.
Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.
Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Fine-tune and host SDXL models cost-effectively with AWS Inferentia2

Building upon a previous Machine Learning Blog post to create personalized avatars by fine-tuning and hosting the Stable Diffusion 2.1 model at scale using Amazon SageMaker, this post takes the journey a step further. As technology continues to evolve, newer models are emerging, offering higher quality, increased flexibility, and faster image generation capabilities. One such groundbreaking model is Stable Diffusion XL (SDXL), released by StabilityAI, advancing the text-to-image generative AI technology to unprecedented heights. In this post, we demonstrate how to efficiently fine-tune the SDXL model using SageMaker Studio. We show how to then prepare the fine-tuned model to run on AWS Inferentia2 powered Amazon EC2 Inf2 instances, unlocking superior price performance for your inference workloads.
Solution overview
The SDXL 1.0 is a text-to-image generation model developed by Stability AI, consisting of over 3 billion parameters. It comprises several key components, including a text encoder that converts input prompts into latent representations, and a U-Net model that generates images based on these latent representations through a diffusion process. Despite its impressive capabilities trained on a public dataset, app builders sometimes need to generate images for a specific subject or style that are difficult or inefficient to describe in words. In that situation, fine-tuning is a great option to improve relevance using your own data.
One popular approach to fine-tuning SDXL is to use DreamBooth and Low-Rank Adaptation (LoRA) techniques. You can use DreamBooth to personalize the model by embedding a subject into its output domain using a unique identifier, effectively expanding its language-vision dictionary. This process uses a technique called prior preservation, which retains the model’s existing knowledge about the subject class (such as humans) while incorporating new information from the provided subject images. LoRA is an efficient fine-tuning method that attaches small adapter networks to specific layers of the pre-trained model, freezing most of its weights. By combining these techniques, you can generate a personalized model while tuning an order-of-magnitude fewer parameters, resulting in faster fine-tuning times and optimized storage requirements.
After the model is fine-tuned, you can compile and host the fine-tuned SDXL on Inf2 instances using the AWS Neuron SDK. By doing this, you can benefit from the higher performance and cost-efficiency offered by these specialized AI chips while taking advantage of the seamless integration with popular deep learning frameworks such as TensorFlow and PyTorch. To learn more, visit our Neuron documentation.
Prerequisites
Before you get started, review the list of services and instance types required to run the sample notebooks provided at this GitHub location.

Basic understanding of Stable Diffusion models. Refer to Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker for more information.
General knowledge about foundation models (FMs) and how fine-tuning brings value. Read more on Fine-tune a foundation model.
An Amazon Web Services (AWS) account. Confirm your AWS identity has the requisite permissions, including the ability to create SageMaker resources (domain, model, and endpoints) and Amazon Simple Storage Service (Amazon S3) access to upload model artifacts. Alternatively, you can attach the AmazonSageMakerFullAccess managed policy to your AWS Identity and Access Management (IAM) user or role.
This notebook is tested using the default Python 3 kernel on SageMaker Studio. A GPU instance such as ml.g5.2xlarge is recommended. Refer to the documentation on setting up a domain for SageMaker Studio.
For compiling the fine-tuned model, an inf2.8xlarge or larger Amazon Elastic Compute Cloud (Amazon EC2) instance with Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04) is required. The instance comes with the required neuron drivers, libraries and Jupyter Lab preinstalled.

By following these prerequisites, you will have the necessary knowledge and AWS resources to run the sample notebooks and work with Stable Diffusion models and FMs on Amazon SageMaker.
Fine-tuning SDXL on SageMaker
To fine-tune SDXL on SageMaker, follow the steps in the next sections.
Prepare the images
The first step in fine-tuning the SDXL model is to prepare your training images. Using the DreamBooth technique, you need as few as 10–12 images for fine-tuning. It’s recommended to provide a variety of images to help the model better understand and generalize your facial features.
The training images should include selfies taken from different angles, covering various perspectives of your face. Include images with different facial expressions, such as smiling, frowning, and neutral. Preferably, use images with different backgrounds to help the model identify the subject more effectively. By providing a diverse set of images, DreamBooth can better identify the subject from the pictures and generalize your facial features. The following set of images demonstrate this.

Additionally, use 1024×1024 pixel square images for fine-tuning. To simplify the process of preparing the images, there is a utility function that automatically crops and adjusts your images to the correct dimensions.
Train the personalized model
After the images are prepared, you can begin the fine-tuning process. To achieve this, you use the autoTrain library from Hugging Face, an automatic and user-friendly approach to training and deploying state-of-the-art machine learning (ML) models. Seamlessly integrated with the Hugging Face ecosystem, autoTrain is designed to be accessible, and individuals can train custom models without extensive technical expertise or coding proficiency. To use autoTrain, use the following example code:
!autotrain dreambooth
–prompt “${INSTANCE_PROMPT}”
–class-prompt “${CLASS_PROMPT}”
–model ${MODEL_NAME}
–project-name ${PROJECT_NAME}
–image-path “${IMAGE_PATH}”
–resolution ${RESOLUTION}
–batch-size ${BATCH_SIZE}
–num-steps ${NUM_STEPS}
–gradient-accumulation ${GRADIENT_ACCUMULATION}
–lr ${LEARNING_RATE}
–fp16
–gradient-checkpointing
First, you need to set the prompt and class-prompt. The prompt should include a unique identifier or token that the model can reference to the subject. The class-prompt, on the other hand, is used to subsidize the model training with similar subjects of the same class. This is a requirement for the DreamBooth technique to better associate the new token with the subject of interest. This is why the DreamBooth technique can generate exceptional fine-tuned results with fewer input images. Additionally, you’ll notice that even though you didn’t provide examples of the top or back of our head, the model still knows how to generate them because of the class prompt. In this example, you are using <<TOK>> as a unique identifier to avoid a name that the model might already be familiar with.
instance_prompt = “photo of <<TOK>>”
class_prompt = “photo of a person”
Next, you need to provide the model, image-path, and project-name. The model name loads the base model from the Hugging Face Hub or locally. The image-path is the location of the training images. By default, autoTrain uses LoRA, a parameter-efficient way to fine-tune. Unlike traditional fine-tuning, LoRA fine-tunes by attaching a small transformer adapter model to the base model. Only the adapter weights are updated during training to achieve fine-tuning behavior. Additionally, these adapters can be attached and detached at any time, making them highly efficient for storage as well. These supplementary LoRA adapters are 98% smaller in size compared to the original model, allowing us to store and share the LoRA adapters without having to duplicate the base model repeatedly. The following diagram illustrates these concepts.

The rest of the configuration parameters are as follows. You are recommended to start with these values first. Adjust them only if the fine-tuning results don’t meet your expectations.
resolution = 1024 # resolution or size of the generated images
batch_size = 1 # number of samples in one forward and backward pass
num_steps = 500 # number of training steps
gradient_accumulation = 4 # accumulating gradients over number of batches
learning_rate = 1e-4 # step size
fp16 # half-precision
gradient-checkpointing # technique to reduce memory consumption during training
The entire training process takes about 30 mins with the preceding configuration. After the training is done, you can load the LoRA adapter, such as the following code, and generate fine-tuned images.
from diffusers import DiffusionPipeline, StableDiffusionXLImg2ImgPipeline
import random

seed = random.randint(0, 100000)

# loading the base model
pipeline = DiffusionPipeline.from_pretrained(
model_name_base,
torch_dtype=torch.float16,
).to(device)

# attach the LoRA adapter
pipeline.load_lora_weights(
project_name,
weight_name=”pytorch_lora_weights.safetensors”,
)

# generate fine tuned images
generator = torch.Generator(device).manual_seed(seed)
base_image = pipeline(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=50,
generator=generator,
height=1024,
width=1024,
output_type=”pil”,
).images[0]
base_image

Deploy on Amazon EC2 Inf2 instances
In this section, you learn to compile and host the fine-tuned SDXL model on Inf2 instances. To begin, you need to clone the repository and upload the LoRA adapter onto the Inf2 instance created in the prerequisites section. Then, run the compilation notebook to compile the fine-tuned SDXL model using the Optimum Neuron library. Visit the Optimum Neuron page for more details.
The NeuronStableDiffusionXLPipeline class in Optimum Neuron now has direct support for the LoRA. All you need to do is to supply the base model, LoRA adapters, and supply the model input shapes to start the compilation process. The following code snippet illustrates how to compile and then export the compiled model to a local directory.
from optimum.neuron import NeuronStableDiffusionXLPipeline

model_id = “stabilityai/stable-diffusion-xl-base-1.0”
adapter_id = “lora”
input_shapes = {“batch_size”: 1, “height”: 1024, “width”: 1024, “num_images_per_prompt”: 1}

# Compile
pipe = NeuronStableDiffusionXLPipeline.from_pretrained(
model_id,
export=True,
lora_model_ids=adapter_id,
lora_weight_names=”pytorch_lora_weights.safetensors”,
lora_adapter_names=”sttirum”,
**input_shapes,
)

# Save locally or upload to the HuggingFace Hub
save_directory = “sd_neuron_xl/”
pipe.save_pretrained(save_directory)
The compilation process takes about 35 minutes. After the process is complete, you can use the NeuronStableDiffusionXLPipeline again to load the compiled model back.
from optimum.neuron import NeuronStableDiffusionXLPipeline

stable_diffusion_xl = NeuronStableDiffusionXLPipeline.from_pretrained(“sd_neuron_xl”)
You can then test the model on Inf2 and make sure that you can still generate the fine-tuned results.
import torch
# Run pipeline
prompt = “””
photo of <<TOK>> , 3d portrait, ultra detailed, gorgeous, 3d zbrush, trending on dribbble, 8k render
“””

negative_prompt = “””
ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, blurry, bad anatomy, blurred,
watermark, grainy, signature, cut off, draft, amateur, multiple, gross, weird, uneven, furnishing, decorating, decoration, furniture, text, poor, low, basic, worst, juvenile,
unprofessional, failure, crayon, oil, label, thousand hands
“””

seed = 491057365
generator = [torch.Generator().manual_seed(seed)]
image = stable_diffusion_xl(prompt,
num_inference_steps=50,
guidance_scale=7,
negative_prompt=negative_prompt,
generator=generator).images[0]

Here are a few avatar images generated using the fine-tuned model on Inf2. The corresponding prompts are the following:

emoji of << TOK >>, astronaut, space ship background
oil painting of << TOK >>, business woman, suit
photo of << TOK >> , 3d portrait, ultra detailed, 8k render
anime of << TOK >>, ninja style, dark hair

Clean up
To avoid incurring AWS charges after you finish testing this example, make sure you delete the following resources:

Amazon SageMaker Studio Domain
Amazon EC2 Inf2 instance

Conclusion
This post has demonstrated how to fine-tune the Stable Diffusion XL (SDXL) model using DreamBooth and LoRA techniques on Amazon SageMaker, enabling enterprises to generate highly personalized and domain-specific images tailored to their unique requirements using as few as 10–12 training images. By using these techniques, businesses can rapidly adapt the SDXL model to their specific needs, unlocking new opportunities to enhance customer experiences and differentiate their offerings. Moreover, we showcased the process of compiling and deploying the fine-tuned SDXL model for inference on AWS Inferentia2 powered Amazon EC2 Inf2 instances, which deliver an unparalleled price-to-performance ratio for generative AI workloads, enabling enterprises to host fine-tuned SDXL models at scale in a cost-efficient manner. We encourage you to try the example and share your creations with us using hashtags #sagemaker #mme #genai on social platforms. We would love to see what you make.
For more examples about AWS Neuron, refer to aws-neuron-samples.

About the Authors
Deepti Tirumala is a Senior Solutions Architect at Amazon Web Services, specializing in Machine Learning and Generative AI technologies. With a passion for helping customers advance their AWS journey, she works closely with organizations to architect scalable, secure, and cost-effective solutions that leverage the latest innovations in these areas.
James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.
Diwakar Bansal is a Principal GenAI Specialist focused on business development and go-to- market for GenAI and Machine Learning accelerated computing services. Diwakar has led product definition, global business development, and marketing of technology products in the fields of IOT, Edge Computing, and Autonomous Driving focusing on bringing AI and Machine learning to these domains. Diwakar is passionate about public speaking and thought leadership in the Cloud and GenAI space.

How Aetion is using generative AI and Amazon Bedrock to translate scie …

This post is co-written with Javier Beltrán, Ornela Xhelili, and Prasidh Chhabri from Aetion. 
For decision-makers in healthcare, it is critical to gain a comprehensive understanding of patient journeys and health outcomes over time. Scientists, epidemiologists, and biostatisticians implement a vast range of queries to capture complex, clinically relevant patient variables from real-world data. These variables often involve complex sequences of events, combinations of occurrences and non-occurrences, as well as detailed numeric calculations or categorizations that accurately reflect the diverse nature of patient experiences and medical histories. Expressing these variables as natural language queries allows users to express scientific intent and explore the full complexity of the patient timeline.
Aetion is a leading provider of decision-grade real-world evidence software to biopharma, payors, and regulatory agencies. The company provides comprehensive solutions to healthcare and life science customers to rapidly and transparently transforms real-world data into real-world evidence.
At the core of the Aetion Evidence Platform (AEP) are Measures—logical building blocks used to flexibly capture complex patient variables, enabling scientists to customize their analyses to address the nuances and challenges presented by their research questions. AEP users can use Measures to build cohorts of patients and analyze their outcomes and characteristics.
A user asking a scientific question aims to translate scientific intent, such as “I want to find patients with a diagnosis of diabetes and a subsequent metformin fill,” into algorithms that capture these variables in real-world data. To facilitate this translation, Aetion developed a Measures Assistant to turn users’ natural language expressions of scientific intent into Measures.
In this post, we review how Aetion is using Amazon Bedrock to help streamline the analytical process toward producing decision-grade real-world evidence and enable users without data science expertise to interact with complex real-world datasets.
Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI startups and Amazon through a unified API. It offers a wide range of FMs, allowing you to choose the model that best suits your specific use case.
Aetion’s technology
Aetion is a healthcare software and services company that uses the science of causal inference to generate real-world evidence on the safety, effectiveness, and value of medications and clinical interventions. Aetion has partnered with the majority of top 20 biopharma, leading payors, and regulatory agencies.
Aetion brings deep scientific expertise and technology to life sciences, regulatory agencies (including FDA and EMA), payors, and health technology assessment (HTA) customers in the US, Canada, Europe, and Japan with analytics that can achieve the following:

Optimize clinical trials by identifying target populations, creating external control arms, and contextualizing settings and populations underrepresented in controlled settings
Expand industry access through label changes, pricing, coverage, and formulary decisions
Conduct safety and effectiveness studies for medications, treatments, and diagnostics

Aetion’s applications, including Discover and Substantiate, are powered by the AEP, a core longitudinal analytic engine capable of applying rigorous causal inference and statistical methods to hundreds of millions of patient journeys.
AetionAI, Aetion’s set of generative AI capabilities, are embedded across the AEP and applications. Measures Assistant is an AetionAI feature in Substantiate.
The following figure illustrates the organization of Aetion’s services.

Measures Assistant
Users build analyses in Aetion Substantiate to turn real-world data into decision-grade real-world evidence. The first step is capturing patient variables from real-world data. Substantiate offers a wide range of Measures, as illustrated in the following screenshot. Measures can often be chained together to capture complex variables.

Suppose the user is assessing a therapy’s cost-effectiveness to help negotiate drug coverage with payors. The first step in this analysis is to filter out negative cost values that might appear in claims data. The user can ask AetionAI how to implement this, as shown in the following screenshot.

In another scenario, a user might want to define an outcome in their analysis as the change in hemoglobin over successive lab tests following the start of treatment. A user asks Measures Assistant a question expressed in natural language and receives instructions on how to implement this.

Solution overview
Patient datasets are ingested into the AEP and transformed into a longitudinal (timeline) format. AEP references this data to generate cohorts and run analyses. Measures are the variables that determine conditions for cohort entry, inclusion or exclusion, and the characteristics of a study.
The following diagram illustrates the solution architecture.

Measures Assistant is a microservice deployed in a Kubernetes on AWS environment and accessed through a REST API. The data transmitted to the service is encrypted using Transport Layer Security 1.2 (TLS). When a user asks a question through the assistant UI, Substantiate initiates a request containing the question and previous history of messages, if available. Measures Assistant incorporates the question into a prompt template and calls the Amazon Bedrock API to invoke Anthropic’s Claude 3 Haiku. The user-provided prompts and the requests sent to the Amazon Bedrock API are encrypted using TLS 1.2.
Aetion chose to use Amazon Bedrock for working with large language models (LLMs) due to its vast model selection from multiple providers, security posture, extensibility, and ease of use. Anthropic’s Claude 3 Haiku LLM was found to be more efficient in runtime and cost than available alternatives.
Measures Assistant maintains a local knowledge base about AEP Measures from scientific experts at Aetion and incorporates this information into its responses as guardrails. These guardrails make sure the service returns valid instructions to the user, and compensates for logical reasoning errors that the core model might exhibit.
The Measures Assistant prompt template contains the following information:

A general definition of the task the LLM is running.
Extracts of AEP documentation, describing each Measure type covered, its input and output types, and how to use it.
An in-context learning technique that includes semantically relevant solved questions and answers in the prompt.
Rules to condition the LLM to behave in a certain manner. For example, how to react to unrelated questions, keep sensitive data secure, or restrict its creativity in developing invalid AEP settings.

To streamline the process, Measures Assistant uses templates composed of two parts:

Static – Fixed instructions to be used with user questions. These instructions cover a broad range of well-defined instructions for Measures Assistant.
Dynamic – Questions and answers are dynamically selected from a local knowledge base based on semantic proximity to the user question. These examples improve the quality of the generated answers by incorporating similar previously asked and answered questions to the prompt. This technique models a small-scale, optimized, in-process knowledge base for a Retrieval Augmented Generation (RAG) pattern.

Mixedbread’s mxbai-embed-large-v1 Sentence Transformer was fine-tuned to generate sentence embeddings for a question-and-answer local knowledge base and users’ questions. Sentence question similarity is calculated through the cosine similarity between embedding vectors.
The generation and maintenance of the question-and-answer pool involve a human in the loop. Subject matter experts continuously test Measures Assistant, and question-and-answer pairs are used to refine it continually to optimize the user experience.
Outcomes
Our implementation of AetionAI capabilities enable users using natural language queries and sentences to describe scientific intent into algorithms that capture these variables in real-world data. Users now can turn questions expressed in natural language into measures in a matter minutes as opposed to days, without the need of support staff and specialized training.
Conclusion
In this post, we covered how Aetion uses AWS services to streamline the user’s path from defining scientific intent to running a study and obtaining results. Measures Assistant enables scientists to implement complex studies and iterate on study designs, instantaneously receiving guidance through responses to quick, natural language queries.
Aetion is continuing to refine the knowledge base available to Measures Assistant and expand innovative generative AI capabilities across its product suite to help improve the user experience and ultimately accelerate the process of turning real-world data into real-world evidence.
With Amazon Bedrock, the future of innovation is at your fingertips. Explore Generative AI Application Builder on AWS to learn more about building generative AI capabilities to unlock new insights, build transformative solutions, and shape the future of healthcare today.

About the Authors
Javier Beltrán is a Senior Machine Learning Engineer at Aetion. His career has focused on natural language processing, and he has experience applying machine learning solutions to various domains, from healthcare to social media.
Ornela Xhelili is a Staff Machine Learning Architect at Aetion. Ornela specializes in natural language processing, predictive analytics, and MLOps, and holds a Master’s of Science in Statistics. Ornela has spent the past 8 years building AI/ML products for tech startups across various domains, including healthcare, finance, analytics, and ecommerce.
Prasidh Chhabri is a Product Manager at Aetion, leading the Aetion Evidence Platform, core analytics, and AI/ML capabilities. He has extensive experience building quantitative and statistical methods to solve problems in human health.
Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare life sciences customers and specializes in data analytics services. Mikhail has more than 20 years of industry experience covering a wide range of technologies and sectors.

4 Open-Source Alternatives to OpenAI’s $200/Month Deep Research AI A …

OpenAI’s Deep Research AI Agent offers a powerful research assistant at a premium price of $200 per month. However, the open-source community has stepped up to provide cost-effective and customizable alternatives. Here are four fully open-source AI research agents that can rival OpenAI’s offering:

1. Deep-Research

Overview:Deep-Research is an iterative research agent that autonomously generates search queries, scrapes websites, and processes information using AI reasoning models. It aims to provide a structured approach to deep research tasks.

Key Features:

Query Generation: Dynamically generates optimized search queries.

Web Scraping with Firecrawl: Extracts useful information from websites.

o3-Mini Model for Reasoning: Uses OpenAI’s o3-mini model for intelligent processing.

100% Open Source: Fully accessible and modifiable.

GitHub Repository: https://github.com/dzhng/deep-research

2. OpenDeepResearcher

Overview:OpenDeepResearcher is an asynchronous AI research agent designed to conduct comprehensive research iteratively. It utilizes multiple search engines, content extraction tools, and LLM APIs to provide detailed insights.

Key Features:

SERP API Integration: Automates iterative search queries.

Jina AI for Content Extraction: Extracts and summarizes webpage content.

OpenRouter LLM Processing: Utilizes various open LLMs for reasoning.

100% Open Source: Offers flexibility in customization and deployment.

GitHub Repository: https://github.com/mshumer/OpenDeepResearcher

3. Open Deep Research by Firecrawl

Overview:Open Deep Research is a lightweight and efficient AI research agent that leverages Firecrawl search and extraction mechanisms. It allows users to reason with any LLM of their choice rather than relying on a fine-tuned proprietary model.

Key Features:

Firecrawl Search + Extract: Fetches and extracts relevant content efficiently.

Customizable AI Reasoning: Supports any LLM via the AI SDK.

Open Source & Self-Hostable: Full control over deployment and customization.

GitHub Repository: https://github.com/nickscamara/open-deep-research

4. DeepResearch by Jina AI

Overview:DeepResearch by Jina AI is an advanced AI research assistant that replicates OpenAI’s agentic search, read, and reasoning workflow. It integrates multiple search engines and employs an AI-driven approach to extract and summarize relevant information.

Key Features:

Search Integration: Uses Gemini Flash, Brave, and DuckDuckGo for diverse search results.

AI-Powered Reading: Implements Jina Reader to extract and summarize content efficiently.

Reasoning Process: Uses advanced AI models for contextual understanding.

100% Open Source: Fully customizable and self-hostable.

GitHub Repository: https://github.com/jina-ai/node-DeepResearch

Conclusion

These four open-source AI research agents provide powerful alternatives to OpenAI’s Deep Research AI Agent. With robust search capabilities, AI-powered extraction, and reasoning features, they enable researchers to automate and optimize their workflows without incurring high costs. Since all options are open-source, users have complete flexibility to modify, extend, and self-host these tools based on their specific needs.

This article is inspired from this Tweet. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post 4 Open-Source Alternatives to OpenAI’s $200/Month Deep Research AI Agent appeared first on MarkTechPost.

Meet Satori: A New AI Framework for Advancing LLM Reasoning through De …

Large Language Models (LLMs) have demonstrated notable reasoning capabilities in mathematical problem-solving, logical inference, and programming. However, their effectiveness is often contingent on two approaches: supervised fine-tuning (SFT) with human-annotated reasoning chains and inference-time search strategies guided by external verifiers. While supervised fine-tuning offers structured reasoning, it requires significant annotation effort and is constrained by the quality of the teacher model. Inference-time search techniques, such as verifier-guided sampling, enhance accuracy but increase computational demands. This raises an important question: Can an LLM develop reasoning capabilities independently, without relying on extensive human supervision or external verifiers? To address this, researchers have introduced Satori, a 7B parameter LLM designed to internalize reasoning search and self-improvement mechanisms.

Introducing Satori: A Model for Self-Reflective and Self-Exploratory Reasoning

Researchers from MIT, Singapore University of Technology and Design, Harvard, MIT-IBM Watson AI Lab, IBM Research, and UMass Amherst propose Satori, a model that employs autoregressive search—a mechanism enabling it to refine its reasoning steps and explore alternative strategies autonomously. Unlike models that rely on extensive fine-tuning or knowledge distillation, Satori enhances reasoning through a novel Chain-of-Action-Thought (COAT) reasoning paradigm. Built upon Qwen-2.5-Math-7B, Satori follows a two-stage training framework: small-scale format tuning (FT) and large-scale self-improvement via reinforcement learning (RL).

Technical Details and Benefits of Satori

Satori’s training framework consists of two stages:

Format Tuning (FT) Stage:

A small-scale dataset (~10K samples) is used to introduce COAT reasoning, which includes three meta-actions:

Continue (<|continue|>): Extends the reasoning trajectory.

Reflect (<|reflect|>): Prompts a self-check on previous reasoning steps.

Explore (<|explore|>): Encourages the model to consider alternative approaches.

Unlike conventional CoT training, which follows predefined reasoning paths, COAT enables dynamic decision-making during reasoning.

Reinforcement Learning (RL) Stage:

A large-scale self-improvement process using Reinforcement Learning with Restart and Explore (RAE).

The model restarts reasoning from intermediate steps, refining its problem-solving approach iteratively.

A reward model assigns scores based on self-corrections and exploration depth, leading to progressive learning.

Insights

Evaluations show that Satori performs strongly on multiple benchmarks, often surpassing models that rely on supervised fine-tuning or knowledge distillation. Key findings include:

Mathematical Benchmark Performance:

Satori outperforms Qwen-2.5-Math-7B-Instruct on datasets such as GSM8K, MATH500, OlympiadBench, AMC2023, and AIME2024.

Self-improvement capability: With additional reinforcement learning rounds, Satori demonstrates continuous refinement without additional human intervention.

Out-of-Domain Generalization:

Despite training primarily on mathematical reasoning, Satori exhibits strong generalization to diverse reasoning tasks, including logical reasoning (FOLIO, BoardgameQA), commonsense reasoning (StrategyQA), and tabular reasoning (TableBench).

This suggests that RL-driven self-improvement enhances adaptability beyond mathematical contexts.

Efficiency Gains:

Compared to conventional supervised fine-tuning, Satori achieves similar or better reasoning performance with significantly fewer annotated training samples (10K vs. 300K for comparable models).

This approach reduces reliance on extensive human annotations while maintaining effective reasoning capabilities.

Conclusion: A Step Toward Autonomous Learning in LLMs

Satori presents a promising direction in LLM reasoning research, demonstrating that models can refine their own reasoning without external verifiers or high-quality teacher models. By integrating COAT reasoning, reinforcement learning, and autoregressive search, Satori shows that LLMs can iteratively improve their reasoning abilities. This approach not only enhances problem-solving accuracy but also broadens generalization to unseen tasks. Future work may explore refining meta-action frameworks, optimizing reinforcement learning strategies, and extending these principles to broader domains.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Meet Satori: A New AI Framework for Advancing LLM Reasoning through Deep Thinking without a Strong Teacher Model appeared first on MarkTechPost.