Open Thoughts: An Open Source Initiative Advancing AI Reasoning with H …

The critical issue of restricted access to high-quality reasoning datasets has limited open-source AI-driven logical and mathematical reasoning advancements. While proprietary models have leveraged structured reasoning demonstrations to enhance performance, these datasets and methodologies remain closed, restricting independent research and innovation. The lack of open, scalable reasoning datasets has created a bottleneck for AI development.

Over recent years, models such as SkyT1, STILL-2, and DeepSeek-R1 have demonstrated that a relatively small set of high-quality reasoning demonstrations on hundreds of thousands can substantially enhance a model’s ability to perform complex logical and mathematical reasoning tasks. Still, most reasoning datasets and the methodologies behind their creation remain proprietary, limiting access to crucial resources necessary for further exploration in the field.

The Open Thoughts initiative, led by Bespoke Labs and the DataComp community from Stanford, UC Berkeley, UT Austin, UW, UCLA, UNC, TRI, and LAION, is an ambitious open-source project aiming to curate and develop high-quality reasoning datasets to address the above concerns with the availability of datasets. This project seeks to establish the best open reasoning datasets to enhance language models’ cognitive capabilities. The team aims to provide publicly available, state-of-the-art reasoning datasets and data generation strategies. In this effort, they have released the OpenThoughts-114k reasoning dataset and the associated OpenThinker-7B model. Let’s look into the details of both of them one by one.

Image Source

The OpenThoughts-114k Dataset: A New Standard in Open Reasoning Data

This dataset was designed to provide a large-scale, high-quality corpus of reasoning demonstrations to improve language models’ reasoning abilities. OpenThoughts-114k is an extension of previous datasets like Bespoke-Stratos-17k, which only contained 17,000 examples. By scaling up to 114,000 reasoning examples, this dataset has improved performance on various reasoning benchmarks. OpenThoughts-114k was generated using reasoning distillation techniques inspired by DeepSeek-R1, which showed that synthetic reasoning demonstrations could be produced efficiently and at scale. This dataset incorporates diverse reasoning challenges, ranging from mathematical problem-solving to logical deduction, thereby serving as a valuable resource for improving model robustness across multiple reasoning domains.

OpenThinker-7B: A Model for Advanced Reasoning

Alongside the release of OpenThoughts-114k, the Open Thoughts team also introduced OpenThinker-7B, a fine-tuned version of Qwen-2.5-7B-Instruct. This model was trained specifically on OpenThoughts-114k and substantially improved over its predecessors. Over 20 hours, it was trained using four 8xH100 nodes. It was trained using the Transformers 4.46.1 library and PyTorch 2.3.0 to ensure compatibility with widely used ML frameworks.

In some reasoning tasks, OpenThinker-7B outperforms comparable models such as Bespoke-Stratos-7B, DeepSeek-R1-Distill-Qwen-7B, and even GPT-4o. Benchmarked using Evalchemy, it demonstrated impressive results on datasets such as AIME24: 43.3%, MATH500: 83.0%, GPQA-D: 42.4%, LCB Easy: 75.3%, and LCB Medium: 28.6%. These results indicate that OpenThinker-7B is a formidable open-source alternative to proprietary reasoning models.

Image Source

Fully Open-Source: Weights, Data, and Code

A defining feature of the Open Thoughts project is its commitment to full transparency. Unlike proprietary models such as GPT-4o and o1-mini, which keep their datasets and training methodologies closed, OpenThinker-7B and OpenThoughts-114k are entirely open-source. This means:

Open Model Weights: The OpenThinker-7B model weights are publicly accessible, allowing researchers and developers to fine-tune and build upon the model.

Open Data: The OpenThoughts-114k dataset is freely available for anyone to use, modify, and expand.

Open Code: The data generation, evaluation, and training code for OpenThinker-7B are all hosted on GitHub, ensuring complete transparency and reproducibility.

The Open Thoughts project is only in its early stages, with plans for further expansion. Some potential future directions include:

Future iterations of OpenThoughts could incorporate millions of reasoning examples, covering a broader spectrum of cognitive challenges.

OpenThinker-7B is an excellent starting point, but larger models fine-tuned on even more data could further push the boundaries of reasoning capabilities.

Encouraging more researchers, engineers, and AI enthusiasts to contribute to dataset creation, model training, and evaluation methodologies.

In conclusion, Open Thoughts represents a transformative effort to democratize AI reasoning. By launching OpenThoughts-114k and OpenThinker-7B as open-source resources, the project empowers the AI community with high-quality data and models to advance reasoning research. With continued collaboration and expansion, Open Thoughts has the potential to redefine how AI approaches logical, mathematical, and cognitive reasoning tasks.

Sources

https://github.com/open-thoughts/open-thoughts 

https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k 

https://huggingface.co/open-thoughts/OpenThinker-7B 

https://www.open-thoughts.ai/blog/launch 

We are announcing Open Thoughts, our large-scale open-source effort to curate the best open reasoning datasets!DeepSeek-R1 is amazing but we still don’t have access to high-quality open reasoning datasets. These datasets are crucial if you want to build your reasoning models!… pic.twitter.com/2kU6z8zDdT— Mahesh Sathiamoorthy (@madiator) January 28, 2025

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)
The post Open Thoughts: An Open Source Initiative Advancing AI Reasoning with High-Quality Datasets and Models Like OpenThoughts-114k and OpenThinker-7B appeared first on MarkTechPost.

Decoupling Tokenization: How Over-Tokenized Transformers Redefine Voca …

Tokenization plays a fundamental role in the performance and scalability of Large Language Models (LLMs). Despite being a critical component, its influence on model training and efficiency remains underexplored. While larger vocabularies can compress sequences and reduce computational costs, existing approaches tie input and output vocabularies together, creating trade-offs where scaling benefits larger models but harms smaller ones. This paper introduces a framework called Over-Tokenized Transformers that reimagines vocabulary design by decoupling input and output tokenization, unlocking new pathways for model efficiency and performance.

Reference: https://arxiv.org/pdf/2501.16975

Traditional tokenization methods use identical vocabularies for input processing and output prediction. While larger vocabularies allow models to process longer n-gram tokens (e.g., multi-character sequences), they force smaller models to handle overly granular output predictions, increasing underfitting risks. For instance, a 3-gram tokenizer reduces sequence length by 66% but requires predicting three characters jointly—a task manageable for large models but overwhelming for smaller ones. Previous work like multi-token prediction (MTP) attempted to address this by predicting future tokens in parallel, but these methods still entangled input/output granularity and struggled with smaller architectures.

The research team identified a critical insight through synthetic experiments with context-free grammars: input and output vocabularies influence models differently. Larger input vocabularies consistently improved all model sizes by enriching context representations through multi-gram embeddings. Conversely, larger output vocabularies introduced fine-grained prediction tasks that only benefited sufficiently large models. This dichotomy motivated their Over-Tokenized framework, which separates input encoding (Over-Encoding) and output decoding (Over-Decoding) vocabularies.

Over-Encoding (OE) scales input vocabularies exponentially using hierarchical n-gram embeddings. Instead of a single token ID, each input token is represented as the sum of 1-, 2-, and 3-gram embeddings. For example, the word “cat” might decompose into embeddings for “c,” “ca,” and “cat,” allowing the model to capture multi-scale contextual cues. To avoid impractical memory costs from large n-gram tables (e.g., 100k³ entries), the team used parameter-efficient techniques:

Modulo-based token hashing: Maps n-gram tokens to a fixed-size embedding table using modular arithmetic, enabling dynamic vocabulary expansion without storing all possible combinations.

Embedding decomposition: Splits high-dimensional embeddings into smaller, stacked matrices, reducing memory access costs while preserving representational capacity.

Over-Decoding (OD) approximates larger output vocabularies by predicting multiple future tokens sequentially, a refinement of earlier MTP methods. For instance, instead of predicting one token at a time, OD trains the model to predict the next two tokens conditioned on the first prediction. Crucially, OD is selectively applied—only larger models benefit from this granular supervision, while smaller ones retain single-token decoding to avoid underfitting.

The researchers performed experiments on OLMo and OLMoE architectures and demonstrated three key findings:

Log-Linear Scaling: Training loss decreased linearly as input vocabulary size grew exponentially (Figure 1). A 400M parameter model with a 12.8M-entry input vocabulary matched the performance of a 1B-parameter baseline, achieving 2.5× effective scaling at equal computational cost.

Convergence Acceleration: Over-Encoding reduced training steps needed for convergence by 3–5× across tasks like MMLU and PIQA, suggesting richer input representations accelerate learning.

Sparse Parameter Efficiency: Despite using 128× larger input vocabularies, memory and computation overheads increased by <5% due to sparse embedding access and optimized sharding strategies.

On evaluations, the framework demonstrated consistent performance improvements across various model types. For dense models, a 151M Over-Encoded (OE) model achieved a 14% reduction in perplexity compared to its baseline. Similarly, in sparse Mixture-of-Experts (MoE) models, the OLMoE-1.3B with OE reduced validation loss by 0.12 points, although the gains were less pronounced as the benefits of sparse experts diluted the impact of embedding enhancements. Beyond synthetic experiments, real-world evaluations on large-scale datasets further validated these findings. Over-Encoded models consistently improved performance across multiple benchmarks, including MMLU-Var, Hellaswag, ARC-Challenge, ARC-Easy, and PIQA. Notably, the framework accelerated convergence, achieving a 5.7× speedup in training loss reduction. Additionally, downstream evaluations showed significant acceleration, with OE delivering speedups of 3.2× on MMLU-Var, 3.0× on Hellaswag, 2.6× on ARC-Challenge, 3.1× on ARC-Easy, and 3.9× on PIQA, highlighting its efficiency and effectiveness across diverse tasks.

In conclusion, this work redefines tokenization as a scalable dimension in language model design. By decoupling input and output vocabularies, Over-Tokenized Transformers break traditional trade-offs, enabling smaller models to benefit from compressed input sequences without grappling with overly complex prediction tasks. The log-linear relationship between input vocabulary size and performance suggests embedding parameters represent a new axis for scaling laws, complementing existing work on model depth and width. Practically, the framework offers a low-cost upgrade path for existing architectures—integrating Over-Encoding requires minimal code changes but yields immediate efficiency gains. Future research could explore hybrid tokenization strategies or dynamic vocabulary adaptation, further solidifying tokenization’s role in the next generation of efficient, high-performing LLMs.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)
The post Decoupling Tokenization: How Over-Tokenized Transformers Redefine Vocabulary Scaling in Language Models appeared first on MarkTechPost.

Yandex Develops and Open-Sources Perforator: An Open-Source Tool that …

Yandex, a global tech company, develops and open-sources Perforator, an innovative tool for continuous real-time monitoring and analysis of servers and applications.

Perforator helps developers identify the most resource-intensive sections of code and provides detailed statistics for subsequent optimization. By identifying code inefficiencies and supporting profile-guided optimization, Perforator delivers accurate data that enables businesses to manually optimize their applications and reduce infrastructure costs by up to 20%. Depending on company size, this could translate to millions or even billions saved annually. 

“Perforator helps businesses get the most out of their servers without sacrificing performance,” said Sergey Skvortsov, a senior developer at Yandex who leads the team behind the tool. “Using Perforator, businesses can optimize their code, reduce server load, and ultimately lower energy and equipment costs.”

Why use Perforator?

Resource optimization is crucial for large data centers, big tech corporations, as well as small businesses and startups with limited resources. Instead of investing in additional equipment, companies can leverage Perforator to optimize their existing infrastructure without sacrificing performance. The tool has already been used for profiling in many Yandex services for over a year, and now it is accessible to companies, developers, and researchers worldwide.

Companies can deploy Perforator on their own servers, minimizing reliance on external cloud providers while maintaining full control over their data. This makes Perforator a strong fit for organizations with stringent data security requirements operating within closed infrastructures.

“Perforator can benefit companies of all sizes, from small businesses with 10-100 servers, which can save millions of dollars per year, to larger enterprises with thousands of servers and more, where savings can reach hundreds of millions or even billions of dollars annually” Sergey Skvortsov noted. “Regardless of your company size, Perforator can help you reduce infrastructure costs, freeing up resources for further innovation and growth.”

How Perforator works

Perforator provides detailed insights into server resource usage and analyzes the impact of code on performance, highlighting which applications consume the most system resources. Perforator uses eBPF technology to run small programs within the Linux kernel in a way that is safe and does not slow down the system. eBPF allows for improved monitoring, security, and performance optimization without changing the source code.

Perforator supports native programming languages such as C, C++, Go, Rust, Python, and Java. The solution enables in-depth analytics and data visualization with flame graphs, making problem diagnostics much more manageable.

An example of a flame graph generated by Perforator

“Perforator has been battle-tested in Yandex’s demanding environment for over a year and provides a wide range of features that make it a reliable and versatile solution for monitoring and optimizing server performance,” Sergey Skvortsov added.

One of Perforator’s key advantages is its support for profile-guided optimization (PGO), which automatically accelerates C++ programs by up to 10%. Additionally, Perforator is designed to run seamlessly on individual computers, making it accessible not only to large businesses but also to startups and tech enthusiasts. Furthermore, Perforator offers essential features tailored for large organizations, including A/B testing capabilities that help make better-informed decisions.

Open-source solution for developers and businesses

The decision to make Perforator open source reflects Yandex’s commitment to fostering community collaboration in developing system technologies.

“We believe that open-sourcing such fundamental system technologies helps drive tech innovation worldwide.” — Sergey Skvortsov. 

“We aim for our  technologies to benefit the world and provide value to both developers and businesses. Additionally, the openness of the technology enables us to make decisions regarding the development of the profiling infrastructure together with the community.”

What’s next?

In the near future, Perforator will be enhanced with additional capabilities, including improved integration with Python and Java and more precise analysis of events.

Perforator’s source code is now available on GitHub, alongside other Yandex open-source solutions, such as YaFSDP, a tool designed to accelerate the training of large language models. 

Perforator is the latest addition to Yandex’s collection of open-source tools. You can view all of the company’s open-source projects, including YaFSDP, AQLM, YTsaurus, and more, on this page.

About Yandex

Yandex is a global technology company that builds intelligent products and services powered by machine learning. Its goal is to help consumers and businesses better navigate the online and offline world. Since 1997, Yandex has delivered world-class, locally relevant search and information services and developed market-leading on-demand transportation services, navigation products, and other mobile applications for millions of consumers worldwide.

Key Takeaways:

Yandex introduces Perforator, a tool that can identify and evaluate code inefficiencies across a company’s entire code base.

Perforator helps developers identify the most resource-intensive sections of code and provides detailed statistics for subsequent optimization.

The solution can help businesses reduce CPU resource usage by 20% annually.

By leveraging Perforator, companies can potentially save millions or even billions, depending on company size, and allocate resources for further innovation and growth.

Perforator can be accessed for free on GitHub.

Note: Thanks to the Yandex team for the thought leadership/ Resources for this article. Yandex team has supported and sponsored this content/article.
The post Yandex Develops and Open-Sources Perforator: An Open-Source Tool that can Save Businesses Billions of Dollars a Year on Server Infrastructure appeared first on MarkTechPost.

DeepSeek-R1 model now available in Amazon Bedrock Marketplace and Amaz …

Today, we are announcing that DeepSeek AI’s first-generation frontier model, DeepSeek-R1, is available through Amazon SageMaker JumpStart and Amazon Bedrock Marketplace to deploy for inference. You can now use DeepSeek-R1 to build, experiment, and responsibly scale your generative AI ideas on AWS.
In this post, we demonstrate how to get started with DeepSeek-R1 on Amazon Bedrock and SageMaker JumpStart.
Overview of DeepSeek-R1
DeepSeek-R1 is a large language model (LLM) developed by DeepSeek-AI that uses reinforcement learning to enhance reasoning capabilities through a multi-stage training process from a DeepSeek-V3-Base foundation. A key distinguishing feature is its reinforcement learning (RL) step, which was used to refine the model’s responses beyond the standard pre-training and fine-tuning process. By incorporating RL, DeepSeek-R1 can adapt more effectively to user feedback and objectives, ultimately enhancing both relevance and clarity. In addition, DeepSeek-R1 employs a chain-of-thought (CoT) approach, meaning it’s equipped to break down complex queries and reason through them in a step-by-step manner. This guided reasoning process allows the model to produce more accurate, transparent, and detailed answers. This model combines RL-based fine-tuning with CoT capabilities, aiming to generate structured responses while focusing on interpretability and user interaction. With its wide-ranging capabilities DeepSeek-R1 has captured the industry’s attention as a versatile text-generation model that can be integrated into various workflows such as agents, logical reasoning and data interpretation tasks
DeepSeek-R1 uses a Mixture of Experts (MoE) architecture and is 671 billion parameters in size. The MoE architecture allows activation of 37 billion parameters, enabling efficient inference by routing queries to the most relevant expert “clusters.” This approach allows the model to specialize in different problem domains while maintaining overall efficiency. DeepSeek-R1 requires at least 800 GB of HBM memory in FP8 format for inference. In this post, we will use an ml.p5e.48xlarge instance to deploy the model. ml.p5e.48xlarge comes with 8 Nvidia H200 GPUs providing 1128 GB of GPU memory.
You can deploy DeepSeek-R1 model either through SageMaker JumpStart or Bedrock Marketplace. Because DeepSeek-R1 is an emerging model, we recommend deploying this model with guardrails in place. In this blog, we will use Amazon Bedrock Guardrails to introduce safeguards, prevent harmful content, and evaluate models against key safety criteria. At the time of writing this blog, for DeepSeek-R1 deployments on SageMaker JumpStart and Bedrock Marketplace, Bedrock Guardrails supports only the ApplyGuardrail API. You can create multiple guardrails tailored to different use cases and apply them to the DeepSeek-R1 model, improving user experiences and standardizing safety controls across your generative AI applications.
Prerequisites
To deploy the DeepSeek-R1 model, you need access to an ml.p5e instance. To check if you have quotas for P5e, open the Service Quotas console and under AWS Services, choose Amazon SageMaker, and confirm you’re using ml.p5e.48xlarge for endpoint usage. Make sure that you have at least one ml.P5e.48xlarge instance in the AWS Region you are deploying. To request a limit increase, create a limit increase request and reach out to your account team.
Because you will be deploying this model with Amazon Bedrock Guardrails, make sure you have the correct AWS Identity and Access Management (IAM) permissions to use Amazon Bedrock Guardrails. For instructions, see Set up permissions to use guardrails for content filtering.
Implementing guardrails with the ApplyGuardrail API
Amazon Bedrock Guardrails allows you to introduce safeguards, prevent harmful content, and evaluate models against key safety criteria. You can implement safety measures for the DeepSeek-R1 model using the Amazon Bedrock ApplyGuardrail API. This allows you to apply guardrails to evaluate user inputs and model responses deployed on Amazon Bedrock Marketplace and SageMaker JumpStart. You can create a guardrail using the Amazon Bedrock console or the API. For the example code to create the guardrail, see the GitHub repo.
The general flow involves the following steps: First, the system receives an input for the model. This input is then processed through the ApplyGuardrail API. If the input passes the guardrail check, it’s sent to the model for inference. After receiving the model’s output, another guardrail check is applied. If the output passes this final check, it’s returned as the final result. However, if either the input or output is intervened by the guardrail, a message is returned indicating the nature of the intervention and whether it occurred at the input or output stage. The examples showcased in the following sections demonstrate inference using this API.
Deploy DeepSeek-R1 in Amazon Bedrock Marketplace
Amazon Bedrock Marketplace gives you access to over 100 popular, emerging, and specialized foundation models (FMs) through Amazon Bedrock. To access DeepSeek-R1 in Amazon Bedrock, complete the following steps:

On the Amazon Bedrock console, choose Model catalog under Foundation models in the navigation pane. At the time of writing this post, you can use the InvokeModel API to invoke the model. It doesn’t support Converse APIs and other Amazon Bedrock tooling.
Filter for DeepSeek as a provider and choose the DeepSeek-R1 model. The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration. The model supports various text generation tasks, including content creation, code generation, and question answering, using its reinforcement learning optimization and CoT reasoning capabilities. The page also includes deployment options and licensing information to help you get started with DeepSeek-R1 in your applications.
To begin using DeepSeek-R1, choose Deploy. You will be prompted to configure the deployment details for DeepSeek-R1. The model ID will be pre-populated.
For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters).
For Number of instances, enter a number of instances (between 1–100).
For Instance type, choose your instance type. For optimal performance with DeepSeek-R1, a GPU-based instance type like ml.p5e.48xlarge is recommended. Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you might want to review these settings to align with your organization’s security and compliance requirements.
Choose Deploy to begin using the model. When the deployment is complete, you can test DeepSeek-R1’s capabilities directly in the Amazon Bedrock playground.
Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters like temperature and maximum length.

This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.

You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with any Amazon Bedrock APIs, you need to get the endpoint ARN.
Run inference using guardrails with the deployed DeepSeek-R1 endpoint
The following code example demonstrates how to perform inference using a deployed DeepSeek-R1 model through Amazon Bedrock using the invoke_model and ApplyGuardrail API. You can create a guardrail using the Amazon Bedrock console or the API. For the example code to create the guardrail, see the GitHub repo. After you have created the guardrail, use the following code to implement guardrails. The script initializes the bedrock_runtime client, configures inference parameters, and sends a request to generate text based on a user prompt.

import boto3
import json

# Initialize Bedrock client
bedrock_runtime = boto3.client(“bedrock-runtime”)

# Configuration
MODEL_ID = “your-model-id” # Bedrock model ID
GUARDRAIL_ID = “your-guardrail-id”
GUARDRAIL_VERSION = “your-guardrail-version”

def invoke_with_guardrails(prompt, max_tokens=1000, temperature=0.6, top_p=0.9):
“””
Invoke Bedrock model with input and output guardrails
“””
# Apply input guardrails
input_guardrail = bedrock_runtime.apply_guardrail(
guardrailIdentifier=GUARDRAIL_ID,
guardrailVersion=GUARDRAIL_VERSION,
source=’INPUT’,
content=[{“text”: {“text”: prompt}}]
)

if input_guardrail[‘action’] == ‘GUARDRAIL_INTERVENED’:
return f”Input blocked: {input_guardrail[‘outputs’][0][‘text’]}”

# Prepare model input
request_body = {
“inputs”: f”””You are an AI assistant. Do as the user asks.
### Instruction: {prompt}
### Response: <think>”””,
“parameters”: {
“max_new_tokens”: max_tokens,
“top_p”: top_p,
“temperature”: temperature
}
}

# Invoke model
response = bedrock_runtime.invoke_model(
modelId=MODEL_ID,
body=json.dumps(request_body)
)

# Parse model response
model_output = json.loads(response[‘body’].read())[‘generated_text’]

# Apply output guardrails
output_guardrail = bedrock_runtime.apply_guardrail(
guardrailIdentifier=GUARDRAIL_ID,
guardrailVersion=GUARDRAIL_VERSION,
source=’OUTPUT’,
content=[{“text”: {“text”: model_output}}]
)

if output_guardrail[‘action’] == ‘GUARDRAIL_INTERVENED’:
return f”Output blocked: {output_guardrail[‘outputs’][0][‘text’]}”

return model_output

# Example usage
if __name__ == “__main__”:
prompt = “What’s 1+1?”
result = invoke_with_guardrails(prompt)
print(result)

Deploy DeepSeek-R1 with SageMaker JumpStart
SageMaker JumpStart is a machine learning (ML) hub with FMs, built-in algorithms, and prebuilt ML solutions that you can deploy with just a few clicks. With SageMaker JumpStart, you can customize pre-trained models to your use case, with your data, and deploy them into production using either the UI or SDK.
Deploying DeepSeek-R1 model through SageMaker JumpStart offers two convenient approaches: using the intuitive SageMaker JumpStart UI or implementing programmatically through the SageMaker Python SDK. Let’s explore both methods to help you choose the approach that best suits your needs.
Deploy DeepSeek-R1 through SageMaker JumpStart UI
Complete the following steps to deploy DeepSeek-R1 using SageMaker JumpStart:

On the SageMaker console, choose Studio in the navigation pane.
First-time users will be prompted to create a domain.
On the SageMaker Studio console, choose JumpStart in the navigation pane. The model browser displays available models, with details like the provider name and model capabilities.
Search for DeepSeek-R1 to view the DeepSeek-R1 model card. Each model card shows key information, including:

Model name
Provider name
Task category (for example, Text Generation)
Bedrock Ready badge (if applicable), indicating that this model can be registered with Amazon Bedrock, allowing you to use Amazon Bedrock APIs to invoke the model

Choose the model card to view the model details page. The model details page includes the following information:

The model name and provider information
Deploy button to deploy the model
About and Notebooks tabs with detailed information
The About tab includes important details, such as:

Model description
License information
Technical specifications
Usage guidelines
Before you deploy the model, it’s recommended to review the model details and license terms to confirm compatibility with your use case.
Choose Deploy to proceed with deployment.
For Endpoint name, use the automatically generated name or create a custom one.
For Instance type¸ choose an instance type (default: ml.p5e.48xlarge).
For Initial instance count, enter the number of instances (default: 1). Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed.Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.
Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
Choose Deploy to deploy the model.

The deployment process can take several minutes to complete.

When deployment is complete, your endpoint status will change to InService. At this point, the model is ready to accept inference requests through the endpoint. You can monitor the deployment progress on the SageMaker console Endpoints page, which will display relevant metrics and status information. When the deployment is complete, you can invoke the model using a SageMaker runtime client and integrate it with your applications.
Deploy DeepSeek-R1 using the SageMaker Python SDK
To get started with DeepSeek-R1 using the SageMaker Python SDK, you will need to install the SageMaker Python SDK and make sure you have the necessary AWS permissions and environment setup. The following is a step-by-step code example that demonstrates how to deploy and use DeepSeek-R1 for inference programmatically. The code for deploying the model is provided in the Github here . You can clone the notebook and run from SageMaker Studio.

!pip install –force-reinstall –no-cache-dir sagemaker==2.235.2

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.jumpstart.model import ModelAccessConfig
from sagemaker.session import Session
import logging

sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

js_model_id = “deepseek-llm-r1”

gpu_instance_type = “ml.p5e.48xlarge”

response = “Hello, I’m a language model, and I’m here to help you with your English.”

sample_input = {
“inputs”: “Hello, I’m a language model,”,
“parameters”: {“max_new_tokens”: 128, “top_p”: 0.9, “temperature”: 0.6},
}

sample_output = [{“generated_text”: response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

model_builder = ModelBuilder(
model=js_model_id,
schema_builder=schema_builder,
sagemaker_session=sagemaker_session,
role_arn=execution_role_arn,
log_level=logging.ERROR )

model= model_builder.build()
predictor = model.deploy(model_access_configs={js_model_id:ModelAccessConfig(accept_eula=True)}, accept_eula=True)

predictor.predict(sample_input)

You can run additional requests against the predictor:

new_input = {
“inputs”: “What is Amazon doing in Generative AI?”,
“parameters”: {“max_new_tokens”: 64, “top_p”: 0.8, “temperature”: 0.7},
}

prediction = predictor.predict(new_input)
print(prediction)

Implement guardrails and run inference with your SageMaker JumpStart predictor
Similar to Amazon Bedrock, you can also use the ApplyGuardrail API with your SageMaker JumpStart predictor. You can create a guardrail using the Amazon Bedrock console or the API, and implement it as shown in the following code:

import boto3
import json
bedrock_runtime = boto3.client(‘bedrock-runtime’)
sagemaker_runtime = boto3.client(‘sagemaker-runtime’)

# Add your guardrail identifier and version created from Bedrock Console or AWSCLI
guardrail_id = “” # Your Guardrail ID
guardrail_version = “” # Your Guardrail Version
endpoint_name = “” # Endpoint Name

prompt = “What’s 1+1 equal?”

# Apply guardrail to input before sending to model
input_guardrail_response = bedrock_runtime.apply_guardrail(
guardrailIdentifier=guardrail_id,
guardrailVersion=guardrail_version,
source=’INPUT’,
content=[{ “text”: { “text”: prompt }}]
)

# If input guardrail passes, proceed with model inference
if input_guardrail_response[‘action’] != ‘GUARDRAIL_INTERVENED’:
# Prepare the input for the SageMaker endpoint
template = f”””You are an AI assistant. Do as the user asks.
### Instruction: {prompt}
### Response: <think>”””

input_payload = {
“inputs”: template,
“parameters”: {
“max_new_tokens”: 1000,
“top_p”: 0.9,
“temperature”: 0.6
}
}

# Convert the payload to JSON string
input_payload_json = json.dumps(input_payload)

# Invoke the SageMaker endpoint
response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=’application/json’,
Body=input_payload_json
)

# Get the response from the model
model_response = json.loads(response[‘Body’].read().decode())

# Apply guardrail to output
output_guardrail_response = bedrock_runtime.apply_guardrail(
guardrailIdentifier=guardrail_id,
guardrailVersion=guardrail_version,
source=’OUTPUT’,
content=[{ “text”: { “text”: model_response[‘generated_text’] }}]
)

# Check if output passes guardrails
if output_guardrail_response[‘action’] != ‘GUARDRAIL_INTERVENED’:
print(model_response[‘generated_text’])
else:
print(“Output blocked: “, output_guardrail_response[‘outputs’][0][‘text’])
else:
print(“Input blocked: “, input_guardrail_response[‘outputs’][0][‘text’])

Clean up
To avoid unwanted charges, complete the steps in this section to clean up your resources.
Delete the Amazon Bedrock Marketplace deployment
If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Marketplace deployments.
In the Managed deployments section, locate the endpoint you want to delete.
Select the endpoint, and on the Actions menu, choose Delete.
Verify the endpoint details to make sure you’re deleting the correct deployment:

Endpoint name
Model name
Endpoint status

Choose Delete to delete the endpoint.
In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart predictor
The SageMaker JumpStart model you deployed will incur costs if you leave it running. Use the following code to delete the endpoint if you want to stop incurring charges. For more details, see Delete Endpoints and Resources.

predictor.delete_model()
predictor.delete_endpoint()

Conclusion
In this post, we explored how you can access and deploy the DeepSeek-R1 model using Bedrock Marketplace and SageMaker JumpStart. Visit SageMaker JumpStart in SageMaker Studio or Amazon Bedrock Marketplace now to get started. For more information, refer to Use Amazon Bedrock tooling with Amazon SageMaker JumpStart models, SageMaker JumpStart pretrained models, Amazon SageMaker JumpStart Foundation Models, Amazon Bedrock Marketplace, and Getting started with Amazon SageMaker JumpStart.

About the Authors
Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.
Jonathan Evans is a Specialist Solutions Architect working on generative AI with the Third-Party Model Science team at AWS.
Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, SageMaker’s machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Streamline grant proposal reviews using Amazon Bedrock

Government and non-profit organizations evaluating grant proposals face a significant challenge: sifting through hundreds of detailed submissions, each with unique merits, to identify the most promising initiatives. This arduous, time-consuming process is typically the first step in the grant management process, which is critical to driving meaningful social impact.
The AWS Social Responsibility & Impact (SRI) team recognized an opportunity to augment this function using generative AI. The team developed an innovative solution to streamline grant proposal review and evaluation by using the natural language processing (NLP) capabilities of Amazon Bedrock. Amazon Bedrock is a fully managed service that lets you use your choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities that you need to build generative AI applications with security, privacy, and responsible AI.
Historically, AWS Health Equity Initiative applications were reviewed manually by a review committee. It took 14 or more days each cycle for all applications to be fully reviewed. On average, the program received 90 applications per cycle. The June 2024 AWS Health Equity Initiative application cycle received 139 applications, the program’s largest influx to date. It would have taken an estimated 21 days for the review committee to manually process these many applications. The Amazon Bedrock centered approach reduced the review time to 2 days (a 90% reduction).
The goal was to enhance the efficiency and consistency of the review process, empowering customers to build impactful solutions faster. By combining the advanced NLP capabilities of Amazon Bedrock with thoughtful prompt engineering, the team created a dynamic, data-driven, and equitable solution demonstrating the transformative potential of large language models (LLMs) in the social impact domain.
In this post, we explore the technical implementation details and key learnings from the team’s Amazon Bedrock powered grant proposal review solution, providing a blueprint for organizations seeking to optimize their grants management processes.
Building an effective prompt for reviewing grant proposals using generative AI
Prompt engineering is the art of crafting effective prompts to instruct and guide generative AI models, such as LLMs, to produce the desired outputs. By thoughtfully designing prompts, practitioners can unlock the full potential of generative AI systems and apply them to a wide range of real-world scenarios.
When building a prompt for our Amazon Bedrock model to review grant proposals, we used multiple prompt engineering techniques to make sure the model’s responses were tailored, structured, and actionable. This included assigning the model a specific persona, providing step-by-step instructions, and specifying the desired output format.
First, we assigned the model the persona of an expert in public health, with a focus on improving healthcare outcomes for underserved populations. This context helps prime the model to evaluate the proposal from the perspective of a subject matter expert (SME) who thinks holistically about global challenges and community-level impact. By clearly defining the persona, we make sure the model’s responses are tailored to the desired evaluation lens.

Your task is to review a proposal document from the perspective of a given persona, and assess it based on dimensions defined in a rubric. Here are the steps to follow:

1. Review the provided proposal document: {PROPOSAL}

2. Adopt the perspective of the given persona: {PERSONA}

Multiple personas can be assigned against the same rubric to account for various perspectives. For example, when the persona “Public Health Subject Matter Expert” was assigned, the model provided keen insights on the project’s impact potential and evidence basis. When the persona “Venture Capitalist” was assigned, the model provided more robust feedback on the organization’s articulated milestones and sustainability plan for post funding. Similarly, when the persona “Software Development Engineer” was assigned, the model relayed subject matter expertise on the proposed use of AWS technology.
Next, we broke down the review process into a structured set of instructions for the model to follow. This includes reviewing the proposal, assessing it across specific dimensions (impact potential, innovation, feasibility, sustainability), and then providing an overall summary and score. Outlining these step-by-step directives gives the model clear guidance on the required task elements and helps produce a comprehensive and consistent assessment.

3. Assess the proposal based on each dimension in the provided rubric: {RUBRIC}

For each dimension, follow this structure:
<Dimension Name>
<Summary> Provide a brief summary (2-3 sentences) of your assessment of how well the proposal meets the criteria for this dimension from the perspective of the given persona. </Summary>
<Score> Provide a score from 0 to 100 for this dimension. Start with a default score of 0 and increase it based on the information in the proposal. </Score>
<Recommendations> Provide 2-3 specific recommendations for how the author could improve the proposal in this dimension. </Recommendations>
</Dimension Name>

4. After assessing each dimension, provide an <Overall Summary> section with:
– An overall assessment summary (3-4 sentences) of the proposal’s strengths and weaknesses across all dimensions from the persona’s perspective
– Any additional feedback beyond the rubric dimensions
– Identification of any potential risks or biases in the proposal or your assessment

5. Finally, calculate the <Overall Weighted Score> by applying the weightings specified in the rubric to your scores for each dimension.

Finally, we specified the desired output format as JSON, with distinct sections for the dimensional assessments, overall summary, and overall score. Prescribing this structured response format makes sure that the model’s output can be ingested, stored, and analyzed by our grant review team, rather than being delivered in free-form text. This level of control over the output helps streamline the downstream use of the model’s evaluations.

6. Return your assessment in JSON format with the following structure:

{{ “dimensions”: [ {{ “name”: “<Dimension Name>”, “summary”: “<Summary>”, “score”: <Score>, “recommendations”: [ “<Recommendation 1>”, “<Recommendation 2>”, … ] }}, … ], “overall_summary”: “<Overall Summary>”,”overall_score”: <Overall Weighted Score> }}

Do not include any other commentary beyond following the specified structure. Focus solely on providing the assessment based on the given inputs.

By combining these prompt engineering techniques—role assignment, step-by-step instructions, and output formatting—we were able to craft a prompt that elicits thorough, objective, and actionable grant proposal assessments from our generative AI model. This structured approach enables us to effectively use the model’s capabilities to support our grant review process in a scalable and efficient manner.
Building a dynamic proposal review application with Streamlit and generative AI
To demonstrate and test the capabilities of a dynamic proposal review solution, we built a rapid prototype implementation using Streamlit, Amazon Bedrock, and Amazon DynamoDB. It’s important to note that this implementation isn’t intended for production use, but rather serves as a proof of concept and a starting point for further development. The application allows users to define and save various personas and evaluation rubrics, which can then be dynamically applied when reviewing proposal submissions. This approach enables a tailored and relevant assessment of each proposal, based on the specified criteria.
The application’s architecture consists of several key components, which we discuss in this section.
The team used DynamoDB, a NoSQL database, to store the personas, rubrics, and submitted proposals. The data stored in DynamoDB was sent to Streamlit, a web application interface. On Streamlit, the team added the persona and rubric to the prompt and sent the prompt to Amazon Bedrock.

import boto3
import json

from api.personas import Persona
from api.rubrics import Rubric
from api.submissions import Submission

bedrock = boto3.client(“bedrock-runtime”, region_name=”us-east-1″)

def _construct_prompt(persona: Persona, rubric: Rubric, submission: Submission):
rubric_dimensions = [
f”{dimension[‘name’]}|{dimension[‘description’]}|{dimension[‘weight’]}”
for dimension in rubric.dimensions
]

# add the table headers the prompt is expecting to the front of the dimensions list
rubric_dimensions[:0] = [“dimension_name|dimension_description|dimension_weight”]
rubric_string = “n”.join(rubric_dimensions)
print(rubric_string)

with open(“prompt/prompt_template.txt”, “r”) as prompt:
prompt = prompt.read()
print(prompt)
return prompt.format(
PROPOSAL=submission.content,
PERSONA=persona.description,
RUBRIC=rubric_string,
)

Amazon Bedrock used Anthropic’s Claude 3 Sonnet FM to evaluate the submitted proposals against the prompt. The model’s prompts are dynamically generated based on the selected persona and rubric. Amazon Bedrock would send the evaluation results back to Streamlit for team review.

def get_assessment(submission: Submission, persona: Persona, rubric: Rubric):
prompt = _construct_prompt(persona, rubric, submission)

body = json.dumps(
{
“anthropic_version”: “”,
“max_tokens”: 2000,
“temperature”: 0.5,
“top_p”: 1,
“messages”: [{“role”: “user”, “content”: prompt}],
}
)
response = bedrock.invoke_model(
body=body, modelId=”anthropic.claude-3-haiku-20240307-v1:0″
)
response_body = json.loads(response.get(“body”).read())
return response_body.get(“content”)[0].get(“text”)

The following diagram illustrates the show of the preceding figure.

The workflow consists of the following steps:

Users can create and manage personas and rubrics through the Streamlit application. These are stored in the DynamoDB database.
When a user submits a proposal for review, they choose the desired persona and rubric from the available options.
The Streamlit application generates a dynamic prompt for the Amazon Bedrock model, incorporating the selected persona and rubric details.
The Amazon Bedrock model evaluates the proposal based on the dynamic prompt and returns the assessment results.
The evaluation results are stored in the DynamoDB database and presented to the user through the Streamlit application.

Impact
This rapid prototype demonstrates the potential for a scalable and flexible proposal review process, allowing organizations to:

Reduce application processing time by up to 90%
Streamline the review process by automating the evaluation tasks
Capture structured data on the proposals and assessments for further analysis
Incorporate diverse perspectives by enabling the use of multiple personas and rubrics

Throughout the implementation, the AWS SRI team focused on creating an interactive and user-friendly experience. By working hands-on with the Streamlit application and observing the impact of dynamic persona and rubric selection, users can gain practical experience in building AI-powered applications that address real-world challenges.
Considerations for a production-grade implementation
Although the rapid prototype demonstrates the potential of this solution, a production-grade implementation requires additional considerations and the use of additional AWS services. Some key considerations include:

Scalability and performance – For handling large volumes of proposals and concurrent users, a serverless architecture using AWS Lambda, Amazon API Gateway, DynamoDB, and Amazon Simple Storage Service (Amazon S3) would provide improved scalability, availability, and reliability.
Security and compliance – Depending on the sensitivity of the data involved, additional security measures such as encryption, authentication and access control, and auditing are necessary. Services like AWS Key Management Service (KMS), Amazon Cognito, AWS Identity and Access Management (IAM), and AWS CloudTrail can help meet these requirements.
Monitoring and logging – Implementing robust monitoring and logging mechanisms using services like Amazon CloudWatch and AWS X-Ray enable tracking performance, identifying issues, and maintaining compliance.
Automated testing and deployment – Implementing automated testing and deployment pipelines using services like AWS CodePipeline, AWS CodeBuild, and AWS CodeDeploy help provide consistent and reliable deployments, reducing the risk of errors and downtime.
Cost optimization – Implementing cost optimization strategies, such as using AWS Cost Explorer and AWS Budgets, can help manage costs and help maintain efficient resource utilization.
Responsible AI considerations – Implementing safeguards—such as Amazon Bedrock Guardrails—and monitoring mechanisms can help enforce the responsible and ethical use of the generative AI model, including bias detection, content moderation, and human oversight. Although the AWS Health Equity Initiative application form collected customer information such as name, email address, and country of operation, this was systematically omitted when sent to the Amazon Bedrock enabled tool to avoid bias in the model and protect customer data.

By using the full suite of AWS services and following best practices for security, scalability, and responsible AI, organizations can build a production-ready solution that meets their specific requirements while achieving compliance, reliability, and cost-effectiveness.
Conclusion
Amazon Bedrock—coupled with effective prompt engineering—enabled AWS SRI to review grant proposals and deliver awards to customers in days instead of weeks. The skills developed in this project—such as building web applications with Streamlit, integrating with NoSQL databases like DynamoDB, and customizing generative AI prompts—are highly transferable and applicable to a wide range of industries and use cases.

About the authors
Carolyn Vigil  is a Global Lead for AWS Social Responsibility & Impact Customer Engagement. She drives strategic initiatives that leverage cloud computing for social impact worldwide. A passionate advocate for underserved communities, she has co-founded two non-profit organizations serving individuals with developmental disabilities and their families. Carolyn enjoys Mountain adventures with her family and friends in her free time.
Lauren Hollis is a Program Manager for AWS Social Responsibility and Impact. She leverages her background in economics, healthcare research, and technology to support mission-driven organizations deliver social impact using AWS cloud technology.  In her free time, Lauren enjoys reading an playing the piano and cello.
 Ben West is a hands-on builder with experience in machine learning, big data analytics, and full-stack software development. As a technical program manager on the AWS Social Responsibility & Impact team, Ben leverages a wide variety of cloud, edge, and Internet of Things (IoT) technologies to develop innovative prototypes and help public sector organizations make a positive impact in the world.  Ben is an Army Veteran that enjoys cooking and being outdoors.
Mike Haggerty is a Senior Systems Development Engineer (Sr. SysDE) at Amazon Web Services (AWS), working within the PACE-EDGE team. In this role, he contributes to AWS’s edge computing initiatives as part of the Worldwide Public Sector (WWPS) organization’s PACE (Prototyping and Customer Engineering) team. Beyond his professional duties, Mike is a pet therapy volunteer who, together with his dog Gnocchi, provides support services at local community facilities.

How Aetion is using generative AI and Amazon Bedrock to unlock hidden …

The real-world data collected and derived from patient journeys offers a wealth of insights into patient characteristics and outcomes and the effectiveness and safety of medical innovations. Researchers ask questions about patient populations in the form of structured queries; however, without the right choice of structured query and deep familiarity with complex real-world patient datasets, many trends and patterns can remain undiscovered.
Aetion is a leading provider of decision-grade real-world evidence software to biopharma, payors, and regulatory agencies. The company provides comprehensive solutions to healthcare and life science customers to transform real-world data into real-world evidence.
The use of unsupervised learning methods on semi-structured data along with generative AI has been transformative in unlocking hidden insights. With Aetion Discover, users can conduct rapid, exploratory analyses with real-world data while experiencing a structured approach to research questions. To help accelerate data exploration and hypothesis generation, Discover uses unsupervised learning methods to uncover Smart Subgroups. These subgroups of patients within a larger population display similar characteristics or profiles across a vast range of factors, including diagnoses, procedures, and therapies.
In this post, we review how Aetion’s Smart Subgroups Interpreter enables users to interact with Smart Subgroups using natural language queries. Powered by Amazon Bedrock and Anthropic’s Claude 3 large language models (LLMs), the interpreter responds to user questions expressed in conversational language about patient subgroups and provides insights to generate further hypotheses and evidence. Aetion chose to use Amazon Bedrock for working with LLMs due to its vast model selection from multiple providers, security posture, extensibility, and ease of use.
Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI startups and Amazon through a unified API. It offers a wide range of FMs, allowing you to choose the model that best suits your specific use case.
Aetion’s technology
Aetion uses the science of causal inference to generate real-world evidence on the safety, effectiveness, and value of medications and clinical interventions. Aetion has partnered with the majority of top 20 biopharma, leading payors, and regulatory agencies.
Aetion brings deep scientific expertise and technology to life sciences, regulatory agencies (including FDA and EMA), payors, and health technology assessment (HTA) customers in the US, Canada, Europe, and Japan with analytics that can achieve the following:

Optimize clinical trials by identifying target populations, creating external control arms, and contextualizing settings and populations underrepresented in controlled settings
Expand industry access through label changes, pricing, coverage, and formulary decisions
Conduct safety and effectiveness studies for medications, treatments, and diagnostics

Aetion’s applications, including Discover and Aetion Substantiate, are powered by the Aetion Evidence Platform (AEP), a core longitudinal analytic engine capable of applying rigorous causal inference and statistical methods to hundreds of millions of patient journeys.
AetionAI is a set of generative AI capabilities embedded across the core environment and applications. Smart Subgroups Interpreter is an AetionAI feature in Discover.
The following figure illustrates the organization of Aetion’s services.

Smart Subgroups
For a user-specified patient population, the Smart Subgroups feature identifies clusters of patients with similar characteristics (for example, similar prevalence profiles of diagnoses, procedures, and therapies).
These subgroups are further classified and labeled by generative AI models based on each subgroup’s prevalent characteristics. For example, as shown in the following generated heat map, the first two Smart Subgroups within a population of patients who were prescribed GLP-1 agonists are labeled “Cataract and Retinal Disease” and “Inflammatory Skin Conditions,” respectively, to capture their defining characteristics.

After the subgroups are displayed, a user engages with AetionAI to probe further with inquiries expressed in natural language. The user can express questions about the subgroups, such as “What are the most common characteristics for patients in the cataract disorders subgroup?” As shown in the following screenshot, AetionAI responds to the user in natural language, citing relevant subgroup statistics in its response.

A user might also ask AetionAI detailed questions such as “Compare the prevalence of cardiovascular diseases or conditions among the ‘Dulaglutide’ group vs the overall population.” The following screenshot shows AetionAI’s response.

In this example, the insights enable the user to hypothesize that Dulaglutide patients might experience fewer circulatory signs and symptoms. They can explore this further in Aetion Substantiate to produce decision-grade evidence with causal inference to assess the effectiveness of Dulaglutide use in cardiovascular disease outcomes.
Solution overview
Smart Subgroups Interpreter combines elements of unsupervised machine learning with generative AI to uncover hidden patterns in real-world data. The following diagram illustrates the workflow.
Let’s review each step in detail:

Create the patient population – Users define a patient population using the Aetion Measure Library (AML) features. The AML feature store standardizes variable definitions using scientifically validated algorithms. The user selects the AML features that define the patient population for analysis.
Generate features for the patient population – The AEP computes over 1,000 AML features for each patient across various categories, such as diagnoses, therapies, and procedures.
Build clusters and summarize cluster features – The Smart Subgroups component trains a topic model using the patient features to determine the optimal number of clusters and assign patients to clusters. The prevalences of the most distinctive features within each cluster, as determined by a trained classification model, are used to describe the cluster characteristics.
Generate cluster names and answer user queries – A prompt engineering technique for Anthropic’s Claude 3 Haiku on Amazon Bedrock generates descriptive cluster names and answers user queries. Amazon Bedrock provides access to LLMs from a variety of model providers. Anthropic’s Claude 3 Haiku was selected as the model due to its speed and satisfactory intelligence level.

The solution uses Amazon Simple Storage Service (Amazon S3) and Amazon Aurora for data persistence and data exchange, and Amazon Bedrock with Anthropic’s Claude 3 Haiku models for cluster names generation. Discover and its transactional and batch applications are deployed and scaled on a Kubernetes on AWS cluster to optimize performance, user experience, and portability.
The following diagram illustrates the solution architecture.

The workflow includes the following steps:

Users create Smart Subgroups for their patient population of interest.
AEP uses real-world data and a custom query language to compute over 1,000 science-validated features for the user-selected population. The features are stored in Amazon S3 and encrypted with AWS Key Management Service (AWS KMS) for downstream use.
The Smart Subgroups component trains the clustering algorithm and summarizes the most important features of each cluster. The cluster feature summaries are stored in Amazon S3 and displayed as a heat map to the user. Smart Subgroups is deployed as a Kubernetes job and is run on demand.
Users interact with the Interpreter API microservice by using questions expressed in natural language to retrieve descriptive subgroup names. The data transmitted to the service is encrypted using Transport Layer Security 1.2 (TLS). The Interpreter API uses composite prompt engineering techniques with Anthropic’s Claude 3 Haiku to answer user queries:

Versioned prompt templates generate descriptive subgroup names and answer user queries.
AML features are added to the prompt template. For example, the description of the feature “Benign Ovarian Cyst” is expanded in a prompt to the LLM as “This measure covers different types of cysts that can form in or on a woman’s ovaries, including follicular cysts, corpus luteum cysts, endometriosis, and unspecified ovarian cysts.”
Lastly, the top feature prevalences of each subgroup are added to the prompt template. For example: “In Smart Subgroup 1 the relative prevalence of ‘Cornea and external disease (EYE001)’ is 30.32% In Smart Subgroup 1 the relative prevalence of ‘Glaucoma (EYE003)’ is 9.94%…”

Amazon Bedrock responds back to the application that displays the heat map to the user.

Outcomes
Smart Subgroups Interpreter enables users of the AEP who are unfamiliar with real-world data to discover patterns among patient populations using natural language queries. Users now can turn findings from such discoveries into hypotheses for further analyses across Aetion’s software to generate decision-grade evidence in a matter of minutes, as opposed to days, and without the need of support staff.
Conclusion
In this post, we demonstrated how Aetion uses Amazon Bedrock and other AWS services to help users uncover meaningful patterns within patient populations, even without prior expertise in real-world data. These discoveries lay the groundwork for deeper analysis within Aetion’s Evidence Platform, generating decision-grade evidence that drives smarter, data-informed outcomes.
As we continue expanding our generative AI capabilities, Aetion remains committed to enhancing user experiences and accelerating the journey from real-world data to real-world evidence.
With Amazon Bedrock, the future of innovation is at your fingertips. Explore Generative AI Application Builder on AWS to learn more about building generative AI capabilities to unlock new insights, build transformative solutions, and shape the future of healthcare today.

About the Authors
Javier Beltrán is a Senior Machine Learning Engineer at Aetion. His career has focused on natural language processing, and he has experience applying machine learning solutions to various domains, from healthcare to social media.
Ornela Xhelili is a Staff Machine Learning Architect at Aetion. Ornela specializes in natural language processing, predictive analytics, and MLOps, and holds a Master’s of Science in Statistics. Ornela has spent the past 8 years building AI/ML products for tech startups across various domains, including healthcare, finance, analytics, and ecommerce.
Prasidh Chhabri is a Product Manager at Aetion, leading the Aetion Evidence Platform, core analytics, and AI/ML capabilities. He has extensive experience building quantitative and statistical methods to solve problems in human health.
Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare life sciences customers and specializes in data analytics services. Mikhail has more than 20 years of industry experience covering a wide range of technologies and sectors.

Meta AI Introduces MR.Q: A Model-Free Reinforcement Learning Algorithm …

Reinforcement learning (RL) trains agents to make sequential decisions by maximizing cumulative rewards. It has diverse applications, including robotics, gaming, and automation, where agents interact with environments to learn optimal behaviors. Traditional RL methods fall into two categories: model-free and model-based approaches. Model-free techniques prioritize simplicity but require extensive training data, while model-based methods introduce structured learning but are computationally demanding. A growing area of research aims to bridge these approaches and develop more versatile RL frameworks that function efficiently across different domains.

A persistent challenge in RL is the absence of a universal algorithm capable of performing consistently across multiple environments without exhaustive parameter tuning. Most RL algorithms are designed for specific applications, necessitating adjustments to work effectively in new settings. Model-based RL methods generally demonstrate superior generalization but at the cost of greater complexity and slower execution speeds. On the other hand, model-free methods are easier to implement but often lack efficiency when applied to unfamiliar tasks. Developing an RL framework that integrates the strengths of both approaches without compromising computational feasibility remains a key research objective.

Several RL methodologies have emerged, each with trade-offs between performance and efficiency. Model-based solutions such as DreamerV3 and TD-MPC2 have achieved substantial results across different tasks but rely heavily on complex planning mechanisms and large-scale simulations. Model-free alternatives, including TD3 and PPO, offer reduced computational demands but require domain-specific tuning. This disparity underscores the need for an RL algorithm that combines adaptability and efficiency, enabling seamless application across various tasks and environments.

A research team from Meta FAIR introduced MR.Q, a model-free RL algorithm incorporating model-based representations to improve learning efficiency and generalization. Unlike traditional model-free approaches, MR.Q leverages a representation learning phase inspired by model-based objectives, enabling the algorithm to function effectively across different RL benchmarks with minimal tuning. This approach allows MR.Q to benefit from the structured learning signals of model-based methods while avoiding the computational overhead associated with full-scale planning and simulated rollouts.

The MR.Q framework maps state-action pairs into embeddings that maintain an approximately linear relationship with the value function. These embeddings are then processed through a non-linear function to retain consistency across different environments. The system integrates an encoder that extracts relevant features from state and action inputs, enhancing learning stability. Further, MR.Q employs a prioritized sampling technique and a reward scaling mechanism to improve training efficiency. The algorithm achieves robust performance across multiple RL benchmarks while maintaining computational efficiency by focusing on an optimized learning strategy.

Experiments conducted across four RL benchmarks—Gym locomotion tasks, DeepMind Control Suite, and Atari—demonstrate that MR.Q achieves strong results with a single set of hyperparameters. The algorithm outperforms conventional model-free baselines like PPO and DQN while maintaining comparable performance to DreamerV3 and TD-MPC2. MR.Q achieves competitive results while utilizing significantly fewer computational resources, making it a practical choice for real-world applications. In the Atari benchmark, MR.Q performs particularly well in discrete-action spaces, surpassing existing methods. MR.Q demonstrates strong performance in continuous control environments, surpassing model-free baselines such as PPO and DQN while maintaining competitive results compared to DreamerV3 and TD-MPC2. The algorithm achieves significant efficiency improvements across benchmarks without requiring extensive reconfiguration for different tasks. The evaluation further highlights MR.Q’s ability to generalize effectively without requiring extensive reconfiguration for new tasks.

The study underscores the benefits of incorporating model-based representations into model-free RL algorithms. MR.Q marks a step toward developing a truly versatile RL framework by enhancing efficiency and adaptability. Future advancements could refine its approach to address challenges such as hard exploration problems and non-Markovian environments. The findings contribute to the broader goal of making RL techniques more accessible and effective for many applications, positioning MR.Q as a promising tool for researchers and practitioners seeking robust RL solutions.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)
The post Meta AI Introduces MR.Q: A Model-Free Reinforcement Learning Algorithm with Model-Based Representations for Enhanced Generalization appeared first on MarkTechPost.

Optimization Using FP4 Quantization For Ultra-Low Precision Language M …

Large Language Models (LLMs) have emerged as transformative tools in research and industry, with their performance directly correlating to model size. However, training these massive models presents significant challenges, related to computational resources, time, and cost. The training process for state-of-the-art models like Llama 3 405B requires extensive hardware infrastructure, utilizing up to 16,000 H100 GPUs over 54 days. Similarly, models like GPT-4, estimated to have one trillion parameters, demand extraordinary computational power. These resource requirements create barriers to entry and development in the field, highlighting the critical need for more efficient training methodologies for advancing LLM technology while reducing the associated computational burden.

Various approaches have been explored to address the computational challenges in LLM training and inference. Mixed Precision Training has been widely adopted to accelerate model training while maintaining accuracy, initially focusing on CNNs and DNNs before expanding to LLMs. For inference optimization, Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) have achieved significant compression using 4-bit, 2-bit, and even 1-bit quantization. While differentiable quantization techniques have been proposed using learnable parameters updated through backpropagation, they face limitations in handling activation outliers effectively. Existing solutions for managing outliers depend on offline pre-processing methods, making them impractical for direct application in training scenarios.

Researchers from the University of Science and Technology of China, Microsoft SIGMA Team, and Microsoft Research Asia have proposed a framework for training language models using the FP4 format, marking the first comprehensive validation of this ultra-low precision representation. The framework addresses quantization errors through two key innovations: 

A differentiable quantization estimator for weights that enhances gradient updates in FP4 computations by incorporating correction terms

An outlier handling mechanism for activations that combines clamping with a sparse auxiliary matrix. 

These techniques help to maintain model performance while enabling efficient training in ultra-low precision formats, representing a significant advancement in efficient LLM training.

The framework primarily targets General Matrix Multiplication (GeMM) operations, containing over 95% of LLM training computations. The architecture implements 4-bit quantization for GeMM operations using distinct quantization approaches: token-wise quantization for activation tensors and channel-wise quantization for weight tensors. Due to hardware limitations, the system’s performance is validated using Nvidia H-series GPUs’ FP8 Tensor Cores, which can accurately simulate FP4’s dynamic range. The framework employs FP8 gradient communication and a mixed-precision Adam optimizer for memory efficiency. The system was validated using the LLaMA 2 architecture, trained from scratch on the DCLM dataset, with carefully tuned hyperparameters including a warm-up and cosine decay learning rate schedule, and specific parameters for the FP4 method’s unique components.

The proposed FP4 training framework shows that training curves for LLaMA models of 1.3B, 7B, and 13B parameters have similar patterns between FP4 and BF16 implementations, with FP4 showing marginally higher training losses: 2.55 vs. 2.49 (1.3B), 2.17 vs. 2.07 (7B), and 1.97 vs. 1.88 (13B) after 100B tokens of training. Zero-shot evaluations across diverse downstream tasks, including Arc, BoolQ, HellaSwag, LogiQA, PiQA, SciQ, OpenbookQA, and Lambada, reveal that FP4-trained models achieve competitive or occasionally superior performance compared to their BF16 counterparts. The results demonstrate that larger models achieve higher accuracy, validating the scalability of the FP4 training approach.

In conclusion, researchers have successfully developed and validated the first FP4 pretraining framework for LLMs, marking a significant advancement in ultra-low-precision computing. The framework achieves performance comparable to higher-precision formats across various model scales through innovative solutions like the differentiable gradient estimator and outlier compensation mechanism. However, the current implementation faces a notable limitation: the lack of dedicated FP4 Tensor Cores in existing hardware necessitates simulation-based testing, which introduces computational overhead and prevents direct measurement of potential efficiency gains. This limitation underscores the need for hardware advancement to realize the benefits of FP4 computation.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Optimization Using FP4 Quantization For Ultra-Low Precision Language Model Training appeared first on MarkTechPost.

TensorLLM: Enhancing Reasoning and Efficiency in Large Language Models …

LLMs based on transformer architectures, such as GPT and LLaMA series, have excelled in NLP tasks due to their extensive parameterization and large training datasets. However, research indicates that not all learned parameters are necessary to retain performance, prompting the development of post-training compression techniques to enhance efficiency without significantly reducing inference quality. For example, the LASER model uses singular value decomposition (SVD) to compress feedforward network (FFN) weight matrices by removing factors with minimal singular values, reducing weight noise from training. However, LASER only targets individual weight matrices, limiting its ability to utilize shared information between them.

Researchers at Imperial College London introduced a novel framework to enhance the reasoning abilities of LLMs by compressing the Multi-Head Attention (MHA) block through multi-head tensorisation and Tucker decomposition. This approach enforces a shared higher-dimensional subspace across attention heads, enabling structured denoising and compression, with compression rates reaching up to 250x without requiring additional data or fine-tuning. Unlike existing methods focused on FFN weights, this method addresses MHA limitations by leveraging domain knowledge about attention heads’ shared and specialized roles. Extensive tests on benchmark datasets demonstrated improved reasoning in both encoder and decoder architectures, alongside compatibility with FFN-based techniques.

The study adopts mathematical notations commonly used in previous works, with scalars, vectors, matrices, and tensors represented as a and A, respectively. Operations such as matrix transpose, Frobenius norm, and tensor mode-n product are defined for computational tasks. Tensors, which are multidimensional arrays, extend from simple scalars (0D) to higher dimensions by stacking lower-dimensional structures. The mode-n product links a tensor and a matrix along a specific dimension. SVD decomposes matrices into rank-1 components, enabling noise reduction through low-rank approximations by discarding insignificant values. Tucker decomposition extends SVD to tensors, breaking them into smaller core tensors and factor matrices, which aids in efficient data representation and dimensionality reduction.

The study proposes a method to reshape MHA weight matrices in transformers into 3D tensors instead of the conventional 2D format. Tucker decomposition composes these tensors into core tensors and shared factor matrices across all attention heads within a transformer layer. This technique ensures that attention heads function within the same subspace, improving reasoning capabilities and reducing noise. Compared to existing methods such as LASER and TRAWL, this approach leverages shared low-rank structures to enhance performance and efficiency while reducing the number of parameters.

Extensive experiments validated the proposed framework on four benchmark reasoning datasets using three LLMs: RoBERTa, GPT-J, and LLaMA2, encompassing both encoder-only and decoder-only architectures. The framework, applied selectively to transformer layers, significantly enhanced reasoning abilities while achieving parameter compression. Results showed compatibility with FFN-only compression methods like LASER, achieving improved accuracy and loss reduction. A hybrid approach combining LASER and the proposed method usually yielded the best performance. Ablation studies confirmed the effectiveness of compressing all MHA weights together, outperforming separate compression of query, key, value, and output weights, further validating the framework’s design.

In conclusion, the study introduced a framework to enhance reasoning in LLMs while achieving significant parameter compression. By leveraging domain knowledge about MHA and employing a unique multi-head tensorisation with Tucker decomposition, we denoise MHA weights and encode diverse information within a shared higher-dimensional subspace. This approach improves reasoning in encoder-only and decoder-only LLMs with up to 250x compression, requiring no additional training or fine-tuning. The method can also complement FFN-based denoising techniques for further gains. While hyperparameter tuning varies across datasets, future work will focus on developing generalizable settings for broader applicability.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post TensorLLM: Enhancing Reasoning and Efficiency in Large Language Models through Multi-Head Attention Compression and Tensorisation appeared first on MarkTechPost.

Deploy DeepSeek-R1 Distilled Llama models in Amazon Bedrock

Open foundation models (FMs) have become a cornerstone of generative AI innovation, enabling organizations to build and customize AI applications while maintaining control over their costs and deployment strategies. By providing high-quality, openly available models, the AI community fosters rapid iteration, knowledge sharing, and cost-effective solutions that benefit both developers and end-users. DeepSeek AI, a research company focused on advancing AI technology, has emerged as a significant contributor to this ecosystem. Their DeepSeek-R1 models represent a family of large language models (LLMs) designed to handle a wide range of tasks, from code generation to general reasoning, while maintaining competitive performance and efficiency.
Amazon Bedrock Custom Model Import enables the import and use of your customized models alongside existing FMs through a single serverless, unified API. You can access your imported custom models on-demand and without the need to manage underlying infrastructure. Accelerate your generative AI application development by integrating your supported custom models with native Bedrock tools and features like Knowledge Bases, Guardrails, and Agents.
In this post, we explore how to deploy distilled versions of DeepSeek-R1 with Amazon Bedrock Custom Model Import, making them accessible to organizations looking to use state-of-the-art AI capabilities within the secure and scalable AWS infrastructure at an effective cost.
DeepSeek-R1 distilled variations
From the foundation of DeepSeek-R1, DeepSeek AI has created a series of distilled models based on both Meta’s Llama and Qwen architectures, ranging from 1.5–70 billion parameters. The distillation process involves training smaller, more efficient models to mimic the behavior and reasoning patterns of the larger DeepSeek-R1 model by using it as a teacher—essentially transferring the knowledge and capabilities of the 671 billion parameter model into more compact architectures. The resulting distilled models, such as DeepSeek-R1-Distill-Llama-8B (from base model Llama-3.1-8B) and DeepSeek-R1-Distill-Llama-70B (from base model Llama-3.3-70B-Instruct), offer different trade-offs between performance and resource requirements. Although distilled models might show some reduction in reasoning capabilities compared to the original 671B model, they significantly improve inference speed and reduce computational costs. For instance, smaller distilled models like the 8B version can process requests much faster and consume fewer resources, making them more cost-effective for production deployments, whereas larger distilled versions like the 70B model maintain closer performance to the original while still offering meaningful efficiency gains.
Solution overview
In this post, we demonstrate how to deploy distilled versions of DeepSeek-R1 models using Amazon Bedrock Custom Model Import. We focus on importing the variants currently supported DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Llama-70B, which offer an optimal balance between performance and resource efficiency. You can import these models from Amazon Simple Storage Service (Amazon S3) or an Amazon SageMaker AI model repo, and deploy them in a fully managed and serverless environment through Amazon Bedrock. The following diagram illustrates the end-to-end flow.

In this workflow, model artifacts stored in Amazon S3 are imported into Amazon Bedrock, which then handles the deployment and scaling of the model automatically. This serverless approach eliminates the need for infrastructure management while providing enterprise-grade security and scalability.
You can use the Amazon Bedrock console for deploying using the graphical interface and following the instructions in this post, or alternatively use the following notebook to deploy programmatically with the Amazon Bedrock SDK.
Prerequisites
You should have the following prerequisites:

An AWS account with access to Amazon Bedrock.
Appropriate AWS Identity and Access Management (IAM) roles and permissions for Amazon Bedrock and Amazon S3. For more information, see Create a service role for model import.
An S3 bucket prepared to store the custom model. For more information, see Creating a bucket.
Sufficient local storage space, at least 17 GB for the 8B model or 135 GB for the 70B model.

Prepare the model package
Complete the following steps to prepare the model package:

Download the DeepSeek-R1-Distill-Llama model artifacts from Hugging Face, from one of the following links, depending on the model you want to deploy:

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tree/main
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B/tree/main

For more information, you can follow the Hugging Face’s Downloading models or Download files from the hub instructions.
You typically need the following files:

Model configuration file: config.json
Tokenizer files: tokenizer.json, tokenizer_config.json, and tokenizer.mode
Model weights files in .safetensors format

Upload these files to a folder in your S3 bucket, in the same AWS Region where you plan to use Amazon Bedrock. Take note of the S3 path you’re using.

Import the model
Complete the following steps to import the model:

On the Amazon Bedrock console, choose Imported models under Foundation models in the navigation pane.

Choose Import model.

For Model name, enter a name for your model (it’s recommended to use a versioning scheme in your name, for tracking your imported model).
For Import job name, enter a name for your import job.
For Model import settings, select Amazon S3 bucket as your import source, and enter the S3 path you noted earlier (provide the full path in the form s3://<your-bucket>/folder-with-model-artifacts/).
For Encryption, optionally choose to customize your encryption settings.
For Service access role, choose to either create a new IAM role or provide your own.
Choose Import model.

Importing the model will take several minutes depending on the model being imported (for example, the Distill-Llama-8B model could take 5–20 minutes to complete).

Watch this video demo for a step-by-step guide.

Test the imported model
After you import the model, you can test it by using the Amazon Bedrock Playground or directly through the Amazon Bedrock invocation APIs. To use the Playground, complete the following steps:

On the Amazon Bedrock console, choose Chat / Text under Playgrounds in the navigation pane.
From the model selector, choose your imported model name.
Adjust the inference parameters as needed and write your test prompt. For example: <|begin▁of▁sentence|><|User|>Given the following financial data: – Company A’s revenue grew from $10M to $15M in 2023 – Operating costs increased by 20% – Initial operating costs were $7M Calculate the company’s operating margin for 2023. Please reason step by step, and put your final answer within \boxed{}<|Assistant|>

As we’re using an imported model in the playground, we must include the “beginning_of_sentence” and “user/assistant” tags to properly format the context for DeepSeek models; these tags help the model understand the structure of the conversation and provide more accurate responses. If you’re following the programmatic approach in the following notebook then this is being automatically taken care of by configuring the model.

Review the model response and metrics provided.

Note: When you invoke the model for the first time, if you encounter a ModelNotReadyException error the SDK automatically retries the request with exponential backoff. The restoration time varies depending on the on-demand fleet size and model size. You can customize the retry behavior using the AWS SDK for Python (Boto3) Config object. For more information, see Handling ModelNotReadyException.
Once you are ready to import the model, use this step-by-step video demo to help you get started.
Pricing
Custom Model Import enables you to use your custom model weights within Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a fully managed way through On-Demand mode. Custom Model Import does not charge for model import, you are charged for inference based on two factors: the number of active model copies and their duration of activity.
Billing occurs in 5-minute windows, starting from the first successful invocation of each model copy. The pricing per model copy per minute varies based on factors including architecture, context length, region, and compute unit version, and is tiered by model copy size. The Custom Model Units required for hosting depends on the model’s architecture, parameter count, and context length, with examples ranging from 2 Units for a Llama 3.1 8B 128K model to 8 Units for a Llama 3.1 70B 128K model.
Amazon Bedrock automatically manages scaling, maintaining zero to three model copies by default (adjustable through Service Quotas) based on your usage patterns. If there are no invocations for 5 minutes, it scales to zero and scales up when needed, though this may involve cold-start latency of tens of seconds. Additional copies are added if inference volume consistently exceeds single-copy concurrency limits. The maximum throughput and concurrency per copy is determined during import, based on factors such as input/output token mix, hardware type, model size, architecture, and inference optimizations.
Consider the following pricing example: An application developer imports a customized Llama 3.1 type model that is 8B parameter in size with a 128K sequence length in us-east-1 region and deletes the model after 1 month. This requires 2 Custom Model Units. So, the price per minute will be $0.1570 and the model storage costs will be $3.90 for the month.
For more information, see Amazon Bedrock pricing.
Benchmarks
DeepSeek has published benchmarks comparing their distilled models against the original DeepSeek-R1 and base Llama models, available in the model repositories. The benchmarks show that depending on the task DeepSeek-R1-Distill-Llama-70B maintains between 80-90% of the original model’s reasoning capabilities, while the 8B version achieves between 59-92% performance with significantly reduced resource requirements. Both distilled versions demonstrate improvements over their corresponding base Llama models in specific reasoning tasks.
Other considerations
When deploying DeepSeek models in Amazon Bedrock, consider the following aspects:

Model versioning is essential. Because Custom Model Import creates unique models for each import, implement a clear versioning strategy in your model names to track different versions and variations.
The current supported model formats focus on Llama-based architectures. Although DeepSeek-R1 distilled versions offer excellent performance, the AI ecosystem continues evolving rapidly. Keep an eye on the Amazon Bedrock model catalog as new architectures and larger models become available through the platform.
Evaluate your use case requirements carefully. Although larger models like DeepSeek-R1-Distill-Llama-70B provide better performance, the 8B version might offer sufficient capability for many applications at a lower cost.
Consider implementing monitoring and observability. Amazon CloudWatch provides metrics for your imported models, helping you track usage patterns and performance. You can monitor costs with AWS Cost Explorer.
Start with a lower concurrency quota and scale up based on actual usage patterns. The default limit of three concurrent model copies per account is suitable for most initial deployments.

Conclusion
Amazon Bedrock Custom Model Import empowers organizations to use powerful publicly available models like DeepSeek-R1 distilled versions, among others, while benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing model deployments and operations, allowing teams to focus on building applications rather than infrastructure. With features like auto scaling, pay-per-use pricing, and seamless integration with AWS services, Amazon Bedrock provides a production-ready environment for AI workloads. The combination of DeepSeek’s innovative distillation approach and the Amazon Bedrock managed infrastructure offers an optimal balance of performance, cost, and operational efficiency. Organizations can start with smaller models and scale up as needed, while maintaining full control over their model deployments and benefiting from AWS security and compliance capabilities.
The ability to choose between proprietary and open FMs Amazon Bedrock gives organizations the flexibility to optimize for their specific needs. Open models enable cost-effective deployment with full control over the model artifacts, making them ideal for scenarios where customization, cost optimization, or model transparency are crucial. This flexibility, combined with the Amazon Bedrock unified API and enterprise-grade infrastructure, allows organizations to build resilient AI strategies that can adapt as their requirements evolve.
For more information, refer to the Amazon Bedrock User Guide.

About the Authors
Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Morgan Rankey is a Solutions Architect based in New York City, specializing in Hedge Funds. He excels in assisting customers to build resilient workloads within the AWS ecosystem. Prior to joining AWS, Morgan led the Sales Engineering team at Riskified through its IPO. He began his career by focusing on AI/ML solutions for machine asset management, serving some of the largest automotive companies globally.
Harsh Patel is an AWS Solutions Architect supporting 200+ SMB customers across the United States to drive digital transformation through cloud-native solutions. As an AI&ML Specialist, he focuses on Generative AI, Computer Vision, Reinforcement Learning and Anomaly Detection. Outside the tech world, he recharges by hitting the golf course and embarking on scenic hikes with his dog.

Generative AI operating models in enterprise organizations with Amazon …

Generative AI can revolutionize organizations by enabling the creation of innovative applications that offer enhanced customer and employee experiences. Intelligent document processing, translation and summarization, flexible and insightful responses for customer support agents, personalized marketing content, and image and code generation are a few use cases using generative AI that organizations are rolling out in production.
Large organizations often have many business units with multiple lines of business (LOBs), with a central governing entity, and typically use AWS Organizations with an Amazon Web Services (AWS) multi-account strategy. They implement landing zones to automate secure account creation and streamline management across accounts, including logging, monitoring, and auditing. Although LOBs operate their own accounts and workloads, a central team, such as the Cloud Center of Excellence (CCoE), manages identity, guardrails, and access policies
As generative AI adoption grows, organizations should establish a generative AI operating model. An operating model defines the organizational design, core processes, technologies, roles and responsibilities, governance structures, and financial models that drive a business’s operations.
In this post, we evaluate different generative AI operating model architectures that could be adopted.
Operating model patterns
Organizations can adopt different operating models for generative AI, depending on their priorities around agility, governance, and centralized control. Governance in the context of generative AI refers to the frameworks, policies, and processes that streamline the responsible development, deployment, and use of these technologies. It encompasses a range of measures aimed at mitigating risks, promoting accountability, and aligning generative AI systems with ethical principles and organizational objectives. Three common operating model patterns are decentralized, centralized, and federated, as shown in the following diagram.

Decentralized model
In a decentralized approach, generative AI development and deployment are initiated and managed by the individual LOBs themselves. LOBs have autonomy over their AI workflows, models, and data within their respective AWS accounts.
This enables faster time-to-market and agility because LOBs can rapidly experiment and roll out generative AI solutions tailored to their needs. However, even in a decentralized model, often LOBs must align with central governance controls and obtain approvals from the CCoE team for production deployment, adhering to global enterprise standards for areas such as access policies, model risk management, data privacy, and compliance posture, which can introduce governance complexities.
Centralized model
In a centralized operating model, all generative AI activities go through a central generative artificial intelligence and machine learning (AI/ML) team that provisions and manages end-to-end AI workflows, models, and data across the enterprise.
LOBs interact with the central team for their AI needs, trading off agility and potentially increased time-to-market for stronger top-down governance. A centralized model may introduce bottlenecks that slow down time-to-market, so organizations need to adequately resource the team with sufficient personnel and automated processes to meet the demand from various LOBs efficiently. Failure to scale the team can negate the governance benefits of a centralized approach.
Federated model
A federated model strikes a balance by having key activities of the generative AI processes managed by a central generative AI/ML platform team.
While LOBs drive their AI use cases, the central team governs guardrails, model risk management, data privacy, and compliance posture. This enables agile LOB innovation while providing centralized oversight on governance areas.
Generative AI architecture components
Before diving deeper into the common operating model patterns, this section provides a brief overview of a few components and AWS services used in the featured architectures.
Large language models
Large language models (LLMs) are large-scale ML models that contain billions of parameters and are pre-trained on vast amounts of data. LLMs may hallucinate, which means a model can provide a confident but factually incorrect response. Furthermore, the data that the model was trained on might be out of date, which leads to providing inaccurate responses. One way to mitigate LLMs from giving incorrect information is by using a technique known as Retrieval Augmented Generation (RAG). RAG is an advanced natural language processing technique that combines knowledge retrieval with generative text models. RAG combines the powers of pre-trained language models with a retrieval-based approach to generate more informed and accurate responses. To set up RAG, you need to have a vector database to provide your model with related source documents. Using RAG, the relevant document segments or other texts are retrieved and shared with LLMs to generate targeted responses with enhanced content quality and relevance.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, including AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon using a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
Amazon SageMaker JumpStart provides access to proprietary FMs from third-party providers such as AI21 Labs, Cohere, and LightOn. In addition, Amazon SageMaker JumpStart onboards and maintains open source FMs from third-party sources such as Hugging Face.
Data sources, embeddings, and vector store
Organizations’ domain-specific data, which provides context and relevance, typically resides in internal databases, data lakes, unstructured data repositories, or document stores, collectively referred to as organizational data sources or proprietary data stores.
A vector store is a system you can use to store and query vectors at scale, with efficient nearest neighbor query algorithms and appropriate indexes to improve data retrieval. It includes not only the embeddings of an organization’s data (mathematical representation of data in the form of vectors) but also raw text from the data in chunks. These vectors are generated by specialized embedding LLMs, which process the organization’s text chunks to create numerical representations (vectors), which are stored with the text chunks in the vector store. For a comprehensive read about vector store and embeddings, you can refer to The role of vector databases in generative AI applications.
With Amazon Bedrock Knowledge Bases, you securely connect FMs in Amazon Bedrock to your company data for RAG. Amazon Bedrock Knowledge Bases facilitates data ingestion from various supported data sources; manages data chunking, parsing, and embeddings; and populates the vector store with the embeddings. With all that provided as a service, you can think of Amazon Bedrock Knowledge Bases as a fully managed and serverless option to build powerful conversational AI systems using RAG.
Guardrails
Content filtering mechanisms are implemented as safeguards to control user-AI interactions, aligning with application requirements and responsible AI policies by minimizing undesirable and harmful content. Guardrails can check user inputs and FM outputs and filter or deny topics that are unsafe, redact personally identifiable information (PII), and enhance content safety and privacy in generative AI applications.
Amazon Bedrock Guardrails is a feature of Amazon Bedrock that you can use to put safeguards in place. You determine what qualifies based on your company policies. These safeguards are FM agnostic. You can create multiple guardrails with different configurations tailored to specific use cases. For a review on Amazon Bedrock Guardrails, you can refer to these blog posts: Guardrails for Amazon Bedrock helps implement safeguards customized to your use cases and responsible AI policies and Guardrails for Amazon Bedrock now available with new safety filters and privacy controls.
Operating model architectures
This section provides an overview of the three kinds of operating models.
Decentralized operating model
In a decentralized operating model, LOB teams maintain control and ownership of their AWS accounts. Each LOB configures and orchestrates generative AI components, common functionalities, applications, and Amazon Bedrock configurations within their respective AWS accounts. This model empowers LOBs to tailor their generative AI solutions according to their specific requirements, while taking advantage of the power of Amazon Bedrock.
With this model, the LOBs configure the core components, such as LLMs and guardrails, and the Amazon Bedrock service account manages the hosting, execution, and provisioning of interface endpoints. These endpoints enable LOBs to access and interact with the Amazon Bedrock services they’ve configured.
Each LOB performs monitoring and auditing of their configured Amazon Bedrock services within their account, using Amazon CloudWatch Logs and AWS CloudTrail for log capture, analysis, and auditing tailored to their needs. Amazon Bedrock cost and usage will be recorded in each LOB’s AWS accounts. By adopting this decentralized model, LOBs retain control over their generative AI solutions through a decentralized configuration, while benefiting from the scalability, reliability, and security of Amazon Bedrock.
The following diagram shows the architecture of the decentralized operating model.

Centralized operating model
The centralized AWS account serves as the primary hub for configuring and managing the core generative AI functionalities, including reusable agents, prompt flows, and shared libraries. LOB teams contribute their business-specific requirements and use cases to the centralized team, which then integrates and orchestrates the appropriate generative AI components within the centralized account.
Although the orchestration and configuration of generative AI solutions reside in the centralized account, they often require interaction with LOB-specific resources and services. To facilitate this, the centralized account uses API gateways or other integration points provided by the LOBs’ AWS accounts. These integration points enable secure and controlled communication between the centralized generative AI orchestration and the LOBs’ business-specific applications, data sources, or services. This centralized operating model promotes consistency, governance, and scalability of generative AI solutions across the organization.
The centralized team maintains adherence to common standards, best practices, and organizational policies, while also enabling efficient sharing and reuse of generative AI components. Furthermore, the core components of Amazon Bedrock, such as LLMs and guardrails, continue to be hosted and executed by AWS in the Amazon Bedrock service account, promoting secure, scalable, and high-performance execution environments for these critical components. In this centralized model, monitoring and auditing of Amazon Bedrock can be achieved within the centralized account, allowing for comprehensive monitoring, auditing, and analysis of all generative AI activities and configurations. Amazon CloudWatch Logs provides a unified view of generative AI operations across the organization.
By consolidating the orchestration and configuration of generative AI solutions in a centralized account while enabling secure integration with LOB-specific resources, this operating model promotes standardization, governance, and centralized control over generative AI operations. It uses the scalability, reliability, security, and centralized monitoring capabilities of AWS managed infrastructure and services, while still allowing for integration with LOB-specific requirements and use cases.
The following is the architecture for a centralized operating model.

Federated operating model
In a federated model, Amazon Bedrock enables a collaborative approach where LOB teams can develop and contribute common generative AI functionalities within their respective AWS accounts. These common functionalities, such as reusable agents, prompt flows, or shared libraries, can then be migrated to a centralized AWS account managed by a dedicated team or CCoE.
The centralized AWS account acts as a hub for integrating and orchestrating these common generative AI components, providing a unified platform for action groups and prompt flows. Although the orchestration and configuration of generative AI solutions remain within the LOBs’ AWS accounts, they can use the centralized Amazon Bedrock agents, prompt flows, and other shared components defined in the centralized account.
This federated model allows LOBs to retain control over their generative AI solutions, tailoring them to specific business requirements while benefiting from the reusable and centrally managed components. The centralized account maintains consistency, governance, and scalability of these shared generative AI components, promoting collaboration and standardization across the organization.
Organizations frequently prefer storing sensitive data, including Payment Card Industry (PCI), PII, General Data Protection Regulation (GDPR), and Health Insurance Portability and Accountability Act (HIPAA) information, within their respective LOB AWS accounts. This approach makes sure that LOBs maintain control over their sensitive business data in the vector store while preventing centralized teams from accessing it without proper governance and security measures.
A federated model combines decentralized development, centralized integration, and centralized monitoring. This operating model fosters collaboration, reusability, and standardization while empowering LOBs to retain control over their generative AI solutions. It uses the scalability, reliability, security, and centralized monitoring capabilities of AWS managed infrastructure and services, promoting a harmonious balance between autonomy and governance.
The following is the architecture for a federated operating model.

Cost management
Organizations may want to analyze Amazon Bedrock usage and costs per LOB. To track the cost and usage of FMs across LOBs’ AWS accounts, solutions that record model invocations per LOB can be implemented.
Amazon Bedrock now supports model invocation resources that use inference profiles. Inference profiles can be defined to track Amazon Bedrock usage metrics, monitor model invocation requests, or route model invocation requests to multiple AWS Regions for increased throughput.
There are two types of inference profiles. Cross-Region inference profiles, which are predefined in Amazon Bedrock and include multiple AWS Regions to which requests for a model can be routed. The other is application inference profiles, which are user created to track cost and model usage when submitting on-demand model invocation requests. You can attach custom tags, such as cost allocation tags, to your application inference profiles. When submitting a prompt, you can include an inference profile ID or its Amazon Resource Name (ARN). This capability enables organizations to track and monitor costs for various LOBs, cost centers, or applications. For a detailed explanation of application inference profiles refer to this post: Track, allocate, and manage your generative AI cost and usage with Amazon Bedrock.
Conclusion
Although enterprises often begin with a centralized operating model, the rapid pace of development in generative AI technologies, the need for agility, and the desire to quickly capture value often lead organizations to converge on a federated operating model.
In a federated operating model, lines of business have the freedom to innovate and experiment with generative AI solutions, taking advantage of their domain expertise and proximity to business problems. Key aspects of the AI workflow, such as data access policies, model risk management, and compliance monitoring, are managed by a central cloud governance team. Successful generative AI solutions developed by a line of business can be promoted and productionized by the central team for enterprise-wide re-use.
This federated model fosters innovation from the lines of business closest to domain problems. Simultaneously, it allows the central team to curate, harden, and scale those solutions adherent to organizational policies, then redeploy them efficiently to other relevant areas of the business.
To sustain this operating model, enterprises often establish a dedicated product team with a business owner that works in partnership with lines of business. This team is responsible for continually evolving the operating model, refactoring and enhancing the generative AI services to help meet the changing needs of the lines of business and keep up with the rapid advancements in LLMs and other generative AI technologies.
Federated operating models strike a balance, mitigating the risks of fully decentralized initiatives while minimizing bottlenecks from overly centralized approaches. By empowering business agility with curation by a central team, enterprises can accelerate compliant, high-quality generative AI capabilities aligned with their innovation goals, risk tolerances, and need for rapid value delivery in the evolving AI landscape.
As enterprises look to capitalize on the generative AI revolution, Amazon Bedrock provides the ideal foundation to establish a flexible operating model tailored to their organization’s needs. Whether you’re starting with a centralized, decentralized, or federated approach, AWS offers a comprehensive suite of services to support the full generative AI lifecycle.
Try Amazon Bedrock and let us know your feedback on how you’re planning to implement the operating model that suits your organization.

About the Authors
Martin Tunstall is a Principal Solutions Architect at AWS. With over three decades of experience in the finance sector, he helps global finance and insurance customers unlock the full potential of Amazon Web Services (AWS).
Yashar Araghi is a Senior Solutions Architect at AWS. He has over 20 years of experience designing and building infrastructure and application security solutions. He has worked with customers across various industries such as government, education, finance, energy, and utilities. In the last 6 years at AWS, Yashar has helped customers design, build, and operate their cloud solutions that are secure, reliable, performant and cost optimized in the AWS Cloud.

ByteDance Introduces UI-TARS: A Native GUI Agent Model that Integrates …

GUI agents seek to perform real tasks in digital environments by understanding and interacting with graphical interfaces such as buttons and text boxes. The biggest open challenges lie in enabling agents to process complex, evolving interfaces, plan effective actions, and execute precision tasks that include finding clickable areas or filling text boxes. These agents also need memory systems to recall past actions and adapt to new scenarios. One significant problem facing modern, unified end-to-end models is the absence of integrated perception, reasoning, and action within seamless workflows with high-quality data encompassing this breadth of vision. Lacking such data, these systems can hardly adapt to a diversity of dynamic environments and scale.

Current approaches to GUI agents are mostly rule-based and heavily dependent on predefined rules, frameworks, and human involvement, which are not flexible or scalable. Rule-based agents, like Robotic Process Automation (RPA), operate in structured environments using human-defined heuristics and require direct access to systems, making them unsuitable for dynamic or restricted interfaces. Framework-based agents use foundation models like GPT-4 for multi-step reasoning but still depend on manual workflows, prompts, and external scripts. These methods are fragile, need constant updates for evolving tasks, and lack seamless integration of learning from real-world interactions. The models of native agents try to bring together perception, reasoning, memory, and action under one roof by reducing human engineering through end-to-end learning. Still, these models rely on curated data and training guidance, thus limiting their adaptability. The approaches do not allow the agents to learn autonomously, adapt efficiently, or handle unpredictable scenarios without manual intervention.

To address the challenges faced in GUI agent development, the researchers from  ByteDance Seed and Tsinghua University, proposed the UI-TARS framework to boost native GUI agent models. It integrates enhanced perception, unified action modeling, advanced reasoning, and iterative training, which helps reduce human intervention with improved generalization. It enables detailed understanding with precise captioning of interface elements using a large dataset of GUI screenshots. This introduces a unified action space to standardize platform interactions and utilizes extensive action traces to enhance multi-step execution. The framework also incorporates System-2 reasoning for deliberate decision-making and iteratively refines its capabilities through online interaction traces.

Researchers designed the framework with several key principles. Enhanced perception was used to ensure that GUI elements are recognized accurately by using curated datasets for tasks such as element description and dense captioning. Unified action modeling links the element descriptions with spatial coordinates to achieve precise grounding. System-2 reasoning was integrated to incorporate diverse logical patterns and explicit thought processes, guiding deliberate actions. It utilized iterative training for dynamic data gathering and interaction refinement, identification of error, and adaptation through reflection tuning for robust and scalable learning with less human involvement.

Researchers tested the UI-TARS trained on a corpus of about 50B tokens along various axes, including perception, grounding, and agent capabilities. The model was developed in three variants: UI-TARS-2B, UI-TARS-7B, and UI-TARS-72B, along with extensive experiments validating their advantages. Compared to baselines like GPT-4o and Claude-3.5, UI-TARS performed better in benchmarks measuring perception, such as VisualWebBench and WebSRC. UI-TARS outperformed models like UGround-V1-7B in grounding across multiple datasets, demonstrating robust capabilities in high-complexity scenarios. Regarding agent tasks, UI-TARS excelled in Multimodal Mind2Web and Android Control and environments like OSWorld and AndroidWorld. The results highlighted the importance of system-1 and system-2 reasoning, with system-2 reasoning proving beneficial in diverse, real-world scenarios, although it required multiple candidate outputs for optimal performance. Scaling the model size improved reasoning and decision-making, particularly in online tasks.

In conclusion, the proposed method, UI-TARS, advances GUI automation by integrating enhanced perception, unified action modeling, system-2 reasoning, and iterative training. It achieves state-of-the-art performance, surpassing previous systems like Claude and GPT-4o, and effectively handles complex GUI tasks with minimal human oversight. This work establishes a strong baseline for future research, particularly in active and lifelong learning areas, where agents can autonomously improve through continuous real-world interactions, paving the way for further advancements in GUI automation.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post ByteDance Introduces UI-TARS: A Native GUI Agent Model that Integrates Perception, Action, Reasoning, and Memory into a Scalable and Adaptive Framework appeared first on MarkTechPost.

Microsoft AI Introduces CoRAG (Chain-of-Retrieval Augmented Generation …

Retrieval-Augmented Generation (RAG) is a key technique in enterprise applications that combines large foundation models with external retrieval systems to generate responses that are both accurate and grounded in factual information. Unlike traditional foundation models, which are trained on massive datasets and remain static post-deployment, RAG enhances reliability by incorporating real-time or domain-specific information during the generation process. This integration addresses common issues like hallucinations or gaps in long-tail factual knowledge. RAG systems typically follow a sequential pipeline where retrieved information is provided as input to the generative model, with the overall performance depending heavily on the quality of the retrieval process. To ensure scalability, dense retrievers often use bi-encoder architectures for compressing documents and queries into fixed-size vectors, enabling efficient search algorithms. However, this efficiency comes at the cost of reduced flexibility for handling complex or multi-hop queries, which require iterative reasoning and retrieval steps based on dynamically evolving information.

Recent advancements in RAG have introduced iterative retrieval-generation methods to overcome the limitations of a single retrieval step. Approaches like FLARE and ITER-RETGEN enable models to decide when and what to retrieve during generation, enhancing performance in complex reasoning tasks. Methods like IRCoT adopt chain-of-thought reasoning, refining retrieval steps recursively, while Self-RAG integrates retrieval, generation, and critique for improved factual accuracy. Scaling test-time computing has also been explored to boost RAG performance, with strategies such as retrieving more documents or using long-context LLMs, as seen in LongRAG and IterDRAG. Tree-of-Thought (ToT) and STaR extend reasoning capabilities by leveraging structured exploration and intermediate training states, though these approaches increase token consumption and response latency. Newer methods, like Search-o1, integrate open-source models with active search mechanisms, further advancing RAG’s potential in knowledge-intensive tasks.

Researchers from Microsoft Corporation and the Renmin University of China introduced CoRAG (Chain-of-Retrieval Augmented Generation), a method for training RAG models to iteratively retrieve and reason before generating answers. Unlike conventional RAG systems, CoRAG dynamically reformulates queries based on the evolving reasoning state. The approach uses rejection sampling to augment datasets with intermediate retrieval chains, enabling fine-tuning of open-source models. CoRAG achieves state-of-the-art results on benchmarks like KILT, particularly excelling in multi-hop reasoning tasks by addressing retrieval bottlenecks. It supports diverse decoding strategies, adjusts test-time retrieval dynamically, and demonstrates robustness to varying retriever quality, offering a pathway to more grounded and factual AI models.

The CoRAG framework enhances RAG models through three key components: retrieval chain generation, model training, and test-time scaling strategies. Retrieval chains are generated using rejection sampling, where intermediate sub-queries and sub-answers are iteratively formed, and the chain with the highest log-likelihood score is selected to augment datasets. Using a multi-task learning framework, the model is trained on these augmented datasets for sub-query, sub-answer, and final answer prediction. At test time, decoding strategies like greedy decoding, best-of-N sampling, and tree search allow for controlling token consumption and retrieval steps. These approaches optimize the trade-off between performance and compute efficiency.

The evaluation of CoRAG was conducted using two benchmarks: (1) multi-hop QA datasets, including 2WikiMultihopQA, HotpotQA, Bamboogle, and MuSiQue, to test multi-hop reasoning, and (2) the KILT benchmark for generalization across knowledge-intensive tasks. Fine-tuning was performed on Llama-3.1-8B-Instruct using retrieval chain-augmented datasets. CoRAG-8B significantly outperformed baselines in most multi-hop QA datasets, except Bamboogle, where limited instances and outdated retrieval data caused variability. In the KILT benchmark, CoRAG achieved state-of-the-art performance across tasks, except for FEVER, where a larger model slightly surpassed it. Performance scaling experiments showed improvements with increased retrieval chain lengths and sampling strategies.

In conclusion, the study presents CoRAG, a framework that trains LLMs to retrieve and reason through complex queries iteratively. Unlike traditional RAG methods that rely on a single retrieval step, CoRAG dynamically reformulates queries during retrieval, enhancing accuracy. Intermediate retrieval chains are automatically generated using rejection sampling, eliminating the need for manual annotations. At test time, adaptive decoding strategies balance performance with computational efficiency. CoRAG achieves state-of-the-art results on multi-hop QA datasets and the KILT benchmark, outperforming larger models. Detailed analysis highlights its scaling and generalization capabilities, paving the way for advancing factual, grounded, and trustworthy AI systems in challenging tasks.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Microsoft AI Introduces CoRAG (Chain-of-Retrieval Augmented Generation): An AI Framework for Iterative Retrieval and Reasoning in Knowledge-Intensive Tasks appeared first on MarkTechPost.

Leveraging Hallucinations in Large Language Models to Enhance Drug Dis …

Researchers have highlighted concerns regarding hallucinations in LLMs due to their generation of plausible but inaccurate or unrelated content. However, these hallucinations hold potential in creativity-driven fields like drug discovery, where innovation is essential. LLMs have been widely applied in scientific domains, such as materials science, biology, and chemistry, aiding tasks like molecular description and drug design. While traditional models like MolT5 offer domain-specific accuracy, LLMs often produce hallucinated outputs when not fine-tuned. Despite their lack of factual consistency, such outputs can provide valuable insights, such as high-level molecular descriptions and potential compound applications, thereby supporting exploratory processes in drug discovery.

Drug discovery, a costly and time-intensive process, involves evaluating vast chemical spaces and identifying novel solutions to biological challenges. Previous studies have used machine learning and generative models to assist in this field, with researchers exploring the integration of LLMs for molecule design, dataset curation, and prediction tasks. Hallucinations in LLMs, often viewed as a drawback, can mimic creative processes by recombining knowledge to generate novel ideas. This perspective aligns with creativity’s role in innovation, exemplified by groundbreaking accidental discoveries like penicillin. By leveraging hallucinated insights, LLMs could advance drug discovery by identifying molecules with unique properties and fostering high-level innovation.

ScaDS.AI and Dresden University of Technology researchers hypothesize that hallucinations can enhance LLM performance in drug discovery. Using seven instruction-tuned LLMs, including GPT-4o and Llama-3.1-8B, they incorporated hallucinated natural language descriptions of molecules’ SMILES strings into prompts for classification tasks. The results confirmed their hypothesis, with Llama-3.1-8B achieving an 18.35% ROC-AUC improvement over the baseline. Larger models and Chinese-generated hallucinations demonstrated the greatest gains. Analyses revealed that hallucinated text provides unrelated yet insightful information, aiding predictions. This study highlights hallucinations’ potential in pharmaceutical research and offers new perspectives on leveraging LLMs for innovative drug discovery.

To generate hallucinations, SMILES strings of molecules are translated into natural language using a standardized prompt where the system is defined as an “expert in drug discovery.” The generated descriptions are evaluated for factual consistency using the HHM-2.1-Open Model, with MolT5-generated text as the reference. Results show low factual consistency across LLMs, with ChemLLM scoring 20.89% and others averaging 7.42–13.58%. Drug discovery tasks are formulated as binary classification problems, predicting specific molecular properties via next-token prediction. Prompts include SMILES, descriptions, and task instructions, with models constrained to output “Yes” or “No” based on the highest probability.

The study examines how hallucinations generated by different LLMs impact performance in molecular property prediction tasks. Experiments use a standardized prompt format to compare predictions based on SMILES strings alone, SMILES with MolT5-generated descriptions, and hallucinated descriptions from various LLMs. Five MoleculeNet datasets were analyzed using ROC-AUC scores. Results show that hallucinations generally improve performance over SMILES or MolT5 baselines, with GPT-4o achieving the highest gains. Larger models benefit more from hallucinations, but improvements plateau beyond 8 billion parameters. Temperature settings influence hallucination quality, with intermediate values yielding the best performance enhancements.

In conclusion, the study explores the potential benefits of hallucinations in LLMs for drug discovery tasks. By hypothesizing that hallucinations can enhance performance, the research evaluates seven LLMs across five datasets using hallucinated molecule descriptions integrated into prompts. Results confirm that hallucinations improve LLM performance compared to baseline prompts without hallucinations. Notably, Llama-3.1-8B achieved an 18.35% ROC-AUC gain. GPT-4o-generated hallucinations provided consistent improvements across models. Findings reveal that larger model sizes generally benefit more from hallucinations, while factors like generation temperature have minimal impact. The study highlights hallucinations’ creative potential in AI and encourages further exploration of drug discovery applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)
The post Leveraging Hallucinations in Large Language Models to Enhance Drug Discovery appeared first on MarkTechPost.

Develop a RAG-based application using Amazon Aurora with Amazon Kendra

Generative AI and large language models (LLMs) are revolutionizing organizations across diverse sectors to enhance customer experience, which traditionally would take years to make progress. Every organization has data stored in data stores, either on premises or in cloud providers.
You can embrace generative AI and enhance customer experience by converting your existing data into an index on which generative AI can search. When you ask a question to an open source LLM, you get publicly available information as a response. Although this is helpful, generative AI can help you understand your data along with additional context from LLMs. This is achieved through Retrieval Augmented Generation (RAG).
RAG retrieves data from a preexisting knowledge base (your data), combines it with the LLM’s knowledge, and generates responses with more human-like language. However, in order for generative AI to understand your data, some amount of data preparation is required, which involves a big learning curve.
Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. Aurora combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases.
In this post, we walk you through how to convert your existing Aurora data into an index without needing data preparation for Amazon Kendra to perform data search and implement RAG that combines your data along with LLM knowledge to produce accurate responses.
Solution overview
In this solution, use your existing data as a data source (Aurora), create an intelligent search service by connecting and syncing your data source to Amazon Kendra search, and perform generative AI data search, which uses RAG to produce accurate responses by combining your data along with the LLM’s knowledge. For this post, we use Anthropic’s Claude on Amazon Bedrock as our LLM.
The following are the high-level steps for the solution:

Create an Amazon Aurora PostgreSQL-Compatible Edition
Ingest data to Aurora PostgreSQL-Compatible.
Create an Amazon Kendra index.
Set up the Amazon Kendra Aurora PostgreSQL connector.
Invoke the RAG application.

The following diagram illustrates the solution architecture.

Prerequisites
To follow this post, the following prerequisites are required:

The AWS Command Line Interface (AWS CLI) installed and configured
An AWS account and appropriate permissions to interact with resources in your AWS account
The AWS managed AWS Identity and Access Management (IAM) policy AmazonKendraReadOnlyAccess should be part of an Amazon SageMaker IAM role
An Aurora DB cluster where the current data is present
Your preferred interactive development environment (IDE) to run the Python script (such as SageMaker, or VS Code)
The pgAdmin tool for data loading and validation

Create an Aurora PostgreSQL cluster
Run the following AWS CLI commands to create an Aurora PostgreSQL Serverless v2 cluster:

aws rds create-db-cluster
–engine aurora-postgresql
–engine-version 15.4
–db-cluster-identifier genai-kendra-ragdb
–master-username postgres
–master-user-password XXXXX
–db-subnet-group-name dbsubnet
–vpc-security-group-ids “sg-XXXXX”
–serverless-v2-scaling-configuration “MinCapacity=2,MaxCapacity=64”
–enable-http-endpoint
–region us-east-2

aws rds create-db-instance
–db-cluster-identifier genai-kendra-ragdb
–db-instance-identifier genai-kendra-ragdb-instance
–db-instance-class db.serverless
–engine aurora-postgresql

The following screenshot shows the created instance.

Ingest data to Aurora PostgreSQL-Compatible
Connect to the Aurora instance using the pgAdmin tool. Refer to Connecting to a DB instance running the PostgreSQL database engine for more information. To ingest your data, complete the following steps:

Run the following PostgreSQL statements in pgAdmin to create the database, schema, and table:

CREATE DATABASE genai;
CREATE SCHEMA ’employees’;

CREATE DATABASE genai;
SET SCHEMA ’employees’;

CREATE TABLE employees.amazon_review(
pk int GENERATED ALWAYS AS IDENTITY NOT NULL,
id varchar(50) NOT NULL,
name varchar(300) NULL,
asins Text NULL,
brand Text NULL,
categories Text NULL,
keys Text NULL,
manufacturer Text NULL,
reviews_date Text NULL,
reviews_dateAdded Text NULL,
reviews_dateSeen Text NULL,
reviews_didPurchase Text NULL,
reviews_doRecommend varchar(100) NULL,
reviews_id varchar(150) NULL,
reviews_numHelpful varchar(150) NULL,
reviews_rating varchar(150) NULL,
reviews_sourceURLs Text NULL,
reviews_text Text NULL,
reviews_title Text NULL,
reviews_userCity varchar(100) NULL,
reviews_userProvince varchar(100) NULL,
reviews_username Text NULL,
PRIMARY KEY
(
pk
)
) ;

In your pgAdmin Aurora PostgreSQL connection, navigate to Databases, genai, Schemas, employees, Tables.
Choose (right-click) Tables and choose PSQL Tool to open a PSQL client connection.
Place the csv file under your pgAdmin location and run the following command:

copy employees.amazon_review (id, name, asins, brand, categories, keys, manufacturer, reviews_date, reviews_dateadded, reviews_dateseen, reviews_didpurchase, reviews_dorecommend, reviews_id, reviews_numhelpful, reviews_rating, reviews_sour
ceurls, reviews_text, reviews_title, reviews_usercity, reviews_userprovince, reviews_username) FROM ‘C:Program FilespgAdmin 4runtimeamazon_review.csv’ DELIMITER ‘,’ CSV HEADER ENCODING ‘utf8′;

Run the following PSQL query to verify the number of records copied:

Select count (*) from employees.amazon_review;

Create an Amazon Kendra index
The Amazon Kendra index holds the contents of your documents and is structured in a way to make the documents searchable. It has three index types:

Generative AI Enterprise Edition index – Offers the highest accuracy for the Retrieve API operation and for RAG use cases (recommended)
Enterprise Edition index – Provides semantic search capabilities and offers a high-availability service that is suitable for production workloads
Developer Edition index – Provides semantic search capabilities for you to test your use cases

To create an Amazon Kendra index, complete the following steps:

On the Amazon Kendra console, choose Indexes in the navigation pane.
Choose Create an index.
On the Specify index details page, provide the following information:

For Index name, enter a name (for example, genai-kendra-index).
For IAM role, choose Create a new role (Recommended).
For Role name, enter an IAM role name (for example, genai-kendra). Your role name will be prefixed with AmazonKendra-<region>- (for example, AmazonKendra-us-east-2-genai-kendra).

Choose Next.
On the Add additional capacity page, select Developer edition (for this demo) and choose Next.
On the Configure user access control page, provide the following information:

Under Access control settings¸ select No.
Under User-group expansion, select None.

Choose Next.
On the Review and create page, verify the details and choose Create.

It might take some time for the index to create. Check the list of indexes to watch the progress of creating your index. When the status of the index is ACTIVE, your index is ready to use.
Set up the Amazon Kendra Aurora PostgreSQL connector
Complete the following steps to set up your data source connector:

On the Amazon Kendra console, choose Data sources in the navigation pane.
Choose Add data source.
Choose Aurora PostgreSQL connector as the data source type.
On the Specify data source details page, provide the following information:

For Data source name, enter a name (for example, data_source_genai_kendra_postgresql).
For Default language¸ choose English (en).
Choose Next.

On the Define access and security page, under Source, provide the following information:

For Host, enter the host name of the PostgreSQL instance (cvgupdj47zsh.us-east-2.rds.amazonaws.com).
For Port, enter the port number of the PostgreSQL instance (5432).
For Instance, enter the database name of the PostgreSQL instance (genai).

Under Authentication, if you already have credentials stored in AWS Secrets Manager, choose it on the dropdown Otherwise, choose Create and add new secret.
In the Create an AWS Secrets Manager secret pop-up window, provide the following information:

For Secret name, enter a name (for example, AmazonKendra-Aurora-PostgreSQL-genai-kendra-secret).
For Data base user name, enter the name of your database user.
For Password¸ enter the user password.

Choose Add Secret.
Under Configure VPC and security group, provide the following information:

For Virtual Private Cloud, choose your virtual private cloud (VPC).
For Subnet, choose your subnet.
For VPC security groups, choose the VPC security group to allow access to your data source.

Under IAM role¸ if you have an existing role, choose it on the dropdown menu. Otherwise, choose Create a new role.
On the Configure sync settings page, under Sync scope, provide the following information:

For SQL query, enter the SQL query and column values as follows: select * from employees.amazon_review.
For Primary key, enter the primary key column (pk).
For Title, enter the title column that provides the name of the document title within your database table (reviews_title).
For Body, enter the body column on which your Amazon Kendra search will happen (reviews_text).

Under Sync node, select Full sync to convert the entire table data into a searchable index.

After the sync completes successfully, your Amazon Kendra index will contain the data from the specified Aurora PostgreSQL table. You can then use this index for intelligent search and RAG applications.

Under Sync run schedule, choose Run on demand.
Choose Next.
On the Set field mappings page, leave the default settings and choose Next.
Review your settings and choose Add data source.

Your data source will appear on the Data sources page after the data source has been created successfully.

Invoke the RAG application
The Amazon Kendra index sync can take minutes to hours depending on the volume of your data. When the sync completes without error, you are ready to develop your RAG solution in your preferred IDE. Complete the following steps:

Configure your AWS credentials to allow Boto3 to interact with AWS services. You can do this by setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables or by using the ~/.aws/credentials file:

import boto3
  pip install langchain

# Create a Boto3 session

session = boto3.Session(
   aws_access_key_id=’YOUR_AWS_ACCESS_KEY_ID’,
   aws_secret_access_key=’YOUR_AWS_SECRET_ACCESS_KEY’,
   region_name=’YOUR_AWS_REGION’
)

Import LangChain and the necessary components:

from langchain_community.llms import Bedrock
from langchain_community.retrievers import AmazonKendraRetriever
from langchain.chains import RetrievalQA

Create an instance of the LLM (Anthropic’s Claude):

llm = Bedrock(
region_name = “bedrock_region_name”,
model_kwargs = {
“max_tokens_to_sample”:300,
“temperature”:1,
“top_k”:250,
“top_p”:0.999,
“anthropic_version”:”bedrock-2023-05-31″
},
model_id = “anthropic.claude-v2”
)

Create your prompt template, which provides instructions for the LLM:

from langchain_core.prompts import PromptTemplate

prompt_template = “””
You are a <persona>Product Review Specialist</persona>, and you provide detail product review insights.
You have access to the product reviews in the <context> XML tags below and nothing else.

<context>
{context}
</context>

<question>
{question}
</question>
“””

prompt = PromptTemplate(template=prompt_template, input_variables=[“context”, “question”])

Initialize the KendraRetriever with your Amazon Kendra index ID by replacing the Kendra_index_id that you created earlier and the Amazon Kendra client:

session = boto3.Session(region_name=’Kendra_region_name’)
kendra_client = session.client(‘kendra’)
# Create an instance of AmazonKendraRetriever
kendra_retriever = AmazonKendraRetriever(
kendra_client=kendra_client,
index_id=”Kendra_Index_ID”
)

Combine Anthropic’s Claude and the Amazon Kendra retriever into a RetrievalQA chain:

qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type=”stuff”,
retriever=kendra_retriever,
return_source_documents=True,
chain_type_kwargs={“prompt”: prompt},
)

Invoke the chain with your own query:

query = “What are some products that has bad quality reviews, summarize the reviews”
result_ = qa.invoke(
query
)
result_

Clean up
To avoid incurring future charges, delete the resources you created as part of this post:

Delete the Aurora DB cluster and DB instance.
Delete the Amazon Kendra index.

Conclusion
In this post, we discussed how to convert your existing Aurora data into an Amazon Kendra index and implement a RAG-based solution for the data search. This solution drastically reduces the data preparation need for Amazon Kendra search. It also increases the speed of generative AI application development by reducing the learning curve behind data preparation.
Try out the solution, and if you have any comments or questions, leave them in the comments section.

About the Authors
Aravind Hariharaputran is a Data Consultant with the Professional Services team at Amazon Web Services. He is passionate about Data and AIML in general with extensive experience managing Database technologies .He helps customers transform legacy database and applications to Modern data platforms and generative AI applications. He enjoys spending time with family and playing cricket.
Ivan Cui is a Data Science Lead with AWS Professional Services, where he helps customers build and deploy solutions using ML and generative AI on AWS. He has worked with customers across diverse industries, including software, finance, pharmaceutical, healthcare, IoT, and entertainment and media. In his free time, he enjoys reading, spending time with his family, and traveling.