Fractional Reasoning in LLMs: A New Way to Control Inference Depth

What is included in this article:The limitations of current test-time compute strategies in LLMs.Introduction of Fractional Reasoning (FR) as a training-free, model-agnostic framework.Techniques for latent state manipulation using reasoning prompts and adjustable scaling.Breadth- and depth-based scaling benefits demonstrated across GSM8K, MATH500, and GPQA.Evaluation results showing FR’s superiority over Best-of-N and Majority Vote.Analysis of FR’s behavior across different models, including DeepSeek-R1.

Introduction: Challenges in Uniform Reasoning During Inference

LLMs have shown improvements in various domains, with test-time compute playing a crucial role in their performance. This approach enhances reasoning during inference by allocating extra computational resources, such as generating multiple candidate responses and selecting the most suitable one, or refining answers iteratively through self-reflection. However, current test-time compute strategies treat all problems uniformly, applying the same depth of reasoning regardless of query difficulty or structure. In reality, reasoning needs are highly variable, and reasoning with under-, overthinking, or reflection can lead to degraded answers or unnecessary computational costs. Therefore, LLMs must be capable of adjusting their reasoning depth or level of reflection dynamically.

Prior Work: Latent Steering and Representation Control

Existing research has explored various methods to enhance LLM reasoning through inference-time scaling and latent state control. The Chain-of-Thought (CoT) prompting technique guides models to decompose complex problems into intermediate steps to improve reasoning performance. Outcome reward models (ORMs) and process reward models (PRMs) evaluate generated responses based on correctness or quality of internal reasoning. Moreover, representation engineering methods use steering vectors in LLM latent spaces for controlled generation, while methods like In-Context Vectors (ICV) extract latent vectors from demonstrations to steer internal states at inference time, and Representation Finetuning (ReFT) learns task-specific low-rank interventions over latent representations.

The Proposed Framework: Fractional Reasoning for Adaptive Inference

Researchers from Stanford University have proposed Fractional Reasoning (FR), a training-free and model-agnostic framework for improving test-time compute through adaptive reasoning control. FR adjusts reasoning behavior by directly modifying the model’s internal representations, extracting the latent shift induced by reasoning-promoting inputs such as CoT or reflection prompts, and again applying this shift with a tunable scaling factor. This enables models to adjust the depth of reasoning during inference without modifying the input text or requiring fine-tuning. FR supports and enhances two key forms of test-time scaling: (a) Breadth-based scaling, like Best-of-N and Majority vote, and (b) Depth-based scaling, like self-reflection.

Benchmarking: Performance Gains on Reasoning Tasks

FR is evaluated on three benchmarks that require multi-step reasoning: GSM8K, MATH500, and GPQA. The evaluation utilizes test sets for GSM8K and MATH500 while using the diamond split for GPQA. Main experiments use two competitive open-source instruction-tuned models: Qwen2.5-7B-Instruct and LLaMA-3.1-8B-Instruct, both of which demonstrate strong reasoning capabilities and provide access to the latent state representations required by the proposed method. FR outperforms standard test-time compute methods on all benchmarks and models, showing that it can strongly enhance performance. Adjusting the influence of prompts enables broader exploration of the solution space, increasing the efficiency of traditional test-time compute methods.

Behavior and Model-Agnostic Generality of Fractional Reasoning

Researchers further analyzed FR to understand its behavioral dynamics, generality across models, and other metrics. Analysis reveals that increasing the scaling parameter leads to longer outputs with more detailed multi-step reasoning, confirming the framework steers model behavior predictably and continuously. FR remains effective even when applied to reasoning-specialized models such as DeepSeek-R1-Distill-Qwen-7B, improving accuracy over standard prompting baselines and showing its generality across both general-purpose and specialized LLMs. Performance scaling analysis shows consistent improvements with an increasing number of generations, and FR shows higher accuracy across most sampling budgets compared to the majority vote baseline.

Conclusion: Towards More Dynamic and Efficient LLM Inference

In conclusion, researchers from Stanford University introduced Fractional Reasoning (FR), a training-free and model-agnostic framework that improves test-time compute through adaptive control of reasoning behavior in LLMs. It offers a general and interpretable approach for more precise and efficient allocation of computational effort during inference, overcoming the limitation of uniform reasoning application in current test-time compute strategies. However, the framework currently depends on predefined reasoning directions and lacks automatic selection of scaling factors, indicating future research directions toward adaptive policies for fully dynamic inference.

Check out the Paper. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]

The post Fractional Reasoning in LLMs: A New Way to Control Inference Depth appeared first on MarkTechPost.

Liquid AI Open-Sources LFM2: A New Generation of Edge LLMs

What is included in this article:Performance breakthroughs – 2x faster inference and 3x faster trainingTechnical architecture – Hybrid design with convolution and attention blocksModel specifications – Three size variants (350M, 700M, 1.2B parameters)Benchmark results – Superior performance compared to similar-sized modelsDeployment optimization – Edge-focused design for various hardwareOpen-source accessibility – Apache 2.0-based licensingMarket implications – Impact on edge AI adoption

The landscape of on-device artificial intelligence has taken a significant leap forward with Liquid AI’s release of LFM2, their second-generation Liquid Foundation Models. This new series of generative AI models represents a paradigm shift in edge computing, delivering unprecedented performance optimizations specifically designed for on-device deployment while maintaining competitive quality standards.

Revolutionary Performance Gains

LFM2 establishes new benchmarks in the edge AI space by achieving remarkable efficiency improvements across multiple dimensions. The models deliver 2x faster decode and prefill performance compared to Qwen3 on CPU architectures, a critical advancement for real-time applications. Perhaps more impressively, the training process itself has been optimized to achieve 3x faster training compared to the previous LFM generation, making LFM2 the most cost-effective path to building capable, general-purpose AI systems.

These performance improvements are not merely incremental but represent a fundamental breakthrough in making powerful AI accessible on resource-constrained devices. The models are specifically engineered to unlock millisecond latency, offline resilience, and data-sovereign privacy – capabilities essential for phones, laptops, cars, robots, wearables, satellites, and other endpoints that must reason in real time.

Hybrid Architecture Innovation

The technical foundation of LFM2 lies in its novel hybrid architecture that combines the best aspects of convolution and attention mechanisms. The model employs a sophisticated 16-block structure consisting of 10 double-gated short-range convolution blocks and 6 blocks of grouped query attention (GQA). This hybrid approach draws from Liquid AI’s pioneering work on Liquid Time-constant Networks (LTCs), which introduced continuous-time recurrent neural networks with linear dynamical systems modulated by nonlinear input interlinked gates.

At the core of this architecture is the Linear Input-Varying (LIV) operator framework, which enables weights to be generated on-the-fly from the input they are acting on. This allows convolutions, recurrences, attention, and other structured layers to fall under one unified, input-aware framework. The LFM2 convolution blocks implement multiplicative gates and short convolutions, creating linear first-order systems that converge to zero after a finite time.

The architecture selection process utilized STAR, Liquid AI’s neural architecture search engine, which was modified to evaluate language modeling capabilities beyond traditional validation loss and perplexity metrics. Instead, it employs a comprehensive suite of over 50 internal evaluations that assess diverse capabilities including knowledge recall, multi-hop reasoning, understanding of low-resource languages, instruction following, and tool use.

Comprehensive Model Lineup

LFM2 is available in three strategically sized configurations: 350M, 700M, and 1.2B parameters, each optimized for different deployment scenarios while maintaining the core efficiency benefits. All models were trained on 10 trillion tokens drawn from a carefully curated pre-training corpus comprising approximately 75% English, 20% multilingual content, and 5% code data sourced from web and licensed materials.

The training methodology incorporates knowledge distillation using the existing LFM1-7B as a teacher model, with cross-entropy between LFM2’s student outputs and the teacher outputs serving as the primary training signal throughout the entire 10T token training process. The context length was extended to 32k during pretraining, enabling the models to handle longer sequences effectively.

Superior Benchmark Performance

Evaluation results demonstrate that LFM2 significantly outperforms similarly-sized models across multiple benchmark categories. The LFM2-1.2B model performs competitively with Qwen3-1.7B despite having 47% fewer parameters. Similarly, LFM2-700M outperforms Gemma 3 1B IT, while the smallest LFM2-350M checkpoint remains competitive with Qwen3-0.6B and Llama 3.2 1B Instruct.

Beyond automated benchmarks, LFM2 demonstrates superior conversational capabilities in multi-turn dialogues. Using the WildChat dataset and LLM-as-a-Judge evaluation framework, LFM2-1.2B showed significant preference advantages over Llama 3.2 1B Instruct and Gemma 3 1B IT while matching Qwen3-1.7B performance despite being substantially smaller and faster.

Edge-Optimized Deployment

The models excel in real-world deployment scenarios, having been exported to multiple inference frameworks including PyTorch’s ExecuTorch and the open-source llama.cpp library. Testing on target hardware including Samsung Galaxy S24 Ultra and AMD Ryzen platforms demonstrates that LFM2 dominates the Pareto frontier for both prefill and decode inference speed relative to model size.

The strong CPU performance translates effectively to accelerators such as GPU and NPU after kernel optimization, making LFM2 suitable for a wide range of hardware configurations. This flexibility is crucial for the diverse ecosystem of edge devices that require on-device AI capabilities.

Conclusion

The release of LFM2 addresses a critical gap in the AI deployment landscape where the shift from cloud-based to edge-based inference is accelerating. By enabling millisecond latency, offline operation, and data-sovereign privacy, LFM2 unlocks new possibilities for AI integration across consumer electronics, robotics, smart appliances, finance, e-commerce, and education sectors.

The technical achievements represented in LFM2 signal a maturation of edge AI technology, where the trade-offs between model capability and deployment efficiency are being successfully optimized. As enterprises pivot from cloud LLMs to cost-efficient, fast, private, and on-premises intelligence, LFM2 positions itself as a foundational technology for the next generation of AI-powered devices and applications.

Check out the Technical Details and Model on Hugging Face. All credit for this research goes to the researchers of this project. ‘Your AI deserves a smarter stage. Ours reaches 1M minds a month.‘ Put it on Marktechpost
The post Liquid AI Open-Sources LFM2: A New Generation of Edge LLMs appeared first on MarkTechPost.

Build AI-driven policy creation for vehicle data collection and automa …

Vehicle data is critical for original equipment manufacturers (OEMs) to drive continuous product innovation and performance improvements and to support new value-added services. Similarly, the increasing digitalization of vehicle architectures and adoption of software-configurable functions allow OEMs to add new features and capabilities efficiently. Sonatus’s Collector AI and Automator AI products address these two aspects of the move towards Software-Defined Vehicles (SDVs) in the automotive industry.
Collector AI lowers the barrier to using data across the entire vehicle lifecycle using data collection policies that can be created without changes to vehicle electronics or requiring modifications to embedded code. However, OEM engineers and other consumers of vehicle data struggle with the thousands of vehicle signals to choose to drive their specific use cases and outcomes. Likewise, Automator AI’s no-code methodology for automating vehicle functions using intuitive if-then-style scripted workflows can also be challenging, especially for OEM users who aren’t well-versed in the events and signals available on vehicles to incorporate in a desired automated action.
To address these challenges, Sonatus partnered with the AWS Generative AI Innovation Center to develop a natural language interface to generate data collection and automation policies using generative AI. This innovation aims to reduce the policy generation process from days to minutes while making it accessible to both engineers and non-experts alike.
In this post, we explore how we built this system using Sonatus’s Collector AI and Amazon Bedrock. We discuss the background, challenges, and high-level solution architecture.
Collector AI and Automator AI
Sonatus has developed a sophisticated vehicle data collection and automation workflow tool, which comprises two main products:

Collector AI – Gathers and transmits precise vehicle data based on configurable trigger events
Automator AI – Executes automated actions within the vehicle based on analyzed data and trigger conditions

The current process requires engineers to create data collection or automation policies manually. Depending on the range of an OEM’s use cases, there could be hundreds of policies for a given vehicle model. Also, identifying the correct data to collect for the given intent required sifting through multiple layers of information and organizational challenges. Our goal was to develop a more intelligent and intuitive way to accomplish the following:

Generate policies from the user’s natural language input
Significantly reduce policy creation time from days to minutes
Provide complete control over the intermediate steps in the generation process
Expand policy creation capabilities to non-engineers such as vehicle product owners, product planners, and even procurement
Implement a human-in-the-loop review process for both existing and newly created policies

Key challenges
During implementation, we encountered several challenges:

Complex event structures – Vehicle models and different policy entities use diverse representations and formats, requiring flexible policy generation
Labeled data limitations – Labeled data mapping natural language inputs to desired policies is limited
Format translation – The solution must handle different data formats and schemas across customers and vehicle models
Quality assurance – Generated policies must be accurate and consistent
Explainability – Clear explanations for how policies are generated can help build trust

Success metrics
We defined the following key metrics to measure the success of our solution:

Business metrics:

Reduced policy generation time
Increased number of policies per customer
Expanded user base for policy creation

Technical metrics:

Accuracy of generated policies
Quality of results for modified prompts

Operational metrics:

Reduced policy generation effort and turnaround time compared to manual process
Successful integration with existing systems

Solution overview
The Sonatus Advanced Technology team and Generative AI Innovation Center team built an automated policy generation system, as shown in the following diagram.

This is a chain of large language models (LLMs) that perform individual tasks, including entity extraction, signal translation, and signal parametrization.
Entity extraction
A fully generated vehicle policy consists of multiple parts, which could be captured within one single user statement. These are triggers and target data for collector policies, and triggers, actions, and associated tasks for automator policies. The user’s statement is first broken down into its entities using the following steps and rules:

Few-shot examples are provided for each entity
Trigger outputs must be self-contained with the appropriate signal value and comparison operator information:

Query example: “Generate an automation policy that locks the doors automatically when the car is moving”
Trigger output: <response>vehicle speed above 0, vehicle signal</response>

Triggers and actions are secondarily verified using a classification prompt
For Automator AI, triggers and actions must be associated with their corresponding tasks
The final output of this process is the intermediate structured XML representation of the user query in natural language:

Query example: “Generate an automation policy that locks the doors automatically when the car is moving”
Generated XML:

<response>
<task> Lock doors when moving </task>
<triggers> vehicle speed above 0, vehicle signal </triggers>
<actions> lock doors, vehicle signal </actions>
</response>

The following is a diagram of our improved solution, which converts a user query into XML output.

Signal translation and parametrization
To get to the final JSON policy structure from the intermediate structured XML output, the correct signals must be identified, the signal parameters need to be generated, and this information must be combined to follow the application’s expected JSON schema.
The output signal format of choice at this stage is Vehicle Signal Specification (VSS), an industry-standard specification driven by COVESA. VSS is a standard specifying vehicle signal naming conventions and strategies that make vehicle signals descriptive and understandable when compared to their physical Control Area Network (CAN) signal counterparts. This makes it not only suitable but also essential in the generative AI generation process because descriptive signal names and availability of their meanings are necessary.
The VSS signals, along with their descriptions and other necessary metadata, are embedded into a vector index. For every XML structure requiring a lookup of a vehicle signal, the process of signal translation includes the following steps:

Available signal data is preprocessed and stored into a vector database.
Each XML representation—triggers, actions, and data—is converted into their corresponding embeddings. In some cases, the XML phrases can also be enhanced for better embedding representation.
For each of the preceding entities:

Top-k similar vector embeddings are identified (assume k as 20).
Candidate signals are reranked based on name and descriptions.
The final signal is selected using a LLM selection prompt.

In the case of triggers, after the selection of the correct signal, the trigger value and condition comparator operator are also generated using few-shot examples.
This retrieved and generated information is combined into a predefined trigger, action, data, and task JSON object structure.
Individual JSON objects are assembled to construct the final JSON policy.
This is run through a policy schema validator before it is saved.

The following diagram illustrates the step-by-step process of signal translation. To generate the JSON output from the intermediate XML structure, correct signals are identified using vector-based lookups and reranking techniques.

Solution highlights
In this section, we discuss key components and features of the solution.
Improvement of task adjacency
In automator policies, a task is a discrete unit of work within a larger process. It has a specific purpose and performs a defined set of actions—both within and outside a vehicle. It also optionally defines a set of trigger conditions that, when evaluated to be true, the defined actions start executing. The larger process—the workflow—defines a dependency graph of tasks and the order in which they are executed. The workflow follows the following rules:

Every automator policy starts with exactly one task
A task can point to one or more next tasks
One task can only initiate one other task
Multiple possible next tasks can exist, but only one can be triggered at a time
Each policy workflow runs one task at a given time
Tasks can be arranged in linear or branching patterns
If none of the conditions satisfy, the default is monitoring the trigger conditions for the next available tasks

For example:

# Linear Task Adjacency
t1 → t2 → t3 → t4 → t1*
# Branching Task Adjacency
t1 → t2, t3, t4
t3 → t5
t5 → t4

*Loops back to start.
In some of the generated outputs, we identified that there can be two adjacent tasks in which one doesn’t have an action, and another doesn’t have a trigger. Task merging aims to resolve this issue by merging those into a single task. To address this, we implemented task merging using Anthropic’s Claude on Amazon Bedrock. Our outcomes were as follows:

Solve the task merging issue, where multiple tasks with incomplete information are merged into one task
Properly generate tasks that point to multiple next tasks
Change the prompt style to decision tree-based planning to make it more flexible

Multi-agent approach for parameter generation
During the signal translation process, an exhaustive list of signals is fed into a vector store, and when corresponding triggers or actions are generated, they are used to search the vector store and select the signal with the highest relevancy. However, this sometimes generates less accurate or ambiguous results.
For example, the following policy asks to cool down the car:
Action: <response> cool down the car </response>
The corresponding signal should try to cool the car cabin, as shown in the following signal:
Vehicle.Cabin.HVAC.Station.Row1.Driver.Temperature
It should not cool the car engine, as shown in the following incorrect signal:
Vehicle.Powertrain.CombustionEngine.EngineCoolant.Temperature
We mitigated this issue by introducing a multi-agent approach. Our approach has two agents:

ReasoningAgent – Proposes initial signal names based on the query and knowledge base
JudgeAgent – Evaluates and refines the proposed signals

The agents interact iteratively up to a set cycle threshold before claiming success for signal identification.
Reduce redundant LLM calls
To reduce latency, parts of the pipeline were identified that could be merged into a single LLM call. For example, trigger condition value generation and trigger condition operator generation were individual LLM calls.We addressed this by introducing a faster Anthropic’s Claude 3 Haiku model and merging prompts where it is possible to do so. The following is an example of a set of prompts before and after merging.The first example is before merging, with the trigger set to when the temperature is above 20 degrees Celsius:

Operator response: <operator> > </operator>
Parameter response: <value> 20 </value>

The following is the combined response for the same trigger:

<response>
<operator> > </operator>
<value> 20 </value>
</response>

Context-driven policy generation
The goal here is to disambiguate the signal translation, similar to the multi-agent approach for parameter generation. To make policy generation more context-aware, we proposed a customer intent clarifier that carries out the following tasks:

Retrieves relevant subsystems using knowledge base lookups
Identifies the intended target subsystem
Allows user verification and override

This approach works by using external and preprocessed information like available vehicle subsystems, knowledge bases, and signals to guide the signal selection. Users can also clarify or override intent in cases of ambiguity early on to reduce wasted iterations and achieve the desired result more quickly. For example, in the case of the previously stated example on an ambiguous generation of “cool the car,” users are asked to clarify which subsystem they meant—to choose from “Engine” or “Cabin.”
Conclusion
Combining early feedback loops and a multi-agent approach has transformed Sonatus’s policy creation system into a more automated and efficient solution. By using Amazon Bedrock, we created a system that not only automates policy creation, reducing time taken by 70%, but also provides accuracy through context-aware generation and validation. So, organizations can achieve similar efficiency gains by implementing this multi-agent approach with Amazon Bedrock for their own complex policy creation workflows. Developers can leverage these techniques to build natural language interfaces that dramatically reduce technical complexity while maintaining precision in business-critical systems.

About the authors
Giridhar Akila Dhakshinamoorthy is the Senior Staff Engineer and AI/ML Tech Lead in the CTO Office at Sonatus.
Tanay Chowdhury is a Data Scientist at Generative AI Innovation Center at Amazon Web Services who helps customers solve their business problems using generative AI and machine learning. He has done MS with Thesis in Machine Learning from University of Illinois and has extensive experience in solving customer problem in the field of data science.
Parth Patwa is a Data Scientist in the Generative AI Innovation Center at Amazon Web Services. He has co-authored research papers at top AI/ML venues and has 1000+ citations.
Yingwei Yu is an Applied Science Manager at Generative AI Innovation Center, AWS, where he leverages machine learning and generative AI to drive innovation across industries. With a PhD in Computer Science from Texas A&M University and years of working experience, Yingwei brings extensive expertise in applying cutting-edge technologies to real-world applications.
Hamed Yazdanpanah was a Data Scientist in the Generative AI Innovation Center at Amazon Web Services. He helps customers solve their business problems using generative AI and machine learning.

How Rapid7 automates vulnerability risk scores with ML pipelines using …

This post is cowritten with Jimmy Cancilla from Rapid7.
Organizations are managing increasingly distributed systems, which span on-premises infrastructure, cloud services, and edge devices. As systems become interconnected and exchange data, the potential pathways for exploitation multiply, and vulnerability management becomes critical to managing risk. Vulnerability management (VM) is the process of identifying, classifying, prioritizing, and remediating security weaknesses in software, hardware, virtual machines, Internet of Things (IoT) devices, and similar assets. When new vulnerabilities are discovered, organizations are under pressure to remediate them. Delayed responses can open the door to exploits, data breaches, and reputational harm. For organizations with thousands or millions of software assets, effective triage and prioritization for the remediation of vulnerabilities are critical.
To support this process, the Common Vulnerability Scoring System (CVSS) has become the industry standard for evaluating the severity of software vulnerabilities. CVSS v3.1, published by the Forum of Incident Response and Security Teams (FIRST), provides a structured and repeatable framework for scoring vulnerabilities across multiple dimensions: exploitability, impact, attack vector, and others. With new threats emerging constantly, security teams need standardized, near real-time data to respond effectively. CVSS v3.1 is used by organizations such as NIST and major software vendors to prioritize remediation efforts, support risk assessments, and comply with standards.
There is, however, a critical gap that emerges before a vulnerability is formally standardized. When a new vulnerability is disclosed, vendors aren’t required to include a CVSS score alongside the disclosure. Additionally, third-party organizations such as NIST aren’t obligated or bound by specific timelines to analyze vulnerabilities and assign CVSS scores. As a result, many vulnerabilities are made public without a corresponding CVSS score. This situation can leave customers uncertain about how to respond: should they patch the newly discovered vulnerability immediately, monitor it for a few days, or deprioritize it? Our goal with machine learning (ML) is to provide Rapid7 customers with a timely answer to this critical question.
Rapid7 helps organizations protect what matters most so innovation can thrive in an increasingly connected world. Rapid7’s comprehensive technology, services, and community-focused research remove complexity, reduce vulnerabilities, monitor for malicious behavior, and shut down attacks. In this post, we share how Rapid7 implemented end-to-end automation for the training, validation, and deployment of ML models that predict CVSS vectors. Rapid7 customers have the information they need to accurately understand their risk and prioritize remediation measures.
Rapid7’s solution architecture
Rapid7 built their end-to-end solution using Amazon SageMaker AI, the Amazon Web Services (AWS) fully managed ML service to build, train, and deploy ML models into production environments. SageMaker AI provides powerful compute for ephemeral tasks, orchestration tools for building automated pipelines, a model registry for tracking model artifacts and versions, and scalable deployment to configurable endpoints.
Rapid7 integrated SageMaker AI with their DevOps tools (GitHub for version control and Jenkins for build automation) to implement continuous integration and continuous deployment (CI/CD) for the ML models used for CVSS scoring. By automating model training and deployment, Rapid7’s CVSS scoring solutions stay up to date with the latest data without additional operational overhead.
The following diagram illustrates the solution architecture.

Orchestrating with SageMaker AI Pipelines
The first step in the journey toward end-to-end automation was removing manual activities previously performed by data scientists. This meant migrating experimental code from Jupyter notebooks to production-ready Python scripts. Rapid7 established a project structure to support both development and production. Each step in the ML pipeline—data download, preprocessing, training, evaluation, and deployment—was defined as a standalone Python module in a common directory.
Designing the pipeline
After refactoring, pipeline steps were moved to SageMaker Training and Processing jobs for remote execution. Steps in the pipeline were defined using Docker images with the required libraries, and orchestrated using SageMaker Pipelines in the SageMaker Python SDK.
CVSS v3.1 vectors consist of eight independent metrics combined into a single vector. To produce an accurate CVSS vector, eight separate models were trained in parallel. However, the data used to train these models was identical. This meant that the training process could share common download and preprocessing steps, followed by separate training, validation, and deployment steps for each metric. The following diagram illustrates the high-level architecture of the implemented pipeline.

Data loading and preprocessing
The data used to train the model comprised existing vulnerabilities and their associated CVSS vectors. This data source is updated constantly, which is why Rapid7 decided to download the most recent data available at training time and uploaded it to Amazon Simple Storage Service (Amazon S3) to be used by subsequent steps. After being updated, Rapid7 implemented a preprocessing step to:

Structure the data to facilitate ingestion and use in training.
Split the data into three sets: training, validation, and testing (80%, 10%, and 10%).

The preprocessing step was defined with a dependency on the data download step so that the new dataset was available before a new preprocessing job was started. The outputs of the preprocessing job—the resulting training, validation, and test sets—are also uploaded to Amazon S3 to be consumed by the training steps that follow.
Model training, evaluation, and deployment
For the remaining pipeline steps, Rapid7 executed each step eight times—one time for each metric in the CVSS vector. Rapid7 iterated through each of the eight metrics to define the corresponding training, evaluation, and deployment steps using the SageMaker Pipelines SDK.
The loop follows a similar pattern for each metric. The process starts with a training job using PyTorch framework images provided by Amazon SageMaker AI. The following is a sample script for defining a training job.

estimator = PyTorch(
entry_point=”train.py”,
source_dir=”src”,
role=role,
instance_count=1,
instance_type=TRAINING_INSTANCE_TYPE
output_path=f”s3://{s3_bucket}/cvss/trained-model”,
framework_version=”2.2″,
py_version=”py310″,
disable_profiler=True,
environment={“METRIC”: cvss_metric}
)
step_train = TrainingStep(
name=f”TrainModel_{cvss_metric}”,
estimator=estimator,
inputs={
“train”: TrainingInput(
s3_data=<<INPUT_DATA_S3_URI>>,
content_type=”text/plain”
),
“validation”: TrainingInput(
s3_data=<<VALIDATION_DATA_S3_URI>>,
content_type=”text/plain”
)
}
)
training_steps.append(step_train)

The PyTorch Estimator creates model artifacts that are automatically uploaded to the Amazon S3 location defined in the output path parameter. The same script is used for each one of the CVSS v3.1 metrics while focusing on a different metric by passing a different cvss_metric to the training script as an environment variable.
The SageMaker Pipeline is configured to trigger the execution of a model evaluation step when the model training job for that CVSS v3.1 metric is finished. The model evaluation job takes the newly trained model and test data as inputs, as shown in the following step definition.

script_eval = Processor(…)
eval_args = script_eval.run(
inputs=[
ProcessingInput(
source=<<MODEL_ARTIFACTS_IN_AMAZON_S3>>,
destination=”/opt/ml/processing/model”
),
ProcessingInput(
source=<<TEST_DATA_IN_AMAZON_S3>>,
destination=”/opt/ml/processing/test”
)
],
outputs=[
ProcessingOutput(
output_name=”evaluation”,
source=”/opt/ml/processing/evaluation/”,
destination=f”s3://{s3_bucket}/cvss/evaluation/{cvss_metric}/”
)
],
source_dir=”src”,
code=”evaluate.py”
)
evaluation_report = PropertyFile(
name=”EvaluationReport”,
output_name=”evaluation”,
path=”evaluation.json”
)
step_eval = ProcessingStep(
name=f”Evaluate_{cvss_metric}”,
step_args=eval_args,
property_files=[evaluation_report],
)
evaluation_steps.append(step_eval)

The processing job is configured to create a PropertyFile object to store the results from the evaluation step. Here is a sample of what might be found in this file:

{
“ac”: {
“metrics”: {
“accuracy”: 99
}
}
}

This information is critical in the last step of the sequence followed for each metric in the CVSS vector. Rapid7 wants to ensure that models deployed in production meet quality standards, and they do that by using a ConditionStep that allows only models whose accuracy is above a critical value to be registered in the SageMaker Model Registry. This process is repeated for all eight models.

cond_gte = ConditionGreaterThanOrEqualTo(
left=JsonGet(
step_name=step_eval.name,
property_file=evaluation_report,
json_path=f”{cvss_metric}.metrics.accuracy”
),
right=accuracy_threshold_param
)
step_cond = ConditionStep(
name=f”CVSS_{cvss_metric}_Accuracy_Condition”,
conditions=[cond_gte],
if_steps=[step_model_create],
else_steps=[]
)
conditional_steps.append(step_cond)

Defining the pipeline
With all the steps defined, a pipeline object is created with all the steps for all eight models. The graph for the pipeline definition is shown in the following image.

Managing models with SageMaker Model Registry
SageMaker Model Registry is a repository for storing, versioning, and managing ML models throughout the machine learning operations (MLOps) lifecycle. The model registry enables the Rapid7 team to track model artifacts and their metadata (such as performance metrics), and streamline model version management as their CVSS models evolve. Each time a new model is added, a new version is created under the same model group, which helps track model iterations over time. Because new versions are evaluated for accuracy before registration, they’re registered with an Approved status. If a model’s accuracy falls below this threshold, the automated deployment pipeline will detect this and send an alert to notify the team about the failed deployment. This enables Rapid7 to maintain an automated pipeline that serves the most accurate model available to date without requiring manual review of new model artifacts.
Deploying models with inference components
When a set of CVSS scoring models has been selected, they can be deployed in a SageMaker AI endpoint for real-time inference, allowing them to be invoked to calculate a CVSS vector as soon as new vulnerability data is available. SageMaker AI endpoints are accessible URLs where applications can send data and receive predictions. Internally, the CVSS v3.1 vector is prepared using predictions from the eight scoring models, followed by postprocessing logic. Because each invocation runs each of the eight CVSS scoring models one time, their deployment can be optimized for efficient use of compute resources.
When the deployment script runs, it checks the model registry for new versions. If it detects an update, it immediately deploys the new version to a SageMaker endpoint.
Ensuring Cost Efficiency
Cost efficiency was a key consideration in designing this workflow. Usage patterns for vulnerability scoring are bursty, with periods of high activity followed by long idle intervals. Maintaining dedicated compute resources for each model would be unnecessarily expensive given these idle times. To address this issue, Rapid7 implemented Inference Components in their SageMaker endpoint. Inference components allow multiple models to share the same underlying compute resources, significantly improving cost efficiency—particularly for bursty inference patterns. This approach enabled Rapid7 to deploy all eight models on a single instance. Performance tests showed that inference requests could be processed in parallel across all eight models, consistently achieving sub-second response times (100-200ms).
Monitoring models in production
Rapid7 continually monitors the models in production to ensure high availability and efficient use of compute resources. The SageMaker AI endpoint automatically uploads logs and metrics into Amazon CloudWatch, which are then forwarded and visualized in Grafana. As part of regular operations, Rapid7 monitors these dashboards to visualize metrics such as model latency, the number of instances behind the endpoint, and invocations and errors over time. Additionally, alerts are configured on response time metrics to maintain system responsiveness and prevent delays in the enrichment pipeline. For more information on the various metrics and their usage, refer to the AWS blog post, Best practices for load testing Amazon SageMaker real-time inference endpoints.
Conclusion
End-to-end automation of vulnerability scoring model development and deployment has given Rapid7 a consistent, fully automated process. The previous manual process for retraining and redeploying these models was fragile, error-prone, and time-intensive. By implementing an automated pipeline with SageMaker, the engineering team now saves at least 2–3 days of maintenance work each month. By eliminating 20 manual operations, Rapid7 software engineers can focus on delivering higher-impact work for their customers. Furthermore, by using inference components, all models can be consolidated onto a single ml.m5.2xlarge instance, rather than deploying a separate endpoint (and instance) for each model. This approach nearly halves the hourly compute cost, resulting in approximately 50% cloud compute savings for this workload. In building this pipeline, Rapid7 benefited from features that reduced time and cost across multiple steps. For example, using custom containers with the necessary libraries improved startup times, while inference components enabled efficient resource utilization—both were instrumental in building an effective solution.
Most importantly, this automation means that Rapid7 customers always receive the most recently published CVEs with a CVSSv3.1 score assigned. This is especially important for InsightVM because Active Risk Scores, Rapid7’s latest risk strategy for understanding vulnerability impact, rely on the CVSSv3.1 score as a key component in their calculation. Providing accurate and meaningful risk scores is critical for the success of security teams, empowering them to prioritize and address vulnerabilities more effectively.
In summary, automating model training and deployment with Amazon SageMaker Pipelines has enabled Rapid7 to deliver scalable, reliable, and efficient ML solutions. By embracing these best practices and lessons learned, teams can streamline their workflows, reduce operational overhead, and remain focused on driving innovation and value for their customers.

About the authors
Jimmy Cancilla is a Principal Software Engineer at Rapid7, focused on applying machine learning and AI to solve complex cybersecurity challenges. He leads the development of secure, cloud-based solutions that use automation and data-driven insights to improve threat detection and vulnerability management. He is driven by a vision of AI as a tool to augment human work, accelerating innovation, enhancing productivity, and enabling teams to achieve more with greater speed and impact.
Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.
Steven Warwick is a Senior Solutions Architect at AWS, where he leads customer engagements to drive successful cloud adoption and specializes in SaaS architectures and Generative AI solutions. He produces educational content including blog posts and sample code to help customers implement best practices, and has led programs on GenAI topics for solution architects. Steven brings decades of technology experience to his role, helping customers with architectural reviews, cost optimization, and proof-of-concept development.

Build secure RAG applications with AWS serverless data lakes

Data is your generative AI differentiator, and successful generative AI implementation depends on a robust data strategy incorporating a comprehensive data governance approach. Traditional data architectures often struggle to meet the unique demands of generative such as applications. An effective generative AI data strategy requires several key components like seamless integration of diverse data sources, real-time processing capabilities, comprehensive data governance frameworks that maintain data quality and compliance, and secure access patterns that respect organizational boundaries. In particular, Retrieval Augmented Generation (RAG) applications have emerged as one of the most promising developments in this space. RAG is the process of optimizing the output of a foundation model (FM), so it references a knowledge base outside of its training data sources before generating a response. Such systems require secure, scalable, and flexible data ingestion and access patterns to enterprise data.
In this post, we explore how to build a secure RAG application using serverless data lake architecture, an important data strategy to support generative AI development. We use Amazon Web Services (AWS) services including Amazon S3, Amazon DynamoDB, AWS Lambda, and Amazon Bedrock Knowledge Bases to create a comprehensive solution supporting unstructured data assets which can be extended to structured data. The post covers how to implement fine-grained access controls for your enterprise data and design metadata-driven retrieval systems that respect security boundaries. These approaches will help you maximize the value of your organization’s data while maintaining robust security and compliance.
Use case overview
As an example, consider a RAG-based generative AI application. The following diagram shows the typical conversational workflow that is initiated with a user prompt, for example, operation specialists in a retail company querying internal knowledge to get procurement and supplier details. Each user prompt is augmented with relevant contexts from data residing in an enterprise data lake.

In the solution, the user interacts with the Streamlit frontend, which serves as the application interface. Amazon Cognito that enables IdP integration through IAM Identity Center, so that only authorized users can access the application. For production use, we recommend that you use a more robust frontend framework such as AWS Amplify, which provides a comprehensive set of tools and services for building scalable and secure web applications. After the user has successfully signed in, the application retrieves the list of datasets associated with the user’s ID from the DynamoDB table. The list of datasets is used to filter while querying the knowledge base to get answers from datasets the user is authorized to access. This is made possible because when the datasets are ingested, the knowledge base is prepopulated with metadata files containing user principal-dataset mapping stored in Amazon S3. The knowledge base returns the relevant results, which are then sent back to the application and displayed to the user.
The datasets reside in a serverless data lake on Amazon S3 and are governed using Amazon S3 Access Grants with IAM Identity Center trusted identity propagation enabling automated data permissions at scale. When an access grant is created or deleted for a user or group, the information is added to the DynamoDB table through event-driven architecture using AWS CloudTrail and Amazon EventBridge.
The workflow includes the following key data foundation steps:

Access policies to extract permissions based on relevant data and filter out results based on the prompt user role and permissions.
Enforce data privacy policies such as personally identifiable information (PII) redactions.
Enforce fine-grained access control.
Grant the user role permissions for sensitive information and compliance policies based on dataset classification in the data lake.
Extract, transform, and load multimodal data assets into a vector store.

In the following sections, we explain why a modern data strategy is important for generative AI and what challenges it solves.
Serverless data lakes powering RAG applications
Organizations implementing RAG applications face several critical challenges that impact both functionality and cost-effectiveness. At the forefront is security and access control. Applications must carefully balance broad data access with strict security boundaries. These systems need to allow access to data sources by only authorized users, apply dynamic filtering based on permissions and classifications, and maintain security context throughout the entire retrieval and generation process. This comprehensive security approach helps prevent unauthorized information exposure while still enabling powerful AI capabilities.
Data discovery and relevance present another significant hurdle. When dealing with petabytes of enterprise data, organizations must implement sophisticated systems for metadata management and advanced indexing. These systems need to understand query context and intent while efficiently ranking retrieval results to make sure users receive the most relevant information. Without proper attention to these aspects, RAG applications risk returning irrelevant or outdated information that diminishes their utility.
Performance considerations become increasingly critical as these systems scale. RAG applications must maintain consistent low latency while processing large document collections, handling multiple concurrent users, integrating data from distributed sources and retrieving relevant data. The challenge of balancing real-time and historical data access adds another layer of complexity to maintaining responsive performance at scale.Cost management represents a final key challenge that organizations must address. Without careful architectural planning, RAG implementations can lead to unnecessary expenses through duplicate data storage, excessive vector database operations, and inefficient data transfer patterns. Organizations need to optimize their resource utilization carefully to help prevent these costs from escalating while maintaining system performance and functionality.
A modern data strategy addresses the complex challenges of RAG applications through comprehensive governance frameworks and robust architectural components. At its core, the strategy implements sophisticated governance mechanisms that go beyond traditional data management approaches. These frameworks enable AI systems to dynamically access enterprise information while maintaining strict control over data lineage, access patterns, and regulatory compliance. By implementing comprehensive provenance tracking, usage auditing, and compliance frameworks, organizations can operate their RAG applications within established ethical and regulatory boundaries.
Serverless data lakes serve as the foundational component of this strategy, offering an elegant solution to both performance and cost challenges. Their inherent scalability automatically handles varying workloads without requiring complex capacity planning, and pay-per-use pricing models facilitate cost efficiency. The ability to support multiple data formats—from structured to unstructured—makes them particularly well-suited for RAG applications that need to process and index diverse document types.
To address security and access control challenges, the strategy implements enterprise-level data sharing mechanisms. These include sophisticated cross-functional access controls and federated access management systems that enable secure data exchange across organizational boundaries. Fine-grained permissions at the row, column, and object levels enforce security boundaries while maintaining necessary data access for AI systems.Data discoverability challenges are met through centralized cataloging systems that help prevent duplicate efforts and enable efficient resource utilization. This comprehensive approach includes business glossaries, technical catalogs, and data lineage tracking, so that teams can quickly locate and understand available data assets. The catalog system is enriched with quality metrics that help maintain data accuracy and consistency across the organization.
Finally, the strategy implements a structured data classification framework that addresses security and compliance concerns. By categorizing information into clear sensitivity levels from public to restricted, organizations can create RAG applications that only retrieve and process information appropriate to user access levels. This systematic approach to data classification helps prevent unauthorized information disclosure while maintaining the utility of AI systems across different business contexts.Our solution uses AWS services to create a secure, scalable foundation for enterprise RAG applications. The components are explained in the following sections.
Data lake structure using Amazon S3
Our data lake will use Amazon S3 as the primary storage layer, organized with the following structure:

s3://amzn-s3-demo-enterprise-datalake/
├── retail/
│   ├── product-catalog/
│   ├── customer-data/
│   └── sales-history/
├── finance/
│   ├── financial-statements/
│   ├── tax-documents/
│   └── budget-forecasts/
├── supply-chain/
│   ├── inventory-reports/
│   ├── supplier-contracts/
│   └── logistics-data/
└── shared/
    ├── company-policies/
    ├── knowledge-base/
    └── public-data/

Each business domain has dedicated folders containing domain-specific data, with common data stored in a shared folder.
Data sharing options
There are two options for data sharing. The first option is Amazon S3 Access Points, which provide a dedicated access endpoint policy for different applications or user groups. This approach enables fine-grained control without modifying the base bucket policy.
The following code is an example access point configuration. This policy grants the RetailAnalyticsRole read-only access (GetObject and ListBucket permissions) to data in both the retail-specific directory and the shared directory, but it restricts access to other business domain directories. The policy is attached to a dedicated S3 access point, allowing users with this role to retrieve only data relevant to retail operations and commonly shared resources:

{
  “Version”: “2012-10-17”,
  “Statement”: [
    {
      “Effect”: “Allow”,
      “Principal”: {
        “AWS”: “arn:aws:iam::123456789012:role/RetailAnalyticsRole”
      },
      “Action”: [
        “s3:GetObject”,
        “s3:ListBucket”
      ],
      “Resource”: [
        “arn:aws:s3:us-east-1:123456789012:accesspoint/retail-access-point/object/retail/*”,
        “arn:aws:s3:us-east-1:123456789012:accesspoint/retail-access-point/object/shared/*”
      ]
    }
  ]
}

The second option for data sharing is using bucket policies with path-based access control. Bucket policies can implement path-based restrictions to control which user roles can access specific data directories.The following code is an example bucket policy. This bucket policy implements domain-based access control by granting different permissions based on user roles and data paths. The FinanceUserRole can only access data within the finance and shared directories, and the RetailUserRole can only access data within the retail and shared directories. This pattern enforces data isolation between business domains while facilitating access to common resources. Each role is limited to read-only operations (GetObject and ListBucket) on their authorized directories, which means users can only retrieve data relevant to their business functions.

{
  “Version”: “2012-10-17”,
  “Statement”: [
    {
      “Effect”: “Allow”,
      “Principal”: {
        “AWS”: “arn:aws:iam::123456789012:role/FinanceUserRole”
      },
      “Action”: [
        “s3:GetObject”,
        “s3:ListBucket”
      ],
      “Resource”: [
        “arn:aws:s3:::amzn-s3-demo-enterprise-datalake/finance/*”,
        “arn:aws:s3:::amzn-s3-demo-enterprise-datalake/shared/*”
      ]
    },
    {
      “Effect”: “Allow”,
      “Principal”: {
        “AWS”: “arn:aws:iam::123456789012:role/RetailUserRole”
      },
      “Action”: [
        “s3:GetObject”,
        “s3:ListBucket”
      ],
      “Resource”: [
        “arn:aws:s3:::amzn-s3-demo-enterprise-datalake/retail/*”,
        “arn:aws:s3:::amzn-s3-demo-enterprise-datalake/shared/*”
      ]
    }
  ]
}

As your number of datasets and use cases scale, you might require more policy space. Bucket policies work as long as the necessary policies fit within the policy size limits of S3 bucket policies (20 KB), AWS Identity and Access Management (IAM) policies (5 KB), and within the number of IAM principals allowed per account. With an increasing number of datasets, access points offer a better alternative of having a dedicated policy for each access point in such cases. You can define quite granular access control patterns because you can have thousands of access points per AWS Region per account, with a policy up to 20 KB in size for each access point. Although S3 Access Points increases the amount of policy space available, it requires a mechanism for clients to discover the right access point for the right dataset. To manage scale, S3 Access Points provides a simplified model to map identities in directories such as Active Directory or IAM principals to datasets in Amazon S3 by prefix, bucket, or object. With the simplified access scheme in S3 Access Grants, you can grant read-only, write-only, or read-write access according to Amazon S3 prefix to both IAM principals and directly to users or groups from a corporate directory. As a result, you can manage automated data permissions at scale.
Amazon Comprehend PII redaction job identifies and redacts (or masks) sensitive data in documents residing in Amazon S3. After redaction, documents are verified for redaction effectiveness using Amazon Macie. Documents flagged by Macie are sent to another bucket for manual review, and cleared documents are moved to a redacted bucket ready for ingestion. For more details, refer to Protect sensitive data in RAG applications with Amazon Comprehend.
User-dataset mapping with DynamoDB
To dynamically manage access permissions, you can use DynamoDB to store mapping information between users or roles and datasets. You can automate the mapping from AWS Lake Formation to DynamoDB using CloudTrail and event-driven Lambda invocation. The DynamoDB structure consists of a table named UserDatasetAccess. Its primary key structure is:

Partition key – UserIdentifier (string) – IAM role Amazon Resource Name (ARN) or user ID
Sort key – DatasetID (string) – Unique identifier for each dataset

Additional attributes consist of:

DatasetPath (string) – S3 path to the dataset
AccessLevel (string) – READ, WRITE, or ADMIN
Classification (string) – PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED
Domain (string) – Business domain (such as retail or finance)
ExpirationTime (number) – Optional Time To Live (TTL) for temporary access

The following DynamoDB item represents an access mapping between a user role (RetailAnalyst) and a specific dataset (retail-products). It defines that this role has READ access to product catalog data in the retail domain with an INTERNAL security classification. When the RAG application processes a query, it references this mapping to determine which datasets the querying user can access, and the application only retrieves and uses data appropriate for the user’s permissions level.

{
  “UserIdentifier”: “arn:aws:iam::123456789012:role/RetailAnalyst”,
  “DatasetID”: “retail-products”,
  “DatasetPath”: “s3://amzn-s3-demo-enterprise-datalake/retail/product-catalog/”,
  “AccessLevel”: “READ”,
  “Classification”: “INTERNAL”,
  “Domain”: “retail”
}

This approach provides a flexible, programmatic way to control which users can access specific datasets, enabling fine-grained permission management for RAG applications.
Amazon Bedrock Knowledge Bases for unstructured data
Amazon Bedrock Knowledge Bases provides a managed solution for organizing, indexing, and retrieving unstructured data to support RAG applications. For our solution, we use this service to create domain-specific knowledge bases. With the metadata filtering feature provided by Amazon Bedrock Knowledge Bases, you can retrieve not only semantically relevant chunks but a well-defined subset of those relevant chunks based on applied metadata filters and associated values. In the next sections, we show how you can set this up.
Configuring knowledge bases with metadata filtering
We organize our knowledge bases to support filtering based on:

Business domain (such as finance, retail, or supply-chain)
Security classification (such as public, internal, confidential, or restricted)
Document type (such as policy, report, or guide)

Each document ingested into our knowledge base includes a standardized metadata structure:

{

  “source_uri”: “s3://amzn-s3-demo-enterprise-datalake/retail/product-catalog/shoes-inventory-2023.pdf”,
  “title”: “Shoes Inventory Report 2023”,
  “language”: “en”,
  “last_updated”: “2023-12-15T14:30:00Z”,
  “author”: “Inventory Management Team”,
  “business_domain”: “retail”,
  “security_classification”: “internal”,
  “document_type”: “inventory_report”,
  “departments”: [“retail”, “supply-chain”],
  “tags”: [“footwear”, “inventory”, “2023”],
  “version”: “1.2”
}

Code examples shown throughout this post are for reference only and highlight key API calls and logic. Additional implementation code is required for production deployments.
Amazon Bedrock Knowledge Bases API integration
To demonstrate how our RAG application will interact with the knowledge base, here’s a Python sample using the AWS SDK:

# High-level logic for querying knowledge base with security filters
def query_knowledge_base(query_text, user_role, business_domain=None):
    # Get permitted classifications based on user role
    permitted_classifications = get_permitted_classifications(user_role)
    
    # Build security filter expression
    filter_expression = build_security_filters(permitted_classifications, business_domain)
    
    # Key API call for retrieval with security filtering
    response = bedrock_agent_runtime.retrieve(
        knowledgeBaseId=’your-kb-id’,
        retrievalQuery={‘text’: query_text},
        retrievalConfiguration={
            ‘vectorSearchConfiguration’: {
                ‘numberOfResults’: 5,
                ‘filter’: filter_expression  # Apply security filters here
            }
        }
    )
    
    return response[‘retrievalResults’]

Conclusion
In this post, we’ve explored how to build a secure RAG application using a serverless data lake architecture. The approach we’ve outlined provides several key advantages:

Security-first design – Fine-grained access controls at scale mean that users only access data they’re authorized for
Scalability – Serverless components automatically handle varying workloads
Cost-efficiency – Pay-as-you-go pricing models optimize expenses
Flexibility – Seamless adaptation to different business domains and use cases

By implementing a modern data strategy with proper governance, security controls, and serverless architecture, organizations can make the most of their data assets for generative AI applications while maintaining security and compliance.The RAG architecture we’ve described enables contextualized, accurate responses that respect security boundaries, providing a powerful foundation for enterprise AI applications across diverse business domains.For next steps, consider implementing monitoring and observability to track performance and usage patterns.
For performance and usage monitoring:

Deploy Amazon CloudWatch metrics and dashboards to track key performance indicators such as query latency, throughput, and error rates
Set up CloudWatch Logs Insights to analyze usage patterns and identify optimization opportunities
Implement AWS X-Ray tracing to visualize request flows across your serverless components

For security monitoring and defense:

Enable Amazon GuardDuty to detect potential threats targeting your S3 data lake, Lambda functions, and other application resources
Implement Amazon Inspector for automated vulnerability assessments of your Lambda functions and container images
Configure AWS Security Hub to consolidate security findings and measure cloud security posture across your RAG application resources
Use Amazon Macie for continuous monitoring of S3 data lake contents to detect sensitive data exposures

For authentication and activity auditing:

Analyze AWS CloudTrail logs to audit API calls across your application stack
Implement CloudTrail Lake to create SQL-queryable datasets for security investigations
Enable Amazon Cognito advanced security features to detect suspicious sign-in activities

For data access controls:

Set up CloudWatch alarms to send alerts about unusual data access patterns
Configure AWS Config rules to monitor for compliance with access control best practices
Implement AWS IAM Access Analyzer to identify unintended resource access

Other important considerations include:

Adding feedback loops to continuously improve retrieval quality
Exploring multi-Region deployment for improved resilience
Implementing caching layers to optimize frequently accessed content
Extending the solution to support structured data assets using AWS Glue and AWS Lake Formation for data transformation and data access

With these foundations in place, your organization will be well-positioned to use generative AI technologies securely and effectively across the enterprise.

About the authors
Venkata Sistla is a Senior Specialist Solutions Architect in the Worldwide team at Amazon Web Services (AWS), with over 12 years of experience in cloud architecture. He specializes in designing and implementing enterprise-scale AI/ML systems across financial services, healthcare, mining and energy, independent software vendors (ISVs), sports, and retail sectors. His expertise lies in helping organizations transform their data challenges into competitive advantages through innovative cloud solutions while mentoring teams and driving technological excellence. He focuses on architecting highly scalable infrastructures that accelerate machine learning initiatives and deliver measurable business outcomes.
Aamna Najmi is a Senior GenAI and Data Specialist in the Worldwide team at Amazon Web Services (AWS). She assists customers across industries and Regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations, bringing a unique perspective of modern data strategies to complement the field of AI. In her spare time, she pursues her passion of experimenting with food and discovering new places.
 

Google DeepMind Releases GenAI Processors: A Lightweight Python Librar …

Google DeepMind recently released GenAI Processors, a lightweight, open-source Python library built to simplify the orchestration of generative AI workflows—especially those involving real-time multimodal content. Launched last week, and available under an Apache‑2.0 license, this library provides a high-throughput, asynchronous stream framework for building advanced AI pipelines.

Stream‑Oriented Architecture

At the heart of GenAI Processors is the concept of processing asynchronous streams of ProcessorPart objects. These parts represent discrete chunks of data—text, audio, images, or JSON—each carrying metadata. By standardizing inputs and outputs into a consistent stream of parts, the library enables seamless chaining, combining, or branching of processing components while maintaining bidirectional flow. Internally, the use of Python’s asyncio enables each pipeline element to operate concurrently, dramatically reducing latency and improving overall throughput.

Efficient Concurrency

GenAI Processors is engineered to optimize latency by minimizing “Time To First Token” (TTFT). As soon as upstream components produce pieces of the stream, downstream processors begin work. This pipelined execution ensures that operations—including model inference—overlap and proceed in parallel, achieving efficient utilization of system and network resources.

Plug‑and‑Play Gemini Integration

The library comes with ready-made connectors for Google’s Gemini APIs, including both synchronous text-based calls and the Gemini Live API for streaming applications. These “model processors” abstract away the complexity of batching, context management, and streaming I/O, enabling rapid prototyping of interactive systems—such as live commentary agents, multimodal assistants, or tool-augmented research explorers.

Modular Components & Extensions

GenAI Processors prioritizes modularity. Developers build reusable units—processors—each encapsulating a defined operation, from MIME-type conversion to conditional routing. A contrib/ directory encourages community extensions for custom features, further enriching the ecosystem. Common utilities support tasks such as splitting/merging streams, filtering, and metadata handling, enabling complex pipelines with minimal custom code.

Notebooks and Real‑World Use Cases

Included with the repository are hands-on examples demonstrating key use cases:

Real‑Time Live agent: Connects audio input to Gemini and optionally a tool like web search, streaming audio output—all in real time.

Research agent: Orchestrates data collection, LLM querying, and dynamic summarization in sequence.

Live commentary agent: Combines event detection with narrative generation, showcasing how different processors sync to produce streamed commentary.

These examples, provided as Jupyter notebooks, serve as blueprints for engineers building responsive AI systems.

Comparison and Ecosystem Role

GenAI Processors complements tools like the google-genai SDK (the GenAI Python client) and Vertex AI, but elevates development by offering a structured orchestration layer focused on streaming capabilities. Unlike LangChain—which is focused primarily on LLM chaining—or NeMo—which constructs neural components—GenAI Processors excels in managing streaming data and coordinating asynchronous model interactions efficiently.

Broader Context: Gemini’s Capabilities

GenAI Processors leverages Gemini’s strengths. Gemini, DeepMind’s multimodal large language model, supports processing of text, images, audio, and video—most recently seen in the Gemini 2.5 rollout in. GenAI Processors enables developers to create pipelines that match Gemini’s multimodal skillset, delivering low-latency, interactive AI experiences.

Conclusion

With GenAI Processors, Google DeepMind provides a stream-first, asynchronous abstraction layer tailored for generative AI pipelines. By enabling:

Bidirectional, metadata-rich streaming of structured data parts

Concurrent execution of chained or parallel processors

Integration with Gemini model APIs (including Live streaming)

Modular, composable architecture with an open extension model

…this library bridges the gap between raw AI models and deployable, responsive pipelines. Whether you’re developing conversational agents, real-time document extractors, or multimodal research tools, GenAI Processors offers a lightweight yet powerful foundation.

Check out the Technical Details and GitHub Page. All credit for this research goes to the researchers of this project. If you’re planning a product launch/release, fundraising, or simply aiming for developer traction—let us help you hit that goal efficiently.
The post Google DeepMind Releases GenAI Processors: A Lightweight Python Library that Enables Efficient and Parallel Content Processing appeared first on MarkTechPost.

Meta AI Introduces UMA (Universal Models for Atoms): A Family of Unive …

Density Functional Theory (DFT) serves as the foundation of modern computational chemistry and materials science. However, its high computational cost severely limits its usage. Machine Learning Interatomic Potentials (MLIPs) have the potential to closely approximate DFT accuracy while significantly improving performance, reducing computation time from hours to less than a second with O(n) versus O(n³) scaling. However, training MLIPs that generalize across different chemical tasks remains an open challenge, as traditional methods rely on smaller problem-specific datasets instead of using the scaling advantages that have driven significant advances in language and vision models.

Existing attempts to address these challenges have focused on developing Universal MLIPs trained on larger datasets, with datasets like Alexandria and OMat24 leading to improved performance on the Matbench-Discovery leaderboard. Moreover, researchers have explored scaling relations to understand relationships between compute, data, and model size, taking inspiration from empirical scaling laws in LLMs that motivated training on more tokens with larger models for predictable performance improvements. These scaling relations help in determining optimal resource allocation between the dataset and model size. However, their application to MLIPs remains limited compared to the transformative impact seen in language modeling.

Researchers from FAIR at Meta and Carnegie Mellon University have proposed a family of Universal Models for Atoms (UMA) designed to test the limits of accuracy, speed, and generalization for a single model across chemistry and materials science. To address these challenges, Moreover, they developed empirical scaling laws relating compute, data, and model size to determine optimal model sizing and training strategies. This helped in overcoming the challenge of balancing accuracy and efficiency, which was due to the unprecedented dataset of ~500 million atomic systems. Moreover, UMA performs similarly or better than specialized models in both accuracy and inference speed on a wide range of material, molecular, and catalysis benchmarks, without fine-tuning to specific tasks.

The UMA architecture builds upon eSEN, an equivariant graph neural network, with crucial modifications to enable efficient scaling and handle additional inputs, including total charge, spin, and DFT settings for emulation. It also incorporates a new embedding that allows UMA models to integrate charge, spin, and DFT-related tasks. Each of these inputs generates an embedding of the same dimension as the spherical channels used. The training follows a two-stage approach: first stage directly predicts forces for faster training, and the second stage removes the force head and fine-tunes the model to predict conserving forces and stresses using auto-grad, ensuring energy conservation and smooth potential energy landscapes.

The results show that UMA models exhibit log-linear scaling behavior across the tested FLOP ranges. This indicates that greater model capacity is required to fit the UMA dataset, with these scaling relationships used to select accurate model sizes and show MoLE’s advantages over dense architectures. In multi-task training, a significant improvement is observed in loss when moving from 1 expert to 8 experts, smaller gains with 32 experts, and negligible improvements at 128 experts. Moreover, UMA models demonstrate exceptional inference efficiency despite having large parameter counts, with UMA-S capable of simulating 1000 atoms at 16 steps per second and fitting system sizes up to 100,000 atoms in memory on a single 80GB GPU.

In conclusion, researchers introduced a family of Universal Models for Atoms (UMA) that shows strong performance across a wide range of benchmarks, including materials, molecules, catalysts, molecular crystals, and metal-organic frameworks. It achieves new state-of-the-art results on established benchmarks such as AdsorbML and Matbench Discovery. However, it fails to handle long-range interactions due to the standard 6Å cutoff distance. Moreover, it uses separate embeddings for discrete charge or spin values, which limits generalization to unseen charges or spins. Future research aims to advance toward universal MLIPs and unlock new possibilities in atomic simulations, while highlighting the need for more challenging benchmarks to drive future progress.

Check out the Paper, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. If you’re planning a product launch/release, fundraising, or simply aiming for developer traction—let us help you hit that goal efficiently.
The post Meta AI Introduces UMA (Universal Models for Atoms): A Family of Universal Models for Atoms appeared first on MarkTechPost.

Moonshot AI Releases Kimi K2: A Trillion-Parameter MoE Model Focused …

Kimi K2, launched by Moonshot AI in July 2025, is a purpose-built, open-source Mixture-of-Experts (MoE) model—1 trillion total parameters, with 32 billion active parameters per token. It’s trained using the custom MuonClip optimizer on 15.5 trillion tokens, achieving stable training at this unprecedented scale without the typical instabilities seen in ultra-large models.

Unlike traditional chatbots, K2 is architected specifically for agentic workflows. It features native Model Context Protocol (MCP) support and was trained on simulated multi-step tool interactions, enabling it to autonomously decompose tasks, execute tool sequences, write and debug code, analyze data, and orchestrate workflows—all with minimal human oversight.

Why Agentic over Conversational?

While advanced models like GPT-4 and Claude 4 Sonnet excel at language reasoning, Kimi K2 moves from reasoning to action. It doesn’t just respond—it executes. The core shift lies in enabling real-world workflows:

Autonomous code execution

Data analysis with charts and interfaces

End-to-end web application development

Orchestration of 17+ tools per session without human input

K2’s training incorporated millions of synthetic dialogues, each rated by an LLM-based evaluator. These dialogues simulate realistic tool-use scenarios, giving K2 a practical edge in tool selection and multi-step execution.

Architecture and Training Innovations

K2’s technical design demonstrates several novel elements:

MoE Transformer Design: 384 experts with routing to 8 active experts per token, plus 1 shared expert for global context. The model uses 64 attention heads and supports a 128K-token context window.

MuonClip Optimizer: A modified version of Muon that stabilizes training at scale. It uses qk-clipping to constrain attention scores by rescaling Q/K matrices, effectively preventing instability in deep layers.

Training Dataset: Over 15.5 trillion tokens from multilingual and multimodal sources, giving K2 robust generalization and tool-use reasoning across diverse domains.

The model comes in two variants: Kimi-K2-Base, the foundational model ideal for fine-tuning and building customized solutions; and Kimi-K2-Instruct, the post-trained version optimized for immediate use in general-purpose chat and tool-using agentic tasks. Instruct is reflex-grade—optimized for fast, low-latency interaction rather than long-form deliberation. On benchmarks, Kimi K2 outperforms Claude Sonnet 4 and GPT-4.1 in coding and agentic reasoning, with 71.6% on SWE-bench, 65.8% on agentic tasks, and 53.7% on LiveCodeBench.

Performance Benchmarks

Kimi K2 not only matches but often surpasses closed-source models on key benchmarks:

BenchmarkKimi K2GPT‑4.1Claude Sonnet 4SWE-bench Verified71.6 %54.6 %~72.7 %Agentic Coding (Tau2)65.8 %45.2 %~61 %LiveCodeBench v6 (Pass@1)53.7 %44.7 %47.4 %MATH-50097.4 %92.4 %–MMLU89.5 %~90.4 %~92.9 %

Its performance in agentic benchmarks like Tau2 and LiveCodeBench demonstrates its superior capacity to handle multi-step, real-world coding tasks—outperforming many proprietary models.

Cost Efficiency

Perhaps the most disruptive element is pricing:

Claude 4 Sonnet: $3 input / $15 output per million tokens

Gemini 2.5 Pro: $2.5 input / $15 output

Kimi K2: $0.60 input / $2.50 output

Kimi K2 is roughly 5x cheaper than Claude or Gemini while offering equal or better performance on several metrics. The cost advantage, combined with open access and support for local deployment, positions K2 as an economically viable alternative for developers, enterprises, and research teams.

Strategic Shift: From Thinking to Acting

Kimi K2 marks a pivotal moment in AI’s evolution—from thinking agents to acting systems. With native tool-use capabilities and built-in support for multi-agent protocols, it goes far beyond static chat interfaces. It is capable of triggering workflows, making decisions, executing API calls, and delivering tangible outputs autonomously.

Moreover, its release comes at a time when most such capabilities are either locked behind expensive APIs or limited to research labs. K2 is:

Open-source, requiring no subscription

Globally accessible, not limited to US-based deployment

Designed for developers, not just end-users

Broader Implications

Will agentic architecture become the norm? K2’s strong performance on tool use tasks could push proprietary players to rethink their architectures.

Can open-source efforts from Asia compete at global scale? With K2, Moonshot AI joins others like DeepSeek in showing that top-tier performance doesn’t have to originate from Silicon Valley.

What’s next in the agentic evolution? Future models may combine video, robotics, and embodied reasoning to further expand the scope of what agentic AI can accomplish.

Conclusion

Kimi K2 isn’t just a bigger model—it’s a blueprint for what comes after the reasoning race: execution-first AI. By combining trillion-parameter scale, low inference costs, and deeply integrated agentic capabilities, Kimi K2 opens the door for AI systems that do more than generate—they build, act, and solve autonomously.

Check out the Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, and Youtube and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Moonshot AI Releases Kimi K2: A Trillion-Parameter MoE Model Focused on Long Context, Code, Reasoning, and Agentic Behavior appeared first on MarkTechPost.

From Perception to Action: The Role of World Models in Embodied AI Sys …

Introduction to Embodied AI Agents

Embodied AI agents are systems that exist in physical or virtual forms, such as robots, wearables, or avatars, and can interact with their surroundings. Unlike static web-based bots, these agents perceive the world and act meaningfully within it. Their embodiment enhances physical interaction, human trust, and human-like learning. Recent advances in large language and vision-language models have powered more capable, autonomous agents that can plan, reason, and adapt to users’ needs. These agents understand context, retain memory, and can collaborate or request clarification when needed. Despite progress, challenges remain, especially with generative models that often prioritize detail over efficient reasoning and decision-making.

World Modeling and Applications

Researchers at Meta AI are exploring how embodied AI agents, such as avatars, wearables, and robots, can interact more naturally with users and their surroundings by sensing, learning, and acting within real or virtual environments. Central to this is “world modeling,” which combines perception, reasoning, memory, and planning to help agents understand both physical spaces and human intentions. These agents are reshaping industries such as healthcare, entertainment, and labor. The study highlights future goals, such as enhancing collaboration, social intelligence, and ethical safeguards, particularly around privacy and anthropomorphism, as these agents become increasingly integrated into our lives.

Types of Embodied Agents

Embodied AI agents come in three forms: virtual, wearable, and robotic, and are designed to interact with the world in much the same way as humans. Virtual agents, such as therapy bots or avatars in the metaverse, simulate emotions to foster empathetic interactions. Wearable agents, such as those in smart glasses, share the user’s view and assist with real-time tasks or provide cognitive support. Robotic agents operate in physical spaces, assisting with complex or high-risk tasks such as caregiving or disaster response. These agents not only enhance daily life but also push us closer to general AI by learning through real-world experience, perception, and physical interaction.

Importance of World Models

World models are crucial for embodied AI agents, enabling them to perceive, understand, and interact with their environment like humans. These models integrate various sensory inputs, such as vision, sound, and touch, with memory and reasoning capabilities to form a cohesive understanding of the world. This enables agents to anticipate outcomes, plan effective actions, and adapt to new situations. By incorporating both physical surroundings and user intentions, world models facilitate more natural and intuitive interactions between humans and AI agents, enhancing their ability to perform complex tasks autonomously.

To enable truly autonomous learning in Embodied AI, future research must integrate passive observation (such as vision-language learning) with active interaction (like reinforcement learning). Passive systems excel at understanding structure from data but lack grounding in real-world actions. Active systems learn through doing, but are often inefficient. By combining both, AI can gain abstract knowledge and apply it through goal-driven behavior. Looking ahead, collaboration among multiple agents adds complexity, requiring effective communication, coordination, and conflict resolution. Strategies like emergent communication, negotiation, and multi-agent reinforcement learning will be key. Ultimately, the aim is to build adaptable, interactive AI that learns like humans through experience.

Conclusion

In conclusion, the study examines how embodied AI agents, such as virtual avatars, wearable devices, and robots, can interact with the world more like humans by perceiving, learning, and acting within their environments. Central to their success is building “world models” that help them understand context, predict outcomes, and plan effectively. These agents are already reshaping areas like therapy, entertainment, and real-time assistance. As they become more integrated into daily life, ethical issues such as privacy and human-like behavior require careful attention. Future work will focus on improving learning, collaboration, and social intelligence, aiming for more natural, intuitive, and responsible human-AI interaction.

Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, and Youtube and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post From Perception to Action: The Role of World Models in Embodied AI Systems appeared first on MarkTechPost.

This AI Paper Introduces PEVA: A Whole-Body Conditioned Diffusion Mode …

Understanding the Link Between Body Movement and Visual Perception

The study of human visual perception through egocentric views is crucial in developing intelligent systems capable of understanding & interacting with their environment. This area emphasizes how movements of the human body—ranging from locomotion to arm manipulation—shape what is seen from a first-person perspective. Understanding this relationship is essential for enabling machines and robots to plan and act with a human-like sense of visual anticipation, particularly in real-world scenarios where visibility is dynamically influenced by physical motion.

Challenges in Modeling Physically Grounded Perception

A major hurdle in this domain arises from the challenge of teaching systems how body actions affect perception. Actions such as turning or bending change what is visible in subtle and often delayed ways. Capturing this requires more than simply predicting what comes next in a video—it involves linking physical movements to the resulting changes in visual input. Without the ability to interpret and simulate these changes, embodied agents struggle to plan or interact effectively in dynamic environments.

Limitations of Prior Models and the Need for Physical Grounding

Until now, tools designed to predict video from human actions have been limited in scope. Models have often used low-dimensional input, such as velocity or head direction, and overlooked the complexity of whole-body motion. These simplified approaches overlook the fine-grained control and coordination required to simulate human actions accurately. Even in video generation models, body motion has usually been treated as the output rather than the driver of prediction. This lack of physical grounding has restricted the usefulness of these models for real-world planning.

Introducing PEVA: Predicting Egocentric Video from Action

Researchers from UC Berkeley, Meta’s FAIR, and New York University introduced a new framework called PEVA to overcome these limitations. The model predicts future egocentric video frames based on structured full-body motion data, derived from 3D body pose trajectories. PEVA aims to demonstrate how entire-body movements influence what a person sees, thereby grounding the connection between action and perception. The researchers employed a conditional diffusion transformer to learn this mapping and trained it using Nymeria, a large dataset comprising real-world egocentric videos synchronized with full-body motion capture.

Structured Action Representation and Model Architecture

The foundation of PEVA lies in its ability to represent actions in a highly structured manner. Each action input is a 48-dimensional vector that includes the root translation and joint-level rotations across 15 upper body joints in 3D space. This vector is normalized and transformed into a local coordinate frame centered at the pelvis to remove any positional bias. By utilizing this comprehensive representation of body dynamics, the model captures the continuous and nuanced nature of real motion. PEVA is designed as an autoregressive diffusion model that uses a video encoder to convert frames into latent state representations and predicts subsequent frames based on prior states and body actions. To support long-term video generation, the system introduces random time-skips during training, allowing it to learn from both immediate and delayed visual consequences of motion.

Performance Evaluation and Results

In terms of performance, PEVA was evaluated on several metrics that test both short-term and long-term video prediction capabilities. The model was able to generate visually consistent and semantically accurate video frames over extended periods of time. For short-term predictions, evaluated at 2-second intervals, it achieved lower LPIPS scores and higher DreamSim consistency compared to baselines, indicating superior perceptual quality. The system also decomposed human movement into atomic actions such as arm movements and body rotations to assess fine-grained control. Furthermore, the model was tested on extended rollouts of up to 16 seconds, successfully simulating delayed outcomes while maintaining sequence coherence. These experiments confirmed that incorporating full-body control led to substantial improvements in video realism and controllability.

Conclusion: Toward Physically Grounded Embodied Intelligence

This research highlights a significant advancement in predicting future egocentric video by grounding the model in physical human movement. The problem of linking whole-body action to visual outcomes is addressed with a technically robust method that uses structured pose representations and diffusion-based learning. The solution introduced by the team offers a promising direction for embodied AI systems that require accurate, physically grounded foresight.

Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, and Youtube and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post This AI Paper Introduces PEVA: A Whole-Body Conditioned Diffusion Model for Predicting Egocentric Video from Human Motion appeared first on MarkTechPost.

Advanced fine-tuning methods on Amazon SageMaker AI

This post provides the theoretical foundation and practical insights needed to navigate the complexities of LLM development on Amazon SageMaker AI, helping organizations make optimal choices for their specific use cases, resource constraints, and business objectives.
We also address the three fundamental aspects of LLM development: the core lifecycle stages, the spectrum of fine-tuning methodologies, and the critical alignment techniques that provide responsible AI deployment. We explore how Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA have democratized model adaptation, so organizations of all sizes can customize large models to their specific needs. Additionally, we examine alignment approaches such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), which help make sure these powerful systems behave in accordance with human values and organizational requirements. Finally, we focus on knowledge distillation, which enables efficient model training through a teacher/student approach, where a smaller model learns from a larger one, while mixed precision training and gradient accumulation techniques optimize memory usage and batch processing, making it possible to train large AI models with limited computational resources.
Throughout the post, we focus on practical implementation while addressing the critical considerations of cost, performance, and operational efficiency. We begin with pre-training, the foundational phase where models gain their broad language understanding. Then we examine continued pre-training, a method to adapt models to specific domains or tasks. Finally, we discuss fine-tuning, the process that hones these models for particular applications. Each stage plays a vital role in shaping large language models (LLMs) into the sophisticated tools we use today, and understanding these processes is key to grasping the full potential and limitations of modern AI language models.
If you’re just getting started with large language models or looking to get more out of your current LLM projects, we’ll walk you through everything you need to know about fine-tuning methods on Amazon SageMaker AI.
Pre-training
Pre-training represents the foundation of LLM development. During this phase, models learn general language understanding and generation capabilities through exposure to massive amounts of text data. This process typically involves training from scratch on diverse datasets, often consisting of hundreds of billions of tokens drawn from books, articles, code repositories, webpages, and other public sources.
Pre-training teaches the model broad linguistic and semantic patterns, such as grammar, context, world knowledge, reasoning, and token prediction, using self-supervised learning techniques like masked language modeling (for example, BERT) or causal language modeling (for example, GPT). At this stage, the model is not tailored to any specific downstream task but rather builds a general-purpose language representation that can be adapted later using fine-tuning or PEFT methods.
Pre-training is highly resource-intensive, requiring substantial compute (often across thousands of GPUs or AWS Trainium chips), large-scale distributed training frameworks, and careful data curation to balance performance with bias, safety, and accuracy concerns.
Continued pre-training (also known as domain-adaptive pre-training or intermediate pre-training) is the process of taking a pre-trained language model and further training it on domain-specific or task-relevant corpora before fine-tuning. Unlike full pre-training from scratch, this approach builds on the existing capabilities of a general-purpose model, allowing it to internalize new patterns, vocabulary, or context relevant to a specific domain.
This step is particularly useful when the models must handle specialized terminology or unique syntax, particularly in fields like law, medicine, or finance. This approach is also essential when organizations need to align AI outputs with their internal documentation standards and proprietary knowledge bases. Additionally, it serves as an effective solution for addressing gaps in language or cultural representation by allowing focused training on underrepresented dialects, languages, or regional content.
To learn more, refer to the following resources:

Pre-training genomic language models using AWS HealthOmics and Amazon SageMaker
Customize models in Amazon Bedrock with your own data using fine-tuning and continued pre-training

Alignment methods for LLMs
The alignment of LLMs represents a crucial step in making sure these powerful systems behave in accordance with human values and preferences. AWS provides comprehensive support for implementing various alignment techniques, each offering distinct approaches to achieving this goal. The following are the key approaches.
Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback (RLHF) is one of the most established approaches to model alignment. This method transforms human preferences into a learned reward signal that guides model behavior. The RLHF process consists of three distinct phases. First, we collect comparison data, where human annotators choose between different model outputs for the same prompt. This data forms the foundation for training a reward model, which learns to predict human preferences. Finally, we fine-tune the language model using Proximal Policy Optimization (PPO), optimizing it to maximize the predicted reward.
Constitutional AI represents an innovative approach to alignment that reduces dependence on human feedback by enabling models to critique and improve their own outputs. This method involves training models to internalize specific principles or rules, then using these principles to guide generation and self-improvement. The reinforcement learning phase is similar to RLHF, except that pairs of responses are generated and evaluated by an AI model, as opposed to a human.
To learn more, refer to the following resources:

Fine-tune large language models with reinforcement learning from human or AI feedback
Machine-learning improving your LLMs with RLHF on Amazon Sagemaker
High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus

Direct Preference Optimization
Direct Preference Optimization (DPO) is an alternative to RLHF, offering a more straightforward path to model alignment. DPO alleviates the need for explicit reward modeling and complex RL training loops, instead directly optimizing the model’s policy to align with human preferences through a modified supervised learning approach.
The key innovation of DPO lies in its formulation of preference learning as a classification problem. Given pairs of responses where one is preferred over the other, DPO trains the model to assign higher probability to preferred responses. This approach maintains theoretical connections to RLHF while significantly simplifying the implementation process. When implementing alignment methods, the effectiveness of DPO heavily depends on the quality, volume, and diversity of the preference dataset. Organizations must establish robust processes for collecting and validating human feedback while mitigating potential biases in label preferences.
For more information about DPO, see Align Meta Llama 3 to human preferences with DPO Amazon SageMaker Studio and Amazon SageMaker Ground Truth.
Fine-tuning methods on AWS
Fine-tuning transforms a pre-trained model into one that excels at specific tasks or domains. This phase involves training the model on carefully curated datasets that represent the target use case. Fine-tuning can range from updating all model parameters to more efficient approaches that modify only a small subset of parameters. Amazon SageMaker HyperPod offers fine-tuning capabilities for supported foundation models (FMs), and Amazon SageMaker Model Training offers flexibility for custom fine-tuning implementations along with training the models at scale without the need to manage infrastructure.
At its core, fine-tuning is a transfer learning process where a model’s existing knowledge is refined and redirected toward specific tasks or domains. This process involves carefully balancing the preservation of the model’s general capabilities while incorporating new, specialized knowledge.
Supervised Fine-Tuning
Supervised Fine-Tuning (SFT) involves updating model parameters using a curated dataset of input-output pairs that reflect the desired behavior. SFT enables precise behavioral control and is particularly effective when the model needs to follow specific instructions, maintain tone, or deliver consistent output formats, making it ideal for applications requiring high reliability and compliance. In regulated industries like healthcare or finance, SFT is often used after continued pre-training, which exposes the model to large volumes of domain-specific text to build contextual understanding. Although continued pre-training helps the model internalize specialized language (such as clinical or legal terms), SFT teaches it how to perform specific tasks such as generating discharge summaries, filling documentation templates, or complying with institutional guidelines. Both steps are typically essential: continued pre-training makes sure the model understands the domain, and SFT makes sure it behaves as required.However, because it updates the full model, SFT requires more compute resources and careful dataset construction. The dataset preparation process requires careful curation and validation to make sure the model learns the intended patterns and avoids undesirable biases.
For more details about SFT, refer to the following resources:

Supervised fine-tuning on SageMaker training jobs
SageMaker HyperPod recipes

Parameter-Efficient Fine-Tuning
Parameter-Efficient Fine-Tuning (PEFT) represents a significant advancement in model adaptation, helping organizations customize large models while dramatically reducing computational requirements and costs. The following table summarizes the different types of PEFT.

PEFT Type
AWS Service
How It Works
Benefits

LoRA
LoRA (Low-Rank Adaptation)
SageMaker Training (custom implementation)
Instead of updating all model parameters, LoRA injects trainable rank decomposition matrices into transformer layers, reducing trainable parameters
Memory efficient, cost-efficient, opens up possibility of adapting larger models

QLoRA (Quantized LoRA)
SageMaker Training (custom implementation)
Combines model quantization with LoRA, loading the base model in 4-bit precision while adapting it with trainable LoRA parameters
Further reduces memory requirements compared to standard LoRA

Prompt Tuning
Additive
SageMaker Training (custom implementation)
Prepends a small set of learnable prompt tokens to the input embeddings; only these tokens are trained
Lightweight and fast tuning, good for task-specific adaptation with minimal resources

P-Tuning
Additive
SageMaker Training (custom implementation)
Uses a deep prompt (tunable embedding vector passed through an MLP) instead of discrete tokens, enhancing expressiveness of prompts
More expressive than prompt tuning, effective in low-resource settings

Prefix Tuning
Additive
SageMaker Training (custom implementation)
Prepends trainable continuous vectors (prefixes) to the attention keys and values in every transformer layer, leaving the base model frozen
Effective for long-context tasks, avoids full model fine-tuning, and reduces compute needs

The selection of a PEFT method significantly impacts the success of model adaptation. Each technique presents distinct advantages that make it particularly suitable for specific scenarios. In the following sections, we provide a comprehensive analysis of when to employ different PEFT approaches.
Low-Rank Adaptation
Low-Rank Adaptation (LoRA) excels in scenarios requiring substantial task-specific adaptation while maintaining reasonable computational efficiency. It’s particularly effective in the following use cases:

Domain adaptation for enterprise applications – When adapting models to specialized industry vocabularies and conventions, such as legal, medical, or financial domains, LoRA provides sufficient capacity for learning domain-specific patterns while keeping training costs manageable. For instance, a healthcare provider might use LoRA to adapt a base model to medical terminology and clinical documentation standards.
Multi-language adaptation – Organizations extending their models to new languages find LoRA particularly effective. It allows the model to learn language-specific nuances while preserving the base model’s general knowledge. For example, a global ecommerce platform might employ LoRA to adapt their customer service model to different regional languages and cultural contexts.

To learn more, refer to the following resources:

Accelerating Mixtral MOE fine-tuning on Amazon SageMaker with QLoRA
Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium
PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium
Efficient and cost-effective multi-tenant LoRA serving with Amazon SageMaker

Prompt tuning
Prompt tuning is ideal in scenarios requiring lightweight, switchable task adaptations. With prompt tuning, you can store multiple prompt vectors for different tasks without modifying the model itself. A primary use case could be when different customers require slightly different versions of the same basic functionality: prompt tuning allows efficient switching between customer-specific behaviors without loading multiple model versions. It’s useful in the following scenarios:

Personalized customer interactions – Companies offering software as a service (SaaS) platform with customer support or virtual assistants can use prompt tuning to personalize response behavior for different clients without retraining the model. Each client’s brand tone or service nuance can be encoded in prompt vectors.
Task switching in multi-tenant systems – In systems where multiple natural language processing (NLP) tasks (for example, summarization, sentiment analysis, classification) need to be served from a single model, prompt tuning enables rapid task switching with minimal overhead.

For more information, see Prompt tuning for causal language modeling.
P-tuning
P-tuning extends prompt tuning by representing prompts as continuous embeddings passed through a small trainable neural network (typically an MLP). Unlike prompt tuning, which directly learns token embeddings, P-tuning enables more expressive and non-linear prompt representations, making it suitable for complex tasks and smaller models. It’s useful in the following use cases:

Low-resource domain generalization – A common use case includes low-resource settings where labeled data is limited, yet the task requires nuanced prompt conditioning to steer model behavior. For example, organizations operating in low-data regimes (such as niche scientific research or regional dialect processing) can use P-tuning to extract better task-specific performance without the need for large fine-tuning datasets.

To learn more, see P-tuning.
Prefix tuning
Prefix tuning prepends trainable continuous vectors, also called prefixes, to the key-value pairs in each attention layer of a transformer, while keeping the base model frozen. This provides control over the model’s behavior without altering its internal weights. Prefix tuning excels in tasks that benefit from conditioning across long contexts, such as document-level summarization or dialogue modeling. It provides a powerful compromise between performance and efficiency, especially when serving multiple tasks or clients from a single frozen base model. Consider the following use case:

Dialogue systems – Companies building dialogue systems with varied tones (for example, friendly vs. formal) can use prefix tuning to control the persona and coherence across multi-turn interactions without altering the base model.

For more details, see Prefix tuning for conditional generation.
LLM optimization
LLM optimization represents a critical aspect of their development lifecycle, enabling more efficient training, reduced computational costs, and improved deployment flexibility. AWS provides a comprehensive suite of tools and techniques for implementing these optimizations effectively.
Quantization
Quantization is a process of mapping a large set of input values to a smaller set of output values. In digital signal processing and computing, it involves converting continuous values to discrete values and reducing the precision of numbers (for example, from 32-bit to 8-bit). In machine learning (ML), quantization is particularly important for deploying models on resource-constrained devices, because it can significantly reduce model size while maintaining acceptable performance. One of the most used techniques is Quantized Low-Rank Adaptation (QLoRA).QLoRA is an efficient fine-tuning technique for LLMs that combines quantization and LoRA approaches. It uses 4-bit quantization to reduce model memory usage while maintaining model weights in 4-bit precision during training and employs double quantization for further memory reduction. The technique integrates LoRA by adding trainable rank decomposition matrices and keeping adapter parameters in 16-bit precision, enabling PEFT. QLoRA offers significant benefits, including up to 75% reduced memory usage, the ability to fine-tune large models on consumer GPUs, performance comparable to full fine-tuning, and cost-effective training of LLMs. This has made it particularly popular in the open-source AI community because it makes working with LLMs more accessible to developers with limited computational resources.
To learn more, refer to the following resources:

Interactively fine-tune Falcon-40B and other LLMs on Amazon SageMaker Studio notebooks using QLoRA
Fine-tune Llama 2 using QLoRA and deploy it on Amazon SageMaker with AWS Inferentia2

Knowledge distillation
Knowledge distillation is a groundbreaking model compression technique in the world of AI, where a smaller student model learns to emulate the sophisticated behavior of a larger teacher model. This innovative approach has revolutionized the way we deploy AI solutions in real-world applications, particularly where computational resources are limited. By learning not only from ground truth labels but also from the teacher model’s probability distributions, the student model can achieve remarkable performance while maintaining a significantly smaller footprint. This makes it invaluable for various practical applications, from powering AI features on mobile devices to enabling edge computing solutions and Internet of Things (IoT) implementations. The key feature of distillation lies in its ability to democratize AI deployment—making sophisticated AI capabilities accessible across different platforms without compromising too much on performance. With knowledge distillation, you can run real-time speech recognition on smartphones, implement computer vision systems in resource-constrained environments, optimize NLP tasks for faster inference, and more.
For more information about knowledge distillation, refer to the following resources:

A guide to Amazon Bedrock Model Distillation (preview)
Use Llama 3.1 405B for synthetic data generation and distillation to fine-tune smaller models

Mixed precision training
Mixed precision training is a cutting-edge optimization technique in deep learning that balances computational efficiency with model accuracy. By intelligently combining different numerical precisions—primarily 32-bit (FP32) and 16-bit (FP16) floating-point formats—this approach revolutionizes how we train complex AI models. Its key feature is selective precision usage: maintaining critical operations in FP32 for stability while using FP16 for less sensitive calculations, resulting in a balance of performance and accuracy. This technique has become a game changer in the AI industry, enabling up to three times faster training speeds, a significantly reduced memory footprint, and lower power consumption. It’s particularly valuable for training resource-intensive models like LLMs and complex computer vision systems. For organizations using cloud computing and GPU-accelerated workloads, mixed precision training offers a practical solution to optimize hardware utilization while maintaining model quality. This approach has effectively democratized the training of large-scale AI models, making it more accessible and cost-effective for businesses and researchers alike.
To learn more, refer to the following resources:

Mixed precision training with FP8 on P5 instances using Transformer Engine
Mixed precision training with half-precision data types using PyTorch FSDP
Efficiently train models with large sequence lengths using Amazon SageMaker model parallel

Gradient accumulation
Gradient accumulation is a powerful technique in deep learning that addresses the challenges of training large models with limited computational resources. Developers can simulate larger batch sizes by accumulating gradients over multiple smaller forward and backward passes before performing a weight update. Think of it as breaking down a large batch into smaller, more manageable mini batches while maintaining the effective training dynamics of the larger batch size. This method has become particularly valuable in scenarios where memory constraints would typically prevent training with optimal batch sizes, such as when working with LLMs or high-resolution image processing networks. By accumulating gradients across several iterations, developers can achieve the benefits of larger batch training—including more stable updates and potentially faster convergence—without requiring the enormous memory footprint typically associated with such approaches. This technique has democratized the training of sophisticated AI models, making it possible for researchers and developers with limited GPU resources to work on cutting-edge deep learning projects that would otherwise be out of reach. For more information, see the following resources:

Efficiently fine-tune the ESM 2 protein language model with Amazon SageMaker
End-to-end LLM training on instance clusters with over 100 nodes using AWS Trainium

Conclusion
When fine-tuning ML models on AWS, you can choose the right tool for your specific needs. AWS provides a comprehensive suite of tools for data scientists, ML engineers, and business users to achieve their ML goals. AWS has built solutions to support various levels of ML sophistication, from simple SageMaker training jobs for FM fine-tuning to the power of SageMaker HyperPod for cutting-edge research.
We invite you to explore these options, starting with what suits your current needs, and evolve your approach as those needs change. Your journey with AWS is just beginning, and we’re here to support you every step of the way.

About the authors
Ilan Gleiser is a Principal GenAI Specialist at AWS on the WWSO Frameworks team, focusing on developing scalable generative AI architectures and optimizing foundation model training and inference. With a rich background in AI and machine learning, Ilan has published over 30 blog posts and delivered more than 100 machine learning and HPC prototypes globally over the last 5 years. Ilan holds a master’s degree in mathematical economics.
Prashanth Ramaswamy is a Senior Deep Learning Architect at the AWS Generative AI Innovation Center, where he specializes in model customization and optimization. In his role, he works on fine-tuning, benchmarking, and optimizing models by using generative AI as well as traditional AI/ML solutions. He focuses on collaborating with Amazon customers to identify promising use cases and accelerate the impact of AI solutions to achieve key business outcomes.
Deeksha Razdan is an Applied Scientist at the AWS Generative AI Innovation Center, where she specializes in model customization and optimization. Her work resolves around conducting research and developing generative AI solutions for various industries. She holds a master’s in computer science from UMass Amherst. Outside of work, Deeksha enjoys being in nature.

Streamline machine learning workflows with SkyPilot on Amazon SageMake …

This post is co-written with Zhanghao Wu, co-creator of SkyPilot.
The rapid advancement of generative AI and foundation models (FMs) has significantly increased computational resource requirements for machine learning (ML) workloads. Modern ML pipelines require efficient systems for distributing workloads across accelerated compute resources, while making sure developer productivity remains high. Organizations need infrastructure solutions that are not only powerful but also flexible, resilient, and straightforward to manage.
SkyPilot is an open source framework that simplifies running ML workloads by providing a unified abstraction layer that helps ML engineers run their workloads on different compute resources without managing underlying infrastructure complexities. It offers a simple, high-level interface for provisioning resources, scheduling jobs, and managing distributed training across multiple nodes.
Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not only provides the flexibility to create and use your own software stack, but also provides optimal performance through same spine placement of instances, as well as built-in resiliency. Combining the resiliency of SageMaker HyperPod and the efficiency of SkyPilot provides a powerful framework to scale up your generative AI workloads.
In this post, we share how SageMaker HyperPod, in collaboration with SkyPilot, is streamlining AI development workflows. This integration makes our advanced GPU infrastructure more accessible to ML engineers, enhancing productivity and resource utilization.
Challenges of orchestrating machine learning workloads
Kubernetes has become popular for ML workloads due to its scalability and rich open source tooling. SageMaker HyperPod orchestrated on Amazon Elastic Kubernetes Service (Amazon EKS) combines the power of Kubernetes with the resilient environment of SageMaker HyperPod designed for training large models. Amazon EKS support in SageMaker HyperPod strengthens resilience through deep health checks, automated node recovery, and job auto-resume capabilities, providing uninterrupted training for large-scale and long-running jobs.
ML engineers transitioning from traditional VM or on-premises environments often face a steep learning curve. The complexity of Kubernetes manifests and cluster management can pose significant challenges, potentially slowing down development cycles and resource utilization.
Furthermore, AI infrastructure teams faced the challenge of balancing the need for advanced management tools with the desire to provide a user-friendly experience for their ML engineers. They required a solution that could offer both high-level control and ease of use for day-to-day operations.
SageMaker HyperPod with SkyPilot
To address these challenges, we partnered with SkyPilot to showcase a solution that uses the strengths of both platforms. SageMaker HyperPod excels at managing the underlying compute resources and instances, providing the robust infrastructure necessary for demanding AI workloads. SkyPilot complements this by offering an intuitive layer for job management, interactive development, and team coordination.
Through this partnership, we can offer our customers the best of both worlds: the powerful, scalable infrastructure of SageMaker HyperPod, combined with a user-friendly interface that significantly reduces the learning curve for ML engineers. For AI infrastructure teams, this integration provides advanced management capabilities while simplifying the experience for their ML engineers, creating a win-win situation for all stakeholders.
SkyPilot helps AI teams run their workloads on different infrastructures with a unified high-level interface and powerful management of resources and jobs. An AI engineer can bring in their AI framework and specify the resource requirements for the job; SkyPilot will intelligently schedule the workloads on the best infrastructure: find the available GPUs, provision the GPU, run the job, and manage its lifecycle.

Solution overview
Implementing this solution is straightforward, whether you’re working with existing SageMaker HyperPod clusters or setting up a new deployment. For existing clusters, you can connect using AWS Command Line Interface (AWS CLI) commands to update your kubeconfig and verify the setup. For new deployments, we guide you through setting up the API server, creating clusters, and configuring high-performance networking options like Elastic Fabric Adapter (EFA).
The following diagram illustrates the solution architecture.

In the following sections, we show how to run SkyPilot jobs for multi-node distributed training on SageMaker HyperPod. We go over the process of creating a SageMaker HyperPod cluster, installing SkyPilot, creating a SkyPilot cluster, and deploying a SkyPilot training job.
Prerequisites
You must have the following prerequisites:

An existing SageMaker HyperPod cluster with Amazon EKS (to create one, refer to Deploy Your HyperPod Cluster). You must provision a single ml.p5.48xlarge instance for the code samples in the following sections.
Access to the AWS CLI and kubectl command line tools.
A Python environment for installing SkyPilot.

Create a SageMaker HyperPod cluster
You can create an EKS cluster with a single AWS CloudFormation stack following the instructions in Using CloudFormation, configured with a virtual private cloud (VPC) and storage resources.
To create and manage SageMaker HyperPod clusters, you can use either the AWS Management Console or AWS CLI. If you use the AWS CLI, specify the cluster configuration in a JSON file and choose the EKS cluster created from the CloudFormation stack as the orchestrator of the SageMaker HyperPod cluster. You then create the cluster worker nodes with NodeRecovery set to Automatic to enable automatic node recovery, and for OnStartDeepHealthChecks, add InstanceStress and InstanceConnectivity to enable deep health checks. See the following code:

cat > cluster-config.json << EOL
{
    “ClusterName”: “hp-cluster”,
    “Orchestrator”: {
        “Eks”: {
            “ClusterArn”: “${EKS_CLUSTER_ARN}”
        }
    },
    “InstanceGroups”: [
        {
            “InstanceGroupName”: “worker-group-1”,
            “InstanceType”: “ml.p5.48xlarge”,
            “InstanceCount”: 2,
            “LifeCycleConfig”: {
                “SourceS3Uri”: “s3://${BUCKET_NAME}”,
                “OnCreate”: “on_create.sh”
            },
            “ExecutionRole”: “${EXECUTION_ROLE}”,
            “ThreadsPerCore”: 1,
            “OnStartDeepHealthChecks”: [
                “InstanceStress”,
                “InstanceConnectivity”
            ],
        },
  ….
    ],
    “VpcConfig”: {
        “SecurityGroupIds”: [
            “$SECURITY_GROUP”
        ],
        “Subnets”: [
            “$SUBNET_ID”
        ]
    },
    “ResilienceConfig”: {
        “NodeRecovery”: “Automatic”
    }
}
EOL

You can add InstanceStorageConfigs to provision and mount additional Amazon Elastic Block Store (Amazon EBS) volumes on SageMaker HyperPod nodes.
To create the cluster using the SageMaker HyperPod APIs, run the following AWS CLI command:

aws sagemaker create-cluster  
–cli-input-json file://cluster-config.json

You are now ready to set up SkyPilot on your SageMaker HyperPod cluster.
Connect to your SageMaker HyperPod EKS cluster
From your AWS CLI environment, run the aws eks update-kubeconfig command to update your local kube config file (located at ~/.kube/config) with the credentials and configuration needed to connect to your EKS cluster using the kubectl command (provide your specific EKS cluster name):
aws eks update-kubeconfig –name $EKS_CLUSTER_NAME
You can verify that you are connected to the EKS cluster by running the following command:
kubectl config current-context
Install SkyPilot with Kubernetes support
Use the following code to install SkyPilot with Kubernetes support using pip:
pip install skypilot[kubernetes]
This installs the latest build of SkyPilot, which includes the necessary Kubernetes integrations.
Verify SkyPilot’s connection to the EKS cluster
Check if SkyPilot can connect to your Kubernetes cluster:
sky check k8s
The output should look similar to the following code:

Checking credentials to enable clouds for SkyPilot.
Kubernetes: enabled [compute]

To enable a cloud, follow the hints above and rerun: sky check
If any problems remain, refer to detailed docs at: https://docs.skypilot.co/en/latest/getting-started/installation.html

🎉 Enabled clouds 🎉
Kubernetes [compute]
Active context: arn:aws:eks:us-east-2:XXXXXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster

Using SkyPilot API server: http://127.0.0.1:46580

If this is your first time using SkyPilot with this Kubernetes cluster, you might see a prompt to create GPU labels for your nodes. Follow the instructions by running the following code:
python -m sky.utils.kubernetes.gpu_labeler –context <your-eks-context>
This script helps SkyPilot identify what GPU resources are available on each node in your cluster. The GPU labeling job might take a few minutes depending on the number of GPU resources in your cluster.
Discover available GPUs in the cluster
To see what GPU resources are available in your SageMaker HyperPod cluster, use the following code:
sky show-gpus –cloud k8s
This will list the available GPU types and their counts. We have two p5.48xlarge instances, each equipped with 8 NVIDIA H100 GPUs:

Kubernetes GPUs
GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
H100 1, 2, 4, 8 16 16

Kubernetes per node accelerator availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
hyperpod-i-00baa178bc31afde3 H100 8 8
hyperpod-i-038beefa954efab84 H100 8 8

Launch an interactive development environment
With SkyPilot, you can launch a SkyPilot cluster for interactive development:
sky launch -c dev –gpus H100
This command creates an interactive development environment (IDE) with a single H100 GPU and will sync the local working directory to the cluster. SkyPilot handles the pod creation, resource allocation, and setup of the IDE.

Considered resources (1 node):
——————————————————————————————————————————————————————-
 CLOUD        INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                                                 COST ($)   CHOSEN  
——————————————————————————————————————————————————————-
 Kubernetes   2CPU–8GB–H100:1   2       8         H100:1         arn:aws:eks:us-east-2:XXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster   0.00          ✔    
——————————————————————————————————————————————————————
Launching a new cluster ‘dev’. Proceed? [Y/n]: Y
• Launching on Kubernetes.
Pod is up.
✔ Cluster launched: dev. View logs: sky api logs -1 sky-2025-05-05-15-28-47-523797/provision. log
• Syncing files.
Run commands not specified or empty.
Useful Commands
Cluster name: dey
To log into the head VM:   ssh dev
To submit a job:           sky exec dev yaml_file
To stop the cluster:       sky stop dev
To teardown the cluster:   sky down dev

After it’s launched, you can connect to your IDE:
ssh dev
This gives you an interactive shell in your IDE, where you can run your code, install packages, and perform ML experiments.
Run training jobs
With SkyPilot, you can run distributed training jobs on your SageMaker HyperPod cluster. The following is an example of launching a distributed training job using a YAML configuration file.
First, create a file named train.yaml with your training job configuration:

resources:
    accelerators: H100

num_nodes: 1

setup: |
    git clone –depth 1 https://github.com/pytorch/examples || true
    cd examples
    git filter-branch –prune-empty –subdirectory-filter distributed/minGPT-ddp
    # SkyPilot’s default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
    uv venv –python 3.10
    source .venv/bin/activate
    uv pip install -r requirements.txt “numpy<2” “torch”

run: |
    cd examples
    source .venv/bin/activate
    cd mingpt
    export LOGLEVEL=INFO

    MASTER_ADDR=$(echo “$SKYPILOT_NODE_IPS” | head -n1)
    echo “Starting distributed training, head node: $MASTER_ADDR”

    torchrun
    –nnodes=$SKYPILOT_NUM_NODES
    –nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE
    –master_addr=$MASTER_ADDR
    –master_port=8008
    –node_rank=${SKYPILOT_NODE_RANK}
    main.py

Then launch your training job:
sky launch -c train train.yaml
This creates a training job on a single p5.48xlarge nodes, equipped with 8 H100 NVIDIA GPUs. You can monitor the output with the following command:
sky logs train
Running multi-node training jobs with EFA
Elastic Fabric Adapter (EFA) is a network interface for Amazon Elastic Compute Cloud (Amazon EC2) instances that enables you to run applications requiring high levels of inter-node communications at scale on AWS through its custom-built operating system bypass hardware interface. This enables applications to communicate directly with the network hardware while bypassing the operating system kernel, significantly reducing latency and CPU overhead. This direct hardware access is particularly beneficial for distributed ML workloads where frequent inter-node communication during gradient synchronization can become a bottleneck. By using EFA-enabled instances such as p5.48xlarge or p6-b200.48xlarge, data scientists can scale their training jobs across multiple nodes while maintaining the low-latency, high-bandwidth communication essential for efficient distributed training, ultimately reducing training time and improving resource utilization for large-scale AI workloads.
The following code snippet shows how to incorporate this into your SkyPilot job:

name: nccl-test-efa

resources:
  cloud: kubernetes
  accelerators: H100:8
  image_id: docker:public.ecr.aws/hpc-cloud/nccl-tests:latest

num_nodes: 2

envs:
  USE_EFA: “true”

run: |
  if [ “${SKYPILOT_NODE_RANK}” == “0” ]; then
    echo “Head node”

    # Total number of processes, NP should be the total number of GPUs in the cluster
    NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

    # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
    nodes=””
    for ip in $SKYPILOT_NODE_IPS; do
      nodes=”${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},”
    done
    nodes=${nodes::-1}
    echo “All nodes: ${nodes}”

    # Set environment variables
    export PATH=$PATH:/usr/local/cuda-12.2/bin:/opt/amazon/efa/bin:/usr/bin
    export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:$LD_LIBRARY_PATH
    export NCCL_HOME=/opt/nccl
    export CUDA_HOME=/usr/local/cuda-12.2
    export NCCL_DEBUG=INFO
    export NCCL_BUFFSIZE=8388608
    export NCCL_P2P_NET_CHUNKSIZE=524288
    export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so

    if [ “${USE_EFA}” == “true” ]; then
      export FI_PROVIDER=”efa”
    else
      export FI_PROVIDER=””
    fi

    /opt/amazon/openmpi/bin/mpirun
      –allow-run-as-root
      –tag-output
      -H $nodes
      -np $NP
      -N $SKYPILOT_NUM_GPUS_PER_NODE
      –bind-to none
      -x FI_PROVIDER
      -x PATH
      -x LD_LIBRARY_PATH
      -x NCCL_DEBUG=INFO
      -x NCCL_BUFFSIZE
      -x NCCL_P2P_NET_CHUNKSIZE
      -x NCCL_TUNER_PLUGIN
      –mca pml ^cm,ucx
      –mca btl tcp,self
      –mca btl_tcp_if_exclude lo,docker0,veth_def_agent
      /opt/nccl-tests/build/all_reduce_perf
      -b 8
      -e 2G
      -f 2
      -g 1
      -c 5
      -w 5
      -n 100
  else
    echo “Worker nodes”
  fi

config:
  kubernetes:
    pod_config:
      spec:
        containers:
        – resources:
            limits:
              
              vpc.amazonaws.com/efa: 32
            requests:
              
              vpc.amazonaws.com/efa: 32

Clean up
To delete your SkyPilot cluster, run the following command:
sky down <cluster_name>
To delete the SageMaker HyperPod cluster created in this post, you can user either the SageMaker AI console or the following AWS CLI command:
aws sagemaker delete-cluster –cluster-name <cluster_name>
Cluster deletion will take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker AI console.
If you used the CloudFormation stack to create resources, you can delete it using the following command:
aws cloudformation delete-stack –stack-name <stack_name>
Conclusion
By combining the robust infrastructure capabilities of SageMaker HyperPod with SkyPilot’s user-friendly interface, we’ve showcased a solution that helps teams focus on innovation rather than infrastructure complexity. This approach not only simplifies operations but also enhances productivity and resource utilization across organizations of all sizes. To get started, refer to SkyPilot in the Amazon EKS Support in Amazon SageMaker HyperPod workshop.

About the authors
Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS. He helps AWS customers—from small startups to large enterprises—train and deploy foundation models efficiently on AWS. He is passionate about computational optimization problems and improving the performance of AI workloads.
Zhanghao Wu is a co-creator of the SkyPilot open source project and holds a PhD in computer science from UC Berkeley. He works on SkyPilot core, client-server architecture, managed jobs, and improving the AI experience on diverse cloud infrastructure in general.
Ankit Anand is a Senior Foundation Models Go-To-Market (GTM) Specialist at AWS. He partners with top generative AI model builders, strategic customers, and AWS service teams to enable the next generation of AI/ML workloads on AWS. Ankit’s experience includes product management expertise within the financial services industry for high-frequency and low-latency trading and business development for Amazon Alexa.

Intelligent document processing at scale with generative AI and Amazon …

Extracting information from unstructured documents at scale is a recurring business task. Common use cases include creating product feature tables from descriptions, extracting metadata from documents, and analyzing legal contracts, customer reviews, news articles, and more. A classic approach to extracting information from text is named entity recognition (NER). NER identifies entities from predefined categories, such as persons and organizations. Although various AI services and solutions support NER, this approach is limited to text documents and only supports a fixed set of entities. Furthermore, classic NER models can’t handle other data types such as numeric scores (such as sentiment) or free-form text (such as summary). Generative AI unlocks these possibilities without costly data annotation or model training, enabling more comprehensive intelligent document processing (IDP).
AWS recently announced the general availability of Amazon Bedrock Data Automation, a feature of Amazon Bedrock that automates the generation of valuable insights from unstructured multimodal content such as documents, images, video, and audio. This service offers pre-built capabilities for IDP and information extraction through a unified API, alleviating the need for complex prompt engineering or fine-tuning, and making it an excellent choice for document processing workflows at scale. To learn more about Amazon Bedrock Data Automation, refer to Simplify multimodal generative AI with Amazon Bedrock Data Automation.
Amazon Bedrock Data Automation is the recommended approach for IDP use case due to its simplicity, industry-leading accuracy, and managed service capabilities. It handles the complexity of document parsing, context management, and model selection automatically, so developers can focus on their business logic rather than IDP implementation details.
Although Amazon Bedrock Data Automation meets most IDP needs, some organizations require additional customization in their IDP pipelines. For example, companies might need to use self-hosted foundation models (FMs) for IDP due to regulatory requirements. Some customers have builder teams who might prefer to maintain full control over the IDP pipeline instead of using a managed service. Finally, organizations might operate in AWS Regions where Amazon Bedrock Data Automation is not available (available in us-west-2 and us-east-1 as of June 2025). In such cases, builders might use Amazon Bedrock FMs directly or perform optical character recognition (OCR) with Amazon Textract.
This post presents an end-to-end IDP application powered by Amazon Bedrock Data Automation and other AWS services. It provides a reusable AWS infrastructure as code (IaC) that deploys an IDP pipeline and provides an intuitive UI for transforming documents into structured tables at scale. The application only requires the user to provide the input documents (such as contracts or emails) and a list of attributes to be extracted. It then performs IDP with generative AI.
The application code and deployment instructions are available on GitHub under the MIT license.
Solution overview
The IDP solution presented in this post is deployed as IaC using the AWS Cloud Development Kit (AWS CDK). Amazon Bedrock Data Automation serves as the primary engine for information extraction. For cases requiring further customization, the solution also provides alternative processing paths using Amazon Bedrock FMs and Amazon Textract integration.
We use AWS Step Functions to orchestrate the IDP workflow and parallelize processing for multiple documents. As part of the workflow, we use AWS Lambda functions to call Amazon Bedrock Data Automation or Amazon Textract and Amazon Bedrock (depending on the selected parsing mode). Processed documents and extracted attributes are stored in Amazon Simple Storage Service (Amazon S3).
A Step Functions workflow with the business logic is invoked through an API call performed using an AWS SDK. We also build a containerized web application running on Amazon Elastic Container Service (Amazon ECS) that is available to end-users through Amazon CloudFront to simplify their interaction with the solution. We use Amazon Cognito for authentication and secure access to the APIs.
The following diagram illustrates the architecture and workflow of the IDP solution.

The IDP workflow includes the following steps:

A user logs in to the web application using credentials managed by Amazon Cognito, selects input documents, and defines the fields to be extracted from them in the UI. Optionally, the user can specify the parsing mode, LLM to use, and other settings.
The user starts the IDP pipeline.
The application creates a pre-signed S3 URL for the documents and uploads them to Amazon S3.
The application triggers Step Functions to start the state machine with the S3 URIs and IDP settings as inputs. The Map state starts to process the documents concurrently.
Depending on the document type and the parsing mode, it branches to different Lambda functions that perform IDP, save results to Amazon S3, and send them back to the UI:

Amazon Bedrock Data Automation – Documents are directed to the “Run Data Automation” Lambda function. The Lambda function creates a blueprint with the user-defined fields schema and launches an asynchronous Amazon Bedrock Data Automation job. Amazon Bedrock Data Automation handles the complexity of document processing and attribute extraction using optimized prompts and models. When the job results are ready, they’re saved to Amazon S3 and sent back to the UI. This approach provides the best balance of accuracy, ease of use, and scalability for most IDP use cases.
Amazon Textract – If the user specifies Amazon Textract as a parsing mode, the IDP pipeline splits into two steps. First, the “Perform OCR” Lambda function is invoked to run an asynchronous document analysis job. The OCR outputs are processed using the amazon-textract-textractor library and formatted as Markdown. Second, the text is passed to the “Extract attributes” Lambda function (Step 6), which invokes an Amazon Bedrock FM given the text and the attributes schema. The outputs are saved to Amazon S3 and sent to the UI.
Handling office documents – Documents with suffixes like .doc, .ppt, and .xls are processed by the “Parse office” Lambda function, which uses LangChain document loaders to extract the text content. The outputs are passed to the “Extract attributes” Lambda function (Step 6) to proceed with the IDP pipeline.

If the user chooses an Amazon Bedrock FM for IDP, the document is sent to the “Extract attributes” Lambda function. It converts a document into a set of images, which are sent to a multimodal FM with the attributes schema as part of a custom prompt. It parses the LLM response to extract JSON outputs, saves them to Amazon S3, and sends it back to the UI. This flow supports .pdf, .png, and .jpg documents.
The web application checks the state machine execution results periodically and returns the extracted attributes to the user when they are available.

Prerequisites
You can deploy the IDP solution from your local computer or from an Amazon SageMaker notebook instance. The deployment steps are detailed in the solution README file.
If you choose to deploy using a SageMaker notebook, which is recommended, you will need access to an AWS account with permissions to create and launch a SageMaker notebook instance.
Deploy the solution
To deploy the solution to your AWS account, complete the following steps:

Open the AWS Management Console and choose the Region in which you want to deploy the IDP solution.
Launch a SageMaker notebook instance. Provide the notebook instance name and notebook instance type, which you can set to ml.m5.large. Leave other options as default.
Navigate to the Notebook instance and open the IAM role attached tothe notebook. Open the role on the AWS Identity and Access Management (IAM) console.
Attach an inline policy to the role and insert the following policy JSON:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“cloudformation:*”,
“s3:*”,
“iam:*”,
“sts:AssumeRole”
],
“Resource”: “*”
},
{
“Effect”: “Allow”,
“Action”: [
“ssm:GetParameter”,
“ssm:GetParameters”
],
“Resource”: “arn:aws:ssm:*:*:parameter/cdk-bootstrap/*”
}
]
}

When the notebook instance status is marked as InService, choose Open JupyterLab.
In the JupyterLab environment, choose File, New, and Terminal.
Clone the solution repository by running the following commands:

cd SageMaker
git clone https://github.com/aws-samples/intelligent-document-processing-with-amazon-bedrock.git

Navigate to the repository folder and run the script to install requirements:

cd intelligent-document-processing-with-amazon-bedrock
sh install_deps.sh

Run the script to create a virtual environment and install dependencies:

sh install_env.sh
source .venv/bin/activate

Within the repository folder, copy the config-example.yml to a config.yml to specify your stack name. Optionally, configure the services and indicate the modules you want to deploy (for example, to disable deploying a UI, change deploy_streamlit to False). Make sure you add your user email to the Amazon Cognito users list.
Configure Amazon Bedrock model access by opening the Amazon Bedrock console in the Region specified in the config.yml file. In the navigation pane, choose Model Access and make sure to enable access for the model IDs specified in config.yml.
Bootstrap and deploy the AWS CDK in your account:

cdk bootstrap
cdk deploy

Note that this step may take some time, especially on the first deployment. Once deployment is complete, you should see the message as shown in the following screenshot. You can access the Streamlit frontend using the CloudFront distribution URL provided in the AWS CloudFormation outputs. The temporary login credentials will be sent to the email specified in config.yml during the deployment.

Using the solution
This section guides you through two examples to showcase the IDP capabilities.
Example 1: Analyzing financial documents
In this scenario, we extract key features from a multi-page financial statement using Amazon Bedrock Data Automation. We use a sample document in PDF format with a mixture of tables, images, and text, and extract several financial metrics. Complete the following steps:

Upload a document by attaching a file through the solution UI.

On the Describe Attributes tab, either manually list the names and descriptions of the attributes or upload these fields in JSON format. We want to find the following metrics:

Current cash in assets in 2018
Current cash in assets in 2019
Operating profit in 2018
Operating profit in 2019

Choose Extract attributes to start the IDP pipeline.

The provided attributes are integrated into a custom blueprint with the inferred attributes list, which is then used to invoke a data automation job on the uploaded documents.
After the IDP pipeline is complete, you will see a table of results in the UI. It includes an index for each document in the _doc column, a column for each of the attributes you defined, and a file_name column that contains the document name.

From the following statement excerpts, we can see that Amazon Bedrock Data Automation was able to correctly extract the values for current assets and operating profit.

The IDP solution is also able to do complex calculations beyond well-defined entities. Let’s say we want to calculate the following accounting metrics:

Liquidity ratios (Current assets/Current liabilities)
Working capitals (Current assets – Current liabilities)
Revenue increase ((Revenue year 2/Revenue year 1) – 1)

We define the attributes and their formulas as parts of the attributes’ schema. This time, we choose an Amazon Bedrock LLM as a parsing mode to demonstrate how the application can use a multimodal FM for IDP. When using an Amazon Bedrock LLM, starting the IDP pipeline will now combine the attributes and their description into a custom prompt template, which is sent to the LLM with the documents converted to images. As a user, you can specify the LLM powering the extraction and its inference parameters, such as temperature.

The output, including the full results, is shown in the following screenshot.

Example 2: Processing customer emails
In this scenario, we want to extract multiple features from a list of emails with customer complaints due to delays in product shipments using Amazon Bedrock Data Automation. For each email, we want to find the following:

Customer name
Shipment ID
Email language
Email sentiment
Shipment delay (in days)
Summary of issue
Suggested response

Complete the following steps:

Upload input emails as .txt files. You can download sample emails from GitHub.

On the Describe Attributes tab, list names and descriptions of the attributes.

You can add few-shot examples for some fields (such as delay) to explain to the LLM how these fields values should be extracted. You can do this by adding an example input and the expected output for the attribute to the description.

Choose Extract attributes to start the IDP pipeline.

The provided attributes and their descriptions will be integrated into a custom blueprint with the inferred attributes list, which is then used to invoke a data automation job on the uploaded documents. When the IDP pipeline is complete, you will see the results.

The application allows downloading the extraction results as a CSV or a JSON file. This makes it straightforward to use the results for downstream tasks, such as aggregating customer sentiment scores.
Pricing
In this section, we calculate cost estimates for performing IDP on AWS with our solution.
Amazon Bedrock Data Automation provides a transparent pricing schema depending on the input document size (number of pages, images, or minutes). When using Amazon Bedrock FMs, pricing depends on the number of input and output tokens used as part of the information extraction call. Finally, when using Amazon Textract, OCR is performed and priced separately based on the number of pages in the documents.
Using the preceding scenarios as examples, we can approximate the costs depending on the selected parsing mode. In the following table, we show costs using two datasets: 100 20-page financial documents, and 100 1-page customer emails. We ignore costs of Amazon ECS and Lambda.

AWS service
Use case 1 (100 20-page financial documents)
Use case 2 (100 1-page customer emails)

IDP option 1: Amazon Bedrock Data Automation

Amazon Bedrock Data Automation (custom output)
$20.00
$1.00

IDP option 2: Amazon Bedrock FM

Amazon Bedrock (FM invocation, Anthropic’s Claude 4 Sonnet)
$1.79
$0.09

IDP option 3: Amazon Textract and Amazon Bedrock FM

Amazon Textract (document analysis job with layout)
$30.00
$1.50

Amazon Bedrock (FM invocation, Anthropic’s Claude 3.7 Sonnet)
$1.25
$0.06

Orchestration and storage (shared costs)

Amazon S3
$0.02
$0.02

AWS CloudFront
$0.09
$0.09

Amazon ECS

AWS Lambda

Total cost: Amazon Bedrock Data Automation
$20.11
$1.11

Total cost: Amazon Bedrock FM
$1.90
$0.20

Total cost: Amazon Textract and Amazon Bedrock FM
$31.36
$1.67

The cost analysis suggests that using Amazon Bedrock FMs with a custom prompt template is a cost-effective method for IDP. However, this approach requires a bigger operational overhead, because the pipeline needs to be optimized depending on the LLM, and requires manual security and privacy management. Amazon Bedrock Data Automation offers a managed service that uses a choice of high-performing FMs through a single API.
Clean up
To remove the deployed resources, complete the following steps:

On the AWS CloudFormation console, delete the created stack. Alternatively, run the following command:

cdk destroy –region <YOUR_DEPLOY_REGION>

On the Amazon Cognito console, delete the user pool.

Conclusion
Extracting information from unstructured documents at scale is a recurring business task. This post discussed an end-to-end IDP application that performs information extraction using multiple AWS services. The solution is powered by Amazon Bedrock Data Automation, which provides a fully managed service for generating insights from documents, images, audio, and video. Amazon Bedrock Data Automation handles the complexity of document processing and information extraction, optimizing for both performance and accuracy without requiring expertise in prompt engineering. For extended flexibility and customizability in specific scenarios, our solution also supports IDP using Amazon Bedrock custom LLM calls and Amazon Textract for OCR.
The solution supports multiple document types, including text, images, PDF, and Microsoft Office documents. At the time of writing, accurate understanding of information in documents rich with images, tables, and other visual elements is only available for PDF and images. We recommend converting complex Office documents to PDFs or images for best performance. Another solution limitation is the document size. As of June 2025, Amazon Bedrock Data Automation supports documents up to 20 pages for custom attributes extraction. When using custom Amazon Bedrock LLMs for IDP, the 300,000-token context window of Amazon Nova LLMs allows processing documents with up to roughly 225,000 words. To extract information from larger documents, you would currently need to split the file into multiple documents.
In the next versions of the IDP solution, we plan to keep adding support for state-of-the-art language models available through Amazon Bedrock and iterate on prompt engineering to further improve the extraction accuracy. We also plan to implement techniques for extending the size of supported documents and providing users with a precise indication of where exactly in the document the extracted information is coming from.
To get started with IDP with the described solution, refer to the GitHub repository. To learn more about Amazon Bedrock, refer to the documentation.

About the authors
Nikita Kozodoi, PhD, is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where he works on the frontier of AI research and business. With rich experience in Generative AI and diverse areas of ML, Nikita is enthusiastic about using AI to solve challenging real-world business problems across industries.
Zainab Afolabi is a Senior Data Scientist at the Generative AI Innovation Centre in London, where she leverages her extensive expertise to develop transformative AI solutions across diverse industries. She has over eight years of specialised experience in artificial intelligence and machine learning, as well as a passion for translating complex technical concepts into practical business applications.
Aiham Taleb, PhD, is a Senior Applied Scientist at the Generative AI Innovation Center, working directly with AWS enterprise customers to leverage Gen AI across several high-impact use cases. Aiham has a PhD in unsupervised representation learning, and has industry experience that spans across various machine learning applications, including computer vision, natural language processing, and medical imaging.
Liza (Elizaveta) Zinovyeva is an Applied Scientist at AWS Generative AI Innovation Center and is based in Berlin. She helps customers across different industries to integrate Generative AI into their existing applications and workflows. She is passionate about AI/ML, finance and software security topics. In her spare time, she enjoys spending time with her family, sports, learning new technologies, and table quizzes.
Nuno Castro is a Sr. Applied Science Manager at AWS Generative AI Innovation Center. He leads Generative AI customer engagements, helping hundreds of AWS customers find the most impactful use case from ideation, prototype through to production. He has 19 years experience in AI in industries such as finance, manufacturing, and travel, leading AI/ML teams for 12 years.
Ozioma Uzoegwu is a Principal Solutions Architect at Amazon Web Services. In his role, he helps financial services customers across EMEA to transform and modernize on the AWS Cloud, providing architectural guidance and industry best practices. Ozioma has many years of experience with web development, architecture, cloud and IT management. Prior to joining AWS, Ozioma worked with an AWS Advanced Consulting Partner as the Lead Architect for the AWS Practice. He is passionate about using latest technologies to build a modern financial services IT estate across banking, payment, insurance and capital markets.
Eren Tuncer is a Solutions Architect at Amazon Web Services focused on Serverless and building Generative AI applications. With more than fifteen years experience in software development and architecture, he helps customers across various industries achieve their business goals using cloud technologies with best practices. As a builder, he’s passionate about creating solutions with state-of-the-art technologies, sharing knowledge, and helping organizations navigate cloud adoption.
Francesco Cerizzi is a Solutions Architect at Amazon Web Services exploring tech frontiers while spreading generative AI knowledge and building applications. With a background as a full stack developer, he helps customers across different industries in their journey to the cloud, sharing insights on AI’s transformative potential along the way. He’s passionate about Serverless, event-driven architectures, and microservices in general. When not diving into technology, he’s a huge F1 fan and loves Tennis.

Mistral AI Releases Devstral 2507 for Code-Centric Language Modeling

Mistral AI, in collaboration with All Hands AI, has released updated versions of its developer-focused large language models under the Devstral 2507 label. The release includes two models—Devstral Small 1.1 and Devstral Medium 2507—designed to support agent-based code reasoning, program synthesis, and structured task execution across large software repositories. These models are optimized for performance and cost, making them applicable for real-world use in developer tools and code automation systems.

Devstral Small 1.1: Open Model for Local and Embedded Use

Devstral Small 1.1 (also called devstral-small-2507) is based on the Mistral-Small-3.1 foundation model and contains approximately 24 billion parameters. It supports a 128k token context window, which allows it to handle multi-file code inputs and long prompts typical in software engineering workflows.

The model is fine-tuned specifically for structured outputs, including XML and function-calling formats. This makes it compatible with agent frameworks such as OpenHands and suitable for tasks like program navigation, multi-step edits, and code search. It is licensed under Apache 2.0 and available for both research and commercial use.

Source: https://mistral.ai/news/devstral-2507

Performance: SWE-Bench Results

Devstral Small 1.1 achieves 53.6% on the SWE-Bench Verified benchmark, which evaluates the model’s ability to generate correct patches for real GitHub issues. This represents a noticeable improvement over the previous version (1.0) and places it ahead of other openly available models of comparable size. The results were obtained using the OpenHands scaffold, which provides a standard test environment for evaluating code agents.

While not at the level of the largest proprietary models, this version offers a balance between size, inference cost, and reasoning performance that is practical for many coding tasks.

Deployment: Local Inference and Quantization

The model is released in multiple formats. Quantized versions in GGUF are available for use with llama.cpp, vLLM, and LM Studio. These formats make it possible to run inference locally on high-memory GPUs (e.g., RTX 4090) or Apple Silicon machines with 32GB RAM or more. This is beneficial for developers or teams that prefer to operate without dependency on hosted APIs.

Mistral also makes the model available via their inference API. The current pricing is $0.10 per million input tokens and $0.30 per million output tokens, the same as other models in the Mistral-Small line.

Source: https://mistral.ai/news/devstral-2507

Devstral Medium 2507: Higher Accuracy, API-Only

Devstral Medium 2507 is not open-sourced and is only available through the Mistral API or through enterprise deployment agreements. It offers the same 128k token context length as the Small version but with higher performance.

The model scores 61.6% on SWE-Bench Verified, outperforming several commercial models, including Gemini 2.5 Pro and GPT-4.1, in the same evaluation framework. Its stronger reasoning capacity over long contexts makes it a candidate for code agents that operate across large monorepos or repositories with cross-file dependencies.

API pricing is set at $0.40 per million input tokens and $2 per million output tokens. Fine-tuning is available for enterprise users via the Mistral platform.

Comparison and Use Case Fit

ModelSWE-Bench VerifiedOpen SourceInput CostOutput CostContext LengthDevstral Small 1.153.6%Yes$0.10/M$0.30/M128k tokensDevstral Medium61.6%No$0.40/M$2.00/M128k tokens

Devstral Small is more suitable for local development, experimentation, or integrating into client-side developer tools where control and efficiency are important. In contrast, Devstral Medium provides stronger accuracy and consistency in structured code-editing tasks and is intended for production services that benefit from higher performance despite increased cost.

Integration with Tooling and Agents

Both models are designed to support integration with code agent frameworks such as OpenHands. The support for structured function calls and XML output formats allows them to be integrated into automated workflows for test generation, refactoring, and bug fixing. This compatibility makes it easier to connect Devstral models to IDE plugins, version control bots, and internal CI/CD pipelines.

For example, developers can use Devstral Small for prototyping local workflows, while Devstral Medium can be used in production services that apply patches or triage pull requests based on model suggestions.

Conclusion

The Devstral 2507 release reflects a targeted update to Mistral’s code-oriented LLM stack, offering users a clearer tradeoff between inference cost and task accuracy. Devstral Small provides an accessible, open model with sufficient performance for many use cases, while Devstral Medium caters to applications where correctness and reliability are critical.

The availability of both models under different deployment options makes them relevant across various stages of the software engineering workflow—from experimental agent development to deployment in commercial environments.

Check out the Technical details, Devstral Small model weights at Hugging Face and Devstral Medium will also be available on Mistral Code for enterprise customers and on finetuning API. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, and Youtube and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Mistral AI Releases Devstral 2507 for Code-Centric Language Modeling appeared first on MarkTechPost.

Google AI Releases Vertex AI Memory Bank: Enabling Persistent Agent Co …

Developers are actively working to bring AI agents to market, but a significant hurdle has been the lack of memory. Without the ability to recall past interactions, agents treat each conversation as if it’s the first, leading to repetitive questions, an inability to remember user preferences, and a general lack of personalization. This results in frustration for both users and developers.

Historically, developers have attempted to mitigate this by inserting entire session dialogues directly into an LLM’s context window. However, this approach is expensive and computationally inefficient, leading to higher inference costs and slower response times. Furthermore, feeding too much information, especially irrelevant details, can degrade the model’s output quality, causing issues like “lost in the middle” and “context rot”.

Introducing Vertex AI Memory Bank

To overcome these limitations, Google Cloud has announced the public preview of Memory Bank, a new managed service within the Vertex AI Agent Engine. Memory Bank is designed to help you build highly personalized conversational agents that facilitate more natural, contextual, and continuous engagements.

For instance, here is a personalized healthcare agent: Key information about a user’s allergy and previous symptoms mentioned in the past sessions is needed to provide a more informed response in the current session

Memory Bank addresses the fundamental memory problem in several key ways:

Personalize interactions: It goes beyond generic scripts by remembering user preferences, key events, and past choices to tailor every response.

Maintain continuity: Conversations can pick up seamlessly where they left off, even across multiple sessions that might span days or weeks.

Provide better context: Agents are armed with the necessary background on a user, leading to more relevant, insightful, and helpful responses.

Improve user experience: It eliminates the frustration of users repeating information, creating more natural, efficient, and engaging conversations.

How Memory Bank Works

Memory Bank operates through an intelligent, multi-stage process, leveraging Google’s Gemini models and novel research:

Understands and Extracts Memories: Memory Bank analyzes a user’s conversation history (stored in Agent Engine Sessions) to extract key facts, preferences, and context. This process happens asynchronously in the background, generating new memories without requiring developers to build complex extraction pipelines.

Stores and Updates Memories Intelligently: Key information, such as “I prefer sunny days” is stored and organized by a defined scope, like a user ID. When new information emerges, Memory Bank, using Gemini, can consolidate it with existing memories, resolving contradictions and ensuring the memories remain up to date.

Recalls Relevant Information: When a new conversation session begins, the agent can retrieve these stored memories. This retrieval can be a simple recall of all facts or a more advanced similarity search using embeddings to find memories most relevant to the current topic. This ensures the agent is always equipped with the right context.

This entire process is grounded in Google Research’s novel research method, accepted by ACL 2025, which provides an intelligent, topic-based approach to how agents learn and recall information, setting a new standard for agent memory performance. An example is how a personal beauty companion agent can remember a user’s evolving skin type to make personalized product recommendations.

Getting Started with Memory Bank

Memory Bank is integrated with the Agent Development Kit (ADK) and Agent Engine Sessions. Developers can define an agent using ADK and enable Agent Engine Sessions to manage conversation history within individual sessions. Memory Bank can then be enabled to provide long-term memory across multiple sessions.

You can integrate Memory Bank into your agent in two primary ways:

Develop an agent with Google Agent Development Kit (ADK) for an out-of-the-box experience.

Develop an agent that orchestrates API calls to Memory Bank if you are building your agent with any other framework, including popular ones like LangGraph and CrewAI.

For those new to Google Cloud but using ADK, an express mode registration for Agent Engine Sessions and Memory Bank allows you to sign up with a Gmail account to receive an API key and build within free tier usage quotas before seamlessly upgrading to a full Google Cloud project for production.
The post Google AI Releases Vertex AI Memory Bank: Enabling Persistent Agent Conversations appeared first on MarkTechPost.