July 2025 - Page 7 of 10

This AI Paper Introduces PEVA: A Whole-Body Conditioned Diffusion Mode …

Posted on July 12, 2025 by i-genie

Understanding the Link Between Body Movement and Visual Perception

The study of human visual perception through egocentric views is crucial in developing intelligent systems capable of understanding & interacting with their environment. This area emphasizes how movements of the human body—ranging from locomotion to arm manipulation—shape what is seen from a first-person perspective. Understanding this relationship is essential for enabling machines and robots to plan and act with a human-like sense of visual anticipation, particularly in real-world scenarios where visibility is dynamically influenced by physical motion.

Challenges in Modeling Physically Grounded Perception

A major hurdle in this domain arises from the challenge of teaching systems how body actions affect perception. Actions such as turning or bending change what is visible in subtle and often delayed ways. Capturing this requires more than simply predicting what comes next in a video—it involves linking physical movements to the resulting changes in visual input. Without the ability to interpret and simulate these changes, embodied agents struggle to plan or interact effectively in dynamic environments.

Limitations of Prior Models and the Need for Physical Grounding

Until now, tools designed to predict video from human actions have been limited in scope. Models have often used low-dimensional input, such as velocity or head direction, and overlooked the complexity of whole-body motion. These simplified approaches overlook the fine-grained control and coordination required to simulate human actions accurately. Even in video generation models, body motion has usually been treated as the output rather than the driver of prediction. This lack of physical grounding has restricted the usefulness of these models for real-world planning.

Introducing PEVA: Predicting Egocentric Video from Action

Researchers from UC Berkeley, Meta’s FAIR, and New York University introduced a new framework called PEVA to overcome these limitations. The model predicts future egocentric video frames based on structured full-body motion data, derived from 3D body pose trajectories. PEVA aims to demonstrate how entire-body movements influence what a person sees, thereby grounding the connection between action and perception. The researchers employed a conditional diffusion transformer to learn this mapping and trained it using Nymeria, a large dataset comprising real-world egocentric videos synchronized with full-body motion capture.

Structured Action Representation and Model Architecture

The foundation of PEVA lies in its ability to represent actions in a highly structured manner. Each action input is a 48-dimensional vector that includes the root translation and joint-level rotations across 15 upper body joints in 3D space. This vector is normalized and transformed into a local coordinate frame centered at the pelvis to remove any positional bias. By utilizing this comprehensive representation of body dynamics, the model captures the continuous and nuanced nature of real motion. PEVA is designed as an autoregressive diffusion model that uses a video encoder to convert frames into latent state representations and predicts subsequent frames based on prior states and body actions. To support long-term video generation, the system introduces random time-skips during training, allowing it to learn from both immediate and delayed visual consequences of motion.

Performance Evaluation and Results

In terms of performance, PEVA was evaluated on several metrics that test both short-term and long-term video prediction capabilities. The model was able to generate visually consistent and semantically accurate video frames over extended periods of time. For short-term predictions, evaluated at 2-second intervals, it achieved lower LPIPS scores and higher DreamSim consistency compared to baselines, indicating superior perceptual quality. The system also decomposed human movement into atomic actions such as arm movements and body rotations to assess fine-grained control. Furthermore, the model was tested on extended rollouts of up to 16 seconds, successfully simulating delayed outcomes while maintaining sequence coherence. These experiments confirmed that incorporating full-body control led to substantial improvements in video realism and controllability.

Conclusion: Toward Physically Grounded Embodied Intelligence

This research highlights a significant advancement in predicting future egocentric video by grounding the model in physical human movement. The problem of linking whole-body action to visual outcomes is addressed with a technically robust method that uses structured pose representations and diffusion-based learning. The solution introduced by the team offers a promising direction for embodied AI systems that require accurate, physically grounded foresight.

Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, and Youtube and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post This AI Paper Introduces PEVA: A Whole-Body Conditioned Diffusion Model for Predicting Egocentric Video from Human Motion appeared first on MarkTechPost.

Advanced fine-tuning methods on Amazon SageMaker AI

Posted on July 12, 2025 by i-genie

This post provides the theoretical foundation and practical insights needed to navigate the complexities of LLM development on Amazon SageMaker AI, helping organizations make optimal choices for their specific use cases, resource constraints, and business objectives.
We also address the three fundamental aspects of LLM development: the core lifecycle stages, the spectrum of fine-tuning methodologies, and the critical alignment techniques that provide responsible AI deployment. We explore how Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA have democratized model adaptation, so organizations of all sizes can customize large models to their specific needs. Additionally, we examine alignment approaches such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), which help make sure these powerful systems behave in accordance with human values and organizational requirements. Finally, we focus on knowledge distillation, which enables efficient model training through a teacher/student approach, where a smaller model learns from a larger one, while mixed precision training and gradient accumulation techniques optimize memory usage and batch processing, making it possible to train large AI models with limited computational resources.
Throughout the post, we focus on practical implementation while addressing the critical considerations of cost, performance, and operational efficiency. We begin with pre-training, the foundational phase where models gain their broad language understanding. Then we examine continued pre-training, a method to adapt models to specific domains or tasks. Finally, we discuss fine-tuning, the process that hones these models for particular applications. Each stage plays a vital role in shaping large language models (LLMs) into the sophisticated tools we use today, and understanding these processes is key to grasping the full potential and limitations of modern AI language models.
If you’re just getting started with large language models or looking to get more out of your current LLM projects, we’ll walk you through everything you need to know about fine-tuning methods on Amazon SageMaker AI.
Pre-training
Pre-training represents the foundation of LLM development. During this phase, models learn general language understanding and generation capabilities through exposure to massive amounts of text data. This process typically involves training from scratch on diverse datasets, often consisting of hundreds of billions of tokens drawn from books, articles, code repositories, webpages, and other public sources.
Pre-training teaches the model broad linguistic and semantic patterns, such as grammar, context, world knowledge, reasoning, and token prediction, using self-supervised learning techniques like masked language modeling (for example, BERT) or causal language modeling (for example, GPT). At this stage, the model is not tailored to any specific downstream task but rather builds a general-purpose language representation that can be adapted later using fine-tuning or PEFT methods.
Pre-training is highly resource-intensive, requiring substantial compute (often across thousands of GPUs or AWS Trainium chips), large-scale distributed training frameworks, and careful data curation to balance performance with bias, safety, and accuracy concerns.
Continued pre-training (also known as domain-adaptive pre-training or intermediate pre-training) is the process of taking a pre-trained language model and further training it on domain-specific or task-relevant corpora before fine-tuning. Unlike full pre-training from scratch, this approach builds on the existing capabilities of a general-purpose model, allowing it to internalize new patterns, vocabulary, or context relevant to a specific domain.
This step is particularly useful when the models must handle specialized terminology or unique syntax, particularly in fields like law, medicine, or finance. This approach is also essential when organizations need to align AI outputs with their internal documentation standards and proprietary knowledge bases. Additionally, it serves as an effective solution for addressing gaps in language or cultural representation by allowing focused training on underrepresented dialects, languages, or regional content.
To learn more, refer to the following resources:

Pre-training genomic language models using AWS HealthOmics and Amazon SageMaker
Customize models in Amazon Bedrock with your own data using fine-tuning and continued pre-training

Alignment methods for LLMs
The alignment of LLMs represents a crucial step in making sure these powerful systems behave in accordance with human values and preferences. AWS provides comprehensive support for implementing various alignment techniques, each offering distinct approaches to achieving this goal. The following are the key approaches.
Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback (RLHF) is one of the most established approaches to model alignment. This method transforms human preferences into a learned reward signal that guides model behavior. The RLHF process consists of three distinct phases. First, we collect comparison data, where human annotators choose between different model outputs for the same prompt. This data forms the foundation for training a reward model, which learns to predict human preferences. Finally, we fine-tune the language model using Proximal Policy Optimization (PPO), optimizing it to maximize the predicted reward.
Constitutional AI represents an innovative approach to alignment that reduces dependence on human feedback by enabling models to critique and improve their own outputs. This method involves training models to internalize specific principles or rules, then using these principles to guide generation and self-improvement. The reinforcement learning phase is similar to RLHF, except that pairs of responses are generated and evaluated by an AI model, as opposed to a human.
To learn more, refer to the following resources:

Fine-tune large language models with reinforcement learning from human or AI feedback
Machine-learning improving your LLMs with RLHF on Amazon Sagemaker
High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus

Direct Preference Optimization
Direct Preference Optimization (DPO) is an alternative to RLHF, offering a more straightforward path to model alignment. DPO alleviates the need for explicit reward modeling and complex RL training loops, instead directly optimizing the model’s policy to align with human preferences through a modified supervised learning approach.
The key innovation of DPO lies in its formulation of preference learning as a classification problem. Given pairs of responses where one is preferred over the other, DPO trains the model to assign higher probability to preferred responses. This approach maintains theoretical connections to RLHF while significantly simplifying the implementation process. When implementing alignment methods, the effectiveness of DPO heavily depends on the quality, volume, and diversity of the preference dataset. Organizations must establish robust processes for collecting and validating human feedback while mitigating potential biases in label preferences.
For more information about DPO, see Align Meta Llama 3 to human preferences with DPO Amazon SageMaker Studio and Amazon SageMaker Ground Truth.
Fine-tuning methods on AWS
Fine-tuning transforms a pre-trained model into one that excels at specific tasks or domains. This phase involves training the model on carefully curated datasets that represent the target use case. Fine-tuning can range from updating all model parameters to more efficient approaches that modify only a small subset of parameters. Amazon SageMaker HyperPod offers fine-tuning capabilities for supported foundation models (FMs), and Amazon SageMaker Model Training offers flexibility for custom fine-tuning implementations along with training the models at scale without the need to manage infrastructure.
At its core, fine-tuning is a transfer learning process where a model’s existing knowledge is refined and redirected toward specific tasks or domains. This process involves carefully balancing the preservation of the model’s general capabilities while incorporating new, specialized knowledge.
Supervised Fine-Tuning
Supervised Fine-Tuning (SFT) involves updating model parameters using a curated dataset of input-output pairs that reflect the desired behavior. SFT enables precise behavioral control and is particularly effective when the model needs to follow specific instructions, maintain tone, or deliver consistent output formats, making it ideal for applications requiring high reliability and compliance. In regulated industries like healthcare or finance, SFT is often used after continued pre-training, which exposes the model to large volumes of domain-specific text to build contextual understanding. Although continued pre-training helps the model internalize specialized language (such as clinical or legal terms), SFT teaches it how to perform specific tasks such as generating discharge summaries, filling documentation templates, or complying with institutional guidelines. Both steps are typically essential: continued pre-training makes sure the model understands the domain, and SFT makes sure it behaves as required.However, because it updates the full model, SFT requires more compute resources and careful dataset construction. The dataset preparation process requires careful curation and validation to make sure the model learns the intended patterns and avoids undesirable biases.
For more details about SFT, refer to the following resources:

Supervised fine-tuning on SageMaker training jobs
SageMaker HyperPod recipes

Parameter-Efficient Fine-Tuning
Parameter-Efficient Fine-Tuning (PEFT) represents a significant advancement in model adaptation, helping organizations customize large models while dramatically reducing computational requirements and costs. The following table summarizes the different types of PEFT.

PEFT Type
AWS Service
How It Works
Benefits

LoRA
LoRA (Low-Rank Adaptation)
SageMaker Training (custom implementation)
Instead of updating all model parameters, LoRA injects trainable rank decomposition matrices into transformer layers, reducing trainable parameters
Memory efficient, cost-efficient, opens up possibility of adapting larger models

QLoRA (Quantized LoRA)
SageMaker Training (custom implementation)
Combines model quantization with LoRA, loading the base model in 4-bit precision while adapting it with trainable LoRA parameters
Further reduces memory requirements compared to standard LoRA

Prompt Tuning
Additive
SageMaker Training (custom implementation)
Prepends a small set of learnable prompt tokens to the input embeddings; only these tokens are trained
Lightweight and fast tuning, good for task-specific adaptation with minimal resources

P-Tuning
Additive
SageMaker Training (custom implementation)
Uses a deep prompt (tunable embedding vector passed through an MLP) instead of discrete tokens, enhancing expressiveness of prompts
More expressive than prompt tuning, effective in low-resource settings

Prefix Tuning
Additive
SageMaker Training (custom implementation)
Prepends trainable continuous vectors (prefixes) to the attention keys and values in every transformer layer, leaving the base model frozen
Effective for long-context tasks, avoids full model fine-tuning, and reduces compute needs

The selection of a PEFT method significantly impacts the success of model adaptation. Each technique presents distinct advantages that make it particularly suitable for specific scenarios. In the following sections, we provide a comprehensive analysis of when to employ different PEFT approaches.
Low-Rank Adaptation
Low-Rank Adaptation (LoRA) excels in scenarios requiring substantial task-specific adaptation while maintaining reasonable computational efficiency. It’s particularly effective in the following use cases:

Domain adaptation for enterprise applications – When adapting models to specialized industry vocabularies and conventions, such as legal, medical, or financial domains, LoRA provides sufficient capacity for learning domain-specific patterns while keeping training costs manageable. For instance, a healthcare provider might use LoRA to adapt a base model to medical terminology and clinical documentation standards.
Multi-language adaptation – Organizations extending their models to new languages find LoRA particularly effective. It allows the model to learn language-specific nuances while preserving the base model’s general knowledge. For example, a global ecommerce platform might employ LoRA to adapt their customer service model to different regional languages and cultural contexts.

To learn more, refer to the following resources:

Accelerating Mixtral MOE fine-tuning on Amazon SageMaker with QLoRA
Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium
PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium
Efficient and cost-effective multi-tenant LoRA serving with Amazon SageMaker

Prompt tuning
Prompt tuning is ideal in scenarios requiring lightweight, switchable task adaptations. With prompt tuning, you can store multiple prompt vectors for different tasks without modifying the model itself. A primary use case could be when different customers require slightly different versions of the same basic functionality: prompt tuning allows efficient switching between customer-specific behaviors without loading multiple model versions. It’s useful in the following scenarios:

Personalized customer interactions – Companies offering software as a service (SaaS) platform with customer support or virtual assistants can use prompt tuning to personalize response behavior for different clients without retraining the model. Each client’s brand tone or service nuance can be encoded in prompt vectors.
Task switching in multi-tenant systems – In systems where multiple natural language processing (NLP) tasks (for example, summarization, sentiment analysis, classification) need to be served from a single model, prompt tuning enables rapid task switching with minimal overhead.

For more information, see Prompt tuning for causal language modeling.
P-tuning
P-tuning extends prompt tuning by representing prompts as continuous embeddings passed through a small trainable neural network (typically an MLP). Unlike prompt tuning, which directly learns token embeddings, P-tuning enables more expressive and non-linear prompt representations, making it suitable for complex tasks and smaller models. It’s useful in the following use cases:

Low-resource domain generalization – A common use case includes low-resource settings where labeled data is limited, yet the task requires nuanced prompt conditioning to steer model behavior. For example, organizations operating in low-data regimes (such as niche scientific research or regional dialect processing) can use P-tuning to extract better task-specific performance without the need for large fine-tuning datasets.

To learn more, see P-tuning.
Prefix tuning
Prefix tuning prepends trainable continuous vectors, also called prefixes, to the key-value pairs in each attention layer of a transformer, while keeping the base model frozen. This provides control over the model’s behavior without altering its internal weights. Prefix tuning excels in tasks that benefit from conditioning across long contexts, such as document-level summarization or dialogue modeling. It provides a powerful compromise between performance and efficiency, especially when serving multiple tasks or clients from a single frozen base model. Consider the following use case:

Dialogue systems – Companies building dialogue systems with varied tones (for example, friendly vs. formal) can use prefix tuning to control the persona and coherence across multi-turn interactions without altering the base model.

For more details, see Prefix tuning for conditional generation.
LLM optimization
LLM optimization represents a critical aspect of their development lifecycle, enabling more efficient training, reduced computational costs, and improved deployment flexibility. AWS provides a comprehensive suite of tools and techniques for implementing these optimizations effectively.
Quantization
Quantization is a process of mapping a large set of input values to a smaller set of output values. In digital signal processing and computing, it involves converting continuous values to discrete values and reducing the precision of numbers (for example, from 32-bit to 8-bit). In machine learning (ML), quantization is particularly important for deploying models on resource-constrained devices, because it can significantly reduce model size while maintaining acceptable performance. One of the most used techniques is Quantized Low-Rank Adaptation (QLoRA).QLoRA is an efficient fine-tuning technique for LLMs that combines quantization and LoRA approaches. It uses 4-bit quantization to reduce model memory usage while maintaining model weights in 4-bit precision during training and employs double quantization for further memory reduction. The technique integrates LoRA by adding trainable rank decomposition matrices and keeping adapter parameters in 16-bit precision, enabling PEFT. QLoRA offers significant benefits, including up to 75% reduced memory usage, the ability to fine-tune large models on consumer GPUs, performance comparable to full fine-tuning, and cost-effective training of LLMs. This has made it particularly popular in the open-source AI community because it makes working with LLMs more accessible to developers with limited computational resources.
To learn more, refer to the following resources:

Interactively fine-tune Falcon-40B and other LLMs on Amazon SageMaker Studio notebooks using QLoRA
Fine-tune Llama 2 using QLoRA and deploy it on Amazon SageMaker with AWS Inferentia2

Knowledge distillation
Knowledge distillation is a groundbreaking model compression technique in the world of AI, where a smaller student model learns to emulate the sophisticated behavior of a larger teacher model. This innovative approach has revolutionized the way we deploy AI solutions in real-world applications, particularly where computational resources are limited. By learning not only from ground truth labels but also from the teacher model’s probability distributions, the student model can achieve remarkable performance while maintaining a significantly smaller footprint. This makes it invaluable for various practical applications, from powering AI features on mobile devices to enabling edge computing solutions and Internet of Things (IoT) implementations. The key feature of distillation lies in its ability to democratize AI deployment—making sophisticated AI capabilities accessible across different platforms without compromising too much on performance. With knowledge distillation, you can run real-time speech recognition on smartphones, implement computer vision systems in resource-constrained environments, optimize NLP tasks for faster inference, and more.
For more information about knowledge distillation, refer to the following resources:

A guide to Amazon Bedrock Model Distillation (preview)
Use Llama 3.1 405B for synthetic data generation and distillation to fine-tune smaller models

Mixed precision training
Mixed precision training is a cutting-edge optimization technique in deep learning that balances computational efficiency with model accuracy. By intelligently combining different numerical precisions—primarily 32-bit (FP32) and 16-bit (FP16) floating-point formats—this approach revolutionizes how we train complex AI models. Its key feature is selective precision usage: maintaining critical operations in FP32 for stability while using FP16 for less sensitive calculations, resulting in a balance of performance and accuracy. This technique has become a game changer in the AI industry, enabling up to three times faster training speeds, a significantly reduced memory footprint, and lower power consumption. It’s particularly valuable for training resource-intensive models like LLMs and complex computer vision systems. For organizations using cloud computing and GPU-accelerated workloads, mixed precision training offers a practical solution to optimize hardware utilization while maintaining model quality. This approach has effectively democratized the training of large-scale AI models, making it more accessible and cost-effective for businesses and researchers alike.
To learn more, refer to the following resources:

Mixed precision training with FP8 on P5 instances using Transformer Engine
Mixed precision training with half-precision data types using PyTorch FSDP
Efficiently train models with large sequence lengths using Amazon SageMaker model parallel

Gradient accumulation
Gradient accumulation is a powerful technique in deep learning that addresses the challenges of training large models with limited computational resources. Developers can simulate larger batch sizes by accumulating gradients over multiple smaller forward and backward passes before performing a weight update. Think of it as breaking down a large batch into smaller, more manageable mini batches while maintaining the effective training dynamics of the larger batch size. This method has become particularly valuable in scenarios where memory constraints would typically prevent training with optimal batch sizes, such as when working with LLMs or high-resolution image processing networks. By accumulating gradients across several iterations, developers can achieve the benefits of larger batch training—including more stable updates and potentially faster convergence—without requiring the enormous memory footprint typically associated with such approaches. This technique has democratized the training of sophisticated AI models, making it possible for researchers and developers with limited GPU resources to work on cutting-edge deep learning projects that would otherwise be out of reach. For more information, see the following resources:

Efficiently fine-tune the ESM 2 protein language model with Amazon SageMaker
End-to-end LLM training on instance clusters with over 100 nodes using AWS Trainium

Conclusion
When fine-tuning ML models on AWS, you can choose the right tool for your specific needs. AWS provides a comprehensive suite of tools for data scientists, ML engineers, and business users to achieve their ML goals. AWS has built solutions to support various levels of ML sophistication, from simple SageMaker training jobs for FM fine-tuning to the power of SageMaker HyperPod for cutting-edge research.
We invite you to explore these options, starting with what suits your current needs, and evolve your approach as those needs change. Your journey with AWS is just beginning, and we’re here to support you every step of the way.

About the authors
Ilan Gleiser is a Principal GenAI Specialist at AWS on the WWSO Frameworks team, focusing on developing scalable generative AI architectures and optimizing foundation model training and inference. With a rich background in AI and machine learning, Ilan has published over 30 blog posts and delivered more than 100 machine learning and HPC prototypes globally over the last 5 years. Ilan holds a master’s degree in mathematical economics.
Prashanth Ramaswamy is a Senior Deep Learning Architect at the AWS Generative AI Innovation Center, where he specializes in model customization and optimization. In his role, he works on fine-tuning, benchmarking, and optimizing models by using generative AI as well as traditional AI/ML solutions. He focuses on collaborating with Amazon customers to identify promising use cases and accelerate the impact of AI solutions to achieve key business outcomes.
Deeksha Razdan is an Applied Scientist at the AWS Generative AI Innovation Center, where she specializes in model customization and optimization. Her work resolves around conducting research and developing generative AI solutions for various industries. She holds a master’s in computer science from UMass Amherst. Outside of work, Deeksha enjoys being in nature.

Streamline machine learning workflows with SkyPilot on Amazon SageMake …

Posted on July 12, 2025 by i-genie

This post is co-written with Zhanghao Wu, co-creator of SkyPilot.
The rapid advancement of generative AI and foundation models (FMs) has significantly increased computational resource requirements for machine learning (ML) workloads. Modern ML pipelines require efficient systems for distributing workloads across accelerated compute resources, while making sure developer productivity remains high. Organizations need infrastructure solutions that are not only powerful but also flexible, resilient, and straightforward to manage.
SkyPilot is an open source framework that simplifies running ML workloads by providing a unified abstraction layer that helps ML engineers run their workloads on different compute resources without managing underlying infrastructure complexities. It offers a simple, high-level interface for provisioning resources, scheduling jobs, and managing distributed training across multiple nodes.
Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not only provides the flexibility to create and use your own software stack, but also provides optimal performance through same spine placement of instances, as well as built-in resiliency. Combining the resiliency of SageMaker HyperPod and the efficiency of SkyPilot provides a powerful framework to scale up your generative AI workloads.
In this post, we share how SageMaker HyperPod, in collaboration with SkyPilot, is streamlining AI development workflows. This integration makes our advanced GPU infrastructure more accessible to ML engineers, enhancing productivity and resource utilization.
Challenges of orchestrating machine learning workloads
Kubernetes has become popular for ML workloads due to its scalability and rich open source tooling. SageMaker HyperPod orchestrated on Amazon Elastic Kubernetes Service (Amazon EKS) combines the power of Kubernetes with the resilient environment of SageMaker HyperPod designed for training large models. Amazon EKS support in SageMaker HyperPod strengthens resilience through deep health checks, automated node recovery, and job auto-resume capabilities, providing uninterrupted training for large-scale and long-running jobs.
ML engineers transitioning from traditional VM or on-premises environments often face a steep learning curve. The complexity of Kubernetes manifests and cluster management can pose significant challenges, potentially slowing down development cycles and resource utilization.
Furthermore, AI infrastructure teams faced the challenge of balancing the need for advanced management tools with the desire to provide a user-friendly experience for their ML engineers. They required a solution that could offer both high-level control and ease of use for day-to-day operations.
SageMaker HyperPod with SkyPilot
To address these challenges, we partnered with SkyPilot to showcase a solution that uses the strengths of both platforms. SageMaker HyperPod excels at managing the underlying compute resources and instances, providing the robust infrastructure necessary for demanding AI workloads. SkyPilot complements this by offering an intuitive layer for job management, interactive development, and team coordination.
Through this partnership, we can offer our customers the best of both worlds: the powerful, scalable infrastructure of SageMaker HyperPod, combined with a user-friendly interface that significantly reduces the learning curve for ML engineers. For AI infrastructure teams, this integration provides advanced management capabilities while simplifying the experience for their ML engineers, creating a win-win situation for all stakeholders.
SkyPilot helps AI teams run their workloads on different infrastructures with a unified high-level interface and powerful management of resources and jobs. An AI engineer can bring in their AI framework and specify the resource requirements for the job; SkyPilot will intelligently schedule the workloads on the best infrastructure: find the available GPUs, provision the GPU, run the job, and manage its lifecycle.

Solution overview
Implementing this solution is straightforward, whether you’re working with existing SageMaker HyperPod clusters or setting up a new deployment. For existing clusters, you can connect using AWS Command Line Interface (AWS CLI) commands to update your kubeconfig and verify the setup. For new deployments, we guide you through setting up the API server, creating clusters, and configuring high-performance networking options like Elastic Fabric Adapter (EFA).
The following diagram illustrates the solution architecture.

In the following sections, we show how to run SkyPilot jobs for multi-node distributed training on SageMaker HyperPod. We go over the process of creating a SageMaker HyperPod cluster, installing SkyPilot, creating a SkyPilot cluster, and deploying a SkyPilot training job.
Prerequisites
You must have the following prerequisites:

An existing SageMaker HyperPod cluster with Amazon EKS (to create one, refer to Deploy Your HyperPod Cluster). You must provision a single ml.p5.48xlarge instance for the code samples in the following sections.
Access to the AWS CLI and kubectl command line tools.
A Python environment for installing SkyPilot.

Create a SageMaker HyperPod cluster
You can create an EKS cluster with a single AWS CloudFormation stack following the instructions in Using CloudFormation, configured with a virtual private cloud (VPC) and storage resources.
To create and manage SageMaker HyperPod clusters, you can use either the AWS Management Console or AWS CLI. If you use the AWS CLI, specify the cluster configuration in a JSON file and choose the EKS cluster created from the CloudFormation stack as the orchestrator of the SageMaker HyperPod cluster. You then create the cluster worker nodes with NodeRecovery set to Automatic to enable automatic node recovery, and for OnStartDeepHealthChecks, add InstanceStress and InstanceConnectivity to enable deep health checks. See the following code:

cat > cluster-config.json << EOL
{
   “ClusterName”: “hp-cluster”,
   “Orchestrator”: {
   “Eks”: {
   “ClusterArn”: “${EKS_CLUSTER_ARN}”
   }
   },
   “InstanceGroups”: [
   {
   “InstanceGroupName”: “worker-group-1”,
   “InstanceType”: “ml.p5.48xlarge”,
   “InstanceCount”: 2,
   “LifeCycleConfig”: {
   “SourceS3Uri”: “s3://${BUCKET_NAME}”,
   “OnCreate”: “on_create.sh”
   },
   “ExecutionRole”: “${EXECUTION_ROLE}”,
   “ThreadsPerCore”: 1,
   “OnStartDeepHealthChecks”: [
   “InstanceStress”,
   “InstanceConnectivity”
   ],
   },
  ….
   ],
   “VpcConfig”: {
   “SecurityGroupIds”: [
   “$SECURITY_GROUP”
   ],
   “Subnets”: [
   “$SUBNET_ID”
   ]
   },
   “ResilienceConfig”: {
   “NodeRecovery”: “Automatic”
   }
}
EOL

You can add InstanceStorageConfigs to provision and mount additional Amazon Elastic Block Store (Amazon EBS) volumes on SageMaker HyperPod nodes.
To create the cluster using the SageMaker HyperPod APIs, run the following AWS CLI command:

aws sagemaker create-cluster
–cli-input-json file://cluster-config.json

You are now ready to set up SkyPilot on your SageMaker HyperPod cluster.
Connect to your SageMaker HyperPod EKS cluster
From your AWS CLI environment, run the aws eks update-kubeconfig command to update your local kube config file (located at ~/.kube/config) with the credentials and configuration needed to connect to your EKS cluster using the kubectl command (provide your specific EKS cluster name):
aws eks update-kubeconfig –name $EKS_CLUSTER_NAME
You can verify that you are connected to the EKS cluster by running the following command:
kubectl config current-context
Install SkyPilot with Kubernetes support
Use the following code to install SkyPilot with Kubernetes support using pip:
pip install skypilot[kubernetes]
This installs the latest build of SkyPilot, which includes the necessary Kubernetes integrations.
Verify SkyPilot’s connection to the EKS cluster
Check if SkyPilot can connect to your Kubernetes cluster:
sky check k8s
The output should look similar to the following code:

Checking credentials to enable clouds for SkyPilot.
Kubernetes: enabled [compute]

To enable a cloud, follow the hints above and rerun: sky check
If any problems remain, refer to detailed docs at: https://docs.skypilot.co/en/latest/getting-started/installation.html

🎉 Enabled clouds 🎉
Kubernetes [compute]
Active context: arn:aws:eks:us-east-2:XXXXXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster

Using SkyPilot API server: http://127.0.0.1:46580

If this is your first time using SkyPilot with this Kubernetes cluster, you might see a prompt to create GPU labels for your nodes. Follow the instructions by running the following code:
python -m sky.utils.kubernetes.gpu_labeler –context <your-eks-context>
This script helps SkyPilot identify what GPU resources are available on each node in your cluster. The GPU labeling job might take a few minutes depending on the number of GPU resources in your cluster.
Discover available GPUs in the cluster
To see what GPU resources are available in your SageMaker HyperPod cluster, use the following code:
sky show-gpus –cloud k8s
This will list the available GPU types and their counts. We have two p5.48xlarge instances, each equipped with 8 NVIDIA H100 GPUs:

Kubernetes GPUs
GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
H100 1, 2, 4, 8 16 16

Kubernetes per node accelerator availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
hyperpod-i-00baa178bc31afde3 H100 8 8
hyperpod-i-038beefa954efab84 H100 8 8

Launch an interactive development environment
With SkyPilot, you can launch a SkyPilot cluster for interactive development:
sky launch -c dev –gpus H100
This command creates an interactive development environment (IDE) with a single H100 GPU and will sync the local working directory to the cluster. SkyPilot handles the pod creation, resource allocation, and setup of the IDE.

Considered resources (1 node):
——————————————————————————————————————————————————————-
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
——————————————————————————————————————————————————————-
Kubernetes 2CPU–8GB–H100:1 2 8 H100:1 arn:aws:eks:us-east-2:XXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster 0.00 ✔
——————————————————————————————————————————————————————
Launching a new cluster ‘dev’. Proceed? [Y/n]: Y
• Launching on Kubernetes.
Pod is up.
✔ Cluster launched: dev. View logs: sky api logs -1 sky-2025-05-05-15-28-47-523797/provision. log
• Syncing files.
Run commands not specified or empty.
Useful Commands
Cluster name: dey
To log into the head VM: ssh dev
To submit a job: sky exec dev yaml_file
To stop the cluster: sky stop dev
To teardown the cluster: sky down dev

After it’s launched, you can connect to your IDE:
ssh dev
This gives you an interactive shell in your IDE, where you can run your code, install packages, and perform ML experiments.
Run training jobs
With SkyPilot, you can run distributed training jobs on your SageMaker HyperPod cluster. The following is an example of launching a distributed training job using a YAML configuration file.
First, create a file named train.yaml with your training job configuration:

resources:
accelerators: H100

num_nodes: 1

setup: |
   git clone –depth 1 https://github.com/pytorch/examples || true
   cd examples
   git filter-branch –prune-empty –subdirectory-filter distributed/minGPT-ddp
   # SkyPilot’s default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
   uv venv –python 3.10
   source .venv/bin/activate
   uv pip install -r requirements.txt “numpy<2” “torch”

run: |
   cd examples
   source .venv/bin/activate
   cd mingpt
   export LOGLEVEL=INFO

MASTER_ADDR=$(echo “$SKYPILOT_NODE_IPS” | head -n1)
echo “Starting distributed training, head node: $MASTER_ADDR”

   torchrun
   –nnodes=$SKYPILOT_NUM_NODES
   –nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE
   –master_addr=$MASTER_ADDR
   –master_port=8008
   –node_rank=${SKYPILOT_NODE_RANK}
   main.py

Then launch your training job:
sky launch -c train train.yaml
This creates a training job on a single p5.48xlarge nodes, equipped with 8 H100 NVIDIA GPUs. You can monitor the output with the following command:
sky logs train
Running multi-node training jobs with EFA
Elastic Fabric Adapter (EFA) is a network interface for Amazon Elastic Compute Cloud (Amazon EC2) instances that enables you to run applications requiring high levels of inter-node communications at scale on AWS through its custom-built operating system bypass hardware interface. This enables applications to communicate directly with the network hardware while bypassing the operating system kernel, significantly reducing latency and CPU overhead. This direct hardware access is particularly beneficial for distributed ML workloads where frequent inter-node communication during gradient synchronization can become a bottleneck. By using EFA-enabled instances such as p5.48xlarge or p6-b200.48xlarge, data scientists can scale their training jobs across multiple nodes while maintaining the low-latency, high-bandwidth communication essential for efficient distributed training, ultimately reducing training time and improving resource utilization for large-scale AI workloads.
The following code snippet shows how to incorporate this into your SkyPilot job:

name: nccl-test-efa

resources:
  cloud: kubernetes
  accelerators: H100:8
  image_id: docker:public.ecr.aws/hpc-cloud/nccl-tests:latest

num_nodes: 2

envs:
USE_EFA: “true”

run: |
if [ “${SKYPILOT_NODE_RANK}” == “0” ]; then
echo “Head node”

# Total number of processes, NP should be the total number of GPUs in the cluster
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

   # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
   nodes=””
   for ip in $SKYPILOT_NODE_IPS; do
   nodes=”${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},”
   done
   nodes=${nodes::-1}
   echo “All nodes: ${nodes}”

   # Set environment variables
   export PATH=$PATH:/usr/local/cuda-12.2/bin:/opt/amazon/efa/bin:/usr/bin
   export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:$LD_LIBRARY_PATH
   export NCCL_HOME=/opt/nccl
   export CUDA_HOME=/usr/local/cuda-12.2
   export NCCL_DEBUG=INFO
   export NCCL_BUFFSIZE=8388608
   export NCCL_P2P_NET_CHUNKSIZE=524288
   export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so

   if [ “${USE_EFA}” == “true” ]; then
   export FI_PROVIDER=”efa”
   else
   export FI_PROVIDER=””
   fi

   /opt/amazon/openmpi/bin/mpirun
   –allow-run-as-root
   –tag-output
   -H $nodes
   -np $NP
   -N $SKYPILOT_NUM_GPUS_PER_NODE
   –bind-to none
   -x FI_PROVIDER
   -x PATH
   -x LD_LIBRARY_PATH
   -x NCCL_DEBUG=INFO
   -x NCCL_BUFFSIZE
   -x NCCL_P2P_NET_CHUNKSIZE
   -x NCCL_TUNER_PLUGIN
   –mca pml ^cm,ucx
   –mca btl tcp,self
   –mca btl_tcp_if_exclude lo,docker0,veth_def_agent
   /opt/nccl-tests/build/all_reduce_perf
   -b 8
   -e 2G
   -f 2
   -g 1
   -c 5
   -w 5
   -n 100
  else
   echo “Worker nodes”
  fi

config:
  kubernetes:
   pod_config:
   spec:
   containers:
   – resources:
   limits:

   vpc.amazonaws.com/efa: 32
   requests:

   vpc.amazonaws.com/efa: 32

Clean up
To delete your SkyPilot cluster, run the following command:
sky down <cluster_name>
To delete the SageMaker HyperPod cluster created in this post, you can user either the SageMaker AI console or the following AWS CLI command:
aws sagemaker delete-cluster –cluster-name <cluster_name>
Cluster deletion will take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker AI console.
If you used the CloudFormation stack to create resources, you can delete it using the following command:
aws cloudformation delete-stack –stack-name <stack_name>
Conclusion
By combining the robust infrastructure capabilities of SageMaker HyperPod with SkyPilot’s user-friendly interface, we’ve showcased a solution that helps teams focus on innovation rather than infrastructure complexity. This approach not only simplifies operations but also enhances productivity and resource utilization across organizations of all sizes. To get started, refer to SkyPilot in the Amazon EKS Support in Amazon SageMaker HyperPod workshop.

About the authors
Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS. He helps AWS customers—from small startups to large enterprises—train and deploy foundation models efficiently on AWS. He is passionate about computational optimization problems and improving the performance of AI workloads.
Zhanghao Wu is a co-creator of the SkyPilot open source project and holds a PhD in computer science from UC Berkeley. He works on SkyPilot core, client-server architecture, managed jobs, and improving the AI experience on diverse cloud infrastructure in general.
Ankit Anand is a Senior Foundation Models Go-To-Market (GTM) Specialist at AWS. He partners with top generative AI model builders, strategic customers, and AWS service teams to enable the next generation of AI/ML workloads on AWS. Ankit’s experience includes product management expertise within the financial services industry for high-frequency and low-latency trading and business development for Amazon Alexa.

Intelligent document processing at scale with generative AI and Amazon …

Posted on July 12, 2025 by i-genie

Extracting information from unstructured documents at scale is a recurring business task. Common use cases include creating product feature tables from descriptions, extracting metadata from documents, and analyzing legal contracts, customer reviews, news articles, and more. A classic approach to extracting information from text is named entity recognition (NER). NER identifies entities from predefined categories, such as persons and organizations. Although various AI services and solutions support NER, this approach is limited to text documents and only supports a fixed set of entities. Furthermore, classic NER models can’t handle other data types such as numeric scores (such as sentiment) or free-form text (such as summary). Generative AI unlocks these possibilities without costly data annotation or model training, enabling more comprehensive intelligent document processing (IDP).
AWS recently announced the general availability of Amazon Bedrock Data Automation, a feature of Amazon Bedrock that automates the generation of valuable insights from unstructured multimodal content such as documents, images, video, and audio. This service offers pre-built capabilities for IDP and information extraction through a unified API, alleviating the need for complex prompt engineering or fine-tuning, and making it an excellent choice for document processing workflows at scale. To learn more about Amazon Bedrock Data Automation, refer to Simplify multimodal generative AI with Amazon Bedrock Data Automation.
Amazon Bedrock Data Automation is the recommended approach for IDP use case due to its simplicity, industry-leading accuracy, and managed service capabilities. It handles the complexity of document parsing, context management, and model selection automatically, so developers can focus on their business logic rather than IDP implementation details.
Although Amazon Bedrock Data Automation meets most IDP needs, some organizations require additional customization in their IDP pipelines. For example, companies might need to use self-hosted foundation models (FMs) for IDP due to regulatory requirements. Some customers have builder teams who might prefer to maintain full control over the IDP pipeline instead of using a managed service. Finally, organizations might operate in AWS Regions where Amazon Bedrock Data Automation is not available (available in us-west-2 and us-east-1 as of June 2025). In such cases, builders might use Amazon Bedrock FMs directly or perform optical character recognition (OCR) with Amazon Textract.
This post presents an end-to-end IDP application powered by Amazon Bedrock Data Automation and other AWS services. It provides a reusable AWS infrastructure as code (IaC) that deploys an IDP pipeline and provides an intuitive UI for transforming documents into structured tables at scale. The application only requires the user to provide the input documents (such as contracts or emails) and a list of attributes to be extracted. It then performs IDP with generative AI.
The application code and deployment instructions are available on GitHub under the MIT license.
Solution overview
The IDP solution presented in this post is deployed as IaC using the AWS Cloud Development Kit (AWS CDK). Amazon Bedrock Data Automation serves as the primary engine for information extraction. For cases requiring further customization, the solution also provides alternative processing paths using Amazon Bedrock FMs and Amazon Textract integration.
We use AWS Step Functions to orchestrate the IDP workflow and parallelize processing for multiple documents. As part of the workflow, we use AWS Lambda functions to call Amazon Bedrock Data Automation or Amazon Textract and Amazon Bedrock (depending on the selected parsing mode). Processed documents and extracted attributes are stored in Amazon Simple Storage Service (Amazon S3).
A Step Functions workflow with the business logic is invoked through an API call performed using an AWS SDK. We also build a containerized web application running on Amazon Elastic Container Service (Amazon ECS) that is available to end-users through Amazon CloudFront to simplify their interaction with the solution. We use Amazon Cognito for authentication and secure access to the APIs.
The following diagram illustrates the architecture and workflow of the IDP solution.

The IDP workflow includes the following steps:

A user logs in to the web application using credentials managed by Amazon Cognito, selects input documents, and defines the fields to be extracted from them in the UI. Optionally, the user can specify the parsing mode, LLM to use, and other settings.
The user starts the IDP pipeline.
The application creates a pre-signed S3 URL for the documents and uploads them to Amazon S3.
The application triggers Step Functions to start the state machine with the S3 URIs and IDP settings as inputs. The Map state starts to process the documents concurrently.
Depending on the document type and the parsing mode, it branches to different Lambda functions that perform IDP, save results to Amazon S3, and send them back to the UI:

Amazon Bedrock Data Automation – Documents are directed to the “Run Data Automation” Lambda function. The Lambda function creates a blueprint with the user-defined fields schema and launches an asynchronous Amazon Bedrock Data Automation job. Amazon Bedrock Data Automation handles the complexity of document processing and attribute extraction using optimized prompts and models. When the job results are ready, they’re saved to Amazon S3 and sent back to the UI. This approach provides the best balance of accuracy, ease of use, and scalability for most IDP use cases.
Amazon Textract – If the user specifies Amazon Textract as a parsing mode, the IDP pipeline splits into two steps. First, the “Perform OCR” Lambda function is invoked to run an asynchronous document analysis job. The OCR outputs are processed using the amazon-textract-textractor library and formatted as Markdown. Second, the text is passed to the “Extract attributes” Lambda function (Step 6), which invokes an Amazon Bedrock FM given the text and the attributes schema. The outputs are saved to Amazon S3 and sent to the UI.
Handling office documents – Documents with suffixes like .doc, .ppt, and .xls are processed by the “Parse office” Lambda function, which uses LangChain document loaders to extract the text content. The outputs are passed to the “Extract attributes” Lambda function (Step 6) to proceed with the IDP pipeline.

If the user chooses an Amazon Bedrock FM for IDP, the document is sent to the “Extract attributes” Lambda function. It converts a document into a set of images, which are sent to a multimodal FM with the attributes schema as part of a custom prompt. It parses the LLM response to extract JSON outputs, saves them to Amazon S3, and sends it back to the UI. This flow supports .pdf, .png, and .jpg documents.
The web application checks the state machine execution results periodically and returns the extracted attributes to the user when they are available.

Prerequisites
You can deploy the IDP solution from your local computer or from an Amazon SageMaker notebook instance. The deployment steps are detailed in the solution README file.
If you choose to deploy using a SageMaker notebook, which is recommended, you will need access to an AWS account with permissions to create and launch a SageMaker notebook instance.
Deploy the solution
To deploy the solution to your AWS account, complete the following steps:

Open the AWS Management Console and choose the Region in which you want to deploy the IDP solution.
Launch a SageMaker notebook instance. Provide the notebook instance name and notebook instance type, which you can set to ml.m5.large. Leave other options as default.
Navigate to the Notebook instance and open the IAM role attached tothe notebook. Open the role on the AWS Identity and Access Management (IAM) console.
Attach an inline policy to the role and insert the following policy JSON:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“cloudformation:*”,
“s3:*”,
“iam:*”,
“sts:AssumeRole”
],
“Resource”: “*”
},
{
“Effect”: “Allow”,
“Action”: [
“ssm:GetParameter”,
“ssm:GetParameters”
],
“Resource”: “arn:aws:ssm:*:*:parameter/cdk-bootstrap/*”
}
]
}

When the notebook instance status is marked as InService, choose Open JupyterLab.
In the JupyterLab environment, choose File, New, and Terminal.
Clone the solution repository by running the following commands:

cd SageMaker
git clone https://github.com/aws-samples/intelligent-document-processing-with-amazon-bedrock.git

Navigate to the repository folder and run the script to install requirements:

cd intelligent-document-processing-with-amazon-bedrock
sh install_deps.sh

Run the script to create a virtual environment and install dependencies:

sh install_env.sh
source .venv/bin/activate

Within the repository folder, copy the config-example.yml to a config.yml to specify your stack name. Optionally, configure the services and indicate the modules you want to deploy (for example, to disable deploying a UI, change deploy_streamlit to False). Make sure you add your user email to the Amazon Cognito users list.
Configure Amazon Bedrock model access by opening the Amazon Bedrock console in the Region specified in the config.yml file. In the navigation pane, choose Model Access and make sure to enable access for the model IDs specified in config.yml.
Bootstrap and deploy the AWS CDK in your account:

cdk bootstrap
cdk deploy

Note that this step may take some time, especially on the first deployment. Once deployment is complete, you should see the message as shown in the following screenshot. You can access the Streamlit frontend using the CloudFront distribution URL provided in the AWS CloudFormation outputs. The temporary login credentials will be sent to the email specified in config.yml during the deployment.

Using the solution
This section guides you through two examples to showcase the IDP capabilities.
Example 1: Analyzing financial documents
In this scenario, we extract key features from a multi-page financial statement using Amazon Bedrock Data Automation. We use a sample document in PDF format with a mixture of tables, images, and text, and extract several financial metrics. Complete the following steps:

Upload a document by attaching a file through the solution UI.

On the Describe Attributes tab, either manually list the names and descriptions of the attributes or upload these fields in JSON format. We want to find the following metrics:

Current cash in assets in 2018
Current cash in assets in 2019
Operating profit in 2018
Operating profit in 2019

Choose Extract attributes to start the IDP pipeline.

The provided attributes are integrated into a custom blueprint with the inferred attributes list, which is then used to invoke a data automation job on the uploaded documents.
After the IDP pipeline is complete, you will see a table of results in the UI. It includes an index for each document in the _doc column, a column for each of the attributes you defined, and a file_name column that contains the document name.

From the following statement excerpts, we can see that Amazon Bedrock Data Automation was able to correctly extract the values for current assets and operating profit.

The IDP solution is also able to do complex calculations beyond well-defined entities. Let’s say we want to calculate the following accounting metrics:

Liquidity ratios (Current assets/Current liabilities)
Working capitals (Current assets – Current liabilities)
Revenue increase ((Revenue year 2/Revenue year 1) – 1)

We define the attributes and their formulas as parts of the attributes’ schema. This time, we choose an Amazon Bedrock LLM as a parsing mode to demonstrate how the application can use a multimodal FM for IDP. When using an Amazon Bedrock LLM, starting the IDP pipeline will now combine the attributes and their description into a custom prompt template, which is sent to the LLM with the documents converted to images. As a user, you can specify the LLM powering the extraction and its inference parameters, such as temperature.

The output, including the full results, is shown in the following screenshot.

Example 2: Processing customer emails
In this scenario, we want to extract multiple features from a list of emails with customer complaints due to delays in product shipments using Amazon Bedrock Data Automation. For each email, we want to find the following:

Customer name
Shipment ID
Email language
Email sentiment
Shipment delay (in days)
Summary of issue
Suggested response

Complete the following steps:

Upload input emails as .txt files. You can download sample emails from GitHub.

On the Describe Attributes tab, list names and descriptions of the attributes.

You can add few-shot examples for some fields (such as delay) to explain to the LLM how these fields values should be extracted. You can do this by adding an example input and the expected output for the attribute to the description.

Choose Extract attributes to start the IDP pipeline.

The provided attributes and their descriptions will be integrated into a custom blueprint with the inferred attributes list, which is then used to invoke a data automation job on the uploaded documents. When the IDP pipeline is complete, you will see the results.

The application allows downloading the extraction results as a CSV or a JSON file. This makes it straightforward to use the results for downstream tasks, such as aggregating customer sentiment scores.
Pricing
In this section, we calculate cost estimates for performing IDP on AWS with our solution.
Amazon Bedrock Data Automation provides a transparent pricing schema depending on the input document size (number of pages, images, or minutes). When using Amazon Bedrock FMs, pricing depends on the number of input and output tokens used as part of the information extraction call. Finally, when using Amazon Textract, OCR is performed and priced separately based on the number of pages in the documents.
Using the preceding scenarios as examples, we can approximate the costs depending on the selected parsing mode. In the following table, we show costs using two datasets: 100 20-page financial documents, and 100 1-page customer emails. We ignore costs of Amazon ECS and Lambda.

AWS service
Use case 1 (100 20-page financial documents)
Use case 2 (100 1-page customer emails)

IDP option 1: Amazon Bedrock Data Automation

Amazon Bedrock Data Automation (custom output)
$20.00
$1.00

IDP option 2: Amazon Bedrock FM

Amazon Bedrock (FM invocation, Anthropic’s Claude 4 Sonnet)
$1.79
$0.09

IDP option 3: Amazon Textract and Amazon Bedrock FM

Amazon Textract (document analysis job with layout)
$30.00
$1.50

Amazon Bedrock (FM invocation, Anthropic’s Claude 3.7 Sonnet)
$1.25
$0.06

Orchestration and storage (shared costs)

Amazon S3
$0.02
$0.02

AWS CloudFront
$0.09
$0.09

Amazon ECS
–
–

AWS Lambda
–
–

Total cost: Amazon Bedrock Data Automation
$20.11
$1.11

Total cost: Amazon Bedrock FM
$1.90
$0.20

Total cost: Amazon Textract and Amazon Bedrock FM
$31.36
$1.67

The cost analysis suggests that using Amazon Bedrock FMs with a custom prompt template is a cost-effective method for IDP. However, this approach requires a bigger operational overhead, because the pipeline needs to be optimized depending on the LLM, and requires manual security and privacy management. Amazon Bedrock Data Automation offers a managed service that uses a choice of high-performing FMs through a single API.
Clean up
To remove the deployed resources, complete the following steps:

On the AWS CloudFormation console, delete the created stack. Alternatively, run the following command:

cdk destroy –region <YOUR_DEPLOY_REGION>

On the Amazon Cognito console, delete the user pool.

Conclusion
Extracting information from unstructured documents at scale is a recurring business task. This post discussed an end-to-end IDP application that performs information extraction using multiple AWS services. The solution is powered by Amazon Bedrock Data Automation, which provides a fully managed service for generating insights from documents, images, audio, and video. Amazon Bedrock Data Automation handles the complexity of document processing and information extraction, optimizing for both performance and accuracy without requiring expertise in prompt engineering. For extended flexibility and customizability in specific scenarios, our solution also supports IDP using Amazon Bedrock custom LLM calls and Amazon Textract for OCR.
The solution supports multiple document types, including text, images, PDF, and Microsoft Office documents. At the time of writing, accurate understanding of information in documents rich with images, tables, and other visual elements is only available for PDF and images. We recommend converting complex Office documents to PDFs or images for best performance. Another solution limitation is the document size. As of June 2025, Amazon Bedrock Data Automation supports documents up to 20 pages for custom attributes extraction. When using custom Amazon Bedrock LLMs for IDP, the 300,000-token context window of Amazon Nova LLMs allows processing documents with up to roughly 225,000 words. To extract information from larger documents, you would currently need to split the file into multiple documents.
In the next versions of the IDP solution, we plan to keep adding support for state-of-the-art language models available through Amazon Bedrock and iterate on prompt engineering to further improve the extraction accuracy. We also plan to implement techniques for extending the size of supported documents and providing users with a precise indication of where exactly in the document the extracted information is coming from.
To get started with IDP with the described solution, refer to the GitHub repository. To learn more about Amazon Bedrock, refer to the documentation.

About the authors
Nikita Kozodoi, PhD, is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where he works on the frontier of AI research and business. With rich experience in Generative AI and diverse areas of ML, Nikita is enthusiastic about using AI to solve challenging real-world business problems across industries.
Zainab Afolabi is a Senior Data Scientist at the Generative AI Innovation Centre in London, where she leverages her extensive expertise to develop transformative AI solutions across diverse industries. She has over eight years of specialised experience in artificial intelligence and machine learning, as well as a passion for translating complex technical concepts into practical business applications.
Aiham Taleb, PhD, is a Senior Applied Scientist at the Generative AI Innovation Center, working directly with AWS enterprise customers to leverage Gen AI across several high-impact use cases. Aiham has a PhD in unsupervised representation learning, and has industry experience that spans across various machine learning applications, including computer vision, natural language processing, and medical imaging.
Liza (Elizaveta) Zinovyeva is an Applied Scientist at AWS Generative AI Innovation Center and is based in Berlin. She helps customers across different industries to integrate Generative AI into their existing applications and workflows. She is passionate about AI/ML, finance and software security topics. In her spare time, she enjoys spending time with her family, sports, learning new technologies, and table quizzes.
Nuno Castro is a Sr. Applied Science Manager at AWS Generative AI Innovation Center. He leads Generative AI customer engagements, helping hundreds of AWS customers find the most impactful use case from ideation, prototype through to production. He has 19 years experience in AI in industries such as finance, manufacturing, and travel, leading AI/ML teams for 12 years.
Ozioma Uzoegwu is a Principal Solutions Architect at Amazon Web Services. In his role, he helps financial services customers across EMEA to transform and modernize on the AWS Cloud, providing architectural guidance and industry best practices. Ozioma has many years of experience with web development, architecture, cloud and IT management. Prior to joining AWS, Ozioma worked with an AWS Advanced Consulting Partner as the Lead Architect for the AWS Practice. He is passionate about using latest technologies to build a modern financial services IT estate across banking, payment, insurance and capital markets.
Eren Tuncer is a Solutions Architect at Amazon Web Services focused on Serverless and building Generative AI applications. With more than fifteen years experience in software development and architecture, he helps customers across various industries achieve their business goals using cloud technologies with best practices. As a builder, he’s passionate about creating solutions with state-of-the-art technologies, sharing knowledge, and helping organizations navigate cloud adoption.
Francesco Cerizzi is a Solutions Architect at Amazon Web Services exploring tech frontiers while spreading generative AI knowledge and building applications. With a background as a full stack developer, he helps customers across different industries in their journey to the cloud, sharing insights on AI’s transformative potential along the way. He’s passionate about Serverless, event-driven architectures, and microservices in general. When not diving into technology, he’s a huge F1 fan and loves Tennis.

Mistral AI Releases Devstral 2507 for Code-Centric Language Modeling

Posted on July 11, 2025 by i-genie

Mistral AI, in collaboration with All Hands AI, has released updated versions of its developer-focused large language models under the Devstral 2507 label. The release includes two models—Devstral Small 1.1 and Devstral Medium 2507—designed to support agent-based code reasoning, program synthesis, and structured task execution across large software repositories. These models are optimized for performance and cost, making them applicable for real-world use in developer tools and code automation systems.

Devstral Small 1.1: Open Model for Local and Embedded Use

Devstral Small 1.1 (also called devstral-small-2507) is based on the Mistral-Small-3.1 foundation model and contains approximately 24 billion parameters. It supports a 128k token context window, which allows it to handle multi-file code inputs and long prompts typical in software engineering workflows.

The model is fine-tuned specifically for structured outputs, including XML and function-calling formats. This makes it compatible with agent frameworks such as OpenHands and suitable for tasks like program navigation, multi-step edits, and code search. It is licensed under Apache 2.0 and available for both research and commercial use.

Source: https://mistral.ai/news/devstral-2507

Performance: SWE-Bench Results

Devstral Small 1.1 achieves 53.6% on the SWE-Bench Verified benchmark, which evaluates the model’s ability to generate correct patches for real GitHub issues. This represents a noticeable improvement over the previous version (1.0) and places it ahead of other openly available models of comparable size. The results were obtained using the OpenHands scaffold, which provides a standard test environment for evaluating code agents.

While not at the level of the largest proprietary models, this version offers a balance between size, inference cost, and reasoning performance that is practical for many coding tasks.

Deployment: Local Inference and Quantization

The model is released in multiple formats. Quantized versions in GGUF are available for use with llama.cpp, vLLM, and LM Studio. These formats make it possible to run inference locally on high-memory GPUs (e.g., RTX 4090) or Apple Silicon machines with 32GB RAM or more. This is beneficial for developers or teams that prefer to operate without dependency on hosted APIs.

Mistral also makes the model available via their inference API. The current pricing is $0.10 per million input tokens and $0.30 per million output tokens, the same as other models in the Mistral-Small line.

Source: https://mistral.ai/news/devstral-2507

Devstral Medium 2507: Higher Accuracy, API-Only

Devstral Medium 2507 is not open-sourced and is only available through the Mistral API or through enterprise deployment agreements. It offers the same 128k token context length as the Small version but with higher performance.

The model scores 61.6% on SWE-Bench Verified, outperforming several commercial models, including Gemini 2.5 Pro and GPT-4.1, in the same evaluation framework. Its stronger reasoning capacity over long contexts makes it a candidate for code agents that operate across large monorepos or repositories with cross-file dependencies.

API pricing is set at $0.40 per million input tokens and $2 per million output tokens. Fine-tuning is available for enterprise users via the Mistral platform.

Comparison and Use Case Fit

ModelSWE-Bench VerifiedOpen SourceInput CostOutput CostContext LengthDevstral Small 1.153.6%Yes$0.10/M$0.30/M128k tokensDevstral Medium61.6%No$0.40/M$2.00/M128k tokens

Devstral Small is more suitable for local development, experimentation, or integrating into client-side developer tools where control and efficiency are important. In contrast, Devstral Medium provides stronger accuracy and consistency in structured code-editing tasks and is intended for production services that benefit from higher performance despite increased cost.

Integration with Tooling and Agents

Both models are designed to support integration with code agent frameworks such as OpenHands. The support for structured function calls and XML output formats allows them to be integrated into automated workflows for test generation, refactoring, and bug fixing. This compatibility makes it easier to connect Devstral models to IDE plugins, version control bots, and internal CI/CD pipelines.

For example, developers can use Devstral Small for prototyping local workflows, while Devstral Medium can be used in production services that apply patches or triage pull requests based on model suggestions.

Conclusion

The Devstral 2507 release reflects a targeted update to Mistral’s code-oriented LLM stack, offering users a clearer tradeoff between inference cost and task accuracy. Devstral Small provides an accessible, open model with sufficient performance for many use cases, while Devstral Medium caters to applications where correctness and reliability are critical.

The availability of both models under different deployment options makes them relevant across various stages of the software engineering workflow—from experimental agent development to deployment in commercial environments.

Check out the Technical details, Devstral Small model weights at Hugging Face and Devstral Medium will also be available on Mistral Code for enterprise customers and on finetuning API. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, and Youtube and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Mistral AI Releases Devstral 2507 for Code-Centric Language Modeling appeared first on MarkTechPost.

Google AI Releases Vertex AI Memory Bank: Enabling Persistent Agent Co …

Posted on July 11, 2025 by i-genie

Developers are actively working to bring AI agents to market, but a significant hurdle has been the lack of memory. Without the ability to recall past interactions, agents treat each conversation as if it’s the first, leading to repetitive questions, an inability to remember user preferences, and a general lack of personalization. This results in frustration for both users and developers.

Historically, developers have attempted to mitigate this by inserting entire session dialogues directly into an LLM’s context window. However, this approach is expensive and computationally inefficient, leading to higher inference costs and slower response times. Furthermore, feeding too much information, especially irrelevant details, can degrade the model’s output quality, causing issues like “lost in the middle” and “context rot”.

Introducing Vertex AI Memory Bank

To overcome these limitations, Google Cloud has announced the public preview of Memory Bank, a new managed service within the Vertex AI Agent Engine. Memory Bank is designed to help you build highly personalized conversational agents that facilitate more natural, contextual, and continuous engagements.

For instance, here is a personalized healthcare agent: Key information about a user’s allergy and previous symptoms mentioned in the past sessions is needed to provide a more informed response in the current session

Memory Bank addresses the fundamental memory problem in several key ways:

Personalize interactions: It goes beyond generic scripts by remembering user preferences, key events, and past choices to tailor every response.

Maintain continuity: Conversations can pick up seamlessly where they left off, even across multiple sessions that might span days or weeks.

Provide better context: Agents are armed with the necessary background on a user, leading to more relevant, insightful, and helpful responses.

Improve user experience: It eliminates the frustration of users repeating information, creating more natural, efficient, and engaging conversations.

How Memory Bank Works

Memory Bank operates through an intelligent, multi-stage process, leveraging Google’s Gemini models and novel research:

Understands and Extracts Memories: Memory Bank analyzes a user’s conversation history (stored in Agent Engine Sessions) to extract key facts, preferences, and context. This process happens asynchronously in the background, generating new memories without requiring developers to build complex extraction pipelines.

Stores and Updates Memories Intelligently: Key information, such as “I prefer sunny days” is stored and organized by a defined scope, like a user ID. When new information emerges, Memory Bank, using Gemini, can consolidate it with existing memories, resolving contradictions and ensuring the memories remain up to date.

Recalls Relevant Information: When a new conversation session begins, the agent can retrieve these stored memories. This retrieval can be a simple recall of all facts or a more advanced similarity search using embeddings to find memories most relevant to the current topic. This ensures the agent is always equipped with the right context.

This entire process is grounded in Google Research’s novel research method, accepted by ACL 2025, which provides an intelligent, topic-based approach to how agents learn and recall information, setting a new standard for agent memory performance. An example is how a personal beauty companion agent can remember a user’s evolving skin type to make personalized product recommendations.

Getting Started with Memory Bank

Memory Bank is integrated with the Agent Development Kit (ADK) and Agent Engine Sessions. Developers can define an agent using ADK and enable Agent Engine Sessions to manage conversation history within individual sessions. Memory Bank can then be enabled to provide long-term memory across multiple sessions.

You can integrate Memory Bank into your agent in two primary ways:

Develop an agent with Google Agent Development Kit (ADK) for an out-of-the-box experience.

Develop an agent that orchestrates API calls to Memory Bank if you are building your agent with any other framework, including popular ones like LangGraph and CrewAI.

For those new to Google Cloud but using ADK, an express mode registration for Agent Engine Sessions and Memory Bank allows you to sign up with a Gmail account to receive an API key and build within free tier usage quotas before seamlessly upgrading to a full Google Cloud project for production.
The post Google AI Releases Vertex AI Memory Bank: Enabling Persistent Agent Conversations appeared first on MarkTechPost.

Microsoft Releases Phi-4-mini-Flash-Reasoning: Efficient Long-Context …

Posted on July 11, 2025 by i-genie

Phi-4-mini-Flash-Reasoning, the latest addition to Microsoft’s Phi-4 model family, is an open, lightweight language model designed to excel at long-context reasoning while maintaining high inference efficiency. Released on Hugging Face, this 3.8B parameter model is a distilled version of Phi-4-mini, fine-tuned for dense reasoning tasks like math problem solving and multi-hop question answering. Built using Microsoft’s new SambaY decoder-hybrid-decoder architecture, it achieves state-of-the-art performance among compact models and operates up to 10× faster than its predecessor on long-generation tasks.

Architecture: Gated Memory Meets Hybrid Decoding

At the core of Phi-4-mini-Flash-Reasoning is the SambaY architecture, a novel decoder-hybrid-decoder model that integrates State Space Models (SSMs) with attention layers using a lightweight mechanism called the Gated Memory Unit (GMU). This structure enables efficient memory sharing between layers, significantly reducing inference latency in long-context and long-generation scenarios.

Unlike Transformer-based architectures that rely heavily on memory-intensive attention computations, SambaY leverages Samba (a hybrid SSM architecture) in the self-decoder and replaces roughly half of the cross-attention layers in the cross-decoder with GMUs. GMUs serve as cheap, element-wise gating functions that reuse the hidden state from the final SSM layer, thereby avoiding redundant computation. This results in a linear-time prefill complexity and lower decoding I/O, yielding substantial speedups during inference.

Training Pipeline and Reasoning Capabilities

The Phi-4-mini-Flash model is pre-trained on 5T tokens from high-quality synthetic and filtered real data, consistent with the rest of the Phi-4-mini family. Post pretraining, it undergoes multi-stage supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) using reasoning-focused instruction datasets. Notably, unlike Phi-4-mini-Reasoning, it excludes reinforcement learning (RLHF) entirely.

Despite this, Phi-4-mini-Flash-Reasoning outperforms Phi-4-mini-Reasoning on a suite of complex reasoning tasks. On the Math500 benchmark, it achieves a pass@1 accuracy of 92.45%, outperforming Phi-4-mini-Reasoning (91.2%) and surpassing other open models like Qwen-1.5B and Bespoke-Stratos-7B. On AIME24/25, it shows strong gains as well, with over 52% accuracy on AIME24.

This performance leap is attributed to the architecture’s capacity for long Chain-of-Thought (CoT) generation. With 64K context length support and optimized inference under the vLLM framework, the model can generate and reason across multi-thousand-token contexts without bottlenecks. In latency benchmarks with 2K-token prompts and 32K-token generations, Phi-4-mini-Flash-Reasoning delivers up to 10× higher throughput than its predecessor.

Efficient Long-Context Processing

Efficiency gains in Phi-4-mini-Flash-Reasoning aren’t just theoretical. Through the decoder-hybrid-decoder design, the model achieves competitive performance on long-context benchmarks like Phonebook and RULER. For instance, with a sliding window attention (SWA) size as small as 256, it maintains high retrieval accuracy, indicating that long-range token dependencies are well captured via SSMs and GMU-based memory sharing.

These architectural innovations lead to reduced compute and memory overhead. For example, during decoding, GMU layers replace attention operations that would otherwise cost O(N·d) time per token, cutting that down to O(d), where N is sequence length and d is hidden dimension. The result is real-time inference capability even in multi-turn or document-level scenarios.

Open Weights and Use Cases

Microsoft has open-sourced the model weights and configuration through Hugging Face, providing full access to the community. The model supports 64K context length, operates under standard Hugging Face and vLLM runtimes, and is optimized for fast token throughput on A100 GPUs.

Potential use cases for Phi-4-mini-Flash-Reasoning include:

Mathematical Reasoning (e.g., SAT, AIME-level problems)

Multi-hop QA

Legal and Scientific Document Analysis

Autonomous Agents with Long-Term Memory

High-throughput Chat Systems

Its combination of open access, reasoning ability, and efficient inference makes it a strong candidate for deployment in environments where compute resources are constrained but task complexity is high.

Conclusion

Phi-4-mini-Flash-Reasoning exemplifies how architectural innovation—particularly hybrid models leveraging SSMs and efficient gating—can bring transformative gains in reasoning performance without ballooning model size or cost. It marks a new direction in efficient long-context language modeling, paving the way for real-time, on-device reasoning agents and scalable open-source alternatives to commercial LLMs.

Check out the Paper, Codes, Model on Hugging Face and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Microsoft Releases Phi-4-mini-Flash-Reasoning: Efficient Long-Context Reasoning with Compact Architecture appeared first on MarkTechPost.

New capabilities in Amazon SageMaker AI continue to transform how orga …

Posted on July 11, 2025 by i-genie

As AI models become increasingly sophisticated and specialized, the ability to quickly train and customize models can mean the difference between industry leadership and falling behind. That is why hundreds of thousands of customers use the fully managed infrastructure, tools, and workflows of Amazon SageMaker AI to scale and advance AI model development. Since launching in 2017, SageMaker AI has transformed how organizations approach AI model development by reducing complexity while maximizing performance. Since then, we’ve continued to relentlessly innovate, adding more than 420 new capabilities since launch to give customers the best tools to build, train, and deploy AI models quickly and efficiently. Today, we’re pleased to announce new innovations that build on the rich features of SageMaker AI to accelerate how customers build and train AI models.
Amazon SageMaker HyperPod: The infrastructure of choice for developing AI models
AWS launched Amazon SageMaker HyperPod in 2023 to reduce complexity and maximize performance and efficiency when building AI models. With SageMaker HyperPod, you can quickly scale generative AI model development across thousands of AI accelerators and reduce foundation model (FM) training and fine-tuning development costs by up to 40%. Many of today’s top models are trained on SageMaker HyperPod, including models from Hugging Face, Luma AI, Perplexity AI, Salesforce, Thomson Reuters, Writer, and Amazon. By training Amazon Nova FMs on SageMaker HyperPod, Amazon saved months of work and increased utilization of compute resources to more than 90%.

To further streamline workflows and make it faster to develop and deploy models, a new command line interface (CLI) and software development kit (SDK) provides a single, consistent interface that simplifies infrastructure management, unifies job submission across training and inference, and supports both recipe-based and custom workflows with integrated monitoring and control. Today, we are also adding two capabilities to SageMaker HyperPod that can help you reduce training costs and accelerate AI model development.
Reduce the time to troubleshoot performance issues from days to minutes with SageMaker HyperPod observability
To bring new AI innovations to market as quickly as possible, organizations need visibility across AI model development tasks and compute resources to optimize training efficiency and detect and resolve interruptions or performance bottlenecks as soon as possible. For example, to investigate if a training or fine-tuning job failure was the result of a hardware issue, data scientists and machine learning (ML) engineers want to quickly filter to review the monitoring data of the specific GPUs that performed the job rather than manually browsing through the hardware resources of an entire cluster to establish the correlation between the job failure and a hardware issue.
The new observability capability in SageMaker HyperPod transforms how you can monitor and optimize your model development workloads. Through a unified dashboard preconfigured in Amazon Managed Grafana, with the monitoring data automatically published to an Amazon Managed Service for Prometheus workspace, you can now see generative AI task performance metrics, resource utilization, and cluster health in a single view. Teams can now quickly spot bottlenecks, prevent costly delays, and optimize compute resources. You can define automated alerts, specify use case-specific task metrics and events, and publish them to the unified dashboard with just a few clicks.
By reducing troubleshooting time from days to minutes, this capability can help you accelerate your path to production and maximize the return on your AI investments.

DatologyAI builds tools to automatically select the best data on which to train deep learning models.

“We are excited to use Amazon SageMaker HyperPod’s one-click observability solution. Our senior staff members needed insights into how we’re utilizing GPU resources. The pre-built Grafana dashboards will give us exactly what we needed, with immediate visibility into critical metrics—from task-specific GPU utilization to file system (FSx for Lustre) performance—without requiring us to maintain any monitoring infrastructure. As someone who appreciates the power of the Prometheus Query Language, I like the fact that I can write my own queries and analyze custom metrics without worrying about infrastructure problems.” –Josh Wills, Member of Technical Staff at DatologyAI

–

Articul8 helps companies build sophisticated enterprise generative AI applications.

“With SageMaker HyperPod observability, we can now deploy our metric collection and visualization systems in a single click, saving our teams days of otherwise manual setup and enhancing our cluster observability workflows and insights. Our data scientists can quickly monitor task performance metrics, such as latency, and identify hardware issues without manual configuration. SageMaker HyperPod observability will help streamline our foundation model development processes, allowing us to focus on advancing our mission of delivering accessible and reliable AI-powered innovation to our customers.” –Renato Nascimento, head of technology at Articul8

–
Deploy Amazon SageMaker JumpStart models on SageMaker HyperPod for fast, scalable inference
After developing generative AI models on SageMaker HyperPod, many customers import these models to Amazon Bedrock, a fully managed service for building and scaling generative AI applications. However, some customers want to use their SageMaker HyperPod compute resources to speed up their evaluation and move models into production faster.
Now, you can deploy open-weights models from Amazon SageMaker JumpStart, as well as fine-tuned custom models, on SageMaker HyperPod within minutes with no manual infrastructure setup. Data scientists can run inference on SageMaker JumpStart models with a single click, simplifying and accelerating model evaluation. This straightforward, one-time provisioning reduces manual infrastructure setup, providing a reliable and scalable inference environment with minimal effort. Large model downloads are reduced from hours to minutes, accelerating model deployments and shortening the time to market.
–

H.AI exists to push the boundaries of superintelligence with agentic AI.

“With Amazon SageMaker HyperPod, we used the same high-performance compute to build and deploy the foundation models behind our agentic AI platform. This seamless transition from training to inference streamlined our workflow, reduced time to production, and delivered consistent performance in live environments. SageMaker HyperPod helped us go from experimentation to real-world impact with greater speed and efficiency.” –Laurent Sifre, Co-founder & CTO at H.AI

–
Seamlessly access the powerful compute resources of SageMaker AI from local development environments
Today, many customers choose from the broad set of fully managed integrated development environments (IDEs) available in SageMaker AI for model development, including JupyterLab, Code Editor based on Code-OSS, and RStudio. Although these IDEs enable secure and efficient setups, some developers prefer to use local IDEs on their personal computers for their debugging capabilities and extensive customization options. However, customers using a local IDE, such as Visual Studio Code, couldn’t easily run their model development tasks on SageMaker AI until now.
With new remote connections to SageMaker AI, developers and data scientists can quickly and seamlessly connect to SageMaker AI from their local VS Code, maintaining access to the custom tools and familiar workflows that help them work most efficiently. Developers can build and train AI models using their local IDE while SageMaker AI manages remote execution, so you can work in your preferred environment while still benefiting from the performance, scalability, and security of SageMaker AI. You can now choose your preferred IDE—whether that is a fully managed cloud IDE or VS Code—to accelerate AI model development using the powerful infrastructure and seamless scalability of SageMaker AI.
–

CyberArk is a leader in Identity Security, which provides a comprehensive approach centered on privileged controls to protect against advanced cyber threats.
“With remote connections to SageMaker AI, our data scientists have the flexibility to choose the IDE that makes them most productive. Our teams can leverage their customized local setup while accessing the infrastructure and security controls of SageMaker AI. As a security first company, this is extremely important to us as it ensures sensitive data stays protected, while allowing our teams to securely collaborate and boost productivity.” –Nir Feldman, Senior Vice President of Engineering at CyberArk

–
Build generative AI models and applications faster with fully managed MLflow 3.0
As customers across industries accelerate their generative AI development, they require capabilities to track experiments, observe behavior, and evaluate performance of models and AI applications. Customers such as Cisco, SonRai, and Xometry are already using managed MLflow on SageMaker AI to efficiently manage ML model experiments at scale. The introduction of fully managed MLflow 3.0 on SageMaker AI makes it straightforward to track experiments, monitor training progress, and gain deeper insights into the behavior of models and AI applications using a single tool, helping you accelerate generative AI development.
Conclusion
In this post, we shared some of the new innovations in SageMaker AI to accelerate how you can build and train AI models.
To learn more about these new features, SageMaker AI, and how companies are using this service, refer to the following resources:

Accelerate foundation model development with one click observability in Amazon SageMaker HyperPod
Supercharge your AI workflows by connecting to SageMaker Studio from Visual Studio Code
Accelerating generative AI development with fully managed MLflow 3.0 on Amazon SageMaker AI
Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle
Amazon SageMaker AI
Amazon SageMaker AI customers

About the author
Ankur Mehrotra joined Amazon back in 2008 and is currently the General Manager of Amazon SageMaker AI. Before Amazon SageMaker AI, he worked on building Amazon.com’s advertising systems and automated pricing technology.

Accelerate foundation model development with one-click observability i …

Posted on July 11, 2025 by i-genie

Amazon SageMaker HyperPod now provides a comprehensive, out-of-the-box dashboard that delivers insights into foundation model (FM) development tasks and cluster resources. This unified observability solution automatically publishes key metrics to Amazon Managed Service for Prometheus and visualizes them in Amazon Managed Grafana dashboards, optimized specifically for FM development with deep coverage of hardware health, resource utilization, and task-level performance.
With a one-click installation of the Amazon Elastic Kubernetes Service (Amazon EKS) add-on for SageMaker HyperPod observability, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter (EFA), integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators. With this unified view, you can trace model development task performance to cluster resources with aggregation of resource metrics at the task level. The solution also abstracts management of collector agents and scrapers across clusters, offering automatic scalability of collectors across nodes as the cluster grows. The dashboards feature intuitive navigation across metrics and visualizations to help users diagnose problems and take action faster. They are also fully customizable, supporting additional PromQL metric imports and custom Grafana layouts.
These capabilities save teams valuable time and resources during FM development, helping accelerate time-to-market and reduce the cost of generative AI innovations. Instead of spending hours or days configuring, collecting, and analyzing cluster telemetry systems, data scientists and machine learning (ML) engineers can now quickly identify training, tuning, and inference disruptions, underutilization of valuable GPU resources, and hardware performance issues. The pre-built, actionable insights of SageMaker HyperPod observability can be used in several common scenarios when operating FM workloads, such as:

Data scientists can monitor resource utilization of submitted training and inference tasks at the per-GPU level, with insights into GPU memory and FLOPs
AI researchers can troubleshoot sub-optimal time-to-first-token (TTFT) for their inferencing workloads by correlating the deployment metrics with the corresponding resource bottlenecks
Cluster administrators can configure customizable alerts to send notifications to multiple destinations such as Amazon Simple Notification Service (Amazon SNS), PagerDuty, and Slack when hardware falls outside of recommended health thresholds
Cluster administrators can quickly identify inefficient resource queuing patterns across teams or namespaces to reconfigure allocation and prioritization policies

In this post, we walk you through installing and using the unified dashboards of the out-of-the-box observability feature in SageMaker HyperPod. We cover the one-click installation from the Amazon SageMaker AI console, navigating the dashboard and metrics it consolidates, and advanced topics such as setting up custom alerts. If you have a running SageMaker HyperPod EKS cluster, then this post will help you understand how to quickly visualize key health and performance telemetry data to derive actionable insights.
Prerequisites
To get started with SageMaker HyperPod observability, you first need to enable AWS IAM Identity Center to use Amazon Managed Grafana. If IAM Identity Center isn’t already enabled in your account, refer to Getting started with IAM Identity Center. Additionally, create at least one user in the IAM Identity Center.
SageMaker HyperPod observability is available for SageMaker HyperPod clusters with an Amazon EKS orchestrator. If you don’t already have a SageMaker HyperPod cluster with an Amazon EKS orchestrator, refer to Amazon SageMaker HyperPod quickstart workshops for instructions to create one.
Enable SageMaker HyperPod observability
To enable SageMaker HyperPod observability, follow these steps:

On the SageMaker AI console, choose Cluster management in the navigation pane.
Open the cluster detail page from the SageMaker HyperPod clusters list.
On the Dashboard tab, in the HyperPod Observability section, choose Quick installation.

SageMaker AI will create a new Prometheus workspace, a new Grafana workspace, and install the SageMaker HyperPod observability add-on to the EKS cluster. The installation typically completes within a few minutes.

When the installation process is complete, you can view the add-on details and metrics available.

Choose Manage users to assign a user to a Grafana workspace.
Choose Open dashboard in Grafana to open the Grafana dashboard.

When prompted, sign in with IAM Identity Center with the user you configured as a prerequisite.

After signing in successfully, you will see the SageMaker HyperPod observability dashboard on Grafana.
SageMaker HyperPod observability dashboards
You can choose from multiple dashboards, including Cluster, Tasks, Inference, Training, and File system.
The Cluster dashboard shows cluster-level metrics such as Total Nodes and Total GPUs, and cluster node-level metrics such as GPU Utilization and Filesystem space available. By default, the dashboard shows metrics about entire cluster, but you can apply filters to show metrics only about a specific hostname or specific GPU ID.

The Tasks dashboard is helpful if you want to see resource allocation and utilization metrics at the task level (PyTorchJob, ReplicaSet, and so on). For example, you can compare GPU utilization by multiple tasks running on your cluster and identify which task should be improved.
You can also choose an aggregation level from multiple options (Namespace, Task Name, Task Pod), and apply filters (Namespace, Task Type, Task Name, Pod, GPU ID). You can use these aggregation and filtering capabilities to view metrics at the appropriate granularity and drill down into the specific issue you are investigating.

The Inference dashboard shows inference application specific metrics such as Incoming Requests, Latency, and Time to First Byte (TTFB). The Inference dashboard is particularly useful when you use SageMaker HyperPod clusters for inference and need to monitor the traffic of the requests and performance of models.

Advanced installation
The Quick installation option will create a new workspace for Prometheus and Grafana and select default metrics. If you want to reuse an existing workspace, select additional metrics, or enable Pod logging to Amazon CloudWatch Logs, use the Custom installation option. For more information, see Amazon SageMaker HyperPod.
Set up alerts
Amazon Managed Grafana includes access to an updated alerting system that centralizes alerting information in a single, searchable view (in the navigation pane, choose Alerts to create an alert). Alerting is useful when you want to receive timely notifications, such as when GPU utilization drops unexpectedly, when a disk usage of your shared file system exceeds 90%, when multiple instances become unavailable at the same time, and so on. The HyperPod observability dashboard in Amazon Managed Grafana has pre-configured alerts for few of these key metrics. You can create additional alert rules based on metrics or queries and set up multiple notification channels, such as emails and Slack messages. For instructions on setting up alerts with Slack messages, see the Setting Up Slack Alerts for Amazon Managed Grafana GitHub page.
The number of alerts is limited to 100 per Grafana workspace. If you need a more scalable solution, check out the alerting options in Amazon Managed Service for Prometheus.
High-level overview
The following diagram illustrates the architecture of the new HyperPod observability capability.

Clean up
If you want to uninstall the SageMaker HyperPod observability feature (for example, to reconfigure it), clean up the resources in the following order:

Remove the SageMaker HyperPod observability add-on, either using the SageMaker AI console or Amazon EKS console.
Delete the Grafana workspace on the Amazon Managed Grafana console.
Delete the Prometheus workspace on the Amazon Managed Service for Prometheus console.

Conclusion
This post provided an overview and usage instructions for SageMaker HyperPod observability, a newly released observability feature for SageMaker HyperPod. This feature reduces the heavy lifting involved in setting up cluster observability and provides centralized visibility into cluster health status and performance metrics.
For more information about SageMaker HyperPod observability, see Amazon SageMaker HyperPod. Please leave your feedback on this post in the comments section.

About the authors
Tomonori Shimomura is a Principal Solutions Architect on the Amazon SageMaker AI team, where he provides in-depth technical consultation to SageMaker AI customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in Cloud side technology. In his free time, he enjoys playing video games, reading books, and writing software.
Matt Nightingale is a Solutions Architect Manager on the AWS WWSO Frameworks team focusing on Generative AI Training and Inference. Matt specializes in distributed training architectures with a focus on hardware performance and reliability. Matt holds a bachelors degree from University of Virginia and is based in Boston, Massachusetts.
Eric Saleh is a Senior GenAI Specialist at AWS, focusing on foundation model training and inference. He is partnering with top foundation model builders and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions with strategic customers. Before joining AWS, Eric led product teams building enterprise AI/ML solutions, which included frontier GenAI services for fine-tuning, RAG, and managed inference. He holds a master’s degree in Business Analytics from UCLA Anderson.
Piyush Kadam is a Senior Product Manager on the Amazon SageMaker AI team, where he specializes in LLMOps products that empower both startups and enterprise customers to rapidly experiment with and efficiently govern foundation models. With a Master’s degree in Computer Science from the University of California, Irvine, specializing in distributed systems and artificial intelligence, Piyush brings deep technical expertise to his role in shaping the future of cloud AI products.
Aman Shanbhag is a Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services (AWS), where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.
Bhaskar Pratap is a Senior Software Engineer with the Amazon SageMaker AI team. He is passionate about designing and building elegant systems that bring machine learning to people’s fingertips. Additionally, he has extensive experience with building scalable cloud storage services.
Gopi Sekar is an Engineering Leader for the Amazon SageMaker AI team. He is dedicated to assisting customers and developing products that simplify the adaptation of machine learning to address real-world customer challenges.

Accelerating generative AI development with fully managed MLflow 3.0 o …

Posted on July 11, 2025 by i-genie

Amazon SageMaker now offers fully managed support for MLflow 3.0 that streamlines AI experimentation and accelerates your generative AI journey from idea to production. This release transforms managed MLflow from experiment tracking to providing end-to-end observability, reducing time-to-market for generative AI development.
As customers across industries accelerate their generative AI development, they require capabilities to track experiments, observe behavior, and evaluate performance of models and AI applications. Data scientists and developers struggle to effectively analyze the performance of their models and AI applications from experimentation to production, making it hard to find root causes and resolve issues. Teams spend more time integrating tools than improving the quality of their models or generative AI applications.
With the launch of fully managed MLflow 3.0 on Amazon SageMaker AI, you can accelerate generative AI development by making it easier to track experiments and observe behavior of models and AI applications using a single tool. Tracing capabilities in fully managed MLflow 3.0 provide customers the ability to record the inputs, outputs, and metadata at every step of a generative AI application, so developers can quickly identify the source of bugs or unexpected behaviors. By maintaining records of each model and application version, fully managed MLflow 3.0 offers traceability to connect AI responses to their source components, which means developers can quickly trace an issue directly to the specific code, data, or parameters that generated it. With these capabilities, customers using Amazon SageMaker HyperPod to train and deploy foundation models (FMs) can now use managed MLflow to track experiments, monitor training progress, gain deeper insights into the behavior of models and AI applications, and manage their machine learning (ML) lifecycle at scale. This reduces troubleshooting time and enables teams to focus more on innovation.
This post walks you through the core concepts of fully managed MLflow 3.0 on SageMaker and provides technical guidance on how to use the new features to help accelerate your next generative AI application development.
Getting started
You can get started with fully managed MLflow 3.0 on Amazon SageMaker to track experiments, manage models, and streamline your generative AI/ML lifecycle through the AWS Management Console, AWS Command Line Interface (AWS CLI), or API.
Prerequisites
To get started, you need:

An AWS account with billing enabled
An Amazon SageMaker Studio AI domain. To create a domain, refer to Guide to getting set up with Amazon SageMaker AI.

Configure your environment to use SageMaker managed MLflow Tracking Server
To perform the configuration, follow these steps:

In the SageMaker Studio UI, in the Applications pane, choose MLflow and choose Create.

Enter a unique name for your tracking server and specify the Amazon Simple Storage Service (Amazon S3) URI where your experiment artifacts will be stored. When you’re ready, choose Create. By default, SageMaker will select version 3.0 to create the MLflow tracking server.
Optionally, you can choose Update to adjust settings such as server size, tags, or AWS Identity and Access Management (IAM) role.

The server will now be provisioned and started automatically, typically within 25 minutes. After setup, you can launch the MLflow UI from SageMaker Studio to start tracking your ML and generative AI experiments. For more details on tracking server configurations, refer to Machine learning experiments using Amazon SageMaker AI with MLflow in the SageMaker Developer Guide.
To begin tracking your experiments with your newly created SageMaker managed MLflow tracking server, you need to install both MLflow and the AWS SageMaker MLflow Python packages in your environment. You can use SageMaker Studio managed Jupyter Lab, SageMaker Studio Code Editor, a local integrated development environment (IDE), or other supported environment where your AI workloads operate to track with SageMaker managed MLFlow tracking server.
To install both Python packages using pip:pip install mlflow==3.0 sagemaker-mlflow==0.1.0
To connect and start logging your AI experiments, parameters, and models directly to the managed MLflow on SageMaker, replace the Amazon Resource Name (ARN) of your SageMaker MLflow tracking server:

import mlflow

# SageMaker MLflow ARN
tracking_server_arn = “arn:aws:sagemaker:<Region>:<Account_id>:mlflow-tracking-server/<Name>” # Enter ARN
mlflow.set_tracking_uri(tracking_server_arn)
mlflow.set_experiment(“customer_support_genai_app”)

Now your environment is configured and ready to track your experiments with your SageMaker Managed MLflow tracking server.
Implement generative AI application tracing and version tracking
Generative AI applications have multiple components, including code, configurations, and data, which can be challenging to manage without systematic versioning. A LoggedModel entity in managed MLflow 3.0 represents your AI model, agent, or generative AI application within an experiment. It provides unified tracking of model artifacts, execution traces, evaluation metrics, and metadata throughout the development lifecycle. A trace is a log of inputs, outputs, and intermediate steps from a single application execution. Traces provide insights into application performance, execution flow, and response quality, enabling debugging and evaluation. With LoggedModel, you can track and compare different versions of your application, making it easier to identify issues, deploy the best version, and maintain a clear record of what was deployed and when.
To implement version tracking and tracing with managed MLflow 3.0 on SageMaker, you can establish a versioned model identity using a Git commit hash, set this as the active model context so all subsequent traces will be automatically linked to this specific version, enable automatic logging for Amazon Bedrock interactions, and then make an API call to Anthropic’s Claude 3.5 Sonnet that will be fully traced with inputs, outputs, and metadata automatically captured within the established model context. Managed MLflow 3.0 tracing is already integrated with various generative AI libraries and provides one-line automatic tracing experience for all the support libraries. For information about supported libraries, refer to Supported Integrations in the MLflow documentation.

# 1. Define your application version using the git commit
logged_model= “customer_support_agent”
logged_model_name = f”{logged_model}-{git_commit}”

# 2.Set the active model context – traces will be linked to this
mlflow.set_active_model(name=logged_model_name)

# 3.Set auto logging for your model provider
mlflow.bedrock.autolog()

# 4. Chat with your LLM provider
# Ensure that your boto3 client has the necessary auth information
bedrock = boto3.client(
service_name=”bedrock-runtime”,
region_name=”<REPLACE_WITH_YOUR_AWS_REGION>”,
)

model = “anthropic.claude-3-5-sonnet-20241022-v2:0”
messages = [{ “role”: “user”, “content”: [{“text”: “Hello!”}]}]
# All intermediate executions within the chat session will be logged
bedrock.converse(modelId=model, messages=messages)

After logging this information, you can track these generative AI experiments and the logged model for the agent in the managed MLflow 3.0 tracking server UI, as shown in the following screenshot.

In addition to the one-line auto tracing functionality, MLflow offers Python SDK for manually instrumenting your code and manipulating traces. Refer to the code sample notebook sagemaker_mlflow_strands.ipynb in the aws-samples GitHub repository, where we use MLflow manual instrumentation to trace Strands Agents. With tracing capabilities in fully managed MLflow 3.0, you can record the inputs, outputs, and metadata associated with each intermediate step of a request, so you can pinpoint the source of bugs and unexpected behaviors.
These capabilities provide observability in your AI workload by capturing detailed information about the execution of the workload services, nodes, and tools that you can see under the Traces tab.

You can inspect each trace, as shown in the following image, by choosing the request ID in the traces tab for the desired trace.

Fully managed MLflow 3.0 on Amazon SageMaker also introduces the capability to tag traces. Tags are mutable key-value pairs you can attach to traces to add valuable metadata and context. Trace tags make it straightforward to organize, search, and filter traces based on criteria such as user session, environment, model version, or performance characteristics. You can add, update, or remove tags at any stage—during trace execution using mlflow.update_current_trace() or after a trace is logged using the MLflow APIs or UI. Managed MLflow 3.0 makes it seamless to search and analyze traces, helping teams quickly pinpoint issues, compare agent behaviors, and optimize performance. The tracing UI and Python API both support powerful filtering, so you can drill down into traces based on attributes such as status, tags, user, environment, or execution time as shown in the screenshot below. For example, you can instantly find all traces with errors, filter by production environment, or search for traces from a specific request. This capability is essential for debugging, cost analysis, and continuous improvement of generative AI applications.
The following screenshot displays the traces returned when searching for the tag ‘Production’.

The following code snippet shows how you can use search for all traces in production with a successful status:

# Search for traces in production environment with successful status
traces = mlflow.search_traces( filter_string=”attributes.status = ‘OK’ AND tags.environment = ‘production'”)

Generative AI use case walkthrough with MLflow tracing
Building and deploying generative AI agents such as chat-based assistants, code generators, or customer support assistants requires deep visibility into how these agents interact with large language models (LLMs) and external tools. In a typical agentic workflow, the agent loops through reasoning steps, calling LLMs and using tools or subsystems such as search APIs or Model Context Protocol (MCP) servers until it completes the user’s task. These complex, multistep interactions make debugging, optimization, and cost tracking especially challenging.
Traditional observability tools fall short in generative AI because agent decisions, tool calls, and LLM responses are dynamic and context-dependent. Managed MLflow 3.0 tracing provides comprehensive observability by capturing every LLM call, tool invocation, and decision point in your agent’s workflow. You can use this end-to-end trace data to:

Debug agent behavior – Pinpoint where an agent’s reasoning deviates or why it produces unexpected outputs.
Monitor tool usage – Discover how and when external tools are called and analyze their impact on quality and cost.
Track performance and cost – Measure latency, token usage, and API costs at each step of the agentic loop.
Audit and govern – Maintain detailed logs for compliance and analysis.

Imagine a real-world scenario using the managed MLflow 3.0 tracing UI for a sample finance customer support agent equipped with a tool to retrieve financial data from a datastore. While you’re developing a generative AI customer support agent or analyzing the agent behavior in production, you can observe how agent responses and the execution optionally call a product database tool for more accurate recommendations. For illustration, the first trace, shown in the following screenshot, shows the agent handling a user query without invoking any tools. The trace captures the prompt, agent response, and agent decision points. The agent’s response lacks product-specific details. The trace makes it clear that no external tool was called, and you quickly identify the behavior in the agent’s reasoning chain.

The second trace, shown in the following screenshot, captures the same agent, but this time it decides to call the product database tool. The trace logs the tool invocation, the returned product data, and how the agent incorporates this information into its final response. Here, you can observe improved answer quality, a slight increase in latency, and additional API cost with higher token usage.

By comparing these traces side by side, you can debug why the agent sometimes skips using the tool, optimize when and how tools are called, and balance quality against latency and cost. MLflow’s tracing UI makes these agentic loops transparent, actionable, and seamless to analyze at scale. This post’s sample agent and all necessary code is available on the aws-samples GitHub repository, where you can replicate and adapt it for your own applications.
Cleanup
After it’s created, a SageMaker managed MLflow tracking server will incur costs until you delete or stop it. Billing for tracking servers is based on the duration the servers have been running, the size selected, and the amount of data logged to the tracking servers. You can stop tracking servers when they’re not in use to save costs, or you can delete them using API or the SageMaker Studio UI. For more details on pricing, refer to Amazon SageMaker pricing.
Conclusion
Fully managed MLflow 3.0 on Amazon SageMaker AI is now available. Get started with sample code in the aws-samples GitHub repository. We invite you to explore this new capability and experience the enhanced efficiency and control it brings to your ML projects. To learn more, visit Machine Learning Experiments using Amazon SageMaker with MLflow.
For more information, visit the SageMaker Developer Guide and send feedback to AWS re:Post for SageMaker or through your usual AWS Support contacts.

About the authors
Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his three-year old sheep-a-doodle!
Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customer through their AIOps journey across model training, Retrieval-Augmented-Generation (RAG), GenAI Agents, and scaling GenAI use-cases. He also focuses on Go-To-Market strategies helping AWS build and align products to solve industry challenges in the GenerativeAI space. You can find Sandeep on LinkedIn.
Amit Modi is the product leader for SageMaker AIOps and Governance, and Responsible AI at AWS. With over a decade of B2B experience, he builds scalable products and teams that drive innovation and deliver value to customers globally.
Rahul Easwar is a Senior Product Manager at AWS, leading managed MLflow and Partner AI Apps within the SageMaker AIOps team. With over 15 years of experience spanning startups to enterprise technology, he leverages his entrepreneurial background and MBA from Chicago Booth to build scalable ML platforms that simplify AI adoption for organizations worldwide. Connect with Rahul on LinkedIn to learn more about his work in ML platforms and enterprise AI solutions.

Hugging Face Releases SmolLM3: A 3B Long-Context, Multilingual Reasoni …

Posted on July 9, 2025 by i-genie

Hugging Face just released SmolLM3, the latest version of its “Smol” language models, designed to deliver strong multilingual reasoning over long contexts using a compact 3B-parameter architecture. While most high-context capable models typically push beyond 7B parameters, SmolLM3 manages to offer state-of-the-art (SoTA) performance with significantly fewer parameters—making it more cost-efficient and deployable on constrained hardware, without compromising on capabilities like tool usage, multi-step reasoning, and language diversity.

Overview of SmolLM3

SmolLM3 stands out as a compact, multilingual, and dual-mode long-context language model capable of handling sequences up to 128k tokens. It was trained on 11 trillion tokens, positioning it competitively against models like Mistral, LLaMA 2, and Falcon. Despite its size, SmolLM3 achieves surprisingly strong tool usage performance and few-shot reasoning ability—traits more commonly associated with models double or triple its size.

SmolLM3 was released in two variants:

SmolLM3-3B-Base: The base language model trained on the 11T-token corpus.

SmolLM3-3B-Instruct: An instruction-tuned variant optimized for reasoning and tool use.

Both models are publicly available under the Apache 2.0 license on Hugging Face’s Model Hub.

Key Features

1. Long Context Reasoning (up to 128k tokens)SmolLM3 utilizes a modified attention mechanism to efficiently process extremely long contexts—up to 128,000 tokens. This capability is crucial for tasks involving extended documents, logs, or structured records where context length directly affects comprehension and accuracy.

2. Dual Mode ReasoningThe instruction-tuned SmolLM3-3B supports dual-mode reasoning:

Instruction-following for chat-style and tool-augmented tasks.

Multilingual QA and generation for tasks in multiple languages.

This bifurcation allows the model to excel in both open-ended generation and structured reasoning, making it suitable for applications ranging from RAG pipelines to agent workflows.

3. Multilingual CapabilitiesTrained on a multilingual corpus, SmolLM3 supports six languages: English, French, Spanish, German, Italian, and Portuguese. It performs well on benchmarks like XQuAD and MGSM, demonstrating its ability to generalize across linguistic boundaries with minimal performance drop.

4. Compact Size with SoTA PerformanceAt just 3 billion parameters, SmolLM3 achieves performance close to or on par with larger models such as Mistral-7B on multiple downstream tasks. This is made possible by the scale and quality of its training data (11T tokens) and careful architectural tuning.

5. Tool Use and Structured OutputsThe model demonstrates impressive performance on tool-calling tasks—both in prompt-based workflows and with structured outputs. It correctly follows schema-driven input-output constraints and interfaces well with systems requiring deterministic behavior, such as autonomous agents and API-driven environments.

Technical Training Details

SmolLM3 was trained on an internal mixture curated by Hugging Face, consisting of high-quality web content, code, academic papers, and multilingual sources. The 11T-token training run was done using multi-node distributed training strategies on GPU clusters, employing optimizations like Flash Attention v2 for efficient long-sequence training. The tokenizer is a 128k-token SentencePiece model, shared across all supported languages.

For long context support, Hugging Face employed linear and grouped attention mechanisms that minimize quadratic complexity while retaining performance. This enabled the model to handle context lengths up to 128k during both training and inference—without memory bottlenecks that plague dense transformers at this scale.

The SmolLM3-3B instruction-tuned variant was further trained using Hugging Face’s trlx library for alignment with chat instructions, reasoning tasks, and tool usage demonstrations.

Performance Benchmarks

SmolLM3 performs strongly on multiple multilingual and reasoning benchmarks:

XQuAD (Multilingual QA): Competitive scores in all six supported languages.

MGSM (Multilingual Grade School Math): Outperforms several larger models in zero-shot settings.

ToolQA and MultiHopQA: Shows strong multi-step reasoning and context grounding.

ARC and MMLU: High accuracy in commonsense and professional knowledge domains.

While it does not surpass the latest 7B and 13B models on every benchmark, SmolLM3’s performance-to-parameter ratio remains one of the highest in its class.

Use Cases and Applications

SmolLM3 is particularly suited for:

Low-cost, multilingual AI deployments in chatbots, helpdesk systems, and document summarizers.

Lightweight RAG and retrieval-based systems that benefit from long-context understanding.

Tool-augmented agents requiring schema adherence and deterministic tool invocation.

Edge deployments and private environments where smaller models are necessary due to hardware or data privacy constraints.

Conclusion

SmolLM3 exemplifies a new generation of small-yet-capable language models. Its combination of multilingual support, long-context handling, and strong reasoning—all within a 3B parameter footprint—marks a significant step forward in model efficiency and accessibility. Hugging Face’s release demonstrates that with the right training recipe and architectural design, smaller models can still deliver robust performance in complex tasks traditionally reserved for much larger LLMs.

Check out the SmolLM3-3B-Base and SmolLM3-3B-Instruct. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, and Youtube and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Hugging Face Releases SmolLM3: A 3B Long-Context, Multilingual Reasoning Model appeared first on MarkTechPost.

A Code Implementation for Designing Intelligent Multi-Agent Workflows …

Posted on July 9, 2025 by i-genie

BeeAI FrameworkIn this tutorial, we explore the power and flexibility of the beeai-framework by building a fully functional multi-agent system from the ground up. We walk through the essential components, custom agents, tools, memory management, and event monitoring, to show how BeeAI simplifies the development of intelligent, cooperative agents. Along the way, we demonstrate how these agents can perform complex tasks, such as market research, code analysis, and strategic planning, using a modular, production-ready pattern.

Copy CodeCopiedUse a different Browserimport subprocess
import sys
import asyncio
import json
from typing import Dict, List, Any, Optional
from datetime import datetime
import os

def install_packages():
packages = [
“beeai-framework”,
“requests”,
“beautifulsoup4”,
“numpy”,
“pandas”,
“pydantic”
]

print(“Installing required packages…”)
for package in packages:
try:
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, package])
print(f” {package} installed successfully”)
except subprocess.CalledProcessError as e:
print(f” Failed to install {package}: {e}”)
print(“Installation complete!”)

install_packages()

try:
from beeai_framework import ChatModel
from beeai_framework.agents import Agent
from beeai_framework.tools import Tool
from beeai_framework.workflows import Workflow
BEEAI_AVAILABLE = True
print(” BeeAI Framework imported successfully”)
except ImportError as e:
print(f” BeeAI Framework import failed: {e}”)
print(“Falling back to custom implementation…”)
BEEAI_AVAILABLE = False

We begin by installing all the required packages, including the beeai-framework, to ensure our environment is ready for multi-agent development. Once installed, we attempt to import BeeAI’s core modules. If the import fails, we gracefully fall back to a custom implementation to maintain workflow functionality.

Copy CodeCopiedUse a different Browserclass MockChatModel:
“””Mock LLM for demonstration purposes”””
def __init__(self, model_name: str = “mock-llm”):
self.model_name = model_name

async def generate(self, messages: List[Dict[str, str]]) -> str:
“””Generate a mock response”””
last_message = messages[-1][‘content’] if messages else “”

if “market” in last_message.lower():
return “Market analysis shows strong growth in AI frameworks with 42% YoY increase. Key competitors include LangChain, CrewAI, and AutoGen.”
elif “code” in last_message.lower():
return “Code analysis reveals good structure with async patterns. Consider adding more error handling and documentation.”
elif “strategy” in last_message.lower():
return “Strategic recommendation: Focus on ease of use, strong documentation, and enterprise features to compete effectively.”
else:
return f”Analyzed: {last_message[:100]}… Recommendation: Implement best practices for scalability and maintainability.”

class CustomTool:
“””Base class for custom tools”””
def __init__(self, name: str, description: str):
self.name = name
self.description = description

async def run(self, input_data: str) -> str:
“””Override this method in subclasses”””
raise NotImplementedError

We define a MockChatModel to simulate LLM behavior when BeeAI is unavailable, allowing us to test and prototype workflows without relying on external APIs. Alongside it, we create a CustomTool base class, which serves as a blueprint for task-specific tools that our agents can use, laying the foundation for modular, tool-augmented agent capabilities.

Copy CodeCopiedUse a different Browserclass MarketResearchTool(CustomTool):
“””Custom tool for market research and competitor analysis”””

def __init__(self):
super().__init__(
name=”market_research”,
description=”Analyzes market trends and competitor information”
)
self.market_data = {
“AI_frameworks”: {
“competitors”: [“LangChain”, “CrewAI”, “AutoGen”, “Haystack”, “Semantic Kernel”],
“market_size”: “$2.8B”,
“growth_rate”: “42% YoY”,
“key_trends”: [“Multi-agent systems”, “Production deployment”, “Tool integration”, “Enterprise adoption”]
},
“enterprise_adoption”: {
“rate”: “78%”,
“top_use_cases”: [“Customer support”, “Data analysis”, “Code generation”, “Document processing”],
“challenges”: [“Reliability”, “Cost control”, “Integration complexity”, “Governance”]
}
}

async def run(self, query: str) -> str:
“””Simulate market research based on query”””
query_lower = query.lower()

if “competitor” in query_lower or “competition” in query_lower:
data = self.market_data[“AI_frameworks”]
return f”””Market Analysis Results:

Key Competitors: {‘, ‘.join(data[‘competitors’])}
Market Size: {data[‘market_size’]}
Growth Rate: {data[‘growth_rate’]}
Key Trends: {‘, ‘.join(data[‘key_trends’])}

Recommendation: Focus on differentiating features like simplified deployment, better debugging tools, and enterprise-grade security.”””

elif “adoption” in query_lower or “enterprise” in query_lower:
data = self.market_data[“enterprise_adoption”]
return f”””Enterprise Adoption Analysis:

Adoption Rate: {data[‘rate’]}
Top Use Cases: {‘, ‘.join(data[‘top_use_cases’])}
Main Challenges: {‘, ‘.join(data[‘challenges’])}

Recommendation: Address reliability and cost control concerns through better monitoring and resource management features.”””

else:
return “Market research available for: competitor analysis, enterprise adoption, or specific trend analysis. Please specify your focus area.”

We implement the MarketResearchTool as a specialized extension of our CustomTool base class. This tool simulates real-world market intelligence by returning pre-defined insights on AI framework trends, key competitors, adoption rates, and industry challenges. With this, we equip our agents to make informed, data-driven recommendations during workflow execution.

Copy CodeCopiedUse a different Browserclass CodeAnalysisTool(CustomTool):
“””Custom tool for analyzing code patterns and suggesting improvements”””

def __init__(self):
super().__init__(
name=”code_analysis”,
description=”Analyzes code structure and suggests improvements”
)

async def run(self, code_snippet: str) -> str:
“””Analyze code and provide insights”””
analysis = {
“lines”: len(code_snippet.split(‘n’)),
“complexity”: “High” if len(code_snippet) > 500 else “Medium” if len(code_snippet) > 200 else “Low”,
“async_usage”: “Yes” if “async” in code_snippet or “await” in code_snippet else “No”,
“error_handling”: “Present” if “try:” in code_snippet or “except:” in code_snippet else “Missing”,
“documentation”: “Good” if ‘”””‘ in code_snippet or “”'” in code_snippet else “Needs improvement”,
“imports”: “Present” if “import ” in code_snippet else “None detected”,
“classes”: len([line for line in code_snippet.split(‘n’) if line.strip().startswith(‘class ‘)]),
“functions”: len([line for line in code_snippet.split(‘n’) if line.strip().startswith(‘def ‘) or line.strip().startswith(‘async def ‘)])
}

suggestions = []
if analysis[“error_handling”] == “Missing”:
suggestions.append(“Add try-except blocks for error handling”)
if analysis[“documentation”] == “Needs improvement”:
suggestions.append(“Add docstrings and comments”)
if “print(” in code_snippet:
suggestions.append(“Consider using proper logging instead of print statements”)
if analysis[“async_usage”] == “Yes” and “await” not in code_snippet:
suggestions.append(“Ensure proper await usage with async functions”)
if analysis[“complexity”] == “High”:
suggestions.append(“Consider breaking down into smaller functions”)

return f”””Code Analysis Report:

Structure:
– Lines of code: {analysis[‘lines’]}
– Complexity: {analysis[‘complexity’]}
– Classes: {analysis[‘classes’]}
– Functions: {analysis[‘functions’]}

Quality Metrics:
– Async usage: {analysis[‘async_usage’]}
– Error handling: {analysis[‘error_handling’]}
– Documentation: {analysis[‘documentation’]}

Suggestions:
{chr(10).join(f”• {suggestion}” for suggestion in suggestions) if suggestions else “• Code looks good! Following best practices.”}

Overall Score: {10 – len(suggestions) * 2}/10″””

class CustomAgent:
“””Custom agent implementation”””

def __init__(self, name: str, role: str, instructions: str, tools: List[CustomTool], llm=None):
self.name = name
self.role = role
self.instructions = instructions
self.tools = tools
self.llm = llm or MockChatModel()
self.memory = []

async def run(self, task: str) -> Dict[str, Any]:
“””Execute agent task”””
print(f” {self.name} ({self.role}) processing task…”)

self.memory.append({“type”: “task”, “content”: task, “timestamp”: datetime.now()})

task_lower = task.lower()
tool_used = None
tool_result = None

for tool in self.tools:
if tool.name == “market_research” and (“market” in task_lower or “competitor” in task_lower):
tool_result = await tool.run(task)
tool_used = tool.name
break
elif tool.name == “code_analysis” and (“code” in task_lower or “analyze” in task_lower):
tool_result = await tool.run(task)
tool_used = tool.name
break

messages = [
{“role”: “system”, “content”: f”You are {self.role}. {self.instructions}”},
{“role”: “user”, “content”: task}
]

if tool_result:
messages.append({“role”: “system”, “content”: f”Tool {tool_used} provided: {tool_result}”})

response = await self.llm.generate(messages)

self.memory.append({“type”: “response”, “content”: response, “timestamp”: datetime.now()})

return {
“agent”: self.name,
“task”: task,
“tool_used”: tool_used,
“tool_result”: tool_result,
“response”: response,
“success”: True
}

We now implement the CodeAnalysisTool, which enables our agents to assess code snippets based on structure, complexity, documentation, and error handling. This tool generates insightful suggestions to improve code quality. We also define the CustomAgent class, equipping each agent with its own role, instructions, memory, tools, and access to an LLM. This design allows each agent to decide whether a tool is needed intelligently and then synthesize responses using both analysis and LLM reasoning, ensuring adaptable and context-aware behavior.

Copy CodeCopiedUse a different Browserclass WorkflowMonitor:
“””Monitor and log workflow events”””

def __init__(self):
self.events = []
self.start_time = datetime.now()

def log_event(self, event_type: str, data: Dict[str, Any]):
“””Log workflow events”””
timestamp = datetime.now()
self.events.append({
“timestamp”: timestamp,
“duration”: (timestamp – self.start_time).total_seconds(),
“event_type”: event_type,
“data”: data
})
print(f”[{timestamp.strftime(‘%H:%M:%S’)}] {event_type}: {data.get(‘agent’, ‘System’)}”)

def get_summary(self):
“””Get monitoring summary”””
return {
“total_events”: len(self.events),
“total_duration”: (datetime.now() – self.start_time).total_seconds(),
“event_types”: list(set([e[“event_type”] for e in self.events])),
“events”: self.events
}

class CustomWorkflow:
“””Custom workflow implementation”””

def __init__(self, name: str, description: str):
self.name = name
self.description = description
self.agents = []
self.monitor = WorkflowMonitor()

def add_agent(self, agent: CustomAgent):
“””Add agent to workflow”””
self.agents.append(agent)
self.monitor.log_event(“agent_added”, {“agent”: agent.name, “role”: agent.role})

async def run(self, tasks: List[str]) -> Dict[str, Any]:
“””Execute workflow with tasks”””
self.monitor.log_event(“workflow_started”, {“tasks”: len(tasks)})

results = []
context = {“shared_insights”: []}

for i, task in enumerate(tasks):
agent = self.agents[i % len(self.agents)]

if context[“shared_insights”]:
enhanced_task = f”{task}nnContext from previous analysis:n” + “n”.join(context[“shared_insights”][-2:])
else:
enhanced_task = task

result = await agent.run(enhanced_task)
results.append(result)

context[“shared_insights”].append(f”{agent.name}: {result[‘response’][:200]}…”)

self.monitor.log_event(“task_completed”, {
“agent”: agent.name,
“task_index”: i,
“success”: result[“success”]
})

self.monitor.log_event(“workflow_completed”, {“total_tasks”: len(tasks)})

return {
“workflow”: self.name,
“results”: results,
“context”: context,
“summary”: self._generate_summary(results)
}

def _generate_summary(self, results: List[Dict[str, Any]]) -> str:
“””Generate workflow summary”””
summary_parts = []

for result in results:
summary_parts.append(f”• {result[‘agent’]}: {result[‘response’][:150]}…”)

return f”””Workflow Summary for {self.name}:

{chr(10).join(summary_parts)}

Key Insights:
• Market opportunities identified in AI framework space
• Technical architecture recommendations provided
• Strategic implementation plan outlined
• Multi-agent collaboration demonstrated successfully”””

We implement the WorkflowMonitor to log and track events throughout the execution, giving us real-time visibility into the actions taken by each agent. With the CustomWorkflow class, we orchestrate the entire multi-agent process, assigning tasks, preserving shared context across agents, and capturing all relevant insights. This structure ensures that we not only execute tasks in a coordinated and transparent way but also generate a comprehensive summary that highlights collaboration and key outcomes.

Copy CodeCopiedUse a different Browserasync def advanced_workflow_demo():
“””Demonstrate advanced multi-agent workflow”””

print(” Advanced Multi-Agent Workflow Demo”)
print(“=” * 50)

workflow = CustomWorkflow(
name=”Advanced Business Intelligence System”,
description=”Multi-agent system for comprehensive business analysis”
)

market_agent = CustomAgent(
name=”MarketAnalyst”,
role=”Senior Market Research Analyst”,
instructions=”Analyze market trends, competitor landscape, and business opportunities. Provide data-driven insights with actionable recommendations.”,
tools=[MarketResearchTool()],
llm=MockChatModel()
)

tech_agent = CustomAgent(
name=”TechArchitect”,
role=”Technical Architecture Specialist”,
instructions=”Evaluate technical solutions, code quality, and architectural decisions. Focus on scalability, maintainability, and best practices.”,
tools=[CodeAnalysisTool()],
llm=MockChatModel()
)

strategy_agent = CustomAgent(
name=”StrategicPlanner”,
role=”Strategic Business Planner”,
instructions=”Synthesize market and technical insights into comprehensive strategic recommendations. Focus on ROI, risk assessment, and implementation roadmaps.”,
tools=[],
llm=MockChatModel()
)

workflow.add_agent(market_agent)
workflow.add_agent(tech_agent)
workflow.add_agent(strategy_agent)

tasks = [
“Analyze the current AI framework market landscape and identify key opportunities for a new multi-agent framework targeting enterprise users.”,
“””Analyze this code architecture pattern and provide technical assessment:

async def multi_agent_workflow():
agents = [ResearchAgent(), AnalysisAgent(), SynthesisAgent()]
context = SharedContext()

for agent in agents:
try:
result = await agent.run(context.get_task())
if result.success:
context.add_insight(result.data)
else:
context.add_error(result.error)
except Exception as e:
logger.error(f”Agent {agent.name} failed: {e}”)

return context.synthesize_recommendations()”””,
“Based on the market analysis and technical assessment, create a comprehensive strategic plan for launching a competitive AI framework with focus on multi-agent capabilities and enterprise adoption.”
]

print(“n Executing Advanced Workflow…”)
result = await workflow.run(tasks)

print(“n Workflow Completed Successfully!”)
print(“=” * 50)
print(” COMPREHENSIVE ANALYSIS RESULTS”)
print(“=” * 50)
print(result[“summary”])

print(“n WORKFLOW MONITORING SUMMARY”)
print(“=” * 30)
summary = workflow.monitor.get_summary()
print(f”Total Events: {summary[‘total_events’]}”)
print(f”Total Duration: {summary[‘total_duration’]:.2f} seconds”)
print(f”Event Types: {‘, ‘.join(summary[‘event_types’])}”)

return workflow, result

async def simple_tool_demo():
“””Demonstrate individual tool functionality”””

print(“n Individual Tool Demo”)
print(“=” * 30)

market_tool = MarketResearchTool()
code_tool = CodeAnalysisTool()

print(“Available Tools:”)
print(f”• {market_tool.name}: {market_tool.description}”)
print(f”• {code_tool.name}: {code_tool.description}”)

print(“n Market Research Analysis:”)
market_result = await market_tool.run(“competitor analysis in AI frameworks”)
print(market_result)

print(“n Code Analysis:”)
sample_code = ”’
import asyncio
from typing import List, Dict

class AgentManager:
“””Manages multiple AI agents”””

def __init__(self):
self.agents = []
self.results = []

async def add_agent(self, agent):
“””Add agent to manager”””
self.agents.append(agent)

async def run_all(self, task: str) -> List[Dict]:
“””Run task on all agents”””
results = []
for agent in self.agents:
try:
result = await agent.execute(task)
results.append(result)
except Exception as e:
print(f”Agent failed: {e}”)
results.append({“error”: str(e)})
return results
”’

code_result = await code_tool.run(sample_code)
print(code_result)

We demonstrate two powerful workflows. First, in the individual tool demo, we directly test the capabilities of our MarketResearchTool and CodeAnalysisTool, ensuring they generate relevant insights independently. Then, we bring everything together in the advanced workflow demo, where we deploy three specialized agents, MarketAnalyst, TechArchitect, and StrategicPlanner, to tackle business analysis tasks collaboratively.

Copy CodeCopiedUse a different Browserasync def main():
“””Main demo function”””

print(” Advanced BeeAI Framework Tutorial”)
print(“=” * 40)
print(“This tutorial demonstrates:”)
print(“• Multi-agent workflows”)
print(“• Custom tool development”)
print(“• Memory management”)
print(“• Event monitoring”)
print(“• Production-ready patterns”)

if BEEAI_AVAILABLE:
print(“• Using real BeeAI Framework”)
else:
print(“• Using custom implementation (BeeAI not available)”)

print(“=” * 40)

await simple_tool_demo()

print(“n” + “=”*50)
await advanced_workflow_demo()

print(“n Tutorial Complete!”)
print(“nNext Steps:”)
print(“1. Install BeeAI Framework properly: pip install beeai-framework”)
print(“2. Configure your preferred LLM (OpenAI, Anthropic, local models)”)
print(“3. Explore the official BeeAI documentation”)
print(“4. Build custom agents for your specific use case”)
print(“5. Deploy to production with proper monitoring”)

if __name__ == “__main__”:
try:
import nest_asyncio
nest_asyncio.apply()
print(” Applied nest_asyncio for Colab compatibility”)
except ImportError:
print(” nest_asyncio not available – may not work in some environments”)

asyncio.run(main())

We wrap up our tutorial with the main() function, which ties together everything we’ve built, demonstrating both tool-level capabilities and a full multi-agent business intelligence workflow. Whether we’re running BeeAI natively or using a fallback setup, we ensure compatibility with environments like Google Colab using nest_asyncio. With this structure in place, we’re ready to scale our agent systems, explore deeper use cases, and confidently deploy production-ready AI workflows.

In conclusion, we’ve built and executed a robust multi-agent workflow using the BeeAI framework (or a custom equivalent), showcasing its potential in real-world business intelligence applications. We’ve seen how easy it is to create agents with specific roles, attach tools for task augmentation, and monitor execution in a transparent way.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Code Implementation for Designing Intelligent Multi-Agent Workflows with the BeeAI Framework appeared first on MarkTechPost.

Accelerate AI development with Amazon Bedrock API keys

Posted on July 9, 2025 by i-genie

Today, we’re excited to announce a significant improvement to the developer experience of Amazon Bedrock: API keys. API keys provide quick access to the Amazon Bedrock APIs, streamlining the authentication process so that developers can focus on building rather than configuration.
CamelAI is an open-source, modular framework for building intelligent multi-agent systems for data generation, world simulation, and task automation.

“As a startup with limited resources, streamlined customer onboarding is critical to our success. The Amazon Bedrock API keys enable us to onboard enterprise customers in minutes rather than hours. With Bedrock, our customers can quickly provision access to leading AI models and seamlessly integrate them into CamelAI,”
said Miguel Salinas, CTO, CamelAI.

In this post, explore how API keys work and how you can start using them today.
API key authentication
Amazon Bedrock now provides API key access to streamline integration with tools and frameworks that expect API key-based authentication. The Amazon Bedrock and Amazon Bedrock runtime SDKs support API key authentication for methods including on-demand inference, provisioned throughput inference, model fine-tuning, distillation, and evaluation.
The diagram compares the default authentication process to Amazon Bedrock (in orange) with the API keys approach (in blue). In the default process, you must create an identity in AWS IAM Identity Center or IAM, attach IAM policies to provide permissions to perform API operations, and generate credentials, which you can then use to make API calls. The grey boxes in the diagram highlight the steps that Amazon Bedrock now streamlines when generating an API key. Developers can now authenticate and access Amazon Bedrock APIs with minimal setup overhead.

You can generate API keys in the Amazon Bedrock console, choosing between two types.
With long-term API keys, you can set expiration times ranging from 1 day to no expiration. These keys are associated with an IAM user that Amazon Bedrock automatically creates for you. The system attaches the AmazonBedrockLimitedAccess managed policy to this IAM user, and you can then modify permissions as needed through the IAM service. We recommend using long-term keys primarily for exploration of Amazon Bedrock.
Short-term API keys use the IAM permissions from your current IAM principal and expire when your account’s session ends or can last up to 12 hours. Short-term API keys use AWS Signature Version 4 for authentication. For continuous application use, you can implement API key refreshing with a script as shown in this example. We recommend that you use short-term API keys for setups that require a higher level of security.
Making Your First API Call
Once you have access to foundation models, getting started with Amazon Bedrock API key is straightforward. Here’s how to make your first API call using the AWS SDK for Python (Boto3 SDK) and API keys:
Generate an API key
To generate an API key, follow these steps:

Sign in to the AWS Management Console and open the Amazon Bedrock console
In the left navigation panel, select API keys
Choose either Generate short-term API key or Generate long-term API key
For long-term keys, set your desired expiration time and optionally configure advanced permissions
Choose Generate and copy your API key

Set Your API Key as Environment Variable
You can set your API key as an environment variable so that it’s automatically recognized when you make API requests:

# To set the API key as an environment variable, you can open a terminal and run the following command:
export AWS_BEARER_TOKEN_BEDROCK=${api-key}

The Boto3 SDK automatically detects your environment variable when you create an Amazon Bedrock client.
Make Your First API Call
You can now make API calls to Amazon Bedrock in multiple ways:

Using curl

curl -X POST “https://bedrock-runtime.us-east-1.amazonaws.com/model/us.anthropic.claude-3-5-haiku-20241022-v1:0/converse”
-H “Content-Type: application/json”
-H “Authorization: Bearer $AWS_BEARER_TOKEN_BEDROCK”
-d ‘{
“messages”: [
{
“role”: “user”,
“content”: [{“text”: “Hello”}]
}
]
}’

Using the Amazon Bedrock SDK:

import boto3

# Create an Amazon Bedrock client
client = boto3.client(
service_name=”bedrock-runtime”,
region_name=”us-east-1″ # If you’ve configured a default region, you can omit this line
)

# Define the model and message
model_id = “us.anthropic.claude-3-5-haiku-20241022-v1:0”
messages = [{“role”: “user”, “content”: [{“text”: “Hello”}]}]

response = client.converse(
modelId=model_id,
messages=messages,
)

# Print the response
print(response[‘output’][‘message’][‘content’][0][‘text’])

You can also use native libraries like Python Requests:

import requests
import os

url = “https://bedrock-runtime.us-east-1.amazonaws.com/model/us.anthropic.claude-3-5-haiku-20241022-v1:0/converse”

payload = {
“messages”: [
{
“role”: “user”,
“content”: [{“text”: “Hello”}]
}
]
}

headers = {
“Content-Type”: “application/json”,
“Authorization”: f”Bearer {os.environ[‘AWS_BEARER_TOKEN_BEDROCK’]}”
}

response = requests.request(“POST”, url, json=payload, headers=headers)

print(response.text)

Bridging developer experience and enterprise security requirements
Enterprise administrators can now streamline their user onboarding to Amazon Bedrock foundation models. With setups that require a higher level of security, administrators can enable short-term API keys for their users. Short-term API keys use AWS Signature Version 4 and existing IAM principals, maintaining established access controls implemented by administrators.
For audit and compliance purposes, all API calls are logged in AWS CloudTrail. API keys are passed as authorization headers to API requests and aren’t logged.
Conclusion
Amazon Bedrock API keys are available in 20 AWS Regions where Amazon Bedrock is available: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Hyderabad, Mumbai, Osaka, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Milan, Paris, Spain, Stockholm, Zurich), and South America (São Paulo). To learn more about API keys in Amazon Bedrock, visit the API Keys documentation in the Amazon Bedrock user guide.
Give API keys a try in the Amazon Bedrock console today and send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.

About the Authors
Sofian Hamiti is a technology leader with over 10 years of experience building AI solutions, and leading high-performing teams to maximize customer outcomes. He is passionate in empowering diverse talent to drive global impact and achieve their career aspirations.
Ajit Mahareddy is an experienced Product and Go-To-Market (GTM) leader with over 20 years of experience in product management, engineering, and go-to-market. Prior to his current role, Ajit led product management building AI/ML products at leading technology companies, including Uber, Turing, and eHealth. He is passionate about advancing generative AI technologies and driving real-world impact with generative AI.
Nakul Vankadari Ramesh is a Software Development Engineer with over 7 years of experience building large-scale distributed systems. He currently works on the Amazon Bedrock team, helping accelerate the development of generative AI capabilities. Previously, he contributed to Amazon Managed Blockchain, focusing on scalable and reliable infrastructure.
Huong Nguyen is a Principal Product Manager at AWS. She is a product leader at Amazon Bedrock, with 18 years of experience building customer-centric and data-driven products. She is passionate about democratizing responsible machine learning and generative AI to enable customer experience and business innovation. Outside of work, she enjoys spending time with family and friends, listening to audiobooks, traveling, and gardening.
Massimiliano Angelino is Lead Architect for the EMEA Prototyping team. During the last 3 and half years he has been an IoT Specialist Solution Architect with a particular focus on edge computing, and he contributed to the launch of AWS IoT Greengrass v2 service and its integration with Amazon SageMaker Edge Manager. Based in Stockholm, he enjoys skating on frozen lakes.

Accelerating data science innovation: How Bayer Crop Science used AWS …

Posted on July 9, 2025 by i-genie

The world’s population is expanding at a rapid rate. The growing global population requires innovative solutions to produce food, fiber, and fuel, while restoring natural resources like soil and water and addressing climate change. Bayer Crop Science estimates farmers need to increase crop production by 50% by 2050 to meet these demands. To support their mission, Bayer Crop Science is collaborating with farmers and partners to promote and scale regenerative agriculture—a future where farming can produce more while restoring the environment.
Regenerative agriculture is a sustainable farming philosophy that aims to improve soil health by incorporating nature to create healthy ecosystems. It’s based on the idea that agriculture should restore degraded soils and reverse degradation, rather than sustain current conditions. The Crop Science Division at Bayer believes regenerative agriculture is foundational to the future of farming. Their vision is to produce 50% more food by restoring nature and scaling regenerative agriculture. To make this mission a reality, Bayer Crop Science is driving model training with Amazon SageMaker and accelerating code documentation with Amazon Q.
In this post, we show how Bayer Crop Science manages large-scale data science operations by training models for their data analytics needs and maintaining high-quality code documentation to support developers. Through these solutions, Bayer Crop Science projects up to a 70% reduction in developer onboarding time and up to a 30% improvement in developer productivity.
Challenges
Bayer Crop Science faced the challenge of scaling genomic predictive modeling to increase its speed to market. It also needed data scientists to focus on building the high-value foundation models (FMs), rather than worrying about constructing and engineering the solution itself. Prior to building their solution, the Decision Science Ecosystem, provisioning a data science environment could take days for a data team within Bayer Crop Science.
Solution overview
Bayer Crop Science’s Decision Science Ecosystem (DSE) is a next-generation machine learning operations (MLOps) solution built on AWS to accelerate data-driven decision making for data science teams at scale across the organization. AWS services assist Bayer Crop Science in creating a connected decision-making system accessible to thousands of data scientists. The company is using the solution for generative AI, product pipeline advancements, geospatial imagery analytics of field data, and large-scale genomic predictive modeling that will allow Bayer Crop Science to become more data-driven and increase speed to market. This solution helps the data scientist at every step, from ideation to model output, including the entire business decision record made using DSE. Other divisions within Bayer are also beginning to build a similar solution on AWS based on the success of DSE.
Bayer Crop Science teams’ DSE integrates cohesively with SageMaker, a fully managed service that lets data scientists quickly build, train, and deploy machine learning (ML) models for different use cases so they can make data-informed decisions quickly. This boosts collaboration within Bayer Crop Science across product supply, R&D, and commercial. Their data science strategy no longer needs self-service data engineering, but rather provides an effective resource to drive fast data engineering at scale. Bayer Crop Science chose SageMaker because it provides a single cohesive experience where data scientists can focus on building high-value models, without having to worry about constructing and engineering the resource itself. With the help of AWS services, cross-functional teams can align quickly to reduce operational costs by minimizing redundancy, addressing bugs early and often, and quickly identifying issues in automated workflows. The DSE solution uses SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), AWS Lambda, and Amazon Simple Storage Service (Amazon S3) to accelerate innovation at Bayer Crop Science and to create a customized, seamless, end-to-end user experience.
The following diagram illustrates the DSE architecture.

Solution walkthrough
Bayer Crop Science had two key challenges in managing large-scale data science operations: maintaining high-quality code documentation and optimizing existing documentation across multiple repositories. With Amazon Q, Bayer Crop Science tackled both challenges, which empowered them to onboard developers more rapidly and improve developer productivity.
The company’s first use case focused on automatically creating high-quality code documentation. When a developer pushes code to a GitHub repository, a webhook—a lightweight, event-driven communication that automatically sends data between applications using HTTP—triggers a Lambda function through Amazon API Gateway. This function then uses Amazon Q to analyze the code changes and generate comprehensive documentation and change summaries. The updated documentation is then stored in Amazon S3. The same Lambda function also creates a pull request with the AI-generated summary of code changes. To maintain security and flexibility, Bayer Crop Science uses Parameter Store, a capability of AWS Systems Manager, to manage prompts for Amazon Q, allowing for quick updates without redeployment, and AWS Secrets Manager to securely handle repository tokens.
This automation significantly reduces the time developers spend creating documentation and pull request descriptions. The generated documentation is also ingested into Amazon Q, so developers can quickly answer questions they have about a repository and onboard onto projects.
The second use case addresses the challenge of maintaining and improving existing code documentation quality. An AWS Batch job, triggered by Amazon EventBridge, processes the code repository. Amazon Q generates new documentation for each code file, which is then indexed along with the source code. The system also generates high-level documentation for each module or functionality and compares the AI-generated documentation with existing human-written documentation. This process makes it possible for Bayer Crop Science to systematically evaluate and enhance their documentation quality over time.
To improve search capabilities, Bayer Crop Science added repository names as custom attributes in the Amazon Q index and prefixed them to indexed content. This enhancement improved the accuracy and relevance of documentation searches. The development team also implemented strategies to handle API throttling and variability in AI responses, maintaining robustness in production environments. Bayer Crop Science is considering developing a management plane to streamline the addition of new repositories and centralize the management of settings, tokens, and prompts. This would further enhance the scalability and ease of use of the system.
Organizations looking to replicate Bayer Crop Science’s success can implement similar webhook-triggered documentation generation, use Amazon Q Business for both generating and evaluating documentation quality, and integrate the solution with existing version control and code review processes. By using AWS services like Lambda, Amazon S3, and Systems Manager, companies can create a scalable and manageable architecture for their documentation needs. Amazon Q Developer also helps organizations further accelerate their development timelines by providing real-time code suggestions and a built-in next-generation chat experience.

“One of the lessons we’ve learned over the last 10 years is that we want to write less code. We want to focus our time and investment on only the things that provide differentiated value to Bayer, and we want to leverage everything we can that AWS provides out of the box. Part of our goal is reducing the development cycles required to transition a model from proof-of-concept phase, to production, and ultimately business adoption. That’s where the value is.”
– Will McQueen, VP, Head of CS Global Data Assets and Analytics at Bayer Crop Science.

Summary
Bayer Crop Science’s approach aligns with modern MLOps practices, enabling data science teams to focus more on high-value modeling tasks rather than time-consuming documentation processes and infrastructure management. By adopting these practices, organizations can significantly reduce the time and effort required for code documentation while improving overall code quality and team collaboration.
Learn more about Bayer Crop Science’s generative AI journey, and discover how Bayer Crop Science is redesigning sustainable practices through cutting-edge technology.
About Bayer
Bayer is a global enterprise with core competencies in the life science fields of health care and nutrition. In line with its mission, “Health for all, Hunger for none,” the company’s products and services are designed to help people and the planet thrive by supporting efforts to understand the major challenges presented by a growing and aging global population. Bayer is committed to driving sustainable development and generating a positive impact with its businesses. At the same time, Bayer aims to increase its earning power and create value through innovation and growth. The Bayer brand stands for trust, reliability, and quality throughout the world. In fiscal 2023, the Group employed around 100,000 people and had sales of 47.6 billion euros. R&D expenses before special items amounted to 5.8 billion euros. For more information, go to www.bayer.com.

About the authors
Lance Smith is a Senior Solutions Architect and part of the Global Healthcare and Life Sciences industry division at AWS. He has spent the last 2 decades helping life sciences companies apply technology in pursuit of their missions to help patients. Outside of work, he loves traveling, backpacking, and spending time with his family.
Kenton Blacutt is an AI Consultant within the Amazon Q Customer Success team. He works hands-on with customers, helping them solve real-world business problems with cutting-edge AWS technologies. In his free time, he likes to travel and run an occasional marathon.
Karthik Prabhakar is a Senior Applications Architect within the AWS Professional Services team. In this role, he collaborates with customers to design and implement cutting-edge solutions for their mission-critical business systems, focusing on areas such as scalability, reliability, and cost optimization in digital transformation and modernization projects.
Jake Malmad is a Senior DevOps Consultant within the AWS Professional Services team, specializing in infrastructure as code, security, containers, and orchestration. As a DevOps consultant, he uses this expertise to collaboratively works with customers, architecting and implementing solutions for automation, scalability, reliability, and security across a wide variety of cloud adoption and transformation engagements.
Nicole Brown is a Senior Engagement Manager within the AWS Professional Services team based in Minneapolis, MN. With over 10 years of professional experience, she has led multidisciplinary, global teams across the healthcare and life sciences industries. She is also a supporter of women in tech and currently holds a board position within the Women at Global Services affinity group.

Combat financial fraud with GraphRAG on Amazon Bedrock Knowledge Bases

Posted on July 9, 2025 by i-genie

Financial fraud detection isn’t just important to banks—it’s essential. With global fraud losses surpassing $40 billion annually and sophisticated criminal networks constantly evolving their tactics, financial institutions face an increasingly complex threat landscape. Today’s fraud schemes operate across multiple accounts, institutions, and channels, creating intricate webs designed specifically to evade detection systems.
Financial institutions have invested heavily in detection capabilities, but the core challenge remains: how to connect the dots across fragmented information landscapes where the evidence of fraud exists not within individual documents or transactions, but in the relationships between them.
In this post, we show how to use Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics to build a financial fraud detection solution.
The limitations of traditional RAG systems
In recent years, Retrieval Augmented Generation (RAG) has emerged as a promising approach for building AI systems grounded in organizational knowledge. However, traditional RAG-based systems have limitations when it comes to complex financial fraud detection.The fundamental limitation lies in how conventional RAG processes information. Standard RAG retrieves and processes document chunks as isolated units, looking for semantic similarities between a query and individual text passages. This approach works well for straightforward information retrieval, but falls critically short in the following scenarios:

Evidence is distributed across multiple documents and systems
The connections between entities matter more than the entities themselves
Complex relationship chains require multi-hop reasoning
Structural context (like hierarchical document organization) provides critical clues
Entity resolution across disparate references is essential

A fraud analyst intuitively follows connection paths—linking an account to a phone number, that phone number to another customer, and that customer to a known fraud ring. Traditional RAG systems, however, lack this relational reasoning capability, leaving sophisticated fraud networks undetected until losses have already occurred.
Amazon Bedrock Knowledge Bases with GraphRAG for financial fraud detection
Amazon Bedrock Knowledge Bases GraphRAG helps financial institutions implement fraud detection systems without building complex graph infrastructure from scratch. By offering a fully managed service that seamlessly integrates knowledge graph construction, maintenance, and querying with powerful foundation models (FMs), Amazon Bedrock Knowledge Bases dramatically lowers the technical barriers to implementing relationship-aware fraud detection. Financial organizations can now use their existing transaction data, customer profiles, and risk signals within a graph context that preserves the critical connections between entities while benefiting from the natural language understanding of FMs. This powerful combination enables fraud analysts to query complex financial relationships using intuitive natural language to detect suspicious patterns that can result in financial fraud.
Example fraud detection use case
To demonstrate this use case, we use a fictitious bank (AnyCompany Bank) in Australia whose customers hold savings, checking, and credit card accounts with the bank. These customers perform transactions to buy goods and services from merchants across the country using their debit and credit cards. AnyCompany Bank is looking to use the latest advancements in GraphRAG and generative AI technologies to detect subtle patterns in fraudulent behavior that will yield higher accuracy and reduce false positives.A fraud analyst at AnyCompany Bank wants to use natural language queries to get answers to the following types of queries:

Basic queries – For example, “Show me all the transactions processed by ABC Electronics” or “What accounts does Michael Green own?”
Relationship exploration queries – For example, “Which devices have accessed account A003?” or “Show all relationships between Jane Smith and her devices.”
Temporal pattern detection queries – For example, “Which accounts had transactions and device access on the same day?” or “Which accounts had transactions outside their usual location pattern?”
Fraud detection queries – For example, as “Find unusual transaction amounts compared to account history” or “Are there any accounts with failed transactions followed by successful ones within 24 hours?”

Solution overview
To help illustrate the core GraphRAG principles, we have simplified the data model to six key tables: accounts, transactions, individuals, devices, merchants, and relationships. Real-world financial fraud detection systems are much more complex, with hundreds of entity types and intricate relationships, but this example demonstrates the essential concepts that scale to enterprise implementations. The following figure is an example of the accounts table.

The following figure is an example of the individuals table.

The following figure is an example of the devices table.

The following figure is an example of the transactions table.

The following figure is an example of the merchants table.

The following figure is an example of the relationships table.

The following diagram shows the relationships among these entities: accounts, individuals, devices, transactions, and merchants. For example, the individual John Doe uses device D001 to access account A001 to execute transaction T001, which is processed by merchant ABC Electronics.

In the following sections, we demonstrate how to upload documents to Amazon Simple Storage Service (Amazon S3), create a knowledge base using Amazon Bedrock Knowledge Bases, and test the knowledge base by running natural language queries.
Prerequisites
To follow along with this post, make sure you have an active AWS account with appropriate permissions to access Amazon Bedrock and create an S3 bucket to be the data source. Additionally, verify that you have enabled access to both Anthropic’s Claude 3.5 Haiku and an embeddings model, such as Amazon Titan Text Embeddings V2.
Uplaod documents to Amazon S3
In this step, you create an S3 bucket as the data source and upload the six tables (accounts, individuals, devices, transactions, merchants, and relationships) as Excel data sheets. The following screenshot shows our S3 bucket and its contents.

Create a knowledge base
Complete the following steps to create the knowledge base:

On the Amazon Bedrock console, choose Knowledge Bases under Builder tools in the navigation pane.
Choose Create and Knowledge Base with vector store.

In the Knowledge Base details section, provide the following information:

Enter a meaningful name for the knowledge base.
For IAM permissions, select Create and use a new service role to create a new AWS Identity and Access Management (IAM) role.
For Choose data source, select Amazon S3.
Choose Next.

In the Configure data source section, provide the following information:

Enter a data source name.
For Data source location, select the location of your data source (for example, we select This AWS account).
For S3 source, choose Browse S3 and choose the location where you uploaded the files.
For Parsing staretgy, select Amazon Bedrock default parser.
For Chunking strategy, choose Default chunking.
Choose Next.

In the Configure data storage and processing section, provide the following information:

For Embeddings model, choose Titan Text Embeddings V2.
For Vector store creation method, select Quick create a new vector store.
For Vector store type, select Amazon Neptune Analytics (GraphRAG).
Choose Next

Amazon Bedrock chooses the FM as Anthropic’s Claude 3 Haiku v1 to automatically build graphs for our knowledge base. This automatically enables contextual enrichment.

Choose Create knowledge base.
Choose the knowledge base when it’s in Available status.

Select the data source and choose Sync, then wait for the sync process to complete.

In the sync process, Amazon Bedrock ingests data files from Amazon S3, creates chunks and embeddings, and automatically extracts entities and relationships, creating the graph.

Test the knowledge base and run natural language queries
When the sync is complete, you can test the knowledge base.

In the Test Knowledge Base section, choose Select model.
Set the model as Anthropic’s Claude 3.5 Haiku (or another model of your choice) and then choose Apply.

Enter a sample query and choose Run.

Let’s start with some basic queries, such as “Show me all transactions processed by ABC Electronics” or “What accounts does Michael Green own?” The generated responses are shown in the following screenshot.

We can also run some relationship exploration queries, such as “Which devices have accessed account A003?” or “Show all relationships between Jane Smith and her devices.” The generated responses are shown in the following screenshot. To arrive at the response, the model will do multi-hop reasoning where it will traverse multiple files.

The model can also perform temporal pattern detection queries, such as “Which accounts had transactions and device access on the same day?” or “Which accounts had transactions outside their usual location pattern?” The generated responses are shown in the following screenshot.

Let’s try out some fraud detection queries, such as “Find unusual transaction amounts compared to account history” or “Are there any accounts with failed transactions followed by successful ones within 24 hours?” The generated responses are shown in the following screenshot.

The GraphRAG solution also enables complex relationship queries, such as “Show the complete path from Emma Brown to Pacific Fresh Market” or “Map all connections between the individuals and merchants in the system.” The generated responses are shown in the following screenshot.

Clean up
To avoid incurring additional costs, clean up the resources you created. This includes deleting the Amazon Bedrock knowledge base, its associated IAM role, and the S3 bucket used for source documents. Additionally, you must separately delete the Neptune Analytics graph that was automatically created by Amazon Bedrock Knowledge Bases during the setup process.
Conclusion
GraphRAG in Amazon Bedrock emerges as a game-changing feature in the fight against financial fraud. By automatically connecting relationships across transaction data, customer profiles, historical patterns, and fraud reports, it significantly enhances financial institutions’ ability to detect complex fraud schemes that traditional systems might miss. Its unique capability to understand and link information across multiple documents and data sources proves invaluable when investigating sophisticated fraud patterns that span various touchpoints and time periods.For financial institutions and fraud detection teams, GraphRAG intelligent document processing means faster, more accurate fraud investigations. It can quickly piece together related incidents, identify common patterns in fraud reports, and connect seemingly unrelated activities that might indicate organized fraud rings. This deeper level of insight, combined with its ability to provide comprehensive, context-aware responses, enables security teams to stay one step ahead of fraudsters who continuously evolve their tactics.As financial crimes become increasingly sophisticated, GraphRAG in Amazon Bedrock stands as a powerful tool for fraud prevention, transforming how you can analyze, connect, and act on fraud-related information. The future of fraud detection demands tools that can think and connect like humans—and GraphRAG is leading the way in making this possible.

About the Authors
Senaka Ariyasinghe is a Senior Partner Solutions Architect at AWS. He collaborates with Global Systems Integrators to drive cloud innovation across the Asia-Pacific and Japan region. He specializes in helping AWS partners develop and implement scalable, well-architected solutions, with particular emphasis on generative AI, machine learning, cloud migration strategies, and the modernization of enterprise applications.
Senthil Nathan is a Senior Partner Solutions Architect working with Global Systems Integrators at AWS. In his role, Senthil works closely with global partners to help them maximize the value and potential of the AWS Cloud landscape. He is passionate about using the transformative power of cloud computing and emerging technologies to drive innovation and business impact.
Deependra Shekhawat is a Senior Energy and Utilities Industry Specialist Solutions Architect based in Sydney, Australia. In his role, Deependra helps energy companies across the Asia-Pacific and Japan region use cloud technologies to drive sustainability and operational efficiency. He specializes in creating robust data foundations and advanced workflows that enable organizations to harness the power of big data, analytics, and machine learning for solving critical industry challenges.
Aaron Sempf is Next Gen Tech Lead for the AWS Partner Organization in Asia-Pacific and Japan. With over 20 years in distributed system engineering design and development, he focuses on solving for large-scale complex integration and event-driven systems. In his spare time, he can be found coding prototypes for autonomous robots, IoT devices, distributed solutions, and designing agentic architecture patterns for generative AI-assisted business automation.
Ozan Eken is a Product Manager at AWS, passionate about building cutting-edge generative AI and graph analytics products. With a focus on simplifying complex data challenges, Ozan helps customers unlock deeper insights and accelerate innovation. Outside of work, he enjoys trying new foods, exploring different countries, and watching soccer.
JaiPrakash Dave is a Partner Solutions Architect working with Global Systems Integrators at AWS based in India. In his role, JaiPrakash guides AWS partners in the India region to design and scale well-architected solutions, focusing on generative AI, machine learning, DevOps, and application and data modernization initiatives.