Falcon 180B foundation model from TII is now available via Amazon Sage …

Today, we are excited to announce that the Falcon 180B foundation model developed by Technology Innovation Institute (TII) and trained on Amazon SageMaker is available for customers through Amazon SageMaker JumpStart to deploy with one-click for running inference. With a 180-billion-parameter size and trained on a massive 3.5-trillion-token dataset, Falcon 180B is the largest and one of the most performant models with openly accessible weights. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. In this post, we walk through how to discover and deploy the Falcon 180B model via SageMaker JumpStart.
What is Falcon 180B
Falcon 180B is a model released by TII that follows previous releases in the Falcon family. It’s a scaled-up version of Falcon 40B, and it uses multi-query attention for better scalability. It’s an auto-regressive language model that uses an optimized transformer architecture. It was trained on 3.5 trillion tokens of data, primarily consisting of web data from RefinedWeb (approximately 85%). The model has two versions: 180B and 180B-Chat. 180B is a raw, pre-trained model, which should be further fine-tuned for most use cases. 180B-Chat is better suited to taking generic instructions. The Chat model has been fine-tuned on chat and instructions datasets together with several large-scale conversational datasets.
The model is made available under the Falcon-180B TII License and Acceptable Use Policy.
Falcon 180B was trained by TII on Amazon SageMaker, on a cluster of approximately 4K A100 GPUs. It used a custom distributed training codebase named Gigatron, which uses 3D parallelism with ZeRO, and custom, high-performance Triton kernels. The distributed training architecture used Amazon Simple Storage Service (Amazon S3) as the sole unified service for data loading and checkpoint writing and reading, which particularly contributed to the workload reliability and operational simplicity.
What is SageMaker JumpStart
With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. ML practitioners can deploy foundation models to dedicated SageMaker instances within a network isolated environment, and customize models using Amazon SageMaker for model training and deployment.
You can now discover and deploy Falcon 180B with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping ensure data security. Falcon 180B is discoverable and can be deployed in Regions where the requisite instances are available. At present, ml.p4de instances are available in US East (N. Virginia) and US West (Oregon).
Discover models
You can access the foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.
SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.
In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

From the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and other resources. You can find Falcon 180B in the Foundation Models: Text Generation carousel.

You can also find other model variants by choosing Explore all Text Generation Models or searching for Falcon.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You will also find two buttons, Deploy and Open Notebook, which will help you use the model (the following screenshot shows the Deploy option).

Deploy models
When you choose Deploy, the model deployment will start. Alternatively, you can deploy through the example notebook that shows up by choosing Open Notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.
To deploy using a notebook, we start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker with the following code:

from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(model_id=”huggingface-llm-falcon-180b-chat-bf16″) predictor = my_model.deploy()

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To learn more, refer to the API documentation. After it’s deployed, you can run inference against the deployed endpoint through a SageMaker predictor. See the following code:

payload = {
“inputs”: “User: Hello!nFalcon: “,
“parameters”: {“max_new_tokens”: 256, “top_p”: 0.9, “temperature”: 0.6}
}
response = predictor.predict(payload)

Inference parameters control the text generation process at the endpoint. The max new tokens control refers to the size of the output generated by the model. Note that this is not the same as the number of words because the vocabulary of the model is not the same as the English language vocabulary and each token may not be an English language word. Temperature controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.
This 180B parameter model is 335GB and requires even more GPU memory to sufficiently perform inference in 16-bit precision. Currently, JumpStart only supports this model on ml.p4de.24xlarge instances. It is possible to deploy an 8-bit quantized model on a ml.p4d.24xlarge instance by providing the env={“HF_MODEL_QUANTIZE”: “bitsandbytes”} keyword argument to the JumpStartModel constructor and specifying instance_type=”ml.p4d.24xlarge” to the deploy method. However, please note that per-token latency is approximately 5x slower for this quantized configuration.
The following table lists all the Falcon models available in SageMaker JumpStart along with the model IDs, default instance types, maximum number of total tokens (sum of the number of input tokens and number of generated tokens) supported, and the typical response latency per token for each of these models.

Model Name
Model ID
Default Instance Type
Max Total Tokens
Latency per Token*

Falcon 7B
huggingface-llm-falcon-7b-bf16
ml.g5.2xlarge
2048
34 ms

Falcon 7B Instruct
huggingface-llm-falcon-7b-instruct-bf16
ml.g5.2xlarge
2048
34 ms

Falcon 40B
huggingface-llm-falcon-40b-bf16
ml.g5.12xlarge
2048
57 ms

Falcon 40B Instruct
huggingface-llm-falcon-40b-instruct-bf16
ml.g5.12xlarge
2048
57 ms

Falcon 180B
huggingface-llm-falcon-180b-bf16
ml.p4de.24xlarge
2048
45 ms

Falcon 180B Chat
huggingface-llm-falcon-180b-chat-bf16
ml.p4de.24xlarge
2048
45 ms

*per-token latency is provided for the median response time of the example prompts provided in this blog; this value will vary based on length of input and output sequences.
Inference and example prompts for Falcon 180B
Falcon models can be used for text completion for any piece of text. Through text generation, you can perform a variety of tasks, such as answering questions, language translation, sentiment analysis, and many more. The endpoint accepts the following input payload schema:

payload = {
“inputs”: str,
(optional)”parameters”: {
“do_sample”: bool,
“max_new_tokens”: int,
“repetition_penalty”: float,
“return_full_text”: bool,
“stop”: List[str],
“seed”: int,
“temperature”: float,
“top_k”: int,
“top_p”: float,
“truncate”: int,
“typical_p”: float,
“best_of”: int,
“watermark”: bool,
“details”: bool,
“decoder_input_details”: bool,
}
}

You can explore the definition of these client parameters and their default values within the text-generation-inference repository.
The following are some sample example prompts and the text generated by the model. All outputs here are generated with inference parameters {“max_new_tokens”: 768, “stop”: [“<|endoftext|>”, “###”]}.
Building a website can be done in 10 simple steps:

Input Building a website can be done in 10 simple steps:

Output
– Choose a domain name
– Register your domain name
– Choose a web host
– Choose a website builder
– Create your website
– Add content to your website
– Add pages to your website
– Add images to your website
– Add videos to your website
– Add social media to your website
1. Choose a domain name
A domain name is the address of your website. It is what people will type into their browser to find your website. For example, the domain name of this website is.
When choosing a domain name, you should choose something that is easy to remember and that represents your business or website. You should also make sure that the domain name is available. You can check if a domain name is available by using a domain name search tool.
2. Register your domain name
Once you have chosen a domain name, you need to register it. This can be done through a domain name registrar. A domain name registrar is a company that manages the registration of domain names.
When registering your domain name, you will need to provide your contact information and choose a payment method. You will also need to choose a domain name extension. The most common domain name extensions are .com, .net, and .org.
3. Choose a web host
A web host is a company that provides space on a server for your website. When choosing a web host, you should choose one that is reliable and that offers good customer support. You should also make sure that the web host offers the features that you need for your website.
4. Choose a website builder
A website builder is a tool that allows you to create a website without having to learn how to code. There are many different website builders available, so you should choose one that is easy to use and that offers the features that you need.
5. Create your website
Once you have chosen a website builder, you can start creating your website. You will need to choose a template for your website and then add content to your website.
6. Add content to your website
The content of your website is what will make your website unique. You should add content that is relevant to your business or website. You can add text, images, videos, and other types of content to your website.
7. Add pages to your website
A website is made up of pages. You can add as many pages as you want to your website. Each page should have a different purpose. For example, you might have a page for your home page, a page for your about page, and a page for your contact page.
8. Add images to your website
Images can make your website more visually appealing. You can add images to your website by uploading them to your website or by using a image hosting service.
9. Add videos to your website
Videos can also make your website more visually appealing. You can add videos to your website by uploading them to your website or by using a video hosting service.
10. Add social media to your website
Social media can help you promote your website. You can add social media buttons to your website so that people can share your website on social media. You can also add social media feeds to your website so that people can see your latest social media posts.

You may notice this pretrained model generates long text sequences that are not necessarily ideal for dialog use cases. Before we show how the fine-tuned chat model performs for a larger set of dialog-based prompts, the next two examples illustrate how to use Falcon models with few-shot in-context learning, where we provide training samples available to the model. Note that “few-shot learning” does not adjust model weights — we only perform inference on the deployed model during this process while providing a few examples within the input context to help guild model output.

Input
Tweet: “I hate it when my phone battery dies.”
Sentiment: Negative
###
Tweet: “My day has been :+1:”
Sentiment: Positive
###
Tweet: “This is the link to the article”
Sentiment: Neutral
###
Tweet: “This new music video was incredible”
Sentiment:

Output
Positive
###

Input
Translate English to French:
sea otter => loutre de mer
###
peppermint => menthe poivrée
###
plush girafe => girafe peluche
###
cheese =>

Output
fromage
###

Inference and example prompts for Falcon 180B-Chat
With Falcon 180B-Chat models, optimized for dialogue use cases, the input to the chat model endpoints may contain previous history between the chat assistant and the user. You can ask questions contextual to the conversation that has happened so far. You can also provide the system configuration, such as personas, which define the chat assistant’s behavior. Input payload to the endpoint is the same as the Falcon 180B model except the inputs string value should use the following format:

System: <An optional system prompt>
User: <Optional user message history>
Falcon: <Optional chat history Falcon response>
User: <Current user message>
Falcon:

The following are some sample example prompts and the text generated by the model. All outputs are generated with inference parameters {“max_new_tokens”:256, “stop”: [“nUser:”, “<|endoftext|>”, ” User:”, “###”]}.

Input
System: You are a helpful assistant
User: What is Amazon SageMaker JumpStart?
Falcon:

Output
Amazon SageMaker JumpStart is a capability of Amazon SageMaker that provides a set of solutions to common machine learning problems. These solutions include pre-built machine learning models, algorithms, and pre-configured model training and deployment settings. JumpStart also provides sample notebooks that demonstrate how to use these solutions for various use cases. The goal of JumpStart is to help users get started with machine learning quickly and easily, without the need for extensive machine learning expertise.
User:

Input
User: Simply put, the theory of relativity states that
Falcon:

Output
1. The laws of physics are the same for all observers in uniform motion relative to one another.
2. The speed of light in a vacuum is the same for all observers, regardless of their motion or the motion of the light source.
3. Time and space are relative, meaning that they can vary depending on the observer’s frame of reference.
4. Mass and energy are interchangeable, as expressed by the famous equation E=mc².
User:

Input
User: Building a website can be done in 10 simple steps:
Falcon:

Output
1. Determine the purpose and goals of your website.
2. Choose a domain name and register it.
3. Select a web hosting provider and sign up for a plan.
4. Choose a website builder or content management system (CMS) to use.
5. Select a theme or template for your website.
6. Customize the design and layout of your website.
7. Add content to your website, including text, images, and videos.
8. Set up navigation menus and links to make your website easy to navigate.
9. Test your website to ensure it is functioning properly and is mobile-friendly.
10. Launch your website and promote it to attract visitors.
User:

Input
System: Always answer with a Haiku.
User: I am deploying a large language model on Amazon SageMaker JumpStart. What should I do?
Falcon:

Output
“JumpStart” your dreams,
Deploy with ease, scale with glee,
SageMaker awaits.

Input
System: Generate a table summarizing the options outlined in this email.
User: Team, we have to make a decision on the product plan for Goocrux v2. The first option will take us 3 weeks of development, and will achieve performance of 80ms, but means that the UI has to change. The second option will take us 4 weeks of development, and will achieve performance of 60ms, without any UI change. The third option will take the longest — 6 weeks of development, will achieve performance of 30ms, without any UI change. A markdown table with four columns: (1) Option ID, (2) Development time, (3) Performance, (4) UI Change.
Falcon:

Output
| Option ID | Development time | Performance | UI Change |
| — | — | — | — |
| 1 | 3 weeks | 80ms | Yes |
| 2 | 4 weeks | 60ms | No |
| 3 | 6 weeks | 30ms | No |

Input
System: You always translate English to French.
User: sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese =>
Falcon:

Output
=> fromage

Input
System: Classify the user prompt as Negative, Positive, or Neutral sentiment. Use a short response.
User: Tweet: “This new music video was awesome”
Falcon:

Output
“Positive”

Input
User: What is the recipe of mayonnaise?
Falcon:

Output 1. In a mixing bowl, whisk together 1 egg yolk, 1 tablespoon of Dijon mustard, and 1 tablespoon of lemon juice until well combined.
2. While whisking continuously, slowly drizzle in 1 cup of vegetable oil until the mixture thickens and emulsifies.
3. Season with salt and pepper to taste.
4. Store in an airtight container in the refrigerator for up to one week.
User:

In the following example, the user has had a conversation with the assistant about tourist sites in Paris. Next, the user is inquiring about the first option recommended by the chat assistant.

Input
System: You are a helpful trip planner.
User: I am going to Paris, what should I see?
Falcon: 1. Eiffel Tower: No trip to Paris is complete without a visit to the iconic Eiffel Tower.
2. Louvre Museum: Home to some of the world’s most famous art, including the Mona Lisa.
3. Notre-Dame Cathedral: A stunning example of Gothic architecture and a must-see for history buffs.
4. Champs-Élysées: One of the most famous avenues in the world, lined with shops, cafes, and theaters.
5. Montmartre: A charming neighborhood known for its bohemian vibe, street art, and stunning views of the city.
6. Palace of Versailles: A short train ride from Paris, this opulent palace was once the seat of French royalty.
7. Musée d’Orsay: Housed in a former train station, this museum boasts an impressive collection of Impressionist and Post-Impressionist art.
8. Sainte-Chapelle: A hidden gem in the heart of Paris, this medieval chapel is famous for its stunning stained glass windows.
9. Luxembourg Gardens: A beautiful park in the heart of the city, perfect for a picnic or a leisurely stroll.
User: What is so great about #1?
Falcon:

Output
#1 is the Eiffel Tower, which is one of the most famous landmarks in the world. It is a symbol of Paris and a must-see attraction for many visitors. The tower offers stunning views of the city from its observation decks, and it is especially beautiful when lit up at night. Additionally, the Eiffel Tower has a rich history and is an engineering marvel, making it a fascinating destination for anyone interested in architecture or history.
User:

Clean up
After you’re done running the notebook, make sure to delete all resources that you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion
In this post, we showed you how to get started with Falcon 180B in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.
Resources

SageMaker JumpStart documentation
SageMaker JumpStart Foundation Models documentation
SageMaker JumpStart product detail page
SageMaker JumpStart model catalog

About the Authors
Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.
Olivier Cruchant is a Principal Machine Learning Specialist Solutions Architect at AWS, based in France. Olivier helps AWS customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.
Karl Albertsen leads Amazon SageMaker’s foundation model hub, algorithms, and partnerships teams.

Amazon SageMaker Domain in VPC only mode to support SageMaker Studio w …

Amazon SageMaker Domain supports SageMaker machine learning (ML) environments, including SageMaker Studio and SageMaker Canvas. SageMaker Studio is a fully integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models, improving data science team productivity by up to 10x. SageMaker Canvas expands access to machine learning by providing business analysts with a visual interface that allows them to generate accurate ML predictions on their own—without requiring any ML experience or having to write a single line of code.
HashiCorp Terraform is an infrastructure as code (IaC) tool that lets you organize your infrastructure in reusable code modules. AWS customers rely on IaC to design, develop, and manage their cloud infrastructure, such as SageMaker Domains. IaC ensures that customer infrastructure and services are consistent, scalable, and reproducible while following best practices in the area of development operations (DevOps). Using Terraform, you can develop and manage your SageMaker Domain and its supporting infrastructure in a consistent and repeatable manner.
In this post, we demonstrate the Terraform implementation to deploy a SageMaker Domain and the Amazon Virtual Private Cloud (Amazon VPC) it associates with. The solution will use Terraform to create:

A VPC with subnets, security groups, as well as VPC endpoints to support VPC only mode for the SageMaker Domain.
A SageMaker Domain in VPC only mode with a user profile.
An AWS Key Management Service (AWS KMS) key to encrypt the SageMaker Studio’s Amazon Elastic File System (Amazon EFS) volume.
A Lifecycle Configuration attached to the SageMaker Domain to automatically shut down idle Studio notebook instances.
A SageMaker Domain execution role and IAM policies to enable SageMaker Studio and Canvas functionalities.

The solution described in this post is available at this GitHub repo.
Solution overview
The following image shows SageMaker Domain in VPC only mode.

By launching SageMaker Domain in your VPC, you can control the data flow from your SageMaker Studio and Canvas environments. This allows you to restrict internet access, monitor and inspect traffic using standard AWS networking and security capabilities, and connect to other AWS resources through VPC endpoints.
VPC requirements to use VPC only mode
Creating a SageMaker Domain in VPC only mode requires a VPC with the following configurations:

At least two private subnets, each in a different Availability Zone, to ensure high availability.
Ensure your subnets have the required number of IP addresses needed. We recommend between two and four IP addresses per user. The total IP address capacity for a Studio domain is the sum of available IP addresses for each subnet provided when the domain is created.
Set up one or more security groups with inbound and outbound rules that together allow the following traffic:

NFS traffic over TCP on port 2049 between the domain and the Amazon EFS volume.
TCP traffic within the security group. This is required for connectivity between the JupyterServer app and the KernelGateway apps. You must allow access to at least ports in the range 8192–65535.

Create a gateway endpoint for Amazon Simple Storage Service (Amazon S3). SageMaker Studio needs to access Amazon S3 from your VPC using Gateway VPC endpoints. After you create the gateway endpoint, you need to add it as a target in your route table for traffic destined from your VPC to Amazon S3.
Create interface VPC endpoints (AWS PrivateLink) to allow Studio to access the following services with the corresponding service names. You must also associate a security group for your VPC with these endpoints to allow all inbound traffic from port 443:

SageMaker API: com.amazonaws.region.sagemaker.api. This is required to communicate with the SageMaker API.
SageMaker runtime: com.amazonaws.region.sagemaker.runtime. This is required to run Studio notebooks and to train and host models.
SageMaker Feature Store: com.amazonaws.region.sagemaker.featurestore-runtime. This is required to use SageMaker Feature Store.
SageMaker Projects: com.amazonaws.region.servicecatalog. This is required to use SageMaker Projects.

Additional VPC endpoints to use SageMaker Canvas
In addition to the previously mentioned VPC endpoints, to use SageMaker Canvas, you need to also create the following interface VPC endpoints:

Amazon Forecast and Amazon Forecast Query: com.amazonaws.region.forecast and com.amazonaws.region.forecastquery. These are required to use Amazon Forecast.
Amazon Rekognition: com.amazonaws.region.rekognition. This is required to use Amazon Rekognition.
Amazon Textract: com.amazonaws.region.textract. This is required to use Amazon Textract.
Amazon Comprehend: com.amazonaws.region.comprehend. This is required to use Amazon Comprehend.
AWS Security Token Service (AWS STS): com.amazonaws.region.sts. This is required because SageMaker Canvas uses AWS STS to connect to data sources.
Amazon Athena and AWS Glue: com.amazonaws.region.athena and com.amazonaws.region.glue. This is required to connect to AWS Glue Data Catalog through Amazon Athena.
Amazon Redshift: com.amazonaws.region.redshift-data. This is required to connect to the Amazon Redshift data source.

To view all VPC endpoints for each service you can use with SageMaker Canvas, please go to Configure Amazon SageMaker Canvas in a VPC without internet access.
AWS KMS encryption for SageMaker Studio’s EFS volume
The first time a user on your team onboards to SageMaker Studio, SageMaker creates an EFS volume for the team. A home directory is created in the volume for each user who onboards to Studio as part of your team. Notebook files and data files are stored in these directories.
You can encrypt your SageMaker Studio’s EFS volume with a KMS key so your home directories’ data are encrypted at rest. This Terraform solution creates a KMS key and uses it to encrypt SageMaker Studio’s EFS volume.
SageMaker Domain Lifecycle Configuration to automatically shut down idle Studio notebooks

Lifecycle Configurations are shell scripts triggered by Amazon SageMaker Studio lifecycle events, such as starting a new Studio notebook. You can use Lifecycle Configurations to automate customization for your Studio environment.
This Terraform solution creates a SageMaker Lifecycle Configuration to detect and stop idle resources that incur costs within Studio using an auto-shutdown Jupyter extension. Under the hood, the following resources are created or configured to achieve the desired result:

Create an S3 bucket and upload the latest version of the auto-shutdown extension sagemaker_studio_autoshutdown-0.1.5.tar.gz. Later, the auto-shutdown script will run the s3 cp command to download the extension file from the S3 bucket on Jupyter Server start-ups. Please refer to the following GitHub repos for more information regarding the auto-shutdown extension and auto-shutdown script.
Create an aws_sagemaker_studio_lifecycle_config resource “auto_shutdown”. This resource will encode the autoshutdown-script.sh with base 64 and create a Lifecycle Configuration for the SageMaker Domain.
For SageMaker Domain default user settings, specify the Lifecycle Configuration arn and set it as default.

SageMaker execution role IAM permissions
As a managed service, SageMaker performs operations on your behalf on the AWS hardware that is managed by SageMaker. SageMaker can perform only operations that the user permits.
A SageMaker user can grant these permissions with an IAM role (referred to as an execution role). When you create a SageMaker Studio domain, SageMaker allows you to create the execution role by default. You can restrict access to user profiles by changing the SageMaker user profile role. This Terraform solution attaches the following IAM policies to the SageMaker execution role:

SageMaker managed AmazonSageMakerFullAccess policy. This policy grants the execution role full access to use SageMaker Studio.
A customer managed IAM policy to access the KMS key used to encrypt the SageMaker Studio’s EFS volume.
SageMaker managed AmazonSageMakerCanvasFullAccess and AmazonSageMakerCanvasAIServicesAccess policies. These policies grant the execution role full access to use SageMaker Canvas.
In order to enable time series analysis in SageMaker Canvas, you also need to add the IAM trust policy for Amazon Forecast.

Solution walkthrough
In this blog post, we demonstrate how to deploy the Terraform solution. Prior to making the deployment, please ensure to satisfy the following prerequisites:
Prerequisites

An AWS account
An IAM user with administrative access

Deployment steps
To give users following this guide a unified deployment experience, we demonstrate the deployment process with AWS CloudShell. Using CloudShell, a browser-based shell, you can quickly run scripts with the AWS Command Line Interface (AWS CLI), experiment with service APIs using the AWS CLI, and use other tools to increase your productivity.
To deploy the Terraform solution, complete the following steps:
CloudShell launch settings

Sign in to the AWS Management Console and select the CloudShell service.
In the navigation bar, in the Region selector, choose US East (N. Virginia).

Your browser will open the CloudShell terminal.
Install Terraform
The next steps should be executed in a CloudShell terminal.
Check this Hashicorp guide for up-to-date instructions to install Terraform for Amazon Linux:

Install yum-config-manager to manage your repositories.

sudo yum install -y yum-utils

Use yum-config-manager to add the official HashiCorp Linux repository.

sudo yum-config-manager –add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo

Install Terraform from the new repository.

sudo yum -y install terraform

Verify that the installation worked by listing Terraform’s available subcommands.

terraform -help

Expected output:

Usage: terraform [-version] [-help] <command> [args]

The available commands for execution are listed below.

The most common, useful commands are shown first, followed by

less common or more advanced commands. If you’re just getting

started with Terraform, stick with the common commands. For the

other commands, please read the help and docs before usage.

Clone the code repo
Perform the following steps in a CloudShell terminal.

Clone the repo and navigate to the sagemaker-domain-vpconly-canvas-with-terraform folder:

git clone https://github.com/aws-samples/sagemaker-domain-vpconly-canvas-with-terraform.git

cd sagemaker-domain-vpconly-canvas-with-terraform

Download the auto-shutdown extension and place it in the assets/auto_shutdown_template folder:

wget https://github.com/aws-samples/sagemaker-studio-auto-shutdown-extension/raw/main/sagemaker_studio_autoshutdown-0.1.5.tar.gz -P assets/auto_shutdown_template

Deploy the Terraform solution
In the CloudShell terminal, run the following Terraform commands:

terraform init

You should see a success message like:

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running “terraform plan” to see

any changes that are required for your infrastructure. All Terraform commands

should now work…

Now you can run:

terraform plan

After you are satisfied with the resources the plan outlines to be created, you can run:

terraform apply

Enter “yes“ when prompted to confirm the deployment.
If successfully deployed, you should see an output that looks like:

Apply complete! Resources: X added, 0 changed, 0 destroyed.

Accessing SageMaker Studio and Canvas
We now have a Studio domain associated with our VPC and a user profile in this domain.

To use the SageMaker Studio console, on the Studio Control Panel, locate your user name (it should be defaultuser) and choose Open Studio.
We made it! Now you can use your browser to connect to the SageMaker Studio environment. After a few minutes, Studio finishes creating your environment, and you’re greeted with the launcher screen.

To use the SageMaker Canvas console, on the Canvas Control Panel, locate your user name (should be defaultuser) and choose Open Canvas.
Now you can use your browser to connect to the SageMaker Canvas environment. After a few minutes, Canvas finishes creating your environment, and you’re greeted with the launcher screen.

Feel free to explore the full functionality SageMaker Studio and Canvas has to offer! Please refer to the Conclusion section for additional workshops and tutorials you can use to learn more about SageMaker.
Clean up
Run the following command to clean up your resources:

terraform destroy

Tip: If you set the Amazon EFS retention policy as “Retain” (the default), you will run into issues during “terraform destroy” because Terraform is trying to delete the subnets and VPC when the EFS volume as well as its associated security groups (created by SageMaker) still exist. To fix this, first delete the EFS volume manually and then delete the subnets and VPC manually in the AWS console.
Conclusion
The solution in this post provides you the ability to create a SageMaker Domain to support ML environments, including SageMaker Studio and SageMaker Canvas with Terraform. SageMaker Studio provides a fully managed IDE that removes the heavy lifting in the ML process. With SageMaker Canvas, our business users can easily explore and build ML models to make accurate predictions without writing any code. With the ability to launch Studio and Canvas inside a VPC and the use of a KMS key to encrypt the EFS volume, customers can use SageMaker ML environments with enhanced security. Auto shutdown Lifecycle Configuration helps customers save costs on idle Studio notebook instances.
Go test this solution and let us know what you think. For more information about how to use SageMaker Studio and Sagemaker Canvas, see the following:

Amazon SageMaker Immersion Day
SageMaker Canvas Immersion Day

About the Author
Chen Yang is a Machine Learning Engineer at Amazon Web Services. She is part of the AWS Professional Services team, and has been focusing on building secure machine learning environments for customers. In her spare time, she enjoys running and hiking in the Pacific Northwest.

A New AI Research from Apple and Equall AI Uncovers Redundancies in Tr …

Transformer design that has recently become popular has taken over as the standard method for Natural Language Processing (NLP) activities, particularly Machine Translation (MT). This architecture has displayed impressive scaling qualities, which means that adding more model parameters results in better performance on a variety of NLP tasks. A number of studies and investigations have validated this observation. Though transformers excel in terms of scalability, there is a parallel movement to make these models more effective and deployable in the real world. This entails taking care of issues with latency, memory use, and disc space.

Researchers have been actively investigating methods to address these issues, including component trimming, parameter sharing, and dimensionality reduction. The widely utilized Transformer architecture comprises a number of essential parts, of which two of the most important ones are the Feed Forward Network (FFN) and Attention.

Attention – The Attention mechanism allows the model to capture relationships and dependencies between words in a sentence, irrespective of their positions. It functions as a sort of mechanism to aid the model in determining which portions of the input text are most pertinent to each word it is currently analyzing. Understanding the context and connections between words in a phrase depends on this.

Feed Forward Network (FFN): The FFN is responsible for non-linearly transforming each input token independently. It adds complexity and expressiveness to the model’s comprehension of each word by performing specific mathematical operations on the representation of each word.

In recent research, a team of researchers has focused on investigating the role of the FFN within the Transformer architecture. They have discovered that the FFN exhibits a high level of redundancy while being a large component of the model and consuming a significant number of parameters. They have found that they could scale back the model’s parameter count without significantly compromising accuracy. They have achieved this by removing the FFN from the decoder layers and instead using a single shared FFN across the encoder layers.

Decoder Layers: Each encoder and decoder in a standard Transformer model has its own FFN. The researchers eliminated the FFN from the decoder layers.

Encoder Layers: They used a single FFN that was shared by all of the encoder layers rather than having individual FFNs for each encoder layer.

The researchers have shared the benefits that have accompanied this approach, which are as follows.

Parameter Reduction: They drastically decreased the amount of parameters in the model by deleting and sharing the FFN components.

The model’s accuracy only decreased by a modest amount despite removing a sizable number of its parameters. This shows that the encoder’s numerous FFNs and the decoder’s FFN have some degree of functional redundancy.

Scaling Back: They expanded the hidden dimension of the shared FFN to restore the architecture to its previous size while maintaining or even enhancing the performance of the model. Compared to the previous large-scale Transformer model, this resulted in considerable improvements in accuracy and model processing speed, i.e., latency.

In conclusion, this research shows that the Feed Forward Network in the Transformer design, especially in the decoder levels, may be streamlined and shared without significantly affecting model performance. This not only lessens the model’s computational load but also improves its effectiveness and applicability for diverse NLP applications.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post A New AI Research from Apple and Equall AI Uncovers Redundancies in Transformer Architecture: How Streamlining the Feed Forward Network Boosts Efficiency and Accuracy appeared first on MarkTechPost.

Unlocking the Secrets of Catalytic Performance with Deep Learning: A D …

The way a catalyst’s surface is shaped matters for certain chemical reactions due to various properties of the catalyst, which we study in Surface Chemistry. Researchers think that high-speed testing using Deep Learning models can help us understand these effects better and speed up catalyst development. But, The existing models aren’t good at making accurate predictions about the catalyst’s work. Catalysts in Surface Chemistry are described by graph as well as by their characteristics. However, the characteristics don’t pay attention to how these atoms are connected. This makes it hard for the model to capture the details of shape and how it works in reaction. Graph-based ML models also lose important details about where the things are placed when molecules stick to each other. It also becomes too complicated to figure out certain predictions. So, we need an easier way to understand how materials work in a chemical reaction.

Researchers from Zhejiang University in China have come up with a solution regarding this. They created a special program called GLCNN. The program aims at looking at the fine details of how molecules sit on a surface and analyze it. It does this by turning the surface and the spots where molecules attach into simple grids and lists of numbers. This new model helps researchers understand the tiny details of every chemical reaction on the surface. This is a step forward in making computers smarter at predicting how materials will behave in certain chemical processes. It is also easier to understand that this could be a big help in designing new catalysts for various applications.

Adding data augmentation (DA) to the GLCNN method helps create a bigger dataset and prevents the computer from making predictions based on limited data. GLCNN is a computer program that’s good at predicting how molecules stick to surfaces. It did a fantastic job in predicting how OH molecules stick to certain catalysts, with very tiny errors, which is better than other computer models used in the past. This combination helps it understand both the shape and the chemical properties of the catalysts. So, GLCNN is like a super-smart tool that can figure out why some materials work better in chemical reactions. It’s a step forward in making computers good at chemistry.

In the descriptor part of the analysis, we found that the way atoms are arranged and their electronic properties are super important for predicting how well a catalyst works. The type of metal used is also crucial, even more so than how the atoms are arranged around it. When we looked at the different layers of GLCNN, we saw that it’s pretty smart. It starts by picking out the obvious details of the chemical structures and then goes deeper to find more complex information that helps it make accurate predictions about how well a catalyst will work. So, this GLCNN method is a handy tool for quickly and accurately testing catalysts. It can handle a wide range of different catalysts, making it a great solution for finding the best ones.

Check out the Paper and Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Unlocking the Secrets of Catalytic Performance with Deep Learning: A Deep Dive into the ‘Global + Local’ Convolutional Neural Network for High-Precision Screening of Heterogeneous Catalysts appeared first on MarkTechPost.

Walk me Through Time: SceNeRFlow is an AI Method That Generates Time C …

Neural Radiance Fields (NeRF) emerged as a transformative concept in the 3D domain recently. It reshaped how we handle the 3D object visualization and opened new possibilities. It bridges the gap between digital and physical reality by enabling machines to regenerate scenes with realism.  

In this digital age, where visuals play a central role in communication, entertainment, and decision-making, NeRF stands as a testament to the power of machine learning to simulate the physical world in ways previously thought unimaginable. 

With NeRF, you can walk through virtual environments, though the time is frozen. So, you actually view the same scene from different angles, but the movement is not there.

Of course, those who are not happy with 3D NeRFs and want to have the time in the equation started working on 4D. This new frontier, 4D scene reconstruction, has emerged recently. The goal here is to not only capture 3D scenes but also to chronicle their change through time. This phenomenon is achieved through the intricate interplay of correspondences across time, aka “time consistency.”

The concept of reconstructing dynamic scenes in a manner that maintains correspondences across time is a gateway to numerous possibilities. While the challenge of reconstructing general dynamic objects from RGB inputs in a time-consistent manner remains relatively underexplored, its significance cannot be overstated. So, let us meet with SceNeRFlow.

SceNeRFlow can reconstruct a general non-rigid scene from multi-view video. Source: https://arxiv.org/pdf/2308.08258.pdf

SceNeRFlow offers the ability to not only view a scene from various angles but also to experience its temporal change seamlessly. It extracts more than just visual data; it encapsulates the very essence of scenes, their transformations, and their interactions.

The biggest challenge lies in establishing correspondences, a process to decode the underlying structure of a dynamic scene. It’s like assigning object locations in different time steps. SceNeRFlow tackles this problem using a time-invariant geometric model. 

Overview of SceNeRFlow. Source: https://arxiv.org/pdf/2308.08258.pdf

SceNeRFlow explores time consistency for large motions and dense 3D correspondences. Previous methods have mainly focused on novel-view synthesis, but SceNeRFlow takes a new approach. It seeks to understand scenes and their transformations holistically. It uses backward deformation modeling, a complex technique, to achieve this goal. It proposes a new method that allows backward deformation modeling to handle substantial non-rigid motion. This breakthrough bridges the gap between theory and practice.

SceNeRFlow starts with a series of multi-view RGB images captured over consecutive timestamps from fixed cameras with established extrinsic and intrinsic. This method enables reconstructing the scene’s essence. With a commitment to maintaining temporal alignment, SceNeRFlow forges a time-invariant NeRF-style canonical model that encapsulates both geometry and appearance, underpinned by time-evolving deformations. Operating in an online fashion, the method constructs an initial canonical model based on the first timestamp and then continuously tracks its change across the temporal input sequence. The outcome is a meticulously reconstructed scene that marries fluid motion with steadfast consistency, offering an intricate portrayal of the scene’s transformation over time.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Walk me Through Time: SceNeRFlow is an AI Method That Generates Time Consistent NeRFs appeared first on MarkTechPost.

How Can We Measure Uncertainty in Neural Radiance Fields? Introducing …

Creating 3D models provides a more immersive and realistic representation of scenes than 2D images. They allow viewers to explore and interact with the scene from different angles, providing a better understanding of the spatial layout and depth of information. 

These are fundamental for virtual reality (VR) and augmented reality (AR) applications. They enable the overlay of digital information onto the real world (AR) or the creation of entirely virtual environments (VR), enhancing user experiences in gaming, education, training, and various industries.

Neural Radiance Fields (NeRFs) is a computer vision technique in 3D scene reconstruction and rendering. NeRF treats a scene as a 3D volume where each point in the volume has a corresponding color (radiance) and density. The neural network learns to predict the color and density of each point based on the 2D images taken from different viewpoints.

NeRFs have multiple applications like view synthesis and depth estimation, but learning from multiview images has inherent uncertainties. Current methods to quantify them are either heuristic or computationally demanding. Researchers at Google DeepMind, Adobe Research, and the University of Toronto introduced a new technique called BayesRays. 

It consists of a framework to evaluate uncertainty in any pretrained NeRF without modifying the training process. By adding a volumetric uncertainty field using spatial perturbations and a Bayesian Laplace approximation, they were able to overcome the limitations of NeRFs. Bayesian Laplace approximation is a mathematical method to approximate complex probability distributions with simpler multivariate Gaussian distributions.

Their calculated uncertainties are statistically meaningful and can be rendered as additional color channels. Their method also outperforms previous works on key metrics like correlation to reconstructed depth errors. They use a plug-and-play probabilistic approach to quantify the uncertainty of any pre-trained NeRFs independent of their architecture. Their work provides a threshold to remove artifacts from pre-trained NeRFs in real time.

They say their intuition behind formulating their method is from using the volumetric fields to model the 3D scenes. Volumetric deformation fields are often used in manipulating implicitly represented objects. Their work is also similar to photogrammetry, where reconstructing uncertainty is often modeled by placing Gaussian distributions on the spatial positions identified. 

At last, they say that their algorithm is limited to quantifying the uncertainty of NeRFs and cannot be trivially translated to other frameworks. However, their future work involves a similar deformation-based Laplace approximation formulated for more recent spatial representations like 3D Gaussian splatting.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post How Can We Measure Uncertainty in Neural Radiance Fields? Introducing BayesRays: A Revolutionary Post-Hoc Framework for NeRFs appeared first on MarkTechPost.

Apple Researchers Propose a New Tensor Decomposition Model for Collabo …

The capacity to infer user preferences from past behaviors is crucial for effective personalized suggestions. The fact that many products don’t have star ratings makes this task exponentially more challenging. Past actions are generally interpreted in a binary form to indicate whether or not a user has interacted with a certain object in the past. Additional assumptions must be made based on this binary data to deduce the users’ preferences from such covert input. 

It’s reasonable to assume that viewers enjoy the content with which they’ve engaged and dismiss the content that hasn’t piqued their attention. This assumption, however, is rarely correct in actual use. It’s possible that a consumer isn’t engaging with a product because they are unaware it even exists. Therefore, it is more plausible to assume that users simply ignore or don’t care about the aspects that can’t be interacted with. 

Studies have assumed that the tendency to favor products with which one is already familiar over those with which one is not. This idea formed the basis for Bayesian Personalized Ranking (BPR), a technique for making tailored recommendations. In BPR, the data is transformed into a three-dimensional binary tensor called D, where the first dimension represents the users. 

A new Apple study created a variant of the popular basic product rating (BPR) model that does not rely on transitivity. For generalization, they propose an alternative tensor decomposition. They introduce Sliced Anti-symmetric Decomposition (SAD), a novel implicit-feedback-based model for collaborative filtering. Using a novel three-way tensor perspective of user-item interactions, SAD adds one more latent vector to each item, unlike conventional methods that estimate a latent representation of users (user vectors) and items (item vectors). To produce interactions between items when evaluating relative preferences, this new vector generalizes the preferences derived by regular dot products to generic inner products. When the vector collapses to 1, SAD becomes a state-of-the-art (SOTA) collaborative filtering model; in this research, we permit its value to be determined from data. The decision to allow the new item vector’s values to exceed 1 has far-reaching consequences. The existence of cycles in pairwise comparisons is interpreted as evidence that users’ mental models are not linear.

The team presents a quick group coordinate descent method for SAD parameter estimation. Simple stochastic gradient descent (SGD) is used to obtain accurate parameter estimations rapidly. Using a simulated study, they first demonstrate the efficacy of SGD and the expressiveness of SAD. Then, utilizing the trio above of freely available resources, they pit SAD against seven alternative SOTA recommendation models. This work also shows that by incorporating previously ignored data and relationships between entities, the updated model provides more reliable and accurate results.

For this work, the researchers refer to collaborative filterings as implicit feedback. However, the applications of SAD are not limited to the aforementioned data types. Datasets with explicit ratings, for instance, contain partial orders that can be used immediately during model fitting, as opposed to the current practice of evaluating model consistency post hoc.

Check out the Paper and Github link. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Apple Researchers Propose a New Tensor Decomposition Model for Collaborative Filtering with Implicit Feedback appeared first on MarkTechPost.

This AI Research Introduces Point-Bind: A 3D Multi-Modality Model Alig …

In the current technological landscape, 3D vision has emerged as a star on the rise, capturing the spotlight due to its rapid growth and evolution. This surge in interest can be largely attributed to the soaring demand for autonomous driving, enhanced navigation systems, advanced 3D scene comprehension, and the burgeoning field of robotics. To extend its application scenarios, numerous efforts have been made to incorporate 3D point clouds with data from other modalities, allowing for improved 3D understanding, text-to-3D generation, and 3D question answering. 

https://arxiv.org/abs/2309.00615

Researchers have introduced Point-Bind, a revolutionary 3D multi-modality model designed to seamlessly integrate point clouds with various data sources such as 2D images, language, audio, and video. Guided by the principles of ImageBind, this model constructs a unified embedding space that bridges the gap between 3D data and multi-modalities. This breakthrough enables a multitude of exciting applications, including but not limited to any-to-3D generation, 3D embedding arithmetic, and comprehensive 3D open-world understanding.

In the above image we can see the overall pipeline of Point-Bind. Researchers first collect 3D-image-audio-text data pairs for contrastive learning, which aligns 3D modality with others guided ImageBind. With a joint embedding space, Point-Bind can be utilized for 3D cross-modal retrieval, any-to-3D generation, 3D zero-shot understanding, and developing a 3D large language model, Point-LLM.

The main contributions of Point Blind in this study include:

Aligning 3D with ImageBind: Within a joint embedding space, Point-Bind firstly aligns 3D point clouds with multi-modalities guided by ImageBind, including 2D images, video, language, audio, etc. 

Any-to-3D Generation: Based on existing textto-3D generative models, Point-Bind enables 3D shape synthesis conditioned on any modalities, i.e text/image/audio/point-to-mesh generation. 

3D Embedding-space Arithmetic: We observe that 3D features from Point-Bind can be added with other modalities to incorporate their semantics, achieving composed cross-modal retrieval. 

3D Zero-shot Understanding: Point-Bind attains state-of-the-art performance for 3D zero-shot classification. Also, our approach supports audio-referred 3D open-world understanding, besides text reference.

https://arxiv.org/abs/2309.00615

Researchers leverage Point-Bind to develop 3D large language models (LLMs), termed as Point-LLM, which fine-tunes LLaMA to achieve 3D question answering and multi-modal reasoning. The overall pipeline of Point-LLM can be seen in the above image. 

The main contributions of Point LLM include:

Point-LLM for 3D Question Answering: Using PointBind, we introduce Point-LLM, the first 3D LLM that responds to instructions with 3D point cloud conditions, supporting both English and Chinese.

Data- and Parameter-efficiency: We only utilize public vision-language data for tuning without any 3D instruction data, and adopt parameter-efficient finetuning techniques, saving extensive resources.

3D and Multi-modal Reasoning: Via the joint embedding space, Point-LLM can generate descriptive responses by reasoning a combination of 3D and multimodal input, e.g., a point cloud with an image/audio. 

The future work will focus on aligning multi-modality with more diverse 3D data, such as indoor and outdoor scenes, which allows for wider application scenarios. 

Check out the Paper and Github link. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post This AI Research Introduces Point-Bind: A 3D Multi-Modality Model Aligning Point Clouds with 2D Image, Language, Audio, and Video appeared first on MarkTechPost.

Top AI Tools for Data Analysts 2023

Tableau

As an interactive analytics and data visualization platform, Tableau can be used by someone unfamiliar with programming as one of its main selling features. Tableau allows users to build and distribute reports to others via desktop and mobile devices. Reports can be shared in a web browser or embedded in an app with the help of the data analytics tool, which allows for the viewing and analysis of data. Whether Tableau is hosted in the cloud or on your servers, this is possible. VizQL is Tableau’s query language, and it turns dashboard and visualization components that users drag and drop into database queries. There is also little requirement for speed improvement on the end user’s part. 

Microsoft Power BI

Microsoft Power BI, a powerful business intelligence platform that lets users filter through data and visualize it for insights, is another top AI tool for data analysis. Users may import data from practically anywhere into the platform and immediately create reports and dashboards. To further aid in data analysis, Microsoft Power BI also allows the development of machine learning models. Multiple integrations are supported, including a built-in connection to Excel and one with Azure Machine Learning. Businesses that already use Microsoft Office products will find it simple to use Power BI for reporting, visualization, and dashboard creation. 

Polymer

Polymer is a strong AI technology that data analysts may use to streamline and adapt data into a powerful database with minimal human intervention. One of the finest things about Polymer is that it doesn’t require any code, just like other fantastic AI tools. The software employs AI to examine information and improve understanding. All of this is possible with Polymer without requiring a lengthy learning curve. Users may easily gain insights from their data by uploading a spreadsheet to the platform and instantly converting it into a simplified database. Polymer boasts that it is the only software that “instantly” makes a user’s spreadsheets “searchable, intelligent, and interactive.” Data analysts, digital marketers, content developers, and more use the service. 

Akkio

Akkio, a business analytics and forecasting tool, comes close to the conclusion of our list of the five best artificial intelligence tools for data analysts. This is an entry-level tool designed for users who are just getting started with their data. Users can upload their data to the AI tool and then choose the variable they wish to predict to have Akkio construct a neural network specifically for that variable. It has great potential as a tool for predictive analysis, advertising, and closing deals. To use Akkio, you need no programming knowledge, like many of the other great tools below. Eighty percent of the data submitted is used for training by Akkio, while the remaining twenty percent is used for validation. The AI tool provides an accuracy grade for the models and filters out false positives rather than making predictions. 

MonkeyLearn

MonkeyLearn is another no-coding platform with AI data analysis features to aid users in visualizing and rearranging data. Multiple artificial intelligence (AI) powered text analysis tools are available in MonkeyLearn, allowing users to analyze and visualize data immediately. In addition, it may be used to implement text classifiers and text extractors, which facilitate automated topic and intent-based data sorting and the extraction of product features and user data, respectively. MonkeyLearn’s use of machine learning to streamline business processes and analyze text eliminates the need for countless man-hours of data entry. The ability to automatically pull data from incoming requests is a popular feature in MonkeyLearn. Data is sorted and categorized based on keywords and advanced text analysis, and relevant content is highlighted and filed away accordingly. 

IBM Watson Analytics

IBM Watson Analytics is a cloud-based business intelligence and analytics tool designed to aid organizations in making more informed decisions based on the results of analyses performed on their data. Natural language processing (NLP), predictive analytics (PA), and text mining are only some of the AI and ML methods used by Watson Analytics. Watson Analytics can be utilized to analyze semi-structured data found in documents and webpages and unstructured data found in social media and sensor data. Watson Analytics can also aggregate and analyze data from many sources to reveal previously unseen patterns. IBM Watson Analytics is a potent resource for organizations looking to understand their data better and make informed decisions. The platform is adaptable and can be scaled to accommodate businesses of any size.

RapidMiner 

RapidMiner is an all-inclusive data science platform that streamlines the processes of developing, deploying, and managing data science projects. The data preparation, modeling, evaluation, and deployment tools it provides are useful for businesses of all sizes. RapidMiner’s visual workflow designer is one of its most appealing features. For users without any coding skills, this makes it simple to design and share data science projects. You may save time and energy while creating projects with RapidMiner because of the extensive library of pre-built algorithms and machine learning models it provides. RapidMiner is a robust data science platform thanks to its many capabilities, including a visual workflow designer. Companies of various sizes utilize RapidMiner as their data science platform of choice. It’s a potent resource for creating, deploying, and overseeing data science initiatives of any size.

DataRobot 

DataRobot is a web-based service that streamlining the entire machine learning model development and deployment procedure. It’s an effective method for firms of any size to improve their data-driven decision-making. DataRobot’s many useful features make it an excellent choice as an analytics platform. DataRobot streamlines the time-consuming steps involved in developing and releasing ML models. This allows organizations to construct models without the need to employ or educate data scientists. DataRobot is a simple platform that even those unfamiliar with machine learning may utilize. The platform’s drag-and-drop interface simplifies model development and deployment. Fraud detection, churn prediction, and product suggestions are just some of the many commercial uses for DataRobot.

KNIME 

KNIME is a free and open-source IDE for those interested in machine learning and data science. The users can construct data analysis and transformation procedures. Data cleaning and preparation, data exploration and visualization, machine learning model construction, predictive analytics, and dashboard generation are just some of the many uses for KNIME. In addition to data scientists and analysts, KNIME is also useful for engineers. A sizable group of people use and improve upon it, and it’s simple to pick up and utilize. A sizable KNIME user and developer base is available for assistance in case of problems. KNIME is a wonderful alternative if you need a data analytics solution that is both robust and versatile. A sizable group of people use and improve upon it, and it’s simple to pick up and utilize.

Alteryx 

The powerful data analytics platform Alteryx is useful for businesses of all sizes. Some of its features are data mining, machine learning, and predictive analytics. Alteryx’s strength is its capability to automate mundane processes. This can improve the precision of a company’s data analysis and save a lot of time and money. Alteryx’s capacity to connect to external data sets is also strong. Thus, businesses can use Alteryx to analyze data from databases, spreadsheets, social media, and other unstructured sources. Alteryx is a flexible program that has many applications. It can help with things like optimizing marketing efforts, detecting fraud more accurately, and making more informed financial decisions, among other things.

Also, don’t forget to join our Reddit Page, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you have any questions, feel free to message us at Asif@marktechpost.com

The post Top AI Tools for Data Analysts 2023 appeared first on MarkTechPost.

Implement smart document search index with Amazon Textract and Amazon …

For modern companies that deal with enormous volumes of documents such as contracts, invoices, resumes, and reports, efficiently processing and retrieving pertinent data is critical to maintaining a competitive edge. However, traditional methods of storing and searching for documents can be time-consuming and often result in a large effort to find a specific document, especially when they include handwriting. What if there was a way to process documents intelligently and make them searchable in with high accuracy?
This is made possible with Amazon Textract, AWS’s Intelligent Document Processing service, coupled with the fast search capabilities of OpenSearch. In this post, we’ll take you on a journey to rapidly build and deploy a document search indexing solution that helps your organization to better harness and extract insights from documents.
Whether you’re in Human Resources looking for specific clauses in employee contracts, or a financial analyst sifting through a mountain of invoices to extract payment data, this solution is tailored to empower you to access the information you need with unprecedented speed and accuracy.
With the proposed solution, your documents are automatically ingested, their content parsed and subsequently indexed into a highly responsive and scalable OpenSearch index.
We’ll cover how technologies such as Amazon Textract, AWS Lambda, Amazon Simple Storage Service (Amazon S3), and Amazon OpenSearch Service can be integrated into a workflow that seamlessly processes documents. Then we dive into indexing this data into OpenSearch and demonstrate the search capabilities that become available at your fingertips.
Whether your organization is taking the first steps into the digital transformation era or is an established giant seeking to turbocharge information retrieval, this guide is your compass to navigating the opportunities that AWS Intelligent Document Processing and OpenSearch offer.
The implementation used in this post utilizes the Amazon Textract IDP CDK constructs – AWS Cloud Development Kit (CDK) components to define infrastructure for Intelligent Document Processing (IDP) workflows – which allow you to build use case specific customizable IDP workflows. The IDP CDK constructs and samples are a collection of components to enable definition of IDP processes on AWS and published to GitHub. The main concepts used are the AWS Cloud Development Kit (CDK) constructs, the actual CDK stacks and AWS Step Functions. The workshop Use machine learning to automate and process documents at scale is a good starting point to learn more about customizing workflows and using the other sample workflows as a base for your own.
Solution overview
In this solution, we focus on indexing documents into an OpenSearch index for quick search-and-retrieval of information and documents. Documents in PDF, TIFF, JPEG or PNG format are put in an Amazon Simple Storage Service (Amazon S3) bucket and subsequently indexed into OpenSearch using this Step Functions workflow.

Figure 1: The Step Functions OpenSearch workflow

The OpenSearchWorkflow-Decider looks at the document and verifies that the document is one of the supported mime types (PDF, TIFF, PNG or JPEG). It consists of one AWS Lambda function.
The DocumentSplitter generates maximum of 2500-pages chunk from documents. This means even though Amazon Textract supports documents of up to 3000 pages, you can pass in documents with many more pages and the process still works fine and puts the pages into OpenSearch and creates correct page numbers. The DocumentSplitter is implemented as an AWS Lambda function.
The Map State processes each chunk in parallel.
The TextractAsync task calls Amazon Textract using the asynchronous Application Programming Interface (API) following best practices with Amazon Simple Notification Service (Amazon SNS) notifications and OutputConfig to store the Amazon Textract JSON output to a customer Amazon S3 bucket. It consists of two Amazon Lambda functions: one to submit the document for processing and one getting triggered on the Amazon SNS notification.
Because the TextractAsync task can produce multiple paginated output files, the TextractAsyncToJSON2 process combines them into one JSON file.
The Step Functions context is enriched with information that should also be searchable in the OpenSearch index in the SetMetaData step. The sample implementation adds ORIGIN_FILE_NAME, START_PAGE_NUMBER, and ORIGIN_FILE_URI. You can add any information to enrich the search experience, like information from other backend systems, specific IDs or classification information.
The GenerateOpenSearchBatch takes the generated Amazon Textract output JSON, combines it with the information from the context set by SetMetaData and prepares a file that is optimized for batch import into OpenSearch.
In the OpenSearchPushInvoke, this batch import file is sent into the OpenSearch index and available for search. This AWS Lambda function is connected with the aws-lambda-opensearch construct from the AWS Solutions library using the m6g.large.search instances, OpenSearch version 2.7, and configured the Amazon Elastic Block Service (Amazon EBS) volume size to General Purpose 2 (GP2) with 200 GB. You can change the OpenSearch configuration according to your requirements.
The final TaskOpenSearchMapping step clears the context, which otherwise could exceed the Step Functions Quota of Maximum input or output size for a task, state, or execution.
Prerequisites
To deploy the samples, you need an AWS account , the AWS Cloud Development Kit (AWS CDK), a current Python version and Docker are required. You need permissions to deploy AWS CloudFormation templates, push to the Amazon Elastic Container Registry (Amazon ECR), create Amazon Identity and Access Management (AWS IAM) roles, Amazon Lambda functions, Amazon S3 buckets, Amazon Step Functions, Amazon OpenSearch cluster, and an Amazon Cognito user pool. Make sure your AWS CLI environment is setup with the according permissions.
You can also spin up a AWS Cloud9 instance with AWS CDK, Python and Docker pre-installed to initiate the deployment.
Walkthrough
Deployment

After you set up the prerequisites, you need to first clone the repository:

git clone https://github.com/aws-solutions-library-samples/guidance-for-low-code-intelligent-document-processing-on-aws.git

Then cd into the repository folder and install the dependencies:

cd guidance-for-low-code-intelligent-document-processing-on-aws/

pip install -r requirements.txt

Deploy the OpenSearchWorkflow stack:

cdk deploy OpenSearchWorkflow

The deployment takes around 25 minutes with the default configuration settings from the GitHub samples, and creates a Step Functions workflow, which is invoked when a document is put at an Amazon S3 bucket/prefix and subsequently is processed till the content of the document is indexed in an OpenSearch cluster.
The following is a sample output including useful links and information generated fromcdk deploy OpenSearchWorkflowcommand:

OpenSearchWorkflow.CognitoUserPoolLink = https://us-east-1.console.aws.amazon.com/cognito/v2/idp/user-pools/us-east-1_1234abcdef/users?region=us-east-1
OpenSearchWorkflow.DocumentQueueLink = https://us-east-1.console.aws.amazon.com/sqs/v2/home?region=us-east-1#/queues/https%3A%2F%2Fsqs.us-east-1.amazonaws.com%2F123412341234%2FOpenSearchWorkflow-ExecutionThrottleDocumentQueueABC1234-ABCDEFG1234.fifo
OpenSearchWorkflow.DocumentUploadLocation = s3://opensearchworkflow-opensearchworkflowbucketabcdef1234/uploads/
OpenSearchWorkflow.OpenSearchDashboard = https://search-idp-cdk-opensearch-abcdef1234.us-east-1.es.amazonaws.com/states/_dashboards
OpenSearchWorkflow.OpenSearchLink = https://us-east-1.console.aws.amazon.com/aos/home?region=us-east-1#/opensearch/domains/idp-cdk-opensearch
OpenSearchWorkflow.StepFunctionFlowLink = https://us-east-1.console.aws.amazon.com/states/home?region=us-east-1#/statemachines/view/arn:aws:states:us-east-1:123412341234:stateMachine:OpenSearchWorkflow12341234

This information is also available in the AWS CloudFormation Console.
When a new document is placed under the OpenSearchWorkflow.DocumentUploadLocation, a new Step Functions workflow is started for this document.
To check the status of this document, the OpenSearchWorkflow.StepFunctionFlowLink provides a link to the list of StepFunction executions in the AWS Management Console, displaying the status of the document processing for each document uploaded to Amazon S3. The tutorial Viewing and debugging executions on the Step Functions console provides an overview of the components and views in the AWS Console.
Testing

First test using a sample file.

aws s3 cp s3://amazon-textract-public-content/idp-cdk-samples/moby-dick-hidden-paystub-and-w2.pdf $(aws cloudformation list-exports –query ‘Exports[?Name==`OpenSearchWorkflow-DocumentUploadLocation`].Value’ –output text)

After selecting the link to the StepFunction workflow or open the AWS Management Console and going to the Step Functions service page, you can look at the different workflow invocations.

Figure 2: The Step Functions executions list

Take a look at the currently running sample document execution, where you can follow the execution of the individual workflow tasks.

Figure 3: One document Step Functions workflow execution

Search
Once the process finished, we can validate that the document is indexed in the OpenSearch index.

To do so, first we create an Amazon Cognito user. Amazon Cognito is used for Authentication of users against the OpenSearch index. Select the link in the output from the cdk deploy (or look at the AWS CloudFormation output in the AWS Management Console) named OpenSearchWorkflow.CognitoUserPoolLink.

Figure 4: The Cognito user pool

Next, select the Create user button, which directs you to a page to enter a username and a password for accessing the OpenSearch Dashboard.

Figure 5: The Cognito create user dialog

After choosing Create user, you can continue to the OpenSearch Dashboard by clicking on the OpenSearchWorkflow.OpenSearchDashboard from the CDK deployment output. Login using the previously created username and password. The first time you login, you have to change the password.
Once being logged in to the OpenSearch Dashboard, select the Stack Management section, followed by Index Patterns to create a search index.

Figure 6: OpenSearch Dashboards Stack Management

Figure 7: OpenSearch Index Patterns overview

The default name for the index is papers-index and an index pattern name of papers-index* will match that.

Figure 8: Define the OpenSearch index pattern

After clicking Next step, select timestamp as the Time field and Create index pattern.

Figure 9: OpenSearch index pattern time field

Now, from the menu, select Discover.

Figure 10: OpenSearch Discover

In most cases ,you need to change the time-span according to your last ingest. The default is 15 minutes and often there was no activity in the last 15 minutes. In this example, it changed to 15 days to visualize the ingest.

Figure 11: OpenSearch timespan change

Now you can start to search. A novel was indexed, you can search for any terms like call me Ishmael and see the results.

Figure 12: OpenSearch search term

In this case, the term call me Ishmael appears on page 6 of the document at the given Uniform Resource Identifier (URI), which points to the Amazon S3 location of the file. This makes it faster to identify documents and find information across a large corpus of PDF, TIFF or image documents, compared to manually skipping through them.
Running at scale
In order to estimate scale and duration of an indexing process, the implementation was tested with 93,997 documents and a total sum of 1,583,197 pages (average 16.84 pages/document and the largest file having 3755 pages), which all got indexed into OpenSearch. Processing all files and indexing them into OpenSearch took 5.5 hours in the US East (N. Virginia – us-east-1) region using default Amazon Textract Service Quotas. The graph below shows an initial test at 18:00 followed by the main ingest at 21:00 and all done by 2:30.

Figure 13: OpenSearch indexing overview

For the processing, the tcdk.SFExecutionsStartThrottle was set to an executions_concurrency_threshold=550, which means that concurrent document processing workflows are capped at 550 and excess requests are queued to an Amazon SQS Fist-In-First-Out (FIFO) queue, which is subsequently drained when current workflows finish. The threshold of 550 is based on the Textract Service quota of 600 in the us-east-1 region. Therefore, the queue depth and age of oldest message are metrics worth monitoring.

Figure 14: Amazon SQS monitoring

In this test, all documents were uploaded to Amazon S3 at once, therefore the Approximate Number of Messages Visible has a steep increase and then a slow decline as no new documents are ingested. The Approximate Age Of Oldest Message increases until all messages are processed. The Amazon SQS MessageRetentionPeriod is set to 14 days. For very long running backlog processing that could exceed 14 days processing, start with processing a smaller subset of representative documents and monitor the duration of execution to estimate how many documents you can pass in before exceeding 14 days. The Amazon SQS CloudWatch metrics look similar for a use case of processing a large backlog of documents, which is ingested at once then processed fully. If your use case is a steady flow of documents, both metrics, the Approximate Number of Messages Visible and the Approximate Age Of Oldest Message will be more linear. You can also use the threshold parameter to mix a steady load with backlog processing and allocate capacity according to your processing needs.
Another metrics to monitor is the health of the OpenSearch cluster, which you should setup according to the Opernational best practices for Amazon OpenSearch Service. The default deployment uses m6g.large.search instances.

Figure 15: OpenSearch monitoring

Here is a snapshot of the Key Performance Indicators (KPI) for the OpenSearch cluster. No errors, constant indexing data rate and latency.
The Step Functions workflow executions show the state of processing for each individual document. If you see executions in Failed state, then select the details. A good metric to monitor is the AWS CloudWatch Automatic Dashboard for Step Functions, which exposes some of the Step Functions CloudWatch metrics.

Figure 16: Step Functions monitoring executions succeeded

In this AWS CloudWatch Dashboard graph, you see the successful Step Functions executions over time.

Figure 17: OpenSearch monitoring executions failed

And this one shows the failed executions. These are worth investigating through the AWS Console Step Functions overview.
The following screenshot shows one example of a failed execution due to the origin file being of 0 size, which makes sense because the file has no content and could not be processed. It is important to filter failed processes and visualizes failures, in order for you to go back to the source document and validate the root cause.

Figure 18: Step Functions failed workflow

Other failures might include documents that are not of mime type: application/pdf, image/png, image/jpeg, or image/tiff because other document types are not supported by Amazon Textract.
Cost
The total cost of ingesting 1,583,278 pages was split across AWS services used for the implementation. The following list serves as approximate numbers, because your actual cost and processing duration vary depending on the size of documents, the number of pages per document, the density of information in the documents, and the AWS Region. Amazon DynamoDB was consuming $0.55, Amazon S3 $3.33, OpenSearch Service $14.71, Step Functions $17.92, AWS Lambda $28.95, and Amazon Textract $1,849.97. Also, keep in mind that the deployed Amazon OpenSearch Service cluster is billed by the hour and will accumulate higher cost when run over a period of time.
Modifications
Most likely, you want to modify the implementation and customize for your use case and documents. The workshop Use machine learning to automate and process documents at scale presents a good overview on how to manipulate the actual workflows, changing the flow, and adding new components. To add custom fields to the OpenSearch index, look at the SetMetaData task in the workflow using the set-manifest-meta-data-opensearch AWS Lambda function to add meta-data to the context, which will be added as a field to the OpenSearch index. Any meta-data information will become part of the index.
Cleaning up
Delete the example resources if you no longer need them, to avoid incurring future costs using the followind command:

cdk destroy OpenSearchWorkflow

in the same environment as the cdk deploy command. Beware that this removes everything, including the OpenSearch cluster and all documents and the Amazon S3 bucket. If you want to maintain that information, backup your Amazon S3 bucket and create an index snapshot from your OpenSearch cluster. If you processed many files, then you may have to empty the Amazon S3 bucket first using the AWS Management Console (i.e., after you took a backup or synced them to a different bucket if you want to retain the information), because the cleanup function can time out and then destroy the AWS CloudFormation stack.
Conclusion
In this post, we showed you how to deploy a full stack solution to ingest a large number of documents into an OpenSearch index, which are ready to be used for search use cases. The individual components of the implementation were discussed as well as scaling considerations, cost, and modification options. All code is accessible as OpenSource on GitHub as IDP CDK samples and as IDP CDK constructs to build your own solutions from scratch. As a next step you can start to modify the workflow, add information to the documents in the search index and explore the IDP workshop. Please comment below on your experience and ideas to expand the current solution.

About the Author
Martin Schade is a Senior ML Product SA with the Amazon Textract team. He has over 20 years of experience with internet-related technologies, engineering, and architecting solutions. He joined AWS in 2014, first guiding some of the largest AWS customers on the most efficient and scalable use of AWS services, and later focused on AI/ML with a focus on computer vision. Currently, he’s obsessed with extracting information from documents.

Semantic image search for articles using Amazon Rekognition, Amazon Sa …

Digital publishers are continuously looking for ways to streamline and automate their media workflows in order to generate and publish new content as rapidly as they can.
Publishers can have repositories containing millions of images and in order to save money, they need to be able to reuse these images across articles. Finding the image that best matches an article in repositories of this scale can be a time-consuming, repetitive, manual task that can be automated. It also relies on the images in the repository being tagged correctly, which can also be automated (for a customer success story, refer to Aller Media Finds Success with KeyCore and AWS).
In this post, we demonstrate how to use Amazon Rekognition, Amazon SageMaker JumpStart, and Amazon OpenSearch Service to solve this business problem. Amazon Rekognition makes it easy to add image analysis capability to your applications without any machine learning (ML) expertise and comes with various APIs to fulfil use cases such as object detection, content moderation, face detection and analysis, and text and celebrity recognition, which we use in this example. SageMaker JumpStart is a low-code service that comes with pre-built solutions, example notebooks, and many state-of-the-art, pre-trained models from publicly available sources that are straightforward to deploy with a single click into your AWS account. These models have been packaged to be securely and easily deployable via Amazon SageMaker APIs. The new SageMaker JumpStart Foundation Hub allows you to easily deploy large language models (LLM) and integrate them with your applications. OpenSearch Service is a fully managed service that makes it simple to deploy, scale, and operate OpenSearch. OpenSearch Service allows you to store vectors and other data types in an index, and offers rich functionality that allows you to search for documents using vectors and measuring the semantical relatedness, which we use in this post.
The end goal of this post is to show how we can surface a set of images that are semantically similar to some text, be that an article or tv synopsis.
The following screenshot shows an example of taking a mini article as your search input, rather than using keywords, and being able to surface semantically similar images.
Overview of solution
The solution is divided into two main sections. First, you extract label and celebrity metadata from the images, using Amazon Rekognition. You then generate an embedding of the metadata using a LLM. You store the celebrity names, and the embedding of the metadata in OpenSearch Service. In the second main section, you have an API to query your OpenSearch Service index for images using OpenSearch’s intelligent search capabilities to find images that are semantically similar to your text.
This solution uses our event-driven services Amazon EventBridge, AWS Step Functions, and AWS Lambda to orchestrate the process of extracting metadata from the images using Amazon Rekognition. Amazon Rekognition will perform two API calls to extract labels and known celebrities from the image.
Amazon Rekognition celebrity detection API, returns a number of elements in the response. For this post, you use the following:

Name, Id, and Urls – The celebrity name, a unique Amazon Rekognition ID, and list of URLs such as the celebrity’s IMDb or Wikipedia link for further information.
MatchConfidence – A match confidence score that can be used to control API behavior. We recommend applying a suitable threshold to this score in your application to choose your preferred operating point. For example, by setting a threshold of 99%, you can eliminate more false positives but may miss some potential matches.

In your second API call, Amazon Rekognition label detection API, returns a number of elements in the response. You use the following:

Name – The name of the detected label
Confidence – The level of confidence in the label assigned to a detected object

A key concept in semantic search is embeddings. A word embedding is a numerical representation of a word or group of words, in the form of a vector. When you have many vectors, you can measure the distance between them, and vectors which are close in distance are semantically similar. Therefore, if you generate an embedding of all of your images’ metadata, and then generate an embedding of your text, be that an article or tv synopsis for example, using the same model, you can then find images which are semantically similar to your given text.
There are many models available within SageMaker JumpStart to generate embeddings. For this solution, you use GPT-J 6B Embedding from Hugging Face. It produces high-quality embeddings and has one of the top performance metrics according to Hugging Face’s evaluation results. Amazon Bedrock is another option, still in preview, where you could choose Amazon Titan Text Embeddings model to generate the embeddings.
You use the GPT-J pre-trained model from SageMaker JumpStart to create an embedding of the image metadata and store this as a k-NN vector in your OpenSearch Service index, along with the celebrity name in another field.
The second part of the solution is to return the top 10 images to the user that are semantically similar to their text, be this an article or tv synopsis, including any celebrities if present. When choosing an image to accompany an article, you want the image to resonate with the pertinent points from the article. SageMaker JumpStart hosts many summarization models which can take a long body of text and reduce it to the main points from the original. For the summarization model, you use the AI21 Labs Summarize model. This model provides high-quality recaps of news articles and the source text can contain roughly 10,000 words, which allows the user to summarize the entire article in one go.
To detect if the text contains any names, potentially known celebrities, you use Amazon Comprehend which can extract key entities from a text string. You then filter by the Person entity, which you use as an input search parameter.
Then you take the summarized article and generate an embedding to use as another input search parameter. It’s important to note that you use the same model deployed on the same infrastructure to generate the embedding of the article as you did for the images. You then use Exact k-NN with scoring script so that you can search by two fields: celebrity names and the vector that captured the semantic information of the article. Refer to this post, Amazon OpenSearch Service’s vector database capabilities explained, on the scalability of Score script and how this approach on large indexes may lead to high latencies.
Walkthrough
The following diagram illustrates the solution architecture.

Following the numbered labels:

You upload an image to an Amazon S3 bucket
Amazon EventBridge listens to this event, and then triggers an AWS Step function execution
The Step Function takes the image input, extracts the label and celebrity metadata
The AWS Lambda function takes the image metadata and generates an embedding
The Lambda function then inserts the celebrity name (if present) and the embedding as a k-NN vector into an OpenSearch Service index
Amazon S3 hosts a simple static website, served by an Amazon CloudFront distribution. The front-end user interface (UI) allows you to authenticate with the application using Amazon Cognito to search for images
You submit an article or some text via the UI
Another Lambda function calls Amazon Comprehend to detect any names in the text
The function then summarizes the text to get the pertinent points from the article
The function generates an embedding of the summarized article
The function then searches OpenSearch Service image index for any image matching the celebrity name and the k-nearest neighbors for the vector using cosine similarity
Amazon CloudWatch and AWS X-Ray give you observability into the end to end workflow to alert you of any issues.

Extract and store key image metadata
The Amazon Rekognition DetectLabels and RecognizeCelebrities APIs give you the metadata from your images—text labels you can use to form a sentence to generate an embedding from. The article gives you a text input that you can use to generate an embedding.
Generate and store word embeddings
The following figure demonstrates plotting the vectors of our images in a 2-dimensional space, where for visual aid, we have classified the embeddings by their primary category.

You also generate an embedding of this newly written article, so that you can search OpenSearch Service for the nearest images to the article in this vector space. Using the k-nearest neighbors (k-NN) algorithm, you define how many images to return in your results.
Zoomed in to the preceding figure, the vectors are ranked based on their distance from the article and then return the K-nearest images, where K is 10 in this example.

OpenSearch Service offers the capability to store large vectors in an index, and also offers the functionality to run queries against the index using k-NN, such that you can query with a vector to return the k-nearest documents that have vectors in close distance using various measurements. For this example, we use cosine similarity.
Detect names in the article
You use Amazon Comprehend, an AI natural language processing (NLP) service, to extract key entities from the article. In this example, you use Amazon Comprehend to extract entities and filter by the entity Person, which returns any names that Amazon Comprehend can find in the journalist story, with just a few lines of code:

def get_celebrities(payload):
response = comprehend_client.detect_entities(
Text=’ ‘.join(payload[“text_inputs”]),
LanguageCode=”en”,
)
celebrities = “”
for entity in response[“Entities”]:
if entity[“Type”] == “PERSON”:
celebrities += entity[“Text”] + ” ”
return celebrities

In this example, you upload an image to Amazon Simple Storage Service (Amazon S3), which triggers a workflow where you are extracting metadata from the image including labels and any celebrities. You then transform that extracted metadata into an embedding and store all of this data in OpenSearch Service.
Summarize the article and generate an embedding
Summarizing the article is an important step to make sure that the word embedding is capturing the pertinent points of the article, and therefore returning images that resonate with the theme of the article.
AI21 Labs Summarize model is very simple to use without any prompt and just a few lines of code:

def summarise_article(payload):
sagemaker_endpoint_summarise = os.environ[“SAGEMAKER_ENDPOINT_SUMMARIZE”]
response = ai21.Summarize.execute(
source=payload,
sourceType=”TEXT”,
destination=ai21.SageMakerDestination(sagemaker_endpoint_summarise)
)
response_summary = response.summary
return response_summary

You then use the GPT-J model to generate the embedding

def get_vector(payload_summary):
sagemaker_endpoint = os.environ[“SAGEMAKER_ENDPOINT_VECTOR”]
response = sm_runtime_client.invoke_endpoint(
EndpointName=sagemaker_endpoint,
ContentType=”application/json”,
Body=json.dumps(payload_summary).encode(“utf-8”),
)
response_body = json.loads((response[“Body”].read()))
return response_body[“embedding”][0]

You then search OpenSearch Service for your images
The following is an example snippet of that query:

def search_document_celeb_context(person_names, vector):
results = wr.opensearch.search(
client=os_client,
index=”images”,
search_body={
“size”: 10,
“query”: {
“script_score”: {
“query”: {
“match”: {“celebrities”: person_names }
},
“script”: {
“lang”: “knn”,
“source”: “knn_score”,
“params”: {
“field”: “image_vector”,
“query_value”: vector,
“space_type”: “cosinesimil”
}
}
}
}
},
)
return results.drop(columns=[“image_vector”]).to_dict()

The architecture contains a simple web app to represent a content management system (CMS).
For an example article, we used the following input:
“Werner Vogels loved travelling around the globe in his Toyota. We see his Toyota come up in many scenes as he drives to go and meet various customers in their home towns.”

None of the images have any metadata with the word “Toyota,” but the semantics of the word “Toyota” are synonymous with cars and driving. Therefore, with this example, we can demonstrate how we can go beyond keyword search and return images that are semantically similar. In the above screenshot of the UI, the caption under the image shows the metadata Amazon Rekognition extracted.
You could include this solution in a larger workflow where you use the metadata you already extracted from your images to start using vector search along with other key terms, such as celebrity names, to return the best resonating images and documents for your search query.
Conclusion
In this post, we showed how you can use Amazon Rekognition, Amazon Comprehend, SageMaker, and OpenSearch Service to extract metadata from your images and then use ML techniques to discover them automatically using celebrity and semantic search. This is particularly important within the publishing industry, where speed matters in getting fresh content out quickly and to multiple platforms.
For more information about working with media assets, refer to Media intelligence just got smarter with Media2Cloud 3.0.

About the Author
Mark Watkins is a Solutions Architect within the Media and Entertainment team, supporting his customers solve many data and ML problems. Away from professional life, he loves spending time with his family and watching his two little ones growing up.

Improving asset health and grid resilience using machine learning

This post is co-written with Travis Bronson, and Brian L Wilkerson from Duke Energy
Machine learning (ML) is transforming every industry, process, and business, but the path to success is not always straightforward. In this blog post, we demonstrate how Duke Energy, a Fortune 150 company headquartered in Charlotte, NC., collaborated with the AWS Machine Learning Solutions Lab (MLSL) to use computer vision to automate the inspection of wooden utility poles and help prevent power outages, property damage and even injuries.
The electric grid is made up of poles, lines and power plants to generate and deliver electricity to millions of homes and businesses. These utility poles are critical infrastructure components and subject to various environmental factors such as wind, rain and snow, which can cause wear and tear on assets. It’s critical that utility poles are regularly inspected and maintained to prevent failures that can lead to power outages, property damage and even injuries. Most power utility companies, including Duke Energy, use manual visual inspection of utility poles to identifyanomalies related to their transmission and distribution network. But this method can be costlyand time-consuming, and it requires that power transmission lineworkers follow rigorous safety protocols.
Duke Energy has used artificial intelligence in the past to create efficiencies in day-to-day operations to great success. The company has used AI to inspect generation assets and critical infrastructure and has been exploring opportunities to apply AI to the inspection of utility poles as well. Over the course of the AWS Machine Learning Solutions Lab engagement with Duke Energy, the utility progressed its work to automate the detection of anomalies in wood poles using advanced computer vision techniques.
Goals and use case
The goal of this engagement between Duke Energy and the Machine Learning Solutions Lab is to leverage machine learning to inspect hundreds of thousands of high-resolution aerial images to automate the identification and review process of all wood pole-related issues across 33,000 miles of transmission lines. This goal will further help Duke Energy to improve grid resiliency and comply with government regulations by identifying the defects in a timely manner. It will also reduce fuel and labor costs, as well as reduce carbon emissions by minimizing unnecessary truck rolls. Finally, it will also improve safety by minimizing miles driven, poles climbed and physical inspection risks associated with compromising terrain and weather conditions.
In the following sections, we present the key challenges associated with developing robust and efficient models for anomaly detection related to wood utility poles. We also describe the key challenges and suppositions associated with various data preprocessing techniques employed to achieve the desired model performance. Next, we present the key metrics used for evaluating the model performance along with the evaluation of our final models. And finally, we compare various state-of-the-art supervised and unsupervised modeling techniques.
Challenges
One of the key challenges associated with training a model for detecting anomalies using aerial images is the non-uniform image sizes. The following figure shows the distribution of image height and width of a sample data set from Duke Energy. It can be observed that the images have a large amount of variation in terms of size. Similarly, the size of images also pose significant challenges. The size of input images are thousands of pixels wide and thousands of pixels long. This is also not ideal for training a model for identification of the small anomalous regions in the image.

Distribution of image height and width for a sample data set

Also, the input images contain a large amount of irrelevant background information such as vegetation, cars, farm animals, etc. The background information could result in suboptimal model performance. Based on our assessment, only 5% of the image contains the wood poles and the anomalies are even smaller. This a major challenge for identifying and localizing anomalies in the high-resolution images. The number of anomalies is significantly smaller, compared to the entire data set. There are only 0.12% of anomalous images in the entire data set (i.e., 1.2 anomalies out of 1000 images). Finally, there is no labeled data available for training a supervised machine learning model. Next, we describe how we address these challenges and explain our proposed method.
Solution overview
Modeling techniques
The following figure demonstrates our image processing and anomaly detection pipeline. We first imported the data into Amazon Simple Storage Service (Amazon S3) using Amazon SageMaker Studio. We further employed various data processing techniques to address some of the challenges highlighted above to improve the model performance. After data preprocessing, we employed Amazon Rekognition Custom Labels for data labeling. The labeled data is further used to train supervised ML models such as Vision Transformer, Amazon Lookout for Vision, and AutoGloun for anomaly detection.

Image processing and anomaly detection pipeline

The following figure demonstrates the detailed overview of our proposed approach that includes the data processing pipeline and various ML algorithms employed for anomaly detection. First, we will describe the steps involved in the data processing pipeline. Next, we will explain the details and intuition related to various modeling techniques employed during this engagement to achieve the desired performance goals.

Data preprocessing
The proposed data preprocessing pipeline includes data standardization, identification of region of interest (ROI), data augmentation, data segmentation, and finally data labeling. The purpose of each step is described below:
Data standardization
The first step in our data processing pipeline includes data standardization. In this step, each image is cropped and divided into non overlapping patches of size 224 X 224 pixels. The goal of this step is to generate patches of uniform sizes that could be further utilized for training a ML model and localizing the anomalies in high resolution images.
Identification of region of interest (ROI)
The input data consists of high-resolution images containing large amount of irrelevant background information (i.e., vegetation, houses, cars, horses, cows, etc.). Our goal is to identify anomalies related to wood poles. In order to identify the ROI (i.e., patches containing the wood pole), we employed Amazon Rekognition custom labeling. We trained an Amazon Rekognition custom label model using 3k labeled images containing both ROI and background images. The goal of the model is to do a binary classification between the ROI and background images. The patches identified as background information are discarded while the crops predicted as ROI are used in the next step. The following figure demonstrates the pipeline that identifies the ROI. We generated a sample of non-overlapping crops of 1,110 wooden images that generated 244,673 crops. We further used these images as input to an Amazon Rekognition custom model that identified 11,356 crops as ROI. Finally, we manually verified each of these 11,356 patches. During the manual inspection, we identified the model was able to correctly predict 10,969 wood patches out of 11,356 as ROI. In other words, the model achieved 96% precision.

Identification of region of interest

Data labeling
During the manual inspection of the images, we also labeled each image with their associated labels. The associated labels of images include wood patch, non-wood patch, non-structure, non-wood patch and finally wood patches with anomalies. The following figure demonstrates the nomenclature of the images using Amazon Rekognition custom labeling.

Data augmentation
Given the limited amount of labeled data that was available for training, we augmented the training data set by making horizontal flips of all of the patches. This had the effective impact of doubling the size of our data set.
Segmentation
We labeled the objects in 600 images (poles, wires, and metal railing) using the bounding box object detection labeling tool in Amazon Rekognition Custom Labels and trained a model to detect the three main objects of interest. We used the trained model to remove the background from all the images, by identifying and extracting the poles in each image, while removing the all other objects as well as the background. The resulting dataset had fewer images than the original data set, as a result of removing all images that don’t contain wood poles. In addition, there was also a false positive image that were removed from the dataset.
Anomaly detection
Next, we use the preprocessed data for training the machine learning model for anomaly detection. We employed three different methods for anomaly detection which includes AWS Managed Machine Learning Services (Amazon Lookout for Vision [L4V], Amazon Rekognition), AutoGluon, and Vision Transformer based self-distillation method.
AWS Services
Amazon Lookout for Vision (L4V)
Amazon Lookout for Vision is a managed AWS service that enables swift training and deployment of ML models and provides anomaly detection capabilities. It requires fully labelled data, which we provided by pointing to the image paths in Amazon S3. Training the model is as a simple as a single API (Application programming interface) call or console button click and L4V takes care of model selection and hyperparameter tuning under the hood.
Amazon Rekognition
Amazon Rekognition is a managed AI/ML service similar to L4V, which hides modelling details and provides many capabilities such as image classification, object detection, custom labelling, and more. It provides the ability to use the built-in models to apply to previously known entities in images (e.g., from ImageNet or other large open datasets). However, we used Amazon Rekognition’s Custom Labels functionality to train the ROI detector, as well as an anomaly detector on the specific images that Duke Energy has. We also used the Amazon Rekognition’s Custom Labels to train a model to put bounding boxes around wood poles in each image.
AutoGloun
AutoGluon is an open-source machine learning technique developed by Amazon. AutoGluon includes a multi-modal component which allows easy training on image data. We used AutoGluon Multi-modal to train models on the labelled image patches to establish a baseline for identifying anomalies.
Vision Transformer
Many of the most exciting new AI breakthroughs have come from two recent innovations: self-supervised learning, which allows machines to learn from random, unlabeled examples; and Transformers, which enable AI models to selectively focus on certain parts of their input and thus reason more effectively. Both methods have been a sustained focus for the Machine learning community, and we’re pleased to share that we used them in this engagement.
In particular, working in collaboration with researchers at Duke Energy, we used pre-trained self-distillation ViT (Vision Transformer) models as feature extractors for the downstream anomaly detection application using Amazon Sagemaker. The pre-trained self-distillation vision transformer models are trained on large amount of training data stored on Amazon S3 in a self-supervised manner using Amazon SageMaker. We leverage the transfer learning capabilities of ViT models pre-trained on large scale datasets (e.g., ImageNet). This helped us achieve a recall of 83% on an evaluation set using only a few thousands of labeled images for training.
Evaluation metrics
The following figure shows the key metrics used to evaluate model performance and its impacts. The key goal of the model is to maximize anomaly detection (i.e. true positives) and minimize the number of false negatives, or times when the anomalies that could lead to outages are beingmisclassified.
Once the anomalies are identified, technicians can address them, preventing future outages and ensuring compliance with government regulations. There’s another benefit to minimizing false positives: you avoid the unnecessary effort of going through images again.

Keeping these metrics in mind, we track the model performance in terms of following metrics, which encapsulates all four metrics defined above.
Precision
The percent of anomalies detected that are actual anomalies for objects of interest. Precision measures how well our algorithm identifies only anomalies. For this use case, high precision means low false alarms (i.e., the algorithm falsely identifies a woodpecker hole while there isn’t any in the image).

Recall
The percent of all anomalies that are recovered for each object of interest. Recall measures how well we identify all anomalies. This set captures some percentage of the full set of anomalies, and that percentage is the recall. For this use case, high recall means that we’re good at catching woodpecker holes when they occur. Recall is therefore the right metric to focus on in this POC because false alarms are at best annoying while missed anomalies could lead to serious consequence if left unattended.

Lower recall can lead to outages and government regulation violations. While lower precision leads to wasted human effort. The primary goal of this engagement is to identify all the anomalies to comply with government regulation and avoid any outage, hence we prioritize improving recall over precision.
Evaluation and model comparison
In the following section, we demonstrate the comparison of various modeling techniques employed during this engagement. We evaluated the performance of two AWS services Amazon Rekognition and Amazon Lookout for Vision. We also evaluated various modeling techniques using AutoGluon. Finally, we compare the performance with state-of-the-art ViT based self-distillation method.
The following figure shows the model improvement for the AutoGluon using different data processing techniques over the period of this engagement. The key observation is as we improve the data quality and quantity the performance of the model in terms of recall improved from below 30% to 78%.

Next, we compare the performance of AutoGluon with AWS services. We also employed various data processing techniques that helped improve the performance. However, the major improvement came from increasing the data quantity and quality. We increase the dataset size from 11 K images in total to 60 K images.

Next, we compare the performance of AutoGluon and AWS services with ViT based method. The following figure demonstrates that ViT-based method, AutoGluon and AWS services performed on par in terms of recall. One key observation is, beyond a certain point, increase in data quality and quantity does not help increase the performance in terms of recall. However, we observe improvements in terms of precision.

Precision versus recall comparison

Amazon AutoGluon
Predicted anomalies
Predicted normal

Anomalies
15600
4400

Normal
3659
38341

Next, we present the confusion matrix for AutoGluon and Amazon Rekognition and ViT based method using our dataset that contains 62 K samples. Out of 62K samples, 20 K samples are anomalous while remaining 42 K images are normal. It can be observed that ViT based methods captures largest number of anomalies (16,600) followed by Amazon Rekognition (16,000) and Amazon AutoGluon (15600). Similarly, Amazon AutoGluon has least number of false positives (3659 images) followed by Amazon Rekognition (5918) and ViT (15323). These results demonstrates that Amazon Rekognition achieves the highest AUC (area under the curve).

Amazon Rekognition
Predicted anomalies
Predicted normal

Anomalies
16,000
4000

Normal
5918
36082

ViT                               
Predicted anomalies
Predicted normal

Anomalies
16,600
3400

Normal
15,323
26,677

Conclusion
In this post, we showed you how the MLSL and Duke Energy teams worked together to develop a computer vision-based solution to automate anomaly detection in wood poles using high resolution images collected via helicopter flights. The proposed solution employed a data processing pipeline to crop the high-resolution image for size standardization. The cropped images are further processed using Amazon Rekognition Custom Labels to identify the region of interest (i.e., crops containing the patches with poles). Amazon Rekognition achieved 96% precision in terms of correctly identifying the patches with poles. The ROI crops are further used for anomaly detection using ViT based self-distillation mdoel AutoGluon and AWS services for anomaly detection. We used a standard data set to evaluate the performance of all three methods. The ViT based model achieved 83% recall and 52% precision. AutoGluon achieved 78% recall and 81% precision. Finally, Amazon Rekognition achieves 80% recall and 73% precision. The goal of using three different methods is to compare the performance of each method with different number of training samples, training time, and deployment time. All these methods take less than 2 hours to train a and deploy using a single A100 GPU instance or managed services on Amazon AWS. Next, steps for further improvement in model performance include adding more training data for improving model precision.
Overall, the end-to-end pipeline proposed in this post help achieve significant improvements in anomaly detection while minimizing operations cost, safety incident, regulatory risks, carbon emissions, and potential power outages.
The solution developed can be employed for other anomaly detection and asset health-related use cases across transmission and distribution networks, including defects in insulators and other equipment. For further assistance in developing and customizing this solution, please feel free to get in touch with the MLSL team.

About the Authors
Travis Bronson is a Lead Artificial Intelligence Specialist with 15 years of experience in technology and 8 years specifically dedicated to artificial intelligence. Over his 5-year tenure at Duke Energy, Travis has advanced the application of AI for digital transformation by bringing unique insights and creative thought leadership to his company’s leading edge. Travis currently leads the AI Core Team, a community of AI practitioners, enthusiasts, and business partners focused on advancing AI outcomes and governance. Travis gained and refined his skills in multiple technological fields, starting in the US Navy and US Government, then transitioning to the private sector after more than a decade of service.
 Brian Wilkerson is an accomplished professional with two decades of experience at Duke Energy. With a degree in computer science, he has spent the past 7 years excelling in the field of Artificial Intelligence. Brian is a co-founder of Duke Energy’s MADlab (Machine Learning, AI and Deep learning team). Hecurrently holds the position of Director of Artificial Intelligence & Transformation at Duke Energy, where he is passionate about delivering business value through the implementation of AI.
Ahsan Ali is an Applied Scientist at the Amazon Generative AI Innovation Center, where he works with customers from different domains to solve their urgent and expensive problems using Generative AI.
Tahin Syed is an Applied Scientist with the Amazon Generative AI Innovation Center, where he works with customers to help realize business outcomes with generative AI solutions. Outside of work, he enjoys trying new food, traveling, and teaching taekwondo.
Dr. Nkechinyere N. Agu is an Applied Scientist in the Generative AI Innovation Center at AWS. Her expertise is in Computer Vision AI/ML methods, Applications of AI/ML to healthcare, as well as the integration of semantic technologies (Knowledge Graphs) in ML solutions. She has a Masters and a Doctorate in Computer Science.
Aldo Arizmendi is a Generative AI Strategist in the AWS Generative AI Innovation Center based out of Austin, Texas. Having received his B.S. in Computer Engineering from the University of Nebraska-Lincoln, over the last 12 years, Mr. Arizmendi has helped hundreds of Fortune 500 companies and start-ups transform their business using advanced analytics, machine learning, and generative AI.
Stacey Jenks is a Principal Analytics Sales Specialist at AWS, with more than two decades of experience in Analytics and AI/ML. Stacey is passionate about diving deep on customer initiatives and driving transformational, measurable business outcomes with data. She is especially enthusiastic about the mark that utilities will make on society, via their path to a greener planet with affordable, reliable, clean energy.
Mehdi Noor is an Applied Science Manager at Generative Ai Innovation Center. With a passion for bridging technology and innovation, he assists AWS customers in unlocking the potential of Generative AI, turning potential challenges into opportunities for rapid experimentation and innovation by focusing on scalable, measurable, and impactful uses of advanced AI technologies, and streamlining the path to production.

Optimize equipment performance with historical data, Ray, and Amazon S …

Efficient control policies enable industrial companies to increase their profitability by maximizing productivity while reducing unscheduled downtime and energy consumption. Finding optimal control policies is a complex task because physical systems, such as chemical reactors and wind turbines, are often hard to model and because drift in process dynamics can cause performance to deteriorate over time. Offline reinforcement learning is a control strategy that allows industrial companies to build control policies entirely from historical data without the need for an explicit process model. This approach does not require interaction with the process directly in an exploration stage, which removes one of the barriers for the adoption of reinforcement learning in safety-critical applications. In this post, we will build an end-to-end solution to find optimal control policies using only historical data on Amazon SageMaker using Ray’s RLlib library. To learn more about reinforcement learning, see Use Reinforcement Learning with Amazon SageMaker.
Use cases
Industrial control involves the management of complex systems, such as manufacturing lines, energy grids, and chemical plants, to ensure efficient and reliable operation. Many traditional control strategies are based on predefined rules and models, which often require manual optimization. It is standard practice in some industries to monitor performance and adjust the control policy when, for example, equipment starts to degrade or environmental conditions change. Retuning can take weeks and may require injecting external excitations in the system to record its response in a trial-and-error approach.
Reinforcement learning has emerged as a new paradigm in process control to learn optimal control policies through interacting with the environment. This process requires breaking down data into three categories: 1) measurements available from the physical system, 2) the set of actions that can be taken upon the system, and 3) a numerical metric (reward) of equipment performance. A policy is trained to find the action, at a given observation, that is likely to produce the highest future rewards.
In offline reinforcement learning, one can train a policy on historical data before deploying it into production. The algorithm trained in this blog post is called “Conservative Q Learning” (CQL). CQL contains an “actor” model and a “critic” model and is designed to conservatively predict its own performance after taking a recommended action. In this post, the process is demonstrated with an illustrative cart-pole control problem. The goal is to train an agent to balance a pole on a cart while simultaneously moving the cart towards a designated goal location. The training procedure uses the offline data, allowing the agent to learn from preexisting information. This cart-pole case study demonstrates the training process and its effectiveness in potential real-world applications.
Solution overview
The solution presented in this post automates the deployment of an end-to-end workflow for offline reinforcement learning with historical data. The following diagram describes the architecture used in this workflow. Measurement data is produced at the edge by a piece of industrial equipment (here simulated by an AWS Lambda function). The data is put into an Amazon Kinesis Data Firehose, which stores it in Amazon Simple Storage Service (Amazon S3). Amazon S3 is a durable, performant, and low-cost storage solution that allows you to serve large volumes of data to a machine learning training process.
AWS Glue catalogs the data and makes it queryable using Amazon Athena. Athena transforms the measurement data into a form that a reinforcement learning algorithm can ingest and then unloads it back into Amazon S3. Amazon SageMaker loads this data into a training job and produces a trained model. SageMaker then serves that model in a SageMaker endpoint. The industrial equipment can then query that endpoint to receive action recommendations.

Figure 1: Architecture diagram showing the end-to-end reinforcement learning workflow.

In this post, we will break down the workflow in the following steps:

Formulate the problem. Decide which actions can be taken, which measurements to make recommendations based on, and determine numerically how well each action performed.
Prepare the data. Transform the measurements table into a format the machine learning algorithm can consume.
Train the algorithm on that data.
Select the best training run based on training metrics.
Deploy the model to a SageMaker endpoint.
Evaluate the performance of the model in production.

Prerequisites
To complete this walkthrough, you need to have an AWS account and a command line interface with AWS SAM installed. Follow these steps to deploy the AWS SAM template to run this workflow and generate training data:

Download the code repository with the command

git clone https://github.com/aws-samples/sagemaker-offline-reinforcement-learning-ray-cql

Change directory to the repo:

cd sagemaker-offline-reinforcement-learning-ray-cql

Build the repo:

sam build –use-container

Deploy the repo

sam deploy –guided –capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND

Use the following commands to call a bash script, which generates mock data using an AWS Lambda function.

sudo yum install jq
cd utils
sh generate_mock_data.sh

Solution walkthrough
Formulate problem
Our system in this blog post is a cart with a pole balanced on top. The system performs well when the pole is upright, and the cart position is close to the goal position. In the prerequisite step, we generated historical data from this system.
The following table shows historical data gathered from the system.

Cart position
Cart velocity
Pole angle
Pole angular velocity
Goal position
External force
Reward
Time

0.53
-0.79
-0.08
0.16
0.50
-0.04
11.5
5:37:54 PM

0.51
-0.82
-0.07
0.17
0.50
-0.04
11.9
5:37:55 PM

0.50
-0.84
-0.07
0.18
0.50
-0.03
12.2
5:37:56 PM

0.48
-0.85
-0.07
0.18
0.50
-0.03
10.5
5:37:57 PM

0.46
-0.87
-0.06
0.19
0.50
-0.03
10.3
5:37:58 PM

You can query historical system information using Amazon Athena with the following query:

SELECT *
FROM “AWS CloudFormation Stack Name_glue_db”.”measurements_table”
ORDER BY episode_id, epoch_time ASC
limit 10;

The state of this system is defined by the cart position, cart velocity, pole angle, pole angular velocity, and goal position. The action taken at each time step is the external force applied to the cart. The simulated environment outputs a reward value that is higher when the cart is closer to the goal position and the pole is more upright.
Prepare data
To present the system information to the reinforcement learning model, transform it into JSON objects with keys that categorize values into the state (also called observation), action, and reward categories. Store these objects in Amazon S3. Here’s an example of JSON objects produced from time steps in the previous table.

{“obs”:[[0.53,-0.79,-0.08,0.16,0.5]], “action”:[[-0.04]], “reward”:[11.5] ,”next_obs”:[[0.51,-0.82,-0.07,0.17,0.5]]}

{“obs”:[[0.51,-0.82,-0.07,0.17,0.5]], “action”:[[-0.04]], “reward”:[11.9], “next_obs”:[[0.50,-0.84,-0.07,0.18,0.5]]}

{“obs”:[[0.50,-0.84,-0.07,0.18,0.5]], “action”:[[-0.03]], “reward”:[12.2], “next_obs”:[[0.48,-0.85,-0.07,0.18,0.5]]}

The AWS CloudFormation stack contains an output called AthenaQueryToCreateJsonFormatedData. Run this query in Amazon Athena to perform the transformation and store the JSON objects in Amazon S3. The reinforcement learning algorithm uses the structure of these JSON objects to understand which values to base recommendations on and the outcome of taking actions in the historical data.
Train agent
Now we can start a training job to produce a trained action recommendation model. Amazon SageMaker lets you quickly launch multiple training jobs to see how various configurations affect the resulting trained model. Call the Lambda function named TuningJobLauncherFunction to start a hyperparameter tuning job that experiments with four different sets of hyperparameters when training the algorithm.
Select best training run
To find which of the training jobs produced the best model, examine loss curves produced during training. CQL’s critic model estimates the actor’s performance (called a Q value) after taking a recommended action. Part of the critic’s loss function includes the temporal difference error. This metric measures the critic’s Q value accuracy. Look for training runs with a high mean Q value and a low temporal difference error. This paper, A Workflow for Offline Model-Free Robotic Reinforcement Learning, details how to select the best training run. The code repository has a file, /utils/investigate_training.py, that creates a plotly html figure describing the latest training job. Run this file and use the output to pick the best training run.
We can use the mean Q value to predict the performance of the trained model. The Q values are trained to conservatively predict the sum of discounted future reward values. For long-running processes, we can convert this number to an exponentially weighted average by multiplying the Q value by (1-“discount rate”). The best training run in this set achieved a mean Q value of 539. Our discount rate is 0.99, so the model is predicting at least 5.39 average reward per time step. You can compare this value to historical system performance for an indication of if the new model will outperform the historical control policy. In this experiment, the historical data’s average reward per time step was 4.3, so the CQL model is predicting 25 percent better performance than the system achieved historically.
Deploy model
Amazon SageMaker endpoints let you serve machine learning models in several different ways to meet a variety of use cases. In this post, we’ll use the serverless endpoint type so that our endpoint automatically scales with demand, and we only pay for compute usage when the endpoint is generating an inference. To deploy a serverless endpoint, include a ProductionVariantServerlessConfig in the production variant of the SageMaker endpoint configuration. The following code snippet shows how the serverless endpoint in this example is deployed using the Amazon SageMaker software development kit for Python. Find the sample code used to deploy the model at sagemaker-offline-reinforcement-learning-ray-cql.

predictor = model.deploy(
serverless_inference_config=ServerlessInferenceConfig(
memory_size_in_mb=2048,
max_concurrency=200
),
<…>
)

The trained model files are located at the S3 model artifacts for each training run. To deploy the machine learning model, locate the model files of the best training run, and call the Lambda function named “ModelDeployerFunction” with an event that contains this model data. The Lambda function will launch a SageMaker serverless endpoint to serve the trained model. Sample event to use when calling the “ModelDeployerFunction”:

{ “DescribeTrainingJob”:
{ “ModelArtifacts”:
{ “S3ModelArtifacts”: “s3://your-bucket/training/my-training-job/output/model.tar.gz”}
}
}

Evaluate trained model performance
It’s time to see how our trained model is doing in production! To check the performance of the new model, call the Lambda function named “RunPhysicsSimulationFunction” with the SageMaker endpoint name in the event. This will run the simulation using the actions recommended by the endpoint. Sample event to use when calling the RunPhysicsSimulatorFunction:

{“random_action_fraction”: 0.0, “inference_endpoint_name”: “sagemaker-endpoint-name”}

Use the following Athena query to compare the performance of the trained model with historical system performance.

WITH
sum_reward_by_episode AS (
SELECT SUM(reward) as sum_reward, m_temp.action_source
FROM “<AWS CloudFormation Stack Name>_glue_db”.”measurements_table” m_temp
GROUP BY m_temp.episode_id, m_temp.action_source
)

SELECT sre.action_source, AVG(sre.sum_reward) as avg_total_reward_per_episode
FROM sum_reward_by_episode sre
GROUP BY sre.action_source
ORDER BY avg_total_reward_per_episode DESC

Here is an example results table. We see the trained model achieved 2.5x more reward than the historical data! Additionally, the true performance of the model was 2x better than the conservative performance prediction.

Action source
Average reward per time step

trained_model
10.8

historic_data
4.3

The following animations show the difference between a sample episode from the training data and an episode where the trained model was used to pick which action to take. In the animations, the blue box is the cart, the blue line is the pole, and the green rectangle is the goal location. The red arrow shows the force applied to the cart at each time step. The red arrow in the training data jumps back and forth quite a bit because the data was generated using 50 percent expert actions and 50 percent random actions. The trained model learned a control policy that moves the cart quickly to the goal position, while maintaining stability, entirely from observing nonexpert demonstrations.

 Clean up
To delete resources used in this workflow, navigate to the resources section of the Amazon CloudFormation stack and delete the S3 buckets and IAM roles. Then delete the CloudFormation stack itself.
Conclusion
Offline reinforcement learning can help industrial companies automate the search for optimal policies without compromising safety by using historical data. To implement this approach in your operations, start by identifying the measurements that make up a state-determined system, the actions you can control, and metrics that indicate desired performance. Then, access this GitHub repository for the implementation of an automatic end-to-end solution using Ray and Amazon SageMaker.
The post just scratches the surface of what you can do with Amazon SageMaker RL. Give it a try, and please send us feedback, either in the Amazon SageMaker discussion forum or through your usual AWS contacts.

About the Authors
Walt Mayfield is a Solutions Architect at AWS and helps energy companies operate more safely and efficiently. Before joining AWS, Walt worked as an Operations Engineer for Hilcorp Energy Company. He likes to garden and fly fish in his spare time.
Felipe Lopez is a Senior Solutions Architect at AWS with a concentration in Oil & Gas Production Operations. Prior to joining AWS, Felipe worked with GE Digital and Schlumberger, where he focused on modeling and optimization products for industrial applications.
Yingwei Yu is an Applied Scientist at Generative AI Incubator, AWS. He has experience working with several organizations across industries on various proofs of concept in machine learning, including natural language processing, time series analysis, and predictive maintenance. In his spare time, he enjoys swimming, painting, hiking, and spending time with family and friends.
Haozhu Wang is a research scientist in Amazon Bedrock focusing on building Amazon’s Titan foundation models. Previously he worked in Amazon ML Solutions Lab as a co-lead of the Reinforcement Learning Vertical and helped customers build advanced ML solutions with the latest research on reinforcement learning, natural language processing, and graph learning. Haozhu received his PhD in Electrical and Computer Engineering from the University of Michigan.

Enable pod-based GPU metrics in Amazon CloudWatch

In February 2022, Amazon Web Services added support for NVIDIA GPU metrics in Amazon CloudWatch, making it possible to push metrics from the Amazon CloudWatch Agent to Amazon CloudWatch and monitor your code for optimal GPU utilization. Since then, this feature has been integrated into many of our managed Amazon Machine Images (AMIs), such as the Deep Learning AMI and the AWS ParallelCluster AMI. To obtain instance-level metrics of GPU utilization, you can use Packer or the Amazon ImageBuilder to bootstrap your own custom AMI and use it in various managed service offerings like AWS Batch, Amazon Elastic Container Service (Amazon ECS), or Amazon Elastic Kubernetes Service (Amazon EKS). However, for many container-based service offerings and workloads, it’s ideal to capture utilization metrics on the container, pod, or namespace level.
This post details how to set up container-based GPU metrics and provides an example of collecting these metrics from EKS pods.
Solution overview
To demonstrate container-based GPU metrics, we create an EKS cluster with g5.2xlarge instances; however, this will work with any supported NVIDIA accelerated instance family.
We deploy the NVIDIA GPU operator to enable use of GPU resources and the NVIDIA DCGM Exporter to enable GPU metrics collection. Then we explore two architectures. The first one connects the metrics from NVIDIA DCGM Exporter to CloudWatch via a CloudWatch agent, as shown in the following diagram.

The second architecture (see the following diagram) connects the metrics from DCGM Exporter to Prometheus, then we use a Grafana dashboard to visualize those metrics.

Prerequisites
To simplify reproducing the entire stack from this post, we use a container that has all the required tooling (aws cli, eksctl, helm, etc.) already installed. In order to clone the container project from GitHub, you will need git. To build and run the container, you will need Docker. To deploy the architecture, you will need AWS credentials. To enable access to Kubernetes services using port-forwarding, you will also need kubectl.
These prerequisites can be installed on your local machine, EC2 instance with NICE DCV, or AWS Cloud9. In this post, we will use a c5.2xlarge Cloud9 instance with a 40GB local storage volume. When using Cloud9, please disable AWS managed temporary credentials by visiting Cloud9->Preferences->AWS Settings as shown on the screenshot below.

Build and run the aws-do-eks container
Open a terminal shell in your preferred environment and run the following commands:

git clone https://github.com/aws-samples/aws-do-eks
cd aws-do-eks
./build.sh
./run.sh
./exec.sh

The result is as follows:

root@e5ecb162812f:/eks#

You now have a shell in a container environment that has all the tools needed to complete the tasks below. We will refer to it as “aws-do-eks shell”. You will be running the commands in the following sections in this shell, unless specifically instructed otherwise.
Create an EKS cluster with a node group
This group includes a GPU instance family of your choice; in this example, we use the g5.2xlarge instance type.
The aws-do-eks project comes with a collection of cluster configurations. You can set your desired cluster configuration with a single configuration change.

In the container shell, run ./env-config.sh and then set CONF=conf/eksctl/yaml/eks-gpu-g5.yaml
To verify the cluster configuration, run ./eks-config.sh

You should see the following cluster manifest:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: do-eks-yaml-g5
version: “1.25”
region: us-east-1
availabilityZones:
– us-east-1a
– us-east-1b
– us-east-1c
– us-east-1d
managedNodeGroups:
– name: sys
instanceType: m5.xlarge
desiredCapacity: 1
iam:
withAddonPolicies:
autoScaler: true
cloudWatch: true
– name: g5
instanceType: g5.2xlarge
instancePrefix: g5-2xl
privateNetworking: true
efaEnabled: false
minSize: 0
desiredCapacity: 1
maxSize: 10
volumeSize: 80
iam:
withAddonPolicies:
cloudWatch: true
iam:
withOIDC: true

To create the cluster, run the following command in the container

./eks-create.sh

The output is as follows:

root@e5ecb162812f:/eks# ./eks-create.sh
/eks/impl/eksctl/yaml /eks

./eks-create.sh

Mon May 22 20:50:59 UTC 2023
Creating cluster using /eks/conf/eksctl/yaml/eks-gpu-g5.yaml …

eksctl create cluster -f /eks/conf/eksctl/yaml/eks-gpu-g5.yaml

2023-05-22 20:50:59 [ℹ] eksctl version 0.133.0
2023-05-22 20:50:59 [ℹ] using region us-east-1
2023-05-22 20:50:59 [ℹ] subnets for us-east-1a – public:192.168.0.0/19 private:192.168.128.0/19
2023-05-22 20:50:59 [ℹ] subnets for us-east-1b – public:192.168.32.0/19 private:192.168.160.0/19
2023-05-22 20:50:59 [ℹ] subnets for us-east-1c – public:192.168.64.0/19 private:192.168.192.0/19
2023-05-22 20:50:59 [ℹ] subnets for us-east-1d – public:192.168.96.0/19 private:192.168.224.0/19
2023-05-22 20:50:59 [ℹ] nodegroup “sys” will use “” [AmazonLinux2/1.25]
2023-05-22 20:50:59 [ℹ] nodegroup “g5” will use “” [AmazonLinux2/1.25]
2023-05-22 20:50:59 [ℹ] using Kubernetes version 1.25
2023-05-22 20:50:59 [ℹ] creating EKS cluster “do-eks-yaml-g5” in “us-east-1” region with managed nodes
2023-05-22 20:50:59 [ℹ] 2 nodegroups (g5, sys) were included (based on the include/exclude rules)
2023-05-22 20:50:59 [ℹ] will create a CloudFormation stack for cluster itself and 0 nodegroup stack(s)
2023-05-22 20:50:59 [ℹ] will create a CloudFormation stack for cluster itself and 2 managed nodegroup stack(s)
2023-05-22 20:50:59 [ℹ] if you encounter any issues, check CloudFormation console or try ‘eksctl utils describe-stacks –region=us-east-1 –cluster=do-eks-yaml-g5’
2023-05-22 20:50:59 [ℹ] Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster “do-eks-yaml-g5” in “us-east-1”
2023-05-22 20:50:59 [ℹ] CloudWatch logging will not be enabled for cluster “do-eks-yaml-g5” in “us-east-1”
2023-05-22 20:50:59 [ℹ] you can enable it with ‘eksctl utils update-cluster-logging –enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} –region=us-east-1 –cluster=do-eks-yaml-g5’
2023-05-22 20:50:59 [ℹ]
2 sequential tasks: { create cluster control plane “do-eks-yaml-g5”,
2 sequential sub-tasks: {
4 sequential sub-tasks: {
wait for control plane to become ready,
associate IAM OIDC provider,
2 sequential sub-tasks: {
create IAM role for serviceaccount “kube-system/aws-node”,
create serviceaccount “kube-system/aws-node”,
},
restart daemonset “kube-system/aws-node”,
},
2 parallel sub-tasks: {
create managed nodegroup “sys”,
create managed nodegroup “g5”,
},
}
}
2023-05-22 20:50:59 [ℹ] building cluster stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:51:00 [ℹ] deploying stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:51:30 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:52:00 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:53:01 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:54:01 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:55:01 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:56:02 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:57:02 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:58:02 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 20:59:02 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 21:00:03 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 21:01:03 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 21:02:03 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 21:03:04 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-cluster”
2023-05-22 21:05:07 [ℹ] building iamserviceaccount stack “eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node”
2023-05-22 21:05:10 [ℹ] deploying stack “eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node”
2023-05-22 21:05:10 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node”
2023-05-22 21:05:40 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-addon-iamserviceaccount-kube-system-aws-node”
2023-05-22 21:05:40 [ℹ] serviceaccount “kube-system/aws-node” already exists
2023-05-22 21:05:41 [ℹ] updated serviceaccount “kube-system/aws-node”
2023-05-22 21:05:41 [ℹ] daemonset “kube-system/aws-node” restarted
2023-05-22 21:05:41 [ℹ] building managed nodegroup stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:05:41 [ℹ] building managed nodegroup stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:05:42 [ℹ] deploying stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:05:42 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:05:42 [ℹ] deploying stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:05:42 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:06:12 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:06:12 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:06:55 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:07:11 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:08:29 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:08:45 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-sys”
2023-05-22 21:09:52 [ℹ] waiting for CloudFormation stack “eksctl-do-eks-yaml-g5-nodegroup-g5”
2023-05-22 21:09:53 [ℹ] waiting for the control plane to become ready
2023-05-22 21:09:53 [✔] saved kubeconfig as “/root/.kube/config”
2023-05-22 21:09:53 [ℹ] 1 task: { install Nvidia device plugin }
W0522 21:09:54.155837 1668 warnings.go:70] spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the “priorityClassName” field instead
2023-05-22 21:09:54 [ℹ] created “kube-system:DaemonSet.apps/nvidia-device-plugin-daemonset”
2023-05-22 21:09:54 [ℹ] as you are using the EKS-Optimized Accelerated AMI with a GPU-enabled instance type, the Nvidia Kubernetes device plugin was automatically installed.
to skip installing it, use –install-nvidia-plugin=false.
2023-05-22 21:09:54 [✔] all EKS cluster resources for “do-eks-yaml-g5” have been created
2023-05-22 21:09:54 [ℹ] nodegroup “sys” has 1 node(s)
2023-05-22 21:09:54 [ℹ] node “ip-192-168-18-137.ec2.internal” is ready
2023-05-22 21:09:54 [ℹ] waiting for at least 1 node(s) to become ready in “sys”
2023-05-22 21:09:54 [ℹ] nodegroup “sys” has 1 node(s)
2023-05-22 21:09:54 [ℹ] node “ip-192-168-18-137.ec2.internal” is ready
2023-05-22 21:09:55 [ℹ] kubectl command should work with “/root/.kube/config”, try ‘kubectl get nodes’
2023-05-22 21:09:55 [✔] EKS cluster “do-eks-yaml-g5” in “us-east-1” region is ready

Mon May 22 21:09:55 UTC 2023
Done creating cluster using /eks/conf/eksctl/yaml/eks-gpu-g5.yaml

/eks

To verify that your cluster is created successfully, run the following command

kubectl get nodes -L node.kubernetes.io/instance-type

The output is similar to the following:

NAME STATUS ROLES AGE VERSION INSTANCE_TYPE
ip-192-168-18-137.ec2.internal Ready <none> 47m v1.25.9-eks-0a21954 m5.xlarge
ip-192-168-214-241.ec2.internal Ready <none> 46m v1.25.9-eks-0a21954 g5.2xlarge

In this example, we have one m5.xlarge and one g5.2xlarge instance in our cluster; therefore, we see two nodes listed in the preceding output.
During the cluster creation process, the NVIDIA device plugin will get installed. You will need to remove it after cluster creation because we will use the NVIDIA GPU Operator instead.

Delete the plugin with the following command

kubectl -n kube-system delete daemonset nvidia-device-plugin-daemonset

We get the following output:

daemonset.apps “nvidia-device-plugin-daemonset” deleted

Install the NVIDIA Helm repo
Install the NVIDIA Helm repo with the following command:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

Deploy the DCGM exporter with the NVIDIA GPU Operator
To deploy the DCGM exporter, complete the following steps:

Prepare the DCGM exporter GPU metrics configuration

curl https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/etc/dcp-metrics-included.csv > dcgm-metrics.csv

You have the option to edit the dcgm-metrics.csv file. You can add or remove any metrics as needed.

Create the gpu-operator namespace and DCGM exporter ConfigMap

kubectl create namespace gpu-operator && /
kubectl create configmap metrics-config -n gpu-operator –from-file=dcgm-metrics.csv

The output is as follows:

namespace/gpu-operator created
configmap/metrics-config created

Apply the GPU operator to the EKS cluster

helm install –wait –generate-name -n gpu-operator –create-namespace nvidia/gpu-operator
–set dcgmExporter.config.name=metrics-config
–set dcgmExporter.env[0].name=DCGM_EXPORTER_COLLECTORS
–set dcgmExporter.env[0].value=/etc/dcgm-exporter/dcgm-metrics.csv
–set toolkit.enabled=false

The output is as follows:

NAME: gpu-operator-1684795140
LAST DEPLOYED: Day Month Date HH:mm:ss YYYY
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Confirm that the DCGM exporter pod is running

kubectl -n gpu-operator get pods | grep dcgm

The output is as follows:

nvidia-dcgm-exporter-lkmfr       1/1     Running    0   1m

If you inspect the logs, you should see the “Starting webserver” message:

kubectl -n gpu-operator logs -f $(kubectl -n gpu-operator get pods | grep dcgm | cut -d ‘ ‘ -f 1)

The output is as follows:

Defaulted container “nvidia-dcgm-exporter” out of: nvidia-dcgm-exporter, toolkit-validation (init)
time=”2023-05-22T22:40:08Z” level=info msg=”Starting dcgm-exporter”
time=”2023-05-22T22:40:08Z” level=info msg=”DCGM successfully initialized!”
time=”2023-05-22T22:40:08Z” level=info msg=”Collecting DCP Metrics”
time=”2023-05-22T22:40:08Z” level=info msg=”No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv”
time=”2023-05-22T22:40:08Z” level=info msg=”Initializing system entities of type: GPU”
time=”2023-05-22T22:40:09Z” level=info msg=”Initializing system entities of type: NvSwitch”
time=”2023-05-22T22:40:09Z” level=info msg=”Not collecting switch metrics: no switches to monitor”
time=”2023-05-22T22:40:09Z” level=info msg=”Initializing system entities of type: NvLink”
time=”2023-05-22T22:40:09Z” level=info msg=”Not collecting link metrics: no switches to monitor”
time=”2023-05-22T22:40:09Z” level=info msg=”Kubernetes metrics collection enabled!”
time=”2023-05-22T22:40:09Z” level=info msg=”Pipeline starting”
time=”2023-05-22T22:40:09Z” level=info msg=”Starting webserver”

NVIDIA DCGM Exporter exposes a Prometheus metrics endpoint, which can be ingested by the CloudWatch agent. To see the endpoint, use the following command:

kubectl -n gpu-operator get services | grep dcgm

We get the following output:

nvidia-dcgm-exporter    ClusterIP   10.100.183.207   <none>   9400/TCP   10m

To generate some GPU utilization, we deploy a pod that runs the gpu-burn binary

kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/gpu-metrics/gpu-burn-deployment.yaml

The output is as follows:

deployment.apps/gpu-burn created

This deployment uses a single GPU to produce a continuous pattern of 100% utilization for 20 seconds followed by 0% utilization for 20 seconds.

To make sure the endpoint works, you can run a temporary container that uses curl to read the content of http://nvidia-dcgm-exporter:9400/metrics

kubectl -n gpu-operator run -it –rm curl –restart=’Never’ –image=curlimages/curl –command — curl http://nvidia-dcgm-exporter:9400/metrics

We get the following output:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 1455
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 6250
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 65
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
DCGM_FI_DEV_POWER_USAGE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 299.437000
# HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Total energy consumption since boot (in mJ).
# TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 15782796862
# HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries.
# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
# TYPE DCGM_FI_DEV_GPU_UTIL gauge
DCGM_FI_DEV_GPU_UTIL{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 100
# HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge
DCGM_FI_DEV_MEM_COPY_UTIL{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 38
# HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %).
# TYPE DCGM_FI_DEV_ENC_UTIL gauge
DCGM_FI_DEV_ENC_UTIL{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).
# TYPE DCGM_FI_DEV_DEC_UTIL gauge
DCGM_FI_DEV_DEC_UTIL{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
# TYPE DCGM_FI_DEV_FB_FREE gauge
DCGM_FI_DEV_FB_FREE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 2230
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 20501
# HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes.
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS Number of remapped rows for uncorrectable errors
# TYPE DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS counter
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS Number of remapped rows for correctable errors
# TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_DEV_ROW_REMAP_FAILURE Whether remapping of rows has failed
# TYPE DCGM_FI_DEV_ROW_REMAP_FAILURE gauge
DCGM_FI_DEV_ROW_REMAP_FAILURE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0.808369
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0.000000
# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data (in %).
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
DCGM_FI_PROF_DRAM_ACTIVE{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 0.315787
# HELP DCGM_FI_PROF_PCIE_TX_BYTES The rate of data transmitted over the PCIe bus – including both protocol headers and data payloads – in bytes per second.
# TYPE DCGM_FI_PROF_PCIE_TX_BYTES gauge
DCGM_FI_PROF_PCIE_TX_BYTES{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 3985328
# HELP DCGM_FI_PROF_PCIE_RX_BYTES The rate of data received over the PCIe bus – including both protocol headers and data payloads – in bytes per second.
# TYPE DCGM_FI_PROF_PCIE_RX_BYTES gauge
DCGM_FI_PROF_PCIE_RX_BYTES{gpu=”0″,UUID=”GPU-ff76466b-22fc-f7a9-abe2-ce3ac453b8b3″,device=”nvidia0″,modelName=”NVIDIA A10G”,Hostname=”nvidia-dcgm-exporter-48cwd”,DCGM_FI_DRIVER_VERSION=”470.182.03″,container=”main”,namespace=”kube-system”,pod=”gpu-burn-c68d8c774-ltg9s”} 21715174
pod “curl” deleted

Configure and deploy the CloudWatch agent
To configure and deploy the CloudWatch agent, complete the following steps:

Download the YAML file and edit it

curl -O https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/k8s/1.3.15/k8s-deployment-manifest-templates/deployment-mode/service/cwagent-prometheus/prometheus-eks.yaml

The file contains a cwagent configmap and a prometheus configmap. For this post, we edit both.

Edit the prometheus-eks.yaml file

Open the prometheus-eks.yaml file in your favorite editor and replace the cwagentconfig.json section with the following content:

apiVersion: v1
data:
# cwagent json config
cwagentconfig.json: |
{
“logs”: {
“metrics_collected”: {
“prometheus”: {
“prometheus_config_path”: “/etc/prometheusconfig/prometheus.yaml”,
“emf_processor”: {
“metric_declaration”: [
{
“source_labels”: [“Service”],
“label_matcher”: “.*dcgm.*”,
“dimensions”: [[“Service”,”Namespace”,”ClusterName”,”job”,”pod”]],
“metric_selectors”: [
“^DCGM_FI_DEV_GPU_UTIL$”,
“^DCGM_FI_DEV_DEC_UTIL$”,
“^DCGM_FI_DEV_ENC_UTIL$”,
“^DCGM_FI_DEV_MEM_CLOCK$”,
“^DCGM_FI_DEV_MEM_COPY_UTIL$”,
“^DCGM_FI_DEV_POWER_USAGE$”,
“^DCGM_FI_DEV_ROW_REMAP_FAILURE$”,
“^DCGM_FI_DEV_SM_CLOCK$”,
“^DCGM_FI_DEV_XID_ERRORS$”,
“^DCGM_FI_PROF_DRAM_ACTIVE$”,
“^DCGM_FI_PROF_GR_ENGINE_ACTIVE$”,
“^DCGM_FI_PROF_PCIE_RX_BYTES$”,
“^DCGM_FI_PROF_PCIE_TX_BYTES$”,
“^DCGM_FI_PROF_PIPE_TENSOR_ACTIVE$”
]
}
]
}
}
},
“force_flush_interval”: 5
}
}

In the prometheus config section, append the following job definition for the DCGM exporter

– job_name: ‘kubernetes-pod-dcgm-exporter’
sample_limit: 10000
metrics_path: /api/v1/metrics/prometheus
kubernetes_sd_configs:
– role: pod
relabel_configs:
– source_labels: [__meta_kubernetes_pod_container_name]
action: keep
regex: ‘^DCGM.*$’
– source_labels: [__address__]
action: replace
regex: ([^:]+)(?::d+)?
replacement: ${1}:9400
target_label: __address__
– action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
– action: replace
source_labels:
– __meta_kubernetes_namespace
target_label: Namespace
– source_labels: [__meta_kubernetes_pod]
action: replace
target_label: pod
– action: replace
source_labels:
– __meta_kubernetes_pod_container_name
target_label: container_name
– action: replace
source_labels:
– __meta_kubernetes_pod_controller_name
target_label: pod_controller_name
– action: replace
source_labels:
– __meta_kubernetes_pod_controller_kind
target_label: pod_controller_kind
– action: replace
source_labels:
– __meta_kubernetes_pod_phase
target_label: pod_phase
– action: replace
source_labels:
– __meta_kubernetes_pod_node_name
target_label: NodeName

Save the file and apply the cwagent-dcgm configuration to your cluster

kubectl apply -f ./prometheus-eks.yaml

We get the following output:

namespace/amazon-cloudwatch created
configmap/prometheus-cwagentconfig created
configmap/prometheus-config created
serviceaccount/cwagent-prometheus created
clusterrole.rbac.authorization.k8s.io/cwagent-prometheus-role created
clusterrolebinding.rbac.authorization.k8s.io/cwagent-prometheus-role-binding created
deployment.apps/cwagent-prometheus created

Confirm that the CloudWatch agent pod is running

kubectl -n amazon-cloudwatch get pods

We get the following output:

NAME READY STATUS RESTARTS AGE
cwagent-prometheus-7dfd69cc46-s4cx7 1/1 Running 0 15m

Visualize metrics on the CloudWatch console
To visualize the metrics in CloudWatch, complete the following steps:

On the CloudWatch console, under Metrics in the navigation pane, choose All metrics
In the Custom namespaces section, choose the new entry for ContainerInsights/Prometheus

For more information about the ContainerInsights/Prometheus namespace, refer to Scraping additional Prometheus sources and importing those metrics.

Drill down to the metric names and choose DCGM_FI_DEV_GPU_UTIL
On the Graphed metrics tab, set Period to 5 seconds

Set the refresh interval to 10 seconds

You will see the metrics collected from DCGM exporter that visualize the gpu-burn pattern on and off each 20 seconds.

On the Browse tab, you can see the data, including the pod name for each metric.

The EKS API metadata has been combined with the DCGM metrics data, resulting in the provided pod-based GPU metrics.
This concludes the first approach of exporting DCGM metrics to CloudWatch via the CloudWatch agent.
In the next section, we configure the second architecture, which exports the DCGM metrics to Prometheus, and we visualize them with Grafana.
Use Prometheus and Grafana to visualize GPU metrics from DCGM
Complete the following steps:

Add the Prometheus community helm chart

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

This chart deploys both Prometheus and Grafana. We need to make some edits to the chart before running the install command.

Save the chart configuration values to a file in /tmp

helm inspect values prometheus-community/kube-prometheus-stack > /tmp/kube-prometheus-stack.values

Edit the char configuration file

Edit the saved file (/tmp/kube-prometheus-stack.values) and set the following option by looking for the setting name and setting the value:

prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

Add the following ConfigMap to the additionalScrapeConfigs section

additionalScrapeConfigs:
– job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
– role: endpoints
namespaces:
names:
– gpu-operator
relabel_configs:
– source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node

Deploy the Prometheus stack with the updated values

helm install prometheus-community/kube-prometheus-stack
–create-namespace –namespace prometheus
–generate-name
–values /tmp/kube-prometheus-stack.values

We get the following output:

NAME: kube-prometheus-stack-1684965548
LAST DEPLOYED: Wed May 24 21:59:14 2023
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
kubectl –namespace prometheus get pods -l “release=kube-prometheus-stack-1684965548″

Visit https://github.com/prometheus-operator/kube-prometheus
for instructions on how to create & configure Alertmanager
and Prometheus instances using the Operator.

Confirm that the Prometheus pods are running

kubectl get pods -n prometheus

We get the following output:

NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-1684-alertmanager-0 2/2 Running 0 6m55s
kube-prometheus-stack-1684-operator-6c87649878-j7v55 1/1 Running 0 6m58s
kube-prometheus-stack-1684965548-grafana-dcd7b4c96-bzm8p 3/3 Running 0 6m58s
kube-prometheus-stack-1684965548-kube-state-metrics-7d856dptlj5 1/1 Running 0 6m58s
kube-prometheus-stack-1684965548-prometheus-node-exporter-2fbl5 1/1 Running 0 6m58s
kube-prometheus-stack-1684965548-prometheus-node-exporter-m7zmv 1/1 Running 0 6m58s
prometheus-kube-prometheus-stack-1684-prometheus-0 2/2 Running 0 6m55s

Prometheus and Grafana pods are in the Running state.
Next, we validate that DCGM metrics are flowing into Prometheus.

Port-forward the Prometheus UI

There are different ways to expose the Prometheus UI running in EKS to requests originating outside of the cluster. We will use kubectl port-forwarding. So far, we have been executing commands inside the aws-do-eks container. To access the Prometheus service running in the cluster, we will create a tunnel from the host. Here the aws-do-eks container is running by executing the following command outside of the container, in a new terminal shell on the host. We will refer to this as “host shell”.

kubectl -n prometheus port-forward svc/$(kubectl -n prometheus get svc | grep prometheus | grep -v alertmanager | grep -v operator | grep -v grafana | grep -v metrics | grep -v exporter | grep -v operated | cut -d ‘ ‘ -f 1) 8080:9090 &

While the port-forwarding process is running, we are able to access the Prometheus UI from the host as described below.

Open the Prometheus UI

If you are using Cloud9, please navigate to Preview->Preview Running Application to open the Prometheus UI in a tab inside the Cloud9 IDE, then click the icon in the upper-right corner of the tab to pop out in a new window.
If you are on your local host or connected to an EC2 instance via remote desktop open a browser and visit the URL http://localhost:8080.

Enter DCGM to see the DCGM metrics that are flowing into Prometheus
Select DCGM_FI_DEV_GPU_UTIL, choose Execute, and then navigate to the Graph tab to see the expected GPU utilization pattern

Stop the Prometheus port-forwarding process

Run the following command line in your host shell:

kill -9 $(ps -aef | grep port-forward | grep -v grep | grep prometheus | awk ‘{print $2}’)

Now we can visualize the DCGM metrics via Grafana Dashboard.

Retrieve the password to log in to the Grafana UI

kubectl -n prometheus get secret $(kubectl -n prometheus get secrets | grep grafana | cut -d ‘ ‘ -f 1) -o jsonpath=”{.data.admin-password}” | base64 –decode ; echo

Port-forward the Grafana service

Run the following command line in your host shell:

kubectl port-forward -n prometheus svc/$(kubectl -n prometheus get svc | grep grafana | cut -d ‘ ‘ -f 1) 8080:80 &

Log in to the Grafana UI

Access the Grafana UI login screen the same way as you accessed the Prometheus UI earlier. If using Cloud9, select Preview->Preview Running Application, then pop out in a new window. If using your local host or an EC2 instance with remote desktop visit URL http://localhost:8080. Login with the user name admin and the password you retrieved earlier.

In the navigation pane, choose Dashboards

Choose New and Import

We are going to import the default DCGM Grafana dashboard described in NVIDIA DCGM Exporter Dashboard.

In the field import via grafana.com, enter 12239 and choose Load
Choose Prometheus as the data source
Choose Import

You will see a dashboard similar to the one in the following screenshot.

To demonstrate that these metrics are pod-based, we are going to modify the GPU Utilization pane in this dashboard.

Choose the pane and the options menu (three dots)
Expand the Options section and edit the Legend field
Replace the value there with Pod {{pod}}, then choose Save

The legend now shows the gpu-burn pod name associated with the displayed GPU utilization.

Stop port-forwarding the Grafana UI service

Run the following in your host shell:

kill -9 $(ps -aef | grep port-forward | grep -v grep | grep prometheus | awk ‘{print $2}’)

In this post, we demonstrated using open-source Prometheus and Grafana deployed to the EKS cluster. If desired, this deployment can be substituted with Amazon Managed Service for Prometheus and Amazon Managed Grafana.
Clean up
To clean up the resources you created, run the following script from the aws-do-eks container shell:

./eks-delete.sh

Conclusion
In this post, we utilized NVIDIA DCGM Exporter to collect GPU metrics and visualize them with either CloudWatch or Prometheus and Grafana. We invite you to use the architectures demonstrated here to enable GPU utilization monitoring with NVIDIA DCGM in your own AWS environment.
Additional resources

Amazon EC2 GPU instances
NVIDIA DCGM: Manage and Monitor GPUs in Cluster Environments
kube-prometheus-stack GitHub repo

About the authors
Amr Ragab is a former Principal Solutions Architect, EC2 Accelerated Computing at AWS. He is devoted to helping customers run computational workloads at scale. In his spare time, he likes traveling and finding new ways to integrate technology into daily life.
Alex Iankoulski is a Principal Solutions Architect, Self-managed Machine Learning at AWS. He’s a full-stack software and infrastructure engineer who likes to do deep, hands-on work. In his role, he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open-source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on democratizing AI and ML, combating climate change, and making travel safer, healthcare better, and energy smarter.
Keita Watanabe is a Senior Solutions Architect of Frameworks ML Solutions at Amazon Web Services where he helps develop the industry’s best cloud based Self-managed Machine Learning solutions. His background is in Machine Learning research and development. Prior to joining AWS, Keita was working in the e-commerce industry. Keita holds a Ph.D. in Science from the University of Tokyo.

Best practices and design patterns for building machine learning workf …

Amazon SageMaker Pipelines is a fully managed AWS service for building and orchestrating machine learning (ML) workflows. SageMaker Pipelines offers ML application developers the ability to orchestrate different steps of the ML workflow, including data loading, data transformation, training, tuning, and deployment. You can use SageMaker Pipelines to orchestrate ML jobs in SageMaker, and its integration with the larger AWS ecosystem also allows you to use resources like AWS Lambda functions, Amazon EMR jobs, and more. This enables you to build a customized and reproducible pipeline for specific requirements in your ML workflows.
In this post, we provide some best practices to maximize the value of SageMaker Pipelines and make the development experience seamless. We also discuss some common design scenarios and patterns when building SageMaker Pipelines and provide examples for addressing them.
Best practices for SageMaker Pipelines
In this section, we discuss some best practices that can be followed while designing workflows using SageMaker Pipelines. Adopting them can improve the development process and streamline the operational management of SageMaker Pipelines.
Use Pipeline Session for lazy loading of the pipeline
Pipeline Session enables lazy initialization of pipeline resources (the jobs are not started until pipeline runtime). The PipelineSession context inherits the SageMaker Session and implements convenient methods for interacting with other SageMaker entities and resources, such as training jobs, endpoints, input datasets in Amazon Simple Storage Service (Amazon S3), and so on. When defining SageMaker Pipelines, you should use PipelineSession over the regular SageMaker Session:

from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.sklearn.processing import SKLearnProcessor
role = sagemaker.get_execution_role()
pipeline_session = PipelineSession()
sklearn_processor = SKLearnProcessor(
framework_version=’0.20.0’,
instance_type=’ml.m5.xlarge’,
instance_count=1,
base_job_name=”sklearn-abalone-process”,
role=role,
sagemaker_session=pipeline_session,
)

Run pipelines in local mode for cost-effective and quick iterations during development
You can run a pipeline in local mode using the LocalPipelineSession context. In this mode, the pipeline and jobs are run locally using resources on the local machine, instead of SageMaker managed resources. Local mode provides a cost-effective way to iterate on the pipeline code with a smaller subset of data. After the pipeline is tested locally, it can be scaled to run using the PipelineSession context.

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.workflow.pipeline_context import LocalPipelineSession
local_pipeline_session = LocalPipelineSession()
role = sagemaker.get_execution_role()
sklearn_processor = SKLearnProcessor(
framework_version=’0.20.0’,
instance_type=’ml.m5.xlarge,
instance_count=1,
base_job_name=”sklearn-abalone-process”,
role=role,
sagemaker_session=local_pipeline_session,
)

Manage a SageMaker pipeline through versioning
Versioning of artifacts and pipeline definitions is a common requirement in the development lifecycle. You can create multiple versions of the pipeline by naming pipeline objects with a unique prefix or suffix, the most common being a timestamp, as shown in the following code:

from sagemaker.workflow.pipeline_context import PipelineSession
import time

current_time = time.strftime(“%Y-%m-%d-%H-%M-%S”, time.gmtime())
pipeline_name = “pipeline_” + current_time
pipeline_session = PipelineSession()
pipeline = Pipeline(
name=pipeline_name,
steps=[step_process, step_train, step_eval, step_cond],
sagemaker_session=pipeline_session,
)

Organize and track SageMaker pipeline runs by integrating with SageMaker Experiments
SageMaker Pipelines can be easily integrated with SageMaker Experiments for organizing and tracking pipeline runs. This is achieved by specifying PipelineExperimentConfig at the time of creating a pipeline object. With this configuration object, you can specify an experiment name and a trial name. The run details of a SageMaker pipeline get organized under the specified experiment and trial. If you don’t explicitly specify an experiment name, a pipeline name is used for the experiment name. Similarly, if you don’t explicitly specify a trial name, a pipeline run ID is used for the trial or run group name. See the following code:

Pipeline(
name=”MyPipeline”,
parameters=[…],
pipeline_experiment_config=PipelineExperimentConfig(
experiment_name = ExecutionVariables.PIPELINE_NAME,
trial_name = ExecutionVariables.PIPELINE_EXECUTION_ID
),
steps=[…]
)

Securely run SageMaker pipelines within a private VPC
To secure the ML workloads, it’s a best practice to deploy the jobs orchestrated by SageMaker Pipelines in a secure network configuration within a private VPC, private subnets, and security groups. To ensure and enforce the usage of this secure environment, you can implement the following AWS Identity and Access Management (IAM) policy for the SageMaker execution role (this is the role assumed by the pipeline during its run). You can also add the policy to run the jobs orchestrated by SageMaker Pipelines in network isolation mode.

# IAM Policy to enforce execution within a private VPC

{

“Action”: [

“sagemaker:CreateProcessingJob”,
“sagemaker:CreateTrainingJob”,
“sagemaker:CreateModel”
],

“Resource”: “*”,
“Effect”: “Deny”,
“Condition”: {
“Null”: {
“sagemaker:VpcSubnets”: “true”
}
}
}

# IAM Policy to enforce execution in network isolation mode
{

“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Deny”,
“Action”: [
“sagemaker:Create*”
],
“Resource”: “*”,
“Condition”: {
“StringNotEqualsIfExists”: {
“sagemaker:NetworkIsolation”: “true”
}
}
}
]
}

For an example of pipeline implementation with these security controls in place, refer to Orchestrating Jobs, Model Registration, and Continuous Deployment with Amazon SageMaker in a secure environment.
Monitor the cost of pipeline runs using tags
Using SageMaker pipelines by itself is free; you pay for the compute and storage resources you spin up as part of the individual pipeline steps like processing, training, and batch inference. To aggregate the costs per pipeline run, you can include tags in every pipeline step that creates a resource. These tags can then be referenced in the cost explorer to filter and aggregate total pipeline run cost, as shown in the following example:

sklearn_processor = SKLearnProcessor(
framework_version=’0.20.0’,
instance_type=’ml.m5.xlarge,
instance_count=1,
base_job_name=”sklearn-abalone-process”,
role=role,
tags=[{‘Key’:’pipeline-cost-tag’, ‘Value’:'<<tag_parameter>>’}]
)

step_process = ProcessingStep(
name=”AbaloneProcess”,
processor=sklearn_processor,

)

From the cost explorer, you can now get the cost filtered by the tag:

response = client.get_cost_and_usage(
TimePeriod={
‘Start’: ‘2023-07-01’,
‘End’: ‘2023-07-15’
},
Metrics=[‘BLENDED_COST’,’USAGE_QUANTITY’,’UNBLENDED_COST’],
Granularity=’MONTHLY’,
Filter={
‘Dimensions’: {
‘Key’:’USAGE_TYPE’,
‘Values’: [
‘SageMaker:Pipeline’
]
},
‘Tags’: {
‘Key’: ‘keyName’,
‘Values’: [
‘keyValue’,
]
}
}
)

Design patterns for some common scenarios
In this section, we discuss design patterns for some common use cases with SageMaker Pipelines.
Run a lightweight Python function using a Lambda step
Python functions are omnipresent in ML workflows; they are used in preprocessing, postprocessing, evaluation, and more. Lambda is a serverless compute service that lets you run code without provisioning or managing servers. With Lambda, you can run code in your preferred language that includes Python. You can use this to run custom Python code as part of your pipeline. A Lambda step enables you to run Lambda functions as part of your SageMaker pipeline. Start with the following code:

%%writefile lambdafunc.py

import json

def lambda_handler(event, context):
str1 = event[“str1”]
str2 = event[“str2”]
str3 = str1 + str2
return {
“str3”: str3
}

Create the Lambda function using the SageMaker Python SDK’s Lambda helper:

from sagemaker.lambda_helper import Lambda

def create_lambda(function_name, script, handler):
response = Lambda(
function_name=function_name,
execution_role_arn=role,
script= script,
handler=handler,
timeout=600,
memory_size=10240,
).upsert()

function_arn = response[‘FunctionArn’]
return function_arn

fn_arn = create_Lambda(“func”, “lambdafunc.py”, handler = “lambdafunc.lambda_handler”)

Call the Lambda step:

from sagemaker.lambda_helper import Lambda
from sagemaker.workflow.lambda_step import (
LambdaStep,
LambdaOutput,
LambdaOutputTypeEnum
)

str3 = LambdaOutput(output_name=”str3″, output_type=LambdaOutputTypeEnum.String)

# Lambda Step
step_lambda1 = LambdaStep(
name=”LambdaStep1″,
lambda_func=Lambda(
function_arn=fn_arn
),
inputs={
“str1”: “Hello”,
“str2″: ” World”
},
outputs=[str3],
)

Pass data between steps
Input data for a pipeline step is either an accessible data location or data generated by one of the previous steps in the pipeline. You can provide this information as a ProcessingInput parameter. Let’s look at a few scenarios of how you can use ProcessingInput.
Scenario 1: Pass the output (primitive data types) of a Lambda step to a processing step
Primitive data types refer to scalar data types like string, integer, Boolean, and float.
The following code snippet defines a Lambda function that returns a dictionary of variables with primitive data types. Your Lambda function code will return a JSON of key-value pairs when invoked from the Lambda step within the SageMaker pipeline.

def handler(event, context):

return {
“output1”: “string_value”,
“output2”: 1,
“output3”: True,
“output4″: 2.0,
}

In the pipeline definition, you can then define SageMaker pipeline parameters that are of a specific data type and set the variable to the output of the Lambda function:

from sagemaker.workflow.lambda_step import (
LambdaStep,
LambdaOutput,
LambdaOutputTypeEnum
)
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.sklearn.processing import SKLearnProcessor

role = sagemaker.get_execution_role()
pipeline_session = PipelineSession()

# 1. Define the output params of the Lambda Step

str_outputParam = LambdaOutput(output_name=”output1″, output_type=LambdaOutputTypeEnum.String)
int_outputParam = LambdaOutput(output_name”output2″, output_type=LambdaOutputTypeEnum.Integer)
bool_outputParam = LambdaOutput(output_name”output3″, output_type=LambdaOutputTypeEnum.Boolean)
float_outputParam = LambdaOutput(output_name”output4″, output_type=LambdaOutputTypeEnum.Float)

# 2. Lambda step invoking the lambda function and returns the Output

step_lambda = LambdaStep(
name=”MyLambdaStep”,
lambda_func=Lambda(
function_arn=”arn:aws:lambda:us-west-2:123456789012:function:sagemaker_test_lambda”,
session=PipelineSession(),
),
inputs={“arg1”: “foo”, “arg2”: “foo1”},
outputs=[
str_outputParam, int_outputParam, bool_outputParam, float_outputParam
],
)

# 3. Extract the output of the Lambda

str_outputParam = step_lambda.properties.Outputs[“output1″]

# 4. Use it in a subsequent step. For ex. Processing step

sklearn_processor = SKLearnProcessor(
framework_version=”0.23-1″,
instance_type=”ml.m5.xlarge”,
instance_count=1,
sagemaker_session=pipeline_session,
role=role
)

processor_args = sklearn_processor.run(
code=”code/preprocess.py”, #python script to run
arguments=[“–input-args”, str_outputParam]
)

step_process = ProcessingStep(
name=”processstep1″,
step_args=processor_args,
)

Scenario 2: Pass the output (non-primitive data types) of a Lambda step to a processing step
Non-primitive data types refer to non-scalar data types (for example, NamedTuple). You may have a scenario when you have to return a non-primitive data type from a Lambda function. To do this, you have to convert your non-primitive data type to a string:

# Lambda function code returning a non primitive data type

from collections import namedtuple

def lambda_handler(event, context):
Outputs = namedtuple(“Outputs”, “sample_output”)
named_tuple = Outputs(
[
{‘output1’: 1, ‘output2’: 2},
{‘output3’: ‘foo’, ‘output4’: ‘foo1′}
]
)
return{
“named_tuple_string”: str(named_tuple)
}

#Pipeline step that uses the Lambda output as a “Parameter Input”

output_ref = step_lambda.properties.Outputs[“named_tuple_string”]

Then you can use this string as an input to a subsequent step in the pipeline. To use the named tuple in the code, use eval() to parse the Python expression in the string:

# Decipher the string in your processing logic code

import argparse
from collections import namedtuple

Outputs = namedtuple(“Outputs”, “sample_output”)

if __name__ == “__main__”:
parser = argparse.ArgumentParser()
parser.add_argument(“–named_tuple_string”, type=str, required=True)
args = parser.parse_args()
#use eval to obtain the named tuple from the string
named_tuple = eval(args.named_tuple_string)

Scenario 3: Pass the output of a step through a property file
You can also store the output of a processing step in a property JSON file for downstream consumption in a ConditionStep or another ProcessingStep. You can use the JSONGet function to query a property file. See the following code:

# 1. Define a Processor with a ProcessingOutput
sklearn_processor = SKLearnProcessor(
framework_version=”0.23-1″,
instance_type=”ml.m5.xlarge”,
instance_count=1,
base_job_name=”sklearn-abalone-preprocess”,
sagemaker_session=session,
role=sagemaker.get_execution_role(),
)

step_args = sklearn_processor.run(

outputs=[
ProcessingOutput(
output_name=”hyperparam”,
source=”/opt/ml/processing/evaluation”
),
],
code=”./local/preprocess.py”,
arguments=[“–input-data”, “s3://my-input”],
)

# 2. Define a PropertyFile where the output_name matches that with the one used in the Processor

hyperparam_report = PropertyFile(
name=”AbaloneHyperparamReport”,
output_name=”hyperparam”,
path=”hyperparam.json”,
)

Let’s assume the property file’s contents were the following:

{
“hyperparam”: {
“eta”: {
“value”: 0.6
}
}
}

In this case, it can be queried for a specific value and used in subsequent steps using the JsonGet function:

# 3. Query the property file
eta = JsonGet(
step_name=step_process.name,
property_file=hyperparam_report,
json_path=”hyperparam.eta.value”,
)

Parameterize a variable in pipeline definition
Parameterizing variables so that they can be used at runtime is often desirable—for example, to construct an S3 URI. You can parameterize a string such that it is evaluated at runtime using the Join function. The following code snippet shows how to define the variable using the Join function and use that to set the output location in a processing step:

# define the variable to store the s3 URI
s3_location = Join(
on=”/”,
values=[
“s3:/”,
ParameterString(
name=”MyBucket”,
default_value=””
),
“training”,
ExecutionVariables.PIPELINE_EXECUTION_ID
]
)

# define the processing step
sklearn_processor = SKLearnProcessor(
framework_version=”1.2-1″,
instance_type=”ml.m5.xlarge”,
instance_count=processing_instance_count,
base_job_name=f”{base_job_prefix}/sklearn-abalone-preprocess”,
sagemaker_session=pipeline_session,
role=role,
)

# use the s3uri as the output location in processing step
processor_run_args = sklearn_processor.run(
outputs=[
ProcessingOutput(
output_name=”train”,
source=”/opt/ml/processing/train”,
destination=s3_location,
),
],
code=”code/preprocess.py”
)

step_process = ProcessingStep(
name=”PreprocessingJob”,
step_args=processor_run_args,
)

Run parallel code over an iterable
Some ML workflows run code in parallel for-loops over a static set of items (an iterable). It can either be the same code that gets run on different data or a different piece of code that needs to be run for each item. For example, if you have a very large number of rows in a file and want to speed up the processing time, you can rely on the former pattern. If you want to perform different transformations on specific sub-groups in the data, you might have to run a different piece of code for every sub-group in the data. The following two scenarios illustrate how you can design SageMaker pipelines for this purpose.
Scenario 1: Implement a processing logic on different portions of data
You can run a processing job with multiple instances (by setting instance_count to a value greater than 1). This distributes the input data from Amazon S3 into all the processing instances. You can then use a script (process.py) to work on a specific portion of the data based on the instance number and the corresponding element in the list of items. The programming logic in process.py can be written such that a different module or piece of code gets run depending on the list of items that it processes. The following example defines a processor that can be used in a ProcessingStep:

sklearn_processor = FrameworkProcessor(
estimator_cls=sagemaker.sklearn.estimator.SKLearn,
framework_version=”0.23-1″,
instance_type=’ml.m5.4xlarge’,
instance_count=4, #number of parallel executions / instances
base_job_name=”parallel-step”,
sagemaker_session=session,
role=role,
)

step_args = sklearn_processor.run(
code=’process.py’,
arguments=[
“–items”,
list_of_items, #data structure containing a list of items
inputs=[
ProcessingInput(source=”s3://sagemaker-us-east-1-xxxxxxxxxxxx/abalone/abalone-dataset.csv”,
destination=”/opt/ml/processing/input”
)
],
]
)

Scenario 2: Run a sequence of steps
When you have a sequence of steps that need to be run in parallel, you can define each sequence as an independent SageMaker pipeline. The run of these SageMaker pipelines can then be triggered from a Lambda function that is part of a LambdaStep in the parent pipeline. The following piece of code illustrates the scenario where two different SageMaker pipeline runs are triggered:

import boto3
def lambda_handler(event, context):
items = [1, 2]
#sagemaker client
sm_client = boto3.client(“sagemaker”)

#name of the pipeline that needs to be triggered.
#if there are multiple, you can fetch available pipelines using boto3 api
#and trigger the appropriate one based on your logic.
pipeline_name = ‘child-pipeline-1′

#trigger pipeline for every item
response_ppl = sm_client.start_pipeline_execution(
PipelineName=pipeline_name,
PipelineExecutionDisplayName=pipeline_name+’-item-%d’ %(s),
)
pipeline_name = ‘child-pipeline-2′
response_ppl = sm_client.start_pipeline_execution(
PipelineName=pipeline_name,
PipelineExecutionDisplayName=pipeline_name+’-item-%d’ %(s),
)
return

Conclusion
In this post, we discussed some best practices for the efficient use and maintenance of SageMaker pipelines. We also provided certain patterns that you can adopt while designing workflows with SageMaker Pipelines, whether you are authoring new pipelines or are migrating ML workflows from other orchestration tools. To get started with SageMaker Pipelines for ML workflow orchestration, refer to the code samples on GitHub and Amazon SageMaker Model Building Pipelines.

About the Authors
Pinak Panigrahi works with customers to build machine learning driven solutions to solve strategic business problems on AWS. When not occupied with machine learning, he can be found taking a hike, reading a book or watching sports.
Meenakshisundaram Thandavarayan works for AWS as an AI/ ML Specialist. He has a passion to design, create, and promote human-centered data and analytics experiences. Meena focusses on developing sustainable systems that deliver measurable, competitive advantages for strategic customers of AWS. Meena is a connector, design thinker, and strives to drive business to new ways of working through innovation, incubation and democratization.