Meet Neuralangelo: Nvidia’s AI Revolutionizing 2D to 3D Video Conver …

Nvidia, the multinational technology corporation known for its advancements in artificial intelligence (AI), has recently unveiled Neuralangelo, a groundbreaking AI system that can convert 2D video into immersive 3D scenes. This pioneering technology was introduced in an Nvidia blog post dated June 1, 2023.

Translating Two Dimensions into Three

Neuralangelo uses a novel AI algorithm to transform traditional 2D videos into immersive, detailed 3D environments. The process involves extrapolating depth and perspective from the spatial and temporal clues embedded in the 2D footage, rendering realistic 3D models from these clues.

Unlike some previous methods, Neuralangelo doesn’t rely on multi-angle footage or depth-sensing cameras. It can process single-view, regular 2D footage and perform this impressive transformation. This makes the system versatile and adaptable to a variety of applications.

Power of Deep Learning

The system leverages deep learning technologies, a subdivision of AI that teaches computers to learn by example. Nvidia trained Neuralangelo on a diverse array of videos covering a wide range of scenes, objects, and activities, helping the AI to understand how depth and space work in many different contexts.

The algorithm makes use of several neural networks, each with a specialized function. Some networks are trained to estimate depth, while others fill in unseen details, creating comprehensive 3D models from flat images. The sophisticated interaction of these networks allows Neuralangelo to construct an impressively accurate 3D scene from 2D footage.

Potential Applications and Implications

The potential applications for Neuralangelo are vast. In entertainment, this technology could revolutionize the movie industry by making 3D conversion cheaper and more accessible, even for older movies shot in 2D. It could also enhance video game experiences by adding an extra dimension of depth and realism.

In the realm of professional technology, Neuralangelo could prove invaluable in fields such as architecture and real estate, allowing for 3D tours of properties based on 2D footage. In the medical world, Neuralangelo could potentially transform 2D medical images into 3D models, aiding diagnostics and surgical planning.

A Milestone for Nvidia

The development of Neuralangelo is a significant achievement for Nvidia and represents a significant leap forward in the field of AI and machine learning. By effectively bridging the gap between 2D and 3D visuals, Nvidia is pushing the boundaries of what is possible in visual rendering and AI technologies.

As Neuralangelo continues to be refined and developed, its impact on various industries is set to be profound. From entertainment to professional applications, the ability to transform 2D footage into 3D scenes opens up a world of possibilities, enhancing realism, depth, and immersive experiences across the board.

In conclusion, Neuralangelo is another example of Nvidia’s commitment to pioneering AI technologies, continuing its tradition of leading the pack in digital innovation. As we look towards a future increasingly shaped by AI, technologies like Neuralangelo will be at the forefront of this exciting revolution.

Nvidia’s breakthrough paints a vibrant picture of the potential AI holds for transforming our visual and digital landscapes. Hence, it’s safe to say that, with Neuralangelo, the future of video has never looked more three-dimensional.

Check out the NVIDIA Blog. Don’t forget to join our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com.

Check Out 100’s AI Tools in AI Tools Club
The post Meet Neuralangelo: Nvidia’s AI Revolutionizing 2D to 3D Video Conversion appeared first on MarkTechPost.

Implement a multi-object tracking solution on a custom dataset with Am …

The demand for multi-object tracking (MOT) in video analysis has increased significantly in many industries, such as live sports, manufacturing, and traffic monitoring. For example, in live sports, MOT can track soccer players in real time to analyze physical performance such as real-time speed and moving distance.
Since its introduction in 2021, ByteTrack remains to be one of best performing methods on various benchmark datasets, among the latest model developments in MOT application. In ByteTrack, the author proposed a simple, effective, and generic data association method (referred to as BYTE) for detection box and tracklet matching. Rather than only keep the high score detection boxes, it also keeps the low score detection boxes, which can help recover unmatched tracklets with these low score detection boxes when occlusion, motion blur, or size changing occurs. The BYTE association strategy can also be used in other Re-ID based trackers, such as FairMOT. The experiments showed improvements compared to the vanilla tracker algorithms. For example, FairMOT achieved an improvement of 1.3% on MOTA (FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking), which is one of the main metrics in the MOT task when applying BYTE in data association.
In the post Train and deploy a FairMOT model with Amazon SageMaker, we demonstrated how to train and deploy a FairMOT model with Amazon SageMaker on the MOT challenge datasets. When applying a MOT solution in real-world cases, you need to train or fine-tune a MOT model on a custom dataset. With Amazon SageMaker Ground Truth, you can effectively create labels on your own video dataset.
Following on the previous post, we have added the following contributions and modifications:

Generate labels for a custom video dataset using Ground Truth
Preprocess the Ground Truth generated label to be compatible with ByteTrack and other MOT solutions
Train the ByteTrack algorithm with a SageMaker training job (with the option to extend a pre-built container)
Deploy the trained model with various deployment options, including asynchronous inference

We also provide the code sample on GitHub, which uses SageMaker for labeling, building, training, and inference.
SageMaker is a fully managed service that provides every developer and data scientist with the ability to prepare, build, train, and deploy machine learning (ML) models quickly. SageMaker provides several built-in algorithms and container images that you can use to accelerate training and deployment of ML models. Additionally, custom algorithms such as ByteTrack can also be supported via custom-built Docker container images. For more information about deciding on the right level of engagement with containers, refer to Using Docker containers with SageMaker.
SageMaker provides plenty of options for model deployment, such as real-time inference, serverless inference, and asynchronous inference. In this post, we show how to deploy a tracking model with different deployment options, so that you can choose the suitable deployment method in your own use case.
Overview of solution
Our solution consists of the following high-level steps:

Label the dataset for tracking, with a bounding box on each object (for example, pedestrian, car, and so on). Set up the resources for ML code development and execution.
Train a ByteTrack model and tune hyperparameters on a custom dataset.
Deploy the trained ByteTrack model with different deployment options depending on your use case: real-time processing, asynchronous, or batch prediction.

The following diagram illustrates the architecture in each step.
Prerequisites
Before getting started, complete the following prerequisites:

Create an AWS account or use an existing AWS account.
We recommend running the source code in the us-east-1 Region.
Make sure that you have a minimum of one GPU instance (for example, ml.p3.2xlarge for single GPU training, or ml.p3.16xlarge) for the distributed training job. Other types of GPU instances are also supported, with various performance differences.
Make sure that you have a minimum of one GPU instance (for example, ml.p3.2xlarge) for inference endpoint.
Make sure that you have a minimum of one GPU instance (for example, ml.p3.2xlarge) for running batch prediction with processing jobs.

If this is your first time running SageMaker services on the aforementioned instance types, you may have to request a quota increase for the required instances.
Set up your resources
After you complete all the prerequisites, you’re ready to deploy the solution.

Create a SageMaker notebook instance. For this task, we recommend using the ml.t3.medium instance type. While running the code, we use docker build to extend the SageMaker training image with the ByteTrack code (the docker build command will be run locally within the notebook instance environment). Therefore, we recommend increasing the volume size to 100 GB (default volume size to 5 GB) from the advanced configuration options. For your AWS Identity and Access Management (IAM) role, choose an existing role or create a new role, and attach the AmazonS3FullAccess, AmazonSNSFullAccess, AmazonSageMakerFullAccess, and AmazonElasticContainerRegistryPublicFullAccess policies to the role.
Clone the GitHub repo to the /home/ec2-user/SageMaker folder on the notebook instance you created.
Create a new Amazon Simple Storage Service (Amazon S3) bucket or use an existing bucket.

Label the dataset
In the data-preparation.ipynb notebook, we download an MOT16 test video file and split the video file into small video files with 200 frames. Then we upload those video files to the S3 bucket as the data source for labeling.

To label the dataset for the MOT task, refer to Getting started. When the labeling job is complete, we can access the following annotation directory at the job output location in the S3 bucket.

The manifests directory should contain an output folder if we finished labeling all the files. We can see the file output.manifest in the output folder. This manifest file contains information about the video and video tracking labels that you can use later to train and test a model.
Train a ByteTrack model and tune hyperparameters on the custom dataset
To train your ByteTrack model, we use the bytetrack-training.ipynb notebook. The notebook consists of the following steps:

Initialize the SageMaker setting.
Perform data preprocessing.
Build and push the container image.
Define a training job.
Launch the training job.
Tune hyperparameters.

Especially in data preprocessing, we need to convert the labeled dataset with the Ground Truth output format to the MOT17 format dataset, and convert the MOT17 format dataset to a MSCOCO format dataset (as shown in the following figure) so that we can train a YOLOX model on the custom dataset. Because we keep both the MOT format dataset and MSCOCO format dataset, you can train other MOT algorithms without separating detection and tracking on the MOT format dataset. You can easily change the detector to other algorithms such as YOLO7 to use your existing object detection algorithm.

Deploy the trained ByteTrack model
After we train the YOLOX model, we deploy the trained model for inference. SageMaker provides several options for model deployment, such as real-time inference, asynchronous inference, serverless inference, and batch inference. In our post, we use the sample code for real-time inference, asynchronous inference, and batch inference. You can choose the suitable code from these options based on your own business requirements.
Because SageMaker batch transform requires the data to be partitioned and stored on Amazon S3 as input and the invocations are sent to the inference endpoints concurrently, it doesn’t meet the requirements in object tracking tasks where the targets need to be sent in a sequential manner. Therefore, we don’t use the SageMaker batch transform jobs to run the batch inference. In this example, we use SageMaker processing jobs to do batch inference.
The following table summarizes the configuration for our inference jobs.

Inference Type
Payload
Processing Time
Auto Scaling

Real-time
Up to 6 MB
Up to 1 minute
Minimum instance count is 1 or higher

Asynchronous
Up to 1 GB
Up to 15 minutes
Minimum instance count can be zero

Batch (with processing job)
No limit
No limit
Not supported

Deploy a real-time inference endpoint
To deploy a real-time inference endpoint, we can run the bytetrack-inference-yolox.ipynb notebook. We separate ByteTrack inference into object detection and tracking. In the inference endpoint, we only run the YOLOX model for object detection. In the notebook, we create a tracking object, receive the result of object detection from the inference endpoint, and update trackers.
We use SageMaker PyTorchModel SDK to create and deploy a ByteTrack model as follows:

from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
model_data=s3_model_uri,
role=role,
source_dir=”sagemaker-serving/code”,
entry_point=”inference.py”,
framework_version=”1.7.1″,
py_version=”py3″,
)

endpoint_name =<endpint name>
pytorch_model.deploy(
initial_instance_count=1,
instance_type=”ml.p3.2xlarge”,
endpoint_name=endpoint_name
)

After we deploy the model to an endpoint successfully, we can invoke the inference endpoint with the following code snippet:

with open(f”datasets/frame_{frame_id}.png”, “rb”) as f:
payload = f.read()

response = sm_runtime.invoke_endpoint(
EndpointName=endpoint_name, ContentType=”application/x-image”, Body=payload
)
outputs = json.loads(response[“Body”].read().decode())

We run the tracking task on the client side after accepting the detection result from the endpoint (see the following code). By drawing the tracking results in each frame and saving as a tracking video, you can confirm the tracking result on the tracking video.

aspect_ratio_thresh = 1.6
min_box_area = 10
tracker = BYTETracker(
frame_rate=30,
track_thresh=0.5,
track_buffer=30,
mot20=False,
match_thresh=0.8
)

online_targets = tracker.update(torch.as_tensor(outputs[0]), [height, width], (800, 1440))
online_tlwhs = []
online_ids = []
online_scores = []
for t in online_targets:
tlwh = t.tlwh
tid = t.track_id
vertical = tlwh[2] / tlwh[3] > aspect_ratio_thresh
if tlwh[2] * tlwh[3] > min_box_area and not vertical:
online_tlwhs.append(tlwh)
online_ids.append(tid)
online_scores.append(t.score)
results.append(
f”{frame_id},{tid},{tlwh[0]:.2f},{tlwh[1]:.2f},{tlwh[2]:.2f},{tlwh[3]:.2f},{t.score:.2f},-1,-1,-1n”
)
online_im = plot_tracking(
frame, online_tlwhs, online_ids, frame_id=frame_id + 1, fps=1. / timer.average_time
)

Deploy an asynchronous inference endpoint
SageMaker asynchronous inference is the ideal option for requests with large payload sizes (up to 1 GB), long processing times (up to 1 hour), and near-real-time latency requirements. For MOT tasks, it’s common that a video file is beyond 6 MB, which is the payload limit of a real-time endpoint. Therefore, we deploy an asynchronous inference endpoint. Refer to Asynchronous inference for more details of how to deploy an asynchronous endpoint. We can reuse the model created for the real-time endpoint; for this post, we put a tracking process into the inference script so that we can get the final tracking result directly for the input video.
To use scripts related to ByteTrack on the endpoint, we need to put the tracking script and model into the same folder and compress the folder as the model.tar.gz file, and then upload it to the S3 bucket for model creation. The following diagram shows the structure of model.tar.gz.

We need to explicitly set the request size, response size, and response timeout as the environment variables, as shown in the following code. The name of the environment variable varies depending on the framework. For more details, refer to Create an Asynchronous Inference Endpoint.

pytorch_model = PyTorchModel(
model_data=s3_model_uri,
role=role,
entry_point=”inference.py”,
framework_version=”1.7.1″,
sagemaker_session=sm_session,
py_version=”py3″,
env={
‘TS_MAX_REQUEST_SIZE’: ‘1000000000’, #default max request size is 6 Mb for torchserve, need to update it to support the 1GB input payload
‘TS_MAX_RESPONSE_SIZE’: ‘1000000000’,
‘TS_DEFAULT_RESPONSE_TIMEOUT’: ‘900’ # max timeout is 15mins (900 seconds)
}
)

pytorch_model.create(
instance_type=”ml.p3.2xlarge”,
)

When invoking the asynchronous endpoint, instead of sending the payload in the request, we send the Amazon S3 URL of the input video. When the model inference finishes processing the video, the results will be saved on the S3 output path. We can configure Amazon Simple Notification Service (Amazon SNS) topics so that when the results are ready, we can receive an SNS message as a notification.
Run batch inference with SageMaker processing
For video files bigger than 1 GB, we use a SageMaker processing job to do batch inference. We define a custom Docker container to run a SageMaker processing job (see the following code). We draw the tracking result on the input video. You can find the result video in the S3 bucket defined by s3_output.

from sagemaker.processing import ProcessingInput, ProcessingOutput
script_processor.run(
code=’./container-batch-inference/predict.py’,
inputs=[
ProcessingInput(source=s3_input, destination=”/opt/ml/processing/input”),
ProcessingInput(source=s3_model_uri, destination=”/opt/ml/processing/model”),
],
outputs=[
ProcessingOutput(source=’/opt/ml/processing/output’, destination=s3_output),
]
)

Clean up
To avoid unnecessary costs, delete the resources you created as part of this solution, including the inference endpoint.
Conclusion
This post demonstrated how to implement a multi-object tracking solution on a custom dataset using one of the state-of-the-art algorithms on SageMaker. We also demonstrated three deployment options on SageMaker so that you can choose the optimal option for your own business scenario. If the use case requires low latency and needs a model to be deployed on an edge device, you can deploy the MOT solution at the edge with AWS Panorama.
For more information, refer to Multi Object Tracking using YOLOX + BYTE-TRACK and data analysis.

About the Authors
Gordon Wang, is a Senior AI/ML Specialist TAM at AWS. He supports strategic customers with AI/ML best practices cross many industries. He is passionate about computer vision, NLP, Generative AI and MLOps. In his spare time, he loves running and hiking.
Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building artificial intelligence powered industrial applications in computer vision, natural language processing and online user behavior prediction. At AWS, he shares the domain expertise and helps customers to unlock business potentials, and to drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.
Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.
Guang Yang, is a Senior applied scientist at the Amazon ML Solutions Lab where he works with customers across various verticals and applies creative problem solving to generate value for customers with state-of-the-art ML/AI solutions.

CMU Researchers Propose GILL: An AI Method To Fuse LLMs With Image Enc …

With the release of OpenAI’s new GPT 4, multimodality in Large Language Models has been introduced. Unlike the previous version, GPT 3.5, which is only used to let the well-known ChatGPT take textual inputs, the latest GPT-4 accepts text as well as images as input. Recently, a team of researchers from Carnegie Mellon University proposed an approach called Generating Images with Large Language Models (GILL), which focuses on extending multimodal language models to generate some great unique images.

The GILL method enables the processing of inputs that are mixed with images and text to produce text, retrieve images, and create new images. GILL accomplishes this despite the models utilizing distinct text encoders by transferring the output embedding space of a frozen text-only LLM to that of a frozen image-generating model. Unlike other methods that call for interleaved image-text data, the mapping is accomplished by fine-tuning a small number of parameters utilizing image-caption pairings.

The team has mentioned that this method combines large language models for frozen text with models for image encoding and decoding that have already been trained. It can provide a wide range of multimodal capabilities, such as image retrieval, unique image production, and multimodal dialogue. This has been done by mapping the modalities’ embedding spaces in order to fuse them. GILL works with conditioning mixed image and text inputs and produces outputs that are both coherent and readable.

This method provides an effective mapping network that grounds the LLM to a text-to-image generation model in order to obtain great performance in picture generation. This mapping network converts hidden text representations into the visual models’ embedding space. In doing so, it uses the LLM’s powerful text representations to produce aesthetically consistent outputs. 

With this approach, the model can retrieve images from a specified dataset in addition to creating new images. The model chooses whether to produce or obtain an image at the time of inference. A learned decision module that is conditional on the LLM’s hidden representations is used to make this choice. This approach is computationally efficient as it works without the need to run the image generation model at the time of training.       

This method performs better than baseline generation models, especially for tasks requiring longer and more sophisticated language. In comparison, GILL outperforms the Stable Diffusion method in processing longer-form text, including dialogue and discourse. GILL performs more in dialogue-conditioned image generation than non-LLM-based generation models, benefiting from multimodal context and generating images that better match the given text. Unlike conventional text-to-image models that only process textual input, GILL can also process arbitrarily interleaved image-text inputs.

In conclusion, GILL (Generating Images with Large Language Models) seems promising as it portrays a wider range of abilities compared to previous multimodal language models. Its ability to outperform non-LLM-based generation models in various text-to-image tasks that measure context dependence makes it a powerful solution for multimodal tasks.

Check out the Paper and Project Page. Don’t forget to join our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post CMU Researchers Propose GILL: An AI Method To Fuse LLMs With Image Encoder And Decoder Models appeared first on MarkTechPost.

Meta Unveils An AI Model for Code Generation, Comparable to Copilot

Meta, the company behind a popular social media platform, has unveiled an advanced AI model called CodeCompose that can generate code comparable to GitHub’s Copilot. The newly developed AI model is designed to support developers by offering automated code snippets and recommendations. At the same time, they write their programs, aiming to boost efficiency and streamline the coding process.

By harnessing the power of artificial intelligence, Meta’s code-generating model (CodeCompose) enables developers to receive real-time assistance during coding sessions. As developers type their code, the AI model analyzes the context and provides suggestions and code snippets that align with the desired functionality. This feature can accelerate the coding process by reducing the time spent searching for relevant code examples and implementing repetitive tasks.

Meta’s code-generating AI model is a significant step forward in aiding developers and enhancing their productivity. By leveraging the vast amounts of code available on platforms like GitHub, the AI model has been trained to recognize patterns and understand the intent behind the code. This enables it to generate relevant and context-specific suggestions, empowering developers to focus more on high-level design and better problem-solving rather than getting down into repetitive coding tasks.

The introduction of this AI-powered code generation model by Meta highlights the increasing reliance on artificial intelligence in the software development industry. It demonstrates the potential of AI to augment human capabilities and transform traditional coding practices. The aim is not to replace developers but to provide them with intelligent tools that can assist in various aspects of the coding process.

However, as with any AI-based technology, there are considerations regarding the quality and reliability of the generated code. While the code suggestions from the AI model can be a valuable resource, developers still need to review and validate the generated code to ensure its correctness and adherence to best practices. Code quality and security remain critical factors that require human oversight and expertise.

In conclusion, Meta’s development of a code-generating AI model similar to GitHub’s Copilot represents a significant advancement in software development. This AI-powered tool can potentially enhance developer productivity by providing real-time code suggestions and snippets. Although it is essential to approach generated code with caution, this technology showcases the potential of AI to revolutionize coding practices and empower developers with intelligent assistance.

Check out the Paper and Blog. Don’t forget to join our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Meta Unveils An AI Model for Code Generation, Comparable to Copilot appeared first on MarkTechPost.

MIT Researchers Developed an AI Tool that Eliminates a Source of Bias …

Researchers at MIT have formulated a method to eradicate bias in trace-driven simulation, an approach quite regularly used by scientists and analysts to devise algorithms for various use cases. Using machine learning algorithms and statistics that rely on causality principles, the tool called CasualSim has been created by researchers, which enables unbiased simulations. This method is a crucial development as it can improve the algorithm design significantly, eventually leading to better-trained, evaluated, and suited models in areas such as video quality enhancements and data processing system performance.

Analysts, scientists, and researchers often depend on simulation-based approaches to test any new algorithm because of real-world scenario experimentation’s high cost and risk. These trace-driven simulations, which involve recreating a miniature scenario of the real-world data (traces) while activating and testing the targeted components, can unknowingly include biases and lead to the suboptimal algorithmic selection.

Researchers at MIT have tackled this challenge by creating an approach and a tool that helps to overcome the bias unknowingly introduced in these test simulations. Their machine learning model uses simple inference principles to understand better how the simulation’s behavior affects the data traces. This approach helps accurately replicate unbiased data traces during the simulation test process.

Video streaming applications were chosen as a compelling use case for the experimentation by the researchers as it is time-sensitive data and will add to the complexity of the problem, making it more realistic to examine. In this use case, an adaptive bitrate algorithm is used to determine the quality of video that will be delivered based on real-time data about the user’s bandwidth. By collecting real data points from end users during the video streaming process and using those data points as traces in simulations, researchers can then closely examine the impact of differently tweaked adaptive bitrate algorithms on the overall network performance.

Previously, the researchers were under the assumption that the trace data were unaffected by the factors that are manipulated and changed during the simulation process, commonly known as exogenous factors. However, this thinking often leads to biased and suboptimal outcomes in real-world scenarios and renders the entire test invalid. Researchers correctly understood the impact of these errors. They strived for a fix. Instead of approaching the issue conventionally, they framed it as a casual inference challenge.

While collecting unbiased traces, it is important to differentiate between the intrinsic properties of the system and what are the effects on it when a specific course of action is taken. The researchers came up with CasualSim to tackle this issue. This machine learning model learns the underlying features of a system in a spot using the trace data only. CasualSim estimates the underlying functions that produce the data. It helps researchers to analyze how a new algorithm would impact the result under the same condition as the user.

The actual effectiveness of CasualSim was showcased when the researchers used it to design an improved bitrate adaption algorithm. Strikingly different from what the predictions were from a conventional trace-driven simulator, CasualSim helped them to select a new variation that reduced the stall rate (the time spent rebuffering) by nearly 1.4 times compared to a well-accepted competing algorithm while maintaining the same video quality. Real-world tests have testified to this robust performance and accuracy of CausalSim’s prediction.

The performance of CasualSim was further put into the spotlight as it helped to consistently improve simulation accuracy over a 10-month experiment, resulting in algorithms that had significantly fewer errors than the baseline. The researchers put a lot of hope and faith in this algorithm, claiming that it can revolutionize algorithm design, leading to further advancements.

Looking forward, the researchers at MIT have planned to apply CasualSim to use cases where randomized data is unavailable or where recovering the causal dynamics of the system is significantly challenging. It would be interesting to watch how it seeps into existing algorithms and improves them for good and whether it can establish a well-known algorithmic design and thinking approach.

Check out the Paper and Blog. Don’t forget to join our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post MIT Researchers Developed an AI Tool that Eliminates a Source of Bias in Simulations, Leading to Improved Algorithms that can Boost the Performance of an Application appeared first on MarkTechPost.

Translate documents in real time with Amazon Translate

A critical component of business success is the ability to connect with customers. Businesses today want to connect with their customers by offering their content across multiple languages in real time. For most customers, the content creation process is disconnected from the localization effort of translating content into multiple target languages. These disconnected processes delay the business ability to simultaneously publish content in multiple languages, inhibiting their outreach efforts which negatively impacts time to market and revenue.
Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Now, Amazon Translate offers real-time document translation to seamlessly integrate and accelerate content creation and localization. You can submit a document from the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SDK and receive the translated document in real time while maintaining the format of the original document. This feature eliminates the wait for documents to be translated in asynchronous batch mode.
Real-time document translation currently supports plain text and HTML documents. You can use other Amazon Translate features such as custom terminology, profanity masking, and formality as part of the real-time document translation.
In this post, we will show you how to use this new feature.
Solution overview
This post walks you through the steps required to use real-time document translation with the console, AWS CLI, and Amazon Translate SDK. As an example, we will translate this sample text file from English to French.
Use Amazon Translate via the console
Follow these steps to try out real-time document translation on the console:

On the Amazon Translate console, choose Real-time translation in the navigation pane.
Choose the Document tab.
Specify the language of the source file as English.
Specify the language of the target file as French.

Note: Source or Target language should be English for real-time document translation.

Select Choose file and upload the file you want to translate.
Specify the document type.

Text and HTML formats are supported at the time of this writing.

Under Additional settings, you can use other Amazon Translate features in conjunction with real-time document translation.

For more information about Amazon Translate features, refer to the following resources:

Custom terminology – Customize Amazon Translate output to meet your domain and organization specific vocabulary
Formality – Select Formal or Informal for the translated text
Profanity masking – Apply profanity masking in Amazon Translate

Choose Translate and Download.

The translated file is automatically saved to your browser’s downloaded folder, usually to Downloads. The target language code will be prefixed to the translated file’s name. For example, if your source file name is lang.txt and your target language is French (fr), then the translated file will be named fr.lang.txt.

Use Amazon Translate with the AWS CLI
You can translate the contents of a file using the following AWS CLI command. In this example, the contents of source-lang.txt will be translated into target-lang.txt.
aws translate translate-document –source-language-code en –target-language es
–document-content fileb://source-lang.txt
–document ContentType=text/plain
–query “TranslatedDocument.Content”
–output text | base64
–decode > target-lang.txt

Use the Amazon Translate SDK (Python Boto3)
You can use the following Python code to invoke Amazon Translate SDK API to translate text or HTML documents synchronously:
import boto3
import argparse

# Initialize parser
parser = argparse.ArgumentParser()
parser.add_argument(“SourceLanguageCode”)
parser.add_argument(“TargetLanguageCode”)
parser.add_argument(“SourceFile”)
args = parser.parse_args()

translate = boto3.client(‘translate’)

localFile = args.SourceFile
file = open(localFile, “rb”)
data = file.read()
file.close()

result = translate.translate_document(
Document={
“Content”: data,
“ContentType”: “text/html”
},
SourceLanguageCode=args.SourceLanguageCode,
TargetLanguageCode=args.TargetLanguageCode
)
if “TranslatedDocument” in result:
fileName = localFile.split(“/”)[-1]
tmpfile = f”{args.TargetLanguageCode}-{fileName}”
with open(tmpfile, ‘w’, encoding=’utf-8′) as f:

f.write(str(result[“TranslatedDocument”][“Content”]))

print(“Translated document “, tmpfile)

This program accepts three arguments: source language, target language, and file path. Use the following command to invoke this program:
python syncDocumentTranslation.py en es source-lang.txt

Conclusion
The real-time document translation feature in Amazon Translate can expedite time to market by enabling easy integration with content creation and localization. Real-time document translation improves content creation and the localization process.
For more information about Amazon Translate, visit Amazon Translate resources to find video resources and blog posts, and refer to AWS Translate FAQs.

About the Authors
Sathya Balakrishnan is a Senior Consultant in the Professional Services team at AWS, specializing in data and ML solutions. He works with US federal financial clients. He is passionate about building pragmatic solutions to solve customers’ business problems. In his spare time, he enjoys watching movies and hiking with his family.
RG Thiyagarajan is a Senior Consultant in Professional Services at AWS, specializing in application migration, security, and resiliency with US federal financial clients.
Sid Padgaonkar is the Senior Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends, you will find him playing squash and exploring the food scene in the Pacific Northwest.

Scale your machine learning workloads on Amazon ECS powered by AWS Tra …

Running machine learning (ML) workloads with containers is becoming a common practice. Containers can fully encapsulate not just your training code, but the entire dependency stack down to the hardware libraries and drivers. What you get is an ML development environment that is consistent and portable. With containers, scaling on a cluster becomes much easier.
In late 2022, AWS announced the general availability of Amazon EC2 Trn1 instances powered by AWS Trainium accelerators, which are purpose built for high-performance deep learning training. Trn1 instances deliver up to 50% savings on training costs over other comparable Amazon Elastic Compute Cloud (Amazon EC2) instances. Also, the AWS Neuron SDK was released to improve this acceleration, giving developers tools to interact with this technology such as to compile, runtime, and profile to achieve high-performance and cost-effective model trainings.
Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that simplifies your deployment, management, and scaling of containerized applications. Simply describe your application and the resources required, and Amazon ECS will launch, monitor, and scale your application across flexible compute options with automatic integrations to other supporting AWS services that your application needs.
In this post, we show you how to run your ML training jobs in a container using Amazon ECS to deploy, manage, and scale your ML workload.
Solution overview
We walk you through the following high-level steps:

Provision an ECS cluster of Trn1 instances with AWS CloudFormation.
Build a custom container image with the Neuron SDK and push it to Amazon Elastic Container Registry (Amazon ECR).
Create a task definition to define an ML training job to be run by Amazon ECS.
Run the ML task on Amazon ECS.

Prerequisites
To follow along, familiarity with core AWS services such as Amazon EC2 and Amazon ECS is implied.
Provision an ECS cluster of Trn1 instances
To get started, launch the provided CloudFormation template, which will provision required resources such as a VPC, ECS cluster, and EC2 Trainium instance.
We use the Neuron SDK to run deep learning workloads on AWS Inferentia and Trainium-based instances. It supports you in your end-to-end ML development lifecycle to create new models, optimize them, then deploy them for production. To train your model with Trainium, you need to install the Neuron SDK on the EC2 instances where the ECS tasks will run to map the NeuronDevice associated with the hardware, as well as the Docker image that will be pushed to Amazon ECR to access the commands to train your model.
Standard versions of Amazon Linux 2 or Ubuntu 20 don’t come with AWS Neuron drivers installed. Therefore, we have two different options.
The first option is to use a Deep Learning Amazon Machine Image (DLAMI) that has the Neuron SDK already installed. A sample is available on the GitHub repo. You can choose a DLAMI based on the opereating system. Then run the following command to get the AMI ID:

aws ec2 describe-images –region us-east-1 –owners amazon –filters ‘Name=name,Values=Deep Learning AMI Neuron PyTorch 1.13.? (Amazon Linux 2) ????????’ ‘Name=state,Values=available’ –query ‘reverse(sort_by(Images, &CreationDate))[:1].ImageId’ –output text

The output will be as follows:
ami-06c40dd4f80434809
This AMI ID can change over time, so make sure to use the command to get the right AMI ID.
Now you can change this AMI ID in the CloudFormation script and use the ready-to-use Neuron SDK. To do this, look for EcsAmiId in Parameters:

“EcsAmiId”: {
“Type”: “String”,
“Description”: “AMI ID”,
“Default”: “ami-09def9404c46ac27c”
}

The second option is to create an instance filling the userdata field during stack creation. You don’t need to install it because CloudFormation will set this up. For more information, refer to the Neuron Setup Guide.
For this post, we use option 2, in case you need to use a custom image. Complete the following steps:

Launch the provided CloudFormation template.
For KeyName, enter a name of your desired key pair, and it will preload the parameters. For this post, we use trainium-key.
Enter a name for your stack.
If you’re running in the us-east-1 Region, you can keep the values for ALBName and AZIds at their default.

To check what Availability Zone in the Region has Trn1 available, run the following command:

aws ec2 describe-instance-type-offerings –region us-east1 –location-type availability-zone –filter Name=instance-type,Values=trn1.2xlarge

Choose Next and finish creating the stack.

When the stack is complete, you can move to the next step.
Prepare and push an ECR image with the Neuron SDK
Amazon ECR is a fully managed container registry offering high-performance hosting, so you can reliably deploy application images and artifacts anywhere. We use Amazon ECR to store a custom Docker image containing our scripts and Neuron packages needed to train a model with ECS jobs running on Trn1 instances. You can create an ECR repository using the AWS Command Line Interface (AWS CLI) or AWS Management Console. For this post, we use the console. Complete the following steps:

On the Amazon ECR console, create a new repository.
For Visibility settings¸ select Private.
For Repository name, enter a name.
Choose Create repository.

Now that you have a repository, let’s build and push an image, which could be built locally (into your laptop) or in a AWS Cloud9 environment. We are training a multi-layer perceptron (MLP) model. For the original code, refer to Multi-Layer Perceptron Training Tutorial.

Copy the train.py and model.py files into a project.

It’s already compatible with Neuron, so you don’t need to change any code.

5. Create a Dockerfile that has the commands to install the Neuron SDK and training scripts:

FROM amazonlinux:2

RUN echo $'[neuron] n
name=Neuron YUM Repository n
baseurl=https://yum.repos.neuron.amazonaws.com n
enabled=1′ > /etc/yum.repos.d/neuron.repo

RUN rpm –import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

RUN yum install aws-neuronx-collectives-2.* -y
RUN yum install aws-neuronx-runtime-lib-2.* -y
RUN yum install aws-neuronx-tools-2.* -y
RUN yum install -y tar gzip pip
RUN yum install -y python3 python3-pip
RUN yum install -y python3.7-venv gcc-c++
RUN python3.7 -m venv aws_neuron_venv_pytorch

# Activate Python venv
ENV PATH=”/aws_neuron_venv_pytorch/bin:$PATH”
RUN python -m pip install -U pip
RUN python -m pip install wget
RUN python -m pip install awscli

RUN python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
RUN python -m pip install torchvision tqdm torch-neuronx neuronx-cc==2.* pillow
RUN mkdir -p /opt/ml/mnist_mlp
COPY model.py /opt/ml/mnist_mlp/model.py
COPY train.py /opt/ml/mnist_mlp/train.py
RUN chmod +x /opt/ml/mnist_mlp/train.py
CMD [“python3”, “/opt/ml/mnist_mlp/train.py”]

To create your own Dockerfile using Neuron, refer to Develop on AWS ML accelerator instance, where you can find guides for other OS and ML frameworks.

6. Build an image and then push it to Amazon ECR using the following code (provide your Region, account ID, and ECR repository):

aws ecr get-login-password –region us-east-1 | docker login –username AWS –password-stdin {your-account-id}.dkr.ecr.{your-region}.amazonaws.com

docker build -t mlp_trainium .

docker tag mlp_trainium:latest {your-account-id}.dkr.ecr.us-east-1.amazonaws.com/mlp_trainium:latest

docker push {your-account-id}.dkr.ecr.{your-region}.amazonaws.com/{your-ecr-repo-name}:latest

After this, your image version should be visible in the ECR repository that you created.
Run the ML training job as an ECS task
To run the ML training task on Amazon ECS, you first need to create a task definition. A task definition is required to run Docker containers in Amazon ECS.

On the Amazon ECS console, choose Task definitions in the navigation pane.
On the Create new task definition menu, choose Create new task definition with JSON.

You can use the following task definition template as a baseline. Note that in the image field, you can use the one generated in the previous step. Make sure it includes your account ID and ECR repository name.
To make sure that Neuron is installed, you can check if the volume /dev/neuron0 is mapped in the devices block. This maps to a single NeuronDevice running on the trn1.2xlarge instance with two cores.

Create your task definition using the following template:

{
“family”: “mlp_trainium”,
“containerDefinitions”: [
{
“name”: “mlp_trainium”,
“image”: “{your-account-id}.dkr.ecr.us-east-1.amazonaws.com/{your-ecr-repo-name}”,
“cpu”: 0,
“memoryReservation”: 1000,
“portMappings”: [],
“essential”: true,
“environment”: [],
“mountPoints”: [],
“volumesFrom”: [],
“linuxParameters”: {
“capabilities”: {
“add”: [
“IPC_LOCK”
]
},
“devices”: [
{
“hostPath”: “/dev/neuron0”,
“containerPath”: “/dev/neuron0”,
“permissions”: [
“read”,
“write”
]
}
]
},
,
“logConfiguration”: {
“logDriver”: “awslogs”,
“options”: {
“awslogs-create-group”: “true”,
“awslogs-group”: “/ecs/task-logs”,
“awslogs-region”: “us-east-1”,
“awslogs-stream-prefix”: “ecs”
}
}
}
],
“networkMode”: “awsvpc”,
“placementConstraints”: [
{
“type”: “memberOf”,
“expression”: “attribute:ecs.os-type == linux”
},
{
“type”: “memberOf”,
“expression”: “attribute:ecs.instance-type == trn1.2xlarge”
}
],
“requiresCompatibilities”: [
“EC2”
],
“cpu”: “1024”,
“memory”: “3072”
}

You can also complete this step on the AWS CLI using the following task definition or with the following command:

aws ecs register-task-definition
–family mlp-trainium
–container-definitions ‘[{
“name”: “my-container-1”,
“image”: “{your-account-id}.dkr.ecr.us-east-1.amazonaws.com/{your-ecr-repo-name}”,
“cpu”: 0,
“memoryReservation”: 1000,
“portMappings”: [],
“essential”: true,
“environment”: [],
“mountPoints”: [],
“volumesFrom”: [],
“logConfiguration”: {
“logDriver”: “awslogs”,
“options”: {
“awslogs-create-group”: “true”,
“awslogs-group”: “/ecs/task-logs”,
“awslogs-region”: “us-east-1”,
“awslogs-stream-prefix”: “ecs”
}
},
“linuxParameters”: {
“capabilities”: {
“add”: [
“IPC_LOCK”
]
},
“devices”: [{
“hostPath”: “/dev/neuron0”,
“containerPath”: “/dev/neuron0”,
“permissions”: [“read”, “write”]
}]
}
}]’
–requires-compatibilities EC2
–cpu “8192”
–memory “16384”
–placement-constraints ‘[{
“type”: “memberOf”,
“expression”: “attribute:ecs.instance-type == trn1.2xlarge”
}, {
“type”: “memberOf”,
“expression”: “attribute:ecs.os-type == linux”
}]’

Run the task on Amazon ECS
After we have created the ECS cluster, pushed the image to Amazon ECR, and created the task definition, we run the task definition to train a model on Amazon ECS.

On the Amazon ECS console, choose Clusters in the navigation pane.
Open your cluster.
On the Tasks tab, choose Run new task.

For Launch type, choose EC2.

For Application type, select Task.
For Family, choose the task definition you created.

In the Networking section, specify the VPC created by the CloudFormation stack, subnet, and security group.

Choose Create.

You can monitor your task on the Amazon ECS console.

You can also run the task using the AWS CLI:

aws ecs run-task –cluster <your-cluster-name> –task-definition <your-task-name> –count 1 –network-configuration ‘{“awsvpcConfiguration”: {“subnets”: [“<your-subnet-name> “], “securityGroups”: [“<your-sg-name> “] }}’

The result will look like the following screenshot.

You can also check the details of the training job through the Amazon CloudWatch log group.

After you train your models, you can store them in Amazon Simple Storage Service (Amazon S3).
Clean up
To avoid additional expenses, you can change the Auto Scaling group to Minimum capacity and Desired capacity to zero, to shut down the Trainium instances. To do a complete cleanup, delete the CloudFormation stack to remove all resources created by this template.
Conclusion
In this post, we showed how to use Amazon ECS to deploy your ML training jobs. We created a CloudFormation template to create the ECS cluster of Trn1 instances, built a custom Docker image, pushed it to Amazon ECR, and ran the ML training job on the ECS cluster using a Trainium instance.
For more information about Neuron and what you can do with Trainium, check out the following resources:

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters
Scaling distributed training with AWS Trainium and Amazon EKS
Neuron FAQ

About the Authors
Guilherme Ricci is a Senior Startup Solutions Architect on Amazon Web Services, helping startups modernize and optimize the costs of their applications. With over 10 years of experience with companies in the financial sector, he is currently working with a team of AI/ML specialists.
Evandro Franco is an AI/ML Specialist Solutions Architect working on Amazon Web Services. He helps AWS customers overcome business challenges related to AI/ML on top of AWS. He has more than 15 years working with technology, from software development, infrastructure, serverless, to machine learning.
Matthew McClean leads the Annapurna ML Solution Architecture team that helps customers adopt AWS Trainium and AWS Inferentia products. He is passionate about generative AI and has been helping customers adopt AWS technologies for the last 10 years.

Host ML models on Amazon SageMaker using Triton: CV model with PyTorch …

PyTorch is a machine learning (ML) framework based on the Torch library, used for applications such as computer vision and natural language processing. One of the primary reasons that customers are choosing a PyTorch framework is its simplicity and the fact that it’s designed and assembled to work with Python. PyTorch supports dynamic computational graphs, enabling network behavior to be changed at runtime. This provides a major flexibility advantage over the majority of ML frameworks, which require neural networks to be defined as static objects before runtime. In this post, we dive deep to see how Amazon SageMaker can serve these PyTorch models using NVIDIA Triton Inference Server.
SageMaker provides several options for customers who are looking to host their ML models. One of the key available features is SageMaker real-time inference endpoints. Real-time workloads can have varying levels of performance expectations and service level agreements (SLAs), which materialize as latency and throughput requirements.
With real-time endpoints, different deployment options adjust to different tiers of expected performance. For example, your business may rely on a model that must meet very strict SLAs for latency and throughput with predictable performance. In this case, SageMaker provides single-model endpoints (SMEs), allowing you to deploy a single ML model to a logical endpoint, which will use the underlying server’s networking and compute capacity. For other use cases where you need a better balance between performance and cost, multi-model endpoints (MMEs) allows you to deploy multiple models behind a logical endpoint and invoke them individually, while abstracting their loading and unloading from memory.
SageMaker provides support for single-model and multi-model endpoints through NVIDIA Triton Inference Server. Triton supports various backends as engines to power the running and serving of different framework models, like PyTorch, TensorFlow, TensorRT, or ONNX Runtime. For any Triton deployment, it’s crucial to understand how the backend behavior impacts your workload and what to expect from its unique configuration parameters. In this post, we help you understand the Triton PyTorch backend in depth.
Triton with PyTorch backend
The PyTorch backend is designed to run TorchScript models using the PyTorch C++ API. TorchScript is a static subset of Python that captures the structure of a PyTorch model. To use this backend, you need to convert your PyTorch model to TorchScript using Just-In-Time (JIT) compilation. JIT compiles the TorchScript code into an optimized intermediate representation, making it suitable for deployment in non-Python environments. Triton uses TorchScript for improved performance and flexibility.
Each model deployed with Triton requires a configuration file (config.pbtxt) that specifies model metadata, such as input and output tensors, model name, and platform. The configuration file is essential for Triton to understand how to load, run, and optimize the model. For PyTorch models, the platform field in the configuration file should be set to pytorch_libtorch. You can load Triton PyTorch models on GPU and CPU (see Multiple Model Instances) and model weights will be kept either in GPU memory/VRAM or in host memory/RAM correspondingly.
Note that only the model’s forward method will be called when using the Pytorch backend; if you rely on more complex logic to prepare, iterate, and transform your raw model’s predictions to respond to a request, you should wrap it as a custom model forward. Alternatively, you can use ensemble models or business logic scripting.
You can optimize PyTorch model performance on Triton by using a combination of available configuration-based features. Some of these are backend-agnostic, like dynamic batching and concurrent model runs (see Achieve hyperscale performance for model serving using NVIDIA Triton Inference Server on Amazon SageMaker to learn more), and some are PyTorch-specific. Let’s take a deeper look into these configuration parameters and how you should use them:

DISABLE_OPTIMIZED_EXECUTION – Use this parameter to optimize running TorchScript models. This parameter slows down the initial call to a loaded TorchScript model, and might not benefit or even hinder model performance in some cases. Set to false if your tolerance to scaling or cold start latency is very low.
INFERENCE_MODE – Use this parameter to toggle PyTorch inference mode. In inference mode, computations aren’t recorded in the backward graph, and it allows PyTorch to speed up your model. This better runtime comes with a drawback: you won’t be able to use tensors created in inference mode in computations to be recorded by autograd after exiting inference mode. Set to true if the preceding conditions apply to your use case (mostly true for inference workloads).
ENABLE_NVFUSER – Use this parameter to enable NvFuser (CUDA Graph Fuser) optimization for TorchScript models. If not specified, the default PyTorch fuser is used.
ENABLE_WEIGHT_SHARING – Use this parameter to allow model instances (copies) on the same device to share weights. This can reduce memory usage of model loading and inference. It should not be used with models that maintain state.
ENABLE_CACHE_CLEANING – Use this parameter to enable CUDA cache cleaning after each model run (only has an effect if the model is on GPU). Setting this flag to true will negatively impact the performance due to additional CUDA cache cleaning operations after each model run. You should only use this flag if you serve multiple models with Triton and encounter CUDA out of memory issues during model runs.
ENABLE_JIT_EXECUTOR, ENABLE_JIT_PROFILING, and ENABLE_TENSOR_FUSER – Use these parameters to disable certain PyTorch optimizations that can sometimes cause latency regressions in models with complex run modes and dynamic shapes.

Triton Inference on SageMaker
SageMaker allows you to deploy both SMEs and MMEs with NVIDIA Triton Inference Server. The following figure shows Triton’s high-level architecture. The model repository is a file system-based repository of the models that Triton will make available for inferencing. Inference requests arrive at the server via HTTPS and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The backend performs inferencing using the inputs provided in the batched requests and the outputs are then returned.
When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling group. In addition, based on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. Note that for SMEs, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For MMEs, we recommend deploying similar models behind a given endpoint to have more steady, predictable performance. In use cases where models of varying sizes and requirements are used, you may want to separate those workloads across multiple MMEs, or spend added time fine-tuning their auto scaling group policy to obtain the best cost and performance balance. See Model hosting patterns in Amazon SageMaker, Part 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints for more information on auto scaling policy considerations for MMEs. (Note that although the MMS configurations don’t apply in this case, the policy considerations still do.)

For a list of NVIDIA Triton Deep Learning Containers (DLCs) supported by SageMaker inference, refer to Available Deep Learning Containers Images.
Solution overview
In the following sections, we walk through an example available on GitHub to understand how we can use Triton and SageMaker MMEs on GPU to deploy a ResNet model for image classification. For demonstration purposes, we use a pre-trained ResNet50 model that can classify images into 1,000 categories.
Prerequisites
You first need an AWS account and an AWS Identity and Access Management (IAM) administrator user. For instructions on how to set up an AWS account, see How do I create and activate a new AWS account. For instructions on how to secure your account with an IAM administrator user, see Creating your first IAM admin user and user group.
SageMaker needs access to the Amazon Simple Storage Service (Amazon S3) bucket that stores your model. Create an IAM role with a policy that gives SageMaker read access to your bucket.
If you plan to run the notebook in Amazon SageMaker Studio, refer to Get Started for setup instructions.
Set up your environment
To set up your environment, complete the following steps:
Launch a SageMaker notebook instance with a g5.xlarge instance.
You can also run this example on a Studio notebook instance.

Select Clone a public git repository to this notebook instance only and specify the GitHub repository URL.
When JupyterLab is ready, launch the resnet_pytorch_python_backend_MME.ipynb notebook with the conda_python3 conda kernel and run through this notebook step by step.

Install the dependencies and import the required library
Use the following code to install dependencies and import the required library:

!pip install nvidia-pyindex –quiet
!pip install tritonclient[http] –quiet

# imports
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import numpy as np
from PIL import Image
import tritonclient.http as httpclient
# variables
s3_client = boto3.client(“s3″)

# sagemaker variables
role = get_execution_role()
sm_client = boto3.client(service_name=”sagemaker”)
runtime_sm_client = boto3.client(“sagemaker-runtime”)
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
bucket = sagemaker_session.default_bucket()

Prepare the model artifacts
The generate_model_pytorch.sh file in the workspace directory contains scripts to load and save a PyTorch model. First, we load a pre-trained ResNet50 model using the torchvision models package. We save the model as a model.pt file in TorchScript optimized and serialized format. TorchScript needs example inputs to do a model forward pass, so we pass one instance of an RGB image with three color channels of dimension 224X224. The script for exporting this model can be found on the GitHub repo.

!docker run –gpus=all –rm -it 
-v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:23.02-py3 
            /bin/bash generate_model_pytorch.sh

Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model, as shown in the following example. The value 1 represents version 1 of our Pytorch model. Each model is run by its specific backend, so each version subdirectory must contain the model artifact required by that backend. Because we’re using a PyTorch backend, a model.pt file is required within the version directory. For more details on naming conventions for model files, refer to Model Files.

Every Triton model must also provide a config.pbtxt file describing the model configuration. To learn more about the config settings, refer to Model Configuration. Out config.pbtxt file specifies the backend as pytorch_libtorch, and defines input and output tensor shapes and data type information. We also specify that we want to run this model on the GPU via the instance_group parameter. See the following code:

name: “resnet”
platform: “pytorch_libtorch”

max_batch_size: 128
input {
  name: “INPUT__0”
  data_type: TYPE_FP32
  dims: 3
  dims: 224
  dims: 224
}
output {
  name: “OUTPUT__0″
  data_type: TYPE_FP32
  dims: 1000
}

instance_group [
{
count: 1
kind: KIND_GPU
}

For the instance_group config, when simply a count is specified, Triton loads x counts of the model on each available GPU device. If you want to control which GPU devices to load your models on, you can do so explicitly by specifying the GPU device IDs. Note that for MMEs, explicitly specifying such GPU device IDs might lead to poor memory management because multiple models may explicitly try to allocate the same GPU device.
We then tar.gz the model artifacts, which is the format expected by SageMaker:

!tar -C triton-serve-pt/ -czf resnet_pt_v0.tar.gz 
resnetmodel_uri_pt = sagemaker_session.upload_data(path=”resnet_pt_v0.tar.gz”, key_prefix=prefix)

Now that we have uploaded the model artifacts to Amazon S3, we can create a SageMaker multi-model endpoint.
Deploy the model
We now deploy the Triton model to a SageMaker MME. In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker MME will use to load and serve predictions. Set Mode to MultiModel to indicate SageMaker will create the endpoint with MME container specifications. We set the container with an image that supports deploying MMEs with GPU (refer to the MME container images for more details). Note that the parameter  mode is set to MultiModel. This is the key differentiator.

container = {“Image”: mme_triton_image_uri, “ModelDataUrl”: model_data_url, “Mode”: “MultiModel”}

Using the SageMaker Boto3 client, create the model using the create_model API. We pass the container definition to the create_model API along with ModelName and ExecutionRoleArn:

create_model_response = sm_client.create_model(
ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)
print(“Model Arn: ” + create_model_response[“ModelArn”])

Create MME configurations using the create_endpoint_config Boto3 API. Specify an accelerated GPU computing instance in InstanceType (for this post, we use a g4dn.4xlarge instance). We recommend configuring your endpoints with at least two instances. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.

create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
“InstanceType”: “ml.g4dn.4xlarge”,
“InitialVariantWeight”: 1,
“InitialInstanceCount”: 1,
“ModelName”: sm_model_name,
“VariantName”: “AllTraffic”,
}
],
)
print(“Endpoint Config Arn: ” + create_endpoint_config_response[“EndpointConfigArn”])

Using the preceding endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService when the deployment is successful.

create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(“Endpoint Arn: ” + create_endpoint_response[“EndpointArn”])

Invoke the model and run predictions
The following method transforms a sample image we will be using for inference into the payload that can be sent for inference to the Triton server.
The tritonclient package provides utility methods to generate the payload without having to know the details of the specification. We use the following methods to convert our inference request into a binary format, which provides lower latencies for inference:

s3_client.download_file(
    “sagemaker-sample-files”, “datasets/image/pets/shiba_inu_dog.jpg”, “shiba_inu_dog.jpg”
)

def get_sample_image():
    image_path = “./shiba_inu_dog.jpg”
    img = Image.open(image_path).convert(“RGB”)
    img = img.resize((224, 224))
    img = (np.array(img).astype(np.float32) / 255) – np.array(
        [0.485, 0.456, 0.406], dtype=np.float32
    ).reshape(1, 1, 3)
    img = img / np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 1, 3)
    img = np.transpose(img, (2, 0, 1))
    return img.tolist()

def _get_sample_image_binary(input_name, output_name):
    inputs = []
    outputs = []
    inputs.append(httpclient.InferInput(input_name, [1, 3, 224, 224], “FP32”))
    input_data = np.array(get_sample_image(), dtype=np.float32)
    input_data = np.expand_dims(input_data, axis=0)
    inputs[0].set_data_from_numpy(input_data, binary_data=True)
    outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True))
    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
        inputs, outputs=outputs
    )
    return request_body, header_length

def get_sample_image_binary_pt():
    return _get_sample_image_binary(“INPUT__0”, “OUTPUT__0″)

After the endpoint is successfully created, we can send inference requests to the MME using the invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type:

request_body, header_length = get_sample_image_binary_pt()
response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=”application/vnd.sagemaker-triton.binary+json;json-header-size={}”.format(
header_length
),
Body=request_body,
TargetModel=”resnet_pt_v0.tar.gz”,
)
# Parse json header size length from the response
header_length_prefix = “application/vnd.sagemaker-triton.binary+json;json-header-size=”
header_length_str = response[“ContentType”][len(header_length_prefix) :]
# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
response[“Body”].read(), header_length=int(header_length_str)
)
output0_data = result.as_numpy(“OUTPUT__0”)
print(output0_data)

Additionally, SageMaker MMEs provide instance-level metrics to monitor using Amazon CloudWatch:

LoadedModelCount – Number of models loaded in the containers
GPUUtilization – Percentage of GPU units that are used by the containers
GPUMemoryUtilization – Percentage of GPU memory used by the containers
DiskUtilization – Percentage of disk space used by the containers

SageMaker MMEs also provides model loading metrics such as the following:

ModelLoadingWaitTime – Time interval for the model to be downloaded or loaded
ModelUnloadingTime – Time interval to unload the model from the container
ModelDownloadingTime – Time to download the model from Amazon S3
ModelCacheHit – Number of invocations to the model that are already loaded onto the container to get model invocation-level insights

For more details, refer to Monitor Amazon SageMaker with Amazon CloudWatch.
Clean up
In order to avoid incurring charges, delete the model endpoint:

sm_client.delete_model(ModelName=sm_model_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_endpoint(EndpointName=endpoint_name)

Best practices
When using the PyTorch backend, most optimization decisions will depend on your specific workload latency or throughput requirements and what model architecture you are using. In general, in order to do a data-driven comparison of configuration parameters to improve performance, you should use Triton’s Performance Analyzer. With this tool, you should adopt the following decision logic:

Experiment and check if your model architecture can be transformed into a TensorRT engine and deployed with the Triton TensorRT backend. This is the preferable way to deploy models with NVIDIA GPUs because both the TensorRT model format and runtime make the best use of the underlying hardware capabilities.
Always set INFERENCE_MODE to true for pure inference workloads where no autograd calculations are required.
If deploying SMEs, maximize hardware utilization by properly defining instance group configuration according to the available GPU memory or RAM (use the Performance Analyzer tool to find the right size).

For more MME-specific best practices, refer to Model hosting patterns in Amazon SageMaker, Part 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints.
Conclusion
In this post, we dove deep into the PyTorch backend supported by Triton Inference Server, which provides acceleration for both CPU and GPU based models. We went through some of the configuration parameters you can adjust to optimize model performance. Finally, we provided a walkthrough of an example notebook to demonstrate deploying a SageMaker multi-model endpoint deployment. Be sure to try it out!

About the Authors
Neelam Koshiya is an Enterprise Solutions Architect at AWS. With a background in software engineering, she organically moved into an architecture role. Her current focus is helping enterprise customers with their cloud adoption journey for strategic business outcomes with the area of depth being AI/ML. She is passionate about innovation and inclusion. In her spare time, she enjoys reading and being outdoors.
João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model training and inference optimization, and more broadly building large-scale ML platforms on AWS. He is also an active proponent of ML-specialized hardware and low-code ML solutions.
Vivek Gangasani is a Senior Machine Learning Solutions Architect at Amazon Web Services. He works with machine learning startups to build and deploy AI/ML applications on AWS. He is currently focused on delivering solutions for MLOps, ML inference, and low-code ML. He has worked on projects in different domains, including natural language processing and computer vision.