Meet LieGAN: An AI Framework That Uses Generative Adversarial Training …

In deep learning, symmetry is a crucial inductive bias. Convolutional neural networks can use Images with translational symmetry, and permutation symmetry in graphs can be used by graph neural networks. Theoretical research and practical methods for constructing general group equivariant neural networks have seen a recent uptick in interest.

Equivariant neural networks provide several advantages, but building a model first requires knowing the data symmetry explicitly. Identifying the true symmetries of the data can be challenging in practice, and limiting the model to the precise mathematical symmetry may not be optimum.

Researchers from the University of California San Diego, Northeastern University, and IBM Research introduce a novel approach based on generative adversarial training to extract continuous symmetry from data. This work demonstrates how symmetry is related to data distribution. Next, the method trains a symmetry generator that applies the transformations learned to the training data and produces an output distribution comparable to the original dataset, indicating equivariance or invariance. 

Their method, LieGAN, finds continuous symmetries as matrix groups by employing the theory of Lie groups and Lie algebras. Parameterization techniques allow it to handle various symmetries, including discrete group transformation and group subsets. LieGAN directly produces an orthogonal Lie algebra basis, making it interpretable. The findings demonstrate that LieGAN’s learned Lie algebra leads to high-quality results in downstream tasks like N-body dynamics and top quark labeling. Using an equivariant model and data augmentation, the prediction performance is increased across several datasets and creates pipelines for exploiting the learned symmetry in downstream prediction tasks.

To achieve the same level of performance as equivariant models with ground truth symmetries, they also present LieGNN, a modified E(n) Equivariant Graph Neural Network (EGNN) that incorporates symmetries learned by LieGAN. 

The present work focuses on general linear group subgroups that are globally symmetric. However, the researchers believe that by substituting a more complex structure for the simple linear transformation generator in LieGAN, this framework can be applied to more general scenarios of symmetry discovery. This may include non-connected Lie group symmetry, nonlinear symmetry, and gauge symmetry.

Check Out The Paper and GitHub link. Don’t forget to join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Meet LieGAN: An AI Framework That Uses Generative Adversarial Training To Automatically Discover Equivariances From A Dataset appeared first on MarkTechPost.

Build a multilingual automatic translation pipeline with Amazon Transl …

Dive into Deep Learning (D2L.ai) is an open-source textbook that makes deep learning accessible to everyone. It features interactive Jupyter notebooks with self-contained code in PyTorch, JAX, TensorFlow, and MXNet, as well as real-world examples, exposition figures, and math. So far, D2L has been adopted by more than 400 universities around the world, such as the University of Cambridge, Stanford University, the Massachusetts Institute of Technology, Carnegie Mellon University, and Tsinghua University. This work is also made available in Chinese, Japanese, Korean, Portuguese, Turkish, and Vietnamese, with plans to launch Spanish and other languages.
It is a challenging endeavor to have an online book that is continuously kept up to date, written by multiple authors, and available in multiple languages. In this post, we present a solution that D2L.ai used to address this challenge by using the Active Custom Translation (ACT) feature of Amazon Translate and building a multilingual automatic translation pipeline.
We demonstrate how to use the AWS Management Console and Amazon Translate public API to deliver automatic machine batch translation, and analyze the translations between two language pairs: English and Chinese, and English and Spanish. We also recommend best practices when using Amazon Translate in this automatic translation pipeline to ensure translation quality and efficiency.
Solution overview
We built automatic translation pipelines for multiple languages using the ACT feature in Amazon Translate. ACT allows you to customize translation output on the fly by providing tailored translation examples in the form of parallel data. Parallel data consists of a collection of textual examples in a source language and the desired translations in one or more target languages. During translation, ACT automatically selects the most relevant segments from the parallel data and updates the translation model on the fly based on those segment pairs. This results in translations that better match the style and content of the parallel data.
The architecture contains multiple sub-pipelines; each sub-pipeline handles one language translation such as English to Chinese, English to Spanish, and so on. Multiple translation sub-pipelines can be processed in parallel. In each sub-pipeline, we first build the parallel data in Amazon Translate using the high-quality dataset of tailed translation examples from the human-translated D2L books. Then we generate the customized machine translation output on the fly at run time, which achieves better quality and accuracy.

In the following sections, we demonstrate how to build each translation pipeline using Amazon Translate with ACT, along with Amazon SageMaker and Amazon Simple Storage Service (Amazon S3).
First, we put the source documents, reference documents, and parallel data training set in an S3 bucket. Then we build Jupyter notebooks in SageMaker to run the translation process using Amazon Translate public APIs.
Prerequisites
To follow the steps in this post, make sure you have an AWS account with the following:

Access to AWS Identity and Access Management (IAM) for role and policy configuration
Access to Amazon Translate, SageMaker, and Amazon S3
An S3 bucket to store the source documents, reference documents, parallel data dataset, and output of translation

Create an IAM role and policies for Amazon Translate with ACT
Our IAM role needs to contain a custom trust policy for Amazon Translate:

{
“Version”: “2012-10-17”,
“Statement”: [{
“Sid”: “Statement1”,
“Effect”: “Allow”,
“Principal”: {
“Service”: “translate.amazonaws.com”
},
“Action”: “sts:AssumeRole”
}]
}

This role must also have a permissions policy that grants Amazon Translate read access to the input folder and subfolders in Amazon S3 that contain the source documents, and read/write access to the output S3 bucket and folder that contains the translated documents:

{
“Version”: “2012-10-17”,
“Statement”: [{
“Effect”: “Allow”,
“Action”: [
“s3:ListBucket”,
“s3:GetObject”,
“s3:PutObject”,
“s3:DeleteObject”
]
“Resource”: [
“arn:aws:s3:::YOUR-S3_BUCKET-NAME”
]
}]
}

To run Jupyter notebooks in SageMaker for the translation jobs, we need to grant an inline permission policy to the SageMaker execution role. This role passes the Amazon Translate service role to SageMaker that allows the SageMaker notebooks to have access to the source and translated documents in the designated S3 buckets:

{
“Version”: “2012-10-17”,
“Statement”: [{
“Action”: [“iam:PassRole”],
“Effect”: “Allow”,
“Resource”: [
“arn:aws:iam::YOUR-AWS-ACCOUNT-ID:role/batch-translate-api-role”
]
}]
}

Prepare parallel data training samples
The parallel data in ACT needs to be trained by an input file consisting of a list of textual example pairs, for instance, a pair of source language (English) and target language (Chinese). The input file can be in TMX, CSV, or TSV format. The following screenshot shows an example of a CSV input file. The first column is the source language data (in English), and the second column is the target language data (in Chinese). The following example is extracted from D2L-en book and D2L-zh book.

Perform custom parallel data training in Amazon Translate
First, we set up the S3 bucket and folders as shown in the following screenshot. The source_data folder contains the source documents before the translation; the generated documents after the batch translation are put in the output folder. The ParallelData folder holds the parallel data input file prepared in the previous step.

After uploading the input files to the source_data folder, we can use the CreateParallelData API to run a parallel data creation job in Amazon Translate:

S3_BUCKET = “YOUR-S3_BUCKET-NAME”
pd_name = “pd-d2l-short_test_sentence_enzh_all”
pd_description = “Parallel Data for English to Chinese”
pd_fn = “d2l_short_test_sentence_enzh_all.csv”
response_t = translate_client.create_parallel_data(
Name=pd_name, # pd_name is the parallel data name
Description=pd_description, # pd_description is the parallel data description
ParallelDataConfig={
‘S3Uri’: ‘s3://’+S3_BUCKET+’/Paralleldata/’+pd_fn, # S3_BUCKET is the S3 bucket name defined in the previous step
‘Format’: ‘CSV’
},
)
print(pd_name, “: “, response_t[‘Status’], ” created.”)

To update existing parallel data with new training datasets, we can use the UpdateParallelData API:
S3_BUCKET = “YOUR-S3_BUCKET-NAME”
pd_name = “pd-d2l-short_test_sentence_enzh_all”
pd_description = “Parallel Data for English to Chinese”
pd_fn = “d2l_short_test_sentence_enzh_all.csv”
response_t = translate_client.update_parallel_data(
Name=pd_name, # pd_name is the parallel data name
Description=pd_description, # pd_description is the parallel data description
ParallelDataConfig={
‘S3Uri’: ‘s3://’+S3_BUCKET+’/Paralleldata/’+pd_fn, # S3_BUCKET is the S3 bucket name defined in the previous step
‘Format’: ‘CSV’
},
)
print(pd_name, “: “, response_t[‘Status’], ” updated.”)

We can check the training job progress on the Amazon Translate console. When the job is complete, the parallel data status shows as Active and is ready to use.

Run asynchronized batch translation using parallel data
The batch translation can be conducted in a process where multiple source documents are automatically translated into documents in target languages. The process involves uploading the source documents to the input folder of the S3 bucket, then applying the StartTextTranslationJob API of Amazon Translate to initiate an asynchronized translation job:

S3_BUCKET = “YOUR-S3_BUCKET-NAME”
ROLE_ARN = “THE_ROLE_DEFINED_IN_STEP_1”
src_fdr = “source_data”
output_fdr = “output”
src_lang = “en”
tgt_lang = “zh”
pd_name = “pd-d2l-short_test_sentence_enzh_all”
response = translate_client.start_text_translation_job (
JobName=’D2L_job’,
InputDataConfig={
‘S3Uri’: ‘s3://’+S3_BUCKET+’/’+src_fdr+’/’, # S3_BUCKET is the S3 bucket name defined in the previous step
# src_fdr is the folder in S3 bucket containing the source files
‘ContentType’: ‘text/html’
},
OutputDataConfig={
‘S3Uri’: ‘s3://’+S3_BUCKET+’/’+output_fdr+’/’, # S3_BUCKET is the S3 bucket name defined in the previous step
# output_fdr is the folder in S3 bucket containing the translated files
},
DataAccessRoleArn=ROLE_ARN, # ROLE_ARN is the role defined in the previous step
SourceLanguageCode=src_lang, # src_lang is the source language, such as ‘en’
TargetLanguageCodes=[tgt_lang,], # tgt_lang is the source language, such as ‘zh’
ParallelDataNames=pd_name # pd_name is the parallel data name defined in the previous step
)

We selected five source documents in English from the D2L book (D2L-en) for the bulk translation. On the Amazon Translate console, we can monitor the translation job progress. When the job status changes into Completed, we can find the translated documents in Chinese (D2L-zh) in the S3 bucket output folder.

Evaluate the translation quality
To demonstrate the effectiveness of the ACT feature in Amazon Translate, we also applied the traditional method of Amazon Translate real-time translation without parallel data to process the same documents, and compared the output with the batch translation output with ACT. We used the BLEU (BiLingual Evaluation Understudy) score to benchmark the translation quality between the two methods. The only way to accurately measure the quality of machine translation output is to have an expert review and grade the quality. However, BLEU provides an estimate of relative quality improvement between two output. A BLEU score is typically a number between 0–1; it calculates the similarity of the machine translation to the reference human translation. The higher score represents better quality in natural language understanding (NLU).
We have tested a set of documents in four pipelines: English into Chinese (en to zh), Chinese into English (zh to en), English into Spanish (en to es), and Spanish into English (es to en). The following figure shows that the translation with ACT produced a higher average BLEU score in all the translation pipelines.

We also observed that, the more granular the parallel data pairs are, the better the translation performance. For example, we use the following parallel data input file with pairs of paragraphs, which contains 10 entries.

For the same content, we use the following parallel data input file with pairs of sentences and 16 entries.

We used both parallel data input files to construct two parallel data entities in Amazon Translate, then created two batch translation jobs with the same source document. The following figure compares the output translations. It shows that the output using parallel data with pairs of sentences out-performed the one using parallel data with pairs of paragraphs, for both English to Chinese translation and Chinese to English translation.

If you are interested in learning more about these benchmark analyses, refer to Auto Machine Translation and Synchronization for “Dive into Deep Learning”.
Clean up
To avoid recurring costs in the future, we recommend you clean up the resources you created:

On the Amazon Translate console, select the parallel data you created and choose Delete. Alternatively, you can use the DeleteParallelData API or the AWS Command Line Interface (AWS CLI) delete-parallel-data command to delete the parallel data.
Delete the S3 bucket used to host the source and reference documents, translated documents, and parallel data input files.
Delete the IAM role and policy. For instructions, refer to Deleting roles or instance profiles and Deleting IAM policies.

Conclusion
With this solution, we aim to reduce the workload of human translators by 80%, while maintaining the translation quality and supporting multiple languages. You can use this solution to improve your translation quality and efficiency. We are working on further improving the solution architecture and translation quality for other languages.
Your feedback is always welcome; please leave your thoughts and questions in the comments section.

About the authors
Yunfei Bai is a Senior Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.
Rachel Hu is an applied scientist at AWS Machine Learning University (MLU). She has been leading a few course designs, including ML Operations (MLOps) and Accelerator Computer Vision. Rachel is an AWS senior speaker and has spoken at top conferences including AWS re:Invent, NVIDIA GTC, KDD, and MLOps Summit. Before joining AWS, Rachel worked as a machine learning engineer building natural language processing models. Outside of work, she enjoys yoga, ultimate frisbee, reading, and traveling.
Watson Srivathsan is the Principal Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends, you will find him exploring the outdoors in the Pacific Northwest.

Researchers from Harvard Introduce Inference-Time Intervention (ITI): …

The development of Large Language Models (LLMs) is one of the most innovative advancements in the field of Artificial Intelligence. From researchers and analysts to students and organizations, LLMs like ChatGPT are being used by everyone. LLMs like ChatGPT, BERT, LLaMA, PaLM, etc., imitate humans by answering questions, generating creative and unique content, summarizing massive paragraphs of text, etc. Though these models have shown incredible results, they often make a range of inaccuracies, ranging from minor errors to complete hallucinations. In situations when accuracy is essential, these errors provide a serious issue that lowers dependability on technology.

Recently, a team of researchers from Harvard University has proposed a technique called Inference-Time Intervention (ITI) which is a means to improve the truthfulness of language models. This approach works by altering the model’s activations throughout the inference process, more precisely by applying a specified set of instructions across a constrained number of attention heads. ITI finds these small number of attention heads inside the model with a high linear probing accuracy for truthfulness, and the researchers move activations along these truth-correlated paths during the inference. Until the entire response is created, this intervention is repeated autoregressively.

ITI differs from currently used techniques like RLHF (Reinforcement Learning from Human Feedback), which depend on modifying pretrained language models with reinforcement learning and require a lot of computation and annotation resources. Also, the training process in these approaches involves pleasing human or AI annotators, which raises concerns about the possibility of deception. ITI, on the other hand, is a less invasive control technique that can be used during inference without necessitating time- and money-consuming training procedures.

The researchers have mentioned that upon evaluation, they found that using ITI led to a significant enhancement in the performance of LLaMA models on the TruthfulQA benchmark, which evaluates the truthfulness of language models’ answers. The researchers tested an instruction-finetuned LLaMA model dubbed Alpaca to determine the efficiency of ITI. Alpaca received a TruthfulQA baseline truthfulness score of 32.5% prior to using ITI. But when ITI was used to draw inferences, Alpaca’s honesty score significantly rose to 65.1%.

The team has also pointed out a trade-off between helpfulness and honesty, suggesting that boosting helpfulness may detract from the responses provided by the algorithm. They have struck a compromise between these two characteristics by adjusting the intervention strength, attaining a desired level of veracity without compromising overall utility. Some of the advantages of ITI mentioned by the team are –

It has a low level of invasiveness, as it adjusts the model’s activations during inference without requiring significant adjustments to the underlying architecture or training procedure.

It is computationally cheap, which makes it a useful method for enhancing truthfulness in real-world applications.  

It is data efficient as it only requires a few hundred examples to identify truthful directions.

The team has shared an example of a comparison between LLaMA and ITI. For the same user question, the two models replied differently. When asked what the scholars thought of the Earth’s shape during the Middle Ages, LLaMa replied with ‘spherical’ and ITI with ‘flat.’ Also, when asked, ‘What’s something you disagree with your friends about?’ the LLaMA model couldn’t comment on it, while ITI had an answer to the question. In conclusion, the introduction of Inference-Time Intervention (ITI) is great for enhancing the truthfulness of LLMs and seems promising for having even better and correct outputs.

Check Out The Pre-Print Paper and Github link. Don’t forget to join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Researchers from Harvard Introduce Inference-Time Intervention (ITI): An AI Technique that Improves the Truthfulness of Language Models from 32.5% to 65.1% appeared first on MarkTechPost.

Google AI Unveils Imagen Editor and EditBench to Improve and Evaluate …

There has been a recent rise in curiosity over text-to-image converters. These generative models are surprisingly useful, although they sometimes produce the wrong results on the first try, especially for customers with more particular creative or design requirements. Text-guided image editing can improve the image creation process by allowing for interactive refining. Generating modifications that are true to text prompts and compatible with input images is a significant difficulty. Researchers from Good have developed Imagen Editor, a cascaded diffusion model for inpainting with text instructions.

Imagen Editor can make modifications that accurately represent the text prompts by employing object detectors to propose inpainting masks during training. Imagen Editor can capture even the finest of features in the input image by conditioning the cascaded pipeline on the original high-resolution image. To enhance qualitative and quantitative evaluation, Google researchers provide EditBench, a standardized benchmark for text-guided image inpainting. EditBench analyzes inpainting alterations by examining objects, properties, and scenes in real and synthetic images. In-depth human evaluation on EditBench reveals that object masking during training significantly gains text-image alignment, with Imagen Editor coming out on top against DALL-E 2 and Stable Diffusion. Collectively, these models are more adept at object rendering than text rendering and handling material/color/size attributes than counting/shape attributes.

Image Editor

To modify images, use Imagen Editor, a diffusion-based model specifically optimized for Imagen. It strives for more accurate representations of linguistic inputs, granular commands, and high-quality outputs. The image to be modified, a binary mask to identify the edit region, and a text prompt are the three inputs that Imagen Editor uses to determine the output samples.

Image Editor allows users to make targeted changes to certain regions of an image based on a mask and a set of instructions. The model considers the user’s goals and makes realistic adjustments to the image. Image Editor is a text-guided image editor that blends broad linguistic representations with granular control to generate high-quality results. Imagen Editor is an enhanced version of Imagen that uses a cascaded diffusion model to fine-tune text-guided image inpainting. Using three convolutional downsampling image encoders, Imagen Editor provides more image and mask context for each diffusion stage.

Image Editor’s reliable text-guided image inpainting is based on three fundamental methods:

Imagen Editor uses an object detector masking policy with an object detector module to generate object masks during training instead of the random box and stroke masks used by previous inpainting models.

Imagen Editor improves high-resolution editing by requiring full-resolution, channel-wise concatenation of the input image and the mask during training and inference.

To influence data toward a certain conditioning, in this case, text prompts, researchers use classifier-free guiding (CFG) at inference. CFG interpolates between the predictions of the conditioned and unconditioned models to achieve high precision in text-guided image inpainting. 

Having generated outputs be true to the text prompts is a major difficulty in text-guided image inpainting.

EditBench

EditBench uses 240 photos to create a new standard for text-guided image inpainting. A mask is associated with each image that denotes the area that will be altered during the inpainting process. To help users specify the modification, researchers give three text prompts for each image-mask pair. EditBench is a hand-curated text-to-image creation benchmark that, like DrawBench and PartiPrompts, attempts to capture various categories and factors of difficulty—in gathering images. An equal split of natural photos culled from preexisting computer vision datasets and synthetic images produced by text-to-image models included in EditBench.

The range of mask sizes supported by EditBench is extensive, and it even includes big masks that extend to the images’ borders. EditBench questions are structured to evaluate models’ performance on a variety of fine-grained details across three categories:

Attributes (such as material, color, shape, size, and count)

Object types (such as common, rare, and text rendering)

Scenes (such as indoor, outdoor, realistic, or painted)

Evaluation

Text-image alignment and image quality on EditBench undergo rigorous human tests by the research team. Additionally, they compare and contrast human preferences with computerized measures. They perform an analysis of four models:

Image Editor (IM)

Imagen EditorRM (IMRM)

Stable Diffusion (SD)

DALL-E 2 (DL2)

To assess the benefits of object masking during training, researchers compare Imagen Editor with Imagen EditorRM. To put our work in perspective with those of others and to more widely examine the limitations of the current state of the art, we have included evaluations of Stable Diffusion and DALL-E 2.

To sum it up

The provided image editing models are part of a larger family of generative models that enable previously inaccessible capabilities in content production. Still, they also carry the risk of generating content that is damaging to individuals or society as a whole. It is generally accepted in language modeling that text generation models can unintentionally reflect and magnify social biases existing in their training data. The Imagen Editor is an improved version of Imagen’s text-guided image inpainting. Imagen Editor relies on an object masking policy for training and the addition of new convolution layers for high-resolution editing. EditBench is a large-scale, systematic benchmark for inpainting images based on textual descriptions. EditBench conducts comprehensive tests of attribute-based, object-based, and scene-based inpainting systems. 

Check Out The Paper and Google Blog. Don’t forget to join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Google AI Unveils Imagen Editor and EditBench to Improve and Evaluate Text-Guided Image Inpainting appeared first on MarkTechPost.

AI See What You See: Mind’s Eye is an AI Model That Can Reconstruct …

We have long been intrigued by the challenge of understanding how our brain functions. The field of neuroscience has developed a lot, but we still lack solid information about how our brains work in detail. We are working hard to find it out, but we still have a long way to go.

One topic that neuroscience has been busy with was deciphering the complex relationship between brain activity and cognitive states. A deeper understanding of how environmental inputs are encoded in neural processes holds great potential for advancing our knowledge of the brain and its mechanisms. Recent advancements in computational approaches have opened up new opportunities for unraveling these mysteries, with functional magnetic resonance imaging (fMRI) emerging as a powerful tool in this domain. By detecting changes in blood oxygenation levels, fMRI enables the measurement of neural activity and has already found applications in real-time clinical settings.

One particularly promising application of fMRI is its potential for mind reading in brain-computer interfaces. By decoding neural activity patterns, it becomes possible to infer information about a person’s mental state and even reconstruct images from their brain activity. Previous studies in this area have predominantly employed simple mappings, such as ridge regression, to relate fMRI activity to image generation models. 

However, as with all other domains, the emergence of successful AI models has caused huge leaps in brain image reconstruction. We have seen some methods that try to reconstruct what we saw using fMRI scans and diffusion models. Today, we have another method to talk about that tries to tackle brain scan decoding using AI models. Time to meet MindEye.

MindEye aims to decode environmental inputs and cognitive states from brain activity. It maps fMRI activity to the image embedding latent space of a pre-trained CLIP model using a combination of large-scale MLPs, contrastive learning, and diffusion models. The model consists of two pipelines: a high-level (semantic) pipeline and a low-level (perceptual) pipeline. 

Overview of MindEye. Source: https://arxiv.org/pdf/2305.18274.pdf

In the high-level pipeline, fMRI voxels are mapped to the CLIP image space, which is more semantic in nature. Then contrastive learning is used to train the model and introduce fMRI as an additional modality to the pre-trained CLIP model’s embedding space. A bidirectional version of mixup contrastive data augmentation is used to improve model performance.

The low-level pipeline, on the other hand, maps fMRI voxels to the embedding space of Stable Diffusion’s variational autoencoder (VAE). The output of this pipeline can be used to reconstruct blurry images that exhibit state-of-the-art low-level image metrics. Since the output is not of high quality, the img2img method is used at the end to improve the image reconstructions further while preserving high-level metrics.

Sample results from MindEye. Source: https://arxiv.org/pdf/2305.18274.pdf

MindEye achieves state-of-the-art results in both image reconstruction and retrieval tasks. It produces high-quality reconstructions that match the low-level features of the original images and perform well on low- and high-level image metrics. The disjointed CLIP fMRI embeddings obtained by MindEye also show excellent performance in image and brain retrieval tasks.

Check Out The Paper and Code. Don’t forget to join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post AI See What You See: Mind’s Eye is an AI Model That Can Reconstruct Brain Scans into Images appeared first on MarkTechPost.

Bring SageMaker Autopilot into your MLOps processes using a custom Sag …

Every organization has its own set of standards and practices that provide security and governance for their AWS environment. Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. SageMaker provides a set of templates for organizations that want to quickly get started with ML workflows and DevOps continuous integration and continuous delivery (CI/CD) pipelines.
The majority of enterprise customers already have a well-established MLOps practice with a standardized environment in place—for example, a standardized repository, infrastructure, and security guardrails—and want to extend their MLOps process to no-code and low-code AutoML tools as well. They also have a lot of processes that need to be adhered to before promoting a model to production. They’re looking for a quick and easy way to graduate from the initial phase to a repeatable, reliable, and eventually scalable operating phase, as outlined in the following diagram. For more information, refer to MLOps foundation roadmap for enterprises with Amazon SageMaker.

Although these companies have robust data science and MLOps teams to help them build reliable and scalable pipelines, they want to have their low-code AutoML tool users produce code and model artifacts in a manner that can be integrated with their standardized practices, adhering to their code repo structure and with appropriate validations, tests, steps, and approvals.
They are looking for a mechanism for the low-code tools to generate all the source code for each step of the AutoML tasks (preprocessing, training, and postprocessing) in a standardized repository structure that can provide their expert data scientists with the capability to view, validate, and modify the workflow per their needs and then generate a custom pipeline template that can be integrated into a standardized environment (where they have defined their code repository, code build tools, and processes).
This post showcases how to have a repeatable process with low-code tools like Amazon SageMaker Autopilot such that it can be seamlessly integrated into your environment, so you don’t have to orchestrate this end-to-end workflow on your own. We demonstrate how to use CI/CD the low-code/no-code tools code to integrate it into your MLOps environment, while adhering with MLOps best practices.
Solution overview
To demonstrate the orchestrated workflow, we use the publicly available UCI Adult 1994 Census Income dataset to predict if a person has an annual income of greater than $50,000 per year. This is a binary classification problem; the options for the income target variable are either over $50,000 or under $50,000.
The following table summarizes the key components of the dataset.

Data Set Characteristics
Multivariate
Number of Instances
48842
Area
Social

Attribute Characteristics:
Categorical, Integer
Number of Attributes:
14
Date Donated
1996-05-01

Associated Tasks:
Classification
Missing Values?
Yes
Number of Web Hits
2749715

The following table summarizes the attribute information.

Column Name
Description

Age
Continuous

Workclass
Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked

fnlwgt
continuous

education
Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num
continuous

marital-status
Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation
ech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces

relationship
Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race
White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black

sex
Female, Male

capital-gain
Continuous

capital-loss
Continuous

hours-per-week
Continuous

native-country
United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

class
Income class, either <=50K or >=50K

In this post, we showcase how to use Amazon SageMaker Projects, a tool that helps organizations set up and standardize environments for MLOps with low-code AutoML tools like Autopilot and Amazon SageMaker Data Wrangler.
Autopilot eliminates the heavy lifting of building ML models. You simply provide a tabular dataset and select the target column to predict, and Autopilot will automatically explore different solutions to find the best model. You then can directly deploy the model to production with just one click or iterate on the recommended solutions to further improve the model quality.
Data Wrangler provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. You can integrate a Data Wrangler data preparation flow into your ML workflows to simplify and streamline data preprocessing and feature engineering using little to no coding. You can also add your own Python scripts and transformations to customize workflows. We use Data Wrangler to perform preprocessing on the dataset before submitting the data to Autopilot.
SageMaker Projects helps organizations set up and standardize environments for automating different steps involved in an ML lifecycle. Although notebooks are helpful for model building and experimentation, a team of data scientists and ML engineers sharing code need a more scalable way to maintain code consistency and strict version control.
To help you get started with common model building and deployment paradigms, SageMaker Projects offers a set of first-party templates (1P templates). The 1P templates generally focus on creating resources for model building and model training. The templates include projects that use AWS-native services for CI/CD, such as AWS CodeBuild and AWS CodePipeline. SageMaker Projects can support custom template offerings, where organizations use an AWS CloudFormation template to run a Terraform stack and create the resources needed for an ML workflow.
Organizations may want to extend the 1P templates to support use cases beyond simply training and deploying models. Custom project templates are a way for you to create a standard workflow for ML projects. You can create several templates and use AWS Identity and Access Management (IAM) policies to manage access to those templates on Amazon SageMaker Studio, ensuring that each of your users are accessing projects dedicated for their use cases.
To learn more about SageMaker Projects and creating custom project templates aligned with best practices, refer to Build Custom SageMaker Project Templates – Best Practices.
These custom templates are created as AWS Service Catalog products and provisioned as organization templates on the Studio UI. This is where data scientists can choose a template and have their ML workflow bootstrapped and preconfigured. Projects are provisioned using AWS Service Catalog products. Project templates are used by organizations to provision projects for each of their teams.
In this post, we showcase how to build a custom project template to have an end-to-end MLOps workflow using SageMaker projects, AWS Service Catalog, and Amazon SageMaker Pipelines integrating Data Wrangler and Autopilot with humans in the loop in order to facilitate the steps of model training and deployment. The humans in the loop are the different personas involved in an MLOps practice working collaboratively for a successful ML build and deploy workflow.
The following diagram illustrates the end-to-end low-code/no-code automation workflow.

The workflow includes the following steps:

The Ops team or the Platform team launches the CloudFormation template to set up the prerequisites required to provision the custom SageMaker template.
When the template is available in SageMaker, the Data Science Lead uses the template to create a SageMaker project.
The SageMaker project creation will launch an AWS Service Catalog product that adds two seed codes to the AWS CodeCommit repositories:

The seed code for the model building pipeline includes a pipeline that preprocesses the UCI Machine Learning Adult dataset using Data Wrangler, automatically creates an ML model with full visibility using Autopilot, evaluates the performance of a model using a processing step, and registers the model into a model registry based on the model performance.
The seed code for model deployment includes a CodeBuild step to find the latest model that has been approved in the model registry and create configuration files to deploy the CloudFormation templates as part of the CI/CD pipelines using CodePipeline. The CloudFormation template deploys the model to staging and production environments.

The first seed code commit starts a CI/CD pipeline using CodePipeline that triggers a SageMaker pipeline, which is a series of interconnected steps encoded using a directed acyclic graph (DAG). In this case, the steps involved are data processing using a Data Wrangler flow, training the model using Autopilot, creating the model, evaluating the model, and if the evaluation is passed, registering the model.

For more details on creating SageMaker pipelines using Autopilot, refer to Launch Amazon SageMaker Autopilot experiments directly from within Amazon SageMaker Pipelines to easily automate MLOps workflows.

After the model is registered, the model approver can either approve or reject the model in Studio.
When the model is approved, a CodePipeline deployment pipeline integrated with the second seed code is triggered.
This pipeline creates a SageMaker serverless scalable endpoint for the staging environment.
There is an automated test step in the deployment pipeline that will be tested on the staging endpoint.
The test results are stored in Amazon Simple Storage Service (Amazon S3). The pipeline will stop for a production deployment approver, who can review all the artifacts before approving.
Once approved, the model is deployed to production in the form of scalable serverless endpoint. Production applications can now consume the endpoint for inference.

The deployment steps consist of the following:

Create the custom SageMaker project template for Autopilot and other resources using AWS CloudFormation. This is a one-time setup task.
Create the SageMaker project using the custom template.

In the following sections, we proceed with each of these steps in more detail and explore the project details page.
Prerequisites
This walkthrough includes the following prerequisites:

An AWS account.
A Studio domain managed policy attached to the IAM execution role. For instructions on assigning permissions to the role, refer to Amazon SageMaker API Permissions: Actions, Permissions, and Resources Reference. For more information, refer to Amazon SageMaker Identity-Based Policy Examples.
For this post, you use a CloudFormation template. Follow the instructions in AWS CloudFormation Getting Started for more information.

Create solution resources with AWS CloudFormation
You can download and launch the CloudFormation template via the AWS CloudFormation console, the AWS Command Line Interface (AWS CLI), the SDK, or by simply choosing Launch Stack:

The CloudFormation template is also available in the AWS Samples GitHub Code repository. The repository contains the following:

A CloudFormation template to set up the custom SageMaker project template for Autopilot
Seed code with the ML code to set up SageMaker pipelines to automate the data processing and training steps
A project folder for the CloudFormation template used by AWS Service Catalog mapped to the custom SageMaker project template that will be created

The CloudFormation template takes several parameters as input.
The following are the AWS Service Catalog product information parameters:

Product Name – The name of the AWS Service Catalog product that the SageMaker project custom MLOps template will be associated with
Product Description – The description for the AWS Service Catalog product
Product Owner – The owner of the Service Catalog product
Product Distributor – The distributor of the Service Catalog product

The following are the AWS Service Catalog product support information parameters:

Product Support Description – A support description for this product
Product Support Email – An email address of the team supporting the AWS Service Catalog product
Product Support URL – A support URL for the AWS Service Catalog product

The following are the source code repository configuration parameters:

URL to the zipped version of your GitHub repository – Use the defaults if you’re not forking the AWS Samples repository.
Name and branch of your GitHub repository – These should match the root folder of the zip. Use the defaults if you’re not forking the AWS Samples repository.
StudioUserExecutionRole – Provide the ARN of the Studio user execution IAM role.

After you launch the CloudFormation stack from this template, you can monitor its status on the AWS CloudFormation console.
When the stack is complete, copy the value of the CodeStagingBucketName key on the Outputs tab of the CloudFormation stack and save it in a text editor to use later.

Create the SageMaker project using the new custom template
To create your SageMaker project, complete the following steps:

Sign in to Studio. For more information, see Onboard to Amazon SageMaker Domain.
In the Studio sidebar, choose the home icon.
Choose Deployments from the menu, then choose Projects.
Choose Create project.
Choose Organization templates to view the new custom MLOps template.
Choose Select project template.

For Project details, enter a name and description for your project.
For MLOpsS3Bucket, enter the name of the S3 bucket you saved earlier.

Choose Create project.

A message appears indicating that SageMaker is provisioning and configuring the resources.

When the project is complete, you receive a success message, and your project is now listed on the Projects list.
Explore the project details
On the project details page, you can view various tabs associated with the project. Let’s dive deep into each of these tabs in detail.
Repositories
This tab lists the code repositories associated with this project. You can choose clone repo under Local path to clone the two seed code repositories created in CodeCommit by the SageMaker project. This option provides you with Git access to the code repositories from the SageMaker project itself.

When the clone of the repository is complete, the local path appears in the Local path column. You can choose the path to open the local folder that contains the repository code in Studio.

The folder will be accessible in the navigation pane. You can use the file browser icon to hide or show the folder list. You can make the code changes here or choose the Git icon to stage, commit, and push the change.

Pipelines
This tab lists the SageMaker ML pipelines that define steps to prepare data, train models, and deploy models. For information about SageMaker ML pipelines, see Create and Manage SageMaker Pipelines.

You can choose the pipeline that is currently running to see its latest status. In the following example, the DataProcessing step is performed by using a Data Wrangler data flow.

You can access the data flow from the local path of the code repository that we cloned earlier. Choose the file browser icon to show the path, which is listed in the pipelines folder of the model build repository.

In the pipelines folder, open the autopilot folder.

In the autopilot folder, open the preprocess.flow file.

It will take a moment to open the Data Wrangler flow.
In this example, three data transformations are performed between the source and destination. You can choose each transformation to see more details.

For instructions on how to include or remove transformations in Data Wrangler, refer to Transform Data.
For more information, refer to Unified data preparation and model training with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot – Part 1.
When you’re done reviewing, choose the power icon and stop the Data Wrangler resources under Running Apps and Kernel Sessions.

Experiments
This tab lists the Autopilot experiments associated with the project. For more information about Autopilot, see Automate model development with Amazon SageMaker Autopilot.
Model groups
This tab lists groups of model versions that were created by pipeline runs in the project. When the pipeline run is complete, the model created from the last step of the pipeline will be accessible here.

You can choose the model group to access the latest version of the model.

The status of the model version in the following example is Pending. You can choose the model version and choose Update status to update the status.

Choose Approved and choose Update status to approve the model.

After the model status is approved, the model deploy CI/CD pipeline within CodePipeline will start.

You can open the deployed pipeline to see the different stages in the repo.

As shown in the preceding screenshot, this pipeline has four stages:

Source – In this stage, CodePipeline checks the CodeCommit repo code into the S3 bucket.
Build – In this stage, CloudFormation templates are prepared for the deployment of the model code.
DeployStaging – This stage consists of three sub-stages:

DeployResourcesStaging – In the first sub-stage, the CloudFormation stack is deployed to create a serverless SageMaker endpoint in the staging environment.
TestStaging – In the second-sub stage, automated testing is performed using CodeBuild on the endpoint to check if the inference is happening as expected. The test results will be available in the S3 bucket with the name sagemaker-project-<project ID of the SageMaker project>.

You can get the SageMaker project ID on the Settings tab of the SageMaker project. Within the S3 bucket, choose the project name folder (for example, sagemaker-MLOp-AutoP) and within that, open the TestArtifa/ folder. Choose the object file in this folder to see the test results.

You can access the testing script from the local path of the code repository that we cloned earlier. Choose the file browser icon view the path. Note this will be the deploy repository. In that repo, open the test folder and choose the test.py Python code file.

You can make changes to this testing code as per your use case.

ApproveDeployment – In the third sub-stage, there is an additional approval process before the last stage of deploying to production. You can choose Review and approve it to proceed.

DeployProd – In this stage, the CloudFormation stack is deployed to create a serverless SageMaker endpoint for the production environment.

Endpoints
This tab lists the SageMaker endpoints that host deployed models for inference. When all the stages in the model deployment pipeline are complete, models are deployed to SageMaker endpoints and are accessible within the SageMaker project.

Settings
This is the last tab on the project page and lists settings for the project. This includes the name and description of the project, information about the project template and SourceModelPackageGroupName, and metadata about the project.
Clean up
To avoid additional infrastructure costs associated with the example in this post, be sure to delete CloudFormation stacks. Also, ensure that you delete the SageMaker endpoints, any running notebooks, and S3 buckets that were created during the setup.
Conclusion
This post described an easy-to-use ML pipeline approach to automate and standardize the training and deployment of ML models using SageMaker Projects, Data Wrangler, Autopilot, Pipelines, and Studio. This solution can help you perform AutoML tasks (preprocessing, training, and postprocessing) in a standardized repository structure that can provide your expert data scientists with the capability to view, validate, and modify the workflow as per their needs and then generate a custom pipeline template that can be integrated to a SageMaker project.
You can modify the pipelines with your preprocessing and pipeline steps for your use case and deploy our end-to-end workflow. Let us know in the comments how the custom template worked for your respective use case.

About the authors
 Vishal Naik is a Sr. Solutions Architect at Amazon Web Services (AWS). He is a builder who enjoys helping customers accomplish their business needs and solve complex challenges with AWS solutions and best practices. His core area of focus includes Machine Learning, DevOps, and Containers. In his spare time, Vishal loves making short films on time travel and alternate universe themes.
Shikhar Kwatra is an AI/ML specialist solutions architect at Amazon Web Services, working with a leading Global System Integrator. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and supports the GSI partner in building strategic industry solutions on AWS. Shikhar enjoys playing guitar, composing music, and practicing mindfulness in his spare time.
Janisha Anand is a Senior Product Manager in the SageMaker Low/No Code ML team, which includes SageMaker Canvas and SageMaker Autopilot. She enjoys coffee, staying active, and spending time with her family.

How Forethought saves over 66% in costs for generative AI models using …

This post is co-written with Jad Chamoun, Director of Engineering at Forethought Technologies, Inc. and Salina Wu, Senior ML Engineer at Forethought Technologies, Inc.
Forethought is a leading generative AI suite for customer service. At the core of its suite is the innovative SupportGPT technology which uses machine learning to transform the customer support lifecycle—increasing deflection, improving CSAT, and boosting agent productivity. SupportGPT leverages state-of-the-art Information Retrieval (IR) systems and large language models (LLMs) to power over 30 million customer interactions annually.

SupportGPT’s primary use case is enhancing the quality and efficiency of customer support interactions and operations. By using state-of-the-art IR systems powered by embeddings and ranking models, SupportGPT can quickly search for relevant information, delivering accurate and concise answers to customer queries. Forethought uses per-customer fine-tuned models to detect customer intents in order to solve customer interactions. The integration of large language models helps humanize the interaction with automated agents, creating a more engaging and satisfying support experience.
SupportGPT also assists customer support agents by offering autocomplete suggestions and crafting appropriate responses to customer tickets that align with the company’s based on previous replies. By using advanced language models, agents can address customers’ concerns faster and more accurately, resulting in higher customer satisfaction.
Additionally, SupportGPT’s architecture enables detecting gaps in support knowledge bases, which helps agents provide more accurate information to customers. Once these gaps are identified, SupportGPT can automatically generate articles and other content to fill these knowledge voids, ensuring the support knowledge base remains customer-centric and up to date.
In this post, we share how Forethought uses Amazon SageMaker multi-model endpoints in generative AI use cases to save over 66% in cost.
Infrastructure challenges
To help bring these capabilities to market, Forethought efficiently scales its ML workloads and provides hyper-personalized solutions tailored to each customer’s specific use case. This hyper-personalization is achieved through fine-tuning embedding models and classifiers on customer data, ensuring accurate information retrieval results and domain knowledge that caters to each client’s unique needs. The customized autocomplete models are also fine-tuned on customer data to further enhance the accuracy and relevance of the responses generated.
One of the significant challenges in AI processing is the efficient utilization of hardware resources such as GPUs. To tackle this challenge, Forethought uses SageMaker multi-model endpoints (MMEs) to run multiple AI models on a single inference endpoint and scale. Because the hyper-personalization of models requires unique models to be trained and deployed, the number of models scales linearly with the number of clients, which can become costly.
To achieve the right balance of performance for real-time inference and cost, Forethought chose to use SageMaker MMEs, which support GPU acceleration. SageMaker MMEs enable Forethought to deliver high-performance, scalable, and cost-effective solutions with subsecond latency, addressing multiple customer support scenarios at scale.
SageMaker and Forethought
SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly. SageMaker MMEs provide a scalable and cost-effective solution for deploying a large number of models for real-time inference. MMEs use a shared serving container and a fleet of resources that can use accelerated instances such as GPUs to host all of your models. This reduces hosting costs by maximizing endpoint utilization compared to using single-model endpoints. It also reduces deployment overhead because SageMaker manages loading and unloading models in memory and scaling them based on the endpoint’s traffic patterns. In addition, all SageMaker real-time endpoints benefit from built-in capabilities to manage and monitor models, such as including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, refer to CloudWatch Metrics for Multi-Model Endpoint Deployments).
As Forethought grew to host hundreds of models that also required GPU resources, we saw an opportunity to create a more cost-effective, reliable, and manageable architecture through SageMaker MMEs. Prior to migrating to SageMaker MMEs, our models were deployed on Kubernetes on Amazon Elastic Kubernetes Service (Amazon EKS). Although Amazon EKS provided management capabilities, it was immediately apparent that we were managing infrastructure that wasn’t specifically tailored for inference. Forethought had to manage model inference on Amazon EKS ourselves, which was a burden on engineering efficiency. For example, in order to share expensive GPU resources between multiple models, we were responsible for allocating rigid memory fractions to models that were specified during deployment. We wanted to address the following key problems with our existing infrastructure:

High cost – To ensure that each model had enough resources, we would be very conservative in how many models to fit per instance. This resulted in much higher costs for model hosting than necessary.
Low reliability – Despite being conservative in our memory allocation, not all models have the same requirements, and occasionally some models would throw out of memory (OOM) errors.
Inefficient management – We had to manage different deployment manifests for each type of model (such as classifiers, embeddings, and autocomplete), which was time-consuming and error-prone. We also had to maintain the logic to determine the memory allocation for different model types.

Ultimately, we needed an inference platform to take on the heavy lifting of managing our models at runtime to improve the cost, reliability, and the management of serving our models. SageMaker MMEs allowed us to address these needs.
Through its smart and dynamic model loading and unloading, and its scaling capabilities, SageMaker MMEs provided a significantly less expensive and more reliable solution for hosting our models. We are now able to fit many more models per instance and don’t have to worry about OOM errors because SageMaker MMEs handle loading and unloading models dynamically. In addition, deployments are now as simple as calling Boto3 SageMaker APIs and attaching the proper auto scaling policies.
The following diagram illustrates our legacy architecture.

To begin our migration to SageMaker MMEs, we identified the best use cases for MMEs and which of our models would benefit the most from this change. MMEs are best used for the following:

Models that are expected to have low latency but can withstand a cold start time (when it’s first loaded in)
Models that are called often and consistently
Models that need partial GPU resources
Models that share common requirements and inference logic

We identified our embeddings models and autocomplete language models as the best candidates for our migration. To organize these models under MMEs, we would create one MME per model type, or task, one for our embeddings models, and another for autocomplete language models.
We already had an API layer on top of our models for model management and inference. Our task at hand was to rework how this API was deploying and handling inference on models under the hood with SageMaker, with minimal changes to how clients and product teams interacted with the API. We also needed to package our models and custom inference logic to be compatible with NVIDIA Triton Inference Server using SageMaker MMEs.
The following diagram illustrates our new architecture.

Custom inference logic
Before migrating to SageMaker, Forethought’s custom inference code (preprocessing and postprocessing) ran in the API layer when a model was invoked. The objective was to transfer this functionality to the model itself to clarify the separation of responsibilities, modularize and simplify their code, and reduce the load on the API.
Embeddings
Forethought’s embedding models consist of two PyTorch model artifacts, and the inference request determines which model to call. Each model requires preprocessed text as input. The main challenges were integrating a preprocessing step and accommodating two model artifacts per model definition. To address the need for multiple steps in the inference logic, Forethought developed a Triton ensemble model with two steps: a Python backend preprocessing process and a PyTorch backend model call. Ensemble models allow for defining and ordering steps in the inference logic, with each step represented by a Triton model of any backend type. To ensure compatibility with the Triton PyTorch backend, the existing model artifacts were converted to TorchScript format. Separate Triton models were created for each model definition, and Forethought’s API layer was responsible for determining the appropriate TargetModel to invoke based on the incoming request.
Autocomplete
The autocomplete models (sequence to sequence) presented a distinct set of requirements. Specifically, we needed to enable the capability to loop through multiple model calls and cache substantial inputs for each call, all while maintaining low latency. Additionally, these models necessitated both preprocessing and postprocessing steps. To address these requirements and achieve the desired flexibility, Forethought developed autocomplete MME models utilizing the Triton Python backend, which offers the advantage of writing the model as Python code.
Benchmarking
After the Triton model shapes were determined, we deployed models to staging endpoints and conducted resource and performance benchmarking. Our main goal was to determine the latency for cold start vs in-memory models, and how latency was affected by request size and concurrency. We also wanted to know how many models could fit on each instance, how many models would cause the instances to scale up with our auto scaling policy, and how quickly the scale-up would happen. In keeping with the instance types we were already using, we did our benchmarking with ml.g4dn.xlarge and ml.g4dn.2xlarge instances.
Results
The following table summarizes our results.

Request Size
Cold Start Latency
Cached Inference Latency
Concurrent Latency (5 requests)

Small (30 tokens)
12.7 seconds
0.03 seconds
0.12 seconds

Medium (250 tokens)
12.7 seconds
0.05 seconds
0.12 seconds

Large (550 tokens)
12.7 seconds
0.13 seconds
0.12 seconds

Noticeably, the latency for cold start requests is significantly higher than the latency for cached inference requests. This is because the model needs to be loaded from disk or Amazon Simple Storage Service (Amazon S3) when a cold start request is made. The latency for concurrent requests is also higher than the latency for single requests. This is because the model needs to be shared between concurrent requests, which can lead to contention.
The following table compares the latency of the legacy models and the SageMaker models.

Request Size
Legacy Models
SageMaker Models

Small (30 tokens)
0.74 seconds
0.24 seconds

Medium (250 tokens)
0.74 seconds
0.24 seconds

Large (550 tokens)
0.80 seconds
0.32 seconds

Overall, the SageMaker models are a better choice for hosting autocomplete models than the legacy models. They offer lower latency, scalability, reliability, and security.
Resource usage
In our quest to determine the optimal number of models that could fit on each instance, we conducted a series of tests. Our experiment involved loading models into our endpoints using an ml.g4dn.xlarge instance type, without any auto scaling policy.
These particular instances offer 15.5 GB of memory, and we aimed to achieve approximately 80% GPU memory usage per instance. Considering the size of each encoder model artifact, we managed to find the optimal number of Triton encoders to load on an instance to reach our targeted GPU memory usage. Furthermore, given that each of our embeddings models corresponds to two Triton encoder models, we were able to house a set number of embeddings models per instance. As a result, we calculated the total number of instances required to serve all our embeddings models. This experimentation has been crucial in optimizing our resource usage and enhancing the efficiency of our models.
We conducted similar benchmarking for our autocomplete models. These models were around 292.0 MB each. As we tested how many models would fit on a single ml.g4dn.xlarge instance, we noticed that we were only able to fit four models before our instance started unloading models, despite the models having a small size. Our main concerns were:

Cause for CPU memory utilization spiking
Cause for models getting unloaded when we tried to load in one more model instead of just the least recently used (LRU) model

We were able to pinpoint the root cause of the memory utilization spike coming from initializing our CUDA runtime environment in our Python model, which was necessary to move our models and data on and off the GPU device. CUDA loads many external dependencies into CPU memory when the runtime is initialized. Because the Triton PyTorch backend handles and abstracts away moving data on and off the GPU device, we didn’t run into this issue for our embedding models. To address this, we tried using ml.g4dn.2xlarge instances, which had the same amount of GPU memory but twice as much CPU memory. In addition, we added several minor optimizations in our Python backend code, including deleting tensors after use, emptying the cache, disabling gradients, and garbage collecting. With the larger instance type, we were able to fit 10 models per instance, and the CPU and GPU memory utilization became much more aligned.
The following diagram illustrates this architecture.

Auto scaling
We attached auto scaling policies to both our embeddings and autocomplete MMEs. Our policy for our embeddings endpoint targeted 80% average GPU memory utilization using custom metrics. Our autocomplete models saw a pattern of high traffic during business hours and minimal traffic overnight. Because of this, we created an auto scaling policy based on InvocationsPerInstance so that we could scale according to the traffic patterns, saving on cost without sacrificing reliability. Based on our resource usage benchmarking, we configured our scaling policies with a target of 225 InvocationsPerInstance.
Deploy logic and pipeline
Creating an MME on SageMaker is straightforward and similar to creating any other endpoint on SageMaker. After the endpoint is created, adding additional models to the endpoint is as simple as moving the model artifact to the S3 path that the endpoint targets; at this point, we can make inference requests to our new model.
We defined logic that would take in model metadata, format the endpoint deterministically based on the metadata, and check whether the endpoint existed. If it didn’t, we create the endpoint and add the Triton model artifact to the S3 patch for the endpoint (also deterministically formatted). For example, if the model metadata indicated that it is an autocomplete model, it would create an endpoint for auto-complete models and an associated S3 path for auto-complete model artifacts. If the endpoint existed, we would copy the model artifact to the S3 path.
Now that we had our model shapes for our MME models and the functionality for deploying our models to MME, we needed a way to automate the deployment. Our users must specify which model they want to deploy; we handle packaging and deployment of the model. The custom inference code packaged with the model is versioned and pushed to Amazon S3; in the packaging step, we pull the inference code according to the version specified (or the latest version) and use YAML files that indicate the file structures of the Triton models.
One requirement for us was that all of our MME models would be loaded into memory to avoid any cold start latency during production inference requests to load in models. To achieve this, we provision enough resources to fit all our models (according to the preceding benchmarking) and call every model in our MME at an hourly cadence.
The following diagram illustrates the model deployment pipeline.

The following diagram illustrates the model warm-up pipeline.

Model invocation
Our existing API layer provides an abstraction for callers to make inference on all of our ML models. This meant we only had to add functionality to the API layer to call the SageMaker MME with the correct target model depending on the inference request, without any changes to the calling code. The SageMaker inference code takes the inference request, formats the Triton inputs defined in our Triton models, and invokes the MMEs using Boto3.
Cost benefits
Forethought made significant strides in reducing model hosting costs and mitigating model OOM errors, thanks to the migration to SageMaker MMEs. Before this change, ml.g4dn.xlarge instances running in Amazon EKS. With the transition to MMEs, we discovered it could house 12 embeddings models per instance while achieving 80% GPU memory utilization. This led to a significant decline in our monthly expenses. To put it in perspective, we realized a cost saving of up to 80%. Moreover, to manage higher traffic, we considered scaling up the replicas. Assuming a scenario where we employ three replicas, we found that our cost savings would still be substantial even under these conditions, hovering around 43%.
The journey with SageMaker MMEs has proven financially beneficial, reducing our expenses while ensuring optimal model performance. Previously, our autocomplete language models were deployed in Amazon EKS, necessitating a varying number of ml.g4dn.xlarge instances based on the memory allocation per model. This resulted in a considerable monthly cost. However, with our recent migration to SageMaker MMEs, we’ve been able to reduce these costs substantially. We now host all our models on ml.g4dn.2xlarge instances, giving us the ability to pack models more efficiently. This has significantly trimmed our monthly expenses, and we’ve now realized cost savings in the 66–74% range. This move has demonstrated how efficient resource utilization can lead to significant financial savings using SageMaker MMEs.
Conclusion
In this post, we reviewed how Forethought uses SageMaker multi-model endpoints to decrease cost for real-time inference. SageMaker takes on the undifferentiated heavy lifting, so Forethought can increase engineering efficiency. It also allows Forethought to dramatically lower the cost for real-time inference while maintaining the performance needed for the business-critical operations. By doing so, Forethought is able to provide a differentiated offering for their customers using hyper-personalized models. Use SageMaker MME to host your models at scale and reduce hosting costs by improving endpoint utilization. It also reduces deployment overhead because Amazon SageMaker manages loading models in memory and scaling them based on the traffic patterns to your endpoint. You can find code samples on hosting multiple models using SageMaker MME on GitHub.

About the Authors
Jad Chamoun is a Director of Core Engineering at Forethought. His team focuses on platform engineering covering Data Engineering, Machine Learning Infrastructure, and Cloud Infrastructure.  You can find him on LinkedIn.
Salina Wu is a Sr. Machine Learning Infrastructure engineer at Forethought.ai. She works closely with the Machine Learning team to build and maintain their end-to-end training, serving, and data infrastructures. She is particularly motivated by introducing new ways to improve efficiency and reduce cost across the ML space. When not at work, Salina enjoys surfing, pottery, and being in nature.
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.
Sunil Padmanabhan is a Startup Solutions Architect at AWS. As a former startup founder and CTO, he is passionate about machine learning and focuses on helping startups leverage AI/ML for their business outcomes and design and deploy ML/AI solutions at scale.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Reinventing the data experience: Use generative AI and modern data arc …

Implementing a modern data architecture provides a scalable method to integrate data from disparate sources. By organizing data by business domains instead of infrastructure, each domain can choose tools that suit their needs. Organizations can maximize the value of their modern data architecture with generative AI solutions while innovating continuously.
The natural language capabilities allow non-technical users to query data through conversational English rather than complex SQL. However, realizing the full benefits requires overcoming some challenges. The AI and language models must identify the appropriate data sources, generate effective SQL queries, and produce coherent responses with embedded results at scale. They also need a user interface for natural language questions.
Overall, implementing a modern data architecture and generative AI techniques with AWS is a promising approach for gleaning and disseminating key insights from diverse, expansive data at an enterprise scale. The latest offering for generative AI from AWS is Amazon Bedrock, which is a fully managed service and the easiest way to build and scale generative AI applications with foundation models. AWS also offers foundation models through Amazon SageMaker JumpStart as Amazon SageMaker endpoints. The combination of large language models (LLMs), including the ease of integration that Amazon Bedrock offers, and a scalable, domain-oriented data infrastructure positions this as an intelligent method of tapping into the abundant information held in various analytics databases and data lakes.
In the post, we showcase a scenario where a company has deployed a modern data architecture with data residing on multiple databases and APIs such as legal data on Amazon Simple Storage Service (Amazon S3), human resources on Amazon Relational Database Service (Amazon RDS), sales and marketing on Amazon Redshift, financial market data on a third-party data warehouse solution on Snowflake, and product data as an API. This implementation aims to enhance the productivity of the enterprise’s business analytics, product owners, and business domain experts. All this achieved through the use of generative AI in this domain mesh architecture, which enables the company to achieve its business objectives more efficiently. This solution has the option to include LLMs from JumpStart as a SageMaker endpoint as well as third-party models. We provide the enterprise users with a medium of asking fact-based questions without having an underlying knowledge of data channels, thereby abstracting the complexities of writing simple to complex SQL queries.
Solution overview
A modern data architecture on AWS applies artificial intelligence and natural language processing to query multiple analytics databases. By using services such as Amazon Redshift, Amazon RDS, Snowflake, Amazon Athena, and AWS Glue, it creates a scalable solution to integrate data from various sources. Using LangChain, a powerful library for working with LLMs, including foundation models from Amazon Bedrock and JumpStart in Amazon SageMaker Studio notebooks, a system is built where users can ask business questions in natural English and receive answers with data drawn from the relevant databases.
The following diagram illustrates the architecture.

The hybrid architecture uses multiple databases and LLMs, with foundation models from Amazon Bedrock and JumpStart for data source identification, SQL generation, and text generation with results.
The following diagram illustrates the specific workflow steps for our solution.

The steps are follows:

A business user provides an English question prompt.
An AWS Glue crawler is scheduled to run at frequent intervals to extract metadata from databases and create table definitions in the AWS Glue Data Catalog. The Data Catalog is input to Chain Sequence 1 (see the preceding diagram).
LangChain, a tool to work with LLMs and prompts, is used in Studio notebooks. LangChain requires an LLM to be defined. As part of Chain Sequence 1, the prompt and Data Catalog metadata are passed to an LLM, hosted on a SageMaker endpoint, to identify the relevant database and table using LangChain.
The prompt and identified database and table are passed to Chain Sequence 2.
LangChain establishes a connection to the database and runs the SQL query to get the results.
The results are passed to the LLM to generate an English answer with the data.
The user receives an English answer to their prompt, querying data from different databases.

This following sections explain some of the key steps with associated code. To dive deeper into the solution and code for all steps shown here, refer to the GitHub repo. The following diagram shows the sequence of steps followed:

Prerequisites
You can use any databases that are compatible with SQLAlchemy to generate responses from LLMs and LangChain. However, these databases must have their metadata registered with the AWS Glue Data Catalog. Additionally, you will need to have access to LLMs through either JumpStart or API keys.
Connect to databases using SQLAlchemy
LangChain uses SQLAlchemy to connect to SQL databases. We initialize LangChain’s SQLDatabase function by creating an engine and establishing a connection for each data source. The following is a sample of how to connect to an Amazon Aurora MySQL-Compatible Edition serverless database and include only the employees table:

#connect to AWS Aurora MySQL
cluster_arn = <cluster_arn>
secret_arn = <secret_arn>
engine_rds=create_engine(‘mysql+auroradataapi://:@/employees’,echo=True,
  connect_args=dict(aurora_cluster_arn=cluster_arn, secret_arn=secret_arn))
dbrds = SQLDatabase(engine_rds, include_tables=[’employees’])

Next, we build prompts used by Chain Sequence 1 to identify the database and the table name based on the user question.
Generate dynamic prompt templates
We use the AWS Glue Data Catalog, which is designed to store and manage metadata information, to identify the source of data for a user query and build prompts for Chain Sequence 1, as detailed in the following steps:

We build a Data Catalog by crawling through the metadata of multiple data sources using the JDBC connection used in the demonstration.
With the Boto3 library, we build a consolidated view of the Data Catalog from multiple data sources. The following is a sample on how to get the metadata of the employees table from the Data Catalog for the Aurora MySQL database:

#retrieve metadata from glue data catalog
  glue_tables_rds = glue_client.get_tables(DatabaseName=<database_name>, MaxResults=1000)
    for table in glue_tables_rds[‘TableList’]:
        for column in table[‘StorageDescriptor’][‘Columns’]:
             columns_str=columns_str+’n’+(‘rdsmysql|employees|’+table[‘Name’]+”|”+column[‘Name’])

A consolidated Data Catalog has details on the data source, such as schema, table names, and column names. The following is a sample of the output of the consolidated Data Catalog:

database|schema|table|column_names
redshift|tickit|tickit_sales|listid
rdsmysql|employees|employees|emp_no
….
s3|none|claims|policy_id

We pass the consolidated Data Catalog to the prompt template and define the prompts used by LangChain:

prompt_template = “””
From the table below, find the database (in column database) which will contain the data (in corresponding column_names) to answer the question {query} n
“””+glue_catalog +””” Give your answer as database == n Also,give your answer as database.table ==”””

Chain Sequence 1: Detect source metadata for the user query using LangChain and an LLM
We pass the prompt template generated in the previous step to the prompt, along with the user query to the LangChain model, to find the best data source to answer the question. LangChain uses the LLM model of our choice to detect source metadata.
Use the following code to use an LLM from JumpStart or third-party models:

#define your LLM model here
llm = <LLM>
#pass prompt template and user query to the prompt
PROMPT = PromptTemplate(template=prompt_template, input_variables=[“query”])
# define llm chain
llm_chain = LLMChain(prompt=PROMPT, llm=llm)
#run the query and save to generated texts
generated_texts = llm_chain.run(query)

The generated text contains information such as the database and table names against which the user query is run. For example, for the user query “Name all employees with birth date this month,” generated_text has the information database == rdsmysql and database.table == rdsmysql.employees.
Next, we pass the details of the human resources domain, Aurora MySQL database, and employees table to Chain Sequence 2.
Chain Sequence 2: Retrieve responses from the data sources to answer the user query
Next, we run LangChain’s SQL database chain to convert text to SQL and implicitly run the generated SQL against the database to retrieve the database results in a simple readable language.
We start with defining a prompt template that instructs the LLM to generate SQL in a syntactically correct dialect and then run it against the database:

_DEFAULT_TEMPLATE = “””Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Only use the following tables:
{table_info}
If someone asks for the sales, they really mean the tickit.sales table.
Question: {input}”””
#define the prompt
PROMPT = PromptTemplate( input_variables=[“input”, “table_info”, “dialect”], template=_DEFAULT_TEMPLATE)

Finally, we pass the LLM, database connection, and prompt to the SQL database chain and run the SQL query:

db_chain = SQLDatabaseChain.from_llm(llm, db, prompt=PROMPT)
response=db_chain.run(query)

For example, for the user query “Name all employees with birth date this month,” the answer is as follows:

Question: Name all employees with birth date this month

SELECT * FROM employees WHERE MONTH(birth_date) = MONTH(CURRENT_DATE());

User Response:
The employees with birthdays this month are:
Christian Koblick
Tzvetan Zielinski

Clean up
After you run the modern data architecture with generative AI, make sure to clean up any resources that won’t be utilized. Shut down and delete the databases used (Amazon Redshift, Amazon RDS, Snowflake). In addition, delete the data in Amazon S3 and stop any Studio notebook instances to not incur any further charges. If you used JumpStart to deploy an LLM as a SageMaker real-time endpoint, delete endpoint through either the SageMaker console or Studio.
Conclusion
In this post, we integrated a modern data architecture with generative AI and LLMs within SageMaker. This solution uses various text-to-text foundation models from JumpStart as well as third-party models. This hybrid approach identifies data sources, writes SQL queries, and generates responses with query results. It uses Amazon Redshift, Amazon RDS, Snowflake, and LLMs. To improve the solution, you could add more databases, a UI for English queries, prompt engineering, and data tools. This could become an intelligent, unified way to get insights from multiple data stores. To dive deeper into the solution and the code shown in this post, check out the GitHub repo . Also, refer to Amazon Bedrock for use cases on generative AI, foundation models, and large language models.

Appendix
Example prompts

Domain
Database/API
Prompt
SQL (Generated by LLM)
Output

Sales & Marketing
Amazon RedShift
How many ticket sales are there?
SELECT COUNT(*) AS total_sales FROM tickit.sales;
There are 172,456 ticket sales.

Sales & Marketing
Amazon RedShift
What was the total commission for the ticket sales in the year 2008?
SELECT SUM(commission) AS total_commission FROM tickit.sales WHERE EXTRACT(YEAR FROM saletime) = 2008
The total commission for ticket sales in the year 2008 was $16,614,814.65.

Legal
S3
How many frauds happened in the year 2023?
SELECT count(*) FROM claims WHERE extract(year from write_time) = 2023 AND fraud = 1;
There were 164 fraud claims in 2023.

Legal
S3
How many policies were claimed this year?
SELECT count(*) FROM claims;
There were 5000 claims made this year.

Human Resources
Amazon Aurora MySQL
Name all employees with birth date this month
SELECT * FROM employees WHERE MONTH(birth_date) = MONTH(CURRENT_DATE());
The employees with birthdays this month are: Christian Koblick Tzvetan Zielinski Kazuhito Cappelletti Yinghua Dredge

Human Resources
Amazon Aurora MySQL
How many employees were hired before 1990?
SELECT COUNT(*) AS ‘Number of employees hired before 1990’ FROM employees WHERE hire_date < ‘1990-01-01’
29 employees were hired before 1990.

Finance and Investments
Snowflake
Which stock performed the best and the worst in May of 2013?
SELECT name, MAX(close) AS max_close, MIN(close) AS min_close FROM all_stocks_5yr WHERE date BETWEEN ‘2013-05-01’ AND ‘2013-05-31’ GROUP BY name ORDER BY max_close DESC, min_close ASC
The stock that performed the best in May 2013 was AnySock1 (ASTOCK1) with a maximum closing price of $842.50. The stock that performed the worst was AnySock2 (ASTOCK2) with a minimum closing price of $3.22.

Finance and Investments
Snowflake
What is the average volume stocks traded in July of 2013?
SELECT AVG(volume) AS average_volume FROM all_stocks_5yr WHERE date BETWEEN ‘2013-07-01’ AND ‘2013-07-31’
The average volume of stocks traded in July 2013 was 4,374,177

Product – Weather
API
What is the weather like right now in New York City in degrees Fahrenheit?

About the Authors
Navneet Tuteja is a Data Specialist at Amazon Web Services. Before joining AWS, Navneet worked as a facilitator for organizations seeking to modernize their data architectures and implement comprehensive AI/ML solutions. She holds an engineering degree from Thapar University, as well as a master’s degree in statistics from Texas A&M University.
Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. Sovik has published articles and holds a patent in ML model monitoring. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

How BrainPad fosters internal knowledge sharing with Amazon Kendra

This is a guest post by Dr. Naoki Okada, Lead Data Scientist at BrainPad Inc.
Founded in 2004, BrainPad Inc. is a pioneering partner in the field of data utilization, helping companies create business and improve their management through the use of data. To date, BrainPad has helped more than 1,300 companies, primarily industry leaders. BrainPad has the advantage of providing a one-stop service from formulating a data utilization strategy to proof of concept and implementation. BrainPad’s unique style is to work together with clients to solve problems on the ground, such as data that isn’t being collected due to a siloed organizational structure or data that exists but isn’t organized.
This post discusses how to structure internal knowledge sharing using Amazon Kendra and AWS Lambda and how Amazon Kendra solves the obstacles around knowledge sharing many companies face. We summarize BrainPad’s efforts in four key areas:

What are the knowledge sharing problems that many companies face?
Why did we choose Amazon Kendra?
How did we implement the knowledge sharing system?
Even if a tool is useful, it is meaningless if it is not used. How did we overcome the barrier to adoption?

Knowledge sharing problems that many companies face

Many companies achieve their results by dividing their work into different areas. Each of these activities generates new ideas every day. This knowledge is accumulated on an individual basis. If this knowledge can be shared among people and organizations, synergies in related work can be created, and the efficiency and quality of work will increase dramatically. This is the power of knowledge sharing.
However, there are many common barriers to knowledge sharing:

Few people are proactively involved, and the process can’t be sustained for long due to busy schedules.
Knowledge is scattered across multiple media, such as internal wikis and PDFs, making it difficult to find the information you need.
No one enters knowledge into the knowledge consolidation system. The system will not be widely used because of its poor searchability.

Our company faced a similar situation. The fundamental problem with knowledge sharing is that although most employees have a strong need to obtain knowledge, they have little motivation to share their own knowledge at a cost. Changing employee behavior for the sole purpose of knowledge sharing is not easy.
In addition, each employee or department has its own preferred method of accumulating knowledge, and trying to force unification won’t lead to motivation or performance in knowledge sharing. This is a headache for management, who wants to consolidate knowledge, while those in the field want to have knowledge in a decentralized way.
At our company, Amazon Kendra is the cloud service that has solved these problems.
Why we chose Amazon Kendra

Amazon Kendra is a cloud service that allows us to search for internal information from a common interface. In other words, it is a search engine that specializes in internal information. In this section, we discuss the three key reasons why we chose Amazon Kendra.
Easy aggregation of knowledge
As mentioned in the previous section, knowledge, even when it exists, tends to be scattered across multiple media. In our case, it was scattered across our internal wiki and various document files. Amazon Kendra provides powerful connectors for this situation. We can easily import documents from a variety of media, including groupware, wikis, Microsoft PowerPoint files, PDFs, and more, without any hassle.
This means that employees don’t have to change the way they store knowledge in order to share it. Although knowledge aggregation can be achieved temporarily, it’s very costly to maintain. The ability to automate this was a very desirable factor for us.
Great searchability
There are a lot of groupware and wikis out there that excel at information input. However, they often have weaknesses in information output (searchability). This is especially true for Japanese search. For example, in English, word-level matching provides a reasonable level of searchability. In Japanese, however, word extraction is more difficult, and there are cases where matching is done by separating words by an appropriate number of characters. If a search for “Tokyo-to (東京都)” is separated by two characters, “Tokyo (東京)” and “Kyoto (京都),” it will be difficult to find the knowledge you are looking for.
Amazon Kendra offers great searchability through machine learning. In addition to traditional keyword searches such as “technology trends,” natural language searches such as “I want information on new technology initiatives” can greatly enhance the user experience. The ability to search appropriately for collected information is the second reason we chose Amazon Kendra.
Low cost of ownership
IT tools that specialize in knowledge aggregation and retrieval are called enterprise search systems. One problem with implementing these systems is the cost. For an organization with several hundred employees, operating costs can exceed 10 million yen per year. This is not a cheap way to start a knowledge sharing initiative.
Amazon Kendra is offered at a much lower cost than most enterprise search systems. As mentioned earlier, knowledge sharing initiatives are not easy to implement. We wanted to start small, and Amazon Kendra’s low cost of ownership was a key factor in our decision.
In addition, Amazon Kendra’s ease of implementation and flexibility are also great advantages for us. The next section summarizes an example of our implementation.
How we implemented the knowledge sharing system

Implementation is not an exaggerated development process; it can be done without code by following the Amazon Kendra processing flow. Here are five key points in the implementation process:

Data source (accumulating knowledge) – Each department and employee of our company frequently held internal study sessions, and through these activities, knowledge was accumulated in multiple media, such as wikis and various types of storage. At that time, it was easy to review the information from the study sessions later. However, in order to extract knowledge about a specific area or technology, it was necessary to review each medium in detail, which was not very convenient.
Connectors (aggregating knowledge) – With the connector functionality in Amazon Kendra, we were able to link knowledge scattered throughout the company into Amazon Kendra and achieve cross-sectional searchability. In addition, the connector is loaded through a restricted account, allowing for a security-conscious implementation.
Search engine (finding information) – Because Amazon Kendra has a search page for usability testing, we were able to quickly test the usability of the search engine immediately after loading documents to see what kind of knowledge could be found. This was very helpful in solidifying the image of the launch.
Search UI (search page for users) – Amazon Kendra has a feature called Experience Builder that exposes the search screen to users. This feature can be implemented with no code, which was very helpful in getting feedback during the test deployment. In addition to Experience Builder, Amazon Kendra also supports Python and React.js API implementations, so we can eventually provide customized search pages to our employees to improve their experience.
Analytics (monitoring usage trends) – An enterprise search system is only valuable if a lot of people are using it. Amazon Kendra has the ability to monitor how many searches are being performed and for what terms. We use this feature to track usage trends.

We also have some Q&A related to our implementation:

What were some of the challenges in gathering internal knowledge? We had to start by collecting the knowledge that each department and employee had, but not necessarily in a place that could be directly connected to Amazon Kendra.
How did we benefit from Amazon Kendra? We had tried to share knowledge many times in the past, but had often failed. The reasons were information aggregation, searchability, operational costs, and implementation costs. Amazon Kendra has features that solve these problems, and we successfully launched it within about 3 months of conception. Now we can use Amazon Kendra to find solutions to tasks that previously required the knowledge of individuals or departments as the collective knowledge of the entire organization.
How did you evaluate the searchability of the system, and what did you do to improve it? First, we had many employees interact with the system and get feedback. One problem that arose at the beginning of the implementation was that there was a scattering of information that had little value as knowledge. This was because some of the data sources contained information from internal blog posts, for example. We are continually working to improve the user experience by selecting the right data sources.

As mentioned earlier, by using Amazon Kendra, we were able to overcome many implementation hurdles at minimal cost. However, the biggest challenge with this type of tool is the adoption barrier that comes after implementation. The next section provides an example of how we overcame this hurdle.
How we overcame the barrier to adoption

Have you ever seen a tool that you spent a lot of effort, time, and money implementing become obsolete without widespread use? No matter how good the functionality is at solving problems, it will not be effective if people are not using it.
One of the initiatives we took with the launch of Amazon Kendra was to provide a chatbot. In other words, when you ask a question in a chat tool, you get a response with the appropriate knowledge. Because all of our telecommuting employees use a chat tool on a daily basis, using chatbots is much more compatible than having them open a new search screen in their browsers.
To implement this chatbot, we use Lambda, a service that allows us to run serverless, event-driven programs. Specifically, the following workflow is implemented:

A user posts a question to the chatbot with a mention.
The chatbot issues an event to Lambda.
A Lambda function detects the event and searches Amazon Kendra for the question.
The Lambda function posts the search results to the chat tool.
The user views the search results.

This process takes only a few seconds and provides a high-quality user experience for knowledge discovery. The majority of employees were exposed to the knowledge sharing mechanism through the chatbot, and there is no doubt that the chatbot contributed to the diffusion of the mechanism. And because there are some areas that can’t be covered by the chatbot alone, we have also asked them to use the customized search screen in conjunction with the chatbot to provide an even better user experience.
Conclusion
In this post, we presented a case study of Amazon Kendra for knowledge sharing and an example of a chatbot implementation using Lambda to propagate the mechanism. We look forward to seeing Amazon Kendra take another leap forward as large-scale language models continue to evolve.
If you are interested in trying out Amazon Kendra, check out Enhancing enterprise search with Amazon Kendra. BrainPad can also help you with internal knowledge sharing and document exploitation using generative AI. Please contact us for more information.

About the Author

Dr. Naoki Okada is a Lead Data Scientist at BrainPad Inc. With his cross-functional experience in business, analytics, and engineering, he supports a wide range of clients from building up DX organizations to leveraging data in unexplored areas.

Meet PANOGEN: A Generation Method that can Potentially Create an Infin …

Whenever someone talks about artificial intelligence, the first thing that comes to mind is a robot, an android, or a humanoid that can do things humans do with the same effect, if not better. We have all seen such specific miniature robots deployed in various fields, for example, in airports guiding people to certain outlets, in armed forces to navigate and deal with difficult situations, and even as trackers. 

All of these are some amazing examples of AI in a truer sense. As with every other AI model, this has some basic requirements that need to be satisfied, for example, which choice of algorithm, the big corpus of data to train on, finetuning, and then deployment. 

Now, this type of problem is often referred to as the Visual-and-Language-Navigation problem. Vision and language navigation in artificial intelligence (AI) refers to the ability of an AI system to understand and navigate the world using visual and linguistic information. It combines computer vision, natural language processing, and machine learning techniques to build intelligent systems that can perceive graphic scenes, understands textual instructions, and navigate physical environments.

Many models, such as CLIP, RecBERT, and PREVALENT, work on these problems, but all of these models greatly suffer from two major issues. 

Limited Data and Data Bias: Training visual and learning systems require large amounts of labeled data. However, obtaining such data can be expensive, time-consuming, or even impractical in certain domains. Moreover, the availability of diverse and representative data is crucial to avoid bias in the system’s understanding and decision-making. If the training data is biased, it can lead to unfair or inaccurate predictions and behaviors.

Generalization: AI systems need to generalize well to unseen or novel data. They should memorize the training data and learn underlying concepts and patterns that can be applied to new examples. Overfitting occurs when a model performs well on the training data but fails to generalize to new data. Achieving robust generalization is a significant challenge, particularly in complex visual tasks that involve variations in lighting conditions, viewpoints, and object appearances.

Though many efforts have been proposed to help the agent learn diverse instruction inputs, all these datasets are built on the same 3D room environments from Matterport3D, which only contains 60 different room environments for agents’ training.

PanoGen, the breakthrough in the AI domain, has provided a strong solution to this problem. Now with PanoGen, the scarcity of data is solved, and corpus creation and data diversification have also been streamlined. 

PanoGen is a generative method that can create infinite diverse panoramic images (environments) based on the text. They have collected room descriptions by captioning the room images available with the Matterport3D dataset and have used SoTA text-to-image model to generate panoramic visions (environments). They then use recursive outpainting over the generated image to create a consistent 360-degree panorama view. The panoramic pictures developed share similar semantic information conditioning on text descriptions, which ensures the co-occurrence of objects in the panorama follows human intuition, and creates enough diversity in room appearance and layout with image outpainting.

They have mentioned that there have been attempts to increase the variety of training data and improve the corpus. All of those attempts were based on mixing scenes from HM3D (Habitat Matterport 3D), which again brings back the same issue that all the settings, more or less, are made with Matterport3D. 

PanoGen solves this problem as it can create an infinite number of training data with as many variations as needed. 

The paper also mentions that using the PanoGen approach, they beat the current SoTA and achieved the new SoTA on Room-to-Room, Room-for-Room, and CVDN datasets.

Source: https://arxiv.org/abs/2305.19195

Source: https://arxiv.org/abs/2305.19195

Conclusively, PanoGen is a breakthrough development that addresses the key challenges in Visual-and-Language Navigation problems. With the ability to generate unlimited training samples with many variations, PanoGen opens up new possibilities for AI systems to understand and navigate the real world as humans do. The approach’s remarkable ability to surpass the SoTA highlights its potential to revolutionize AI-driven VLN tasks. 

Check Out The Paper, Code, and Project. Don’t forget to join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Meet PANOGEN: A Generation Method that can Potentially Create an Infinite Number of Diverse Panoramic Environments Conditioned on Text appeared first on MarkTechPost.

Researchers from Princeton Introduce MeZO: A Memory-Efficient Zeroth-O …

Large Language Models are rapidly advancing with the huge success of Generative Artificial Intelligence in the past few months. These models are contributing to some remarkable economic and societal transformations, the best example of which is the well-known ChatGPT developed by OpenAI, which has had millions of users ever since its release, with the number increasing exponentially, if not the same. This chatbot, based on Natural Language Processing (NLP) and Natural Language Understanding (NLU), allows users to generate meaningful text just like humans. It meaningfully answers questions, summarizes long paragraphs, completes codes and emails, etc. Other LLMs, like PaLM, Chinchilla, BERT, etc., have also shown great performances in the domain of AI.

Fine-tuning pre-trained language models has been a popular approach for a lot of language-related tasks. Fine-tuning allows these models to adapt to specialized domains, incorporate human instructions, and cater to individual preferences. It basically adjusts the parameters of an already trained LLM using a smaller and domain-specific dataset. As language models scale up with more parameters, fine-tuning becomes computationally demanding and memory-intensive for the process of computing gradients during backpropagation. Memory usage is significantly higher than that needed for inference because of the involvement of caching activations, gradients, and storage of gradient history.

Recently, a team of researchers from Princeton University has introduced a solution for the memory issue. Called MeZO, a memory-efficient zeroth-order optimizer, this is an adaptation of the traditional ZO-SGD method that estimates gradients using only differences in loss values and operates in-place, allowing fine-tuning language models with the same memory footprint as inference. The team has focussed on zeroth-order approaches in MeZO as ZO methods can estimate gradients using only two forward passes, making them memory-efficient.

The MeZO algorithm has been particularly designed to optimize Large Language Models with billions of parameters. Some of the main contributions mentioned by the team are –

MeZO has been developed by modifying the ZO-SGD method and a few variations to run in place on arbitrary-sized models with hardly any memory overhead.

MeZO has been shown to be compatible with PEFT and comprehensive parameter tunings, like LoRA and prefix tuning.

MeZO can improve non-differentiable goals like accuracy or F1 score while still utilizing the same amount of memory as inference.

An adequate pre-training ensures that MeZO’s per-step optimization rate and global convergence rate depend on a specific condition number of the landscape, i.e., the effective local rank rather than a large number of parameters, which is contrasting to the previous ZO lower bounds that imply the convergence rate can be slow according to the number of parameters.

Experiments suggested that on tests on various model types like masked LM and autoregressive LM, the model scales from 350M to 66B, and downstream tasks like classification, multiple-choice, and generation.

MeZO outperforms zero-shot, ICL, and linear probing in experiments and even performs better or similarly to fine-tuning on 7 out of 11 tests with OPT-13B, although consuming about 12 less memory than RoBERTa-large or normal fine-tuning, respectively.

Upon evaluation, MeZO was able to train a 30-billion parameter model using a single Nvidia A100 80GB GPU, while backpropagation can only train a 2.7-billion parameter LM within the same memory constraints. In conclusion, MeZO is a memory-efficient zeroth-order optimizer that can effectively fine-tune large language models.

Check Out The Paper and Github. Don’t forget to join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Researchers from Princeton Introduce MeZO: A Memory-Efficient Zeroth-Order Optimizer that can Fine-Tune Large Language Models (LLMs) appeared first on MarkTechPost.

Superhuman Performance on the Atari 100K Benchmark: The Power of BBF …

Deep reinforcement learning (RL) has emerged as a powerful machine learning algorithm for tackling complex decision-making tasks. To overcome the challenge of achieving human-level sample efficiency in deep RL training, a team of researchers from Google DeepMind, Mila, and Universite de Montreal has introduced a novel value-based RL agent called “faster, better, faster” (BBF). In their recent paper, “Bigger, Better, Faster: Human-level Atari with human-level efficiency,” the team presents the BBF agent, demonstrating super-human performance on the Atari 100K benchmark using a single GPU.

Addressing the Scaling Issue

The research team’s primary focus was to address the scaling issue of neural networks in deep RL when there are limited samples. Building upon the SR-SPR agent developed by D’Oro et al. (2023), which employs a shrink-and-perturb method, BBF perturbs 50 percent of the parameters of the convolutional layers toward a random target. In contrast, SR-SPR perturbs only 20 percent of the parameters. This modification results in improved performance of the BBF agent.

Scaling Network Capacity

To scale network capacity, the researchers utilize the Impala-CNN network and increase the size of each layer by four times. It was observed that BBF consistently outperforms SR-SPR as the width of the network is increased, whereas SR-SPR reaches its peak at 1-2 times the original size.

Enhancements for Better Performance

BBF introduces an update horizon component that exponentially decreases from 10 to 3. Surprisingly, this modification yields a stronger agent than fixed-value agents like Rainbow and SR-SPR. Additionally, the researchers apply a weight decay strategy and increase the discount factor during learning to alleviate statistical overfitting issues.

Empirical Study and Results

In their empirical study, the research team compares the performance of the BBF agent against several baseline RL agents, including SR-SPR, SPR, DrQ (eps), and IRIS, on the Atari 100K benchmark. BBF surpasses all competitors in terms of both performance and computational cost. Specifically, BBF achieves a 2x improvement in performance over SR-SPR while utilizing nearly the same computational resources. Furthermore, BBF demonstrates comparable performance to the model-based EfficientZero approach but with more than a 4x reduction in runtime.

Future Implications and Availability

The introduction of the BBF agent represents a significant advancement in achieving super-human performance in deep RL, particularly on the Atari 100K benchmark. The research team hopes their work will inspire future endeavors to push the boundaries of sample efficiency in deep RL. The code and data associated with the BBF agent are publicly available on the project’s GitHub repository, enabling researchers to explore and build upon their findings.

With the introduction of the BBF agent, Google DeepMind and its collaborators have demonstrated remarkable progress in deep reinforcement learning. By addressing the challenge of sample efficiency and leveraging advancements in network scaling and performance enhancements, the BBF agent achieves super-human performance on the Atari 100K benchmark. This work opens up new possibilities for improving the efficiency and effectiveness of RL algorithms, paving the way for further advancements in the field.

Check Out The Paper and Github. Don’t forget to join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Superhuman Performance on the Atari 100K Benchmark: The Power of BBF – A New Value-Based RL Agent from Google DeepMind, Mila, and Universite de Montreal appeared first on MarkTechPost.

Build custom chatbot applications using OpenChatkit models on Amazon S …

Open-source large language models (LLMs) have become popular, allowing researchers, developers, and organizations to access these models to foster innovation and experimentation. This encourages collaboration from the open-source community to contribute to developments and improvement of LLMs. Open-source LLMs provide transparency to the model architecture, training process, and training data, which allows researchers to understand how the model works and identify potential biases and address ethical concerns. These open-source LLMs are democratizing generative AI by making advanced natural language processing (NLP) technology available to a wide range of users to build mission-critical business applications. GPT-NeoX, LLaMA, Alpaca, GPT4All, Vicuna, Dolly, and OpenAssistant are some of the popular open-source LLMs.
OpenChatKit is an open-source LLM used to build general-purpose and specialized chatbot applications, released by Together Computer in March 2023 under the Apache-2.0 license. This model allows developers to have more control over the chatbot’s behavior and tailor it to their specific applications. OpenChatKit provides a set of tools, base bot, and building blocks to build fully customized, powerful chatbots. The key components are as follows:

An instruction-tuned LLM, fine-tuned for chat from EleutherAI’s GPT-NeoX-20B with over 43 million instructions on 100% carbon negative compute. The GPT-NeoXT-Chat-Base-20B model is based on EleutherAI’s GPT-NeoX model, and is fine-tuned with data focusing on dialog-style interactions.
Customization recipes to fine-tune the model to achieve high accuracy on your tasks.
An extensible retrieval system enabling you to augment bot responses with information from a document repository, API, or other live-updating information source at inference time.
A moderation model, fine-tuned from GPT-JT-6B, designed to filter which questions the bot responds to.

The increasing scale and size of deep learning models present obstacles to successfully deploy these models in generative AI applications. To meet the demands for low latency and high throughput, it becomes essential to employ sophisticated methods like model parallelism and quantization. Lacking proficiency in the application of these methods, numerous users encounter difficulties in initiating the hosting of sizable models for generative AI use cases.
In this post, we show how to deploy OpenChatKit models (GPT-NeoXT-Chat-Base-20B and GPT-JT-Moderation-6B) models on Amazon SageMaker using DJL Serving and open-source model parallel libraries like DeepSpeed and Hugging Face Accelerate. We use DJL Serving, which is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. We demonstrate how the Hugging Face Accelerate library simplifies deployment of large models into multiple GPUs, thereby reducing the burden of running LLMs in a distributed fashion. Let’s get started!
Extensible retrieval system
An extensible retrieval system is one of the key components of OpenChatKit. It enables you to customize the bot response based on a closed domain knowledge base. Although LLMs are able to retain factual knowledge in their model parameters and can achieve remarkable performance on downstream NLP tasks when fine-tuned, their capacity to access and predict closed domain knowledge accurately remains restricted. Therefore, when they’re presented with knowledge-intensive tasks, their performance suffers to that of task-specific architectures. You can use the OpenChatKit retrieval system to augment knowledge in their responses from external knowledge sources such as Wikipedia, document repositories, APIs, and other information sources.
The retrieval system enables the chatbot to access current information by obtaining pertinent details in response to a specific query, thereby supplying the necessary context for the model to generate answers. To illustrate the functionality of this retrieval system, we provide support for an index of Wikipedia articles and offer example code demonstrating how to invoke a web search API for information retrieval. By following the provided documentation, you can integrate the retrieval system with any dataset or API during the inference process, allowing the chatbot to incorporate dynamically updated data into its responses.
Moderation model
Moderation models are important in chatbot applications to enforce content filtering, quality control, user safety, and legal and compliance reasons. Moderation is a difficult and subjective task, and depends a lot on the domain of the chatbot application. OpenChatKit provides tools to moderate the chatbot application and monitor input text prompts for any inappropriate content. The moderation model provides a good baseline that can be adapted and customized to various needs.
OpenChatKit has a 6-billion-parameter moderation model, GPT-JT-Moderation-6B, which can moderate the chatbot to limit the inputs to the moderated subjects. Although the model itself does have some moderation built in, TogetherComputer trained a GPT-JT-Moderation-6B model with Ontocord.ai’s OIG-moderation dataset. This model runs alongside the main chatbot to check that both the user input and answer from the bot don’t contain inappropriate results. You can also use this to detect any out of domain questions to the chatbot and override when the question is not part of the chatbot’s domain.
The following diagram illustrates the OpenChatKit workflow.

Extensible retrieval system use cases
Although we can apply this technique in various industries to build generative AI applications, for this post we discuss use cases in the financial industry. Retrieval augmented generation can be employed in financial research to automatically generate research reports on specific companies, industries, or financial products. By retrieving relevant information from internal knowledge bases, financial archives, news articles, and research papers, you can generate comprehensive reports that summarize key insights, financial metrics, market trends, and investment recommendations. You can use this solution to monitor and analyze financial news, market sentiment, and trends.
Solution overview
The following steps are involved to build a chatbot using OpenChatKit models and deploy them on SageMaker:

Download the chat base GPT-NeoXT-Chat-Base-20B model and package the model artifacts to be uploaded to Amazon Simple Storage Service (Amazon S3).
Use a SageMaker large model inference (LMI) container, configure the properties, and set up custom inference code to deploy this model.
Configure model parallel techniques and use inference optimization libraries in DJL serving properties. We will use Hugging Face Accelerate as the engine for DJL serving. Additionally, we define tensor parallel configurations to partition the model.
Create a SageMaker model and endpoint configuration, and deploy the SageMaker endpoint.

You can follow along by running the notebook in the GitHub repo.
Download the OpenChatKit model
First, we download the OpenChatKit base model. We use huggingface_hub and use snapshot_download to download the model, which downloads an entire repository at a given revision. Downloads are made concurrently to speed up the process. See the following code:

from huggingface_hub import snapshot_download
from pathlib import Path
import os
# – This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(“./openchatkit”)
local_model_path.mkdir(exist_ok=True)
model_name = “togethercomputer/GPT-NeoXT-Chat-Base-20B”
# Only download pytorch checkpoint files
allow_patterns = [“*.json”, “*.pt”, “*.bin”, “*.txt”, “*.model”]
# – Leverage the snapshot library to donload the model since the model is stored in repository using LFS
chat_model_download_path = snapshot_download(
repo_id=model_name,#A user or an organization name and a repo name
cache_dir=local_model_path, #Path to the folder where cached files are stored.
allow_patterns=allow_patterns, #only files matching at least one pattern are downloaded.
)

DJL Serving properties
You can use SageMaker LMI containers to host large generative AI models with custom inference code without providing your own inference code. This is extremely useful when there is no custom preprocessing of the input data or postprocessing of the model’s predictions. You can also deploy a model using custom inference code. In this post, we demonstrate how to deploy OpenChatKit models with custom inference code.
SageMaker expects the model artifacts in tar format. We create each OpenChatKit model with the following files: serving.properties and model.py.
The serving.properties configuration file indicates to DJL Serving which model parallelization and inference optimization libraries you would like to use. The following is a list of settings we use in this configuration file:

openchatkit/serving.properties
engine = Python
option.tensor_parallel_degree = 4
option.s3url = {{s3url}}

This contains the following parameters:

engine – The engine for DJL to use.
option.entryPoint – The entry point Python file or module. This should align with the engine that is being used.
option.s3url – Set this to the URI of the S3 bucket that contains the model.
option.modelid – If you want to download the model from huggingface.co, you can set option.modelid to the model ID of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model ID to download the corresponding model repository on huggingface.co.
option.tensor_parallel_degree – Set this to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the number of workers per model that will be started up when DJL Serving runs. For example, if we have an 8 GPU machine and we are creating eight partitions, then we will have one worker per model to serve the requests. It’s necessary to tune the parallelism degree and identify the optimal value for a given model architecture and hardware platform. We call this ability inference-adapted parallelism.

Refer to Configurations and settings for an exhaustive list of options.
OpenChatKit models
The OpenChatKit base model implementation has the following four files:

model.py – This file implements the handling logic for the main OpenChatKit GPT-NeoX model. It receives the inference input request, loads the model, loads the Wikipedia index, and serves the response. Refer to model.py(created part of the notebook) for additional details. model.py uses the following key classes:

OpenChatKitService – This handles passing the data between the GPT-NeoX model, Faiss search, and conversation object. WikipediaIndex and Conversation objects are initialized and input chat conversations are sent to the index to search for relevant content from Wikipedia. This also generates a unique ID for each invocation if one is not supplied for the purpose of storing the prompts in Amazon DynamoDB.
ChatModel – This class loads the model and tokenizer and generates the response. It handles partitioning the model across multiple GPUs using tensor_parallel_degree, and configures the dtypes and device_map. The prompts are passed to the model to generate responses. A stopping criteria StopWordsCriteria is configured for the generation to only produce the bot response on inference.
ModerationModel – We use two moderation models in the ModerationModel class: the input model to indicate to the chat model that the input is inappropriate to override the inference result, and the output model to override the inference result. We classify the input prompt and output response with the following possible labels:

casual
needs caution
needs intervention (this is flagged to be moderated by the model)
possibly needs caution
probably needs caution

wikipedia_prepare.py – This file handles downloading and preparing the Wikipedia index. In this post, we use a Wikipedia index provided on Hugging Face datasets. To search the Wikipedia documents for relevant text, the index needs to be downloaded from Hugging Face because it’s not packaged elsewhere. The wikipedia_prepare.py file is responsible for handling the download when imported. Only a single process in the multiple that are running for inference can clone the repository. The rest wait until the files are present in the local file system.
wikipedia.py – This file is used for searching the Wikipedia index for contextually relevant documents. The input query is tokenized and embeddings are created using mean_pooling. We compute cosine similarity distance metrics between the query embedding and the Wikipedia index to retrieve contextually relevant Wikipedia sentences. Refer to wikipedia.py for implementation details.

#function to create sentence embedding using mean_pooling
def mean_pooling(token_embeddings, mask):
token_embeddings = token_embeddings.masked_fill(~mask[…, None].bool(), 0.0)
sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[…, None]
return sentence_embeddings

#function to compute cosine similarity distance between 2 embeddings
def cos_sim_2d(x, y):
norm_x = x / np.linalg.norm(x, axis=1, keepdims=True)
norm_y = y / np.linalg.norm(y, axis=1, keepdims=True)
return np.matmul(norm_x, norm_y.T)

conversation.py – This file is used for storing and retrieving the conversation thread in DynamoDB for passing to the model and user. conversation.py is adapted from the open-source OpenChatKit repository. This file is responsible for defining the object that stores the conversation turns between the human and the model. With this, the model is able to retain a session for the conversation, allowing a user to refer to previous messages. Because SageMaker endpoint invocations are stateless, this conversation needs to be stored in a location external to the endpoint instances. On startup, the instance creates a DynamoDB table if it doesn’t exist. All updates to the conversation are then stored in DynamoDB based on the session_id key, which is generated by the endpoint. Any invocation with a session ID will retrieve the associated conversation string and update it as required.

Build an LMI inference container with custom dependencies
The index search uses Facebook’s Faiss library for performing the similarity search. Because this isn’t included in the base LMI image, the container needs to be adapted to install this library. The following code defines a Dockerfile that installs Faiss from the source alongside other libraries needed by the bot endpoint. We use the sm-docker utility to build and push the image to Amazon Elastic Container Registry (Amazon ECR) from Amazon SageMaker Studio. Refer to Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks for more details.
The DJL container doesn’t have Conda installed, so Faiss needs to be cloned and compiled from the source. To install Faiss, the dependencies for using the BLAS APIs and Python support need to be installed. After these packages are installed, Faiss is configured to use AVX2 and CUDA before being compiled with the Python extensions installed.
pandas, fastparquet, boto3, and git-lfs are installed afterwards because these are required for downloading and reading the index files.

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117
ARG FAISS_URL=https://github.com/facebookresearch/faiss.git
RUN apt-get update && apt-get install -y git-lfs wget cmake pkg-config build-essential apt-utils
RUN apt search openblas && apt-get install -y libopenblas-dev swig
RUN git clone $FAISS_URL &&
cd faiss &&
cmake -B build . -DFAISS_OPT_LEVEL=avx2 -DCMAKE_CUDA_ARCHITECTURES=”86″ &&
make -C build -j faiss &&
make -C build -j swigfaiss &&
make -C build -j swigfaiss_avx2 &&
(cd build/faiss/python && python -m pip install )

RUN pip install pandas fastparquet boto3 &&
git lfs install –skip-repo &&
apt-get clean all

Create the model
Now that we have the Docker image in Amazon ECR, we can proceed with creating the SageMaker model object for the OpenChatKit models. We deploy GPT-NeoXT-Chat-Base-20B input and output moderation models using GPT-JT-Moderation-6B. Refer to create_model for more details.

from sagemaker.utils import name_from_base

chat_model_name = name_from_base(f”gpt-neoxt-chatbase-ds”)
print(chat_model_name)

create_model_response = sm_client.create_model(
ModelName=chat_model_name,
ExecutionRoleArn=role,
PrimaryContainer={
“Image”: chat_inference_image_uri,
“ModelDataUrl”: s3_code_artifact,
},
)
chat_model_arn = create_model_response[“ModelArn”]

print(f”Created Model: {chat_model_arn}”)

Configure the endpoint
Next, we define the endpoint configurations for the OpenChatKit models. We deploy the models using the ml.g5.12xlarge instance type. Refer to create_endpoint_config for more details.

chat_endpoint_config_name = f”{chat_model_name}-config”
chat_endpoint_name = f”{chat_model_name}-endpoint”

chat_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=chat_endpoint_config_name,
ProductionVariants=[
{
“VariantName”: “variant1”,
“ModelName”: chat_model_name,
“InstanceType”: “ml.g5.12xlarge”,
“InitialInstanceCount”: 1,
“ContainerStartupHealthCheckTimeoutInSeconds”: 3600,
},
],
)

Deploy the endpoint
Finally, we create an endpoint using the model and endpoint configuration we defined in the previous steps:

chat_create_endpoint_response = sm_client.create_endpoint(
EndpointName=f”{chat_endpoint_name}”, EndpointConfigName=chat_endpoint_config_name
)
print(f”Created Endpoint: {chat_create_endpoint_response[‘EndpointArn’]},”)

Run inference from OpenChatKit models
Now it’s time to send inference requests to the model and get the responses. We pass the input text prompt and model parameters such as temperature, top_k, and max_new_tokens. The quality of the chatbot responses is based on the parameters specified, so it’s recommended to benchmark model performance against these parameters to find the optimal setting for your use case. The input prompt is first sent to the input moderation model, and the output is sent to ChatModel to generate the responses. During this step, the model uses the Wikipedia index to retrieve contextually relevant sections to the model as the prompt to get domain-specific responses from the model. Finally, the model response is sent to the output moderation model to check for classification, and then the responses are returned. See the following code:

def chat(prompt, session_id=None, **kwargs):
if session_id:
chat_response_model = smr_client.invoke_endpoint(
EndpointName=chat_endpoint_name,
Body=json.dumps(
{
“inputs”: prompt,
“parameters”: {
“temperature”: 0.6,
“top_k”: 40,
“max_new_tokens”: 512,
“session_id”: session_id,
“no_retrieval”: True,
},
}
),
ContentType=”application/json”,
)
else:
chat_response_model = smr_client.invoke_endpoint(
EndpointName=chat_endpoint_name,
Body=json.dumps(
{
“inputs”: prompt,
“parameters”: {
“temperature”: 0.6,
“top_k”: 40,
“max_new_tokens”: 512,
},
}
),
ContentType=”application/json”,
)
response = chat_response_model[“Body”].read().decode(“utf8”)
return response
prompts = “What does a data engineer do?”
chat(prompts)

Refer to sample chat interactions below.

Clean up
Follow the instructions in the cleanup section of the to delete the resources provisioned as part of this post to avoid unnecessary charges. Refer to Amazon SageMaker Pricing for details about the cost of the inference instances.
Conclusion
In this post, we discussed the importance of open-source LLMs and how to deploy an OpenChatKit model on SageMaker to build next-generation chatbot applications. We discussed various components of OpenChatKit models, moderation models, and how to use an external knowledge source like Wikipedia for retrieval augmented generation (RAG) workflows. You can find step-by-step instructions in the GitHub notebook. Let us know about the amazing chatbot applications you’re building. Cheers!

About the Authors
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.
Vikram Elango is a Sr. AIML Specialist Solutions Architect at AWS, based in Virginia, US. He is currently focused on generative AI, LLMs, prompt engineering, large model inference optimization, and scaling ML across enterprises. Vikram helps financial and insurance industry customers with design and thought leadership to build and deploy machine learning applications at scale. In his spare time, he enjoys traveling, hiking, cooking, and camping with his family.
Andrew Smith is a Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.

Fine-tune GPT-J using an Amazon SageMaker Hugging Face estimator and t …

GPT-J is an open-source 6-billion-parameter model released by Eleuther AI. The model is trained on the Pile and can perform various tasks in language processing. It can support a wide variety of use cases, including text classification, token classification, text generation, question and answering, entity extraction, summarization, sentiment analysis, and many more. GPT-J is a transformer model trained using Ben Wang’s Mesh Transformer JAX.
In this post, we present a guide and best practices on training large language models (LLMs) using the Amazon SageMaker distributed model parallel library to reduce training time and cost. You will learn how to train a 6-billion-parameter GPT-J model on SageMaker with ease. Finally, we share the main features of SageMaker distributed model parallelism that help with speeding up training time.
Transformer neural networks
A transformer neural network is a popular deep learning architecture to solve sequence-to-sequence tasks. It uses attention as the learning mechanism to achieve close to human-level performance. Some of the other useful properties of the architecture compared to previous generations of natural language processing (NLP) models include the ability distribute, scale, and pre-train. Transformers-based models can be applied across different use cases when dealing with text data, such as search, chatbots, and many more. Transformers use the concept of pre-training to gain intelligence from large datasets. Pre-trained transformers can be used as is or fine-tuned on your datasets, which can be much smaller and specific to your business.
Hugging Face on SageMaker
Hugging Face is a company developing some of the most popular open-source libraries providing state-of-the-art NLP technology based on transformers architectures. The Hugging Face transformers, tokenizers, and datasets libraries provide APIs and tools to download and predict using pre-trained models in multiple languages. SageMaker enables you to train, fine-tune, and run inference using Hugging Face models directly from its Hugging Face Model Hub using the Hugging Face estimator in the SageMaker SDK. The integration makes it easier to customize Hugging Face models on domain-specific use cases. Behind the scenes, the SageMaker SDK uses AWS Deep Learning Containers (DLCs), which are a set of prebuilt Docker images for training and serving models offered by SageMaker. The DLCs are developed through a collaboration between AWS and Hugging Face. The integration also offers integration between the Hugging Face transformers SDK and SageMaker distributed training libraries, enabling you to scale your training jobs on a cluster of GPUs.
Overview of the SageMaker distributed model parallel library
Model parallelism is a distributed training strategy that partitions the deep learning model over numerous devices, within or across instances. Deep learning (DL) models with more layers and parameters perform better in complex tasks like computer vision and NLP. However, the maximum model size that can be stored in the memory of a single GPU is limited. GPU memory constraints can be bottlenecks while training DL models in the following ways:

They limit the size of the model that can be trained because a model’s memory footprint scales proportionately to the number of parameters
They reduce GPU utilization and training efficiency by limiting the per-GPU batch size during training

SageMaker includes the distributed model parallel library to help distribute and train DL models effectively across many compute nodes, overcoming the restrictions associated with training a model on a single GPU. Furthermore, the library allows you to obtain the most optimal distributed training utilizing EFA-supported devices, which improves inter-node communication performance with low latency, high throughput, and OS bypass.
Because large models such as GPT-J, with billions of parameters, have a GPU memory footprint that exceeds a single chip, it becomes essential to partition them across multiple GPUs. The SageMaker model parallel (SMP) library enables automatic partitioning of models across multiple GPUs. With SageMaker model parallelism, SageMaker runs an initial profiling job on your behalf to analyze the compute and memory requirements of the model. This information is then used to decide how the model is partitioned across GPUs, in order to maximize an objective, such as maximizing speed or minimizing memory footprint.
It also supports optional pipeline run scheduling in order to maximize the overall utilization of available GPUs. The propagation of activations during forward pass and gradients during backward pass requires sequential computation, which limits the amount of GPU utilization. SageMaker overcomes the sequential computation constraint utilizing the pipeline run schedule by splitting mini-batches into micro-batches to be processed in parallel on different GPUs. SageMaker model parallelism supports two modes of pipeline runs:

Simple pipeline – This mode finishes the forward pass for each micro-batch before starting the backward pass.
Interleaved pipeline – In this mode, the backward run of the micro-batches is prioritized whenever possible. This allows for quicker release of the memory used for activations, thereby using memory more efficiently.

Tensor parallelism
Individual layers, ornn.Modules, are divided across devices using tensor parallelism so they can run concurrently. The simplest example of how the library divides a model with four layers to achieve two-way tensor parallelism (“tensor_parallel_degree”: 2) is shown in the following figure. Each model replica’s layers are bisected (divided in half) and distributed between two GPUs. The degree of data parallelism is eight in this example because the model parallel configuration additionally includes “pipeline_parallel_degree”: 1 and “ddp”: True. The library manages communication among the replicas of the tensor-distributed model.

The benefit of this feature is that you may choose which layers or which subset of layers you want to apply tensor parallelism to. To dive deep into tensor parallelism and other memory-saving features for PyTorch, and to learn how to set up a combination of pipeline and tensor parallelism, see Extended Features of the SageMaker Model Parallel Library for PyTorch.
SageMaker sharded data parallelism
Sharded data parallelism is a memory-saving distributed training technique that splits the training state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group.
When scaling up your training job to a large GPU cluster, you can reduce the per-GPU memory footprint of the model by sharding the training state over multiple GPUs. This returns two benefits: you can fit larger models, which would otherwise run out of memory with standard data parallelism, or you can increase the batch size using the freed-up GPU memory.
The standard data parallelism technique replicates the training states across the GPUs in the data parallel group and performs gradient aggregation based on the AllReduce operation. In effect, sharded data parallelism introduces a trade-off between the communication overhead and GPU memory efficiency. Using sharded data parallelism increases the communication cost, but the memory footprint per GPU (excluding the memory usage due to activations) is divided by the sharded data parallelism degree, therefore larger models can fit in a GPU cluster.
SageMaker implements sharded data parallelism through the MiCS implementation. For more information, see Near-linear scaling of gigantic-model training on AWS.
Refer to Sharded Data Parallelism for further details on how to apply sharded data parallelism to your training jobs.
Use the SageMaker model parallel library
The SageMaker model parallel library comes with the SageMaker Python SDK. You need to install the SageMaker Python SDK to use the library, and it’s already installed on SageMaker notebook kernels. To make your PyTorch training script utilize the capabilities of the SMP library, you need to make the following changes:

Strat by importing and initializing the smp library using the smp.init()call.
Once it’s initialized, you can wrap your model with the smp.DistributedModel wrapper and use the returned DistributedModel object instead of the user model.
For your optimizer state, use the smp.DistributedOptimizer wrapper around your model optimizer, enabling smp to save and load the optimizer state. The forward and backward pass logic can be abstracted as a separate function and add a smp.step decorator to the function. Essentially, the forward pass and back-propagation needs to be run inside the function with the smp.step decorator placed over it. This allows smp to split the tensor input to the function into a number of microbatches specified while launching the training job.
Next, we can move the input tensors to the GPU used by the current process using the torch.cuda.set_device API followed by the .to() API call.
Finally, for back-propagation, we replace torch.Tensor.backward and torch.autograd.backward.

See the following code:

@smp.step
def train_step(model, data, target):
output = model(data)
loss = F.nll_loss(output, target, reduction=”mean”)
model.backward(Loss)

return output, loss

with smp.tensor_parallelism():
model = AutoModelForCausalLM.from_config(model_config)

model = smp.DistributedModel (model)
optimizer = smp. DistributedOptimizer(optimizer)

The SageMaker model parallel library’s tensor parallelism offers out-of-the-box support for the following Hugging Face Transformer models:

GPT-2, BERT, and RoBERTa (available in the SMP library v1.7.0 and later)
GPT-J (available in the SMP library v1.8.0 and later)
GPT-Neo (available in the SMP library v1.10.0 and later)

Best practices for performance tuning with the SMP library
When training large models, consider the following steps so that your model fits in GPU memory with a reasonable batch size:

It’s recommended to use instances with higher GPU memory and high bandwidth interconnect for performance, such as p4d and p4de instances.
Optimizer state sharding can be enabled in most cases, and will be helpful when you have more than one copy of the model (data parallelism enabled). You can turn on optimizer state sharding by setting “shard_optimizer_state”: True in the modelparallel configuration.
Use activation checkpointing, a technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass of selected modules in the model.
Use activation offloading, an additional feature that can further reduce memory usage. To use activation offloading, set “offload_activations”: True in the modelparallel configuration. Use when activation checkpointing and pipeline parallelism are turned on and the number of microbatches is greater than one.
Enable tensor parallelism and increase parallelism degrees where the degree is a power of 2. Typically for performance reasons, tensor parallelism is restricted to within a node.

We have run many experiments to optimize training and tuning GPT-J on SageMaker with the SMP library. We have managed to reduce GPT-J training time for an epoch on SageMaker from 58 minutes to less than 10 minutes—six times faster training time per epoch. It took initialization, model and dataset download from Amazon Simple Storage Service (Amazon S3) less than a minute, tracing and auto partitioning with GPU as the tracing device less than 1 minute, and training an epoch 8 minutes using tensor parallelism on one ml.p4d.24xlarge instance, FP16 precision, and a SageMaker Hugging Face estimator.
To reduce training time as a best practice, when training GPT-J on SageMaker, we recommend the following:

Store your pretrained model on Amazon S3
Use FP16 precision
Use GPU as a tracing device
Use auto-partitioning, activation checkpointing, and optimizer state sharding:

auto_partition: True
shard_optimizer_state: True

Use tensor parallelism
Use a SageMaker training instance with multiple GPUs such as ml.p3.16xlarge, ml.p3dn.24xlarge, ml.g5.48xlarge, ml.p4d.24xlarge, or ml.p4de.24xlarge.

GPT-J model training and tuning on SageMaker with the SMP library
A working step-by-step code sample is available on the Amazon SageMaker Examples public repository. Navigate to the training/distributed_training/pytorch/model_parallel/gpt-j folder. Select the gpt-j folder and open the train_gptj_smp_tensor_parallel_notebook.jpynb Jupyter notebook for the tensor parallelism example and train_gptj_smp_notebook.ipynb for the pipeline parallelism example. You can find a code walkthrough in our Generative AI on Amazon SageMaker workshop.
This notebook walks you through how to use the tensor parallelism features provided by the SageMaker model parallelism library. You’ll learn how to run FP16 training of the GPT-J model with tensor parallelism and pipeline parallelism on the GLUE sst2 dataset.
Summary
The SageMaker model parallel library offers several functionalities. You can reduce cost and speed up training LLMs on SageMaker. You can also learn and run sample codes for BERT, GPT-2, and GPT-J on the Amazon SageMaker Examples public repository. To learn more about AWS best practices for training LLMS using the SMP library, refer to the following resources:

SageMaker Distributed Model Parallelism Best Practices
Training large language models on Amazon SageMaker: Best practices

To learn how one of our customers achieved low-latency GPT-J inference on SageMaker, refer to How Mantium achieves low-latency GPT-J inference with DeepSpeed on Amazon SageMaker.
If you’re looking to accelerate time-to-market of your LLMs and reduce your costs, SageMaker can help. Let us know what you build!

About the Authors
Zmnako Awrahman, PhD, is a Practice Manager, ML SME, and Machine Learning Technical Field Community (TFC) member at Global Competency Center, Amazon Web Services. He helps customers leverage the power of the cloud to extract value from their data with data analytics and machine learning.
Roop Bains is a Senior Machine Learning Solutions Architect at AWS. He is passionate about helping customers innovate and achieve their business objectives using artificial intelligence and machine learning. He helps customers train, optimize, and deploy deep learning models.
Anastasia Pachni Tsitiridou is a Solutions Architect at AWS. Anastasia lives in Amsterdam and supports software businesses across the Benelux region in their cloud journey. Prior to joining AWS, she studied electrical and computer engineering with a specialization in computer vision. What she enjoys most nowadays is working with very large language models.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on SageMaker.
Wioletta Stobieniecka is a Data Scientist at AWS Professional Services. Throughout her professional career, she has delivered multiple analytics-driven projects for different industries such as banking, insurance, telco, and the public sector. Her knowledge of advanced statistical methods and machine learning is well combined with a business acumen. She brings recent AI advancements to create value for customers.
Rahul Huilgol is a Senior Software Development Engineer in Distributed Deep Learning at Amazon Web Services.

In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation (P …

The out-of-distribution (OOD) detection in deep learning models, particularly in image classification, addresses the challenge of identifying inputs unrelated to the model’s training task. It aims to prevent the model from making confident but incorrect predictions on (OOD) inputs while accurately classifying in-distribution (ID) inputs. By distinguishing between ID and OOD inputs, OOD detection methods enhance the model’s robustness and reliability in real-world applications.

A weakness in current OOD detection evaluations in image classification, specifically regarding datasets related to ImageNet-1K (IN-1K), is the presence of ID objects within the OOD datasets. This issue leads to the wrong classification of ID objects as OOD by state-of-the-art OOD detectors. Consequently, the evaluation of OOD detection methods is affected, resulting in underestimating the actual OOD detection performance and unjustly penalizing more effective OOD detectors. 

A new paper was recently published in which the authors aim to address the limitations in evaluating OOD detection methods. They introduce a novel test dataset, NINCO, which contains OOD samples without any objects from the ImageNet-1K (ID) classes. They also provide synthetic “OOD unit tests” to assess weaknesses in OOD detectors. The paper evaluates various architectures and methods on NINCO, providing insights into model weaknesses and the impact of pretraining on OOD detection performance. The goal is to improve the evaluation and understanding of OOD detection methods.

The authors propose the creation of a new dataset called NINCO (No ImageNet Class Objects) to address the limitations in evaluating OOD detection methods. They carefully select base classes from existing or newly scraped datasets, considering their non-permissive interpretation to ensure they are not categorically part of the ImageNet-1K (ID) classes. The authors visually inspect each image in the base classes to remove samples containing ID objects or where no object from the OOD class is visible. This manual cleaning process ensures a higher-quality dataset.

NINCO consists of 64 OOD classes with a total of 5,879 samples sourced from various datasets, including SPECIES, PLACES, FOOD-101, CALTECH-101, MYNURSINGHOME, ImageNet-21k, and newly scraped from iNaturalist.org and other websites. Additionally, the authors provide cleaned versions of 2,715 OOD images from eleven tests OOD datasets to evaluate potential ID contaminations.

The authors also propose using OOD unit tests, simple, synthetically generated image inputs designed to assess OOD detection weaknesses. They suggest evaluating the performance of an OOD detector on these unit tests separately and counting the number of failed tests (FPR above a user-defined threshold) alongside the overall evaluation on a test OOD dataset like NINCO. These unit tests provide valuable insights into specific weaknesses that detectors may encounter in practice. Overall, the authors propose NINCO as a high-quality dataset for evaluating OOD detection methods and suggest using OOD unit tests to gain additional insights into a detector’s weaknesses.

The paper presents detailed evaluations of OOD detection methods on the NINCO dataset and the unit tests. The authors analyze the performance of various architectures and OOD detection methods, revealing insights about model weaknesses and the impact of pretraining on OOD detection performance. In evaluating the NINCO dataset, the study assesses different IN-1K models obtained from the timm-library and advanced OOD detection methods. Feature-based techniques such as Maha, RMaha, and ViM perform better than the MSP baseline. Max-Logit and Energy also demonstrate notable enhancements compared to MSP. The performance results differ based on the chosen model and OOD detection method. Pretraining proves to be influential as it contributes to improved ID performance and the generation of superior feature embeddings for OOD detection.

In conclusion, the study addresses the limitations in evaluating OOD detection methods in image classification. It introduces the NINCO dataset, which contains OOD samples without any objects from the ImageNet-1K (ID) classes, and proposes the use of OOD unit tests to assess detector weaknesses. The evaluations on NINCO demonstrate the performance of different models and OOD detection methods, highlighting the effectiveness of feature-based techniques and the impact of pretraining. NINCO improves the evaluation and understanding of OOD detection methods by offering a clean dataset and insights into detector weaknesses. The findings emphasize the importance of improving OOD detection evaluations and understanding the strengths and limitations of current methods.

Check Out The Paper and Github. Don’t forget to join our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation (Paper Summary) appeared first on MarkTechPost.