Overcoming Hallucinations in AI: How Factually Augmented RLHF Optimize …

By additional pre-training using image-text pairings or fine-tuning them with specialized visual instruction tuning datasets, Large Language Models may dive into the multimodal domain, giving rise to potent Large Multimodal Models. However, there are obstacles to building LMMs, chief among them the disparity between the quantity and quality of multimodal data and text-only datasets. Take the LLaVA model, initialized from a pre-trained visual encoder and a language model tweaked for instructions. It is trained on far fewer instances than text-only models, which use over 100M examples over 1800 tasks. It is only trained on 150K artificial image-based conversations. Due to such data restrictions, the visual and language modalities may not be aligned. 

As a result, LMMs could generate hallucinatory outputs that are inaccurately tied to the context that pictures give. Researchers from UC Berkeley, CMU, UIUC, UW–Madison, UMass Amherst Microsoft Research, and MIT-IBM Watson AI Lab present LLaVA-RLHF, a vision-language model trained for enhanced multimodal alignment, to address the issues brought on by the absence of high-quality visual instruction tuning data for LMM training. One of their major contributions is adapting the multimodal alignment for LMMs to the universal and scalable alignment paradigm known as Reinforcement Learning from Human Feedback, which has demonstrated remarkable effectiveness for text-based AI agents. To fine-tune LMM, it collects human preferences focusing on recognizing hallucinations and uses those preferences in reinforcement learning. 

This strategy may improve the multimodal alignment at a relatively cheap annotation cost, such as $3000 for gathering 10K human preferences for image-based discussions. As far as they know, this strategy is the first effective use of RLHF for multimodal alignment. Gaining high ratings from the reward model only sometimes equates to improving human judgments, which is reward hacking. It is a possible problem with the present RLHF paradigm. Previous research suggested iteratively gathering “fresh” human feedback to stop incentive hacking, but this method is typically expensive and cannot properly use existing human preference data. This study suggests a more data-efficient option, attempting to make the reward model capable of using the knowledge and data already present in bigger language models that humans have annotated. 

Figure 1: A diagram illustrating the possibility of hallucinations during the Supervised Fine-Tuning (SFT) phase of LMM training and the way Factually Augmented RLHF addresses the problem of low capacity in the reward model, which is initialized from the SFT model.

First, they use a superior visual encoder with higher resolutions and a bigger language model to enhance the reward model’s overall functionality. Second, they present the Factually Augmented RLHF algorithm, which, as shown in Fig. 1, calibrates the reward signals by supplementing them with extra information like picture descriptions or a ground-truth multi-choice option. They further augment the synthetic vision instruction tuning data with existing high-quality human-annotated multimodal data in the conversation format to enhance the general capabilities of LMMs during the Supervised Fine-Tuning stage. They specifically transform Flickr30k into a Spotting Captioning assignment, VQA-v2, and A-OKVQA into a multi-round QA task, and both train the LLaVA-SFT+ models using the new data set. 

Finally, they consider how to evaluate the multimodal alignment of LMMs in situations of real-world creation, paying particular attention to penalizing any hallucinations. The benchmark questions they develop, MMHAL-BENCH, cover all 12 of COCO’s key object categories and comprise eight job kinds. According to their analysis, this benchmark dataset closely matches human assessments, especially if scores are considered for anti-hallucinations. As the first LMM trained with RLHF, LLaVA-RLHF performs admirably in their experimental assessment. They saw an improvement of 94% on the LLaVA-Bench, a 60% improvement on the MMHAL-BENCH, and they set new performance records for LLaVA with 52.4% on MMBench and 82.7% F1 on POPE. On GitHub, they have made their code, model, and data accessible to the public.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Overcoming Hallucinations in AI: How Factually Augmented RLHF Optimizes Vision-Language Alignment in Large Multimodal Models appeared first on MarkTechPost.

All You Need To Know About The Qwen Large Language Models (LLMs) Serie …

Large language models (LLMs) have significantly reshaped the landscape of Artificial Intelligence (AI) since their emergence. These models provide a strong framework for challenging reasoning and problem-solving problems, revolutionizing numerous AI disciplines. LLMs are adaptable agents capable of various tasks thanks to their capacity to compress huge amounts of knowledge into neural networks. They can carry out jobs that were previously thought to be reserved for humans, such as creative endeavors and expert-level problem-solving when given access to a chat interface. Applications ranging from chatbots and virtual assistants to language translation and summarization tools have been created as a result of this transition.

LLMs perform as generalist agents, working with other systems, resources, and models to achieve goals established by people. This includes their ability to follow multimodal instructions, run programs, use tools, and more. This opens up new possibilities for AI applications, including those in autonomous vehicles, healthcare, and finance. Despite their outstanding powers, LLMs have come under fire for their lack of repeatability, steerability, and service provider accessibility.

In recent research, a group of researchers has introduced QWEN1, which marks the initial release of the team’s comprehensive large language model series, i.e., the QWEN LLM series. QWEN is not one particular model but rather a collection of models with varied parameter counts. The two primary categories in this series are QWEN, which stands for base pretrained language models, and QWEN-CHAT, which stands for chat models that have been refined using human alignment methods.

In a variety of downstream tasks, the base language models, represented by QWEN, have consistently displayed outstanding performance. These models have a thorough comprehension of many different domains thanks to their substantial training in a variety of textual and coding datasets. They are valuable assets for a variety of applications due to their adaptability and capacity for success across various activities.

On the other side, the QWEN-CHAT models are created especially for interactions and talks in natural language. They have undergone thorough fine-tuning using human alignment methodologies, including Reinforcement Learning from Human Feedback (RLHF) and supervised fine-tuning. Particularly, RLHF has been quite successful at improving the functionality of these chat models.

In addition to QWEN and QWEN-CHAT, the team has also introduced two specialized variants in the model series, specifically designed for coding-related tasks. Called CODE-QWEN and CODE-QWEN-CHAT, these models have undergone rigorous pre-training on large datasets of code, followed by fine-tuning to excel in tasks involving code comprehension, creation, debugging, and interpretation. While they may slightly lag behind proprietary models, these models vastly outperform open-source counterparts in terms of performance, making them an invaluable tool for academics and developers.

Similar to this, MATH-QWEN-CHAT has also been developed, which focuses on solving mathematical puzzles. When it comes to jobs involving mathematics, these models perform far better than open-source models and come close to matching the capabilities of commercial models. In conclusion, QWEN marks an important turning point in the creation of extensive language models. It includes a wide variety of models, which can collectively reveal the transformational potential of LLMs in the field of AI, exhibiting their superior performance over open-source alternatives.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post All You Need To Know About The Qwen Large Language Models (LLMs) Series appeared first on MarkTechPost.

Personalize your generative AI applications with Amazon SageMaker Feat …

Large language models (LLMs) are revolutionizing fields like search engines, natural language processing (NLP), healthcare, robotics, and code generation. The applications also extend into retail, where they can enhance customer experiences through dynamic chatbots and AI assistants, and into digital marketing, where they can organize customer feedback and recommend products based on descriptions and purchase behaviors.
The personalization of LLM applications can be achieved by incorporating up-to-date user information, which typically involves integrating several components. One such component is a feature store, a tool that stores, shares, and manages features for machine learning (ML) models. Features are the inputs used during training and inference of ML models. For instance, in an application that recommends movies, features could include previous ratings, preference categories, and demographics. Amazon SageMaker Feature Store is a fully managed repository designed specifically for storing, sharing, and managing ML model features. Another essential component is an orchestration tool suitable for prompt engineering and managing different type of subtasks. Generative AI developers can use frameworks like LangChain, which offers modules for integrating with LLMs and orchestration tools for task management and prompt engineering.
Building on the concept of dynamically fetching up-to-date data to produce personalized content, the use of LLMs has garnered significant attention in recent research for recommender systems. The underlying principle of these approaches involves the construction of prompts that encapsulate the recommendation task, user profiles, item attributes, and user-item interactions. These task-specific prompts are then fed into the LLM, which is tasked with predicting the likelihood of interaction between a particular user and item. As stated in the paper Personalized Recommendation via Prompting Large Language Models, recommendation-driven and engagement-guided prompting components play a crucial role in enabling LLMs to focus on relevant context and align with user preferences.
In this post, we elucidate the simple yet powerful idea of combining user profiles and item attributes to generate personalized content recommendations using LLMs. As demonstrated throughout the post, these models hold immense potential in generating high-quality, context-aware input text, which leads to enhanced recommendations. To illustrate this, we guide you through the process of integrating a feature store (representing user profiles) with an LLM to generate these personalized recommendations.
Solution overview
Let’s imagine a scenario where a movie entertainment company promotes movies to different users via an email campaign. The promotion contains 25 well-known movies, and we want to select the top three recommendations for each user based on their interests and previous rating behaviors.
For example, given a user’s interest in different movie genres like action, romance, and sci-fi, we could have an AI system determine the top three recommended movies for that particular user. In addition, the system might generate personalized messages for each user in a tone tailored to their preferences. We include some examples of personalized messages later in this post.
This AI application would include several components working together, as illustrated in the following diagram:

A user profiling engine takes in a user’s previous behaviors and outputs a user profile reflecting their interests.
A feature store maintains user profile data.
A media metadata store keeps the promotion movie list up to date.
A language model takes the current movie list and user profile data, and outputs the top three recommended movies for each user, written in their preferred tone.
An orchestrating agent coordinates the different components.

In summary, intelligent agents could construct prompts using user- and item-related data and deliver customized natural language responses to users. This would represent a typical content-based recommendation system, which recommends items to users based on their profiles. The user’s profile is stored and maintained in the feature store and revolves around their preferences and tastes. It is commonly derived based on their previous behaviors, such as ratings.
The following diagram illustrates how it works.

The application follows these steps to provide responses to a user’s recommendation:

The user profiling engine that takes a user’s historical movie rating as input, outputs user interest, and stores the feature in SageMaker Feature Store. This process can be updated in a scheduling manner.
The agent takes the user ID as input, searches for the user interest, and completes the prompt template following the user’s interests.
The agent takes the promotion item list (movie name, description, genre) from a media metadata store.
The interests prompt template and promotion item list are fed into an LLM for email campaign messages.
The agent sends the personalized email campaign to the end user.

The user profiling engine builds a profile for each user, capturing their preferences and interests. This profile can be represented as a vector with elements mapping to features like movie genres, with values indicating the user’s level of interest. The user profiles in the feature store allow the system to suggest personalized recommendations matching their interests. User profiling is a well-studied domain within recommendation systems. To simplify, you can build a regression algorithm using a user’s previous ratings across different categories to infer their overall preferences. This can be done with algorithms like XGBoost.
Code walkthrough
In this section, we provide examples of the code. The full code walkthrough is available in the GitHub repo.
After obtaining the user interests feature from the user profiling engine, we can store the results in the feature store. SageMaker Feature Store supports batch feature ingestion and online storage for real-time inference. For ingestion, data can be updated in an offline mode, whereas inference needs to happen in milliseconds. SageMaker Feature Store ensures that offline and online datasets remain in sync.
For data ingestion, we use the following code:

from sagemaker.feature_store.feature_group import FeatureGroup

feature_group_name = ‘user-profile-feature-group’
feature_group = FeatureGroup(name=feature_group_name, feature_definitions=feature_definitions, sagemaker_session=sess)

#Ingest data
feature_group.ingest(data_frame=data_frame, max_workers=6, wait=True)

For real-time online storage, we could use the following code to extract the user profile based on the user ID:

feature_record = featurestore_runtime_client.get_record(FeatureGroupName=feature_group_name, RecordIdentifierValueAsString=customer_id)
print(feature_record)

Then we rank the top three interested movie categories to feed the downstream recommendation engine:

User ID: 42 Top3 Categories: [‘Animation’, ‘Thriller’, ‘Adventure’]

Our application employs two primary components. The first component retrieves data from a feature store, and the second component acquires a list of movie promotions from the metadata store. The coordination between these components is managed by Chains from LangChain, which represent a sequence of calls to components.
It’s worth mentioning that in complex scenarios, the application may need more than a fixed sequence of calls to LLMs or other tools. Agents, equipped with a suite of tools, use an LLM to determine the sequence of actions to be taken. Whereas Chains encode a hardcoded sequence of actions, agents use the reasoning power of a language model to dictate the order and nature of actions.
The connection between different data sources, including SageMaker Feature Store, is demonstrated in the following code. All the retrieved data is consolidated to construct an extensive prompt, serving as input for the LLM. We dive deep into the specifics of prompt design in the subsequent section. The following is a prompt template definition that interfaces with multiple data sources:­

from langchain.prompts import StringPromptTemplate

class FeatureStorePromptTemplate(StringPromptTemplate):

feature_group_name = ‘user-profile-feature-group’

def format(self, **kwargs) -> str:
user_id = kwargs.pop(“user_id”)
feature_record = self.fetch_user_preference_from_feature_store(user_id)
user_preference = self.rank_user_preference(feature_record)

kwargs[“promotion_movie_list”] = self.read_promotion_list()
kwargs[“user_preference”] = user_preference
return prompt.format(**kwargs)

def fetch_user_preference_from_feature_store(self, user_id):

boto_session = boto3.Session()
featurestore_runtime_client = boto_session.client(‘sagemaker-featurestore-runtime’)
feature_record = featurestore_runtime_client.get_record(FeatureGroupName=self.feature_group_name, RecordIdentifierValueAsString=str(user_id))
return feature_record[‘Record’]

# Rank Top_3_Categories for given user’s preference
def rank_user_preference(self, data) -> str:
# refer to the details in the notebook
return str(top_categories_df.values.tolist())

# Get promotion movie list from metadata store
def read_promotion_list(self,) -> str:
# refer to the details in the notebook
return output_string

In addition, we use Amazon SageMaker to host our LLM model and expose it as the LangChain SageMaker endpoint. To deploy the LLM, we use Amazon SageMaker JumpStart (for more details, refer to Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart). After the model is deployed, we can create the LLM module:

from langchain import PromptTemplate, SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler

class ContentHandler(LLMContentHandler):

def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
# refer to the details in the notebook

def transform_output(self, output: bytes) -> str:
# refer to the details in the notebook

content_handler = ContentHandler()

sm_llm = SagemakerEndpoint(
endpoint_name = endpoint_name,
region_name = aws_region,
model_kwargs = parameters,
endpoint_kwargs={“CustomAttributes”: ‘accept_eula=true’},
content_handler = content_handler,
)

In the context of our application, the agent runs a sequence of steps, called an LLMChain. It integrates a prompt template, model, and guardrails to format the user input, pass it to the model, get a response, and then validate (and, if necessary, rectify) the model output.

from langchain.chains import LLMChain
llmchain = LLMChain(llm=sm_llm, prompt=prompt_template)
email_content = llmchain.run({‘user_id’: 4})
print(email_content)

In the next section, we walk through the prompt engineering for the LLM to output expected results.
LLM recommendation prompting and results
Following the high-level concept of engagement-guided prompting as described in the research study Personalized Recommendation via Prompting Large Language Models, the fundamental principle of our prompting strategy is to integrate user preferences in creating prompts. These prompts are designed to guide the LLM towards more effectively identifying attributes within the content description that align with user preferences. To elaborate further, our prompt comprises several components:

Contextual relevance – The initial part of our prompt template incorporates media metadata such as item name (movie title), description (movie synopsis), and attribute (movie genre). By incorporating this information, the prompt provides the LLM with a broader context and a more comprehensive understanding of the content. This contextual information aids the LLM in better understanding the item through its description and attributes, thereby enhancing its utility in content recommendation scenarios.
User preference alignment – By taking into account a user profile that signifies user preferences, potential recommendations are better positioned to identify content characteristics and features that resonate with target users. This alignment augments the utility of the item descriptions because it enhances the efficiency of recommending items that are relevant and in line with user preferences.
Enhanced recommendation quality – The engagement-guided prompt uses user preferences to identify relevant promotional items. We can also use user preference to adjust the tone of the LLM for the final output. This can result in an accurate, informative, and personalized experience, thereby improving the overall performance of the content recommendation system.

The following code shows an example prompt template:
prompt_template = “”” Our company, “Classic Cinema” frequently promotes movies that we aim to recommend to our customers. This month, we have several popular movies on promotion.
As an AI agent, you are tasked to assist “Classic Cinema” in crafting an email campaign to recommend relevant movies to users. The recommendations should adhere to several guidelines, including contextual relevance, ensuring the recommendations are strictly from our promotional movie list. Additionally, the recommendations should align with user preferences, suggesting items that are relevant and in harmony with the user’s preferred categories. You are to provide precisely three top recommended movies. Finally, please draft the email to reflect the tone of the user’s preferred categories. The email should not exceed 100 words.
The recommended movies should be sourced from this contextual relevance movie list: {promotion_movie_list}.
The user has expressed interest in {user_preference}.
Please ensure the recommendations are relevant, and the tone of the email reflects the tastes of those interested in the {user_preference} movie category.
Ensure the letter appeals to those interested in the {user_preference} movie category, and keep the email campaign within a 100-word limit. “””
The following is an example result with user preferences of sci-fi, adventure, and war genres:

Subject: Explore the Frontier of Classic Cinema with Our Sci-Fi, Adventure, and War Movies!
Dear [Name], Are you ready to embark on a journey through time and space, experience heart-pumping action, and witness the bravery of heroes on the battlefield? Look no further! Classic Cinema proudly presents our carefully curated selection of movies, tailored specifically to your interests in Sci-Fi, Adventure, and War.
First up, we have the iconic “Star Wars: Episode V – The Empire Strikes Back” (1980). This classic installment in the Star Wars saga follows Luke Skywalker’s journey to becoming a Jedi Master, while his friends face off against the evil Empire. With its groundbreaking special effects and memorable characters, this movie is a must-see for any fan of the genre.
Next, we recommend “Interstellar” (2014), a thought-provoking and visually stunning film that delves into the mysteries of time and space. As a group of explorers embark on a quest to save humanity, they encounter breathtaking landscapes and uncover the secrets of the universe.
Lastly, we have “Saving Private Ryan” (1998), a gripping and intense war drama that follows a group of soldiers on a dangerous mission to find and rescue a paratrooper whose brothers have been killed in action. This critically acclaimed film is a powerful tribute to the heroes of World War II. Don’t miss out on these cinematic masterpieces! Watch them now and experience the thrill of adventure, the wonder of sci-fi, and the bravery of war heroes. Happy viewing, and may the force be with you!
Best regards, Classic Cinema Team

The following is another example result with a user preference of documentary, musical, and drama:

Subject: Classic Cinema’s Recommendations for Documentary, Musical, and Drama Lovers Dear [Name], We hope this email finds you well and that you’re enjoying the variety of movies available on our platform. At Classic Cinema, we take pride in catering to the diverse tastes of our customers, and we’ve selected three exceptional movies that we believe will resonate with your interest in Documentary, Musical, and Drama. First up, we have “The Shawshank Redemption” (1994), a powerful and uplifting drama that follows the journey of two prisoners as they find hope and redemption in a corrupt and unforgiving prison system. With its gripping storyline, outstanding performances, and timeless themes, this movie is a must-see for anyone who loves a well-crafted drama. Next, we recommend “The Lord of the Rings: The Fellowship of the Ring” (2001), an epic adventure that combines breathtaking visuals, memorable characters, and a richly detailed world. This movie is a masterclass in storytelling, with a deep sense of history and culture that will transport you to Middle-earth and leave you wanting more. Lastly, we suggest “The Pianist” (2002), a profound and moving documentary that tells the true story of Władysław Szpilman, a Polish Jewish pianist who struggled to survive the destruction of the Warsaw ghetto during World War II. This film is a powerful reminder of the human spirit’s capacity for resilience and hope, even in the face of unimaginable tragedy. We hope these recommendations resonate with your interests and provide you with an enjoyable and enriching movie experience. Don’t miss out on these timeless classics – watch them now and discover the magic of Classic Cinema! Best regards, The Classic Cinema Team

We have carried out tests with both Llama 2 7B-Chat (see the following code sample) and Llama 70B for comparison. Both models performed well, yielding consistent conclusions. By using a prompt template filled with up-to-date data, we found it easier to test arbitrary LLMs, helping us choose the right balance between performance and cost. We have also made several shared observations that are worth noting.
Firstly, we can see that the recommendations provided genuinely align with user preferences. The movie recommendations are guided by various components within our application, most notably the user profile stored in the feature store.
Additionally, the tone of the emails corresponds to user preferences. Thanks to the advanced language understanding capabilities of LLM, we can customize the movie descriptions and email content, tailoring them to each individual user.
Furthermore, the final output format can be designed into the prompt. For example, in our case, the salutation “Dear [Name]” needs to be filled by the email service. It’s important to note that although we avoid exposing personally identifiable information (PII) within our generative AI application, there is the possibility to reintroduce this information during postprocessing, assuming the right level of permissions are granted.
Clean up
To avoid unnecessary costs, delete the resources you created as part of this solution, including the feature store and LLM inference endpoint deployed with SageMaker JumpStart.
Conclusion
The power of LLMs in generating personalized recommendations is immense and transformative, particularly when coupled with the right tools. By integrating SageMaker Feature Store and LangChain for prompt engineering, developers can construct and manage highly tailored user profiles. This results in high-quality, context-aware inputs that significantly enhance recommendation performance. In our illustrative scenario, we saw how this can be applied to tailor movie recommendations to individual user preferences, resulting in a highly personalized experience.
As the LLM landscape continues to evolve, we anticipate seeing more innovative applications that use these models to deliver even more engaging, personalized experiences. The possibilities are boundless, and we are excited to see what you will create with these tools. With resources such as SageMaker JumpStart and Amazon Bedrock now available to accelerate the development of generative AI applications, we strongly recommend exploring the construction of recommendation solutions using LLMs on AWS.

About the Authors
Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building AI-powered industrial applications in computer vision, natural language processing, and online user behavior prediction. At AWS, he shares his domain expertise and helps customers unlock business potentials and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.
Gordon Wang is a Senior AI/ML Specialist TAM at AWS. He supports strategic customers with AI/ML best practices cross many industries. He is passionate about computer vision, NLP, generative AI, and MLOps. In his spare time, he loves running and hiking.
Michelle Hong, PhD, works as Prototyping Solutions Architect at Amazon Web Services, where she helps customers build innovative applications using a variety of AWS components. She demonstrated her expertise in machine learning, particularly in natural language processing, to develop data-driven solutions that optimize business processes and improve customer experiences.
Bin Wang, PhD, is a Senior Analytic Specialist Solutions Architect at AWS, boasting over 12 years of experience in the ML industry, with a particular focus on advertising. He possesses expertise in natural language processing (NLP), recommender systems, diverse ML algorithms, and ML operations. He is deeply passionate about applying ML/DL and big data techniques to solve real-world problems. Outside of his professional life, he enjoys music, reading, and traveling.

Build an image-to-text generative AI application using multimodality m …

As we delve deeper into the digital era, the development of multimodality models has been critical in enhancing machine understanding. These models process and generate content across various data forms, like text and images. A key feature of these models is their image-to-text capabilities, which have shown remarkable proficiency in tasks such as image captioning and visual question answering.
By translating images into text, we unlock and harness the wealth of information contained in visual data. For instance, in ecommerce, image-to-text can automate product categorization based on images, enhancing search efficiency and accuracy. Similarly, it can assist in generating automatic photo descriptions, providing information that might not be included in product titles or descriptions, thereby improving user experience.
In this post, we provide an overview of popular multimodality models. We also demonstrate how to deploy these pre-trained models on Amazon SageMaker. Furthermore, we discuss the diverse applications of these models, focusing particularly on several real-world scenarios, such as zero-shot tag and attribution generation for ecommerce and automatic prompt generation from images.
Background of multimodality models
Machine learning (ML) models have achieved significant advancements in fields like natural language processing (NLP) and computer vision, where models can exhibit human-like performance in analyzing and generating content from a single source of data. More recently, there has been increasing attention in the development of multimodality models, which are capable of processing and generating content across different modalities. These models, such as the fusion of vision and language networks, have gained prominence due to their ability to integrate information from diverse sources and modalities, thereby enhancing their comprehension and expression capabilities.
In this section, we provide an overview of two popular multimodality models: CLIP (Contrastive Language-Image Pre-training) and BLIP (Bootstrapping Language-Image Pre-training).
CLIP model
CLIP is a multi-modal vision and language model, which can be used for image-text similarity and for zero-shot image classification. CLIP is trained on a dataset of 400 million image-text pairs collected from a variety of publicly available sources on the internet. The model architecture consists of an image encoder and a text encoder, as shown in the following diagram.

During training, an image and corresponding text snippet are fed through the encoders to get an image feature vector and text feature vector. The goal is to make the image and text features for a matched pair have a high cosine similarity, while features for mismatched pairs have low similarity. This is done through a contrastive loss. This contrastive pre-training results in encoders that map images and text to a common embedding space where semantics are aligned.
The encoders can then be used for zero-shot transfer learning for downstream tasks. At inference time, the image and text pre-trained encoder processes its respective input and transforms it into a high-dimensional vector representation, or an embedding. The embeddings of the image and text are then compared to determine their similarity, such as cosine similarity. The text prompt (image classes, categories, or tags) whose embedding is most similar (for example, has the smallest distance) to the image embedding is considered the most relevant, and the image is classified accordingly.
BLIP model
Another popular multimodality model is BLIP. It introduces a novel model architecture capable of adapting to diverse vision-language tasks and employs a unique dataset bootstrapping technique to learn from noisy web data. BLIP architecture includes an image encoder and text encoder: the image-grounded text encoder injects visual information into the transformer block of the text encoder, and the image-grounded text decoder incorporates visual information into the transformer decoder block. With this architecture, BLIP demonstrates outstanding performance across a spectrum of vision-language tasks that involve the fusion of visual and linguistic information, from image-based search and content generation to interactive visual dialog systems. In a previous post, we proposed a content moderation solution based on the BLIP model that addressed multiple challenges using computer vision unimodal ML approaches.
Use case 1: Zero-shot tag or attribute generation for an ecommerce platform
Ecommerce platforms serve as dynamic marketplaces teeming with ideas, products, and services. With millions of products listed, effective sorting and categorization poses a significant challenge. This is where the power of auto-tagging and attribute generation comes into its own. By harnessing advanced technologies like ML and NLP, these automated processes can revolutionize the operations of ecommerce platforms.
One of the key benefits of auto-tagging or attribute generation lies in its ability to enhance searchability. Products tagged accurately can be found by customers swiftly and efficiently. For instance, if a customer is searching for a “cotton crew neck t-shirt with a logo in front,” auto-tagging and attribute generation enable the search engine to pinpoint products that match not merely the broader “t-shirt” category, but also the specific attributes of “cotton” and “crew neck.” This precise matching can facilitate a more personalized shopping experience and boost customer satisfaction. Moreover, auto-generated tags or attributes can substantially improve product recommendation algorithms. With a deep understanding of product attributes, the system can suggest more relevant products to customers, thereby increasing the likelihood of purchases and enhancing customer satisfaction.
CLIP offers a promising solution for automating the process of tag or attribute generation. It takes a product image and a list of descriptions or tags as input, generating a vector representation, or embedding, for each tag. These embeddings exist in a high-dimensional space, with their relative distances and directions reflecting the semantic relationships between the inputs. CLIP is pre-trained on a large scale of image-text pairs to encapsulate these meaningful embeddings. If a tag or attribute accurately describes an image, their embeddings should be relatively close in this space. To generate corresponding tags or attributes, a list of potential tags can be inputted into the text part of the CLIP model, and the resulting embeddings stored. Ideally, this list should be exhaustive, covering all potential categories and attributes relevant to the products on the ecommerce platform. The following figure shows some examples.

To deploy the CLIP model on SageMaker, you can follow the notebook in the following GitHub repo. We use the SageMaker pre-built large model inference (LMI) containers to deploy the model. The LMI containers use DJL Serving to serve your model for inference. To learn more about hosting large models on SageMaker, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference and Deploy large models at high performance using FasterTransformer on Amazon SageMaker.
In this example, we provide the files serving.properties, model.py, and requirements.txt to prepare the model artifacts and store them in a tarball file.

serving.properties is the configuration file that can be used to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration. For more details on the configuration options and an exhaustive list, refer to Configurations and settings.
model.py is the script that handles any requests for serving.
requirements.txt is the text file containing any additional pip wheels to install.

If you want to download the model from Hugging Face directly, you can set the option.model_id parameter in the serving.properties file as the model id of a pre-trained model hosted inside a model repository on huggingface.co. The container uses this model id to download the corresponding model during deployment time. If you set the model_id to an Amazon Simple Storage Service (Amazon S3) URL, the DJL will download the model artifacts from Amazon S3 and swap the model_id to the actual location of the model artifacts. In your script, you can point to this value to load the pre-trained model. In our example, we use the latter option, because the LMI container uses s5cmd to download data from Amazon S3, which significantly reduces the speed when loading models during deployment. See the following code:

# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path(“clip/serving.properties”).open().read())
Path(“clip/serving.properties”).open(“w”).write(
template.render(s3url=pretrained_model_location)
)
!pygmentize clip/serving.properties | cat -n

In the model.py script, we load the model path using the model ID provided in the property file:

def load_clip_model(self, properties):
if self.config.caption_model is None:
model_path = properties[“model_id”]

… …

print(f’model path: {model_path}’)
model = CLIPModel.from_pretrained(model_path, cache_dir=”/tmp”,)
self.caption_processor = CLIPProcessor.from_pretrained(model_path)

After the model artifacts are prepared and uploaded to Amazon S3, you can deploy the CLIP model to SageMaker hosting with a few lines of code:

from sagemaker.model import Model

model = Model(
image_uri=inference_image_uri,
model_data=s3_code_artifact,
role=role,
name=model_name,
)

model.deploy(
initial_instance_count=1,
instance_type=”ml.g5.2xlarge”,
endpoint_name=endpoint_name
)

When the endpoint is in service, you can invoke the endpoint with an input image and a list of labels as the input prompt to generate the label probabilities:

def encode_image(img_file):
with open(img_file, “rb”) as image_file:
img_str = base64.b64encode(image_file.read())
base64_string = img_str.decode(“latin1”)
return base64_string

def run_inference(endpoint_name, inputs):
response = smr_client.invoke_endpoint(
EndpointName=endpoint_name, Body=json.dumps(inputs)
)
return response[“Body”].read().decode(‘utf-8′)

base64_string = encode_image(test_image)
inputs = {“image”: base64_string, “prompt”: [“a photo of cats”, “a photo of dogs”]}
output = run_inference(endpoint_name, inputs)
print(json.loads(output)[0])

Use case 2: Automatic prompt generation from images
One innovative application using the multimodality models is to generate informative prompts from an image. In generative AI, a prompt refers to the input provided to a language model or other generative model to instruct it on what type of content or response is desired. The prompt is essentially a starting point or a set of instructions that guides the model’s generation process. It can take the form of a sentence, question, partial text, or any input that conveys the context or desired output to the model. The choice of a well-crafted prompt is pivotal in generating high-quality images with precision and relevance. Prompt engineering is the process of optimizing or crafting a textual input to achieve desired responses from a language model, often involving wording, format, or context adjustments.
Prompt engineering for image generation poses several challenges, including the following:

Defining visual concepts accurately – Describing visual concepts in words can sometimes be imprecise or ambiguous, making it difficult to convey the exact image desired. Capturing intricate details or complex scenes through textual prompts might not be straightforward.
Specifying desired styles effectively – Communicating specific stylistic preferences, such as mood, color palette, or artistic style, can be challenging through text alone. Translating abstract aesthetic concepts into concrete instructions for the model can be tricky.
Balancing complexity to prevent overloading the model – Elaborate prompts could confuse the model or lead to overloading it with information, affecting the generated output. Striking the right balance between providing sufficient guidance and avoiding overwhelming complexity is essential.

Therefore, crafting effective prompts for image generation is time consuming, which requires iterative experimentation and refining to strike the right balance between precision and creativity, making it a resource-intensive task that heavily relies on human expertise.
The CLIP Interrogator is an automatic prompt engineering tool for images that combines CLIP and BLIP to optimize text prompts to match a given image. You can use the resulting prompts with text-to-image models like Stable Diffusion to create cool art. The prompts created by CLIP Interrogator offer a comprehensive description of the image, covering not only its fundamental elements but also the artistic style, the potential inspiration behind the image, the medium where the image could have been or might be used, and beyond. You can easily deploy the CLIP Interrogator solution on SageMaker to streamline the deployment process, and take advantage of the scalability, cost-efficiency, and robust security provided by the fully managed service. The following diagram shows the flow logic of this solution.

You can use the following notebook to deploy the CLIP Interrogator solution on SageMaker. Similarly, for CLIP model hosting, we use the SageMaker LMI container to host the solution on SageMaker using DJL Serving. In this example, we provided an additional input file with the model artifacts that specifies the models deployed to the SageMaker endpoint. You can choose different CLIP or BLIP models by passing the caption model name and the clip model name through the model_name.json file created with the following code:

model_names = {
“caption_model_name”:’blip2-2.7b’, #@param [“blip-base”, “blip-large”, “git-large-coco”]
“clip_model_name”:’ViT-L-14/openai’ #@param [“ViT-L-14/openai”, “ViT-H-14/laion2b_s32b_b79k”]
}
with open(“clipinterrogator/model_name.json”,’w’) as file:
json.dump(model_names, file)

The inference script model.py contains a handle function that DJL Serving will run your request by invoking this function. To prepare this entry point script, we adopted the code from the original clip_interrogator.py file and modified it to work with DJL Serving on SageMaker hosting. One update is the loading of the BLIP model. The BLIP and CLIP models are loaded via the load_caption_model() and load_clip_model() function during the initialization of the Interrogator object. To load the BLIP model, we first downloaded the model artifacts from Hugging Face and uploaded them to Amazon S3 as the target value of the model_id in the properties file. This is because the BLIP model can be a large file, such as the blip2-opt-2.7b model, which is more than 15 GB in size. Downloading the model from Hugging Face during model deployment will require more time for endpoint creation. Therefore, we point the model_id to the Amazon S3 location of the BLIP2 model and load the model from the model path specified in the properties file. Note that, during deployment, the model path will be swapped to the local container path where the model artifacts were downloaded to by DJL Serving from the Amazon S3 location. See the following code:

if “model_id” in properties and any(os.listdir(properties[“model_id”])):
model_path = properties[“model_id”]

… …

caption_model = Blip2ForConditionalGeneration.from_pretrained(model_path, torch_dtype=self.dtype)

Because the CLIP model isn’t very big in size, we use open_clip to load the model directly from Hugging Face, which is the same as the original clip_interrogator implementation:

self.clip_model, _, self.clip_preprocess = open_clip.create_model_and_transforms(
clip_model_name,
pretrained=clip_model_pretrained_name,
precision=’fp16′ if config.device == ‘cuda’ else ‘fp32’,
device=config.device,
jit=False,
cache_dir=config.clip_model_path
)

We use similar code to deploy the CLIP Interrogator solution to a SageMaker endpoint and invoke the endpoint with an input image to get the prompts that can be used to generate similar images.
Let’s take the following image as an example. Using the deployed CLIP Interrogator endpoint on SageMaker, it generates the following text description: croissant on a plate, pexels contest winner, aspect ratio 16:9, cgsocietywlop, 8 h, golden cracks, the artist has used bright, picture of a loft in morning, object features, stylized border, pastry, french emperor.

We can further combine the CLIP Interrogator solution with Stable Diffusion and prompt engineering techniques—a whole new dimension of creative possibilities emerges. This integration allows us to not only describe images with text, but also manipulate and generate diverse variations of the original images. Stable Diffusion ensures controlled image synthesis by iteratively refining the generated output, and strategic prompt engineering guides the generation process towards desired outcomes.
In the second part of the notebook, we detail the steps to use prompt engineering to restyle images with the Stable Diffusion model (Stable Diffusion XL 1.0). We use the Stability AI SDK to deploy this model from SageMaker JumpStart after subscribing to this model on the AWS marketplace. Because this is a newer and better version for image generation provided by Stability AI, we can get high-quality images based on the original input image. Additionally, if we prefix the preceding description and add an additional prompt mentioning a known artist and one of his works, we get amazing results with restyling. The following image uses the prompt: This scene is a Van Gogh painting with The Starry Night style, croissant on a plate, pexels contest winner, aspect ratio 16:9, cgsocietywlop, 8 h, golden cracks, the artist has used bright, picture of a loft in morning, object features, stylized border, pastry, french emperor.

The following image uses the prompt: This scene is a Hokusai painting with The Great Wave off Kanagawa style, croissant on a plate, pexels contest winner, aspect ratio 16:9, cgsocietywlop, 8 h, golden cracks, the artist has used bright, picture of a loft in morning, object features, stylized border, pastry, french emperor.

Conclusion
The emergence of multimodality models, like CLIP and BLIP, and their applications are rapidly transforming the landscape of image-to-text conversion. Bridging the gap between visual and semantic information, they are providing us with the tools to unlock the vast potential of visual data and harness it in ways that were previously unimaginable.
In this post, we illustrated different applications of the multimodality models. These range from enhancing the efficiency and accuracy of search in ecommerce platforms through automatic tagging and categorization to the generation of prompts for text-to-image models like Stable Diffusion. These applications open new horizons for creating unique and engaging content. We encourage you to learn more by exploring the various multimodality models on SageMaker and build a solution that is innovative to your business.

About the Authors
Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building AI-powered industrial applications in computer vision, natural language processing, and online user behavior prediction. At AWS, he shares his domain expertise and helps customers unlock business potentials and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.
Raghu Ramesha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.
Sam Edwards, is a Cloud Engineer (AI/ML) at AWS Sydney specialized in machine learning and Amazon SageMaker. He is passionate about helping customers solve issues related to machine learning workflows and creating new solutions for them. Outside of work, he enjoys playing racquet sports and traveling.
Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers build solutions using state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing ML solutions with best practices. In her spare time, she loves to explore nature and spend time with family and friends.
Gordon Wang is a Senior AI/ML Specialist TAM at AWS. He supports strategic customers with AI/ML best practices cross many industries. He is passionate about computer vision, NLP, generative AI, and MLOps. In his spare time, he loves running and hiking.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Meet Concept2Box: Bridging the Gap Between High-Level Concepts and Fin …

A lot of research has gone into finding ways to represent big sets of connected data, like knowledge graphs. These methods are called Knowledge Graph Embeddings (KGE), and they help us use this data for various practical purposes in the real world. 

Traditional methods have often overlooked a significant aspect of knowledge graphs, which is the presence of two distinct types of information: high-level concepts that relate to the overall structure (ontology view) and specific individual entities (instance view). Typically, these methods treat all nodes in the knowledge graph as vectors within a single hidden space. 

The above image demonstrates a two-view knowledge graph, which comprises (1) an ontology-view knowledge graph containing high-level concepts and meta-relations, (2) an instance-view knowledge graph containing specific, detailed instances and relations, and (3) a collection of connections (cross-view links) between these two views, Concept2Box is designed to acquire dual geometric embeddings. Under this approach, each concept is represented as a geometric box in the latent space, while entities are represented as point vectors.

In contrast to using a single geometric representation that cannot adequately capture the structural distinctions between two perspectives within a knowledge graph and lacks probabilistic meaning in relation to the granularity of concepts, the authors introduce Concept2Box. This innovative approach simultaneously embeds both views of a knowledge graph by employing dual geometric representations. Concepts are represented using box embeddings, enabling the learning of hierarchical structures and complex relationships like overlap and disjointness.

 The volume of these boxes corresponds to the granularity of concepts. In contrast, entities are represented as vectors. To bridge the gap between concept box embeddings and entity vector embeddings, a novel vector-to-box distance metric is proposed, and both embeddings are learned jointly. Experimental evaluations conducted on both the publicly available DBpedia knowledge graph and a newly created industrial knowledge graph underscore the effectiveness of Concept2Box. Our model is built to handle the differences in how information is structured in knowledge graphs. But in today’s knowledge graphs, which can involve multiple languages, there’s another challenge. Different parts of the knowledge graph not only have different structures but also use different languages, making it even more complex to understand and work with. In the future, we can expect advancements in this domain.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Meet Concept2Box: Bridging the Gap Between High-Level Concepts and Fine-Grained Entities in Knowledge Graphs – A Dual Geometric Approach appeared first on MarkTechPost.

Researchers at the Shibaura Institute of Technology Revolutionize Face …

In computer vision and human-computer interaction, the critical task of face orientation estimation has emerged as a pivotal component with multifaceted applications. One particularly notable domain where this technology plays a vital role is in driver monitoring systems aimed at enhancing road safety. These systems harness the power of machine learning models to continuously analyze a driver’s face orientation in real-time, determining their attentiveness to the road or any distractions that may be at play, such as texting or drowsiness. When deviations from the desired orientation are detected, these systems can issue alerts or activate safety mechanisms, significantly reducing the risk of accidents.

Traditionally, face orientation estimation relied upon recognizing distinctive facial features and tracking their movements to infer orientation. However, these conventional methods encountered limitations, such as privacy concerns and their susceptibility to failure when individuals wore masks or when their heads assumed unexpected positions.

In response to these challenges, researchers from the Shibaura Institute of Technology in Japan have pioneered a novel AI solution. Their groundbreaking approach leverages deep learning techniques and integrates an additional sensor into the model training process. This innovative addition accurately identifies any facial orientation from point cloud data and achieves this remarkable feat using a relatively small training data set.

The researchers harnessed the capabilities of a 3D depth camera, similar to previous methods, but introduced a game-changer—gyroscopic sensors, during the training process. As data flowed in, the point clouds captured by the depth camera were meticulously paired with precise information on face orientation acquired from a gyroscopic sensor strategically attached to the back of the head. This ingenious combination yielded an accurate, consistent measure of the head’s horizontal rotation angle.

The key to their success lay in the vast dataset they amassed, representing a diverse array of head angles. This comprehensive data pool enabled the training of a highly accurate model capable of recognizing a broader spectrum of head orientations than the traditional methods limited to just a handful. Moreover, thanks to the gyroscopic sensor’s precision, only a relatively modest number of samples were required to achieve this remarkable versatility.

In conclusion, the fusion of deep learning techniques with gyroscopic sensors has ushered in a new era of face orientation estimation, transcending the limitations of traditional methods. With its ability to recognize an extensive range of head orientations and maintain privacy, this innovative approach holds great promise not only for driver monitoring systems but also for revolutionizing human-computer interaction and healthcare applications. As research in this field advances, we can look forward to safer roads, more immersive virtual experiences, and enhanced healthcare diagnostics, all thanks to the ingenuity of those pushing the boundaries of technology.

Check out the Paper and Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Researchers at the Shibaura Institute of Technology Revolutionize Face Direction Detection with Deep Learning: Navigating Challenges of Hidden Facial Features and Expanding Horizon Angles appeared first on MarkTechPost.

Researchers from ETH Zurich and Microsoft Introduce SCREWS: An Artific …

Large Language Models (LLMs) have succeeded in several different reasoning tasks. To guarantee that the intended aim is met, it is sometimes required to iteratively adjust the LLM results because the output is only occasionally accurate on the first try. These refinement techniques assume that consecutive results (from the same model, an external model, or some tool) result in improved performance. However, there is no assurance that later versions will always be better as Figure 1 shows, refining might result in a false positive. This encourages the model to choose an earlier outcome using the selection technique. Furthermore, prior research on iterative refining frequently uses a single, fixed reasoning technique. But humans are more adaptable. 

Figure 1: A case study illustrative of how Conditional Resampling (also known as “refinement”) may result in improper modification of the initial response. The original response, which in this case is the right one, may be chosen by a selection module in place of the alteration.

A product manager may use a brainstorming technique to generate several ideas before switching to a prioritization technique to rank them according to their viability or effect. Similarly, a student preparing for an exam might use deductive reasoning to answer issues and inductive reasoning to confirm the results. They thus suggest a modular strategy for answering refinements, enabling us to try various tactics. In this paper, researchers from  ETH Zurich and Microsoft Semantic Machines present SCREWS, a modular framework for reasoning about changes. Sampling, Conditional Resampling, and Selection are the three core components of the architecture that are introduced in detail in Figure 2. They instantiate SCREWS by fixing the submodules for each module (for example, they may choose “Chain of Thought” for Sampling). This is done for a specific job and input sequence. 

Figure 2 presents a high-level picture of the modular SCREWS system for reasoning about revisions. The three substantial boxes (or “modules”) each contain a number of choices (or “submodules”). Many previous efforts, including Self-Refine, Least to Most, LLMs Know (Mostly), Self-Consistency, Self-Improve, PHP CoT, Self-Correct, Socratic CoT, Programme of Thoughts, and many more, may be seen as examples of the framework. (…) denotes additional sub-components that may be added to each module, including, but not limited to, cached memory or online search for the Sampling module, a fine-tuned model or an external verifier for Conditional Resampling, and selection based on humans or an oracle for the Selection module.

Sampling’s first outputs are handed on to Conditional Resampling, which determines whether to create a revision based on the original sample and does so if necessary. The Selection module then chooses the best from all the samples and revisions. Given the modular design of their framework, additional framework elements can be used to enhance several newly suggested self-refining approaches. One example is the combination of their model-based selection technique and self-refinement method, which can improve overall performance. They use ChatGPT or GPT-4 to assess SCREWS on various reasoning tasks, including multi-hop question answering, arithmetic reasoning, and code debugging. 

Compared to the standard sample and resampling procedures, their suggested solutions produce significant improvements (10–15%). They show the value of heterogeneous resampling, showing how it may influence the model’s logic and substantially improve the baselines at a very low total cost. They also explain the significance of a model-based selection approach, a crucial element of contemporary LLMs that enables the model to revert to earlier, more certain outputs.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Researchers from ETH Zurich and Microsoft Introduce SCREWS: An Artificial Intelligence Framework for Enhancing the Reasoning in Large Language Models appeared first on MarkTechPost.

Improve prediction quality in custom classification models with Amazon …

Artificial intelligence (AI) and machine learning (ML) have seen widespread adoption across enterprise and government organizations. Processing unstructured data has become easier with the advancements in natural language processing (NLP) and user-friendly AI/ML services like Amazon Textract, Amazon Transcribe, and Amazon Comprehend. Organizations have started to use AI/ML services like Amazon Comprehend to build classification models with their unstructured data to get deep insights that they didn’t have before. Although you can use pre-trained models with minimal effort, without proper data curation and model tuning, you can’t realize the full benefits AI/ML models.
In this post, we explain how to build and optimize a custom classification model using Amazon Comprehend. We demonstrate this using an Amazon Comprehend custom classification to build a multi-label custom classification model, and provide guidelines on how to prepare the training dataset and tune the model to meet performance metrics such as accuracy, precision, recall, and F1 score. We use the Amazon Comprehend model training output artifacts like a confusion matrix to tune model performance and guide you on improving your training data.
Solution overview
This solution presents an approach to building an optimized custom classification model using Amazon Comprehend. We go through several steps, including data preparation, model creation, model performance metric analysis, and optimizing inference based on our analysis. We use an Amazon SageMaker notebook and the AWS Management Console to complete some of these steps.
We also go through best practices and optimization techniques during data preparation, model building, and model tuning.
Prerequisites
If you don’t have a SageMaker notebook instance, you can create one. For instructions, refer to Create an Amazon SageMaker Notebook Instance.
Prepare the data
For this analysis, we use the Toxic Comment Classification dataset from Kaggle. This dataset contains 6 labels with 158,571 data points. However, each label only has less than 10% of the total data as positive examples, with two of the labels having less than 1%.
We convert the existing Kaggle dataset to the Amazon Comprehend two-column CSV format with the labels split using a pipe (|) delimiter. Amazon Comprehend expects at least one label for each data point. In this dataset, we encounter several data points that don’t fall under any of the provided labels. We create a new label called clean and assign any of the data points that aren’t toxic to be positive with this label. Finally, we split the curated datasets into training and test datasets using an 80/20 ratio split per label.
We will be using the Data-Preparation notebook. The following steps use the Kaggle dataset and prepare the data for our model.

On the SageMaker console, choose Notebook instances in the navigation pane.
Select the notebook instance you have configured and choose Open Jupyter.
On the New menu, choose Terminal.

Run the following commands in the terminal to download the required artifacts for this post:

cd SageMaker
wget https://aws-ml-blog.s3.amazonaws.com/artifacts/amazon-comprehend-improve-prediction-quality/comprehend-blog-artifacts.zip
unzip comprehend-blog-artifacts.zip
rm comprehend-blog-artifacts.zip
mkdir assets

Close the terminal window.

You should see three notebooks and train.csv files.

Choose the notebook Data-Preparation.ipynb.
Run all the steps in the notebook.

These steps prepare the raw Kaggle dataset to serve as curated training and test datasets. Curated datasets will be stored in the notebook and Amazon Simple Storage Service (Amazon S3).
Consider the following data preparation guidelines when dealing with large-scale multi-label datasets:

Datasets must have a minimum of 10 samples per label.
Amazon Comprehend accepts a maximum of 100 labels. This is a soft limit that can be increased.
Ensure the dataset file is correctly formatted with the proper delimiter. Incorrect delimiters can introduce blank labels.
All the data points must have labels.
Training and test datasets should have balanced data distribution per label. Don’t use random distribution because it might introduce bias in the training and test datasets.

Build a custom classification model
We use the curated training and test datasets we created during the data preparation step to build our model. The following steps create an Amazon Comprehend multi-label custom classification model:

On the Amazon Comprehend console, choose Custom classification in the navigation pane.
Choose Create new model.
For Model name, enter toxic-classification-model.
For Version name, enter 1.
For Annotation and data format, choose Using Multi-label mode.
For Training dataset, enter the location of the curated training dataset on Amazon S3.
Choose Customer provided test dataset and enter the location of the curated test data on Amazon S3.
For Output data, enter the Amazon S3 location.
For IAM role, select Create an IAM role, specify the name suffix as “comprehend-blog”.
Choose Create to start the custom classification model training and model creation.

The following screenshot shows the custom classification model details on the Amazon Comprehend console.

Tune for model performance
The following screenshot shows the model performance metrics. It includes key metrics like precision, recall, F1 score, accuracy, and more.

After the model is trained and created, it will generate the output.tar.gz file, which contains the labels from the dataset as well as the confusion matrix for each of the labels. To further tune the model’s prediction performance, you have to understand your model with the prediction probabilities for each class. To do this, you need to create an analysis job to identify the scores Amazon Comprehend assigned to each of the data points.
Complete the following steps to create an analysis job:

On the Amazon Comprehend console, choose Analysis jobs in the navigation pane.
Choose Create job.
For Name, enter toxic_train_data_analysis_job.
For Analysis type, choose Custom classification.
For Classification models and flywheels, specify toxic-classification-model.
For Version, specify 1.
For Input data S3 location, enter the location of the curated training data file.
For Input format, choose One document per line.
For Output data S3 location, enter the location.
For Access Permissions, select Use an existing IAM Role and pick the role created previously.
Choose Create job to start the analysis job.
Select the Analysis jobs to view the job details. Please take a note of the job id under Job details. We will be using the job id in our next step.

Repeat the steps to the start analysis job for the curated test data. We use the prediction outputs from our analysis jobs to learn about our model’s prediction probabilities. Please make note of job ids of training and test analysis jobs.
We use the Model-Threshold-Analysis.ipynb notebook to test the outputs on all possible thresholds and score the output based on the prediction probability using the scikit-learn’s precision_recall_curve function. Additionally, we can compute the F1 score at each threshold.
We will need the Amazon Comprehend analysis job id’s as input for Model-Threshold-Analysis notebook. You can get the job ids from Amazon Comprehend console. Execute all the steps in Model-Threshold-Analysis notebook to observe the thresholds for all the classes.

Notice how precision goes up as the threshold goes up, while the inverse occurs with recall. To find the balance between the two, we use the F1 score where it has visible peaks in their curve. The peaks in the F1 score correspond to a particular threshold that can improve the model’s performance. Notice how most of the labels fall around the 0.5 mark for the threshold except for threat label, which has a threshold around 0.04.

We can then use this threshold for specific labels that are underperforming with just the default 0.5 threshold. By using the optimized thresholds, the results of the model on the test data improve for the label threat from 0.00 to 0.24. We are using the max F1 score at the threshold as a benchmark to determine positive vs. negative for that label instead of a common benchmark (a standard value like > 0.7) for all the labels.

Handling underrepresented classes
Another approach that’s effective for an imbalanced dataset is oversampling. By oversampling the underrepresented class, the model sees the underrepresented class more often and emphasizes the importance of those samples. We use the Oversampling-underrepresented.ipynb notebook to optimize the datasets.
For this dataset, we tested how the model’s performance on the evaluation dataset changes as we provide more samples. We use the oversampling technique to increase the occurrence of underrepresented classes to improve the performance.

In this particular case, we tested on 10, 25, 50, 100, 200, and 500 positive examples. Notice that although we are repeating data points, we are inherently improving the performance of the model by emphasizing the importance of the underrepresented class.
Cost
With Amazon Comprehend, you pay as you go based on the number of text characters processed. Refer to Amazon Comprehend Pricing for actual costs.
Clean up
When you’re finished experimenting with this solution, clean up your resources to delete all the resources deployed in this example. This helps you avoid continuing costs in your account.
Conclusion
In this post, we have provided best practices and guidance on data preparation, model tuning using prediction probabilities and techniques to handle underrepresented data classes. You can use these best practices and techniques to improve the performance metrics of your Amazon Comprehend custom classification model.
For more information about Amazon Comprehend, visit Amazon Comprehend developer resources to find video resources and blog posts, and refer to AWS Comprehend FAQs.

About the Authors
Sathya Balakrishnan is a Sr. Customer Delivery Architect in the Professional Services team at AWS, specializing in data and ML solutions. He works with US federal financial clients. He is passionate about building pragmatic solutions to solve customers’ business problems. In his spare time, he enjoys watching movies and hiking with his family.
Prince Mallari is an NLP Data Scientist in the Professional Services team at AWS, specializing in applications of NLP for public sector customers. He is passionate about using ML as a tool to allow customers to be more productive. In his spare time, he enjoys playing video games and developing one with his friends.

Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

Large language models (LLMs) have captured the imagination and attention of developers, scientists, technologists, entrepreneurs, and executives across several industries. These models can be used for question answering, summarization, translation, and more in applications such as conversational agents for customer support, content creation for marketing, and coding assistants.
Recently, Meta released Llama 2 for both researchers and commercial entities, adding to the list of other LLMs, including MosaicML MPT and Falcon. In this post, we walk through how to fine-tune Llama 2 on AWS Trainium, a purpose-built accelerator for LLM training, to reduce training times and costs. We review the fine-tuning scripts provided by the AWS Neuron SDK (using NeMo Megatron-LM), the various configurations we used, and the throughput results we saw.
About the Llama 2 model
Similar to the previous Llama 1 model and other models like GPT, Llama 2 uses the Transformer’s decoder-only architecture. It comes in three sizes: 7 billion, 13 billion, and 70 billion parameters. Compared to Llama 1, Llama 2 doubles context length from 2,000 to 4,000, and uses grouped-query attention (only for 70B). Llama 2 pre-trained models are trained on 2 trillion tokens, and its fine-tuned models have been trained on over 1 million human annotations.
Distributed training of Llama 2
To accommodate Llama 2 with 2,000 and 4,000 sequence length, we implemented the script using NeMo Megatron for Trainium that supports data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP). To be specific, with the new implementation of some features like untie word embedding, rotary embedding, RMSNorm, and Swiglu activation, we use the generic script of GPT Neuron Megatron-LM to support the Llama 2 training script.
Our high-level training procedure is as follows: for our training environment, we use a multi-instance cluster managed by the SLURM system for distributed training and scheduling under the NeMo framework.
First, download the Llama 2 model and training datasets and preprocess them using the Llama 2 tokenizer. For example, to use the RedPajama dataset, use the following command:

wget https://data.together.xyz/redpajama-data-1T/v1.0.0/book/book.jsonl

python nemo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py

For detailed guidance of downloading models and the argument of the preprocessing script, refer to Download LlamaV2 dataset and tokenizer.
Next, compile the model:

sbatch –nodes 4 compile.slurm ./llama_7b.sh

After the model is compiled, launch the training job with the following script that is already optimized with the best configuration and hyperparameters for Llama 2 (included in the example code):

sbatch –nodes 4 run.slurm ./llama_7b.sh

Lastly, we monitor TensorBoard to keep track of training progress:

tensorboard –logdir ./

For the complete example code and scripts we mentioned, refer to the Llama 7B tutorial and NeMo code in the Neuron SDK to walk through more detailed steps.
Fine-tuning experiments
We fine-tuned the 7B model on the OSCAR (Open Super-large Crawled ALMAnaCH coRpus) and QNLI (Question-answering NLI) datasets in a Neuron 2.12 environment (PyTorch). For each 2,000 and 4,000 sequence length, we optimized some configurations, such as batchsize and gradient_accumulation, for training efficiency. As a fine-tuning strategy, we adopted full fine-tuning of all parameters (about 500 steps), which can be extended to pre-training with longer steps and larger datasets (for example, 1T RedPajama). Sequence parallelism can also be enabled to allow NeMo Megatron to successfully fine-tune models with a larger sequence length of 4,000. The following table shows the configuration and throughput results of the Llama 7B fine-tuning experiment. The throughput scales almost linearly as the number of instances increase up to 4.

Distributed Library
Datasets
Sequence Length
Number of Instances
Tensor Parallel
Data Parallel
Pipeline Parellel
Global Batch size
Throughput (seq/s)

Neuron NeMo Megatron
OSCAR
4096
1
8
4
1
256
3.7

.
.
4096
2
8
4
1
256
7.4

.
.
4096
4
8
4
1
256
14.6

.
QNLI
4096
4
8
4
1
256
14.1

The last step is to verify the accuracy with the base model. We implemented a reference script for GPU experiments and confirmed the training curves for GPU and Trainium matched as shown in the following figure. The figure illustrates loss curves over the number of training steps on the QNLI dataset. Mixed-precision was adopted for GPU (blue), and bf16 with default stochastic rounding for Trainium (orange).

Conclusion
In this post, we showed that Trainium delivers high performance and cost-effective fine-tuning of Llama 2. For more resources on using Trainium for distributed pre-training and fine-tuning your generative AI models using NeMo Megatron, refer to AWS Neuron Reference for NeMo Megatron.

About the Authors
Hao Zhou is a Research Scientist with Amazon SageMaker. Before that, he worked on developing machine learning methods for fraud detection for Amazon Fraud Detector. He is passionate about applying machine learning, optimization, and generative AI techniques to various real-world problems. He holds a PhD in Electrical Engineering from Northwestern University.
Karthick Gopalswamy is an Applied Scientist with AWS. Before AWS, he worked as a scientist in Uber and Walmart Labs with a major focus on mixed integer optimization. At Uber, he focused on optimizing the public transit network with on-demand SaaS products and shared rides. At Walmart Labs, he worked on pricing and packing optimizations. Karthick has a PhD in Industrial and Systems Engineering with a minor in Operations Research from North Carolina State University. His research focuses on models and methodologies that combine operations research and machine learning.
Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.
Youngsuk Park is a Sr. Applied Scientist at AWS Annapurna Labs, working on developing and training foundation models on AI accelerators. Prior to that, Dr. Park worked on R&D for Amazon Forecast in AWS AI Labs as a lead scientist. His research lies in the interplay between machine learning, foundational models, optimization, and reinforcement learning. He has published over 20 peer-reviewed papers in top venues, including ICLR, ICML, AISTATS, and KDD, with the service of organizing workshop and presenting tutorials in the area of time series and LLM training. Before joining AWS, he obtained a PhD in Electrical Engineering from Stanford University.
Yida Wang is a principal scientist in the AWS AI team of Amazon. His research interest is in systems, high-performance computing, and big data analytics. He currently works on deep learning systems, with a focus on compiling and optimizing deep learning models for efficient training and inference, especially large-scale foundation models. The mission is to bridge the high-level models from various frameworks and low-level hardware platforms including CPUs, GPUs, and AI accelerators, so that different models can run in high performance on different devices.
Jun (Luke) Huan is a Principal Scientist at AWS AI Labs. Dr. Huan works on AI and Data Science. He has published more than 160 peer-reviewed papers in leading conferences and journals and has graduated 11 PhD students. He was a recipient of the NSF Faculty Early Career Development Award in 2009. Before joining AWS, he worked at Baidu Research as a distinguished scientist and the head of Baidu Big Data Laboratory. He founded StylingAI Inc., an AI start-up, and worked as the CEO and Chief Scientist in 2019–2021. Before joining the industry, he was the Charles E. and Mary Jane Spahr Professor in the EECS Department at the University of Kansas. From 2015–2018, he worked as a program director at the US NSF in charge of its big data program.
Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Simplify medical image classification using Amazon SageMaker Canvas

Analyzing medical images plays a crucial role in diagnosing and treating diseases. The ability to automate this process using machine learning (ML) techniques allows healthcare professionals to more quickly diagnose certain cancers, coronary diseases, and ophthalmologic conditions. However, one of the key challenges faced by clinicians and researchers in this field is the time-consuming and complex nature of building ML models for image classification. Traditional methods require coding expertise and extensive knowledge of ML algorithms, which can be a barrier for many healthcare professionals.
To address this gap, we used Amazon SageMaker Canvas, a visual tool that allows medical clinicians to build and deploy ML models without coding or specialized knowledge. This user-friendly approach eliminates the steep learning curve associated with ML, which frees up clinicians to focus on their patients.
Amazon SageMaker Canvas provides a drag-and-drop interface for creating ML models. Clinicians can select the data they want to use, specify the desired output, and then watch as it automatically builds and trains the model. Once the model is trained, it generates accurate predictions.
This approach is ideal for medical clinicians who want to use ML to improve their diagnosis and treatment decisions. With Amazon SageMaker Canvas, they can use the power of ML to help their patients, without needing to be an ML expert.
Medical image classification directly impacts patient outcomes and healthcare efficiency. Timely and accurate classification of medical images allows for early detection of diseases that aides in effective treatment planning and monitoring. Moreover, the democratization of ML through accessible interfaces like Amazon SageMaker Canvas, enables a broader range of healthcare professionals, including those without extensive technical backgrounds, to contribute to the field of medical image analysis. This inclusive approach fosters collaboration and knowledge sharing and ultimately leads to advancements in healthcare research and improved patient care.
In this post, we’ll explore the capabilities of Amazon SageMaker Canvas in classifying medical images, discuss its benefits, and highlight real-world use cases that demonstrate its impact on medical diagnostics.
Use case
Skin cancer is a serious and potentially deadly disease, and the earlier it is detected, the better chance there is for successful treatment. Statistically, skin cancer (e.g. Basal and squamous cell carcinomas) is one of the most common cancer types and leads to hundreds of thousands of deaths worldwide each year. It manifests itself through the abnormal growth of skin cells.
However, early diagnosis drastically increases the chances of recovery. Moreover, it may render surgical, radiographic, or chemotherapeutic therapies unnecessary or lessen their overall usage, helping to reduce healthcare costs.
The process of diagnosing skin cancer starts with a procedure called a dermoscopy[1], which inspects the general shape, size, and color characteristics of skin lesions. Suspected lesions then undergo further sampling and histological tests for confirmation of the cancer cell type. Doctors use multiple methods to detect skin cancer, starting with visual detection. The American Center for the Study of Dermatology developed a guide for the possible shape of melanoma, which is called ABCD (asymmetry, border, color, diameter) and is used by doctors for initial screening of the disease. If a suspected skin lesion is found, then the doctor takes a biopsy of the visible lesion on the skin and examines it microscopically for a benign or malignant diagnosis and the type of skin cancer. Computer vision models can play a valuable role in helping to identify suspicious moles or lesions, which enables earlier and more accurate diagnosis.
Creating a cancer detection model is a multi-step process, as outlined below:

Gather a large dataset of images from healthy skin and skin with various types of cancerous or precancerous lesions. This dataset needs to be carefully curated to ensure accuracy and consistency.
Use computer vision techniques to preprocess the images and extract relevant to differentiate between healthy and cancerous skin.
Train an ML model on the preprocessed images, using a supervised learning approach to teach the model to distinguish between different skin types.
Evaluate the performance of the model using a variety of metrics, such as precision and recall, to ensure that it accurately identifies cancerous skin and minimizes false positives.
Integrate the model into a user-friendly tool that could be used by dermatologists and other healthcare professionals to aid in the detection and diagnosis of skin cancer.

Overall, the process of developing a skin cancer detection model from scratch typically requires significant resources and expertise. This is where Amazon SageMaker Canvas can help simplify the time and effort for steps 2 – 5.
Solution overview
To demonstrate the creation of a skin cancer computer vision model without writing any code, we use a dermatoscopy skin cancer image dataset published by Harvard Dataverse. We use the dataset, which can be found at HAM10000 and consists of 10,015 dermatoscopic images, to build a skin cancer classification model that predicts skin cancer classes. A few key points about the dataset:

The dataset serves as a training set for academic ML purposes.
It includes a representative collection of all important diagnostic categories in the realm of pigmented lesions.
A few categories in the dataset are: Actinic keratoses and intraepithelial carcinoma / Bowen’s disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc)
More than 50% of the lesions in the dataset are confirmed through histopathology (histo).
The ground truth for the rest of the cases is determined through follow-up examination (follow_up), expert consensus (consensus), or confirmation by in vivo confocal microscopy (confocal).
The dataset includes lesions with multiple images, which can be tracked using the lesion_id column within the HAM10000_metadata file.

We showcase how to simplify image classification for multiple skin cancer categories without writing any code using Amazon SageMaker Canvas. Given an image of a skin lesion, SageMaker Canvas image classification automatically classifies an image into benign or possible cancer.
Prerequisites

Access to an AWS account with permissions to create the resources described in the steps section.
An AWS Identity and Access Management (AWS IAM) user with full permissions to use Amazon SageMaker.

Walkthrough

Set-up SageMaker domain

Create an Amazon SageMaker domain using steps outlined here.
Download the HAM10000 dataset.

Set-up datasets

Create an Amazon Simple Storage Service (Amazon S3) bucket with a unique name, which is image-classification-<ACCOUNT_ID> where ACCOUNT_ID is your unique AWS AccountNumber.

Figure 1 Creating bucket

In this bucket create two folders: training-data and test-data.

Figure 2 Create folders

Under training-data, create seven folders for each of the skin cancer categories identified in the dataset: akiec, bcc, bkl, df, mel, nv, and vasc.

Figure 3 Folder View

The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file. Using the lesion_id-column, copy the corresponding images in the right folder (i.e., you may start with 100 images for each classification).

Figure 4 Listing Objects to import (Sample Images)

Use Amazon SageMaker Canvas

Go to the Amazon SageMaker service in the console and select Canvas from the list. Once you are on the Canvas page, please select Open Canvas button.

Figure 5 Navigate to Canvas

Once you are on the Canvas page, select My models and then choose New Model on the right of your screen.

Figure 6 Creation of Model

A new pop-up window opens up, where we name image_classify as the model’s name and select Image analysis under the Problem type.

Import the dataset

On the next page, please select Create dataset and in the pop-up box name the dataset as image_classify and select the Create button.

Figure 7 Creating dataset

On the next page, change the Data Source to Amazon S3. You can also directly upload the images (i.e., Local upload).

Figure 8 Import Dataset from S3 buckets

When you select Amazon S3, you’ll get the list of buckets present in your account. Select the parent bucket that holds the dataset into subfolder (e.g., image-classify-2023 and select Import data button. This allows Amazon SageMaker Canvas to quickly label the images based on the folder names.
Once, the dataset is successfully imported, you’ll see the value in the Status column change to Ready from Processing.
Now select your dataset by choosing Select dataset at the bottom of your page.

Build your model

On the Build page, you should see your data imported and labelled as per the folder name in Amazon S3.

Figure 9 Labelling of Amazon S3 data

Select the Quick build button (i.e., the red-highlighted content in the following image) and you’ll see two options to build the model. First one is the Quick build and second one is Standard build. As name suggest quick build option provides speed over accuracy and it takes around 15 to 30 minutes to build the model. The standard build prioritizes accuracy over speed, with model building taking from 45 minutes to 4 hours to complete. Standard build runs experiments using different combinations of hyperparameters and generates many models in the backend (using SageMaker Autopilot functionality) and then picks the best model.
Select Standard build to start building the model. It takes around 2–5 hours to complete.

Figure 10 Doing Standard build

Once model build is complete, you can see an estimated accuracy as shown in Figure 11.

Figure 11 Model prediction

If you select the Scoring tab, it should provide you insights into the model accuracy. Also, we can select the Advanced metrics button on the Scoring tab to view the precision, recall, and F1 score (A balanced measure of accuracy that takes class balance into account).
The advanced metrics that Amazon SageMaker Canvas shows you depend on whether your model performs numeric, categorical, image, text, or time series forecasting predictions on your data. In this case, we believe recall is more important than precision because missing a cancer detection is far more dangerous than detecting correct. Categorical prediction, such as 2-category prediction or 3-category prediction, refers to the mathematical concept of classification. The advanced metric recall is the fraction of true positives (TP) out of all the actual positives (TP + false negatives). It measures the proportion of positive instances that were correctly predicted as positive by the model. Please refer this A deep dive into Amazon SageMaker Canvas advanced metrics for a deep dive on the advance metrics.

Figure 12 Advanced metrics
This completes the model creation step in Amazon SageMaker Canvas.

Test your model

You can now choose the Predict button, which takes you to the Predict page, where you can upload your own images through Single prediction or Batch prediction. Please set the option of your choice and select Import to upload your image and test the model.

Figure 13 Test your own images

Let’s start by doing a single image prediction. Make sure you are on the Single Prediction and choose Import image. This takes you to a dialog box where you can choose to upload your image from Amazon S3, or do a Local upload. In our case, we select Amazon S3 and browse to our directory where we have the test images and select any image. Then select Import data.

Figure 14 Single Image Prediction

Once selected, you should see the screen says Generating prediction results. You should have your results in a few minutes as shown below.
Now let’s try the Batch prediction. Select Batch prediction under Run predictions and select the Import new dataset button and name it BatchPrediction and hit the Create button.

Figure 15 Single image prediction results

On the next window, make sure you have selected Amazon S3 upload and browse to the directory where we have our test set and select the Import data button.

Figure 16 Batch Image Prediction

Once the images are in Ready status, select the radio button for the created dataset and choose Generate predictions. Now, you should see the status of batch prediction batch to Generating predictions. Let’s wait for few minutes for the results.
Once the status is in Ready state, choose the dataset name that takes you to a page showing the detailed prediction on all our images.

Figure 17 Batch image prediction results

Another important feature of Batch Prediction is to be able to verify the results and also be able to download the prediction in a zip or csv file for further usage or sharing.

Figure 18 Download prediction

With this you have successfully been able to create a model, train it, and test its prediction with Amazon SageMaker Canvas.
Cleaning up
Choose Log out in the left navigation pane to log out of the Amazon SageMaker Canvas application to stop the consumption of SageMaker Canvas workspace instance hours and release all resources.
Citation
[1]Fraiwan M, Faouri E. On the Automatic Detection and Classification of Skin Cancer Using Deep Transfer Learning. Sensors (Basel). 2022 Jun 30;22(13):4963. doi: 10.3390/s22134963. PMID: 35808463; PMCID: PMC9269808.
Conclusion
In this post, we showed you how medical image analysis using ML techniques can expedite the diagnosis skin cancer, and its applicability to diagnosing other diseases. However, building ML models for image classification is often complex and time-consuming, requiring coding expertise and ML knowledge. Amazon SageMaker Canvas addressed this challenge by providing a visual interface that eliminates the need for coding or specialized ML skills. This empowers healthcare professionals to use ML without a steep learning curve, allowing them to focus on patient care.
The traditional process of developing a cancer detection model is cumbersome and time-consuming. It involves gathering a curated dataset, preprocessing images, training a ML model, evaluate its performance, and integrate it into a user-friendly tool for healthcare professionals. Amazon SageMaker Canvas simplified the steps from preprocessing to integration, which reduced the time and effort required for building a skin cancer detection model.
In this post, we delved into the powerful capabilities of Amazon SageMaker Canvas in classifying medical images, shedding light on its benefits and presenting real-world use cases that showcase its profound impact on medical diagnostics. One such compelling use case we explored was skin cancer detection and how early diagnosis often significantly enhances treatment outcomes and reduces healthcare costs.
It is important to acknowledge that the accuracy of the model can vary depending on factors, such as the size of the training dataset and the specific type of model employed. These variables play a role in determining the performance and reliability of the classification results.
Amazon SageMaker Canvas can serve as an invaluable tool that assists healthcare professionals in diagnosing diseases with greater accuracy and efficiency. However, it is vital to note that it isn’t intended to replace the expertise and judgment of healthcare professionals. Rather, it empowers them by augmenting their capabilities and enabling more precise and expedient diagnoses. The human element remains essential in the decision-making process, and the collaboration between healthcare professionals and artificial intelligence (AI) tools, including Amazon SageMaker Canvas, is pivotal in providing optimal patient care.

About the authors
 Ramakant Joshi is an AWS Solutions Architect, specializing in the analytics and serverless domain. He has a background in software development and hybrid architectures, and is passionate about helping customers modernize their cloud architecture.
Jake Wen is a Solutions Architect at AWS, driven by a passion for Machine Learning, Natural Language Processing, and Deep Learning. He assists Enterprise customers in achieving modernization and scalable deployment in the Cloud. Beyond the tech world, Jake finds delight in skateboarding, hiking, and piloting air drones.
Sonu Kumar Singh is an AWS Solutions Architect, with a specialization in analytics domain. He has been instrumental in catalyzing transformative shifts in organizations by enabling data-driven decision-making thereby fueling innovation and growth. He enjoys it when something he designed or created brings a positive impact. At AWS his intention is to help customers extract value out of AWS’s 200+ cloud services and empower them in their cloud journey.
Dariush Azimi is a Solution Architect at AWS, with specialization in Machine Learning, Natural Language Processing (NLP), and microservices architecture with Kubernetes. His mission is to empower organizations to harness the full potential of their data through comprehensive end-to-end solutions encompassing data storage, accessibility, analysis, and predictive capabilities.

Automate prior authorization using CRD with CDS Hooks and AWS HealthLa …

Prior authorization is a crucial process in healthcare that involves the approval of medical treatments or procedures before they are carried out. This process is necessary to ensure that patients receive the right care and that healthcare providers are following the correct procedures. However, prior authorization can be a time-consuming and complex process that requires a lot of paperwork and communication between healthcare providers, insurance companies, and patients.
The prior authorization process for electronic health record (EHRs) consists of five steps:

Determine whether prior authorization is required.
Gather information necessary to support the prior authorization request.
Submit the request for prior authorization.
Monitor the prior authorization request for resolution.
If needed, supplement the prior authorization request with additional required information (and resume at Step 4).

The Da Vinci Burden Reduction project has rearranged these steps for prior authorization into three interrelated implementation guides that are focused on reducing the clinician and payer burden:

Coverage Requirements Discovery (CRD) – This provides decision support to providers at the time they’re ordering diagnostics, specifying treatments, making referrals, scheduling appointments, and so on.
Documentation Templates and Rules (DTR) – This allows providers to download smart questionnaires and rules, such as Clinical Quality Language (CQL), and provides a SMART on FHIR app or EHR app that runs the questionnaires and rules to gather information relevant to a performed or planned service. Running the questionnaires and rules may also be performed by an application that is part of the provider’s EHR.
Prior Authorization Support (PAS) – This allows provider systems to send (and payer systems to receive) prior authorization requests using FHIR, while still meeting regulatory mandates to have X12 278 used, where required, to transport the prior authorization, potentially simplifying processing for either exchange partner (or both).

In this post, we focus on the CRD implementation guide to determine prior authorization requirements and explain how CDS (Clinical Decision Support) Hooks uses AWS HealthLake to determine if prior authorization is required or not.
Solution overview
CRD is a protocol within the electronic prior authorization workflow that facilitates calls between EHRs and the payers using CDS services. When utilized, it provides information on coverage requirements to providers while patient care decisions are in progress. This enables provider staff to make more informed decisions and meet the requirements of their patient’s insurance coverage. Interaction between providers and payers is done seamlessly using CDS Hooks.
CDS Hooks is a Health Level Seven International (HL7) specification. CDS Hooks provides a way to embed additional, near-real-time functionality within a clinician’s workflow of an EHR. With CDS Hooks, eligibility practices like prior authorization can be properly optimized, along with other pre-certification requirements like the physician’s network participation. This function assists providers in making informed decisions by providing them with information on their patient’s condition, treatment options, and the forms that must be completed to facilitate their care. The strategic use of CDS Hooks allows clinicians to quickly develop more patient-centered care plans and assist the prior authorization process by disclosing critical administrative and clinical requirements. For more information on CDS Hooks and its specification, refer to the CDS Hooks website.
The following diagram illustrates how the CRD workflow is automated using HealthLake.

The workflow steps are as follows:

A provider staff member logs into the EHR system to open the patient chart.
The EHR system validates user credentials and invokes the patient-view hook to retrieve patient condition information.
Amazon API Gateway invokes the Patient View Hooks AWS Lambda function.
The Lambda function validates and retrieves the patient ID from the request and gets the patient condition information from HealthLake.
After reviewing the patient condition, the user invokes the order-select hook to retrieve coverage requirements information for the respective drug.
API Gateway invokes the Coverage Requirements Hooks Lambda function.
The Lambda function retrieves claims information for the patient, runs CQL rules based on the medication submitted and claims information retrieved from HealthLake, and determines whether prior authorization is required.

The solution is available in the Determine Coverage Requirements Discovery using CDS Hooks with AWS HealthLake GitHub repo.
Prerequisites
This post assumes familiarity with the following services:

API Gateway
HealthLake
Lambda
AWS Serverless Application Model (AWS SAM)

Deploy the application using the AWS SAM CLI
You can deploy the template using the AWS Management Console or the AWS SAM CLI. To use the CLI, complete the following steps:

Install the AWS SAM CLI.
Download the sample code from the AWS samples repository to your local system:

git clone https://github.com/aws-samples/aws-crd-hooks-with-awshealthlake-api
cd aws-crd-hooks-with-awshealthlake-api/

Build the application using AWS SAM:

sam build

Deploy the application using the guided process:

sam deploy –guided

# Replace MY_VALUE with proper resource names
Configuring SAM deploy

======================

Looking for config file [samconfig.toml] : Not found

Setting default arguments for ‘sam deploy’

=========================================

Stack Name [sam-app]: aws-cds-hooks-with-healthlake

AWS Region [us-east-1]: us-east-2

#Shows you resources changes to be deployed and require a ‘Y’ to initiate deploy

Confirm changes before deploy [y/N]:

#SAM needs permission to be able to create roles to connect to the resources in your template

Allow SAM CLI IAM role creation [Y/n]:

#Preserves the state of previously provisioned resources when an operation fails

Disable rollback [y/N]:

cdsDemoServicesFunction has no authentication. Is this okay? [y/N]: y

cqlQueryFunction has no authentication. Is this okay? [y/N]: y

cqlQueryOrderFunction has no authentication. Is this okay? [y/N]: y

Save arguments to configuration file [Y/n]: y

SAM configuration file [samconfig.toml]:

SAM configuration environment [default]:
The deployment may take 30 minutes or more while AWS creates a HealthLake data store and related resources in your AWS account. AWS SAM may time out and return you to your command line. This timeout stops AWS SAM from showing you the progress in the cloud, but doesn’t stop the deployment happening in the cloud. If you see a timeout, go to the AWS CloudFormation console and verify the CloudFormation stack deployment status. Integrate CDS Hooks with your clinical workflow when the CloudFormation stack deployment is complete.
Determine coverage requirements for prior authorization
The solution has two hooks, patient-view and order-select, to determine if prior authorization is required or not based on prior authorization rules from payer. CQL is used to evaluate prior authorization rules.
CDS Hooks can be integrated with EHR that supports CDS Hooks. Alternatively, if you don’t have EHR available for testing, you can use the publicly available sandbox as described in the GitHub repo. Note that the CDS Hooks sandbox is being used solely for the purpose of testing.
After your hooks are integrated with EHR, when a user navigates to the clinical workflow, the patient-view hook is run for the configured patient. Note that the patient ID from the clinical workflow should exist in HealthLake. The cards returned from the API indicate that the patient has a sinus infection health condition and the doctor may need to order a prescription.

You can navigate to the RX View tab to order a prescription. Acting as the doctor, choose the appropriate medication and enter other details as shown in the following screenshot.

The order-select hook is returned with the prior authorization eligibility card.

The next step is to submit a prior authorization using the SMART app or other mechanisms available to the provider.
Clean up
If you no longer need the AWS resources that you created by running this example, you can remove them by deleting the CloudFormation stack that you deployed:
sam delete –stack-name <<your-stack-name>>
Conclusion
In this post, we showed how HealthLake with CDS Hooks can help reduce the burden on providers and improve the member experience by determining coverage requirements for prior authorization as part of the prescription order clinical workflow. CDS Hooks along with HealthLake can help providers at the time they’re ordering diagnostics, specifying treatments, making referrals, and scheduling appointments.
If you are interested in implementing a coverage requirement discovery on AWS using this solution or want to learn more about the implementing prior authorization on AWS , you can contact an AWS Representative.

About the Authors
Manish Patel, a Global Partner Solutions Architect supporting Healthcare and Life Sciences at AWS. He has more than 20 years of experience building solutions for Medicare, Medicaid, Payers, Providers and Life Sciences customers. He drives go-to-market strategies along with partners to accelerate solution developments in areas such as Electronics Health Records, Medical Imaging, multi-model data solutions and Generative AI. He is passionate about using technology to transform the healthcare industry to drive better patient care outcomes.
Shravan Vurputoor is a Senior Solutions Architect at AWS. As a trusted customer advocate, he helps organizations understand best practices around advanced cloud-based architectures, and provides advice on strategies to help drive successful business outcomes across a broad set of enterprise customers through his passion for educating, training, designing, and building cloud solutions.

Researchers from Google and Cornell Propose RealFill: A Novel Generati …

Researchers have introduced a novel framework called RealFill to address the problem of Authentic Image Completion. This challenge arises when users want to enhance or complete missing parts of a photograph, ensuring that the added content remains faithful to the original scene. The motivation behind this work is to provide a solution for situations where a single image fails to capture the perfect angle, timing, or composition. For instance, consider a scenario where a precious moment was nearly captured in a photograph, but a crucial detail was left out, such as a child’s intricate crown during a dance performance. RealFill aims to fill in these gaps by generating content that “should have been there” instead of what “could have been there.”

Existing approaches for image completion typically rely on geometric-based pipelines or generative models. However, these methods face limitations when the scene’s structure cannot be accurately estimated, especially in cases with complex geometry or dynamic objects. On the other hand, generative models, like diffusion models, have shown promise in image inpainting and outpainting tasks but struggle to recover fine details and scene structure due to their reliance on text prompts.

To address these challenges, the researchers propose RealFill, a referenced-driven image completion framework that personalizes a pre-trained diffusion-based inpainting model using a small set of reference images. This personalized model learns not only the scene’s image prior but also its contents, lighting, and style. The process involves fine-tuning the model on both the reference and target images and then using it to fill in the missing regions in the target image through a standard diffusion sampling process.

One key innovation in RealFill is Correspondence-Based Seed Selection, which automatically selects high-quality generations by leveraging the correspondence between generated content and reference images. This method greatly reduces the need for human intervention in selecting the best model outputs.

The researchers have created a dataset called RealBench to evaluate RealFill, covering both inpainting and outpainting tasks in diverse and challenging scenarios. They compare RealFill with two baselines: Paint-byExample, which relies on a CLIP embedding of a single reference image, and Stable Diffusion Inpainting, which uses a manually written prompt. RealFill outperforms these baselines by a significant margin across various image similarity metrics.

In conclusion, RealFill addresses the problem of Authentic Image Completion by personalizing a diffusion-based inpainting model with reference images. This approach enables the generation of content that is both high-quality and faithful to the original scene, even when reference and target images have significant differences. While RealFill exhibits promising results, it is not without limitations, such as its computational demands and challenges in cases with dramatic viewpoint changes. Nonetheless, RealFill represents a significant advancement in image completion technology, offering a powerful tool for enhancing and completing photographs with missing elements.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Researchers from Google and Cornell Propose RealFill: A Novel Generative AI Approach for Authentic Image Completion appeared first on MarkTechPost.

Meet Colossal-LLaMA-2: An Open-Sourced Artificial Intelligence Approac …

In the ever-evolving field of artificial intelligence, the pursuit of large-scale deep-learning models capable of handling complex tasks has been at the forefront. These models, often powered by billions of parameters, have demonstrated remarkable capabilities in various applications, from natural language understanding to computer vision. However, there’s a catch – building and training such colossal models traditionally demands astronomical costs and substantial computational resources, often rendering them inaccessible to smaller companies, independent developers, and researchers. Enter Colossal-AI, a pioneering research team committed to democratizing access to large models through innovative training techniques.

The problem is the exorbitant cost of training large-scale deep-learning models from scratch. Conventional approaches necessitate vast amounts of data, computational power, and financial resources. This prohibitive barrier to entry has long discouraged many from venturing into the realm of large models. It’s not uncommon for industry insiders to humorously refer to this domain as reserved only for those with “50 million dollars” to spare. This situation has stifled innovation and limited the accessibility of state-of-the-art AI models.

Colossal-AI’s groundbreaking solution comes from Colossal-LLaMA-2, an innovative approach to training large models that defies convention. Unlike traditional methods that consume trillions of data tokens and incur astronomical costs, Colossal-LLaMA-2 achieves remarkable results with just a few hundred dollars. This approach opens up the possibility of constructing large models from scratch without breaking the bank.

The success of Colossal-LLaMA-2 can be attributed to several key strategies. Firstly, the research team expanded the model’s vocabulary significantly. This expansion improved the efficiency of encoding string sequences and enriched the encoded sequences with more meaningful information, enhancing document-level encoding and understanding. However, the team was careful to maintain the vocabulary, as an excessively large vocabulary would increase the number of embedding-related parameters, impacting training efficiency.

To further reduce training costs and enhance efficiency, high-quality data played a crucial role. The team developed a complete data cleaning system and toolkit for selecting higher-quality data for continual pre-training. This approach stimulated the model’s capabilities and addressed the issue of catastrophic forgetting.

Colossal-LaMA-2′s training strategy is another critical component of its success. It utilizes a multi-stage, hierarchical, continual pre-training scheme that progresses in three stages: large-scale pre-training, Chinese knowledge injection, and relevant knowledge replay. This approach ensures that the model evolves effectively in Chinese and English, making it versatile and capable of handling a wide range of tasks.

Data distribution balance is paramount in continual pre-training, and to achieve this, the team designed a data bucketing strategy, dividing the same type of data into ten different bins. This ensures that the model can utilize every kind of data evenly.

Performance is assessed comprehensively through the ColossalEval framework, which evaluates large language models from various dimensions, including knowledge reserve capability, multiple-choice questions, content generation, and more. Colossal-LaMA-2 consistently outperforms its competitors in these evaluations, showcasing its robustness and versatility.

In conclusion, Colossal-LLaMA-2 represents a remarkable breakthrough in large-scale deep-learning models. By drastically reducing training costs and enhancing accessibility, it brings the power of state-of-the-art models to a wider audience. The implications of this advancement are profound. Smaller companies, independent developers, and researchers are now facing insurmountable barriers when it comes to leveraging the capabilities of large models. This democratization of AI has the potential to spark innovation across various domains and accelerate the development and deployment of AI applications.

Check out the Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Meet Colossal-LLaMA-2: An Open-Sourced Artificial Intelligence Approach with a Full-Flow Solution for LLaMA2 with High Scalability appeared first on MarkTechPost.

Spotify’s Newest Feature: Using AI to Clone and Translate Podcast Vo …

In the ever-evolving world of podcasting, language barriers have long stood as a formidable obstacle to the global reach of audio content. However, recent developments signal a promising solution to this challenge. Spotify, the streaming giant, has partnered with OpenAI to introduce a groundbreaking AI-powered voice translation tool that has the potential to revolutionize the way podcast episodes are consumed around the world.

Traditionally, podcasts have faced linguistic limitations, with content primarily accessible to audiences fluent in the language of the podcast. While subtitles and dubbing have been employed to bridge this gap, they often need to deliver an authentic experience. This longstanding problem has prompted content creators and platforms to seek innovative solutions.

Spotify’s voice translation technology is a remarkable development that leverages OpenAI’s cutting-edge voice technology. This tool transcends conventional translation methods by crafting synthetic voices that mimic the podcast hosts’ cadence, tone, and inflection. It promises to maintain the essence of the original content while breaking down language barriers and expanding the global audience for podcasts.

This technology uses just a few seconds of a host’s real speech to create translated podcast episodes that sound remarkably authentic and personalized. This innovation, tested with prominent podcasters, aims to offer listeners the same unique voice experience in Spanish, French, and German. As the pilot program progresses, more shows and languages will undoubtedly be added, marking a significant stride toward making podcasts accessible to a broader global audience.

Spotify’s commitment to democratizing podcast content is evident in its decision to offer these translated episodes to free and Premium users. This inclusivity underscores the company’s dedication to enhancing creator expression and building connections between talent and fans worldwide. The success and user reception of these AI-powered episodes will shape the direction of future refinements, promising even more innovative solutions for the podcasting landscape.

In conclusion, Spotify’s introduction of AI-powered voice translation technology signifies a monumental step in overcoming the longstanding barriers to storytelling imposed by language differences. By preserving the authenticity of podcast hosts’ voices in translated content, Spotify aims to bring global listeners closer to their favorite podcasters. As Spotify continues to expand its podcast catalog, innovations like voice translation could make this captivating medium more accessible and inclusive globally, marking a promising new chapter in the world of podcasting.

Check out the Spotify Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Spotify’s Newest Feature: Using AI to Clone and Translate Podcast Voices Across Languages appeared first on MarkTechPost.

Shanghai Jiao Tong University Researchers Unveil RH20T: The Ultimate R …

Robotic manipulation is advancing towards the goal of enabling robots to swiftly acquire new skills through one-shot imitation learning and foundational models. While the field has made strides in simple tasks like object manipulation, hurdles impede progress in more complex scenarios. The scarcity of large and diverse robotic manipulation datasets and a reliance on visual guidance are key challenges. To address these issues, researchers from Shanghai Jiao Tong University introduce an innovative data collection approach employing force-torque sensors and haptic devices.

There are three critical areas in robotic manipulation research: the scarcity of comprehensive datasets, the promising advancements in one-shot imitation learning and foundational models, and the necessity of integrating visual and tactile perception for complex skill acquisition. The researchers recognize the untapped potential within one-shot learning and foundational models to elevate robotic manipulative skills by harnessing the power of demonstrations.

Researchers tackle the challenge of equipping robots with diverse and adaptable skills for open-domain tasks using one-shot imitation learning and foundational robotic models. While current efforts primarily revolve around straightforward tasks like pushing or picking objects, mainly guided by visual cues, the potential for more complex skills involving both visual and tactile perception remains unexplored. Their approach introduces an innovative data collection approach for robotic manipulation, integrating a force-torque sensor and a haptic device to gather data. Their dataset comprises over 110,000 robot manipulation sequences spanning various skills, scenarios, robots, and camera angles, encompassing visual, force, audio, and action data.

The importance of intuitive teleoperation, their research highlights its role in collision avoidance and generating significant forces safely. Their organized dataset, designed to be representative, diverse, and true to real-world scenarios, promises to be a valuable asset for advancing research in general skill learning. The primary focus lies in demonstrating how their dataset enhances the transferability of a baseline model within a few-shot learning framework.

Their research showcases the model’s performance across various training configurations, highlighting the substantial benefits of leveraging the diverse dataset for robotic manipulation. Pretraining the model with dataset data, despite differing conditions, significantly boosts success rates. The incorporation of data from diverse tasks during pre-training further enhances overall performance and accelerates model convergence. Notably, the dataset proves its value in few-shot learning, with pretrained models consistently outperforming their non-pre-trained counterparts, even with fewer demonstrations. Their research substantially bolsters the model’s generalization capabilities, consistently outshining non-pretrained models when tested in new environments.

In conclusion, their dataset provides a valuable resource for diverse robotic skill learning, particularly in the field of robotic manipulation in novel environments. It provides contact-rich robot manipulation sequences across various skills, contexts, robots, and camera viewpoints, with multimodal perception information. While acknowledging limitations, like the high data collection costs and the need for further evaluation with robotic foundation models, the researchers have generously open-sourced the dataset to foster collaboration and progress in the field. Future endeavors aim to expand the dataset to encompass a wider range of robotic manipulation tasks, including dual-arm and multi-finger dexterous manipulation.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..
The post Shanghai Jiao Tong University Researchers Unveil RH20T: The Ultimate Robotic Dataset Boasting 110K Sequences, Multimodal Data, and 147 Diverse Tasks appeared first on MarkTechPost.