Use Stable Diffusion XL with Amazon SageMaker JumpStart in Amazon Sage …

Today we are excited to announce that Stable Diffusion XL 1.0 (SDXL 1.0) is available for customers through Amazon SageMaker JumpStart. SDXL 1.0 is the latest image generation model from Stability AI. SDXL 1.0 enhancements include native 1024-pixel image generation at a variety of aspect ratios. It’s designed for professional use, and calibrated for high-resolution photorealistic images. SDXL 1.0 offers a variety of preset art styles ready to use in marketing, design, and image generation use cases across industries. You can easily try out these models and use them with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML.
In this post, we walk through how to use SDXL 1.0 models via SageMaker JumpStart.
What is Stable Diffusion XL 1.0 (SDXL 1.0)
SDXL 1.0 is the evolution of Stable Diffusion and the next frontier for generative AI for images. SDXL is capable of generating stunning images with complex concepts in various art styles, including photorealism, at quality levels that exceed the best image models available today. Like the original Stable Diffusion series, SDXL is highly customizable (in terms of parameters) and can be deployed on Amazon SageMaker instances.
The following image of a lion was generated using SDXL 1.0 using a simple prompt, which we explore later in this post.

The SDXL 1.0 model includes the following highlights:

Freedom of expression – Best-in-class photorealism, as well as an ability to generate high-quality art in virtually any art style. Distinct images are made without having any particular feel that is imparted by the model, ensuring absolute freedom of style.
Artistic intelligence – Best-in-class ability to generate concepts that are notoriously difficult for image models to render, such as hands and text, or spatially arranged objects and people (for example, a red box on top of a blue box).
Simpler prompting – Unlike other generative image models, SDXL requires only a few words to create complex, detailed, and aesthetically pleasing images. No more need for paragraphs of qualifiers.
More accurate – Prompting in SDXL is not only simple, but more true to the intention of prompts. SDXL’s improved CLIP model understands text so effectively that concepts like “The Red Square” are understood to be different from “a red square.” This accuracy allows much more to be done to get the perfect image directly from text, even before using the more advanced features or fine-tuning that Stable Diffusion is famous for.

What is SageMaker JumpStart
With SageMaker JumpStart, ML practitioners can choose from a broad selection of state-of-the-art models for use cases such as content writing, image generation, code generation, question answering, copywriting, summarization, classification, information retrieval, and more. ML practitioners can deploy foundation models to dedicated SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment. The SDXL model is discoverable today in Amazon SageMaker Studio and, as of this writing, is available in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, and ap-southeast-2 Regions.
Solution overview
In this post, we demonstrate how to deploy SDXL 1.0 to SageMaker and use it to generate images using both text-to-image and image-to-image prompts.
SageMaker Studio is a web-based integrated development environment (IDE) for ML that lets you build, train, debug, deploy, and monitor your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.
Once you are in the SageMaker Studio UI, access SageMaker JumpStart and search for Stable Diffusion XL. Choose the SDXL 1.0 model card, which will open up an example notebook. This means you will be only be responsible for compute costs. There is no associated model cost. Closed weight SDXL 1.0 offers SageMaker optimized scripts and container with faster inference time and can be run on smaller instance compared to the open weight SDXL 1.0. The example notebook will walk you through steps, but we also discuss how to discover and deploy the model later in this post.
In the following sections, we show how you can use SDXL 1.0 to create photorealistic images with shorter prompts and generate text within images. Stable Diffusion XL 1.0 offers enhanced image composition and face generation with stunning visuals and realistic aesthetics.
Stable Diffusion XL 1.0 parameters
The following are the parameters used by SXDL 1.0:

cfg_scale – How strictly the diffusion process adheres to the prompt text.
height and width – The height and width of image in pixel.
steps – The number of diffusion steps to run.
seed – Random noise seed. If a seed is provided, the resulting generated image will be deterministic.
sampler – Which sampler to use for the diffusion process to denoise our generation with.
text_prompts – An array of text prompts to use for generation.
weight – Provides each prompt a specific weight

For more information, refer to the Stability AI’s text to image documentation.
The following code is a sample of the input data provided with the prompt:

{
“cfg_scale”: 7,
“height”: 1024,
“width”: 1024,
“steps”: 50,
“seed”: 42,
“sampler”: “K_DPMPP_2M”,
“text_prompts”: [
{
“text”: “A photograph of fresh pizza with basil and tomatoes, from a traditional oven”,
“weight”: 1
}
]
}

All examples in this post are based on the sample notebook for Stability Diffusion XL 1.0, which can be found on Stability AI’s GitHub repo.
Generate images using SDXL 1.0
In the following examples, we focus on the capabilities of Stability Diffusion XL 1.0 models, including superior photorealism, enhanced image composition, and the ability to generate realistic faces. We also explore the significantly improved visual aesthetics, resulting in visually appealing outputs. Additionally, we demonstrate the use of shorter prompts, enabling the creation of descriptive imagery with greater ease. Lastly, we illustrate how the text in images is now more legible, further enriching the overall quality of the generated content.
The following example shows using a simple prompt to get detailed images. Using only a few words in the prompt, it was able to create a complex, detailed, and aesthetically pleasing image that resembles the provided prompt.

text = “photograph of latte art of a cat”

output = deployed_model.predict(GenerationRequest(text_prompts=[TextPrompt(text=text)],
seed=5,
height=640,
width=1536,
sampler=”DDIM”,
))
decode_and_show(output)

Next, we show the use of the style_preset input parameter, which is only available on SDXL 1.0. Passing in a style_preset parameter guides the image generation model towards a particular style.
Some of the available style_preset parameters are enhance, anime, photographic, digital-art, comic-book, fantasy-art, line-art, analog-film, neon-punk, isometric, low-poly, origami, modeling-compound, cinematic, 3d-mode, pixel-art, and tile-texture. This list of style presets is subject to change; refer to the latest release and documentation for updates.
For this example, we use a prompt to generate a teapot with a style_preset of origami. The model was able to generate a high-quality image in the provided art style.

output = deployed_model.predict(GenerationRequest(text_prompts=[TextPrompt(text=”teapot”)],
style_preset=”origami”,
seed = 3,
height = 1024,
width = 1024
))

Let’s try some more style presets with different prompts. The next example shows a style preset for portrait generation using style_preset=”photographic” with the prompt “portrait of an old and tired lion real pose.”

text = “portrait of an old and tired lion real pose”

output = deployed_model.predict(GenerationRequest(text_prompts=[TextPrompt(text=text)],
style_preset=”photographic”,
seed=111,
height=640,
width=1536,
))

Now let’s try the same prompt (“portrait of an old and tired lion real pose”) with modeling-compound as the style preset. The output image is a distinct image made without having any particular feel that is imparted by the model, ensuring absolute freedom of style.

Multi-prompting with SDXL 1.0
As we have seen, one of the core foundations of the model is the ability to generate images via prompting. SDXL 1.0 supports multi-prompting. With multi-prompting, you can mix concepts together by assigning each prompt a specific weight. As you can see in the following generated image, it has a jungle background with tall bright green grass. This image was generated using the following prompts. You can compare this to a single prompt from our earlier example.

text1 = “portrait of an old and tired lion real pose”
text2 = “jungle with tall bright green grass”

output = deployed_model.predict(GenerationRequest(
text_prompts=[TextPrompt(text=text1),
TextPrompt(text=text2, weight=0.7)],
style_preset=”photographic”,
seed=111,
height=640,
width=1536,
))

Spatially aware generated images and negative prompts
Next, we look at poster design with a detailed prompt. As we saw earlier, multi-prompting allows you to combine concepts to create new and unique results.
In this example, the prompt is very detailed in terms of subject position, appearance, expectations, and surroundings. The model is also trying to avoid images that have distortion or are poorly rendered with the help of a negative prompt. The image generated shows spatially arranged objects and subjects.
text = “A cute fluffy white cat stands on its hind legs, peering curiously into an ornate golden mirror. But in the reflection, the cat sees not itself, but a mighty lion. The mirror illuminated with a soft glow against a pure white background.”

text = “A cute fluffy white cat stands on its hind legs, peering curiously into an ornate golden mirror. But in the reflection, the cat sees not itself, but a mighty lion. The mirror illuminated with a soft glow against a pure white background.”

negative_prompts = [‘distorted cat features’, ‘distorted lion features’, ‘poorly rendered’]

output = deployed_model.predict(GenerationRequest(
text_prompts=[TextPrompt(text=text)],
style_preset=”enhance”,
seed=43,
height=640,
width=1536,
steps=100,
cfg_scale=7,
negative_prompts=negative_prompts
))

Let’s try another example, where we keep the same negative prompt but change the detailed prompt and style preset. As you can see, the generated image not only spatially arranges objects, but also changes the style presets with attention to details like the ornate golden mirror and reflection of the subject only.

text = “A cute fluffy white cat stands on its hind legs, peering curiously into an ornate golden mirror. In the reflection the cat sees itself.”

negative_prompts = [‘distorted cat features’, ‘distorted lion features’, ‘poorly rendered’]

output = deployed_model.predict(GenerationRequest(
text_prompts=[TextPrompt(text=text)],
style_preset=”neon-punk”,
seed=4343434,
height=640,
width=1536,
steps=150,
cfg_scale=7,
negative_prompts=negative_prompts
))

Face generation with SDXL 1.0
In this example, we show how SDXL 1.0 creates enhanced image composition and face generation with realistic features such as hands and fingers. The generated image is of a human figure created by AI with clearly raised hands. Note the details in the fingers and the pose. An AI-generated image such as this would otherwise have been amorphous.

text = “Photo of an old man with hands raised, real pose.”

output = deployed_model.predict(GenerationRequest(
text_prompts=[TextPrompt(text=text)],
style_preset=”photographic”,
seed=11111,
height=640,
width=1536,
steps=100,
cfg_scale=7,
))

Text generation using SDXL 1.0
SDXL is primed for complex image design workflows that include generation of text within images. This example prompt showcases this capability. Observe how clear the text generation is using SDXL and notice the style preset of cinematic.

text = “Write the following word: Dream”

output = deployed_model.predict(GenerationRequest(text_prompts=[TextPrompt(text=text)],
style_preset=”cinematic”,
seed=15,
height=640,
width=1536,
sampler=”DDIM”,
steps=32,
))

Discover SDXL 1.0 from SageMaker JumpStart
SageMaker JumpStart onboards and maintains foundation models for you to access, customize, and integrate into your ML lifecycles. Some models are open weight models that allow you to access and modify model weights and scripts, whereas some are closed weight models that don’t allow you to access them to protect the IP of model providers. Closed weight models require you to subscribe to the model from the AWS Marketplace model detail page, and SDXL 1.0 is a model with closed weight at this time. In this section, we go over how to discover, subscribe, and deploy a closed weight model from SageMaker Studio.
You can access SageMaker JumpStart by choosing JumpStart under Prebuilt and automated solutions on the SageMaker Studio Home page.

From the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and other resources. The following screenshot shows an example of the landing page with solutions and foundation models listed.

Each model has a model card, as shown in the following screenshot, which contains the model name, if it is fine-tunable or not, the provider name, and a short description about the model. You can find the Stable Diffusion XL 1.0 model in the Foundation Model: Image Generation carousel or search for it in the search box.

You can choose Stable Diffusion XL 1.0 to open an example notebook that walks you through how to use the SDXL 1.0 model. The example notebook opens as read-only mode; you need to choose Import notebook to run it.

After importing the notebook, you need to select the appropriate notebook environment (image, kernel, instance type, and so on) before running the code.
Deploy SDXL 1.0 from SageMaker JumpStart
In this section, we walk through how to subscribe and deploy the model.

Open the model listing page in AWS Marketplace using the link available from the example notebook in SageMaker JumpStart.
On the AWS Marketplace listing, choose Continue to subscribe.

If you don’t have the necessary permissions to view or subscribe to the model, reach out to your AWS administrator or procurement point of contact. Many enterprises may limit AWS Marketplace permissions to control the actions that someone can take in the AWS Marketplace Management Portal.

Choose Continue to Subscribe.
On the Subscribe to this software page, review the pricing details and End User Licensing Agreement (EULA). If agreeable, choose Accept offer.
Choose Continue to configuration to start configuring your model.
Choose a supported Region.

You will see a product ARN displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.

Copy the ARN corresponding to your Region and specify the same in the notebook’s cell instruction.

ARN information may be already available in the example notebook.

Now you’re ready to start following the example notebook.

You can also continue from AWS Marketplace, but we recommend following the example notebook in SageMaker Studio to better understand how deployment works.

Clean up
When you’ve finished working, you can delete the endpoint to release the Amazon Elastic Compute Cloud (Amazon EC2) instances associated with it and stop billing.
Get your list of SageMaker endpoints using the AWS CLI as follows:

!aws sagemaker list-endpoints

Then delete the endpoints:

deployed_model.sagemaker_session.delete_endpoint(endpoint_name)

Conclusion
In this post, we showed you how to get started with the new SDXL 1.0 model in SageMaker Studio. With this model, you can take advantage of the different features offered by SDXL to create realistic images. Because foundation models are pre-trained, they can also help lower training and infrastructure costs and enable customization for your use case.
Resources

SageMaker JumpStart
JumpStart Foundation Models
SageMaker JumpStart product page
SageMaker JumpStart model catalog

About the authors
June Won is a product manager with SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications.
Mani Khanuja is an Artificial Intelligence and Machine Learning Specialist SA at Amazon Web Services (AWS). She helps customers using machine learning to solve their business challenges using the AWS. She spends most of her time diving deep and teaching customers on AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. She is passionate about ML at edge, therefore, she has created her own lab with self-driving kit and prototype manufacturing production line, where she spends lot of her free time.
Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS with experience in Software Engineering , Enterprise Architecture and AI/ML. He works with customers on helping them build well-architected applications on the AWS platform. He is passionate about solving technology challenges and helping customers with their cloud journey.
Suleman Patel is a Senior Solutions Architect at Amazon Web Services (AWS), with a special focus on Machine Learning and Modernization. Leveraging his expertise in both business and technology, Suleman helps customers design and build solutions that tackle real-world business problems. When he’s not immersed in his work, Suleman loves exploring the outdoors, taking road trips, and cooking up delicious dishes in the kitchen.
Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Flag harmful language in spoken conversations with Amazon Transcribe T …

The increase in online social activities such as social networking or online gaming is often riddled with hostile or aggressive behavior that can lead to unsolicited manifestations of hate speech, cyberbullying, or harassment. For example, many online gaming communities offer voice chat functionality to facilitate communication among their users. Although voice chat often supports friendly banter and trash talking, it can also lead to problems such as hate speech, cyberbullying, harassment, and scams. Flagging harmful language helps organizations keep conversations civil and maintain a safe and inclusive online environment for users to create, share, and participate freely. Today, many companies rely solely on human moderators to review toxic content. However, scaling human moderators to meet these needs at a sufficient quality and speed is expensive. As a result, many organizations risk facing high user attrition rates, reputational damage, and regulatory fines. In addition, moderators are often psychologically impacted by reviewing the toxic content.
Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to their applications. Today, we are excited to announce Amazon Transcribe Toxicity Detection, a machine learning (ML)-powered capability that uses both audio and text-based cues to identify and classify voice-based toxic content across seven categories, including sexual harassment, hate speech, threats, abuse, profanity, insults, and graphic language. In addition to text, Toxicity Detection uses speech cues such as tones and pitch to hone in on toxic intent in speech.
This is an improvement from standard content moderation systems that are designed to focus only on specific terms, without accounting for intention. Most enterprises have an SLA of 7–15 days to review content reported by users because moderators must listen to lengthy audio files to evaluate if and when the conversation became toxic. With Amazon Transcribe Toxicity Detection, moderators only review the specific portion of the audio file flagged for toxic content (vs. the entire audio file). The content human moderators must review is reduced by 95%, enabling customers to reduce their SLA to just a few hours, as well as enable them to proactively moderate more content beyond just what’s flagged by the users. It will allow enterprises to automatically detect and moderate content at scale, provide a safe and inclusive online environment, and take action before it can cause user churn or reputational damage. The models used for toxic content detection are maintained by Amazon Transcribe and updated periodically to maintain accuracy and relevance.
In this post, you’ll learn how to:

Identify harmful content in speech with Amazon Transcribe Toxicity Detection
Use the Amazon Transcribe console for toxicity detection
Create a transcription job with toxicity detection using the AWS Command Line Interface (AWS CLI) and Python SDK
Use the Amazon Transcribe toxicity detection API response

Detect toxicity in audio chat with Amazon Transcribe Toxicity Detection
Amazon Transcribe now provides a simple, ML-based solution for flagging harmful language in spoken conversations. This feature is especially useful for social media, gaming, and general needs, eliminating the need for customers to provide their own data to train the ML model. Toxicity Detection classifies toxic audio content into the following seven categories and provides a confidence score (0–1) for each category:

Profanity – Speech that contains words, phrases, or acronyms that are impolite, vulgar, or offensive.
Hate speech – Speech that criticizes, insults, denounces, or dehumanizes a person or group on the basis of an identity (such as race, ethnicity, gender, religion, sexual orientation, ability, and national origin).
Sexual – Speech that indicates sexual interest, activity, or arousal using direct or indirect references to body parts, physical traits, or sex.
Insults – Speech that includes demeaning, humiliating, mocking, insulting, or belittling language. This type of language is also labeled as bullying.
Violence or threat – Speech that includes threats seeking to inflict pain, injury, or hostility toward a person or group.
Graphic – Speech that uses visually descriptive and unpleasantly vivid imagery. This type of language is often intentionally verbose to amplify a recipient’s discomfort.
Harassment or abusive – Speech intended to affect the psychological well-being of the recipient, including demeaning and objectifying terms.

You can access Toxicity Detection either via the Amazon Transcribe console or by calling the APIs directly using the AWS CLI or the AWS SDKs. On the Amazon Transcribe console, you can upload the audio files you want to test for toxicity and get results in just a few clicks. Amazon Transcribe will identify and categorize toxic content, such as harassment, hate speech, sexual content, violence, insults, and profanity. Amazon Transcribe also provides a confidence score for each category, providing valuable insights into the content’s toxicity level. Toxicity Detection is currently available in the standard Amazon Transcribe API for batch processing and supports US English language.
Amazon Transcribe console walkthrough
To get started, sign in to the AWS Management Console and go to Amazon Transcribe. To create a new transcription job, you need to upload your recorded files into an Amazon Simple Storage Service (Amazon S3) bucket before they can be processed. On the audio settings page, as shown in the following screenshot, enable Toxicity detection and proceed to create the new job. Amazon Transcribe will process the transcription job in the background. As the job progresses, you can expect the status to change to COMPLETED when the process is finished.

To review the results of a transcription job, choose the job from the job list to open it. Scroll down to the Transcription preview section to check results on the Toxicity tab. The UI shows color-coded transcription segments to indicate the level of toxicity, determined by the confidence score. To customize the display, you can use the toggle bars in the Filters pane. These bars allow you to adjust the thresholds and filter the toxicity categories accordingly.
The following screenshot has covered portions of the transcription text due to the presence of sensitive or toxic information.

Transcription API with a toxicity detection request
In this section, we guide you through creating a transcription job with toxicity detection using programming interfaces. If the audio file is not already in an S3 bucket, upload it to ensure access by Amazon Transcribe. Similar to creating a transcription job on the console, when invoking the job, you need to provide the following parameters:

TranscriptionJobName – Specify a unique job name.
MediaFileUri – Enter the URI location of the audio file on Amazon S3. Amazon Transcribe supports the following audio formats: MP3, MP4, WAV, FLAC, AMR, OGG, or WebM
LanguageCode – Set to en-US. As of this writing, Toxicity Detection only supports US English language.
ToxicityCategories – Pass the ALL value to include all supported toxicity detection categories.

The following are examples of starting a transcription job with toxicity detection enabled using Python3:

import time
import boto3

transcribe = boto3.client(‘transcribe’, ‘us-east-1’)
job_name = “toxicity-detection-demo”
job_uri = “s3://my-bucket/my-folder/my-file.wav”

# start a transcription job
transcribe.start_transcription_job(
TranscriptionJobName = job_name,
Media = { ‘MediaFileUri’: job_uri },
OutputBucketName = ‘doc-example-bucket’,
OutputKey = ‘my-output-files/’,
LanguageCode = ‘en-US’,
ToxicityDetection = [{‘ToxicityCategories’: [‘ALL’]}]
)

# wait for the transcription job to complete
while True:
status = transcribe.get_transcription_job(TranscriptionJobName = job_name)
if status[‘TranscriptionJob’][‘TranscriptionJobStatus’] in [‘COMPLETED’, ‘FAILED’]:
break
print(“Not ready yet…”)
time.sleep(5)
print(status)

You can invoke the same transcription job with toxicity detection using the following AWS CLI command:

aws transcribe start-transcription-job
–region us-east-1
–transcription-job-name toxicity-detection-demo
–media MediaFileUri=s3://my-bucket/my-folder/my-file.wav
–output-bucket-name doc-example-bucket
–output-key my-output-files/
–language-code en-US
–toxicity-detection ToxicityCategories=ALL

Transcription API with toxicity detection response
The Amazon Transcribe toxicity detection JSON output will include the transcription results in the results field. Enabling toxicity detection adds an extra field called toxicityDetection under the results field. toxicityDetection includes a list of transcribed items with the following parameters:

text – The raw transcribed text
toxicity – A confidence score of detection (a value between 0–1)
categories – A confidence score for each category of toxic speech
start_time – The start position of detection in the audio file (seconds)
end_time – The end position of detection in the audio file (seconds)

The following is a sample abbreviated toxicity detection response you can download from the console:

{
“results”:{
“transcripts”: […],
“items”:[…],
“toxicityDetection”: [
{
“text”: “A TOXIC TRANSCRIPTION SEGMENT GOES HERE.”,
“toxicity”: 0.8419,
“categories”: {
“PROFANITY”: 0.7041,
“HATE_SPEECH”: 0.0163,
“SEXUAL”: 0.0097,
“INSULT”: 0.8532,
“VIOLENCE_OR_THREAT”: 0.0031,
“GRAPHIC”: 0.0017,
“HARASSMENT_OR_ABUSE”: 0.0497
},
“start_time”: 16.298,
“end_time”: 20.35
},

]
},
“status”: “COMPLETED”
}

Summary
In this post, we provided an overview of the new Amazon Transcribe Toxicity Detection feature. We also described how you can parse the toxicity detection JSON output. For more information, check out the Amazon Transcribe console and try out the Transcription API with Toxicity Detection.
Amazon Transcribe Toxicity Detection is now available in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney), Europe (Ireland), and Europe (London). To learn more, visit Amazon Transcribe.
Learn more about content moderation on AWS and our content moderation ML use cases. Take the first step towards streamlining your content moderation operations with AWS.

About the author
Lana Zhang is a Senior Solutions Architect at AWS WWSO AI Services team, specializing in AI and ML for content moderation, computer vision, and natural language processing. With her expertise, she is dedicated to promoting AWS AI/ML solutions and assisting customers in transforming their business solutions across diverse industries, including social media, gaming, e-commerce, and advertising & marketing.
Sumit Kumar is a Sr Product Manager, Technical at AWS AI Language Services team. He has 10 years of product management experience across a variety of domains and is passionate about AI/ML. Outside of work, Sumit loves to travel and enjoys playing cricket and Lawn-Tennis.

Maximize Stable Diffusion performance and lower inference costs with A …

Generative AI models have been experiencing rapid growth in recent months due to its impressive capabilities in creating realistic text, images, code, and audio. Among these models, Stable Diffusion models stand out for their unique strength in creating high-quality images based on text prompts. Stable Diffusion can generate a wide variety of high-quality images, including realistic portraits, landscapes, and even abstract art. And, like other generative AI models, Stable Diffusion models require powerful computing to provide low-latency inference.
In this post, we show how you can run Stable Diffusion models and achieve high performance at the lowest cost in Amazon Elastic Compute Cloud (Amazon EC2) using Amazon EC2 Inf2 instances powered by AWS Inferentia2. We look at the architecture of a Stable Diffusion model and walk through the steps of compiling a Stable Diffusion model using AWS Neuron and deploying it to an Inf2 instance. We also discuss the optimizations that the Neuron SDK automatically makes to improve performance. You can run both Stable Diffusion 2.1 and 1.5 versions on AWS Inferentia2 cost-effectively. Lastly, we show how you can deploy a Stable Diffusion model to an Inf2 instance with Amazon SageMaker.
The Stable Diffusion 2.1 model size in floating point 32 (FP32) is 5 GB and 2.5 GB in bfoat16 (BF16). A single inf2.xlarge instance has one AWS Inferentia2 accelerator with 32 GB of HBM memory. The Stable Diffusion 2.1 model can fit on a single inf2.xlarge instance. Stable Diffusion is a text-to-image model that you can use to create images of different styles and content simply by providing a text prompt as an input. To learn more about the Stable Diffusion model architecture, refer to Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker.

How the Neuron SDK optimizes Stable Diffusion performance
Before we can deploy the Stable Diffusion 2.1 model on AWS Inferentia2 instances, we need to compile the model components using the Neuron SDK. The Neuron SDK, which includes a deep learning compiler, runtime, and tools, compiles and automatically optimizes deep learning models so they can run efficiently on Inf2 instances and extract full performance of the AWS Inferentia2 accelerator. We have examples available for Stable Diffusion 2.1 model on the GitHub repo. This notebook presents an end-to-end example of how to compile a Stable Diffusion model, save the compiled Neuron models, and load it into the runtime for inference.
We use StableDiffusionPipeline from the Hugging Face diffusers library to load and compile the model. We then compile all the components of the model for Neuron using torch_neuronx.trace() and save the optimized model as TorchScript. Compilation processes can be quite memory-intensive, requiring a significant amount of RAM. To circumvent this, before tracing each model, we create a deepcopy of the part of the pipeline that’s being traced. Following this, we delete the pipeline object from memory using del pipe. This technique is particularly useful when compiling on instances with low RAM.
Additionally, we also perform optimizations to the Stable Diffusion models. UNet holds the most computationally intensive aspect of the inference. The UNet component operates on input tensors that have a batch size of two, generating a corresponding output tensor also with a batch size of two, to produce a single image. The elements within these batches are entirely independent of each other. We can take advantage of this behavior to get optimal latency by running one batch on each Neuron core. We compile the UNet for one batch (by using input tensors with one batch), then use the torch_neuronx.DataParallel API to load this single batch model onto each core. The output of this API is a seamless two-batch module: we can pass to the UNet the inputs of two batches, and a two-batch output is returned, but internally, the two single-batch models are running on the two Neuron cores. This strategy optimizes resource utilization and reduces latency.
Compile and deploy a Stable Diffusion model on an Inf2 EC2 instance
To compile and deploy the Stable Diffusion model on an Inf2 EC2 instance, sign to the AWS Management Console and create an inf2.8xlarge instance. Note that an inf2.8xlarge instance is required only for the compilation of the model because compilation requires a higher host memory. The Stable Diffusion model can be hosted on an inf2.xlarge instance. You can find the latest AMI with Neuron libraries using the following AWS Command Line Interface (AWS CLI) command:

aws ec2 describe-images –region us-east-1 –owners amazon
–filters ‘Name=name,Values=Deep Learning AMI Neuron PyTorch 1.13.? (Amazon Linux 2) ????????’ ‘Name=state,Values=available’
–query ‘reverse(sort_by(Images, &CreationDate))[:1].ImageId’
–output text

For this example, we created an EC2 instance using the Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04). You can then create a JupyterLab lab environment by connecting to the instance and running the following steps:

run source /opt/aws_neuron_venv_pytorch/bin/activate
pip install jupyterlab
jupyter-lab

A notebook with all the steps for compiling and hosting the model is located on GitHub.
Let’s look at the compilation steps for one of the text encoder blocks. Other blocks that are part of the Stable Diffusion pipeline can be compiled similarly.
The first step is to load the pre-trained model from Hugging Face. The StableDiffusionPipeline.from_pretrained method loads the pre-trained model into our pipeline object, pipe. We then create a deepcopy of the text encoder from our pipeline, effectively cloning it. The del pipe command is then used to delete the original pipeline object, freeing up the memory that was consumed by it. Here, we are quantizing the model to BF16 weights:

model_id = “stabilityai/stable-diffusion-2-1-base”
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
text_encoder = copy.deepcopy(pipe.text_encoder)
del pipe

This step involves wrapping our text encoder with the NeuronTextEncoder wrapper. The output of a compiled text encoder module will be of dict. We convert it to a list type using this wrapper:

text_encoder = NeuronTextEncoder(text_encoder)

We initialize PyTorch tensor emb with some values. The emb tensor is used as example input for the torch_neuronx.trace function. This function traces our text encoder and compiles it into a format optimized for Neuron. The directory path for the compiled model is constructed by joining COMPILER_WORKDIR_ROOT with the subdirectory text_encoder:

emb = torch.tensor([…])
text_encoder_neuron = torch_neuronx.trace(
        text_encoder.neuron_text_encoder,
        emb,
        compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, ‘text_encoder’),
        )

The compiled text encoder is saved using torch.jit.save. It’s stored under the file name model.pt in the text_encoder directory of our compiler’s workspace:

text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, ‘text_encoder/model.pt’)
torch.jit.save(text_encoder_neuron, text_encoder_filename)

The notebook includes similar steps to compile other components of the model: UNet, VAE decoder, and VAE post_quant_conv. After you have compiled all the models, you can load and run the model following these steps:

Define the paths for the compiled models.
Load a pre-trained StableDiffusionPipeline model, with its configuration specified to use the bfloat16 data type.
Load the UNet model onto two Neuron cores using the torch_neuronx.DataParallel API. This allows data parallel inference to be performed, which can significantly speed up model performance.
Load the remaining parts of the model (text_encoder, decoder, and post_quant_conv) onto a single Neuron core.

You can then run the pipeline by providing input text as prompts. The following are some pictures generated by the model for the prompts:

Portrait of renaud sechan, pen and ink, intricate line drawings, by craig mullins, ruan jia, kentaro miura, greg rutkowski, loundraw

Portrait of old coal miner in 19th century, beautiful painting, with highly detailed face painting by greg rutkowski

A castle in the middle of a forest

Host Stable Diffusion 2.1 on AWS Inferentia2 and SageMaker
Hosting Stable Diffusion models with SageMaker also requires compilation with the Neuron SDK. You can complete the compilation ahead of time or during runtime using Large Model Inference (LMI) containers. Compilation ahead of time allows for faster model loading times and is the preferred option.
SageMaker LMI containers provide two ways to deploy the model:

A no-code option where we just provide a serving.properties file with the required configurations
Bring your own inference script

We look at both solutions and go over the configurations and the inference script (model.py). In this post, we demonstrate the deployment using a pre-compiled model stored in an Amazon Simple Storage Service (Amazon S3) bucket. You can use this pre-compiled model for your deployments.
Configure the model with a provided script
In this section, we show how to configure the LMI container to host the Stable Diffusion models. The SD2.1 notebook available on GitHub. The first step is to create the model configuration package per the following directory structure. Our aim is to use the minimal model configurations needed to host the model. The directory structure needed is as follows:

<config-root-directory> / 
    ├── serving.properties
    │   
    └── model.py [OPTIONAL]

Next, we create the serving.properties file with the following parameters:

%%writefile code_sd/serving.properties
engine=Python
option.entryPoint=djl_python.transformers-neuronx
option.use_stable_diffusion=True
option.model_id=s3url
option.tensor_parallel_degree=2
option.dtype=bf16

The parameters specify the following:

option.model_id – The LMI containers use s5cmd to load the model from the S3 location and therefore we need to specify the location of where our compiled weights are.
option.entryPoint – To use the built-in handlers, we specify the transformers-neuronx class. If you have a custom inference script, you need to provide that instead.
option.dtype – This specifies to load the weights in a specific size. For this post, we use BF16, which further reduces our memory requirements vs. FP32 and lowers our latency due to that.
option.tensor_parallel_degree – This parameter specifies the number of accelerators we use for this model. The AWS Inferentia2 chip accelerator has two Neuron cores and so specifying a value of 2 means we use one accelerator (two cores). This means we can now create multiple workers to increase the throughput of the endpoint.
option.engine – This is set to Python to indicate we will not be using other compilers like DeepSpeed or Faster Transformer for this hosting.

Bring your own script
If you want to bring your own custom inference script, you need to remove the option.entryPoint from serving.properties. The LMI container in that case will look for a model.py file in the same location as the serving.properties and use that to run the inferencing.
Create your own inference script (model.py)
Creating your own inference script is relatively straightforward using the LMI container. The container requires your model.py file to have an implementation of the following method:

def handle(inputs: Input) which returns an object of type Outputs

Let’s examine some of the critical areas of the attached notebook, which demonstrates the bring your own script function.
Replace the cross_attention module with the optimized version:

# Replace original cross-attention module with custom cross-attention module for better performance
    CrossAttention.get_attention_scores = get_attention_scores
Load the compiled weights for the following
text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, ‘text_encoder.pt’)
decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, ‘vae_decoder.pt’)
unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, ‘unet.pt’)
post_quant_conv_filename =. os.path.join(COMPILER_WORKDIR_ROOT, ‘vae_post_quant_conv.pt’)

These are the names of the compiled weights file we used when creating the compilations. Feel free to change the file names, but make sure your weights file names match what you specify here.
Then we need to load them using the Neuron SDK and set these in the actual model weights. When loading the UNet optimized weights, note we are also specifying the number of Neuron cores we need to load these onto. Here, we load to a single accelerator with two cores:

# Load the compiled UNet onto two neuron cores.
    pipe.unet = NeuronUNet(UNetWrap(pipe.unet))
    logging.info(f”Loading model: unet:created”)
    device_ids = [idx for idx in range(tensor_parallel_degree)]
   
    pipe.unet.unetwrap = torch_neuronx.DataParallel(torch.jit.load(unet_filename), device_ids, set_dynamic_batching=False)
   
 
    # Load other compiled models onto a single neuron core.
 
    # – load encoders
    pipe.text_encoder = NeuronTextEncoder(pipe.text_encoder)
    clip_compiled = torch.jit.load(text_encoder_filename)
    pipe.text_encoder.neuron_text_encoder = clip_compiled
    #- load decoders
    pipe.vae.decoder = torch.jit.load(decoder_filename)
    pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)

Running the inference with a prompt invokes the pipe object to generate an image.
Create the SageMaker endpoint
We use Boto3 APIs to create a SageMaker endpoint. Complete the following steps:

Create the tarball with just the serving and the optional model.py files and upload it to Amazon S3.
Create the model using the image container and the model tarball uploaded earlier.
Create the endpoint config using the following key parameters:

Use an ml.inf2.xlarge instance.
Set ContainerStartupHealthCheckTimeoutInSeconds to 240 to ensure the health check starts after the model is deployed.
Set VolumeInGB to a larger value so it can be used for loading the model weights that are 32 GB in size.

Create a SageMaker model
After you create the model.tar.gz file and upload it to Amazon S3, we need to create a SageMaker model. We use the LMI container and the model artifact from the previous step to create the SageMaker model. SageMaker allows us to customize and inject various environment variables. For this workflow, we can leave everything as default. See the following code:

inference_image_uri = (
    f”763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0 djl-serving-inf2″
)

Create the model object, which essentially creates a lockdown container that is loaded onto the instance and used for inferencing:

model_name = name_from_base(f”inf2-sd”)
create_model_response = boto3_sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={“Image”: inference_image_uri, “ModelDataUrl”: s3_code_artifact},
)

Create a SageMaker endpoint
In this demo, we use an ml.inf2.xlarge instance. We need to set the VolumeSizeInGB parameters to provide the necessary disk space to load the model and the weights. This parameter is applicable to instances supporting the Amazon Elastic Block Store (Amazon EBS) volume attachment. We can leave the model download timeout and container startup health check to a higher value, which will give adequate time for the container to pull the weights from Amazon S3 and load into the AWS Inferentia2 accelerators. For more details, refer to CreateEndpointConfig.

endpoint_config_response = boto3_sm_client.create_endpoint_config(

EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            “VariantName”: “variant1”,
            “ModelName”: model_name,
            “InstanceType”: “ml.inf2.xlarge”, # – 
            “InitialInstanceCount”: 1,
            “ContainerStartupHealthCheckTimeoutInSeconds”: 360, 
            “VolumeSizeInGB”: 400
        },
    ],
)

Lastly, we create a SageMaker endpoint:

create_endpoint_response = boto3_sm_client.create_endpoint(
    EndpointName=f”{endpoint_name}”, EndpointConfigName=endpoint_config_name
)

Invoke the model endpoint
This is a generative model, so we pass in the prompt that the model uses to generate the image. The payload is of the type JSON:

response_model = boto3_sm_run_client.invoke_endpoint(

EndpointName=endpoint_name,
    Body=json.dumps(
        {
            “prompt”: “Mountain Landscape”, 
            “parameters”: {} # 
        }
    ), 
    ContentType=”application/json”,
)

Benchmarking the Stable Diffusion model on Inf2
We ran a few tests to benchmark the Stable Diffusion model with BF 16 data type on Inf2, and we are able to derive latency numbers that rival or exceed some of the other accelerators for Stable Diffusion. This, coupled with the lower cost of AWS Inferentia2 chips, makes this an extremely valuable proposition.
The following numbers are from the Stable Diffusion model deployed on an inf2.xl instance. For more information about costs, refer to Amazon EC2 Inf2 Instances.

Model
Resolution
Data type
Iterations
P95 Latency (ms)
Inf2.xl On-Demand cost per hour
Inf2.xl (Cost per image)

Stable Diffusion 1.5
512×512
bf16
50
2,427.4
$0.76
$0.0005125

Stable Diffusion 1.5
768×768
bf16
50
8,235.9
$0.76
$0.0017387

Stable Diffusion 1.5
512×512
bf16
30
1,456.5
$0.76
$0.0003075

Stable Diffusion 1.5
768×768
bf16
30
4,941.6
$0.76
$0.0010432

Stable Diffusion 2.1
512×512
bf16
50
1,976.9
$0.76
$0.0004174

Stable Diffusion 2.1
768×768
bf16
50
6,836.3
$0.76
$0.0014432

Stable Diffusion 2.1
512×512
bf16
30
1,186.2
$0.76
$0.0002504

Stable Diffusion 2.1
768×768
bf16
30
4,101.8
$0.76
$0.0008659

Conclusion
In this post, we dove deep into the compilation, optimization, and deployment of the Stable Diffusion 2.1 model using Inf2 instances. We also demonstrated deployment of Stable Diffusion models using SageMaker. Inf2 instances also deliver great price performance for Stable Diffusion 1.5. To learn more about why Inf2 instances are great for generative AI and large language models, refer to Amazon EC2 Inf2 Instances for Low-Cost, High-Performance Generative AI Inference are Now Generally Available. For performance details, refer to Inf2 Performance. Check out additional examples on the GitHub repo.
Special thanks to Matthew Mcclain, Beni Hegedus, Kamran Khan, Shruti Koparkar, and Qing Lan for reviewing and providing valuable inputs.

About the Authors
Vivek Gangasani is a Senior Machine Learning Solutions Architect at Amazon Web Services. He works with machine learning startups to build and deploy AI/ML applications on AWS. He is currently focused on delivering solutions for MLOps, ML inference, and low-code ML. He has worked on projects in different domains, including natural language processing and computer vision.
K.C. Tung is a Senior Solution Architect in AWS Annapurna Labs. He specializes in large deep learning model training and deployment at scale in cloud. He has a Ph.D. in molecular biophysics from the University of Texas Southwestern Medical Center in Dallas. He has spoken at AWS Summits and AWS Reinvent. Today he helps customers to train and deploy large PyTorch and TensorFlow models in AWS cloud. He is the author of two books: Learn TensorFlow Enterprise and TensorFlow 2 Pocket Reference.
Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.

Dream First, Learn Later: DECKARD is an AI Approach That Uses LLMs for …

Reinforcement learning (RL) is a popular approach to training autonomous agents that can learn to perform complex tasks by interacting with their environment. RL enables them to learn the best action in different conditions and adapt to their environment using a reward system.

A major challenge in RL is how to explore the vast state space of many real-world problems efficiently. This challenge arises due to the fact that in RL, agents learn by interacting with their environment via exploration. Think of an agent that tries to play Minecraft. If you heard about it before, you know how complicated Minecraft crafting tree looks. You have hundreds of craftable objects, and you might need to craft one to craft another, etc. So, it is a really complex environment.

As the environment can have a large number of possible states and actions, it can become difficult for the agent to find the optimal policy through random exploration alone. The agent must balance between exploiting the current best policy and exploring new parts of the state space to find a better policy potentially. Finding efficient exploration methods that can balance exploration and exploitation is an active area of research in RL.

It’s known that practical decision-making systems need to use prior knowledge about a task efficiently. By having prior information about the task itself, the agent can better adapt its policy and can avoid getting stuck in sub-optimal policies. However, most reinforcement learning methods currently train without any previous training or external knowledge. 

[ Trending ] Meet Pixis AI: An Emerging Startup Providing Codeless AI Solutions
But why is that the case? In recent years, there has been growing interest in using large language models (LLMs) to aid RL agents in exploration by providing external knowledge. This approach has shown promise, but there are still many challenges to overcome, such as grounding the LLM knowledge in the environment and dealing with the accuracy of LLM outputs.

So, should we give up on using LLMs to aid RL agents? If not, how can we fix those problems and then use them again to guide RL agents? The answer has a name, and it’s DECKARD.

Overview of DECKARD. Source: https://arxiv.org/abs/2301.12050

DECKARD is trained for Minecraft, as crafting a specific item in Minecraft can be a challenging task if one lacks expert knowledge of the game. This has been demonstrated by studies that have shown that achieving a goal in Minecraft can be made easier through the use of dense rewards or expert demonstrations. As a result, item crafting in Minecraft has become a persistent challenge in the field of AI.

DECKARD utilizes a few-shot prompting technique on a large language model (LLM) to generate an Abstract World Model (AWM) for subgoals. It uses the LLM to hypothesize an AWM, which means it dreams about the task and the steps to solve it. Then, it wakes up and learns a modular policy of subgoals that it generates during dreaming. Since this is done in the real environment, DECKARD can verify the hypothesized AWM. The AWM is corrected during the waking phase, and discovered nodes are marked as verified to be used again in the future.

Experiments show us that LLM guidance is essential to exploration in DECKARD, with a version of the agent without LLM guidance taking over twice as long to craft most items during open-ended exploration. When exploring a specific task, DECKARD improves sample efficiency by orders of magnitude compared to comparable agents, demonstrating the potential for robustly applying LLMs to RL.

Check out the Research Paper, Code, and Project. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club

The post Dream First, Learn Later: DECKARD is an AI Approach That Uses LLMs for Training Reinforcement learning (RL) Agents appeared first on MarkTechPost.

Researchers from China Propose a Data Augmentation Approach CarveMix f …

Automated brain lesion segmentation using convolutional neural networks (CNNs) has become a valuable clinical diagnosis and research tool. However, CNN-based approaches still face challenges in accurately segmenting brain lesions due to the scarcity of annotated training data. Data augmentation strategies that mix pairs of annotated images have been developed to improve the training of CNNs. However, existing methods based on image mixing are not designed for brain lesions and may not perform well for brain lesion segmentation. 

Before using CNN-based approaches, previous studies on automated brain lesion segmentation relied on traditional machine-learning techniques. Recent developments in CNNs have resulted in substantial enhancements in segmentation performance. Examples of these recent developments include 3D DenseNet, U-Net, Context-Aware Network (CANet), and uncertainty-aware CNN, which have been proposed for segmenting various types of brain lesions. However, despite these advancements, accurately segmenting brain lesions remains challenging.

Thus, a research team from China recently proposed a simple and effective data augmentation approach called CarveMix, which is lesion-aware and preserves the lesion information during image combination.

CarveMix, a data augmentation approach, is lesion-aware and designed specifically for CNN-based brain lesion segmentation. It stochastically combines two annotated images to obtain new labeled samples. CarveMix carves a region of interest (ROI) from one annotated image according to the lesion location and geometry with a variable ROI size. The carved ROI then replaces the corresponding voxels in a second annotated image to synthesize new labeled images for network training. The method also applies additional harmonization steps for heterogeneous data from different sources and models the mass effect unique to whole brain tumor segmentation during image mixing.

[ Trending ] Meet Pixis AI: An Emerging Startup Providing Codeless AI Solutions
Concretely, the main steps of the proposed approach for brain lesion segmentation are the following:

Authors use a set of 3D annotated images with brain lesions to train a CNN for automated brain lesion segmentation.

From the annotated images, the data augmentation is performed using CarveMix, which is based on lesion-aware image mixing.

To perform image mixing, the authors take an annotated image pair and extract a 3D ROI from one image according to the lesion location and geometry gave by the annotation.

Then the ROI is mixed with the other image, replacing the corresponding region, and adjust the annotation accordingly.

Finally, synthetic annotated images and annotations are obtained that can be used to improve the network training. The authors repeat the process to generate diverse annotated training data.

The proposed method was evaluated on several datasets for brain lesion segmentation and compared to traditional data augmentation (TDA), Mixup, and CutMix. Results show that CarveMix+TDA outperformed the competing methods regarding Dice coefficient, Hausdorff distance, precision, and recall. The proposed method reduced false negative predictions and under-segmentation of lesions. The benefit of CarveMix alone without online TDA was also shown.

In this article, we presented a new approach named CarveMix which was proposed as a data augmentation technique for brain lesion segmentation. CarveMix is a combination of annotated training images that creates synthetic training images. This combination is lesion-aware, taking into account the location and shape of the lesions with a randomly sampled size parameter. To ensure consistency in the combination of data from different sources, harmonization steps are introduced. Additionally, mass effect modeling is incorporated to improve CarveMix specifically for whole brain tumor segmentation. The experimental results of four brain lesion segmentation tasks show that CarveMix improves accuracy and outperforms other data augmentation strategies.

Check out the Paper. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club

The post Researchers from China Propose a Data Augmentation Approach CarveMix for Brain Lesion Segmentation appeared first on MarkTechPost.

Meet ShortGPT: A Powerful AI Framework for Automating Content Creation …

In the fast-paced world of digital content creation, efficiency and creativity are paramount. Meet ShortGPT, a robust framework designed to automate content creation and streamline the video production process. Leveraging the capabilities of Large Language Models (LLMs) and cutting-edge technologies, ShortGPT simplifies video creation, footage sourcing, voiceover synthesis, and editing tasks like never before.

The Automated Editing Framework

At the heart of ShortGPT lies an innovative LLM-oriented video editing language, which serves as the backbone of the automated editing framework. This language breaks down the editing process into manageable and customizable blocks, making it comprehensible to Large Language Models. This enables ShortGPT to efficiently generate scripts and prompts for various automated editing processes, providing ready-to-use resources for creators.

Multi-Language Voiceover and Content Creation

[ Trending ] Meet Pixis AI: An Emerging Startup Providing Codeless AI Solutions
ShortGPT is designed to support multiple languages, ensuring a global reach for content creators. ShortGPT’s voiceover synthesis capabilities empower creators to deliver content in their preferred language, breaking language barriers and reaching diverse audiences worldwide, from English and Spanish to Arabic, French, Polish, German, Italian, and Portuguese.

Automated Caption Generation and Asset Sourcing

Captioning is a critical aspect of video content, enhancing accessibility and engagement. With ShortGPT’s automatic caption generation, creators can effortlessly add captions to their videos, saving time and effort. Additionally, ShortGPT sources images and video footage from the internet, connecting with the web and utilizing the Pexels API to access a vast library of high-quality visuals. This feature streamlines the process of finding relevant assets, further expediting the content creation workflow.

Memory and Persistency for Seamless Editing

ShortGPT ensures the long-term persistence of automated editing variables through the use of TinyDB, a lightweight database. This feature enables the framework to remember user preferences and settings, allowing seamless and consistent editing experiences across multiple sessions.

Easy Implementation with Google Colab

ShortGPT offers a Google Colab notebook option for those who prefer a hassle-free setup, eliminating the need to install prerequisites on local systems. This web-based interface is free and readily accessible, enabling users to run ShortGPT without any installation requirements.

Installation Steps and API Integration

ShortGPT’s detailed installation guide provides step-by-step instructions for setting up ImageMagick, FFmpeg and cloning the repository. Additionally, the framework integrates with OpenAI and ElevenLabs APIs, requiring users to input their API keys for smooth automation of tasks.

Customizable and Flexible

The flexibility of ShortGPT shines through its various engines – ContentShortEngine, ContentVideoEngine, and Automated EditingEngine. Creators can choose the engine that best suits their project, whether they are creating short videos or longer content or require customizable editing options

Open-Source and Evolving

As an open-source project, ShortGPT actively encourages contributions from the community. The developers value new features, improved infrastructure, and better documentation to keep ShortGPT at the forefront of content creation with AI.

ShortGPT is a game-changer in content creation, revolutionizing video production with AI automation. Its robust framework, multi-language support, automated captioning, and asset-sourcing capabilities empower creators to efficiently produce engaging, high-quality content. With a user-friendly interface and continuous development, ShortGPT promises to drive the future of AI-powered content creation, inspiring creativity and simplifying the video production process for creators worldwide.

Check out the GitHub. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Meet ShortGPT: A Powerful AI Framework for Automating Content Creation and Streamline the Video Production Process appeared first on MarkTechPost.

AWS Reaffirms its Commitment to Responsible Generative AI

As a pioneer in artificial intelligence and machine learning, AWS is committed to developing and deploying generative AI responsibly
As one of the most transformational innovations of our time, generative AI continues to capture the world’s imagination, and we remain as committed as ever to harnessing it responsibly. With a team of dedicated responsible AI experts, complemented by our engineering and development organization, we continually test and assess our products and services to define, measure, and mitigate concerns about accuracy, fairness, intellectual property, appropriate use, toxicity, and privacy. And while we don’t have all of the answers today, we are working alongside others to develop new approaches and solutions to address these emerging challenges. We believe we can both drive innovation in AI, while continuing to implement the necessary safeguards to protect our customers and consumers.
At AWS, we know that generative AI technology and how it is used will continue to evolve, posing new challenges that will require additional attention and mitigation. That’s why Amazon is actively engaged with organizations and standard bodies focused on the responsible development of next-generation AI systems including NIST, ISO, the Responsible AI Institute, and the Partnership on AI. In fact, last week at the White House, Amazon signed voluntary commitments to foster the safe, responsible, and effective development of AI technology. We are eager to share knowledge with policymakers, academics, and civil society, as we recognize the unique challenges posed by generative AI will require ongoing collaboration.
This commitment is consistent with our approach to developing our own generative AI services, including building foundation models (FMs) with responsible AI in mind at each stage of our comprehensive development process. Throughout design, development, deployment, and operations we consider a range of factors including 1/ accuracy, e.g., how closely a summary matches the underlying document; whether a biography is factually correct; 2/ fairness, e.g., whether outputs treat demographic groups similarly; 3/ intellectual property and copyright considerations; 4/ appropriate usage, e.g., filtering out user requests for legal advice, medical diagnoses, or illegal activities, 5/ toxicity, e.g., hate speech, profanity, and insults; and 6/ privacy, e.g., protecting personal information and customer prompts. We build solutions to address these issues into our processes for acquiring training data, into the FMs themselves, and into the technology that we use to pre-process user prompts and post-process outputs. For all our FMs, we invest actively to improve our features, and to learn from customers as they experiment with new use cases.
For example, Amazon’s Titan FMs are built to detect and remove harmful content in the data that customers provide for customization, reject inappropriate content in the user input, and filter the model’s outputs containing inappropriate content (such as hate speech, profanity, and violence).
To help developers build applications responsibly, Amazon CodeWhisperer provides a reference tracker that displays the licensing information for a code recommendation and provides link to the corresponding open-source repository when necessary. This makes it easier for developers to decide whether to use the code in their project and make the relevant source code attributions as they see fit. In addition, Amazon CodeWhisperer filters out code recommendations that include toxic phrases, and recommendations that indicate bias.
Through innovative services like these, we will continue to help our customers realize the benefits of generative AI, while collaborating across the public and private sectors to ensure we’re doing so responsibly. Together, we will build trust among customers and the broader public, as we harness this transformative new technology as a force for good.

About the Author
Peter Hallinan leads initiatives in the science and practice of Responsible AI at AWS AI, alongside a team of responsible AI experts. He has deep expertise in AI (PhD, Harvard) and entrepreneurship (Blindsight, sold to Amazon). His volunteer activities have included serving as a consulting professor at the Stanford University School of Medicine, and as the president of the American Chamber of Commerce in Madagascar. When possible, he’s off in the mountains with his children: skiing, climbing, hiking and rafting

Use generative AI foundation models in VPC mode with no internet conne …

With recent advancements in generative AI, there are lot of discussions happening on how to use generative AI across different industries to solve specific business problems. Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. It is all backed by very large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). These FMs can perform a wide range of tasks that span multiple domains, like writing blog posts, generating images, solving math problems, engaging in dialog, and answering questions based on a document. The size and general-purpose nature of FMs make them different from traditional ML models, which typically perform specific tasks, like analyzing text for sentiment, classifying images, and forecasting trends.
While organizations are looking to use the power of these FMs, they also want the FM-based solutions to be running in their own protected environments. Organizations operating in heavily regulated spaces like global financial services and healthcare and life sciences have auditory and compliance requirements to run their environment in their VPCs. In fact, a lot of times, even direct internet access is disabled in these environments to avoid exposure to any unintended traffic, both ingress and egress.
Amazon SageMaker JumpStart is an ML hub offering algorithms, models, and ML solutions. With SageMaker JumpStart, ML practitioners can choose from a growing list of best performing open source FMs. It also provides the ability to deploy these models in your own Virtual Private Cloud (VPC).
In this post, we demonstrate how to use JumpStart to deploy a Flan-T5 XXL model in a VPC with no internet connectivity. We discuss the following topics:

How to deploy a foundation model using SageMaker JumpStart in a VPC with no internet access
Advantages of deploying FMs via SageMaker JumpStart models in VPC mode
Alternate ways to customize deployment of foundation models via JumpStart

Apart from FLAN-T5 XXL, JumpStart provides lot of different foundation models for various tasks. For the complete list, check out Getting started with Amazon SageMaker JumpStart.
Solution overview
As part of the solution, we cover the following steps:

Set up a VPC with no internet connection.
Set up Amazon SageMaker Studio using the VPC we created.
Deploy the generative AI Flan T5-XXL foundation model using JumpStart in the VPC with no internet access.

The following is an architecture diagram of the solution.

Let’s walk through the different steps to implement this solution.
Prerequisites
To follow along with this post, you need the following:

Access to an AWS account. For details, check out Creating an AWS account.
An AWS Identity and Access Management (IAM) role with permissions to deploy the AWS CloudFormation templates used in this solution and manage resources as part of the solution.

Set up a VPC with no internet connection
Create a new CloudFormation stack by using the 01_networking.yaml template. This template creates a new VPC and adds two private subnets across two Availability Zones with no internet connectivity. It then deploys gateway VPC endpoints for accessing Amazon Simple Storage Service (Amazon S3) and interface VPC endpoints for SageMaker and a few other services to allow the resources in the VPC to connect to AWS services via AWS PrivateLink.
Provide a stack name, such as No-Internet, and complete the stack creation process.

This solution is not highly available because the CloudFormation template creates interface VPC endpoints only in one subnet to reduce costs when following the steps in this post.
Set up Studio using the VPC
Create another CloudFormation stack using 02_sagemaker_studio.yaml, which creates a Studio domain, Studio user profile, and supporting resources like IAM roles. Choose a name for the stack; for this post, we use the name SageMaker-Studio-VPC-No-Internet. Provide the name of the VPC stack you created earlier (No-Internet) as the CoreNetworkingStackName parameter and leave everything else as default.

Wait until AWS CloudFormation reports that the stack creation is complete. You can confirm the Studio domain is available to use on the SageMaker console.

To verify the Studio domain user has no internet access, launch Studio using the SageMaker console. Choose File, New, and Terminal, then attempt to access an internet resource. As shown in the following screenshot, the terminal will keep waiting for the resource and eventually time out.

This proves that Studio is operating in a VPC that doesn’t have internet access.
Deploy the generative AI foundation model Flan T5-XXL using JumpStart
We can deploy this model via Studio as well as via API. JumpStart provides all the code to deploy the model via a SageMaker notebook accessible from within Studio. For this post, we showcase this capability from the Studio.

On the Studio welcome page, choose JumpStart under Prebuilt and automated solutions.

Choose the Flan-T5 XXL model under Foundation Models.

By default, it opens the Deploy tab. Expand the Deployment Configuration section to change the hosting instance and endpoint name, or add any additional tags. There is also an option to change the S3 bucket location where the model artifact will be stored for creating the endpoint. For this post, we leave everything at its default values. Make a note of the endpoint name to use while invoking the endpoint for making predictions.

Expand the Security Settings section, where you can specify the IAM role for creating the endpoint. You can also specify the VPC configurations by providing the subnets and security groups. The subnet IDs and security group IDs can be found from the VPC stack’s Outputs tab on the AWS CloudFormation console. SageMaker JumpStart requires at least two subnets as part of this configuration. The subnets and security groups control access to and from the model container.

NOTE: Irrespective of whether the SageMaker JumpStart model is deployed in the VPC or not, the model always runs in network isolation mode, which isolates the model container so no inbound or outbound network calls can be made to or from the model container. Because we’re using a VPC, SageMaker downloads the model artifact through our specified VPC. Running the model container in network isolation doesn’t prevent your SageMaker endpoint from responding to inference requests. A server process runs alongside the model container and forwards it the inference requests, but the model container doesn’t have network access.

Choose Deploy to deploy the model. We can see the near-real-time status of the endpoint creation in progress. The endpoint creation may take 5–10 minutes to complete.

Observe the value of the field Model data location on this page. All the SageMaker JumpStart models are hosted on a SageMaker managed S3 bucket (s3://jumpstart-cache-prod-{region}). Therefore, irrespective of which model is picked from JumpStart, the model gets deployed from the publicly accessible SageMaker JumpStart S3 bucket and the traffic never goes to the public model zoo APIs to download the model. This is why the model endpoint creation started successfully even when we’re creating the endpoint in a VPC that doesn’t have direct internet access.
The model artifact can also be copied to any private model zoo or your own S3 bucket to control and secure model source location further. You can use the following command to download the model locally using the AWS Command Line Interface (AWS CLI):
aws s3 cp s3://jumpstart-cache-prod-eu-west-1/huggingface-infer/prepack/v1.0.2/infer-prepack-huggingface-text2text-flan-t5-xxl.tar.gz .

After a few minutes, the endpoint gets created successfully and shows the status as In Service. Choose Open Notebook in the Use Endpoint from Studio section. This is a sample notebook provided as part of the JumpStart experience to quickly test the endpoint.

In the notebook, choose the image as Data Science 3.0 and the kernel as Python 3. When the kernel is ready, you can run the notebook cells to make predictions on the endpoint. Note that the notebook uses the invoke_endpoint() API from the AWS SDK for Python to make predictions. Alternatively, you can use the SageMaker Python SDK’s predict() method to achieve the same result.

This concludes the steps to deploy the Flan-T5 XXL model using JumpStart within a VPC with no internet access.
Advantages of deploying SageMaker JumpStart models in VPC mode
The following are some of the advantages of deploying SageMaker JumpStart models in VPC mode:

Because SageMaker JumpStart doesn’t download the models from a public model zoo, it can be used in fully locked-down environments as well where there is no internet access
Because the network access can be limited and scoped down for SageMaker JumpStart models, this helps teams improve the security posture of the environment
Due to the VPC boundaries, access to the endpoint can also be limited via subnets and security groups, which adds an extra layer of security

Alternate ways to customize deployment of foundation models via SageMaker JumpStart
In this section, we share some alternate ways to deploy the model.
Use SageMaker JumpStart APIs from your preferred IDE
Models provided by SageMaker JumpStart don’t require you to access Studio. You can deploy them to SageMaker endpoints from any IDE, thanks to the JumpStart APIs. You could skip the Studio setup step discussed earlier in this post and use the JumpStart APIs to deploy the model. These APIs provide arguments where VPC configurations can be supplied as well. The APIs are part of the SageMaker Python SDK itself. For more information, refer to Pre-trained models.
Use notebooks provided by SageMaker JumpStart from SageMaker Studio
SageMaker JumpStart also provides notebooks to deploy the model directly. On the model detail page, choose Open notebook to open a sample notebook containing the code to deploy the endpoint. The notebook uses SageMaker JumpStart Industry APIs that allow you to list and filter the models, retrieve the artifacts, and deploy and query the endpoints. You can also edit the notebook code per your use case-specific requirements.

Clean up resources
Check out the CLEANUP.md file to find detailed steps to delete the Studio, VPC, and other resources created as part of this post.
Troubleshooting
If you encounter any issues in creating the CloudFormation stacks, refer to Troubleshooting CloudFormation.
Conclusion
Generative AI powered by large language models is changing how people acquire and apply insights from information. However, organizations operating in heavily regulated spaces are required to use the generative AI capabilities in a way that allows them to innovate faster but also simplifies the access patterns to such capabilities.
We encourage you to try out the approach provided in this post to embed generative AI capabilities in your existing environment while still keeping it inside your own VPC with no internet access. For further reading on SageMaker JumpStart foundation models, check out the following:

Domain-adaptation Fine-tuning of Foundation Models in Amazon SageMaker JumpStart on Financial data
Implementing MLOps practices with Amazon SageMaker JumpStart pre-trained models

About the authors
Vikesh Pandey is a Machine Learning Specialist Solutions Architect at AWS, helping customers from financial industries design and build solutions on generative AI and ML. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.
Mehran Nikoo is a Senior Solutions Architect at AWS, working with Digital Native businesses in the UK and helping them achieve their goals. Passionate about applying his software engineering experience to machine learning, he specializes in end-to-end machine learning and MLOps practices.

How Patsnap used GPT-2 inference on Amazon SageMaker with low latency …

This blog post was co-authored, and includes an introduction, by Zilong Bai, senior natural language processing engineer at Patsnap.
You’re likely familiar with the autocomplete suggestion feature when you search for something on Google or Amazon. Although the search terms in these scenarios are pretty common keywords or expressions that we use in daily life, in some cases search terms are very specific to the scenario. Patent search is one of them. Recently, the AWS Generative AI Innovation Center collaborated with Patsnap to implement a feature to automatically suggest search keywords as an innovation exploration to improve user experiences on their platform.
Patsnap provides a global one-stop platform for patent search, analysis, and management. They use big data (such as a history of past search queries) to provide many powerful yet easy-to-use patent tools. These tools have enabled Patsnap’s global customers to have a better understanding of patents, track recent technological advances, identify innovation trends, and analyze competitors in real time.
At the same time, Patsnap is embracing the power of machine learning (ML) to develop features that can continuously improve user experiences on the platform. A recent initiative is to simplify the difficulty of constructing search expressions by autofilling patent search queries using state-of-the-art text generation models. Patsnap had trained a customized GPT-2 model for such a purpose. Because there is no such existing feature in a patent search engine (to their best knowledge), Patsnap believes adding this feature will increase end-user stickiness.
However, in their recent experiments, the inference latency and queries per second (QPS) of a PyTorch-based GPT-2 model couldn’t meet certain thresholds that can justify its business value. To tackle this challenge, AWS Generative AI Innovation Center scientists explored a variety of solutions to optimize GPT-2 inference performance, resulting in lowering the model latency by 50% on average and improving the QPS by 200%.
Large language model inference challenges and optimization approaches
In general, applying such a large model in a real-world production environment is non-trivial. The prohibitive computation cost and latency of PyTorch-based GPT-2 made it difficult to be widely adopted from a business operation perspective. In this project, our objective is to significantly improve the latency with reasonable computation costs. Specifically, Patsnap requires the following:

The average latency of model inference for generating search expressions needs to be controlled within 600 milliseconds in real-time search scenarios
The model requires high throughput and QPS to do a large number of searches per second during peak business hours

In this post, we discuss our findings using Amazon Elastic Compute Cloud (Amazon EC2) instances, featuring GPU-based instances using NVIDIA TensorRT.
In a short summary, we use NVIDIA TensorRT to optimize the latency of GPT-2 and deploy it to an Amazon SageMaker endpoint for model serving, which reduces the average latency from 1,172 milliseconds to 531 milliseconds
In the following sections, we go over the technical details of the proposed solutions with key code snippets and show comparisons with the customer’s status quo based on key metrics.
GPT-2 model overview
Open AI’s GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on the WebText dataset, containing 8 million web pages. The GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and let it generate a lengthy continuation. In this situation, we exploit it to generate search queries. As GPT models keep growing larger, inference costs are continuously rising, which increases the need to deploy these models with acceptable cost.
Achieve low latency on GPU instances via TensorRT
TensorRT is a C++ library for high-performance inference on NVIDIA GPUs and deep learning accelerators, supporting major deep learning frameworks such as PyTorch and TensorFlow. Previous studies have shown great performance improvement in terms of model latency. Therefore, it’s an ideal choice for us to reduce the latency of the target model on NVIDIA GPUs.
We are able to achieve a significant reduction in GPT-2 model inference latency with a TensorRT-based model on NVIDIA GPUs. The TensorRT-based model is deployed via SageMaker for performance tests. In this post, we show the steps to convert the original PyTorch-based GPT-2 model to a TensorRT-based model.
Converting the PyTorch-based GPT-2 to the TensorRT-based model is not difficult via the official tool provided by NVIDIA. In addition, with such straightforward conversions, no obvious model accuracy degradation has been observed. In general, there are three steps to follow:

Analyze your GPT-2. As of this writing, NVIDIA’s conversion tool only supports Hugging Face’s version of GPT-2 model. If the current GPT-2 model isn’t the original version, you need to modify it accordingly. It’s recommended to strip out custom code from the original GPT-2 implementation of Hugging Face, which is very helpful for the conversion.
Install the required Python packages. The conversion process first converts the PyTorch-based model to the ONNX model and then converts the ONNX-based model to the TensorRT-based model. The following Python packages are needed for this two-step conversion:

tabulate
toml
torch
sentencepiece==0.1.95
onnx==1.9.0
onnx_graphsurgeon
polygraphy
transformers

Convert your model. The following code contains the functions for the two-step conversion:

def torch2onnx():
metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=True), other=GPT2Metadata(kv_cache=False))
gpt2 = GPT2TorchFile(model.to(‘cpu’), metadata)
onnx_path = (‘Your own path to save ONNX-based model’) # e.g, ./model_fp16.onnx
gpt2.as_onnx_model(onnx_path, force_overwrite=False)
return onnx_path, metadata

def onnx2trt(onnx_path, metadata):
trt_path = ‘Your own path to save TensorRT-based model’ # e.g., ./model_fp16.onnx.engine
batch_size = 10
max_sequence_length = 42
profiles = [Profile().add(
“input_ids”,
min=(1, 1),
opt=(batch_size, max_sequence_length // 2),
max=(batch_size, max_sequence_length),
)]
gpt2_engine = GPT2ONNXFile(onnx_path, metadata).as_trt_engine(output_fpath=trt_path, profiles=profiles)
gpt2_trt = GPT2TRTDecoder(gpt2_engine, metadata, config, max_sequence_length=42, batch_size=10)

Latency comparison: PyTorch vs. TensorRT
JMeter is used for performance benchmarking in this project. JMeter is an Apache project that can be used as a load testing tool for analyzing and measuring the performance of a variety of services. We record the QPS and latency of the original PyTorch-based model and our converted TensorRT-based GPT-2 model on an AWS P3.2xlarge instance. As we show later in this post, due to the powerful acceleration ability of TensorRT, the latency of GPT-2 is significantly reduced. When the request concurrency is 1, the average latency has been reduced by 274 milliseconds (2.9 times faster). From the perspective of QPS, it is increased to 7 from 2.4, which is around a 2.9 times boost compared to the original PyTorch-based model. Moreover, as the concurrency increases, QPS keeps increasing. This suggests lower costs with acceptable latency increase (but still much faster than the original model).
The following table compares latency:

.
Concurrency
QPS
Maximum Latency
Minumum Latency
Average Latency

Customer PyTorch version (on p3.2xlarge)
1
2.4
632
105
417

2
3.1
919
168
636

3
3.4
1911
222
890

4
3.4
2458
277
1172

AWS TensorRT version (on p3.2xlarge)
1
7 (+4.6)
275
22
143 (-274 ms)

2
7.2 (+4.1)
274
51
361 (-275 ms)

3
7.3 (+3.9)
548
49
404 (-486 ms)

4
7.5 (+4.1)
765
62
531 (-641 ms)

Deploy TensorRT-based GPT-2 with SageMaker and a custom container
TensorRT-based GPT-2 requires a relatively recent TensorRT version, so we choose the bring your own container (BYOC) mode of SageMaker to deploy our model. BYOC mode provides a flexible way to deploy the model, and you can build customized environments in your own Docker container. In this section, we show how to build your own container, deploy your own GPT-2 model, and test with the SageMaker endpoint API.
Build your own container
The container’s file directory is presented in the following code. Specifically, Dockerfile and build.sh are used to build the Docker container. gpt2 and predictor.py implement the model and the inference API. serve, nginx.conf, and wsgi.py provide the configuration for the NGINX web server.

container
├── Dockerfile # build our docker based on this file.
├── build.sh # create our own image and push it to Amazon ECR
├── gpt2 # model directory
├── predictor.py # backend function for invoke the model
├── serve # web server setting file
├── nginx.conf # web server setting file
└── wsgi.py # web server setting file

You can run sh ./build.sh to build the container.
Deploy to a SageMaker endpoint
After you have built a container to run the TensorRT-based GPT-2, you can enable real-time inference via a SageMaker endpoint. Use the following code snippets to create the endpoint and deploy the model to the endpoint using the corresponding SageMaker APIs:

import boto3from time import gmtime, strftime
from sagemaker import get_execution_role

sm_client = boto3.client(service_name=’sagemaker’)
runtime_sm_client = boto3.client(service_name=’sagemaker-runtime’)
account_id = boto3.client(‘sts’).get_caller_identity()[‘Account’]
region = boto3.Session().region_name
s3_bucket = ‘${Your s3 bucket}’
role = get_execution_role()
model_name = ‘${Your Model Name}’
# you need to upload your container to S3 first
container = ‘${Your Image Path}’
instance_type = ‘ml.p3.2xlarge’
container = {
‘Image’: container
}
create_model_response = sm_client.create_model(
ModelName = model_name,
ExecutionRoleArn = role,
Containers = [container])

# Endpoint Setting
endpoint_config_name = ‘${Your Endpoint Config Name}’
print(‘Endpoint config name: ‘ + endpoint_config_name)
create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName = endpoint_config_name,
ProductionVariants=[{
‘InstanceType’: instance_type,
‘InitialInstanceCount’: 1,
‘InitialVariantWeight’: 1,
‘ModelName’: model_name,
‘VariantName’: ‘AllTraffic’}])
print(“Endpoint config Arn: ” + create_endpoint_config_response[‘EndpointConfigArn’])

# Deploy Model
endpoint_name = ‘${Your Endpoint Name}’
print(‘Endpoint name: ‘ + endpoint_name)
create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name)
print(‘Endpoint Arn: ‘ + create_endpoint_response[‘EndpointArn’])
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp[‘EndpointStatus’]
print(“Endpoint Status: ” + status)
print(‘Waiting for {} endpoint to be in service…’.format(endpoint_name))
waiter = sm_client.get_waiter(‘endpoint_in_service’)
waiter.wait(EndpointName=endpoint_name)

Test the deployed model
After the model is successfully deployed, you can test the endpoint via the SageMaker notebook instance with the following code:

import json
import boto3

sagemaker_runtime = boto3.client(“sagemaker-runtime”, region_name=’us-east-2′)
endpoint_name = “${Your Endpoint Name}”
request_body = {“input”: “amazon”}
payload = json.dumps(request_body)
content_type = “application/json”
response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=content_type,
Body=payload # Replace with your own data.
)
result = json.loads(response[‘Body’].read().decode())
print(result)

Conclusion
In this post, we described how to enable low-latency GPT-2 inference on SageMaker to create business value. Specifically, with the support of NVIDIA TensorRT, we can achieve 2.9 times acceleration on the NVIDIA GPU instances with SageMaker for a customized GPT-2 model.
If you want help with accelerating the use of GenAI models in your products and services, please contact the AWS Generative AI Innovation Center. The AWS Generative AI Innovation Center can help you make your ideas a reality faster and more effectively. To get started with the Generative AI Innovation Center, visit here.

About the Authors
Hao Huang is an applied scientist at the AWS Generative AI Innovation Center. He specializes in Computer Vision (CV) and Visual-Language Model (VLM). Recently, he has developed a strong interest in generative AI technologies and has already collaborated with customers to apply these cutting-edge technologies to their business. He is also a reviewer for AI conferences such as ICCV and AAAI.
Zilong Bai is a senior natural language processing engineer at Patsnap. He is passionate about research and proof-of-concept work on cutting-edge techniques for generative language models.
Yuanjun Xiao is a Solution Architect at AWS. He is responsible for AWS architecture consulting and design. He is also passionate about building AI and analytic solutions.
Xuefei Zhang is an applied scientist at the AWS Generative AI Innovation Center, works in NLP and AGI areas to solve industry problems with customers.
Guang Yang is a senior applied scientist at the AWS Generative AI Innovation Center where he works with customers across various verticals and applies creative problem solving to generate value for customers with state-of-the-art ML/AI solutions.

Meet Chapyter: A New Jupyter Extension That Lets ChatGPT Assist You in …

Chapyter, developed by a group of language modelers, is a new Jupyter plugin that integrates ChatGPT to let one create Python notebooks. The system can likewise read the results of previously executed cells.

Chapyter is an add-on for JupyterLab, allowing the integration of GPT-4 into the development environment without hassle. It has an interpreter that can take the description written in natural language and turn it into Python code that can be automatically executed. Chapyter can increase productivity and allow one to try new things by enabling “natural language programming” in the preferred IDE.

Essential features

The process of automatically generating code from natural language and running it.

The production of new code based on past code and the results of previous executions.

Code correction and bug fixing on the fly.

Customization options and full visibility into the AI’s setting prompts.

Prioritize privacy when utilizing cutting-edge AI technology.

The library’s prompts and settings are made public, and researchers are working to simplify the customization of those questions and settings. Chapyter/programs.py is where one may view this.

Check out their API’s data usage policies for more information on how OpenAI handles training data. In contrast, anytime one uses Copilot or ChatGPT, part of the data will be cached and used in the training and analysis of those services. Chapyter comprises two main parts: using the ipython magic command to manage to prompt and using that command to call GPT-X models. The user interface that monitors Chapyter cell execution runs freshly created cells and updates cell styles automatically.

Many programmers prefer to work in notebooks in a “fragmented” fashion, writing only a few lines of code at a time before moving on to the next cell. Each cell’s mission or purpose is relatively modest and autonomous from those of neighboring cells. Subsequent work may have little in common with the preceding one. Adding the dataset loader, for instance, while creating a neural network, demands different ways of thinking and writing code. Constantly switching between tasks is not only inefficient but also potentially exhausting. The command “Please load the dataset in a way to test the neural network” could be useful when one wants to type it and let the machine do the rest.

Chapyter’s cell-level code development and autonomous execution facilitate a solution to this problem. When one creates a new cell, Chapyter will automatically invoke the GPT-X model to build the code and run it for them based on the text they write. Unlike systems like Copilot, which focus on supporting micro-tasks that span only a few lines of code but are highly relevant to ongoing work (such as finishing a function call), Chapyter aims to take over entire tasks, some of which may differ from the existing code.

Chapyter is a lightweight Python tool that integrates perfectly with JupyterLab after a local installation. By default, the OpenAI API is set up to discard the interaction data and code after calling the GPT-X models. The library contains all the standard prompts, “programs,” and the option to load the personalized prompts. By analyzing the previous coding decisions and runtime data, Chapyter can make intelligent recommendations. Files can be loaded if desired, and suggestions for additional processing and analysis will be provided. 

Given the limitations of today’s AI, Chapyter was built so that its generated code may be easily debugged and improved.

The three-step installation process is straightforward to follow. In GitHub, at https://github.com/chapyter/chapyter, one may find further information.

Shortly, researchers will release major enhancements to Chapyter that will make it even more flexible and secure in code generation and execution. They can’t wait to put it through its paces on some of the most demanding and complex real-world coding tasks, like ensuring a jupyter notebook with 300 cell executions has all the help it needs. Please try our tools and stay tuned for further improvements; they value your thoughts and opinions.

Check out the Github and Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Check Out 100’s AI Tools in AI Tools Club

The post Meet Chapyter: A New Jupyter Extension That Lets ChatGPT Assist You in Writing Python Notebooks appeared first on MarkTechPost.

Use a generative AI foundation model for summarization and question an …

Large language models (LLMs) can be used to analyze complex documents and provide summaries and answers to questions. The post Domain-adaptation Fine-tuning of Foundation Models in Amazon SageMaker JumpStart on Financial data describes how to fine-tune an LLM using your own dataset. Once you have a solid LLM, you’ll want to expose that LLM to business users to process new documents, which could be hundreds of pages long. In this post, we demonstrate how to construct a real-time user interface to let business users process a PDF document of arbitrary length. Once the file is processed, you can summarize the document or ask questions about the content. The sample solution described in this post is available on GitHub.
Working with financial documents
Financial statements like quarterly earnings reports and annual reports to shareholders are often tens or hundreds of pages long. These documents contain a lot of boilerplate language like disclaimers and legal language. If you want to extract the key data points from one of these documents, you need both time and some familiarity with the boilerplate language so you can identify the interesting facts. And of course, you can’t ask an LLM questions about a document it has never seen.
LLMs used for summarization have a limit on the number of tokens (characters) passed into the model, and with some exceptions, these are typically no more than a few thousand tokens. That normally precludes the ability to summarize longer documents.
Our solution handles documents that exceed an LLM’s maximum token sequence length, and make that document available to the LLM for question answering.
Solution overview
Our design has three important pieces:

It has an interactive web application for business users to upload and process PDFs
It uses the langchain library to split a large PDF into more manageable chunks
It uses the retrieval augmented generation technique to let users ask questions about new data that the LLM hasn’t seen before

As shown in the following diagram, we use a front end implemented with React JavaScript hosted in an Amazon Simple Storage Service (Amazon S3) bucket fronted by Amazon CloudFront. The front-end application lets users upload PDF documents to Amazon S3. After the upload is complete, you can trigger a text extraction job powered by Amazon Textract. As part of the post-processing, an AWS Lambda function inserts special markers into the text indicating page boundaries. When that job is done, you can invoke an API that summarizes the text or answers questions about it.

Because some of these steps may take some time, the architecture uses a decoupled asynchronous approach. For example, the call to summarize a document invokes a Lambda function that posts a message to an Amazon Simple Queue Service (Amazon SQS) queue. Another Lambda function picks up that message and starts an Amazon Elastic Container Service (Amazon ECS) AWS Fargate task. The Fargate task calls the Amazon SageMaker inference endpoint. We use a Fargate task here because summarizing a very long PDF may take more time and memory than a Lambda function has available. When the summarization is done, the front-end application can pick up the results from an Amazon DynamoDB table.
For summarization, we use AI21’s Summarize model, one of the foundation models available through Amazon SageMaker JumpStart. Although this model handles documents of up to 10,000 words (approximately 40 pages), we use langchain’s text splitter to make sure that each summarization call to the LLM is no more than 10,000 words long. For text generation, we use Cohere’s Medium model, and we use GPT-J for embeddings, both via JumpStart.
Summarization processing
When handling larger documents, we need to define how to split the document into smaller pieces. When we get the text extraction results back from Amazon Textract, we insert markers for larger chunks of text (a configurable number of pages), individual pages, and line breaks. Langchain will split based on those markers and assemble smaller documents that are under the token limit. See the following code:

text_splitter = RecursiveCharacterTextSplitter(
separators = [“<CHUNK>”, “<PAGE>”, “n”],
chunk_size = int(chunk_size),
chunk_overlap = int(chunk_overlap))

with open(local_path) as f:
doc = f.read()
texts = text_splitter.split_text(doc)
print(f”Number of splits: {len(texts)}”)

llm = SageMakerLLM(endpoint_name = endpoint_name)

responses = []
for t in texts:
r = llm(t)
responses.append(r)
summary = “n”.join(responses)

The LLM in the summarization chain is a thin wrapper around our SageMaker endpoint:

class SageMakerLLM(LLM):

endpoint_name: str

@property
def _llm_type(self) -> str:
return “summarize”

def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
response = ai21.Summarize.execute(
source=prompt,
sourceType=”TEXT”,
sm_endpoint=self.endpoint_name
)
return response.summary

Question answering
In the retrieval augmented generation method, we first split the document into smaller segments. We create embeddings for each segment and store them in the open-source Chroma vector database via langchain’s interface. We save the database in an Amazon Elastic File System (Amazon EFS) file system for later use. See the following code:

documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500,
chunk_overlap = 0)
texts = text_splitter.split_documents(documents)
print(f”Number of splits: {len(texts)}”)

embeddings = SMEndpointEmbeddings(
endpoint_name=endpoint_name,
)
vectordb = Chroma.from_documents(texts, embeddings,
persist_directory=persist_directory)
vectordb.persist()

When the embeddings are ready, the user can ask a question. We search the vector database for the text chunks that most closely match the question:

embeddings = SMEndpointEmbeddings(
endpoint_name=endpoint_embed
)
vectordb = Chroma(persist_directory=persist_directory,
embedding_function=embeddings)
docs = vectordb.similarity_search_with_score(question)

We take the closest matching chunk and use it as context for the text generation model to answer the question:

cohere_client = Client(endpoint_name=endpoint_qa)
context = docs[high_score_idx][0].page_content.replace(“n”, “”)
qa_prompt = f’Context={context}nQuestion={question}nAnswer=’
response = cohere_client.generate(prompt=qa_prompt,
max_tokens=512,
temperature=0.25,
return_likelihoods=’GENERATION’)
answer = response.generations[0].text.strip().replace(‘n’, ”)

User experience
Although LLMs represent advanced data science, most of the use cases for LLMs ultimately involve interaction with non-technical users. Our example web application handles an interactive use case where business users can upload and process a new PDF document.
The following diagram shows the user interface. A user starts by uploading a PDF. After the document is stored in Amazon S3, the user is able to start the text extraction job. When that’s complete, the user can invoke the summarization task or ask questions. The user interface exposes some advanced options like the chunk size and chunk overlap, which would be useful for advanced users who are testing the application on new documents.

Next steps
LLMs provide significant new information retrieval capabilities. Business users need convenient access to those capabilities. There are two directions for future work to consider:

Take advantage of the powerful LLMs already available in Jumpstart foundation models. With just a few lines of code, our sample application could deploy and make use of advanced LLMs from AI21 and Cohere for text summarization and generation.
Make these capabilities accessible to non-technical users. A prerequisite to processing PDF documents is extracting text from the document, and summarization jobs may take several minutes to run. That calls for a simple user interface with asynchronous backend processing capabilities, which is easy to design using cloud-native services like Lambda and Fargate.

We also note that a PDF document is semi-structured information. Important cues like section headings are difficult to identify programmatically, because they rely on font sizes and other visual indicators. Identifying the underlying structure of information helps the LLM process the data more accurately, at least until such time that LLMs can handle input of unbounded length.
Conclusion
In this post, we showed how to build an interactive web application that lets business users upload and process PDF documents for summarization and question answering. We saw how to take advantage of Jumpstart foundation models to access advanced LLMs, and use text splitting and retrieval augmented generation techniques to process longer documents and make them available as information to the LLM.
At this point in time, there is no reason not to make these powerful capabilities available to your users. We encourage you to start using the Jumpstart foundation models today.

About the author
Randy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. In entered the Big Data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.

Integrate Amazon SageMaker Model Cards with the model registry

Amazon SageMaker Model Cards enable you to standardize how models are documented, thereby achieving visibility into the lifecycle of a model, from designing, building, training, and evaluation. Model cards are intended to be a single source of truth for business and technical metadata about the model that can reliably be used for auditing and documentation purposes. They provide a factsheet of the model that is important for model governance.
Until now, model cards were logically associated to a model in the Amazon SageMaker Model Registry using model name match. However, when solving a business problem through a machine learning (ML) model, as customers iterate on the problem, they create multiple versions of the model and they need to operationalize and govern multiple model versions. Therefore, they need the ability to associate a model card to a particular model version.
In this post, we discuss a new feature that supports integrating model cards with the model registry at the deployed model version level. We discuss the solution architecture and best practices for managing model card versions, and walk through how to set up, operationalize, and govern the model card integration with the model version in the model registry.
Solution overview
SageMaker model cards help you standardize documenting your models from a governance perspective, and the SageMaker model registry helps you deploy and operationalize ML models. The model registry supports a hierarchical structure for organizing and storing ML models with model metadata information.
When an organization solves a business problem using ML, such as a customer churn prediction, we recommend the following steps:

Create a model card for the business problem to be solved.
Create a model package group for the business problem to be solved.
Build, train, evaluate, and register the first version of the model package version (for example, Customer Churn V1).
Update the model card linking the model package version to the model card.
As you iterate on new model package version, clone the model card from the previous version and link to the new model package version (for example, Customer Churn V2).

The following figure illustrates how a SageMaker model card integrates with the model registry.

As illustrated in the preceding diagram, the integration of SageMaker model cards and the model registry allows you to associate a model card with a specific model version in the model registry. This enables you to establish a single source of truth for your registered model versions, with comprehensive and standardized documentation across all stages of the model’s journey on SageMaker, facilitating discoverability and promoting governance, compliance, and accountability throughout the model lifecycle.
Best practices for managing model cards
Operating in machine learning with governance is a critical requirement for many enterprise organizations today, notably in highly regulated industries. As part of those requirements, AWS provides several services that enable reliable operation of the ML environment.
SageMaker model cards document critical details about your ML models in a single place for streamlined governance and reporting. Model cards help you capture details such as the intended use and risk rating of a model, training details and metrics, evaluation results and observations, and additional call-outs such as considerations, recommendations, and custom information.
Model cards need to be managed and updated as part of your development process, throughout the ML lifecycle. They are an important part of continuous delivery and pipelines in ML. In the same way that a Well-Architected ML project implements continuous integration and continuous delivery (CI/CD) under the umbrella of MLOps, a continuous ML documentation process is a critical capability in a lot of regulated industries or for higher risk use cases. Model cards are part of the best practices for responsible and transparent ML development.
The following diagram shows how model cards should be part of a development lifecycle.

Consider the following best practices:

We recommend creating model cards early in your project lifecycle. In the first phase of the project, when you are working on identifying the business goal and framing the ML problem, you should initiate the creation of the model card. As you work through the different steps of business requirements and important performance metrics, you can create the model card in a draft status and determine the business details and intended uses.
As part of your model development lifecycle phase, you should use the model registry to catalog models for production, manage model versions, and associate metadata with a model. The model registry enables lineage tracking.
After you have iterated successfully and are ready to deploy your model to production, it’s time to update the model card. In the deployment lifecycle phase, you can update the model details of the model card. You should also update training details, evaluation details, ethical considerations, and caveats and recommendations.

Model cards have versions associated with them. A given model version is immutable across all attributes other than the model card status. If you make any other changes to the model card, such as evaluation metrics, description, or intended uses, SageMaker creates a new version of the model card to reflect the updated information. This is to ensure that a model card, once created, can’t be tampered with. Additionally, each unique model name can have only one associated model card and it can’t be changed after you create the model card.
ML models are dynamic and workflow automation components enable you to easily scale your ability to build, train, test, and deploy hundreds of models in production, iterate faster, reduce errors due to manual orchestration, and build repeatable mechanisms.
Therefore, the lifecycle of your model cards will look as described in the following diagram. Every time you update your model card through the model lifecycle, you automatically create a new version of the model card. Every time you iterate on a new model version, you create a new model card that can inherit some model card information of the previous model versions and follow the same lifecycle.

Pre-requisites
This post assumes that you already have models in your model registry. If you want to follow along, you can use the following SageMaker example on GitHub to populate your model registry: SageMaker Pipelines integration with Model Monitor and Clarify.
Integrate a model card with the model version in the model registry
In this example, we have the model-monitor-clarify-group package in our model registry.

In this package, two model versions are available.

For this example, we link Version 1 of the model to a new model card. In the model registry, you can see the details for Version 1.

We can now use the new feature in the SageMaker Python SDK. From the sagemaker.model_card ModelPackage module, you can select a specific model version from the model registry that you would like to link the model card to.

You can now create a new model card for the model version and specify the model_package_details parameter with the previous model package retrieved. You need to populate the model card with all the additional details necessary. For this post, we create a simple model card as an example.

You can then use that definition to create a model card using the SageMaker Python SDK.

When loading the model card again, you can see the associated model under “__model_package_details”.

You also have the option to update an existing model card with the model_package as shown in the example code snippet below:

my_card = ModelCard.load((“<model_card_name>”)
mp_details = ModelPackage.from_model_package_arn(“<arn>”)
my_card.model_package_details = mp_details
my_card.update()

Finally, when creating or updating a new model package version in an existing model package, if a model card already exists in that model package group, some information such as the business details and intended uses can be carried over to the new model card.
Clean up
Users are responsible for cleaning up resources if created using the notebook mentioned in the pre-requisites section. Please follow the instructions in the notebook to clean up resources.
Conclusion
In this post, we discussed how to integrate a SageMaker model card with a model version in the model registry. We shared the solution architecture with best practices for implementing a model card and showed how to set up and operationalize a model card to improve your model governance posture. We encourage you to try out this solution and share your feedback in the comments section.

About the Authors
Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 2-year-old sheep-a-doodle!
Natacha Fort is the Government Data Science Lead for Public Sector Australia and New Zealand, Principal SA at AWS. She helps organizations navigate their machine learning journey, supporting them from framing the machine learning problem to deploying into production, all the while making sure the best architecture practices are in place to ensure their success. Natacha focuses with organizations on MLOps and responsible AI.

Top 50+ AI Coding Assistant Tools in 2023

ChatGPT

ChatGPT is capable of writing code without relying on existing code references. Furthermore, it can efficiently debug the user’s code. By incorporating a code interpreter, ChatGPT has expanded its capabilities to include self-testing of its own code.

Bard

Google’s Bard, just like ChatGPT, is capable of interacting in a conversational manner and is suitable for writing and debugging code.

GitHub Copilot

GitHub Copilot is an AI-powered code completion tool that analyzes contextual code and delivers real-time feedback and recommendations by suggesting relevant code snippets. 

Tabnine

Tabnine is an AI-based code completion tool that offers an alternative to GitHub Copilot. It stands out with its expertise in providing full-function AI code completion capabilities.

Code Snippets AI

Code Snippets allows users to turn their questions into code. It is an all-in-one tool with features like code explanation, snippets library, etc.

MutableAI

MutableAI is a great choice for developers who frequently use boilerplate code and want efficient autocompletion capabilities. It offers code completion and the ability to organize and tidy code into logical groups. 

Cogram

Cogram is an SQL code generation tool that allows users to write efficient SQL queries using natural language.

Amazon CodeWhisperer

CodeWhisperer is also a code completion tool developed by AWS that can make intelligent completions based on comments and existing code.

Replit

Replit is an online coding platform that is known for its browser-based IDE. One of its features, Ghostwriter, provides relevant code suggestions based on the context.

Warp AI

Warp is an AI-powered terminal that speeds up the development process.

GitFluence

This tool allows users to find the right Git command quickly.

AskCodi

AskCodi can generate code, answer programming questions, and provide helpful code suggestions.

Codiga

Codiga is a code analysis tool that inspects codes and finds potential errors and vulnerabilities. 

Bugasura

Bugasura is a bug-tracking tool that streamlines the bug management process.

CodeWP

CodeWP is an AI-powered code generator that has been designed to simplify the coding process for WordPress developers.

AI Helper Bot

This tool is an SQL query generator that can write queries based on the user prompt.

Android Studio Bot

This tool generates codes, removes errors, and answers questions related to Android development.

SinCode

SinCode is an AI assistant that helps users with tasks like AI writing and code generation.

WPCode

WPCode is a snippet deployment tool for WordPress websites.

Sourcegraph Cody

It is an assistant inside the user’s editor and is best for explaining the code.

Codeium

Codeium provides auto code completion of more than 20 programming languages.

K.Explorer

This tool suggests code completions as well as completes the body of functions.

Codacy

Codacy is a code review tool that identifies issues using static code analysis.

Hacker AI

This tool scans the code to find potential security vulnerabilities.

Figstack

Figstack allows users to read and write code across multiple programming languages.

Kodezi

This tool auto-corrects the user’s code and removes all bugs.

DevKit

DevGPT combines ChatGPT and other mini-tools to help users test public APIs, query databases, and generate code.

Taiga

Taiga offers real-time feedback, guidance, and tailored recommendations.

MarsAI

This tool is a low-code platform for building mobile and web applications.

Safurai

Safurai helps users write better, safer, and more optimized code.

Phind

Phind is the search engine for developers.

AI Reality

This tool generates augmented reality prototypes from the entered text.

AutoRegex

AutoRegex is an English-to-regex converter.

Lookup

Lookup is an analytics tool that helps users analyze their data just by writing in plain English.

Whybug

Whybug helps in explaining the user’s error.

How2

This tool provides code completion for the Unix terminal.

Cheat Layer

This tool helps businesses solve their automation problems using an ML model.

WiseData

WiseData is an AI assistant for Python data analytics.

Deepnote

Deepnote’s AI Copilot gives contextual code suggestions.

Generate JSON

This tool allows users to quickly make dummy JSON for software testing.

TimeComplexity

This tool analyzes the runtime complexity of codes.

Quest

Quest is a tool for building React applications.

DocumentationLab

This tool helps users manage and document software development.

GitPoet

This tool suggests accurate commit messages.

Checksum

Checksum helps users maintain end-to-end tests.

LogicLoop

This is an SQL query generator and optimizer.

CodeMate AI

CodeMate allows users to quickly write error-free code.

Codium AI

Codium generates meaningful tests for developers.

CodeSquire

CodeSquire is a coding assistant for data scientists, engineers, and analysts.

Refact

Refact is a code assistant for VS Code and JetBrains.

Zeus Notebook

Zeus Notebook is a browser-based Python notebook with an AI assistant

Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

The post Top 50+ AI Coding Assistant Tools in 2023 appeared first on MarkTechPost.

Stanford and Mila Researchers Propose Hyena: An Attention-Free Drop-in …

As we all know that the race to develop and come up with mindblowing Generative models such as ChatGPT and Bard, and their underlying technology such as GPT3 and GPT4, has taken the AI world by magnanimous force, there are still many challenges when it comes to the accessibility, training and actual feasibility of these models in lots of use cases which pertains to our day to day problems. 

If anyone has ever played around with any of such sequence models, there is one sure-shot problem that might have ruined their excitement. That is, the length of input they can send in to prompt the model. 

If they are enthusiasts who want to dabble in the core of such technologies and train their custom model, the whole optimization process makes it quite an impossible task. 

At the heart of these problems lies the quadratic nature of the optimization of attention models that sequence models utilize. One of the biggest reasons is the computation cost of such algorithms and the resources needed to solve this issue. It can be an extremely expensive solution, especially if someone wants to scale it up, which leads to only a few concentrated organizations having a vivid sense of understanding and real control of such algorithms. 

Simply put, attention exhibits quadratic cost in sequence length. Limiting the amount of context accessible and scaling it is a costly affair. 

However, worry not; there is new architecture called the Hyena, which is now making waves in the NLP community, and people ordain it as the rescuer we all need. It challenges the dominance of the existing attention mechanisms, and the research paper demonstrates its potential to topple the existing system. 

Developed by a team of researchers at a leading university, Hyena boasts an impressive performance on a range of subquadratic NLP tasks in terms of optimization. In this article, we will look closely at Hyena’s claims.

This paper suggests that subquadratic operators can match the quality of attention models at scale without being that costly in terms of parameters and optimization cost. Based on targeted reasoning tasks, the authors distill the three most important properties contributing to its performance. 

Data control

Sublinear parameter scaling

Unrestricted context. 

Aiming with these points in mind, they then introduce the Hyena hierarchy. This new operator combines long convolutions and element-wise multiplicative gating to match the quality of attention at scale while reducing the computational cost. 

The experiments conducted reveal mindblowing results. 

Language modeling. 

Hyena’s scaling was tested on autoregressive language modeling, which, when evaluated on perplexity on benchmark dataset WikiText103 and The Pile, revealed that Hyena is the first attention-free, convolution architecture to match GPT quality with a 20% reduction in total FLOPS.

Perplexity on WikiText103 (same tokenizer). ∗ are results from (Dao et al., 2022c). Deeper and thinner models (Hyena-slim) achieve lower perplexity

Perplexity on The Pile for models trained until a total number of tokens e.g., 5 billion (different runs for each token total). All models use the same tokenizer (GPT2). FLOP count is for the 15 billion token run

Large Scale image classification 

The paper demonstrates the potential of Hyena as a general deep-learning operator for image classification. On image translation, they drop-in replace attention layers in the Vision Transformer(ViT) with the Hyena operator and match the performance with ViT.

On CIFAR-2D, we test a 2D version of Hyena long convolution filters in a standard convolutional architecture, which improves on the 2D long convolutional model S4ND (Nguyen et al., 2022) in accuracy with an 8% speedup and 25% fewer parameters.

The promising results at the sub-billion parameter scale suggest that attention may not be all we need and that simpler subquadratic designs such as Hyena, informed by simple guiding principles and evaluation on mechanistic interpretability benchmarks, form the basis for efficient large models.

With the waves this architecture is creating in the community, it will be interesting to see if the Hyena would have the last laugh.

Check out the Paper and Github link. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club

The post Stanford and Mila Researchers Propose Hyena: An Attention-Free Drop-in Replacement to the Core Building Block of Many Large-Scale Language Models appeared first on MarkTechPost.

Meet DreamIdentity: An Optimization-Free AI Method for Each Face Ident …

The discipline of creating visual material has recently changed thanks to diffusion-based large-scale text-to-image (T2I) models. These T2I models make producing engaging, expressive, and human-centered graphics simple. An intriguing use of these models is their ability to generate various situations linked with an identity using natural language descriptions, given a specific person’s face in their daily life (our family members, friends, etc.). The identity re-contextualization challenge, which deviates from the typical T2I task illustrated in Fig. 1, calls for the model to maintain input face identification (i.e., ID preservation) while adhering to textual cues.

Figure 1 shows how DreamIdentity effectively creates a large number of identity-preserving and text-coherent pictures in various contexts from a single face image without the need for test-time optimisation.

Personalizing a pre-trained T2I model for each face identity is a workable method. It entails learning to correlate a particular word with the essence by enhancing its word embedding or fine-tuning the model parameters. Due to the per-identity optimization, these optimization-based approaches could be more efficient. To avoid the time-consuming per-identity optimization, various optimization-free methods suggest directly mapping the image characteristics obtained from a pre-trained image encoder (usually CLIP) into a word embedding. However, this compromises ID preservation. These techniques, therefore, run the danger of impairing the original T2I model’s editing skills since they either call for fine-tuning the parameters of the pre-trained T2I model or changing the original structure to inject extra grid image characteristics. 

To put it simply, all concurrent optimization-free efforts struggle to maintain an identity while maintaining the model’s editability. They contend that two problems, namely (1) the erroneous identity feature representation and (2) the inconsistent aim between the training and testing, are the root causes of the abovementioned difficulty in existing optimization-free studies. On the one hand, the fact that the best CLIP model at the moment still performs much worse than the face recognition model on top-1 face identification accuracy (80.95% vs. 87.61%) indicates that the common encoder (i.e., CLIP) utilized by concurrent efforts is inadequate for identity re-contextualization job. Furthermore, the CLIP’s final layer feature, which largely focuses on high-level semantics rather than precise face descriptions, fails to maintain the identification information. 

The editability for the input face is negatively impacted by all concurrent tasks using the vanilla reconstruction objective to learn the word embedding. To address the difficulty mentioned above of identity preservation and editability, they provide a unique optimization-free framework (named DreamIdentity) with accurate identity representation and consistent training/inference aim. To be more precise, they create a unique Multi-word Multi-scale ID encoder (M2 ID encoder) in the architecture of Vision Transformer for correct identification representation. This encoder is pre-trained on a sizable face dataset and projects multi-scale characteristics into multi-word embeddings. 

Researchers from the University of Science and Technology of China and ByteDance suggest a novel Self-Augmented Editability Learning method to move the editing task into the training phase. This method uses the T2I model to build a self-augmented dataset by generating celebrity faces and various target-edited celebrity images. The M2 ID encoder is trained using this dataset to improve the model’s editability. They made the following contributions to this work: They argue that because of their erroneous representation and inconsistent training/inference objectives, existing optimization-free approaches are ineffective for ID preservation and high editability. 

Technically speaking, (1) they suggest M2 ID Encoder, an ID-aware multi-scale feature with multi-embedding projection, for appropriate representation. (2) They incorporate self-augmented editability learning to enable the underlying T2I model to provide a high-quality dataset for editing to achieve a consistent training/inference target. The effectiveness of their approaches, which effectively achieve identity preservation while permitting flexible text-guided modification, or identity re-contextualization, is demonstrated by comprehensive studies.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Meet DreamIdentity: An Optimization-Free AI Method for Each Face Identity Keeping the Editability for Text-to-Image Models appeared first on MarkTechPost.