Meet NerfDiff: An AI Framework To Enable High-Quality and Consistent M …

The synthesis of new views is a hot topic in computer graphics and vision applications, such as virtual and augmented reality, immersive photography, and the development of digital replicas. The objective is to generate additional views of an object or a scene based on limited initial viewpoints. This task is particularly demanding because the newly synthesized views must consider occluded areas and previously unseen regions. 

Recently, neural radiance fields (NeRF) have demonstrated exceptional results in generating high-quality novel views. However, NeRF relies on a significant number of images, ranging from tens to hundreds, to effectively capture the scene, making it susceptible to overfitting and lacking the ability to generalize to new scenes. 

Previous attempts have introduced generalizable NeRF models that condition the NeRF representation based on the projection of 3D points and extracted image features. These approaches yield satisfactory results, particularly for views close to the input image. However, when the target views significantly differ from the input, these methods produce blurry outcomes. The challenge lies in resolving the uncertainty associated with large unseen regions in the novel views. 

An alternative approach to tackle the uncertainty problem in single-image view synthesis involves utilizing 2D generative models that predict novel views while conditioning on the input view. However, the risk for these methods is the lack of consistency in image generation with the underlying 3D structure.

For this purpose, a new technique called NerfDiff has been presented. NerfDiff is a framework designed for synthesizing high-quality multi-view consistent images based on single-view input. An overview of the workflow is presented in the figure below. 

The proposed approach consists of two stages: training and finetuning. 

During the training stage, a camera-space triplane-based NeRF model and a 3D-aware conditional diffusion model (CDM) are jointly trained on a collection of scenes. The NeRF representation is initialized using the input image at the finetuning stage. Then, the parameters of the NeRF model are adjusted based on a set of virtual images generated by the CDM, which is conditioned on the NeRF-rendered outputs. However, a straightforward finetuning strategy that optimizes the NeRF parameters directly using the CDM outputs produces low-quality renderings due to the multi-view inconsistency of the CDM outputs. To address this issue, the researchers propose NeRF-guided distillation, an alternating process that updates the NeRF representation and guides the multi-view diffusion process. Specifically, this approach allows the resolution of uncertainty in single-image view synthesis by leveraging the additional information provided by the CDM. Simultaneously, the NeRF model guides the CDM to ensure multi-view consistency during the diffusion process. 

Some of the results obtained through NerfDiff are reported here below (where NGD stands for Nerf-Guided Distillation).

This was the summary of NerfDiff, a novel AI framework to enable high-quality and consistent multiple views from a single input image. If you are interested, you can learn more about this technique in the links below.

Check out the Paper and Project. Don’t forget to join our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Meet NerfDiff: An AI Framework To Enable High-Quality and Consistent Multiple Views Synthesis From a Single Image appeared first on MarkTechPost.

Meet PromptingWhisper: Using Prompt Engineering to Adapt the Whisper M …

All the advancements that have been recently taking place in the field of Artificial Intelligence have enabled us to define intelligent systems with a better and more articulate understanding of language than ever before. With each upgradation and release, Large Language Models are becoming more capable of catering to different necessities in applications and scenarios. For any robust and efficient model, it is important to have a proper training prompt along with its design and content. Prompt engineering involves designing a prompt that would enable the user to receive a suitable response from the model. Its main objective is to feed the model with a good quality training prompt so that the model easily finds patterns and trends in the data.

Specifically focussing on the domain of audio and speech processing, the study of prompt engineering has gained attention but is relatively new compared to other domains. The Whisper model, which OpenAI released, is a transformer-based encoder-decoder model that can be classified into two groups: English-only and multilingual. Trained on a large dataset consisting of 680,000 hours of web-scraped speech data, Whisper is an automatic speech recognition model.

In a recently released research paper, a team of researchers discussed adapting the Whisper model to unseen tasks using simple prompts. Called PromptingWhisper, the main approach of the researchers has been to investigate the zero-shot task generalization abilities of the Whisper model by analyzing its strengths and weaknesses. For adapting Whisper to unseen tasks, the team has used prompt engineering to design task-specific prompts. They have mainly discussed three specific tasks, which are – audio-visual speech recognition (AVSR), code-switched speech recognition (CS-ASR), and speech translation (ST) involving unseen language pairs.

In AVSR, the team has found that Whisper exhibited a robust nature in terms of the length and noisiness of the visual prompt. Its efficiency for visual prompts in English models is different as compared to the multilingual models. In CS-ASR, some performance gaps were found between different accents. Lastly, in ST, it was found that the task token in the prompts could be effectively used to instruct the model to perform translation. To customize the prompts to the specific requirements of each task, the team has manipulated the special tokens within the default prompts provided by Whisper or used another large-scale model. 

The team has conducted experiments to evaluate the performance of the Whisper model. After comparing the default prompts to their proposed task-specific prompts, the results showed that their prompts significantly improved performance across the three zero-shot tasks, with performance gains ranging from 10% to 45%. In some cases, the proposed prompts even outperformed the SOTA-supervised models on certain datasets.

In conclusion, the researchers have investigated the Whisper model in great depth. While evaluating, they observed how Whisper is robust to different prompts, efficiently uncovers biases related to accents, and identifies the model’s ability to understand multiple languages within its latent space. They have studied and analyzed Whisper’s hidden strengths and weaknesses in detail by focusing on the gradient-free zero-shot task generalization abilities of webscale speech models.

Check out the Paper and Code. Don’t forget to join our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Meet PromptingWhisper: Using Prompt Engineering to Adapt the Whisper Model to Unseen Tasks, the Proposed Prompts Enhances Performance by 10% to 45% on Three Zero-Shot Tasks appeared first on MarkTechPost.

Adobe has Integrated Firefly Directly into Photoshop: Marrying the Spe …

Adobe has introduced a new feature called “Generative Fill” in its popular photo editing software, Photoshop. This AI-powered tool leverages advanced machine learning algorithms to generate realistic images and seamlessly fill missing or empty areas with a visually readable content.

Generative Fill analyzes an image’s surrounding elements and textures, enabling it to intelligently create new pixels and blend them seamlessly into the existing composition. This results in visually appealing and realistic outcomes, saving users time and effort by eliminating the need for manual recreation. One notable application of Generative Fill is its ability to remove undesirable elements from images. Users can easily select distracting objects within a photograph and utilize Generative Fill to generate visually harmonious replacements. This feature is particularly valuable to photographers, designers, and content creators, streamlining the editing process while preserving the overall integrity of the composition.

Adobe strongly emphasizes ethical considerations in the development of Generative Fill. The company has trained its AI models using diverse datasets to ensure inclusivity and reduce bias in the generated images. Moreover, Adobe acknowledges the importance of user control, allowing users to refine and customize the results according to their preferences.

Generative Fill has the potential to make a significant impact across various industries. Graphic designers, marketers, and advertisers, in particular, can benefit greatly from this tool as it enables them to manipulate images more efficiently, creating compelling visuals that meet their specific requirements.

Generative Fill in Adobe Photoshop not only saves users a lot of time and effort by automating the process of removing unwanted elements or filling in gaps, but it also opens up new creative possibilities. Users can experiment with different compositions, explore alternative visual options, and push the boundaries of their artistic expression. Including Generative Fill in Photoshop signifies Adobe’s ongoing commitment to providing innovative tools that empower professionals in digital image editing.

While Generative Fill offers immense potential, ongoing debates surround the ethical implications and potential misuse of AI-generated content. However, responsible use of such tools, transparent guidelines, and user awareness can help mitigate these concerns.

In conclusion, Adobe’s introduction of Generative Fill in Photoshop represents a major advancement in AI-driven image editing. By harnessing the power of ML, this feature empowers users to remove objects and fill missing areas with realistic content seamlessly. With a focus on ethical considerations and user control, Generative Fill has the potential to revolutionize the way professionals approach image editing, enhancing their creativity and productivity.

Check out the Reference Article. Don’t forget to join our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Adobe has Integrated Firefly Directly into Photoshop: Marrying the Speed and Ease of Generative AI with the Power and Precision of Photoshop appeared first on MarkTechPost.

Create high-quality images with Stable Diffusion models and deploy the …

Text-to-image generation is a task in which a machine learning (ML) model generates an image from a textual description. The goal is to generate an image that closely matches the description, capturing the details and nuances of the text. This task is challenging because it requires the model to understand the semantics and syntax of the text and to generate photorealistic images. There are many practical applications of text-to-image generation in AI photography, concept art, building architecture, fashion, video games, graphic design, and much more.
Stable Diffusion is a text-to-image model that empowers you to create high-quality images within seconds. When real-time interaction with this type of model is the goal, ensuring a smooth user experience depends on the use of accelerated hardware for inference, such as GPUs or AWS Inferentia2, Amazon’s own ML inference accelerator. The steep costs involved in using GPUs typically requires optimizing the utilization of the underlying compute, even more so when you need to deploy different architectures or personalized (fine-tuned) models. Amazon SageMaker multi-model endpoints (MMEs) help you address this problem by helping you scale thousands of models into one endpoint. By using a shared serving container, you can host multiple models in a cost-effective, scalable manner within the same endpoint, and even the same GPU.
In this post, you will learn about Stable Diffusion model architectures, different types of Stable Diffusion models, and techniques to enhance image quality. We also show you how to deploy Stable Diffusion models cost-effectively using SageMaker MMEs and NVIDIA Triton Inference Server.

Prompt: portrait of a cute bernese dog, art by elke Vogelsang, 8k ultra realistic, trending on artstation, 4 k
Prompt: architecture design of living room, 8 k ultra-realistic, 4 k, hyperrealistic, focused, extreme details
Prompt: New York skyline at night, 8k, long shot photography, unreal engine 5, cinematic, masterpiece

Stable Diffusion architecture
Stable Diffusion is a text-to-image open-source model that you can use to create images of different styles and content simply by providing a text prompt. In the context of text-to-image generation, a diffusion model is a generative model that you can use to generate high-quality images from textual descriptions. Diffusion models are a type of generative model that can capture the complex dependencies between the input and output modalities text and images.
The following diagram shows a high-level architecture of a Stable Diffusion model.

It consists of the following key elements:

Text encoder – CLIP is a transformers-based text encoder model that takes input prompt text and converts it into token embeddings that represent each word in the text. CLIP is trained on a dataset of images and their captions, a combination of image encoder and text encoder.
U-Net – A U-Net model takes token embeddings from CLIP along with an array of noisy inputs and produces a denoised output. This happens though a series of iterative steps, where each step processes an input latent tensor and produces a new latent space tensor that better represents the input text.
Auto encoder-decoder – This model creates the final images. It takes the final denoised latent output from the U-Net model and converts it into images that represents the text input.

Types of Stable Diffusion models
In this post, we explore the following pre-trained Stable Diffusion models by Stability AI from the Hugging Face model hub.
stable-diffusion-2-1-base
Use this model to generate images based on a text prompt. This is a base version of the model that was trained on LAION-5B. The model was trained on a subset of the large-scale dataset LAION-5B, and mainly with English captions. We use StableDiffusionPipeline from the diffusers library to generate images from text prompts. This model can create images of dimension 512 x 512. It uses the following parameters:

prompt – A prompt can be a text word, phrase, sentences, or paragraphs.
negative_prompt – You can also pass a negative prompt to exclude specified elements from the image generation process and to enhance the quality of the generated images.
guidance_scale – A higher guidance scale results in an image more closely related to the prompt, at the expense of image quality. If specified, it must be a float.

stable-diffusion-2-depth
This model is used to generate new images from existing ones while preserving the shape and depth of the objects in the original image. This stable-diffusion-2-depth model is fine-tuned from stable-diffusion-2-base, an extra input channel to process the (relative) depth prediction. We use StableDiffusionDepth2ImgPipeline from the diffusers library to load the pipeline and generate depth images. The following are the additional parameters specific to the depth model:

image – The initial image to condition the generation of new images.
num_inference_steps (optional) – The number of denoising steps. More denoising steps usually leads to a higher-quality image at the expense of slower inference. This parameter is modulated by strength.
strength (optional) – Conceptually, this indicates how much to transform the reference image. The value must be between 0–1. image is used as a starting point, adding more noise to it the larger the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, the added noise will be maximum and the denoising process will run for the full number of iterations specified in num_inference_steps. A value of 1, therefore, essentially ignores image. For more details, refer to the following code.

stable-diffusion-2-inpainting
You can use this model for AI image restoration use cases. You can also use it to create novel designs and images from the prompts and additional arguments. This model is also derived from the base model and has a mask generation strategy. It specifies the mask of the original image to represent segments to be changed and segments to leave unchanged. We use StableDiffusionUpscalePipeline from the diffusers library to apply inpaint changes on original image. The following additional parameter is specific to the depth model:

mask_input – An image where the blacked-out portion remains unchanged during image generation and the white portion is replaced

stable-diffusion-x4-upscaler
This model is also derived from the base model, additionally trained on the 10M subset of LAION containing 2048 x 2048 images. As the name implies, it can be used to upscale lower-resolution images to higher resolutions
Use case overview
For this post, we deploy an AI image service with multiple capabilities, including generating novel images from text, changing the styles of existing images, removing unwanted objects from images, and upscaling low-resolution images to higher resolutions. Using several variations of Stable Diffusion models, you can address all of these use cases within a single SageMaker endpoint. This means that you’ll need to host large number of models in a performant, scalable, and cost-efficient way. In this post, we show how to deploy multiple Stable Diffusion models cost-effectively using SageMaker MMEs and NVIDIA Triton Inference Server. You will learn about the implementation details, optimization techniques, and best practices to work with text-to-image models.
The following table summarizes the Stable Diffusion models that we deploy to a SageMaker MME.

Model Name
Model Size in GB

stabilityai/stable-diffusion-2-1-base
2.5

stabilityai/stable-diffusion-2-depth
2.7

stabilityai/stable-diffusion-2-inpainting
2.5

stabilityai/stable-diffusion-x4-upscaler
7

Solution overview
The following steps are involved in deploying Stable Diffusion models to SageMaker MMEs:

Use the Hugging Face hub to download the Stable Diffusion models to a local directory. This will download scheduler, text_encoder, tokenizer, unet, and vae for each Stable Diffusion model into its corresponding local directory. We use the revision=”fp16″ version of the model.
Set up the NVIDIA Triton model repository, model configurations, and model serving logic model.py. Triton uses these artifacts to serve predictions.
Package the conda environment with additional dependencies and the package model repository to be deployed to the SageMaker MME.
Package the model artifacts in an NVIDIA Triton-specific format and upload model.tar.gz to Amazon Simple Storage Service (Amazon S3). The model will be used for generating images.
Configure a SageMaker model, endpoint configuration, and deploy the SageMaker MME.
Run inference and send prompts to the SageMaker endpoint to generate images using the Stable Diffusion model. We specify the TargetModel variable and invoke different Stable Diffusion models to compare the results visually.

We have published the code to implement this solution architecture in the GitHub repo. Follow the README instructions to get started.
Serve models with an NVIDIA Triton Inference Server Python backend
We use a Triton Python backend to deploy the Stable Diffusion pipeline model to a SageMaker MME. The Python backend lets you serve models written in Python by Triton Inference Server. To use the Python backend, you need to create a Python file model.py that has the following structure: Every Python backend can implement four main functions in the TritonPythonModel class:

import triton_python_backend_utils as pb_utils
class TritonPythonModel:
“””Your Python model must use the same class name. Every Python model
that is created must have “TritonPythonModel” as the class name.
“””
def auto_complete_config(auto_complete_model_config):
def initialize(self, args):
def execute(self, requests):
def finalize(self):

Every Python backend can implement four main functions in the TritonPythonModel class: auto_complete_config, initialize, execute, and finalize.
initialize is called when the model is being loaded. Implementing initialize is optional. initialize allows you to do any necessary initializations before running inference. In the initialize function, we create a pipeline and load the pipelines using from_pretrained checkpoints. We configure schedulers from the pipeline scheduler config pipe.scheduler.config. Finally, we specify xformers optimizations to enable the xformer memory efficient parameter enable_xformers_memory_efficient_attention. We provide more details on xformers later in this post. You can refer to model.py of each model to understand the different pipeline details. This file can be found in the model repository.
The execute function is called whenever an inference request is made. Every Python model must implement the execute function. In the execute function, you are given a list of InferenceRequest objects. We pass the input text prompt to the pipeline to get an image from the model. Images are decoded and the generated image is returned from this function call.
We get the input tensor from the name defined in the model configuration config.pbtxt file. From the inference request, we get prompt, negative_prompt, and gen_args, and decode them. We pass all the arguments to the model pipeline object. Encode the image to return the generated image predictions. You can refer to the config.pbtxt file of each model to understand the different pipeline details. This file can be found in the model repository. Finally, we wrap the generated image in InferenceResponse and return the response.
Implementing finalize is optional. This function allows you to do any cleanups necessary before the model is unloaded from Triton Inference Server.
When working with the Python backend, it’s the user’s responsibility to ensure that the inputs are processed in a batched manner and that responses are sent back accordingly. To achieve this, we recommend following these steps:

Loop through all requests in the requests object to form a batched_input.
Run inference on the batched_input.
Split the results into multiple InferenceResponse objects and concatenate them as the responses.

Refer to the Triton Python backend documentation or Host ML models on Amazon SageMaker using Triton: Python backend for more details.
NVIDIA Triton model repository and configuration
The model repository contains the model serving script, model artifacts and tokenizer artifacts, a packaged conda environment (with dependencies needed for inference), the Triton config file, and the Python script used for inference. The latter is mandatory when you use the Python backend, and you should use the Python file model.py. Let’s explore the configuration file of the inpaint Stable Diffusion model and understand the different options specified:

name: “sd_inpaint”
backend: “python”
max_batch_size: 8
input [
{
name: “prompt”
data_type: TYPE_STRING
dims: [
-1
]
},
{
name: “negative_prompt”
data_type: TYPE_STRING
dims: [
-1
]
optional: true
},
{
name: “image”
data_type: TYPE_STRING
dims: [
-1
]
},
{
name: “mask_image”
data_type: TYPE_STRING
dims: [
-1
]
},
{
name: “gen_args”
data_type: TYPE_STRING
dims: [
-1
]
optional: true
}
]
output [
{
name: “generated_image”
data_type: TYPE_STRING
dims: [
-1
]
}
]
instance_group [
{
kind: KIND_GPU
}
]
parameters: {
key: “EXECUTION_ENV_PATH”,
value: {string_value: “/tmp/conda/sd_env.tar.gz”
}
}

The following table explains the various parameters and values:

Key
Details

name
It’s not required to include the model configuration name property. In the event that the configuration doesn’t specify the model’s name, it’s presumed to be identical to the name of the model repository directory where the model is stored. However, if a name is provided, it must match the name of the model repository directory where the model is stored. sd_inpaint is the config property name.

backend
This specifies the Triton framework to serve model predictions. This is a mandatory parameter. We specify python, because we’ll be using the Triton Python backend to host the Stable Diffusion models.

max_batch_size
This indicates the maximum batch size that the model supports for the types of batching that can be exploited by Triton.

input→ prompt
Text prompt of type string. Specify -1 to accept dynamic tensor shape.

input→ negative_prompt
Negative text prompt of type string. Specify -1 to accept dynamic tensor shape.

input→ mask_image
Base64 encoded mask image of type string. Specify -1 to accept dynamic tensor shape.

input→ image
Base64 encoded image of type string. Specify -1 to accept dynamic tensor shape.

input→ gen_args
JSON encoded additional arguments of type string. Specify -1 to accept dynamic tensor shape.

output→ generated_image
Generated image of type string. Specify -1 to accept dynamic tensor shape.

instance_group
You can use this this setting to place multiple run instances of a model on every GPU or on only certain GPUs. We specify KIND_GPU to make copies of the model on available GPUs.

parameters
We set the conda environment path to EXECUTION_ENV_PATH.

For details about the model repository and configurations of other Stable Diffusion models, refer to the code in the GitHub repo. Each directory contains artifacts for the specific Stable Diffusion models.
Package a conda environment and extend the SageMaker Triton container
SageMaker NVIDIA Triton container images don’t contain libraries like transformer, accelerate, and diffusers to deploy and serve Stable Diffusion models. However, Triton allows you to bring additional dependencies using conda-pack. Let’s start by creating the conda environment with the necessary dependencies outlined in the environment.yml file and create a tar model artifact sd_env.tar.gz file containing the conda environment with dependencies installed in it. Run the following YML file to create a conda-pack artifact and copy the artifact to the local directory from where it will be uploaded to Amazon S3. Note that we will be uploading the conda artifacts as one of the models in the MME and invoking this model to set up the conda environment in the SageMaker hosting ML instance.

%%writefile environment.yml
name: mme_env
dependencies:
– python=3.8
– pip
– pip:
– numpy
– torch –extra-index-url https://download.pytorch.org/whl/cu118
– accelerate
– transformers
– diffusers
– xformers
– conda-pack

!conda env create -f environment.yml –force

Upload model artifacts to Amazon S3
SageMaker expects the .tar.gz file containing each Triton model repository to be hosted on the multi-model endpoint. Therefore, we create a tar artifact with content from the Triton model repository. We can use this S3 bucket to host thousands of model artifacts, and the SageMaker MME will use models from this location to dynamically load and serve a large number of models. We store all the Stable Diffusion models in this Amazon S3 location.
Deploy the SageMaker MME
In this section, we walk through the steps to deploy the SageMaker MME by defining container specification, SageMaker model and endpoint configurations.
Define the serving container
In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker MME will use to load and serve predictions. Set Mode to MultiModel to indicate that SageMaker will create the endpoint with the MME container specifications. We set the container with an image that supports deploying MMEs with GPU. See Supported algorithms, frameworks, and instances for more details.
We see all three model artifacts in the following Amazon S3 ModelDataUrl location:

container = {“Image”: mme_triton_image_uri,
“ModelDataUrl”: model_data_url,
“Mode”: “MultiModel”}

Create an MME object
We use the SageMaker Boto3 client to create the model using the create_model API. We pass the container definition to the create model API along with ModelName and ExecutionRoleArn:

create_model_response = sm_client.create_model(
ModelName=sm_model_name,
ExecutionRoleArn=role,
PrimaryContainer=container
)

Define configurations for the MME
Create an MME configuration using the create_endpoint_config Boto3 API. Specify an accelerated GPU computing instance in InstanceType (we use the same instance type that we are using to host our SageMaker notebook). We recommend configuring your endpoints with at least two instances with real-life use cases. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.

create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
“InstanceType”: instance_type,
“InitialVariantWeight”: 1,
“InitialInstanceCount”: 1,
“ModelName”: sm_model_name,
“VariantName”: “AllTraffic”,
}
],
)

Create an MME
Use the preceding endpoint configuration to create a new SageMaker endpoint and wait for the deployment to finish:

create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name
)

The status will change to InService when the deployment is successful.
Generate images using different versions of Stable Diffusion models
Let’s start by invoking the base model with a prompt and getting the generated image. We pass the inputs to the base model with prompt, negative_prompt, and gen_args as a dictionary. We set the data type and shape of each input item in the dictionary and pass it as input to the model.

inputs = dict(prompt = “Infinity pool on top of a high rise overlooking Central Park”,
negative_prompt = “blur,low detail, low quality”,
gen_args = json.dumps(dict(num_inference_steps=50, guidance_scale=8))
)
payload = {
“inputs”:
[{“name”: name, “shape”: [1,1], “datatype”: “BYTES”, “data”: [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=”application/octet-stream”,
Body=json.dumps(payload),
TargetModel=”sd_base.tar.gz”,
)
output = json.loads(response[“Body”].read().decode(“utf8”))[“outputs”]
decode_image(output[0][“data”][0])

Prompt: Infinity pool on top of a high rise overlooking Central Park
Working with this image, we can modify it with the versatile Stable Diffusion depth model. For example, we can change the style of the image to an oil painting, or change the setting from Central Park to Yellowstone National Park simply by passing the original image along with a prompt describing the changes we would like to see.
We invoke the depth model by specifying sd_depth.tar.gz in the TargetModel of the invoke_endpoint function call. In the outputs, notice how the orientation of the original image is preserved, but for one example, the NYC buildings have been transformed into rock formations of the same shape.

inputs = dict(prompt = “highly detailed oil painting of an inifinity pool overlooking central park”,
image=image,
gen_args = json.dumps(dict(num_inference_steps=50, strength=0.9))
)
payload = {
“inputs”:
[{“name”: name, “shape”: [1,1], “datatype”: “BYTES”, “data”: [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=”application/octet-stream”,
Body=json.dumps(payload),
TargetModel=”sd_depth.tar.gz”,
)
output = json.loads(response[“Body”].read().decode(“utf8”))[“outputs”]
print(“original image”)
display(original_image)
print(“generated image”)
display(decode_image(output[0][“data”][0]))

Original image
Oil painting
Yellowstone Park

Another useful model is Stable Diffusion inpainting, which we can use to remove certain parts of the image. Let’s say you want to remove the tree in the following example image. We can do so by invoking the inpaint model sd_inpaint.tar.gz. To remove the tree, we need to pass a mask_image, which indicates which regions of the image should be retained and which should be filled in. The black pixel portion of the mask image indicates the regions that should remain unchanged, and the white pixels indicate what should be replaced.

image = encode_image(original_image).decode(“utf8”)
mask_image = encode_image(Image.open(“sample_images/bertrand-gabioud-mask.png”)).decode(“utf8”)
inputs = dict(prompt = “building, facade, paint, windows”,
image=image,
mask_image=mask_image,
negative_prompt = “tree, obstruction, sky, clouds”,
gen_args = json.dumps(dict(num_inference_steps=50, guidance_scale=10))
)
payload = {
“inputs”:
[{“name”: name, “shape”: [1,1], “datatype”: “BYTES”, “data”: [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=”application/octet-stream”,
Body=json.dumps(payload),
TargetModel=”sd_inpaint.tar.gz”,
)
output = json.loads(response[“Body”].read().decode(“utf8”))[“outputs”]
decode_image(output[0][“data”][0])

Original image
Mask image
Inpaint image

In our final example, we downsize the original image that was generated earlier from its 512 x 512 resolution to 128 x 128. We then invoke the Stable Diffusion upscaler model to upscale the image back to 512 x 512. We use the same prompt to upscale the image as what we used to generate the initial image. While not necessary, providing a prompt that describes the image helps guide the upscaling process and should lead to better results.

low_res_image = output_image.resize((128, 128))
inputs = dict(prompt = “Infinity pool on top of a high rise overlooking Central Park”,
image=encode_image(low_res_image).decode(“utf8”)
)

payload = {
“inputs”:
[{“name”: name, “shape”: [1,1], “datatype”: “BYTES”, “data”: [data]} for name, data in inputs.items()]
}

response = runtime_sm_client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=”application/octet-stream”,
Body=json.dumps(payload),
TargetModel=”sd_upscale.tar.gz”,
)
output = json.loads(response[“Body”].read().decode(“utf8”))[“outputs”]
upscaled_image = decode_image(output[0][“data”][0])

Low-resolution image
Upscaled image

Although the upscaled image is not as detailed as the original, it’s a marked improvement over the low-resolution one.
Optimize for memory and speed
The xformers library is a way to speed up image generation. This optimization is only available for NVIDIA GPUs. It speeds up image generation and lowers VRAM usage. We have used the xformers library for memory-efficient attention and speed. When the enable_xformers_memory_efficient_attention option is enabled, you should observe lower GPU memory usage and a potential speedup at inference time.
Clean Up
Follow the instruction in the clean up section of the notebook to delete the resource provisioned part of this blog to avoid unnecessary charges. Refer Amazon SageMaker Pricing for details the cost of the inference instances.
Conclusion
In this post, we discussed Stable Diffusion models and how you can deploy different versions of Stable Diffusion models cost-effectively using SageMaker multi-model endpoints. You can use this approach to build a creator image generation and editing tool. Check out the code samples in the GitHub repo to get started and let us know about the cool generative AI tool that you build.

About the Authors
Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.
Vikram Elango is a Sr. AI/ML Specialist Solutions Architect at AWS, based in Virginia, US. He is currently focused on generative AI, LLMs, prompt engineering, large model inference optimization, and scaling ML across enterprises. Vikram helps financial and insurance industry customers with design and architecture to build and deploy ML applications at scale. In his spare time, he enjoys traveling, hiking, cooking, and camping with his family.
João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model training and inference optimization, and more broadly building large-scale ML platforms on AWS. He is also an active proponent of ML-specialized hardware and low-code ML solutions.
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Meet Surv_ai: An Open Source Framework for Modeling and Comparative An …

We have used generative models such as ChatGPT at least once. If you still haven’t, you should. It is imperative to understand what we are going to discuss today. While the results provided by these large language models seem to be appropriate and frankly much better, real, and natural than expected, there are still a few things we miss out which can be quite important if we are using these models as a source of ground truth or a reference somewhere else. 

The above paragraph entails an obvious question: what is wrong when it doesn’t seem to be at fault?

Nothing specifically is wrong, but there are a few questions that we need to ask while using these models, or at least the results produced by them somewhere else, such as

What is the source of ground truth for these models? Where do they source their information from? It has to come from somewhere.

What about the bias? Are these models biased? And, if so, can we estimate this bias? Can we counter it?

What are the alternatives to the Model you are using, and what if those perform better in certain fact-checking scenarios? 

These are the exact issues that Daniel Balsam and the team have tackled with their project surv_ai. 

Surv_ai is a multi-agent large-language model framework designed for mult-agent modeling. This framework enables large-language models to be used as engines to enhance the quality of research, bias estimation, researching hypothesis, and doing the comparative analysis in a much better and more efficient way, all packed under one hood. 

To completely understand what it does, it is important to understand the core philosophy of this approach. The framework was inspired by a common predictive analytics technique called bagging (bootstrap aggregating), an example of classic ensemble techniques. The idea revolves around the fact that instead of one weak learner with a vast amount of information, sometimes a lot of weak performers with limited information, when aggregated, perform much better and give higher-quality net results.

Similarly, multi-agent modeling involves generating multiple statistical models based on the actions of numerous agents. In the case of Surv_ai, these models are made by agents querying and processing text from a data corpus. These agents then test and reason the hypothesis, in simple terms, whatever you have asked them to verify or give an opinion on and generate a suitable response.

Due to the stochastic nature of large language models, individual data points can vary. It can be countered by increasing the number of agents employed.

Surv_ai employs two approaches a user can opt for based on the requirements. One segment in the provided repository can produce multi-agent data points and is called Survey. A Survey takes a statement as input and returns the percentage of agents who agree. 

A more complex implementation is called a Model, which can do what Survey can but with more control and nuances. You can control a lot of variation in the input parameters of a Model implementation and hence can increase the precision of results you wish to see. 

These implementations help us compare the ground truth by the consolidated opinions of different agents. It can help us reach and analyze a hypothesis’s change of sentiment over time. It can also allow us to understand and estimate the bias in the source information and bias in the large-language Model itself. 

Rapid advancements in large language models and generative engines are guaranteed to happen. Such a multi-agent modeling framework proves itself a promising and valuable framework for such use cases. As claimed, it can also serve as a potential and indispensable tool for researchers to investigate complex issues with numerous factors it relies on. It will be interesting to see how this evolves and adapts over time. 

Check out the Project. Don’t forget to join our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Meet Surv_ai: An Open Source Framework for Modeling and Comparative Analysis Using AI Agents appeared first on MarkTechPost.

Take My Video to Another Dimension: HOSNeRF is an AI Model That Can Ge …

We’ve experienced that Immersive media is becoming a hot topic recently thanks to the advancements in 3D reconstruction methods. Especially video reconstruction and free-viewpoint rendering have emerged as powerful technologies, enabling enhanced user engagement and the generation of realistic environments. These methods have found applications in various domains, including virtual reality, telepresence, metaverse, and 3D animation production.

However, reconstructing videos comes with its fair share of challenges. We experience this especially when dealing with monocular viewpoints and complex human-environment interactions. If things are simple, then the challenge is no more, but in reality, our interactions with the virtual environment are quite unpredictable; thus, they are challenging to tackle.

Significant progress has been made in the field of view synthesis, with Neural Radiance Fields (NeRF) playing a pivotal role. NeRF is originally proposed to reconstruct static 3D scenes from multi-view images. However, its huge success has attracted attention, and since then, it has been improved to address the challenge of dynamic view synthesis. Researchers have proposed several approaches to incorporate dynamic elements, such as deformation fields and spatiotemporal radiance fields. Additionally, there has been a specific focus on dynamic neural human modeling, leveraging estimated human poses as prior information. While these advancements have shown promise, accurately reconstructing challenging monocular videos with fast and complex human-object-scene motions and interactions remains a significant challenge.

What if we want to advance NeRFs further so that they can accurately reconstruct complex human-environment interactions? How can we utilize NeRFs in environments with complex object movement? Time to meet HOSNeRF.

Overview of HOSNeRF. Source: https://arxiv.org/pdf/2304.12281.pdf

Human-Object-Scene Neural Radiance Fields (HOSNeRF) is introduced to overcome the limitations of NeRF. HOSNeRF tackles the challenges associated with complex object motions in human-object interactions and the dynamic interaction between humans and different objects at different times. By incorporating object bones attached to the human skeleton hierarchy, HOSNeRF enables accurate estimation of object deformations during human-object interactions. Additionally, two new learnable object state embeddings have been introduced to handle the dynamic removal and addition of objects in the static background model and the human-object model.

Overview of the proposed method. Source: https://arxiv.org/pdf/2304.12281.pdf

The development of HOSNeRF involved the exploration and identification of effective training objectives and strategies. Key considerations included deformation cycle consistency, optical flow supervision, and foreground-background rendering. HOSNeRF can achieve high-fidelity dynamic novel view synthesis. Also, it allows for pausing monocular videos at any time and rendering all scene details, including dynamic humans, objects, and backgrounds, from arbitrary viewpoints. So, you can literally enjoy the infamous Neo dodging bullets scene in the Matrix movie.

HOSNeRF presents a groundbreaking framework that achieves 360° free-viewpoint high-fidelity novel view synthesis for dynamic scenes with human-environment interactions, all from a single video. The introduction of object bones and state-conditional representations enables HOSNeRF to effectively handle the complex non-rigid motions and interactions between humans, objects, and the environment.

Check out the Paper and Project. Don’t forget to join our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Take My Video to Another Dimension: HOSNeRF is an AI Model That Can Generate Dynamic Neural Radiance Fields from a Single Video appeared first on MarkTechPost.

New Google AI Report Shows Data Improvements And Scaling Insights That …

For a long time, the next-word prediction was the go-to method for estimating the linguistic information present, making language modeling a vital study area. Over the past few years, large language models (LLMs) have demonstrated impressive performance in reasoning, math, science, and language problems thanks to greater scale and the Transformer architecture. Expanding the model size and data quantity has played critical roles in these breakthroughs. Most LLMs still stick to a tried-and-true formula, including primarily monolingual corpora and a language modeling goal.

Recent Google research presents PaLM 2, an updated version of the PaLM language model that incorporates new modeling, data, and scaling developments. PaLM 2 integrates a wide variety of new findings from several fields of study, including: 

Rationalization by computation: Data size has recently been shown to be at least as relevant as model size through compute-optimal scaling. This study debunks the conventional wisdom that it’s better to scale the model three times as quickly as the dataset if users want optimal performance for their training computation. 

The blending of data sets improved: Most of the text in previous large pre-trained language models was in English. With hundreds of languages and domains in mind (such as programming, mathematics, and parallel multilingual texts), the team has developed a more multilingual and diverse pretraining mixture. The findings demonstrate that more complex models can effectively deal with more diverse non-English datasets and employ deduplication to decrease memory without negatively impacting English language understanding ability.

In the past, LLMs have typically relied on either a single causal or concealed goal. The proposed model architecture is based on the Transformer, which has been shown to improve both architecture and objective metrics. The researchers used a carefully balanced combination of pretraining objectives to train this model to comprehend a wide range of linguistic facets.

The findings reveal that PaLM 2 models perform much better than PaLM on a wide range of tasks, such as generating natural language, translating it, and reasoning. Even though it requires more training compute than the largest PaLM model, the PaLM 2-L model, the largest in the PaLM 2 family, is much smaller. These findings point to alternatives to model scaling for enhancing performance, such as carefully selecting the data and having efficient architecture/objectives that can unlock performance. Having a smaller model that is nevertheless high quality improves inference efficiency, decreases serving costs, and opens the door for the model to be used in more downstream applications and by more users. 

The language, code production, and reasoning abilities of PaLM 2 across languages are impressive. It outperforms its predecessor on advanced language proficiency tests in the wild by a wide margin. 

By altering only a subset of pretraining, PaLM 2 allows inference-time control over toxicity through control tokens. PaLM 2’s pretraining data were augmented with novel ‘canary’ token sequences to facilitate better cross-lingual memory evaluations. After comparing PaLM and PaLM 2, the researchers found that the latter has lower average rates of verbatim memorization. For tail languages, memorizing rates only increase above English when data is repeated numerous times throughout texts. The group demonstrates that PaLM 2 has enhanced multilingual toxicity classification capabilities and assesses the risks and biases associated with several potential applications.

The team believes that changes to the architecture and objective, as well as additional scaling of model parameters and dataset size and quality, can continue to generate advancements in language interpretation and generation.

Check out the Paper. Don’t forget to join our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post New Google AI Report Shows Data Improvements And Scaling Insights That Have Enabled Its New Palm2 Large Language Model appeared first on MarkTechPost.

Build a powerful question answering bot with Amazon SageMaker, Amazon …

One of the most common applications of generative AI and large language models (LLMs) in an enterprise environment is answering questions based on the enterprise’s knowledge corpus. Amazon Lex provides the framework for building AI based chatbots. Pre-trained foundation models (FMs) perform well at natural language understanding (NLU) tasks such summarization, text generation and question answering on a broad variety of topics but either struggle to provide accurate (without hallucinations) answers or completely fail at answering questions about content that they haven’t seen as part of their training data. Furthermore, FMs are trained with a point in time snapshot of data and have no inherent ability to access fresh data at inference time; without this ability they might provide responses that are potentially incorrect or inadequate.
A commonly used approach to address this problem is to use a technique called Retrieval Augmented Generation (RAG). In the RAG-based approach we convert the user question into vector embeddings using an LLM and then do a similarity search for these embeddings in a pre-populated vector database holding the embeddings for the enterprise knowledge corpus. A small number of similar documents (typically three) is added as context along with the user question to the “prompt” provided to another LLM and then that LLM generates an answer to the user question using information provided as context in the prompt. RAG models were introduced by Lewis et al. in 2020 as a model where parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. To understand the overall structure of a RAG-based approach, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart.
In this post we provide a step-by-step guide with all the building blocks for creating an enterprise ready RAG application such as a question answering bot. We use a combination of different AWS services, open-source foundation models (FLAN-T5 XXL for text generation and GPT-j-6B for embeddings) and packages such as LangChain for interfacing with all the components and Streamlit for building the bot frontend.
We provide an AWS Cloud Formation template to stand up all the resources required for building this solution. We then demonstrate how to use LangChain for tying everything together:

Interfacing with LLMs hosted on Amazon SageMaker.
Chunking of knowledge base documents.
Ingesting document embeddings into Amazon OpenSearch Service.
Implementing the question answering task.

We can use the same architecture to swap the open-source models with the Amazon Titan models. After Amazon Bedrock launches, we will publish a follow-up post showing how to implement similar generative AI applications using Amazon Bedrock, so stay tuned.
Solution overview
We use the SageMaker docs as the knowledge corpus for this post. We convert the HTML pages on this site into smaller overlapping chunks (to retain some context continuity between chunks) of information and then convert these chunks into embeddings using the gpt-j-6b model and store the embeddings in OpenSearch Service. We implement the RAG functionality inside an AWS Lambda function with Amazon API Gateway to handle routing all requests to the Lambda. We implement a chatbot application in Streamlit which invokes the function via the API Gateway and the function does a similarity search in the OpenSearch Service index for the embeddings of user question. The matching documents (chunks) are added to the prompt as context by the Lambda function and then the function uses the flan-t5-xxl model deployed as a SageMaker endpoint to generate an answer to the user question. All the code for this post is available in the GitHub repo.
The following figure represents the high-level architecture of the proposed solution.

Figure 1: Architecture

Step-by-step explanation:

The User provides a question via the Streamlit web application.
The Streamlit application invokes the API Gateway endpoint REST API.
The API Gateway invokes the Lambda function.
The function invokes the SageMaker endpoint to convert user question into embeddings.
The function invokes invokes an OpenSearch Service API to find similar documents to the user question.
The function creates a “prompt” with the user query and the “similar documents” as context and asks the SageMaker endpoint to generate a response.
The response is provided from the function to the API Gateway.
The API Gateway provides the response to the Streamlit application.
The User is able to view the response on the Streamlit application,

As illustrated in the architecture diagram, we use the following AWS services:

SageMaker and Amazon SageMaker JumpStart for hosting the two LLMs.
OpenSearch Service for storing the embeddings of the enterprise knowledge corpus and doing similarity search with user questions.
Lambda for implementing the RAG functionality and exposing it as a REST endpoint via the API Gateway.
Amazon SageMaker Processing jobs for large scale data ingestion into OpenSearch.
Amazon SageMaker Studio for hosting the Streamlit application.
AWS Identity and Access Management roles and policies for access management.
AWS CloudFormation for creating the entire solution stack through infrastructure as code.

In terms of open-source packages used in this solution, we use LangChain for interfacing with OpenSearch Service and SageMaker, and FastAPI for implementing the REST API interface in the Lambda.
The workflow for instantiating the solution presented in this post in your own AWS account is as follows:

Run the CloudFormation template provided with this post in your account. This will create all the necessary infrastructure resources needed for this solution:

SageMaker endpoints for the LLMs
OpenSearch Service cluster
API Gateway
Lambda function
SageMaker Notebook
IAM roles

Run the data_ingestion_to_vectordb.ipynb notebook in the SageMaker notebook to ingest data from SageMaker docs into an OpenSearch Service index.
Run the Streamlit application on a terminal in Studio and open the URL for the application in a new browser tab.
Ask your questions about SageMaker via the chat interface provided by the Streamlit app and view the responses generated by the LLM.

These steps are discussed in detail in the following sections.
Prerequisites
To implement the solution provided in this post, you should have an AWS account and familiarity with LLMs, OpenSearch Service and SageMaker.
We need access to accelerated instances (GPUs) for hosting the LLMs. This solution uses one instance each of ml.g5.12xlarge and ml.g5.24xlarge; you can check the availability of these instances in your AWS account and request these instances as needed via a Sevice Quota increase request as shown in the following screenshot.

Figure 2: Service Quota Increase Request

Use AWS Cloud Formation to create the solution stack
We use AWS CloudFormation to create a SageMaker notebook called aws-llm-apps-blog and an IAM role called LLMAppsBlogIAMRole. Choose Launch Stack for the Region you want to deploy resources to. All parameters needed by the CloudFormation template have default values already filled in, except for the OpenSearch Service password which you’d have to provide. Make a note of the OpenSearch Service username and password, we use those in subsequent steps. This template takes about 15 minutes to complete.

AWS Region
Link

us-east-1

us-west-2

eu-west-1

ap-northeast-1

After the stack is created successfully, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the values for OpenSearchDomainEndpoint and LLMAppAPIEndpoint. We use those in the subsequent steps.

Figure 3: Cloud Formation Stack Outputs

Ingest the data into OpenSearch Service
To ingest the data, complete the following steps:

On the SageMaker console, choose Notebooks in the navigation pane.
Select the notebook aws-llm-apps-blog and choose Open JupyterLab.

Figure 4: Open JupyterLab

Choose data_ingestion_to_vectordb.ipynb to open it in JupyterLab. This notebook will ingest the SageMaker docs to an OpenSearch Service index called llm_apps_workshop_embeddings.

Figure 5: Open Data Ingestion Notebook

When the notebook is open, on the Run menu, choose Run All Cells to run the code in this notebook. This will download the dataset locally into the notebook and then ingest it into the OpenSearch Service index. This notebook takes about 20 minutes to run. The notebook also ingests the data into another vector database called FAISS. The FAISS index files are saved locally and the uploaded to Amazon Simple Storage Service (S3) so that they can optionally be used by the Lambda function as an illustration of using an alternate vector database.

Figure 6: Notebook Run All Cells

Now we’re ready to split the documents into chunks, which can then be converted into embeddings to be ingested into OpenSearch. We use the LangChain RecursiveCharacterTextSplitter class to chunk the documents and then use the LangChain SagemakerEndpointEmbeddingsJumpStart class to convert these chunks into embeddings using the gpt-j-6b LLM. We store the embeddings in OpenSearch Service via the LangChain OpenSearchVectorSearch class. We package this code into Python scripts that are provided to the SageMaker Processing Job via a custom container. See the data_ingestion_to_vectordb.ipynb notebook for the full code.

Create a custom container, then install in it the LangChain and opensearch-py Python packages.
Upload this container image to Amazon Elastic Container Registry (ECR).
We use the SageMaker ScriptProcessor class to create a SageMaker Processing job that will run on multiple nodes.

The data files available in Amazon S3 are automatically distributed across in the SageMaker Processing job instances by setting s3_data_distribution_type=’ShardedByS3Key’ as part of the ProcessingInput provided to the processing job.
Each node processes a subset of the files and this brings down the overall time required to ingest the data into OpenSearch Service.
Each node also uses Python multiprocessing to internally also parallelize the file processing. Therefore, there are two levels of parallelization happening, one at the cluster level where individual nodes are distributing the work (files) amongst themselves and another at the node level where the files in a node are also split between multiple processes running on the node.

# setup the ScriptProcessor with the above parameters
processor = ScriptProcessor(base_job_name=base_job_name,
image_uri=image_uri,
role=aws_role,
instance_type=instance_type,
instance_count=instance_count,
command=[“python3″],
tags=tags)

# setup input from S3, note the ShardedByS3Key, this ensures that
# each instance gets a random and equal subset of the files in S3.
inputs = [ProcessingInput(source=f”s3://{bucket}/{app_name}/{DOMAIN}”,
destination=’/opt/ml/processing/input_data’,
s3_data_distribution_type=’ShardedByS3Key’,
s3_data_type=’S3Prefix’)]

logger.info(f”creating an opensearch index with name={opensearch_index}”)
# ready to run the processing job
st = time.time()
processor.run(code=”container/load_data_into_opensearch.py”,
inputs=inputs,
outputs=[],
arguments=[“–opensearch-cluster-domain”, opensearch_domain_endpoint,
“–opensearch-secretid”, os_creds_secretid_in_secrets_manager,
“–opensearch-index-name”, opensearch_index,
“–aws-region”, aws_region,
“–embeddings-model-endpoint-name”, embeddings_model_endpoint_name,
“–chunk-size-for-doc-split”, str(CHUNK_SIZE_FOR_DOC_SPLIT),
“–chunk-overlap-for-doc-split”, str(CHUNK_OVERLAP_FOR_DOC_SPLIT),
“–input-data-dir”, “/opt/ml/processing/input_data”,
“–create-index-hint-file”, CREATE_OS_INDEX_HINT_FILE,
“–process-count”, “2”])

Close the notebook after all cells run without any error. Your data is now available in OpenSearch Service. Enter the following URL in your browser’s address bar to get a count of documents in the llm_apps_workshop_embeddings index. Use the OpenSearch Service domain endpoint from the CloudFormation stack outputs in the URL below. You’d be prompted for the OpenSearch Service username and password, these are available from the CloudFormations stack.

https://your-opensearch-domain-endpoint/llm_apps_workshop_embeddings/_count

The browser window should show an output similar to the following. This output shows that 5,667 documents were ingested into the llm_apps_workshop_embeddings index. {“count”:5667,”_shards”:{“total”:5,”successful”:5,”skipped”:0,”failed”:0}}
Run the Streamlit application in Studio
Now we’re ready to run the Streamlit web application for our question answering bot. This application allows the user to ask a question and then fetches the answer via the /llm/rag REST API endpoint provided by the Lambda function.
Studio provides a convenient platform to host the Streamlit web application. The following steps describes how to run the Streamlit app on Studio. Alternatively, you could also follow the same procedure to run the app on your laptop.

Open Studio and then open a new terminal.
Run the following commands on the terminal to clone the code repository for this post and install the Python packages needed by the application:

git clone https://github.com/aws-samples/llm-apps-workshop
cd llm-apps-workshop/blogs/rag/app
pip install -r requirements.txt

The API Gateway endpoint URL that is available from the CloudFormation stack output needs to be set in the webapp.py file. This is done by running the following sed command. Replace the replace-with-LLMAppAPIEndpoint-value-from-cloudformation-stack-outputs in the shell commands with the value of the LLMAppAPIEndpoint field from the CloudFormation stack output and then run the following commands to start a Streamlit app on Studio.

EP=replace-with-LLMAppAPIEndpoint-value-from-cloudformation-stack-outputs
# replace __API_GW_ENDPOINT__ with output from the cloud formation stack
sed -i “s|__API_GW_ENDPOINT__|$EP|g” webapp.py
streamlit run webapp.py

When the application runs successfully, you’ll see an output similar to the following (the IP addresses you will see will be different from the ones shown in this example). Note the port number (typically 8501) from the output to use as part of the URL for app in the next step.

sagemaker-user@studio$ streamlit run webapp.py

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.

You can now view your Streamlit app in your browser.

Network URL: http://169.255.255.2:8501
External URL: http://52.4.240.77:8501

You can access the app in a new browser tab using a URL that is similar to your Studio domain URL. For example, if your Studio URL is https://d-randomidentifier.studio.us-east-1.sagemaker.aws/jupyter/default/lab? then the URL for your Streamlit app will be https://d-randomidentifier.studio.us-east-1.sagemaker.aws/jupyter/default/proxy/8501/webapp (notice that lab is replaced with proxy/8501/webapp). If the port number noted in the previous step is different from 8501 then use that instead of 8501 in the URL for the Streamlit app.

The following screenshot shows the app with a couple of user questions.

A closer look at the RAG implementation in the Lambda function
Now that we have the application working end to end, lets take a closer look at the Lambda function. The Lambda function uses FastAPI to implement the REST API for RAG and the Mangum package to wrap the API with a handler that we package and deploy in the function. We use the API Gateway to route all incoming requests to invoke the function and handle the routing internally within our application.
The following code snippet shows how we find documents in the OpenSearch index that are similar to the user question and then create a prompt by combining the question and the similar documents. This prompt is then provided to the LLM for generating an answer to the user question.

@router.post(“/rag”)
async def rag_handler(req: Request) -> Dict[str, Any]:
# dump the received request for debugging purposes
logger.info(f”req={req}”)

# initialize vector db and SageMaker Endpoint
_init(req)

# Use the vector db to find similar documents to the query
# the vector db call would automatically convert the query text
# into embeddings
docs = _vector_db.similarity_search(req.q, k=req.max_matching_docs)
logger.info(f”here are the {req.max_matching_docs} closest matching docs to the query=”{req.q}””)
for d in docs:
logger.info(f”———“)
logger.info(d)
logger.info(f”———“)

# now that we have the matching docs, lets pack them as a context
# into the prompt and ask the LLM to generate a response
prompt_template = “””Answer based on context:nn{context}nn{question}”””

prompt = PromptTemplate(
template=prompt_template, input_variables=[“context”, “question”]
)
logger.info(f”prompt sent to llm = “{prompt}””)
chain = load_qa_chain(llm=_sm_llm, prompt=prompt)
answer = chain({“input_documents”: docs, “question”: req.q}, return_only_outputs=True)[‘output_text’]
logger.info(f”answer received from llm,nquestion: “{req.q}”nanswer: “{answer}””)
resp = {‘question’: req.q, ‘answer’: answer}
if req.verbose is True:
resp[‘docs’] = docs

return resp

Clean up
To avoid incurring future charges, delete the resources. You can do this by deleting the CloudFormation stack as shown in the following screenshot.

Figure 7: Cleaning Up

Conclusion
In this post, we showed how to create an enterprise ready RAG solution using a combination of AWS service, open-source LLMs and open-source Python packages.
We encourage you to learn more by exploring JumpStart, Amazon Titan models, Amazon Bedrock, and OpenSearch Service and building a solution using the sample implementation provided in this post and a dataset relevant to your business. If you have questions or suggestions, leave a comment.

About the Authors
Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.
Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.
Navneet Tuteja is a Data Specialist at Amazon Web Services. Before joining AWS, Navneet worked as a facilitator for organizations seeking to modernize their data architectures and implement comprehensive AI/ML solutions. She holds an engineering degree from Thapar University, as well as a master’s degree in statistics from Texas A&M University.

Get insights on your user’s search behavior from Amazon Kendra using …

Amazon Kendra is a highly accurate and intelligent search service that enables users to search unstructured and structured data using natural language processing (NLP) and advanced search algorithms. With Amazon Kendra, you can find relevant answers to your questions quickly, without sifting through documents. However, just enabling end-users to get the answers to their queries is not enough in today’s world. We need to constantly understand the end-user’s search behavior, such as what are the top queries for the month, have any new query that queries appeared recently, what percentage of queries received instant answer, and more.
Although the Amazon Kendra console comes equipped with an analytics dashboard, many of our customers prefer to build a custom dashboard. This allows you to create unique views and filters, and grants management teams access to a streamlined, one-click dashboard without needing to log in to the AWS Management Console and search for the appropriate dashboard. In addition, you can enhance your dashboard’s functionality by adding preprocessing logic, such as grouping similar top queries. For example, you may want to group similar queries such as “What is Amazon Kendra” and “What is the purpose of Amazon Kendra” together so that you can effectively analyze the metrics and gain a deeper understanding of the data. Such grouping of similar queries can be done using the concept of semantic similarity.
This post discusses an end-to-end solution to implement this use case, which includes using AWS Lambda to extract the summarized metrics from Amazon Kendra, calculating the semantic similarity score using a Hugging Face model hosted on an Amazon SageMaker Serverless Inference endpoint to group similar queries, and creating an Amazon QuickSight dashboard to display the user insights effectively.
Solution overview
The following diagram illustrates our solution architecture.

The high-level workflow is as follows:

An Amazon EventBridge scheduler triggers Lambda functions once a month to extract last month’s search metrics from Amazon Kendra.
The Lambda functions upload the search metrics to an Amazon Simple Storage Service (Amazon S3) bucket.
The Lambda functions group similar queries in the uploaded file based on the semantic similarity score by Hugging Face model hosted on a SageMaker inference endpoint.
An AWS Glue crawler creates or updates the AWS Glue Data Catalog from the uploaded file in the S3 bucket for an Amazon Athena table.
QuickSight uses the Athena table dataset to create analyses and dashboards.

For this solution, we deploy the infrastructure resources to create the QuickSight analysis and dashboard using an AWS CloudFormation template.
Prerequisites
Complete the following prerequisite steps:

If you’re a first-time user of QuickSight in your AWS account, sign up for QuickSight.
Get the Amazon Kendra index ID that you want visualize your search metrics from Amazon Kendra. You will have to use the search engine for a while (for example, a few weeks) to be able to extract a sufficient amount of data to use to extract some insights.
Clone the GitHub repo to create the container image:

app.py
Dockerfile
requirements.txt

Create an Amazon Elastic Container Registry (Amazon ECR) repository in us-east-1 and push the container image created by the downloaded Dockerfile. For instructions, refer to Creating a private repository.
Run the following commands in the directory of your local environment to create and push the container image to the ECR repository you created:

aws ecr get-login-password –region us-east-1 | docker login –username AWS –password-stdin <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com
docker build -t <YOUR_ECR_REPOSITORY_NAME> .
docker tag <YOUR_ECR_REPOSITORY_NAME>:latest <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/<YOUR_ECR_REPOSITORY_NAME>:latest
docker push <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/<YOUR_ECR_REPOSITORY_NAME>:latest

Deploy the CloudFormation template
Complete the following steps to deploy the CloudFormation template:

Download the CloudFormation template kendrablog-sam-template.yml.
On the AWS CloudFormation console, create a new stack.

Use the us-east-1 Region for this deployment.

Upload the template directly or through your preferred S3 bucket.
For KendraIndex, enter the Amazon Kendra index ID from the prerequisites.
For LambdaECRRepository, enter the ECR repository from the prerequisites.
For QSIdentityRegion, enter the identity Region of QuickSight. The identity Region aligns with your Region selection when you signed up your QuickSight subscription.
For QSUserDefaultPassward, enter the default password to use for your QuickSight user.

You’ll be prompted to change this password when you first sign in to the QuickSight console.

For QSUserEmail, enter the email address to use for the QuickSight user.
Choose Next.
Leave other settings as default and choose Next.
Select the acknowledgement check boxes and choose Create stack.

When the deployment is complete, you can confirm all the generated resources on the stack’s Resources tab on the AWS CloudFormation console.
We walk through some of the key components of this solution in the following sections.
Get insights from Amazon Kendra search metrics
We can get the metrics data from Amazon Kendra using the GetSnapshots API. There are 10 metrics for analyzing what information the users are searching for: 5 metrics include trends data for us to look for patterns over time, and 5 metrics use just a snapshot or aggregated data. The metrics with the daily trend data are clickthrough rate, zero click rate, zero search results rate, instant answer rate, and total queries. The metrics with aggregated data are top queries, top queries with zero clicks, top queries with zero search results, top clicked on documents, and total documents.
We use Lambda functions to get the search metrics data from Amazon Kendra. The functions extract the metrics from Amazon Kendra and store them in Amazon S3. You can find the functions in the GitHub repo.
Create a SageMaker serverless endpoint and host a Hugging Face model to calculate semantic similarity
After the metrics are extracted, the next step is to complete the preprocessing for the aggregated metrics. The preprocessing step checks the semantic similarity between the query texts and groups them together to show the total counts for the similar queries. For example, if there are three queries of “What is S3” and two queries of “What is the purpose of S3,” it will group them together and show that there are five queries of “What is S3” or “What is the purpose of S3.”
To calculate semantic similarity, we use a model from the Hugging Face model library. Hugging Face is a popular open-source platform that provides a wide range of NLP models, including transformers, which have been trained on a variety of NLP tasks. These models can be easily integrated with SageMaker and take advantage of its rich training and deployment options. The Hugging Face Deep Learning Containers (DLCs), which comes pre-packaged with the necessary libraries, make it easy to deploy the model in SageMaker with just few lines of code. In our use case, we first get the vector embedding using the Hugging Face pre-trained model flax-sentence-embeddings/all_datasets_v4_MiniLM-L6, and then use cosine similarity to calculate the similarity score between the vector embeddings.
To get the vector embedding from the Hugging Face model, we create a serverless endpoint in SageMaker. Serverless endpoints help save cost because you only pay for the amount of time the inference runs. To create a serverless endpoint, you first define the max concurrent invocations for a single endpoint, known as MaxConcurrency, and the memory size. The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. SageMaker Serverless Inference auto-assigns compute resources proportional to the memory you select.
We also need to pad one of the vectors with zeros so that the size of the two vectors matches with each other and we can calculate the cosine similarity as a dot product of the two vectors. We can set a threshold for cosine similarity (for example, 0.6) and if the similarity score is more than the threshold, we can group the queries together. After the queries are grouped, we can understand the top queries better. We put all this logic in a Lambda function and deploy the function using a container image. The container image contains codes to invoke the SageMaker Serverless Inference endpoints, and necessary Python libraries to run the Lambda function such as NumPy, pandas, and scikit-learn. The following file is an example of the output from the Lambda function: HF_QUERIES_BY_COUNT.csv.
Create a dashboard using QuickSight
After you have collected the metrics and preprocessed the aggregated metrics, you can visualize the data to get the business insights. For this solution, we use QuickSight for the business intelligence (BI) dashboard and Athena as the data source for QuickSight.
QuickSight is a fully managed enterprise-grade BI service that you can use to create analyses and dashboards to deliver easy-to-understand insights. You can choose various types of charts and graphs to deliver the business insights effectively through a QuickSight dashboard. QuickSight connects to your data and combines data from many different sources, such as Amazon S3 and Athena. For our solution, we use Athena as the data source.
Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. You can use Athena queries to create your custom views from data stored in an S3 bucket before visualizing it with QuickSight. This solution uses an AWS Glue crawler to create the AWS Glue Data Catalog for the Athena table from the files in the S3 bucket.
The CloudFormation template runs the first crawler during resource creation. The following screenshot shows the Data Catalog schema.

The following screenshot shows the Athena table sample you will see after the deployment.

Access permission to the AWS Glue databases and tables are managed by AWS Lake Formation. The CloudFormation template already attached the necessary Lake Formation permissions to the generated AWS Identity and Access Management (IAM) user for QuickSight. If you see permission issues with your IAM principal, grant at least the SELECT permission to the AWS Glue tables to your IAM principal in Lake Formation. You can find the AWS Glue database name on the Outputs tab of the CloudFormation stack. For more information, refer to Granting Data Catalog permissions using the named resource method.
We have completed the data preparation step. The last step is to create an analysis and dashboard using QuickSight.

Sign in to the QuickSight console with the QuickSight user that the CloudFormation template generated.
In the navigation pane, choose Datasets.
Choose Dataset.
Choose Athena as the data source.
Enter a name for Data Source name and choose kendrablog for Athena workgroup.
Choose Create data source.
Choose AWSDataCatalog for Catalog and kendra-search-analytics-database for Database, and select one of the tables you want to use for analysis.
Choose Select.
Select Import to SPICE for quicker analytics and choose Edit/Preview data.
Optionally, choose Add data to join additional data.
You can also modify the data schema, such as column name or data type, and join multiple datasets, if needed.
Choose Publish & Visualize to move on to creating visuals.
Choose your visual type and set dimensions to create your visual.
You can optionally configure additional features for the chart using the navigation pane, such as filters, actions, and themes.

The following screenshots show a sample QuickSight dashboard for your reference. “Search Queries group by similar queries” in the screenshot shows how the search queries been consolidated using semantic similarity.

Clean up
Delete the QuickSight resources (dashboard, analysis, and dataset) that you created and infrastructure resources that AWS CloudFormation generated to avoid unwanted charges. You can delete the infrastructure resource and QuickSight user that was created by the stack via the AWS CloudFormation console.

Conclusion
This post showed an end-to-end solution to get business insights from Amazon Kendra. The solution provided the serverless stack to deploy a custom dashboard for Amazon Kendra search analytics metrics using Lambda and QuickSight. We also solved common challenges relating to analyzing similar queries using a SageMaker Hugging Face model. You could further enhance the dashboard by adding more insights such as the key phrases or the named entities in the queries using Amazon Comprehend and displaying those in the dashboard. Please try out the solution and let us know your feedback.

About the Authors
Genta Watanabe is a Senior Technical Account Manager at Amazon Web Services. He spends his time working with strategic automotive customers to help them achieve operational excellence. His areas of interest are machine learning and artificial intelligence. In his spare time, Genta enjoys spending time with his family and traveling.
Abhijit Kalita is a Senior AI/ML Evangelist at Amazon Web Services. He spends his time working with public sector partners in Asia Pacific, enabling them on their AI/ML workloads. He has many years of experience in data analytics, AI, and machine learning across different verticals such as automotive, semiconductor manufacturing, and financial services. His areas of interest are machine learning and artificial intelligence, especially NLP and computer vision. In his spare time, Abhijit enjoys spending time with his family, biking, and playing with his little hamster.

How OCX Cognition reduced ML model development time from weeks to days …

This post was co-authored by Brian Curry (Founder and Head of Products at OCX Cognition) and Sandhya MN (Data Science Lead at InfoGain)
OCX Cognition is a San Francisco Bay Area-based startup, offering a commercial B2B software as a service (SaaS) product called Spectrum AI. Spectrum AI is a predictive (generative) CX analytics platform for enterprises. OCX’s solutions are developed in collaboration with Infogain, an AWS Advanced Tier Partner. Infogain works with OCX Cognition as an integrated product team, providing human-centered software engineering services and expertise in software development, microservices, automation, Internet of Things (IoT), and artificial intelligence.
The Spectrum AI platform combines customer attitudes with customers’ operational data and uses machine learning (ML) to generate continuous insight on CX. OCX built Spectrum AI on AWS because AWS offered a wide range of tools, elastic computing, and an ML environment that would keep pace with evolving needs.
In this post, we discuss how OCX Cognition with the support of Infogain and OCX’s AWS account team improved their end customer experience and reduced time to value by automating and orchestrating ML functions that supported Spectrum AI’s CX analytics. Using AWS Step Functions, the AWS Step Functions Data Science SDK for Python, and Amazon SageMaker Experiments, OCX Cognition reduced ML model development time from 6 weeks to 2 weeks and reduced ML model update time from 4 days to near-real time.
Background
The Spectrum AI platform has to produce models tuned for hundreds of different generative CX scores for each customer, and these scores need to be uniquely computed for tens of thousands of active accounts. As time passes and new experiences accumulate, the platform has to update these scores based on new data inputs. After new scores are produced, OCX and Infogain compute the relative impact of each underlying operational metric in the prediction. Amazon SageMaker is a web-based integrated development environment (IDE) that allows you to build, train, and deploy ML models for any use case with fully managed infrastructure, tools, and workflows. With SageMaker, the OCX-Infogain team developed their solution using shared code libraries across individually maintained Jupyter notebooks in Amazon SageMaker Studio.
The problem: Scaling the solution for multiple customers
While the initial R&D proved successful, scaling posed a challenge. OCX and Infogain’s ML development involved multiple steps: feature engineering, model training, prediction, and the generation of analytics. The code for modules resided in multiple notebooks, and running these notebooks was manual, with no orchestration tool in place. For every new customer, the OCX-Infogain team spent 6 weeks per customer on model development time because libraries couldn’t be reused. Due to the amount of time spent on model development, the OCX-Infogain team needed an automated and scalable solution that operated as a singular platform using unique configurations for each of their customers.
The following architecture diagram depicts OCX’s initial ML model development and update processes.

Solution overview
To simplify the ML process, the OCX-Infogain team worked with the AWS account team to develop a custom declarative ML framework to replace all repetitive code. This reduced the need to develop new low-level ML code. New libraries could be reused for multiple customers by configuring the data appropriately for each customer through YAML files.
While this high-level code continues to be developed initially in Studio using Jupyter notebooks, it’s then converted to Python (.py files), and the SageMaker platform is used to build a Docker image with BYO (bring your own) containers. The Docker images are then pushed to Amazon Elastic Container Registry (Amazon ECR) as a preparatory step. Finally, the code is run using Step Functions.
The AWS account team recommended the Step Functions Data Science SDK and SageMaker Experiments to automate feature engineering, model training, and model deployment. The Step Functions Data Science SDK was used to generate the step functions programmatically. The OCX-Infogain team learned how to use features like Parallel and MAP within Step Functions to orchestrate a large number of training and processing jobs in parallel, which reduces the runtime. This was combined with Experiments, which functions as an analytics tool, tracking multiple ML candidates and hyperparameter tuning variations. These built-in analytics allowed the OCX-Infogain team to compare multiple metrics at runtime and identify best-performing models on the fly.
The following architecture diagram shows the MLOps pipeline developed for the model creation cycle.

The Step Functions Data Science SDK is used to analyze and compare multiple model training algorithms. The state machine runs multiple models in parallel, and each model output is logged into Experiments. When model training is complete, the results of multiple experiments are retrieved and compared using the SDK. The following screenshots show how the best performing model is selected for each stage.

The following are the high-level steps of the ML lifecycle:

ML developers push their code into libraries on the Gitlab repository when development in Studio is complete.
AWS CodePipeline is used to check out the appropriate code from the Gitlab repository.
A Docker image is prepared using this code and pushed to Amazon ECR for serverless computing.
Step Functions is used to run steps using Amazon SageMaker Processing jobs. Here, multiple independent tasks are run in parallel:

Feature engineering is performed, and the features are stored in the feature store.
Model training is run, with multiple algorithms and several combinations of hyperparameters utilizing the YAML configuration file.
The training step function is designed to have heavy parallelism. The models for each journey stage are run in parallel. This is depicted in the following diagram.

Model results are then logged in Experiments. The best-performing model is selected and pushed to the model registry.
Predictions are made using the best-performing models for each CX analytic we generate.
Hundreds of analytics are generated and then handed off for publication in a data warehouse hosted on AWS.

Results
With this approach, OCX Cognition has automated and accelerated their ML processing. By replacing labor-intensive manual processes and highly repetitive development burdens, the cost per customer is reduced by over 60%. This also allows OCX to scale their software business by tripling overall capacity and doubling capacity for simultaneous onboarding of customers. OCX’s automating of their ML processing unlocks new potential to grow through customer acquisition. Using SageMaker Experiments to track model training is critical to identifying the best set of models to use and take to production. For their customers, this new solution provides not only an 8% improvement in ML performance, but a 63% improvement in time to value. New customer onboarding and the initial model generation has improved from 6 weeks to 2 weeks. Once built and in place, OCX begins to continuously regenerate the CX analytics as new input data arrives from the customer. These update cycles have improved from 4 days to near-real time
Conclusion
In this post, we showed how OCX Cognition and Infogain utilized Step Functions, the Step Functions Data Science SDK for Python, and Sagemaker Experiments in conjunction with Sagemaker Studio to reduce time to value for the OCX-InfoGain team in developing and updating CX analytics models for their customers.
To get started with these services, refer to Amazon SageMaker, AWS Step Functions Data Science Python SDK, AWS Step Functions, and Manage Machine Learning with Amazon SageMaker Experiments.

About the Authors
Brian Curry is currently a founder and Head of Products at OCX Cognition, where we are building a machine learning platform for customer analytics. Brian has more than a decade of experience leading cloud solutions and design-centered product organizations.
Sandhya M N is part of Infogain and leads the Data Science team for OCX. She is a seasoned software development leader with extensive experience across multiple technologies and industry domains. She is passionate about staying up to date with technology and using it to deliver business value to customers.
Prashanth Ganapathy is a Senior Solutions Architect in the Small Medium Business (SMB) segment at AWS. He enjoys learning about AWS AI/ML services and helping customers meet their business outcomes by building solutions for them. Outside of work, Prashanth enjoys photography, travel, and trying out different cuisines.
Sabha Parameswaran is a Senior Solutions Architect at AWS with over 20 years of deep experience in enterprise application integration, microservices, containers and distributed systems performance tuning, prototyping, and more. He is based out of the San Francisco Bay Area. At AWS, he is focused on helping customers in their cloud journey and is also actively involved in microservices and serverless-based architecture and frameworks.
Vaishnavi Ganesan is a Solutions Architect at AWS based in the San Francisco Bay Area. She is focused on helping Commercial Segment customers on their cloud journey and is passionate about security in the cloud. Outside of work, Vaishnavi enjoys traveling, hiking, and trying out various coffee roasters.
Ajay Swaminathan is an Account Manager II at AWS. He is an advocate for Commercial Segment customers, providing the right financial, business innovation, and technical resources in accordance with his customers’ goals. Outside of work, Ajay is passionate about skiing, dubstep and drum and bass music, and basketball.

Perp-Neg: Unveiling Image Potential with Negative Prompts and Stable D …

Despite the remarkable capabilities demonstrated by advancements in generating images from text using diffusion models, the accuracy of the generated images in conveying the intended meaning of the original text prompt is not always guaranteed, as found by recent research. Generating images that effectively align with the semantic content of the text query is a challenging task that necessitates a deep understanding of textual concepts and their meaning in visual representations.

Due to the challenges of acquiring detailed annotations, current text-to-image models struggle to fully comprehend the intricate relationship between text and images. Consequently, these models tend to generate images that resemble frequently occurring text-image pairs in the training datasets. As a result, the generated images often lack requested attributes or contain undesired ones. While recent research efforts have focused on addressing this issue by reintroducing missing objects or attributes to modify images based on well-crafted text prompts, there is a limited exploration of techniques for removing redundant attributes or explicitly instructing the model to exclude unwanted objects using negative prompts. 

Based on this research gap, a new approach has been proposed to address the current limitations of the existing algorithm for negative prompts. According to the authors of this work, the current implementation of negative prompts can lead to unsatisfactory results, particularly when there is an overlap between the main prompt and the negative prompts. 

To address this issue, they propose a novel algorithm called Perp-Neg, which does not require any training and can be applied to a pre-trained diffusion model. The architecture is reported below. 

The name “Perp-Neg” is derived from the concept of utilizing the perpendicular score estimated by the denoiser for the negative prompt. This choice of name reflects the key principle behind the Perp-Neg algorithm. Specifically, Perp-Neg employs a denoising process that is restricted to be perpendicular to the direction of the main prompt. This geometric constraint plays a crucial role in achieving the desired outcome.

Perp-Neg effectively addresses the issue of undesired perspectives in the negative prompts by limiting the denoising process to be perpendicular to the main prompt. It ensures that the model focuses on eliminating aspects that are orthogonal or unrelated to the main semantics of the prompt. In other words, Perp-Neg enables the model to remove undesirable attributes or objects not aligned with the text’s intended meaning while preserving the main prompt’s core essence.

This approach helps in enhancing the overall quality and coherence of the generated images, ensuring a stronger alignment with the original text input.

Some results obtained via Perp-Neg are presented in the figure below.

Beyond image synthesis, Perp-Neg is also extended to DreamFusion, an advanced text-to-3D model. Furthermore, in this context, the authors demonstrate its effectiveness in mitigating the Janus problem. The Janus (or multi-faced) problem refers to situations where a 3D-generated object is primarily rendered according to its canonical view rather than other perspectives. This problem mainly happens because the training dataset is unbalanced. For instance, animals or people are usually depicted from their front view and only sporadically from the side or back views.

This was the summary of Perp-Neg, a novel AI algorithm that leverages the geometrical properties of the score space to address the shortcomings of the current negative prompts algorithm. If you are interested, you can learn more about this technique in the links below.

Check out the Paper, Project, and Github. Don’t forget to join our 21k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Perp-Neg: Unveiling Image Potential with Negative Prompts and Stable Diffusion appeared first on MarkTechPost.

Meet XTREME-UP: A Benchmark for Evaluating Multilingual Models with Sc …

The fields of Artificial Intelligence and Machine Learning are solely dependent upon data. Everyone is deluged with data from different sources like social media, healthcare, finance, etc., and this data is of great use to applications involving Natural Language Processing. But even with so much data, readily usable data is scarce for training an NLP model for a particular task. Finding high-quality data with usefulness and good-quality filters is a difficult task. Specifically talking about developing NLP models for different languages, the lack of data for most languages comes as a limitation that hinders progress in NLP for under-represented languages (ULs). 

The emerging tasks like news summarization, sentiment analysis, question answering, or the development of a virtual assistant all heavily rely on data availability in high-resource languages. These tasks are dependent upon technologies like language identification, automatic speech recognition (ASR), or optical character recognition (OCR), which are mostly unavailable for under-represented languages, to overcome which it is important to build datasets and evaluate models on tasks that would be beneficial for UL speakers. 

Recently, a team of researchers from GoogleAI has proposed a benchmark called XTREME-UP (Under-Represented and User-Centric with Paucal Data) that evaluates multilingual models on user-centric tasks in a few-shot learning setting. It primarily focuses on activities that technology users often perform in their day-to-day lives, such as information access and input/output activities that enable other technologies. The three main features that distinguish XTREME-UP are – its use of scarce data, its user-centric design, and its focus on under-represented languages.

With XTREME-UP, the researchers have introduced a standardized multilingual in-language fine-tuning setting in place of the conventional cross-lingual zero-shot option. This method considers the amount of data that can be generated or annotated in an 8-hour period for a particular language, thus aiming to give the ULs a more useful evaluation setup. 

XTREME-UP assesses the performance of language models across 88 under-represented languages in 9 significant user-centric technologies, some of which include Automatic Speech Recognition (ASR), Optical Character Recognition (OCR), Machine Translation (MT), and information access tasks that have general utility. The researchers have developed new datasets specifically for operations like OCR, autocomplete, semantic parsing, and transliteration in order to evaluate the capabilities of the language models. They have also improved and polished the currently existing datasets for other tasks in the same benchmark.

XTREME-UP has one of its key abilities to assess various modeling situations, including both text-only and multi-modal scenarios with visual, audio, and text inputs. It also offers methods for supervised parameter adjustment and in-context learning, allowing for a thorough assessment of various modeling approaches. The tasks in XTREME-UP involve enabling access to language technology, enabling information access as part of a larger system such as question answering, information extraction, and virtual assistants, followed by making information accessible in the speaker’s language.

Consequently, XTREME-UP is a great benchmark that addresses the data scarcity challenge in highly multilingual NLP systems. It is a standardized evaluation framework for under-represented language and seems really useful for future NLP research and developments.

Check out the Paper and Github. Don’t forget to join our 21k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Meet XTREME-UP: A Benchmark for Evaluating Multilingual Models with Scarce Data Evaluation, Focusing on Under-Represented Languages appeared first on MarkTechPost.

Meet ONE-PEACE: A General Representation Model Towards Unlimited Modal …

Representation models have gotten much attention in computer vision, voice, natural language processing, etc. Representation models exhibit high generalization in various downstream tasks after learning from vast data. Furthermore, there is a growing demand for representation models due to the spectacular rise of large-scale language models (LLMs). Representation models have recently demonstrated their fundamental importance in enabling LLMs to comprehend, experience, and engage with other modalities (like vision). Previous research has mostly focused on developing uni-modal representation models with unique topologies and pretraining tasks due to the various properties of various modalities. 

Recent efforts in vision-language and audio-language learning have shown promising results thanks to the development of unified architectures and effective pretraining activities. However, research on creating universal models that can be used for language, audio, and visual modalities still needs to be made available. Despite producing outstanding results, unimodal representation models need help using multi-modal data, such as image-text and audio-text pairings, efficiently, making applying them to multi-modal tasks difficult. Use a single masked prediction task with the Multiway Transformer to analyze text and picture modalities for pretraining. 

The scalability to other modalities, such as audio, is constrained since the masked prediction job necessitates the pretrained CLIP model to discretize picture input. It offers a broad pretraining approach that can be used for language, audio, and visual modalities without external models (like CLIP). Still, it needs to expand the approach to multi-modal data. In this study, they investigate a scalable method to develop a general representation model that can accommodate any number of modalities. They promote the following requirements for a broad representation model: 1. The model design must be adaptable enough to handle multi-modal interaction and multiple modalities. 2. Pretraining exercises should promote alignment across modalities and information extraction within each modality. 3. Pretraining exercises should be broad and uncomplicated so they may be used with various modalities. 

Due to these incentives, researchers from DAMO Academy and Huazhong University of Science and Technology suggest ONE-PEACE, a model with 4B parameters that can smoothly align and integrate representations across visual, audio, and language modalities. The architecture of ONE-PEACE comprises a modality fusion encoder and many modality adapters. Each modality includes an adaptor to transform the raw inputs into feature sequences. The modality fusion encoder uses the Transformer architecture-based feature sequences. A common self-attention layer and several modality Feed Forward Networks (FFNs) are present in each Transformer block. During the modality FFNs aid in information extraction within modalities. The self-attention layer uses the attention mechanism to enable interaction between the multi-modal features. 

This architecture’s obvious division of labor makes adding new modalities simple and merely calls for adding adapters and FFNs. They provide two modality-independent pretraining assignments for ONE-PEACE. The first is cross-modal contrastive learning, which combines vision-language contrastive education and audio-language contrastive learning to successfully align the semantic spaces of the three modalities of vision, audio, and language. The second method is intra-modal denoising contrastive learning, which can be thought of as combining masked prediction and contrastive knowledge. Contrastive loss is performed between the fine-grained masked features and visible features, like image patches, language tokens, or audio waveform features. 

ONE-PEACE can be expanded to infinite modalities thanks to the scaling-friendly model design and pretraining activities. Together, these activities improve the model’s performance during fine-tuning while preserving cross-modal retrieval capacity. They also eliminate the requirement for modality-specific plans because they are ubiquitous for all modalities. They carry out in-depth studies on various tasks in various modalities, such as vision, audio, vision-language, and audio-language activities. ONE PEACE achieves industry-leading results without using vision or language-pre-trained models for initialization in uni-modal and multi-modal tasks. The code is publicly available on GitHub.

Check out the Paper and Github. Don’t forget to join our 21k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club
The post Meet ONE-PEACE: A General Representation Model Towards Unlimited Modalities Across Different Modalities appeared first on MarkTechPost.

Dialogue-guided intelligent document processing with foundation models …

Intelligent document processing (IDP) is a technology that automates the processing of high volumes of unstructured data, including text, images, and videos. IDP offers a significant improvement over manual methods and legacy optical character recognition (OCR) systems by addressing challenges such as cost, errors, low accuracy, and limited scalability, ultimately leading to better outcomes for organizations and stakeholders.
Natural language processing (NLP) is one of the recent developments in IDP that has improved accuracy and user experience. However, despite these advances, there are still challenges to overcome. For instance, many IDP systems are not user-friendly or intuitive enough for easy adoption by users. Additionally, several existing solutions lack the capability to adapt to changes in data sources, regulations, and user requirements through continuous improvement and updates.
Enhancing IDP through dialogue involves incorporating dialogue capabilities into IDP systems. By enabling users to interact with IDP systems in a more natural and intuitive way, through multi-round dialogue by adjusting inaccurate information or adding missing information aided with task automation, these systems can become more efficient, accurate, and user-friendly.

In this post, we explore an innovative approach to IDP that utilizes a dialogue-guided query solution using Amazon Foundation Models and SageMaker JumpStart.
Solution overview
This innovative solution combines OCR for information extraction, a local deployed large language model (LLM) for dialogue and autonomous tasking, VectorDB for embedding subtasks, and LangChain-based task automation for integration with external data sources to transform the way businesses process and analyze document contexts. By harnessing generative AI technologies, organizations can streamline IDP workflows, enhance user experience, and boost overall efficiency.
The following video highlights the dialogue-guided IDP system by processing an article authored by the Federal Reserve Board of Governors, discussing the collapse of Silicon Valley Bank in March 2023.

The system is capable of processing images, large PDF, and documents in other format and answering questions derived from the content via interactive text or voice inputs. If a user needs to inquire beyond the document’s context, the dialogue-guided IDP can create a chain of tasks from the text prompt and then reference external and up-to-date data sources for relevant answers. Additionally, it supports multi-round conversations and accommodates multilingual exchanges, all managed through dialogue.
Deploy your own LLM using Amazon foundation models
One of the most promising developments in generative AI is the integration of LLMs into dialogue systems, opening up new avenues for more intuitive and meaningful exchanges. An LLM is a type of AI model designed to understand and generate human-like text. These models are trained on massive amounts of data and consist of billions of parameters, allowing them to perform various language-related tasks with high accuracy. This transformative approach facilitates a more natural and productive interaction, bridging the gap between human intuition and machine intelligence. A key advantage of local LLM deployment lies in its ability to enhance data security without submitting data outside to third-party APIs. Moreover, you can fine-tune your chosen LLM with domain-specific data, resulting in a more accurate, context-aware, and natural language understanding experience.
The Jurassic-2 series from AI21 Labs, which are based on the instruct-tuned 178-billion-parameter Jurassic-1 LLM, are integral parts of the Amazon foundation models available through Amazon Bedrock. The Jurassic-2 instruct was specifically trained to manage prompts that are instructions only, known as zero-shot, without the need for examples, or few-shot. This method provides the most intuitive interaction with LLMs, and it’s the best approach to understand the ideal output for your task without requiring any examples. You can efficiently deploy the pre-trained J2-jumbo-instruct, or other Jurassic-2 models available on AWS Marketplace, into your own own virtual private cloud (VPC) using Amazon SageMaker. See the following code:
import ai21, sagemaker

# Define endpoint name
endpoint_name = “sagemaker-soln-j2-jumbo-instruct”
# Define real-time inference instance type. You can also choose g5.48xlarge or p4de.24xlarge instance types
# Please request P instance quota increase via <a href=”https://console.aws.amazon.com/servicequotas/home” target=”_blank” rel=”noopener”>Service Quotas console</a> or your account manager
real_time_inference_instance_type = (“ml.p4d.24xlarge”)

# Create a Sgaemkaer endpoint then deploy a pre-trained J2-jumbo-instruct-v1 model from AWS Market Place.
model_package_arn = “arn:aws:sagemaker:us-east-1:865070037744:model-package/j2-jumbo-instruct-v1-0-20-8b2be365d1883a15b7d78da7217cdeab”
model = ModelPackage(
role=sagemaker.get_execution_role(),
model_package_arn=model_package_arn,
sagemaker_session=sagemaker.Session()
)

# Deploy the model
predictor = model.deploy(1, real_time_inference_instance_type,
endpoint_name=endpoint_name,
model_data_download_timeout=3600,
container_startup_health_check_timeout=600,
)
After the endpoint has been successfully deployed within your own VPC, you can initiate an inference task to verify that the deployed LLM is functioning as anticipated:
response_jumbo_instruct = ai21.Completion.execute(
sm_endpoint=endpoint_name,
prompt=”Explain deep learning algorithms to 8th graders”,
numResults=1,
maxTokens=100,
temperature=0.01 #subject to reduce “hallucination” by using common words.
)
Document processing, embedding, and indexing
We delve into the process of building an efficient and effective search index, which forms the foundation for intelligent and responsive dialogues to guide document processing. To begin, we convert documents from various formats into text content using OCR and Amazon Textract. We then read this content and fragment it into smaller pieces, ideally around the size of a sentence each. This granular approach allows for more precise and relevant search results, because it enables better matching of queries against individual segments of a page rather than the entire document. To further enhance the process, we use embeddings such as the sentence transformers library from Hugging Face, which generates vector representations (encoding) of each sentence. These vectors serve as a compact and meaningful representation of the original text, enabling efficient and accurate semantic matching functionality. Finally, we store these vectors in a vector database for similarity search. This combination of techniques lays the groundwork for a novel document processing framework that delivers accurate and intuitive results for users. The following diagram illustrates this workflow.

OCR serves as a crucial element in the solution, allowing for the retrieval of text from scanned documents or pictures. We can use Amazon Textract for extracting text from PDF or image files. This managed OCR service is capable of identifying and examining text in multi-page documents, including those in PDF, JPEG or TIFF formats, such as invoices and receipts. The processing of multi-page documents occurs asynchronously, making it advantageous for handling extensive, multi-page documents. See the following code:
def pdf_2_text(input_pdf_file, history):
history = history or []
key = ‘input-pdf-files/{}’.format(os.path.basename(input_pdf_file.name))
try:
response = s3_client.upload_file(input_pdf_file.name, default_bucket_name, key)
except ClientError as e:
print(“Error uploading file to S3:”, e)
s3_object = {‘Bucket’: default_bucket_name, ‘Name’: key}
response = textract_client.start_document_analysis(
DocumentLocation={‘S3Object’: s3_object},
FeatureTypes=[‘TABLES’, ‘FORMS’]
)
job_id = response[‘JobId’]
while True:
response = textract_client.get_document_analysis(JobId=job_id)
status = response[‘JobStatus’]
if status in [‘SUCCEEDED’, ‘FAILED’]:
break
time.sleep(5)

if status == ‘SUCCEEDED’:
with open(output_file, ‘w’) as output_file_io:
for block in response[‘Blocks’]:
if block[‘BlockType’] in [‘LINE’, ‘WORD’]:
output_file_io.write(block[‘Text’] + ‘n’)
with open(output_file, “r”) as file:
first_512_chars = file.read(512).replace(“n”, “”).replace(“r”, “”).replace(“[“, “”).replace(“]”, “”) + ” […]”
history.append((“Document conversion”, first_512_chars))
return history, history
When dealing with large documents, it’s crucial to break them down into more manageable pieces for easier processing. In the case of LangChain, this means dividing each document into smaller segments, such as 1,000 tokens per chunk with an overlap of 100 tokens. To achieve this smoothly, LangChain utilizes specialized splitters designed specifically for this purpose:
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
separator = ‘n’
overlap_count = 100. # overlap count between the splits
chunk_size = 1000 # Use a fixed split unit size
loader = TextLoader(output_file)
documents = loader.load()
text_splitter = CharacterTextSplitter(separator=separator, chunk_overlap=overlap_count, chunk_size=chunk_size, length_function=len)
texts = text_splitter.split_documents(documents)
The duration needed for embedding can fluctuate based on the size of the document; for example, it could take roughly 10 minutes to finish. Although this time frame may not be substantial when dealing with a single document, the ramifications become more notable when indexing hundreds of gigabytes as opposed to just hundreds of megabytes. To expedite the embedding process, you can implement sharding, which enables parallelization and consequently enhances efficiency:
from langchain.document_loaders import ReadTheDocsLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import numpy as np
import ray
from embeddings import LocalHuggingFaceEmbeddings

# Define number of splits
db_shards = 10

loader = TextLoader(output_file)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 1000,
chunk_overlap  = 100,
length_function = len,
)

@ray.remote()
def process_shard(shard):
embeddings = LocalHuggingFaceEmbeddings(‘multi-qa-mpnet-base-dot-v1′)
result = Chroma.from_documents(shard, embeddings)
return result

# Read the doc content and split them into chunks.
chunks = text_splitter.create_documents([doc.page_content for doc in documents], metadatas=[doc.metadata for doc in documents])
# Embed the doc chunks into vectors.
shards = np.array_split(chunks, db_shards)
futures = [process_shard.remote(shards[i]) for i in range(db_shards)]
texts = ray.get(futures)
Now that we have obtained the smaller segments, we can continue to represent them as vectors through embeddings. Embeddings, a technique in NLP, generate vector representations of text prompts. The Embedding class serves as a unified interface for interacting with various embedding providers, such as SageMaker, Cohere, Hugging Face, and OpenAI, which streamlines the process across different platforms. These embeddings are numeric portrayals of ideas transformed into number sequences, allowing computers to effortlessly comprehend the connections between these ideas. See the following code:
# Choose a SageMaker deployed local LLM endpoint for embedding
llm_embeddings = SagemakerEndpointEmbeddings(
endpoint_name=<endpoint_name>,
region_name=<region>,
content_handler=content_handler
)
After creating the embeddings, we need to utilize a vectorstore to store the vectors. Vectorstores like Chroma are specially engineered to construct indexes for quick searches in high-dimensional spaces later on, making them perfectly suited for our objectives. As an alternative, you can use FAISS, an open-source vector clustering solution for storing vectors. See the following code:
from langchain.vectorstores import Chroma
# Store vectors in Chroma vectorDB
docsearch_chroma = Chroma.from_documents(texts, llm_embeddings)
# Alternatively you can choose FAISS vectorstore
from langchain.vectorstores import FAISS
docsearch_faiss = FAISS.from_documents(texts, llm_embeddings)
You can also use Amazon Kendra to index enterprise content and produce precise answers. As a fully managed service, Amazon Kendra offers ready-to-use semantic search features for advanced document and passage ranking. With the high-accuracy search in Amazon Kendra, you can obtain the most pertinent content and documents to optimize the quality of your payload. This results in superior LLM responses compared to traditional or keyword-focused search methods. For more information, refer to Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models.
Interactive multilingual voice input
Incorporating interactive voice input into document search offers a myriad of advantages that enhance the user experience. By enabling users to verbally articulate search terms, document search becomes more natural and intuitive, making it simpler and quicker for users to find the information they need. Voice input can bolster the precision of search results, because spoken search terms are less susceptible to spelling or grammatical errors. Interactive voice input renders document search more inclusive, catering to a broader spectrum of users with different language speakers and culture background.
The Amazon Transcribe Streaming SDK enables you to perform audio-to-speech recognition by integrating directly with Amazon Transcribe simply with a stream of audio bytes and a basic handler. As an alternative, you can deploy the whisper-large model locally from Hugging Face using SageMaker, which offers improved data security and better performance. For details, refer to the sample notebook published on the GitHub repo.
# Choose ASR using a locally deployed Whisper-large model from Hugging Face
image = sagemaker.image_uris.retrieve(
framework=’pytorch’,
region=region,
image_scope=’inference’,
version=’1.12′,
instance_type=’ml.g4dn.xlarge’,
)

model_name = f’sagemaker-soln-whisper-model-{int(time.time())}’
whisper_model_sm = sagemaker.model.Model(
model_data=model_uri,
image_uri=image,
role=sagemaker.get_execution_role(),
entry_point=”inference.py”,
source_dir=’src’,
name=model_name,
)

# Audio transcribe
transcribe = whisper_endpoint.predict(audio.numpy())
The above demonstration video shows how voice commands, in conjunction with text input, can facilitate the task of document summarization through interactive conversation.
Guiding NLP tasks through multi-round conversations
Memory in language models maintains a concept of state throughout a user’s interactions. This involves processing a sequence of chat messages to extract and transform knowledge. Memory types vary, but each can be understood using standalone functions and within a chain. Memory can return multiple data points, such as recent messages or message summaries, in the form of strings or lists. This post focuses on the simplest memory form, buffer memory, which stores all prior messages, and demonstrates its usage with modular utility functions and chains.
The LangChain’s ChatMessageHistory class is a crucial utility for memory modules, providing convenient methods to save and retrieve human and AI messages by remembering all previous chat interactions. It’s ideal for managing memory externally from a chain. The following code is an example of applying a simple concept in a chain by introducing ConversationBufferMemory, a wrapper for ChatMessageHistory. This wrapper extracts messages into a variable, allowing them to be represented as a string:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(return_messages=True)
LangChain works with many popular LLM providers such as AI21 Labs, OpenAI, Cohere, Hugging Face, and more. For this example, we use a locally deployed AI21 Labs’ Jurassic-2 LLM wrapper using SageMaker. AI21 Studio also provides API access to Jurassic-2 LLMs.
from langchain import PromptTemplate, SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import ContentHandlerBase
from langchain.chains.question_answering import load_qa_chain

prompt= PromptTemplate(
template=prompt_template, input_variables=[“context”, “question”]
)

class ContentHandler(ContentHandlerBase):
content_type = “application/json”
accepts = “application/json”
def transform_input(self, prompt: str, model_kwargs: Dict) — bytes:
input_str = json.dumps({prompt: prompt, **model_kwargs})
return input_str.encode(‘utf-8′)

def transform_output(self, output: bytes) — str:
response_json = json.loads(output.read().decode(“utf-8”))
return response_json[0][“generated_text”]
content_handler = ContentHandler()
llm_ai21=SagemakerEndpoint(
endpoint_name=endpoint_name,
credentials_profile_name=f’aws-credentials-profile-name’,
region_name=”us-east-1″,
model_kwargs={“temperature”:0},
content_handler=content_handler)

qa_chain = VectorDBQA.from_chain_type(
llm=llm_ai21,
chain_type=’stuff’,
vectorstore=docsearch,
verbose=True,
memory=ConversationBufferMemory(return_messages=True)
)

response = qa_chain(
{‘query’: query_input},
return_only_outputs=True
)
In the event that the process is unable to locate an appropriate response from the original documents in response to a user’s inquiry, the integration of a third-party URL or ideally a task-driven autonomous agent with external data sources significantly enhances the system’s ability to access a vast array of information, ultimately improving context and providing more accurate and current results.
With AI21’s preconfigured Summarize run method, a query can access a predetermined URL, condense its content, and then carry out question and answer tasks based on the summarized information:
# Call AI21 API to query the context of a specific URL for Q&A
ai21.api_key = “<YOUR_API_KEY>”
url_external_source = “<your_source_url>”
response_url = ai21.Summarize.execute(
source=url_external_source,
sourceType=”URL” )
context = “<concate_document_and_response_url>”
question = “<query>”
response = ai21.Answer.execute(
context=context,
question=question,
sm_endpoint=endpoint_name,
maxTokens=100,
)
For additional details and code examples, refer to the LangChain LLM integration document as well as the task-specific API documents provided by AI21.
Task automation using BabyAGI
The task automation mechanism allows the system to process complex queries and generate relevant responses, which greatly improves the validity and authenticity of document processing. LangCain’s BabyAGI is a powerful AI-powered task management system that can autonomously create, prioritize, and run tasks. One of the key features is its ability to interface with external sources of information, such as the web, databases, and APIs. One way to use this feature is to integrate BabyAGI with Serpapi, a search engine API that provides access to search engines. This integration allows BabyAGI to search the web for information related to tasks, allowing BabyAGI to access a wealth of information beyond the input documents.
BabyAGI’s autonomous tasking capacity is fueled by an LLM, a vector search database, an API wrapper to external links, and the LangChain framework, allowing it to run a broad spectrum of tasks across various domains. This enables the system to proactively carry out tasks based on user interactions, streamlining the document processing pipeline that incorporates external sources and creating a more efficient, smooth experience. The following diagram illustrates the task automation process.

This process includes the following components:

Memory – The memory stores all the information that BabyAGI needs to complete its tasks. This includes the task itself, as well as any intermediate results or data that BabyAGI has generated.
Execution agent – The execution agent is responsible for carrying out the tasks that are stored in the memory. It does this by accessing the memory, retrieving the relevant information, and then taking the necessary steps to complete the task.
Task creation agent – The task creation agent is responsible for generating new tasks for BabyAGI to complete. It does this by analyzing the current state of the memory and identifying any gaps in knowledge or understanding. When a gap has been identified, the task creation agent generates a new task that will help BabyAGI fill that gap.
Task queue – The task queue is a list of all of the tasks that BabyAGI has been assigned. The tasks are added to the queue in the order in which they were received.
Task prioritization agent – The task prioritization agent is responsible for determining the order in which BabyAGI should complete its tasks. It does this by analyzing the tasks in the queue and identifying the ones that are most important or urgent. The tasks that are most important are placed at the front of the queue, and the tasks that are least important are placed at the back of the queue.

See the following code:
from babyagi import BabyAGI
from langchain.docstore import InMemoryDocstore
import faiss
# Set temperatur=0 to generate the most frequent words, instead of more “poetically free” behavior.
new_query = “””
What happened to the First Republic Bank? Will the FED take the same action as it did on SVB’s failure?
“””
# Enable verbose logging and use a fixed embedding size.
verbose = True
embedding_size = 1536

# Using FAISS vector cluster for vectore store
index = faiss.IndexFlatL2(embedding_size)
vectorstore = FAISS(llm_embeddings.embed_query, index, InMemoryDocstore({}), {})

# Choose 1 iteration for demo and 1>N>10 for real. If None, it will loop indefinitely
max_iterations: Optional[int] = 2

# Call bayagi class for task automation
baby_agi = BabyAGI.from_llm(
llm=llm_embedding, vectorstore=vectorstore, verbose=verbose, max_iterations=max_iterations<br />)

response = baby_agi({“objective”: new_query})
Let’s examine the tasks gathered and their outcomes from a single iteration, used for demonstration purposes, to accomplish the objective in response to the user’s inquiry. BabyAGI operates through a continuous cycle of the following steps:

A task creation agent formulates a new task.
The new task is incorporated into the task queue.
The task prioritization agent establishes the sequence in which tasks should be tackled.
The run agent accomplishes the task.
The task outcome is saved in the memory.
The cycle repeats.

This iterative process enables BabyAGI to learn and develop over time. As it successfully completes an increasing number of tasks, BabyAGI’s knowledge and comprehension of the world expand. Consequently, it can generate and efficiently run more intricate and advanced tasks. By augmenting the iteration count, you can enhance the outcome through the reorganization of the task list, which focuses on the goal and the outcome of preceding tasks, ultimately yielding more pertinent and precise responses.
See the following example output:
*****TASK LIST*****
1: Make a todo list
*****NEXT TASK*****
1: Make a todo list
> Entering new AgentExecutor chain…
Thought: I need to research the First Republic Bank and the FED’s action
Action: Search
Action Input: What happened to the First Republic Bank? Will the FED take the same
action as it did on SVB’s failure?
Observation: First Republic’s failure was due to a run on deposits following the
collapse of Silicon Valley Bank and Signature Bank. JPMorgan Chase acquired
First Republic Bank on May 1, 2023. If you had insured or uninsured money in First Republic accounts, your funds are safe and now managed by JPMorgan Chase.
Thought: I need to research the FED’s action on SVB’s failure
Action: Search
Action Input: What action did the FED take on SVB’s failure?
Observation: The Hutchins Center explains how the Federal Reserve has responded to the
March 2023 failures of Silicon Valley Bank and Signature Bank.
Thought: I now know the final answer
Final Answer: The FED responded to the March 2023 failures of Silicon Valley Bank and <br />Signature Bank by providing liquidity to the banking system. JPMorgan
Chase acquired First Republic Bank on May 1, 2023, and if you had insured
or uninsured money in First Republic accounts, your funds are safe and
now managed by JPMorgan Chase.
> Finished chain.
*****TASK RESULT*****
The Federal Reserve responded to the March 2023 failures of Silicon Valley Bank and Signature Bank by providing liquidity to the banking system. It is unclear what action the FED will take in response to the failure of First Republic Bank.

***TASK LIST***

2: Research the timeline of First Republic Bank’s failure.
3: Analyze the Federal Reserve’s response to the failure of Silicon Valley Bank and Signature Bank.
4: Compare the Federal Reserve’s response to the failure of Silicon Valley Bank and Signature Bank to the Federal Reserve’s response to the failure of First Republic Bank.
5: Investigate the potential implications of the Federal Reserve’s response to the failure of First Republic Bank.
6: Identify any potential risks associated with the Federal Reserve’s response to the failure of First Republic Bank.<br />*****NEXT TASK*****

2: Research the timeline of First Republic Bank’s failure.

> Entering new AgentExecutor chain…
Will the FED take the same action as it did on SVB’s failure?
Thought: I should search for information about the timeline of First Republic Bank’s failure and the FED’s action on SVB’s failure.
Action: Search
Action Input: Timeline of First Republic Bank’s failure and FED’s action on SVB’s failure
Observation: March 20: The FDIC decides to break up SVB and hold two separate auctions for its traditional deposits unit and its private bank after failing …
Thought: I should look for more information about the FED’s action on SVB’s failure.
Action: Search
Action Input: FED’s action on SVB’s failure
Observation: The Fed blamed failures on mismanagement and supervisory missteps, compounded by a dose of social media frenzy.
Thought: I now know the final answer.
Final Answer: The FED is likely to take similar action on First Republic Bank’s failure as it did on SVB’s failure, which was to break up the bank and hold two separate auctions for its traditional deposits unit and its private bank.</p><p>&gt; Finished chain.

*****TASK RESULT*****
The FED responded to the March 2023 failures of ilicon Valley Bank and Signature Bank
by providing liquidity to the banking system. JPMorgan Chase acquired First Republic
Bank on May 1, 2023, and if you had insured or uninsured money in First Republic
accounts, your funds are safe and now managed by JPMorgan Chase.*****TASK ENDING*****
With BabyAGI for task automation, the dialogue-guided IDP system showcased its effectiveness by going beyond the original document’s context to address the user’s query about the Federal Reserve’s potential actions concerning the First Republic Bank’s failure, which occurred in late April 2023, 1 month after the sample publication, in comparison to SVB’s failure. To achieve this, the system generated a to-do list and completed tasks sequentially. It investigated the circumstances surrounding the First Republic Bank’s failure, pinpointed potential risks tied to the Federal Reserve’s response, and compared it to the response to SVB’s failure.
Although BabyAGI remains a work in progress, it carries the promise of revolutionizing machine interactions, inventive thinking, and problem resolution. As BabyAGI’s learning and enhancement persist, it will be capable of producing more precise, insightful, and inventive responses. By empowering machines to learn and evolve autonomously, BabyAGI could facilitate their assistance in a broad spectrum of tasks, ranging from mundane chores to intricate problem-solving.
Constraints and limitations
Dialogue-guided IDP offers a promising approach to enhancing the efficiency and effectiveness of document analysis and extraction. However, we must acknowledge its current constraints and limitations, such as the need for data bias avoidance, hallucination mitigation, the challenge of handling complex and ambiguous language, and difficulties in understanding context or maintaining coherence in longer conversations.
Additionally, it’s important to consider confabulations and hallucinations in AI-generated responses, which may lead to the creation of inaccurate or fabricated information. To address these challenges, ongoing developments are focusing on refining LLMs with better natural language understanding capabilities, incorporating domain-specific knowledge and developing more robust context-aware models. Building an LLM from scratch can be costly and time-consuming; however, you can employ several strategies to improve existing models:

Fine-tuning a pre-trained LLM on specific domains for more accurate and relevant outputs
Integrating external data sources known to be safe during inference for enhanced contextual understanding
Designing better prompts to elicit more precise responses from the model
Using ensemble models to combine outputs from multiple LLMs, averaging out errors and minimizing hallucination chances
Building guardrails to prevent models from veering off into undesired areas while ensuring apps respond with accurate and appropriate information
Conducting supervised fine-tuning with human feedback, iteratively refining the model for increased accuracy and reduced hallucination.

By adopting these approaches, AI-generated responses can be made more reliable and valuable.
The task-driven autonomous agent offers significant potential across various applications, but it is vital to consider key risks before adopting the technology. These risks include:

Data privacy and security breaches due to reliance on the selected LLM provider and vectorDB
Ethical concerns arising from biased or harmful content generation
Dependence on model accuracy, which may lead to ineffective task completion or undesired results
System overload and scalability issues if task generation outpaces completion, requiring proper task sequencing and parallel management
Misinterpretation of task prioritization based on the LLM’s understanding of task importance
The authenticity of the data it received from the web

Addressing these risks is crucial for responsible and successful application, allowing us to maximize the benefits of AI-powered language models while minimizing potential risks.
Conclusions
The dialogue-guided solution for IDP presents a groundbreaking approach to document processing by integrating OCR, automatic speech recognition, LLMs, task automation, and external data sources. This comprehensive solution enables businesses to streamline their document processing workflows, making them more efficient and intuitive. By incorporating these cutting-edge technologies, organizations can not only revolutionize their document management processes, but also bolster decision-making capabilities and considerably boost overall productivity. The solution offers a transformative and innovative means for businesses to unlock the full potential of their document workflows, ultimately driving growth and success in the era of generative AI. Refer to SageMaker Jumpstart for other solutions and Amazon Bedrock for additional generative AI models.
The authors would like to sincerely express their appreciation to Ryan Kilpatrick, Ashish Lal, and Kristine Pearce for their valuable inputs and contributions to this work. They also acknowledge Clay Elmore for the code sample provided on Github.

About the authors
Alfred Shen is a Senior AI/ML Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on CV, NLP, and multimodality. His work has been showcased in publications such as EMNLP, ICLR, and Public Health.
Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.
Dr. Li Zhang is a Principal Product Manager-Technical for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms, a service that helps data scientists and machine learning practitioners get started with training and deploying their models, and uses reinforcement learning with Amazon SageMaker. His past work as a principal research staff member and master inventor at IBM Research has won the test of time paper award at IEEE INFOCOM.
Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, mentoring college students for entrepreneurship, and spending time with friends and families.

Automate document validation and fraud detection in the mortgage under …

In this three-part series, we present a solution that demonstrates how you can automate detecting document tampering and fraud at scale using AWS AI and machine learning (ML) services for a mortgage underwriting use case.
This solution rides on a more significant global wave of increasing mortgage fraud, which is worsening as more people present fraudulent proofs to qualify for loans. Data suggests high-risk and suspected fraudulent mortgage activity is on the rise, noting a 52% increase in suspected fraudulent mortgage applications since 2013. (Source: Equifax)
Part 1 of this series discusses the most common challenges associated with the manual lending process. We provide concrete guidance on addressing this issue with AWS AI and ML services to detect document tampering, identify and categorize patterns for fraudulent scenarios, and integrate with business-defined rules while minimizing human expertise for fraud detection.
In Part 2, we demonstrate how to train and host a computer vision model for tampering detection and localization on Amazon SageMaker. In Part 3, we show how to automate detecting fraud in mortgage documents with an ML model and business-defined rules using Amazon Fraud Detector.
Challenges associated with the manual lending process
Organizations in the lending and mortgage industry receive thousands of applications, ranging from new mortgage applications to refinancing an existing mortgage. These documents are increasingly susceptible to document fraud as fraudsters attempt to exploit the system and qualify for mortgages in several illegal ways. To be eligible for a mortgage, the applicant must provide the lender with documents verifying their employment, assets, and debts. Changing borrowing rules and interest rates can drastically alter an applicant’s credit affordability. Fraudsters range from blundering novices to near-perfect masters when creating fraudulent loan application documents. Fraudulent paperwork includes but is not limited to altering or falsifying paystubs, inflating information about income, misrepresenting job status, and forging letters of employment and other key mortgage underwriting documents. These fraud attempts can be challenging for mortgage lenders to capture.
The significant challenges associated with the manual lending process include but not limited to:

The necessity for a borrower to visit the branch
Operational overhead
Data entry errors
Automation and time to resolution

Finally, the underwriting process, or the analysis of creditworthiness and the loan decision, takes additional time if done manually. Again, the manual consumer lending process has some advantages, such as approving a loan that requires human judgment. The solution will provide automation and risk mitigation in mortgage underwriting which will help reduce time and cost as compared to the manual process.
Solution overview
Document validation is a critical type of input for mortgage fraud decisions. Understanding the risk profile of the supporting mortgage documents and driving insights from this data can significantly improve risk decisions and is central to any underwriter’s fraud management strategy.
The following diagram represents each stage in a mortgage document fraud detection pipeline. We walk through each of these stages and how they aid towards underwriting accuracy (initiated with capturing documents to classify and extract required content), detecting tampered documents, and finally using an ML model to detect potential fraud classified according to business-driven rules.

In the following sections, we discuss the stages of the process in detail.
Document classification
With intelligent document processing (IDP), we can automatically process financial documents using AWS AI services such as Amazon Textract and Amazon Comprehend.
Additionally, we can use the Amazon Textract Analyze Lending API in processing mortgage documents. Analyze Lending uses pre-trained ML models to automatically extract, classify, and validate information in mortgage-related documents with high speed and accuracy while reducing human error. As depicted in the following figure, Analyze Lending receives a loan document and then splits it into pages, classifying them according to the type of document. The document pages are then automatically routed to Amazon Textract text processing operations for accurate data extraction and analysis.

The Analyze Lending API offers the following benefits:

Automated end-to-end processing of mortgage packages
Pre-trained ML models across a variety of document types in a mortgage application package
Ability to scale on demand and reduce reliance on human reviewers
Improved decision-making and significantly lower operating costs

Tampering detection
We use a computer vision model deployed on SageMaker for our end-to-end image forgery detection and localization solution, which means it takes a testing image as input and predicts pixel-level forgery likelihood as output.
Most research studies focus on four image forgery techniques: splicing, copy-move, removal, and enhancement. Both splicing and copy-move involve adding image content to the target (forged) image. However, the added content is obtained from a different image in splicing. In copy-move, it’s from the target image. Removal, or inpainting, removes a selected image region (for example, hiding an object) and fills the space with new pixel values estimated from the background. Finally, image enhancement is a vast collection of local manipulations, such as sharpening, brightness, and adjustment.
Depending on the characteristics of the forgery, different clues can be used as the foundation for detection and localization. These clues include JPEG compression artifacts, edge inconsistencies, noise patterns, color consistency, visual similarity, EXIF consistency, and camera model. However, real-life forgeries are more complex and often use a sequence of manipulations to hide the forgery. Most existing methods focus on image-level detection, whether or not an image is forged, and not on localizing or highlighting a forged area of the document image to aid the underwriter in making informed decisions.
We walk through the implementation details of training and hosting a computer vision model for tampering detection and localization on SageMaker in Part 2 of this series. The conceptual CNN-based architecture of the model is depicted in the following diagram. The model extracts image manipulation trace features for a testing image and identifies anomalous regions by assessing how different a local feature is from its reference features. It detects forged pixels by identifying local anomalous features as a predicted mask of the testing image.

Fraud detection
We use Amazon Fraud Detector, a fully managed AI service, to automate the generation, evaluation, and detection of fraudulent activities. This is achieved by generating fraud predictions based on data extracted from the mortgage documents against ML fraud models trained with the customer’s historical (fraud) data. You can use the prediction to trigger business rules in relation to underwriting decisions.

Defining the fraud prediction logic involves the following components:

Event types – Define the structure of the event
Models – Define the algorithm and data requirements for predicting fraud
Variables – Represent a data element associated with the fraud detection event
Rules – Tell Amazon Fraud Detector how to interpret the variable values during fraud prediction
Outcomes – The results generated from a fraud prediction
Detector version – Contains fraud prediction logic for the fraud detection event

The following diagram illustrates the architecture of this component.

After you deploy your model, you may evaluate its performance scores and metrics based on the prediction explanations. This helps identify top risk indicators and analyze fraud patterns across the data.
Third-party validation
We integrate the solution with third-party providers (via API) to validate the extracted information from the documents, such as personal and employment information. This is particularly useful to cross-validate details in addition to document tampering detection and fraud detection based on the historical pattern of applications.
The following architecture diagram illustrates a batch-oriented fraud detection pipeline in mortgage application processing using various AWS services.

The workflow includes the following steps:

The user uploads the scanned documents into Amazon Simple Storage Service (Amazon S3).
The upload triggers an AWS Lambda function (Invoke Document Analysis) that calls the Amazon Textract API for text extraction. Additionally, we can use the Amazon Textract Analyze Lending API to automatically extract, classify, and validate information.
On completion of text extraction, a notification is sent via Amazon Simple Notification Service (Amazon SNS).
The notification triggers a Lambda function (Get Document Analysis), which invokes Amazon Comprehend for custom document classification.
Document analysis results that have a low confidence score to are routed to human reviewers using Amazon Augmented AI (Amazon A2I).
Output from Amazon Textract and Amazon Comprehend is aggregated using a Lambda function (Analyze & Classify Document).
A SageMaker inference endpoint is called for a fraud prediction mask of the input documents.
Amazon Fraud Detector is called for a fraud prediction score using the data extracted from the mortgage documents.
The results from Amazon Fraud Detector and the SageMaker inference endpoint are aggregated into the loan origination application.
The status of the document processing job is tracked in Amazon DynamoDB.

Conclusion
This post walked through an automated solution to detect document tampering and fraud in the mortgage underwriting process using Amazon Fraud Detector and other Amazon AI and ML services. This solution allows you to detect fraudulent attempts closer to the time of fraud occurrence and helps underwriters with an effective decision-making process. The flexibility of the implementation allows you to define business-driven rules to classify and capture the fraudulent attempts customized to specific business needs.
In Part 2 of this series, we provide the implementation details for detecting document tampering using SageMaker. In Part 3, we demonstrate how to implement the solution on Amazon Fraud Detector.

About the authors
Anup Ravindranath is a Senior Solutions Architect at Amazon Web Services (AWS) based in Toronto, Canada working with Financial Services organizations. He helps customers to transform their businesses and innovate on cloud.
Vinnie Saini is a Senior Solutions Architect at Amazon Web Services (AWS) based in Toronto, Canada. She has been helping Financial Services customers transform on cloud, with AI and ML driven solutions laid on strong foundational pillars of Architectural Excellence.