May 2024 - Page 3 of 9 - i-genie.co.uk

Hunyuan-DiT: A Text-to-Image Diffusion Transformer with Fine-Grained U …

Posted on May 24, 2024 by i-genie

In recent research, a text-to-image diffusion transformer called Hunyuan-DiT has been developed with the goal of comprehending both English and Chinese text prompts in a subtle way. Several essential elements and procedures have been involved in the creation of Hunyuan-DiT in order to guarantee excellent picture production and fine-grained language comprehension.

The primary components of Hunyuan-DiT are as follows.

Transformer Structure: Hunyuan-DiT’s transformer architecture has been designed to maximize the model’s ability to produce visuals from textual descriptions. This includes improving the model’s ability to process intricate linguistic inputs and making sure it can record precise data.

Bilingual and Multilingual Encoding: Hunyuan-DiT’s ability to correctly read prompts is largely dependent on the text encoder. The model utilizes the strengths of both encoders, a bilingual CLIP that can handle both English and Chinese, and a multilingual T5 encoder in order to improve understanding and context handling.

Enhanced Positional Encoding: Hunyuan-DiT’s positional encoding algorithms have been adjusted to handle the sequential nature of text and the spatial characteristics of images more efficiently. This helps the model in correctly mapping tokens to appropriate image attributes and maintaining the token sequence.

The team has developed an extensive data pipeline that consists of the following components in order to enhance and support Hunyuan-DiT’s capabilities.

Data Curation and Collection: Assembling a sizable and varied dataset of text-image pairings.

Data augmentation and filtering: Adding more examples to the dataset and removing unnecessary or low-quality data.

Iterative Model Optimisation: Continuously updating and enhancing the model’s performance based on fresh data and user feedback by employing the ‘data convoy’ technique.

In order to improve the language understanding precision of the model, the team has specially trained an MLLM to improve the captions corresponding to the photos. By utilizing contextual knowledge, this model produces captions that are accurate and detailed, enhancing the quality of the images that are produced.

Hunyuan-DiT facilitates multi-turn dialogues that enable interactive image generation. This implies that over multiple iterations of engagement, people can offer input and improve the generated images, producing more accurate and pleasing outcomes.

To evaluate Hunyuan-DiT, the team has created a strict evaluation methodology with the participation of over 50 qualified evaluators. This technique measures the subject clarity, visual quality, lack of AI artifacts, text-image consistency, and other elements of the created images. Compared to other open-source models, the evaluations showed that Hunyuan-DiT delivers state-of-the-art performance in Chinese-to-image creation. It is excellent at creating crisp, semantically correct visuals in response to Chinese cues.

In conclusion, Hunyuan-DiT is a major breakthrough in text-to-image generation, especially for Chinese prompts. It provides outstanding performance in producing detailed and contextually accurate images by carefully constructing its transformer architecture, text encoders, and positional encoding, as well as by establishing a reliable data pipeline. Its capacity for interactive, multi-turn dialogues increases its usefulness even further, making it an effective tool for a range of uses.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post Hunyuan-DiT: A Text-to-Image Diffusion Transformer with Fine-Grained Understanding of Both English and Chinese appeared first on MarkTechPost.

Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon …

Posted on May 24, 2024 by i-genie

Mixture of Experts (MoE) architectures for large language models (LLMs) have recently gained popularity due to their ability to increase model capacity and computational efficiency compared to fully dense models. By utilizing sparse expert subnetworks that process different subsets of tokens, MoE models can effectively increase the number of parameters while requiring less computation per token during training and inference. This enables more cost-effective training of larger models within fixed compute budgets compared to dense architectures.
Despite their computational benefits, training and fine-tuning large MoE models efficiently presents some challenges. MoE models can struggle with load balancing if the tokens aren’t evenly distributed across experts during training, and some experts may become overloaded while others are under-utilized. MoE models have high memory requirements, because all expert parameters need to be loaded into memory even though only a subset is used for each input.
In this post, we highlight new features of the Amazon SageMaker model parallelism library that enable efficient training of MoE models using expert parallelism. Expert parallelism is a type of parallelism that handles splitting experts of an MoE model across separate workers or devices, similar to how tensor parallelism can partition dense model layers. We demonstrate how to use these new features of SMP by pre-training the 47 billion parameter Mixtral 8x7B MoE model using expert parallelism. To learn more, refer to our GitHub repo and Expert parallelism.
Expert parallelism
The Mixtral 8x7B model has a sparse MoE architecture, containing eight expert subnetworks with around 7 billion parameters each. A trainable gate network called a router determines which input tokens are sent to which expert. With this architecture, the experts specialize in processing different aspects of the input data. The complete Mixtral 8x7B model has a total of 47 billion parameters, but only around 12.9 billion (two experts, for this model architecture) are activated for any given input token; this results in improved computational efficiency relative to a dense model of the same total size. To learn more about the MoE architecture in general, refer to Applying Mixture of Experts in LLM Architectures.
SMP adds support for expert parallelism
SMP now supports expert parallelism, which is essential to performant MoE model training. With expert parallelism, different expert subnetworks that comprise the MoE layers are placed on separate devices. During training, different data is routed to the different devices, with each device handling the computation for the experts it contains. By distributing experts across workers, expert parallelism addresses the high memory requirements of loading all experts on a single device and enables MoE training on a larger cluster. The following figure offers a simplified look at how expert parallelism works on a multi-GPU cluster.

The SMP library uses NVIDIA Megatron to implement expert parallelism and support training MoE models, and runs on top of PyTorch Fully Sharded Data Parallel (FSDP) APIs. You can keep using your PyTorch FSDP training code as is and activate SMP expert parallelism for training MoE models. SMP offers a simplified workflow where you need to specify the expert_parallel_degree parameter, which will evenly divide experts across the number of GPUs in your cluster. For example, to shard your model while using an instance with 8 GPUs, you can set the expert_parallel_degree to 2, 4, or 8. We recommend that you start with a small number and gradually increase it until the model fits in the GPU memory.
SMP’s expert parallelism is compatible with sharded data parallelism
SMP’s expert parallel implementation is compatible with sharded data parallelism, enabling more memory-efficient and faster training. To understand how this works, consider an MoE model in the following example with eight experts (N=8) training on a simple cluster with one node containing 4 GPUs.
SMP’s expert parallelism splits the MoE experts across GPUs. You control how many experts are instantiated on each device by using the expert_parallel_degree parameter. For example, if you set the degree to 2, SMP will assign half of the eight experts to each data parallel group. The degree value must be a factor of the number of GPUs in your cluster and the number of experts in your model. Data is dynamically routed to and from the GPU or GPUs hosting the selected expert using all-to-all GPU communication.
Next, sharded data parallelism partitions and distributes the experts as well as the non-MoE layers of the model, like attention or routers, across your cluster to reduce the memory footprint of the model. The hybrid_shard_degree parameter controls this. For example, a hybrid_shard_degree of 2 will shard the model states (including experts and non-MoE layers) across half of the GPUs in our cluster. The product of expert_parallel_degree and hybrid_shard_degree should not exceed the world size of the cluster. In the following example, hybrid_shard_degree * expert_parallel_degree = 4 is a valid configuration.

Solution overview
With the background out of the way, let’s dig into the components of our distributed training architecture. The following diagram illustrates the solution architecture.

In this example, we use SageMaker training jobs. With SageMaker training jobs, you can launch and manage clusters of high-performance instances with simple API calls. For example, you can use the SageMaker Estimator to specify the type and quantity of instances to use in your distributed systems with just a few lines of code. Later in this post, we use a cluster of two ml.p4d.24xlarge instances to train our model by specifying these parameters in our Estimator. To learn about SageMaker training jobs, see Train a Model with Amazon SageMaker.
In this post, we use the SMP library to efficiently distribute the workload across the cluster using hybrid sharded data parallelism and expert parallelism. In addition to these implementations, SMP offers many other performance-improving and memory-saving techniques, such as:

Mixed precision training and fp8 support for dense Llama models (which accelerates distributed training and takes advantage of the performance improvements on P5 instances)
Tensor parallelism composable with sharded data parallelism
Delayed parameter initialization
Activation checkpointing (a technique to reduce memory usage by clearing activations of certain layers and recomputing them during the backward pass)

For the latest updates, refer to SageMaker model parallelism library v2.
Along with SMP, this example also uses the SageMaker distributed data parallel library (SMDDP). As you scale your workload and add instances to your cluster, the overhead of communication between instances also increases, which can lead to a drop in overall computational performance and training efficiency. This is where SMDDP helps. SMDDP includes optimized communication collectives such as AllGather that are designed for AWS network infrastructure. Because of this, SMDDP can outperform other more general communications libraries such as NCCL when training on SageMaker.
Together, the SMP and SMDDP libraries can accelerate large distributed training workloads by up to 20%. Additionally, these libraries are compatible with standard PyTorch APIs and capabilities, which makes it convenient to adapt any existing PyTorch FSDP training script to the SageMaker training platform and take advantage of the performance improvements that SMP and SMDDP provide. To learn more, see SageMaker model parallelism library v2 and Run distributed training with the SageMaker distributed data parallelism library.
In the following sections, we showcase how you can accelerate distributed training of the Hugging Face Transformers Mixtral 8*7B model on P4 instances using SMP and SMDDP.
Prerequisites
You need to complete some prerequisites before you can run the Mixtral notebook.
First, make sure you have created a Hugging Face access token so you can download the Hugging Face tokenizer to be used later. After you have the access token, you need to make a few quota increase requests for SageMaker. You need to request a minimum of 2 P4d instances ranging to a maximum of 8 P4d instances (depending on time-to-train and cost-to-train trade-offs for your use case).
On the Service Quotas console, request the following SageMaker quotas:

P4 instances (ml.p4d.24xlarge) for training job usage: 2–8

It may take up to 24 hours for the quota increase to get approved.
Now that you’re ready to begin the process to pre-train the Mixtral model, we start with dataset preparation in the next step.
Prepare the dataset
We begin our tutorial with preparing the dataset. This will cover loading the GLUE/SST2 dataset, tokenizing and chunking the dataset, and configuring the data channels for SageMaker training on Amazon Simple Storage Service (Amazon S3). Complete the following steps:

You first need to load the GLUE/SST2 dataset and split it into training and validation datasets:

hyperparameters = {
“cache_dir”: “tmp”,
“dataset_config_name”: “sst2”,
“dataset_name”: “glue”,
“do_train”: True,
“do_eval”: True,
}

raw_datasets = load_dataset(
hyperparameters[“dataset_name”],
hyperparameters[“dataset_config_name”],
)

del raw_datasets[“validation”]

if “validation” not in raw_datasets.keys():
validation_percentage = “10%”

raw_datasets[“validation”] = load_dataset(
hyperparameters[“dataset_name”],
hyperparameters[“dataset_config_name”],
split=f”train[:{validation_percentage}]”,
cache_dir=hyperparameters[“cache_dir”],
)

raw_datasets[“train”] = load_dataset(
hyperparameters[“dataset_name”],
hyperparameters[“dataset_config_name”],
split=f”train[{validation_percentage}:]”,
cache_dir=hyperparameters[“cache_dir”],
)

Load the Mixtral-8x7B tokenizer from the Hugging Face Transformers library:

tokenizer = AutoTokenizer.from_pretrained(“mistralai/Mixtral-8x7B-v0.1”, **tokenizer_kwargs)

Next, you define two utility functions: tokenize_function() and group_texts(). The tokenize_function() runs the tokenizer on the text data. The group_texts() function concatenates all texts from the dataset and generates chunks of a block size that corresponds to the model’s input length (2048) for this example. By chunking the text data into smaller pieces, you make sure the model can process the entire dataset during training, even if some text examples are longer than the input length (2048).

Define the functions with the following code:

def tokenize_function(examples):
…

output = tokenizer(examples[text_column_name])
return output
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])

if total_length >= block_size:
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result[“labels”] = result[“input_ids”].copy()
return result

Call the preceding utility functions on your dataset to tokenize and generate chunks suitable for the model:

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True,num_proc=1,remove_columns=column_names)
lm_datasets = tokenized_datasets.map(group_texts, batched=True)

Prepare the training and validation datasets for SageMaker training by saving them as JSON files and constructing the S3 paths where these files will be uploaded:

train_dataset = lm_datasets[“train”]
train_dataset.to_json(“./training.json”)
training_dataset_location = f”s3://{default_bucket}/dataset/train/”

eval_dataset = lm_datasets[“validation”]
eval_dataset.to_json(“./validation.json”)
validation_dataset_location = f”s3://{default_bucket}/dataset/validation/”

Finally, set up the data channels for SageMaker training by creating TrainingInput objects from the provided S3 bucket paths for the training and test/validation datasets:

train = sagemaker.inputs.TrainingInput(
s3_train_bucket, distribution=”FullyReplicated”,
s3_data_type=”S3Prefix”
)
data_channels = {“train”: train}

test = sagemaker.inputs.TrainingInput(
s3_test_bucket, distribution=”FullyReplicated”,
s3_data_type=”S3Prefix”
)
data_channels[“test”] = test

You’re now ready to run pre-training or fine-tuning on the dataset.
Pre-train Mixtral 8x7B with expert parallelism on SMP
To pre-train the Mixtral 8x7B model, complete the following steps:

Initialize the script with torch.sagemaker.init() to activate the SMP library:

import torch.sagemaker as tsm
tsm.init()

Import the MoEConfig class from the torch.sagemaker.transform API. We use the MoEConfig class to enable the model to use the SMP implementation of MoE:

from torch.sagemaker.moe.moe_config import MoEConfig

Create a model configuration for Mixtral 8x7B model. This will be passed to AutoModelForCausalLM.from_config(model_config, attn_implementation=”flash_attention_2″) from the Hugging Face Transformers library to initialize the model with random weights. If you want to fine-tune, you can provide the path to the pre-trained weights instead of the model configuration.

model_config = MixtralConfig(
vocab_size=args.vocab_size, # 32000,
hidden_size=args.hidden_width, # 4096,
intermediate_size=args.intermediate_size, # 14336,
num_hidden_layers=args.num_layers, # 32,
num_attention_heads=args.num_heads, # 32,
num_key_value_heads=args.num_key_value_heads, # 8,
hidden_act=”silu”,
max_position_embeddings=args.max_context_width, # 4096 * 32,
initializer_range=args.initializer_range, # 0.02,
rms_norm_eps=1e-5,
use_cache=False,
pad_token_id=None,
bos_token_id=1,
eos_token_id=2,
tie_word_embeddings=False,
rope_theta=1e6,
sliding_window=args.sliding_window, # None,
attention_dropout=0.0,
num_experts_per_tok=args.num_experts_per_tok, # 2,
num_local_experts=args.num_local_experts, # 8,
output_router_logits=False,
router_aux_loss_coef=0.001,
)

model = AutoModelForCausalLM.from_config(model_config, dtype=dtype, attn_implementation=”flash_attention_2″ )

In the example Jupyter Notebook, you use a create_model() function that invokes the AutoModelForCausalLM.from_config() function.

Create the SMP MoE configuration class. In the following code, you specify parameters in the training estimator in the subsequent steps. To learn more about the SMP MoEConfig class, see torch.sagemaker.moe.moe_config.MoEConfig.

moe_config = MoEConfig(
smp_moe=args.use_smp_implementation > 0, #Whether to use the SMP-implementation of MoE. The default value is True.
random_seed=args.seed, # A seed number for the random operations in expert-parallel distributed modules. This seed will be added to the expert parallel rank to set the actual seed for each rank. It is unique for each expert parallel rank. The default value is 12345.
moe_load_balancing=args.moe_load_balancing, #Specify the load balancing type of the MoE router. Valid options are aux_loss, sinkhorn, balanced, and none. The default value is sinkhorn.
global_token_shuffle=args.global_token_shuffle > 0, #Whether to shuffle tokens across EP ranks within the same expert parallel group. The default value is False
moe_all_to_all_dispatcher=args.moe_all_to_all_dispatcher > 0, #Whether to use all-to-all dispatcher for the communications in MoE. The default value is True.
)

With the model and MoE configuration ready, you wrap the model with the SMP transform API and pass the MoE configuration. Here, the tsm.transform method adapts the model from Hugging Face format to SMP format. For more information, refer to torch.sagemaker.transform.

model = tsm.transform(
model,
config=moe_config,
)

Define the training hyperparameters, including the MoE configuration and other settings specific to the model and training setup:

hyperparameters = {
# MoE config
“moe”: 1,
“moe_load_balancing”: “sinkhorn”,
“moe_all_to_all_dispatcher”: 1,
“seed”: 12345,
#rest of hyperparameters
…
“model_type”: “mixtral”,
“sharding_strategy”: “hybrid_shard”,
“delayed_param”: 1,
“epochs”: 100,
“activation_checkpointing”: 1,
“beta1”: 0.9,
“bf16”: 1,
“fp8”: 0,
“checkpoint_dir”: “/opt/ml/checkpoints”,
…
…

}

We enable delayed parameter initialization in SMP, which allows initializing large models on a meta device without attaching data. This can resolve limited GPU memory issues when you first load the model. This approach is particularly useful for training LLMs with tens of billions of parameters, where even CPU memory might not be sufficient for initialization.
SMP supports various routing strategies, including sinkhorn, balanced, and aux_loss. Each provides distinct load balancing approaches to achieve equitable token assignment among experts, thereby maintaining balanced workload distribution.

Specify the parameters for expert_parallel_degree and hybrid_shard_degree:

expert_parallel_degree = 2 # An integer in [1, world_size]
hybrid_shard_degree = (
8 # An integer in [0, world_size // expert_parallel_degree] and its default value is 0.
)

Hybrid sharding is a memory saving technique between `FULL_SHARD` and `NO_SHARD`, with `FULL_SHARD` saving the most memory and `NO_SHARD` not saving any. This technique shards parameters within the hybrid shard degree (HSD) group and replicates parameters across groups. The HSD controls sharding across GPUs and can be set to an integer from 0 to `world_size`.
An HSD of 8 applies `FULL_SHARD` within a node and then replicates parameters across nodes because there are 8 GPUs in the nodes we are using. This results in reduced communication volume because expensive all-gathers and reduce-scatters are only done within a node, which can be more performant for medium-sized models. Generally, you want to use the smallest HSD that doesn’t cause out of memory (OOM) errors. If you’re experiencing OOM, try increasing the hybrid shard degree to reduce memory usage on each node.

With all the necessary configurations in place, you now create the PyTorch estimator function to encapsulate the training setup and launch the training job. We run the pre-training on the 2 ml.p4d.24xlarge instances, where each instance contains 8 A100 Nvidia GPUs:

smp_estimator = PyTorch(
entry_point=”train.py”,
hyperparameters=hyperparameters,
role=role,
checkpoint_s3_uri=checkpoint_s3_uri,
checkpoint_local_path=hyperparameters[“checkpoint_dir”]
instance_type=”ml.p4d.24xlarge”,
volume_size=400,
instance_count=2,
sagemaker_session=sagemaker_session,
…
distribution={
“torch_distributed”: {
“enabled”: True,
},
“smdistributed”: {
“modelparallel”: {
“enabled”: True,
“parameters”: {
“activation_loading_horizon”: activation_loading_horizon,
“hybrid_shard_degree”: hybrid_shard_degree,
“sm_activation_offloading”: offload_activations,
“expert_parallel_degree”: expert_parallel_degree,
},
}
},
},
py_version=”py310″,
framework_version=”2.2.0″,
output_path=s3_output_bucket,
)

Finally, launch the pre-training workload:

smp_estimator.fit(inputs=data_channels)

Clean up
As part of cleanup, you can delete the SageMaker default bucket created to host the GLUE/SST2 dataset.
Conclusion
Training large MoE language models like the 47 billion parameter Mistral 8x7B can be challenging due to high computational and memory requirements. By using expert parallelism and sharded data parallelism from the SageMaker model parallelism library, you can effectively scale these MoE architectures across multiple GPUs and workers.
SMP’s expert parallelism implementation seamlessly integrates with PyTorch and the Hugging Face Transformers library, allowing you to enable MoE training using simple configuration flags without changing your existing model code. Additionally, SMP provides performance optimizations like hybrid sharding, delayed parameter initialization, and activation offloading and recomputation to further improve training efficiency.
For the complete sample to pre-train and fine-tune Mixtral 8x7B, see the GitHub repo.
Special thanks
Special thanks to Rahul Huilgol, Gautam Kumar, and Luis Quintela for their guidance and engineering leadership in developing this new capability.

About the Authors
Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS based in Munich, Germany. Roy helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS. Roy is passionate about computational optimization problems and improving the performance of AI workloads.
Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.
Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads frameworks, compilers, and optimization techniques for deep learning training.
Teng Xu is a Software Development Engineer in the Distributed Training group in AWS AI. He enjoys reading.
Suhit Kodgule is a Software Development Engineer with the AWS Artificial Intelligence group working on deep learning frameworks. In his spare time, he enjoys hiking, traveling, and cooking.

Safe Reinforcement Learning: Ensuring Safety in RL

Posted on May 23, 2024 by i-genie

Reinforcement Learning (RL) has gained substantial traction over recent years, driven by its successes in complex tasks such as game playing, robotics, & autonomous systems. However, deploying RL in real-world applications necessitates addressing safety concerns, which has led to the emergence of Safe Reinforcement Learning (Safe RL). Safe RL aims to ensure that RL algorithms operate within predefined safety constraints while optimizing performance. Let’s explore key features, use cases, architectures, and recent advancements in Safe RL.

Key Features of Safe RL

Safe RL focuses on developing algorithms to navigate environments safely, avoiding actions that could lead to catastrophic failures. The main features include:

Constraint Satisfaction: Ensuring that the policies learned by the RL agent adhere to safety constraints. These constraints are often domain-specific and can be hard (absolute) or soft (probabilistic).

Robustness to Uncertainty: Safe RL algorithms must be robust to environmental uncertainties, which can arise from partial observability, dynamic changes, or model inaccuracies.

Balancing Exploration and Exploitation: While standard RL algorithms focus on exploration to discover optimal policies, Safe RL must carefully balance exploration to prevent unsafe actions during the learning process.

Safe Exploration: This involves strategies to explore the environment without violating safety constraints, such as using conservative policies or shielding techniques that prevent unsafe actions.

Architectures in Safe RL

Safe RL leverages various architectures and methods to achieve safety. Some of the prominent architectures include:

Constrained Markov Decision Processes (CMDPs): CMDPs extend the standard Markov Decision Processes (MDPs) by incorporating constraints that the policy must satisfy. These constraints are expressed in terms of expected cumulative costs.

Shielding: This involves using an external mechanism to prevent the RL agent from taking unsafe actions. For example, a “shield” can block actions that violate safety constraints, ensuring that only safe actions are executed.

Barrier Functions: These mathematical functions ensure the system states remain within a safe set. Barrier functions penalize the agent for approaching unsafe states, thus guiding it to remain in safe regions.

Model-based Approaches: These methods use models of the environment to predict the outcomes of actions and assess their safety before execution. By simulating future states, the agent can avoid actions that might lead to unsafe conditions.

Recent Advances and Research Directions

Recent research has made significant strides in Safe RL, addressing various challenges and proposing innovative solutions. Some notable advancements include:

Feasibility Consistent Representation Learning: This approach addresses the difficulty of estimating safety constraints by learning representations consistent with feasibility constraints. This method helps better approximate the safety boundaries in high-dimensional spaces.

Policy Bifurcation in Safe RL: This technique involves splitting the policy into safe and exploratory components, allowing the agent to explore new strategies while ensuring safety through a conservative baseline policy. This bifurcation helps balance exploration and exploitation while maintaining safety.

Shielding for Probabilistic Safety: Leveraging approximate model-based shielding, this approach provides probabilistic safety guarantees in continuous environments. This method uses simulations to predict unsafe states and preemptively avoid them.

Off-Policy Risk Assessment: This involves assessing the risk of policies in off-policy settings, where the agent learns from historical data rather than direct interactions with the environment. Off-policy risk assessment helps in evaluating the safety of new policies before deployment.

Use Cases of Safe RL

Safe RL has significant applications in several critical domains:

Autonomous Vehicles: Ensuring that self-driving cars can make decisions that prioritize passenger and pedestrian safety, even in unpredictable conditions.

Healthcare: Applying RL to personalized treatment plans while ensuring recommended actions do not harm patients.

Industrial Automation: Deploying robots in manufacturing settings where safety is crucial for human workers and equipment.

Finance: Developing trading algorithms that maximize returns while adhering to regulatory and risk management constraints.

Challenges for Safe RL

Despite the progress, several open challenges remain in Safe RL:

Scalability: Developing scalable Safe RL algorithms that efficiently handle high-dimensional state and action spaces.

Generalization: Ensuring Safe RL policies generalize well to unseen environments and conditions is crucial for real-world deployment.

Human-in-the-Loop Approaches: Integrating human feedback into Safe RL to improve safety and trustworthiness, particularly in critical applications like healthcare and autonomous driving.

Multi-agent Safe RL: Addressing safety in multi-agent settings where multiple RL agents interact introduces additional complexity and safety concerns.

Conclusion

Safe Reinforcement Learning is a vital area of research aimed at making RL algorithms viable for real-world applications by ensuring their safety and robustness. With ongoing advancements and research, Safe RL continues to evolve, addressing new challenges and expanding its applicability across various domains. By incorporating safety constraints, robust architectures, and innovative methods, Safe RL is paving the way for RL’s safe and reliable deployment in critical, real-world scenarios.

Sources

https://arxiv.org/abs/2405.12063

https://arxiv.org/abs/2403.12564

https://arxiv.org/abs/2402.12345

https://paperswithcode.com/task/safe-reinforcement-learning/latest

The post Safe Reinforcement Learning: Ensuring Safety in RL appeared first on MarkTechPost.

This AI Paper by the National University of Singapore Introduces Mamba …

Posted on May 23, 2024 by i-genie

In recent years, computer vision has made significant strides by leveraging advanced neural network architectures to tackle complex tasks such as image classification, object detection, and semantic segmentation. Transformative models like Transformers and Convolutional Neural Networks (CNNs) have become fundamental tools, driving substantial improvements in visual recognition performance. These advancements have paved the way for more efficient and accurate systems in various applications, from autonomous driving to medical imaging.

One of the crucial challenges in computer vision is the quadratic complexity of the attention mechanism used in transformers, which hinders their efficiency in handling long sequences. This issue is particularly critical in vision tasks where the sequence length, defined by the number of image patches, can significantly impact computational resources and processing time. Addressing this problem is crucial for improving the scalability and performance of vision models, especially when dealing with high-resolution images or videos that require extensive computational power.

Existing research includes various token mixers with linear complexity, such as dynamic convolution, Linformer, Longformer, and Performer. Furthermore, RNN-like models such as RWKV and Mamba have been developed to handle long-sequence efficiently. Vision models incorporating Mamba include Vision Mamba, VMamba, LocalMamba, and PlainMamba. These models leverage structured state space models (SSM) for improved performance in visual recognition tasks, demonstrating their potential to address the quadratic complexity challenges posed by traditional attention mechanisms in transformers.

Researchers from the National University of Singapore have introduced MambaOut, an architecture derived from the Gated CNN block, designed to evaluate the necessity of Mamba for vision tasks. Unlike traditional Mamba models, MambaOut removes the state space model (SSM) component, focusing on simplifying the architecture while maintaining performance. This innovative approach seeks to determine whether the complexities introduced by Mamba are indeed necessary for achieving high performance in vision tasks, particularly in image classification on ImageNet.

The MambaOut architecture utilizes Gated CNN blocks, integrating token mixing through depthwise convolution. This approach allows MambaOut to maintain a lower computational complexity than traditional Mamba models. By stacking these blocks, MambaOut constructs a hierarchical model, similar to ResNet, to handle various visual recognition tasks efficiently. The researchers implemented MambaOut with PyTorch and timm libraries, training the models on TPU v3 with a batch size of 4096 and an initial learning rate of 0.004. The training scheme followed DeiT without distillation, incorporating data augmentation techniques such as random resized crop, horizontal flip, and regularization techniques like weight decay and stochastic depth.

Empirical results indicate that MambaOut surpasses all visual Mamba models in ImageNet image classification. Specifically, MambaOut achieves a top-1 accuracy of 84.1%, outperforming LocalVMamba-S by 0.4% with only 79% of the MACs. For instance, the MambaOut-Small model achieves an accuracy of 84.1%, which is 0.4% higher than LocalVMamba-S, while requiring only 79% of the Multiply-Accumulate Operations (MACs). MambaOut is the backbone within Mask R-CNN, initialized with ImageNet pre-trained weights in object detection and instance segmentation on COCO. Despite MambaOut surpassing some visual Mamba models, it still lags behind state-of-the-art models like VMamba and LocalVMamba by 1.4 APb and 1.1 APm, respectively. This performance disparity highlights the benefits of integrating Mamba in long-sequence visual tasks, reinforcing the hypothesis that Mamba is more suitable for tasks with long-sequence characteristics.

In conclusion, the researchers demonstrated that while MambaOut effectively simplifies the architecture for image classification, the Mamba model’s strengths lie in handling long-sequence tasks like object detection and segmentation. This study underscores Mamba’s potential for specific visual tasks, guiding future research directions in optimizing vision models. The findings suggest that further exploration of Mamba’s application in long-sequence visual tasks is warranted, as it offers a promising avenue for enhancing the performance and efficiency of vision models.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post This AI Paper by the National University of Singapore Introduces MambaOut: Streamlining Visual Models for Improved Accuracy appeared first on MarkTechPost.

Apple Researchers Propose KV-Runahead: An Efficient Parallel LLM Infer …

Posted on May 23, 2024 by i-genie

Large language models (LLMs), particularly Generative Pre-trained Transformer (GPT) models, have demonstrated strong performance across various language tasks. However, challenges persist in their decoder architecture, Specifically in time-to-first-token (TTFT) and time-per-output token (TPOT). TTFT, reliant on extensive user context, and TPOT, for rapid subsequent token generation, have spurred research into memory-bound solutions like sparsification and speculative decoding. Parallelization, through tensor and sequential methods, addresses compute-bound TTFT but still lacks optimization for scalable LLM inference due to inefficiencies in attention computation and communication.

Generative LLM inference entails a prompt phase, where initial tokens are generated after receiving user context, and an extension phase, using cached key-value embeddings to expedite subsequent token generation. To minimize TTFT for long contexts, efficient KV-cache management and fast attention map computation are vital. Various optimization approaches, such as PagedAttention and CacheGen, address these challenges. Parallelization techniques like tensor and sequence parallelism aim to optimize compute-bound TTFT, with innovations like KV-Runahead further enhancing scalability and load balancing for improved inference efficiency.

Apple researchers present a parallelization technique, KV-Runahead, tailored specifically for LLM inference to minimize TTFT. Utilizing the existing KV cache mechanism, KV-Runahead optimizes by distributing the KV-cache population across processes, ensuring context-level load-balancing. By capitalizing on causal attention computation inherent in KV-cache, KV-Runahead effectively reduces computation and communication costs, resulting in lower TTFT compared to existing methods. Importantly, its implementation entails minimal engineering effort, as it repurposes the KV-cache interface without significant modifications.

KV-Runahead is contrasted with Tensor/Sequence Parallel Inference (TSP), which evenly distributes computation across processes. Unlike TSP, KV-Runahead utilizes multiple processes to populate KV-caches for the final process, necessitating effective context partitioning for load-balancing. Each process then executes layers, awaiting KV-cache from the preceding process via local communication rather than global synchronization.

Researchers conducted experiments on a single node equipped with 8× NVidia A100 GPUs, under both high (300GB/s) and low (10GB/s) bandwidth conditions. KV-Runahead, utilizing FP16 for inference, was compared against Tensor/Sequence Parallelization (TSP) and demonstrated superior performance, consistently outperforming TSP in various scenarios. Different variants of KV-Runahead, including KVR-E with even context partitioning, KVR-S with searched partitioning, and KVR-P with predicted partitioning, were evaluated for efficiency. KV-Runahead achieves significant speedups, particularly with longer contexts and more GPUs, even outperforming TSP on low bandwidth networks. Also, KV-Runahead exhibits robustness against non-uniform network bandwidth, showcasing the benefits of its communication mechanism.

In this work, Apple researchers introduced KV-Runahead, an effective parallel LLM inference method aimed at reducing time-to-first-token. KV cache achieved a significant speedup, over 60% speedup in the first token generation compared to existing parallelization methods. Also, KV-Runahead demonstrates increased resilience in scenarios with non-uniform bandwidth environments.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post Apple Researchers Propose KV-Runahead: An Efficient Parallel LLM Inference Technique to Minimize the Time-to-First-Token appeared first on MarkTechPost.

Generating fashion product descriptions by fine-tuning a vision-langua …

Posted on May 23, 2024 by i-genie

In the world of online retail, creating high-quality product descriptions for millions of products is a crucial, but time-consuming task. Using machine learning (ML) and natural language processing (NLP) to automate product description generation has the potential to save manual effort and transform the way ecommerce platforms operate. One of the main advantages of high-quality product descriptions is the improvement in searchability. Customers can more easily locate products that have correct descriptions, because it allows the search engine to identify products that match not just the general category but also the specific attributes mentioned in the product description. For example, a product that has a description that includes words such as “long sleeve” and “cotton neck” will be returned if a consumer is looking for a “long sleeve cotton shirt.” Furthermore, having factoid product descriptions can increase customer satisfaction by enabling a more personalized buying experience and improving the algorithms for recommending more relevant products to users, which raise the probability that users will make a purchase.
With the advancement of Generative AI, we can use vision-language models (VLMs) to predict product attributes directly from images. Pre-trained image captioning or visual question answering (VQA) models perform well on describing every-day images but can’t to capture the domain-specific nuances of ecommerce products needed to achieve satisfactory performance in all product categories. To solve this problem, this post shows you how to predict domain-specific product attributes from product images by fine-tuning a VLM on a fashion dataset using Amazon SageMaker, and then using Amazon Bedrock to generate product descriptions using the predicted attributes as input. So you can follow along, we’re sharing the code in a GitHub repository.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
You can use a managed service, such as Amazon Rekognition, to predict product attributes as explained in Automating product description generation with Amazon Bedrock. However, if you’re trying to extract specifics and detailed characteristics of your product or your domain (industry), fine-tuning a VLM on Amazon SageMaker is necessary.
Vision-language models
Since 2021, there has been a rise in interest in vision-language models (VLMs), which led to the release of solutions such as Contrastive Language-Image Pre-training (CLIP) and Bootstrapping Language-Image Pre-training (BLIP). When it comes to tasks such as image captioning, text-guided image generation, and visual question-answering, VLMs have demonstrated state-of-the art performance.
In this post, we use BLIP-2, which was introduced in BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, as our VLM. BLIP-2 consists of three models: a CLIP-like image encoder, a Querying Transformer (Q-Former) and a large language model (LLM). We use a version of BLIP-2, that contains Flan-T5-XL as the LLM.
The following diagram illustrates the overview of BLIP-2:

Figure 1: BLIP-2 overview
The pre-trained version of the BLIP-2 model has been demonstrated in Build an image-to-text generative AI application using multimodality models on Amazon SageMaker and Build a generative AI-based content moderation solution on Amazon SageMaker JumpStart. In this post, we demonstrate how to fine-tune BLIP-2 for a domain-specific use case.
Solution overview
The following diagram illustrates the solution architecture.

Figure 2: High-level solution architecture
The high-level overview of the solution is:

An ML scientist uses Sagemaker notebooks to process and split the data into training and validation data.
The datasets are uploaded to Amazon Simple Storage Service (Amazon S3) using the S3 client (a wrapper around an HTTP call).
Then the Sagemaker client is used to launch a Sagemaker Training job, again a wrapper for an HTTP call.
The training job manages the copying of the datasets from S3 to the training container, the training of the model, and the saving of its artifacts to S3.
Then, through another call of the Sagemaker client, an endpoint is generated, copying the model artifacts into the endpoint hosting container.
The inference workflow is then invoked through an AWS Lambda request, which first makes an HTTP request to the Sagemaker endpoint, and then uses that to make another request to Amazon Bedrock.

In the following sections, we demonstrate how to:

Set up the development environment
Load and prepare the dataset
Fine-tune the BLIP-2 model to learn product attributes using SageMaker
Deploy the fine-tuned BLIP-2 model and predict product attributes using SageMaker
Generate product descriptions from predicted product attributes using Amazon Bedrock

Set up the development environment
An AWS account is needed with an AWS Identity and Access Management (IAM) role that has permissions to manage resources created as part of the solution. For details, see Creating an AWS account.
We use Amazon SageMaker Studio with the ml.t3.medium instance and the Data Science 3.0 image. However, you can also use an Amazon SageMaker notebook instance or any integrated development environment (IDE) of your choice.
Note: Be sure to set up your AWS Command Line Interface (AWS CLI) credentials correctly. For more information, see Configure the AWS CLI.
An ml.g5.2xlarge instance is used for SageMaker Training jobs, and an ml.g5.2xlarge instance is used for SageMaker endpoints. Ensure sufficient capacity for this instance in your AWS account by requesting a quota increase if required. Also check the pricing of the on-demand instances.
You need to clone this GitHub repository for replicating the solution demonstrated in this post. First, launch the notebook main.ipynb in SageMaker Studio by selecting the Image as Data Science and Kernel as Python 3. Install all the required libraries mentioned in the requirements.txt.
Load and prepare the dataset
For this post, we use the Kaggle Fashion Images Dataset, which contain 44,000 products with multiple category labels, descriptions, and high resolution images. In this post we want to demonstrate how to fine-tune a model to learn attributes such as fabric, fit, collar, pattern, and sleeve length of a shirt using the image and a question as inputs.
Each product is identified by an ID such as 38642, and there is a map to all the products in styles.csv. From here, we can fetch the image for this product from images/38642.jpg and the complete metadata from styles/38642.json. To fine-tune our model, we need to convert our structured examples into a collection of question and answer pairs. Our final dataset has the following format after processing for each attribute:
Id | Question | Answer 38642 | What is the fabric of the clothing in this picture? | Fabric: Cotton

After we process the dataset, we split it into training and validation sets, create CSV files, and upload the dataset to Amazon S3.

Fine-tune the BLIP-2 model to learn product attributes using SageMaker
To launch a SageMaker Training job, we need the HuggingFace Estimator. SageMaker starts and manages all of the necessary Amazon Elastic Compute Cloud (Amazon EC2) instances for us, supplies the appropriate Hugging Face container, uploads the specified scripts, and downloads data from our S3 bucket to the container to /opt/ml/input/data.
We fine-tune BLIP-2 using the Low-Rank Adaptation (LoRA) technique, which adds trainable rank decomposition matrices to every Transformer structure layer while keeping the pre-trained model weights in a static state. This technique can increase training throughput and reduce the amount of GPU RAM required by 3 times and the number of trainable parameters by 10,000 times. Despite using fewer trainable parameters, LoRA has been demonstrated to perform as well as or better than the full fine-tuning technique.
We prepared entrypoint_vqa_finetuning.py which implements fine-tuning of BLIP-2 with the LoRA technique using Hugging Face Transformers, Accelerate, and Parameter-Efficient Fine-Tuning (PEFT). The script also merges the LoRA weights into the model weights after training. As a result, you can deploy the model as a normal model without any additional code.

from peft import LoraConfig, get_peft_model
from transformers import Blip2ForConditionalGeneration

model = Blip2ForConditionalGeneration.from_pretrained(
“Salesforce/blip2-flan-t5-xl”,
device_map=”auto”,
cache_dir=”/tmp”,
load_in_8bit=True,
)

config = LoraConfig(
r=8, # Lora attention dimension.
lora_alpha=32, # the alpha parameter for Lora scaling.
lora_dropout=0.05, # the dropout probability for Lora layers.
bias=”none”, # the bias type for Lora.
target_modules=[“q”, “v”],
)

model = get_peft_model(model, config)
We reference entrypoint_vqa_finetuning.py as the entry_point in the Hugging Face Estimator.

from sagemaker.huggingface import HuggingFace

hyperparameters = {
‘epochs’: 10,
‘file-name’: “vqa_train.csv”,
}

estimator = HuggingFace(
entry_point=”entrypoint_vqa_finetuning.py”,
source_dir=”../src”,
role=role,
instance_count=1,
instance_type=”ml.g5.2xlarge”,
transformers_version=’4.26′,
pytorch_version=’1.13′,
py_version=’py39′,
hyperparameters = hyperparameters,
base_job_name=”VQA”,
sagemaker_session=sagemaker_session,
output_path=f”{output_path}/models”,
code_location=f”{output_path}/code”,
volume_size=60,
metric_definitions=[
{‘Name’: ‘batch_loss’, ‘Regex’: ‘Loss: ([0-9\.]+)’},
{‘Name’: ‘epoch_loss’, ‘Regex’: ‘Epoch Loss: ([0-9\.]+)’}
],
)

We can start our training job by running with the .fit() method and passing our Amazon S3 path for images and our input file.

estimator.fit({“images”: images_input, “input_file”: input_file})

Deploy the fine-tuned BLIP-2 model and predict product attributes using SageMaker
We deploy the fine-tuned BLIP-2 model to the SageMaker real time endpoint using the HuggingFace Inference Container. You can also use the large model inference (LMI) container, which is described in more detail in Build a generative AI-based content moderation solution on Amazon SageMaker JumpStart, which deploys a pre-trained BLIP-2 model. Here, we reference our fine-tuned model in Amazon S3 instead of the pre-trained model available in the Hugging Face hub. We first create the model and deploy the endpoint.

from sagemaker.huggingface import HuggingFaceModel

model = HuggingFaceModel(
   model_data=estimator.model_data,
   role=role,
   transformers_version=”4.28″,
   pytorch_version=”2.0″,
   py_version=”py310″,
   model_server_workers=1,
   sagemaker_session=sagemaker_session
)

endpoint_name = “endpoint-finetuned-blip2″
model.deploy(initial_instance_count=1, instance_type=”ml.g5.2xlarge”, endpoint_name=endpoint_name )

When the endpoint status becomes in service, we can invoke the endpoint for the instructed vision-to-language generation task with an input image and a question as a prompt:

inputs = {
“prompt”: “What is the sleeve length of the shirt in this picture?”,
“image”: image # image encoded in Base64
}

The output response looks like the following:
{“Sleeve Length”: “Long Sleeves”}
Generate product descriptions from predicted product attributes using Amazon Bedrock
To get started with Amazon Bedrock, request access to the foundational models (they are not enabled by default). You can follow the steps in the documentation to enable model access. In this post, we use Anthropic’s Claude in Amazon Bedrock to generate product descriptions. Specifically, we use the model anthropic.claude-3-sonnet-20240229-v1 because it provides good performance and speed.
After creating the boto3 client for Amazon Bedrock, we create a prompt string that specifies that we want to generate product descriptions using the product attributes.
You are an expert in writing product descriptions for shirts. Use the data below to create product description for a website. The product description should contain all given attributes. Provide some inspirational sentences, for example, how the fabric moves. Think about what a potential customer wants to know about the shirts. Here are the facts you need to create the product descriptions: [Here we insert the predicted attributes by the BLIP-2 model]
The prompt and model parameters, including maximum number of tokens used in the response and the temperature, are passed to the body. The JSON response must be parsed before the resulting text is printed in the final line.

bedrock = boto3.client(service_name=’bedrock-runtime’, region_name=’us-west-2′)

model_id = “anthropic.claude-3-sonnet-20240229-v1”

body = json.dumps(
{“system”: prompt, “messages”: attributes_content, “max_tokens”: 400, “temperature”: 0.1, “anthropic_version”: “bedrock-2023-05-31”}
)

response = bedrock.invoke_model(
body=body,
    modelId=model_id,
    accept=’application/json’,
    contentType=’application/json’
)

The generated product description response looks like the following:
“Classic Striped Shirt Relax into comfortable casual style with this classic collared striped shirt. With a regular fit that is neither too slim nor too loose, this versatile top layers perfectly under sweaters or jackets.”
Conclusion
We’ve shown you how the combination of VLMs on SageMaker and LLMs on Amazon Bedrock present a powerful solution for automating fashion product description generation. By fine-tuning the BLIP-2 model on a fashion dataset using Amazon SageMaker, you can predict domain-specific and nuanced product attributes directly from images. Then, using the capabilities of Amazon Bedrock, you can generate product descriptions from the predicted product attributes, enhancing the searchability and personalization of ecommerce platforms. As we continue to explore the potential of generative AI, LLMs and VLMs emerge as a promising avenue for revolutionizing content generation in the ever-evolving landscape of online retail. As a next step, you can try fine-tuning this model on your own dataset using the code provided in the GitHub repository to test and benchmark the results for your use cases.

About the Authors
Antonia Wiebeler is a Data Scientist at the AWS Generative AI Innovation Center, where she enjoys building proofs of concept for customers. Her passion is exploring how generative AI can solve real-world problems and create value for customers. While she is not coding, she enjoys running and competing in triathlons.
Daniel Zagyva is a Data Scientist at AWS Professional Services. He specializes in developing scalable, production-grade machine learning solutions for AWS customers. His experience extends across different areas, including natural language processing, generative AI, and machine learning operations.
Lun Yeh is a Machine Learning Engineer at AWS Professional Services. She specializes in NLP, forecasting, MLOps, and generative AI and helps customers adopt machine learning in their businesses. She graduated from TU Delft with a degree in Data Science & Technology.
Fotinos Kyriakides is an AI/ML Consultant at AWS Professional Services specializing in developing production-ready ML solutions and platforms for AWS customers. In his free time Fotinos enjoys running and exploring.

Gradient AI Introduces Llama-3 8B Gradient Instruct 1048k: Setting New …

Posted on May 22, 2024 by i-genie

Language models are designed to understand & generate human language. These models are crucial for applications like chatbots, automated content creation, and data analysis. Their ability to comprehend and generate text depends on the context length they can handle, making advancements in long-context models particularly significant for enhancing AI capabilities.

Among many challenges, one major challenge in AI language models is efficiently processing and understanding long text sequences. Traditional models often struggle with context lengths beyond a few thousand tokens, leading to difficulty maintaining coherence and relevance in longer interactions. This limitation hinders the application of AI in areas requiring extensive context, such as legal document analysis, lengthy conversations, and detailed technical writing.

Most language models use fixed context windows, which limit their ability to handle long text sequences. Techniques like positional encodings are employed to manage context, but they often lead to performance degradation when the context exceeds the predefined length. Models like GPT-3 and earlier versions of Llama have made strides but still face significant challenges in extending context length without compromising accuracy and relevance.

With sponsorship support for computing from Crusoe Energy, researchers at Gradient introduced the Llama-3 8B Gradient Instruct 1048k model, a groundbreaking advancement in language models. This model extends the context length from 8,000 to over 1,048,000 tokens, showcasing the ability to manage long contexts with minimal additional training. Utilizing techniques like NTK-aware interpolation and Ring Attention, the researchers significantly improved training efficiency and speed, enabling the model to handle extensive data without the typical performance drop associated with longer contexts.

Image Source

The researchers employed techniques such as NTK-aware interpolation and Ring Attention to efficiently scale the training of long-context models. They achieved a significant speedup in model training by progressively increasing the context length during training and using advanced computational strategies. This approach allowed them to create a model capable of handling extensive data without the typical performance drop associated with longer contexts.

Image Source

The new Llama-3 8B model with a context length of over 1 million tokens performed exceptionally well in evaluations. It achieved perfect scores on the Needle-in-a-Haystack (NIAH) test, demonstrating its ability to identify and utilize specific information within vast amounts of data. This model’s performance surpasses previous benchmarks, making it a leading option for applications requiring long-context comprehension and generation.

Image Source

Use Cases of Llama-3 8B Gradient Instruct 1048k:

Code Generation: Generating code suggestions based on the context of an entire repository.

Investment Analysis: Synthesizing nuanced investment analysis from company reports spanning different periods and sectors.

Data Analysis: Automating the analysis of large sets of poorly structured tabular data.

Legal Analysis: Generating legal analysis using historical precedent from previous court proceedings.

These use cases highlight the model’s ability to effectively handle detailed and context-rich tasks.

In conclusion, the introduction of the Llama-3 8B Gradient Instruct 1048k model marks a significant milestone in developing long-context language models. By addressing the challenge of processing extensive text sequences, the researchers have opened new possibilities for AI applications in various fields. This advancement improves the coherence and relevance of AI-generated content and enhances the overall utility of language models in real-world scenarios.

Sources

https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k

https://x.com/Gradient_AI_/status/1785036209468907796

https://gradient.ai/blog/evaluating-models-beyond-niah

https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models

The post Gradient AI Introduces Llama-3 8B Gradient Instruct 1048k: Setting New Standards in Long-Context AI appeared first on MarkTechPost.

Researchers from the University of Maryland Introduce an Automatic Tex …

Posted on May 22, 2024 by i-genie

The privacy of users engaging in online communities is a significant task. This is a key justification for why websites like Reddit let users post under fictitious names. There is strong evidence that disclosing an online user’s identity can be damaging, especially for vulnerable groups, even though anonymity might occasionally encourage abusive behavior.

Still, there are situations where choosing a pseudonym rather than your true name may not offer enough privacy. Even anonymous posts may contain stylistic elements that identify the author despite these safeguards. Research on stylometry, which is the study of language style shows that these hints can be used to recognize writers of a variety of genres. This creates a serious privacy concern by making it feasible to follow a writer’s writing across several texts and platforms.

Authorship obfuscation techniques automatically rewrite text to obscure the identity of the original author in an effort to protect people’s privacy in online conversations. These methods show promise because they enable users to preserve their anonymity, which is essential for participating in online areas safely.

Conventional methods of obfuscation in the literature on Natural Language Processing (NLP) have frequently been restricted to certain environments and have depended on basic, surface-level modifications. These techniques can produce strange or odd writing, which could impair the effectiveness of the privacy protection measures as well as the quality of communication.

In a recent study, a team of researchers from the University of Maryland, College Park, has come up with an automatic text privatization framework that fine-tunes a Large Language Model to produce rewrites that balance soundness, sense, and privacy. It makes use of a sizable language model that has been refined using reinforcement learning to attain an improved equilibrium between safeguarding privacy, keeping the text’s meaning or soundness, and preserving naturalness or sense. The original content’s coherence and readability are preserved while the author’s identity is concealed through an automatic rewriting system.

The team has conducted a thorough evaluation of this technique’s effectiveness using a huge dataset of English posts from Reddit, which includes texts from 68,000 authors. These entries range in length from brief to medium, mirroring the usual content of Internet discussion boards. The study looks at how the obfuscation approach performs differently depending on factors like authorship detection strategies and the length of the author’s profile.

Both automatic measurements and human reviews demonstrate that this strategy maintains good text quality. This indicates that readers will still be able to understand and relate to the revised text. The technique successfully avoids several automated authorship attacks, indicating how reliable it is in safeguarding user privacy.

This method offers a major improvement over prior approaches by fine-tuning a huge language model using reinforcement learning. It offers a more advanced and practical method of masking authorship, guaranteeing that people can converse openly and safely in virtual spaces without sacrificing the caliber of their work or their privacy.

velopers working with generative AI models.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post Researchers from the University of Maryland Introduce an Automatic Text Privatization Framework that Fine-Tunes a Large Language Model via Reinforcement Learning appeared first on MarkTechPost.

The AI-Powered Code Revolution: Bridging Traditional and Neurosymbolic …

Posted on May 22, 2024 by i-genie

Generative AI models, particularly Large Language Models (LLMs), have seen a surge in adoption across various industries, transforming the software development landscape. As enterprises and startups increasingly integrate LLMs into their workflows, the future of programming is set to undergo significant changes.

Historically, symbolic programming has dominated, where developers use symbolic code to express logic for tasks or problem-solving. However, the rapid adoption of LLMs has sparked interest in a new paradigm, Neurosymbolic programming, which combines neural networks and traditional symbolic code to create sophisticated algorithms and applications.

LLMs operate by processing text inputs and generating text outputs, with prompt engineering currently being the primary programming method with these models. This approach relies heavily on constructing the right input prompts, a task that can be complex and tedious. The intricacies of generating appropriate prompts from existing code constructs can reduce code readability and maintainability. To address these challenges, several open-source libraries and research efforts, such as LangChain, Guidance, LMQL, and SGLang, have emerged. These tools aim to simplify prompt construction and facilitate LLM programming, but they still require developers to manually decide the type of prompts and the information to include.

The complexity of LLM programming largely stems from the need for more abstraction when interfacing with these models. In conventional symbolic programming, operations are conducted directly on variables or typed values. However, LLMs operate on text strings, necessitating the conversion of variables to prompts and the parsing of LLM outputs back into variables. This process introduces additional logic and complexity, highlighting a fundamental mismatch between LLM abstractions and conventional symbolic programming.

To address this, a new approach proposes treating LLMs as native code constructs and providing syntax support at the programming language level. This approach introduces a new type of “meaning” to serve as the abstraction for LLM interactions. “Meaning” refers to the semantic purpose behind the symbolic data (strings) used as LLM inputs and outputs. The language runtime should automate the process of translating conventional code constructs and meanings, termed Meaning-type Transformations (MTT), to reduce developer complexity.

A novel language feature, Semantic Strings (semstrings), is introduced to enable developers to annotate existing code constructs with additional context. Semstrings allow for the seamless integration of LLMs by providing necessary context and information, facilitating the Automatic Meaning-type Transformation (A-MTT). This automation abstracts the complexity of prompt generation and response parsing, making it easier for developers to leverage LLMs in their code.

Through real code examples, the concept of A-MTT is demonstrated to streamline common symbolic code operations, such as instantiating custom type objects, standalone function calls, and class member methods. Introducing these new abstractions and language features represents a significant contribution to the programming paradigm, enabling more efficient and maintainable integration of LLMs into conventional symbolic programming. This advancement promises to transform the future of programming, making it more accessible and less cumbersome for developers working with generative AI models.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post The AI-Powered Code Revolution: Bridging Traditional and Neurosymbolic Programming appeared first on MarkTechPost.

Create a multimodal assistant with advanced RAG and Amazon Bedrock

Posted on May 22, 2024 by i-genie

Retrieval Augmented Generation (RAG) models have emerged as a promising approach to enhance the capabilities of language models by incorporating external knowledge from large text corpora. However, despite their impressive performance in various natural language processing tasks, RAG models still face several limitations that need to be addressed.
Naive RAG models face limitations such as missing content, reasoning mismatch, and challenges in handling multimodal data. Although they can retrieve relevant information, they may struggle to generate complete and coherent responses when required information is absent, leading to incomplete or inaccurate outputs. Additionally, even with relevant information retrieved, the models may have difficulty correctly interpreting and reasoning over the content, resulting in inconsistencies or logical errors. Furthermore, effectively understanding and reasoning over multimodal data remains a significant challenge for these primarily text-based models.
In this post, we present a new approach named multimodal RAG (mmRAG) to tackle those existing limitations in greater detail. The solution intends to address these limitations for practical generative artificial intelligence (AI) assistant use cases. Additionally, we examine potential solutions to enhance the capabilities of large language models (LLMs) and visual language models (VLMs) with advanced LangChain capabilities, enabling them to generate more comprehensive, coherent, and accurate outputs while effectively handling multimodal data. The solution uses Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, providing a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
Solution architecture
The mmRAG solution is based on a straightforward concept: to extract different data types separately, you generate text summarization using a VLM from different data types, embed text summaries along with raw data accordingly to a vector database, and store raw unstructured data in a document store. The query will prompt the LLM to retrieve relevant vectors from both the vector database and document store and generate meaningful and accurate answers.
The following diagram illustrates the solution architecture.

The architecture diagram depicts the mmRAG architecture that integrates advanced reasoning and retrieval mechanisms. It combines text, table, and image (including chart) data into a unified vector representation, enabling cross-modal understanding and retrieval. The process begins with diverse data extractions from various sources such as URLs and PDF files by parsing and preprocessing text, table, and image data types separately, while table data is converted into raw text and image data into captions.
These parsed data streams are then fed into a multimodal embedding model, which encodes the various data types into uniform, high dimensional vectors. The resulting vectors, representing the semantic content regardless of original format, are indexed in a vector database for efficient approximate similarity searches. When a query is received, the reasoning and retrieval component performs similarity searches across this vector space to retrieve the most relevant information from the vast integrated knowledge base.
The retrieved multimodal representations are then used by the generation component to produce outputs such as text, images, or other modalities. The VLM component generates vector representations specifically for textual data, further enhancing the system’s language understanding capabilities. Overall, this architecture facilitates advanced cross-modal reasoning, retrieval, and generation by unifying different data modalities into a common semantic space.
Developers can access mmRAG source codes on the GitHub repo.
Configure Amazon Bedrock with LangChain
You start by configuring Amazon Bedrock to integrate with various components from the LangChain Community library. This allows you to work with the core FMs. You use the BedrockEmbeddings class to create two different embedding models: one for text (embedding_bedrock_text) and one for images (embeddings_bedrock_image). These embeddings represent textual and visual data in a numerical format, which is essential for various natural language processing (NLP) tasks.
Additionally, you use the LangChain Bedrock and BedrockChat classes to create a VLM model instance (llm_bedrock_claude3_haiku) from Anthropic Claude 3 Haiku and a chat instance based on a different model, Sonnet (chat_bedrock_claude3_sonnet). These instances are used for advanced query reasoning, argumentation, and retrieval tasks. See the following code snippet:
from langchain_community.embeddings import BedrockEmbeddings
from langchain_community.chat_models.bedrock import BedrockChat

embedding_bedrock_text = BedrockEmbeddings(client=boto3_bedrock, model_id=”amazon.titan-embed-g1-text-02″)
embeddings_bedrock_image = BedrockEmbeddings(client=boto3_bedrock, model_id=”amazon.titan-embed-image-v1″)

model_kwargs = {
“max_tokens”: 2048,
“temperature”: 0.0,
“top_k”: 250,
“top_p”: 1,
“stop_sequences”: [“nnn”],
}
chat_bedrock_claude3_haiku = BedrockChat(
model_id=”anthropic:claude-3-haiku-20240307-v1:0″,
client=boto3_bedrock,
model_kwargs=model_kwargs,
)

chat_bedrock_claude3_sonnet = BedrockChat(
model_id=”anthropic.claude-3-sonnet-20240229-v1:0″,
client=boto3_bedrock,
model_kwargs=model_kwargs,
)
Parse content from data sources and embed both text and image data
In this section, we explore how to harness the power of Python to parse text, tables, and images from URLs and PDFs efficiently, using two powerful packages: Beautiful Soup and PyMuPDF. Beautiful Soup, a library designed for web scraping, makes it straightforward to sift through HTML and XML content, allowing you to extract the desired data from web pages. PyMuPDF offers an extensive set of functionalities for interacting with PDF files, enabling you to extract not just text but also tables and images with ease. See the following code:
from bs4 import BeautifulSoup as Soup
import fitz

def parse_tables_images_from_urls(url:str):
…
# Parse the HTML content using BeautifulSoup
soup = Soup(response.content, ‘html.parser’)

# Find all table elements
tables = soup.find_all(‘table’)
# Find all image elements
images = soup.find_all(‘img’)
…

def parse_images_tables_from_pdf(pdf_path:str):
…
pdf_file = fitz.open(pdf_path)

# Iterate through each page
for page_index in range(len(pdf_file)):
# Select the page
page = pdf_file[page_index]

# Search for tables on the page
tables = page.find_tables()
df = table.to_pandas()

# Search for images on the page
images = page.get_images()
image_info = pdf_file.extract_image(xref)
image_data = image_info[“image”]
…
The following code snippets demonstrate how to generate image captions using Anthropic Claude 3 by invoking the bedrock_get_img_description utility function. Additionally, they showcase how to embed image pixels along with image captioning using the Amazon Titan image embedding model amazon.titan_embeding_image_v1 by calling the get_text_embedding function.
image_caption = bedrock_get_img_description(model_id,
prompt=’You are an expert at analyzing images in great detail. Your task is to carefully examine the provided
mage and generate a detailed, accurate textual description capturing all of the important elements and
context present in the image. Pay close attention to any numbers, data, or quantitative information visible,
and be sure to include those numerical values along with their semantic meaning in your description.
Thoroughly read and interpret the entire image before providing your detailed caption describing the
image content in text format. Strive for a truthful and precise representation of what is depicted’,
image=image_byteio,
max_token=max_token,
temperature=temperature,
top_p=top_p,
top_k=top_k,
stop_sequences=’Human:’)

image_sum_vectors = get_text_embedding(image_base64=image_base64, text_description=image_caption, embd_model_id=embd_model_id)
Embedding and vectorizing multimodality data
You can harness the capabilities of the newly released Anthropic Claude 3 Sonnet and Haiku on Amazon Bedrock, combined with the Amazon Titan image embedding model and LangChain. This powerful combination allows you to generate comprehensive text captions for tables and images, seamlessly integrating them into your content. Additionally, you can store vectors, objects, raw image file names, and source documents in an Amazon OpenSearch Serverless vector store and object store. Use the following code snippets to create image captions by invoking the utility function bedrock_get_img_description. Embed image pixels along with image captions using the Amazon Titan image embedding model amazon.titan_embeding_image_v1 by calling the get_text_embedding functions.
def get_text_embedding(image_base64=None, text_description=None, embd_model_id:str=”amazon.titan-embed-image-v1″):
input_data = {}
if image_base64 is not None:
input_data[“inputImage”] = image_base64
if text_description is not None:
input_data[“inputText”] = text_description
if not input_data:
raise ValueError(“At least one of image_base64 or text_description must be provided”)
body = json.dumps(input_data)
response = boto3_bedrock.invoke_model(
body=body,
modelId=embd_model_id,
accept=”application/json”,
contentType=”application/json”
)
response_body = json.loads(response.get(“body”).read())
return response_body.get(“embedding”)

image_caption = bedrock_get_img_description(model_id,
prompt=’You are an expert at analyzing images in great detail. Your task is to carefully examine the provided
mage and generate a detailed, accurate textual description capturing all of the important elements and
context present in the image. Pay close attention to any numbers, data, or quantitative information visible,
and be sure to include those numerical values along with their semantic meaning in your description.
Thoroughly read and interpret the entire image before providing your detailed caption describing the
image content in text format. Strive for a truthful and precise representation of what is depicted’,
image=image_byteio,
max_token=max_token,
temperature=temperature,
top_p=top_p,
top_k=top_k,
stop_sequences=’Human:’)

image_sum_vectors = get_text_embedding(image_base64=image_base64, text_description=image_sum, embd_model_id=embd_model_id)
You can consult the provided code examples for more information on how to embed multimodal and insert vector documents into the OpenSearch Serverless vector store. For more information about data access, refer to Data access control for Amazon OpenSearch Serverless.
# Form a data dictionary with image metatadata, raw image object store location and base64 encoded image data
document = {
“doc_source”: image_url,
“image_filename”: s3_image_path,
“embedding”: image_base64
}
# Parse out only the iamge name from the full temp path
filename = f”jsons/{image_path.split(‘/’)[-1].split(‘.’)[0]}.json”

# Writing the data dict into JSON data
with open(filename, ‘w’) as file:
json.dump(document, file, indent=4)

#Load all json files from the temp directory
loader = DirectoryLoader(“./jsons”, glob=’**/*.json’, show_progress=False, loader_cls=TextLoader)

#loader = DirectoryLoader(“./jsons”, glob=’**/*.json’, show_progress=True, loader_cls=JSONLoader, loader_kwargs = {‘jq_schema’:’.content’})
new_documents = loader.load()
new_docs = text_splitter.split_documents(new_documents)

# Insert into AOSS
new_docsearch = OpenSearchVectorSearch.from_documents(
new_docs,
bedrock_embeddings,
opensearch_url=host,
http_auth=auth,
timeout = 100,
use_ssl = True,
verify_certs = True,
connection_class = RequestsHttpConnection,
index_name=new_index_name,
engine=”faiss”,
)
Advanced RAG with fusion and decomposition
Fusion in RAG presents an innovative search strategy designed to transcend the limitations of conventional search techniques, aligning more closely with the complex nature of human inquiries. This initiative elevates the search experience by integrating multi-faceted query generation and using Reciprocal Rank Fusion for an enhanced re-ranking of search outcomes. This approach offers a more nuanced and effective way to navigate the vast expanse of available information, catering to the intricate and varied demands of users’ searches.
The following diagram illustrates this workflow.

We use the Anthropic Claude 3 Sonnet and Haiku models, which possess the capability to process visual and language data, which enables them to handle the query decomposition (Haiku) and answer fusion (Sonnet) stages effectively. The following code snippet demonstrates how to create a retriever using OpenSearch Serverless:
from langchain.vectorstores import OpenSearchVectorSearch
retriever = OpenSearchVectorSearch(
opensearch_url = “{}.{}.aoss.amazonaws.com”.format(<collection_id>, <my_region>),
index_name = <index_name>,
embedding_function = embd)
The combination of decomposition and fusion intend to address the limitations of the chain-of-thought (CoT) method in language models. It involves breaking down complex problems into simpler, sequential sub-problems, where each sub-problem builds upon the solution of the previous one. This technique significantly enhances the problem-solving abilities of language models in areas such as symbolic manipulation, compositional generalization, and mathematical reasoning.
The RAG-decomposition approach, which uses the decomposition step (see the following code), underscores the potential of a technique called least-to-most prompting. This technique not only improves upon existing methods but also paves the way for more advanced, interactive learning frameworks for language models. The ultimate goal is to move towards a future where language models can learn from bidirectional conversations, enabling more effective reasoning and problem-solving capabilities.
# Decomposition
prompt_rag = hub.pull(“rlm/rag-prompt”)
template = “””You are a helpful assistant that generates multiple sub-questions related to an input question. n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. n
Generate multiple search queries semantically related to: {question} n
Output (5 queries):”””
prompt_decomposition = ChatPromptTemplate.from_template(template)
generate_queries_decomposition = ( prompt_decomposition | llm_bedrock | StrOutputParser() | (lambda x: x.split(“n”)))
questions = generate_queries_decomposition.invoke({“question”:question})

def reciprocal_rank_fusion(results: list[list], k=60):

# Initialize a dictionary to hold fused scores for each unique document
fused_scores = {}

# Iterate through each list of ranked documents
for docs in results:
# Iterate through each document in the list, with its rank (position in the list)
for rank, doc in enumerate(docs):
# Convert the document to a string format to use as a key (assumes documents can be serialized to JSON)
doc_str = dumps(doc)
# If the document is not yet in the fused_scores dictionary, add it with an initial score of 0
if doc_str not in fused_scores:
fused_scores[doc_str] = 0
# Retrieve the current score of the document, if any
previous_score = fused_scores[doc_str]
# Update the score of the document using the RRF formula: 1 / (rank + k)
fused_scores[doc_str] += 1 / (rank + k)
# Sort the documents based on their fused scores in descending order to get the final reranked results
reranked_results = [
(loads(doc), score)
for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
]
# Return the reranked results as a list of tuples, each containing the document and its fused score
return reranked_results

def retrieve_and_rag(question,prompt_rag,sub_question_generator_chain):
sub_questions = sub_question_generator_chain.invoke({“question”:question})
# Initialize a list to hold RAG chain results
rag_results = []
for sub_question in sub_questions:
# Retrieve documents for each sub-question with reciprocal reranking
retrieved_docs = retrieval_chain_rag_fusion.invoke({“question”: sub_question})
# Use retrieved documents and sub-question in RAG chain
answer = (prompt_rag
| chat_bedrock
| StrOutputParser()
| reciprocal_rank_fusion
).invoke({“context”: retrieved_docs,”question”: sub_question}
rag_results.append(answer)
return rag_results,sub_questions

def format_qa_pairs(questions, answers):
“””Format Q and A pairs”””

formatted_string = “”
for i, (question, answer) in enumerate(zip(questions, answers), start=1):
formatted_string += f”Question {i}: {question}nAnswer {i}: {answer}nn”
return formatted_string.strip()

context = format_qa_pairs(questions, answers)

# Prompt
template = “””Here is a set of Q+A pairs:

{context}

Use these to synthesize an answer to the question: {question}
“””
prompt_fusion = ChatPromptTemplate.from_template(template)
final_rag_chain = (prompt_fusion | llm_bedrock| StrOutputParser())

# Decompsing and reciprocal reranking
retrieval_chain_rag_fusion = generate_queries_decomposition | retriever.map() | reciprocal_rank_fusion

# Wrap the retrieval and RAG process in a RunnableLambda for integration into a chain
answers, questions = retrieve_and_rag(question, prompt_rag, generate_queries_decomposition)
final_rag_chain.invoke({“context”:context,”question”:question})
The RAG process is further enhanced by integrating a reciprocal re-ranker, which uses sophisticated NLP techniques. This makes sure the retrieved results are relevant and also semantically aligned with the user’s intended query. This multimodal retrieval approach seamlessly operates across vector databases and object stores, marking a significant advancement in the quest for more efficient, accurate, and contextually aware search mechanisms.
Multimodality retrievals
The mmRAG architecture enables the system to understand and process multimodal queries, retrieve relevant information from various sources, and generate multimodal answers by combining textual, tabular, and visual information in a unified manner. The following diagram highlights the data flows from queries to answers by using an advanced RAG and a multimodal retrieval engine powered by a multimodal embedding model (amazon.titan-embed-image-v1), an object store (Amazon S3), and a vector database (OpenSearch Serverless). For tables, the system retrieves relevant table locations and metadata, and computes the cosine similarity between the multimodal embedding and the vectors representing the table and its summary. Similarly, for images, the system retrieves relevant image locations and metadata, and computes the cosine similarity between the multimodal embedding and the vectors representing the image and its caption.

# Connect to the AOSS with given host and index name
docsearch = OpenSearchVectorSearch(
index_name=index_name, # TODO: use the same index-name used in the ingestion script
embedding_function=bedrock_embeddings,
opensearch_url=host, # TODO: e.g. use the AWS OpenSearch domain instantiated previously
http_auth=auth,
timeout = 100,
use_ssl = True,
verify_certs = True,
connection_class = RequestsHttpConnection,
engine=”faiss”,
)

# Query for images with text
query = “What is the math and reasoning score MMMU (val) for Anthropic Claude 3 Sonnet ?”
t2i_results = docsearch.similarity_search_with_score(query, k=3) # our search query # return 3 most relevant docs

# Or Query AOSS with image aka image-to-image
with open(obj_image_path, “rb”) as image_file:
image_data = image_file.read()
image_base64 = base64.b64encode(image_data).decode(‘utf8’)
image_vectors = get_image_embedding(image_base64=image_base64)
i2i_results = docsearch.similarity_search_with_score_by_vector(image_vectors, k=3) # our search query # return 3 most relevant docs

The following screenshot illustrates the improved accuracy and comprehensive understanding of the user’s query with multimodality capability. The mmRAG approach is capable of grasping the intent behind the query, extracting relevant information from the provided chart, and estimating the overall costs, including the estimated output token size. Furthermore, it can perform mathematical calculations to determine the cost difference. The output includes the source chart and a link to its original location.

Use cases and limitations
Amazon Bedrock offers a comprehensive set of generative AI models for enhancing content comprehension across various modalities. By using the latest advancements in VLMs, such as Anthropic Claude 3 Sonnet and Haiku, as well as the Amazon Titan image embedding model, Amazon Bedrock enables you to expand your document understanding beyond text to include tables, charts, and images. The integration of OpenSearch Serverless provides enterprise-grade vector storage and approximate k-NN search capabilities, enabling efficient retrieval of relevant information. With advanced LangChain decomposition and fusion techniques, you can use multi-step querying across different LLMs to improve accuracy and gain deeper insights. This powerful combination of cutting-edge technologies allows you to unlock the full potential of multimodal content comprehension, enabling you to make informed decisions and drive innovation across various data sources.
The reliance on visual language models and image embedding models for comprehensive and accurate image captions has its limitations. Although these models excel at understanding visual and textual data, the multi-step query decomposition, reciprocal ranking, and fusion processes involved can lead to increased inference latency. This makes such solutions less suitable for real-time applications or scenarios that demand instantaneous responses. However, these solutions can be highly beneficial in use cases where higher accuracy and less time-sensitive responses are required, allowing for more detailed and accurate analysis of complex visual and textual data.
Conclusion
In this post, we discussed how you can use multimodal RAG to address limitations in multimodal generative AI assistants. We invite you to explore mmRAG and take advantage of the advanced features of Amazon Bedrock. These powerful tools can assist your business in gaining deeper insights, making well-informed decisions, and fostering innovation driven by more accurate data. Ongoing research efforts are focused on developing an agenic and graph-based pipeline to streamline the processes of parsing, injection, and retrieval. These approaches hold the promise of enhancing the reliability and reusability of the mmRAG system.
Acknowledgement
Authors would like to expression sincere gratitude to Nausheen Sayed, Karen Twelves, Li Zhang, Sophia Shramko, Mani Khanuja, Santhosh Kuriakose, and Theresa Perkins for their comprehensive reviews.

About the Authors
Alfred Shen is a Senior AI/ML Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on CV, NLP, and multimodality. His work has been showcased in publications such as EMNLP, ICLR, and Public Health.
Changsha Ma is an generative AI Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, mentoring college students for entrepreneurship, and spending time with friends and families.
Julianna Delua is a Principal Specialist for AI/ML and generative AI. She serves the financial services industry customers including those in Capital Markets, Fintech and Payments. Julianna enjoys helping businesses turn new ideas into solutions and transform the organizations with AI-powered solutions.

How 20 Minutes empowers journalists and boosts audience engagement wit …

Posted on May 22, 2024 by i-genie

This post is co-written with Aurélien Capdecomme and Bertrand d’Aure from 20 Minutes.
With 19 million monthly readers, 20 Minutes is a major player in the French media landscape. The media organization delivers useful, relevant, and accessible information to an audience that consists primarily of young and active urban readers. Every month, nearly 8.3 million 25–49-year-olds choose 20 Minutes to stay informed. Established in 2002, 20 Minutes consistently reaches more than a third (39 percent) of the French population each month through print, web, and mobile platforms.
As 20 Minutes’s technology team, we’re responsible for developing and operating the organization’s web and mobile offerings and driving innovative technology initiatives. For several years, we have been actively using machine learning and artificial intelligence (AI) to improve our digital publishing workflow and to deliver a relevant and personalized experience to our readers. With the advent of generative AI, and in particular large language models (LLMs), we have now adopted an AI by design strategy, evaluating the application of AI for every new technology product we develop.
One of our key goals is to provide our journalists with a best-in-class digital publishing experience. Our newsroom journalists work on news stories using Storm, our custom in-house digital editing experience. Storm serves as the front end for Nova, our serverless content management system (CMS). These applications are a focus point for our generative AI efforts.
In 2023, we identified several challenges where we see the potential for generative AI to have a positive impact. These include new tools for newsroom journalists, ways to increase audience engagement, and a new way to ensure advertisers can confidently assess the brand safety of our content. To implement these use cases, we rely on Amazon Bedrock.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon Web Services (AWS) through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
This blog post outlines various use cases where we’re using generative AI to address digital publishing challenges. We dive into the technical aspects of our implementation and explain our decision to choose Amazon Bedrock as our foundation model provider.
Identifying challenges and use cases
Today’s fast-paced news environment presents both challenges and opportunities for digital publishers. At 20 Minutes, a key goal of our technology team is to develop new tools for our journalists that automate repetitive tasks, improve the quality of reporting, and allow us to reach a wider audience. Based on this goal, we have identified three challenges and corresponding use cases where generative AI can have a positive impact.
The first use case is to use automation to minimize the repetitive manual tasks that journalists perform as part of the digital publishing process. The core work of developing a news story revolves around researching, writing, and editing the article. However, when the article is complete, supporting information and metadata must be defined, such as an article summary, categories, tags, and related articles.
While these tasks can feel like a chore, they are critical to search engine optimization (SEO) and therefore the audience reach of the article. If we can automate some of these repetitive tasks, this use case has the potential to free up time for our newsroom to focus on core journalistic work while increasing the reach of our content.
The second use case is how we republish news agency dispatches at 20 Minutes. Like most news outlets, 20 Minutes subscribes to news agencies, such as the Agence France-Presse (AFP) and others, that publish a feed of news dispatches covering national and international news. 20 Minutes journalists select stories relevant to our audience and rewrite, edit, and expand on them to fit the editorial standards and unique tone our readership is used to. Rewriting these dispatches is also necessary for SEO, as search engines rank duplicate content low. Because this process follows a repeatable pattern, we decided to build an AI-based tool to simplify the republishing process and reduce the time spent on it.
The third and final use case we identified is to improve transparency around the brand safety of our published content. As a digital publisher, 20 Minutes is committed to providing a brand-safe environment for potential advertisers. Content can be classified as brand-safe or not brand-safe based on its appropriateness for advertising and monetization. Depending on the advertiser and brand, different types of content might be considered appropriate. For example, some advertisers might not want their brand to appear next to news content about sensitive topics such as military conflicts, while others might not want to appear next to content about drugs and alcohol.
Organizations such as the Interactive Advertising Bureau (IAB) and the Global Alliance for Responsible Media (GARM) have developed comprehensive guidelines and frameworks for classifying the brand safety of content. Based on these guidelines, data providers such as the IAB and others conduct automated brand safety assessments of digital publishers by regularly crawling websites such as 20minutes.fr and calculating a brand safety score.
However, this brand safety score is site-wide and doesn’t break down the brand safety of individual news articles. Given the reasoning capabilities of LLMs, we decided to develop an automated per-article brand safety assessment based on industry-standard guidelines to provide advertisers with a real-time, granular view of the brand safety of 20 Minutes content.
Our technical solution
At 20 Minutes, we’ve been using AWS since 2017, and we aim to build on top of serverless services whenever possible.
The digital publishing frontend application Storm is a single-page application built using React and Material Design and deployed using Amazon Simple Storage Service (Amazon S3) and Amazon CloudFront. Our CMS backend Nova is implemented using Amazon API Gateway and several AWS Lambda functions. Amazon DynamoDB serves as the primary database for 20 Minutes articles. New articles and changes to existing articles are captured using DynamoDB Streams, which invokes processing logic in AWS Step Functions and feeds our search service based on Amazon OpenSearch.
We integrate Amazon Bedrock using AWS PrivateLink, which allows us to create a private connection between our Amazon Virtual Private Cloud (VPC) and Amazon Bedrock without traversing the public internet.
When working on articles in Storm, journalists have access to several AI tools implemented using Amazon Bedrock. Storm is a block-based editor that allows journalists to combine multiple blocks of content, such as title, lede, text, image, social media quotes, and more, into a complete article. With Amazon Bedrock, journalists can use AI to generate an article summary suggestion block and place it directly into the article. We use a single-shot prompt with the full article text in context to generate the summary.
Storm CMS also gives journalists suggestions for article metadata. This includes recommendations for appropriate categories, tags, and even in-text links. These references to other 20 Minutes content are critical to increasing audience engagement, as search engines rank content with relevant internal and external links higher.
To implement this, we use a combination of Amazon Comprehend and Amazon Bedrock to extract the most relevant terms from an article’s text and then perform a search against our internal taxonomic database in OpenSearch. Based on the results, Storm provides several suggestions of terms that should be linked to other articles or topics, which users can accept or reject.

News dispatches become available in Storm as soon as we receive them from our partners such as AFP. Journalists can browse the dispatches and select them for republication on 20minutes.fr. Every dispatch is manually reworked by our journalists before publication. To do so, journalists first invoke a rewrite of the article by an LLM using Amazon Bedrock. For this, we use a low-temperature single-shot prompt that instructs the LLM not to reinterpret the article during the rewrite, and to keep the word count and structure as similar as possible. The rewritten article is then manually edited by a journalist in Storm like any other article.
To implement our new brand safety feature, we process every new article published on 20minutes.fr. Currently, we use a single shot prompt that includes both the article text and the IAB brand safety guidelines in context to get a sentiment assessment from the LLM. We then parse the response, store the sentiment, and make it publicly available for each article to be accessed by ad servers.
Lessons learned and outlook
When we started working on generative AI use cases at 20 Minutes, we were surprised at how quickly we were able to iterate on features and get them into production. Thanks to the unified Amazon Bedrock API, it’s easy to switch between models for experimentation and find the best model for each use case.
For the use cases described above, we use Anthropic’s Claude in Amazon Bedrock as our primary LLM because of its overall high quality and, in particular, its quality in recognizing French prompts and generating French completions. Because 20 Minutes content is almost exclusively French, these multilingual capabilities are key for us. We have found that careful prompt engineering is a key success factor and we closely adhere to Anthropic’s prompt engineering resources to maximize completion quality.
Even without relying on approaches like fine-tuning or retrieval-augmented generation (RAG) to date, we can implement use cases that deliver real value to our journalists. Based on data collected from our newsroom journalists, our AI tools save them an average of eight minutes per article. With around 160 pieces of content published every day, this is already a significant amount of time that can now be spent reporting the news to our readers, rather than performing repetitive manual tasks.
The success of these use cases depends not only on technical efforts, but also on close collaboration between our product, engineering, newsroom, marketing, and legal teams. Together, representatives from these roles make up our AI Committee, which establishes clear policies and frameworks to ensure the transparent and responsible use of AI at 20 Minutes. For example, every use of AI is discussed and approved by this committee, and all AI-generated content must undergo human validation before being published.
We believe that generative AI is still in its infancy when it comes to digital publishing, and we look forward to bringing more innovative use cases to our platform this year. We’re currently working on deploying fine-tuned LLMs using Amazon Bedrock to accurately match the tone and voice of our publication and further improve our brand safety analysis capabilities. We also plan to use Bedrock models to tag our existing image library and provide automated suggestions for article images.
Why Amazon Bedrock?
Based on our evaluation of several generative AI model providers and our experience implementing the use cases described above, we selected Amazon Bedrock as our primary provider for all our foundation model needs. The key reasons that influenced this decision were:

Choice of models: The market for generative AI is evolving rapidly, and the AWS approach of working with multiple leading model providers ensures that we have access to a large and growing set of foundational models through a single API.
Inference performance: Amazon Bedrock delivers low-latency, high-throughput inference. With on-demand and provisioned throughput, the service can consistently meet all of our capacity needs.
Private model access: We use AWS PrivateLink to establish a private connection to Amazon Bedrock endpoints without traversing the public internet, ensuring that we maintain full control over the data we send for inference.
Integration with AWS services: Amazon Bedrock is tightly integrated with AWS services such as AWS Identity and Access Management (IAM) and the AWS Software Development Kit (AWS SDK). As a result, we were able to quickly integrate Bedrock into our existing architecture without having to adapt any new tools or conventions.

Conclusion and outlook
In this blog post, we described how 20 Minutes is using generative AI on Amazon Bedrock to empower our journalists in the newsroom, reach a broader audience, and make brand safety transparent to our advertisers. With these use cases, we’re using generative AI to bring more value to our journalists today, and we’ve built a foundation for promising new AI use cases in the future.
To learn more about Amazon Bedrock, start with Amazon Bedrock Resources for documentation, blog posts, and more customer success stories.

About the authors
Aurélien Capdecomme is the Chief Technology Officer at 20 Minutes, where he leads the IT development and infrastructure teams. With over 20 years of experience in building efficient and cost-optimized architectures, he has a strong focus on serverless strategy, scalable applications and AI initiatives. He has implemented innovation and digital transformation strategies at 20 Minutes, overseeing the complete migration of digital services to the cloud.
Bertrand d’Aure is a software developer at 20 Minutes. An engineer by training, he designs and implements the backend of 20 Minutes applications, with a focus on the software used by journalists to create their stories. Among other things, he is responsible for adding generative AI features to the software to simplify the authoring process.
Dr. Pascal Vogel is a Solutions Architect at Amazon Web Services. He collaborates with enterprise customers across EMEA to build cloud-native solutions with a focus on serverless and generative AI. As a cloud enthusiast, Pascal loves learning new technologies and connecting with like-minded customers who want to make a difference in their cloud journey.

Efficient and cost-effective multi-tenant LoRA serving with Amazon Sag …

Posted on May 22, 2024 by i-genie

In the rapidly evolving landscape of artificial intelligence (AI), the rise of generative AI models has ushered in a new era of personalized and intelligent experiences. Organizations are increasingly using the power of these language models to drive innovation and enhance their services, from natural language processing to content generation and beyond.
Using generative AI models in the enterprise environment, however, requires taming their intrinsic power and enhancing their skills to address specific customer needs. In cases where an out-of-the-box model is missing knowledge of domain- or organization-specific terminologies, a custom fine-tuned model, also called a domain-specific large language model (LLM), might be an option for performing standard tasks in that domain or micro-domain. BloombergGPT is an example of LLM that was trained from scratch to have a better understanding of highly specialized vocabulary found in the financial domain. In the same sense, domain specificity can be addressed through fine-tuning at a smaller scale. Customers are fine-tuning generative AI models based on domains including finance, sales, marketing, travel, IT, HR, finance, procurement, healthcare and life sciences, customer service, and many more. Additionally, independent software vendors (ISVs) are building secure, managed, multi-tenant, end-to-end generative AI platforms with models that are customized and personalized based on their customer’s datasets and domains. For example, Forethought introduced SupportGPT, a generative AI platform for customer support.
As the demands for personalized and specialized AI solutions grow, businesses often find themselves grappling with the challenge of efficiently managing and serving a multitude of fine-tuned models across diverse use cases and customer segments. With the need to serve a wide range of AI-powered use cases, from resume parsing and job skill matching, domain-specific to email generation and natural language understanding, these businesses are often left with the daunting task of managing hundreds of fine-tuned models, each tailored to specific customer needs or use cases. The complexities of this challenge are compounded by the inherent scalability and cost-effectiveness concerns that come with deploying and maintaining such a diverse model ecosystem. Traditional approaches to model serving can quickly become unwieldy and resource intensive, leading to increased infrastructure costs, operational overhead, and potential performance bottlenecks.
Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage and switching cost for hosting independent instances for different tasks. LoRA (Low-Rank Adaptation) is an efficient adaptation strategy that neither introduces inference latency nor reduces input sequence length while retaining high model quality. Importantly, it allows for quick task switching when deployed as a service by sharing the vast majority of the model parameters.
In this post, we explore a solution that addresses these challenges head-on using LoRA serving with Amazon SageMaker. By using the new performance optimizations of LoRA techniques in SageMaker large model inference (LMI) containers along with inference components, we demonstrate how organizations can efficiently manage and serve their growing portfolio of fine-tuned models, while optimizing costs and providing seamless performance for their customers.
The latest SageMaker LMI container offers unmerged-LoRA inference, sped up with our LMI-Dist inference engine and OpenAI style chat schema. To learn more about LMI, refer to LMI Starting Guide, LMI handlers Inference API Schema, and Chat Completions API Schema.
New LMI features for serving LoRA adapters at scale on SageMaker
There are two kinds of LoRA that can be put onto various engines:

Merged LoRA – This applies the adapter by modifying the base model in place. It has zero added latency while running, but has a cost to apply or unapply the merge. It works best for cases with only a few adapters. It is best for single-adapter batches, and doesn’t support multi-adapter batches.
Unmerged LoRA – This alters the model operators to factor in the adapters without changing the base model. It has a higher inference latency for the additional adapter operations. However, it does support multi-adapter batches. It works best for use cases with a large number of adapters.

The new LMI container offers out-of-box integration and abstraction with SageMaker for hosting multiple unmerged LoRA adapters with higher performance (low latency and high throughput) using the vLLM backend LMI-Dist backend that uses vLLM, which in-turn uses S-LORA and Punica. The LMI container offers two backends for serving LoRA adapters: the LMI-Dist backend (recommended) and the vLLM Backend. Both backends are based on the open source vLLM library for serving LoRA adapters, but the LMI-Dist backend provides additional optimized continuous (rolling) batching implementation. You are not required to configure these libraries separately; the LMI container provides the higher-level abstraction through the vLLM and LMI-Dist backends. We recommend you start with the LMI-Dist backend because it has additional performance optimizations related to continuous (rolling) batching.
S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes unified paging. Unified paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead.
Punica is designed to efficiently serve multiple LoRA models on a shared GPU cluster. It achieves this by following three design guidelines:

Consolidating multi-tenant LoRA serving workloads to a small number of GPUs to increase overall GPU utilization
Enabling batching for different LoRA models to improve performance and GPU utilization
Focusing on the decode stage performance, which is the predominant factor in the cost of model serving

Punica uses a new CUDA kernel design called Segmented Gather Matrix-Vector Multiplication (SGMV) to batch GPU operations for concurrent runs of multiple LoRA models, significantly improving GPU efficiency in terms of memory and computation. Punica also implements a scheduler that routes requests to active GPUs and migrates requests for consolidation, optimizing GPU resource allocation. Overall, Punica achieves high throughput and low latency in serving multi-tenant LoRA models on a shared GPU cluster. For more information, read the Punica whitepaper.
The following figure shows the multi LoRA adapter serving stack of the LMI container on SageMaker.

As shown in the preceding figure, the LMI container provides the higher-level abstraction through the vLLM and LMI-Dist backends to serve LoRA adapters at scale on SageMaker. As a result, you’re not required to configure the underlying libraries (S-LORA, Punica, or vLLM) separately. However, there might be cases where you want to control some of the performance driving parameters depending on your use case and application performance requirements. The following are the common configuration options the LMI container provides to tune LoRA serving. For more details on configuration options specific to each backend, refer to vLLM Engine User Guide and LMI-Dist Engine User Guide.

option.enable_lora: This config enables support for LoRA adapters.
option.max_loras: This config determines the maximum number of LoRA adapters that can be run at once. Allocates GPU memory for those number adapters.
option.max_lora_rank: This config determines the maximum rank allowed for a LoRA adapter. Set this value to maximum rank of your adapters. Setting a larger value will enable more adapters at a greater memory usage cost.
option.lora_extra_vocab_size: This config determines the maximum additional vocabulary that can be added through a LoRA adapter.
option.max_cpu_loras: This config determines the maximum number of LoRA adapters to cache in memory. All others will be evicted to disk.

Design patterns for serving fine-tuned LLMs at scale
Enterprises grappling with the complexities of managing generative AI models often encounter scenarios where a robust and flexible design pattern is crucial. One common use case involves a single base model with multiple LoRA adapters, each tailored to specific customer needs or use cases. This approach allows organizations to use a foundational language model while maintaining the agility to fine-tune and deploy customized versions for their diverse customer base.
Single-base model with multiple fine-tuned LoRA adapters
An enterprise offering a resume parsing and job skill matching service may use a single high-performance base model, such as Mistral 7B. The Mistral 7B base model is particularly well-suited for job-related content generation tasks, such as creating personalized job descriptions and tailored email communications. Mistral’s strong performance in natural language generation and its ability to capture industry-specific terminology and writing styles make it a valuable asset for such an enterprise’s customers in the HR and recruitment space. By fine-tuning Mistral 7B with LoRA adapters, enterprises can make sure the generated content aligns with the unique branding, tone, and requirements of each customer, delivering a highly personalized experience.
Multi-base models with multiple fine-tuned LoRA adapters
On the other hand, the same enterprise may use the Llama 3 base model for more general natural language processing tasks, such as resume parsing, skills extraction, and candidate matching. Llama 3’s broad knowledge base and robust language understanding capabilities enable it to handle a wide range of documents and formats, making sure their services can effectively process and analyze candidate information, regardless of the source. By fine-tuning Llama 3 with LoRA adapters, such enterprises can tailor the model’s performance to specific customer requirements, such as regional dialects, industry-specific terminology, or unique data formats. By employing a multi-base model, multi-adapter design pattern, enterprises can take advantage of the unique strengths of each language model to deliver a comprehensive and highly personalized job profile to a candidate resume matching service. This approach allows enterprises to cater to the diverse needs of their customers, making sure each client receives tailored AI-powered solutions that enhance their recruitment and talent management processes.
Effectively implementing and managing these design patterns, where multiple base models are coupled with numerous LoRA adapters, is a key challenge that enterprises must address to unlock the full potential of their generative AI investments. A well-designed and scalable approach to model serving is crucial in delivering cost-effective, high-performance, and personalized experiences to customers.

Solution overview
The following sections outline the coding steps to deploy a base LLM, TheBloke/Llama-2-7B-Chat-fp16, with LoRA adapters on SageMaker. It involves preparing a compressed archive with the base model files and LoRA adapter files, uploading it to Amazon Simple Storage Service (Amazon S3), selecting and configuring the SageMaker LMI container to enable LoRA support, creating a SageMaker endpoint configuration and endpoint, defining an inference component for the model, and sending inference requests specifying different LoRA adapters like Spanish (“es”) and French (“fr”) in the request payload to use those fine-tuned language capabilities. For more information on deploying models using SageMaker inference components, see Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency.
To showcase multi-base models with their LoRA adapters, we add another base model, mistralai/Mistral-7B-v0.1, and its LoRA adapter to the same SageMaker endpoint, as shown in the following diagram.

Prerequisites
You need to complete some prerequisites before you can run the notebook:

Have a Hugging Face user access token and authorization to access to mistralai/Mistral-7B-v0.1 model
Have a SageMaker quota for one ml.g5.12xlarge instance for endpoint usage

Upload your LoRA adapters to Amazon S3
To prepare the LoRA adapters, create a adapters.tar.gz compressed archive containing the LoRA adapters directory. The adapters directory should contain subdirectories for each of the LoRA adapters, with each adapter subdirectory containing the adapter_model.bin file (the adapter weights) and the adapter_config.json file (the adapter configuration). We typically obtain these adapter files by using the PeftModel.save_pretrained() method from the Peft library. After you assemble the adapters directory with the adapter files, you compress it into a adapters.tar.gz archive and upload it to an S3 bucket for deployment or sharing. We include the LoRA adapters in the adapters directory as follows:

Download LoRA adapters, compress them, and upload the compressed file to Amazon S3:

snapshot_download(“UnderstandLing/llama-2-7b-chat-es”, local_dir=”llama-lora-multi-adapter/adapters/es”, local_dir_use_symlinks=False)
snapshot_download(“UnderstandLing/llama-2-7b-chat-fr”, local_dir=”llama-lora-multi-adapter/adapters/fr”, local_dir_use_symlinks=False)
snapshot_download(“UnderstandLing/llama-2-7b-chat-ru”, local_dir=”llama-lora-multi-adapter/adapters/ru”, local_dir_use_symlinks=False)
!tar czvf adapters.tar.gz -C llama-lora-multi-adapter .
s3_code_artifact_accelerate = sess.upload_data(“adapters.tar.gz”, model_bucket, s3_code_prefix)

Select and LMI container and configure LMI to enable LoRA
SageMaker provides optimized containers for LMI that support different frameworks for model parallelism, allowing the deployment of LLMs across multiple GPUs. For this post, we employ the DeepSpeed container, which encompasses frameworks such as DeepSpeed and vLLM, among others. See the following code:

deepspeed_image_uri = image_uris.retrieve(
framework=”djl-deepspeed”,
region=sess.boto_session.region_name,
version=”0.27.0″
)

env_generation = {“OPTION_MODEL_ID”: “TheBloke/Llama-2-7B-Chat-fp16”,
“OPTION_TRUST_REMOTE_CODE”: “true”,
“OPTION_TENSOR_PARALLEL_DEGREE”: “2”,
“OPTION_ROLLING_BATCH”: “lmi-dist”,
“OPTION_MAX_ROLLING_BATCH_SIZE”: “32”,
“OPTION_DTYPE”: “fp16”,
“OPTION_ENABLE_LORA”: “true”,
“OPTION_GPU_MEMORY_UTILIZATION”: “0.8”,
“OPTION_MAX_LORA_RANK”: “64”,
“OPTION_MAX_CPU_LORAS”: “4”
}

Create a SageMaker endpoint configuration
Create an endpoint configuration using the appropriate instance type. Set ContainerStartupHealthCheckTimeoutInSeconds to account for the time taken to download the LLM weights from Amazon S3 or the model hub, and the time taken to load the model on the GPUs:

endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ExecutionRoleArn=role,
ProductionVariants=[
{
“VariantName”: variant_name,
“InstanceType”: instance_type,
“InitialInstanceCount”: initial_instance_count,
“ModelDataDownloadTimeoutInSeconds”: model_data_download_timeout_in_seconds,
“ContainerStartupHealthCheckTimeoutInSeconds”: container_startup_health_check_timeout_in_seconds,
“ManagedInstanceScaling”: {
“Status”: “ENABLED”,
“MinInstanceCount”: initial_instance_count,
“MaxInstanceCount”: max_instance_count,
},
“RoutingConfig”: {
‘RoutingStrategy’: ‘LEAST_OUTSTANDING_REQUESTS’
},
},
],
)

Create a SageMaker endpoint
Create a SageMaker endpoint based on the endpoint configuration defined in the previous step. You use this endpoint for hosting the inference component (model) inference and make invocations.

create_endpoint_response = sm_client.create_endpoint(
EndpointName=f”{endpoint_name}”, EndpointConfigName=endpoint_config_name
)

Create a SageMaker inference component (model)
Now that you have created a SageMaker endpoint, let’s create our model as an inference component. The SageMaker inference component enables you to deploy one or more foundation models (FMs) on the same SageMaker endpoint and control how many accelerators and how much memory is reserved for each FM. See the following code:

model_name = sagemaker.utils.name_from_base(“lmi-llama2-7b”)
print(model_name)

create_model_response = sm_client.create_model(
ModelName=model_name,
ExecutionRoleArn=role,
PrimaryContainer={
“Image”: inference_image_uri,
“Environment”: env_generation,
“ModelDataUrl”: s3_code_artifact_accelerate,
}
)

prefix = sagemaker.utils.unique_name_from_base(“lmi-llama2-7b”)
inference_component_name = f”{prefix}-inference-component”

sm_client.create_inference_component(
InferenceComponentName=inference_component_name,
EndpointName=endpoint_name,
VariantName=variant_name,
Specification={
“ModelName”: model_name,
# “Container”: {
# “Image”: inference_image_uri,
# “ArtifactUrl”: s3_code_artifact,
# },
“StartupParameters”: {
“ModelDataDownloadTimeoutInSeconds”: 1200,
“ContainerStartupHealthCheckTimeoutInSeconds”: 1200,
},
“ComputeResourceRequirements”: {
“NumberOfAcceleratorDevicesRequired”: 2,
“MinMemoryRequiredInMb”: 7*2*1024,
},
},
RuntimeConfig={“CopyCount”: 1},
)

Make inference requests using different LoRA adapters
With the endpoint and inference model ready, you can now send requests to the endpoint using the LoRA adapters you fine-tuned for Spanish and French languages. The specific LoRA adapter is specified in the request payload under the “adapters” field. We use “es” for the Spanish language adapter and “fr” for the French language adapter, as shown in the following code:

# Testing Spanish (es) adapter
response_model = smr_client.invoke_endpoint(
InferenceComponentName=inference_component_name,
EndpointName=endpoint_name,
Body=json.dumps({“inputs”: [“Piensa en una excusa creativa para decir que no necesito ir a la fiesta.”],
“adapters”: [“es”]}),
ContentType=”application/json”,
)

response_model[“Body”].read().decode(“utf8”)

# Testing French (fr) adapter
response_model = smr_client.invoke_endpoint(
InferenceComponentName=inference_component_name,
EndpointName=endpoint_name,
Body=json.dumps({“inputs”: [“Pensez à une excuse créative pour dire que je n’ai pas besoin d’aller à la fête.”],
“adapters”: [“fr”]}),
ContentType=”application/json”,
)

response_model[“Body”].read().decode(“utf8”)

# Testing Russian (ru) adapter
response_model = smr_client.invoke_endpoint(
InferenceComponentName=inference_component_name,
EndpointName=endpoint_name,
Body=json.dumps({“inputs”: [“Придумайте креативное “],
“parameters”: params,
“adapters”: [“ru”]}),
ContentType=”application/json”,
)

response_model[“Body”].read().decode(“utf8″)

Add another base model and inference component and its LoRA adapter
Let’s add another base model and its LoRA adapter to the same SageMaker endpoint for multi-base models with multiple fine-tuned LoRA adapters. The code is very similar to the previous code for creating the Llama base model and its LoRA adapter.
Configure the SageMaker LMI container to host the base model (mistralai/Mistral-7B-v0.1) and its LoRA adapter (mistral-lora-multi-adapter/adapters/fr):

deepspeed_image_uri = image_uris.retrieve(
framework=”djl-deepspeed”,
region=sess.boto_session.region_name,
version=”0.27.0″
)

my_hf_token = “<YOUR_HuggingFacePersonalAccessToken_HERE>”

env_generation = {“HF_TOKEN”: my_hf_token,
“OPTION_MODEL_ID”: “mistralai/Mistral-7B-v0.1”,
“OPTION_TRUST_REMOTE_CODE”: “true”,
“OPTION_TENSOR_PARALLEL_DEGREE”: “2”,
“OPTION_ENABLE_LORA”: “true”,
“OPTION_GPU_MEMORY_UTILIZATION”: “0.8”,
“OPTION_MAX_LORA_RANK”: “64”,
“OPTION_MAX_CPU_LORAS”: “4”
}

Create a new SageMaker model and inference component for the base model (mistralai/Mistral-7B-v0.1) and its LoRA adapter (mistral-lora-multi-adapter/adapters/fr):

model_name2 = sagemaker.utils.name_from_base(“lmi-mistral-7b”)

create_model_response = sm_client.create_model(
ModelName=model_name2,
ExecutionRoleArn=role,
PrimaryContainer={
“Image”: inference_image_uri,
“Environment”: env,
“ModelDataUrl”: s3_code_artifact_accelerate,
}
)

sm_client.create_inference_component(
InferenceComponentName=inference_component_name2,
EndpointName=endpoint_name,
VariantName=variant_name,
Specification={
“ModelName”: model_name2,
# “Container”: {
# “Image”: inference_image_uri,
# “ArtifactUrl”: s3_code_artifact,
# },
“StartupParameters”: {
“ModelDataDownloadTimeoutInSeconds”: 3600,
“ContainerStartupHealthCheckTimeoutInSeconds”: 1200,
},
“ComputeResourceRequirements”: {
“NumberOfAcceleratorDevicesRequired”: 2,
“MinMemoryRequiredInMb”: 7*2*1024,
},
},
RuntimeConfig={“CopyCount”: 1},
)

Invoke the same SageMaker endpoint for the newly created inference component for the base model (mistralai/Mistral-7B-v0.1) and its LoRA adapter (mistral-lora-multi-adapter/adapters/fr):

# Testing French (fr) adapter
response_model = smr_client.invoke_endpoint(
InferenceComponentName=inference_component_name2,
EndpointName=endpoint_name,
Body=json.dumps({“inputs”: [“Pensez à une excuse créative pour dire que je n’ai pas besoin d’aller à la fête.”],
“adapters”: [“fr”]}),
ContentType=”application/json”,
)

response_model[“Body”].read().decode(“utf8”)

Clean up
Delete the SageMaker inference components, models, endpoint configuration, and endpoint to avoid incurring unnecessary costs:

sm_client.delete_inference_component(InferenceComponentName=inference_component_name)
sm_client.delete_inference_component(InferenceComponentName=inference_component_name2)
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)
sm_client.delete_model(ModelName=model_name2)

Conclusion
The ability to efficiently manage and serve a diverse portfolio of fine-tuned generative AI models is paramount if you want your organization to deliver personalized and intelligent experiences at scale in today’s rapidly evolving AI landscape. With the inference capabilities of SageMaker LMI coupled with the performance optimizations of LoRA techniques, you can overcome the challenges of multi-tenant fine-tuned LLM serving. This solution enables you to consolidate AI workloads, batch operations across multiple models, and optimize resource utilization for cost-effective, high-performance delivery of tailored AI solutions to your customers. As demand for specialized AI experiences continues to grow, we’ve shown how the scalable infrastructure and cutting-edge model serving techniques of SageMaker position AWS as a powerful platform for unlocking generative AI’s full potential. To start exploring the benefits of this solution for yourself, we encourage you to use the code example and resources we’ve provided in this post.

About the authors
Michael Nguyen is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.
Vivek Gangasani is a AI/ML Startup Solutions Architect for Generative AI startups at AWS. He helps emerging GenAI startups build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.
Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Abacus AI Releases Smaug-Llama-3-70B-Instruct: The New Benchmark in Op …

Posted on May 21, 2024 by i-genie

Artificial intelligence (AI) has revolutionized various fields by introducing advanced models for natural language processing (NLP). NLP enables computers to understand, interpret, and respond to human language in a valuable way. This field encompasses text generation, translation, and sentiment analysis applications, significantly impacting industries like healthcare, finance, and customer service. The evolution of NLP models has driven these advancements, continually pushing the boundaries of what AI can achieve in understanding and generating human language.

Despite these advancements, developing models that can effectively handle complex multi-turn conversations remains a persistent challenge. Existing models often fail to maintain context and coherence over long interactions, leading to suboptimal performance in real-world applications. Maintaining a coherent conversation over multiple turns is crucial for applications like customer service bots, virtual assistants, and interactive learning platforms.

Current methods for improving AI conversation models include fine-tuning diverse datasets and integrating reinforcement learning techniques. Popular models like GPT-4-Turbo and Claude-3-Opus have set benchmarks in performance, yet they still need to improve in handling intricate dialogues and maintaining consistency. These models often rely on large-scale datasets and complex algorithms to enhance their conversational abilities. However, maintaining context over long conversations remains a significant hurdle despite these efforts. While impressive, the performance of these models indicates the potential for further improvement in handling dynamic and contextually rich interactions.

Researchers from Abacus.AI have introduced the Smaug-Llama-3-70B-Instruct model, which is very interesting and claimed to be one of the best open-source models rivaling GPT-4 Turbo. This new model aims to enhance performance in multi-turn conversations by leveraging a novel training recipe. Abacus.AI’s approach focuses on improving the model’s ability to understand & generate contextually relevant responses, surpassing previous models in the same category. Smaug-Llama-3-70B-Instruct builds on the Meta-Llama-3-70B-Instruct foundation, incorporating advancements that enable it to outperform its predecessors.

The Smaug-Llama-3-70B-Instruct model uses advanced techniques and new datasets to achieve superior performance. Researchers employed a specific training protocol emphasizing real-world conversational data, ensuring the model can handle diverse and complex interactions. The model integrates seamlessly with popular frameworks like transformers and can be deployed for various text-generation tasks. This allows the model to generate accurate & contextually appropriate responses. Transformers enable efficient processing of large datasets, contributing to the model’s ability to understand and develop detailed and nuanced conversational responses.

Image Source

The performance of the Smaug-Llama-3-70B-Instruct model is demonstrated through benchmarks such as MT-Bench and Arena Hard. On MT-Bench, the model scored 9.4 in the first turn, 9.0 in the second turn, and an average of 9.2, outperforming Llama-3 70B and GPT-4 Turbo, which scored 9.2 and 9.18, respectively. These scores indicate the model’s robustness in maintaining context and delivering coherent responses over extended dialogues. The MT-Bench results, correlated with human evaluations, highlight Smaug’s ability to handle simple prompts effectively.

However, real-world tasks require complex reasoning and planning, which MT-Bench does not fully address. Arena Hard, a new benchmark measuring an LLM’s ability to solve complex tasks, showed significant gains for Smaug over Llama-3, with Smaug scoring 56.7 compared to Llama-3’s 41.1. This improvement underscores the model’s capability to tackle more sophisticated and agentic tasks, reflecting its advanced understanding and processing of multi-turn interactions.

In conclusion, Smaug-Llama-3-70B-Instruct by Abacus.AI addresses the challenges of maintaining context and coherence. The research team has developed a tool that improves performance and sets a new standard for future developments in the field. The detailed evaluation metrics and superior performance scores highlight the model’s potential to transform applications requiring advanced conversational AI. This new model represents a promising advancement, paving the way for more sophisticated and reliable AI-driven communication tools.
The post Abacus AI Releases Smaug-Llama-3-70B-Instruct: The New Benchmark in Open-Source Conversational AI Rivaling GPT-4 Turbo appeared first on MarkTechPost.

MARKLLM: An Open-Source Toolkit for LLM Watermarking

Posted on May 21, 2024 by i-genie

LLM watermarking embeds subtle, detectable signals in AI-generated text to identify its origin, addressing misuse concerns like impersonation, ghostwriting, and fake news. Despite its promise to distinguish humans from AI text and prevent misinformation, the field faces challenges. The numerous and complex watermarking algorithms, alongside varied evaluation methods, make it difficult for researchers and the public to experiment with and understand these technologies. Consensus and support are crucial for advancing LLM watermarking to ensure reliable identification of AI-generated content and maintain the integrity of digital communication.

Researchers from Tsinghua University, Shanghai Jiao Tong University, The University of Sydney, UC Santa Barbara, the CUHK, and the HKUST have developed MARKLLM, an open-source toolkit for LLM watermarking. MARKLLM provides a unified, extensible framework for implementing watermarking algorithms, supporting nine specific methods from two major algorithm families. It offers user-friendly algorithm loading, text watermarking, detection, and data visualization interfaces. The toolkit includes 12 evaluation tools and two automated pipelines for assessing detectability, robustness, and impact on text quality. MARKLLM’s modular design enhances scalability and flexibility, making it a valuable resource for researchers and the general public to advance LLM watermarking technology.

LLM watermarking algorithms fall into two main categories: the KGW Family and the Christ Family. The KGW method modifies LLM logits to prefer certain tokens, creating watermarked text identified by a statistical threshold. Variations of this method improve performance, reduce text quality impact, increase watermark capacity, resist removal attacks, and enable public detection. The Christ Family uses pseudo-random sequences to guide token sampling, with methods like EXP-sampling correlating text with these sequences for detection. Evaluating watermarking algorithms involves assessing detectability, robustness against tampering, and impact on text quality using metrics like perplexity and diversity.

MARKLLM provides a unified framework to address issues in LLM watermarking algorithms, including lack of standardization, uniformity, and code quality. It allows easy invocation and switching between algorithms, offering a well-designed class structure. MARKLLM features a KGW and Christ family algorithms visualization module, highlighting token preferences and correlations. It includes 12 evaluation tools and two automated pipelines for assessing watermark detectability, robustness, and text quality impact. The toolkit supports flexible configurations, facilitating thorough and automated evaluations of watermarking algorithms across various metrics and attack scenarios.

Using MARKLLM, nine watermarking algorithms were evaluated for detectability, robustness, and impact on text quality. The C4 dataset was used for general text generation, WMT16 for machine translation, and HumanEval for code generation. OPT-1.3b and Starcoder served as language models. Dynamic threshold adjustment and various text tampering attacks were used for assessments, with metrics including PPL, log diversity, BLEU, pass@1, and GPT-4 Judge. Results showed high detection accuracy, algorithm-specific strengths, and varying results based on metrics and attacks. MARKLLM’s user-friendly design facilitates comprehensive evaluations, offering valuable insights for future research.

In conclusion, MARKLLM is an open-source toolkit designed for LLM watermarking, offering flexible configurations for various algorithms, text watermarking, detection, and visualization. It includes convenient evaluation tools and customizable pipelines for thorough assessments from multiple perspectives. While it supports a subset of methods, excluding recent approaches embedding watermarks in model parameters, future contributions are expected to expand its capabilities. The visualization solutions provided are useful but could benefit from more diversity. Additionally, while it covers key evaluation aspects, some scenarios, like retranslation and CWRA attacks, still need to be fully addressed. Developers and researchers are encouraged to contribute to MARKLLM’s robustness and versatility.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post MARKLLM: An Open-Source Toolkit for LLM Watermarking appeared first on MarkTechPost.

FinTextQA: A Long-Form Question Answering LFQA Dataset Specifically De …

Posted on May 21, 2024 by i-genie

The expansion of question-answering (QA) systems driven by artificial intelligence (AI) results from the increasing demand for financial data analysis and management. In addition to bettering customer service, these technologies aid in risk management and provide individualized stock suggestions. Accurate and useful replies to financial data necessitate a thorough understanding of the financial domain because of the data’s complexity, domain-specific terminology and concepts, market uncertainty, and decision-making processes. Due to the complex tasks involved, such as information retrieval, summarization, analysis of data, comprehension, and reasoning, long-form question answering (LFQA) scenarios have added significance in this setting.

While there are several LFQA datasets available in the public domain, such as ELI5, WikiHowQA, and WebCPM, none of them are tailored to the financial sector. This gap in the market is significant, as complex, open-domain questions often require extensive paragraph-length replies and relevant document retrievals. Current financial QA standards, which heavily rely on numerical calculation and sentiment analysis, often struggle to handle the diversity and complexity of these questions.

In light of these difficulties, the researchers from HSBC Lab, Hong Kong University of Science and Technology (Guangzhou), and Harvard University present FinTextQA, a new dataset for testing QA models on issues pertaining to general finance, regulation, or policy. This dataset is composed of LFQAs taken from textbooks in the field as well as government agencies’ websites. The 1,262 question-answer pairs and document contexts that makeup FinTextQA are of excellent quality and have the source attributed. Selected from five rounds of human screening, it includes six question categories with an average text length of 19,7k words. By incorporating financial rules and regulations into LFQA, this dataset challenges models with more complex content and represents ground-breaking work in the field.

The team introduced the dataset and benchmarked state-of-the-art (SOTA) models using FinTextQA to set standards for future studies. Many existing LFQA systems depend on pre-trained language models that have been fine-tuned, such as GPT-3.5-turbo, LLaMA2, Baichuan2, etc. However, these models aren’t always up to answering complex financial inquiries or providing thorough answers. They end up using the RAG framework as a response. The RAG system can improve LLMs’ performance and explanation capacities by pre-processing documents in various steps and providing them with the most relevant information.

The researchers highlight that FinTextQA has fewer QA pairs despite its professional curation and high quality in contrast to bigger AI-generated datasets. Because of this restriction, models trained on it may not be able to be extended to more general real-world scenarios. Acquiring high-quality data is difficult, and copyright constraints frequently hinder sharing it. Consequently, cutting-edge approaches to data scarcity and data augmentation should be the focus of future studies. It may also be useful to investigate more sophisticated RAG capabilities and retrieval methods and broaden the dataset to include more diverse sources.

Nevertheless, the team believes that this work presents a significant step forward in improving financial concept understanding and support by introducing the first LFQA financial dataset and performing extensive benchmark trials on it. FinTextQA provides a robust and thorough framework for developing and testing LFQA systems in general finance. In addition to demonstrating the effectiveness of different model configurations, the experimental research stresses the importance of improving existing approaches to make financial question-answering systems more accurate and easier to understand.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit
The post FinTextQA: A Long-Form Question Answering LFQA Dataset Specifically Designed for the Financial Domain appeared first on MarkTechPost.