Unlocking AI Transparency: How Anthropic’s Feature Grouping Enhances …

In a recent paper, “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning,” researchers have addressed the challenge of understanding complex neural networks, specifically language models, which are increasingly being used in various applications. The problem they sought to tackle was the lack of interpretability at the level of individual neurons within these models, which makes it challenging to comprehend their behavior fully.

The existing methods and frameworks for interpreting neural networks were discussed, highlighting the limitations associated with analyzing individual neurons due to their polysemantic nature. Neurons often respond to mixtures of seemingly unrelated inputs, making it difficult to reason about the overall network’s behavior by focusing on individual components.

The research team proposed a novel approach to address this issue. They introduced a framework that leverages sparse autoencoders, a weak dictionary learning algorithm, to generate interpretable features from trained neural network models. This framework aims to identify more monosemantic units within the network, which are easier to understand and analyze than individual neurons.

The paper provides an in-depth explanation of the proposed method, detailing how sparse autoencoders are applied to decompose a one-layer transformer model with a 512-neuron MLP layer into interpretable features. The researchers conducted extensive analyses and experiments, training the model on a vast dataset to validate the effectiveness of their approach.

The results of their work were presented in several sections of the paper:

1. Problem Setup: The paper outlined the motivation for the research and described the neural network models and sparse autoencoders used in their study.

2. Detailed Investigations of Individual Features: The researchers offered evidence that the features they identified were functionally specific causal units distinct from neurons. This section served as an existence proof for their approach.

3. Global Analysis: The paper argued that the typical features were interpretable and explained a significant portion of the MLP layer, thus demonstrating the practical utility of their method.

4. Phenomenology: This section describes various properties of the features, such as feature-splitting, universality, and how they could form complex systems resembling “finite state automata.”

The researchers also provided comprehensive visualizations of the features, enhancing the understandability of their findings.

In conclusion, the paper revealed that sparse autoencoders can successfully extract interpretable features from neural network models, making them more comprehensible than individual neurons. This breakthrough can enable the monitoring and steering of model behavior, enhancing safety and reliability, particularly in the context of large language models. The research team expressed their intention to further scale this approach to more complex models, emphasizing that the primary obstacle to interpreting such models is now more of an engineering challenge than a scientific one.

Check out the Research Article and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Unlocking AI Transparency: How Anthropic’s Feature Grouping Enhances Neural Network Interpretability appeared first on MarkTechPost.

How Veriff decreased deployment time by 80% using Amazon SageMaker mul …

Veriff is an identity verification platform partner for innovative growth-driven organizations, including pioneers in financial services, FinTech, crypto, gaming, mobility, and online marketplaces. They provide advanced technology that combines AI-powered automation with human feedback, deep insights, and expertise.

Veriff delivers a proven infrastructure that enables their customers to have trust in the identities and personal attributes of their users across all the relevant moments in their customer journey. Veriff is trusted by customers such as Bolt, Deel, Monese, Starship, Super Awesome, Trustpilot, and Wise.
As an AI-powered solution, Veriff needs to create and run dozens of machine learning (ML) models in a cost-effective way. These models range from lightweight tree-based models to deep learning computer vision models, which need to run on GPUs to achieve low latency and improve the user experience. Veriff is also currently adding more products to its offering, targeting a hyper-personalized solution for its customers. Serving different models for different customers adds to the need for a scalable model serving solution.
In this post, we show you how Veriff standardized their model deployment workflow using Amazon SageMaker, reducing costs and development time.
Infrastructure and development challenges
Veriff’s backend architecture is based on a microservices pattern, with services running on different Kubernetes clusters hosted on AWS infrastructure. This approach was initially used for all company services, including microservices that run expensive computer vision ML models.
Some of these models required deployment on GPU instances. Conscious of the comparatively higher cost of GPU-backed instance types, Veriff developed a custom solution on Kubernetes to share a given GPU’s resources between different service replicas. A single GPU typically has enough VRAM to hold multiple of Veriff’s computer vision models in memory.
Although the solution did alleviate GPU costs, it also came with the constraint that data scientists needed to indicate beforehand how much GPU memory their model would require. Furthermore, DevOps were burdened with manually provisioning GPU instances in response to demand patterns. This caused an operational overhead and overprovisioning of instances, which resulted in a suboptimal cost profile.
Apart from GPU provisioning, this setup also required data scientists to build a REST API wrapper for each model, which was needed to provide a generic interface for other company services to consume, and to encapsulate preprocessing and postprocessing of model data. These APIs required production-grade code, which made it challenging for data scientists to productionize models.
Veriff’s data science platform team looked for alternative ways to this approach. The main objective was to support the company’s data scientists with a better transition from research to production by providing simpler deployment pipelines. The secondary objective was to reduce the operational costs of provisioning GPU instances.
Solution overview
Veriff required a new solution that solved two problems:

Allow building REST API wrappers around ML models with ease
Allow managing provisioned GPU instance capacity optimally and, if possible, automatically

Ultimately, the ML platform team converged on the decision to use Sagemaker multi-model endpoints (MMEs). This decision was driven by MME’s support for NVIDIA’s Triton Inference Server (an ML-focused server that makes it easy to wrap models as REST APIs; Veriff was also already experimenting with Triton), as well as its capability to natively manage the auto scaling of GPU instances via simple auto scaling policies.
Two MMEs were created at Veriff, one for staging and one for production. This approach allows them to run testing steps in a staging environment without affecting the production models.
SageMaker MMEs
SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly. SageMaker MMEs provide a scalable and cost-effective solution for deploying a large number of models for real-time inference. MMEs use a shared serving container and a fleet of resources that can use accelerated instances such as GPUs to host all of your models. This reduces hosting costs by maximizing endpoint utilization compared to using single-model endpoints. It also reduces deployment overhead because SageMaker manages loading and unloading models in memory and scaling them based on the endpoint’s traffic patterns. In addition, all SageMaker real-time endpoints benefit from built-in capabilities to manage and monitor models, such as including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, refer to CloudWatch Metrics for Multi-Model Endpoint Deployments).
Custom Triton ensemble models
There were several reasons why Veriff decided to use Triton Inference Server, the main ones being:

It allows data scientists to build REST APIs from models by arranging model artifact files in a standard directory format (no code solution)
It’s compatible with all major AI frameworks (PyTorch, Tensorflow, XGBoost, and more)
It provides ML-specific low-level and server optimizations such as dynamic batching of requests

Using Triton allows data scientists to deploy models with ease because they only need to build formatted model repositories instead of writing code to build REST APIs (Triton also supports Python models if custom inference logic is required). This decreases model deployment time and gives data scientists more time to focus on building models instead of deploying them.
Another important feature of Triton is that it allows you to build model ensembles, which are groups of models that are chained together. These ensembles can be run as if they were a single Triton model. Veriff currently employs this feature to deploy preprocessing and postprocessing logic with each ML model using Python models (as mentioned earlier), ensuring that there are no mismatches in the input data or model output when models are used in production.
The following is what a typical Triton model repository looks like for this workload:

The model.py file contains preprocessing and postprocessing code. The trained model weights are in the screen_detection_inferencer directory, under model version 1 (model is in ONNX format in this example, but can also be TensorFlow, PyTorch format, or others). The ensemble model definition is in the screen_detection_pipeline directory, where inputs and outputs between steps are mapped in a configuration file.
Additional dependencies needed to run the Python models are detailed in a requirements.txt file, and need to be conda-packed to build a Conda environment (python_env.tar.gz). For more information, refer to Managing Python Runtime and Libraries. Also, config files for Python steps need to point to python_env.tar.gz using the EXECUTION_ENV_PATH directive.
The model folder then needs to be TAR compressed and renamed using model_version.txt. Finally, the resulting <model_name>_<model_version>.tar.gz file is copied to the Amazon Simple Storage Service (Amazon S3) bucket connected to the MME, allowing SageMaker to detect and serve the model.
Model versioning and continuous deployment
As the previous section made apparent, building a Triton model repository is straightforward. However, running all the necessary steps to deploy it is tedious and error prone, if run manually. To overcome this, Veriff built a monorepo containing all models to be deployed to MMEs, where data scientists collaborate in a Gitflow-like approach. This monorepo has the following features:

It’s managed using Pants.
Code quality tools such as Black and MyPy are applied using Pants.
Unit tests are defined for each model, which check that the model output is the expected output for a given model input.
Model weights are stored alongside model repositories. These weights can be large binary files, so DVC is used to sync them with Git in a versioned manner.

This monorepo is integrated with a continuous integration (CI) tool. For every new push to the repo or new model, the following steps are run:

Pass the code quality check.
Download the model weights.
Build the Conda environment.
Spin up a Triton server using the Conda environment and use it to process requests defined in unit tests.
Build the final model TAR file (<model_name>_<model_version>.tar.gz).

These steps make sure that models have the quality required for deployment, so for every push to a repo branch, the resulting TAR file is copied (in another CI step) to the staging S3 bucket. When pushes are done in the main branch, the model file is copied to the production S3 bucket. The following diagram depicts this CI/CD system.

Cost and deployment speed benefits
Using MMEs allows Veriff to use a monorepo approach to deploy models to production. In summary, Veriff’s new model deployment workflow consists of the following steps:

Create a branch in the monorepo with the new model or model version.
Define and run unit tests in a development machine.
Push the branch when the model is ready to be tested in the staging environment.
Merge the branch into main when the model is ready to be used in production.

With this new solution in place, deploying a model at Veriff is a straightforward part of the development process. New model development time has decreased from 10 days to an average of 2 days.
The managed infrastructure provisioning and auto scaling features of SageMaker brought Veriff added benefits. They used the InvocationsPerInstance CloudWatch metric to scale according to traffic patterns, saving on costs without sacrificing reliability. To define the threshold value for the metric, they performed load testing on the staging endpoint to find the best trade-off between latency and cost.
After deploying seven production models to MMEs and analyzing spend, Veriff reported a 75% cost reduction in GPU model serving as compared to the original Kubernetes-based solution. Operational costs were reduced as well, because the burden of provisioning instances manually was lifted from the company’s DevOps engineers.
Conclusion
In this post, we reviewed why Veriff chose Sagemaker MMEs over self-managed model deployment on Kubernetes. SageMaker takes on the undifferentiated heavy lifting, allowing Veriff to decrease model development time, increase engineering efficiency, and dramatically lower the cost for real-time inference while maintaining the performance needed for their business-critical operations. Finally, we showcased Veriff’s simple yet effective model deployment CI/CD pipeline and model versioning mechanism, which can be used as a reference implementation of combining software development best practices and SageMaker MMEs. You can find code samples on hosting multiple models using SageMaker MMEs on GitHub.

About the Authors
Ricard Borràs is a Senior Machine Learning at Veriff, where he is leading MLOps efforts in the company. He helps data scientists to build faster and better AI / ML products by building a Data Science Platform at the company, and combining several open source solutions with AWS services.
João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model large-scale training and inference optimization, and more broadly building large-scale ML platforms on AWS.
Miguel Ferreira works as a Sr. Solutions Architect at AWS based in Helsinki, Finland. AI/ML has been a lifelong interest and he has helped multiple customers integrate Amazon SageMaker into their ML workflows.

Researchers from Yale and Google Introduce HyperAttention: An Approxim …

The rapid advancement of large language models has paved the way for breakthroughs in natural language processing, enabling applications ranging from chatbots to machine translation. However, these models often need help processing long sequences efficiently, essential for many real-world tasks. As the length of the input sequence grows, the attention mechanisms in these models become increasingly computationally expensive. Researchers have been exploring ways to address this challenge and make large language models more practical for various applications.

A research team recently introduced a groundbreaking solution called “HyperAttention.” This innovative algorithm aims to efficiently approximate attention mechanisms in large language models, particularly when dealing with long sequences. It simplifies existing algorithms and leverages various techniques to identify dominant entries in attention matrices, ultimately accelerating computations.

HyperAttention’s approach to solving the efficiency problem in large language models involves several key elements. Let’s dive into the details:

Spectral Guarantees: HyperAttention focuses on achieving spectral guarantees to ensure the reliability of its approximations. Utilizing parameterizations based on the condition number reduces the need for certain assumptions typically made in this domain.

SortLSH for Identifying Dominant Entries: HyperAttention uses the Hamming sorted Locality-Sensitive Hashing (LSH) technique to enhance efficiency. This method allows the algorithm to identify the most significant entries in attention matrices, aligning them with the diagonal for more efficient processing.

Efficient Sampling Techniques: HyperAttention efficiently approximates diagonal entries in the attention matrix and optimizes the matrix product with the values matrix. This step ensures that large language models can process long sequences without significantly dropping performance.

Versatility and Flexibility: HyperAttention is designed to offer flexibility in handling different use cases. As demonstrated in the paper, it can be effectively applied when using a predefined mask or generating a mask using the sortLSH algorithm.

The performance of HyperAttention is impressive. It allows for substantial speedups in both inference and training, making it a valuable tool for large language models. By simplifying complex attention computations, it addresses the problem of long-range sequence processing, enhancing the practical usability of these models.

In conclusion, the research team behind HyperAttention has made significant progress in tackling the challenge of efficient long-range sequence processing in large language models. Their algorithm simplifies the complex computations involved in attention mechanisms and offers spectral guarantees for its approximations. By leveraging techniques like Hamming sorted LSH, HyperAttention identifies dominant entries and optimizes matrix products, leading to substantial speedups in inference and training.

This breakthrough is a promising development for natural language processing, where large language models play a central role. It opens up new possibilities for scaling self-attention mechanisms and makes these models more practical for various applications. As the demand for efficient and scalable language models continues to grow, HyperAttention represents a significant step in the right direction, ultimately benefiting researchers and developers in the NLP community.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Researchers from Yale and Google Introduce HyperAttention: An Approximate Attention Mechanism Accelerating Large Language Models for Efficient Long-Range Sequence Processing appeared first on MarkTechPost.

Meet PIXART-α: A Transformer-Based T2I Diffusion Model Whose Image Ge …

A new era of photorealistic image synthesis has just begun thanks to the development of text-to-image (T2I) generative models like DALLE 2, Imagen, and Stable Diffusion. This has significantly influenced many downstream applications, including picture editing, video production, the creation of 3D assets, etc. However, these sophisticated models require significant processing power to train. For example, training SDv1.5 requires 6K A100 GPU days, which costs around $320,000. The more current bigger model, RAPHAEL, even requires 60K A100 GPU days, which costs about $3,080,000. Additionally, the training causes significant CO2 emissions that put the environment under stress; for instance, RAPHAEL’s training produces 35 tonnes of CO2 emissions, the same amount of emissions that one person has during 7 years, as seen in Figure 1. 

Figure 1: Comparisons of CO2 emissions and training costs among T2I producers are shown here. A remarkable $26,000 is spent on training for PIXART-α. Our CO2 emissions and training expenses are just 1.1% and 0.85% less than RAPHAEL.

Such a high price creates major restrictions on obtaining such models for both the research community and businesses, which significantly impedes the critical progress of the AIGC community. A crucial question is raised regarding these difficulties: Can they create a high-quality picture generator with manageable resource usage? Researchers from Huawei Noah’s Ark Lab, Dalian University of Technology, HKU and HKUST present PIXART-α, which dramatically lowers training’s computing requirements while keeping the competitive picture-generating quality to the most recent state-of-the-art image generators. They suggest three main designs to do this: Decomposition of the training plan. They break down the challenging text-to-image production problem into three simple subtasks:

Learning the distribution of pixels in natural pictures

Learning text-image alignment

Improving the aesthetic appeal of images

They suggest drastically lowering the learning cost for the first subtask by initializing the T2I model with a low-cost class-condition model. They provide a training paradigm that consists of pretraining and fine-tuning for the second and third subtasks: pretraining on text-image pair data with high information density, followed by fine-tuning on data with higher aesthetic quality, increasing training effectiveness. a productive T2I transformer. They use cross-attention modules to inject text conditions and simplify the computationally demanding class-condition branch to increase efficiency based on the Diffusion Transformer (DiT). Additionally, they present a reparameterization method that enables the modified text-to-image model to import the parameters of the original class condition model directly. 

They may thus use ImageNet’s past knowledge of natural picture distribution to provide the T2I Transformer an acceptable initialization and speed up its training. High-quality information. Their research reveals significant flaws in existing text-image pair datasets, with LAION as an example. Textual captions frequently suffer from a severe long-tail effect (i.e., many nouns appearing with extremely low frequencies) and a lack of informative content (i.e., typically describing only a portion of the objects in the images). These flaws greatly reduce the effectiveness of T2I model training and need millions of iterations to get reliable text-image alignments. They suggest an autolabeling pipeline using the most advanced vision-language model to produce captions on the SAM to overcome these issues. 

The SAM dataset has the benefit of having a large and diverse collection of objects, which makes it a perfect source for producing text-image pairings with a high information density that are more suited for text-image alignment learning. Their clever features enable their model’s training to be extremely efficient, using just 675 A100 GPU days and $26,000. Figure 1 shows how their approach uses less training data volume (0.2% vs. Imagen) and less training time (2% vs. RAPHAEL) than Imagen. Their training expenses are about 1% of those of RAPHAEL, saving them about $3,000,000 ($26,000 vs. $3,080,000). 

Regarding generation quality, their user research trials show that PIXART-α delivers better picture quality and semantic alignment than current SOTA T2I models, Stable Diffusion, etc.; moreover, its performance on T2I-CompBench demonstrates its advantage in semantic control. They anticipate that their efforts to train T2I models effectively will provide the AIGC community with useful insights and aid more independent academics or companies in producing their own high-quality T2I models at more affordable prices.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Meet PIXART-α: A Transformer-Based T2I Diffusion Model Whose Image Generation Quality is Competitive with State-of-the-Art Image Generators appeared first on MarkTechPost.

This AI Paper Proposes a NeRF-based Mapping Method that Enables Higher …

In this paper, researchers have introduced a NeRF-based mapping method called H2-Mapping, aimed at addressing the need for high-quality, dense maps in real-time applications, such as robotics, AR/VR, and digital twins. The key problem they tackle is the efficient generation of detailed maps in real-time, particularly on edge computers with limited computational power.

They highlight that previous mapping methods have struggled to balance memory efficiency, mapping accuracy, and novel view synthesis, making them unsuitable for some applications. NeRF-based methods have shown promise in overcoming these limitations but are generally time-consuming, even on powerful edge computers. To meet the four key requirements for real-time mapping, namely adaptability, high detail, real-time capability, and novel view synthesis, the authors propose a novel hierarchical hybrid representation.

The proposed method combines explicit octree SDF priors for coarse scene geometry and implicit multiresolution hash encoding for high-resolution details. This approach speeds up scene geometry initialization and makes it easier to learn. They also introduce a coverage-maximizing keyframe selection strategy to enhance mapping quality, particularly in marginal areas.

The results of their experiments demonstrate that H2-Mapping outperforms existing NeRF-based mapping methods in terms of geometry accuracy, texture realism, and time consumption. The paper presents comprehensive details about the method’s architecture and performance evaluation.

In conclusion, the researchers have introduced H2-Mapping, a NeRF-based mapping method with a hierarchical hybrid representation that achieves high-quality real-time mapping even on edge computers. Their approach addresses the limitations of existing methods and showcases promising results in terms of both accuracy and efficiency.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post This AI Paper Proposes a NeRF-based Mapping Method that Enables Higher-Quality Reconstruction and Real-Time Capability Even on Edge Computers appeared first on MarkTechPost.

Google AI Introduces SANPO: A Multi-Attribute Video Dataset for Outdoo …

For tasks like self-driving, the AI model must understand not only the 3D structure of the roads and sidewalks but also identify and recognize street signs and stop lights. This task is made easier with a special laser mounted on the car that captures the 3D data. Such a process is called egocentric scene understanding, i.e., comprehending the environment from one’s own perspective. The problem is that there aren’t publicly available datasets beyond the autonomous driving domain that generalize to egocentric human scene understanding.

Researchers at Google have introduced SANPO (Scene understanding, Accessibility, Navigation, Pathfinding, Obstacle avoidance) dataset, which is a multi-attribute video dataset for human egocentric scene understanding. SANPO consists of both real-world as well as synthetic data, called SANPO-Real and SANPO-Synthetic, respectively. SANPO-Real covers diverse environments and has videos from two stereo cameras to support multi-view methods. The real dataset also includes 11.4 hours of video captured at 15 frames per second (FPS) with dense annotations. 

SANPO is a large-scale video dataset for human egocentric scene understanding, consisting of more than 600K real-world and more than 100K synthetic frames with dense prediction annotations.

Google’s researchers have prioritized privacy protection. They’ve collected data while following the laws at the local, city, and state levels. They’ve also made sure to remove any personal information, like faces and vehicle license plates, before sending the data for annotation.

To overcome the imperfections while capturing videos, such as motion blur, human rating mistakes, etc., SANPO-Synthetic was introduced to augment the real dataset. The researchers partnered with Parallel Domain to create a high-quality synthetic dataset optimized to match real-world conditions. SANPO-Synthetic consists of 1961 sessions, which were recorded using virtualized Zed cameras having an even split between head-mounted and chest-mounted positions.

The synthetic dataset and a part of the real dataset have been annotated using panoptic instance masks, which assigns a class and an ID to each pixel. In SANPO-Real, only a few frames have more than 20 instances per frame. On the contrary, SANPO-Synthetic features many more instances per frame than the real dataset.

Some of the other important video datasets in this field are  SCAND, MuSoHu, Ego4D, VIPSeg, and Waymo Open. SANPO was compared to these datasets, and it is the first dataset with panoptic masks, depth, camera pose, multi-view stereo, and both real and synthetic data. Apart from SANPO, only Waymo Open has both panoptic segmentation and depth maps.

The researchers trained two state-of-the-art models – BinsFormer (for depth estimation) and kMaX-DeepLab (for panoptic segmentation), on the SANPO dataset. They observed that the dataset is quite challenging for both the dense prediction tasks. Moreover, the synthetic dataset has better accuracy than the real dataset. This is mainly because real-world environments are quite complex compared to synthetic data. Additionally, segmentation annotators are more precise in the case of synthetic data.

Introduced to tackle the lack of datasets for human egocentric scene understanding, SANPO is a significant advancement that encompasses both real-world and synthetic datasets. Its dense annotations, multi-attribute features, and unique combination of panoptic segmentation and depth information set it apart from other datasets in the field. Furthermore, the researchers’ commitment to privacy allows the dataset to support fellow researchers in creating visual navigation systems for the visually impaired and push the boundaries of advanced visual scene understanding.

Check out the Paper and Google Blog. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Google AI Introduces SANPO: A Multi-Attribute Video Dataset for Outdoor Human Egocentric Scene Understanding appeared first on MarkTechPost.

How Can We Effectively Compress Large Language Models with One-Bit Wei …

In Large Language Models (LLMs), Partially-Binarized LLMs (PB-LLM) is a cutting-edge technique for achieving extreme low-bit quantization in LLMs without sacrificing language reasoning capabilities. PB-LLM strategically filters salient weights during binarization, reserving them for higher-bit storage. Moreover, it introduces post-training quantization (PTQ) and quantization-aware training (QAT) methods to recover the reasoning capacity of quantized LLMs. This approach represents a significant advancement in network binarization for LLMs.

Researchers from the Illinois Institute of Technology, Huomo AI, and UC Berkeley introduced PB-LLM as an innovative approach for extreme low-bit quantization while preserving language reasoning capacity. Their course addresses the limitations of existing binarization algorithms and emphasizes the significance of salient weights. Their study further explores PTQ and QAT techniques to recover reasoning capacity in quantized LLMs. Their findings contribute to advancements in LLM network binarization, with the PB-LLM code available for further exploration and implementation.

Their method delves into the challenge of deploying LLMs on memory-constrained devices. It explores network binarization, reducing weight bit-width to one bit to compress LLMs. Their proposed approach, PB-LLM, aims to achieve extremely low-bit quantization while preserving language reasoning capacity. Their research also investigates the salient-weight property of LLM quantization and employs PTQ and QAT techniques to regain reasoning capacity in quantized LLMs.

Their approach introduces PB-LLM as an innovative method for achieving extremely low-bit quantization in LLMs while preserving their language reasoning capacity. It addresses the limitations of existing binarization algorithms by emphasizing the importance of salient weights. PB-LLM selectively bins a fraction of salient consequences into higher-bit storage, enabling partial binarization. 

PB-LLM selectively binarizes a fraction of these salient weights, assigning them to higher-bit storage. The paper extends PB-LLM’s capabilities through PTQ and QAT methodologies, revitalizing the performance of low-bit quantized LLMs. These advancements contribute significantly to network binarization for LLMs and offer accessible code for further exploration. Their approach explored the viability of binarization techniques for quantizing LLMs. Current binarization algorithms struggle to quantize LLMs, suggesting the necessity for innovative approaches effectively.

Their research underscores the role of salient weights in effective binarization and proposes optimal scaling strategies. The combined use of PTQ and QAT can restore quantized LLM capacities. The provided PB-LLM code encourages research and development in LLM network binarization, particularly in resource-constrained environments.

In conclusion, the paper introduces PB-LLM as an innovative solution for extreme low-bit quantization in LLMs while preserving language reasoning capabilities. It addresses the limitations of existing binarization algorithms and emphasizes the importance of salient weights. PB-LLM selectively binarizes salient weights, allocating them to higher-bit storage. Their research extends PB-LLM through PTQ and QAT methodologies, revitalizing low-bit quantized LLMs’ performance. These advancements significantly contribute to network binarization for LLMs.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post How Can We Effectively Compress Large Language Models with One-Bit Weights? This Artificial Intelligence Research Proposes PB-LLM: Exploring the Potential of Partially-Binarized LLMs appeared first on MarkTechPost.

Researchers from Princeton and Meta AI Introduce MemWalker: A New Meth …

Adopting the Transformer architecture with self-attention and increases in model size and pre-training data has led to significant progress in large language models (LLMs). Users want to use longer input sequences during inference more frequently as LLMs improve capacity. As a result, there is an increasing need for services that facilitate the analysis of lengthy texts, such as legal or scientific studies, and the management of lengthy conversations. Longer context processing time is very useful when dealing with such a massive volume of information consumption as these tasks require. 

Despite the progress, the self-attention mechanism’s limitations become more obvious as the length of a sequence increases the amount of memories it must keep track of. Several methods have been used to deal with this issue, such as developing more compact and effective attention schemes, fine-tuning with extrapolated or interpolated positional embeddings, using recurrence to carry forward information from one text segment into the next, and retrieving pertinent passages. However, these methods still have inherent constraints. No matter how far you drag the slider, the context window always stays the same size, and not every spot has the same weight. Although recurrence can handle sequences of indefinite length, it frequently forgets details from previous parts of the sequence. 

Instead of analyzing the full sequence at once, researchers from Princeton University and Meta AI created a radically new method that approaches the model with a finite context window as an interactive agent, thereby resolving the problems above. To achieve this goal, they present MEMWALKER, a method that guides the model through the lengthy text in an iterative LLM-based manner. 

MEMWALKER is a two-step process that involves:

Building a memory tree

Using that tree to guide the way.

The lengthy material is broken into manageable pieces in the first phase that the LLM can process. The LLM then condenses the information from each segment into a unified summary node. The tree structure is constructed from these summary nodes and subsequently summarized into higher-level summary nodes. When processing a user inquiry, the LLM will return to the tree’s beginning. It looks at each tree branch and analyzes the text to find the path that answers the question. This allows MEMWALKER to process texts rapidly and to identify the crucial parts of a long text in its native language without requiring any fine-tuning on the part of the user. 

In their analysis of MEMWALKER, the team finds that the system outperforms recurrence, retrieval, and vanilla LLM baselines when asked to answer three different types of extended context questions. Other open long context systems that can handle 8,000 to 16,000 tokens couldn’t compare to MEMWALKER’s performance. They evaluate MEMWALKER’s performance, demonstrating that it can reason about navigation decisions, use working memory while traversing, and rectify mistakes committed in the early stages of navigation.

The team also discussed three significant shortcomings with MEMWALKER:

The memory tree generation might not scale very well if the sequence gets long.

The study’s results show that the LLM must be large (over 70B) and instruction-tuned for MEMWALKER to be effective. 

MEMWALKER’s interactive reading capabilities are limited to zero-shot prompting, and it does not use fine-tuning in any way.

Nevertheless, the team believes that MEMWALKER paves the way for a lot of exciting research in the future, including expanding its use to data structures other than trees and optimizing its performance for the interactive reading task.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Researchers from Princeton and Meta AI Introduce MemWalker: A New Method that First Processes the Long Context into a Tree of Summary Nodes appeared first on MarkTechPost.

Meet ToolJet: An Open-Source Low-Code Framework to Build and Deploy …

In the world of software development, one common challenge that organizations face is the need to rapidly build and deploy internal tools without expending excessive engineering effort. These tools are essential for streamlining various processes and improving organizational efficiency. However, the traditional approach to building such tools often requires significant time and resources, leading to delays in addressing critical business needs.

Existing solutions for this problem include low-code and no-code platforms that aim to simplify application development. While these platforms offer a degree of convenience, they often come with limitations in terms of customization, flexibility, and integration capabilities. Organizations may need to improve functionality or face challenges when integrating with external data sources, APIs, and SaaS tools.

Meet ToolJet, an open-source low-code framework that presents a compelling solution to these challenges. ToolJet’s drag-and-drop frontend builder empowers users to create complex and responsive frontends within minutes, eliminating the need for extensive coding. What sets ToolJet apart is its robust ability to integrate with a range of data sources, including databases like PostgreSQL, MongoDB, and Elasticsearch, API endpoints with OpenAPI spec and OAuth2 support, SaaS tools such as Stripe, Slack, Google Sheets, Airtable, and Notion, and object storage services like S3, GCS, and Minio.

The metrics associated with ToolJet demonstrate its capabilities. With over 40 built-in responsive components, it provides a rich library for designing user interfaces. It also offers a built-in no-code database, supports multi-page applications, and even allows multiplayer editing, facilitating collaboration among developers. ToolJet’s versatility extends to its compatibility with various hosting options, including Docker, Kubernetes, Heroku, AWS EC2, Google Cloud Run, and more. Additionally, it boasts granular access control, the ability to run custom JavaScript and Python code, and support for single sign-on (SSO) providers, enhancing both security and customization.

In conclusion, ToolJet offers a powerful solution to the problem of building and deploying internal tools with minimal engineering effort. Its impressive features, extensive integration capabilities, and ease of use make it a valuable asset for organizations looking to accelerate their internal tool development processes. By leveraging ToolJet’s capabilities, businesses can address their unique needs and drive productivity while minimizing development time and complexity.

Check out the Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Meet ToolJet: An Open-Source Low-Code Framework to Build and Deploy Internal Tools with Minimal Engineering Effort appeared first on MarkTechPost.

Meta AI Researchers Introduce a Machine Learning Model that Explores D …

Deciphering speech from brain activity, a longstanding goal in healthcare and neuroscience, has recently seen progress with invasive devices. Deep-learning algorithms trained on intracranial recordings can decode basic linguistic elements. However, extending this to natural speech and non-invasive brain recordings poses a challenge. Researchers from Meta introduce a machine learning model employing contrastive learning to decode perceived speech representations from non-invasive recordings. Their method combines four datasets and achieves promising results, offering a potential pathway for language decoding from brain activity without invasive procedures, with implications for healthcare and neuroscience.

Researchers explore decoding speech from non-invasive brain activity recordings, building upon recent successes with invasive devices in decoding linguistic elements. Their method introduces a contrastive learning model trained to decode self-supervised speech representations. Comparisons with invasive studies highlight their larger vocabulary, and potential applications in speech production are discussed. Ethical approvals were obtained for healthy adult volunteers’ datasets involving passive listening.

Decoding speech from non-invasive brain recordings is a significant challenge in healthcare and neuroscience. While invasive devices have progressed, extending this to natural speech remains difficult. Their approach presents a model trained with contrastive learning to decode self-supervised speech representations from non-invasive data. Their advancement offers promise in decoding language from brain activity without invasive procedures.

Their method introduces a neural decoding task to decipher perceived speech from non-invasive brain recordings. The model is trained and evaluated by utilizing four public datasets with 175 volunteers recorded via MEG or EEG while listening to stories. It employs a common convolutional architecture, simultaneously trained on multiple participants. Comparative analysis with baselines underscores the significance of the contrastive objective and pretrained speech representations. Additionally, the decoder’s predictions primarily rely on lexical and contextual semantic representations.

Decoding accuracy varied among participants and datasets. Word-level predictions showed accurate identification of correct words and discrimination from negative candidates. Comparisons with baselines underscored the significance of the contrastive objective, pretrained speech representations, and a shared convolutional architecture in enhancing decoding accuracy. Decoder predictions primarily relied on lexical and contextual semantic representations.

Researchers introduce a contrastive learning-based model for decoding perceived speech from non-invasive brain recordings. Their model demonstrates promising results, achieving an average accuracy of up to 41% in speech segment identification and up to 80% accuracy in the best-performing participants. Comparison with baselines underscores the significance of contrastive objectives, pretrained speech representations, and a shared convolutional architecture in enhancing decoding accuracy. Decoder predictions primarily rely on lexical and contextual semantics. Their work holds potential for non-invasive language decoding in healthcare and neuroscience applications.

Future research should elucidate the factors contributing to decoding accuracy variations among participants and datasets. Investigating the model’s performance in solving more intricate linguistic attributes and real-time speech perception scenarios is essential. Assessing the model’s generalizability to diverse brain recording or imaging techniques is imperative. Exploring its capacity to capture prosody and phonetic features would offer a comprehensive insight into speech decoding.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Meta AI Researchers Introduce a Machine Learning Model that Explores Decoding Speech Perception from Non-Invasive Brain Recordings appeared first on MarkTechPost.

IBM Announces AI-Powered Threat Detection and Response Services to Rev …

In the ever-evolving landscape of cybersecurity threats, organizations face an increasingly daunting challenge – the overwhelming volume of security alerts. Security teams find themselves outnumbered by attackers and buried beneath an avalanche of vulnerabilities, warnings, and security tools. This problem has led to delayed response times, missed critical threats, and an urgent need for a scalable and efficient solution.

Existing solutions have offered some relief but often must address modern cybersecurity threats’ sheer scale and complexity. A global technology leader, IBM unveiled a groundbreaking solution to this problem – the Threat Detection and Response Services.

IBM’s TDR Services leverage cutting-edge AI technologies, continuously learning from real-world client data, including security analyst responses. This intelligent system can automatically escalate or close up to 85% of alerts, allowing security teams to focus on the most critical threats. With its ability to assess and auto-recommend the most effective detection rules, the TDR Services have reduced low-value SIEM alerts by 45% and escalated 79% more high-value alerts requiring immediate attention.

Moreover, organizations can now assess their security posture compared to their industry peers, thanks to the MITRE ATT&CK assessment. The TDR Services apply AI to reconcile multiple detection tools and policies, providing a comprehensive view of how to detect threats and assess gaps within an ATT&CK framework. This framework ensures a proactive and adaptable approach to security.

One of the standout features of IBM’s TDR Services is its seamless end-to-end integration. It boasts an open API approach, enabling swift integration with a client’s existing security assets, whether on-premise or in the cloud. This co-managed portal offers a unified enterprise view, precise remediation capabilities, and consistent enforcement of security policies across IT & OT.

Additionally, organizations can rely on global support from IBM Cybersecurity Services professionals worldwide. 

In conclusion, IBM’s Threat Detection and Response Services represent a significant leap forward in addressing the escalating challenges of modern cybersecurity. Its AI-powered capabilities, MITRE ATT&CK assessment, seamless integration, and global support offer a holistic and efficient solution for organizations looking to bolster their security defenses and stay ahead of evolving threats. In an era where the stakes for cybersecurity have never been higher, IBM’s TDR Services provide hope for organizations seeking to protect their digital assets and reputations.

Check out the Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post IBM Announces AI-Powered Threat Detection and Response Services to Revolutionize Cybersecurity appeared first on MarkTechPost.

This AI Research Proposes SMPLer-X: A Generalist Foundation Model for …

The animation, gaming, and fashion sectors may all benefit from the cutting-edge field of expressive human pose and shape estimation (EHPS) from monocular photos or videos. To accurately portray the complex human anatomy, face, and hands, this job often uses parametric human models (like SMPL-X). Recent years have seen an influx of unique datasets, giving the community additional opportunities to research topics like capture environment, position distribution, body visibility, and camera viewpoints. However, the state-of-the-art approaches are still constrained to a small number of these datasets, causing a performance bottleneck in various scenarios and impeding generalization to uncharted terrain. 

To build reliable, globally applicable models for EHPS, their goal in this work is to analyze the available data sets thoroughly. To do this, they created the first systematic benchmark for EHPS using 32 datasets and assessed their performance against four key standards. This demonstrates the significant inconsistencies between benchmarks, highlighting the complexity of the overall EHPS landscape, and calls for data scaling to address the domain gaps between scenarios. This in-depth analysis highlights the necessity to reevaluate the use of existing datasets for EHPS, arguing for a switch to more aggressive substitutes that provide better generalization abilities. 

Their research emphasizes the value of utilizing several datasets to benefit from their complimentary nature. They also thoroughly look at the relevant aspects affecting these datasets’ transferability. Their research provides helpful advice for future dataset gathering: 1) Datasets do not need to be particularly huge to be beneficial as long as they contain more than 100K instances, according to their observation. 2) If an in-the-wild (including outdoor) collection is not feasible, various interior sceneries are an excellent alternative. 3) Synthetic datasets are becoming surprisingly more effective while having detectable domain gaps. 4) In the absence of SMPL-X annotations, pseudo-SMPL-X labels are helpful.

Using the information from the benchmark, researchers from Nanyang Technological University, SenseTime Research, Shanghai AI Laboratory, The University of Tokyo and the International Digital Economy Academy (IDEA) created SMPLer-X. This generalist foundation model is trained using a variety of datasets and provides remarkably balanced outcomes in various circumstances. This work demonstrates the power of massively chosen data. They developed SMPLer-X with a minimalist design philosophy to dissociate from algorithmic research works: SMPLer-X has a very basic architecture with only the most crucial components for EHPS. In contrast to a rigorous analysis of the algorithmic element, SMPLer-X is intended to permit huge data and parameter scaling and serve as a basis for future field research. 

A comprehensive model that outperforms all benchmark results from experiments with various data combinations and model sizes and challenges the widespread practice of restricted dataset training. The mean primary errors on five major benchmarks (AGORA, UBody, EgoBody, 3DPW, and EHF) were reduced from over 110 mm to below 70 mm by their foundation models, which also show impressive generalization capabilities by successfully adapting to new scenarios like RenBody and ARCTIC. Additionally, they demonstrate the effectiveness of optimizing their generalist foundation models to develop into domain-specific experts, producing exceptional performance across the board. 

They specifically employ the same data selection methodology that enables their specialized models to achieve SOTA performance on EgoBody, UBody, and EHF in addition to becoming the first model to attain 107.2mm in NMVE (an 11.0% improvement) and break new records on the AGORA leaderboard. They provide three distinct contributions. 1) Using extensive EHPS datasets, they construct the first systematic benchmark, which offers crucial direction for scaling up the training data towards reliable and transportable EHPS. 2) They investigate both data and model scaling to construct a generalist foundation model that offers balanced outcomes across many scenarios and extends effectively to unexplored datasets. 3) They refine their foundation model to make it a powerful specialist across various benchmarks by extending the data selection technique.

Check out the Paper, Project Page, and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post This AI Research Proposes SMPLer-X: A Generalist Foundation Model for 3D/4D Human Motion Capture from Monocular Inputs appeared first on MarkTechPost.

Improve performance of Falcon models with Amazon SageMaker

What is the optimal framework and configuration for hosting large language models (LLMs) for text-generating generative AI applications? Despite the abundance of options for serving LLMs, this is a hard question to answer due to the size of the models, varying model architectures, performance requirements of applications, and more. The Amazon SageMaker Large Model Inference (LMI) container makes it straightforward to serve LLMs by bringing together a host of different frameworks and techniques that optimize the deployment of LLMs. The LMI container has a powerful serving stack called DJL serving that is agnostic to the underlying LLM. It provides system-level configuration parameters that can be tuned for extracting the best performance of the hosting infrastructure for a given LLM. It also has support for recent optimizations like continuous batching, also known as iterative batching or rolling batching, which provides significant improvements in throughput.
In an earlier post, we showed how you can use the LMI container to deploy the Falcon family of models on SageMaker. In this post, we demonstrate how to improve the throughput and latency of serving Falcon-40B with techniques like continuous batching. We also provide an intuitive understanding of configuration parameters provided by the SageMaker LMI container that can help you find the best configuration for your real-world application.
Fundamentals of text-generative inference for LLMs
Let’s first look at a few fundamentals on how to perform inference for LLMs for text generation.
Forward pass, activations, and the KV cache
Given an input sequence of tokens, they are run in a forward pass across all the layers of the LLM (like Falcon) to generate the next token. A forward pass refers to the process of input data being passed through a neural network to produce an output. In the case of text generation, the forward pass involves feeding an initial seed or context into the language model and generating the next character or token in the sequence. To generate a sequence of text, the process is often done iteratively, meaning it is repeated for each step or position in the output sequence. At each iteration, the model generates the next character or token, which becomes part of the generated text, and this process continues until the desired length of text is generated.
Text generation with language models like Falcon or GPT are autoregressive. This means that the model generates one token at a time while conditioning on the previously generated tokens. In other words, at each iteration, the model takes the previously generated text as input and predicts the next token based on that context. As mentioned in vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention, in this autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as the KV cache.
Prefill and decode phases
In an autoregressive decoding process, like the one used in text generation with language models such as Falcon, there are typically two main phases: the prefill phase and the decode phase. These phases are crucial for generating coherent and contextually relevant text.
The prefill phase includes the following:

Initial context – The prefill phase begins with an initial context or seed text provided by the user. This initial context can be a sentence, a phrase, or even just a single word. It sets the starting point for text generation and provides context for what comes next.
Model conditioning – The provided context is used to condition the language model. The model takes this context as input and generates the next token (word or character) in the sequence based on its understanding of the context.
Token generation – The model generates one token at a time, predicting what should come next in the text. This token is appended to the context, effectively extending it.
Iterative process – The process of generating tokens is repeated iteratively. At each step, the model generates a token while considering the updated context, which now includes the tokens generated in previous steps.

The prefill phase continues until a predetermined stopping condition is met. This condition can be a maximum length for the generated text, a specific token that signals the end of the text, or any other criteria set by the user or the application.
The decode phase includes the following:

Completion – After the prefill phase, you have a partially generated text that may be incomplete or cut off at some point. The decode phase is responsible for completing the text to make it coherent and grammatically correct.
Continuation from the last token – In the decode phase, the model starts from the last token generated during the prefill phase. It uses this token as the initial context and generates the next token to continue the text.
Iterative completion – Like in the prefill phase, the process of generating tokens is again iterative. The model generates one token at a time, conditioning on the preceding tokens in the sequence.
Stopping condition – The decode phase also has a stopping condition, which might be the same as in the prefill phase, such as reaching a maximum length or encountering an end-of-text token. When this condition is met, the generation process stops.

The combination of the prefill and decode phases allows autoregressive models to generate text that builds on an initial context and produces coherent, contextually relevant, and contextually consistent sequences of text.
Refer to A Distributed Serving System for Transformer-Based Generative Models for a detailed explanation of the process.
Optimizing throughput using dynamic batching
So far, we’ve only talked about a single input. In practice, we expect to deal with multiple requests coming in randomly from the application clients for inference concurrently or in a staggered fashion. In the traditional way, basic batching can be used to increase the throughput and the utilization of the computing resources of the GPU. Batching is effectively combining the numerical representations of more than one request in a batch and performing parallel runs of the autoregressive forward passes. This intelligent batching is done at the serving side. SageMaker LMI’s DJLServing server can be configured to batch together multiple requests to process them in parallel by setting the following parameters in serving.properties:

max_batch_delay = 100 – The maximum delay for batch aggregation in milliseconds. The default value is 100 milliseconds.
batch_size = 32 – The dynamic batch size. The default is 1.

This basically shows that DJLServing will queue up requests for 100 milliseconds at a time or if the number of requests that are queued up are up to the batch_size specified, the batch will be scheduled to run to the backend for inference. This is known as dynamic batching. It’s dynamic because the batch size may change across batches depending on how many requests were added in that time duration. However, because requests might have different characteristics, (for example, some requests might be of shape 20 tokens of input and 500 tokens of output, whereas others might be reversed, with 500 tokens of input but only 20 for output), some requests might complete processing faster than others in the same batch. This could result in underutilization of the GPU while waiting for all in-flight requests in the batch to complete its decode stage, even if there are additional requests waiting to be processed in the queue. The following diagram illustrates this process.

Dynamic Batching Visual – notice the idle windows at the end of Request 2 and 3

Optimizing throughput using continuous batching
With continuous batching, also known as iterative or rolling batching, we take advantage of the differences between the prefill and decode stages. To activate continuous batching, DJServing provides the following additional configurations as per serving.properties:

engine=MPI – We encourage you to use the MPI engine for continuous batching.
option.rolling_batch=auto or lmi-dist – We recommend using auto because it will automatically pick the most appropriate rolling batch algorithm along with other optimizations in the future.
option.max_rolling_batch_size=32 – This limits the number of concurrent requests. The default is 32.

With continuous batching, the serving stack (DJLServing) doesn’t wait for all in-flight requests in a batch to complete its decode stage. Rather, at logical breaks (at the end of one iteration in the decode stage), it pulls in additional requests that are waiting in the queue while the current batch is still processing (hence the name rolling batch). It does this check for pending requests at the end of each iteration of the decode stage. Remember, for each request, we need to run the prefill stage followed by the sequential decode stage. Because we can process all the tokens from the initial prompt of a request in parallel for its prefill stage, anytime a new request is pulled in, we temporarily pause the decode stage of in-flight requests of the batch—we temporarily save its KV cache and activations in memory and run the prefill stage of the new requests.
The size of this cache can be configured with the following option:

option.max_rolling_batch_prefill_tokens=1024 – Limits the number of simultaneous prefill tokens saved in the cache for the rolling batch (between the decode and the prefill stages)

When the prefill is complete, we combine the new requests and the old paused requests in a new rolling batch, which can proceed with their decode stage in parallel. Note that the old paused requests can continue their decode stage where they left off and the new requests will start from their first new token.

Continuous or Iterative Batching Visual – notice that the idle times are replaced with follow on requests

You might have already realized that continuous batching is an almost similar approach with which we naturally parallelize tasks in our daily lives. We have messages, emails, phone notifications (potentially new requests) coming in at random times (analogous to multiple requests coming in a random staggered fashion for GPUs). This is all happening while we go about completing our in-flight tasks—composing emails, coding, participating in meetings (analogous to the currently processing tasks in the GPUs). At logical breaks, we pause our in-flight tasks and check our notifications to decide if there is some action required on our part, and if there is, we add it to our in-flight tasks (real-life rolling batch), or put it on a to-do list (the queue).
Putting it all together: How to think about memory utilization of GPUs
It’s recommended to load test your model to see which configuration is the most cost-effective for your business use case. To build an understanding, let’s visualize the memory footprint of the GPUs as the model is loaded and as successive requests are processed in a rolling batch. For this post, let’s assume we are loading the Falcon-40B model onto one of the G5 instance types instance that are installed with NVIDIA A10G GPUs, each with 24 GB of memory. Note that a similar understanding is applicable for the p3, p4, and p5 instance types, which come with the V100, A100, and H100 GPU series.
The following is the overview of getting an approximate value of total memory required to serve Falcon-40B:

Model size = Number of model parameters (40 billion for Falcon-40B) x 4 bytes per parameter (for FP32) = 160 GB
Approximate total memory required to load Falcon-40B for inference = Model size (=160 GB) + KV Cache (Attention Cache) (=*20 GB) + Additional memory overhead by ML Frameworks (approximately 2 GB)

Memory Visual – Understanding the memory footprint of a loaded Falcon-40B model

For Falcon-40B, if we compress the model by quantizing the model to the bfloat16 (2 bytes) data type, the model size becomes approximately 80 GB. As you can see, this is still larger than the memory supported by one accelerator device, so we need to adopt a model partitioning (sharding) technique with a special tensor parallelism (TP) approach and shard the model across multiple accelerator devices. Let’s assume that we have chosen g5.24xlarge, which has 4 A10G GPU devices. If we configure DJLServing (serving.properties) with the following, we can expect that the 80 GB of model weights will be divided equally across all 4 GPUs:

option.tensor_parallel_degree = 4 or 8, or use max (maximum GPUs detected on the instance)

With tensor_parallel_degree set to 4, about 20 GB of the 24 GB GPU memory (approximately 84%) is already utilized even before a single request has been processed. The remaining 16% of the GPU will be used for the KV cache for the incoming requests. It’s possible that for your business scenario and its latency and throughput requirements, 2–3 GB of the remaining memory is more than enough. If not, you can increase the instance size to g5.48xlarge, which has 8 GPUs and uses tensor_parallel_degree set to 8. In such a case, only approximately 10 GB of the available 24 GB memory of each GPU is utilized for model weights and we have about 60% of the remaining GPU for the activations and KV cache. Intuitively, we can see that this configuration may allow us to have a higher throughput. Additionally, because we have a larger buffer now, we can increase the max_rolling_batch_prefill_tokens and max_rolling_batch_size parameters to further optimize the throughput. Together, these two parameters will control the preallocations of the activation prefills and KV cache for the model. A larger number for these two parameters will co-relate to a larger throughput, assuming you have enough buffer for the KV cache in the GPU memory.
Continuous batching with PagedAttention
PagedAttention is a new optimization algorithm developed by UC Berkeley that improves the continuous batching process by allowing the attention cache (KV cache) to be non-contiguous by allocating memory in fixed-size pages or blocks. This is inspired by virtual memory and paging concepts used by operating systems.
As per the vLLM paper, the attention cache of each sequence of tokens is partitioned into blocks and mapped to physical blocks through a block table. During the computation of attention, a PagedAttention kernel can use the block table to efficiently fetch the blocks from physical memory. This results in a significant reduction of memory waste and allows for larger batch size, increased GPU utilization, and higher throughput.
Performance comparison
To ensure effective load testing of your deployment configuration, it’s recommended to begin by considering the business scenario and clearly defining the characteristics of the input and output for the LLM-based application. For instance, if you are working on a call center summarization use case, the input could consist of larger text, such as a 500-token chat transcript between a customer service agent and a customer, but the output might be relatively smaller, around 100 tokens, representing a summary of the transcript. On the other hand, if you’re working on a code generation scenario, the input could be as short as 15 tokens, like “write an efficient implementation in Python for describing all EC2 resources, including pagination,” but the output could be much larger, reaching 500 tokens. It’s also important to consider whether achieving lower latency or maximizing throughput is the top priority for your specific scenario.
After gaining a comprehensive understanding of the business scenario, you can analyze and determine the optimal configuration for your hosting environment. In this context, the hosting environment encompasses various key elements, including the instance type and other configuration parameters such as tensor_parallel_degree, max_rolling_batch_size, max_rolling_batch_prefill_tokens, and more. Our objective is to identify the most effective setup to support our response time, throughput, and model output quality requirements.
In our analysis, we benchmarked the performance to illustrate the benefits of continuous batching over traditional dynamic batching. We used the configurations detailed in the following table in serving.properties for dynamic batching and iterative batching, using an LMI container on SageMaker.

Dynamic Batching
Continuous Batching
Continuous Batching with PagedAttention

engine=Python option.model_id=tiiuae/falcon-40b option.tensor_parallel_degree=8 option.dtype=fp16 batch_size=4 max_batch_delay=100 option.trust_remote_code = true
engine = MPI option.model_id = {{s3_url}} option.trust_remote_code = true option.tensor_parallel_degree = 8 option.max_rolling_batch_size = 32 option.rolling_batch = auto option.dtype = fp16 option.max_rolling_batch_prefill_tokens = 1024 option.paged_attention = False
engine = MPI option.model_id = {{s3_url}} option.trust_remote_code = true option.tensor_parallel_degree = 8 option.max_rolling_batch_size = 32 option.rolling_batch = auto option.dtype = fp16 option.max_rolling_batch_prefill_tokens = 1024 option.paged_attention = True

The two configurations were benchmarked for Falcon-40B with the FP16 data type deployed on ml.g5.48xlarge in a couple of different scenarios that represent real-world applications:

A small number of input tokens with a large number of tokens being generated – In this scenario, number of input tokens was fixed at 32 and 128 new tokens were generated

Batching Strategy
Throughput (tokens/sec)
Latency p90 (secs)

Dynamic Batching
5.53
58.34

Continuous Batching
56.04
4.74

Continuous Batching with PagedAttention
59.18
4.76

A large input with a small number of tokens being generated – Here, we fix the number of input tokens at 256 and prompt the LLM to summarize the input to 32 tokens

Batching Strategy
Throughput (tokens/sec)
Latency p90 (secs)

Dynamic Batching
19.96
59.31

Continuous Batching
46.69
3.88

Continuous Batching with PagedAttention
44.75
2.67

We can see that continuous batching with PagedAttention provides a throughput improvement of 10 times greater in scenario 1 and 2.3 times in scenario 2 compared to using dynamic batching on SageMaker while using the LMI container.
Conclusion
In this post, we looked at how LLMs use memory and explained how continuous batching improves the throughput using an LMI container on SageMaker. We demonstrated the benefits of continuous batching for Falcon-40B using an LMI container on SageMaker by showing benchmark results. You can find the code on the GitHub repo.

About the Authors
Abhi Shivaditya is a Senior Solutions Architect at AWS, working with strategic global enterprise organizations to facilitate the adoption of AWS services in areas such as Artificial Intelligence, distributed computing, networking, and storage. His expertise lies in Deep Learning in the domains of Natural Language Processing (NLP) and Computer Vision. Abhi assists customers in deploying high-performance machine learning models efficiently within the AWS ecosystem.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.
Pinak Panigrahi works with customers to build machine learning driven solutions to solve strategic business problems on AWS. When not occupied with machine learning, he can be found taking a hike, reading a book or watching sports.
Abhi Sodhani holds the position of Senior AI/ML Solutions Architect at AWS, where he specializes in offering technical expertise and guidance on Generative AI and ML solutions to customers. His primary focus is to assist Digital Native Businesses in realizing the full potential of Generative AI and ML technologies, enabling them to achieve their business objectives effectively. Beyond his professional endeavors, Abhi exhibits a strong passion for intellectual pursuits such as reading, as well as engaging in activities that promote physical and mental well-being, such as yoga, meditation.
Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Index your web crawled content using the new Web Crawler for Amazon Ke …

Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.
Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to provide you with a fully managed experience and simplify the process of indexing your content from a variety of data sources in the enterprise.
One such unstructured data repository are internal and external websites. Sites may need to be crawled to create news feeds, analyze language use, or create bots to answer questions based on the website data.
We’re excited to announce that you can now use the new Amazon Kendra Web Crawler to search for answers from content stored in internal and external websites or create chatbots. In this post, we show how to index information stored in websites and use the intelligent search in Amazon Kendra to search for answers from content stored in internal and external websites. In addition, the ML-powered intelligent search can accurately get answers for your questions from unstructured documents with natural language narrative content, for which keyword search is not very effective.
The Web Crawler offers the following new features:

Support for Basic, NTLM/Kerberos, Form, and SAML authentication
The ability to specify 100 seed URLs and store connection configuration in Amazon Simple Storage Service (Amazon S3)
Support for a web and internet proxy with the ability to provide proxy credentials
Support for crawling dynamic content, such as a website containing JavaScript
Field mapping and regex filtering features

Solution overview
With Amazon Kendra, you can configure multiple data sources to provide a central place to search across your document repository. For our solution, we demonstrate how to index a crawled website using the Amazon Kendra Web Crawler. The solution consists of the following steps:

Choose an authentication mechanism for the website (if required) and store the details in AWS Secrets Manager.
Create an Amazon Kendra index.
Create a Web Crawler data source V2 via the Amazon Kendra console.
Run a sample query to test the solution.

Prerequisites
To try out the Amazon Kendra Web Crawler, you need the following:

A website to crawl.
An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
Basic knowledge of AWS.

Gather authentication details
For protected and secure websites, the following authentication types and standards are supported:

Basic
NTLM/Kerberos
Form authentication
SAML

You need the authentication information when you set up the data source.
For basic or NTLM authentication, you need to provide your Secrets Manager secret, user name, and password.
Form and SAML authentication require additional information, as shown in the following screenshot. Some of the fields like User name button Xpath are optional and will depend on whether the site you are crawling uses a button after entering the user name. Also note that you will need to know how to determine the Xpath of the user name and password field and the submit buttons.

Create an Amazon Kendra index
To create an Amazon Kendra index, complete the following steps:

On the Amazon Kendra console, choose Create an Index.
For Index name, enter a name for the index (for example, Web Crawler).
Enter an optional description.
For Role name, enter an IAM role name.
Configure optional encryption settings and tags.
Choose Next.
In the Configure user access control section, leave the settings at their defaults and choose Next.
For Provisioning editions, select Developer edition and choose Next.
On the review page, choose Create.

This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.

Create an Amazon Kendra Web Crawler data source
Complete the following steps to create your data source:

On the Amazon Kendra console, choose Data sources in the navigation pane.
Locate the WebCrawler connector V2.0 tile and choose Add connector.
For Data source name, enter a name (for example, crawl-fda).
Enter an optional description.
Choose Next.
In the Source section, select Source URL and enter a URL. For this post, we use https://www.fda.gov/ as an example source URL.
In the Authentication section, chose the appropriate authentication based on the site that you want to crawl. For this post, we select No authentication because it’s a public site and doesn’t need authentication.
In the Web proxy section, you can specify a Secrets Manager secret (if required).

Choose Create and Add New Secret.
Enter the authentication details that you gathered previously.
Choose Save.

In the IAM role section, choose Create a new role and enter a name (for example, AmazonKendra-Web Crawler-datasource-role).
Choose Next.
In the Sync scope section, configure your sync settings based on the site you are crawling. For this post, we leave all the default settings.
For Sync mode, choose how you want to update your index. For this post, we select Full sync.
For Sync run schedule, choose Run on demand.
Choose Next.
Optionally, you can set field mappings. For this post, we keep the defaults for now.

Mapping fields is a useful exercise where you can substitute field names to values that are user-friendly and that fit in your organization’s vocabulary.

Choose Next.
Choose Add data source.
To sync the data source, choose Sync now on the data source details page.
Wait for the sync to complete.

Example of an authenticated website
If you want to crawl a site that has authentication, then in the Authentication section in the previous steps, you need to specify the authentication details. The following is an example if you selected Form authentication.

In the Source section, select Source URL and enter a URL. For this example, we use https://accounts.autodesk.com.
In the Authentication section, select Form authentication.
In the Web proxy section, specify your Secrets Manager secret. This is required for any option other than No authentication.

Choose Create and Add New Secret.
Enter the authentication details that you gathered previously.
Choose Save.

Test the solution
Now that you have ingested the content from the site into your Amazon Kendra index, you can test some queries.

Go to your index and choose Search indexed content.
Enter a sample search query and test out your search results (your query will vary based on the contents of site your crawled and the query entered).

Congratulations! You have successfully used Amazon Kendra to surface answers and insights based on the content indexed from the site you crawled.
Clean up
To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra Web Crawler V2, delete that data source.
Conclusion
With the new Amazon Kendra Web Crawler V2, organizations can crawl any website that is public or behind authentication and use it for intelligent search powered by Amazon Kendra.
To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide. For more information on how you can create, modify, or delete metadata and content when ingesting your data, refer to Enriching your documents during ingestion and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.

About the Authors
Jiten Dedhia is a Sr. Solutions Architect with over 20 years of experience in the software industry. He has worked with global financial services clients, providing them advice on modernizing by using services provided by AWS.
Gunwant Walbe is a Software Development Engineer at Amazon Web Services. He is an avid learner and keen to adopt new technologies. He develops complex business applications, and Java is his primary language of choice.

Meet Mistral-7B-v0.1: A New Large Language Model on the Block

Mistral-7B-v0.1 is one of the most recent advancements in artificial intelligence (AI) for large language models (LLMs). Mistral AI’s latest LLM is one of the largest and most potent examples of this model type, boasting 7 billion parameters.

Mistral-7B-v0.1 is a transformer model, a type of neural network especially useful for NLP applications. Its ability to generate text, translate languages, write various forms of creative content, and answer questions instructively results from its training on a large dataset consisting of text and code.

Compared to other LLMs of a similar size, Mistral-7B-v0.1 performs better on several benchmarks. These include GLUE, SQuAD, and SuperGLUE. This indicates that it is probably one of the most cutting-edge and powerful LLMs now accessible.

The following architectural options were used to create the Mistral-7B-v0.1 transformer model.

Grouped Question Processing

Constantly Shifting Focus

BPE tokenizer with byte-fallback

Some examples of where Mistral-7B-v0.1 could be useful are as follows:

Mistral-7B-v0.1 is useful for various natural language processing (NLP) applications, including machine translation, text summarization, and question-answering.

Poems, code, screenplays, musical pieces, emails, letters, etc., may be generated with Mistral-7B-v0.1, a program designed for creative writing.

Mistral-7B-v0.1 can be used for code generation in many different languages.

Mistral-7B-v0.1 can be utilized in the classroom to give pupils individualized lessons.

As a customer care tool, Mistral-7B-v0.1 can be used to develop chatbots and other assistance applications.

Check more details here https://huggingface.co/mistralai/Mistral-7B-v0.1 

Even though Mistral-7B-v0.1 is still in the works, it already has the potential to transform how we use computers and the outside world. Mistral-7B-v0.1 is a cutting-edge tool with enormous potential for positive change. It’s in the early stages of development, but so far, so good. Mistral-7B-v0.1 represents a major step forward in the evolution of AI. This development could completely alter the way we use computers and the environment around us.

Check out the Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on WhatsApp. Join our AI Channel on Whatsapp..
The post Meet Mistral-7B-v0.1: A New Large Language Model on the Block appeared first on MarkTechPost.