Large Language Models (LLMs) have taken the world by storm because of their remarkable performances and potential across a diverse range of tasks. They are best known for their capabilities in text generation, language understanding, text summarization and many more. The downside to their widespread adoption is the astronomical size of their model parameters, which requires significant memory capacity and specialized hardware for inference. As a result, deploying these models has been quite challenging.
One way the computational power required for inference could be reduced is by using quantization methods, i.e. reducing the precision of weights and activation functions of an artificial neural network. INT8 and weight-only quantization are a couple of ways the inference cost could be improved. These methods, however, are generally optimized for CUDA and may not necessarily work on CPUs.
The authors of this research paper from Intel have proposed an effective way of efficiently deploying LLMs on CPUs. Their approach supports automatic INT-4 weight-only quantization (low precision is applied to model weights only while that of activation functions is kept high) flow. They have also designed a specific LLM runtime that has highly optimized kernels that accelerate the inference process on CPUs.
The quantization flow is developed on the basis of an Intel Neural Compressor and allows for tuning on different quantization recipes, granularities, and group sizes to generate an INT4 model that meets the accuracy target. The model is then passed to the LLM runtime, a specialized environment designed to evaluate the performance of the quantized model. The runtime has been designed to provide an efficient inference of LLMs on CPUs.
For their experiments, the researchers selected some of the popular LLMs having a diverse range of parameter sizes (from 7B to 20B). They evaluated the performance of FP32 and INT4 models using open-source datasets. They observed that the accuracy of the quantized model on the selected datasets was nearly at par with that of the FP32 model. Additionally, they did a comparative analysis of the latency of the next token generation and found that the LLM runtime outperforms the ggml-based solution by up to 1.6 times.
In conclusion, this research paper presents a solution to one of the biggest challenges associated with LLMs, i.e., inference on CPUs. Traditionally, these models require specialized hardware like GPUs, which render them inaccessible for many organizations. This paper presents an INT4 model quantization along with a specialized LLM runtime to provide an efficient inference of LLMs on CPUs. When evaluated on a set of popular LLMs, the method demonstrated an advantage over ggml-based solutions and gave an accuracy on par with that of FP32 models. There is, however, scope for further improvement, and the researchers plan on empowering generative AI on PCs to meet the growing demands of AI-generated content.
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 32k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
We are also on Telegram and WhatsApp.
The post Intel Researchers Propose a New Artificial Intelligence Approach to Deploy LLMs on CPUs More Efficiently appeared first on MarkTechPost.