By capitalizing on shared representations common to different languages, cross-lingual learning is known to enhance the accuracy of NLP models on Low-Resource Languages (LRLs) that have limited data for model training. However, there is a significant disparity in the accuracy of high-resource languages (HRLs) and low-resource languages (LRLs), and this connects to the relative scarcity of pre-training data from the LRLs, even for state-of-the-art (SOTA) models. Targets for language-level accuracy are frequently imposed in professional contexts. This is when techniques like neural machine translation, transliteration, and label propagation on similar data are useful since they may be used to enhance the existing training data synthetically.
These methods can be used to augment the quantity and quality of training data without resorting to prohibitively expensive manual annotation. As a result of the limitations of machine translation, it may need to catch up to the commercial goals even though translation usually improves LRL accuracy.
A team of researchers from Amazon offers an approach to improving low-resource language (LRL) accuracy by employing active learning to collect labeled data selectively. Active learning for multilingual data has previously been studied, although most focus has been on training a model for a single language. To that end, they are working to perfect a single model that can effectively translate between languages. The method, Language Aware Active Learning for Multilingual Models (LAMM), is analogous to the work, which showed that active learning can improve model performance across languages while utilizing a single model. This approach does not, unfortunately, offer a means of specifically targeting and enhancing an LRL’s accuracy. As a result of their insistence on getting labels for languages that have already exceeded their accuracy objectives, today’s state-of-the-art active learning algorithms waste manual annotations in situations where meeting language-level targets is essential. To improve LRL accuracy without negatively impacting HRL performance, they present an active-learning-based strategy for collecting labeled data strategically. The suggested strategy, LAMM, enhances the likelihood of achieving accuracy targets across all relevant languages.
Researchers frame LAMM as an MOP with multiple goals to achieve. The objective is to pick examples of unlabeled data that are:
Indeterminate (the model has little faith in its results)
From language families, the classifier’s performance could be better than the goals.
Amazon researchers compare LAMM’s performance to two benchmarks on four multilingual classification datasets using the typical pool-based active learning setup. Two examples of public datasets are Amazon Reviews and MLDoc. Two multilingual product classification datasets are used internally by Amazon. These are the standard procedures:
Least Confidence (LC) gathers the most entropically uncertain samples possible.
Equal Allocation (EC), to fill the per-language annotation budget, samples with high entropy are gathered, and the annotation budget is divided equally across the languages.
They found that LAMM outperforms the competition on all LRLs while only slightly underperforming on HRL. The percentage of HRL labels is reduced by 62.1% when using LAMM, although the accuracy of AUC is reduced by just 1.2% when comparing LAMM to LC. Using four different datasets for product classification, two publicly available and two proprietary, they show that LAMM can increase LRL performance by 4–11% relative to robust baselines.
Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, please follow us on Twitter
The post How can Businesses Improve the Accuracy of Multilingual Product Classifiers? This AI Paper Proposes LAMM: An Active Learning Approach Aimed at Bolstering the Classification Accuracy in Languages with Limited Training Data appeared first on MarkTechPost.