Meet Med-PaLM Multimodal (Med-PaLM M): A Large Multimodal Generative M …

Large Language Models (LLMs) have advanced in almost every domain, ranging from healthcare and finance to education and social media. Clinicians in the medical industry rely on a wide variety of data sources to deliver high-quality care. Modalities such as clinical notes, lab results, vital signs and observations, medical photographs, and genomics data are included in this category of information. Though there have been constant developments in the field of biomedical Artificial Intelligence (AI), the majority of AI models in use today are restricted to working on a single job and analyzing data from a single modality.

The well-known foundation models offer a chance to completely transform medical AI, and these models are adjusted to different activities and environments through in-context learning or few-shot fine-tuning since they are trained on enormous volumes of data utilizing self-supervised or unsupervised learning objectives. Unified biomedical AI systems that can understand data from several modalities with complicated structures to handle a variety of medical difficulties are now being developed. Such models are anticipated to have an impact on everything from basic biomedical research to patient treatment.

Researchers have been putting in efforts towards creating a general-purpose biomedical AI system. To facilitate the development of these generalist biomedical AI systems, a team of researchers from Google Research and Google DeepMind have introduced MultiMedBench, a unique benchmark made up of 14 different biomedical activities, to aid in the development of these biomedical AI systems. These activities cover a range of difficulties, including answering medical questions, analyzing dermatological and mammography images, creating and summarising radiology reports, and identifying genomic variations.

Join the fastest growing ML Community on Reddit
The authors have also provided a proof of concept called Med-PaLM Multimodal (Med-PaLM M), a sizable multimodal generative model that can understand and encode many kinds of biomedical data, such as clinical language, medical pictures, and genetic data, with a variety of different levels of flexibility. When compared to cutting-edge models, Med-PaLM M has achieved competitive or even higher performance on all tasks covered in the MultiMedBench assessment. Med-PaLM M even performed noticeably better in many cases than specialized models.

The team has also shared some unique Med-PaLM M capabilities. They have given evidence of the model’s capacity for positive transfer learning across tasks and zero-shot generalization to medical ideas and tasks. The AI system exhibits an emergent capacity for zero-shot medical reasoning, which means it can make decisions regarding medical situations for which it was not specifically trained. Despite these encouraging results, the team has stressed that additional work needs to be done before these generalist biomedical AI systems can be used in practical settings. Still, the published results mark a considerable step forward for these systems and offer encouraging possibilities for the future creation of AI-powered medical solutions.

The team has summarized the contributions in the following way.

The work demonstrates the potential of generalist biomedical AI systems for medical applications, though access to extensive biological data for training and validating in-use performance continues to be a problem.

MultiMedBench has 14 unique tasks encompassing a range of biomedical modalities. Med-PaLM M, the first multitasking generalist biomedical AI system, has been introduced that does not require task-specific modifications.

The AI system demonstrates emergent abilities, such as generalization to new medical concepts and zero-shot medical reasoning.

Potential clinical utility is indicated by a human review of Med-PaLM M’s outputs, particularly in producing chest X-ray reports.

With low average mistakes, radiologists favor Med-PaLM M reports over radiologists’ reports in up to 40.50% of cases.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Meet Med-PaLM Multimodal (Med-PaLM M): A Large Multimodal Generative Model that Flexibly Encodes and Interprets Biomedical Data appeared first on MarkTechPost.

This AI Paper from China Proposes HQTrack: An AI Framework for High-Qu …

Visual object tracking is the backbone of numerous subfields within computer vision, including robot vision and autonomous driving. This job aims to reliably identify the target object in a video sequence. Many state-of-the-art algorithms compete in the Visual Object Tracking (VOT) challenge since it is one of the most important competitions in the tracking field.

The Visual Object Tracking and Segmentation competition (VOTS2023) removes some of the restrictions imposed by previous VOT challenges so that participants can think about object tracking more broadly. As a result, VOTS2023 combines short- and long-term monitoring of a single target and tracking many targets, using target segmentation as the only position specification. This introduces new difficulties, such as precise mask estimate, multi-target trajectory tracking, and recognizing relationships between objects.

A new study by the Dalian University of Technology, China, and DAMO Academy, Alibaba Group, presents a system called HQTrack, which stands for High-Quality Tracking. It comprises primarily a video multi-object segmenter (VMOS) and a mask refiner (MR). To perceive tiny objects in intricate setups, the researchers employ VMOS, an enhanced variation of DeAOT, and cascade a gated propagation module (GPM) at 1/8 scale. In addition, they use Intern-T as their feature extractor to improve the ability to distinguish between different types of objects. In VMOS, the researchers only keep the most recently used frame in the long-term memory, discarding the older ones to make room. However, applying a large segmentation model to improve the tracking masks could be useful. Objects with complicated structures are especially challenging for SAM to predict, and they appear frequently in the VOTS challenge. 

Join the fastest growing ML Community on Reddit
Using an HQ-SAM model that has already been pre-trained, the team may further enhance the quality of the tracking masks. Final tracking results were chosen from VMOS and MR, and they used the outer enclosing boxes of the predicted masks as box prompts to feed into HQ-SAM alongside the original images to obtain the refined masks. HQTrack finishes in second place at the VOTS2023 competition with a quality score of 0.615 on the test set. 

Check out the Paper and GitHub. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post This AI Paper from China Proposes HQTrack: An AI Framework for High-Quality Tracking Anything in Videos appeared first on MarkTechPost.

Stack Overflow Launches Overflow: The Integration of Developers Commu …

Stack Overflow, the renowned platform for developers seeking answers and knowledge, has taken a momentous step forward by announcing its new roadmap, ushering in a new era marked by the integration of generative AI. Aptly named OverflowAI, this visionary initiative promises to enhance the platform’s capabilities, improve search functionalities, and provide a seamless experience for developers across the globe.

The cornerstone of this transformative venture is the introduction of semantic search, a powerful upgrade from the traditional lexical search method. By harnessing the potential of a vector database, Stack Overflow aims to deliver more intelligent responses to user queries, aligning them precisely with their research topics. The objective is to create a truly conversational, human-centric search experience where developers can instantly access reliable and accurate solutions to their problems powered by GenAI. What sets this approach apart is the unwavering focus on trust and attribution, ensuring that contributors’ efforts are recognized and rewarded.

The benefits of OverflowAI extend beyond the public platform as the same enhanced search capabilities are set to be integrated into Stack Overflow for Teams. This means customers can swiftly find relevant answers while leveraging trusted sources, including Stack Overflow for Teams, the public platform, and other knowledge repositories like Confluence and GitHub.

Join the fastest growing ML Community on Reddit
One of the most exciting aspects of OverflowAI is the introduction of “enterprise knowledge ingestion” for Stack Overflow for Teams. This groundbreaking feature enables users to build a comprehensive knowledge base in minutes by leveraging existing, accurate, and trusted content. Utilizing AI and machine learning algorithms, the system will create initial tagging structures and recommend relevant questions and answers based on the team’s most frequent areas of inquiry. This AI-powered process efficiently kick-starts a Stack Overflow community, liberating developers to focus on curating and refining content to ensure accuracy and relevance. With indicators of quality and accuracy, such as votes, edits, comments, and views, all knowledge remains discoverable and reusable within the internal community, creating a vibrant hub of valuable information.

To further enhance accessibility, Stack Overflow integrates the knowledge base of Stack Overflow for Teams into their new chatbot, StackPlusOne, seamlessly integrated with Slack. This ingenious integration allows instant access to solutions for the most technical challenges, drawing from Teams’ instance and the community-validated sources of Stack Overflow’s public platform. GenAI provides responses in a conversational format, ensuring even less technical members of organizations can readily understand the information.

Acknowledging that developers spend a significant portion of their time within an Integrated Development Environment (IDE), Stack Overflow endeavors to facilitate the coding process with its IDE extension for Visual Studio Code, powered by OverflowAI. This innovative extension will deliver validated content from both the public platform and Stack Overflow for Teams, providing developers with personalized solutions for efficient and effective problem-solving. The extension also allows users to document new learnings and solutions, contributing to collective knowledge.

Not stopping at integrating AI into the platform, Stack Overflow is actively nurturing a community of knowledge-sharing centered around AI. GenAI Stack Exchange is the designated hub for discussions about prompt engineering, AI optimization, and staying up-to-date with the ever-evolving GenAI tools. Additionally, the introduction of “Discussions” in Stack Overflow’s Natural Language Processing (NLP) Collective provides a dedicated space for debating technical approaches, exploring implementation strategies, and sharing perspectives to aid developers in making well-informed technical decisions.

Stack Overflow’s commitment to fostering trust and transparency is the driving force behind this groundbreaking venture. Through extensive research, including a Developer Survey with over 90,000 participants, the platform recognizes the prevailing concerns surrounding AI technologies and seeks to alleviate them. By grounding their AI responses in the vast knowledge base of over 58 million questions and answers on Stack Overflow, as well as proprietary knowledge within Stack Overflow for Teams, the company ensures that users can rely on the output of these cutting-edge technologies with confidence.

Check out the Reference Article and Details. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Stack Overflow Launches Overflow: The Integration of Developers Community and AI appeared first on MarkTechPost.

A New AI Research from China Proposes SHIP: A Plug-and-Play Generative …

This paper addresses a novel approach called SyntHesIzed Prompts (SHIP) to improve existing fine-tuning methods. 

Fine-tuning: After pre-training, the model is then fine-tuned on a smaller, task-specific dataset. This involves continuing the training process on the new data, often with a smaller learning rate. The idea is to tweak the generalized knowledge the model has gained from pre-training to make it more applicable to the specific task.

The problem the researchers are tackling is the scenario where some classes have no data. They aimed to train a generative model that can synthesize features by providing class names, which enables them to generate features for categories without data. 

Join the fastest growing ML Community on Reddit
Generating features for categories without data refers to the process of synthesizing representations for classes or categories that are not present in the training dataset. This is particularly useful in scenarios where collecting real data for certain classes might be challenging or impossible.

The researchers then fine-tuned CLIP using both the original labeled and the newly synthesized features with off-the-shelf methods. However, a major obstacle is that generative models typically require a substantial amount of data to train, which contradicts their goal of data efficiency. They proposed to utilize a variational autoencoder (VAE) as the framework, which is easier to train and more effective in low-data scenarios compared to models that require adversarial training.

While both GANs and VAEs are generative models capable of creating new data samples, they differ significantly in their architecture, objectives, and training methods. GANs are known for their ability to generate high-quality, realistic samples but can be challenging to train. VAEs, on the other hand, provide a probabilistic framework that can be easier to work with, especially in scenarios with limited data, but might not produce as sharp or realistic samples as GANs.

CLIP (Contrastive Language–Image Pretraining) is a model developed by OpenAI that learns to understand and generate images from textual descriptions and vice versa. It has been pretrained on a large-scale dataset and has aligned visual and language representations. The pretrained language encoder aids in generating more realistic features. The paper aims to enhance the performance of CLIP fine-tuning methods by utilizing synthesized data. They conducted comprehensive experiments on base-to-new generalization, cross-dataset transfer learning, and generalized zero-shot learning, resulting in state-of-the-art performance.

The proposed model architecture leverages the VAE framework to encode and generate features, integrating with CLIP to extract image features and reconstruct them. During training, the model learns to encode the features into a latent space and then reconstruct them. During the generating stage, it uses this learned encoding to synthesize features for new classes, allowing for fine-tuning of CLIP even when some classes have no data. The novel CLIP-based generator, comprising a lightweight MLP and a frozen CLIP text encoder, plays a key role in transforming the latent code and constructing the final prompts for feature reconstruction. 

Experimental Results observed by the researchers:

Base-to-New Generalization: The experiments were conducted on 11 diverse image classification datasets, including ImageNet, Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, and UCF101. The datasets were partitioned into base classes and new classes, with training performed on base classes with 16 samples per class. The evaluation was done on both base and new classes.

Generalized Zero-Shot Setting: The paper also evaluated base-to-new generalization under a more realistic generalized zero-shot setting, where the base and new data are mixed together in the test dataset. The results indicated a significant decrease in performance for previous methods, but the proposed method, SHIP, continued to improve performance in new classes.

Comparison with Other Methods: The results were compared with other methods, including CLIP, CoOp, CLIP-Adapter, and Tip-Adapter. The proposed method, SHIP, showed improved performance in new classes across various datasets.


The paper proposed a novel SyntHesIzed Prompts (SHIP) approach to improve existing fine-tuning methods, particularly in scenarios where some classes have no data. The method achieved state-of-the-art performance on various tasks by synthesizing features for categories without data and fine-tuning CLIP using both original labeled and newly synthesized features. The paper acknowledged additional training costs as a limitation and expressed an intention to explore the applicability of SHIP in dense prediction tasks in future research.

Overall, the paper presents a significant contribution to the field by addressing the challenge of data scarcity for certain classes and enhancing the performance of CLIP fine-tuning methods using synthesized data.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post A New AI Research from China Proposes SHIP: A Plug-and-Play Generative AI Approach to Improve Existing Fine-Tuning Methods appeared first on MarkTechPost.

ETH Zurich Researchers Introduce LMQL: A Programming Language For Lang …

The performance of large language models on various tasks, including question-answering and code production, has been impressive. A language model can automatically generate a statistically plausible conclusion to a sequence based on an input. Users then use this information to train these models through spoken instructions or examples, allowing them to perform various subsequent activities. More complex prompting techniques can involve collaboration between the language model, the user, and third-party applications like calculators. Ad hoc interaction may still be necessary when implementing complicated task- and model-specific programs to achieve state-of-the-art performance or modify language models to specific tasks.

In light of this, researchers from Switzerland introduced the cutting-edge concept of language model programming (LMP). By expanding the scope of language model prompting beyond simple text prompts, LMP provides a natural hybrid of the two methods. In addition, LMP lets you restrict the results the language model produces. This allows for a high level of abstraction in the language model, making it readily adaptable to various activities. Researchers implement LMQL (for Language Model Query Language) to allow for LMP. LMQL uses the constraints and control flow from an LMP prompt to generate an efficient inference technique that reduces the number of costly calls to the underlying language model. They demonstrate the ease with which LMQL may capture a variety of state-of-the-art prompting mechanisms, notably those that facilitate interactive flows that are difficult to implement with preexisting high-level APIs. The examination demonstrates that they maintain or improve accuracy on various downstream activities while drastically reducing computation time or financial outlay (in the case of pay-to-use APIs).

How does it work?

Join the fastest growing ML Community on Reddit
Because of its declarative nature, LMQL merely specifies the desired outcome of a task and leaves the specifics of the control flow of logic to another language. It borrows ideas from SQL but builds them on top of Python. Users can feed the model both textual and programmable questions. 

The report identifies five primary components of the language’s grammar. The decoder’s job is to figure out the secret behind the text-generating algorithm. It’s a bit of code that turns the data into something useful, like higher-quality, more varied wording. 

The basic tool for interacting with the language model is the Python syntax-written Query block. Each string at the top level of the query block represents a separate query. The query’s target model is identified in the Model/from clause. This specifies the linguistic foundation upon which text is generated, and Where clause, on the other hand, lets people set the parameters that govern the results. It specifies what the language model must produce to maintain the necessary properties. 

LMQL users can place sophisticated logical constraints on the results generated by the language model. Token-level prediction masks are generated automatically from these constraints so they can be strictly enforced at the outset of text production. As a result, various constraints can be carefully enforced, and the model will only produce content that meets the criteria. Because of the improved output format assurances, multi-part prompting and integration are made more easier.

Main Contributions

Several problems with current LM prompting methods have been identified and addressed by the authors of this study, who introduce the innovative paradigm of language model programming.

Scripted prompting and output restricting are two features that LMQL, a high-level query language for LMs, offers.

A formal description of final and follow abstractions for eager, partial evaluation semantics. With this, given only some general guidelines, one can have a model-specific token mask for LM decoding generated automatically.

A thorough analysis of LMQL demonstrates how to express a variety of basic and sophisticated prompting approaches as short, easy-to-understand LMQL programs that run faster and more accurately thanks to LMQL’s ability to lower inference costs and execution times by as much as 80%.

Case studies done by researchers show that:

LMQL’s high level of expressivity means that many modern, state-of-the-art techniques can be implemented with significantly fewer lines of code than their comparable Python-based counterparts.

The number of model queries, and hence efficiency and run time, are greatly improved using LMQL. One can enforce constraints dynamically without resorting to chunk-wise decoding and backtracking, thanks to LMQL’s capability for token-level validation.

There is no effect of LMQL on the model’s accuracy. There are situations in which the limits imposed lead to marginally greater precision.

In addition, researchers have demonstrated that LMQL would provide significant monetary savings when employed in the context of paid, API-gated models due to the observed reduction of billable tokens. Finally, they point out that these case studies are separate from comprehensive user research of LMQL, in which the impact and usability of the language are evaluated in tandem with real-world prompt engineers. It is important to remember that the lack of such a study threatens the credibility of the claims regarding practicality.

To conclude, experts present Language Model Programming as a fresh approach to interacting with (huge) linguistic models. LMQL, a high-level query language with a straightforward syntax, was introduced. LMQL’s evaluation semantics were developed efficiently, allowing for swift query processing. They’ve proven their point with case studies showing how sophisticated prompting methods can be translated into simple, clear, and fast LMQL code that can cut computing expenses by as much as 80 percent.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post ETH Zurich Researchers Introduce LMQL: A Programming Language For Language Model Interaction appeared first on MarkTechPost.

Meet Advanced Reasoning Benchmark (ARB): A New Benchmark To Evaluate L …

Natural Language Processing has evolved significantly in recent years, especially with the creation of sophisticated language models. Almost all natural language tasks, including translation and reasoning, have seen notable advances in the performance of well-known models like GPT 3.5, GPT 4, BERT, PaLM, etc. A number of benchmarks are used to access and evaluate these developments in the field of Artificial Intelligence. Benchmark is basically a collection of standardized tasks made to test language models’ (LLMs’) abilities.

Considering the GLUE and the SuperGLUE benchmark, which were among the first few language understanding benchmarks, models like BERT and GPT-2 were more challenging as language models have been beating these benchmarks, sparking a race between the development of the models and the difficulty of the benchmarks. Scaling up the models by making them bigger and training them on bigger datasets is the key to enhanced performance. LLMs have demonstrated outstanding performance on a variety of benchmarks that gauge their capacity for knowledge and quantitative reasoning, but when these models score higher on the current standards, it is clear that these benchmarks are no longer useful for assessing the models’ capabilities.

To address the limitations, a team of researchers has proposed a new and unique benchmark called ARB (Advanced Reasoning Benchmark). ARB is made to convey more difficult issues in a variety of subject areas, such as mathematics, physics, biology, chemistry, and law. ARB, in contrast to earlier benchmarks, focuses on complex reasoning problems in an effort to improve LLM performance. The team has also introduced a set of math and physics questions as a subset of ARB that demand sophisticated symbolic thinking and in-depth subject knowledge. These issues are exceptionally difficult and outside the scope of LLMs as they exist today.

Join the fastest growing ML Community on Reddit
The team has evaluated these new models on the ARB benchmark, including GPT-4 and Claude. These models struggled to manage the complexity of these difficulties, as evidenced by the findings, which demonstrate that they perform on the more difficult tasks contained in ARB with scores significantly below 50%. The team has also demonstrated a rubric-based evaluation approach to improve the evaluation process. By using this strategy, GPT-4 may evaluate its own intermediate reasoning processes as it tries to solve ARB problems. This broadens the scope of the review process and sheds light on the model’s problem-solving strategy.

The symbolic subset of ARB has been subjected to human review as well. Human annotators have been asked to solve the problems and provide their own evaluations. There has been a promising agreement between the human evaluators and GPT-4’s rubric-based evaluation scores, suggesting that the model’s self-assessment aligns reasonably well with human judgment. With hundreds of issues requiring expert reasoning in quantitative fields, where LLMs have typically had difficulty, the new dataset significantly outperforms previous benchmarks.

In contrast to the multiple-choice questions in past benchmarks, a sizable number of the issues are made up of short-answer and open-response questions, making it harder for LLMs to be evaluated. A more accurate evaluation of the models’ capacities to handle complicated, real-world problems is made possible by the combination of expert-level reasoning tasks and more realistic question formats.

Check out the Paper, Github, and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

We are releasing the Advanced Reasoning Benchmark dataset for LLMs (ARB)!– Evaluates SotA LLMs on ARB, on which even GPT4 struggles– Explores the feasibility of letting LLMs generate and use rubrics to evaluate generated solutions.proj:— DuckAI (@TheDuckAI) July 26, 2023

The post Meet Advanced Reasoning Benchmark (ARB): A New Benchmark To Evaluate Large Language Models appeared first on MarkTechPost.

Meet Mentat: An AI Tool That Assists You With Any Coding Task From You …

The term ‘Mentat’ originates from the science fiction books created by the famous author Frank Herbert in his Dune Modules. Mentats are humans or robots who are trained to perform complex tasks such as Machine Learning and Data Analysis similar to that of Artificial intelligence. They also perform these tasks without the use of computers. They enhanced these abilities with the help of a large amount of training datasets. The role of these Mentats is similar to that of advisors and analysts.

Mentat is an AI tool that assists you with any coding task from your command-line, allowing it to coordinate edits across multiple files. Researchers are still working on the Mentat as there are some issues regarding it. The main error that was told to these researchers was that when the users installed it from their GitHub account, they always used to get an error regarding the invalid syntax. But, these issues were solved with the advanced versions of Python. The second error was due to the SSL certificate. Researchers states that the SSL certificate errors occur due to expired certificates, mismatched domains, self-signed certificates, incomplete certificate chains, certificate revoked, and weak cipher protocols. To tackle these errors, researchers are told to ensure that we are on the correct website. They also told me to clear the cache and cookies in the browser. They also mentioned accessing the website from different browsers if the issue persists.

The codebase of Mentat was too large. So, researchers suggested retrieving a small portion of the codebase, as it will be useful in incorporating all the codebases in the prompt sent to LLM. The prompt is the tool that can speak with Artificial Intelligence. Another problem was regarding the API used. According to the research team, a user could also use the local llama model instead of OpenAI API. There were also developments in Mentat at a later stage as the problems were solved. As mentioned before Mentat are Human robots that evolved to perform complex tasks like Data Analytics and Machine Learning. 

Join the fastest growing ML Community on Reddit
Researchers also mentioned that Mentat finds its applications in a variety of areas. They are also used for the handling of large projects. Large projects can be handled easily with the assistance of Mentat as they provide their mentorship to it. They also fix test bugs and clean up the tests. Mentat also finds its applications in various areas like financial analysis and forecasting, cyber security and threat analysis, Healthcare, NLP, Research, Optimization, Autonomous vehicles, gaming, and also fraud detection. These are some of the important applications where Mentat finds its use.

Check out the GitHub. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Introducing Mentat – an open source, GPT-4 powered coding assistant!Mentat runs in your command line, giving it the context of your projects and allowing it to coordinate edits across multiple files!More videos and a link to github below:— BioBootloader (@bio_bootloader) July 25, 2023

The post Meet Mentat: An AI Tool That Assists You With Any Coding Task From Your Command-Line Allowing It To Coordinate Edits Across Multiple Files appeared first on MarkTechPost.

Meet GETMusic: A Unified Representation and Diffusion Framework that c …

In recent years, Significant progress has been made in music generation using Machine Learning models. However, there are still challenges in achieving efficiency and substantial control over the results. Previous attempts have encountered difficulties primarily due to limitations in music representations and model architectures.

As there can be vast combinations of source and target tracks, there is a need for a unified model that can be capable of handling comprehensive track generation tasks and producing desired results. Current research in symbolic music generations can be generalized into two categories based on the adopted music representations. These are sequence-based and image-based. The sequence-based approach represents music as a sequence of discrete tokens, while the image-based approach represents music as 2D images having piano rolls as the ideal choice. Pianorolls represent music notes as horizontal lines, where the vertical position represents the pitch and the length of the line represents the duration.

To address the need for a unified model capable of generating arbitrary tracks, a team of researchers from China has developed a framework called GETMusic(GET stands for GEnerate music Tracks). GETMusic understands the input very well and can produce music by tracks. This framework allows users to create rhythms and add additional elements to make desired tracks. This framework is capable of creating music from scratch, and it can produce guided and mixed tracks.

Join the fastest growing ML Community on Reddit
GETMusic uses a representation called GETScore and a discrete diffusion model called GETDiff. GETScore represents tracks in a 2D structure where tracks are stacked vertically and progress horizontally with time. The researchers represented musical notes with a pitch and a duration token. The work of GETDiff is to select tracks as targets or sources randomly. GETDiff does two processes: The forward process and the Denoising process. In the forward process, the GETDiff corrupts the target track by masking tokens, leaving the source tracks preserved as ground truth. While in the denoising process, GETDiff learns to predict the masked target tokens based on the provided source.

The researchers highlight that this innovative framework provides explicit control over generating desired target tracks starting from scratch or based on user-provided source tracks. Additionally, GETScore stands out as a concise multi-track music representation, streamlining the model learning process and enabling harmonious music generation. Moreover, the pitch tokens utilized in this representation effectively retain polyphonic dependencies, fostering the creation of harmonically rich musical compositions.

In addition to its track-wise generation capabilities, the advanced mask and denoising mechanism of GETDiff empowers zero-shot infilling. This remarkable feature allows for the seamless denoising of masked tokens at any arbitrary positions within GETScore, pushing the boundaries of creativity and enhancing the overall versatility of the framework.

Overall GETMusic performs well, outperforming many other similar models, demonstrating superior melodic, rhythmic, and structural matching between the target tracks and the provided source tracks. In the future, the researchers are looking to explore the potential of this framework, with a particular focus on incorporating lyrics as an additional track. This integration aims to enable impressive lyric-to-melody generation capabilities, further advancing the versatility and expressive power of the model. Seamlessly combining textual and musical elements could open up new creative possibilities and enhance the overall musical experience.

Check out the Paper, Project, and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Meet GETMusic: A Unified Representation and Diffusion Framework that can Generate any Music Tracks with a Unified Representation and Diffusion Framework appeared first on MarkTechPost.

Breaking Barriers in Source-Free Domain Adaptation: NOTELA’s Impact …

Deep learning has made significant progress in a wide range of application areas. An important contributing factor has been the availability of increasingly larger datasets and models. However, a downside of this trend is that training state-of-the-art models has also become increasingly expensive, leading to environmental concerns and accessibility issues for some practitioners. Additionally, directly reusing pre-trained models can result in performance degradation when facing distribution shifts during deployment. Researchers have explored Source-Free Domain Adaptation (SFDA) to address these challenges. This technique adapts pre-trained models to new target domains without access to the original training data. This article focuses on the problem of SFDA and introduces a novel method, NOTELA, designed to tackle distribution shifts in the audio domain, specifically in bioacoustics. 

The bioacoustics dataset (XC)  is widely used for bird species classification, includes:

Both focal recordings.

Targeting individual birds in natural conditions.

Soundscape recordings were obtained through omnidirectional microphones.

It poses unique challenges, as soundscape recordings have a lower signal-to-noise ratio, multiple birds vocalizing simultaneously, and significant distractors like environmental noise. Furthermore, soundscape recordings are collected from different geographical locations, leading to extreme label shifts since only a small subset of species in XC may appear in a specific area. Additionally, both the source and target domains exhibit class imbalance, and the problem is a multi-label classification task due to the presence of multiple bird species within each recording.

Join the fastest growing ML Community on Reddit
In this study, Google researchers first evaluate several existing SFDA methods on the bioacoustics dataset, including entropy minimization, pseudo-labeling, denoising teacher-student, and manifold regularization. The evaluation results show that while these methods have demonstrated success in traditional vision tasks, their performance in bioacoustics varies significantly. In some cases, they perform worse than having no adaptation at all. This result highlights the need for specialized methods to handle the bioacoustics domain’s unique challenges.

To address this limitation, the researchers propose a new and innovative method named NOisy student TEacher with Laplacian Adjustment (NOTELA). This novel approach combines principles from denoising teacher-student (DTS) methods and manifold regularization (MR) techniques. NOTELA introduces a mechanism for adding noise to the student model (inspired by DTS) while enforcing the cluster assumption in the feature space (similar to MR). This combination helps stabilize the adaptation process and enhances the model’s generalizability across different domains. The method leverages the model’s feature space as an additional source of truth, allowing it to succeed in the challenging bioacoustics dataset and achieve state-of-the-art performance.

In the bioacoustics domain, NOTELA demonstrated substantial improvements over the source model and outperformed other SFDA methods across multiple test target domains. It achieved impressive mean average precision (mAP) and class-wise mean average precision (cmAP) values, standard metrics for multi-label classification. Its notable performances on various target domains, such as S. Nevada (mAP 66.0, cmAP 40.0), Powdermill (mAP 62.0, cmAP 34.7), and SSW (mAP 67.1, cmAP 42.7), highlight its effectiveness in handling the challenges of the bioacoustics dataset.

In the context of vision tasks, NOTELA consistently demonstrated strong performance, outperforming other SFDA baselines. It achieved notable top-1 accuracy results on various vision datasets, including CIFAR-10 (90.5%) and S. Nevada (73.5%). Although it showed slightly lower performance on ImageNet-Sketch (29.1%) and VisDA-C (43.9%), NOTELA’s overall effectiveness and stability in handling the SFDA problem across bioacoustics and vision domains are evident.

The above figure shows the evolution of test mean average precision (mAP) for multi-label classification on six soundscape datasets. It compares NOTELA and Dropout Student (DS) with SHOT, AdaBN, Tent, NRC, DUST, and Pseudo-Labelling, demonstrating that NOTELA is the only method that consistently improves the source model, setting it apart.

Overall, this research highlights the importance of considering different modalities and problem settings when evaluating and designing SFDA methods. The authors propose the bioacoustics task as a valuable avenue for studying SFDA. It emphasizes the need for consistent and generalizable performance, especially without domain-specific validation data. Their findings suggest that NOTELA emerges as a compelling baseline for SFDA, showcasing its ability to deliver reliable performance across diverse domains. These valuable insights open new doors for advancing SFDA techniques and enabling more effective and versatile deep-learning applications.

Check out the Paper and Google Blog. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 27k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Breaking Barriers in Source-Free Domain Adaptation: NOTELA’s Impact on Bioacoustics and Vision Domains appeared first on MarkTechPost.

Researchers from Imperial College London and DeepMind Designed an AI F …

In recent years, there have been significant breakthroughs in the field of Deep Learning, particularly in the popular sub-fields of Artificial Intelligence, including Natural Language Processing (NLP), Natural Language Understanding (NLU) and Computer Vision (CV). Large Language Models (LLMs) have been created in the framework of NLP and demonstrate outstanding language processing and text production skills that are on par with human talents. On the other hand, without any explicit guidance, CV’s Vision Transformers (ViTs) have been able to learn meaningful representations from photos and videos. Vision-linguistic Models (VLMs) have also been developed, which can connect visual inputs with linguistic descriptions or the other way around.

Foundation Models behind a wide range of downstream applications involving various input modalities have been pre-trained on vast amounts of textual and visual data, leading to the emergence of significant attributes like common sense reasoning, proposing and sequencing sub-goals, and visual understanding. The prospect of utilizing Foundation Models’ capabilities to create more effective and all-encompassing reinforcement learning (RL) agents is the topic of research for researchers. RL agents often pick up knowledge through interacting with their surroundings and getting rewards as feedback, but this method of learning by trial and error can be time-consuming and unworkable.

To address the limitations, a team of researchers has proposed a framework that places language at the core of reinforcement learning robotic agents, particularly in scenarios where learning from scratch is required. The core contribution of their work is to demonstrate that by utilizing LLMs and VLMs, they can effectively address several fundamental problems in particularly four RL settings.

Efficient Exploration in Sparse-Reward Settings: It is difficult for RL agents to learn the best behavior because they frequently find it difficult to explore settings with few rewards. The suggested approach makes exploration and learning in these contexts more effective by utilizing the knowledge kept in Foundation Models.

Reusing gathered Data for Sequential Learning: The framework allows RL agents to build on previously gathered data rather than beginning from scratch each time a new task is met, aiding the sequential learning of new tasks.

Scheduling learned abilities for NewTasks: The framework supports the scheduling of learned abilities, enabling agents to handle novel tasks with their current knowledge efficiently.

Learning from Observations of Expert Agents: By using Foundation Models to learn from observations of expert agents, learning processes can become more efficient and quick.

The team has summarized the main contributions as follows –

The framework has been made in a way that enables the RL agent to reason and make judgments more effectively based on textual information by using language models and vision language models as the fundamental reasoning tools. The agent’s capacity to comprehend challenging tasks and settings is improved by this method.

The proposed framework shows its efficiency in resolving fundamental RL problems that in the past needed distinct, specially created algorithms.

The new framework outperforms conventional baseline techniques in the sparse-reward robotic manipulation setting.

The framework also shows that it can efficiently use previously taught skills to complete tasks. The RL agent’s generalization and adaptability are enhanced by the ability to transfer learned information to new situations.

It demonstrates how the RL agent may accurately learn from observable demonstrations by imitating films of human experts.

[ Trending ] Meet Pixis AI: An Emerging Startup Providing Codeless AI Solutions
In conclusion, the study shows that language models and vision language models have the ability to serve as the core components of reinforcement learning agents’ reasoning.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Researchers from Imperial College London and DeepMind Designed an AI Framework that Uses Language as the Core Reasoning Tool of an RL Agent appeared first on MarkTechPost.

CMU Researchers Introduce WebArena: A Realistic and Reproducible Web E …

Given the potential for increased efficiency and broader accessibility, autonomous agents that can do ordinary tasks via human natural language instructions could considerably complement human skills. To fully use the potential of these independent agents, it is essential to comprehend their behavior in a genuine and reproducible setting.

Today’s settings tend to oversimplify complex problems. Therefore, many environments’ features are watered-down versions of real-world equivalents, resulting in a shortage of work variety. In other cases, the environment is presented as a static resource, limiting agents’ ability to explore only those states cached during data gathering.

New research by Carnegie Mellon University and Inspired Cognition present WebArena, a simulated web environment with reproducible conditions that may be used to train autonomous agents to carry out certain tasks. The environment consists of four live, self-hosted web apps, one each for e-commerce, online discussion forums, collaborative software development, and enterprise content management. WebArena also includes several helpful tools, including a map, calculator, and scratchpad, to facilitate the most human-like task executions possible. Finally, WebArena is supported by a wealth of supplementary materials, including guides for using the integrated development environment and more specialized sites like the English Wikipedia. These websites’ content is culled directly from their offline counterparts, ensuring that it is accurate and up-to-date. Docker containers with gym APIs supply hosting services, making WebArena easy to use and replicable.

In addition to WebArena, they also open-source a fully operational benchmark of 812 future-oriented web-based tasks. Each activity is modeled after the abstract language usage patterns generally adopted by humans and described as a natural language aim. They focus on analyzing how well these functions work. In addition to being more accurate than comparing the plain action sequences, this assessment can account for the fact that there are sometimes multiple legitimate routes to the same goal (a universal situation in sufficiently complex tasks). 

[ Trending ] Meet Pixis AI: An Emerging Startup Providing Codeless AI Solutions
The team utilizes this standard to compare the performance of numerous agents that can perform web-based operations in response to natural language commands. Many different methods are used to create these agents, from those that predict next steps based on current observations and history to those that use more complex methods like step-by-step reasoning. Powerful large language models (LLMs) like GPT-3.5 and GPT-4 create these agents in a few-shot in-context learning approach. The findings show that the best GPT-4 agent only managed an overall task success rate of 10.59 percent in the experiments. They hypothesize that current LLMs’ lack of key capabilities, including active exploration and failure recovery, is the root cause of their inability to effectively complete complicated tasks. 

Check out the Paper, Project Page, and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post CMU Researchers Introduce WebArena: A Realistic and Reproducible Web Environment with 4+ Real-World Web Apps for Benchmarking Useful Agents appeared first on MarkTechPost.

Meet REPLUG: a Retrieval-Augmented Language Modeling LM Framework that …

In recent years, language models have become one of the fastest-growing fields in Artificial Intelligence. These models, which have been developed to process and produce natural language text, are driving some of the most innovative and ground-breaking AI applications and are at the forefront of a new era in AI expansion. One language model in particular, GPT-3, has caused a buzz worldwide due to its extraordinary capabilities and performance. GPT-3 uses a transformer architecture to process text, resulting in a model that can easily answer questions as a human would. Not only this, the model is even capable of summarizing long paragraphs, finishing codes, and completing tasks with unmatched speed and accuracy.

Language models like GPT-3 are still distant from perfect and have limitations when it comes to generating precise and appropriate responses to new prompts. This is where REPLUG comes in. A new method called REPLUG has been introduced: a retrieval-augmented Language Model framework. It is a method for improvising the performance of black-box language models by merging them with a retrieval-based structure. The retrieval system finds the most appropriate passages in a large corpus of text that match a given prompt, and then the language model is tweaked on the retrieved passages. This allows the language model to produce more accurate answers, especially when the prompt is unseen in its training data.

The REPLUG method consists of two primary steps – document retrieval and input reformulation. First, a retriever is used to identify related documents from an external corpus. Then, each retrieved document is distinctly added to the original input context, and the output probabilities are combined from several passes. This approach uses a deep neural network that powers attention mechanisms to learn the networks between the different modalities.

REPLUG was tested on various benchmark datasets, including a large image captioning dataset, and showed better results compared to existing systems in terms of accuracy and scalability. One of the key advantages of REPLUG is that it does not require any alteration to the underlying language model architecture. Current models like GPT-3 can be enhanced by adding a retrieval system. This makes REPLUG easy to access and implement. REPLUG with the tuned retriever significantly improves the performance of GPT-3 (175B) on language modeling by 6.3%, as well as the performance of Codex on five-shot MMLU by 5.1%.

[ Trending ] Meet Pixis AI: An Emerging Startup Providing Codeless AI Solutions
Consequently, the introduction of REPLUG seems like a game changer in the field of NLP. It combines the strengths of both black-box language models and retrieval systems to generate a hybrid model that outperforms traditional language models. The deep neural network architecture used by REPLUG is scalable, making it appropriate for real-world applications that require processing huge sums of multi-modal data. The potential applications for REPLUG are definitely massive and seem promising in the coming future.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Meet REPLUG: a Retrieval-Augmented Language Modeling LM Framework that Combines a Frozen Language Model with A Frozen/Tunable Retriever Improving the Performance of GPT-3 (175B) on Language Modeling by 6.3% appeared first on MarkTechPost.

This AI Paper Proposes to Inject the 3D World into Large Language Mode …

Over the last several years, they have seen a rise in large language models (LLMs) (like GPT4) that are excellent at various tasks, including communication and common sense reasoning. Recent research has looked at how to align pictures and videos with LLM for a new breed of multi-modal LLMs (like Flamingo and BLIP-2) that can comprehend and make sense of 2D visuals. However, despite the models’ effectiveness in communicating and making decisions, they are based on something other than the deeper notions found in the real 3D physical world, which includes things like spatial connections, affordances, physics, and interaction. As a result, such LLMs are insignificant compared to the robotic helpers shown in science fiction films, which can comprehend 3D situations and do reasoning and planning based on those understandings. To do this, they suggest incorporating the 3D world into large language models and introducing a brand-new class of 3D-LLMs that may process various 3D-related tasks using 3D representations (i.e., 3D point clouds with associated attributes) as input. 

Figure 1

LLMs benefit from two things when they use 3D representations of situations as input: (1) They can store long-term memories about the complete scene in the holistic 3D representations rather than episodic partial-view observations. (2) Reasoning from 3D representations may infer 3D features like affordances and spatial linkages, going much beyond the capabilities of language-based or 2D image-based LLMs. Data collecting is a significant barrier to training the proposed 3D-LLMs. The lack of 3D data makes it difficult to create foundation models based on 3D data, in contrast to the abundance of coupled 2D-images-and-text data on the Internet. Even more difficult to get are 3D data combined with verbal descriptions. 

They suggest a collection of distinctive data-generating processes that provide massive amounts of 3D data linked with language to solve this. They provide three effective prompting processes for communication between 3D data and language, specifically using ChatGPT. As illustrated in Figure 1, they can acquire 300k 3D-language data in this way, which includes information on various tasks such as 3D captioning, dense captioning, 3D question answering, 3D task decomposition, 3D grounding, 3D-assisted dialogue, navigation, and more. The next difficulty is finding useful 3D attributes that match language features for 3D-LLMs. One method is to train 3D encoders from scratch using a contrastive learning paradigm similar to CLIP, which aligns language and 2D pictures. This approach, however, uses a lot of data, time, and GPU resources. From a different angle, several recent efforts (such as idea fusion and 3D-CLR) construct 3D features from 2D multi-view photos. They also use a 3D feature extractor that creates 3D features from the 2D pretrained features of rendered multi-view pictures in response to this. 

Many visual-language models (such as BLIP-2 and Flamingo) have recently started using the 2D pretrained CLIP features to train their VLMs. They can easily employ 2D VLMs as their backbones and input the extracted 3D features to effectively train 3D-LLMs since they are mapped to the same feature space as 2D pretrained features. The fact that 3D LLMs are anticipated to have an underlying 3D spatial sense of information sets them apart from traditional LLMs and 2D VLMs in several important ways. As a result, researchers from UCLA, Shanghai Jiao Tong University, South China University of Technology, University of Illinois Urbana-Champaign, MIT, UMass Amherst and MIT-IBM Watson AI Lab create a 3D localization system that connects language to geographical places. They add 3D position embeddings to the retrieved 3D features to encode spatial information more effectively. Additionally, they add several location tokens to the 3D-LLMs. Localization may then be trained by producing location tokens based on linguistic descriptions of certain items in the sceneries. This would enable 3D-LLMs to record 3D spatial data more effectively. 

[ Trending ] Meet Pixis AI: An Emerging Startup Providing Codeless AI Solutions
In conclusion, their paper makes the following contributions: 

•They present a new family of 3D-based Large Language models (3D-LLMs) that can process a range of 3D-related tasks using input from 3D points with features and language prompts. They concentrate on activities outside the purview of conventional or 2D-LLMs, such as those involving the knowledge of a whole scene, 3D spatial connections, affordances, and 3D planning. 

•They create innovative data-gathering pipelines that could produce much data in 3D language. Based on the pipelines, they gather a dataset with more than 300,000 3D-language data points spanning a wide range of 3D-related activities, such as 3D grounding, dense captioning, 3D question answering, task decomposition, 3D-assisted dialogue, navigation, etc. 

•They employ a 3D feature extractor, which takes rendered multi-view pictures and extracts useful 3D features. They build their training system using 2D pre-trained VLMs. To train the 3D-LLMs to collect 3D spatial information better, they added a 3D localization method. 

• ScanQA, a held-out assessment dataset, performs better in experiments than cutting-edge baselines. On ScanQA, 3D LLMs, in particular, perform much better than baselines (e.g., 9% for BLEU-1) than baselines. Their approach beats 2D VLMs in tests using held-in datasets for 3D captioning, task creation, and 3D-assisted discourse. Qualitative investigations show that their approach can handle a wide range of jobs in more detail.

•They want to make their 3D-LLMs, the 3D-language dataset, and the dataset’s language-aligned 3D features available for upcoming study.

Check out the Paper, Project, and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post This AI Paper Proposes to Inject the 3D World into Large Language Models and Introduce a Whole New Family of 3D-LLMs appeared first on MarkTechPost.

Fast and Accurate Acoustic Hologram Generation Using a Deep Learning-B …

The team led by Professor Hwang Jae-Yoon of the DGIST Department of Electrical Engineering and Computer Science created a deep learning-based ultrasonic hologram generating framework technology that allows for the free configuration of focused ultrasound in real-time based on holograms. In the future, it will serve as a fundamental technology for precise brain stimulation and therapy.

Even for prenatal examinations, ultrasound is a safe tool. Ultrasound techniques for brain stimulation and therapy have lately been researched since they can activate deep locations without requiring surgery. According to earlier studies, ultrasonic brain stimulation can cure ailments including Alzheimer’s disease, depression, and pain.

DGIST To overcome these constraints, Professor Hwang Jae-team Yoon suggested a deep learning-based learning architecture that can encapsulate free and accurate ultrasound focusing in real-time. As a consequence, Professor Hwang’s team showed that focusing ultrasound into the required form more precisely was achievable in a hologram production time that was nearly real-time and up to 400 times quicker than the current ultrasonic hologram generating algorithm approach.

The study team’s deep learning-based learning framework develops ultrasonic hologram generation skills through self-supervised learning. Self-supervised learning is a technique for teaching a computer to learn by itself to find a rule for data that has no solution. The study team suggested an approach for learning to create ultrasonic holograms, a deep learning network tailored for creating ultrasonic holograms, and a new loss function while demonstrating the reliability and superiority of each element through simulations and actual trials.

[ Trending ] Meet Pixis AI: An Emerging Startup Providing Codeless AI Solutions
Problem and Solution

The issue is that the current technology concentrates ultrasound into a single tiny point or a huge circle for stimulation, which makes it challenging to selectively activate relevant portions of the brain when several areas interact with each other at the same time. A system that uses the holographic concept to focus ultrasound freely on a specific location has been presented as a solution to this problem. Still, it has drawbacks, including poor precision and a lengthy computation process to create a hologram.

To sum it up –

Acoustic holography is gaining popularity for various applications. However, there are still few studies on how to create acoustic holograms. Even traditional acoustic hologram algorithms need more efficiency in producing acoustic holograms quickly and accurately, impeding the creation of new applications. The DGIST Professor Hwang Jae-Yoon team proposes a deep learning-based system to create acoustic holograms quickly and accurately. The framework’s autoencoder-like design allows for the realization of unsupervised training without the need for ground truth. The holographic ultrasonic generating network (HU-Net), a newly created hologram generator network ideal for unsupervised learning of hologram creation, and a unique loss function designed for energy-efficient holograms are demonstrated for the framework.

Check out the Paper and Reference Article. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Meet Hailo-8: An AI Processor That Uses Computer Vision For Multi-Camera Multi-Person Re-Identification

The post <strong>Fast and Accurate Acoustic Hologram Generation Using a Deep Learning-Based Framework</strong> appeared first on MarkTechPost.

This Artificial Intelligence (AI) Paper From South Korea Proposes FFNe …

Research on neural fields, which represent signals by mapping coordinates to their quantities (e.g., scalars or vectors) with neural networks, has exploded recently. This has sparked an increased interest in utilizing this technology to handle a variety of signals, including audio, image, 3D shape, and video. The universal approximation theorem and coordinate encoding techniques provide the theoretical foundations for accurate signal representation of brain fields. Recent investigations have shown its adaptability in data compression, generative models, signal manipulation, and basic signal representation.

Figure 1 shows the (a) general structure of the proposed flow-guided frame-wise representations, (b) frame-wise video representations, (c) pixel-wise video representations (FFNeRV)

Research on neural fields, which represent signals by mapping coordinates to their quantities (e.g., scalars or vectors) with neural networks, has exploded recently. This has sparked an increased interest in utilizing this technology to handle a variety of signals, including audio, image, 3D shape, and video. The universal approximation theorem and coordinate encoding techniques provide the theoretical foundations for accurate signal representation of brain fields. Recent investigations have shown its adaptability in data compression, generative models, signal manipulation, and basic signal representation.

Each time coordinate is represented by a video frame created by a stack of MLP and convolutional layers. Compared to the basic neural field design, our method considerably cut the encoding time and outperformed common video compression techniques. This paradigm is followed by the recently suggested E-NeRV while also boosting video quality. As shown in Figure 1, they offer flow-guided frame-wise neural representations for movies (FFNeRV). They embed optical flows into the frame-wise representation to use temporal redundancy, drawing inspiration from common video codecs. By combining nearby frames led by flows, FFNeRV creates a video frame that enforces the reuse of pixels from previous frames. Encouraging the network to avoid remembering the same pixel values again across frames dramatically improves parameter efficiency.

FFNeRV beats alternative frame-wise algorithms in video compression and frame interpolation, according to experimental results on the UVG dataset. They suggest using multi-resolution temporal grids with a fixed spatial resolution in place of MLP to map continuous temporal coordinates to corresponding latent features to improve the compression performance further. This is motivated by the grid-based neural representations. Additionally, they suggest utilizing a more condensed convolutional architecture. They use group and pointwise convolutions in the recommended frame-wise flow representations, driven by generative models that produce high-quality pictures and lightweight neural networks. FFNeRV beats popular video codecs (H.264 and HEVC) and performs on par with cutting-edge video compression algorithms using quantization-aware training and entropy coding. Code implementation is based on NeRV and is available on GitHub. 

[ Trending ] Meet Pixis AI: An Emerging Startup Providing Codeless AI Solutions
Check out the Paper, Github, and Project. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Meet Hailo-8: An AI Processor That Uses Computer Vision For Multi-Camera Multi-Person Re-Identification

The post This Artificial Intelligence (AI) Paper From South Korea Proposes FFNeRV: A Novel Frame-Wise Video Representation Using Frame-Wise Flow Maps And Multi-Resolution Temporal Grids appeared first on MarkTechPost.