Reinforcement learning (RL) is a popular approach to training autonomous agents that can learn to perform complex tasks by interacting with their environment. RL enables them to learn the best action in different conditions and adapt to their environment using a reward system.
A major challenge in RL is how to explore the vast state space of many real-world problems efficiently. This challenge arises due to the fact that in RL, agents learn by interacting with their environment via exploration. Think of an agent that tries to play Minecraft. If you heard about it before, you know how complicated Minecraft crafting tree looks. You have hundreds of craftable objects, and you might need to craft one to craft another, etc. So, it is a really complex environment.
As the environment can have a large number of possible states and actions, it can become difficult for the agent to find the optimal policy through random exploration alone. The agent must balance between exploiting the current best policy and exploring new parts of the state space to find a better policy potentially. Finding efficient exploration methods that can balance exploration and exploitation is an active area of research in RL.
It’s known that practical decision-making systems need to use prior knowledge about a task efficiently. By having prior information about the task itself, the agent can better adapt its policy and can avoid getting stuck in sub-optimal policies. However, most reinforcement learning methods currently train without any previous training or external knowledge.
[ Trending ] Meet Pixis AI: An Emerging Startup Providing Codeless AI Solutions
But why is that the case? In recent years, there has been growing interest in using large language models (LLMs) to aid RL agents in exploration by providing external knowledge. This approach has shown promise, but there are still many challenges to overcome, such as grounding the LLM knowledge in the environment and dealing with the accuracy of LLM outputs.
So, should we give up on using LLMs to aid RL agents? If not, how can we fix those problems and then use them again to guide RL agents? The answer has a name, and it’s DECKARD.
Overview of DECKARD. Source: https://arxiv.org/abs/2301.12050
DECKARD is trained for Minecraft, as crafting a specific item in Minecraft can be a challenging task if one lacks expert knowledge of the game. This has been demonstrated by studies that have shown that achieving a goal in Minecraft can be made easier through the use of dense rewards or expert demonstrations. As a result, item crafting in Minecraft has become a persistent challenge in the field of AI.
DECKARD utilizes a few-shot prompting technique on a large language model (LLM) to generate an Abstract World Model (AWM) for subgoals. It uses the LLM to hypothesize an AWM, which means it dreams about the task and the steps to solve it. Then, it wakes up and learns a modular policy of subgoals that it generates during dreaming. Since this is done in the real environment, DECKARD can verify the hypothesized AWM. The AWM is corrected during the waking phase, and discovered nodes are marked as verified to be used again in the future.
Experiments show us that LLM guidance is essential to exploration in DECKARD, with a version of the agent without LLM guidance taking over twice as long to craft most items during open-ended exploration. When exploring a specific task, DECKARD improves sample efficiency by orders of magnitude compared to comparable agents, demonstrating the potential for robustly applying LLMs to RL.
Check out the Research Paper, Code, and Project. Don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
Check Out 100’s AI Tools in AI Tools Club
The post Dream First, Learn Later: DECKARD is an AI Approach That Uses LLMs for Training Reinforcement learning (RL) Agents appeared first on MarkTechPost.