Paper -> Embodied AI Agents Modeling the World

In this paper, the authors talk about AI agents embodied in different forms, making their interactions easy. They propose that world models are central to the reasoning and planning of these agents. According to them, world modelling encompasses the combination of multimodal perception, planning through reasoning for action and control, and memory. They also propose a mental world model for users to facilitate better human-agent collaborations. They believe future work lies in embodied AI learning, collaboration between multiple agents, and human-agent teaming.

According to them, embodiment serves two purposes: physical interaction in the form of direct action or indirect awareness, and enhanced human-machine interaction. A growing area of interest is the potential of embodied agents to grow like humans with more sensory information. Autonomy of agents is being noted now by the authors with multi-step thinking, resource access, and a collaborative approach while understanding user needs.

They suggest that the agent should be able to converse with the user for clarification, confirmation, and understanding context. For physical and mental world modelling, they say that it’s essential for an agent to have both short and long-term memory. LLMs are fine-tuned with conversational data and improved with RLHF for learning human preferences. Development of contextual AI is enabled by using LLM/VLM for perception, reasoning, and planning. VLM is instruction-tuned to produce step-by-step planning, and these can be used for robots, too.

LLMs are generative models and are inefficient with respect to their model size. These are good for creative tasks, but are not as good for reasoning and planning tasks. Here in this paper, they propose a shift to a world modelling approach due to improved accuracy and efficiency.

Authors describe world models based on generative models as well as predictive world models. They mainly classify embodied AI agents into three types: virtual embodied agents, wearable agents, and robotic agents. Collaboration between agents is of interest.

VEA(Virtual Embodied Agents) can be used in therapy and AI studio avatars. They can eventually be used to make compelling characters in movies and games. The ability for them to perceive the virtual world and act on it is exciting. They see potential in personalized education, customer service, and healthcare. Meta Motivo and the Seamless project are of interest.

Wearable agents blur the line between humans and machines, and they must know how to plan actions in the physical world with reasoning. LLM/VLM do badly on planning, VLMs outperform LLMs, but hallucinate still. Meta researchers are exploring different hypotheses for world modelling based on transformers and JEPA architectures with the hope of having efficient and effective long-horizon action planning. We can also train VLMs to predict a wearer’s goals directly based on an egocentric context. There are benchmarks that can be tested out for these. They think wearables have two potential uses for aiding humans: coaching and tutoring. Development of AI tutors that can provide personalized guidance and feedback, without simply presenting solutions, is an area of interest.

Robotic agents are AI systems used to operate robot in physical environment for performing tasks independently or with collaboration with humans/agents. These usually have senses like RGB camera, tactile sensors, IMUs, force/torque sensors, audio sensors and can control themselves to achieve desired tasks. These have potential in two abstract ways: learn to perform tasks and operate alongside us, and form the foundation of learning general-purpose intelligent agents via embodied interaction in the real world.

For robots to operate autonomously in an unstructured environment is a longstanding dream of humans. Robots can do more of the physically/mentally demanding tasks, rescue operation, elderly care, buy more time for humans. Embodiment hypothesis, according to them, posits that enabling robots to learn to interact with the real world is the only way to learn a general agent that can reason about the real world.

Read till Page 7 Link to paper: https://arxiv.org/pdf/2506.22355