SLAM Handbook Prelude
I started reading the SLAM handbook, which is in the process of being published and is freely available for everyone to read. It focuses on various topics within SLAM, its history, latest trends, and its potential in the development of Spatial AI.
SLAM stands for simultaneous localization and mapping, where there is a dual goal of making the map of the environment and localizing the agent(robot/human) in that environment. SLAM offers an understanding of the geometry, semantics, and physical aspects of the environment.
When they mentioned that localization is an issue in SLAM, often due to the lack of availability of high-end, accurate sensors, I wondered if this problem could be solved if there were a swarm of people acting in the environment and recording observations. IMU is a proprioceptive sensor that informs about self-movement and position, while exteroceptive sensors like Camera, LIDAR, RADAR inform about external features/stimuli. They mention how SLAM is an inverse problem where sensors(environment) = measurements, and we have to find the environment.
Then the authors mention indirect methods and direct methods. After establishing correspondence points, the problem becomes a classical bundle adjustment problem where solvers, approximation methods, and parallelization can come in handy. This makes me want to study and experiment more with these tools.
Visual SLAM is meant to generate a trajectory and a sparse 3D point cloud map. I wonder why a sparse point cloud is generated, and that might be related to sparse key features and correspondences that are generated during the initial process. It’s supposed to follow measurements->correspondences->trajectory+sparse 3D point cloud map -> dense Map. ICP(iterative closest point) is used to compute the relative pose between two scans. Estimation of the relative motion of the agent between two timestamps is odometry. Loop closure is used to improve SLAM performance by reducing the effects of drift, which is done by visiting the same places multiple times at different times. Based on the odometry data assisted by loop closure, the trajectory and map are calculated by doing pose graph optimization.
What made me curious was how they mention the main role of SLAM is to serve downstream tasks, while I used to worry about utilizing just the sparse point cloud. SLAM is intended to be run in real time. Structure From Motion is another paradigm where the focus has been on generating 3D in an offline manner, usually with an unordered set of images. Then they mention solutions need to be designed depending on requirements, like how local odometry might suffice if the agent operates over small distances.
According to them, long-term operation requires memory, and I think that’s exciting as it means that the SLAM community has been thinking about memory for a while on how to store a representation of the world. Perceptual aliasing is a concept where two different places might look similar. They mention that it’s important that the map is easy to query, inspect, and visualize. And they mention how the goal of training the agent is, after observing the input data, to convert it to action using tools like Reinforcement Learning.
After this portion, they go over the history of mapping and mention Gauss’s use of triangulation in the Kingdom of Hannover, Sir Everest’s Great Trigonometric Survey, and Cholesky’s matrix decomposition while surveying Crete and North Africa. I want to learn how SLAM is different from Photogrammetry and Structure from motion.
They mention a couple of insights about how to avoid drift in an unknown environment. One needs to simultaneously estimate the robot poses and the position of fixed external entities (landmarks). The Extended Kalman Filter has been used for pose estimation for a long time, although it’s sensitive to outliers, misdetection, struggles with drift, real-time prohibitive, and scaling issues due to complex equations to solve. The community shifted to using Particle filter-based approaches based on estimation theory, focusing on particle filtering. Particle Filter approaches have less error. Optimization-based approaches initially disregarded as too slow picked up in popularity when innovation on the covariance matrix and its inverse happened, making it possible to do things faster in a solvable and scalable manner. Factor graph-based approach to SLAM is a dominant paradigm today and also shaped visual and visual inertial odometry. The deep learning revolution began around 2012 and slowly permeated robotics.
From SLAM to Spatial AI requires that the agent be able to listen to statements made by humans to understand both geometry(where) and semantics(what’s what, which). Spatial Perception system and SLAM as part of this type of system are supposed to reason about geometric, semantic, and physical aspects of the scene to build a multi-faceted map representation(“world model”) that enables the robot to understand and execute complex instructions.