Henrik Kretzschmar

I am a software engineer and a researcher specializing in machine learning, with applications to robotics, computer vision and computer graphics.

Most recently, I led an applied machine learning research team at Waymo, formerly known as the Google self-driving car project, in Mountain View, California.

I organized academic workshops on autonomous driving at the computer vision conference CVPR in 2020, 2021, 2022, and 2023.

I was awarded a PhD by the University of Freiburg under the advisement of Prof. Dr. Wolfram Burgard in October 2014.

Visit my Google Scholar and LinkedIn profiles for more information. Feel free to contact me via LinkedIn.

Publications

Find a selection of my scientific publications below. My research interests focus on machine learning, computer vision, and robotics.

Superpixel Transformers for Efficient Semantic Segmentation

Alex Zihao Zhu, Jieru Mei, Siyuan Qiao, Hang Yan, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023.

[pdf] [arXiv]


Semantic segmentation, which aims to classify every pixel in an image, is a key task in machine perception, with many applications across robotics and autonomous driving. Due to the high dimensionality of this task, most existing approaches use local operations, such as convolutions, to generate per-pixel features. However, these methods are typically unable to effectively leverage global context information due to the high computational costs of operating on a dense image. In this work, we propose a solution to this issue by leveraging the idea of superpixels, an over-segmentation of the image, and applying them with a modern transformer framework. In particular, our model learns to decompose the pixel space into a spatially low dimensional superpixel space via a series of local cross-attentions. We then apply multi-head self-attention to the superpixels to enrich the superpixel features with global context and then directly produce a class prediction for each superpixel. Finally, we directly project the superpixel class predictions back into the pixel space using the associations between the superpixels and the image pixel features. Reasoning in the superpixel space allows our method to be substantially more computationally efficient compared to convolution-based decoder methods. Yet, our method achieves state-of-the-art performance in semantic segmentation due to the rich superpixel features generated by the global self-attention mechanism. Our experiments on Cityscapes and ADE20K demonstrate that our method matches the state of the art in terms of accuracy, while outperforming in terms of model parameters and latency.

CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection

Jyh-Jing Hwang, Henrik Kretzschmar, Joshua Manela, Sean Rafferty, Nicholas Armstrong-Crews, Tiffany Chen, Dragomir Anguelov

European Conference on Computer Vision (ECCV), 2022.

[pdf] [arXiv]


Robust 3D object detection is critical for safe autonomous driving. Camera and radar sensors are synergistic as they capture complementary information and work well under different environmental conditions. Fusing camera and radar data is challenging, however, as each of the sensors lacks information along a perpendicular axis, that is, depth is unknown to camera and elevation is unknown to radar. We propose the camera-radar matching network CramNet, an efficient approach to fuse the sensor readings from camera and radar in a joint 3D space. To leverage radar range measurements for better camera depth predictions, we propose a novel ray-constrained cross-attention mechanism that resolves the ambiguity in the geometric correspondences between camera features and radar features. Our method supports training with sensor modality dropout, which leads to robust 3D object detection, even when a camera or radar sensor suddenly malfunctions on a vehicle. We demonstrate the effectiveness of our fusion approach through extensive experiments on the RADIATE dataset, one of the few large-scale datasets that provide radar radio frequency imagery. A camera-only variant of our method achieves competitive performance in monocular 3D object detection on the Waymo Open Dataset.

Instance Segmentation with Cross-Modal Consistency

Alex Zihao Zhu, Vincent Casser, Reza Mahjourian, Henrik Kretzschmar, Sören Pirk

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022.

[pdf] [arXiv]


Segmenting object instances is a key task in machine perception, with safety-critical applications in robotics and autonomous driving. We introduce a novel approach to instance segmentation that jointly leverages measurements from multiple sensor modalities, such as cameras and LiDAR. Our method learns to predict embeddings for each pixel or point that give rise to a dense segmentation of the scene. Specifically, our technique applies contrastive learning to points in the scene both across sensor modalities and the temporal domain. We demonstrate that this formulation encourages the models to learn embeddings that are invariant to viewpoint variations and consistent across sensor modalities. We further demonstrate that the embeddings are stable over time as objects move around the scene. This not only provides stable instance masks, but can also provide valuable signals to downstream tasks, such as object tracking. We evaluate our method on the Cityscapes and KITTI-360 datasets. We further conduct a number of ablation studies, demonstrating benefits when applying additional inputs for the contrastive loss.

Block-NeRF: Scalable Large Scene Neural View Synthesis

Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P. Srinivasan, Jonathan T. Barron, Henrik Kretzschmar

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

[pdf] [arXiv] [project page] [video] [thecvf]


We present Block-NeRF, a variant of Neural Radiance Fields that can represent large-scale environments. Specifically, we demonstrate that when scaling NeRF to render city-scale scenes spanning multiple blocks, it is vital to decompose the scene into individually trained NeRFs. This decomposition decouples rendering time from scene size, enables rendering to scale to arbitrarily large environments, and allows per-block updates of the environment. We adopt several architectural changes to make NeRF robust to data captured over months under different environmental conditions. We add appearance embeddings, learned pose refinement, and controllable exposure to each individual NeRF, and introduce a procedure for aligning appearance between adjacent NeRFs so that they can be seamlessly combined. We build a grid of Block-NeRFs from 2.8 million images to create the largest neural scene representation to date, capable of rendering an entire neighborhood of San Francisco.

Waymo Open Dataset: Panoramic Video Panoptic Segmentation

Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar, Dragomir Anguelov

European Conference on Computer Vision (ECCV), 2022.

[pdf] [arXiv]


Panoptic image segmentation is the computer vision task of finding groups of pixels in an image and assigning semantic classes and object instance identifiers to them. Research in image segmentation has become increasingly popular due to its critical applications in robotics and autonomous driving. The research community thereby relies on publicly available benchmark dataset to advance the state-of-the-art in computer vision. Due to the high costs of densely labeling the images, however, there is a shortage of publicly available ground truth labels that are suitable for panoptic segmentation. The high labeling costs also make it challenging to extend existing datasets to the video domain and to multi-camera setups. We therefore present the Waymo Open Dataset: Panoramic Video Panoptic Segmentation Dataset, a large-scale dataset that offers high-quality panoptic segmentation labels for autonomous driving. We generate our dataset using the publicly available Waymo Open Dataset, leveraging the diverse set of camera images. Our labels are consistent over time for video processing and consistent across multiple cameras mounted on the vehicles for full panoramic scene understanding. Specifically, we offer labels for 28 semantic categories and 2,860 temporal sequences that were captured by five cameras mounted on autonomous vehicles driving in three different geographical locations, leading to a total of 100k labeled camera images. To the best of our knowledge, this makes our dataset an order of magnitude larger than existing datasets that offer video panoptic segmentation labels. We further propose a new benchmark for Panoramic Video Panoptic Segmentation and establish a number of strong baselines based on the DeepLab family of models. We will make the benchmark and the code publicly available. Find the dataset at waymo.com/open.

LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection

Wei-Chih Hung, Henrik Kretzschmar, Vincent Casser, Jyh-Jing Hwang, Dragomir Anguelov

arXiv, 2022.

[pdf] [arXiv]


The popular object detection metric 3D Average Precision (3D AP) relies on the intersection over union between predicted bounding boxes and ground truth bounding boxes. However, depth estimation based on cameras has limited accuracy, which may cause otherwise reasonable predictions that suffer from such longitudinal localization errors to be treated as false positives and false negatives. We therefore propose variants of the popular 3D AP metric that are designed to be more permissive with respect to depth estimation errors. Specifically, our novel longitudinal error tolerant metrics, LET-3D-AP and LET-3D-APL, allow longitudinal localization errors of the predicted bounding boxes up to a given tolerance. The proposed metrics have been used in the Waymo Open Dataset 3D Camera-Only Detection Challenge. We believe that they will facilitate advances in the field of camera-only 3D detection by providing more informative performance signals.

Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking

Longlong Jing, Ruichi Yu, Henrik Kretzschmar, Kang Li, Charles R. Qi, Hang Zhao, Alper Ayvaci, Xu Chen, Dillon Cower, Yingwei Li, Yurong You, Han Deng, Congcong Li, Dragomir Anguelov

IEEE International Conference on Robotics and Automation (ICRA), 2022.

[pdf] [arXiv]


Monocular image-based 3D perception has become an active research area in recent years owing to its applications in autonomous driving. Approaches to monocular 3D perception including detection and tracking, however, often yield inferior performance when compared to LiDAR-based techniques. Through systematic analysis, we identified that per-object depth estimation accuracy is a major factor bounding the performance. Motivated by this observation, we propose a multi-level fusion method that combines different representations (RGB and pseudo-LiDAR) and temporal information across multiple frames for objects (tracklets) to enhance per-object depth estimation. Our proposed fusion method achieves the state-of-the-art performance of per-object depth estimation on the Waymo Open Dataset, the KITTI detection dataset, and the KITTI MOT dataset. We further demonstrate that by simply replacing estimated depth with fusion-enhanced depth, we can achieve significant improvements in monocular 3D perception tasks, including detection and tracking.

GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting

Zhao Chen, Vincent Casser, Henrik Kretzschmar, Dragomir Anguelov

arXiv, 2022.

[pdf] [arXiv]


We propose GradTail, an algorithm that uses gradients to improve model performance on the fly in the face of long-tailed training data distributions. Unlike conventional long-tail classifiers which operate on converged - and possibly overfit - models, we demonstrate that an approach based on gradient dot product agreement can isolate long-tailed data early on during model training and improve performance by dynamically picking higher sample weights for that data. We show that such upweighting leads to model improvements for both classification and regression models, the latter of which are relatively unexplored in the long-tail literature, and that the long-tail examples found by gradient alignment are consistent with our semantic expectations.

Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout

Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, Dragomir Anguelov

Conference on Neural Information Processing Systems (NeurIPS), 2020.

[pdf] [arXiv] [NeurIPS]


The vast majority of deep models use multiple gradient signals, typically corresponding to a sum of multiple loss terms, to update a shared set of trainable weights. However, these multiple updates can impede optimal training by pulling the model in conflicting directions. We present Gradient Sign Dropout (GradDrop), a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. GradDrop is implemented as a simple deep layer that can be used in any deep net and synergizes with other gradient balancing approaches. We show that GradDrop outperforms the state-of-the-art multiloss methods within traditional multitask and transfer learning settings, and we discuss how GradDrop reveals links between optimal multiloss training and gradient stochasticity.

SoDA: Multi-Object Tracking with Soft Data Association

Wei-Chih Hung, Henrik Kretzschmar, Tsung-Yi Lin, Yuning Chai, Ruichi Yu, Ming-Hsuan Yang, Dragomir Anguelov

arXiv, 2020.

[pdf] [arXiv]


Robust multi-object tracking (MOT) is a prerequisite fora safe deployment of self-driving cars. Tracking objects, however, remains a highly challenging problem, especially in cluttered autonomous driving scenes in which objects tend to interact with each other in complex ways and frequently get occluded. We propose a novel approach to MOT that uses attention to compute track embeddings that encode the spatiotemporal dependencies between observed objects. This attention measurement encoding allows our model to relax hard data associations, which may lead to unrecoverable errors. Instead, our model aggregates information from all object detections via soft data associations. The resulting latent space representation allows our model to learn to reason about occlusions in a holistic data-driven way and maintain track estimates for objects even when they are occluded. Our experimental results on the Waymo OpenDataset suggest that our approach leverages modern large-scale datasets and performs favorably compared to the state of the art in visual multi-object tracking.

SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving

Zhenpei Yang, Yuning Chai, Dragomir Anguelov, Yin Zhou, Pei Sun, Dumitru Erhan, Sean Rafferty, Henrik Kretzschmar

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[pdf] [arXiv] [thecvf]


Autonomous driving system development is critically dependent on the ability to replay complex and diverse traffic scenarios in simulation. In such scenarios, the ability to accurately simulate the vehicle sensors such as cameras, lidar or radar is hugely helpful. However, current sensor simulators leverage gaming engines such as Unreal or Unity, requiring manual creation of environments, objects, and material properties. Such approaches have limited scalability and fail to produce realistic approximations of camera, lidar, and radar data without significant additional work. In this paper, we present a simple yet effective approach to generate realistic scenario sensor data, based only on a limited amount of lidar and camera data collected by an autonomous vehicle. Our approach uses texture-mapped surfels to efficiently reconstruct the scene from an initial vehicle pass or set of passes, preserving rich information about object 3D geometry and appearance, as well as the scene conditions. We then leverage a SurfelGAN network to reconstruct realistic camera images for novel positions and orientations of the self-driving vehicle and moving objects in the scene. We demonstrate our approach on the Waymo Open Dataset and show that it can synthesize realistic camera data for simulated scenarios. We also create a novel dataset that contains cases in which two self-driving vehicles observe the same scene at the same time. We use this dataset to provide additional evaluation and demonstrate the usefulness of our SurfelGAN model.

Scalability in Perception for Autonomous Driving: Waymo Open Dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, Dragomir Anguelov

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[pdf] [arxiv] [thecvf] [github] [Waymo blog]


The research community has increasing interest in autonomous driving research, despite the resource intensity of obtaining representative real world data. Existing self-driving datasets are limited in the scale and variation of the environments they capture, even though generalization within and between operating regions is crucial to the over-all viability of the technology. In an effort to help align the research community's contributions with real-world self-driving problems, we introduce a new large scale, high quality, diverse dataset. Our new dataset consists of 1150 scenes that each span 20 seconds, consisting of well synchronized and calibrated high quality LiDAR and camera data captured across a range of urban and suburban geographies. It is 15x more diverse than the largest camera+LiDAR dataset available based on our proposed diversity metric. We exhaustively annotated this data with 2D (camera image) and 3D (LiDAR) bounding boxes, with consistent identifiers across frames. Finally, we provide strong baselines for 2D as well as 3D detection and tracking tasks. We further study the effects of dataset size and generalization across geographies on 3D detection methods. Find data, code and more up-to-date information at http://www.waymo.com/open.

Socially Compliant Mobile Robot Navigation via Inverse Reinforcement Learning

Henrik Kretzschmar, Markus Spies, Christoph Sprunk, Wolfram Burgard

The International Journal of Robotics Research (IJRR), 2016.

[pdf] [sagepub]


Mobile robots are increasingly populating our human environments. To interact with humans in a socially compliant way, these robots need to understand and comply with mutually accepted rules. In this paper, we present a novel approach to model the cooperative navigation behavior of humans. We model their behavior in terms of a mixture distribution that captures both the discrete navigation decisions, such as going left or going right, as well as the natural variance of human trajectories. Our approach learns the model parameters of this distribution that match, in expectation, the observed behavior in terms of user-defined features. To compute the feature expectations over the resulting high-dimensional continuous distributions, we use Hamiltonian Markov chain Monte Carlo sampling. Furthermore, we rely on a Voronoi graph of the environment to efficiently explore the space of trajectories from the robot's current position to its target position. Using the proposed model, our method is able to imitate the behavior of pedestrians or, alternatively, to replicate a specific behavior that was taught by tele-operation in the target environment of the robot. We implemented our approach on a real mobile robot and demonstrate that it is able to successfully navigate in an office environment in the presence of humans. An extensive set of experiments suggests that our technique outperforms state-of-the-art methods to model the behavior of pedestrians, which makes it also applicable to fields such as behavioral science or computer graphics.

Learning Probabilistic Models for Mobile Robot Navigation

Henrik Kretzschmar

PhD Thesis, University of Freiburg, 2014.

[pdf]


Mobile robots are envisioned to revolutionize how people live and work by dealing with everyday tasks such as cleaning and by providing services such as transportation. Such robotic systems need to function autonomously over extended periods of time in a socially compliant way for unobtrusive integration with humans. The robots thereby depend on accurate models, such as a map of the environment for autonomous navigation and, at the same time, models of human behavior for socially compliant human-robot interaction. In most instances, however, these models cannot be provided by human experts. The robots rather need to autonomously learn the models using their on-board sensors.

The contribution of this thesis is a set of novel techniques that enable a robot to learn probabilistic models for socially compliant mobile robot navigation. Learning accurate models from sensor data during long-term operation, however, is challenging owing to computational constraints and the inherent uncertainty in the measurements. In the context of learning maps for robot navigation, we present a technique that reasons about the information gained from the measurements that the robot obtains. This enables the robot to discard highly redundant measurements, which facilitates mapping during long-term operation. In order to build a consistent map that is suitable for navigation, the robot needs to recognize previously observed places, which, however, is challenging, especially in ambiguous environments. We mitigate this problem by having the robot deploy uniquely identifiable artificial landmarks in the environment. Our approach learns an efficient landmark deployment policy that facilitates place recognition when the robot returns. In the context of socially compliant human-robot interaction, we first explore how humans give route directions to others who are unfamiliar with the environment. Our goal is to enable robots to engage in such conversations by imitating humans. Our approach learns a model of the style of a set of descriptions given by a group of humans, which then allows the robot to give natural and intuitive directions just as well, even to goal locations in new environments. To seamlessly integrate mobile robots into everyday life, they require a model of acceptable navigation behavior. In this respect, we first accurately capture the movements of humans by combining readings of inertial measurement units worn by the humans with observations from a mobile robot. Our approach compensates for the drift in the inertial measurements, thereby obtaining accurate estimates of the human movements, even in large areas. The robot can then use these estimates to learn a model of the underlying human navigation behavior. Learning a behavior model, however, is especially challenging when only a limited set of imperfect training examples is available. We propose a framework for learning such a model under these circumstances, which then enables the robot to predict the movements of nearby pedestrians in new situations and to imitate their behavior in order to seamlessly blend in with the humans.

We evaluate the presented methods for learning probabilistic models on real mobile robots and demonstrate that they outperform the state of the art in robotics. Our approaches enable mobile robots to build highly accurate maps of the environment, even in ambiguous environments and during long-term operation. We furthermore demonstrate that our techniques enable socially compliant mobile robot navigation in populated environments. The approaches presented in this thesis are therefore useful for developing flexible mobile robots that autonomously collaborate with humans in a socially compliant way over extended periods of time even in previously unknown environments.

Identifying Vegetation from Laser Data in Structured Outdoor Environments

Kai M. Wurm, Henrik Kretzschmar, Rainer Kümmerle, Cyrill Stachniss, Wolfram Burgard

Robotics and Autonomous Systems (RAS), 2014.

[pdf]


The ability to reliably detect vegetation is an important requirement for outdoor navigation with mobile robots as it enables the robot to navigate more efficiently and safely. In this paper, we present an approach to detect flat vegetation, such as grass, which cannot be identified using range measurements. This type of vegetation is typically found in structured outdoor environments such as parks or campus sites.  Our approach classifies the terrain in the vicinity of the robot based on laser scans and makes use of the fact that plants exhibit specific reflection properties. It uses a support vector machine to learn a classifier for distinguishing vegetation from streets based on laser reflectivity, measured distance, and the incidence angle. In addition, it employs a vibration-based classifier to acquire training data in a self-supervised way and thus reduces manual work. Our approach has been evaluated extensively in real world experiments using several mobile robots. We furthermore evaluated it with different types of sensors and in the context of mapping, autonomous navigation, and exploration experiments. In addition, we compared it to an approach based on linear discriminant analysis. In our real world experiments, our approach yields a classification accuracy close to 100%.

Learning to Predict Trajectories of Cooperatively Navigating Agents

Henrik Kretzschmar, Markus Kuderer, Wolfram Burgard

IEEE International Conference on Robotics and Automation (ICRA), 2014.

[pdf]


The problem of modeling the navigation behavior of multiple interacting agents arises in different areas including robotics, computer graphics, and behavioral science. In this paper, we present an approach to learn the composite navigation behavior of interacting agents from demonstrations. The decision process that ultimately leads to the observed continuous trajectories of the agents often also comprises discrete decisions, which partition the space of composite trajectories into homotopy classes. Therefore, our method uses a mixture probability distribution that consists of a discrete distribution over the homotopy classes and continuous distributions over the composite trajectories for each homotopy class. Our approach learns the model parameters of this distribution that match, in expectation, the observed behavior in terms of user-defined features. To compute the feature expectations over the high-dimensional continuous distributions, we use Hamiltonian Markov chain Monte Carlo sampling. We exploit that the distributions are highly structured due to physical constraints and guide the sampling process to regions of high probability. We apply our approach to learning the behavior of pedestrians and demonstrate that it outperforms state-of-the-art methods.

Learning to Give Route Directions from Human Demonstrations

Stefan Oßwald, Henrik Kretzschmar, Wolfram Burgard, Cyrill Stachniss

IEEE International Conference on Robotics and Automation (ICRA), 2014.

Best Cognitive Robotics Paper - Finalist.

[pdf]


For several applications, robots and other computer systems must provide route descriptions to humans. These descriptions should be natural and intuitive for the human users. In this paper, we present an algorithm that learns how to provide good route descriptions from a corpus of human-written directions. Using inverse reinforcement learning, our algorithm learns how to select the information for the description depending on the context of the route segment. The algorithm then uses the learned policy to generate directions that imitate the style of the descriptions provided by humans, thus taking into account personal as well as cultural preferences and special requirements of the particular user group providing the learning demonstrations. We evaluate our approach in a user study and show that the directions generated by our policy sound similar to human-given directions and substantially more natural than directions provided by commercial web services.

Online Generation of Homotopically Distinct Navigation Paths

Markus Kuderer, Christoph Sprunk, Henrik Kretzschmar, Wolfram Burgard

IEEE International Conference on Robotics and Automation (ICRA), 2014.

[pdf]


In mobile robot navigation, cost functions are a popular approach to generate feasible, safe paths that avoid obstacles and that allow the robot to get from its starting position to the goal position. Alternative ways to navigate around the obstacles typically correspond to different local minima in the cost function. In this paper we present a highly effective approach to overcome such local minima and to quickly propose a set of alternative, topologically different and optimized paths. We furthermore describe how to maintain a set of optimized trajectory alternatives to reduce optimization efforts when the robot has to adapt to changes in the environment. We demonstrate in experiments that our method outperforms a state-of-the-art approach by an order of magnitude in computation time, which allows a robot to use our method online during navigation. We furthermore demonstrate that the approach of using a set of qualitatively different trajectories is beneficial in shared autonomy settings, where a user operating a wheelchair can quickly switch between topologically different trajectories.

Teaching Mobile Robots to Cooperatively Navigate in Populated Environments

Markus Kuderer, Henrik Kretzschmar, Wolfram Burgard

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2013.

[pdf]


Mobile service robots are envisioned to operate in environments that are populated by humans and therefore ought to navigate in a socially compliant way. Since the desired behavior of the robots highly depends on the application, we need flexible means for teaching a robot a certain navigation policy. We present an approach that allows a mobile robot to learn how to navigate in the presence of humans while it is being teleoperated in its designated environment. Our method applies feature-based maximum entropy learning to derive a navigation policy from the interactions with the humans. The resulting policy maintains a probability distribution over the trajectories of all the agents that allows the robot to cooperatively avoid collisions with humans. In particular, our method reasons about multiple homotopy classes of the agents' trajectories, i. e., on which sides the agents pass each other. We implemented our approach on a real mobile robot and demonstrate that it is able to successfully navigate in an office environment in the presence of humans relying only on on-board sensors.

Deploying Artificial Landmarks to Foster Data Association in Simultaneous Localization and Mapping

Maximilian Beinhofer, Henrik Kretzschmar, Wolfram Burgard

IEEE International Conference on Robotics and Automation (ICRA), 2013.

[pdf]


Data association is an essential problem in simultaneous localization and mapping. It is hard to solve correctly, especially in ambiguous environments. We consider a scenario where the robot can ease the data association problem by deploying a limited number of uniquely identifiable artificial landmarks along its path and use them afterwards as fixed anchors. Obviously, the choice of the positions where the robot should drop these markers is crucial as poor choices might prevent the robot from establishing accurate data associations. In this paper, we present a novel approach for learning when to drop the landmarks so as to optimize the data association performance. We use Monte Carlo reinforcement learning for computing an optimal policy and apply a statistical convergence test to decide if the policy is converged and the learning process can be stopped. Extensive experiments also carried out with a real robot demonstrate that the data association performance using landmarks deployed according to our learned policies is significantly higher compared to other strategies.

Learning Manipulation Actions from a Few Demonstrations

Nichola Abdo, Henrik Kretzschmar, Luciano Spinello, Cyrill Stachniss

IEEE International Conference on Robotics and Automation (ICRA), 2013.

[pdf]


To efficiently plan complex manipulation tasks, robots need to reason on a high level. Symbolic planning, however, requires knowledge about the preconditions and effects of the individual actions. In this work, we present a practical approach to learn manipulation skills, including preconditions and effects, based on teacher demonstrations. We believe that requiring only a small number of demonstrations is essential for robots operating in the real world. Therefore, our main focus and contribution is the ability to infer the preconditions and effects of actions based on a small number of demonstrations. Our system furthermore expresses the acquired manipulation actions as planning operators and is therefore able to use symbolic planners to solve new tasks. We implemented our approach on a PR2 robot and present real world manipulation experiments that illustrate that our system allows non-experts to transfer knowledge to robots.

Predicting Human Navigation Behavior via Inverse Reinforcement Learning

Henrik Kretzschmar, Markus Kuderer, Wolfram Burgard

The 1st Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM), 2013.

[pdf]


We present an approach that allows a mobile robot to learn the behavior of pedestrians from observed trajectories. Our method maintains probability distributions over composite trajectories of all the pedestrians and represents these distributions as a mixture model. The upper level of this model represents a discrete distribution over classes of trajectories that are equivalent according to a set of features, such as passing on the left or passing on the right side. The lower level comprises continuous probability distributions over trajectories for each of these classes and captures physical features of the trajectories, such as velocities and accelerations. For each level, our method learns maximum entropy distributions that match the feature values of the observations. To estimate the expected feature values with respect to the high-dimensional probability distributions over the composite trajectories, our approach applies Hamiltonian Markov Chain Monte Carlo sampling. The extensive experimental evaluation suggests that our method models human navigation behavior more accurately than state-of-the-art techniques.

Information-Theoretic Compression of Pose Graphs for Laser-Based SLAM

Henrik Kretzschmar, Cyrill Stachniss

The International Journal of Robotics Research (IJRR), 2012.

[pdf] [sagepub]


In graph-based simultaneous localization and mapping (SLAM), the pose graph grows over time as the robot gathers information about the environment. An ever growing pose graph, however, prevents long-term mapping with mobile robots. In this paper, we address the problem of efficient information-theoretic compression of pose graphs. Our approach estimates the mutual information between the laser measurements and the map to discard the measurements that are expected to provide only a small amount of information. Our method subsequently marginalizes out the nodes from the pose graph that correspond to the discarded laser measurements. To maintain a sparse pose graph that allows for efficient map optimization, our approach applies an approximate marginalization technique that is based on Chow–Liu trees. Our contributions allow the robot to effectively restrict the size of the pose graph. Alternatively, the robot is able to maintain a pose graph that does not grow unless the robot explores previously unobserved parts of the environment. Real-world experiments demonstrate that our approach to pose graph compression is well suited for long-term mobile robot mapping.

kretzschmar12ijrr-fr079.mp4

Freiburg Building 079.

kretzschmar12ijrr-intel.mp4

Intel Building.

kretzschmar12ijrr-fhw.mp4

FHW Building.

Feature-Based Prediction of Trajectories for Socially Compliant Navigation

Markus Kuderer, Henrik Kretzschmar, Christoph Sprunk, Wolfram Burgard

Robotics: Science and Systems (RSS), 2012.

[pdf] [rss]


Mobile robots that operate in a shared environment with humans need the ability to predict the movements of people to better plan their navigation actions. In this paper, we present a novel approach to predict the movements of pedestrians. Our method reasons about entire trajectories that arise from interactions between people in navigation tasks. It applies a maximum entropy learning method based on features that capture relevant aspects of the trajectories to determine the probability distribution that underlies human navigation behavior. Hence, our approach can be used by mobile robots to predict forthcoming interactions with pedestrians and thus react in a socially compliant way. In extensive experiments, we evaluate the capability and accuracy of our approach and demonstrate that our algorithm outperforms the popular social forces method, a state-of-the-art approach. Furthermore, we show how our algorithm can be used for autonomous robot navigation using a real robot.

Accurate Human Motion Capture in Large Areas by Combining IMU- and Laser-Based People Tracking

Jakob Ziegler, Henrik Kretzschmar, Cyrill Stachniss, Giorgio Grisetti, Wolfram Burgard

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2011.

[pdf]


A large number of applications use motion capture systems to track the location and the body posture of people. For instance, the movie industry captures actors to animate virtual characters that perform stunts.  Today's tracking systems either operate with statically mounted cameras and thus can be used in confined areas only or rely on inertial sensors that allow for free and large-scale motion but suffer from drift in the pose estimate.  This paper presents a novel tracking approach that aims to provide globally aligned full body posture estimates by combining a mobile robot and an inertial motion capture system.  In our approach, a mobile robot equipped with a laser scanner is used to anchor the pose estimates of a person given a map of the environment. It uses a particle filter to globally localize a person wearing a motion capture suit and to robustly track the person's position. To obtain a smooth and globally aligned trajectory of the person, we solve a least squares optimization problem formulated from the motion capture suite and tracking data. Our approach has been implemented on a real robot and exhaustively tested. As the experimental evaluation shows, our system is able to provide locally precise and globally aligned estimates of the person's full body posture.

Efficient Information-Theoretic Graph Pruning for Graph-Based SLAM with Laser Range Finders

Henrik Kretzschmar, Cyrill Stachniss, Giorgio Grisetti

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2011.

[pdf]


In graph-based SLAM, the pose graph encodes the poses of the robot during data acquisition as well as spatial constraints between them. The size of the pose graph has a substantial influence on the runtime and the memory requirements of a SLAM system, which hinders long-term mapping. In this paper, we address the problem of efficient information-theoretic compression of pose graphs. Our approach estimates the expected information gain of laser measurements with respect to the resulting occupancy grid map. It allows for restricting the size of the pose graph depending on the information that the robot acquires about the environment or based on a given memory limit, which results in an any-space SLAM system. When discarding laser scans, our approach marginalizes out the corresponding pose nodes from the graph. To avoid a densely connected pose graph, which would result from exact marginalization, we propose an approximation to marginalization that is based on local Chow-Liu trees and maintains a sparse graph. Real world experiments suggest that our approach effectively reduces the growth of the pose graph while minimizing the loss of information in the resulting grid map.

Pose Graph Compression for Laser-Based SLAM

Cyrill Stachniss, Henrik Kretzschmar

International Symposium of Robotics Research (ISRR), 2011.

Invited presentation.

[pdf]


The pose graph is a central data structure in graph-based SLAM approaches. It encodes the poses of the robot during data acquisition as well as spatial constraints between them. The size of the pose graph has a direct influence on the runtime and the memory requirements of a SLAM system since it is typically used to make data associations and within the optimization procedure. In this paper, we address the problem of efficient, information-theoretic compression of such pose graphs. The central question is which sensor measurements can be removed from the graph without loosing too much information. Our approach estimates the expected information gain of laser measurements with respect to the resulting occupancy grid map. It allows us to restrict the size of the pose graph depending on the information that the robot acquires about the environment. Alternatively, we can enforce a maximum number of laser scans the robot is allowed to store, which results in an any-space SLAM system. Real world experiments suggest that our approach efficiently reduces the growth of the pose graph while minimizing the loss of information in the resulting grid map.

Lifelong Map Learning for Graph-based SLAM in Static Environments

Henrik Kretzschmar, Giorgio Grisetti, Cyrill Stachniss

KI - Künstliche Intelligenz (KI), 2010.


In this paper, we address the problem of lifelong map learning in static environments with mobile robots using the graph-based formulation of the simultaneous localization and mapping problem. The pose graph, which stores the poses of the robot and spatial constraints between them, is the central data structure in graph-based SLAM. The size of the pose graph has a direct influence on the runtime and the memory complexity of the SLAM system and typically grows over time. A robot that performs lifelong mapping in a bounded environment has to limit the memory and computational complexity of its mapping system. We present a novel approach to prune the pose graph so that it only grows when the robot acquires relevant new information about the environment in terms of expected information gain. As a result, our approach scales with the size of the environment and not with the length of the trajectory, which is an important prerequisite for lifelong map learning. The experiments presented in this paper illustrate the properties of our method using real robots.

Estimating Landmark Locations from Geo-Referenced Photographs

Henrik Kretzschmar, Cyrill Stachniss, Christian Plagemann, Wolfram Burgard

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2008.

[pdf]


The problem of estimating the positions of landmarks using a mobile robot equipped with a camera has intensively been studied in the past. In this paper, we consider a variant of this problem in which the robot should estimate the locations of observed landmarks based on a sparse set of geo-referenced images for which no heading information is available. Sources for such kind of data are image portals such as Flickr or Google Image Search. We formulate the problem of estimating the landmark locations as an optimization problem and show that it is possible to accurately localize the landmarks in real world settings.

Domänenspezifische Sprachmodelle und Konfidenzmaße für die Domänendetektion und die Verringerung von Erkennungsfehlern

André Berton, Harald Hüning, Henrik Kretzschmar

Konferenz Elektronische Sprachsignalverarbeitung (ESSV), 2003.


Für die Sprachbedienung im Kraftfahrzeug wird in diesem Beitrag anhand von Sprachmodellen und Konfidenzmaßen untersucht, ob kleine, spezifische Sprachmodelle für einzelne Applikationsdomänen (z.B. Telefonbedienung, Navigation, Buchungsdienste) Vorteile gegenüber einem einzigen, domänenübergreifenden Sprachmodell bieten. In Anlehnung an Sprachmodelle aus dem VICO-Projekt werden die Domänen Navigation und Hotelreservierung herangezogen. Bei getrennten Sprachmodellen und folglich zwei Erkennungsergebnissen für die zwei Domänen wird anhand von Satzkonfidenzen entschieden, von welchem Erkenner das Resultat verwendet wird (Domänendetektion). Die resultierende Fehlerrate nach dieser Entscheidung wird mit der Fehlerrate eines domänenübergreifenden Erkenners verglichen. In diesem Vergleich wird gezeigt, daß der neue Ansatz signifikant geringere Wortfehlerraten und eine sehr hohe Domänendetektionsgüte liefert.

Patent Applications

Find a selection of the patent applications on which I am a named inventor below.

Training instance segmentation neural networks through contrastive learning

Alex Zihao Zhu, Vincent Michael Casser, Henrik Kretzschmar, Reza Mahjourian, Soeren Pirk

United States Patent Application, US 2023/0334842 A1, October 19, 2023.

[pdf]


Methods, systems, and apparatus for processing inputs that include video frames using neural networks. In one aspect, a system comprises one or more computers configured to obtain a set of one or more training images and, for each training image, ground truth instance data that identifies, for each of one or more object instances, a corresponding region of the training image that depicts the object instance. For each training image in the set, the one or more computers process the training image using an instance segmentation neural network to generate an embedding output comprising a respective embedding for each of a plurality of output pixels. The one or more computers then train the instance segmentation neural network to minimize a loss function.

Generating panoptic segmentation labels

Jieru Mei, Hang Yan, Liang-Chieh Chen, Siyuan Qiao, Yukun Zhu, Alex Zihao Zhu, Xinchen Yan, Henrik Kretzschmar

United States Patent Application, US 2023/0281824 A1, September 09, 2023.

[pdf]


Methods, systems, and apparatus for generating a panoptic segmentation label for a sensor data sample. In one aspect, a system comprises one or more computers configured to obtain a sensor data sample characterizing a scene in an environment. The one or more computers obtain a 3D bounding box annotation at each time point for a point cloud characterizing the scene at the time point. The one or more computers obtain, for each camera image and each time point, annotation data identifying object instances depicted in the camera image, and the one or more computers generate a panoptic segmentation label for the sensor data sample characterizing the scene in the environment.

Camera-radar sensor fusion using local attention mechanism

Jyh-Jing Hwang, Henrik Kretzschmar, Dragomir Anguelov

United States Patent Application, US 2023/0213643 A1, July 06, 2023.

[pdf]


Methods, computer systems, and apparatus, including computer programs encoded on computer storage media, for processing sensor data. In one aspect, a method includes obtaining image data representing a camera sensor measurement of a scene; obtaining radar data representing a radar sensor measurement of the scene; generating a feature representation of the image data; generating a respective initial depth estimate for each of a subset of the plurality of pixels; generating a feature representation of the radar data; for each of the subset of the plurality of pixels, generating a respective adjusted depth estimate for the pixel using the initial depth estimate for the pixel and the radar feature vectors for a corresponding subset of the plurality of radar reflection points; generating a fused point cloud that includes a plurality of three-dimensional data points; and processing the fused point cloud to generate an output that characterizes the scene.

Large scene neural view synthesis

Vincent Michael Casser, Henrik Kretzschmar, Matthew Justin Tancik, Sabeek Mani Pradhan, Benjamin Joseph Mildenhall, Pratul Preeti Srinivasan, Jonathan Tilton Barron

United States Patent Application, US 2023/0177822 A1, June 08, 2023.

[pdf]


Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for rendering a new image that depicts a scene from a perspective of a camera at a new camera viewpoint.

Time-line based object tracking annotation

Yulai Shen, Henrik Kretzschmar, Jeffrey Sham, Jeffrey Carlson, Lo Po Tsui, Dragomir Anguelov

United States Patent Application, US 2022/0358314 A1, November 10, 2022.

[pdf]


Methods, computer systems, and apparatus, including computer programs encoded on computer storage media, for generating and editing object track labels for objects detected in video data. One of the methods includes obtaining a video segment comprising multiple image frames associated with multiple time points; obtaining object track data specifying a set of object tracks; providing, for presentation to a user, a user interface for modifying the object track data, the user interface displaying object timeline representations of the object tracks; receiving one or more user inputs that indicate one or more modifications to the object timeline representations; updating the object timeline representations displayed in the timeline display area; and updating the object track data according to the updated object timeline representations.

Three-dimensional location prediction from images

Longlong Jing, Ruichi Yu, Jiyang Gao, Henrik Kretzschmar, Kang Li, Ruizhongtai Qi, Hang Zhao, Alper Ayvaci, Xu Chen, Dillon Cower, Congcong Li

United States Patent Application, US 2022/0180549 A1, June 09, 2022.

[pdf]


Methods, systems, and apparatus, including computer programs encoded on computer storage media, for predicting three-dimensional object locations from images. One of the methods includes obtaining a sequence of images that comprises, at each of a plurality of time steps, a respective image that was captured by a camera at the time step; generating, for each image in the sequence, respective pseudo-lidar features of a respective pseudo-lidar representation of a region in the image that has been determined to depict a first object; generating, for a particular image at a particular time step in the sequence, image patch features of the region in the particular image that has been determined to depict the first object; and generating, from the respective pseudo-lidar features and the image patch features, a prediction that characterizes a location of the first object in a three-dimensional coordinate system at the particular time step in the sequence.

Training perspective computer vision models using view synthesis

Vincent Michael Casser, Yuning Chai, Dragomir Anguelov, Hang Zhao, Henrik Kretzschmar, Reza Mahjourian, Anelia Angelova, Ariel Gordon, Soeren Pirk

United States Patent Application, US 2021/0390407 A1, December 16, 2021.

[pdf]


Methods, computer systems, and apparatus, including computer programs encoded on computer storage media, for training a perspective computer vision model. The model is configured to receive input data characterizing an input scene in an environment from an input viewpoint and to process the input data in accordance with a set of model parameters to generate an output perspective representation of the scene from the input viewpoint. The system trains the model based on first data characterizing a scene in the environment from a first viewpoint and second data characterizing the scene in the environment from a second, different viewpoint.

Long-range object detection, localization, tracking and classification for autonomous vehicles

Ruichi Yu, Kang Li, Tao Han, Robert Cosgriff, Henrik Kretzschmar

United States Patent Application, US 2021/0390407 A1, December 16, 2021.

[pdf]


Aspects of the disclosure relate to controlling a vehicle. For instance, using a camera, a first camera image including a first object may be captured. A first bounding box for the first object and a distance to the first object may be identified. A second camera image including a second object may be captured. A second bounding box for the second image and a distance to the second object may be identified. Whether the first object is the second object may be determined using a plurality of models to compare visual similarity of the two bounding boxes, to compare a three-dimensional location based on the distance to the first object and a three-dimensional location based on the distance to the second object, and to compare results from the first and second models. The vehicle may be controlled in an autonomous driving mode based on a result of the third model.

Phrase recognition model for autonomous vehicles

Victoria Dean, Abhijit S. Ogale, Henrik Kretzschmar, David Harrison Silver, Carl Kershaw, Pankaj Chaudhari, Chen Wu, Congcong Li

United States Patent, US 10,699,141 B2,  June 30, 2020.

[pdf]


Aspects of the disclosure relate to training and using a phrase recognition model to identify phrases in images. As an example, a selected phrase list may include a plurality of phrases is received. Each phrase of the plurality of phrases includes text. An initial plurality of images may be received. A training image set may be selected from the initial plurality of images by identifying the phrase-containing images that include one or more phrases from the selected phrase list. Each given phrase-containing image of the training image set may be labeled with information identifying the one or more phrases from the selected phrase list included in the given phrase-containing images. The model may be trained based on the training image set such that the model is configured to, in response to receiving an input image, output data indicating whether a phrase of the plurality of phrases is included in the input image.

Multi object tracking using memory attention

Wei-Chih Hung, Henrik Kretzschmar, Yuning Chai, Dragomir Anguelov

International Patent Application, WO 2021/097429 A1, 20 May 2021.

[pdf]


Methods, systems, and apparatus, including computer programs encoded on computer storage media, for multi object tracking using memory attention.

Generating environmental data

Zhenpei Yang, Yuning Chai, Yin Zhou, Pei Sun, Henrik Kretzschmar, Sean Rafferty, Dumitru Erhan, Dragomir Anguelov

International Patent Application, WO 2021/097409 A1, 20 May 2021.

[pdf]


Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generated simulated sensor data. One of the methods includes obtaining a surfel map generated from sensor observations of a real-world environment and generating, for each surfel in the surfel map, a respective grid having a plurality of grid cells, wherein each grid has an orientation matching an orientation of a corresponding surfel, and wherein each grid cell within each grid is assigned a respective color value. For a simulated location within a simulated representation of the real-world environment, a textured surfel rendering is generated, including combining color information from grid cells visible from the simulated location within the simulated representation of the real-world environment.

Cyclist hand signal detection by an autonomous vehicle

Henrik Kretzschmar, Jiajun Zhu

United States Patent, US 9,014,905 B1, April 21, 2015.

[pdf]


Methods and systems for detecting hand signals of a cyclist by an autonomous vehicle are described. An example method may involve a computing device receiving a plurality of data points corresponding to an environment of an autonomous vehicle. The computing device may then determine one or more subsets of data points from the plurality of data points indicative of at least a body region of a cyclist. Further, based on an output of a comparison of the one or more subsets with one or more predetermined sets of cycling signals, the computing device may determine an expected adjustment of one or more of a speed of the cyclist and a direction of movement of the cyclist. Still further, based on the expected adjustment, the computing device may provide instructions to adjust one or more of a speed of the autonomous vehicle and a direction of movement of the autonomous vehicle.