Why don't self-driving cars need radar? Tesla's chief AI scientist explained

Different companies and researchers have different answers to this question as to what technology stack support is needed for fully autonomous vehicles. In fact, there are many ways to realize autonomous driving. Some only require a camera and computer vision system, and some require a combination of computer vision and advanced sensors. Among them, Tesla has always been an advocate of pure visual autonomous driving technology. At this year’s Computer Vision and Pattern Recognition (CVPR) conference, the company’s chief AI scientist Andrej Karpathy explained the reason.

In the past few years, Capaci has been responsible for leading Tesla’s research and development of autonomous driving systems. At the CVPR Autonomous Driving Symposium in 2021, Capasi detailed how the company developed a deep learning system that only requires video input to understand the environment around the car. At the same time, Capaci also explained why Tesla is most likely to help vision-based autonomous driving become a reality.

1

General Computer Vision System

Deep neural network is one of the main components of the autonomous driving technology stack. It mainly analyzes roads, signs, cars, obstacles, and pedestrians in the video captured by the on-board camera. However, deep learning can also make mistakes when detecting objects in images. For this reason, most self-driving car companies, including Alphabet subsidiary Waymo, use lidar. This device generates a 3D map of the car’s surroundings by emitting laser beams in all directions. Lidar provides more information and can fill the gap left by neural networks.

However, adding lidar to the autonomous driving stack also has its complications. Capaci said: “You have to use lidar to map the environment in advance, and then create a high-definition map, you have to insert all the lanes, figure out how they are connected, and understand all the traffic lights. In the test phase, you only need Drive around on the map.” At the same time, it is extremely difficult to create an accurate map for each location where the autonomous car will drive. Capaci said: “The collection, construction and maintenance of these high-definition lidar maps are not scalable. It is also extremely difficult to keep the infrastructure constantly updated.”

Tesla’s self-driving cars do not use lidar and high-definition maps. Capassi said: “According to the video taken by 8 cameras around the car, everything that happened was the first time that happened in the car.”

Autonomous driving technology must figure out where the lanes are, where the traffic lights are, what their state is, and which ones are related to the vehicle. And it must complete all these tasks without any pre-determined road navigation information. Capassi admits that vision-based autonomous driving methods are technically more difficult to implement because it requires a neural network that can function well with video feedback alone. But he said: “Once this system is put into use, it will become a general-purpose computer vision system that can be deployed anywhere on the planet.”

With a general-purpose computer vision system, the car will no longer need other driving assistance functions. Capaci said that Tesla is already moving in this direction. Previously, the company used a combination of radar and camera to support autonomous driving systems, but it has recently begun rolling out cars that are no longer equipped with radar. Capaci said: “We removed the radar. These cars only rely on vision to drive. This is because Tesla’s deep learning system has reached a critical point and is now 100 times better than radar, and radar is starting to become a hindrance. .”

Supervised learning

2

The main argument against the purely computer vision autopilot method is whether the neural network can perform ranging and estimate the uncertainty without the help of lidar depth maps. Capaci said: “Obviously, humans rely on vision when driving, so our neural network can process visual input to understand the depth and speed of objects around us. But the biggest problem is that synthetic neural networks can do the same. Is it something? In the past few months, our efforts have proved that this is possible.”

Tesla engineers want to create a deep learning system that can detect objects in terms of depth, speed, and acceleration. They decided to treat this challenge as a supervised learning problem, that is, the neural network learns to detect targets and their related attributes after being trained on annotated data.

In order to train their deep learning architecture, the Tesla team needed a massive data set containing millions of videos, and carefully annotated the objects contained in the videos and their attributes. Creating datasets for self-driving cars is particularly tricky. Engineers must ensure that various road settings and infrequent edge situations are included. Capaci said: “When you have a large, clean, and diverse data set, and then use it to train a large neural network, what I have seen in practice is that success can be guaranteed.”

Automatically label data sets

Tesla has sold millions of cars with cameras around the world and is in a good position to collect the data needed to train deep learning models for car vision. Tesla’s self-driving team has accumulated 1.5PB of data, including 1 million 10-second videos and 6 billion objects marked with borders, depth, and speed. But labeling such a huge data set is a huge challenge. One method is to manually label through data labeling companies or online platforms (such as Amazon Turk). But this will require a lot of manual work, may cost a lot of money, and progress is slow.

In contrast, the Tesla team used automatic labeling technology, which combines neural networks, radar data, and manual review. Since the dataset is annotated offline, the neural network can play back the video, compare their predictions with the actual situation, and adjust their parameters. This is contrary to the so-called “test reasoning”, in which everything happens in real time, and deep learning models cannot be traced back.

Offline tagging also enables engineers to apply very powerful, computationally intensive object detection networks that cannot be deployed on cars and can be used for real-time, low-latency applications. They used radar sensor data to further verify the neural network’s inference. All of these increase the accuracy of the marking network. Capaci said: “If you are offline, you will get follow-up benefits, you can better integrate different sensor data. In addition, you can involve humans, they can clean up, verify, edit and other tasks.”

However, Capaci did not disclose how much manpower is required to make the final modification of the automatic labeling system, but human cognition has played a key role in guiding the development of the automatic labeling system in the right direction.

In the process of developing the data set, the Tesla team found more than 200 trigger points, indicating that target detection needs to be adjusted. These problems include inconsistent detection results between different cameras or between camera and radar. They also identified situations that might require special attention, such as tunnel entrances and exits, and cars with objects on top. It took Tesla four months to develop and master all these triggers. As the tag network improves, it is deployed in “shadow mode”. This means that it is installed in a consumer’s car, runs silently without issuing commands to the car, and compares the output of the network with the behavior of traditional networks, radars, and drivers.

The Tesla team has gone through seven data engineering iterations. They start with an initial data set and train their neural network on this data set. Then, they deploy deep learning in “shadow mode” on real cars and use triggers to detect inconsistencies, errors, and special scenarios. The errors are then corrected, and new data is added to the data set if necessary. Capaci said: “We repeat this cycle over and over again until the neural network becomes good enough.”

Therefore, the architecture can be better described as a semi-automatic labeling system with clever division of labor, in which neural networks are responsible for repetitive tasks, and humans are responsible for solving advanced cognitive problems and rare situations.

Interestingly, when asked whether the generation of triggers can be automated, Capaci replied: “The automation of triggers is a very tricky problem, because you can have general triggers, but they cannot correctly represent the error mode. For example, it is difficult to automatically generate a trigger with the function of triggering entry and exit tunnels. This is the ability that humans obtain through intuition. The specific principle is still unclear.”

Hierarchical deep learning architecture

3

Tesla’s self-driving team needs efficient and well-designed neural networks to maximize the use of the high-quality data sets they collect. The company created a hierarchical deep learning architecture composed of different neural networks that process information and output it to the next set of networks.

The deep learning model uses a convolutional neural network to extract features from the video of eight cameras installed around the car, and uses a transform neural network to fuse them together. It then fuses this information over time, which is very important for tasks such as trajectory prediction and eliminating inconsistencies in reasoning. Then, the spatial and temporal features are input into the hierarchical structure of the neural network, which Capasi describes as the head, torso, and nerves. He said: “The reason you want this layered structure is because you are interested in a large number of outputs, but you can’t afford the cost of having a neural network for each output.”

The hierarchical structure makes it possible to reuse components for different tasks and supports feature sharing between different inference paths. Another benefit of the modular architecture of the network is the possibility of distributed development. Tesla currently employs a large team of machine learning engineers dedicated to the research of autonomous driving neural networks. Each of them works on a single small component of the network, and then plugs their results into the larger network. Capaci said: “We have a team of about 20 people who are training neural networks full-time. They all work together on the same neural network.”

Vertical integration

In CVPR’s speech, Capaci shared more details of the supercomputer that Tesla uses to train and fine-tune its deep learning model. The entire computing cluster consists of 80 nodes, each node contains 8 NVIDIA A100 graphics processors and 80 GB of video memory, a total of 5760 GPUs and more than 450 TB of VRAM. This supercomputer also has 10PB of NVME ultra-high-speed storage and 640 Tbps of network capacity to connect all nodes and allows efficient distributed training of neural networks.

Tesla also owns and manufactures AI chips installed in its cars. Capassi said: “These chips are specifically designed for neural networks that we want to run in fully autonomous driving applications.”

Tesla’s biggest advantage is its ability to vertically integrate. The company owns the entire self-driving car stack, produces its own cars and self-driving functional hardware, and occupies a unique position by collecting various telemetry and video data from millions of cars sold. Tesla also uses its proprietary data sets to create and train its neural networks, and to verify and fine-tune these networks through shadow tests on its cars. Of course, Tesla has an outstanding team consisting of machine learning engineers, researchers, and hardware designers who assemble everything together.

Capaci said: “You can design and attack at all levels, no third party hinders you. You are in complete control of your own destiny, I think this is incredible.”

This vertical integration and the ability to create data, adjust machine learning models, and deploy them to many cars gives Tesla an advantage in achieving vision-only self-driving car capabilities. In his speech, Capaci showed several examples showing that the new neural network outperforms the traditional ML model that works in conjunction with radar information. Capaci said that if the system continues to improve, Tesla may eliminate lidar, and believes that no other company can replicate Tesla’s method.

Unresolved problem

But questions still remain, such as whether the current state of progress in deep learning is sufficient to overcome all the challenges faced by autonomous driving. Of course, target detection, speed and distance estimation play an important role in driving. But human vision has many other complex functions, which scientists call the “dark matter” of vision. These are important components of conscious and subconscious analysis of visual input and navigation of different environments.

Deep learning models are also difficult to make causal inferences, which can be a huge obstacle when the models face new situations that they have not seen before. Therefore, although Tesla has successfully created a large and diverse data set, the actual environment on the open road is very complicated, where new and unpredictable things may happen at any time.

The difference in the AI community is whether it is necessary to explicitly integrate causality and reasoning into a deep neural network, or whether it is possible to overcome causal barriers through “direct fitting”. Tesla’s vision-based self-driving team seems to prefer the latter, but this technology obviously needs to stand the test of time.