Jitendra Malik: Computer Vision | Lex Fridman Podcast #110 | Summary and Q&A

56.8K views

•

July 21, 2020

Jitendra Malik: Computer Vision | Lex Fridman Podcast #110

TL;DR

Computer vision is a complex field that is often underestimated due to the effortless nature of human vision, but it poses significant challenges that require a deeper understanding of perception and cognition.

Install to Summarize YouTube Videos and Get Transcripts

Key Insights

➗ Computer vision is often underestimated due to the effortless nature of human vision, but a deeper understanding of the complexity behind it reveals significant challenges.
➗ Vision involves both bottom-up (sensory) and top-down (cognitive) processes, and the interplay between the two is crucial for accurate perception.
➗ Autonomous driving is a particularly challenging vision task that requires not only vision but also control and cognitive reasoning, making full autonomy a difficult goal to achieve.
🚡 Learning from the way children learn, such as multimodal learning and incremental learning, can provide valuable insights for machine vision.
➗ The integration of vision with other modalities, such as language and touch, can enhance the learning process and provide stronger supervision signals.
🏗️ Language, although a challenging aspect of intelligence, builds upon the spatial intelligence developed through vision and physical exploration.
🌉 The development of simulations and physical robots can enable more interactive learning experiences, bridging the gap between perception and action.

Transcript

the following is a conversation with jitendra malik a professor at berkeley and one of the seminal figures in the field of computer vision the kind before the deep learning revolution and the kind after he has been cited over 180 thousand times and has mentored many world-class researchers in computer science quick summary of the ads two sponsors o... Read More

Questions & Answers

Q: Why is computer vision often underestimated?

Computer vision is underestimated because human vision appears effortless and is performed at a subconscious level, leading to the misconception that it is easy to implement on a computer. However, a deeper understanding of the complexity of vision from a neuroscience and psychology perspective reveals the challenges involved.

Q: What is the difference between bottom-up and top-down processes in computer vision?

Bottom-up processes refer to the sensory input of visual information that is processed from raw pixels to higher-level features. Top-down processes, on the other hand, involve cognitive feedback and prior knowledge to guide and influence perception. Both processes are necessary for accurate vision.

Q: How important is cognitive reasoning in computer vision tasks like autonomous driving?

Cognitive reasoning is crucial in certain computer vision tasks, such as autonomous driving. While some driving tasks can be solved with lower-level vision processes, more complex situations call for cognitive reasoning, especially in edge cases. Full autonomy in driving requires sophisticated cognitive capabilities, which is why achieving it in the near future is a challenging task.

Q: How does the human brain learn visual perception differently from current artificial vision systems?

Human learning involves a combination of visual experience, manipulation of objects, and cognitive processes. In contrast, current artificial vision systems rely heavily on supervised learning from large datasets. To achieve human-like learning, we need to develop learning models that imitate the human brain's ability to integrate different aspects of learning, including perception, action, and cognition.

Q: Why is computer vision often underestimated?

More Insights

Computer vision is often underestimated due to the effortless nature of human vision, but a deeper understanding of the complexity behind it reveals significant challenges.
Vision involves both bottom-up (sensory) and top-down (cognitive) processes, and the interplay between the two is crucial for accurate perception.
Autonomous driving is a particularly challenging vision task that requires not only vision but also control and cognitive reasoning, making full autonomy a difficult goal to achieve.
Learning from the way children learn, such as multimodal learning and incremental learning, can provide valuable insights for machine vision.
The integration of vision with other modalities, such as language and touch, can enhance the learning process and provide stronger supervision signals.
Language, although a challenging aspect of intelligence, builds upon the spatial intelligence developed through vision and physical exploration.
The development of simulations and physical robots can enable more interactive learning experiences, bridging the gap between perception and action.
The combination of bottom-up and top-down processes in computer vision, rather than treating them as separate tasks, can lead to more robust and efficient systems.

Summary

In this conversation, Jitendra Malik, a professor at Berkeley and a renowned figure in computer vision, discusses the challenges and misconceptions in the field. He explains that computer vision often appears easy because most of what humans do in vision is unconscious or subconscious. However, when we examine the complexity of human vision from a neuroscience or psychology perspective, it becomes clear that the problem is challenging. Malik also addresses the difficulty of autonomous driving and the role of perception and cognition in vision tasks. He suggests that learning from a child-like perspective and incorporating knowledge and reasoning are essential for advancing computer vision.

Questions & Answers

Q: Why do we underestimate the difficulty of computer vision?

Most of what we do in vision is subconscious or unconscious, giving the impression that it should be easy to implement on a computer. However, from a neuroscience or psychology perspective, the complexity of human vision becomes evident. The cerebral cortex, responsible for visual processing, is large and sophisticated in humans and other primates, indicating that the problem is challenging.

Q: Why does the computer vision community often underestimate the difficulty of the problem?

In the early days of AI, all aspects of the field were regarded as too easy. While the misconception was excusable then, it is less excusable today. One reason people still fall for this fallacy is due to the "fallacy of the successful first step." Some vision problems can be solved up to a certain percentage relatively quickly, but progressing beyond that requires significantly more time and effort.

Q: Is language processing seen as a more challenging problem than computer vision?

Yes, language processing is generally perceived as a more challenging problem than computer vision. While language understanding requires natural language processing, which is still difficult, vision problems are often regarded as easier because much of it can be solved using peripheral processing. Additionally, humans intuitively grasp the complexity of language understanding but tend to underestimate the level of understanding required in vision.

Q: How much understanding is required to solve vision problems?

Vision operates at various levels, and challenges exist at all of them. While lower and mid-level tasks are better suited for current techniques, higher-level cognitive tasks demand deeper understanding. Depending on the application, the level of understanding needed can differ. For tasks like image search, where some tolerance for errors exists, the requirements are lower. However, for highly critical tasks like autonomous driving, more sophisticated cognitive reasoning may be necessary.

Q: Can driving be converted into a purely vision problem and solved through learning?

Certain subsets of vision-based driving tasks, like driving in freeway conditions, are generally solvable. However, achieving full autonomy under all driving conditions is challenging due to the need for control and addressing edge cases. While vision plays a crucial role in driving, factors like interaction with the environment, predicting others' behavior, and cognitive reasoning go beyond just vision.

Q: What do you think about Tesla's approach to autopilot and vision-based driving?

Tesla's autopilot system mainly relies on vision and a single neural network with multiple tasks. While driving in certain conditions is relatively solvable, achieving full autonomy under various driving conditions involves challenges beyond perception. Dealing with control and addressing edge cases where more sophisticated cognitive reasoning is required remains a hurdle.

Q: Is vision a fundamental problem in autonomous driving, or are action and interaction equally important?

Vision is a fundamental problem in autonomous driving, but action and interaction with the environment play significant roles as well. Perception and cognition, which include building predictive models of other agents and understanding their behavior, are crucial for driving. However, achieving full autonomy requires integrating perception, cognition, control, and addressing various levels of difficulty in both vision and interaction.

Q: How does computer vision relate to action and guiding behavior?

Computer vision should be connected to action, as perception alone doesn't hold much value unless it is coupled with action. Just like in biological systems, where perception guides action, computer vision should serve the purpose of guiding behavior. Perception and action are inherently connected, and vision's role is to help make informed decisions and interact effectively with the environment.

Q: What should be the benchmark or test for computer vision that mimics a child's development?

Currently, there aren't enough benchmarks or tests that mimic a child's development in computer vision. The challenge lies in collecting the right kind of data that a child encounters while growing up, both linguistically and visually. This can help develop learning schemes based on these experiences, enabling computer vision systems to learn like a child. However, privacy concerns should be considered when collecting such data.

Q: How much can we learn from early vision and image statistics?

Early vision, focused on image statistics, has taught us that there's a lot of redundancy in images, allowing for significant compression. Understanding image statistics helps compress images and videos more effectively. Some companies are using neural network techniques to enhance image compression. By studying image statistics further, researchers can continue improving compression techniques.

Q: Does simulation have the potential to simulate the principles of existing in the world as a physical being?

Simulation has great potential to simulate the principles of existing in the world as a physical being. The computer graphics community has made significant progress in creating more realistic simulations, not just visually but also in terms of physical interactions. While there are still computational challenges, advancements in simulators can bring us closer to accurately simulating physical interactions, which is crucial for building interactive computer vision systems.

Q: Is active learning important for advancing computer vision, and can it help break the correlation versus causation barrier?

Active learning, where agents interact with the world and conduct controlled experiments, is crucial for advancing computer vision. It helps break the correlation versus causation barrier by enabling the building of causal models. Active learning, through physical embodiments and simulation environments, allows agents to learn causal relationships and refine their models based on the experience gained from interacting with the world.

Takeaways

Computer vision is a challenging field that often faces underestimation due to the subconscious and effortless nature of human vision. To advance computer vision, there's a need to integrate perception, cognition, and action, with a focus on understanding real-world dynamics. Learning from a child-like perspective and incorporating knowledge, reasoning, and active learning through physical embodiments and simulations can lead to more robust computer vision systems. Simulation environments have the potential to simulate the principles of existing in the world accurately, contributing to the development of interactive computer vision systems.

Summary & Key Takeaways

Computer vision is often seen as a simple task because human vision appears effortless, but the complexity becomes clear when considering the neuroscience and psychology behind it.
Vision involves both bottom-up (sensory) and top-down (cognitive) processes, and understanding the interplay between the two is essential for accurate perception.
Autonomous driving is a challenging vision task that requires not only vision but also control and cognitive reasoning, making full autonomy a difficult goal to achieve.