Glossary

Computer Vision

Computer Vision is the automatic extraction of meaningful information from images and video—a field that has been one of the most visibly successful application areas of deep learning. Core tasks include image classification (assigning a category to an image), object detection (locating and classifying objects with bounding boxes), semantic segmentation (labelling every pixel), instance segmentation (distinguishing individual object instances), pose estimation, optical flow, depth estimation, and 3D reconstruction.

The modern era began with AlexNet's victory in the 2012 ImageNet Large Scale Visual Recognition Challenge, which established deep convolutional networks as the dominant approach. The subsequent decade saw rapid architectural innovation (VGG, GoogLeNet, ResNet, DenseNet, EfficientNet) and the emergence of large-scale pretrained models that transfer across tasks. More recently, Vision Transformers have challenged the long dominance of CNNs, and hybrid architectures combining convolution with attention now populate the state of the art.

Computer vision powers a huge range of applications: medical imaging (tumour detection, radiograph analysis, surgical planning), autonomous driving (perception from cameras, lidar, radar), security (facial recognition, anomaly detection), agriculture (crop monitoring, pest detection), manufacturing (quality inspection, defect detection), and content creation (generative image synthesis, video generation, visual effects). Generative vision models like DALL·E, Stable Diffusion, and Midjourney have made text-to-image generation a mainstream capability. Each application raises its own engineering and ethical challenges, from robustness to distributional shift to privacy to bias in high-stakes domains like hiring and law enforcement.

Related terms: Convolutional Neural Network, Object Detection, Semantic Segmentation, Vision Transformer

Discussed in:

Also defined in: Textbook of AI