Community Computer Vision Course documentation

Introduction

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Introduction

Welcome to the Video and Video Processing unit. Maybe you have realized that in our course content so far, we have mainly focused on standard, static 2D images. Of course, the real world of Computer Vision has a lot more to offer. Videos are definitely one of the most used mediums in our world due to applications like Social Media, broadcasts, or surveillance cameras.

Given their importance in our society and research, we also want to talk about them here in our course. In this introduction chapter, you will learn some very basic theory behind videos before going on to have a closer look at video processing.

Let’s go! 🤓

What is a Video?

An image is a binary, two-dimensional (2D) representation of visual data. A video is a multimedia format that sequentially displays these frames or images.

Technically speaking, the frames are separate pictures. As a result, storing and playing these frames sequentially at a conventional speed results in the creation of a video, thus giving the illusion of motion (just like a flipbook). It is a popular and widely used medium for communicating information, entertainment, and conversation. Videos and photos are obtained via image-acquisition equipment such as video cameras, smartphones, and so on.

Aspects of a Video

  • Resolution: The resolution of a video refers to the number of pixels in each frame or we can also refer to it as the size of each frame in the video. It doesn’t need to be a standard size, but there are common sizes for video. Common video resolutions include HD (1280x720 pixels), Full HD (1920x1080 pixels), Ultra HD or 4K (3840x2160 pixels), and so on. When a video is said to have a resolution of 1920x1080 pixels, it essentially means the video has a width of 1920 pixels and a height of 1080 pixels. Higher resolution videos have more detail but also require more storage space and processing power.

  • Frame Rate: A video is composed of multiple separate frames, or images. In order to give the impression of motion, these frames are displayed quickly one after the other. The number of frames displayed per second is called the “frame rate.” Common frame rates include 24, 30, and 60 frames per second (fps) or hertz (general unit for frequency). Higher frame rates result in smoother motion.

  • Bitrate: The quantity of data needed to describe audio and video is called bitrate. Better quality is achieved at higher bitrates, but streaming requires more storage and bandwidth.

Bitrates for videos are commonly expressed in megabytes per second (mbps) or kilobytes per second (kbps).

  • Codecs: Codecs, short for “compressor-decompressor” are software or hardware components that compress and decompress digital media to reduce the size of media files, making them more manageable for storage and transmission while maintaining an acceptable level of quality. There are two main types of codecs; “lossless codecs” and “lossy codecs”. Lossless codecs are designed to compress data without any loss of quality, while lossy codecs are more designed to compress by removing some of the data resulting in a loss of quality.

In summary, a video is a dynamic multimedia format that combines a series of individual frames, audio, and often additional metadata. It is used in a wide range of applications and can be tailored for different purposes, whether for entertainment, education, communication, or analysis.

< > Update on GitHub