A 10,000 Foot View of Computer Vision
This is a living document giving a very high level (breadth focused) view of computer vision and is written in “notes” style, which means that it might have bad grammer/spelling. Also, this document is not suitable for computer vision beginners, this is good for people already familiar with computer vision who want a high level summary/picture.
image representation
- 2d array/matrix of pixel values or function of 2 variables
- color images can be thought of 3 functions/matrices (one for each channel)
- or they can be thought of as a vector valued function of two variables (output is a vector of 3 components, one for each channel)
- color images can be converted to black/white
- just take one of the channels
- “combine” 2 or more channels (some definition of combine - average? weighted average?)
- better definition of combine: add the 3 channels, then divide by
sum(max possible value for each channel)
- image range becomes 0 to 1 now
- better definition of combine: add the 3 channels, then divide by
- range of pixel values
- some maximum possible value
- some minimum possible value
- maximum possible value represents (visually) the whitest white
- minimum possible value represents (visually) the blackest black
- values between the maximum possible and minimum possible represent some shade of gray
- e.g. your range can be 0 to 255, 0 to 1, -1 to 1, etc
normalization
- you have a set of numbers, you’d like them to be in some other range
- for example, you want them to be in the range 0 to 1
- or you’d like them to be in the range 0 to 10
- or 4 to 16
- this is all possible
- to make them 0 to 1, find the sum of all your values, divide each value by the sum
- to make them 0 to 16, find the sum of all your values, divide each value by the sum, then multiply by 16
- to make them 4 to 16, find the sum of all your values, divide each value by the sum, then multiply by (16-4) and add 4
filtering (convolution/cross correlation)
- both the same thing, one produces flipped (in both directions) version of the results of the other
- represents the coefficients or weights of neighboring pixels to use to determine the center pixel
- linear operations
- , aka additivity
- , aka multiplicative scalability
- aka filter, mask, kernel
- values have to be normalize (add up to 1)
adding noise
- to add salt and pepper noise, randomly select locations in image and insert a high white or black value
- to add gaussian noise:
- for each pixel, offset it by a normally distributed random amount
- this is the same as taking a 2d array the same size as the image, populating it with i.i.d normally distributed values (mean being 0), then adding the array to the image
- the higher the standard deviation for your randomly generated offset, the more noise you’re adding
removing noise
- to remove salt and peper noise: median filtering
- the center pixel should be the median of all the neighbors
- this is not a linear operation because median is not linear
- to remove gaussian noise
- option 1 (bad) - box filtering
- make center pixel an average of neighbors (but all neighbors get equal weight)
- bad because center pixel more closely resembles nearer neighbors, but we give all neighbors equal weight!
- option 2 - gaussian filter
- give closer neighbors higher weights
- can generate the weight space by doing gaussian formula with respect to 2d space
- curve under gaussian is always 1, which means weights do add up to 1!
- by decreasing the
$$ \sigma $$
you are concentrating the values to be mostly in the center (which means you are smoothing a narrower area)
- option 1 (bad) - box filtering
template matching
- aka normalized cross correlation
- your filter (i.e. mask, i.e. kernel) is or resembles some sub region of the image
- normalize your filter
- definition of normalize used here:
- make standard deviation of filter 1
- how? find current standard deviation. Let’s say it is 0.5. Multiply all values by 2 and your new standard deviation becomes 1. TODO test this logic.
- make standard deviation of filter 1
- definition of normalize used here:
- normalize the region of the image under the filter (using same definition of normalize as above)
- in the output image (the image formed by convoluting the normalized filter with the input image (each window being normalized as well)), the brightest pixels represent locations where the template best matches regions in the original image
- as long as your filter has the same scale, orientation, and general intensities as the pattern you are looking for, you will find it using template matching (i.e. your filter doesn’t have to EXACTLY match a region of the image)
image gradients
- an image can be visualized by a surface
- getting the gradient means, for each pixel, finding the direction in which the biggest intensity change occurs in (and what the value of this big intensity change is)
- the gradient value for a particular pixel is a vector, it has both a direction and a magnitude
- the gradient magnitude represents the change at this pixel location all-together
- the gradient direction means the direction at which this change occurs (it can occur only in the x direction, only in the y, or some combination)
- a noisy image produces a noisy gradient (i.e. taking the derivative of a noisy function produces a noisy result)
- smooth image first
- more effecient: add smoothing to your derivative kernel (takes advantage of associative property for convolution and the fact that derivative is linear)
- sobel operator, is a kernel, that produces the gradient image of an image
- sobel_x produces the x gradient image (the gradient image only taking the gradient in the x direction)
- sobel_y produces the y gradient image (the gradient image only taking the gradient in the y direction)
- you have the x and y component, now you can calculate magnitude and direction
edge detection
- canny
- apply smoothing to first derivative kernel, apply kernel to image to produce derivative image
- so you’ll have 2, first derivative images:
- one for x dir, one for y
- take the magnitude of the derivative (i.e. )
- threshold the magnitude of derivatives image (so all magnitudes smaller than some amount disappear)
- thin (i.e. non maximal suppression)
- high threshold to produce an image with only strong edge pixels
- low threshold to get rid of all super weak pixels
- all pixels in between will be considered for converting to “strong pixels”
- for each of these pixels, look at neighbors, if there is at least 1 neighboring strong pixel, convert to strong pixel
- laplacian
- apply gaussian to second derivative kernel
- take 0 crossings (one side positive, other negative), these are your edges
finding lines
- hough lines
- every edge pixel can be part of some line
- let every edge pixel vote for what lines it can be a part of
- keep the lines that receive the most votes
- more details
- every point in the image is a line in hough space
- because there are many m,b’s (lines) that can pass through that point
- if you have two points in the image, you get two lines in hough space
- intersection of these lines is an m,b that satisfies both points in the image
- i.e. this m,b line goes through both points in the image
- intersection of these lines is an m,b that satisfies both points in the image
- every point in the image is a line in hough space
- even more details
- representing lines as m,b presents a problem because slope (m) must be infinity for vertical lines
- we can use a polar scheme to represent lines (theta, r)
- theta is the angle
- r is the distance
- line is perpendicular to r
- algorithm now becomes
- initialize hough space table (2d table of theta,r pairs) to all 0 votes
- for every edge pixel
- for every r
- for every theta
- if the edge pixel (x,y) satisfies the (r,theta) equation of line with this specific r,theta, then this edge pixel votes for that r,theta….so increment a vote for that r,theta
- for every theta
- for every r
TODO finish