In this project, I experiment with Image Homography - projecting an image on another plane surface that is not the image plane. An alternative name for this Perspective Projection Transformation. Finally, I will take four groups of three photographs (all at the same Center of Projection) and create an image mosaic (similar to an image panaroma taken on an iPhone) by registering, projective warping, resampling, and compositing them.
In the later parts of this project, we detecting corner features in an image, extracting a Feature Descriptor for each feature point, match these feature descriptors between two images, use a robust method (RANSAC) to compute a homography and then proceed as in the first part to produce a mosaic.
In this step, I use the trusted iPhone to click pictures. I decided to take four groups of three pictures each which I will warp and mosaic together later in this report. The most important technique that I employed during my photo-taking endeavour was to make sure that the lens of my phone camera (serving as the center of projection) always remained the same. I simply rotate the phone about an axis line passing through the center of projection. The images of the Church were taken by my brother through who studies at Purdue. Here are the original images.
The next step involved calculating the homography matrices between pairs of images to recover the transformations required to warp one image into another. I implemented a function called computeH
which uses point correspondences between two images to compute a homography matrix.
The homography matrix was used to warp images into the same perspective, which allowed for the subsequent blending step to create a seamless mosaic.
To compute the homography matrix, we use a set of corresponding points between the source and destination images. Let's denote the source points as \( (x, y) \) and the destination points as \( (x', y') \). The goal is to find a transformation matrix \( H \) such that:
\[ \begin{bmatrix} w'x' \\ w'y' \\ w' \end{bmatrix} = H \cdot \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} \]
The homography matrix \( H \) is a 3x3 matrix that relates the coordinates in one image to the coordinates in the other image:
\[ H = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \]
To find \( H \), we need to set up a system of linear equations using multiple point correspondences. For each pair of corresponding points, we get two equations:
\[ \begin{cases} ax + by + c = wx' \\ dx + ey + f = wy' \\ gx + hy + 1 = w \end{cases} \implies \begin{cases} ax + by + c = (gx + hy + 1) x' \\ dx + ey + f = (gx + hy + 1) y' \end{cases} \implies \begin{cases} ax + by + c - gx x' - hy x' = x' \\ dx + ey + f - gx y' - hy y' = y' \end{cases} \]
Rearranging these equations gives us a system of equations that can be represented in matrix form:
\[ \begin{bmatrix} x & y & 1 & 0 & 0 & 0 & -xx' & -yx' \\ 0 & 0 & 0 & x & y & 1 & -xy' & -yy' \end{bmatrix} \begin{bmatrix} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \end{bmatrix} = \begin{bmatrix} x' \\ y' \end{bmatrix} \]
\[ A \mathbf{h} = \mathbf{b} \]
Where \( A \) is a matrix that depends on the source and destination coordinates, \( \mathbf{h} \) is a vector containing the unknown elements of the homography matrix, and \( \mathbf{b} \) is a vector of destination coordinates. Given enough correspondences, we can solve this system of equations for \( \mathbf{h} \) using a least squares approach.
To solve for the homography matrix elements, we need at least four pairs of corresponding points, which provides eight equations. More correspondences provide an overdetermined system, which we can solve using the least squares method. The least squares solution minimizes the error between the predicted destination coordinates and the actual coordinates.
The least squares solution can be found by solving the following equation:
\[ \mathbf{h} = (A^T A)^{-1} A^T \mathbf{b} \]
This equation provides the best fit for the homography matrix by minimizing the squared differences between the actual and projected points.
Using the computed homography matrix, I applied inverse warping to transform the images into a common perspective. The inverseWarpImage
function was used to project an image onto the final canvas and align it properly. The results of the individual warped images were visualized as intermediate steps.
After aligning the images, the next step was to blend them together seamlessly. I used a distance transform-based blending technique to ensure smooth transitions between overlapping images. The distance transform values were normalized and used as blending weights, providing a feathered transition across the overlapping areas. Here is an example of the final mosaic.
The distance transform sets each pixel inside a region (here, inside a projected image's bounding box) to be the distance to the nearest edge. We'll be using Euclidean distance as the measurement, and the final result is normalized to range from 0 to 1.
These are the normalized distance transforms for the images of the view from the patio
One note on implementation detail: My algorithm of generating the canvas of the final stitched image was as follows: I first figure out what is the eventual extent, what I call as min_x_mosaic, min_y_mosaic, max_x_mosaic, max_y_mosaic. Then I place each image on the final canvas. So now I have my three images on the same final canvas dimensions. Now doing weighted average with three images with same dimension (the final canvas dimension) is relatively easier.
One of the applications of homographies is image rectification. This involves transforming an image so that a particular plane becomes fronto-parallel, effectively correcting the perspective distortion. Below is an example where I rectified an image of a room.
In the first step to achieve automatic feature mapping, we need to find the interest points. The Harris Corner detector focusses on finding the "corners" of an image. Below is an example of the Harris interest points for two of the images of the view from my patio. Key points to note: I updated the provided Harris Corner Detection - I updated the minimum distance between corners from 1 to 2 (so that the corners are not cluttered) and I added a threshold parameter to filter the corners based on the h value. This threshold is like a hyperparameter and needs to be changed for each image. For images of my patio, I chose the threshold to be 100,000. In addition, all the image processing done in this part and the ones ahead will be done on grey-scaled versions of the previous images as we do not multiple channels to compute feature descriptors and correspondences.
Harris interest points are very clustered together, with a lot of redundant information. Adaptive non-maximal suppression (ANMS) is an algorithm to more evenly distribute the interest points around the image, so that feature matching is a lot more effective.
Once we have interest points selected for each of our images, we need a representation for them in order to accurately match them with their corresondences in the warped image space. In order to do this, I first build a gaussian pyramid (not stack) for each image, and then sample 40x40 patches centred at the interest points taken from Part 7 but at pyramid level 2 (with coordinates approproately recalculated). For reference, the original image is at Level 0.
Once we have 40x40 image patches, I subsample (pick one in every 5 pixels) to get 8x8 patch representations. Finally, I add normalize these patches by subtracting the mean pixel intensity and dividing by the standard deviation pixel intensity for each pixel in the 8x8 patch. Here is an example of a feature descriptor for an image of the view from my patio.
With features extracted, the next step is to match features together between two images to form correspondences. In order to do this, I use an nearest neigbor approach. For finding feature correspondences between say image 1 and image 2, I iterate through each feature descriptor in image 1 and then compare the Euclidean distance between this and all the feature descriptors of image 2. I only store those feature descriptors maps who have a 1st-NN-Neighbor to 2nd-NN-Neighbor ratio less than 0.1 roughly following the suggestion from the MOPS paper.
Although Lowe's trick does reduce the number of outliers, there will still inevitably be several incorrect pairs remaining. The issue with these incorrect pairs in the correspondence points is that least-squares is not very robust against outliers; a single pair of incorrect points can skew the least-squares result wildly, which also skews the computed homography.
To increase the robustness of least-squares when finding the desired homography, we can use RANSAC; we randomly sample 4 pairs of points, and compute the exact homography through these points, keeping track of the inliers (points that are correctly transformed by the homography) as we go. At the very end, we keep the homography that produces the most inliers. This ignores the correspondence points that do not follow the majority transformation, increasing the robustness against outliers. Below is an illustration of the result of performing RANSAC on the matched features. In my experience through this project, RANSAC almost always gave me accurate feature maps.
This could be because the RANSAC threshold I use is very small (0.8), so I only am left with accurate feature maps.
Comparing manual results with automatic results. Please note: The only suprising result was the image of the church building. The RANSAC result is worse than the manual result but both have a prominent edge seam. The rest of the results look good.
To make feature descriptors rotation-invariant, we align each 8x8 patch according to the dominant gradient direction at the interest point. Below, we outline the mathematical steps.
Given an interest point at coordinates \( (x, y) \), we calculate the gradient of the image intensity around this point using the Sobel operator. The Sobel operator provides partial derivatives of intensity in the x and y directions:
\[ \nabla_x = \frac{\partial I}{\partial x}, \quad \nabla_y = \frac{\partial I}{\partial y} \]
where \( \nabla_x \) and \( \nabla_y \) represent the rate of change of intensity along the x and y axes, respectively.
Using the partial derivatives \( \nabla_x \) and \( \nabla_y \), we compute the orientation \( \theta \) of the gradient at \( (x, y) \). The orientation is given by:
\[ \theta = \arctan\left(\frac{\nabla_y}{\nabla_x}\right) \]
where \( \theta \) represents the angle of the gradient with respect to the x-axis.
To align the patch according to the orientation \( \theta \), we construct a 2D rotation matrix \( R(\theta) \). For a rotation by an angle \( \theta \), the rotation matrix is:
\[ R(\theta) = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \]
This matrix rotates a point \( (x', y') \) around the origin by angle \( \theta \). In our case, we want to rotate the 40x40 patch around its center.
To rotate the patch, we first center it on the interest point. Then, we use an affine transformation with the rotation matrix \( R(\theta) \) to adjust the patch's orientation. The rotated patch is sampled as follows:
\[ \begin{bmatrix} x' \\ y' \end{bmatrix} = R(\theta) \begin{bmatrix} x - x_c \\ y - y_c \end{bmatrix} + \begin{bmatrix} x_c \\ y_c \end{bmatrix} \]
where \( (x_c, y_c) \) is the center of the patch, and \( (x', y') \) are the new coordinates after rotation.
After rotating the patch, we downsample it to 8x8 by taking every 5th pixel. This reduces the size and maintains the key structural information of the patch in its new orientation, ensuring it remains rotation-invariant.
When compared to the results of the automatic mosaic without rotation invariant features, these results are marginally better. However, the manual point selection still seems to do better. The presence of the seam in all these three cases could potentially mean that there was a slight shift in the centre of projection of the camera between image 2 and image 3 of the church. But, rotation invariant feature descriptors do bring in marginal improvement in image quality.
Using the power of homography to create a clone of my roommate!
Clearly, the coolest thing I learned was the patience to read a convoluted section of a research paper. It took me about 15 readings of the Adaptive Non-Maximal Suppression Section of the paper until I finally understood what was going on. This was a useful experience.