#
Introduction to Pinholes
Pinhole cameras let light converge into a tiny hole (a pinhole!) and then diverge on the other side onto an image plane. In old cameras, this plane was the film. In digital cameras, it's a little more complicated, but the premise remains the same. This allows us to create this lovely diagram that forms the core of our vision systems:
#
Interpreting the Triangles
Ok, so what the heck does that mean? This is a vertical side-view of a camera, where the intersection of the two dashed lines is the pinhole of the camera. The section of the horizontal dashed line on the camera's side (labelled as f) is the distance between the pinhole and the image plane.
The pinhole of the camera, where light converges, is known as a focal point, and the distance between it and the image plane is known as the focal length!
The image plane's height is marked as py, indicating pixel y-axis length, and it represents a single column of pixels in the captured images.
This diagram doesn't have to be a vertical side-view, though; we can simply rename the py line to px and it is now a horizontal side-view, where the px line represents a single row of pixels in the captured image.
Don't get confused — we can't actually just say that py is the same as px; it's just a simple way to show that the camera's horizontal side-view shares proportionality with it's vertical side-view. When we put actual math into it, you'll have to consider the py and px sides separately.
#
Applications of the Triangles
These triangles allow us to approximate the location of a target, provided that we can detect it with computer vision, and that we have some knowns about it. Let's draw in a few more things on the diagram:
That was a lot. Let's take it down part by part: the point labelled target is whatever we're targeting, possibly a piece of retroreflective tape on the wall? target (in image) indicates the position of the pixel that shows the target in the camera. You'll notice that it's upside down on the diagram; this is a side-effect of the pinhole structure.
The image gets flipped right-side up before it ever passes through to the code. However, drawing it like this makes it more intuitive to look at. Just remember that the py line is upside down!
Next, the segment marked h is the height of the target, relative to the camera's height (the distance vertically from the camera to the target). The segment marked d is the distance of the target from the camera laterally. The segment marked ph is the distance, in pixels, from the centre of the image to the pixel that the target appears on in the image.
So how do we use this diagram? Well, we can use the principle of similar triangles to solve for various unknowns: the triangle on the inside of the camera is proportional to that outside the camera. Here's what that looks like in an equation:
\frac{h}{p_h} = \frac{d}{f}
Since p_h and f are known (p_h because it's in the image and f because it's constant and intrinsic to the camera), we can solve for either d or h as long as we know the other one. For a real-world application, if we know that the retroreflective tape is 8 feet off the ground, we can solve for how far away it is from the camera. For example's sake, let's say that it's pixel height is 200 px, and the camera's focal length is 678 px. We can solve for d by cross-multiplying:
\begin{split} \frac{8 \textrm{ft}}{200 \textrm{px}} &= \frac{d}{678 \textrm{px}} \\ \frac{8 \textrm{ft} \times 678 \textrm{px}}{200 \textrm{px}} &= d \\ 27.12 \textrm{ft} &= d \end{split}
It's as easy as that! No complicated trigonometry or linear algebra. Things get quite a bit harder when you don't know either distance or height, or if your target is more complex than a simple piece of retro tape or apriltag that can be represented by a single point.