In working on projects and advocacy to help protect our privacy from surveillance, I realized pretty quickly it would help me to learn more about the fundamentals of the technology that underpins some of the fastest-growing areas of AI’s effect on our rights: computer vision, or how computers see.
I’ve decided that as I work to learn more about how computers see, I wanted to share what I found that was new to me, or familiar concepts that I didn’t know played such an important role in such a little-understood technology. I think if you’re like me and you care about these issues, you likely feel that knowing more about how it actually works instead of feeling like these systems are an all-powerful, omniscient black box technology can help us be both practical and tactical in our work to combat surveillance.
This piece is going to restate some things that are fairly obvious to the person who works with artificial intelligence or computer vision for a living. But I realized as I worked to become literate in these concepts so I could find exploits for them that almost all of the foundations of this technology were easier to understand than I assumed they would be.
Before working hands on with computer vision myself, I presumed such things were the domain of technological wizardry that requires a PhD in computer science to understand. It turns out the workings of these systems were simple enough that I could not only understand them with no such degrees, but develop an understanding deep enough to inform my own point of view on our questions around how to regulate these technologies to protect our privacy.
1. Computer vision is functionally just math.
Science fiction has constructed in our cultural imagination images of artificial intelligence, supercomputers, androids with very human-like eyes, maybe seen through the kind of cameras we use in our daily lives on our phones or in our homes. Our imagining of the outputs these cameras is an image that would be processed as a whole scene with context as it appears to us on our computer screens, like a photograph or video. That is after all, how we humans see.
As I’ll get into in a later post, how we see is actually more interesting and complicated than even this “camera” and “photograph” analogy. But regardless of our “looking through a camera” style of human experience, computers actually “see” pixel by pixel, row by row, experiencing this whole line of bits of a picture as a long list of numbers.
How do computers do this when pixels appear to us as colors, and numbers are numbers? Well each pixel on say, our computer screen, has very small electronic cells with a tiny bunch of colored lights packed together in sets. This might be 3 component colors for something called RGB or Red Green Blue color screens. This might be more if the screen colors are instead a schema like CMYK or Cyan, Magenta, Yellow, Key (black).
This is what the close-up of an RGB screen often looks like:
As we can see, these pixels are each made of 3 lights: red, green, and blue. So when we want to tell the screen which lights to keep lit, a file in the computer holds 8 bits of information in a list to tell what lights have to be on and how brightly to create the impression of a particular color on the color wheel. These lists of also numbers tell each pixel when to light up or not over time, and in what order.
When looked at from a sufficient distance from the screen, our human eyes combine those 3 or more components in to something closer to the color, like orange or blue, we expect to see.
Pretty neat optical illusion, right? But since computers do not have biological eyes, they don’t get to enjoy the fruits of their light-show labor. They (for now) only “see” those 8 bit numeric values for each pixel in a given picture, like the pixels that make up this pigeon:
A computer knows a picture of a pigeon by the numeric value for the color and intensity of the very first pixel, then the pixel next to it, then the pixel next to it, and so on in each row after until all of the values for each are stripped off in a long list of numbers. When I give a computer a picture of say, this pigeon to look at, I have really given it a list of numbers with the word “pigeon” on it.
Unfortunately for computers, a different image of the same pigeon with a bread necklace will have some numbers added to the list that represent the piece of bread, making things a bit more complicated for them. This means this second list will be different from the first even though this one is, as I have again told the computer, also a picture of a pigeon.
If a computer wants to compare these two pictures, it can do some surprisingly basic math (our good old friends addition and subtraction) to say: are the contents of these two lists the same? Certainly in this case they will not be, as there is a very bread-shaped difference between the numbers. And so the computer says no, these are not the exact same image.
But what if despite being different pictures, and two different lists, I want the computer to tell me whether they both might still contain a pigeon despite the addition of bread? How does that work?
2. The math used in image recognition can be done without a computer.
Computer vision is very dependent on the notion of what a computer “expects to see” based on its prior experience with other images, and some rules we try to give it to develop its own logic of what makes two pictures the same or different.
Image recognition is how the computer tries to tell whether the list of numbers that represents the pixels of one image are similar enough to the list that represents the pixels of another. For our two pigeons, not all of the numbers are the same or in the same order, but the numbers representing the pigeon’s heads and wings can probably be found in a group near each other on both lists. So this might help give us a clue that they both contain a pigeon. How do I find areas of the list that are important enough to compare to the lists that make up other pictures?
The process of comparing the two lists of numbers that represent two images is certainly made easier with the aid of a computer, but I was surprised to find out that the math used to compare these lists could hypothetically be done by a person. This kind of calculation can even be completed with the same kind of algebra that many of us learned in high school.
Let’s take a look at one way we try to teach computers how to see, identifying Haar-like features:
These two images represent a kind of math used to tell in some kinds of ex: facial recognition where there’s a feature on the picture we might need to note, so we can expect to see it in any matching pictures, like eyes, a nose, or a hairline. “Features” are designated (in this particular method) as areas of high contrast between light or dark pixels. Looking for this difference in “pixel intensity” in a kind of greyscale saves time; we don’t have to compare all the different colors that might come up in our different images.
You might imagine that math used to find features in a picture would be complex. But to find a feature using this method it just means clipping a section of a picture, adding all the values of the light pixels, adding all of the values of the dark pixels, and then subtracting the light from the dark pixels.
A “feature” in this method is any section of the picture where the difference between the dark and light intensity pixels is as close to 1 as we prefer. I then have an ideal number value for a feature I would “expect to see” if looking for them in another image.
There are many ways to calculate, using math as advanced as calculus or as foundational as arithmetic, whether these features and intensities are similar in new images I’m evaluating. Some algorithms are very dependent on finding and matching these high-contrast areas of images, while others lean on other ideas of what math we could use to see an “expected” feature based on pixel color, nearness to other pixels that suggest an edge of an object or its texture, or distance from other pixels of a different feature.
For example, this Tuna Scope app not only looks for certain intensity of pixels in a photo, but also color and placement to be able to tell if a cross-section of a tuna tail is as close as possible to what it would expect to see in high grade tuna.
But our world often gives us more concern about the impact of image recognition on people than on sushi. So let’s look at low resolution picture of someone’s face. It might be more of a pain to hand-calculate ex: Haar features and compare them to other similarly-sized photos on this image than our 4x4 pixel sample. But we could still do it with a pen and paper if given enough time.
Recognize this person? While I’m sure that a computer would have a bit more of a difficult time matching this former president to other images with a high degree of certainty, as humans we don’t have a hard time seeing how something even this low-resolution could potentially with the right context be personally-identifying.
As you can imagine, smaller resolution fuzzy pictures contain fewer pixels to help the computer to make comparisons, resulting often in mistaken matches. Indeed, even on high resolution images with lots of pixels to compare, facial recognition is found in many current applications to be less effective in matching Black people, elderly or very young people, and often women. The math being used on all pictures is hypothetically the same, but the pictures we gave the computer to teach it what to expect in an image often contain mostly white people, and mostly men. This impacts how easy or hard it is for the computer to “expect to see” a person who doesn’t closely match the people in the training photos.
These examples are what makes say, the common-sense policy of banning facial recognition use by governments something that must be written carefully in text of those laws. If I can do “facial recognition” technically by hand, is the policy I wrote phrased as if I were just attempting to ban someone from doing a certain kind of math? Could that then be harder to enforce than I expect, require unforeseen exceptions, or could a bad actor side-step regulations using this technicality?
We have historical evidence that trying to use policy to ban math technically anyone can do, like encryption, doesn’t work very well. It certainly doesn’t tend to really get at the problems we’re trying to solve in use and abuse of image recognition technology.
Any laws we write, from my perspective, must be durable to how “use math to compare two numbers representing a real world thing” technologies operate. They must preserve people’s rights to privacy and due process regardless of what math I use to calculate anything about a person or their life, no matter if it is simple or complex math. Most importantly, those rights should be preserved whether this math in practice is riddled with errors or perfectly correct.
We would hope any computer system built to do those high-stakes calculations on our behalf should also be able to talk to us and give us a reasonable idea of how it came to a conclusion that say, a picture of a person committing a crime allegedly matches a different person’s driver license photo.
And that leads me to the third very important thing I learned about how computers see:
3. We’re not really sure what parts of a picture computers are looking at when they match images to each other.
Yes, you read that right. When I first started running image recognition processes on my computer to create clothing that triggers automated license plate readers, I was sure my first task would be to figure out what the creators of that system had designated as being important to the algorithms that decide if a picture is indeed of a license plate. Surely they told the algorithm to prioritize certain fonts of license plates, shapes of letters, distance apart they are on a standard plate.
After all, don’t people who build facial recognition tell the computer to look for something that has features of say: eyes, a mouth, a nose all in the right distance from one another?
As it turns out, those creators definitely do not tell the computer what to look at specifically, but try to train it to figure out what to expect to see based on prior experience. Creators of these systems show a computer lots of license plates or faces, sometimes millions of them. They hope that by seeing what the statistical relationship between those sample pictures was, the computer will have a model to learn on its own what features it should be expecting to see in any new pictures.
In researching for my projects, I found out that not only do the researchers who create these image recognition systems not tell the computer what look for while doing that training process, being able to tell what features helped a computer decide two images are the same is an open area of research.
This is because computers can’t talk back to us yet to tell us what they were looking at and why, often because the volume of calculations is so enormous and high-frequency, it wouldn’t make much sense to read as a human. So there are entire fields of research trying to figure out how to get these systems to show or tell us what on earth they were thinking when they decided images matched each other.
Indeed this is one of the major reasons issues like racism in facial recognition systems are so frustrating. Computers cannot tell us directly why they mismatch Black people’s faces more frequently than other people, what logic led to those erroneous conclusions. These areas of research are as a result both timely and critical as governments and corporations push ahead and install these error-prone systems in every aspect of our society.
If you’re not a computer scientist, as I am not, from where you sit image recognition systems probably carry with them a veneer of futuristic technology, a black box filled with mystery. You might have the same assumption I did that under the hood, they must be using components that someone might have to be a genius to understand.
I am sure this impression of complexity is true of many of the technologies we interact with, but it seems image recognition is not one of them. As I work to grow my understanding of computer vision systems, I look forward to continuing to share things I come across that might make it easier for us to decide where image recognition should or should not fit into our world.