AI’s Next Step: Deep Reality?

6 min readApr 1, 2023

Hollywood has a classic TV trope that is so common it’s often parodied these days: the Enhance Button. Our intrepid detective looks over the shoulder of a techie in front of a computer monitor which is displaying a grainy still image from a surveillance camera, and says “Enhance”. The techie presses a button, and the computer then resolves the blurry image into a crisp picture of the perp. Mysteriously, the investigators can’t press the “enhance” button themselves, nor does the techie think to do it on their own and always needs to be asked.

The reason real detectives don’t do that is because it’s impossible. The information isn’t there. There’s only so much signal buried in all that noise.

These days there are lots of cool AI-based image enhancers and upscalers, and they sure look like they’re doing that — but they’re actually just guessing based on statistics that are accumulated from similar images. It’s the same type of thing diffusion models do.

However…

I always wondered why we don’t see distant objects more clearly if we just stare at them for a while. Obviously, we’re getting a lot more information over time than would be contained in a still image with the same resolution. Why doesn’t our brain just interpolate and give us vision that would turn even Legolas green with envy?

I’m certainly no expert, but I suspect there are three main reasons. First, our visual cortex is much, much older, evolutionarily, than the majority of our prefrontal cortex, and doesn’t differ much at all among primates. So “seeing” hasn’t kept up with “thinking”. Second, there really wasn’t any point in evolution going to all that trouble, since you can usually just wander over to something and take a closer look. (Although my optometrist always gets annoyed with me when I do that with the eye chart.) Third, I don’t know if you can really add information to what you already have without having a cognitive framework to put it in. Your eyes see color and light and dark and patterns, but what your brain sees is a cat asleep in a sunbeam.

Cognitive framework?

The Swiss psychologist, Jean Piaget, was the first to study the development of “object permanence” in babies. Since then many other psychologists have joined in the fun (babies are presumably more fun to study than, say, psychopaths). Object permanence has complicated nuances, but it is basically the idea that the world contains “things” that continue to exist whether or not you continue to look at them or if they roll behind the couch.

Basically, way before babies learn the names of things they understand that “things” exist. They have developed a mental model of things that persist in the external world.

That takes us back to our TV trope. Suppose that you wanted to make an “enhance” button that worked on a video, rather than on a single frame.

In the first frame, perhaps there is a shape that is consistent with it being a cat.

In the next few frames, you can test that hypothesis — yep, still a cat, although perhaps not in the exact same location. You have a mental model of a cat based on looking at lots and lots of cats, and those match. Cats have noses. You could imagine a very detailed cat nose! You could also (here’s the key) imagine any number of detailed cat noses, and then see which of them is most consistent with many frames.

Once you’ve done that and got as detailed as you like, you can now apply the model you’ve built up to fit the data in all the frames, including the first one. Now you’ve got a cat with a pink nose that has a brown freckle on it that you built up by comparing many alternatives of all the parts of object models across all the frames. In this case, the “enhanced” first frame isn’t a match for any old “generic” cat, it’s a match for a cat that is the best fit for all the frames.

So one way you could enhance the image is by making many detailed models, then comparing the models to a lot of frames to find which is the best fit, then applying that specific model to the first image.

In theory, this could be used for anything that you can model. Want to restore an old audio recording of an orchestra? Make a model of many instruments producing sound (you can use what else you know about how that music was made to make a good guess), then use the model to re-create the music.

In the olden days of classical AI, we started with models that we painstakingly made by hand. (For some reason, those usually involved colored blocks.) However, babies don’t start with fixed models, and neither should the next generation of AI. The next generation of AI should figure out how to make models itself. (Image recognition — naming things — is more like mapping your own internal models to convention.)

“All models are wrong, but some are useful” — George Box

A prerequisite of this process is working memory. You need to remember the model from one frame to the next in order to refine it. That’s one of the things that LLMs like GPT-4 work around by prepending the tokens (both input and output) to the current state of a chat. The tokens themselves constitute the “memory” but that’s not the same thing as having a separate working memory that is independent of position (in the stream of tokens) and time. From what I’ve read, there are approximately seventy bazillion incredibly smart people all working on this type of problem right now, so it seems unlikely that this “next step” is going to take very long.

What could we do with a stream of data interpreted via the lens of a detailed model (a model that was constructed by using both prior knowledge of the domain and the whole of the data stream)?

Perhaps we could:

convert a blurry, poor resolution old family movie taken by your grandfather into a detailed high resolution video
create a pair of glasses that gives you elf-like vision (bonus if the glasses’ temples make you look like you have elf-like ears)
construct a picture of a planet orbiting a nearby star from very many images from JWST

Taken beyond vision, what kind of models could we (potentially) make? Would we ever be able to hypothesize a model of a mind that would result in predicting observed behavior?

Remember that a model doesn’t have to be right — it just needs to be useful.

That’s what we do all the time, even when we’re children. You construct a mental model of your mother that allows you to hypothesize what she will do if you track mud on the carpet. Being able to do that is useful.

People are making great progress on this right now, and that we’re likely to see things soon that make everything we’ve seen from generative AI like ChatGPT or GPT-4 seem a bit quaint by comparison. For example, there’s a cool paper from March, 2023 by Seunghoon Lee et al. called “TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation” that seems to be related.

Another related tool that you can try yourself is the Segment Anything tool from Meta. As far as I can tell you can’t recognize objects across frames, but it does recognize objects without needing to name them.

AI, in its current state, is limited and often wrong. And while the question I find most interesting is “is it useful?” I have to admit that another very urgent question is “useful for what?”. I deliberately titled this article “Deep Reality” as a contrast to “Deep Fake” — and the same technology can be used for both.

Powerful tools can always be used for good or evil. I’m a bit cynical I suppose, but I think calls for a moratorium or “pause” on AI development are likely to be simply ignored by the people with bad intentions. A better question than “how do we stop this” may be “how do we use AI to prevent bad people from using it for evil?”

Post-script — and Pre-apology: Articles are supposed to be static, like a frame in a video. I may change this as I learn more.

Person standing behind me: “Enhance”

AI’s Next Step: Deep Reality?

Written by Ron Lunde