The Thousand-Mile Telescope: An Architecture for Machine Vision

It has seemed to me almost axiomatic for some time that 'intelligence' can only be an emergent property of systems made of many independently operating actors, cooperating, competing, and communicating.

The ideal place to test that concept is in the realm of visual perception.

Machine Vision is not computer programming

In normal computer programming, the problem domain is completely understandable. If, for example, you are writing a web server, you can go learn all there is to know about the HTML 2.0 specification. HTML 2.0 is a finite logical system created by humans. So is the internet. So is the Transmission Control Protocol. Everything you will interact with as you write your code will be a finite logical system created by humans. You'll still have plenty of trouble getting it done, but at least you'll be able to tell yourself Look, this has to be possible.

But when you point a digital camera at the natural world, then start writing software to try understanding what it sees--you have now entered a whole new ballgame.

The difference is that the natural world is not a finite logical system. You cannot write frozen logic to deal with input that has arbitrary variability. Of course you can try -- as I did for many years -- but what you will find is that you are forced to keep adding alternative pathways to your logic as ever more algorithm-breaking possibilities come to light (sorry) in your input until your code is so complex that every new feature breaks an old one and development grinds to a halt.

The architecture of normal computer programs--the set of techniques that underlay them all--consists of functions that generate results and that are connected to each other in a topology that is fixed at the time the program is written. For Machine Vision we will need something much more flexible.

The architecture we describe here will automatically adapt to significant changes in its inputs, and learn to improve its processing efficiency over time.

Abstractors

I will call the basic units of functionality in the Machine Vision architecture Abstractors. In normal programming, the fundamental units of logic are called functions, by analogy with mathematics. They are chunks of code that transform a set of inputs into a set of outputs. Abstractors have a more specific purpose: to creatively refine large quantities of data into smaller and more valuable representations--Abstractions.

Most Abstractors can only function when an appropriate set of inputs become available--except for the lowest-level Abstractors in the vision system, which take a physical input (like light) and abstract it into images. These physical Abstractors are called Sensors. In special circumstances they may be controlled by higher-level Abstractors, but normally they simply operate on a timer. For example, creating a new image every forty milliseconds.

Abstractions are arbitrary Maps

In Go, it looks like this:

type Abstraction map[string]interface{}

A map using strings for keys and storing values that can be anything at all.

Abstractors that want to consume a particular type of Abstraction are written to understand its various fields and their datatypes.

The Bulletin Board

Unlike functions in a program, Abstractors are not connected to each other in a topology that is fixed when the code is written. Instead, Abstractors exchange Abstractions through a communications switchboard called the Bulletin Board.

When a new Abstractor starts running it registers with the Bulletin Board a function that will be used as a filter to determine what Abstractions will be of interest to it. Every time a new Abstraction is sent to the Bulletin Board, all such functions will be run. The new Abstraction will be sent to every Abstractor whose selector function returns a positive value.

Abstractors may register a new selector function with the Bulletin Board at any time, replacing the old one.

This breaks the fixed topology of standard programming. Abstractors that produce outputs do not know where they may end up. Abstractors that consume inputs do not know whence they come--only that they match the Abstractor's current criteria.

When an Abstractor starts up it also tells the Bulletin Board what type of Abstractions it is capable of creating.

Bottom-Up, and Top-Down

Initial activity in the system is bottom-up. It begins when one or more Sensors create images (or other sensory Abstractions) and send them to the Bulletin Board. These Abstractions will match the selection criteria of some Abstractors in the system, which will then receive them from the Bulletin Board and process them into higher-level Abstractions.

But as activity progresses, higher-level Abstractors may request specific work from lower-level Abstractors.

For example, imagine a system whose top-level task is to locate faint moving objects against a stellar background. It processes stacks of images and tries to locate the moving object in each image. The processing technique that it employs works if the asteroid is always reflecting approximately the same amount of light.

But now it is trying to track an asteroid that is significantly elongated. As the object rotates, it sometimes presents its long axis to our system's sensor and the amount of light that it reflects is greatly reduced. Our tracker-Abstractor fails to find the object's location in several images of the stack.

But Tracker wants to confirm the asteroid's position in those few images, or the quality of the Abstraction that it eventually posts will be significantly degraded. So Tracker calls for help.

To request assistance, Tracker issues a new kind of communication to the Bulletin Board: a description of work that he wants done -- what to look for, and where to look for it. The Bulletin Board matches this request with what it knows about what kind of work the other Abstractors around the system can do, and sends the request to all that look like they might be able to do the work. If any of those Abstractors decide to do the work, and if they succeed, the Tracker will incorporate their work into its Abstraction which will then be posted.

The work that Tracker initiated with its request is an example of top-down activity in the system. Such requests may come all the way from the very topmost Abstractors whose Abstractions constitute the purpose of the system, and reach all the way to the bottommost Sensors.

The combination of data-driven bottom-up and goal-driven top-down action makes for chaotic--but meaningful--patterns of activity.

Positive and Negative Belief

In a standard computer program, a function will either produce a result or an error message. If there is no error, then you can trust the result. You can have that kind of perfect certainty when you're working with human logic: an HTTP server, a Kubernetes cluster, a file was opened or it wasn't.

When our code is pointed at the natural world, we can't have much of that kind of certainty. At the lowest levels we can -- when you are doing image processing operations -- an image goes in and a different kind of image comes out -- then, yes, you can be as certain of your result as any common program's function.

But as we approach the higher levels of actual vision -- in which the outer world is being meaningfully modeled -- abstracted into representations useful for further cognition -- from that point upward we can never again have perfect certainty.

Rather than just answers or error codes, Abstractions come with belief values attached. A belief can be positive or negative (disbelief), and a given Abstraction always has attached to it a value for both. Belief spans the range from 0 to 1, disbelief from 0 to -1. If a given Abstraction is merely uncertain, you will see a low belief number. But if there is actual contradictory evidence, you will see a nonzero disbelief number alongside the positive.

These values may be influenced by Abstractors other than the one that posted the original Abstraction. I.e., another Abstractor may come along and add some disbelief to a given Abstraction, because it is considering a kind of evidence that the original Abstractor did not -- and in that new view, something does not look right.

The Market

Higher-level Abstractors can request any amount of extra activity they want. If a given Abstractor becomes overloaded with work, I think there will at some point be a mechanism that allows more copies of it to be instantiated to share the load. With different experience, different settings, and different relationships, I think those new copies can have a distinct life of their own.

There may even be a way, someday, to create new Abstractors doing new types of processing, entirely from scratch, without human programmer intervention.

In short, we have a system here that can expand to fill up any amount of computational resources you care to provide.

And that's a problem. We have finite resources, but a perceptual system that can expand without limit. And those resources are needed for other things besides visual perception! There is still higher-level cognition that the perceptual system is supposed to be supporting! That's the part that decides what the system should be doing, based on what it is seeing. It needs to have some compute power too.

We need a vision system that can somehow keep itself within some bounds of resource use, and yet handle new perceptual issues as they arise. We need a system that can quickly expand to use new resources if that is what is necessary to keep us alive, but which can then quickly 'learn' to solve that new perceptual issue more efficiently, freeing up computational power for other equally vital realms of cognition.

We need a system where overall efficiencies can arise as an emergent property of the interaction of many independent Abstractors.

We need a market.

The Abstractor Economy

And here I come to the end of what I know so far about this system. Everything up to this point I believe I know how to implement, but not this, not quite.

What provides the overall total resources that the system can use? Are those resources apportioned from the top down? But then how do the Sensors fire? Are they separate?

Is there a 'mainstream' of processing that happens without negotiation? Does the marketplace only affect top-down transaction in which a higher-level Abstractor requests specific work? But then how is that Abstractor compensated for the Abstraction that it posts.

Do Abstractors get compensated when one at a higher level uses its work? Do lower-level Abstractors 'bid' on work? Or do they do the work 'on consignment', hoping to be paid? Or both, at their discretion?

Do Abstractors have 'savings'? If they amass enough savings, is that what controls when they can reproduce?

If an Abstractor is 'thinking' about cost/benefit issues in this way, that implies a substantial amount of logic that is completely independent of its core processing logic -- the stuff it gets paid for. How is that logic implemented? And do different Abstractors have different 'styles' for this type of logic? Some being bigger risk-takers, hoping for greater rewards?

What happens when an Abstractor 'goes broke'? Can it ever function again?

Can it go on the dole and try to build up from there?

I can show one example of a system like this in action, based on our earlier example of an Abstractor trying to find a sequence of asteroid locations across a stack of images.

When Tracker discovers that it cannot detect the asteroid location in several of its images, it creates a work-request which it transmits to the Bulletin Board. That request specifies the type of work wanted, which is region-finding. The region found will be the asteroid in that image. It tells the prospective contractor a small bounding box in which it should look -- interpolating from the other images where it was able to locate the asteroid -- and it says about how big it should be.

Only one contractor manages to find the asteroid, and it is an exotic and expensive one. The problem is that the asteroid is reflecting so little light that normal grayvalue techniques cannot detect it. But this contractor succeeds by doing statistical analysis of the background noise. It discovers a region of just the right size in the right spot. The average grayvalue of that region is no greater than that of the background, but the standard deviation of its grayvalue distribution is three times greater than that of the typical background.

Tracker accepts its work, and pays the contractor. They form an alliance that lasts for hundreds of frames.

Eventually, however, another Abstractor and potential contractor notices the high prices that Tracker is paying for region detection in the problem images. This guy thinks "I can do that cheaper!" He could not bid on the initial solicitation because his technique did not work. But since then he has tried his technique again with many modifications -- spending his own savings to do so. He has discovered a set of modifications that allow it to work on the problem images, much more cheaply than the prices Tracker is paying.

Based on that research, the new contractor makes a bid on the work-request that Tracker submitted hundreds of frames ago. Seeing the lower-cost bid, Tracker tries out the new guy, and finds that it works well. Tracker switches contractors.

This kind of activity is going on all over the system. The net effect is to gradually reduce the overall resource consumption of the system, while maintaining effectiveness.

Next Steps

In the above description, I have one tiny little snippet of code. The next steps will be to make a lot more of those snippets, and finally to see how much of the behavior described above can be shown in a running system.

The ultimate goal is to show system-wide 'intelligent' behavior emerging from the chaotic interactions of these many independent Abstractors.

But, for now, leaving the Market behavior out of it. That still needs a lot of designing.

The Thousand-Mile Telescope

Tuesday, September 22, 2020

An Architecture for Machine Vision

No comments:

Post a Comment