In my M.A.Sc. work on stereo camera network calibration, I made use of a series of graphs to describe the relationships between nodes. Existing work had already produced the vision graph and what I will call the transition graph (after Farrell and Davis‘ transition model). In both graphs, each vertex represents a node (camera or camera pair, usually) in the multi-camera network.

The Vision Graph

In a typical sensor network, the sensing modality is simple and omnidirectional, and thus adjacency is a function of proximity. A consequence of this is that the sensing graph is usually a subset of, equal to, or a superset of the communication graph. Therefore, it is usually possible to cheat by initializing the sensing graph model using either physical topology (from, say, GPS) or communication topology.

Camera networks break from this in that their sensing modality is complex and directional, so sensing adjacency has relatively little to do with communication adjacency. Devarajan and Radke realized this early on, and proposed the vision graph. An edge in the vision graph represents information about shared field of view between the two nodes represented by its vertices; there is an edge wherever there is significant overlap of the observed scene. But what does significant mean? Assuming we’re referring to a classic, unweighted graph for the moment, in order for it to be useful, significant must mean that there is sufficient overlap that we can – or at least, that it is reasonable to expect that we could – achieve whatever task is intended for the shared data.

Since every published use of the vision graph I am aware of is a calibration application, let’s talk about that. Originally, Devarajan and Radke had to specify the vision graph manually a priori in order to use it for calibration; a follow-up with Cheng proposed a method to build it by propagating digests of SIFT features from the various images to other nodes. I myself only used it in concept, as a theoretical upper limit for the final calibration graph.

Now, for a digression…

The Transition Graph

Meanwhile, folks working on surveillance and tracking applications were also using a graph formalism – well, in most cases, a reflexive binary relation on the set of nodes that translates trivially into a graph formalism. An edge in the transition graph represents a different, although related, kind of adjacency between the two nodes represented by its vertices: the possibility of a target moving from the field of view of one node to that of the other. This, of course, includes overlapping fields of view, but importantly, it also extends to the non-overlapping case.

It quickly became clear to someone that it was worthwhile to represent not just the possibility, but the probability, of a transition. This may have been an obvious consequence of some of the methods used to actually determine this graph or relation – for example, Marinakis et al. used a Monte Carlo expectation-maximization method, and Farrell and Davis used sequential Bayesian estimation, both probability-oriented. Though intended to properly capture the information presented, it turns out that weighting the graph with these probabilities allows for some good optimization when it comes time to do predictions or camera hand-offs.

The Fuzzy Vision Graph

Seeing this, we thought, what if we similarly retain the uncertainty in the vision graph (rather than thresholding it out on an task-specific basis, as in all prior work) and put it to work optimizing calibration and other applications? Probability is, in general, ill-suited to the job, because we aren’t talking about transitions here. So, we considered another representation of uncertainty: fuzzy sets. A graph is an ordered pair $(V, E)$ of a set of vertices $V$ and a set of edged $E$; a fuzzy graph (in most definitions) is simply a generalization where $V$ and $E$ are fuzzy sets.

One nice thing immediately apparent is that, with the inherent, well, fuzziness of fuzzy graphs, there’s no longer any need to set task-specific thresholds in advance when constructing the model; instead of summarily judging whether overlap is significant, we simply capture the degree of significance, which if we really want can be thresholded later via task-specific alpha cuts. Beyond this, a number of opportunities for automatically optimizing various tasks in more advanced ways presented themselves as I poked at the idea. (This is the time when FuzzPy was born.)

The major problem, so far, is that there is no real understanding of what the ideal fuzzy vision graph actually represents. We can build them in various ways: using feature digests like Cheng, for example. But in order to decouple the practical optimization research from the practical modeling research, we need some definition of what a perfect vision graph for a camera network actually is.

Modeling Camera Network Coverage

To approach the problem of modeling the topology of camera coverage overlap, what is really needed is a proper model of camera network coverage. Much of the existing work on this has been developed specifically for optimal camera placement algorithms; it provides some inspiration, but as these models are rather simplistic (discrete, two-dimensional, triangular in shape, etc.) they don’t provide much in the way of an ideal model. Delving deeper, some decades-old work on task-oriented camera placement by Cowan and Kovesi provides an idea for a more realistic model of camera coverage, if you invert it, but is only for single cameras and uses task-specific thresholds.

Coverage, like overlap in the vision graph, is an imprecise concept when it is not tied to a specific task. Hence, fuzzy sets once again presented themselves as a possible framework for the representation: one can assign a membership degree in the set of covered points to every point in three-dimensional Euclidean space. Like Cowan and Kovesi, I identified visibility, focus, and resolution as the major factors in single-camera coverage (in the absence of a priori scene knowledge), developed “trapezoidal” fuzzy sets for each, then combined them via algebraic product fuzzy intersection to get a complete model that, given the intrinsic parameters of a camera and a couple of intuitive application (not necessarily single-task) parameters, would tell me how well any point in its local 3-space is covered. This is called the fuzzy coverage model.

It becomes evident that a fourth factor, direction, is needed when looking at the multi-camera case, where cameras are situated in a common coordinate system based on their extrinsic parameters, and we are interested in part in their field-of-view overlap. If two cameras face the same general volume of space, but from opposite sides, is the volume in common covered by both cameras? If you’re tempted to think yes, consider a task such as feature matching. Cameras can only see opaque surfaces with normals in the interior half-space defined by a plane tangent to the point in question and perpendicular to the principal axis. Furthermore, a number of studies have shown that, in practice, feature matching performance degrades over rotation of the viewpoint. Thus, if we extend our three-dimensional concept of space to include direction as well, making the fuzzy coverage model 5-dimensional, we consider points at the same location in 3-space but with different direction not the same point at all. Some geometry and a fuzzifying parameter gives us another component fuzzy set for the single-camera model, which we can incorporate by algebraic product intersection.

Another major consideration for the multi-camera case, which various work on camera network topology and placement optimization has attempted to tackle, is occlusion from the static scene. This may be known a priori (e.g. CAD layout of a building) or estimated by the camera network itself. It is possible to do this in continuous space, like ﻿Tarabanis et al., and so ideally you’d have a complete 3D model of the unoccluded field of view for each camera, and all other points have $\mu = 0$. However, the complexity seems prohibitive in practice for large camera networks. If we instead move to a discrete representation of the fuzzy coverage model, we can do what I did: make the scene finite, mark a subset of its voxels opaque, then use the voxel traversal of the line from a given point to the principal point of the camera to determine whether it is occluded (i.e. if the traversal includes any opaque voxels).

The easy part is the total network coverage. For simple multi-camera networks, where one camera covering a point (remember, we account for direction implicitly) is enough, simply combine all the appropriately transformed, discretized, and occlusion-ified single-camera coverage models together via standard fuzzy union. For 3D multi-camera networks, where we want at least two cameras covering a point, first generate a coverage model for each pair of cameras via standard fuzzy intersection on the pair, then combine all these via union.

Okay, now what?

This is about where I’m at. There’s plenty more that could be done with the fuzzy coverage model itself: solving the optimal camera placement problem in new and exciting ways is one really obvious possibility that I talk about in an upcoming paper.

As far as getting back to the graph stuff, I defined an overlap metric involving the scalar cardinality, which could be used to derive the fuzzy vision graph from the discrete fuzzy coverage model; this is significant in that it indicates a direct link from the $\mu$ values of the edges all the way back to the basic parameters of the camera network and the scene (and we know where all the simplifications are). Even better, though, might be something more analytic and general, like using something along the lines of the Hausdorff distance between two cameras’ coverage models, which might not only allow us to find the fuzzy vision graph, but also start unifying it with the non-overlapping concept of the transition graph.

Back to the future…

May 23rd, 2010
Tags: