On Wednesday, October 2, 2013 4:37:33 PM UTC-5, cl...@claysturner.com wrote:
> I'm aware of that. You can do two photos in one night and achieve it.
> I'm more interested in his technique performing this with a single image,
> but I know it likely won't work.
In fact, it wouldn't be worth talking about here if it was about conversion from 2 or more photos, since the triangulation problem is well-understood and solutions well-established now (even in hardware).
But ironically: more difficult to program than single-photo conversion.
I put a demo here in [1] below. Also, I do another run at wide-camera angle shots with the backwards ray casting algorithm in [2] below. You will see both how far it works and where ... and where its limitations and breaking points are.
If you want to try your hand at this, I put everything you need in [1], though YouTube's conversion mucked up the pictures a bit. But the original picture should be workable.
After the photo comes a sampled depth map. As mentioned at the outset of the thread, this can be acquired either by hand-drawing, or by algorithm (e.g. conversion of focal blur to depth) or both. Focal blur to depth conversion is fairly well-understood these days. But I haven't done anything with it yet.
The color-coding used for the sampled depth map has no index per se, but is tied to a silver-grey plumb line which I added to the map. The line terminates on the horizon, and all the colored lines have at least one intersection point on the plumb line, for reference. So, you can use that to convert the colors to depth coordinates.
Using BBGGRR (hexadecimal) format, all colors have BB, GG, RR in the range { ff, e0, c0, a0, 80, 40, 00 }, so you should be able to recover the lost bits.
After the depth samples comes the interpolated depth map. The color code is for inverse depth, ranging from 1 (= on screen) to 0 (= on the horizon at infinity) and runs in equal intervals from ffffff -> 0000ff -> 00ffff -> 00ff00 -> ffff00 -> ff0000 -> ff00ff -> 000000. There should be enough precision in the compressed image to recover the original bits, since the colors are all constrained to have one of the values:
XXXXff, 00XXff, 00ffXX, XXff00, ffXX00, ff00XX, XX00XX
as XX ranges from 00 to ff.
The interpolation used for the depth samples, as mentioned at the outset of this thread, converts a depth sample into a finite spread for a likelihood function given by f(z, Z, K) f(r, R, L) where z is the distance in color space of a pixel from the pixel associated with the depth sample, and r the pixel distance; Z, R, K, L are parameters, the function is
f(x, X, k) = (X - x)/(1 + kx) if |x| < X; f(x, X, k) = 0 else.
The interpolation is relatively insensitive to the parameters. Segmentation of objects from background comes about almost as an automatic result -- as you can see with the trees in the previously-mentioned video ([3] below).
The conditions used on the density of the samples and on (Z, R) are that there should be enough samples used and (Z, R) should be large enough that every point should have a non-zero accumulated total likelihood, after all the likelihood spreads are added up.
This is basically just Bayes rule.
You can use your own interpolation functions.
After that, the entire cityscape literally rocks "low rider" style in sync with the beat.
Pay careful attention to the TOPS of the buildings in the foreground in both [1] and [2]. They DO foreshorten as you lower in altitude. That's coming from the depth sample interpolation.
By way of comparison...
The methods you see portrayed on YouTube generally involve something along the lines of (a) use Photoshop to carve up layers by hand, (b) create diorama, (c) maybe do some interpolation.
The only REAL conversion displayed on YouTube, I think, is the one from Carnegie-Mellon's AI lab, which should have a "related videos" link to mine. But take a close look at what they did in their demo: (a) no small foreground objects (like the trees in my Rushmore scenes), (b) no real occlusions (like the buildings in my cityscapes), (c) the trees in their street scenes tend to be pasted onto the buildings they're standing in front of. And no complicated structure (like the 100's of buildings in Chicago.)
In other words, they got impressive results by cherry-picking scenes and camera angles in such a way to avoid showing the breaking points and limitations of their process. And they used high-resolution graphics to make it pretty.
I think it's more useful to see where a method breaks down and where its limits are, not merely where it works. So it would have been better for them to show us their process in applications like what I've been doing.
Video References:
[1] The Beast Stomp Rocks Chicago -- Literally.
(+ How to convert single photo to 3D).
http://www.youtube.com/watch?v=JdN76XYvvg0
Sequel to the Meat for the Beast intro sequence and the Beast Stomp. Chicago rocks with a sound so loud they measure it in megatons. Video taken from the BeastCam. Once again, the Cyborg Ninja (inspired partly by Robocop 3) takes up the narration.
[2] Chicago single photo to 3D. Helicopter style landing & flyover.
http://www.youtube.com/watch?v=ILYE5TqEEk4
Wide-ranging flyovers of Chicago derived from single photo. Using backwards raycasting and taking it all the way down to the street level or to the breaking point, whichever comes first.
[3] Single [Computer Generated] Photo To 3D To Video.
http://www.youtube.com/watch?v=NHhIu9bPgZg
Mount "Rushmary", computer generated girl (one of the faces on the mountain), Chicago and railroad tracks -- using backwards ray casting.