# Conversion of Single Photo to 3D to Video -- Trial Runs

Started by September 16, 2013
```This is a short video segment showing some trial runs at an early stage in the development process:

One set of segments takes place over Chicago, the other around Mount Rushmore. Since it's only a first trial run (basically to see how far one can move in different directions before distortion sets in), the images were reprojected at reduced resolution relative to the original photos.

This is part of a larger endeavor to develop the tools needed for doing digital cinematography. I'm keeping a running account of some of this activity via twitter at #Federation2005 as well as the NinjaDarth YouTube channel.

A few notes on the matter:

A conversion from a single photo is carried out in 2 stages.

One is to acquire a relatively small set of samples for a "depth map". The method used presently is just to estimate it by eye. In a more automated system, one may try to use the blurring level of edges in an image to estimate focal blur, which in turn may be used to estimate depth at a sampling of points concentrated around edges. A good system will allow input from both sources as well as others.

The weak part of doing extrapolation by edges is that edges tend to be interfaces between foreground and background objects, so one then has to decide which side of the edge gets the indicated depth and which side is in the background.

Second is the extrapolation of the depth map. For this, I used a simple heuristic to generate a "likelihood" distribution for each depth marker, with the function taking the form L = f(R,k;r) f(Z,K;z) where r is the distance from the depth marker, z the distance of the color at the given pixel location from the pixel color under the depth marker; (R,k), (Z,K) are parameters and the function f(X,K;x) is defined by (X - x)/(1 + kx) if |x| < X, and 0 otherwise.

This performs a natural gradation of the depth, while have the side effect of segmenting objects from the image. Sparse objects, like trees, tend to be handled rather nicely by the process (as seen in the Rushmore segments of the video).

The video has numerous rendering artifacts, as well as artifacts associated with the limitations of trying to look around objects in a single photo. Corrective measures to handle that haven't yet been included.

The rendering process basically works like this to convert the "source image" to a "target image":
(1) Each point in the target image is represented as a "projection line" in the source image. (One of the artifacts you see in the video came when the projection line became "dead on").

(2) Then you run down the projection line from nearest point to furthest point, and grab whatever bits of the scenery come close to the line. To determine "closeness", each point is projected back on the target image and compared with the pixel undergoing projection.

Step 2 is a basic form of ray tracing that effectively accumulates a diffraction pattern from nearby objects, before landing on whatever target object the target image's pixel is looking toward.

To do the computations, one may define a 2x3 matrix Pi whose rows are (1 0) (0 1) and (0 0) and a vector Mu whose column is (0 0 1). This effects a decomposition of the 3x3 identity 1_3 into 1_3 = Pi Pi^T + Mu Mu^T, with Pi^T Pi = 1_2 the 2x2 identity, Mu^T Mu = 1, Pi^T Mu = 0 = Mu^T Pi.

The target image is associated with a 3x3 rotation matrix R and a displacement a. For the video, R = 1_3 was used (more interesting results come from allowing a rolling "yaw" motion or change in orientation, but I didn't include these).

Then associated with the target position 2-vector r is the 3-vectors parameterized by the depth 1/u as:
(Pi r + Mu Z)/u + a
where Z is used to scale the 3rd dimension.

The depth parameter u ranges from 0 (at the horizon) to 1 (on the screen) to infinity (at the focal point). For the source image, the corresponding spatial vector
(Pi r_0 + Mu Z)/u_0.

To get the corresponding line, both r_0 and u can be solved in terms of u_0.

For the image, it's assumed that a depth map is given as a function or table u(r_0) in the interval [0, 1], as r_0 ranges over the source image screen coordinates.

Different offsets can be used for r and r_0 to get the center of focus on the center of the image or on one of its corners -- both are illustrated in the video. (That's the beginnings of what's known as a Camera Model).

To find the depth along the projection line and use the diffraction model, one needs to get the actual depth associated with each point r_0 -- u_0' = u(r_0), which leads to the point
(Pi r_0 + Mu Z)/u_0' = (Pi r' + Mu Z)/u'
and the resulting differences
delta r = r' - r, delta u = u' - u
which are both functions of delta u_0 = u_0' - u_0 as well as the other parameters.

The goal is to find the value of u_0 along the projection line that makes delta u_0 = 0 -- but to do so in a way that doesn't end up shooting in the space between pixels. This is where the comparison between the reprojected target point r' versus original target point r comes into play.
```
```On Monday, September 16, 2013 1:04:48 PM UTC-5, federat...@netzero.com wrote:
> This is a short video segment showing some trial runs at an early stage
> in the development process:

> One set of segments takes place over Chicago,
> the other around Mount Rushmore [...] only a first trial run [...] to see
> how far one can move in different directions before distortion sets in

A second trial run with an improved line-tracing algorithm.

This time, there is variations in the bearing, pitch and yaw, along with motion in each of the three dimensions.
The colored zones seen are:
Cyan = region outside the photo boundaries
Blue = region behind objects in the photo
Green = sideways/reverse view
as well as other colored zones.

The "diffraction" model mentioned in the previous article was meant to (a) fill in the blue zones by interpolation, (b) to anti-alias edges where new occlusion occurs and (c) to see through partial transparencies.

Most of the distortion seen on buildings in the wide angle shots comes from the diffraction model. Without it, and without any other interpolation model, these areas would all be part of the "blue zone".

Only a small amount of the blue zone remains, as a result. This needs an interpolation algorithm, and I might toy with just getting rid of the diffraction model and deferring the blue-zone-handling to it.
```
```Monday, September 16, 2013 1:04:48 PM UTC-5:
> trial runs at an early stage in the [rapid] development process:
> a first trial run [...] to see how far one can move in different directions before distortion sets in

Tuesday, September 17, 2013 6:53:41 PM UTC-5:
> trial run with an improved line-tracing algorithm.
> ...variations in the bearing, pitch and yaw...
> Blue = region behind objects in the photo [and other colored zones]

A trial run, this time, with the interpolation model included to better handle the "blue zones".

Mount Rushmore morphs into "Mount Rushmary", in one segment; in addition to the other segments.

The diffraction model was fixed up to remove the astigmatism and blurring in the 2nd trial run; but it was not eliminated, since it's needed to keep projection lines from passing through an image between pixels.

> Most of the distortion seen [...] comes from the diffraction model.

Much of the problem is resolved, but the problem can never be entirely removed, by the very nature of the task.
```
```On Wednesday, September 18, 2013 5:38:03 PM UTC-5, federat...@netzero.com wrote:
> Mount Rushmore morphs into "Mount Rushmary", in one segment; in addition to the other segments.

Single Photo to 3D With "Independence Day" Shadow Effect.

The monument remains stationary. This is meant to show an application of 2D->3D conversion: to produce more realistic changes in lighting. The shadow goes across the monument -- and even across the embedded subtitle, leaving behind a "parallel universe" version in its wake.

What I wanted to do here is describe how the effect is brought about. This is basically a method for doing 3-dimensional cross-fading, where the cross-fade moves across the image, which it remains intact.

The two images involved are a and b. The brightness for each one is selectively controlled on a pixel by pixel basis by functions of time, t, where t = t(r) is a function of the pixel location r in each image:

a(t) = 1 for t <= 0; f(t/(1-d)) for 0 < t < 1-d; 0 for 1-d <= t
b(t) = 0 for t <= d; f((1-t)/(1-d)) for d < t < 1; 1 for 1 <= t.

The corresponding cross-fading functions are used to mix the two
A(t) = 1 for t <= 0; (1-t) for 0 < t < 1; 0 for 1 <= t
B(t) = 0 for t <= 0; t for 0 < t < 1; 1 for 1 <= t

The function f(t) is chosen so as to make aA + bB continuously differentiable in t. Thus:
f(t) = A + B t + C t^2 + D t^3 + t^2 (1 - t)^2 g(t)
f(0) = 1, f(1) = 0, f'(0) = 1 - d, f'(1) = 0
=> f(t) = (1 - t)^2 (1 + (3 - d) t + t^2 g(t))
For a cubic, g(t) = 0.

Each point in the target photo(s) from the 3-D model is derived from is a 2-vector r that maps (for a given configuration (R,t)) to:
r_a = R (Pi r + mu Z)/u_a(r) + t, r_b = R (Pi r + mu Z)/u_b(r) + t
The configuration (R,t) consists respectively of a rotation matrix R and translation vector t. Since the demo keeps everything stationary, R = 1_3 the 3x3 identity and t = 0.

The depth maps for the two images u_a(r) and u_b(r) are reciprocal depths. So 0 means "at the horizon", 1 means "on the screen".

The shadow effect seen is with a planar shadow, given by a function
t(R) = t_0 + n.R
where R is the 3D vector (r_a, r_b) and n is a fixed 3-vector determining the normal to the plane, along with the rate at which the shadow moves. This results in two separate time readings:
t_a = t(r_a), t_b = t(r_b).
Therefore, because the depths in the two images may differ in places, the sum A(t_a) + B(t_b) need not be 1.

The method used in the demo to normalize this is:
* replace A by min(A, 1 - B),
* replace B by max(1 - A, B).
This is suitable if the shadow is moving away, but may still cause a visibly slow fade-in for new foreground objects -- as seen with the hair when Teddy Roosevelt and his spectacles is replaced by the cool-looking goddess "Tedhi" with her doubly ponytail and sunglasses.

If the shadow is moving toward you, then a more appropriate adjustment is:
* replace A by max(A, 1 - B),
* replace B by min(1 - A, B).

All the cases where the depth differ, one has a non-zero u_a + u_b. Thus, a method that combines these two would be:
* replace A by (A u_a + (1 - B) u_b)/(u_a + u_b)
* replace B by ((1 - A) u_a + B u_b)/(u_a + u_b).
For the present case, it doesn't make any visible difference, whether this is used, instead, or not.

There's no real good way to remove the visible cross-fading, since that's inherent to the effect. For the most part, the shadow masks the cross-fading, so it looks mostly like a cloud shadow passing over the monument and leaving behind something totally different in its wake; in other words: like the phase boundary between two parallel universes moving across the landscape in a smooth fashion.

It's a really cool and novel effect, obtained without too much compromise on the photorealism.
```
```On Tuesday, October 1, 2013 7:06:56 PM UTC-4, federat...@netzero.com wrote:
> On Wednesday, September 18, 2013 5:38:03 PM UTC-5, federat...@netzero.com wrote:
>
>
> > Mount Rushmore morphs into "Mount Rushmary", in one segment; in addition to the other segments.
>
>
>
> Single Photo to 3D With "Independence Day" Shadow Effect.
>
>
>
>
> The monument remains stationary. This is meant to show an application of 2D->3D conversion: to produce more realistic changes in lighting. The shadow goes across the monument -- and even across the embedded subtitle, leaving behind a "parallel universe" version in its wake.
>
>
>
> What I wanted to do here is describe how the effect is brought about. This is basically a method for doing 3-dimensional cross-fading, where the cross-fade moves across the image, which it remains intact.
>
>
>
> The two images involved are a and b. The brightness for each one is selectively controlled on a pixel by pixel basis by functions of time, t, where t = t(r) is a function of the pixel location r in each image:
>
>
>
>    a(t) = 1 for t <= 0; f(t/(1-d)) for 0 < t < 1-d; 0 for 1-d <= t
>
>    b(t) = 0 for t <= d; f((1-t)/(1-d)) for d < t < 1; 1 for 1 <= t.
>
>
>
> The corresponding cross-fading functions are used to mix the two
>
>    A(t) = 1 for t <= 0; (1-t) for 0 < t < 1; 0 for 1 <= t
>
>    B(t) = 0 for t <= 0; t for 0 < t < 1; 1 for 1 <= t
>
>
>
> The function f(t) is chosen so as to make aA + bB continuously differentiable in t. Thus:
>
> 	f(t) = A + B t + C t^2 + D t^3 + t^2 (1 - t)^2 g(t)
>
> 	f(0) = 1, f(1) = 0, f'(0) = 1 - d, f'(1) = 0
>
> 	=> f(t) = (1 - t)^2 (1 + (3 - d) t + t^2 g(t))
>
> For a cubic, g(t) = 0.
>
>
>
> Each point in the target photo(s) from the 3-D model is derived from is a 2-vector r that maps (for a given configuration (R,t)) to:
>
> 	r_a = R (Pi r + mu Z)/u_a(r) + t, r_b = R (Pi r + mu Z)/u_b(r) + t
>
> The configuration (R,t) consists respectively of a rotation matrix R and translation vector t. Since the demo keeps everything stationary, R = 1_3 the 3x3 identity and t = 0.
>
>
>
> The depth maps for the two images u_a(r) and u_b(r) are reciprocal depths. So 0 means "at the horizon", 1 means "on the screen".
>
>
>
> The shadow effect seen is with a planar shadow, given by a function
>
>    t(R) = t_0 + n.R
>
> where R is the 3D vector (r_a, r_b) and n is a fixed 3-vector determining the normal to the plane, along with the rate at which the shadow moves. This results in two separate time readings:
>
>    t_a = t(r_a), t_b = t(r_b).
>
> Therefore, because the depths in the two images may differ in places, the sum A(t_a) + B(t_b) need not be 1.
>
>
>
> The method used in the demo to normalize this is:
>
> 	* replace A by min(A, 1 - B),
>
> 	* replace B by max(1 - A, B).
>
> This is suitable if the shadow is moving away, but may still cause a visibly slow fade-in for new foreground objects -- as seen with the hair when Teddy Roosevelt and his spectacles is replaced by the cool-looking goddess "Tedhi" with her doubly ponytail and sunglasses.
>
>
>
> If the shadow is moving toward you, then a more appropriate adjustment is:
>
> 	* replace A by max(A, 1 - B),
>
> 	* replace B by min(1 - A, B).
>
>
>
> All the cases where the depth differ, one has a non-zero u_a + u_b. Thus, a method that combines these two would be:
>
> 	* replace A by (A u_a + (1 - B) u_b)/(u_a + u_b)
>
> 	* replace B by ((1 - A) u_a + B u_b)/(u_a + u_b).
>
> For the present case, it doesn't make any visible difference, whether this is used, instead, or not.
>
>
>
> There's no real good way to remove the visible cross-fading, since that's inherent to the effect. For the most part, the shadow masks the cross-fading, so it looks mostly like a cloud shadow passing over the monument and leaving behind something totally different in its wake; in other words: like the phase boundary between two parallel universes moving across the landscape in a smooth fashion.
>
>
>
> It's a really cool and novel effect, obtained without too much compromise on the photorealism.

Try doing this with a picture of the full moon or of the great galaxy in Andromeda. To convert these to 3-D would be impressive! There are plenty of source photos on the web.

```
```On 02/10/13 19:11, clay@claysturner.com wrote:
> Try doing this with a picture of the full moon or of the great galaxy in Andromeda. To convert these to 3-D would be impressive!

There are Victorian/Edwardian stereoscopic photos
of the full moon - they used the libration to get
the two viewpoints.

If http://apod.nasa.gov/apod/astropix.html wasn't
down for the count, you could probably find a
version on there using red/blue anaglyphs.

```
```On Wednesday, October 2, 2013 3:35:04 PM UTC-4, Tom Gardner wrote:
> On 02/10/13 19:11, clay wrote:
>
> > Try doing this with a picture of the full moon or of the great galaxy in Andromeda. To convert these to 3-D would be impressive!
>
>
>
> There are Victorian/Edwardian stereoscopic photos
>
> of the full moon - they used the libration to get
>
> the two viewpoints.

I'm aware of that. You can do two photos in one night and achieve it. I'm more interested in his technique performing this with a single image, but I know it likely won't work.

>
>
>
> If http://apod.nasa.gov/apod/astropix.html wasn't
>
> down for the count, you could probably find a
>
> version on there using red/blue anaglyphs.

```
```Another trial run -- this time with "backward" ray casting (i.e. from image to viewer, rather than from viewer to image).

Single [Computer Generated] Photo To 3D To Video.

There's still pixel leakage, as seen from the breakup when moving in. That's a rendering problem, not a 3D conversion issue. The "female" in the 2nd segment ("Taamuz" Jefferson) was computer generated as a *2D* "photo", only later made 3D.

I'll post the remapping analyses done to recolor the images some other time: it's more than just "color equalization". I use 3 algorithms in tandem, one which fakes texture remapping.

The first railroad run is low resolution with a corrected depth map. The second is higher resolution with the original uncorrected depth map.

I'll also put up a simple demo showing the stages of conversion from 2D to 3D later on. The process can  work with fully automated depth estimation (e.g. by blur focus) or with hand-drawn/edited depth maps, or combinations of the two.

On Wednesday, October 2, 2013 1:11:19 PM UTC-5, cl...@claysturner.com wrote:
> Try doing this with a picture of the full moon or of the great galaxy in Andromeda. To convert these to 3-D would be impressive! There are plenty of source photos on the web.

The Wikipedia page for stereophotography has a red-cyan stereo of the moon (which dates from 1897).

I already have topographic maps of the moon (which are now available over the net), I've been sitting on because of an unresolved problem: what color are the various parts of the moon? Photography shows shadows and lighting, which would be a pain to reverse-engineer, especially since the lighting model for lunar features is still an active area of research!

A wild lunar pass-by (& launch from Earth, approach of Mars or Ganymede, actual starfield, etc.) was originally slated to go with these YouTubes where I was experimenting with sound sculpting software:

Right now there are only stubs in there for the video tracks; and the Outer Limits narrator voice is going to be completely replaced by machine along with actual real-time frequency-"relocated" sound spectrograph to replace the Outer Limits oscillator waveform.
```
```On Wednesday, October 2, 2013 4:37:33 PM UTC-5, cl...@claysturner.com wrote:
> I'm aware of that. You can do two photos in one night and achieve it.
> I'm more interested in his technique performing this with a single image,
> but I know it likely won't work.

In fact, it wouldn't be worth talking about here if it was about conversion from 2 or more photos, since the triangulation problem is well-understood and solutions well-established now (even in hardware).

But ironically: more difficult to program than single-photo conversion.

I put a demo here in [1] below. Also, I do another run at wide-camera angle shots with the backwards ray casting algorithm in [2] below. You will see both how far it works and where ... and where its limitations and breaking points are.

If you want to try your hand at this, I put everything you need in [1], though YouTube's conversion mucked up the pictures a bit. But the original picture should be workable.

After the photo comes a sampled depth map. As mentioned at the outset of the thread, this can be acquired either by hand-drawing, or by algorithm (e.g. conversion of focal blur to depth) or both. Focal blur to depth conversion is fairly well-understood these days. But I haven't done anything with it yet.

The color-coding used for the sampled depth map has no index per se, but is tied to a silver-grey plumb line which I added to the map. The line terminates on the horizon, and all the colored lines have at least one intersection point on the plumb line, for reference. So, you can use that to convert the colors to depth coordinates.

Using BBGGRR (hexadecimal) format, all colors have BB, GG, RR in the range { ff, e0, c0, a0, 80, 40, 00 }, so you should be able to recover the lost bits.

After the depth samples comes the interpolated depth map. The color code is for inverse depth, ranging from 1 (= on screen) to 0 (= on the horizon at infinity) and runs in equal intervals from ffffff -> 0000ff -> 00ffff -> 00ff00 -> ffff00 -> ff0000 -> ff00ff -> 000000. There should be enough precision in the compressed image to recover the original bits, since the colors are all constrained to have one of the values:
XXXXff, 00XXff, 00ffXX, XXff00, ffXX00, ff00XX, XX00XX
as XX ranges from 00 to ff.

The interpolation used for the depth samples, as mentioned at the outset of this thread, converts a depth sample into a finite spread for a likelihood function given by f(z, Z, K) f(r, R, L) where z is the distance in color space of a pixel from the pixel associated with the depth sample, and r the pixel distance; Z, R, K, L are parameters, the function is
f(x, X, k) = (X - x)/(1 + kx) if |x| < X; f(x, X, k) = 0 else.
The interpolation is relatively insensitive to the parameters. Segmentation of objects from background comes about almost as an automatic result -- as you can see with the trees in the previously-mentioned video ([3] below).

The conditions used on the density of the samples and on (Z, R) are that there should be enough samples used and (Z, R) should be large enough that every point should have a non-zero accumulated total likelihood, after all the likelihood spreads are added up.

This is basically just Bayes rule.

You can use your own interpolation functions.

After that, the entire cityscape literally rocks "low rider" style in sync with the beat.

Pay careful attention to the TOPS of the buildings in the foreground in both [1] and [2]. They DO foreshorten as you lower in altitude. That's coming from the depth sample interpolation.

By way of comparison...

The methods you see portrayed on YouTube generally involve something along the lines of (a) use Photoshop to carve up layers by hand, (b) create diorama, (c) maybe do some interpolation.

The only REAL conversion displayed on YouTube, I think, is the one from Carnegie-Mellon's AI lab, which should have a "related videos" link to mine. But take a close look at what they did in their demo: (a) no small foreground objects (like the trees in my Rushmore scenes), (b) no real occlusions (like the buildings in my cityscapes), (c) the trees in their street scenes tend to be pasted onto the buildings they're standing in front of. And no complicated structure (like the 100's of buildings in Chicago.)

In other words, they got impressive results by cherry-picking scenes and camera angles in such a way to avoid showing the breaking points and limitations of their process. And they used high-resolution graphics to make it pretty.

I think it's more useful to see where a method breaks down and where its limits are, not merely where it works. So it would have been better for them to show us their process in applications like what I've been doing.

Video References:
[1] The Beast Stomp Rocks Chicago -- Literally.
(+ How to convert single photo to 3D).

Sequel to the Meat for the Beast intro sequence and the Beast Stomp. Chicago rocks with a sound so loud they measure it in megatons. Video taken from the BeastCam. Once again, the Cyborg Ninja (inspired partly by Robocop 3) takes up the narration.

[2] Chicago single photo to 3D. Helicopter style landing & flyover.

Wide-ranging flyovers of Chicago derived from single photo. Using backwards raycasting and taking it all the way down to the street level or to the breaking point, whichever comes first.

[3] Single [Computer Generated] Photo To 3D To Video.

Mount "Rushmary", computer generated girl (one of the faces on the mountain), Chicago and railroad tracks -- using backwards ray casting.
```
```Me, October 1, 2013 7:06:56 PM UTC-4:
> Single Photo to 3D With "Independence Day" Shadow Effect.
> The monument remains stationary...
> It's a really cool and novel effect, obtained without too much compromise on the photorealism.

On Wednesday, October 2, 2013 at 1:11:19 PM UTC-5, cl...@claysturner.com wrote:
> Try doing this with a picture of the full moon or of the great galaxy
> in Andromeda. To convert these to 3-D would be impressive!

Ok. Here's you go!

And, here, we do it with both the planets, starfields AND single photo 2D->3D.

Invasion From Planet Chicago: Independence Day II: https://www.youtube.com/watch?v=f0PGwY_he6Y

That's right: a stealth trailer for ID4 II. You heard it here first (or second): They're Back!

This time the scene (Chicago scene #1) does move. Chicago scene #2 moves a little, too (but mostly zoom). Chicago scene #3 has the shadow and the alien ship. It is also moving (slightly). I'll have more to say on everything below.

Oh, and by the way: the original Mount Rushmore/Rushmary with motion
Cosmetological Singularity:

I reused the Chicago scenes. All the sprites are derived from single photos as well.

Here, the conversion is fully automated.

=======

Some of the DSP used for ID4 II.
* The camera motions make heavy use of interpolation where both order 0 and order 1 constraints are used (in order to remove jerkiness). In some cases, order 2 constraints are used (e.g. the motion away from the Earth).

An earlier experiment with camera motions, here, is also working off a single photo

* Going from Chicago #1 to Chicago #2, there is a recoloring. This is derived by color-matching #1 to #2 and using a combination of statistical fitting and histogram fitting.

The recoloring is used here too to turn the woman's outfit into flesh-color.