Structure from motion

In AR, we often run into a variation of the following problem:

If I wanted to anchor content to a specific location in a specific espresso machine (e.g. "at the top of this espresso machine"), how would I go about it? How would I store/save/publish that association (e.g. notation, format, content)?

I was struggling to understand what kind of information one would have to publish to anchor content in real world objects.

I ran into something called "structure from motion", and decided to see how far it could take me.

Before I go into the details, here is what this coffee maker looks like in 3d space.


I started by recording a movie of my espresso machine at work using my mobile phone (~20 secs, ~90MB).

I then sliced it into multiple screenshots (using ffmpeg, ~3 images per second, leading to ~100 images).

I inserted them into the openfsm reconstruction pipeline which is able to reconstruct a 3d point cloud from correspondences between images.

What structure from motion will give you is a cloud of points from the images where an euclidean correspondence can be made: assuming this is a ridig object, what points match between them.

The image below shows the points that could be found in at least two images.

The position of the camera is computed from the points by minimizing the error between the multiple possible projections.

From the perspective of the coffee maker, you can see the path that the camera took (the white triangles forming a path):

At the end, for each image taken, the translation and the rotation in relationship to the points is available.

You can play with the visualization here.

I'm wondering if this can be used as an authoring tool that can help publishers automatically position / pose the camera from a video and allow them to associate the placement of a frame of reference in the real world from images.