Exploring Image Tracking

Google and Apple recently announced in their AR platforms the ability to track images in the real world.

This is a fundamental stepping stone because it enables new use cases to pop up without any recompilation/release of the platform: a usecase-agnostic building block.

For example, without any anticipation or permission from the platform, you can associate content with videogames, business cards and banking notes, all substantially different verticals.

This is a walkthrough of how this works, mostly for my own education.

Apple's ARKit

First, let me start with Apple's recognizing images in an AR Experience. The goal is well summarized:

Detect known 2D images in the user’s environment, and use their positions to place AR content.

Lets get into the mechanism. Here is an overview of how it works:

Your app provides known 2D images, and ARKit tells you when and where those images are detected during an AR session

That is, you provide examples of 2D images and the system notifies your app when they are detected. Digging deeper, in ARReferenceImage's documentation:

To accurately detect the position and orientation of a 2D image in the real world, ARKit requires preprocessed image data and knowledge of the image's real-world dimensions. For each reference image, use the Xcode inspector panel to provide the real-world size at which you want ARKit to recognize the image.

It appears that, in addition to the image binaries, you tell the system too their real-world dimensions (presumably width/height in real-world dimensions, say centimeters?).

Once that's done, the system takes it from there and notifies your app when it detects the image, with its position and orientation:

When ARKit detects one of your reference images, the session automatically adds a corresponding ARImageAnchor to its list of anchors. To use the detected image as a trigger for AR content, you’ll need to know its position and orientation, its size, and which reference image it is.

Google's ARCore

Google's recognize and augment images addresses a similar need:

Augmented Images in ARCore lets you build AR apps that can respond to 2D images, such as posters or product packaging, in the user's environment.

The mechanism is also similar:

You provide a set of reference images, and ARCore tracking tells you where those images are physically located in an AR session, once they are detected in the camera view.

There are certain restrictions that apply:

Each image database can store feature point information for up to 1000 reference images. ARCore can track up to 20 images simultaneously in the environment, but it cannot track multiple instances of the same image. The physical image in the environment must be at least 15cm x 15cm and must be flat (for example, not wrinkled or wrapped around a bottle)

But ultimately, the result is the same:

Once tracked, ARCore provides estimates for position, orientation, and physical size. These estimates are continuously refined as ARCore gathers more data.

Interestingly, everything happens on the device, which I'd assume is the same for Apple's:

All tracking happens on the device, so no internet connection is required. Reference images can be updated on-device or over the network without requiring an app update.

This is pretty neat and I think a fundamental stepping stone.