I just saw an interesting question on Reddit:

"Why does Quest 3 not allow hand tracking to show your actual hands in passthrough similar to Apple Vision Pro?"

Here is my attempt at answering that ( not affiliated with either Apple or Meta so this is not a definitive answer )

tl;dr; It's too hard to do in a good way on the current hardware so why waste resources to do it badly

There are basically 2 ways of doing it, both require the headset to exactly know which parts of your hands are in front of the virtual things and which are behind

Method 1: how Apple does it

They have a high resolution depth sensor in the headset that can very accurately determine how far each camera pixel ( it's not down to each pixel but close enough ) is away from the headset.

So the only thing they need to do is ignore those pixels that are not belonging to your hand and then check for each pixel that does belong to your hand if it's further away from the headset than the virtual stuff or if it's closer, and then show either the camera pixel or the pixel from the virtual scene.

Method 2: how some apps on the Quest do it

You can use the normal hand tracking information ( only the hand skeleton bones and the rough size ) to put a virtual model covering your hands and that has a material which "punches through" the virtual scene to show the camera image.

This works but the virtual hand model does not always fit to every hand all that well, so it will either be too small in places or too big and show the camera pixels where it should show the virtual scene or the scene where it should show the camera. Think of someone trying to cut out a person from a magazine and doing a really bad job at it.

You may point out that Quest 3 does have a depth sensor as well and could do something similar to method 1.

But it is much lower resolution and any additional processing on the headset takes resources and thus costs battery and produces heat. So having that active all the time just for having a low resolution approximation of the hand cut out and draining the battery faster is probably not a good strategy for Meta

Looking into the crystal ball

  • I do assume that Meta has tests running inside RealityLabs that does exactly the same thing as VisionOS does but they are not happy with the results

  • They may try to use a combination of the low resolution depth sensor with standard camera based segmentation ( maybe directly from the hand tracking ) to "fill in the gaps" in the future. There are smart ways to do that but it also costs processing resources so who knows