Thread by @AnastasiaDevana, #Apple filed a new patent application called “FILE FORMAT FOR SPATIAL AUDIO” [...]

#Apple filed a new patent application called “FILE FORMAT FOR SPATIAL AUDIO” (US Patent Application 20200288258). Here is a bit of my breakdown (while holding my thoughts on its innovative merits

) #SpatialAudio (thread)

As noted earlier, Apple doesn’t want any part of the current nomenclature, and is referring to the entire gamut of various “Realities” as SR (Simulated Reality)

In general terms, the application describes what is essentially a new flavor of an object-based audio format, and how this format can be stored, edited, and reproduced.

There will be a sound asset library, where each sound is comprised of the actual SOUND DATA plus METADATA.

SOUND DATA can be mono, multi-channel, another spatial audio format like ambisonics or “synthesized audio data for producing one of more sounds”.

I found this last point especially interesting, since that leaves the door open for real-time synthesis - what in my opinion is the biggest breakthrough required in the audio industry.

METADATA can contain some of the following:
- attenuation (distance and listener-angled based)
 - directionality (from more common directionality patterns to totally arbitrary) 
- transformation controls for modifying sound properties (RTPCs for my game audio peeps)

Metadata can describe the entire sound, but also each discreet sound channel - this is useful for virtually reproducing something like a 5.1 setup, where the audio file is a single multi-channel file, but each channel is positioned on its own virtual speaker in space.

Metadata can also describe the sound at the time of recording: position, rotation, SPL and distance at which this SPL was measured, size and/or shape of the sound using a polygonal mesh or volumetric size, and the microphone configuration at the time of sound capture.

OK, this is where we’re getting into interesting stuff… Why would Apple want the SPL data?

The holy grail of putting virtual objects into the real world is to make them indistinguishable from real objects in the way they look, sound, and behave (see the recent addition of reflection to ARKit).

This also means a virtual person needs to be no louder or quieter than a real person standing right next to them. This is quite a hard problem to solve, because never before audio needed to precisely match real-world SPL levels.

But storing this information with the metadata is a core building block to the solution. So if I record a sound that was 10ft away, and it measured 60db SPL, an algorithm could estimate how the same sound would appear at 1ft or 100ft.

The other interesting thing is the volumetric information. Since the dawn of time… (ok maybe not for that long), but modern spatial audio engines treat every mono sound as point-source sounds in terms of reproducing the actual direct sound source.

I’m not talking about attenuation, but about defining the actual position of the sound in space. This might be wishful thinking on my part, but if Apple cracked the nut of accurately reproducing a volumetric sound source… well - I’d love to hear it.

Then the application proceeds to describe an authoring environment, where the sound data and metadata can be further manipulated and saved back into the library, and finally how this information is used in the application playback. What stood our for me here is this bit:

An application can write back usage information back into the sound library therefore “a historical record of use of the audio asset in any one or more SR applications is maintained [...] for a developer to know where the sound of the audio asset was previously used”.

Hard to tell what's the exact play here, but it almost looks like it’s opening the door for some type of “asset store” or “stock library” type setup, where audio asset creators can track and monetize usage.

To conclude: I’m not surprised that Apple is rolling their own object-based audio format. It certainly saves them the cost of license of something like Dolby Atmos or MPEG-H, and gives them control to tailor the format to their ecosystem and use cases.

This is all good and fine, but what I would REALLY like to see is a move to properly integrate audio into visual 3D formats (and no, a small handful of audio properties in USD ain’t it). (end thread)

Latest Threads Unrolled: