MPEG Standard for efficiently represented AI Models

Discover how the MPEG Standard revolutionizes neural network compression, enabling efficient representation of AI models for multimedia tasks. Explore the intersection of AI and multimedia with XReco in our latest blog article!

What is MPEG?

MPEG, the Moving Pictures Experts Group, is an ISO/IEC standardisation group well known for multimedia coding standards, widely used for video and audio content. Addressing the trend of adopting AI methods based on neural networks for many multimedia tasks, MPEG has been working on a standard for coding neural networks in the past years. Recently this working group, chaired by Werner Bailer from JOANNEUM RESEARCH, has published the second edition of this standard [1]. XReco uses a number of AI-based methods, for example for content analysis and 3D reconstruction. Neural network compression enables more efficient storage and transmission of the trained networks for these methods.

Why do we need neural network coding?

Artificial neural networks, which aim to mimic biological networks of neurons, have been known for decades. But the ability to scale them to many layers (i.e., making them “deep neural networks”) together with the availability of massive amounts of data has enabled the recent success of AI. The description of a network contains of its topology, i.e. which types of layers are used and how they are connected, and its parameters. The topology is static and can be described quite compactly. The parameters are what is modified during training the network, and with many layers and potentially large size (e.g. for using a high-resolution image with three colour channels as input) they take considerable size.

The network parameters may be weights (i.e. values to be multiplied with the input value at a particular position), biases (values to be added) and some more specific kinds for some layer types. The network parameters are represented as a tensor, which is a generalisation of a matrix to higher dimensions. Think of a matrix as a rectangular grid of numbers, and then extend this to a box to get a 3D tensor (and possibly adding further dimensions). The numbers are typically floating-point values, as covering wide value ranges with high accuracy is needed during the training process.

However, not all numbers in the parameter tensors are equally meaningful. Some may have small values and considering them as zero does not change much. Others could be represented with lower precisions, or there may be repeated patterns of values in a neighbourhood. This is where neural network coding comes in: making use of redundancy in the parameter tensors in order to represent them more efficiently.

What does the MPEG NNC standard specify?

The NNC standard focuses on the compression of parameters of trained neural networks by specifying a toolbox of compression tools, from which a subset can be chosen for a particular application. These tools include parameter reductions tools, such as pruning (i.e. removing sets of parameters with low contribution), parameter quantisation tool (i.e. representing values on a coarser scale) and entropy coding tools (i.e. representing more frequent values or sequences of values with a more compact code).

While the first edition of the standard, published in 2022, focused on coding complete neural networks, the second edition adds support for coding incremental updates efficiently. For example, let’s assume you have an object recognition app on your mobile phone, and the model is trained further to improve performance. The updated model is likely to have similarities in the parameter structure with the previous version, thus the differential update can be coded more efficiently. Your app update would contain the coded model update, that is decoded and added to the base model you already have. This feature is particularly useful in cases where many similar versions of a neural network need to be sent back and forth, for example, in federated learning, where a model is collaboratively trained by a set of computing nodes.

The NNC standard has been tested on a wide range of models and is typically able to compress a neural network to 10-20% of the original size without performance loss. If an application can afford slight degradation in performance, much higher compression rates are possible. More details can be found in the NNC White Paper [2] and a report on applying the standard to different types of neural networks, including large language models (LLMs) [3]. An open-source implementation covering the first edition of the standard is available [4].

In an ecosystem approach, as envisaged by XReco, ensuring the interoperability between components provided by different organisations is crucial. Standards provide the necessary specification to establish this interoperability.

Why is this relevant for multimedia applications, including XR?

There are a number of multimedia applications that make use of neural networks, for example, for content analysis (e.g. classification, object detection), content improvement (e.g., super resolution) or content representation (e.g., image compression). In many cases a trained neural network needs to be sent to other devices where it is used. If the number of parameters of a neural network can be reduced, and parameters can be represented with lower precision, this also speeds up inference with the neural network. This is of course particularly relevant on devices with limited computing power and where energy consumption is a concern.

Why is this particularly relevant for XR?

XR application require 3D representations of objects and scenes. Unlike traditional representations such as point clouds or meshes, recent scene representations, such as Neural Radiance Fields (NERFs) or Gaussian Splatting, are learned from the data. The scene representation is thus like a neural network a set of parameters that have been trained to represent the scene, so that pixel values of views can be derived from them. Thus, the same methods used for compressing neural networks can be applied to compressing these scene representations, which are also a core technology used in XReco.

Conclusion

The MPEG NNC standard provides a toolbox for compressing many types of neural networks and has been tested for many multimedia tasks. The recently completed second edition adds support for efficiently coding incremental updates which is particularly useful for sending new version of trained models to a number of clients or in federated learning scenarios. The emerging types of 3D scene representation, such as NERFs and Gaussian Splatting, make this also a relevant technology for representing content for XR applications.

References

[1] ISO/IEC 15938-17:2024, Information technology – Multimedia content description interface – Part 17: Compression of neural networks for multimedia content description and analysis, https://www.iso.org/standard/85545.html

[2] White Paper on Neural Network Coding, https://www.mpeg.org/wp-content/uploads/mpeg_meetings/145_OnLine/w23564.zip

[3] ISO/IEC JTC1 SC29 WG04 N0450, “Application and Verification of NNC in Different Use Cases,” Jan. 2024, https://www.mpeg.org/wp-content/uploads/mpeg_meetings/145_OnLine/w23554.zip

[4] https://github.com/fraunhoferhhi/nncodec