Definitely not a zero sum game: Sparsity and next generation AI

By Gareth Stokes on June 17, 2021

Posted in Artificial Intelligence, General Technology, Uncategorized

If I could tell you how you could make your AI system do nearly ten times as much work on the same hardware, would that be worth something to you, eh…?

Transformers – more than meets the eye

So how can we make our AI ten times more efficient? Well, compression of data can help us store more in a fixed space, so let’s start there. Can compression help?

Popular compression technologies for digital media (think .mp3, .JPEG or h.265) all rely on the idea that natural signals (sounds, photos, movies) are highly structured and information rich. There is information in the signal that can be isolated. If we just focus on the parts with the greatest information content, we can throw away the rest and get a vastly smaller file. These ‘lossy’ compression technologies can result in files 1/10^th or less the size of the original – far better ratios than the one-half compression ratio you would be fortunate to achieve with ‘lossless’ methods (think .zip or media-specific examples like .FLAC).

This apparent magic is achieved using mathematics dating all the way back to 1822, and Joseph Fourier’s realisation that complex waveforms could be described as the sum of several other simple harmonics. The Fourier transform (FT) takes a seemingly complex signal, and reveals the frequencies which, if added together, would reconstitute the original signal. All other frequencies not essential to the original signal can then be ignored.

Compression standards like JPEG for images use this approach (in two dimensions) to first reveal the frequencies with particular information content, and then to “throw away”, or zero-out, any values below a certain threshold. The resulting much-smaller dataset can still be used (via an inverse Fourier transform) to reconstitute the original image, or something indistinguishably close to it. Compression of datasets to one tenth or less the original size are commonplace using these methods. Such is their usefulness for this purpose that ‘fast Fourier Transform’ or FFT systems have been an important aspect of computers for decades. Without them, your iPod could never have played all those mp3 files.

Even more effective compression can be obtained using wavelet transforms. In layman’s terms, these are methods of ‘spending’ more of your information budget on higher frequency signals than lower frequency ones. The logic being that the wavelength of lower-frequency signals means there cannot be many of them in the signal, where the shorter wavelength of higher-frequencies means there can be more in the signal. Wavelets will comfortably outperform a simple Fourier transform for compression ratios, albeit at additional processing cost to compress and decompress the file.

Regardless of method, there is a link between compression and quality. To compress more, and output a smaller file, the system simply needs to set the threshold value at which a non-zero value will be discarded at a higher value. However, set the threshold too high, and quality starts to suffer – your image will lack detail when uncompressed, or your mp3 will sound ‘muddy’ and indistinct when played back.

The core idea is that most signals can be represented in ways that show that many values are at or very near to zero, and are essentially redundant. These zero-filled representations are considered ‘sparse’, with only a small percentage of values being relevant.

Sparsity in AI

OK – but what’s the connection between lossy compression and AI? An AI is an active system, not just a data store.

Let’s imagine a machine learning system based upon a neural network. It consists of several layers of neurons, and every neuron in each layer is connected to all neurons in the layers immediately before and after. The things that distinguish the neurons from each other are twofold:

first, that the strength, or ‘weight’ of each of those connections between neurons are not equal; and
second, that each neuron has a ‘bias’ or pre-determined tendency to a particular activation level.

It is the carefully optimised pattern of weights and biases that represents a trained network.

Just as images, sounds or movies are highly structured and information rich, so is a trained AI. The process of training is really a process of optimisation to create information content. In a typical convolutional neural network, after training, the weights of many of the links between nodes in the network will be at or near zero. A relatively small percentage will have a value that is meaningfully greater than or less than zero. The weights of the connections in the network represent a multiplier for the activation value of the neuron in the earlier layer being carried to the value of the subsequent layer. Therefore any weights at or near zero are essentially ‘dead connections’ within the network. Any value multiplied by zero is zero, and even a fully activated neuron connected by a very low-weighted connection will contribute only a tiny amount.

Running an AI with all of those zero-weighted links and very low-weighted links is a waste of computational resources. The system ends up carrying out large numbers of calculations which have a negligible effect. The outcome is almost certainly decided by the signal passing through the links with significant weights.

Once trained, the AI can be used far more efficiently in production if all very low-weight connections are simply zero’d out and ignored altogether. This might reduce the number of links between neurons to a tiny fraction of those at the outset, perhaps one tenth or less. With one tenth the links to process, running the AI now takes one tenth the computing power – or put another way, your current hardware can now do ten times as much.

Consider just one example – an insurer running a machine-learning driven fraud detection system. The original (non-sparse) system requires ten high power servers to process the required volume of transactions. Taking the same AI model and pruning it to be sparse by removing zero and very-near-zero values results in a model that produces apparently indistinguishable results, but now only requires one server to process the workload. Since the infrastructure running the model is procured under an Infrastructure-as-a-Service cloud contract, the month-on-month savings resulting from deploying the sparse model are significant.

Crossing the threshold – where sparsity intersects with regulation.

Just as with compression, setting too aggressive a threshold for making a previously-trained AI sparse can result in compromised behaviour. In this case, a greater tendency for incorrect classifications by the AI system.

It is clear that transparency and explainability are key principles that will underpin much of the forthcoming regulation of AI. In the EU’s recently-published draft AI Regulation, the obligations attaching to any ‘high-risk’ AI include strict requirements for transparency-by-design, effective human oversight and that an AI is ‘accurate, robust and secure’ (see Chapter 2 (Articles 8-15) of the AI Regulation – more detail can be found in DLA Piper’s AI Regulation Handbook here).

There is an obvious commercial imperative for operators to make AIs as sparse as possible. A more sparse system takes less computing power to run, and therefore can deliver a greater return on investment for customers. This could lead to some zealous approaches to over-pruning / setting the threshold as high as possible.

It isn’t difficult to see how an overly-aggressive attempt to make a trained AI sparse could have serious consequences. Setting the threshold too high will remove links that could affect the AI’s decision – especially in edge cases. Whilst the effect of those very low value weights may be individually tiny, with thousands of neurons in each layer, collectively the impact becomes significant. The resulting (over-)sparse AI produces significantly different results for the same input to that produced by the trained AI with all weights and links intact. If that were the case, any material created to meet regulatory obligations of transparency-by-design, accuracy etc. would be false. Similarly, the resulting human oversight would (at the very least) be less effective if not potentially ineffective.

Returning to the proposed legal frameworks, legislators appear determined to ensure regulations in this space have real teeth. The fines for the most serious breaches of the draft EU AI Regulation are set at 6% of group turnover, although for many contraventions of the rules for high-risk AI the upper limit would be a still-sizable 4% of group turnover. With fines of that magnitude a potential consequence if over-pruning compromises the AI, operators will need to ensure they very carefully assess when a sparse AI becomes too sparse.

Sparse design and training

Some of these thresholding issues might be solved if systems can be designed to be sparse from the outset. Training the full network first, and then applying a threshold to make it sparse once trained creates a more efficient AI at run time, but does nothing to reduce computational burdens during the training phase. Training an AI is by far the more processor-intensive activity. As methods are found to determine at the outset which links are going to be important, sparse networks can be created at the training phase. This will enable huge gains in efficiency.

Whilst (at the time of writing) we’re still at a relatively early stage with sparse training methods (Google’s RigL (Rigging the Lottery) method being a notable example), it will be interesting to see how this develops. At its best, sparse training could help to reduce the regulatory risks associated with aggressive pruning / thresholding. On the other hand, with current methods of sparse training often dependent on multiple mid-training pruning activities, these methods could potentially create additional challenges from an accuracy, transparency and explainability perspective.

Join the Conversation.

We’ll be discussing the implications of this exciting time in AI and hardware at the European Technology Summit in our ‘Hardware Renaissance’ panel. To find out more and register to attend the summit, visit the event website.

You can find more views from the DLA Piper team on the topics of hardware, AI, systems integration and the related legal issues on our blog Technology’s Legal Edge.

If you’d like to discuss any of the issues discussed in this article, get in touch with Gareth Stokes or your usual DLA Piper contact.