On 2nd February 2024 the House of Lords Communications and Digital Committee (HoL) published its report on large language models (LLMs).
That report covered a variety of topics, with attention on two in particular:
- what the HoL refers to as the ‘Goldilocks problem’ – or the challenges of getting the balance of investment and risk just right, especially in the context of open and closed AI models; and
- liability and copyright issues arising from development, training and use of LLMs.
This article is the second in a two-part series. The first looked at the ‘Goldilocks problem’, and in this article we’ll take a deeper dive into the liability and copyright issues surrounding LLMs, as examined by the HoL, and where the UK government is at in terms of addressing these issues.
Liability
Chapter 7 of the report sees the HoL considering where liability should sit within the AI supply chain with respect to upstream developers (those that build the foundation model) and downstream actors (those that modify or use the foundation model for a specific use case).
The UK currently tasks sector regulators to use existing powers to hold players in the AI supply chain accountable. Having canvassed their views, the HoL highlights the different approaches being taken by different regulators. For instance, the ICO looks at the entire value chain, while Ofcom tends to concentrate on the downstream actors.
The HoL found that it’s common ground amongst regulators that holding actors to account for LLM development and deployment isn’t easy. But what is it that makes it so difficult, and what more could the UK do to get it right?
Many hands make hard work
The HoL report immediately mentions the “many hands” problem that regulators are faced with – i.e. the involvement of many different actors in the complex AI supply chain. The LLM supply chain has the capacity to compound this problem, because one model can be fine-tuned for a wide array of applications. As the Ada Lovelace Institute points out[1], this means that:
- Upstream providers are less able to envision all potential downstream use-cases at the build stage.
- The UK’s sectoral approach could mean that lots of different sector-specific regulators are responsible for its use (which could result in a single model being regulated differently across different sectors).
- A single mistake or vulnerability introduced by the upstream developer could create a domino effect for all downstream actors.
The complexity is further amplified by open access models, where upstream developer control is naturally much reduced. If the UK is to pursue a healthy balance between open and closed models, as recommended by the HoL and discussed in the first article of this series, appreciating the nuances of regulating these two different release strategies becomes key.
Upstream v. downstream / open v. closed
The HoL notes that upstream developers have the greatest insight and control over the foundation model and it follows that this makes them a natural starting point in the supply chain in the context of accountability. This becomes more persuasive when considering the current state of the market, with market share concentrated in just a handful of proprietary AI providers whose models are being used by downstream actors across the globe. This gives those providers greater leverage when it comes to securing robust contractual terms with their customers, and this should be borne in mind when deciding which actors to hold to account.
In the report, the HoL emphasises its point of view that “extensive primary legislation aimed solely at LLMs is not currently appropriate” due to the frontier nature of the LLMs and generative AI more broadly, citing the risk of stifling innovation. Recently we have seen astonishing leaps forward, with extremely impressive text-to-video generative AI models and LLMs with far larger context windows being announced by these large tech vendors, demonstrating how proprietary AI models are often at the cutting-edge in moving the state of the art forward. Grand medium to long-term predictions for the social and economic impacts of AI technology are already being realised, and it will be interesting to see the impact the EU’s AI Act has in the realm of LLMs.
Despite the HoL’s warning, much can be learnt from the AI Act and its approach to regulating upstream providers. In particular, increased transparency from upstream providers – such as the provision of data sheets and risk assessments – is crucial to enable downstream actors to understand the model and the risks that could flow from a particular use case. This would mitigate the concern raised by Dr Zoë Webster, Director of Data and AI Solutions at BT, who was quoted in the report raising concerns that downstream actors could be held accountable “for issues with a foundation model where we have no idea what data it was trained on, how it was tested and what the limitations are on how and when it can be used“.
But these responsibilities shouldn’t end with the upstream provider. Transparency only works if it permeates the entire LLM value chain. This could be achieved by imposing an obligation on downstream actors to request the information from upstream providers, and matching obligations to pay it forward further down the value chain. It also shouldn’t be viewed as a ‘once-and-done’ responsibility, but a continuous cycle of information flow as risks become clear or new risks emerge.
For true open-source LLMs a degree of transparency is already baked into the concept. Whilst calls for open-source to be all but excluded for regulatory oversight probably go too far, the benefits of open-source in enabling the UK to compete with large providers should be at the forefront of regulator’s minds. Worth bearing in mind in all this is that open-source AI isn’t necessarily significantly more transparent or explainable than closed AI models. In contrast to ‘open source’ software, where access to source code allows the underlying logic to be examined, open AI tends to simply mean that a set of weights representing the AI model can be downloaded. The performance of the AI may not be appreciably any more transparent or explicable because a user can examine those weights. Even so, there is still greater transparency in the open source / open weights AI domain, so it seems likely that this is an area where the downstream actors – who use open-source to commercialise their own models – will be required to take more responsibility.
Next steps for the UK
In an environment where the state of the art is rapidly pushing forward, the HoL’s concrete and achievable recommendations in this space are welcome.
While acknowledging that upstream developers should probably take some responsibility, particularly in the arena of information disclosure, the HoL’s main recommendation in the liability space is to engage the Law Commission to review liability across the entire value chain. With the British Law Commissions’ excellent work in the realm of autonomous vehicles[2] still fresh in the memory, a similar approach here for foundation models more generally would be very welcome.
Away from legal theory, the HoL also recommends:
- introducing standardised powers for main regulators who are expected to lead on AI oversight, ensuring that these powers are buttressed by meaningful sanctions; and
- hastening the Government’s efforts to establish a central support function to assist in co-ordinating regulators’ efforts, and producing cross-sector guidance on AI issues that could fall outside of individual sector remits.
What’s clear from the Government’s consultation response to its March 2023 pro-innovation approach to AI regulation, released just days after the HoL report, is that it could be some time before more detailed guidance is available and before the central support function is properly up and running. There was no mention of a Law Commission review and there remains no near-term plan to introduce primary legislation that specifically regulates AI.
Copyright
The issue around the use of copyright material concerning LLMs has been hotly debated on a worldwide basis. Rights holders believe that developers using copyrighted data to train models without permission to be both unlawful and unethical. Developers, inevitably, disagree and argue that AI systems “would simply collapse” if they didn’t have a legal exemption enabling them to train the models using human-generated content; part of their argument is based for example on the societal value of LLMs[3].
There is a wider philosophical question to consider. Human artists are free to be ‘inspired’ by works they encounter (providing that inspiration does not stray into outright copying). Some have argued that the mechanisms of training transformer-based or diffusion-based generative AI models are, at their deepest level, no different to exposing an art student to a wide range of examples as part of their training. However, the fact that such activity is likely to involve making copies (at least fragments) of many works means that the legal treatment of AI training may be very different despite such philosophical positioning.
Data mining
The HoL explored various arguments around copyright questions in Chapter 8 of the report. It initially considers the issue of text and data mining (TDM), which involves the use of extensive datasets to discern patterns and trends for training AI. Currently, this activity would require a licence, unless an exemption applies (e.g. non-commercial research) and there are cases currently before the English courts which will no doubt test the boundaries of such exemptions. In 2022, the Intellectual Property Office considered allowing commercial data mining through a broad copyright exception, but this was abandoned due to concerns about undermining the business model of creative industries relying on copyright to protect works. Instead, the Government initiated a working group to create a new code of practice targeted for completion by the summer of 2023. Since then, no further ground has been made and it was very recently confirmed in the Government’s white paper consultation response that plans to publish a voluntary code had been shelved. This was seen as a major blow to the creative industries which had been looking for more certainty.
So, what is the current state of play? The Government has emphasised its continued commitment “to promote and reward investment in creativity” and ensure rightsholder content is “appropriately protected” while also supporting AI innovation. Given the HoL report flagging this debate can’t continue indefinitely, we expect legislative changes will be required to resolve the dispute.
Holding / copying data
The next point of consideration for the HoL was the extent to which LLMs are in fact “holding” a set of copyrighted works, given that texts from books and articles are transformed into billions of sequences (or the ‘weights’ of links between neurons and ‘biases’ for initial activation values of neurons in a neural network). It was commented that the resulting model holds only statistical representations of the original data and that tracing the origin of a specific word back to a source is technically impossible. Nonetheless, the act of extracting data from websites and moving it to processing platforms to train a model might create at least temporary copies which constitute infringement.
There is still no consensus on whether this temporary copying is an exempt activity under the Copyright, Designs and Patents Act 1988. The law excludes from infringement the creation of transient or incidental copies, but expert commentary suggests this would not extend to LLMs given the narrowness of the defences. While scholars will likely continue to debate this matter, the HoL has indicated disappointment that the Government did not articulate its current legal understanding (seeking to wait for sufficient case law). As such, again, it is clear that legislation needs to evolve to address the realities facing copyright owners in the era of AI. We agree that copyright law should be adaptable and future-oriented, focussing on the “why”, not “how”.
Further complicating matters, there are other forms of training or prompting generative AI models that could present a far clearer-cut case of infringement, and as such it becomes difficult to suggest that ‘activity X to train an AI is never infringing’ or ‘activity Y is always infringing’. Many cases will be fact-specific, meaning it could be some time before a sufficiently broad coverage of case law provides a reasonably definitive guide to what activities are or are not permitted when using works to train an AI.
Potential solution(s) / way forward
Overall,the HoL explained it considers that while LLMs may provide significant societal benefits, this should not justify the infringement of copyright law or its fundamental principles. To ensure that companies do not exploit rightsholder data for commercial gain without obtaining a licence and compensating copyright owners, the HoL report considered various approaches:
- expanding existing licensing systems (noting content aggregators already run businesses which reportedly offer access to trillions of words) and developing new, commercially attractive high-quality curated datasets at the scale required for LLM training;
- restricting the use of “products built upon the infringement of UK creators’ rights” (particularly in the public sector space where the Government has more control, albeit there is a question as to how practicable this is to enforce); and
- requiring “developers to maintain records, which can be accessed by rightsholders”. In other words, LLM vendors would be required to deploy ‘transparency-by-design’, so that proper data management is integral from the beginning, thus giving rightsholders an ability to understand what an LLM has used or at least had access to.
UK at a crossroads
Whilst it will be difficult to grasp, there is no doubting that the UK has a great opportunity to forge its own path when it comes to LLMs and generative AI more generally. Its recent focus on safety is commendable, but this needs to be coupled with innovation for the UK to keep its seat at the table. Much like with open and closed models, the key will be striking a balance between equipping regulators with the tools they need to hold those with the most control to account and fostering that innovation – all whilst ensuring the UK’s creative (human) industries continue to thrive too.
Next Steps
If you’ve not read part 1 of this series, where we considered the ‘Goldilocks problem’ and the wider challenge of open and closed models, you can read that here.
You can find more views from the DLA Piper team on the topics of AI systems, regulation and the related legal issues on our blog, Technology’s Legal Edge.
If your organisation is deploying AI solutions, you can download DLA Piper’s AI Act App and download our AI Report, a survey of real world AI use.
If you’d like to discuss any of the issues discussed in this article, get in touch with Kurt Davies, Linzi Penman, Huw Cookson, Gareth Stokes, Parman Dhillon, or your usual DLA Piper contact.
[1] Ian Brown, ‘Allocating accountability in AI supply chains: a UK-centred regulatory perspective’ (June 2023)
Allocating-accountability-in-AI-supply-chains-June-2023.pdf (adalovelaceinstitute.org)
[2] Law Commission of England and Wales, Scottish Law Commission, ‘Automated vehicles: joint report’ (26 January 2022)
Automated-vehicles-joint-report-cvr-03-02-22.pdf
[3] Written evidence from the Society of Authors (LLM0044)