Home / Quick Pickle / Indian Copyright Law and Generative AI- Part 5: Right to exclude access?

Indian Copyright Law and Generative AI- Part 5: Right to exclude access?

Indian Copyright Law and Generative AI- Part 5: Right to exclude access?

In a host of the cases filed in the US against Gen AI developers, one of the primary allegations concern unauthorized access and use of publicly available works for training. The allegations concern scraping of publicly available works, which are protected by copyright, for extracting meta-information to train and make Gen AI models capable of responding to prompts. Note, the issue is not access-circumvention by jumping paywalls, but access and use of publicly available works (does not equate to works in “public domain” i.e., no longer protected by Copyright law).

The allegation further is that while a person buys a book from a store, to read and enjoy it, but Gen AI models have scraped publicly available content, stored it, without paying a penny, and utilized for training, to outcompete the producers of works which it used to train itself.

A simple analogy helps explain the issue better:

    Imagine, while reading this article on LawPickle, you press Cltr+P on your laptop, to save a copy of the article on your laptop (as a pdf). You then read the article later and learn. Imagine you do the same for a host of articles you find across the web, on publicly accessible platforms. [You do not distribute or communicate your copies of the article to anyone else]. You then write an article on the same issue, incorporating various arguments you read in these pieces, without incorporating any substantially similar expression. You amalgamate the ideas or the underlying content you learn from the said articles, and articulate them, in your own words, to present an expression. Would your access be unauthorized, your use for training yourself be illegal, and your expression, i.e., output be infringing?

Gen AI technology does this on a large scale, much beyond the capability of any single human. Gen AI technology, however, does not read to extract in the traditional sense, but infact tokenizes (i.e., converts the scraped expression in the language of numbers, the numbers it understands) and identifies patterns within these tokens, to then map on to the patterns of the prompts, in order to produce a reasonable continuation of the prompt. Given the large volume of data sets used, it is improbable for expression to be substantially similar to any single tokenized work.

The question is would this infringe any Right granted under Section 14 of the Indian Copyright Act?

Copyright does not protect underlying elements, i.e., ideas, arguments, philosophies, themes, styles/meta-information contained within the packaging or the expressive form of a work. It protects the copyright owner against anyone forming a market using their particular expression/or a substantially similar expression. What constitutes forming a market is clearly specified within Section 14 of the Act, which does not include the right to exclude access/control access to a work, more so a work that is publicly made available. Extracting unprotectable elements from copies of protected works, which copies are not exposed to any other human being, fundamentally does not erode any right of the Copyright owner. Under the guise of controlling access to expression, the copyright owner does not have a right or a claim over use of the underlying elements. Such access to expression, that is publicly available, is in fact necessary, analogous to easements by necessity, to be able to extract the unprotected elements. There is no argument of trespass applicable, firstly as the work is by choice, published and made publicly available, and secondly as there is no right to exclude present concerning the underlying elements. Moreover, such use of a digitally available work does not take away any use of the copyright owner (unlike takings of physical properties like physical books).

Thus, web scraping that is used to extract information or learn from a work without reproducing its expressive form, by exposing it to another human, it arguably falls outside the traditional scope of copyright protection. The AI system is not copying the protected expression to reproduce it, but rather learning from the underlying elements contained within the copyrighted work.

Would contractual provisions, on accessing websites that have publicly available works, stating – enter the website only if you won’t scrape unprotectable elements work? Possibly not, as a contractual right cannot be extended to exclude material one does not have a pre-existing claim over. There is no claim over underlying ideas, for anyone to contractually secure the same, using the expression as a covering tool.

But what if this scraping was of data obtained through databases of shadow libraries?

Here as well, publicly available datasets are being scraped to extract unprotectable meta-information. The GenAI model is not circumventing paywalls to reproduce expressions. Thus, imputing any kind of liability on the model developer is as herculean a task, as is proving that the data was actually sourced from publicly available datasets of shadow libraries, and not from a legitimate publicly available source, or a legitimate lender, unless the copy used is watermarked with the identity of the shadow library.

Finally, what if the data was actually sourced by circumventing a paywall?

Section 65A of the Copyright Act, which protects against circumvention of technological protection measures is significantly different from S. 1201 of the DMCA in the US. S. 1201 of the DMCA is a provision that excludes one from circumventing paywalls to obtain access, per se. Whereas, Section 65A of the Copyright Act has an important nexus requirement. The Section states:

    “Any  person  who  circumvents  an  effective technological measure applied for the purpose of protecting any of the rights conferred by this, with the intention of infringing such rights (Note: infringing such rights again meaning infringing one of the rights provided under the copyright section – reproduction, distribution etc. etc.), shall be punishable with imprisonment which may extend to two years and shall also be liable to fine.”

Importantly, Section 65A was brought in to protect circumvention of TPM for violating rights as against protection from circumvention per se. The purpose was to protect against piracy and not lawful extraction.

Moreover, the section, being penal in nature, also has an intention (mens rea) requirement. This connotes a requirement of purpose and knowledge that circumvention would amount to infringement. Knowledge here requires clear consciousness and awareness, coupled with a purpose in furtherance of the said knowledge, and not the mere possibility or likelihood.  This nexus to the intention to infringe is important, and missing, in most GenAI cases, as even access circumvention for the purposes of extraction of unprotected elements, as against reproduction, communication, distribution of the expressive form of the work, for exposure to someone else- i.e., the market of the copyright owner, is beyond the realm of GenAI.  Thus, there need not even be a need to go to Section 65A(2) which expressely exempts actions that circumvent TPMs for permitted purposes.

This approach in Indian law makes it significantly more challenging to impose liability on those who compile training datasets through web scraping, even when paywalls are involved, as long as the purpose is not to infringe on the rights protected by copyright under Section 14. This focus on the purpose and intent of the circumvention, rather than the act of circumvention itself, provides a more flexible framework that may be better suited to the realities of AI and machine learning in the digital age.

The Indian approach, in particular, thus, appears more accommodating to the use of web scraping and information extraction for AI training purposes. As long as the intention is to learn from or extract non-expressive information from the works, rather than to reproduce or distribute the copyrighted expression, access and use, are less likely to be considered copyright infringement under Indian law.

What this tells us is that, in fact, the challenges in AI context extend beyond the domain of Copyright law, as an institutional tool. The main concern is not about lack of compensation for unorthodox markets, but rather the fear of being outcompeted by machines. Unlike humans, who can only learn from and digest a limited number of sources, AI can process innumerable publicly available works at very low cost and in very little time. This capability raises concerns about the potential for AI systems to outcompete human creators in certain domains, challenging traditional notions of creativity and authorship.

This fear of machine competition is not just about the speed and scale of AI learning, but also about the potential impact on creative industries and knowledge-based professions. If AI systems can rapidly assimilate vast amounts of information and produce outputs that rival or surpass human-created works, it could potentially disrupt entire industries and change the nature of creative work or dissuade creativity for the sake of creativity. It may even estrange disruptive expressive productions.

Importantly, copyright licensing, even if hypothetically extended to use of underlying ideas, or to the act of circumvention of access per se, through copyright law is not going to resolve this existential crisis. It would merely create market access barriers. The existential crises remain for those models that are generated by seeking licenses. They will potentially outcompete humans at a low compensation rate per human. The solution is to look for institutional tools to support creative activity for the sake of creative activity, and to support enablement of such creators through beyond copyright tools that consider new paradigms that can balance the needs of creators, innovators, and society at large in the rapidly evolving digital landscape.