

CHENNAI: There have been plenty of discussions about the pitfalls of artificial intelligence models but topping the legal and ethical questions is the copyright infringement for training general purpose large language models (LLMs) and making profit off the content thus generated. A recently released report by copyright catcher application program interface (API) by Patronus raised many important issues. It is a model designed to detect potential copyright violations in LLMs. Its findings were shocking.
OpenAI’s GPT-4 produced copyrighted content on 44% of the prompts and Mistral AI 's Mixtral-8x7B produced 22% copyrighted content of the prompts. The report also reveals that Anthropic’s Claude-2.1 produced copyrighted content on 8% of the prompts. Meanwhile, Meta’s Llama-2-70b-chat produced copyrighted content on 10% of the prompts. Currently, the training data behind the models are not disclosed; only the volume of the data is shared in terms of billion and growing parameters. This makes it difficult to detect copyrighted content. A recent research paper findings published in Arxiv reveal that all four model families it tested including appear to have been trained on copyrighted materials.
Furthermore, it states that its new method to detect copyright violations - DE-COP in language models perform better than other methods and human evaluators. Finding copyrighted content in the training data of LLMs is essential to maintain the ethical and legal considerations. This also has the potential to increase transparency and accountability in AI models.
A characteristic of generative AI models is the massive consumption of data -- be it text, images, audios, and videos. From a copyright perspective, an important question is whether training data sets are retained in LLMs output. Various research is going on to find a solution to plagiarism in LLMs. Some of the noteworthy efforts are GenFace, which detects linguistic patterns. Concerned educators are trying to find solutions to prevent students from using AI models for their work. But, all of these tools are premature and have their limitations.
Currently, the solutions vary from avoiding non-transparent models, pressing AI vendors to make them cite sources, especially the models that deliver news and current affairs. There are suggestions to avoid deploying artificial models for enterprises and in official works by individuals, at least without human verification.