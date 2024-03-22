CHENNAI: There have been plenty of discussions about the pitfalls of artificial intelligence models but topping the legal and ethical questions is the copyright infringement for training general purpose large language models (LLMs) and making profit off the content thus generated. A recently released report by copyright catcher application program interface (API) by Patronus raised many important issues. It is a model designed to detect potential copyright violations in LLMs. Its findings were shocking.

OpenAI’s GPT-4 produced copyrighted content on 44% of the prompts and Mistral AI 's Mixtral-8x7B produced 22% copyrighted content of the prompts. The report also reveals that Anthropic’s Claude-2.1 produced copyrighted content on 8% of the prompts. Meanwhile, Meta’s Llama-2-70b-chat produced copyrighted content on 10% of the prompts. Currently, the training data behind the models are not disclosed; only the volume of the data is shared in terms of billion and growing parameters. This makes it difficult to detect copyrighted content. A recent research paper findings published in Arxiv reveal that all four model families it tested including appear to have been trained on copyrighted materials.

Furthermore, it states that its new method to detect copyright violations - DE-COP in language models perform better than other methods and human evaluators. Finding copyrighted content in the training data of LLMs is essential to maintain the ethical and legal considerations. This also has the potential to increase transparency and accountability in AI models.