Fuelling AI with local lingo

A city-based startup is helping AI models break language barriers with regional language datasets created by rural Indians 

Published: 17th August 2023 11:07 AM  |   Last Updated: 17th August 2023 11:07 AM   |  A+A-

ChatGPT, AI, Machine Learning

Image used for illustrative purposes only.

Express News Service

BENGALURU:  Ever the past year, generative AI tools like ChatGPT and Bard have taken the world by storm. While the benefits and the adverse impacts of these revolutionary tools are still contested, the tools themselves face a major technical hurdle for their long-term survival and relevance – bespoke data in languages other than English.

Currently, much of the data that is fed to train the large language models underpinning tools like ChatGPT and Bard are sourced from the open internet, which is largely in English. But if these tools were to penetrate every society in the world, these companies would need a humongous amount of data in other regional languages. 

That’s where city-based startup Karya comes in. Originally started at Microsoft Research in Bengaluru, by Manu Chopra and Vivek Seshadri, the company provides AI companies with bespoke training data in regional languages. But what sets Karya apart from other data companies is its approach to data collection. Harnessing the power of smartphone and 4G network penetration in rural India, Karya provides an additional source of income to underprivileged rural Indians in return for producing data sets in regional languages such as Kannada.

“It’s always been a high-minded dream to bring digital work from Silicon Valley to rural communities. Fortunately, this was the right idea, at the right time and place. We couldn’t have done it anywhere else. This huge digital boom that India has experienced over the last few years made this possible,” shares Chopra. 

A firm believer of equitable access to the internet, Chopra says it’s crucial that these new tools be made available in regional languages to ensure that the benefits of the technology are fairly distributed. “These are basic human rights. It’s fundamental that someone who speaks Kannada, Hindi, Telugu or any other regional language, has the same access to resources on the internet that people who speak English do,” Chopra adds.  

Currently, Karya produces all manner of text and audio data sets for use in a wide variety of uses. But the data collection model it’s pioneering could also be a pathway for yet another hurdle that AI companies might face in the future. As the internet itself becomes more saturated with content generated by AI tools, it creates a vicious cycle, where new LLM models are trained with data produced by previous models. Furthermore, as the resistance against scraping data from websites increases,

AI companies will be forced to turn to companies like Karya. “That’s a very well known fear for AI companies. At the end of the day, what sets your product apart is the quality of the data. If everyone is training their models on the same dataset, all the models end up alike. Hence, you will need customised, specific datasets for your needs,” Chopra adds

Follow The New Indian Express channel on WhatsApp


Disclaimer : We respect your thoughts and views! But we need to be judicious while moderating your comments. All the comments will be moderated by the newindianexpress.com editorial. Abstain from posting comments that are obscene, defamatory or inflammatory, and do not indulge in personal attacks. Try to avoid outside hyperlinks inside the comment. Help us delete comments that do not follow these guidelines.

The views expressed in comments published on newindianexpress.com are those of the comment writers alone. They do not represent the views or opinions of newindianexpress.com or its staff, nor do they represent the views or opinions of The New Indian Express Group, or any entity of, or affiliated with, The New Indian Express Group. newindianexpress.com reserves the right to take any or all comments down at any time.

flipboard facebook twitter whatsapp