Fuelling AI with local lingo
A city-based startup is helping AI models break language barriers with regional language datasets created by rural Indians
BENGALURU: Ever the past year, generative AI tools like ChatGPT and Bard have taken the world by storm. While the benefits and the adverse impacts of these revolutionary tools are still contested, the tools themselves face a major technical hurdle for their long-term survival and relevance – bespoke data in languages other than English.
Currently, much of the data that is fed to train the large language models underpinning tools like ChatGPT and Bard are sourced from the open internet, which is largely in English. But if these tools were to penetrate every society in the world, these companies would need a humongous amount of data in other regional languages.
That’s where city-based startup Karya comes in. Originally started at Microsoft Research in Bengaluru, by Manu Chopra and Vivek Seshadri, the company provides AI companies with bespoke training data in regional languages. But what sets Karya apart from other data companies is its approach to data collection. Harnessing the power of smartphone and 4G network penetration in rural India, Karya provides an additional source of income to underprivileged rural Indians in return for producing data sets in regional languages such as Kannada.
“It’s always been a high-minded dream to bring digital work from Silicon Valley to rural communities. Fortunately, this was the right idea, at the right time and place. We couldn’t have done it anywhere else. This huge digital boom that India has experienced over the last few years made this possible,” shares Chopra.
A firm believer of equitable access to the internet, Chopra says it’s crucial that these new tools be made available in regional languages to ensure that the benefits of the technology are fairly distributed. “These are basic human rights. It’s fundamental that someone who speaks Kannada, Hindi, Telugu or any other regional language, has the same access to resources on the internet that people who speak English do,” Chopra adds.
Currently, Karya produces all manner of text and audio data sets for use in a wide variety of uses. But the data collection model it’s pioneering could also be a pathway for yet another hurdle that AI companies might face in the future. As the internet itself becomes more saturated with content generated by AI tools, it creates a vicious cycle, where new LLM models are trained with data produced by previous models. Furthermore, as the resistance against scraping data from websites increases,
AI companies will be forced to turn to companies like Karya. “That’s a very well known fear for AI companies. At the end of the day, what sets your product apart is the quality of the data. If everyone is training their models on the same dataset, all the models end up alike. Hence, you will need customised, specific datasets for your needs,” Chopra adds