HYDERABAD: Researchers from the Language Technologies Research Centre of the Indian International Institute of Information Technology-Hyderabad (IIIT-H), have made a first-of-its-kind attempt at semantic role labelling of Hindi-English code mixed tweets. With the mixing of languages, for example, Hindi and English have become Hinglish and this code-mixing gaining popularity in multi-lingual countries such as India, it has become imperative to develop code mixed data pool.
“The modern world of social media has lots of code-mixed languages. So if we want to do any parsing or analysis there, the first thing we have to do is handle the code-mixed data,” said Prof Dipti Misra Sarma, who along with her student Riya Pal, has been conducting research on semantic role labelling of Hinglish tweets.
Riya, a final-year dual degree student, says that semantic role labelling is used in any case that needs an accurate understanding of text before it is extracted. “Take the example of a chatbot. It needs to understand data first before extracting information from it.
Let’s say there is a sentence like, I go to school. For a machine to understand this, it extracts the action “go”, “who” is going and “where”. Thus, there are labels in that sentence,” she said. Riya has created a dataset for Hinglish where she labelled about 1,500 tweets manually. The next step is to create an automated tool using machine learning techniques that would do the same, thereby improving the accuracy of the model.