In a world increasingly reliant on artificial intelligence (AI) and the vast expanse of digital data, Nizar Habash, a computer scientist specialising in natural language processing and computational linguistics, stands at a unique crossroads. With the rise of advanced AI systems such as ChatGPT holding the potential to completely transform our world, it’s important to note that majority of these platforms primarily operate in English. with other languages like Arabic can facing setbacks due to limited online data. With extensive research spanning across machine translation, morphological analysis, and computational modelling of Arabic and its dialects, Habash’s research offers insights on the challenges and opportunities presented by building Arabic-language AI systems, or in simpler terms, “teaching Arabic to the robots,” he jokes.
A professor of Computer Science at NYU Abu Dhabi, Habash points out the urgent need for developing more sophisticated machine learning systems that are better equipped at processing cultural nuances embedded within different languages. “Arabic is one of the most important languages globally. It ranks among the top in terms of the number of people who use it, whether for day-to-day life or solely for religious purposes. It’s a significant language that has carried knowledge over a large span of human history, essentially preserving it,” says Habash. “Today, when we assess the resources available for Arabic and the AI systems currently under use, we find that they don’t match the level of complexity the language holds.”
Originally hailing from Palestine, Habash mentions, “Being a native speaker of Arabic, I’ve been aware of its complexity from a very young age—from its various dialects across the Arab world to the standards I’ve had to adhere to throughout my education. I’ve often thought about how Arabic serves as a vehicle for our identity, knowledge, and communication, especially in the age of AI. And we encounter numerous examples of problems in this regard.”
Data challenges
Can the limitations of available online data for Arabic language learning impact the development and performance of AI systems? According to Habash, the current push in AI that has been quite successful is “simply that more data is better”. “That’s not the biggest challenge but for some people, it may be seen as the only challenge. The problem is, you will get to the point where there is no more data that’s naturally created and the moment we start generating artificial data and training AI systems on that, it’s like creating monsters,” says Habash, previously a research scientist at Columbia University’s Centre for Computational Learning Systems.
AI uses feedback loops, which can involve inputs with ‘creative’ mistakes, he explains. To produce 100 times the amount of data, means that the mistakes will also be amplified a 100 times. “When the mistakes are repeated again and again, it becomes the norm, and the norm can become the operating model,” says Habash. “The model has no concept of reality. It is simply trying to predict the next word, or fill in the blank, or use what’s called masking techniques to figure out the next part of the sentence. AI is great at making mistakes with confidence.”
When discussing the limitations of collecting online data for Arabic, Habash highlights the perils of algorithmic bias and the nuances inherent in Arabic script, such as the absence of diacritical marks in common usage. These intricacies pose formidable challenges for AI systems striving to comprehend and process Arabic text accurately. “Arabic, typically for common usage, is written without the diacritical marks, which signify the vowels. Only about one to two per cent of Arabic words in newspapers actually have a marker for vowels but the Arabic readers know how to understand it. However, a word may be ambiguous as a result and could have many meanings. So, when we’re teaching the machines, context becomes really important,” he adds.
Arabic language also has many dialects and where there are dialects, there are historical variants. “Classical Arabic, the Arabic of the Quran, is spelled in slightly different ways than modern standard Arabic. This is another thing that machines are dealing with. It can mix up the Quran text with the modern standard Arabic, with Egyptian dialect, and put this pile together, which would confuse a lot of things,” says Habash. “There are different complexities. In my opinion, some of the interesting challenges that are not tapped in yet, are potentially to do with algorithmic bias.”
Cultural sensitivity and biases
What steps should be taken to ensure that Arabic-language AI systems are culturally sensitive and avoid biases in their interactions? “There are different kinds of biases; one is content bias, and the other one is the grammatical form bias. Both are interconnected,” Habash mentions. “The content bias is related to the kind of ideas about the world that a system is likely to generate in generative models. As AI scientist Toby Walsh had previously said, ‘Language is political. There’s always bias embedded’. To an extent, I agree with this. For instance, in traditional journalism reportage, we’d always see the die-kill paradigm, where Israelis seem to always be ‘killed’ and Palestinians always ‘die’—we are impossible to kill. These types of biases can also occur in Arabic language.”
Citing a more recent example from ChatGPT doing rounds on social media, he adds, “Similarly, ChatGPT was asked ‘Do Palestinians deserve to be free?’ and ‘Do Israelis deserve to be free?’ The answer for the Israelis was something related to ‘Of course, Israelis are human beings, and all human beings deserve freedom’, whereas, for Palestinians, the response was along the lines of ‘The question of Palestinians being free is a complex question with many opinions’. There are biases everywhere; AI will repeat what it learns,” says Habash.
However, even though algorithmic bias stems from human bias, the feedback loop within which machine learning systems operate may amplify the bias, which can be a cause for concern. “The real challenge is to figure out how to get the machines to model properly, to know which things should be given higher weight or lower rate,” says Habash.
Potential solutions to fight the existing algorithmic bias include either adding more data to the mix or researchers working towards identifying content that seems to be different from the normal distribution, he adds. “For example, if there’s a lot of mentions of doctors being men and nurses being women, can you actually artificially reduce the weights of the model? You don’t have to change the data; you can change how you learn from the data. If we see a pattern that looks kind of odd we can work on balancing it out,” says Habash. “It’s really an exciting new space because we’re dealing with data and information and it could be manipulated in different ways.”
Role of language and AI experts
So, in what ways can computational language experts, such as Habash, contribute to overcoming these challenges to make ‘better’ design choices? “That’s a great question. As an industry, we are more focused on the efficiency, efficacy and design of the model, creating something that is simple and easy with the sort of ‘Google elegance,’” says Habash. “Google simplified everything with one simple search box and that’s very attractive for people who are already overwhelmed. The amount of data on the web is so ridiculously huge. Everyone wants the short answer.”
In the realm of design choices for AI models, Habash advocates for simplicity without sacrificing substance, cautioning against ‘deceptive fluency’. “For example, if you talk to an English speaker who has good pronunciation, you can understand them and follow what they’re talking about, your basic assumption is that this person sounds good to the ear, I understand him or her. Clearly, they’re smart, if they’re smart, they’re good, if they’re good, they’re telling the truth.”
“But if a super smart person who actually knows a lot more but has trouble speaking in English, then you might not think the same, even though they might be giving you jewels of wisdom. It’s the same thing with the machines. Fluency equals intelligence equals truth, which is not truly valid logically,” he explains. “We’re not dealing with something that we have not dealt with before but the only thing is the volume and accessibility are a lot higher.”
The perils of relinquishing human agency to AI are steep. “If we rely too much on AI to make decisions on our behalf and to be our voice, we’re giving up something about our humanity, intelligence, conscience, and potentially our responsibility, which is not going to take us too far,” says Habash, firmly warning us against the blind reliance on AI systems, emphasising the irreplaceable role of human judgement, empathy, and ethical responsibility. “That’s why I think it’s extremely important to continue to educate human beings.”

