Like millions around the world, Southeast Asians are trying big language models like Meta's Llama 2 and Mistral AI — but in their native Bahasa Indonesia or Thai. The result is usually ambiguous in English.
This puts them at a disadvantage as artificial artificial intelligence transforms education, work and governance around the world, technologists warn.
The Singapore government-led initiative aims to redress the imbalance with the Southeast Asia LLM, the first in a family of models called SEA-LION – Southeast Asian Languages in One Network – trained in the region's languages and cultural norms.
Trained on data in 11 Southeast Asian languages, including Vietnamese, Thai and Bahasa Indonesian, the open-source model is a cheaper and more efficient option for the region's businesses, governments and academia, says Leslie Teo at AI Singapore.
“Do we want to force every person in Southeast Asia to adapt to the machine, or do we want to make it more accessible so that people in the region can fully use the technology without needing to speak English?” He \ he said.
“We're not trying to compete with the big LLMs; we're trying to complement them, so we're well represented,” Tio Thomson, senior director of AI products, told the Reuters Foundation.
7,000 languages are spoken worldwide. Furthermore, LLMs, including OpenAI's GPT-4 and Meta's Llama 2, which are used to build AI systems such as chatbots and other tools, are mostly developed and trained for the English language.
Governments and technology firms are trying to bridge this gap, with India creating datasets in local languages, LLM in the United Arab Emirates powering generative AI tools in Arabic, and AI models in local languages in China, Japan and Vietnam.
Nuryanti Jalli, an assistant professor at Oklahoma State University's School of Communications, said these models could help local populations participate more equally in a global AI economy dominated by big tech companies.
“Regional LLMs are also needed because they support technology self-reliance,” she said. “Less reliance on Western LLMs could provide better privacy for local populations and better align with national or regional interest.”
Verify and filter
Researchers say that multilingual language models trained on text from several languages at once can infer semantic and grammatical connections between high-resource languages with more data and less-resource languages.
These models can be used in a variety of applications, from translation to customer-service chatbots, to content moderation on social media platforms that have difficulty detecting hate speech in low-resource languages such as Burmese or Amharic.
13% of SEA-LION's data is drawn from Southeast Asian languages – more than any other major LLM, Teo said. More than 9% of its data is from Chinese text and 63% from English.
Multilingual language models are often trained on translated text and other poor-quality data, so the AI will be “cautious” about the data used in Singapore SEA-LION training, Teo said in his office at the National University of Singapore.
“The era of pristine data is gone – most of the stuff on the internet now is produced by LLMs, so we have to validate and filter,” he said.
“We can't be perfect, but we can't remove everything we think is bad either,” he added.
More governments are providing data and businesses are testing Sea-Lion, which because of its small size can be deployed quickly and is cheap to fine-tune and adapt, Teo said.
At Indonesian e-commerce company Tokopedia, the majority of customer interactions are in Bahasa Indonesia, so the models “enhance our ability to connect with customers with local fluency and improve their experiences,” said Paul Kandilis, Tokopedia's associate vice president of data science. .
Bias in data
As many countries and regions build their own LLMs, digital and human rights experts worry that they only reproduce dominant views expressed online, which can be particularly problematic in countries without authoritarian governments or strict media censorship or a strong civil society.
Chinese social media platforms, for example, censor references to the Tiananmen Square uprising and criticism of the government, while many Southeast Asian countries have enacted laws to clamp down on content that authorities deem misleading.
“Training models on such data risks perpetuating biased, biased, incomplete and misleading narratives,” Jalli said.
“Models may fail to address important socio-political issues such as human rights abuses, corruption or valid criticism of political forces,” she said.
For example, in response to a question on former Indonesian President Suharto, Lama 2 and GPT-4 mentioned his unblemished human rights record, while SEA-LION's response focused more on his achievements.
If a model is trained only on favorable narratives about the government, the model is “likely to adopt a worldview that is entirely positive of the government and leave out dissenting opinions,” said Alia Bhatia, a policy analyst at the Center for Democracy & Technology. , a US non-profit.
“Regional LLMs better reflect the linguistic and cultural nuances of native language speakers, but they may have less information about the world in general,” she added.
“There is a real danger that government-backed models will foster a revisionist view of history and undermine democratic values.”
But the alternative – relying entirely on Western LLMs from wealthy, liberal, Western democracies with “disproportionately large influences” – perpetuates diverse biases related to cultural values, political beliefs and social norms, according to AI Singapore.
“These LLMs have a very distinct West Coast American bias — they're very woke. They don't represent us,” Teo said.
“We're not saying ours is the only perspective — we're trying to rebalance it.”
Also, read these top articles today:
Cookies are breaking! Small data files that help companies track consumers on the web are disappearing. But that doesn't mean a return to privacy. Some interesting details in this article. Check it out here.
Meta challenges the EU! Meta announced Wednesday that it will challenge in court the EU's demand for fees under the Content Moderation Act, the EU's legal weapon for regulating big tech. Read all about it here.
Microsoft will cut more jobs! The FTC sought a response after the Satya Nadella-led company aimed to cut 1,900 jobs from its newly acquired Activision Blizzard, after it was revealed plans by Microsoft. Dive in here.