GenAI and the Global South: Language Preservation and Perception Management
- THE GEOSTRATA
- 6 days ago
- 8 min read
Given the rise of Generative Artificial Intelligence (GenAI) platforms like OpenAI’s ChatGPT, Google’s Gemini, X’s Grok, and Microsoft’s Copilot, it has become far easier to access, analyse and appreciate information across the globe.
Illustration by The Geostrata
On the other hand, easy access to GenAI has unfortunately enabled the faster spread of misinformation, fake AI-generated images and written content as well as the erosion of human intellect given the increase in academic usage of such software to generate data and content as opposed to actual painstaking research.
Thus, even as the prevalence of AI increases, the world must pay attention to the nuances of its development to understand its broader impact on cultures and civilizations worldwide. We shall thus explore the areas of interest and concerns in that regard given this new revolution shaping our modern age.
NEGATIVE MEDIA STEREOTYPES BEING REINFORCED BY NEW GEN-AI MODELS
Applications of Artificial Intelligence and its subsets like Natural Language Processing (NLP) and Machine Learning (ML) generally function using models trained on vast amounts of data to give coherent outputs. The problem is, most of these models are often not trained on data that is diverse enough.
For example, Facial detection algorithms have been infamous for being bad at detecting black or brown faces, as compared to white faces. Given the latest trend centred around energy-intensive AI image generation, it is important to note, acknowledge and eliminate certain biases by first understanding their causes.
A good case study in point is the public image of a non-Western nation, especially in the eyes of Western nations and their media outlets. BBC, for example, has been accused multiple times of being racist in its portrayal of India and Indians - a position that reeks of baseless neocolonial infatuations. India is often portrayed as an undeveloped nation as opposed to the reality of India’s status as a rapidly developing and industrialising nation.
While India may have some phenomenal financial digital infrastructure enabling ease-of-transaction for the common man (through the Unified Payments Interface (UPI)), western portrayals of India often only show a poor neighbourhood, often steeped in poverty.
The repercussions of such biased reporting by most global media houses that a GenAI model may be trained to obtain its data from are huge.
When asked to generate ‘images of cities’ for an Instagram Chat Wallpaper image, the results generated by Meta’s Meta AI used by the Meta-owned Instagram, were disastrously based on stereotypes, even when it is intuitive that a chat wallpaper should ideally be aesthetically pleasing.
Requests for Wallpaper images of Dubai and Riyadh showed only glitzy-shiny skyscrapers (despite the existence of poor neighbourhoods where the blue-collar workers live in less-glamorous accommodation) while an image of Mumbai and Pune showed a relatively poor, working-class neighborhood (oblivious to the fact that both cities also have prosperous neighbourhoods like Colaba and Koregaon Park and Skyscrapers throughout Southern Mumbai).
Similarly, an image of Paris showed the Eiffel Tower, aesthetic buildings and the orderly street Cafes, and an image of Frankfurt contained a cathedral, skyscrapers and some beautiful European buildings ; the images of Delhi and Dhaka meanwhile, only showed poorer neighbourhoods with people sitting on the ground.
The image generated for Addis Ababa meanwhile, was balanced - showing a street market as well as skyscrapers. Images for Kuala Lumpur contained the famous Petronas Twin Towers (good choice) while an image for Bangkok was centred around the Tuk-tuks so prevalent there, and an image request for Sao Paulo showed the street of a Favela (slum) rather than the more beautiful side of the city.
More often than not, countries of the Global South, like in this case, India, Brazil, Ethiopia, Thailand and Bangladesh end up having to face negative stereotypes. At the same time, relatively richer non-western nations like the UAE and Malaysia are partially depicted better due to some of the iconic monuments in their cities.
Thus, instead of inspiring a fairer world by showing unbiased aesthetically pleasing scenes from cities that anyone would desire for their wallpaper, AI-generated wallpaper images just go on to reinforce stereotypes about a city or the nation the city is located in.
This in turn speaks volumes about the biased, unbalanced data that has been fed to the models underneath these AI platforms ; it is unfortunate that GenAI is likely to represent Mumbai by its slums (despite the presence of skyscrapers and elite neighbourhoods like Worli and Colaba) and Los Angeles by Hollywood or Downtown LA (despite the presence of unsafe and poor neighbourhoods like Compton or Skidrow). The problem isn’t the image itself - all cities have prosperous and less-prosperous neighbourhoods. The issue lies in the biased perception of the model developed through the training on lopsided data.
CONCERNS OVER AI-LED LANGUAGE HOMOGENISATION
There are hundreds of languages and even more dialects being spoken across the world - India itself has 15 languages on its currency note, for example. GenAI content is overwhelmingly in English, but most major platforms like ChatGPT and Gemini have developed the ability to interact with its users in multiple languages. This becomes important in the 21st-century context as the pace of globalisation has been faster than ever before and people, unfortunately, learn languages and scripts based on convenience.
Mainstream scripts worldwide like Latin, Cyrillic, Mandarin and Devanagari might not have issues with still being ‘relevant’, but the less-used languages and scripts are at an ever-greater risk of being wiped off the planet merely because they lose out on this GenAI revolution.
This issue thus represents a grave threat to endangered scripts and languages only due to the lack of diversity of texts and material fed to those models.
For example, While ChatGPT had no issues conversing in Marathi (written in Devnagari Script) or Spanish (written in Latin Script) but it had issues identifying Ahirani (a dialect of Marathi spoken mostly in Khandesh - northern Maharashtra) and the Sharda script (native to India’s Jammu and Kashmir province) as illustrated in the examples below.
While as a land of immense cultural and linguistic heritage, most Indian languages will probably find a way to survive, the same cannot be said for so many native scripts, languages and dialects across the global south - the languages risk being lost to irrelevance in the near future
SPECIFYING THE ISSUES FACED BY ENDANGERED SCRIPTS, LANGUAGES AND DIALECTS DUE TO LACK OF TRAINING GIVEN TO GEN-AI MODELS
It is important to not only understand and acknowledge the phenomenon of ‘language homogenization’ and all other allied phenomenon, but also to appreciate the beauty that lies in the diversity of languages in the world.
Disparity in the allocation of resources: Less-used languages get less attention. The lack of data leads to poor model performance. Poor performance discourages use, which further reduces investment and interest.
Erosion of culture: GenAI tools shape content creation, education, and even daily communication. If these tools ignore minority languages, we’ll see a subtle push toward dominant languages. In the long term, this can lead to reduced intergenerational transmission of native languages.
Digital Marginalisation: For languages that use non-Latin scripts—like Brahmi-derived scripts, Tifinagh, or traditional Mongolian—the risk is higher: lack of OCR, handwriting recognition, and typing support. GenAI systems often "Romanize" or avoid such scripts entirely. The result will likely be a digital ecosystem that renders some scripts obsolete.
Lack of Contextual & Cultural Nuance: When less-used languages are poorly modeled, even when they are supported, the output lacks cultural understanding: sayings, idioms, and worldview embedded in language get lost - GenAI might produce grammatically correct but culturally tone-deaf content.
STRATEGIES TO MITIGATE THE ABOVE ISSUES IN GEN-AI MODELS AND PRESERVATION OF ENDANGERED LANGUAGES
While GenAI models can be held responsible for the reduced relevance of endangered languages, the advent of Artificial intelligence and NLP (Natural Language Processing) should also be seen as a great opportunity to preserve and propagate these means of communication unto the generations coming forth.
In multilingual nations like India, such platforms can be used to solve the difficulties faced by both blue and white-collared migrant workers in learning the native languages spoken in cities like Delhi, Mumbai, Bengaluru, Hyderabad, Chennai and Pune that have become the economic lifeblood of India - an excellent way to strengthen federalism while assuaging the concerns of both migrant employees and the locals of a city at the same time.
In this aspect, governments can play a key role to identify and push for the inclusion of various languages as a part of the training data used by models by the means of regulation, policy formulation and funding - this issue can be harnessed for greater bilateral cooperation between leading nations of the Global South like India and the US - a global centre of LLM development.
Publicly Funded Multilingual Datasets:
Governments and international bodies should fund the creation of high-quality, open-access datasets for low-resource and endangered languages and mandate the inclusion of local languages in publicly funded NLP projects (like India’s Government-owned online translator platform ‘Bhaashini’)
Establish regional data stewardship bodies to collect, annotate, and validate local content while also supporting community-driven data collection: oral histories, literature, folklore, and public signage.
Regulation through Public-Private Partnership:
Regulate tech companies to ensure language equity in their AI deployments at the national level and enforce “Minimum Viable Language Sets” before product rollout in any linguistic region.
Introduce linguistic fairness audits—much like environmental or privacy audits of these GenAI platforms and incentivize companies to report performance gaps across languages, dialects, and scripts.
Support for Script and Dialect Diversity
Expand beyond language labels to support scripts (writing systems) and dialects ; fund Unicode extension efforts for rare or ancient scripts.
Mandate script-specific support in government websites and digital services.
Decentralised Language Platforms
Give communities direct control over how their language is represented and modelled.
Establish digital language conservatories with funding and training - mandate consultation and consent before using community-generated data.
Use Web3 or blockchain principles for data ownership and provenance of linguistic resources.
Create global Linguistic Equity Index for GenAI Models
Create a benchmarking framework to rate AI models on inclusivity - similar to carbon footprint scores, introduce a “Linguistic Inclusivity Score”.
Partner with institutions like UNESCO or Internet Governance Forum (IGF) to maintain this index and encourage academic and industrial competitions focused on multilingual NLP challenges (e.g., WMT, FLORES).
Multimodal & Oral-Language-First AI
Recognize that many languages are primarily oral, not written:
Fund speech-to-text, speech-to-speech, and oral tradition digitization initiatives.
Use community radio, WhatsApp, and other vernacular communication channels as data sources and prioritize audio-based AI agents that can operate in non-literate communities.
Education & Digital Literacy Campaigns
Empower communities to use, co-create, and critique AI platforms.
Offer AI literacy coursework in indigenous and regional languages across universities.
Provide open-source toolkits that allow communities to train/fine-tune models in their language and promote intergenerational knowledge transfer using GenAI-assisted storytelling and translation tools.
Global AI Ethics & Language Preservation Charter
Align with the UN Declaration on the Rights of Indigenous Peoples (UNDRIP).
Create an AI-language preservation treaty framework under UNESCO or the World Intellectual Property Organization (WIPO).
Require GenAI companies to publish language impact assessments, especially when entering new markets like Indonesia and India where a large linguistic diversity exists
Partnerships with Cultural Institutions
Collaborate with archives, universities, tribal councils, and cultural museums.
Fund language documentation projects (e.g., ELAR, Rosetta Project) and support the translation of ancient texts, manuscripts, and oral histories using human-in-the-loop GenAI.
Promote public-facing interfaces like AI-powered language revival apps, co-designed with elders and linguists.
With these mitigation strategies, one can hope to make sure that this GenAI revolution remains equitable for the nations, languages and cultures of the Global South that currently do not have the power or resources to build such cost-intensive platforms - leading to the collective success of humanity in safeguarding knowledge and cultures that have evolved over thousands of years.
BY ANISH A. KALE
INDUSTRY INNOVATION CENTRE
TEAM GEOSTRATA