A groundbreaking new preprint from researchers at the University of Johannesburg has brought computational rigor to a long-standing debate in South Africa: the persistence of racial bias in news media coverage. Titled "Tracing Bias to Its Sources: A Word Embedding Audit of Racism in South African News Outlets," the study employs word embeddings—a natural language processing (NLP) technique—to dissect language patterns across 39 major news platforms. By analyzing over 27,000 COVID-19 vaccination articles published between January 2020 and May 2023, the authors reveal not just the existence of socioeconomic race bias but pinpoint which outlets contribute most.
The findings confirm what qualitative studies have suggested for decades: South African news language associates concepts of poverty, welfare dependency, and crime more closely with Black-associated terms, while wealth, investment, and safety cluster around White-associated ones. This "inferential racism," as termed in prior research, operates subtly through word associations rather than overt slurs, making it harder to detect but no less damaging to public perceptions and social cohesion.
Historical Roots of Media Bias in Post-Apartheid South Africa
South Africa's news media carries the scars of apartheid, when outlets were state tools for white supremacy propaganda, suppressing Black voices and justifying segregation. The 1994 democratic transition spurred reforms, but challenges persisted. The South African Human Rights Commission's (SAHRC) landmark 2000 "Faultlines" inquiry—sparked by complaints over cartoons and columns stereotyping Black people as corrupt or incompetent—concluded that media institutions were "characterised as racist" due to cumulative effects on dignity and equality. It highlighted underrepresentation (few Black sub-editors), stereotypes in crime/poverty coverage, and white-dominated ownership.
Follow-up studies by the Media Monitoring Project (MMP) in 1999, titled "The News in Black and White," documented similar patterns: Black South Africans framed as victims or perpetrators, whites as experts or benefactors. Despite diversification efforts, a 2022 analysis by Govenden found "inferential racism" enduring into 2014. Commercial pressures, niche targeting (e.g., business papers for affluent readers), and legacy newsroom cultures have slowed transformation, with self-regulation bodies like the Press Ombudsman handling few racism complaints effectively.Read the full SAHRC Faultlines report.
Understanding Word Embeddings: A Crash Course in AI Bias Detection
Word embeddings are vector representations of words in a high-dimensional space, capturing semantic relationships based on co-occurrence in text. Popularized by Google's Word2Vec (developed in 2013), the model learns that words like "king" - "man" + "woman" ≈ "queen" through context prediction. To detect bias, researchers construct "dimensions" like race (average vectors of "Black/African/Zulu" minus "White/European/Afrikaner") and project stereotype vocabularies onto it. Positive cosine similarity means closer to Black pole; negative to White.
In bias tests like the Word Embedding Association Test (WEAT), terms like "poverty" or "grant" scoring high on the Black pole indicate negative racial association. This method scales to millions of words, revealing implicit biases invisible to manual review. While static (lacking sentence context, unlike BERT), it's robust for large corpora and validated against human judgments.
The Study's Methodology: From Corpus to Outlet Vectors
The researchers curated a corpus from Media Monitoring Africa's database: 27,140 articles on COVID-19 vaccinations from 76 outlets, filtered to 39 with ≥100 articles each. Articles were prefixed with outlet names (e.g., "News24_news24") for embedding. Ten bootstrap Word2Vec Skip-gram models were trained per resample (3,900 articles, 100/outlet), using Gensim (dimension 200, window 10).
Race dimension from validated pairs (Black-white, African-European, etc.). Socioeconomic stereotypes from MMP (1999), Talbot & Durrheim (2012), and ChatGPT curation: Black pole ("township," "SASSA grant," "R350 relief," "NSFAS"); White ("investor," "taxpayer," "privilege"). Health stereotypes tested too. Outlet vectors averaged, projected onto race dimension for bias scores. Validated via WEAT (effect size -1.04, p<0.001), correlation with prior study (r=0.75), and 26 South Africans' ratings (r=0.61, α=0.54).Access the full preprint.
Main Findings: Consistent Socioeconomic Race Bias Across the Board
The race dimension replicated prior biases: Black pole near poverty/crime ("informal settlement," "gangster," "load shedding"), White near prosperity ("economy," "business rescue"). All 39 outlets biased, no outliers—bias is systemic.
- Business/finance (Moneyweb, Business Day, Fin24): Strongest White-positive (high wealth links), low Black-negative.
- Metropolitan/community/gov (Daily Sun, Cape Argus, SABC): Strongest Black-negative (welfare/crime), low White-positive.
- Mid-tier: The Conversation, Bhekisisa balanced but biased both ways.
Health bias weaker (WEAT -0.36, marginal p), unhealthy terms ("HIV," "TB") nearer Black pole, possibly COVID-amplified.
Photo by National Cancer Institute on Unsplash

Business Outlets vs. Community Papers: Divergent Worlds?
Business media (e.g., Financial Mail cosine -0.15 White pole) reflects audience (affluent, white-skewed), prioritizing investor concerns. Community papers (Daily Sun 0.12 Black pole) cover townships/grants, mirroring demographics but reinforcing stereotypes. SABC neutral-ish, but all embed bias via topic selection and framing.
Human Validation and Robustness Checks
South Africans rated 40 socioeconomic terms on race association (1 Black-7 White), correlating strongly with embeddings (r=0.61). Sensitivity tests (different seeds, dimensions) confirmed stability. Health terms weaker human correlation (r=0.27), suggesting subtler bias.
Implications for Journalism and Democracy
This audit shows bias as institutional, not individual—newsrooms' audience targeting perpetuates divides. In SA's fragile democracy, skewed portrayals erode trust, fuel polarization (e.g., farm murders vs. township crime). COVID coverage amplified inequities: Black poverty vs. white privilege frames hindered unified response.Published companion paper in EPJ Data Science.
Towards Solutions: AI Audits, Diversity, Training
- Scalable Monitoring: Embeddings enable regular audits, holding outlets accountable.
- Diversity Quotas: More Black editors/sub-editors per SAHRC recs.
- Training: NLP literacy, bias workshops (SANEF-led).
- Policy: Strengthen Press Council, incentivize balanced coverage.
Higher Education's Role: NLP for Social Justice
UJ's Psychology Dept exemplifies interdisciplinary higher ed impact: NLP meets social psych. Tools empower researchers tracking bias evolution. For academics, explore careers in computational social science via research positions or faculty roles.
Photo by Jan Antonin Kolar on Unsplash

Limitations and Future Directions
Static embeddings miss context; future BERT/XLNet analyses. COVID corpus limits generalizability—expand to full archives. Include Coloured/Indian dimensions. Longitudinal tracking post-study.
This UJ preprint revives SAHRC calls for media transformation, offering AI as ally against subtle racism. With elections looming, unbiased reporting vital for cohesion. Researchers urge newsrooms: audit language, diversify, prioritize society over niches. Explore SA higher ed opportunities at AcademicJobs South Africa.

