The rapid proliferation of generative artificial intelligence (AI) and large-scale natural language models such as OpenAI’s ChatGPT and Meta’s LLaMA is reshaping the way information is gathered, analyzed and processed to feed mankind’s insatiable appetite for content. With this new landscape comes a new set of challenges for reputation management, not the least of which is the proclivity of AI models to learn and perpetuate errors, falsehoods and biases in the data on which they are trained.

Among the myriad sources of information being used to train large language models, Wikipedia stands out as a cornerstone. According to research conducted in 2023 by the Allen Institute for AI and published by the Washington Post, of the 15 million websites used to train Google’s T5 and Facebook’s LLaMA large language models, was the number one news and media site and the second-largest source of training data overall, edged out only by Google’s own global patent repository.

That Wikipedia features so prominently in the evolution of generative AI should come as no surprise. Since its launch in January 2001, Wikipedia has become the world’s go-to encyclopedia of online knowledge, making it an indispensable resource for researchers, journalists, students and, increasingly, AI developers. That’s because Wikipedia’s vast collection of more than 60 million articles in 309 languages—6.8 million in English alone—on virtually every topic under the sun makes it a virtually bottomless data set for training large language models. Therein lies the problem. Because Wikipedia can be edited by anyone at any time, it’s not uncommon for inaccurate, biased or malicious information to creep into articles unnoticed. Once an article is ingested by a large language model, however, any inaccuracies can appear in the model’s output, ready to be shared, manipulated and incorporated into other content, all but guaranteeing their wider propagation.

The good news is that in the 23 years since its launch, Wikipedia has become increasingly accurate. So much so that some researchers have concluded that content appearing in popular, highly trafficked Wikipedia articles can be completely reliable due to the sheer number of editors frequently reviewing and modifying the content.

A challenge, however, remains with the millions of smaller, less popular articles that don’t attract regular attention from Wikipedia editors. These articles are far more likely to contain information that is inaccurate, outdated or otherwise inappropriate for inclusion on Wikipedia. Safeguarding the reputations of organizations, individuals or brands that are the subjects of such articles, especially ones  that are lightly trafficked, can present significant challenges.

Many believe that the size and scope of Wikipedia’s global legion of volunteer editors provide the best assurance that information is accurate. Using the basic concept pioneered by open-source software developers, Wikipedia’s unregulated points of entry would appear to support this theory. In truth, however, Wikipedia’s community of editors is not quite as ubiquitous as many think.

As of April 2024, the English language version of Wikipedia has more than 47 million registered users. However, only about 123,000 are considered “active,” which is defined as having taken at least one action on the platform in the preceding 30 days. The most engaged group of users, those who have earned the title of “extended confirmed,” have accounts that have existed for at least 30 days and made more than 500 edits. This group of editors is comprised of just 69,000 people around the world. From a workload perspective alone, maintaining the accuracy of an encyclopedia with more than 6.8 million entries and 14,000 new articles added every month is a Sisyphean endeavor.

The independence and neutrality of Wikipedia’s editors, coupled with the understanding that no one “owns” the content created on the platform, further complicates matters. Even the subject of an article has no legal right or standing to dictate what the article may say or what facts it should contain or omit.

To mitigate the risks posed by inaccurate or unsavory information, communications professionals need to proactively manage how their organizations and clients appear on Wikipedia. Since Wikipedia users are strongly discouraged from editing articles about themselves, their employer or any topic on which they could be perceived to have a conflict of interest, this calls for a multifaceted approach, including:

  • Continuously monitoring for changes to Wikipedia articles using a combination of automated tools and dedicated personnel.
  • Developing protocols for swiftly addressing inaccurate content through the platform’s editing and moderation processes. This can include creating a dedicated corporate account, properly disclosing potential conflicts of interest and providing Wikipedia editors with corrective information from reliable, independent sources.
  • Collaborating with the Wikipedia community by engaging with editors and administrators to foster transparency, address concerns and ensure the accuracy of information presented on Wikipedia.

As AI and natural language models continue to rely on Wikipedia as a primary source of information, making sure that articles on the platform are accurate, up to date and paint clear, unbiased portraits of their subjects has never been more critical. Communicators can significantly reduce the risks to their organizations and clients by proactively monitoring articles, establishing clear processes for responding to problems, engaging collaboratively with the Wikipedia community and seeking expert advice where needed on the cultural norms, community standards and complex rules that govern the Wikipedia ecosystem.