The late English writer Douglas Adams is best known as the author of the 1979 book The Hitchhiker’s Guide to the Galaxy. But there is much more to Adams than what is written in his Wikipedia entry. Whether or not you need to know that his birth sign is Pisces or that libraries worldwide store his books under the same string of numbers — 13230702 — you can if you head to an overlooked corner of the Wikimedia Foundation called Wikidata.Â
There, images, text, keywords, and other information related to Adams are stored both in a webpage and, for the robots among us, in formats designed for machines like JSON.Â
Now, Wikidata is getting a new AI-friendly database that makes it easier for large language models to ingest the information. The database comes from the Wikipedia Embedding Project out of the German chapter of the Wikimedia Foundation, Wikimedia Deutschland, which oversees Wikidata. The Berlin-based team spent the past year using a large language model to turn the 19 million entries within Wikidata from clunkily structured data into vectors that capture the context and meaning around the Wikidata entry.Â
In this vectorized format, information is best imagined like a graph with dots and interconnected lines — Adams would be connected to “human” as well as the titles of his books, Lydia Pintscher, Wikidata portfolio lead, told The Verge.
While the front-end user experience will remain the same — no, Wikipedia is not becoming a chatbot, the project leaders say — the back end will become easier for AI developers to access when building, for example, their own chatbots using the data.Â
The goal of the project is to level the playing field for AI developers outside the monied core of Big Tech, Pintscher said. Companies like OpenAI and Anthropic have the resources to vectorize Wikidata, just like Pintscher and her team did. It’s the smaller outfits that most benefit from the new access to curated data stored in the vaults of Wikidata. “Really, for me, it’s about giving them that edge up and to at least give them a chance, right?” Pintscher said.Â
She points to Govdirectory as an example project that harnessed Wikidata’s vast data curated by volunteers for good. The platform allows users to find the social media handles and emails for public officials across the world.Â
Most AI chatbots prioritize popular words and topics across the internet. In addition to giving Little Tech a leg up, the team hopes that easier access to Wikidata will result in AI systems that better reflect niche topics not widely represented across the internet, Pintscher said. This could be a better way to get information into ChatGPT, for instance, than “generating a ton of content and then waiting for the next time for ChatGPT to retrain, and maybe, or maybe not, taking into account what you contributed,” Pintscher said.Â
In practice, the vectors will allow AI systems to better access the context around information in addition to the information itself, Philippe SaadĂ©, Wikidata AI project manager, told The Verge.Â
The team used a model from AI company Jina AI to turn Wikidata’s structured data, captured through September 18th, 2024, into vectors. IBM company DataStax currently provides the infrastructure to store the vector database to the project for free.Â
The team is waiting for feedback from developers who use the database before updating it with information added over the last year. While the current database does not include entirely new information added in the last year, Saadé says small edits or tweaks to existing Wikidata will not diminish the database’s usefulness. “At the end of the day, the vector that we’re computing is like a general idea of an item, so if some small edit has been made on Wikidata, it’s not going to be super relevant,” he said.
This sounds like an exciting initiative from Wikimedia! Enhancing access to data can really benefit both users and AI developers. It’s always great to see efforts that promote knowledge sharing and innovation.
Absolutely, it really opens up new possibilities for both researchers and developers. With easier access to data, we might see more innovative AI applications that can help us understand and analyze information in ways we haven’t considered before.
I completely agree! It’s fascinating to think about how this could enhance collaboration between human researchers and AI. The potential for increased accessibility to such a vast amount of knowledge could lead to innovative projects we haven’t even imagined yet.
Absolutely! The potential for AI to help streamline the search process could really open up new opportunities for researchers and creators alike. Plus, it might lead to innovative ways to interact with the vast amount of information available on Wikimedia.
I completely agree! It’s exciting to think about how AI could enhance the way we access and interpret such a vast amount of information. Additionally, this could lead to more innovative uses of Wikimedia’s data in various fields, from education to research.
Absolutely! It’s fascinating to consider how AI could not only improve accessibility but also help in uncovering new insights from the vast wealth of information on Wikimedia. The potential for collaborative projects between AI developers and the Wikimedia community could lead to innovative educational tools as well.
I completely agree! It’s exciting to think about how AI can enhance not just accessibility but also the way we interact with vast knowledge bases. With tools like these, even niche topics could become more discoverable and engaging for everyone.
the depth of information we can explore. With Wikimedia’s efforts, we might see even more innovative applications that help us uncover insights from lesser-known works and authors. It’s a great time for both AI and literature!