Project description

Import lists of names from Wikidata for analytics of your document sets, structure of navigation, interactive filters and aggregated overviews

Open Data from Open Access databases like WikiData (structured data from Wikipedia in open standards for linked data and semantic web) can be used to improve search, analysis, filtering and navigation of your private documents, news and data.

Import multilingual lists of names from Wikidata as universal ontology or thesaurus to
– see / overview which of these names occur in your document sets or search results (aggregated overview)
– get interactive filters for your documents by this names or entities like persons (for example all politicians of your parliament), organizations or locations
– automatically find aliases or the same names written in other languages, too
– configure alerts, feeds or leads for new documents where important concepts or people occur by search or filter for/of that lists/facet

Wikidata as open data database for fine granular structured multilingual linked data
If there is an existing list, dictionary, vocabulary or thesaurus or if the entries are available in other structured format like an semantic web ontology yet from open data sources like Wikidata, you don’t need to enter manually each named entity like persons, organizations, locations or important words or concepts to the thesaurus to use them for analytics, navigation, aggregated overview and interactive filters.

One of the biggest and an universal Open Data source is WikiData, the multilingual structured database from Wikipedia.
From this multilingual and fine granular structured datasource you can select and download lists of names for example of people like politicians for your analysis of document sets & news, interactive filters (faceted search) or alerts.

Automatic tagging with named entities in list like names of people or organizations from Wikidata
Using this list of important names or ontology the named entities tagger is tagging documents in which this names occur.

Leads for research by matching watchlists
So you get leads for potential relevant documents where important concepts or people on such a watchlist or ontology occur, i.e. if analysing a large unknown document set or dataset or after getting new documents or news.

Analytics, aggregated overviews and exploration
By the faceted search you get an aggregated overview for the different facets like concepts, persons, locations or organizations showing, how many of the found or all documents matching this entity.

Interactive filters (faceted search)
You can use this overviews or named entities as interactive filter to narrow down search results.
So a click to a facet (i.e. an organization) will drill down the search results to fewer documents, matching this additional facet/filter, too.

Multilingual labels, alternate labels and aliases for Semantic Search
Since the used open standards for Semantic Web and Linked Data labels like names can be multilingual, so you will find the names in different languages.
So for example after import of names of politicians from Wikidata the semantic search for Angela Merkel will find documents which contain Angela Dorothea Kasner, too, which is birth name of Angela Merkel before she married.
Additionally documents which contain Angela Merkelowa or Ангела Меркель will be found, too, which is Angela Merkel written in russian language.

What makes this project innovative?

Open Source Software for privacy-aware analytics and textmining of large document collections by Open Data like Wikidata on your own computer, so no documents and no names which occur in your documents are send to cloud services.

What was the impact of your project? How did you measure it?

Free Open Source Software with free available code used by multiple research projects, archives and investigative journalists. You can follow and collaborate on further development on Github.

Source and methodology

Since Open Data base Wikidata is using open standards for Linked Data and interoperability of the Semantic Web, the structured data in XML/RDF format and linked data by RDFS and SKOS thesaurus standard is interoperable and can be used for configuration of interactive filters and analytics of your document set in Open Semantic Search open source tools for dictionary based named entity recognition and text mining out of the box.

Technologies Used

Apache Solr
Apache Tika
Apache OpenNLP


Additional links

Project owner administration

Contributor username


Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.