Description

Answering complex questions over knowledge bases (KB-QA) faces huge input data with billions of facts, involving millions of entities and thousands of predicates. For efficiency, QA systems first reduce the answer search space by identifying a set of facts that is likely to contain all answers and relevant cues. The most common technique or doing this is to apply named entity disambiguation (NED) systems to the question, and retrieve KB facts for the disambiguated entities.
This work presents CLOCQ, an efficient method that prunes irrelevant parts of the search space using KB-aware signals. CLOCQ uses a top-k query processor over score-ordered lists of KB-items that combine signals about lexical matching, relevance to the question, coherence among candidate items, and connectivity in the KB graph. Experiments with two recent QA benchmarks for complex questions demonstrate the superiority of CLOCQ over state-of-the-art baselines with respect to answer presence, size of the search space, and runtimes.
GitHub link to CLOCQ code

Overview

For search space reduction, CLOCQ takes as input all facts in the KB and the question, and retrieves a set of candidate KB-items for each question word. These KB-items are scored making use of global signals (connectivity in KB-graph, semantic coherence), and local signals (question relatedness, term-matching), and the top-k KB items for each question word are detected. Since the choice of k is not straightforward, CLOCQ provides a mechanism to set k automatically, for each individual question word. For the question "who scored in the 2018 final between france and croatia?", "2018 final" is more ambiguous than "scored", and therefore CLOCQ would consider more KB-items to account for potential errors.
Finally, salient facts with disambiguated KB-items are retrieved and can be passed to a QA system as the search space.

Some example disambiguations of CLOCQ (and baselines) can be found here:


Correct disambiguations are colored in green. The examples illustrate how CLOCQ can adjust the parameter k dynamically and allow for disambiguation errors in case of very ambiguous question words (e.g. for "All We Know", or "son"). In the first example, even though CLOCQ maps some question words to incorrect KB-items, the robustness of CLOCQ helps to identify the important KB-items for answering the question (Football team, Düsseldorf, and Fortuna Düsseldorf).
For additional details, please refer to the paper.

Contact

For feedback and clarifications, please contact: Philipp Christmann (pchristm AT mmci DOT uni HYPHEN saarland DOT de), Rishiraj Saha Roy (rishiraj AT mpi HYPHEN inf DOT mpg DOT de) or Gerhard Weikum (weikum AT mpi HYPHEN inf DOT mpg DOT de).

To know more about our group, please visit https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/question-answering/.

Paper

"Beyond NED: Fast and Effective Search Space Reduction for Complex Question Answering over Knowledge Bases", Philipp Christmann, Rishiraj Saha Roy, and Gerhard Weikum. In (WSDM '22), Phoenix, Arizona, 21 - 25 February 2022.
[Extended version] [Code] [Poster] [Slides] [Video] [Extended Video]

API

The CLOCQ API currently provides access and makes use of the 2022-01-31 Wikidata dump. The API might be updated with the latest Wikidata dump from time to time. As a rule of thumb, we consider updating the dump on a yearly basis.
We provide an API for accessing the CLOCQ code, and making use of its efficient KB-representation to retrieve information from Wikidata. Please note, that the API should not be used for an efficiency analysis of the method, since the API setup is not really optimized in that regard. Specifically, it is not yet clear how well the API scales when accessed via multiple clients simultaneously. The API currently uses the Wikidata dump downloaded on 2020-04-24, which has been processed as outlined in the paper. You can find an example Python snippet for integrating the CLOCQ API into your project here: CLOCQ_api_client.py.
With a hop, we refer to a fact-centric hop as defined in the paper. Please do not hesitate to contact us in case of any questions. Also, any kind of feedback is higly appreciated!

Retrieve search space (Wikidata facts) for question.
GET /api/search_space

?question=
Required field.
&k=
Optional field.
&p=
Optional field.
Retrieve fact-centric 1-hop neighborhood for given Wikidata-item ID.
GET /api/neighborhood

?item=
Required field.
&p=
Optional field.
Retrieve label for given Wikidata-item ID.
GET /api/item_to_label

?item=
Required field.
Retrieve aliases for given Wikidata-item ID (if available).
GET /api/item_to_aliases

?item=
Required field.
Retrieve description for given Wikidata-item ID (if available).
GET /api/item_to_description

?item=
Required field.
Retrieve types for given Wikidata-item ID.
GET /api/item_to_types

?item=
Required field.
Compute frequency for given Wikidata-item ID.
GET /api/frequency

?item=
Required field.
Connectivity check for the two Wikidata-item IDs (within 2-hops).
GET /api/connectivity_check

?item1=
Required field.
?item2=
Required field.
Shortest path(s) between the two Wikidata-item IDs (if available in 2-hops).
GET /api/connect

?item1=
Required field.
?item2=
Required field.