This article presents a comprehensive study addressing the challenge of natural language question-answering using tabular data. We will explore several approaches that utilize deep learning, delving into the architectures and datasets employed. Additionally, we will compare their performance using appropriate evaluation metrics.
Answering questions on tabular data is a research problem in NL, and numerous approaches exist to finding a solution. Some involve a heuristic method to break down the natural language input and translate it to be understandable by structured query languages, while others involve training deep learning models.
A good solution would cover the following:
Tabular data is data that is structured into rows and columns. Since the data can be big in size, it is not easy to manually extract the answer to a particular user query, given a condition or a set of conditions. Tabular data can also be found in relational databases.
While Structured Query Language (SQL) queries are often used to extract particular information from a database, an individual’s knowledge of SQL queries may be limited. A more universal solution is to query the tabular data using natural language.
LN2SQL is one approach which used heuristic approaches to convert natural language to SQL queries. A set of parsers are used to break down the natural language construct into its respective SQL components. A thesaurus is used to improve keyword filtering to enable easier translation of specific entities like column names, SQL keywords, etc. The SQL output would then be queried to fetch results.
Neural networks provide various methods for analyzing tabular data. For instance, a pre-trained model can identify the relationship between the intricate structure of a natural language query and tabular data. This is achieved by embedding the query and the data, which are then processed through the neural network model.
Other solutions are not pre-trained on datasets containing tabular information but use AI and some heuristics to arrive at the solution. These are known as AI-assisted tools. We will take a look at such a tool below.
As we saw earlier, one method for breaking down natural language into smaller components is converting them into SQL queries. A SQL query is built from several component statements which include conditions, aggregate operations, etc. TableQA converts natural language queries to SQL queries. Let’s see a demo of how to use this tool to query cancer death data obtained from the Abu Dhabi open platform.
First, install the package and requirements. Install the module via pip:
Next, we’ll access the sample data from the repository.
And now we’ll take our first look at the data.
Cancer death data
The file we are trying to query is a CSV (comma-separated values) file with columns like Year, Nationality, Gender, Cancer Site, Death Count, and age. Let’s try to answer some questions from this dataset.
>>> [(24,)]
We could also take a look at the associated SQL query related to the same.
As you can see here, there is a COUNT operation done on the column Death Count
with conditions like age<40
and Cancer Site="Stomach"
.
TableQA builds the SQL query using a combination of a deep learning-enabled entity extractor and aggregate clause classifier. The entity extractor matches the conditions by mapping the columns with their respective values if there are any in the question. It uses a pre-trained question-answering model trained on SQUAD to help easily locate the column’s value from the input question.
The aggregate clause classifier is a two-layer deep learning model that helps in classifying the nature of the input question in any of the following:
Combining entity extraction and clause classification with the help of some heuristics generates an SQL query, which is then used to query the data. One possible advantage of this approach is that no retraining is necessary for adding any new feature—some custom fixes could be done by modifying the code, replacing the AI models inside with more accurate ones, etc.
The heuristics-based solutions have their limitations. Addressing increased question complexity is challenging with explicit programming.
For example, we need to be able to answer questions that require multiple non-trivial steps, like: “Who finished first in this year’s F1 championship?” or “What is the cheapest mode of transport that departs tomorrow morning?”
In cases where human-like inference is required, a model pre-trained with a dataset containing such questions can be used.
The WikiTableQuestions (WTQ) is a dataset consisting of complex questions and a table from Wikipedia. The dataset contains 2108 tables from various topics (to cover more domain knowledge) and 22033 questions of different complexities.
Microsoft has made available a Sequential Question Answering dataset, SQA. The task of answering sequences of questions is done on HTML tables. SQA was created by decomposing 2,022 questions from WikiTableQuestions (WTQ). Each WTQ question is divided into three, resulting in a dataset of 6,066 sequences containing 17,553 questions. Each question was also associated with answers in the form of cell locations in the tables.
Another popular dataset for this task is WikiSQL, a dataset of 80654 human-annotated pairs of questions and SQL queries distributed across 24241 tables from Wikipedia. Unlike the other datasets above, WikiSQL contains SQL queries mapped from questions with annotated components.
In 2020, Google Research open-sourced a popular deep learning model for querying tabular data without generating logical forms called TAPAS. It extends the Masked Language Modeling (MLM) approach to structured data.
Like BERT, TAPAS uses features which have to encode the tabular input. It then initializes a joint pre-training of text sequences and tables trained end to end and is successfully able to restore masked words and table cells. It can infer from a query by selecting a subset of table cells. If available, aggregate operations are also performed on top of them. Various datasets have been benchmarked using TAPAS, including WTQ, SQA, and WIKISQL.
Let’s look at a sample.
First, install the requirements:
Copy the model files from the cloud to our machine:
We are going to make a predict()
function to input the table as a dataframe:
Now let’s use a sample table and load it as a dataframe. It consists of columns containing actor names and the number of movies they played in.
The model inference from the table can be seen below:
There are certain limitations when using TAPAS. It can only handle single tables as context, which must fit into memory. This limitation makes it unsuitable for large datasets or multiple databases. Additionally, TAPAS cannot process questions that require multiple aggregate operations, such as, “How many actors have an average number of movies greater than 10?”
A good solution to this task depends on various factors and user preferences. Non-deep-learning solutions that use heuristics are the fastest, but they come with a cost of poor semantic understanding. A deep learning model like TAPAS aids in comprehending natural language questions and table structures, but it raises concerns regarding memory issues and high computational costs. An AI-assisted solution like TableQA has an advantage in using heuristics since the heuristics can be modified to improve the solution’s performance in most cases without having to retrain the deep learning model. However, as natural language becomes more complex, more errors occur in retrieving information from the table.
In this article, we went through various solutions for question-answering on tabular data. We also saw different solutions which use heuristics and deep learning to semantically parse the information combined from tabular data and the natural language question. Choosing the best solution is done by considering various factors: speed, performance on understanding natural language, and memory.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!