A new tool makes it easier for database users to perform complex statistical analysis of tabular data without having to know what’s going on behind the scenes.
GenSQL, a generative AI system for databases, can help users make predictions, detect anomalies, guess missing values, fix errors, or generate synthetic data with just a few keystrokes.
For example, if the system were used to analyze medical records from a patient who has always had high blood pressure, it might pick up a blood pressure reading that is low for that particular patient, but would otherwise to be in the normal range.
GenSQL automatically integrates a tabular dataset and a probabilistic generative AI model that can account for uncertainty and adjust their decision-making based on new data.
Additionally, GenSQL can be used to produce and analyze synthetic data that mimics real data in a database. This can be particularly useful in situations where sensitive data cannot be shared, such as patient health records, or where real data is scarce.
This new tool is built on top of SQL, a programming language for creating and manipulating databases that was introduced in the late 1970s and is used by millions of developers worldwide.
“Historically, SQL taught the business world what a computer could do. They didn’t have to write custom programs, they just had to query a database in a high-level language. We think that, as we move from simply searching data to asking questions of patterns and data, we’ll need an analog language that teaches people the coherent questions you can ask a computer that has a probabilistic model of data ,” says Vikash. Mansinghka ’05, MEng ’09, PhD ’09, senior author of a paper introducing GenSQL and a principal research scientist and head of the Probabilistic Computation Project in the MIT Department of Brain and Cognitive Sciences.
When the researchers compared GenSQL to popular, AI-based approaches to data analysis, they found that it was not only faster, but also produced more accurate results. Importantly, the probabilistic models used by GenSQL are explicable, so users can read and modify them.
“Looking at the data and trying to find some meaningful patterns using just a few simple statistical rules can miss important interactions. You really want to capture the correlations and dependencies of variables, which can be quite complicated, in a model. With GenSQL, we want to enable a large group of users to query their data and their model without having to know all the details,” adds lead author Mathieu Huot, a research scientist in the Department of Brain and Cognitive Sciences and member of Probabilistic. Computer project.
They are joined in the paper by Matin Ghavami and Alexander Lew, graduate students at MIT; Cameron Freer, a research scientist; Ulrich Schaechtel and Zane Shelby of Digital Garage; Martin Rinard, MIT professor in the Department of Electrical Engineering and Computer Science and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Feras Saad ’15, MEng ’16, PhD ’22, an assistant professor at Carnegie Mellon University. The research was recently presented at the ACM Conference on Design and Implementation of Programming Languages.
Combining models and databases
SQL, which stands for Structured Query Language, is a programming language for storing and manipulating information in a database. In SQL, people can ask questions about data using keywords, such as summarizing, filtering, or grouping database data.
However, looking for a pattern can provide deeper insights, as patterns can capture what the data mean for an individual. For example, a female developer asking if she is underpaid is likely more interested in what the salary data means for her individually than in trends from database data.
The researchers noted that SQL did not provide an effective way to incorporate probabilistic AI models, but at the same time, approaches that use probabilistic models to make inferences do not support complex database queries.
They built GenSQL to fill this gap, enabling one to query both a dataset and a probabilistic model using a straightforward yet powerful formal programming language.
A GenSQL user uploads his data and the probabilistic model, which the system integrates automatically. It can then run queries on the data that also get input from the probabilistic model working behind the scenes. This not only enables more complex questions, but can also provide more accurate answers.
For example, a query in GenSQL might be something like, “How likely is it that a developer from Seattle knows the Rust programming language?” Just looking at a correlation between columns in a database can miss subtle dependencies. Incorporating a probabilistic model can capture more complex interactions.
Plus, the probabilistic models that GenSQL uses are auditable, so people can see what data the model uses to make decisions. In addition, these models provide measures of calibrated uncertainty along with each response.
For example, with this calibrated uncertainty, if one queries the model for the predicted outcomes of various cancer treatments for a patient from a minority group that is underrepresented in the data set, GenSQL would tell the user that it is unsafe and how unsafe it is, rather than over-advocating for the wrong treatment.
Faster and more accurate results
To evaluate GenSQL, the researchers compared their system to well-known basic methods that use neural networks. GenSQL was between 1.7 and 6.8 times faster than these approaches, executing most queries in a few milliseconds while providing more accurate results.
They also applied GenSQL to two case studies: one in which the system identified erroneous clinical trial data and another in which it generated accurate synthetic data that captured complex relationships in genomics.
Next, researchers want to apply GenSQL more broadly to perform large-scale modeling of human populations. With GenSQL, they can generate synthetic data to draw inferences about things like health and salary, while controlling what information is used in the analysis.
They also want to make GenSQL easier to use and more powerful by adding new optimizations and automation to the system. In the long term, the researchers want to enable users to ask natural language queries in GenSQL. Their goal is to eventually develop an AI expert like ChatGPT that can be talked to for any database that supports its answers using GenSQL queries.
This research is funded, in part, by the Defense Advanced Research Projects Agency (DARPA), Google, and the Siegel Family Foundation.