World's First AI data analyst that surpasses trained humans: GlazeGPT
Author

Atishay Jain

Generating correct SQL from natural language has long been a topic of research in both academia and industry. The advent of ChatGPT accelerated this research, leading to a surge of startups in this domain.


While many companies developed solutions that simply acted as a wrapper for GPT-4, they showed promise on open-source databases, as evidenced by their performance on datasets like Spider. However, their performance plummeted when interfacing with real-world databases, often characterised by poorly named tables (e.g., "v_driver") and ambiguous column names, like the 'flag' column in the 'Patient' table. No LLM can predict the meaning behind such a column.

Although some companies have attempted fine-tuning on open-source models like Llama 2, none have reported surpassing GPT-4 in code generation tasks to date.


Database querying demands a high level of accuracy since erroneous data can incur significant costs for a company. This is why deploying ChatGPT directly to business users unfamiliar with SQL or a specific database schema can be risky.


Furthermore, companies often utilize unique internal jargon that generic models like ChatGPT are unaware of. For instance, asking ChatGPT, "Please show me the NPS for all the North American customers," would be fruitless as it lacks context on what "NPS" or "North American customers" means.

What is GlazeGPT: GlazeGPT lets everyone query their databases using an English prompt. It leverages an ensemble of LLMs, machine learning techniques, and proprietary fine-tuned models. GlazeGPT has industry-first features to make your database LLM understandable.

We wrote down 4 non-negotiable principles before building GlazeGPT:

  1. It should work with messy and disorganized databases. For eg. some of our customers don't even have foreign keys set up.
  2. It should generate results with acceptable accuracy rates.
  3. It should understand the company's internal metrics and jargon.
  4. No changes or assistance should be required from company developers.


In this blog, we assess the accuracy of GlazeGPT across five databases spanning domains such as finance, medical, logistcs, and entertainment. We have taken these datasets from BIRD Dataset.


This file contains details about databases we used, questions, GlazeGPT generated SQL's, and our remarks about a few queries.

Type of databases we used for dataset preparation

  • Databases that have tables with more than 40 columns.
  • Where Table names are not self-explanatory.
  • Tables having column names that are non-intuitive.
  • Database with a large number of tables. Eg.>50.
  • Databases across multiple domains like finance, medical, logistics, and entertainment.

Type of Queries tested:

  • Human touch in queries; you'll notice this when you browse through the file.
  • To keep it similar to real-world databases, a high percentage of queries requires some information about the database, only schema details and questions are not sufficient to generate SQL.
  • At times, we even amalgamated two questions into one, a practice we've noticed is common among human users based on our product analytics.
  • A lot of Queries are not clearly defined
  • Many queries require the use of a formula, which the user needs to define in GlazeGPT.


Impressively, GlazeGPT generated 97.04% SQL which are correct and acceptable. Out of 406 questions, 395 SQL generated were acceptable. To put this into perspective, the highest accuracy recorded for contemporary solutions is around 55%. Even for humans accuracy rate was around 92%. below image contains accuracy numbers of current solutions.

SQL Result Accuracy Comparison

Significance of BIRD Dataset:

Most of the prevalent benchmarks, i.e., Spider, and WikiSQL, focus on well-structured database schemas, leaving the gap between academic study and real-world applications. BIRD, a Big benchmark for large-scale Databases grounded in text-to-SQL tasks, emphasizes database values that highlight the new challenges of dirty database contents, external knowledge between NL questions and database contents, and SQL efficiency, particularly in the context of massive databases. To solve these problems, text-to-SQL models must feature database value comprehension in addition to semantic parsing.


Models, including those based on large language models (LLMs), have led to impressive performance on existing benchmarks such as Spider and WikiSQL. For instance, the execution accuracy of the top-performing model in the Spider leaderboard is 85.3%. We discovered that current state-of-the-art models still struggle to generalize to more realistic situations characterized by large database sizes and noisy content. Besides, the mysteries hidden behind the huge database values require external knowledge and reasoning to reveal.


Motivated by these observations, the BIRD dataset was created that better represents real-life scenarios and narrows the gap between experimental and practical settings.

Methodology to determine generated SQL acceptability

For a question, you can have multiple correct queries. Like All 3 Queries are correct.

Question: Give me list of customers

SQL1: select id, first_name, last_name, email from customer;
SQL2: select id as customer_id, (first_name + last_name) as full_name, email, address from customer;
SQL3: select * from customer orderby signed_up_time;


For each question, there is an SQL crafted by a human. The accuracy was evaluated by contrasting the results produced by GlazeGPT against human-generated SQL. We have compared results produced by executing both human written SQL and GlazeGPT generated SQL using our sql compare algorithm.

Aspects of our SQL compare algorithm that compares generated and human SQL include:

  • Removing all duplicate rows.
  • Renaming aliases. (taking a few possible values of the column and checking if any column in the other answer has any column with these values. For low cardinality columns we pick three sample rows and use them to match low cardinality to the correct position).
  • Sorting columns, and checking anyone is a subset of others.
  • Ordering rows from the first to the last column (excluding queries categorized under 'order_by' or those specifically requesting a particular order).
  • While comparing decimal numbers, only compare up to 2 decimal points.
  • Ensuring all rows of matched columns are identical.


If the results didn't align, we conducted a secondary evaluation.


Sometimes the answer doesn't match in output but matches in spirit. For the question "Is there any movie filmed in India?", the answer can be yes or no, or the answer can be a list of movies filmed in India. We have accepted both answers.


Then we checked if human-written SQL is correct or not. If not corrected human written SQL.


Then we checked if the question was answerable or too ambiguous. If yes, we rephrased the question. For example for the given 2 questions first question is very ambiguous but the second question is clear.

"Among the accounts that have loan validity of more than 12 months, list out the accounts that have the highest approved amount and account opened in 1993".
"Among the accounts which were opened in 1993 and have loan validity more than 12 months, list out the accounts that have the highest approved amount".


Then we checked if this question requires any form of cataloguing in the tool, if yes we completed that. Take this question and SQL and its SQL can't be generated without defining the meaning of the column in the database:

Question: Regarding the average unemployment ratio of 1995 and 1996, which one has higher percentage?

SQL: SELECT DISTINCT IIF(AVG(A13) > AVG(A12), '1996', '1995') FROM district

Columns needs to be defined: A12 refers to unemployment rate 1995; A13 refers to unemployment rate 1996

For some questions, we need to define formula in GlazeGPT like for question: "What is NPS for the Bengaluru region in the past quarter". Here we need to define the formula of NPS in GlazeGPT.


If we have done any of the above changes, then we generated SQL one more time for incorrect results. And then compared results using our SQL matching algorithm.

What strategies enhanced our accuracy?

It wasn't a single magic bullet but a combination of strategies ranging from cataloging and fine-tuning to LLM call chaining, proprietary algorithms, and embedding techniques.


Database querying for business users necessitates not just accuracy but also efficiency. Many solutions lag, taking excessive time to generate SQL. In contrast, GlazeGPT responds to over 80% of queries in under 15 seconds.

Building Accuracy not only requires generating SQL. It requires addressing a lot of other things:

  • Recognizing that business users might lack database know-how, leads them to pose questions the data can't answer.
  • Addressing ambiguous questions like: "Which regions are performing well?". As performing well is not defined you can never generate correct SQL for this.
  • Delivering prompt results. Users shouldn't have to wait a minute to get their answers.

By enabling natural language querying of databases, we empower business teams to make data-informed decisions without burdening the data team.

Never miss out on the latest AI developments & news. Subscribe to our newsletter AI-Ronman.

The logo of GlazeGPT
Address
121 Boulevard Street,San Jose, CA 95121
More
Blogs
Terms and Conditions
Privacy Policy