From Providing Data to Explaining Why: How Large Language Models are Changing the Role of Data Practitioners
Unlocking the power of Large Language Models in data practice: From SQL troubleshooting to natural language querying with LlamaIndex.
This article was originally published at StreamSets.com.
Over the past few months, there has been a lot of talk about how ChatGPT, and other Large Language Models (LLMs), will change the world. As data professionals, the idea that a machine can provide answers to questions, including business questions, might seem to be an existential threat. But is it?
Large Language Models can be outstanding in providing answers to what is (or what the LLM thinks is true); but by the nature of their training and construction, they are not very good at explaining why something is. LLMs don’t “know” anything about context. They’re really just guessing based on their training.
What are Data Engineers Really Doing?
According to this Monte Carlo survey, data engineers spend 40% of their workday on Bad Data. Not only that but:
- An average organization has 61 data incidents a month, taking an average of 13 hours to identify and resolve, totaling 793 person-hours per month.
- 75% of these incidents take between 3 to 8 hours to detect (not resolve).
- Business analysts take an average of 3 hours per day answering data questions.
- Bad data impacts 26% of business revenue.
As data practitioners, our job is to explain why things are the way they are, not just provide the data. Let’s refer to this data — i.e., revenue is up, or jeans are the most common item in a shopping cart — as the “how.” The Monte Carlo survey focuses mostly on detection and discovery, the “how” of the data. We can still out “think” the LLMs and provide the essential part — the “why.” Therefore, let’s use these Large Language Models to do the “how” and spend more time on the “why.”
The “why” is where the true business value is found. Where LLMs can identify a correlation between answers, the “why” is found in…