See full event listing

Using LLMs in Production at Zing Data to Make Data Exploration Easy From Anywhere

Lessons learned in building a reliable product even with on unreliable responses from OpenAI, and Google PaLM V2

Zack Hendlin is the CEO & co-founder of Zing Data - a platform for collaborative data analysis that works anywhere. Built mobile-first to make it easy to query data in seconds on your phone, Zing has been called the ‘Figma of data’ for making hard data questions a couple taps on a phone.

He was VP of Product at OneSignal, where he 10xed ARR in two years, was a product leader at Facebook where he shipped their first mobile ads format and their first work in speech recognition, and at LinkedIn where he shipped major news feed improvements and post translation.

He holds multiple machine learning patents, and has been featured in Techcrunch, Fortune and VentureBeat. He is a graduate of UC Berkeley and the Massachusetts Institute of Technology.


Zack Hendlin 0:07
Thanks, Erin, and thanks for joining everybody. So, in the next half hour or so, and I’ll take questions at the end.

Zack Hendlin 0:14
I want to share what we see and actually having LLMs. In practice, I think, you know, you could go on Twitter, and everybody will say that LLMs are going to take your job or change everything. And I think what we’ve seen from real data and how people use them, in our product to query data, is that they they work well for some simple use cases. But in a lot of situations, you want to be sure about a result, or you want to know how something actually happened. And so I’m going to talk about some of the challenges and workarounds to actually build production scale systems that work that build user trusts on top of LLMs.

Zack Hendlin 0:57
As a little bit about kind of what the company and product is, which I think will help contextualize the background and understanding for this. We are a platform for mobile data analytics. So you can query data from your phone, we work on web as well. But really, the realization was that it needed to be a step change, easier to be able to ask questions of data without relying on having to write SQL or go to someone on your data team. So me and my co founder, we met about 10 years ago at MIT, it rarely was this beautiful and sunny there. But we realized that this problem kind of existed, and that you needed to make using data stepchange easier. So we built out a visual interfaces for querying, of course SQL. And then the most interesting one was when the bomb dropped, have open AI, really rethinking around how that could change interaction patterns. So you could ask questions with natural language, what we quickly figured out and I think anybody who’s building on top of LLMs have figured out is you can’t really just throw a textbox into your product, and and have it work well. And so when talk through some of the nuances of getting it right, to set the stage a little bit, the idea of asking a question and natural language, and getting results from that is not new. So what you see is what

Zack Hendlin 2:26
Microsoft Power BI has had for quite a while Tableau had flavors of this clear graph thoughtspot. And the idea was, you could ask a natural language question and get a graph or get a get a table as a result. What you’ll see, though, is that it was really, really limited. So this is from Microsoft Power BI as documentation, they said, multiple conditions aren’t supported. So you couldn’t do something like say, show me which of my customers are in Texas, and purchased at least $100 product for me last year, because that would be multiple conditions. So you can do that. So these were really, really, really limited. And you’ll notice, you’ll see these red underlines, on on the screenshots, that was where something is misspelled or unclear. And before LLMs, you basically needed to map each of those meanings to a field or to a specific value. Because if you didn’t, your your your natural language questions wouldn’t be understood.

Zack Hendlin 3:30
So the idea of going from natural language to a graph is pretty old. The ability to do that well, though, is is new. And so before open AI, and this this new wave of GPT style models, the the sort of state of the art was some of the models I’m showing you on the right, so there was a thing called Rat SQL, from Microsoft Research. ServiceNow had something called Picard, which was reasonably good at getting English to SQL, but needed a really specific wording, it couldn’t necessarily understand the difference, or similarity in a concept like revenue versus sales is if you don’t have sales as a field, but you do have revenue as a field. Maybe you want to disambiguate to, to that, so didn’t have a great conceptual understanding of how different words were related. It could say, Hey, I see this field name, I can match the field name and write a query around them. Google also then had bidirectional encoder representations from transformers, the Bert model. And all these worked. Okay, but not great. And we’re pretty finicky.

Zack Hendlin 4:46
So the big development with with OpenAI and now Google is Palm v2, and there’s others like Lacuna. There’s Falcon, which is a $14 billion model out of the UAE.

Zack Hendlin 5:00
The big breakthrough here was that you then start bringing an understanding of different domains. Right. So you go from saying, I don’t know what sales is, I don’t know what revenue is, I don’t know how to disambiguate. When you say what is the sum of sales.

Zack Hendlin 5:16
If I only have a field called revenue to, hey, now I have this much better understanding that’s encoded of kind of human language, and how concepts are related. So I can disambiguate those things in a way that makes sense. So that was the big shift that happened, as some of these LM started coming out. It also made it much easier to use. So you could feed in just table names or views and field names, and get out SQL as a result.

Zack Hendlin 5:48
Not always right. But at least that was the general idea. And then a lot of the practical considerations of trying to get some the previous models I talked about up and running of needing to have a ton of memory and set up the infrastructure and train them or retake train them on your tables or views or schema was radically simplified when you can just call an API to OpenAI.

Zack Hendlin 6:16
Research code does not a production service make. And opening, I made that a step change easier. And it also made it a step change, more affordable, because you didn’t need to go provision all these GPUs for model training and stuff like that. So normally, you might say, hey, it’s super easy for me to jam an LLM into my app, and create a great experience or a thing that’s really cool. I think it’s true that you can kind of jam open AI or search box or textbox into your app and get something that you could probably demo Well, it’s as easy as sort of the the query I’m showing you right here, nominally, but it’s actually much harder to get good results in practice. So if I wanted to get a SQL query for all customers in Texas, with the name of Jane, I would pass it my schema, in this case, these columns up here, and the name of the table, in this case customers and tell it that I want the query for that. And GPT for Palumi to Lacuna there’s a number of other models floating out there can basically do that. And for a simple question like this, get that right. But with even a slight complexity, they’ll start getting it wrong. So for instance, let’s say, your customers in Texas, for state, Texas is actually represented as TX, well, this will say where state equals, quote, Texas, close quote. And if your data is structured such that Texas is Tx or lowercase Texas, it won’t, it won’t actually give you what you want.

Zack Hendlin 8:02
And so that’s where there’s a lot of engineering or prompt engineering that goes into getting much better results that kind of behave like you expect them to. So I want to talk through kind of some of the practical considerations that unexpected goodness as we’ve taken, and I’ll show you kind of what it looks like in the product. So a user here can say how many which products are out of stock, and they’ll be able to type that in natural language and zing and out the other side, they get this list of products, and they can see the sequel that was generated. If the question kind of should generate a graph, we generate a graph. So you have logic for how all that stuff should work. And so the unexpected goodness, the things we just kind of got from the model out of the box, that worked really well. Were the ability to kind of disambiguate it disambiguate sales, sale, revenue revenues, misspellings, and still get results that kind of match to the field names that they existed that matched to users intent.

Zack Hendlin 9:06
The models are also pretty good at handling things like WTI, which stands for West Texas Intermediate in the oil and gas space. So you can say WTI price could map to the field name, WTF WTI crude oil price. And it can handle those differences reasonably well. It also could do cross table joins or cross view joins, if they’re reasonably simple and use the same or similar field things to match on. And it worked better out of the box without any fine tuning any of that stuff. Then Tableau thoughtspot. Any of these others did, unless you’ve invested a lot of time in structuring your data and giving building that a semantic layer and all this other stuff. You basically could have a table have fields point an LLM a question at them and be able to get, you know, a SQL query. The unexpected badness, though. And I think this is really where it comes in. If you’re, if you’re a builder or founder or data analyst as something to think about is, you may have a query that doesn’t necessarily get you a graph, it doesn’t necessarily give you the level of aggregation that you need to have a more understandable result. So if I say, show me, let’s take a trivial example, show me sales over time, well, if my sales are actually in a timestamp, and that timestamp, is actually down to a millisecond level of granularity, that’s a pretty unhelpful graph to show. And so there’s this additional kind of logic layer that you need to build out or instruct your LLMs.

Zack Hendlin 10:51
To take into account when they’re building building your SQL to then say, actually, I want to see this by day, or actually, I want to see this by week.

Zack Hendlin 11:00
Another thing to be aware of, they’re very finicky with wording. So if you say, show me sales by region, versus show me total sales by region, well, sales could be account of sales, total sales is probably a sum. And so in order to kind of give it the right hints where there is ambiguity, right, it’s sales by region and summer account, you sometimes need to put in these show me the count of sales by region or sum of sales by region, where it might otherwise get that wrong. You also need to do things, like give it guidance about what’s in fields, we were talking about the Texas example before, so that it isn’t guessing in the wrong ways that will result in queries not working. And then you also can’t do things very well like reverse lookups. So knowing that, you know, if a user says, Show me, customers in Japan, you might have two fields, you might have region one and region two, and your table, and you might not really have given the model enough context to figure out is Japan a value in region one or a value in region two, so you can create these like lookups.

Zack Hendlin 12:15
To then kind of refine what the model knows, and have it get better results. There’s other things you could do like pre processing, to say, Hey, I’m only going to tell the model that region one exists. If I know Japan is a value in region one, and I won’t even tell the model, when I’m pumping it over here with with different columns, I’m not going to give it the columns that are irrelevant for it, or that would contain the values that I’m looking to match against. And then lastly, there’s a lot of data sets, which are large. And if you get above something on the order of 100 150 tables, depending on how many columns are in each table, you’ll just get too much too many tokens for the models to be able to handle that’s changing a little bit. And there’s larger variants that can handle larger kind of token sizes. So this will probably become less of a problem over time. So I want to show you kind of what we built out, and then how we’re refining that to make it better. So you can ask in plain English or French or Spanish or Japanese or German or Arabic, a question in natural language, you have a search bar at the top in the Zing app. And then that will connect to open AI, or palm v2, and generate the SQL and then that SQL runs live against your Cloud Data Warehouse. And that could be anything from Microsoft SQL, Google Sheet, Excel, file, Snowflake, whatever it is, wherever your data lives. And then one thing we figured out was, sometimes there’s a question that you’ve already asked, or a colleague has already asked, in which case, you don’t necessarily want to go ask that from scratch to open AI, or palm v2. Instead, you can only use the social context that you’ve got from the fact that other colleagues or you have already asked that question, and maybe saved it. And so that’s actually probably a more reliable source than that kind of go into the models from scratch. So we have this UI where you ask a question, we show you a graph or a table, and then we realized kind of quickly that it only worked well on the order of 20 30% of time.

Zack Hendlin 14:36
And what I mean by works well is that a user doesn’t need to go reformulate that question, like worded differently, nor do they need to change the sequel, nor do they need to go to like a visual query interface. They’re able to just like save that question, which means that they think the result kind of looks looks good, without needing to do reformulation.

Zack Hendlin 14:59
What that means, though, is 70 to 80% of the time, the results that these LLMs are spitting out for data queries are just not not quite what someone’s looking for, or not very good. And so that’s where we’ve thought about, I think, to Matt’s earlier point on how you build affordances. To make LLMs more useful, what we’ve done here, and this is shipping later this week, is instead of showing you just a result, or instead of showing you the sequel that was generated, if you wanted to understand how we got the result, instead, we’re actually visually saying, Hey, here’s this thing that we’re putting on the y axis, here’s this group by we talked about timestamps before, here’s this logic, we’re applying where a timestamp is grouped by day. And if you want, you can, you know, tap any of these things and and say, Oh, actually, I want the count, instead of the sound or the median instead of the sophomore, I want to group by month instead of day. And you can kind of refine what is shown to to a user, without them needing to go go go type in SQL and kind of debug it. Because one of the big challenges is there’s this weird gap between, hey, I have a question with natural language, anybody can ask any question, they don’t need to know anything about SQL, they don’t even need to know too much about the underlying data source, and they’ll just get a graph, they’ll get an answer, and it’s gonna be great.

Zack Hendlin 16:30
Which, in theory is awesome. But then the challenge is, if it doesn’t give you quite what you need, and you’re in this world of debugging and SQL, then a lot of folks who are less technical don’t necessarily know how to work through that, and how to get to something useful. And so that’s why we’ve thought about, and again, this goes back to Matt’s point on creating the right affordances in the right UI, and not assuming that the LLM is always going to get it right. Where we said, okay, a user, even if they’re non technical, will see exactly what is happening. And if they can see that that’s not what they want, they can modify it very, very directly.

Zack Hendlin 17:10
So when I asked stable fusion to write me a movie poster for all the data everywhere, all at once, and you kind of see, it doesn’t quite get it right, as you know, parts of the idea, I think that’s kind of the State of the Union for data analysis, powered by powered by these large LLMs. Think it’s gonna get better. But I think that’s just kind of where things are at present. So what did we do? Well, we looked at ways we can kind of make this more deterministic. So let me show you here. If I can get it to animate, there we go. So you can see the deterministic way where we said, you can choose a data source, you choose what table you want to deal with, and deterministically. So this is like some hacker news posts. deterministically, say I want to get the count of posts. And I want to see that by type. Great. That’s a very deterministic, very clear way. Or I can drag these fields to get exactly what I want. That’s a very clear, very deterministic way to get a result without necessarily needing the LLM to always get it right. Because if we show you here with an easy kind of GUI to make it right, if the LLM doesn’t get it right, then you have a lot more control over what’s going on. And you don’t have to play this this kind of frustrating, sometimes confusing game of let me reword it, let me reword it, let me reword it, and hope that that Alain gets it right with each iteration, which can actually take longer than just deterministically doing the thing you want. Now, one of the cool things that we do is, once you’ve created this question deterministically and save that, we can then use that to fine tune that lens.

Zack Hendlin 18:59
So I think Brian asked a question about..or Marshall asked a question, on relying out of the box LLMs versus developing your own bottles. And I think what we’ve seen makes the most sense right now is not building individual models for each user at this point.

Zack Hendlin 19:24
The way you kind of construct a SQL query is going to be similar across use cases. What you do need to do is provide schema information about the fields and stuff like that to make it better. And then the other thing that we’re pretty excited about, and in the early innings on, but I think is going to be a way to make this a lot better, is to use these questions where someone saves this deterministically as training data. So let’s say you create five questions. deterministically not using the LLM and your colleagues create five questions. Well, if you can feed that back Again, as fine tuning examples to palm v2, which you can do via vertex, just like Google Cloud’s, way to kind of fine tune models and stuff, or give additional training examples for open AI, then you start getting models that are smarter about the way that you ask questions or your organization ask questions. And so I think this is gonna be a really interesting development, where it goes from I am asking a foundational model a question to instead, I’m refining the foundational model to my own use case, oh, by the way, and that also has some social context, right? Because if a question is right or wrong, there’s people in your organization who probably know that. And even if you’re not technical, maybe someone else at your organization has already asked that same question, and was more technical, they can save that that can then become a kind of ground truth, if you will answer for that question.

Zack Hendlin 21:06
So when you’re working with these LLMs, and getting back responses, it’s pretty important to note that a response is not necessarily a correct response. And if you want to build a product that is reliable that people understand, then you need to do a good amount of engineering and refinement around around making that an experience where people know what’s happening, and can fix it if it’s not right. So common thing we see. And we’ve done work to fix, but we would pass, as I was saying before, kind of schema and field names. And we would get results, we would get queries back from from opening AI, that would contain tables and fields that were not provided to it. And so what that means is those queries will not run. And that’s pretty frustrating to user, they just get an error. They don’t know why. And they don’t know why the model is generating unrunnable SQL. So one thing we did to refine that was say, return false. So we added an additional guidance in the prompt to say return false when the required tables or fields to answer this question are not available. And that then let us say, Great, we know that it’s not going to answer this question. Well, we can show a user fields that they can pull from, we can show a user tables that they can query and lead them towards like a visual query flow. Or we can ask the model in a slightly different way, do a second shot prompt.

Zack Hendlin 22:42
We can also engineer tracks for if the tables and fields needed to run the query are there and if not, you know, send a user down down a path to maybe refine the way they worded their question or make sure that they’re using field names, or something close to field names that already exist. So these are ways you can start building a product and infrastructure and UI around this this kind of somewhat variable response that you’re getting from the algorithms. Another thing we’ve seen is that, you know, we’d have two users ask a question, and we’d show a graph. But that didn’t give enough confidence to end users that the query was correct. And so we had a way where, you know, they could tap and see the sequel. But the challenge was for for folks who weren’t technical, they didn’t see what was actually happening visually. And so it’s hard to know if that was correct or incorrect. And so what we found was much better way to do that is show this kind of visual interactive query builder that says, here’s what we understand your question to be? Does that look right? And then when they say yes, and run the question, they have a much higher level of confidence that the question actually being asked the query that’s being run is what they want. And then, one other note is, is that you don’t always need to rely on LLMs.

Zack Hendlin 24:06
So I think there’s an inclination to say, hey, you know, we’ll just directly hit that lens whenever a user has a question. But in fact, anytime a user saves something, anytime that they’ve created their own query and saved it, that’s actually something that you probably want to surface. If you think about, I used to work at Facebook and LinkedIn. And if you think about a social network, you think about StackOverflow. You go and you look at answers or, or meal, restaurant recommendations from people you know, from your colleagues from your friends. And so you don’t need to restart from scratch by asking this question. You instead can resurface some results that have already been asked by other people. And you can use that also as a way to fine tune these models.

Zack Hendlin 25:00
If you have a huge amount of that you can rebuild models from scratch on it. But that requires a lot more more data. So typically, you’ll want to do something more in the fine tuning realm, a little bit about kind of our architecture and how we do it. So we connect to your database, any major database, I think there’s a list of them here Trino, data, bricks, Big Query, sheets, redshift, clockhouse, Postgres, and so on. And then, if you have a natural language question, we take that run it through a couple models do a little bit of pre processing, the little bit of post processing. And the post processing allows us to go from this, this table or this query, to a graph. So we have an idea of, okay, if it’s a time series, time should be on the x axis. Or if there’s a couple group bys, then we probably should display it as a stacked bar chart, so that we can make out the different series. And so there’s there’s engineering and logic beyond kind of the base call to an LLM. There’s also looking at multiple LLMs, to figure out, are they in agreement with each other? Are they giving us similar types of results? And as a function of that, are we more confident in the alignment between multiple models as giving us a better result? So let’s say I had three models that I’m running the same question through, if those don’t match, that means that maybe there’s a pretty big difference in how they’re interpreting the speech. And there’s probably a good degree of risk that I’m generating a query. That’s not right. Or that certainly could be could be interpreted in other ways. So running multiple models is like a waterfall or in parallel is also a way to kind of get a sense on if the output you’re generating is high confidence or not.

Zack Hendlin 27:02
So a little bit about kind of how this works. We have a user 7 Eleven, not in the US, but in other countries they operate in, and they basically hooked us up to click house. And then they can ask questions like, which products are out of stock for the store, or show me my fastest selling beverages. And it takes something that used to require a data team and building dashboards and makes it something they can kind of do with natural language really easily. And then because we’re mobile, once you’ve created these kinds of cool questions or queries, you then can do things on top of that, like set up real time alerts. So here I’m saying, let me know send me a push notification when number of events goes above some level, or show me which stores have sold at least $100 of products in the last day. And we can show that on a map. And we can know your current location. If you turn on that ability, so that you’re then able to ask these natural language questions that are specific to your current location or specific to kind of your own your own data. So takeaways, and then then I’ll cover any questions folks have create deterministic fallbacks, don’t assume that everything someone needs to do, they’re going to be able to do via kind of re prompting and re prompting over and over again, create a way they can get exactly what they need, without having to rely on the LLM being perfect. Build affordances for incorrect responses, use human knowledge and social information, saved queries, what colleagues have done, so that you’re not kind of reinventing the wheel each time someone asks a question. Typically, if you’ve asked a question, a lot of your colleagues will have a similar question. And you can kind of benefit from what people have done before, to kind of skip people to the right answer. Try running models in parallel, that’s a way you can kind of get a better read on if if the results are consistent across models, and hence, how confident you might be in them. And then engineer smart guidance for kind of how a user navigates the system. So if they get a result, that’s not good, say, Hey, here’s a visual way you can do this or here’s a list of tables or fields or whatever it is, so that a user can do that on their own.

Zack Hendlin 29:42
With that would be happy to take any questions. If you want to try Zing you can try it out for free. It’s on the App Store and Play Store. It also works on the web at zoom And if any of you are looking for jobs or hiring a product designer business lead in a back end ditch. So, with that, I’ll open it up for any questions that folks have.

Erin Mikail Staples 30:04
Super awesome. Thank you so much, Zack, for sharing. I’m like found myself taking screenshots and like, remembering to go watch back to this. Watch this again later. But we have a couple questions from the audience. First one from Marshall, Zack, even now, you seem to you rely on out of the box LLM as opposed to developing your own models with such development in the field and quite sophisticated refinement.

Zack Hendlin 30:28
Yeah, yeah. So So right now, building a model from scratch, and you can do it and it’s getting much better and much cheaper mosaic ml, which recently sold to data bricks for eyewatering amount of money.

Zack Hendlin 30:44
Basically tries to make it much easier. And so you can train these things on your own from scratch, for something where you want a good understanding of how concepts are related, you actually need to kind of ingest a large corpus of how people use language, right? Like is it sales is related to revenue and stuff. If you have a huge corpus on your own, you could kind of kind of figure that out. But you’re probably better off starting from one of these models that understands the world pretty well. And then fine tuning it for your use cases. And fine tuning has gotten a lot better. So one of the cool things of last, I guess it’s gotten popular of last couple months. But the idea that you can sort of make these slight tweaks without needing to rebuild a model in a much more affordable way. And I won’t go too technical here right now. But if you just look at like, fine tuning. That’s, that’s kind of where to get started from. And it allows you to change kind of the way the model behaves, without needing to invest all the GPU time and money in rebuilding it from scratch.

Erin Mikail Staples 31:57
I totally agree. And I think the fine tuning stuff, and I can actually have a collection of resources on that is super helpful. It’s hard to teach machines, how to do computers, and speak English. And the more like layers of complexity, you add delays, because kind of the the builder by scenario becomes a very new question. But I love this question here from Matt on how many times on average, do you folks retry prompting, the first prompt doesn’t work?

Zack Hendlin 32:24
Yeah. So what we see is, folks will retry probably three to five times. And if they don’t get to an answer they want, then they fall back to these other affordances. So in the example we have, in our case, that’s it’s like drag and drop, like visual Query Builder thing. And so they typically will go to those after three to five attempts of like rephrasing.

Zack Hendlin 32:58
I’d say that someone, if it’s a simple question, they probably get their answer. If I say like, how many customers do I have? And I have customer ID one customer, right kind of customers? That’s right, 95% of time, probably. But once I start asking these, like, one or two notches, more involved questions, I’d say that people actually, even once they’ve reformulated to only get to a success state with the LLMs. And this is on their own data on real world data. Probably 30% of the time, even with all that re prompting. So it’s not amazing, right now, it will get much better fast, especially as you do these, like fine tuning on your own on your own examples. But a product that works well 30% of the time is not going to be a thing that you rely on to like run your business, that can be a thing that like helps you. But we still see that the vast majority of the time users go to these kind of more deterministic ways of getting the answer.

Erin Mikail Staples 34:07
Right on and I think like, yes, the 30% is never a good metric. And we’re looking at success metrics. It’s like, I always talk to people of like, we think about building a model or using the model. Why? Why are we doing it like the best from one of the the new search press books, but it’s like, machine learning is just a way of solving a problem or answer your question.

Zack Hendlin 34:29
Yeah, and hopefully it’s an aid to do that. And over time, it will be a much better way to do that. But I think if you’re saying, hey, what products do I need to restock, right? I’m going to buy more of those things and spend money. If that’s not working, then you want a more deterministic way to get to that. And I think that’s where like, if you’re on Twitter, and you see someone saying it’s gonna like fix everything or change everything. I think it’s gonna help a lot of stuff over time.

Zack Hendlin 35:00
But the state of the art at least for querying complex data sets is it’s it’s not quite there to rely on.

Erin Mikail Staples 35:08
What are you saying that like we all might still have jobs at the end of this AI rush wild concept is here.

Zack Hendlin 35:16
We’ll have to check in a decade, a decade from now and see.

Erin Mikail Staples 35:20
I’m looking forward to that we can all go to the beach together. But thank you so much, Zack.

More Awesome Sessions