Inside the brain of an LLM: How it Works in 30 Minutes

Ishan Anand &
52:01 min
Dec 19, 2024

Large Language Models like ChatGPT, Claude, & Llama may seem like magic but thanks to an innovative approach you don’t need a Ph.D. to understand how they work. In this session, we’ll use an innovative, web developer friendly approach to decoding AI using a spreadsheet interface built around JavaScript and web components. Using this fully functional implementation of GPT-2 (an early precursor to ChatGPT), we’ll break down the architecture of a basic LLM into digestible, interactive components that let you see inside the brain of an LLM as it works. In 30 minutes, you’ll walk through the anatomy of an LLM, understand how the basics of it works, and even control its mind without prompting.

Ishan Anand

AI Consultant and Educator

Ishan Anand is the creator of Spreadsheets-are-all-you-need, a course that teaches the inner workings of large language models (LLMs) through an implementation of GPT-2 (a precursor to ChatGPT) entirely within Excel. He was most recently VP of Product for Edgio Applications, a platform that leverages edge computing, serverless, and AI/ML to enable enterprise teams to accelerate, host, and secure their high-stakes websites. Ishan joined Edgio via the acquisition of Layer0, where he was the CTO and co-founder. Ishan is an experienced speaker on AI, Jamstack, and web performance for conferences such as AI Engineer World’s Fair, AI Tinkerers, Next.js Conf, CascadiaJS, React Day New York, O’Reilly OSCON and more. Ishan holds dual B.S. degrees in Mathematics and Electrical Engineering & Computer Science (EECS) from MIT.

Transcript

Ishan Anand 5:15 Okay, thank you, Brian, thank you for having me. Um, welcome to inside the brain of a large language model, how it works in 30 minutes.

So I’m here to tell you when you are building with AI, the most important model isn’t the one in your code. It’s actually the one in your head. Your mental model is really the most important, because the better your mental model, the more effectively you’re going to build. This isn’t new. If you’re used to writing native applications, you need a mental model of how the CPU, the ram and the disk work. If you’re doing web development, which we do a lot here at CFE, then you need a mental model of the browser, or maybe the box model, if you’re working with CSS. And what I’m, you know, here to tell you no surprises. I think one of the most important mental models you’re going to need going forward in the future, is AI, simply because it’s in everything. So you need a mental model of how the large language model works. The problem is, if you want to go beyond just these kind of, you know, high level analogies, there’s a lot of complexity. There’s a lot of, you know, technical gatekeeping. It can feel like where people like, well, you need a semester of calculus and a semester of linear algebra and then a semester of machine learning, and it can leave you feeling like a deer in headlights. And I’m here to tell you it doesn’t have to be that way.

I know that because I teach a class online where basically in about one to two weeks, we take people from knowing almost nothing about how the models work to actually seeing every single step of the model in complete detail. And that brings me to who I am. So I’m Ishan. I’m probably best known in the AI community for spreadsheets are all you need, which is an implementation of GPT two, which is a large language model from open AI that they open sourced in 2019, and a precursor to chatgpt. I wrote it entirely in Excel using pure Excel functions, no Python, no API calls, but the most important thing for you is that if you are upskilling or retraining on AI, I have been where you are. I have been a traditional software developer, up and down the stack, whether it was device drivers to native applications to cloud and web development. I have been a technical product manager at an enterprise B to B Company. I have been a technical leader. I was CTO at a startup. In fact, that’s where I got to know Brian, through our work with him there, where I was CTO at layer zero. So I’ve been where you are, and I’m here to tell you, you can do it if you know just a little JavaScript, as we’ll see today. You know you’ve got the foundation to understand at least a sharper mentor model. Then maybe you’ve been told you can get so what we’re going to do today is we’re going to go to AI med school. We’re going to operate on our patient. Our patient is going to be none other than GPT two, and that is a large language model that was open sourced by open AI. They built it in 2019 but don’t let that date fool you.

You know, when it was launched, it was state of the art. In fact, it was considered too dangerous to release. But more importantly, if you think about all the large language models you probably are familiar with, like chat, GPT, llama Gemini Bard, as it was previously known, GPT four and Claude. They are all of the same family of large language models that have a common ancestor, and that is GPT two is basically their their Granddaddy. So if you understand how GPT two works, you’re basically 80% of the way there to understanding how the more modern models that you probably use day to day actually operate under the hood. Now, when I’ve done this talk in the past, I’ve done it in Excel using a spreadsheet, and the way the spreadsheet works is you type in some text here on the left, and then here on the right you get the predicted word.

But today, what I’m going to do is I’m going to use a spreadsheet in the browser that I have built using web components. You can go to it right now. It’s spreadsheets are all you need.ai/gpt2.

And in fact, here it is right there, and if you scroll down, the first thing you’re going to want to do is actually read or watch this video. It’ll show you how to load the model weights. So it’ll say, ready, all model parameters loaded. You have to download a file and then upload it into your index DB database. But at that point, everything is running locally in your machine. So.

And the benefit of that is it means you get to debug it using Dev Tools inspector, just like any other web app. So the way it’s structured, let’s go back to the slides here. As you can see, is there’s a series of cells, right, like this, and you see there’s two types. There are these code blocks that you can edit, and you hit a play to run them, and then we have these sheet blocks. So let me go to this one, which I’ve already run.

And if we go right here, you can see it runs a function if you hit play, and then you can see the result of that function.

So we have a web component called a cell, and that’s just a wrapper for every single UI element. It gives us this play button and some other controls to run all the way to the cell. And then we have our sheet elements, and they execute some mathematical formula, just like you would in a spreadsheet, and then displays the result as a two dimensional table, and those are wrapped by the same cell component up here. And then you give it an ID, and so other formulas can reference it. And then you give it a formula like here, it says, Get the matrix, call matrix, last row function, and then it gets it from this other table, which is called or sheet called Block step 16. Now that’s the DOM ID of another table on the page, and it’ll just simply fetch that. That’s what the bracket means. Go grab the matrix over there, and it automatically knows the matrix height and width in order to do any type of computation.

And then you might be wondering, well, where do all these functions come from? They come from you, or they come from JavaScript, if you’re a web developer, these are the same code elements. So that looks like these right here. Let’s scroll up so something like this prompted tokens is right here. You can see that function is defined, and then it’s basically calling it right here. So all the mathematical operations that the you know, the sheet does to execute GPT two are there every single thing. The matrices themselves are two dimensional, JavaScript arrays. It’s built as a teaching tool. So it’s very straightforward. Even matrix add and matrix multiply is right there. So there’s nothing here that is obscure. There’s no layers of abstraction to get in the way. And the way these work is basically, again, you have a cell with an ID, and then you just have a script block where you put JavaScript in, you put text plane so the browser doesn’t execute it, then the framework pulls it out and executes it when you hit the play button, and that’s basically how it’s structured. But again, the cool thing about this is that it’s very native look and feel for debugging if you’re a web developer. So let me just show you a little bit of that. So let’s go to one of these cells. Let’s go to separate and towards so I’m going to run to this cell, and you can see it just executed all the cells before it right here, and then it ran this separate into words function. And I’m going to go here, and I’m going to open up my inspector, and let’s do the following. So one thing you can do is you can console debug, so console log matches, so I can see what that looks like. So I hit play, I’ve now redefine that function. I run again, and we can see right here in WebKit inspector. I can see that actual variable. I can put any debugging statements everywhere on and I can just watch that web that inspector window as it’s running the model. But we can go even further. I can do something like this, and I can just say debugger, and then redefine the function, and then run it, and boom, I’m right here in my inspector debugger. I can step through this model because, you know, I don’t know about you, but sometimes I don’t really understand code till I walk through it. And I like to say, this lets you debug your thinking. That’s one of the great things about having the model so accessible and manipulatable. That’s how you build a mental model. Is basically by interacting with something, and then you can step through it and just use the same inspector used to using and let me just remove this debugger statement so it doesn’t break something or slow something down if I need to use a demo later on. Okay, so that’s the the power of having this in the browser, making it really accessible to those of us like myself who have a web development background originally, rather than an artificial intelligence background. So today, we’re going to do three things in our AI med school. We’re going to take our patient GPT two, we’re going to study their anatomy, understand what makes them tick, and look at all the different parts that make up the model.

Then we’re going to take our patient and we’re going to put them through a virtual MRI.

And then finally, we’re going to actually perform brain surgery on our patient, and we’re going to change its thinking without even changing the prompt.

Okay, let’s get into anatomy.

So what is a large language model? So a large language model just simply takes some input text and outputs the next predicted word.

Technically the next predicted token, which we’ll talk a little bit about, but it just outputs just one thing, which is the word. So when you go into say chat, GPT or Claude, and it outputs a paragraph of text, what it’s doing is it takes some input. In this case, Mike is quick, he moves, and then it just predicts one word after it says quickly, and then it takes that word and it sticks it back to the input, and you get a new input, which is, Mike is quick. He moves quickly. And then you run it through the model again, and it says, Oh, the next word is and and it keeps going. This is an auto regressive model, as they say. But the key point here is the core function of the model that you need to understand is just how it predicts the next word, and then you basically understand every step of the model. And so the spreadsheet or web page right here simply predicts just what the next word is. And if you want to see paragraphs of text, you take that word, put it back into the prompt, which is right up here, and then rerun it again, and you’ll get the next token.

Okay, so we said that, you know, large language model like chat, GPT or Claude is trained to complete sentences like this one and to fill in the blank. Mike is quick. He moves and as a human, you probably understand intuitively, like what an likely completion would be. So he moves quickly, maybe moves around. Maybe he moves fast. But how do we get a computer program to do that? Well, here’s a fill in the blank that I think you’ll have no surprise, that a computer can do really well. Two plus two equals four, because it’s a math problem, and computers are really good at math, even if we make the complicate the computation really, really complex. So in effect, what researchers have figured out how to do is take what is a word or language problem and turn it into a math or number problem. In order to do that, they have to do a couple things. First, they need to take the words and map them into numbers. Now here in this slide, I’ve simplified it. I show each word going to a single number. In reality, as we’re going to see, it’s going to be a long list of numbers called an embedding. So we break the prompt up into tokens, and then we turn those tokens into embeddings that gets our words mapped to numbers. But basically it’s just simply, how do we map the words to numbers? Then we do a bunch of number crunching on them. This is where you might have heard of attention gets factored in. And then we have a multi layer perceptron, which is another name for neural network. They go through 12 passes together in order to finally come out with a mathematical number result. Now that is going to be an approximation. It’s a number that isn’t going to cleanly map back to any one single individual word. What we’re going to do is we’re going to say the words that are closest to the number that we get out predicted out of all this calculation will give those a higher probability, and the words that are further away from whatever number we get, we’re going to interpret as a lower probability of picking that and then the model will just simply sample from that distribution according to that probability, and that’s the job of the language head. So we get this diagram for how the model works. First, we take in text, then we take that text, and we break it apart into tokens. These are sub word units we’ll talk a little bit about. Then we map those tokens into numbers called an embedding, and then we do a bunch of number crunching, and then finally we reverse the process, we turn the numbers back into text, and we get out our next token. So let me walk you through what each of these steps are in just a little more detail. So the first step is tokenization. We split the word, or the prompt, rather, into a series of tokens, not necessarily into words. So in our test sentence, we’ve got Mike is quick. He moves. It turns out that every single word in this prompt maps very cleanly to a single token, but it’s not uncommon for a single word to map to two or more different tokens. Let me give you an example. So let’s go back here and let’s change our prompt instead of Mike as quick to Mike is flavor eyes. And then I’ll rerun to here.

Let’s see, Did I spell that right?

Oh, that’s not the separation into tokens. That’s why. Let’s go to here.

Okay, so let’s see if I blow this up, and let’s close this. There we go. So you can see right here at final tokens. This is just the final of the tokenization process. It broke up the word flavor, eyes into flavor and I Z so as two separate tokens, two things to note here is that the word flavor eyes isn’t even in the dictionary, so we’re kind of illustrating that it can handle words that it weren’t even planned for. The second thing is that the it did map flavorize to flavor and I, z, E, which is kind of what you would expect.

As a like a human intuition for how you’d break the word, but the algorithm under the hood doesn’t map always clean. It’s kind of a happy coincidence. One does and does find a lot of common, you know, suffixes, but isn’t really the same thing as the linguistic morphemes, as they’re called, the little sub words parts of speech that humans use. And it’s worth noting that this is not a naive kind of parsing. So a great example of this is if you take reindeer and re injury. So let’s run that.

Okay, so reindeer is r, e, i n, d, e, r, but re injury, which also starts as r, e, i n is rain and jury. So it doesn’t even map to how you would think of it, like Re and jury. It’s a good example of something not mapping to your human intuition. So it’s a compression algorithm. It doesn’t always cleanly map to your intuition, but just know it’s going to take your input text and break it apart into these sub word units.

Okay. Next up is the embeddings. So this is where we take our tokens and we map them to a long list of numbers. In the case of GPT two, it’s 768 numbers. Larger models you stay map it even more numbers, and every single token gets the same number of these numbers. So even the period here gets 768 numbers. Let me show you that.

So let’s go here. Let’s go to token embeddings. So here are the token embeddings for Well, let’s use mic is quick. So token embeddings. There we go. So here are the token embeddings for Mike is quick. I’ve added some labels here so you can see which row corresponds to which token. But this row one right here is all the 768 numbers that map to Mike. Here’s period. And you can see, if you go here and I go to row one column 768 and you can see, there’s my 768 column. That’s where it ends. So every single one of these is seven. One of these is 768 numbers long, and that 768 long array is supposed to convey the meaning of each of these tokens. What does that mean? Well, probably in the time we have here, the best way to conceptualize it is like we’re creating a map for words, and we want to put the related words closer together. So, you know, think of words like happy and glad. They’re both happy emotions. We’ll put them here. Sad isn’t a happy emotion, but it is still an emotion. So maybe it’s sitting over here on the island, and dog and cat are animals, so they’re not really emotions, so they’re over here somewhere else on the island, on the other side. And rather than this being a two dimensional map, we’re putting all the words together in it’s a multi dimensional map. So in the case of GPT two, or if 768 of these numbers, you can think of these as positions inside that map, coordinates that just are putting them in a 768 dimensional map, putting the related words closer together and the unrelated words further away from each other.

Okay, so that’s text and position embeddings.

And then we get into the heart of the number crunching. These are called layers. I like to call them blocks, to remove confusion with the layer that’s in the multi layer perceptron, which we’ll talk about in a second. And what happens in here are two key steps inside the spreadsheet and inside the web page, you’ll see there’s actually 16 steps. I’m going to simplify it a lot for you here, though, the two biggest ones are attention and the perceptron. So attention is where the tokens talk to each other and get a sense of the larger context and meaning of the prompt that they’re in. So hypothetically, he might look at Mike because that’s the antecedent to its pronoun, or moves. Might look at Quick in order to disambiguate the word, because, for example, quick has multiple meanings in English. It can mean moving fast in physical space, it can mean being smart, as in quick of wit. It can be a body part, as in the Quick of your fingernail, or it can actually mean alive or dead in Shakespeare in English, hence the phrase the quick and the dead. And so the fact that the word moves is in this sentence or in this prompt, rather, helps the next stage, the multi layer perceptron, understand that when it tries to make its prediction as to what the next token is going to be, we’re actually talking about this first definition moving in physical space. So it could predict quickly, it could predict fast, it could predict around, but it’s certainly not going to predict something about your fingernail because moves is there and present. Now, what does it mean for these tokens to talk to each other. Again, it goes back to embeddings. They’re basically sharing subspaces of their embeddings, or actually transformations of their subspaces of their embeddings, to each other in a very kind of complex way that we go into more detail in class, but no, conceptually, they’re just kind of sharing information with each other and kind of disambiguating the meaning, like we’re doing here with quick.

And then we repeat this process. We take the output of one block and we put it back into the output of the next block, and we repeat the process 12 times.

Then finally we get to the end of the stage. We get to the language head, where we basically pick the token and output it. Now the way this happens is the output of the very last the 12th block is a predicted embedding. We take the last embedding of we take the embedding of the last token, and that is a prediction for what will complete our fill in the blank. But this predicted embedding we have to map back towards, and it’s not going to map cleanly to every single word. What we do is we take the words, or rather tokens that is closest to in this map, this embedding space, and we give those a higher probability, and the ones that are further away, we give them a lower probability. And then your LLM model runs a random number generator and picks one of these according to an algorithm called a sampling algorithm. Now, in the case of the spreadsheet and the case of the web page that you guys are going to be using, I’ve implemented the simplest it’s called greedy or temperature zero, which is, I simply just take the highest probability word and I output that. But in what you’re probably more used to, they’re using a much more complex sampling algorithm. They might be doing something like Top K or top P or beam search, where it actually tries to walk the tree of all the different possibilities and pick the most likely and they’re even more complex samplers than that. And tropics is a really interesting one to take a look at if you’re curious. But the key thing is this, the most of the randomness you get out of the model comes from this stage and how it’s picking the next token, and that’s where this temperature parameter gets used, because that changes the distribution and allows the lower probability ones get more of a boost in probability than they normally would. And that’s what temperature is really doing. It’s just altering this probability distribution, although it’s sometimes described as being more creative.

Okay, so this gives us back to our diagram. We get text, we turn the text into tokens. We turn the tokens into numbers, we do some number crunching, and then we get numbers into text, and then we get our next token.

Okay, let’s talk about imaging and putting our patient in a virtual MRI while we’re doing that, let me set something up, because I know it’s going to take a second to load. Let’s get into another interpretation of the large language model. So I showed you this diagram of the large language model as a series of five to six steps where two of them get repeated very often, which is the multi layer. Attention in the multi layer, perceptron, layer perceptron. We can actually interpret the model as a communication bus, where we look at the tokens each in the prompt. Like in this case, Mike is quick. He moves as actually communication channel where the multi head attention and the multi layer perceptron are actually components inside this communication bus that are communicating with each other. So the multi head attention is basically moving information between the tokens, right we talked about them sharing. And then the multi layer perceptron looks at every single token and tries to make a prediction. So it only operates on its part of the communication bus. And what we can look at this is as the model across layers is actually communicating, where the attention of one layer might be sending signals that actually get interpreted by the perceptron of yet a different layer. And the way we can get a view into this is a technique called the logit lens. We take that last step I mentioned the language head, and we actually stick it in between every single layer to see what its output is. And we ask, what was it going to predict if we had just stopped the processing here? So there’s 12 layers in GPT two. If we had stopped at, say, halfway through at the fifth block, what would the prediction have been? And then we actually look at what the results are. So let me show you kind of a feel for this and what it looks like.

So right here is a version that isn’t quite ready and published yet on the website, but it’s very much the same as what you saw here, so on the web. And I’m going to go to the very end of this. So it’s actually showing me for this mike is quick. He moves what the next top seven predicted tokens are so quickly, fast, like around his with very basically, just change this function here in code for predicted token, instead of grabbing the largest value, which is what this guy does, code for predicted token just.

Grabs the max logit and its index and then just simply returns that this version, sorry, this version actually goes through and grabs the top seven and then outputs them in order so that we can see what they are. Well, we can ask, you know, we can see that quickly was the most likely, fast is the next most likely. But we can say, well, what happens if we ask the model what the max predicted token was after halfway through. So we can go right here. And the other thing this version of the code does is it saves the information in between layers. These are called hidden states inside a global variable. And as a web developer, you know where hidden states usually live, sorry, global variables live on the window object. So window hidden states. And since we’re zero index, we’ll go five to get halfway through. And then we just rerun these. So let’s rerun that. Rerun that we’d have to re fetch those variables or those weights, run the predicted embedding, rerun the multiplication to get the logits. Let’s take a little bit of time, one to two seconds, do that matrix multiply, and then we’re going to get the predicted token out. Okay? So here you can see its predicted token was forward, back, up motion. So if we had stopped halfway through it, thought the right answer was, you know, Mike is quick. He moves forward. Not bad. Quickly is actually here, but it’s just not high enough in probability to make it to the max. So let me show you an example of this right here, which is with a different prompt. Here we’re going to go back into Excel for a second.

One note about the spreadsheet, and partially why I built this in the web browser is that Excel on the Mac will lock up running the spreadsheet, so I actually have to run it inside parallels, which leads to some graphics issues. So first of all, the prompt I’m using here is different from the mic is quick one. It’s if today is Tuesday, tomorrow is and the next predicted token is Wednesday. Now GPT two gets this job right for all seven days. It can. It knows what the days of the week are and how they relate. So if you change this to today is Thursday, it’ll say tomorrow is Friday. And so you can see, let’s see if we go here, this is kind of a table I’ve built out of the same results you saw here, where what I’m doing is, I’m taking at each layer, what is the next predicted token. So right here, if you had asked what the next predicted token is after the end of the third block, for if today is Tuesday, tomorrow is it would have predicted not. So one thing to note is that when we make a prediction, every single token is trying to predict what the next word after it is, even if we already have the word in the prompt. So, for example, if today is, is, is that token is trying to predict the next word after it, even though we already have it. So it was saying, if today is, you know, it thought at this point the next word would have been night, when, if you scroll down at the end, we get our final result. So it said if today is it thought the next word is the if today is, then something the if today is Tuesday. It thought comma would be the next predicted thing. It turns out, in this sentence, comma is the right next thing. So that’s how to interpret this. So each of these sections here, separated by blue, is the output results of each block and what each block thought were the top seven most likely next words. So the key thing I want you to note is the right answer, we know, is Wednesday. Where is it? Well, it’s right here, right on, right after Tuesday, and makes sense, you know, in that embedding map of for words, you know, Wednesday probably is pretty close to Tuesday, but then it completely disappears, and then it pops up here, but it’s not the highest probability thing, and then it’s slowly moving up in probability until it locks into place. In fact, we don’t even need the last few layers. Okay, so what? What the heck is going on here?

Well, a bunch of researchers use this technique called logit lens on steroids to figure this out, and they wrote this really great paper called mechanistically interpreting time in GPT, too small. They later turned to a paper on archive called successor heads. But what they found is there are only four components that are doing most of the heavy lifting of figuring out what the next day is. And it’s called the next day circuit in GPT, two small and the key components are the perceptron from layer zero, the attention from layer nine, the perceptron from layer nine and the attention from layer 10. And it’s a great example of a circuit created by the model to accomplish a task. So the key takeaway here is that the model effectively has a toolbox in each layer of components that can do different types of jobs. And when it’s trying to solve a problem, it’s almost dynamically on the fly or during training few years out for a particular job, it’s trying to solve how to put.

Those components together in a circuit to predict what the next token is.

Okay. Now it’s time for surgery.

You probably are familiar with Golden Gate cloud, which was a model created by anthropic that illustrated some groundbreaking research they did on sparse auto encoders. Golden Gate cloud, if you’re not aware of it, was a version of their Claude chat bot that was obsessed with the Golden Gate Bridge. So if you talk to it about itself, people said it almost seemed like it thought it was the Golden Gate Bridge. It’s more like it was. It just was always going to talk about it. So if you asked it to give you a math problem, it would give you a math problem that involved the Golden Gate Bridge. And the way they did this at a very, very high level, is they used a technique called a sparse auto encoder. So you’ve got your large language model as kind of a patient, and we’ve got this thing the residual stream, which is what we talked about earlier when we did the imaging. That’s basically those hidden states we were looking at. And we have another AI technique called a sparse auto encoder, look at the residual stream and try to turn those into interpretable features that are distinct and independent from each other.

And it turns out there’s a team of folks led by Johnny Lynn and Joseph bloom and some others at neuronpedia.org you can go to that website, and they have mapped out the sparse auto encoders for various open source models as well. So here’s one that they found in GPT two small which is our patient. It is number 7650, and this feature in this boss auto encoder is obsessed with anything Jedi or Star Wars related, as you can see here, those are the types of things it activates on. We can take that and turn that into a vector. You can think of this as inside back to our map of words, there’s a Jedi vector. And if I add it to anything, it makes it more Jedi in the Jedi direction. So what I have here is, in this one, a version of our sheet where I’ve taken that feature and I’ve turned it back into a vector, this Jedi vector you can see here, and I’ve added some code that at the start of every block there’s a copy table. So this is a modified copy table block. And right now, when you normally run it, the predicted word is we scroll down for the prompt. Mike pulls out his well, it says phone. So let’s go back to our Jedi vector, and if I can spell There we go, and we’re going to turn this from true to false. So now it’s going to the original version, just simply copied back a table over. Now what we’re going to do is, if you’re a block other than block two, you can do the same thing we did before, but if you’re not block two, we’re going to grab the Jedi vector and we’re going to strengthen it right here with the strength, and then we’re just simply going to add that Jedi vector to every single token other than the last one, and then do that only in block two to see if we can actually change its thinking. So let’s do that. So now we’ve done that, and now we need to reset and run the model and let it go. So this is the model running in action on my machine. It takes about anywhere from one to two minutes to make its final prediction actually run a little faster than that. I’ve just added some things that pause between the runs of each cell just so it’s a little easier to watch.

I should note, there’s multiple ways to do this kind of modification of the models thinking without prompting. One of the earliest techniques, I believe, was called attention steering. What they did was a much simpler thing than a sparse auto encoder. They took the word they wanted to emphasize, like, say, the word Jedi, they ran it through the model, and maybe they’d go block two, and they just pull the result of that, that hidden state for Jedi token, out. Then they’d find something they want to clamp down and, you know, de emphasize, like, in this case, phone, because we don’t want the prediction to be phone again. And they would run phone through the model by itself, and then they would take the result of Jedi and block two and Jedi, sorry, Jedi and block two, and then phone and block two, and they would subtract them together, and that would give them a new steering vector that they would inject into the model. It turned out to work occasionally, not as reliably as the sparse auto encoders, but it’s kind of one of the first ones of these. There’s another technique that uses principal component analysis, or PCA, and that’s actually a tool that’s very often used for visualizing these large embedding spaces, but they use that technique in order to find the reasonable vectors that might alter the thinking or monitor the thinking of the model. So.

One of the things they did in that paper was they were able to use PCA techniques to look at the model while it’s thinking back to determine when it’s hallucinating or making errors or making bugs. So there’s another demonstration of a similar technique. Okay, here we go. Mike pulls out his what’s the prick token. Went from phone to lightsaber, right? We moved everything to the Jedi, to the Jedi space, and so now, instead of a phone, it makes more sense, Mike, if he was a Jedi, would pull out a lightsaber. So we have done it. We have built the world’s first GPT two Jedi, who said droids can’t, can’t wield the force. But more seriously, the message here is you can grow your AI skills by sharpening your mental model of what’s inside the black box. There’s really three reasons you want to do this, even if you’re just prompting the models and you’re not, you know, a machine learning engineer. The first is to know your tools right? You build a better intuition for model behavior, a better sense of where the limits of the model are. Another one is that this field is moving really, really fast. If you’re going to try and interpret and understand what the latest developments are, or if what vendor claims are and whether they’re credible, it helps to be able to, you know, go through a paper and maybe not understand in complete detail with all the formulas, but at least get the JS when they say, Oh, we’ve, we’ve swapped out, you know, the layer normalization with one thing, or we swapped out the attention with some other form. You at least want to know what they’re talking about, whether it even seems reasonable or plausible. And last but not least, is very often, as web developers, or as developers in general, we are very often communicating with non technical stakeholders. And if we’re communicating with somebody who often has a view that this thing is magic. It can lead to misunderstandings in terms of its limits and its behavior, and if we’re able to explain it more clearly and more precisely, we can clear those up. Let me give you a couple examples. So we talked about the theory that llms are basically next token predictors. I really love this metaphor from Andrew main. He was at OpenAI, one of the first prompt engineers, but he said the LM is basically like the guy from momento, if you haven’t seen that movie, that’s the one where the guy writes stuff on his body so he can remember what he’s going to do. And his point is that, you know when the LM sits there and it’s putting out the next token, no matter where it is, and putting out the next token, it doesn’t really remember what it was doing. It kind of wakes up, looks at the last few tokens and it’s like, okay, what was the most plausible thing I was talking about that I could complete the sentence with a concrete example of that is some model providers. This one is taken from anthropic APIs allow you to put words in the llms mouth. So here what they’re doing is they’re trying to extract some structured data out of the model from free text, and they want to turn into js on and what they did is they allow you to pre fill with an open bracket. So the model wakes up like normally it would have said, Okay, here’s the results, right? And you want to get rid of that. Instead, they pre fill in its mouth the word sorry, the symbol for open bracket, right? So the model wakes up and it’s like, okay, what was I doing? Oh, I must have been talking about js on. Let me make sure the next thing I talk about is valid js on, and it kind of helps guide the model towards outputting JSON.

Another thing you might have heard about is chain of thought prompting. One way to look at this and interpret this is, if we go back to our diagram of the model. You note that every step here is basically taking a fixed amount of compute. In fact, the number of times we go through this is very fixed. Actually. The reason this was implementable partially inside a spreadsheet is because there’s fixed compute per token. It’s one complete pass. Everything gets hit. And so that’s good for implementing a spreadsheet, but it’s bad if you’re dealing with a problem that needs more compute or more thinking, because it’s especially hard. And one way of looking at chain of thought is what it’s really doing by giving more tokens, it’s giving it more compute it can spend on the problem, more hits of attention that the tokens can look at each other and actually figure out what the right answer is. So for example, if you’re going to go to a model and ask it to say, do a, you know, a marketing plan, don’t just ask it to generate the content. Maybe put in some sample content, maybe your favorite content, and actually, don’t even ask it to clone it first ask, Hey, what are the value propositions? What makes it great? And what you’re doing is you’re stuffing into the context more examples for attention to look at, and it will pull that context back out before you then go and prompt it and say, Okay, now generate something okay. Here are the references for everything I talked about today, so you can check them out.

But congratulations on graduating our AI brain surgery school for December 2024. If you want to dive in deeper, I do have a class that I teach where we go through in basically one to two weeks, every single step of the model, you’ll see in precise detail how it works. You do not need to be an Excel wizard. You don’t even need to be a web development wizard. This is really built with very simple JavaScript.

And we walk through the entire model, we give a lot of, I like to call them tangible or intuitive reasons for why things work. We don’t get into a lot of math, and then you get to see precise detail how the model works. I’ve even got students who are taking it now, and they’re implementing it GPT two inside their favorite programming languages, because of all the detail they were able to do, and they found it a lot simpler to understand.

Special for the CFE audience is, I have a holiday promotion code.

If you sign up in the next seven days, you will get a discount. So check it out. And I just want to leave you with where we started, which is, hopefully you’ve got a better appreciation where you’re no longer this deer in headlights when people are talking about how large language models work, and you hopefully have this feeling that maybe you don’t have complete mastery of the model, but that complete mastery of the model is within your grasp. It’s not harder than all the other things you’ve probably learned as a developer before, and it’s totally something you can accomplish. Okay, that’s it. Thank you. Now I’ll do Q and A

Brian Rinaldi 46:08 that was awesome, honestly, like that was, I’m gonna say the best explanation I’ve seen of an LLM, and yet, I feel like I gotta go back and re watch it. I do. Yeah, I didn’t. Some of these concepts have to take a little bit to sink into my brain.

But yeah, I don’t think I’ve I understand I already feel like I understand it better.

Ishan Anand 46:34 But yeah, that’s something I often, I often tell my class is, like, the first time you walk through it. I want you that’s why I use that phrase. Don’t feel like you’re going to get full comprehension. Right after the end of somebody showing it to you, you’ll get full comprehension. You get the feeling that full comprehension is within your grasp by going back and looking the video and going back and and playing with it, right? That’s how we build a mental model, is by interacting with something. And that’s why tools like this or the spreadsheet are so helpful. Hopefully.

Brian Rinaldi 47:04 yeah, no, definitely, I’m going to go mess with that spreadsheet. So we did have a couple audience questions, and then I’m going to open it up, just like remind everybody you we are here to answer any questions at this point. So if you have any questions, put them in the chat and I will continue to ask them. So one person asked, why 768?

Ishan Anand 47:27 Yeah, I get that question. I think every class and every time I do this talk, the answer is entirely empirical, that when it’s what’s called a hyper parameter, it’s a number that the model designer essentially picks, and what they’re trying to trade off is, if they make that number too big, then it becomes a lot more computation. So you saw that the matrix right in embedding was token embeddings, right? You saw that this was 768 long, right row, one column, 768, if I make this matrix, if I decide to have an embedding, that’s twice as big, I’ve doubled, actually more than doubled, the number of computations I have to do in every single step. And so basically, your memory and compute get worse, but I get more expressive power, because you can imagine, I’ve got more dimensions in that space where I can more precisely put those words and get better closer relationships, be able to find better, more valuable relationships. So it’s a trade off, and the model designer needs to pick it, GPT two, small. Pick 768, there’s other versions of GPT two, like GPT two large and GPT two Excel, where the number is much larger, or maybe up to 1000

I think llama, for example, is something like 16,000 in its embedding dimensions. So today’s models, because they have more budget to throw more compute at it, will have a larger number of embedding dimensions. But I will say GPT two is a popular model among mechanistic interpretability researchers, because it’s so small you can kind of do experiments really cheaply and quickly on it. So

Brian Rinaldi 49:14 that makes sense. So obviously, like, I’m trying to, like, even picture how this all works. It kind of reminds me many, many years ago, just like basics, like implementing search, where you’re trying to do like a search, that was like trying to find something related to what they put but it wasn’t like so you could search not based on, like, kind of the, you know, the words, kind of similarity of words, and things like that. So I implemented like an algorithm once to, like, did some search things that you know, didn’t, didn’t literally, to interpret the words, it just looked at like similarities in words, and then predicted some, some kind of things like that, which was, I guess, to a very minor level, kind of what they’re doing, they’re, they’re mapping things out and kind of finding the relation of these, these words, and then trying, you know, but on a much grander scale.

Ishan Anand 50:12 yeah, actually, that’s, it’s, I would say, minor in scale, but majorly similar in terms of concept. Um, one way, like what you described is, is like a classic recommendation system. So you probably, maybe you’ve heard of like the Netflix challenge, right? And you build this large matrix of, you know, what movies a user likes, and you compare that to another user, and embeddings are essentially doing the same thing. In fact, in my class, one of the exercises we do in class is we build our own embeddings. We take, like, a couple Wikipedia pages, and we map them out by just looking at, you know, their how often they occur with other words. And that’s basically what it’s doing. Just like we look at movies that people like and other people like the same thing, we look at words that occur in the same context. There’s a this is called the distributional hypothesis. There’s a phrase you might have heard when they talk about embeddings, you will know the meaning of word by the company it keeps, which is the idea that similar words tend to occur in similar context. And that’s how it actually figures out how to relate the words together in that map. But yeah, very similar concept. And so, so tell me, how does it work when somebody does like, when you showed, for instance, re injury and reindeer, how does that work in terms of, like, it’s not obviously mapping the cons, anything related to concept. It’s, no, yeah. Um, so what’s happening is, it’s, it’s really a compression algorithm. And there isn’t, like, an a priori way you can understand how it’s going to break the word down, other than to know what it’s doing. Is, imagine it took all the text on the internet and it said, I’m going to compress all the text in the Internet into 50,000 characters or 50,000 tokens. What would be the least amount of tokens I would need to represent? Imagine this large dictionary of all the large corpus or body of text of everything on the internet.

And so words that occur very frequently get their own token, which is why Mike is quick. He moves. Those are all very common words, so they all get their own token because they’re really common, because it’s most efficient if that word occurs often, to get its own token. But if a word doesn’t occur a lot, then it’s going to say, I need to break this up. And then it’ll break it into the most efficient sub word units it can find to represent that word. And in the case of re injury and reindeer, it just so happens that the way those words are constructed, it’s more efficient to represent them, you know, as three words in one, three tokens in one case, and two tokens in the other. And I want to be clear, this is about how what’s the most efficient way to represent it in the whole corpus of text it was trained on? It’s not how to represent it efficiently in the prompt you give it. That’s a really important distinction. So it it because what it’s trying to do is compress all the text on the internet that it’s going to then put into training. It’s trying to compress the amount of work it needs to do during the training process, and we’re just trying to translate it back into that new language it created during the training process. So it’s, it’s hard to kind of intuitively describe what it picked. It happens to pick, you know, linguistic morphemes just kind of not completely by coincidence, but it’s like correlation, like, you know, EST on the end of words, or I Z on the end of the words, is very common. So it’s not a surprise. It’s going to find it. But if there’s a suffix that’s less common, it may not find it. So re, injury is a great one. It didn’t, it didn’t pick the RE it decided rain was more common. So hopefully that helps, like we go, we go through the entire thing in detail. It’s usually like 45 minutes or so to go through, like the entire model. And then as part of an exercise, we work through, like, manually doing it, but and it gets, hopefully, a better intuition for basically how it did the training and then how it breaks it apart. But that’s the best kind of high level intuition to think about it, is it’s just looking for the most common SubWord units and using those

Brian Rinaldi 54:10 awesome okay, that makes sense. And the other, the other question we had was just, I think this was related to your your spreadsheet thing. I’m gonna Yeah, assume the answer is no. But somebody asked if it’s, if it’s open source?

Ishan Anand 54:23 on what the website or the

Brian Rinaldi 54:27 Yeah, the spreadsheet that you built that the the

Ishan Anand 54:31 Yeah, your web, the GPT two web page that you the GPT two web page. So the, well, first of all, like, I have a class that, you know, is is paid, but I kind of giving away software is kind of what I say, and I monetize on the content. But the spreadsheet is available for download on GitHub. The Yeah, so you can just go to like, you can just go to GitHub. Spreadsheets are all you need. You can go to release. Is, and you can download, you know, the spread, the old spreadsheet that runs in Excel. The website itself. I’m planning to open source the framework that I use to build it. I haven’t figured out what the license is going to be, so I’m basically still figuring that out. But for and then whether the actual page itself is, is open source, not, I’m still basically deciding, but for the purposes of, like, teaching and exploration, like, I wrote it so it was readable, like you can you saw me stepping through it. I’m just trying to figure out what like, what makes sense license wise. But that’s like, you can go, it’s a web page, and it’s been written in a way that you can step through it and you’ll see comments in there. And I’m planning to go back and make the comments more readable for a reason, because I want people to be able to play with it, so I don’t want to inhibit that. I just haven’t figured out which parts get open sourced and where I draw that line and what the license is. So that’s a TBD, but I very much want this to be used, so I’m figuring that out. In fact, if people have feedback on the license they would like to see, let me know. So I’m all ears. Reach out to me on LinkedIn, Twitter, blue sky. Ping me. I’d love to hear what people think.

Brian Rinaldi 56:20 Awesome. Okay, yeah, that was, he clarified that, that is what he was asking the JavaScript implementation. So, yeah, Dan, you can reach out to us. Ishan, all right, that was, that was, we’re running out of time here. But that was actually fantastic. Like I said, I really do want to go back and re watch it. And your class sounds fabulous.

Ishan Anand 56:43 So, oh, one thing on the class I should note is your company probably has a learning and development budget for those of you guys watching, and this is usually applicable to that learning and development budget. And in fact, we’re at the end of the year, so your learning and development budget is probably going to expire if you haven’t used it by the end of the year. So this is something that you’re like, hey, I can quickly. And you can say, hey, I’m going to be a more AI savvy engineer at work. That’s certainly going to be helpful for the organization. So another way to think about when you see it is, hey, it’s great way to use some budget that might disappear on New Year’s.

Brian Rinaldi 57:23 Yep, great idea. Okay, so thank you, Ishan, this was this was really good.

Cool Stuff

More Awesome Sessions

SESSION

AI-assisted Development: Adventures in pairing with a machine

Moar Serverless will give you all the information you need to take advantage of serverless in your application development including new AI, cloud data and edge capabilities.

SESSION

Building and deploying remote MCP servers

Moar Serverless will give you all the information you need to take advantage of serverless in your application development including new AI, cloud data and edge capabilities.

SESSION

The Last Saree: Connoisseurship in the Age of AI

TheJam.dev is a 2-day virtual conference focused on building modern web applications using full stack JavaScript, static site generators, serverless and more.