Dave McAllister 0:09
Cool, thanks. Thank you, Melissa. And that was her polite way of saying I’m an old good old dude. That historical knowledge. So yes, I’m David McAllister. I’m an open source technologist here at nginx, which is part of f5. Some of you will probably have heard a little bit about it, about Nginx. We’re fairly well known in the open source world. But today, I want to talk about something a little bit different. And that’s around this distributed trace. And by the way, I love the theme. So if you notice, of like, introduce you to one of my cats, Julia, here, I am owned by four cats, I am completely used to being ignored. So this, but to talk about distributed traces, we have to talk about some things that are important, and one of them is why. Well, we started off with a concept that basically we’re doing this for this thing called observability. And observability is lit things that now let us look inside of our application space, especially as it gets more complex, so that we can figure out the best ways to fix it up when it’s not acting right for that. And in general, there are lots of observability signals. And these were the three principles. I think the numbers up to six or something I saw, saw a new acronym called timpul. Up here recently for the for talking about observability signals. But observability signals really break down into these three categories, none of which are brand new, they’re actually been around for quite some time here. Metrics. Do you ever problem, this is the number side of the house here, logs? Why is that problem happening? What is the application think is going on at any particular time. And then finally, traces, which is sort of the new member here, where is the specific problem. So you can think of this as metrics for detect traces for troubleshoot logs should give me the root cause the ultimate source of truth in here. And basically, it gives us the visibility into the state of the system, while at the same time, giving us this reduced meantime to clue and in terms of reduced mean time to resolution. So monitoring, which is an overlap for observability basically keeps the eyes on things that we know can go wrong, we know that we might run out of memory, we know that we might run out of space. observability lets us detect things that we didn’t expect, and help us explain why they went wrong, so that we can move them into that monitoring category. Short, dirty point. Why Does anyone care about this observability? And the answer is microservices. Almost all of our applications these days are built out of lots of loosely coupled, independently deployed smaller services. They’re often in different languages. They are very maintainable, very testable, because we test each section independently, and we can deploy them independently. The problem is, is that not all owned by the same team, it can be widespread across this erupted in cloud environments. And because of that, the microservices add some unique challenges to our space. Basically, anyone in mathematics knows you only change one variable at a time here. But when we take microservices and put them into Cloud, we’ve changed two variables. At a minimum here. We’ve added microservices, which gives us a much more complicated model for this loosely coupled intercommunication practice. At the same time, we’ve added cloud elasticity and ephemeral behavior. This means that we have two intersecting things changing, therefore failures may not repeat. And debugging multiple multi tenancy can be painful, generally speaking, and we’re going to cover this a little deeper. We’re talking about a lot of data. And so when we think about this, we have to think about how we move past that complicated model, that chaotic model into this emergent practices. And that’s where observability comes into play. So you’re looking at what’s called the Canadian framework. It’s one of my best descriptions for how we deal with things that are changing in multiple directions. So I use this quite often. So basically, tracing is a data problem. You can think of this as tracing is going to tell you what’s going on with each of those pieces, tracing tracks and independent requests as it flows through the system from where you decided to start all the way through to its conclusion, whether it’s successful or not. And that gives us the ability to have actionable insights based on the application experience or the user experience depending on what portion we’re looking at. And it also We add some more metrics, metrics that are very important to us, when we’re looking at the best ways to approach the making new things work. So what you’re looking at here is a somewhat sped up red monitoring chart, red states for rate, duration, rate, error and duration. And as you can see here, I’ve sped this up, but I was looking at somewhere around 14 to 20, requests coming through per second, the average latency in this case is somewhere around 280. And my error rate goes from nominal all the way up to 21%, I believe is the high end, I’m also seeing a strip chart approach. So I’m actually seeing the data as it comes into the system. Keep in mind, that tracing is an individual piece of information, every single request is a trace. And that trace can be the done. But data is only useful, in a sense, where we can aggregate it, analyze it, and then generally visualize it. And so those things give us the ability to look at new alerts, new debugging, and give us that faster time to clue into response. But it’s lots of different models that can come into play around a tracing functionality. We can have rum, real user monitoring, actually tracing the behavior of a user through the system from start to finish, you can have synthetics, which do a similar thing, but use a known trace a known structure so that we can now compare across upgrades or deployments, we can actually start comparing the changes in our behavior from a user viewpoint, because we’re using something that we’ve already baseline now, we can have network performance monitoring how the network performs across a trace, keep in mind that we are usually talking about a remote system, a browser to a back end system through a loosely coupled network of application of microservices. And so network performance becomes incredibly important inside of that, we can have application performance monitoring, showing us basically what the back end is doing at a particular time. And then underlying all these things, because as much as we’d like to think so we still have to have infrastructure that we run on. And so each of these things is driven by an observability, Sig signal, run synthetics, network performance, and APM actually make heavy use of tracing behavior. And so tracing behavior drives the way we now look at the application model. So anyway, why we are looking at this, but this is what it’s really all about. So there are tracing concepts for the tracing concepts are pretty simple. Let’s start with the basic item of work. A span, a span is the discrete unit of work in your system. And that can be an application of microservice itself, or it can be portion of the microservice. So each of these pieces can be defined as a unit of work. The trace is all the spans for a specific request all together. And that trace is unique inside of your system. There’s only one and only one trace that matches to those pieces. And then you can also add to it other things, distributed context, where it’s coming from, what tags you want to add to it so that people downstream can understand better what the structure is and what the underlying pieces look like. You can also add options, things that you want to have carried over across each of those pieces.
Dave McAllister 8:45
So let’s take a look at a trace. And this is a directed acyclic graph. And it says standard way of looking at trace behavior. In this case, I’m looking at the trace behavior going through the system of microservices. And each of these pieces adds specific information. So in this particular case, each of my little circles is a micro service or the service application. And you’ve got inside of that how long the application itself took. Each of those pieces will tell me this is actually not showing an individual trace, it’s actually showing an aggregate viewpoint. for that. I’m also seeing the pathway where each service sends its next step is a piece of work that span where does the next fan go to. And as you can tell, each of these pieces goes to a different micro service, sometimes multiple micro services. And each of those communication pathways also has an amount of time that it has to deal with. And so in this case API sending off the catalog takes 72 milliseconds sending have to I think the order form is tuned in trouble seconds. And you can see each of the pieces as they go out. Inside of this depending on the visualization model, you may get additional information, you may get error rates, or you may get size of connectivity, how much data is flowing through it, visualization is kind of left and is completely separate than the creation information of the trace data itself. So, once we’re in the trace, we can visualize it in lots of different ways we can benign besides that, we can deal with all sorts of interesting statistical approaches. That this is one of the ways of doing this. And this will clearly show you the pathway and the connectivity between the pathways, especially when it’s multiple elements. Now, the next side of that, and the other side, you’ll see is a standard trace waterfall plot. And the trace waterfall plot starts with, if you will, the trace itself being called What’s the top level environment, and that is the piece that has to extend until the complete work structure ends times out or receives an error. Underneath that you will see breaking down, you’ll see the spans and inside of the standards, you will see also additional item of work where they are being called out. And so in this case, we can look at this and say API called after the HTTP request, which call to the catalog, which call to the authorization authorization return came back and went back through this. And we can see how long that took, in this case, roughly 33 milliseconds. At that point in time, we are authorized in which the next step for this, which is starting to look at the catalog because it now knows that the person is the right person. And those two major spans, so have children spans. So the Trace has a span, those spans can have subs spans underneath of them, and the two traces them together at a pretty closely, that’s about two milliseconds off from from that point, you can also take that same trace data and look at it and say how much time was spent in the application. How much time was spent in a database, how much time was spent on the network. And each of these pieces will give you additional information specific to where your activity is happening, and where your time is being spent. Now, how does that magically work? What’s this thing called baggage and observability. In particularly, the tracing side includes this thing called baggage and baggage is contextual information that’s passed between spares considered a key value store that recites along with a span context, that makes the values available to any span created within that trace, it uses a thing called Content propagation passes baggage around content propagation, basically, make sure that you are in a safe store place to do that. word of caution, baggage should be used only with data that you are okay with potentially exposing to somebody who looks at your network traffic is stored in HTTP headers alongside the current context. So it’s not a protected environment. It’s not an encrypted environment. Outside of the the pieces that can’t be see. There is no integrity checks to make sure that the baggage items you’re getting are the baggage items that you are yours. And so you have to make sure that you’re getting the right baggage concept. So with that, that’s a very quick viewpoint of what tracing looks like, well, even though this is a very complex world, let’s have a ton of data here. It’s really useful. So what do we need to make this tracing stuff work? Well, each trace has to be unique, you cannot duplicate trace IDs. And you have to be able to propagate that trace ID through all of the span activities. So you have a trace ID, you have a span ID and they have to flow together to make that happen. Once we have that, we have to bring this data together this consolidation structure. So we bring the data together and ship it to the right places. We want it to be distributed environment capable. We’re talking about cloud work here. We’d like it to be standards based. So we agents are good cloud integration is good. And in personal case, close fit, open source, the happier I am about this. We’d also like to make it as easy as possible and automated code instrumentation because we have to put the instrumentation around tracing into our application for that. And bottom line is we want any code at any Time. So the first concept of this is API’s. And there’s actually a very simple set of API’s that go into every one of these. Every language has to support each of these, these concepts. The first one is the trace provider, we get a trace point, it’s access to traces, it sets up the global provider, it can set up others, but it’s the thing that actually lets us talk to traces. And then the tracer creates this adds a creates that unique concept. And it delegates getting the act of span. And marking a given span is active to that content propagation system. So that span information gets passed correctly, and context gets passed correctly. And the span is simply the piece that traces the specific operation. Keep in mind that you can actually trace just between applications, you don’t necessarily have to trace within an application as well. So when we enable this pretty straightforward, there’s two basic options. And I just kind of mentioned one of them, services, traffic inspection, traffic inspection, lets us think of it as a service mesh, lets us look at how the request is being passed from service to service. And so we can look at how long it’s taking you to get from a service to a service, as well as how long it’s been inside of a service. What we don’t get is what happened within that service. So it’s a set of information that will show us the behavior of our products. And it will show us highlight the spots where we may be having problems. But it won’t give you a really tight detailed within that application or that microservices app itself. Code is to mutation is the next piece in code, this mutation actually gets into the code and adds the specific information that you want to trace the behavior, the units of work within that microservices application, or larger application. This is not limited to microservices, even though we tend to talk about it in that way. But when we start doing this, we have to do certain behaviors, we have to focus on the code, which means that we have to have the code, the license, the language of interest, all those different pieces. And we need to focus on the different pieces that are important to us. So focusing on the code side here, you add the client library. And I think if memory serves me correctly, currently, we support about 11 libraries as native libraries for open telemetry. For that, in tracing, there’s only one that that is not considered to be stable. In other words, ready for production for that. Once we’ve done this, we want to make sure that our libraries in place, we focus on the service to service communications. And then we can add the things that we want to add. For instance, we did a piece on adding open telemetry to a micro service application, after the fact production level service. And we found that it was really useful to tag the trace ID into our log IDs, and tag the trace ID and to our SQL calls. So each of those pieces gives you additional capabilities and additional information that’s important.
Dave McAllister 18:32
So what this basically means for traces, you instantiate the tracer, you create the spans of interest, you enhance your spans, and you configure your SDK. This is where life can be a little bit interesting. There is this thing called Auto instrumentation, automatic, just add the appropriate files. If you’ve got Java, you can just add the Java instrumentation directly to it. And you will get all sorts of tracing data coming out. It is very language dependent. Not all languages have this and some of them you will have to go in and actually do coding for if you’re doing the manual side, import the API and SDK, configure the API and SDK, create your traces. And it’s very straightforward. But you will need to go and change the code, create the metrics of interest yet, if you want read, make sure you’re getting that and then send your data export the data that’s being created to the right place. You can use automatic and manual at the same time. What you can’t do is use two different automatic interpretations. At the same time, that will give you some really strange results. Automatic, your results may vary wildly, and you may actually not get what you expect you’re going to get. And you may get a lot of data and that data itself may be challenging to figure out what it’s actually trying to tell you. But your apps have To be instrumented to get the correct observability signal to keep it back in quick flashback outside of traces, logs and metrics, you’re probably already instrumented for logs. And traces are logs and metrics. Here, you’re adding traces to that structure. So what are you trying to solve? Well, if you’re looking at things like performance issues, that something is running too slow, traces are amazingly good at helping you understand what and why as well as where things are running slow here. So are you having problems with a time to resolution, or time to detection will traces because it is looking at each request independently, can help you determine when things are sliding out of line for this. And then as I mentioned before, sometimes it’s kind of nice to know typically in a microservices ephemeral environment, where things are happening. So metrics are great for numbers and alerts, logs give us what the application does. But traces let us know where something happened. And so adding traces to our metrics and logs provides us this content, this context about where in our structure things are happening. And bottom line, the more data you have, generally speaking, the better your answers. But do you really need all of that data. So you can either have a span for every single thing, or you can have a bare minimum. So again, you could come back and go, I’m just going to deal with it from point A to point B, I want to know how long things took between it, or I can it can deal with and say I want to know, every time a specific unit of work, or an event has occurred inside of the application space, as well as inside of the environmental space. Generally speaking, it’s best to start with your service boundaries, and third party calls. So understanding again, how long the propagation happens between the services, as well as how long things took overall, without going into each application space, this is really useful. The nice thing is you can treat this as an iterative process, which means you can bill minimal, and then add to it as you discover things that you’re interested in. That means that you can constantly upgrade and improve your observability. Using traces to get specifically the things that you would like. The bottom line is you really have to be careful about information overload. This adds a lot of data. Think of it this way, a trace is like a log file, that log file log entry itself now has anywhere from one span to potentially hundreds of spans. And so your data overload for coming from a trace can be pretty massive. Make sure that you’re getting the right data you need without having to necessarily get every piece of data. As an example, a former customer, previous company, for that video was creating 43 terabytes of log files in our fairly large cloud and type environment, you might say, they added traces to this and they basically turned on everything, that number went to 92 terabytes per hour, no one can absorb that much data. And in fact, it’s even hard to visualize that data in a meaningful way. So beware of information overload. And keep in mind that sampling is not the answer to information overload, because sampling is simply reducing the amount of information you may be seeing, but not necessarily giving you the correct information. So distributed tracing, gives you insights into its application, and to its infrastructure here. But it requires effort, you have to be able to to build something The thing is, you don’t see a win until the project is basically instrumented. And so that work aspect can vary from automatic to manual can vary across a service mesh. But it is really useful once that happens. And there is a startup process that can go across that tracing is this great thing for user happiness. It’s a proxy for understanding what your users are doing. But tracing does not magically solve any of the issues you may be happening, you are still going to have to deal with those pieces. And in fact, my favorite quote, the most effective debugging tool is still careful thought coupled with judiciously replaced print statements. Distributed traces are our new print statements. Brian current again said this for UNIX and beginners in 1979, and is still just as valid here on October 31 2023, we are required to think about the data that’s being given to us, as well as the places that we can make use of that data to improve our applications and our users experience. And with that, I’d like to thank you for listening, I’d like to thanks for for coming out on spooky DevOps day. And if you need to get ahold of me, I blanked on Dave Mack on LinkedIn. My slides are all available on GitHub, DW McAllister on GitHub, and you can always find me on nginx community slack. And with that, thanks, and I will hand it back over to Jose here.
Melissa McKay 25:51
Awesome. Thank you so much, Dave, that your presentations are just packed full of information. So I’m very happy that you shared your slides. I’m sure there are many of the audience that are going to be going looking at those later. Something you said struck me just at the end there, just that you have to be thoughtful. And I do remember in the early days of my career, where print statements were all I had to hang on to that at least that’s what I used until I learned better tools, of course. But it seems like seems like we should probably be thinking more about in order to take advantage of just the benefits of observability. In general, we probably should be developing with that in mind. Right away, right? Like, maybe not try to put it on at the very end. Maybe we need to start rearranging our thinking and learning with it first.
Dave McAllister 26:47
Absolutely, if you can build it in from scratch. And that’s actually where the most successful observability exists today. We do we at nginx took this banking application which exists, microservices based and basically said, what’s it going to take to add opentelemetry to it. And we regretted our lives for about six months. When you think about architecting, you, as an architect should know where the units of work and how they plug together. When you just built the application, what happens is people don’t actually think about units of work and how they put together. And so it just magically happens. And so tracing. If you build it, when you do this, you’ll have this great roadmap for what’s going on in your app. And it’s a lot lot easier to be able to Greenfield application than to try to go in and dig up the graveyard and add tracing to the corpses. Right.
Melissa McKay 27:51
So unfortunately, a lot of us are in this position trying to tack it on at the end. Sounds like you’ve had your experience doing that as well.
Dave McAllister 28:00
Melissa McKay 28:03
as painful as this just because of how things are when you start, do you recommend open Telemetry is that something I could use today?
Dave McAllister 28:11
I actually recommend open telemetry wildly, and it is incredibly successful. Open Telemetry is used in so many production applications. Keep in mind that open telemetry was the blend between open tracing and open census open census came out of Google’s experience. For basically, five years running production and open tracing had been around by the time they merged together for what was it almost eight years, and said both of them were heavily baked for tracing. When we added metrics and logs to it. The rules vary a little bit. And by the way, let me also add to this languages matter. In tracing. It’s pretty much stable. And you can use tracing pretty easily with the exception of rust. I think Russia’s rust is still in beta for tracing metrics, mostly stable. But there are some exceptions inside of that, because we’re coming out of strong metrics environments like Prometheus. And so we can use Prometheus metrics, logs, logs to just hit a stable definition. And so long is a little more challenging right now to put into the overall picture. But open Telemetry is the choice. Everybody, all production applications. I know nobody uses anything that’s not OTEL base these days.
Melissa McKay 29:43
You talked a little bit about auto instrumentation. And I can think of a few reasons why you might not want to go that direction instead. But I want to hear that I want to hear from you and get some valid
Dave McAllister 29:59
yes Oh, yeah. So the problem is, is not that auto instrumentation doesn’t give you really cool traces. The problem is, is that the amount of data it gives you is ridiculous. And it’s probably not the data you want. So we’ll use Java as an example. So you can just add the correct JAR file to the application. And magically, you will get traces. And the traces will be 150, spans long. And buried in that 150 spans might be pieces of information you want to get, and nothing that you know, and that 100 copies means it’s going to miss the 20 pieces of information that you really want. And so, the nice thing about auto instrumentation is you can cross my fingers generally turn off things you don’t want. But what you’re going to do is auto instrument it, turn it on, go, Oh my gosh, look at all this data, what am I supposed to do, and then go back in and start configuring and turning off the things you don’t actually want. And then you’re gonna have to go in and manually add the things that you’ve wanted to see that it didn’t give you in the first place. So, you know, when we did it for, again, for the morrow project, we found very quickly, that we really wanted to trace to know what the database call was, when it handed off to the database, the database was 80% of our time, the database was 80% of our problems. So what trace was causing the problem, so we had to go in and manually instrument so that when we saw a database, call it added the trace ID to it, and logged it off course. So and by the way, some tracing, other tracing tools are better than others. But Java is one of the one of the ones that’s the most complete. But this is where completeness may not be your friend. Yeah,
Melissa McKay 31:56
everything is not always better, right. I just sounds very familiar to some of the struggles I’ve had with even just logging when, you know, starting there, even just trying to figure out what’s going on with your system with just logging and then then finding out that, oh, we weren’t so thoughtful about the type of logging, you know, whether whether it was really important or what level the logging was on. So when we had to turn everything on, it was overwhelming, overwhelming information. And, you know, we were in this business of shipping logs, you know, so at least we have that part down. It was good, but just the amount of information was incredible. Which leads me to think that this is a similar problem with tracing, you can get too much information, right? You talked a little bit about that. Can you elaborate on that just a little more? Yeah,
Dave McAllister 32:47
so I sort of kind of kind of stated this sort of buried in this presentation. Remember that it’s trace is an individual request. And in general, from a DevOps viewpoint, we’re actually not too concerned about the individual request, we are concerned about the behavior of our system. I used to use it and I threw up a little line about proxies had user proxies, user happiness proxies, here, a user cares about their request right now, how long it took, and whether we’re successful or failed. When we get into it from a DevOps viewpoint, we’re concerned about the users their cumulative request, and what the success and failure rate looks like. So if one user out of 10,000 has an error, we as DevOps may not really care, unless it’s repetitive, okay, that user cares a whole heck of a lot. Just to let you know, I use a really cares because it’s requested and go through. So what can happen is we can start the same behavior. We’re seeing 15% failure rates across our user requested environment. Let’s go cumulatively look at those. The differences tracing lets us actually group those things together and get the aggregate viewpoint. And so we can now start start using exponential distribution models to start predicting where things are, or we can use Bayes Theorem. I do a talk about stats recently, Bayes theorem, they say, oh, things are going bad, it’s probably the database. So probability space comes into play. But otherwise, you know, think about taking your average application, putting tail minus F on the log file, and tell me how much information you get scrolling across your screen. That’s what tracing as independent viewpoints can do, and that’s why we have to aggregate so we humans can actually understand something that’s going on and what the trends look like.
Melissa McKay 34:52
Yeah, that can be really overwhelming. So with all you know, getting all this data out Can you give me an example, more of how we would know that users are okay? Like, what’s a good example of some tracing we’d want to pay attention to?
Dave McAllister 35:10
Okay, so there, again, think about it from the end user viewpoint. And it’s how long it took. And so I’m gonna start with that one, generally speaking, and Google did does it stick? I don’t know if they’ve done it this year or not. But they every two years, they did a study that says, How long is a person willing to wait for a request across the Internet to complete? And the answer is 3.7 seconds, at 3.7 seconds, you’re likely to have the person leave your site, especially in an E commerce environment, they’ll abandon the carts, they don’t care anymore. 3.7 seconds. So if you start seeing requests regularly cross 3.7 seconds, and your E commerce, you have a lost sale. And so start looking at it from that, that view. The other side is using something like synthetic monitoring. So synthetic monitoring says, I have a request, I’m going to throw in the system. And I know how long it should take, it’s going to take 158 milliseconds, start to finish, and you do a new deployment. And all of a sudden that 100 milliseconds goes up to one and a half seconds. Okay, that’s another red flag that you can look for. That’s an individual trace result. So aggregation, look for the outliers that are crossing your boundaries of interest. If you’re looking at a user 3.7 seconds of long time, by the way on the internet, you know, maybe look for something that if it’s taking more than two seconds, figure out why. for that.
Melissa McKay 36:48
It’s amazing how much our patients has changed over the years. That number just goes down and down, doesn’t it? We used to, we
Dave McAllister 36:56
used to have this thing called woodpecker, which the old terminal models, somebody would type their command and hit the return key and the system would go off and think about it and would come back, and the person would hit the return key again. And then it didn’t come back and then end up hitting the return key. And we used to call it woodpeckers does it do this with a turnkey, like this immune system respond to it and see 128 returns? And it was scrolled off the screen? So you’d have to start all?
Melissa McKay 37:23
Yeah, yeah, I think I’ve been guilty of that myself. I’m sure. I’m okay. This is a very selfish question. Because you’re with Nginx. I use Nginx. And have in the past a lot and a lot of different projects is it is recommended for tracing.
Dave McAllister 37:40
So what we’ve did is recently built on what’s called the nginx, open telemetry module, open source functionality. There’s also a community module, the community module will show you what’s going on with Nginx. The nginx model shows you the behavior of your request through the nginx system. And so we actually found that when we went out and talk to our users, because we’re open source, as well as as, you know, commercial things, we don’t talk to the users, the users want to know what the behavior is through our system, not what our system is doing. So the OTEL nginx hotel module was designed to literally trace the behavior of the request coming through the nginx system and attach it into your trace functionality. You can find it on GitHub under nginx Inc. Look for the oh just do a search for a hotel and it’ll show up and we would love to get your feedback on it. Awesome.
Melissa McKay 38:38
Good to know.