Brett Smith 0:09
All right, so I am a principal software developer at SAS Institutes. And I might be responsible for some of the supply chain pipelines that we built, but I have been using plausible deniability to escape that type of blame. Currently, I’m a security lead for r&d. For a little background, SAS is 40 year old analytics, artificial intelligence and data management company. We’re an ISV, we still deliver software to customers, we deliver RPMs Debbie wins. OCI containers on several architectures, we’ve got a pipeline that builds tests, stores, and eventually allows us to deliver all this stuff to the customers. So we’re a little different. I would love to go work at a software as a service in my next life. But for now, I’ve got lots of packages to worry about. So I’m gonna talk about supply chain robots, electric sheep, and salsa. Originally, this talk was pretty salsa, heavy. Version 0.1 of salsa was really, really, really strict and really complete and salsa for version 1.0, they have backed off and focused on the on the build section. So what I’m going to do is I’m going to talk about the executive order 14028, and the SSDF standards, and we’re going to throw some ISO in here and some salsa, but we’re not gonna talk about that really long, because you guys don’t really, you guys can go read about that. And what you really want to do is you want to get to some of the crazy ideas I have. A couple of things you need to know about the executive order, like some points that they definitely lean on is that they want you to separate and protect each environment. They want you to secure hardened endpoints improved exe executable security, so use compilers to do the right thing. They want least privileged access to the code and tools, provenance data, and S bombs, and they want software release integrity. So how to salsa fit in. Salsa is a checklist of standards and controls that we use to prevent tampering, improve integrity of secure packages, and infrastructure in your projects and enterprises. It’s how we get from safe enough to being as resilient as possible at any link in the chain. The framework establishes three trust boundaries, the right standards, attestation and technical controls. So we can harden systems from threats and risks. This is a cool slide. It pretty much does what you said. It does what it says. One thing I want to bring up is that I’ll focus a little bit later on on validating artifact integrity. And that’s like a key component of salsa, especially in the build section now. And so we’ll get to that. See. So each level works to achieve a greater sense of security based on the previous level, right? So maybe some of us are at 01. And you know, that’s good luck. It’s the Wild Wild West, right? We don’t know what’s going on. In Level Two provenance exists. And there’s provenance showing how the package was built. And it there’s documentation that helps us not make mistakes, while we’re building these packages. At level two, we’ve got signed provenance generated by the hosted build platform. And then this prevents tampering after the build, right. So I build my package, I sign it, and I can validate that that package is what it says it is based on the signature. For level three, we’ve hardened the build platform, no devs are allowed in there. And nobody’s allowed in there. There’s one door and one door out. And we this helps prevent tampering during the build. And then of course, we built on the other layers, right? So we ended up with hermetic builds, which is where we all want to be.
Brett Smith 4:31
So I have some ideas around some of these things that we want to achieve. And I’m going to go about it in a interesting way. I think you guys will appreciate it. The real answer here is that you pay for a supply chain platform and let somebody else worry about this and you sleep at night but some of us aren’t that lucky. Some of us have a long in the tooth organic hybrid cloud solution. should have in house solutions cobbled together with duct tape. And baling wire. So and you know, the older that pipeline gets, the more fun it is to work on. So I’m going to present a series of what ifs, right? Let’s start with something that is should be obvious, but not everybody takes it as seriously as I do. And that is, what if we use GitOps for everything? Okay? I’m talking documentation, I’m talking to your code. I’m talking infrastructure as code, I’m talking to your policy, your ADRs. All of it is in Git. And I know you’re on Well, that’s kind of crazy. Why you putting all that and get? Well, so take documentation. Say I want to require two reviews on a pull request to in order for it to be merged into main. Well, then I’m guaranteed that I’ve had two reviewers have the documentation change. And I know who the two reviewers were by the get history, and I know who the original author was because of the get history. And they can’t get it in until we go through in through our CI right. And so now I’ve got CI running against my documentation. I’ve got two reviewers, and I’ve got to get blame trail to tell me who did the change and why we made the change. And then at the end, after it’s merged, I run even more CI that does the publication from get to wherever we want the documentation end up in. For us, it’s Confluence. And so we use we wrote a tool in Python that does markdown to Confluence. The infrastructure is code that’s pretty much straightforward. You guys all know that answer. You’ve got Argo CD hooked up to a repo, and then you get the same two reviewers that I have before you can make changes that are automated in deployments, policy and ADR, those are two things where that git log trail, that audit trail really helps you out in the long run. So those are the reasons why we use GitOps.
Brett Smith 7:04
Moving on, this is the best part. So we had a complete inventory of our CI/CD infrastructure. That would be awesome. Wouldn’t it seems pretty necessary. The in our world, the DevOps teams can see their own deployments seems safe, right? compliance and security problem 101. I didn’t apply it. So I don’t know where it is. So how can I enforce it? Well, we’ve got the old school way, I’ve got hundreds of Jenkins servers, I got a confluence page over here that’s got a list of all the URLs of the Jenkins servers. And so we know where all the Jenkins servers are based on this list, except for the ones that developers deployed with shadow IT to get around the restrictions we put on the Jenkins servers we have. Okay, so that didn’t help that much. So the next problem we ran into, is that we built an event driven framework for our pipeline, its services and our event, and events are all asynchronous. And we’re not sure what service gets triggered by what event because the development teams can deploy their own services on top of the infrastructure we’ve deployed. So how do we fix this? How, what are some ideas I can give you guys around this sprawl that I have caused? And you know, it was my fault? You know, what happens? You’re doing stuff, you break things, you lose things, I don’t know where that servers anyway. We built? What if we build out a framework so that develop our DevOps teams get inventory management for free? What if we automate the IAC audits, which would tell us where things are deployed? That’s not too bad an idea. I mean, as long as they checked it into a Git repo. Based on my Git ops theory, then we would know that the manifests would tell us where these things are. That doesn’t help if they didn’t use Git ops. So what if we had event driven registration? Basically, self registering services. The idea would be is that I’ve got this service. It’s a it’s running on the events in the event framework. And then there’s a Kafka topic and service ARPs itself, when it starts, starts up and says, Hey, I’m here. It goes on Kafka bus. You’ve got a service over here that’s collecting up all these ARPs off of this off of the Kafka bus and cataloging where all the things are. That’s, you know, that’s pretty decent idea, right? That’s great, right until you find out that you’re too abstraction layers up. And the IP addresses you have in the Kubernetes cluster are only related to the Kubernetes cluster. Alright, so now I’ve put myself in that switch. Imagine how do I solve that?
Brett Smith 10:03
What we could do is, is there’s this thing called downward API, right? And then for Kubernetes, this means exposing the pod information to the containers through environment variables, right? So in the manifest, there’s a couple of environment variables. And they surface up lower level Kubernetes information that tells us which cluster it’s running on and where that cluster might be. And then we get farther down the chain. And I know what you’re thinking, you’re like, Well, if the developers are doing the deployments, how do you get them to add in this downward API? Well, the trick is, is that you put it in the example manifest, and you don’t mark it as something that they need to change. And I guarantee you, the developers will deliver it deployed, just as it is, and you’ll get your Downward API. So it is what it is. All right. So the next thought I had is, is that what if we, what if we remove out of bands, human checks, right? Every time I come around as security lead and implement a bunch of security, developers get, you know, bogged down there, it makes them harder to get their job done. And we made their job more complex, to what if we shipped everything as far left as we can, right, and then automate everything else to the right. And this keeps the dev experience where I want it, which is git pull hack, hack, hack, get push, create, review, right, get two people to review it, and the PR passes all the CI checks. And that merges. The CI checks, of course, are over on the right are kind of hanging on. The CI checks are right here in the quality and the scan, and write in for the source code gets merged. But so if we get to that point, the devs don’t care what happens after they fixed everything in the pull request. And then we have How do you handle vulnerabilities in that case, so you know, we had some vulnerabilities, we fix them. We have we’re running sneak, self a plug for sneak, we’re running sneak, and we have IDE plug in so the developers can see all of their stuff in their IDE. We also have integration into the GitHub pull requests. And so the GitHub pull requests won’t pass until we’ve remediated whatever issues we run into. So the developers go on, they fix all these things, and then they you know, remediate. And then there’s time box on the remediation. And then the automation on the right swings back around every time and says, Okay, your timeframe to fix this remediation is up, you either need to get another exception, or I’m going to break the build
Brett Smith 13:00
need a clicker? Go back, click too far. So as we shift left, we get the problems earlier in the STL. lifecycle. Scanning and code quality are remediated in the IDE, code merge is blocked until the quality security standards are met. Teacher dev in the IDE is there, they’re not often the web browser back in the IDE. Brian’s talk before this made me think about that, how he could get so much done and never leave his IDE. And then after the merge, the robots are the ones doing the due diligence and producing your audit trail. It keeps the developers working on the code and it keeps the robots keep the compliance and security as best they can. And then we automate provenance in the on the right side, in the pipeline, so it can’t be falsified or tampered with. Okay, all this shift left, this compliance is great. But we still have attack vectors. If this doesn’t keep you up at night, it should. And compliance doesn’t equal security. So just because we’re compliant doesn’t mean we don’t have these attack vectors. It just means that we have compliancy compliance can help but closing the attack vectors should be priority number one. monitoring. Monitoring is great. We should do that. However, verification of environment materials, process and artifacts before running the actions increases security. And look, I’m not telling you not to monitor monitoring, auditing is a big part of the EO 1428 Focus, but closing attack vectors should be priority number one. Okay. attestations you’ve been hearing me talk about provenance, provenance and asset that have stations we kind of interchange them a little bit. It’s the same idea. It’s kind of The all along the same idea, right? So what if we made s bombs, as bombs are good, we should do that. In the end, the S bomb is just a list of ingredients at some point in the supply chain, we must be able to vouch for ourselves, right? So we have to create an attestation. The way to create an attestation, that is non falsifiable is to cryptographically sign it to verify your assets. And then we automate the provenance so that it can’t be falsified again.
Brett Smith 15:30
Let me give you an example of what an attestation would look like. This is a real world example, not a YAML example. So you’re going to get lab work done with your doctor, right? And then you’ve got an the environment and the environment is what lab is the test taking place in and on what date? The process is, what machine is performing the test? And what are the configuration details of that test? The materials are whose blood is going into the text, this is very important. Is the chain of custody verified? And the artifact is the output of the test in is the test Rufus is the app fair, the output of the test is the test results. The blood test results should be verifiable and traceable. So what if we teach the robots to verify the environment, the materials, the process and the artifacts before running the actions? By verifying everything we’ve already said, we’re going to increase security and verification tools, validate the sign provenance and verify metadata, including verifications of the source repository used to build the binary that is a lot of verification. But verification allows us to answer an important question. Is that an electric sheet? Or is it a wolf? Rachel was replicant? How did Deckard miss that? All right, so in the end, we want to build some automation to tend to our electric sheet. And then what does that look like? So the robot gets the signed provenance and validates the signature of the provenance. And then he uses the shot stones from the provenance to validate what it’s about to execute. Like, if the, if the rules for this particular build are Make, make install, then I want to make sure that I’m running make and make install and not make make install make foo, right. And then it uses the same, a different piece of provenance to validate that the tools are what they say they are, right, if I’m using version 1.0.0, of tool foo, I want to make sure that I didn’t get 2.0.0 to tool foo. It invalid, it validates that the instructions are what they’re supposed to be. So depending on what type of build scripts you’re using, it goes through and says, Okay, you’re gonna run this step, this step in this step. And then we shot all that and it gets put in the provenance. And then when you go to run it, you pull the provenance and go, Okay, I’m running this step, this this step, is that what we’re doing here. And then in the end, it validates the the resulting artifact of it, and it creates signed provenance for it. So now you’ve got this artifact that you’ve got signed provenance for so that when you go and download that artifact to go use it somewhere else, you can grab its signature and go and know that it’s what it’s supposed to be. And it was built in a process that you trust.
Brett Smith 18:45
So the next trick is, is that you use the pipeline to build the pipeline. So in the in the past, we might have been very worried about the product we were delivering to the customer, and not as worried about the pipeline that builds the product that delivers to the customer. But now we’re very worried about that. So you use the supply chain to secure the supply chain, and none of your infrastructure should be built anywhere other than this one pipeline. Now, there’s another thing I want to get into, but it’s probably a presentation for another time. And that’s deterministic builds using two different pipelines. The idea is, is that I’ve got pipeline a and pipeline B. And they’re, they are different mechanisms. And we build the same source code through them. And at the end, we test the binaries and if they’re identical, then we are fine. And there nobody’s tampered with anything. So zero trust it’s coming. And securing your supply chain is not just scanning the code in compliance. It is implementing zero trust and you want to follow the principle of never trust always verify. And the verification steps we talked about are also required for zero trust. I got started that way. really fast. Let’s see, you want to pop back in? Cuz I’m sweating, and I’m afraid of robots.
Melissa McKay 20:10
I was gonna ask you about that. Yeah. And I think we might just we’ll just jump into that, because I think you have a good story about this. Yeah. So why are you afraid of robots Brett.
Brett Smith 20:25
They’ve run amok? Do you know how easy it is to get a robot to run amok? So we’re building out this event driven pipeline, right. And it’s early, and we’ve got the our section, the infrastructure of it kind of figured out and built out. And the idea is that you post a receipt to the service, and then the service creates an event. And then everything downstream listening to that event. It uses the data that you posted for the receipt to go do stuff. So we had an early adopter, and we told them that they were going to build this thing to watch for these events. And then it should post some do something and then post receipts. So that it could, you know, on what it did, right, and this particular thing was a security thing. Well, we were early and we didn’t think about guardrails. So this thing managed to fork bomb itself into a 97 megabyte receipt. So it was posting a receipt, the service was creating event and it was listening to the same event and firing itself again, right. And then the original message was like, you know, a couple 100 kilobyte, and it made it to 97 megabyte before it kicked over and died. This led to us putting in a one megabyte limit because Kafka at the time had a one megabyte limit. So we put a one megabyte limit on the service. And we went and wrote a bunch of code into our base library that made sure that you weren’t allowed to listen to the event you were creating. That one was pretty awesome. It it. Yeah. And it happened more than once, which you know, that’s, that’s how it goes. security company,
Melissa McKay 22:20
so easy for that to happen.
Brett Smith 22:24
Yeah, something that didn’t have a lot of this was four years ago. So there wasn’t like you couldn’t Google on the internet about what we were up to. You know, Kafka was mainly a streaming service, and we were kind of cheating and using it as a message bus. To interesting. Yeah. What was the other story I’m supposed to tell you?
Melissa McKay 22:49
Um, that was the main one. Right? But I did have a couple of other questions for you, though. Why all this push, forget ops.
Brett Smith 23:05
Why all the push for GitOps. That wasn’t the main one, while the push for GitOps, because…if you okay, here’s an example. I’m in a previous team. And you know, they did all of their ADRs. Right, we know what ADR is right there architectural decision records. And these are like a point in time, I said, Melissa, you and I are going to put together this event driven system, and it’s going to look like this. And so you’re like, well, we need to make sure we talked to the Kafka bus this way. And we’re gonna go, Okay, I’m with you, right? The messages have to look this way. We go, right and ADR, and me, you and everyone else in agreement is on the top of the ADR. And we committed to the Git repo Well, in my case, I cannot be getting rid of it. But when we say okay, this is the ADR. So on October 31, we made this horribly stupid decision. Next year, we come around and we’re like, Oh, my God, who made that decision, right? And then it’s like, we go back to the ADR and there’s Melissa and Brett. They don’t make good decisions anyway. So if you check that into Git, okay, so anyway, we put that in Confluence okay. We forget to lock it down. Anyone can come along and hack it. Right change it we could accidentally delete a section of it and push update right afterwards and yes, Confluence as a fallback and all that other stuff. But two weeks from now, the our powers that be you’re gonna swing in and say we’re not paying for Confluence anymore, you’re going back to Wikipedia. Right and so now you’ve got this document only in Confluence in a confluence markdown, right that you got to go Convert over SAS PDF. Now. So you got hundreds of those documents, or 1000s like I do Right, if they’re in Git, I’ve got the blame trail, and I’ll have to do is write something to translate them from whatever language I decided to write them in with. And you know, look old. I’m old school, I know a lot about old style doc stuff. We can talk latex if we want to. But we did markdown, because it renders really well on GitHub. Right? And so now you’ve got to renderer for your get your your documentation, you’ve got a approval system, you’ve got to get history of what’s been changed in your documentation. And when. And if I want to move from GitHub GitLab it’s easy. Right? And so that’s pretty much the reasoning behind the documentation. I mean, the IAC explains itself. You know, infrastructure and code, what day did I deploy this? And what day did the manifest change and who’s the knucklehead who put this Docker container in that manifest, get blamed Smitty? I didn’t, the robots did it. This goes back to I’m afraid of robots. See, I didn’t make that change to the manifest, we built a new image, the event system caught that build receipt, went and updated the manifest, updated it, pushed it Argos CD caught it and deployed it. I knew I was supposed to put tests in there before I did that. So so the idea here is is that you can’t hurt yourself with with get ops if you let the robots do the work. Because when you do get blame, it’s like robot one robot to robot three. You need unique identifiers for the robots. And you need to be careful with mixing get. How do I like to put it? I don’t like repos run by humans. I don’t like robots checking in stuff to repos run by humans, you keep the robot repo and the human repo separate. And then so I could go back to the human wrote the human repo and go, Oh, this is why we changed how the the part of this this part of the deployment. And then the other one is, oh, well, this robot went and just deployed that because we built it, we just need to go build it again and fix whatever was broken. And it goes out again, to you know, the, it’s a real CI CD, right? It’s like, oh, well just go build it again. And then you know, or fix the bug and build it again. But get ties back into all that, right? Because it’s your audit trail and your your for it’s your free audit trail. That’s pretty much where I’ve, while everything should be good ops,
Melissa McKay 27:38
you have a lot of experience. One thing I appreciate about you getting to know you, it’s just you’ve had a lot of trial and error, you’ve had a lot of quite a journey, you know, getting all of this stuff in order making it more secure. Making sure you know, your Ops is getting along with your devs too. And that was one thing I wanted to touch on. But before we get to that, I do want to ask you about salsa, because you brought up salsa at the beginning of your talk, and you were talking about the different versions. And we’ve had discussions about this, and I just want to know your feelings. I think this is a good venue for it. There were changes between dot one and one Dotto. Tell us what changed and how you feel about it.
Brett Smith 28:28
So when this office spec came out, I’ve got an amen say you got a camera that follows my head. It’s pretty cool. So when I wobble when we’re talking, it follows me anyway. So when the saw suspect dropped? I was like, Oh, this is the answer. Because version 1.1 covered almost everything that the EO the executive order covered. And it was like a guidelines for it. Right? You had the Source section, you had the build section, you had the the provenance section, I mean, it had all the sections, and it took care of everything you needed to do. If you want got to salsa, level four, you were pretty sure you had done everything you could do to lock down your plate your pipeline. I’m not gonna lie, it was daunting. I looked at it and I was like, Well, where are we? And I could go in and get us to salsa, level one. And then there was four levels at that time. And I was like, Okay, we’re gonna shoot for sauce a level four, we’re gonna go put a plan together. And we’re going to, you know, we’re going to lock down everything we’re going to have, you know, hermetic builds signed provenance. We’re going to have the source code is going to be locked down, where you have two person reviews on every piece of code, we’re going to make that we’re going to make sure that the owner of the code also does a review of any pull requests. He could be one of the two, or she could be one of the two or they can be one of them. See if somebody has one of the two, long as it’s not a robot, you don’t want robots checking your code, because we saw what happened to Brian Right? To say. So it was, it was a beautiful thing. And I went and wrote up all this stuff on it made these presentations, we started working on it. And I think what happened and so Google was the original, drop the spec, and I’ve talked to some people from Google that are that know, and they’re like, ya know, our internal rules, makes also look trivial. I mean, they’re way, way more strict about everything. And I think what happened is when the as the community started to take over, and Google started to drop back a little bit, the community’s like, we’ll never get this done. And so they took, they took the part that is probably the I’m torn between the source code and the build section. But I think the build section is the most important section of the salsa doc, a lot of the rest of it is sysadmin, 101. And they focused in on that, and they dropped the source piece, and they dropped the other pieces. So it’s pretty much just the build environment. And then they dropped down the levels to only had having three levels. I still think through level three is close to what we were calling for for level four, but they dropped still dropped back on some of the requirements. And I don’t you know, I don’t know exactly, I really need to go track somebody down that’s in that, you know, in that space, and ask them what they think happened. But yeah, that’s how I think it went. But so that’s why my talk ended up I had to go pull from ISO 27,001, for source code rules, right? In ISO 27,001, there’s in the 27,002 guidelines, they go through and tell you, what you should do to lock down your source code on least privileged, you know, bases need to know, and walk through that, and it lined up pretty well with the salsa spec, as best it could. So I had to improvise there. So hopefully they get back to it. I mean, it says to be TBD on all those things. So I’m assuming the community will come back and, and improve on it and get back to it, you should go figure out how we can go contribute to it. Because
Melissa McKay 32:22
yeah, I think getting more input from the community is also helpful, for sure. And especially from someone like you who’s been in the middle of it, and trying to implement all of these things. And I know that’s an overwhelming process starting from zero. So just having, you know, something like some goals that are outlined to get to is good. And I’m very sad that us as human beings, we have to feel pain before we make changes. And unfortunately, it feels like in a lot of cases, there’s a lot of security measures that most of us are deficient in, that we are only now looking at and trying to improve.
Brett Smith 33:04
So when I was at all things open, you were there. I saw a talk on AI to learn Python. And it was a wonderful talk. And the thing that the presenter mentioned the most is, is that you need to feel some of the struggle, right, so that the accomplishment feels good. So it shouldn’t be so easy. But I would sleep better at night, it was just easy.
Melissa McKay 33:37
You pinpoint the most difficult part for you and everything you’ve done so far, what was the most difficult piece and something I was alluding to earlier? Did you have any problems with getting folks on board with this with the changes that you wanted to make?
Brett Smith 33:54
How long you’ve been in this industry? You remember all those attack vectors I showed on that slide? Yeah, those are, those are those are multiplied when you have developers. And so the my famous saying is if you have developers, you have attack vectors. It was it’s hard. This is why I this is why we worked so hard on the shift left, right. And this is why it’s so like if you take how Brian was working where he just never had to leave the IDE and he could stay focused. The problem is, it’s not like that when you first start out. And the developers are already trying to get feature a and feature B out and then you can run and tell them that the piece you released two months ago as five CVS go update it and then as they start to update it, it breaks the code and they have to go and mess with the code. Right and in so they lose time to this CVE on something they thought was already done. That makes them not want to do this So, fixing this stuff. So that’s been the hard part is getting developers on board with them understanding that security is their problem as well as mine, right. And I should not use the word problem. Security is everyone’s responsibility as a company, because, like, I like revenue, I know you like revenue, I like my bonus. Right. And so, if we are, the more secure we are, the better chance we have getting our bonuses, because it’s just that one thing that’s going to get us, you know, nicked up, and it’s bad for the company, it’s bad for reputation. And it could be bad, really bad for the customer, and my customers or government agencies and banks, so we don’t really want that to happen with our customers. Anyway, so getting that message across has taken a lot of work. But making it easier to adopt makes it easier to get it going right. We automated a lot of our security checks. On the on the right side, we will we will run the scan automatically after the build. And then we will create JIRA issues automatically if we have things that need to be addressed. And then we sort out things like the base image. If the base image has vulnerabilities, and we create JIRA issues, and they get assigned to the guy who owns them, the person who owns the base image, not the other developers. And so that was pretty interesting because we didn’t do that at first and we made a mess. And not being not being careful with who we assigned things to was really really heavy. We’re on the better side of the torch moms are down the pitchforks have been pushed down. Better from that.
Melissa McKay 36:54
You’re all on the same team again. I’m trying to do the right thing. But they’re
Brett Smith 36:59
not trying to chase us down the hill anymore.