diff --git a/src/pages/blog/std-action.md b/src/pages/blog/std-action.md new file mode 100644 index 0000000..aa8510f --- /dev/null +++ b/src/pages/blog/std-action.md @@ -0,0 +1,178 @@ +--- +layout: $/layouts/post.astro +title: Standard Action +description: Do it once, do it right. +tags: + - std + - nix + - devops + - GitHub + - actions +author: Tim D +authorGithub: nrdxp +date: 2022-12-09 +--- + +## CI Should be Simple + +As promised in the [last post](./std), I'd like to expand a bit more on what we've +been working on recently concerning Nix & Standard in CI. + +At work, our current GH action setup is rather _ad hoc_, and the challenge of optimizing that path +around Nix’s strengths lay largely untapped for nearly a year now. Standard has helped somewhat +to get things organized, but there has been a ton of room for improvement in the way tasks are +scheduled and executed in CI. + +[Standard Action][action] is our answer. We have taken the last several months of brainstorming +off and on as time allows, experimenting to find a path that is versatile enough to be useful +in the general case, yet powerful enough for organizations who need extra capacity. So without +any further stalling, let's get into it! + +## The Gist + +The goal is simple, we want a CI system that only does work once and shares the result from there. +If it has been built or evaled before, then we want to share the results from the previous run +rather than start from scratch. + +It is also useful to have some kind of metadata about our actions, which we can use to build +matrices of task runners to accomplish our goals. This also allows us to schedule builds on +multiple OS trivially, for example. + +Task runners shouldn't have to care about Nix evaluation at all, they should just be able to get +to work doing whatever they need to do. If they have access to already reified derivations, they +can do that. + +So how can we accomplish this? Isolate the evaluation to its own dedicated "discovery" phase, and +share the resulting /nix/store and a json list describing each task and its target derivations. + +From there it's just a matter of opimizing the details based on your usecase, and to that end we +have a few optional inputs for things like caching and remote building, if you are so inclined. + +But you can do everything straight on the runner too, if you just need the basics. + +## How it Works + +Talking is fine, but code is better. To that end, feel free to take a look at my own personal CI +for my NixOS system and related packages: [nrdxp/nrdos/ci.yml][nrdos]. + +What is actually evaluated during the discovery phase is determined directly in the +[flake.nix][ci-api]. + +I am not doing anything fancy here at the moment, just some basic package builds, but that is +enough to illustrate what's happening. You can get a quick visual by look at the summary of +a given run: [nrdxp/nrdos#3644114900](https://github.com/nrdxp/nrdos/actions/runs/3644114900). + +You could have any number of matrices here, one for publishing OCI images, one for publishing +documentation, one for running deployments against a target environment, etc, etc. + +Notice in this particular example that CI exited in 2 minutes. That's because everything +represented by these builds is already cached in the specified action input `cache`, so no work is +required, we simply report that the artifacts already exist and exit quickly. + +This is partially enabled by use of the GH action cache. The cache key is set using the following +format: [divnix/std-action/discover/action.yml#key][key]. Coupled with the guarantees nix already +gives us, this is enough to ensure the evaluation will only be used on runners using a matching OS, +on a matching architecture and the exact revision of the current run. + +This is critical for runners to ensure they get an exact cache hit on start, that way they pick +up where the discovery job left off and begin their build work immediately, acting directly +on their target derivation file instead of doing any more evaluation. + +## Caching & Remote Builds + +Caching is also a first class citizen, and even in the event that a given task fails (even +discovery itself), any of its nix dependencies built during the process leading up to that failure +will be cached making sure no nix build _or_ evaluation is ever repeated. The user doesn't have +to set a cache, but if they do, they can be rest assured their results will be well cached, we +make a point to cache the entire build time closure, and not just the runtime closure, which is +important for active developement in projects using a shared cache. + +The builds themselves can also be handed off to a more powerful dedicated remote builder. The +action handles remote builds using the newer and more efficient remote store build API, and when +coupled with a special purpose service such as [nixbuild.net](https://nixbuild.net), which your +author is already doing, it becomes incredibly powerful. + +To get started, you can run all your builds directly on the action runner, and if that becomes +a burden, there is a solid path available if and when you need to split out your build phase to a +dedicated build farm. + +## Import from What? + +This next part is a bit of an aside, so feel free to skip, but the process outlined above just so +happened to solve an otherwise expensive problem for us at work, outlining how thinking through +these problems carefully has helped us improve our process. + +IOG in general is a bit unique in the Nix community as one of the few heavy users of Nix’s IFD +feature via our [haskell.nix][haskell] project. For those unaware, IFD stands for +"import from derivation" and happens any time the contents of some file from one derivations output +path is read into another during evaluation, say to read a lock file and generate fetch actions. + +This gives us great power, but comes at a cost, since the evaluator has to stop and build the +referenced path if it does not already exist in order to be able to read from it. + +For this reason, this feature is banned from inclusion in nixpkgs, and so the tooling used there +(Hydra, _et al._) is not necessarily a good fit for projects that do make use of IFD to some extent. + +So what can be done? Many folks would love to improve the performance of the evaluator itself, your +author included. The current Nix evaluator is single threaded, so there is plenty of room for +splitting this burden across threads, and especially in the case of IFD, it could theoretically +speed things up a great deal. + +However, improving the evaluator performance itself is actually a bit of a red herring as far as +we are concerned here. What we really want to ensure is that we never pay the cost of any given Nix +workload more than once, no matter how long it takes. Then we can ensure we are only ever +building on what has already been done; an additive process if you will. Without careful +consideration of this principle beforehand, even a well optimized evaluator would be wasting cycles +doing the same evals over and over. There is the nix flake evalulation cache, but it comes with +a few [caveats][4279] on its own and so doesn't currently solve our problem either. + +To give you some numbers, to run a fresh eval of my current project at work takes 35 minutes from a +clean /nix/store, but with a popullated /nix/store from a previous run it takes only 2.5 minutes. +Some of the savings is eaten up by data transfer and compression, but the net savings are still +massive. + +I have already begun brainstorming ways we could elimnate that transfer cost entirely by introducing +an optional, dedicated [evaluation store](https://github.com/divnix/std-action/issues/10) for those +who would benefit from it. With that, there is no transfer cost at all during discovery, and the +individual task runners only have to pull the derivations for their particular task, instead of the +entire /nix/store produced by discovery, saving a ton of time in our case. + +Either way, this is a special case optimization, and for those who are content to stick with the +default of using the action cache to share evaluation results, it should more than suffice in the +majority of cases. + +## Wrap Up + +So essentially, we make due with what we have in terms of eval performance, focus on ensuring we +never do the same work twice, and if breakthroughs are made in the Nix evaluator upstream at some +point in the future, great, but we don't have to wait around for it, we can minimize our burden +right now by thinking smart. After all, we are not doing Nix evaluations just for the sake of it, +but to get meaningful work done, and doing new and interesting work is always better than repeating +old tasks because we failed to strategize correctly. + +If we do ever need to migrate to a more complex CI system, these principles themeselves are all +encapsulated in a few fairly minimal shell scripts and could probably be ported to other +systems without incredible effort. Feel free to take a look at the source to see what's really +goin on: [divnix/std-action](https://github.com/divnix/std-action). + +There are some places where we could use some [help][7437] from [upstream][2946], but even then, the +process is efficient enough to be a massive improvement, both for my own personal setup, and for +work. + +As I mentioned in the previous post though, Standard isn't just about convenience or performance, +but arguable the most important aspect is to assist us in being _thorough_. To ensure all +our tasks are run, all our artifacts are cached and all our images are published is no small feat +without something like Standard to help us automate away the tedium, and thank goodness for that. + +For comments or questions, please feel free to drop by the official Standard [Matrix Room][matrix] +as well to track progress as it comes in. Until next time... + +[action]: https://github.com/divnix/std-action +[haskell]: https://github.com/input-output-hk/haskell.nix +[nrdos]: https://github.com/nrdxp/nrdos/blob/master/.github/workflows/ci.yml +[key]: https://github.com/divnix/std-action/blob/6ed23356cab30bd5c1d957d45404c2accb70e4bd/discover/action.yml#L37 +[7437]: https://github.com/NixOS/nix/issues/7437 +[3946]: https://github.com/NixOS/nix/issues/3946#issuecomment-1344612074 +[4279]: https://github.com/NixOS/nix/issues/4279#issuecomment-1343723345 +[matrix]: https://matrix.to/#/#std-nix:matrix.org +[ci-api]: https://github.com/nrdxp/nrdos/blob/66149ed7fdb4d4d282cfe798c138cb1745bef008/flake.nix#L66-L68