Files
sukr/content/blog/atom-anatomy.md
Timothy DeHerrera 47f0876fcd fix: simplify wording
2025-05-18 14:22:44 -06:00

30 KiB
Raw Blame History

title, description, taxonomies, date, extra, draft
title description taxonomies date extra draft
Anatomy of an Atom A Solid & Natural Foundation
tags
opensource
ekala
nixos
2025-05-16
read_time repo_view
true true
true

At last, Im diving into the technical nitty-gritty of my own contributions to the Ekala project, a vision Ive long teased but, regrettably, delayed. Ive already shared a high-level overview of its ambitious vision, but to bring any grand idea to life, it must be broken down into manageable pieces. Ekala started as a thought experiment; an exploration of how an ideal architecture for a Nix-like build system might be structured. What emerged convinced me to pursue a path I feel compelled to follow, despite efforts to steer me otherwise.

In the spirit of honesty, I see the growing complexity in the Nix ecosystem, both technical and social, as a serious threat to its future. As a holistic thinker unafraid to cross boundaries others might avoid, I believe the communitys rising tensions (however well-masked) and the lack of clear technical leadership are deeply intertwined, not merely coincidental.

Let me be clear: this isnt an indictment of any specific person or faction caught up in the drama. I admire the resolve to chase ones convictions, even if I find their foundations flimsier than my own.

Still, by now I hope my readers see that I view the push for political alignment in open-source—however well-intentioned—as unhealthy and utterly at odds with its founding ideals. Ive explored this intimately and technically elsewhere, so Ill keep it brief: I see the hijacking of community goodwill for extreme political agendas that undermine our technical goals—whether in Ekala, Nix, or open-source broadly—as a direct assault on our ethos.

In all this, Ive reached a difficult but reasoned conclusion: we must not support, let alone empower, individuals or institutions that promote or passively tolerate such agendas if open-source is to remain a vibrant force, not a hollow shell of its former self. If my firm stance feels unacceptable, dear reader, feel free to step away—Ill think no less of you.

And if this seems off-topic, forgive me, but I feel compelled to restate my position briefly given the current landscape. Curious why? My linked pieces and earlier writings justify my growing resolve. With that said, Im grateful for the patience of those whove stuck with me. Life has taught me that sometimes the only path forward is one you carve yourself. As Ive noted, personal and social upheaval over the past year has pushed me down an unexpected road. Though my reserved nature makes me hesitant to share too many personal details, and despite the stress its caused, Ive laid the groundwork for what lies ahead, and Ill gladly walk this path—twists and all—as long as Im able.

To the point: with my philosophical footing now secure and my conscience clear, Im ready to unpack the technical details unencumbered. I may have been overly optimistic about timelines at first, blindsided by one of the toughest years Ive faced. Now though, with a clearer perspective, Im aiming for a 6-to-12-month horizon for Atom as I recharge and press on. Since Atom, itself, is the foundational component of my overall vision within Ekala, let's begin this new technical blog series with a thorough exposition of it, shall we?

Atom: A Review

If youve followed my previous writings or poked around in my code, you might already have a rough sense of the foundational format Im championing: the Atom. To keep this piece standalone, though, lets recap its high-level design and the driving force behind it before we go deep. The silver lining to the long gap between iterations? My ability to explain my designs after months of stewing has, hopefully, gotten a lot sharper.

The Runaway Train: A Motivation

Conventional wisdom in tech projects says that once you hit a certain scale, foundational overhauls are a bad idea: iterative tweaks are the safer play. But every now and then, the existing setup is so broken that it starts threatening the projects very existence. When that happens, a radical rethink isnt just an option; its a necessity.

Ive spent nearly a decade with Nix, half of that in professional gigs, and Ive watched the same problems rear their heads as organizations scale up their Nix usage. I wont bore you with the gory details; anyone whos made a non-trivial contribution to nixpkgs knows the pain all too well, and if you are really curious, there is plenty of evidence all over the internet, by now. The real kicker? These issues dont seem fixable without rethinking the core idioms we use to write and, especially, organize Nix code.

As projects like nixpkgs balloon to massive scale, the cracks only get worse. Long-standing social drama has some folks burying their heads in the sand, or dipping out entirely. Others might lack the experience to see the train wreck coming. Some are too tied to the status quo to budge, while others, like the teams behind snix, the promising early-stage cab language, and our own ekapkgs, are stepping up with bold efforts to tackle the mess.

Im rooting for those projects to succeed; their technical vision lines up closely with my own take on the challenges. My original plan was to pitch in and support them, aiming to complement their work rather than reinvent the wheel. But along the way, I stumbled onto what I now see as a glaring gap in the ecosystem: one that has to be filled if were going to solve these scaling issues at their root.

Theres an irony in Nix. Its a domain-specific language (DSL) meticulously crafted to deliver binary software packages with precision and discipline, yet it barely considers packaging its own expressions in a similar way. To avoid confusion—since “package” is quite an overloaded term—we're referring here to source code distribution packages. Think package.json or Cargo.toml: formats that bundle source code into clean, discrete units for easy distribution and downstream use.

Since Im a Rust enthusiast, lets use it to illustrate. In Rust, a repository might house a workspace with dozens, maybe even hundreds, of crates: self-contained package units. When its time to publish, each crate gets neatly bundled and shipped to a central registry. If I need crate a from a larger workspace P, I can grab just a from this registry, no extra baggage from P included. Later, if I need a newer version of a, its a simply another pull from the registry; only the files for a, nothing more.

Now contrast that with nixpkgs. Want package a? Youre stuck pulling the entire repository just to evaluate it. Sure, as dependencies get fetched in the process, but most of the code youre downloading has nothing to do with a. Need a different version of a down the line? Youre fetching another full nixpkgs checkout, with another chunk of totally irrelevant code. Its not hard to see how this spirals out of control. Its not sustainable.

Like any well-designed language ecosystem, we should have a straightforward way to grab only the Nix expressions we need, with their dependencies pulled in piecemeal—no more, no less. Its not just about efficiency; its just as much about maintaining sanity.

The Unique Challenge of Packaging Nix Code

Of course, this feels like it should be obvious, especially after years in the trenches. When flakes came along, I was hopeful theyd crack this nut. Spoiler: they didnt. In fact, they sometimes make it worse, though I wont dive too deep into that here. The core issue is that flakes still require you to fetch the full repository context to evaluate them, which kills any chance of packaging smaller expressions. Even if you split your repo into multiple flakes—not a trivial task—youre still dragging in the whole repo for each subflake. Its the same mess, just rearranged.

The real problem, delivering small, versioned units of Nix code efficiently, has barely been touched. Some folks dedicate tiny repos to a single package, but thats rare. Flakes encourage tying Nix to project source, and nixpkgs itself is a sprawling monolith. The repository boundary is just too coarse. We need something finer-grained, like a single Rust crate in a workspace, to have any hope of taming the challenge of distributing only the Nix expressions needed for building binaries.

This isnt a simple fix, though. Nix is source-first by design; it needs access to source code to evaluate expressions and build packages. That tight, cryptographically secure link between expressions and their source repo is one of Nixs biggest strengths. Slapping on a centralized registry model, like other languages use, would shred that advantage.

So, we need a novel approach; one that doesnt sacrifice what makes Nix powerful. I experimented with tools like josh proxy, which seemed promising but couldnt handle nixpkgs scale. It became clear theres no off-the-shelf solution for this, not at the size were dealing with. What I needed was a system that:

  • Preserves the cryptographic tie between Nix expressions and their source.
  • Distributes directly from the source, staying true to Nixs ethos.
  • Adds no runtime overhead to enumerating available atoms and their versions, ensuring trivial scalability.
  • Scales efficiently across repositories of any size, letting users organize their projects based on preference, not constraints.
  • Delivers only relevant, versioned code in a way thats simple to understand and use.

Atomic Anatomy

In the last section, I introduced the motivation behind atom—a foundational format to rethink how we package and distribute Nix expressions. Driven by the escalating complexity of nixpkgs and the Nix ecosystems scaling woes, I argued that a first-principles overhaul is critical to avoid a maintenance nightmare. While I respect efforts like snix and cab, Ive identified a unique gap in the ecosystem that the atom aims to fill, complementing those projects with a format they could adopt in the future. Now, lets unpack the technical anatomy of an atom and see how it tackles the problem head-on.

After years with my hands deep in the code, stepping back to explain the big picture to newcomers can be tough. But to build support and drive adoption, Ive realized I need to double down on describing my work as simply as possible. So, lets start from the ground up and build from there.

A Packaging API

The atom API is deliberately generic, unbound to any specific language, ecosystem, or storage system. Think of it as a source code packaging API: a frontend defines how to package code for a given language, and a backend, termed an Ekala store, specifies where and how those atoms are stored. This flexibility isnt just elegant design—its practical, letting atom adapt to the diverse needs of different organizations.

Why such an open approach? A clear, high-level API for the atomic universe is good design, but its also about real-world utility. The Git storage backend, which Ill cover soon, aligns perfectly with the open-source ethos of transparency and redistribution. Yet some organizations prioritize privacy and security over source availability—an S3 backend, for example, could offer a centralized solution to meet those needs. This versatility ensures atom supports varied use cases while maintaining a unified user-facing API, without locking anyone into a single mold.

This openness also future-proofs the design. If atom gains traction, it could support new frontends like Guix or even Cab, or integrate with existing packaging formats. Picture “atomic” Cargo crates distributed from an Ekala Git Store—a concept Ill clarify in the next segment. While supporting existing formats isnt my focus now, it underscores the designs potential.

To ground things, lets dive into the Atom Nix frontend and Git storage backend, which are tightly linked to the motivating use case outlined earlier and the heart of current development efforts. Well begin with the latter, the lower-level storage foundation, and build up from there.

Atomic Git

As Ive outlined earlier, Nixs current code distribution mechanism has a glaring flaw. To reference a package at a specific version, you must first identify the nixpkgs checkout containing that version—a process thats neither obvious nor trivial. Need another version? Find another nixpkgs checkout. Need both simultaneously? Youre stuck fetching all of nixpkgs unrelated code twice. Anyone whos wrestled with a bloated flake.lock file has felt this pain, as Ive previously noted, so I wont belabor it here.

If youre familiar with Gits internal object format, though, you might wonder why this is even necessary. Every file and directory in Git is a content-addressed object, which, in theory, should be independently referenceable and fetchable. The issue isnt that Git cant handle this—its that Gits conventional use as a linear chain of history obscures a more elegant solution.

As mentioned, this led me to explore tools like josh proxy, hoping to filter nixpkgs history and extract specific package definitions without fetching the entire monorepo. But nixpkgs massive history overwhelmed even joshs impressive speed, and it required a non-standard Git proxy thatd need ongoing maintenance. Worse, Nix code lacks inherent boundaries, so fetched objects might reference unrelated code from elsewhere in the repo, breaking the isolation we need.

Well tackle Nixs code boundary issue when we discuss the Atom Nix frontend. For now, lets focus on leveraging Gits object structure to solve our storage woes. Git doesnt offer a straightforward API to fetch individual objects, and even if you resort to the lower level plumbing, youd need their IDs upfront—requiring a costly search through the projects history, which is essentially what tools like josh do.

For the uninitiated, Gits high-level entry point is typically a reference (e.g., a branch under refs/heads or a tag under refs/tags). References usually point to a commit or tag object, and users can list them on a Git server with a quick, lightweight request—no need to fetch object data or sift through history. The reference points to a commits hash, letting the client fetch specific objects directly. Pause for a second: this is exactly the behavior we need to fix our problem.

If we could cheaply list server-side references pointing to specific history subsections—say, a Git tree object (a directory)—without pulling the entire repo or filtering its history, wed be golden. If those references had a clear, versioned format, wed have it all: ping the server, see all available package versions, and fetch only the relevant code, no matter the repos size or history.

Thats precisely what the Ekala Git storage backend does, at a high level, but since this is a technical deep dive, lets go a little further.

# demonstration of querying a remote for atom refs with `git` cli
 git ls-remote origin 'refs/atoms/*'
62e1b358b25f22e970d6eecd0d6c8d06fad380a7        refs/atoms/core/0.3.0
c85014bb462e55cc185853c791b8946734fd09bf        refs/atoms/std/0.2.0

An Atomic Reference

The Atom Git Store, as described, uses references to isolate specific repository subsections—both spatially (subdirectories) and temporally (points in history). To make this work seamlessly with Nix, though, we need to address some key details.

Git treats tree and blob objects as low-level implementation details, with no high-level “porcelain” commands to fetch or manipulate them. Most user-facing tools, including Nix, only understand commit or tag objects. For example, passing a tree object reference to Nixs builtins.fetchGit function will fail, as it expects a commit, not a tree.

To bridge this gap, we wrap atomic Git trees in standalone, orphaned commit objects—detached from history, carrying no baggage on fetch. This lets Git-aware tools, like the Git CLI, treat atoms like branches or tags (e.g., for checkout). This detachment, however, risks breaking our requirement to preserve the cryptographic tie between Nix expressions and their source. Fortunately, we can leverage cryptographic primitives to link the atom to its original history rigorously.

How? The implementation offers a peek, but heres the gist: we ensure the orphaned commits hash is fully reproducible for sanity and hygiene, using a fixed author and timestamp (Unix epoch). To tie it to the source, we embed metadata in the commit objects header, which influences its final hash. Specifically, we include:

  • The commit SHA from the source history where the atom was copied.
  • The relative path from the repository root to the atoms root.

These, combined with the commits reproducibility, yield powerful properties:

  • Source Verification: Users can verify the atom by checking the embedded SHA and ensuring the tree object ID at the specified path matches the source commits. Since tree objects are content-addressed, this guarantees the atoms source hasnt been altered.
  • Trust and Signing: A verified, reproducible atom commit can be signed with a standard Git tag object. Organizations can use a trusted signing key for added security, ensuring downstream users who trust the key can rely on the atoms integrity. Since the commit is reproducible, a verified SHA remains trustworthy indefinitely. If a key is compromised, the tag can be revoked and re-signed with a new key—no need to alter the commit.
  • Low Overhead: The atom adds minimal load to the Git server. Using low-level operations via gitoxide, it references existing Git trees and blobs (the actual files). This is like a shallow copy in Rust or C—a new pointer to pre-existing data—making the operation fast and lightweight.

Isotopic Versioning

Weve built a solid foundation for publishing and referencing Nix code (and potentially other languages) with the Atom Git Store. But one critical piece, which Ive stressed before, deserves its own spotlight: versioning. Its the linchpin of the atom scheme and warrants a dedicated section.

Every atom must be versioned, currently using semantic versioning, though we could support other schemes later to accommodate diverse software naturally. As shown earlier, each atoms Git reference lives at refs/atoms/<atom-id>/<atom-version>. This structure is key for efficient discovery. Querying references from a Git server is lightweight, with filtering done on the server side—no heavy object fetching required. A single request made with a simple glob pattern can list all atoms and their versions in a repository. Try that with nixpkgs today—its a slog, requiring costly history traversal and git log parsing, with no guarantee of accuracy if the log format hiccups; not to mention you'll have to have the whole history available locally to be exhaustive.

By contrast, the atom format is standardized (though evolving), efficient, and well-typed. When published using the official atom crate library, atoms are guaranteed to conform to spec. We even embed the format version in the atoms Git commit header, ensuring tools can easily handle future backward-incompatible changes by identifying the format version upfront.

Versioning also enables disciplined dependency management. Dependencies can be locked to simple semantic version constraints (e.g., ^1). Down the line, a version resolver could traverse the dependency tree to minimize the closure while leveraging Nixs ability to handle multiple software versions seamlessly. This will ensure the smallest possible dependency set, even when different versions are needed in the chain.

Equally critical is the user experience (UX). Versioning as the primary abstraction lowers the barrier to entry for Nix newcomers. Users can fetch, use, or build software without grappling with concepts like “derivations.” Only package maintainers and developers need to dive into Nixs internals—evaluation, dependency closures, and the like. Regular users get a smoother, less daunting onboarding while still reaping Nixs powerful benefits.

Atomic Numbers: A Rigorous Identity

This leads us to a critical aspect of atoms: their machine identity. As weve hinted in the reference and versioning scheme, each atom has a human-readable, Unicode ID specified in its manifest alongside its version. This ID, shown in the Git reference before the version (i.e., refs/atoms/<atom-id>/<atom-version>), uniquely identifies the atom within a repository. To keep things hygienic, we enforce sanity rules: no two atoms in the same repository can share the same Unicode ID in the same commit. For example, you cant have atom “foo” under both bar/baz and baz/buz simultaneously, but you can move “foo” between paths across commits.

With thousands or millions of atoms across multiple repositories, Unicode IDs alone become ambiguous—name collisions are inevitable. We need a robust, cryptographic identity to uniquely and efficiently identify atoms. A GitHub discussion (which Ive tried, and unfortunately failed, to track down for reference here) once highlighted a gap in Nix: it lacks a high-level package abstraction to distinguish “packages” from other derivations. A Nix derivation can represent inputs (sources, patches, build scripts) or outputs (packages, systems, JSON files), yet Nix, despite billing itself as a package manager, offers no unified way to identify a package derivation as distinct among these.

Why does this matter? Try tracking a packages evolution in nixpkgs. You might lean on its name or path, but those can shift. Same source, same project, but a tiny tweak changes the derivation hash, and poof—continuitys gone. Without rigor, youre stuck guessing if its the same package across time. Atoms fix this with a machine ID thats logical, rigorous, and ties a package to its versions or even dev builds (like their derivation hashes) with mathematical precision.

So, how do we pull this off? We need to disambiguate atoms with the same Unicode ID across repositories. I wrestled with ideas—maybe the repos URL? But URLs shift without touching the projects core (name, maintainers, versions). After banging my head on it, the answer hit me: the initial commit hash of the repository. Think about it: a repos history flows from one unique starting point: that first "seed" commit. Its set in stone—rewrite it, and youve got a whole new beast. Its the perfect, unchanging marker for a repository, no matter where its hosted or how it evolves.

From there, we derive the atoms machine ID using a keyed BLAKE3 hash over the repositorys initial commit hash, a constant for key derivation, and the atoms Unicode ID. BLAKE3s speed and vast collision space let us index trillions of atoms with negligible risk of collisions. This hash then becomes our bridge, linking the gritty world of derivations to the human world of versions, pulling software distribution idioms cleanly into Nixs rigorous realm of closures.

And whats it good for? A ton. It can power optimizations like bulletproof evaluation and build caches. Picture a backend that spots a users requested atom and version, verifies its pinned commit, and checks the organizations work history. Been built before? Boom—it skips the work and hands over the artifact. Thats not just faster; it splits concerns cleanly. A users client doesnt need to touch a Nix evaluator—just parse the atom API and ping the backend. If evaluation or buildings needed, the backend handles it quietly; if not, you get results instantly.

This opens up a lot of possibilities. Beyond speed, the machine ID boosts provenance tracking, record-keeping—everything a big outfit might need to manage its atoms or meet compliance standards. And it's important to note: the source identity (that initial commit hash) is an abstraction, so future storage backends can pick their own hash keys, keeping Atom flexible for the future.

Now with atom identities locked in, were ready to tackle how non-package content fits into the mix, especially in those sprawling monorepos.

Subatomics

Were nearly ready to climb the abstraction ladder and explore the Atom Nix frontend. But first, we need to cover one more critical piece planned for the Git store before it hits 1.0. Many organizations rely on large monorepos, blending source code with configuration—think package descriptions, CI workflows, and more. A single monorepo might house hundreds or thousands of software projects. As Ive noted, a key goal for the atom format is to work seamlessly across diverse project structures, from sprawling monorepos to small, focused repositories.

If we stopped here, monorepos could still be a pain. Referencing source code from different places and points in history would mean fetching the entire monorepo each time—echoing the nixpkgs dilemma we outlined earlier. To ensure a consistent, pleasant user experience, we need a way to reference repository subsections that arent full atom packages, with the same efficiency as atoms.

Enter subatomics, the working title for these lightweight “lenses” into a monorepos vast history, much like atoms but for non-package content. Their format is slightly tweaked to handle less structured data. Instead of named, versioned references, subatomics use a flat, content-addressed form: refs/subs/<git-tree-id>. The Git tree object ID, already a content-addressed identifier, acts as a simple, self-verifying reference for the subsection. For compatibility with Git tooling, each reference points to a reproducible, orphaned commit object, carrying all the same benefits as atoms: reproducibility, verifiability, and optional signing.

Well explore how users define subatomics when we move up the abstraction chain, but its worth noting that theyre created only when atoms reference other repository segments (e.g. a source tree for a build) as dependencies, ensuring their existence during the atom publishing phase.

User Entry URIs

Weve thoroughly covered the Ekala Git store, the atom formats first storage backend, crafted to tackle Nixs scaling woes while staying intuitive for newcomers and veterans alike. It leans on, perhaps, the most uncontroversial abstraction in software: the version. With subatomics now in the mix to handle non-package content, were ready to shift gears toward the Atom Nix language API—but first, lets talk about user interface, specifically how we reference atoms.

Even the slickest tooling can flop with clunky UX. The eka CLI is still a work in progress, and not all its features tie directly to atoms, but one piece, the atom URI, is already implemented and worth a look. Its how we address atoms, and its a game-changer for usability.

Now, Ive had a love-hate relationship with flakes. I went from preaching their gospel in the early days, to groaning every time I deal with them. Yet one thing I always liked was the flake URI. Its handy, but not without its flaws. The “shortcuts” arent short enough—Im still typing most of github.com. Worse, those shortcodes are hardwired into the Nix binary, so if your favorite Git host isnt listed, youre out of luck. And dont get me started on how flake URIs, embedded in flake.nix, can confuse newcomers and break clickability in editors or IDEs. I wanted to keep what works, fix what doesnt, and add support for explicit atom versions. After a couple of intense hacking weekends, the atom URI was born, and its pretty much feature-complete.

The syntax is dead simple. Heres the schematic:

[scheme://][[user[:pass]@][url-alias:][url-fragment::]atom-id[@version]

The scheme (e.g., https://, ssh://) is usually omitted, with smart heuristics picking a sane default. The user:pass bit is there for completenesss sake but rarely needed. The real magic is in user-defined aliases—think URL shorteners for common paths:

# eka.toml: client config file
[aliases]
# predefined for convenience
gh = "github.com"
# can build on other aliases
work = "gh:my-verbose-work-org"
cool = "work:our-cool-project"
org = "gitlab.com/some-org"

This lets you write commands like:

 eka do org:project::the-atom@^1
 eka get work:repo::a-pkg@0.2
 eka add cool::cool-atom@^3

When adding an atom as a dependency (like that last command), the manifest stores the full URL—e.g., https://github.com/my-verbose-work-org/our-cool-project—making it readable and clickable. This is crucial: embedding aliases in the manifest would break for downstream users without the same aliases, so we expand them to keep things sane.

Additionally, as a core library component, any tool interacting with atoms can tap this URI format to reference them effortlessly. Its a small but mighty piece of the puzzle, making atoms as easy to use as they are powerful.

Now, lets dive into the Atom Nix language API and explore how it harnesses this foundation to help deliver a more disciplined, scalable Nix experience.

Atomic Nix