2.9k
Connect
  • GitHub
  • Mastodon
  • Twitter
  • Slack
  • Linkedin

Blog

Nix in the Wild: The Flying Circus

Flox Team | 10 Oct 2023
Nix in the Wild: The Flying Circus

Nix in the Wild is a series where we dive into the stories of Nix users across the industry, covering everything from the dotfiles of crafty developers to the processes of engineering leaders in large organizations. Learn where Nix is used, how it came to be, and why it works the way it does.

As founder and CEO of the Flying Circus, Christian Theune knows what it takes to keep things running.

He and his co-founder Christian Zagrodnick serve clients of all kinds, making sure that their applications continue operating while they sleep soundly through the night. Although they have become adept at handling operational situations, and gathered years of experience doing so, they still consider themselves to be developers first and operators second. With more than 23 years of collective experience in the realms of Linux and Python, they are seasoned and well-rounded practitioners.

Flying Circus’s customers "don’t have projects with us, they have products that they need to maintain for a long time," says Theune. He has observed that customers in Europe are more interested in bootstrapping ecosystems that stand the test of time than elsewhere, and their deployments tend to be long-running.

These customers are often reluctant technology operators, even though management of key applications is becoming core to their business. They require a special kind of service, and their needs have heavily influenced the technical approach at Flying Circus.

When it came time to redesign their system from the ground up, they based their decision on the experience they had built. They chose Nix.

The Need

Over the years, the platform at Flying Circus was based on a combination of Puppet and Gentoo. These technologies, used together, allowed them to build systems exactly to their customers’ specifications and keep them operating for years at a time. They managed a fleet of servers with the ability to parameterize services for customers individually and make extensive customizations.

This system was in place for over 7 years, supporting critical applications. This architecture replaced the original late 1990ies "server under a desk" and allowed them to quickly grow into a "proper datacenter" where these applications and their 1.5PiB of data can live as long as they are needed. However, they ultimately found that choosing separate technologies for managing systems and managing software was too difficult to sustain as an integrated platform.

"We had separate systems for automating configuration and package distribution that were not tightly-knit enough," says Theune. "Normally, 1 + 1 = 2, but our design tenet is that 1 + 1 = 3: if you have two separate systems instead of one, then both systems require work and managing their integration becomes a third piece of work you have to spend time on."

Once they reached several hundred machines, this combination became a maintenance nightmare. Sometimes a Puppet run would take over an hour to complete, and platform OS updates became large-scale projects. Worse, they had decreasing confidence that the changes they pushed would actually do the things they wanted them to do, especially on machines that had existed for a long time. They were ready for a fresh approach.

The Solution

In 2015, the Flying Circus started replacing their Gentoo and Puppet solution with NixOS and provided a full rewrite of their platform within less than 6 months to all new customers.

"The promise of NixOS was big enough to convince us to go down a greenfield route," says Theune, even though he personally doesn’t like that approach. The transition took place over a long period of time - almost 7 years - because they never wanted to force a customer through an upgrade. But, in the end, adopting a single solution for both the management of software packages and the configuration of systems saves them and their customers time and money.

The Flying Circus develops their platform synchronously with NixOS, closely following new releases. They use Hydra to automate builds, and are ready to offer staging environments for customers the moment a NixOS release comes out. That allows them to achieve general availability within a few weeks, an order of magnitude faster than the industry standard.

Hydra builds can be triggered by pull requests, and results are cached to speed up future builds. Because Hydra is capable of deciding which of the functional tests to run and which to skip, it streamlines the testing process. "We always get a level of confidence. Once we evaluate a new release, if it comes out right, we can switch over," says Theune.

Management has become more straightforward as well. NixOS is able to handle both concerns - package management and system configuration - which previously were managed separately and integrated. They can now serve their customers’ needs using a single set of tools, address customizations in a clear and consistent way, and tie everything together with a cohesive language.

As Theune eloquently puts it, "there is now a whole list of things we don’t have to do anymore; they aren’t needed anymore due to the architecture of NixOS."

The Challenges

As much as Theune is a fan of NixOS today, he initially believed that "many of those in the community were overselling it." While he found the community’s enthusiasm invigorating and motivating, it could sometimes be overwhelming.

Upon reflection, he realized that he was being sold on the wrong set of virtues: purity above all, repeatability, and rollbacks. What he really values most in Nix, however, is:

  • building without deploying,
  • atomic builds,
  • source-first approach with proper binary caching,
  • building new releases as soon as they are available,
  • switching customers to new versions quickly, and
  • enjoying a much faster build time.

"The community has generally been very helpful when we run into problems, but it does work differently than others," says Theune. Some of their ideas didn’t resonate within the Nix community, but that hasn’t prevented them from implementing the platform their customers need. Because of the nature of NixOS, the team at Flying Circus finds it easy to override defaults and conventions when necessary to make the system behave as they need it to.

One reason they develop synchronously with NixOS is because there are only security updates for 4-8 weeks after a release. This forces them to maintain a faster-moving motion, and a closer engagement with the community. Theune says "once we synchronized our platform with the bi-yearly upstream release, we realized that we could also implement an 'upstream first' approach as the next release is always just around the corner."

The Future

The team at Flying Circus have amazing capabilities due to the flexibility of Nix and NixOS. But they have an even more ambitious vision for the years to come.

Theune recognizes that Nix is currently going "into the deep", in his words, exploring the limits of its potential and defining best practices for tomorrow’s important challenges. Nix is a technology that operates on the edge of innovation, and he has observed it acquire some accidental complexity as it advances. "Unfortunately, it’s always possible to add another level of indirection," he reflects.

He looks forward to a day when Nix begins to reduce complexity again. But, due to its nature, he believes this should be a gradual process: "redesigns are necessary for innovation but you have to make the transition almost unnoticeable," he observes.