Greetings Wayfair Engineering Enthusiasts: old, new, and returned. We’re gathered today (virtually) to collect a story from Wayfair’s history. I’m Gary White, volunteering historian, frequent worker. I’m going to share with you all the journey we took as an organization to adopt Continuous Integration. This is part one of two. In part two, I’ll tell you the story of Buildkite at Wayfair. In this part (one), you’ll indulge me in learning about Gitlab CI. Where we started, the good, the bad, the decline, and the eventual decision to scrap our critical build infrastructure to do something new.
git(lab) init
Git is in just about every company in some form or another these days. Like a lot of companies, Wayfair was created long before git or github reached popularity. In fact, before we had a Release Engineering team (and before we had a git server at all) we would version control by:
- Check out code from “code safe”
- code locally,
- Check code back into code safe
- deploy
We used a “visual source safe” where a user checked out and locked the file, so nobody could edit it until it was checked in. After learning how hard it was to scale such a model, we were using subversion. Subversion, if you’ve never used it, is a fine version control tool much like git that many companies have dropped over time. As we do with new technologies even now, we decided then to run a Proof of Concept and ensure that Gitlab was the right solution.
Despite
a truly haunting uncanny valley of a logo logo, we found that Gitlab itself was a fine tool for the scale of the organization we had, about six years ago.
integration of continuous integration
In the old old deployments, we would use ftp scripts (codesafe->ftp ftw) to update our servers with the necessary code. After that wouldn’t scale, but before we even needed (had?) Gitlab CI, we used an extremely creatively named deploy tool: Deploy Tool. Deploy Tool would push changes from the code safe into production. Merge code to code safe, deploy, don’t break anything.
In those days, testing was on the onus of:
- Teams,
- Individuals,
- or nobody.
To us, in the future, where we now see CI/CD pipelines as important as they are, this is a glaring issue. While the inconsistency of deployment quality is evident to us now, it worked for a while! It became increasingly apparent, as the scale of our organization grew, that we needed automation and quality gate checks for our production deployments. Keep in mind, this was well before we had any one team who had the job solely to monitor/manage deployments. Release Engineering, Train Conductors, teams of people to ensure the quality of our deploys? Those teams did not exist at the time. Foks were not there to merge / run tests / keep the site up 24/7.
We built pragmatic infrastructure based on our needs, and much our necessary deployment automation went into Jenkins jobs. We then increased in size and use cases, and we had a need for more oversight and options on our deployment tools. We continued to find use cases that needed a unifying system: Identifying and monitoring improvements to testing and deployments, deployment integrations and usability improvements for Windows applications, for python deployments, versioning and auditability of our pipelines, and putting control of all this automation into the hands of our developers, to name a few.
A big hole in our needs was found when we needed to run tests for individual commits. Our suite of Jenkins jobs had swelled to a size that had to be coordinated by a slightly-more-creatively-than-deploy-tool-named tool, The Integrator. The Integrator merges a bunch of commits together, then runs tests, and deploys them. Getting jobs accepted for such a heavy-duty tool became a bureaucracy in itself. Consequently, many teams would opt to run testing, monitoring, and verification jobs after code had already made its way into production.
By having Jenkins jobs that would run after we had a master merge, as anyone currently using The Integrator can probably relate, it’s very common for difficult bugs to emerge. We also had a long feedback cycle for scaling and fixing issues in the integrator related to filled disks, or outdated infrastructure configuration. We got queues that ran longer and longer and longer, and the developer experience got worse and worse and worse. With these problems in mind, and the deploy lead time growing, we investigated options for CI that could dig us out.
what you c(i) is what you get
We needed a way to get the code that we were writing tested before we merged in to master. We needed a solution more tightly integrated with git, and with our code branches. Turns out, that new Gitlab thing we were using to store code was releasing an integrated CI system. As it rolled out, some brave developer souls (BDS) took on the task of updating the Gitlab version to get the new CI feature and try it out. I’d like to remember for the next section that specifically, these were brave developer souls (BDS), working on infrastructure.
The first Gitlab Runners (gitrunners) were POC’d against BDS dev machines. These dev machines were provided to engineers, and used for web stack development or, frequently, testing out new functionality connected in the Wayfair subnet. Gitrunners in these machines ran automation (unit tests, linting, etc.) against specific branches, instead of having to push them all the way to Gitlab, and the Integrator, and hope everything worked. This allowed for a faster feedback loop, and a path emerged to enhance developer experience significantly.
Our team of BDS was then further invested in the process of allowing The Integrator to run it’s jobs in Gitlab CI. Before master merges, and when an Integrator pre-merge branch was created, many of the necessary tests would be run against Gitlab CI. We had the ability to run pipelines that were shown in a pleasant and accessible visualization.
Our teams were happy, our organization ran well on this new infrastructure. For a while. Slowly but surely, a similar thread to our Jenkins and Integrator debacle, our builds and deploys had longer and longer lead times. In Jenkins’ case, we had very little controls in place around the application layer, where developers would see disks filling up, or caches not being cleaned. The same issues revealed themselves on the Gitlab agents, with the added instability that gitrunners were stood up not through infrastructure administrators, but by webstack developers.
hot potatownership
Generally, nobody took proactive ownership of the original Jenkins nodes when it came time to fix deploy issues. Instead, we reactively cleaned out directories when disks filles, and ssh'd into machines for hot-fixes. Some of our Jenkins consequently fell out of step with our needs as developers and in the organization. We moved to Gitlab CI because we wanted to build our workflows and necessary fixes into the infrastructure we “owned”.
We never designated formal ownership parameters, SLA’s, on call rotations, ticket time, or organizational alignment within a particular team to ensure a quality developer experience when using Gitlab CI for a deployment block either. It was, at one point, depending on two nodes, not labelled in any meaningful way (no way to tell what they actually did), completely for the critical process of deploying our production applications. If either node went down, there were a handful of people in the organization that would even understand what was keeping deploys on hold.
That isn’t to say there weren’t people who were willing to stand up and maintain their own processes. There were indeed workflows that delegated to Octopus, ran Docker, Docker Compose, and Python workflows that needed minimal maintenance. The teams that maintained and used those workflows were sometimes willing to own those maintenance tasks and were given the means to execute on them. Most of the organization was using a large enough application that maintaining shared infrastructure without a dedicated team became infeasible.
Smaller applications could innovate and solve their own issues. In the case of Integrator and production deployments (used for our largest applications) we needed one or multiple teams to coordinate on the effort. At one point, developer experience suffered so badly, a common issue flow was:
- Check in to a monorepo repository
- Run status checks (could take anywhere from .75-1.5 hours to get a worker allocated and run)
- Find out a deploy went out while tests ran and your base branch is out of date
- Merge master
- Run status checks again
- Find out you can’t join deploy train because master changed since you started
It’s a cycle. The cycle would deter people from trusting our process, or wanting to commit code frequently. Longer living branches magnifies the possibility that tests fail, so these tests would fail even more frequently. That is one example of how we got to a decision point around Gitlab and Gitlab CI.
then, suddenly, GitHub!
While these issues were happening, and as we began organizing around stabilizing Gitlab and Gitlab CI, GitHub entered the picture. Gitlab CI was causing SEV after SEV after SEV, we were seeing issues with even being able to check in code. GitHub provided many benefits, but the best summation I got in talking to fellow engineers was:
“GitHub will kill your request to stay alive, Gitlab will kill itself to keep your request alive”
Among other reasons related to Gitlab, we eventually decided to make the switch, which left us with a choice. Do we keep Gitlab CI, or move to something else?
That’s where we’ll leave this thread until next week. Thanks for reading, I hope you enjoyed it. If you have questions / comments / feedback / thoughts, feel free to drop me a line at gwhite@wayfair.com.