← All posts
ci-cdmonorepocachingdevops

Your Monorepo CI Is Slow Because You Cache Wrong

I see the same six caching mistakes in every monorepo CI I audit. Fix them and pipelines drop from 40 minutes to 8.

14 October 2024·4 min read

Every monorepo CI audit I do starts the same way. The team complains pipelines take 40 minutes. I read the config. The pipeline could take 8.

The same six mistakes show up in every codebase. Fix them in order.

1. You cache the package manager, not the build

Most teams cache node_modules or ~/.m2 or the Go module cache. That is fine. It saves a minute. The expensive thing in a monorepo is not dependency download. It is the build itself, run on every commit, on packages that did not change.

If your CI rebuilds package A on every commit, even when only package B changed, you are wasting most of the wall clock.

The fix is build-aware caching. Bazel, Nx, Turbo, Pants, take your pick. The tool reads inputs, hashes them, and skips work when the hash matches a cache entry. Done well, this turns a 40-minute pipeline into 4 minutes for the 90% of PRs that touch one package.

2. Your cache key is too coarse

Common pattern: cache key is os-${hashFiles('package-lock.json')}. This invalidates the entire cache every time anyone bumps any dependency.

Better: scope cache keys per package. os-pkg-foo-${hashFiles('packages/foo/package.json', 'pnpm-lock.yaml')}. Now a dependency change in package A does not nuke the cache for package B.

3. Your cache key is too fine

Equal and opposite mistake. Cache key includes the commit SHA, or the file mtimes, or some other thing that changes every commit. The cache never hits. You pay the upload cost on every run and the download cost never.

Cache keys should be content-addressed, not commit-addressed. Hash the actual inputs. The same inputs on two different commits should produce the same key.

4. You upload from every job

In a fan-out matrix, every shard tries to write to the cache. They race. The CI system either picks one and discards the others, or stores all of them and the next run gets non-deterministic hits. Either way you are paying network and storage for nothing.

Designate one job as the cache writer. All other jobs are read-only. If the writer fails, the cache stays at its previous good state. The reader jobs degrade gracefully.

5. Your remote cache lives in the wrong region

I have seen GitHub-hosted runners in us-east pulling cache artifacts from an S3 bucket in eu-west, paying 200ms of latency on every get, multiplied across a thousand cache lookups per build.

Put the cache in the same region as the runner. If you span regions, replicate. The cost of a replicated S3 bucket is rounding error compared to the engineering hours your builds are burning.

6. You do not cache test results

This one is the heretical fix. Most teams treat tests as something that must always run. Wrong. A test result is a function of:

  • The test code.
  • The code under test.
  • The dependencies of both.
  • The runtime environment.

Hash that input. If the hash matches a previous run that passed, skip the test. This is what Bazel's remote test execution does by default, and what Nx and Turbo do for unit tests.

The objection is "but what if the test is flaky and we want to re-run it". Fine. Cache only on green. A failed test never caches. Then your re-runs are still re-runs, and your green tests are skipped on identical inputs.

The fastest CI I have built

For one client this year, a TypeScript monorepo with 80 packages and a thousand-engineer team:

  • Cold build, full test suite: 22 minutes.
  • Median PR (changes one package): 3 minutes.
  • Cache hit rate, p50: 94%.

The pipeline does the boring things right. Build-aware tool (Turbo). Per-package cache keys. Designated writer job. Remote cache co-located with runners. Test result caching gated on green. That is it. No exotic tooling. No bespoke scripts.

What I tell teams in week one

If your CI takes longer than 10 minutes for a one-package PR in a monorepo, you have a caching problem, not a compute problem. Throwing larger runners at it gets you 30%. Fixing the caching gets you 5x.

Start with build-aware caching. Get cache keys right. Move test caching last because it scares people. By the time you are done, your CI bill drops, your engineers stop context-switching during builds, and your DORA lead-time-for-changes number quietly halves.

The cache is the platform. Treat it like one.