{
    "version": "https://jsonfeed.org/version/1",
    "title": "Paul Adams | Field Notes",
    "home_page_url": "https://adams.io",
    "feed_url": "https://api.feedifyrss.com/adams/articles/feed.json",
    "description": "RSS feed for Articles",
    "items": [
        {
            "id": "urn:sha256:3e7018c93dd18ef64502e2c3044a22097442ae7de76dc148ed451b41995734ba",
            "content_html": "<p dir=\"auto\">The incident bridge has a reflex. Something’s wrong in production; the graphs are red, and someone says the most reassuring sentence in operations: “Roll it back.” On the backend, that sentence is a plan. You revert the deploy, the bad version disappears, and you investigate with the pressure off. Minutes, not days.</p><p dir=\"auto\">Say it on a mobile incident bridge, and it means nothing. There’s no version to revert. The bad build is already on millions of devices, and it’s staying there.</p><p dir=\"auto\">And that cost doesn’t stop at engineering. A bad release is an experience your customers keep living in—rating it down, churning out, flooding support—for as long as the build survives. What you can’t take back isn’t only the code; it’s the impression the product makes in millions of hands.</p><p dir=\"auto\">Mobile reliability problems often trace back to a single inherited belief that doesn’t survive contact with the medium: that recovery is something you can do after the fact.</p><h2 dir=\"auto\">The Recovery Reflex</h2><p dir=\"auto\">Much of the reliability canon translates cleanly to mobile: circuit breakers, graceful degradation, bulkheads, and blast-radius thinking all survive the move. The recovery reflex doesn’t. Roll back, revert, redeploy. Each rests on an unexamined assumption: that you control the running artifact and can take it back. On the backend, you do: you own the process, the host, and the button.</p><p dir=\"auto\">On mobile, you own none of that after you ship. The deploy unit isn’t a service you control; it’s a single bundle that clears one app-store review and then belongs to the user. Because that bundle ships all-or-nothing, every team’s work has to ride the same release train. That’s the coordination tax. The same atomicity removes your undo button entirely. One property of the medium, billed twice.</p><p dir=\"auto\">So the reflex that organizes backend incident response—revert the artifact, recover fast—is the wrong reflex for mobile. Recovery, if it happens, has to come from somewhere else.</p><h2 dir=\"auto\">The Two Clocks</h2><p dir=\"auto\">There’s a second problem hiding underneath the first. You ship on your clock. Users update on theirs.</p><p dir=\"auto\">You release v2026.7 this week, but a meaningful share of your users are still on v2026.3 from four months ago, and some never move at all. You’re carrying every version still alive in the field at the same time, with no way to retire the bad ones on demand short of forcing the oldest out.</p><p dir=\"auto\">“Fix forward” is the standard answer, and it’s real. But be honest about what it costs. A forward fix is a full build, another store review cycle (expedited if you’re lucky), another staged rollout, and then the part no one can compress: waiting for users to actually install it. Full recovery across the installed base is measured in days to weeks, and partial even then. That’s not a recovery story. That’s a long apology.</p><h2 dir=\"auto\">Where Reliability Actually Lives</h2><p dir=\"auto\">If you can’t recover by reverting, reliability has to move to the only two places you still control: before the artifact leaves your hands, and remotely, through levers you built in advance.</p><p dir=\"auto\">The first is pre-commitment. On mobile, crash-free rate isn’t a dashboard you watch after release. It’s a bar you clear before it: a floor, not a ceiling. I inherited a mobile platform drowning in production defects, and the fix wasn’t a better monitor. We halted roadmap expansion and made reliability a precondition for shipping at all. Crash-free sessions reached 99.98%, and defects fell, not because we got faster at cleanup, but because we moved the decision upstream of the release. And the bar didn’t cost us speed; it bought it back: the people we’d had on cleanup went back onto our roadmap.</p><p dir=\"auto\">The bar is one half of working before you ship; the other is limiting how far a bad build can reach. That comes down to staged rollouts and release cadence. A staged rollout exposes a small percentage, not to recover gracefully but to cap the blast radius. But it’s only worth as much as your view into it: the health metrics and the abort threshold have to be set before you start, because a build degrading in the field leaves no time to decide what counts as bad. Halting in flight is the one move that still stops the bleeding, because it works before you’ve committed to all of it, never after.</p><p dir=\"auto\">Cadence does the same job at a different scale. Decoupling our mobile releases from enterprise governance and moving from monthly to every two weeks shrank the blast radius of any single one: a reliability decision as much as a delivery one. Smaller, more frequent units each carry less risk. It’s the closest thing to reversibility a medium without rollback allows.</p><p dir=\"auto\">The second is remote control. After a build ships, the only levers you keep are the ones you compiled into it: kill switches, server-driven feature flags, and configuration you can change without a release. A forced-update floor goes further: where the product allows it, it’s your only way to retire the oldest builds you otherwise can’t kill.</p><p dir=\"auto\">The deeper play is architectural: push as much as you can into the server-controlled surface—server-driven UI, over-the-air code where platforms allow—so the irreversible core shrinks to the native shell and what you compiled into it. But the surface that reaches the builds you can’t kill is the one place a bad flag goes global: govern it like a release, or you’ve moved the risk, not removed it.</p><p dir=\"auto\">On a regulated consumer platform facing a fast-moving fraud pattern, what mattered wasn’t patch speed. It was that we could contain the exposure on the server side while the real fix made its way through the pipeline. An off switch you built last quarter is worth more than any rollback you wish you had today.</p><p dir=\"auto\">And the levers ship in that same one-way binary: a kill switch you never exercised is one you don’t actually have.</p><p dir=\"auto\">The discipline isn’t to recover faster. It’s to narrow what a bad release can reach, watch what you’ve exposed, and keep a hand on the switch.</p><h2 dir=\"auto\">The Stakes Just Went Up</h2><p dir=\"auto\">This is where the current rush matters. Everyone is racing to ship AI features into their apps. Most teams keep the model behind an API rather than in the binary, and that’s usually the right call. Shipping models on-device makes it even harder, at least when those weights are compiled in rather than downloaded as a swappable asset. But the client code ships like everything else: the part that calls it, gates it, interprets its output, and decides what to show when it misbehaves. An AI feature is only as reversible as that control plane.</p><p dir=\"auto\">Put the on switch, the thresholds, and the fallback in the app instead of on the server, and you’ve dropped your least predictable feature into your least reversible medium. The oldest constraint in mobile, meeting the newest pressure on it.</p><p dir=\"auto\">Backend teams earn reliability by recovering well. Mobile teams earn it by deciding well, early, and never letting go of the levers. That second discipline outlasts mobile: it’s what the job becomes anywhere the cost of being wrong can’t be taken back.</p><p dir=\"auto\">The rollback isn’t coming. Build like it never was.</p>",
            "url": "https://adams.io/blog/you-cant-roll-back-a-phone",
            "title": "You Can’t Roll Back a Phone",
            "summary": "Out here, the only levers left are the ones you packed before launch.",
            "date_modified": "2026-06-18T00:00:00.000Z",
            "tags": [
                "reliability"
            ]
        },
        {
            "id": "urn:sha256:9402c99e4b3a1f68d216b287f0e43cb758069730996c131ee17bddfabe0be6ef",
            "content_html": "<p dir=\"auto\">Planning season ends the same way in most organizations. The deck gets approved. The commitments get made. Then, a few weeks into the quarter, someone finally counts who is actually free to build any of it—and discovers the roadmap was written for an engineering org that doesn’t exist. The org that exists is mostly busy keeping last year’s commitments alive.</p><p dir=\"auto\">This gets diagnosed as an estimation problem, or a prioritization problem, or—on bad days—a velocity problem. It’s none of those. It’s a strategy problem, and it starts with a premise most leaders never examine: that the roadmap is the strategy and the allocation is an implementation detail.</p><p dir=\"auto\">It’s backward. The allocation is the strategy. The roadmap is its output.</p><h2 dir=\"auto\">The Default Portfolio</h2><p dir=\"auto\">Every engineering organization runs a capacity portfolio, whether or not anyone designed one. Some share of capacity keeps the lights on. Some pays down risk. Some builds leverage. Whatever remains places bets. That split exists today, in your org, with real numbers attached. The only question is whether anyone chose them.</p><p dir=\"auto\">Ask directors to state their current allocation in four numbers and most can’t. Not because the data is hard to find—because nobody has asked the question in that form. The allocation accreted. A team formed around an incident three years ago and never disbanded. A platform effort got funded in a flush year and has defended its headcount annually ever since. Support rotations expanded one escalation at a time. Every decision was locally reasonable. Nobody owns the sum.</p><p dir=\"auto\">The sum is the strategy. If 70% of capacity goes to keeping the lights on, you are executing a preservation strategy—regardless of what the slide says about growth. The slide describes intent. The allocation describes reality. When they diverge, reality quietly wins with every sprint until the gap is too large to explain away.</p><h2 dir=\"auto\">Four Numbers</h2><p dir=\"auto\">The portfolio doesn’t need to be elaborate. Four buckets cover it.</p><p dir=\"auto\"><strong>Keep the lights on.</strong> Production support, incident response, mandatory upgrades, the maintenance that prevents decay. This work is non-negotiable, which is exactly why it must be measured—unmeasured obligations always expand.</p><p dir=\"auto\"><strong>Risk reduction.</strong> Security, compliance, resiliency, and the subset of technical debt with a real blast radius. Insurance, priced deliberately.</p><p dir=\"auto\"><strong>Platform leverage.</strong> Investment that makes every future unit of work cheaper—tooling, paved roads, shared services. The only bucket that compounds.</p><p dir=\"auto\"><strong>Growth bets.</strong> New capability, new revenue. The roadmap, as most people use the word.</p><p dir=\"auto\">Notice the language. Preservation, insurance, leverage, bets—this is how capital allocators already think. Presenting capacity this way doesn’t just clarify your own decisions; it moves the conversation with finance from “why do you need more engineers” to “which part of the portfolio are we funding.” Those are very different meetings.</p><h2 dir=\"auto\">Allocation First, Roadmap Second</h2><p dir=\"auto\">The discipline follows directly. Set the split deliberately, at the level where tradeoffs across teams are visible. Derive the roadmap from what the growth bucket can actually carry—not from what stakeholders can be talked into believing. Then defend the split, not the project list.</p><p dir=\"auto\">This also settles who owns which layer. Product owns sequencing inside the bets bucket—which features, in what order, for which customers. That’s their job, and the portfolio doesn’t change it. The roadmap they produce still does real work, exposing that sequence and its dependencies to the rest of the business. But it’s downstream of a decision most leaders never make on purpose.</p><p dir=\"auto\">Engineering owns the size of every bucket, because engineering answers for what happens when the lights-on work gets starved. The executive team ratifies the split. When product wants more roadmap, the legitimate move isn’t pressure on estimates—it’s a proposal to resize the buckets, with the cost stated out loud.</p><p dir=\"auto\">This inverts most planning fights. When stakeholders argue over roadmap items, they’re negotiating at the wrong layer; any individual project can be argued endlessly because individual projects have champions. The allocation has no champion except you. Move the argument up a layer—“we can shift this from leverage to bets, and here is what that costs us in eighteen months”—and the conversation becomes a tradeoff instead of a siege.</p><p dir=\"auto\">And review it quarterly. Allocation drifts the same way it accreted: one reasonable exception at a time. A split you set in January and never revisit is just a slower version of the default portfolio.</p><h2 dir=\"auto\">The Litmus Test</h2><p dir=\"auto\">Can you state your current allocation in four numbers—and defend why that split serves the business better than any other?</p><p dir=\"auto\">If yes, you have a strategy, and every roadmap conversation gets easier, because you’re no longer defending projects. You’re defending a position.</p><p dir=\"auto\">If no, you still have a strategy. You just didn’t choose it. Someone did—three years ago, one escalation at a time—and they weren’t thinking about where you need to be next year.</p><p dir=\"auto\"><em>The roadmap is the most visible artifact of planning and the least consequential. The allocation is invisible and decides everything. Choose it on purpose.</em></p>",
            "url": "https://adams.io/blog/your-roadmap-is-an-output-not-a-plan",
            "title": "Your Roadmap Is an Output, Not a Plan",
            "summary": "If 70% of capacity keeps the lights on, that’s your strategy—no matter what the slide says.",
            "date_modified": "2026-06-10T00:00:00.000Z",
            "tags": [
                "strategy"
            ]
        },
        {
            "id": "urn:sha256:f6999b86de62e2d5081cd13e0a29938280d76df17c4de77e75b5b62927a49eb3",
            "content_html": "<p dir=\"auto\">Three users in one morning. Transactions they hadn’t made. Money gone. By afternoon, a dozen. On a platform processing real financial transactions, this wasn’t a support queue problem but a structural one. By the time you name it fraud, the clock is already running.</p><p dir=\"auto\">The platform wasn’t built with this threat model in mind. When the transaction surface expanded, we attracted bad actors who understood it better than we did. Account takeover fraud—compromised credentials, VPN-masked access, funds moved to temporary accounts before anyone noticed—is a well-worn playbook. We never had to defend against it before. We had the logs and incident reports but lacked a system to evaluate signals fast enough to act before the damage landed.</p><p dir=\"auto\">So we made the call that’s never clean: stop shipping features and fix this. There’s always a roadmap, commitments, and a product org with quarterly priorities and stakeholders who aren’t watching the same fraud queue. But the math was simple once we said it out loud: every release shipped while the fraud vector was open made the problem bigger. Velocity was the accelerant. We paused.</p><h2 dir=\"auto\">Build vs. Buy Under Fire</h2><p dir=\"auto\">Account takeover fraud was a different discipline—we had no in-house fraud detection expertise. We deployed hotfixes while the system still bled. Engineers pulled from roadmap work were manually investigating incidents, buying time while we found a real solution.</p><p dir=\"auto\">Building from scratch wasn’t the answer. A fraud model requires a calibration cycle measured in quarters and a large enough historical dataset to train on. You can’t label fraud you haven’t instrumented. We needed something production-ready in weeks.</p><p dir=\"auto\">We needed a service we could instrument fast, that covered the signal types relevant to account takeover, and that could operate inside a regulated environment on day one. The first vendor we evaluated demonstrated all three. Under other circumstances, we might have run a longer comparison.</p><p dir=\"auto\">What mattered about its design was that it separated signal from verdict. It evaluated risk across four types: IP address, device fingerprint, email, and phone number. Each returned a score plus a set of attributes—connection type, bot indicators, proxy flags. Their models handle the pattern recognition. Our job was to decide what those signals meant for our specific users.</p><p dir=\"auto\">A fraud score is an input, not a decision. A datacenter IP with an elevated score might be a legitimate enterprise user on a corporate VPN. A residential IP with a proxy flag might be a privacy-conscious user who’s never committed fraud. On a financial platform, a false positive is a trust event—sometimes more damaging than the fraud itself.</p><h2 dir=\"auto\">Building the Judgment Layer</h2><p dir=\"auto\">We didn’t wire scores to decisions. We built a rule evaluation layer on top—one handler per signal type, each consuming the service’s response, applying a configurable threshold, and returning a binary determination.</p><p dir=\"auto\">The thresholds weren’t hardcoded. They were pulled from configuration at runtime, allowing us to adjust what “fraud” meant for our platform without deploying new code. We knew the thresholds would need tuning as real traffic patterns emerged and wanted to avoid a deployment cycle for calibration changes. The device fingerprint check showed how the layered system operated in practice.</p><p dir=\"auto\">First gate: if the device fingerprint in the request didn't match the one on record, we triggered step-up verification. On mobile, reinstalls and upgrades produce legitimate mismatches too frequently to justify a hard block.</p><p dir=\"auto\">Second gate: if fraud probability exceeded the configured threshold, we blocked—email and phone signals provided sufficient corroboration.</p><p dir=\"auto\">Third gate: if the score was elevated but below the threshold, we evaluated connection type alongside bot status. A datacenter connection with an active bot flag read differently than a residential connection using a proxy.</p><p dir=\"auto\">Same score, different context, different outcome.</p><h2 dir=\"auto\">The Signal We Kept Dark</h2><p dir=\"auto\">We deferred IP address evaluation entirely in the first release.</p><p dir=\"auto\">The account-takeover pattern made IP the most tempting signal and the most dangerous to miscalibrate. Bad actors masked behind VPNs, but so did legitimate users. Blocking on IP characteristics alone would have caught both. Everyone who looked at the false-positive rate reached the same conclusion: IP addresses are the noisiest of the four signal types, and we hadn’t characterized our legitimate user population well enough to set a trustworthy threshold. We had the integration ready but chose not to enable it.</p><p dir=\"auto\">Because each signal type was independently scoped, deferring IP didn’t require touching anything else. Email, phone, and device fingerprint went live. IP went dark but stayed instrumented—we could watch the signal without acting on it, which meant we were building the dataset we’d need to calibrate it later. When we eventually brought it online, it slotted in. Nothing else changed.</p><h2 dir=\"auto\">The Lever We Didn’t Pull</h2><p dir=\"auto\">Within weeks of shipping, successful fraud all but stopped. The support ticket clusters stopped. Unreconciled transaction reports dropped to near zero. The on-call rotation stopped getting pulled into fraud incidents at scale. Engineers returned to the roadmap.</p><p dir=\"auto\">The IP deferral seemed like a half-measure at the time. In retrospect, it was the most disciplined call in the project. Shipping an uncalibrated block rule would’ve caused a different kind of harm: legitimate transactions declined, accounts locked, and support volume climbed for the wrong reason. A lever you don’t understand is not yours to pull.</p>",
            "url": "https://adams.io/blog/when-fraud-finds-your-platform",
            "title": "When Fraud Finds Your Platform",
            "summary": "The most important decision isn’t which signals you act on. It’s which you leave dark.",
            "date_modified": "2026-06-01T00:00:00.000Z",
            "tags": [
                "architecture"
            ]
        },
        {
            "id": "urn:sha256:56da2072600d063cde61c09e2a36d99f79e40c02b926618d4e97aa857d93600d",
            "content_html": "<p dir=\"auto\">A few years into running mobile engineering, our incident channel lit up again. Wednesday night, same legacy reporting subsystem, same timeout, same two senior engineers dropping everything to stabilize it. By midnight it was stable; by morning standup nobody wanted to talk about it, because talking about it meant owning it.</p><p dir=\"auto\">The code worked like an abandoned building with electricity. No one wanted to walk inside. Every org has a few of these zombie systems—technically alive, structurally dead, and quietly expensive.</p><p dir=\"auto\">The instinct, every time, was to refactor. The right call was to remove. And I avoided that call longer than I should have.</p><h2 dir=\"auto\">The Fractured Bill</h2><p dir=\"auto\">Most engineering orgs treat decommissioning as housekeeping. It’s a portfolio decision that shows up on the wrong line items every sprint.</p><p dir=\"auto\">The bill shows up as features taking longer. Senior engineers pulled into incidents on systems not on their roadmap. Onboarding that runs three weeks long because new hires have to learn one more thing nobody uses directly. A tooling upgrade that stalls because the deprecated framework isn’t supported, so every change in that subsystem costs more than it should.</p><p dir=\"auto\">The CFO sees infrastructure spend. The VP of Engineering sees velocity. The incident rotation sees burnout. The pain is distributed among stakeholders who don’t connect their symptoms, diluting the urgency to act. The line item connecting them is the system that should’ve been turned off two years ago.</p><h2 dir=\"auto\">The Preservation Instinct</h2><p dir=\"auto\">Three forces keep them alive.</p><p dir=\"auto\">The first is that the original owners are still in the room. Retirement feels like erasure to the people who built it, so the conversation gets tangled in identity rather than economics. If the original author is now the director, the conversation often doesn’t start.</p><p dir=\"auto\">The second is sunk-cost gravity. The migration that built the system was hard. That memory becomes the reason to avoid the next hard thing, even when it’s shorter and pays back faster than maintaining what’s there.</p><p dir=\"auto\">The third is that nobody owns the kill decision. Feature teams own features. Platform teams own platforms. Nobody owns the end of life for things nobody currently wants but nothing structurally requires. The decision gets deferred every quarter until the system outlives the people who knew why it existed.</p><h2 dir=\"auto\">The Kill Criteria</h2><p dir=\"auto\">The honest test is one question: <em>if this system disappeared tomorrow and we had to rebuild only what we actually use, would we build it?</em></p><p dir=\"auto\">If the answer is no, the cost of retiring it is lower than the cost of keeping it. And there’s one follow-up that settles most remaining debate: how many engineers are confident they could change it safely? If the answer is one, it’s an organizational liability, not a technical asset—and the person it depends on is one resignation away from being your problem.</p><p dir=\"auto\">If both land hard, the conversation isn’t whether to decommission. It’s who owns the schedule and what gets cut to fund it.</p><h2 dir=\"auto\">Actually Finishing</h2><p dir=\"auto\">Execution is the easy part. The discipline to finish it is what’s rare, and the failure mode is almost always the same: the code stays in the repo. Six months later, someone resurrects it under deadline pressure to ship a quick fix, and you’re back at midnight in two years explaining why the system you killed is on the incident rotation again.</p><p dir=\"auto\">Remove the path back. Delete the code, turn off the infrastructure, and keep only what regulators require—usually a read-only archive satisfies the audit without preserving operational surface area. Then quantify the silence: lower cloud spend, fewer escalations, engineering hours returned to feature work.</p><p dir=\"auto\">The Wednesday-night subsystem got killed a few months later. It took a platform migration to give us the cover. The incidents stopped. Nobody mentioned the subsystem again.</p><h2 dir=\"auto\">The Portfolio Multiplier</h2><p dir=\"auto\">Mobile orgs get punished here. A monolithic single-product app has one decommissioning conversation per dead system. A portfolio has the same conversation <em>n</em> times, once per app that still touches it. The temptation is to keep the system alive in apps that haven’t migrated and revisit when those roadmaps allow. That’s the deferral trap. Roadmaps never allow it. Product managers are incentivized to ship visible features over invisible backend work.</p><p dir=\"auto\">That deferral is how portfolios accumulate the silent debt that eventually becomes a reorganization. The shared platform team carries duplicate contracts indefinitely, and a year later someone proposes splitting the team along business lines to “increase responsiveness.” The federated platform that results is the bill for a decommissioning conversation that didn’t happen.</p><p dir=\"auto\">The structural fix is to make the kill decision a portfolio-level call. Someone with authority across apps owns it. It gets scheduled deliberately and funded as a line item, not absorbed into team capacity. That’s the posture.</p><p dir=\"auto\">The tactic is to exploit forcing functions when they appear—platform migrations, framework deprecations, regulatory deadlines—and attach the kill work to them rather than waiting for a standalone mandate. Posture without tactic means the schedule slips every quarter. Tactic without posture means you only kill things when something else forces your hand. Most orgs live there, buried in systems that should have died.</p><h2 dir=\"auto\">The Bill You’re Paying</h2><p dir=\"auto\">The question isn’t what to build next. Most engineering orgs already have more answers to that than they have capacity for. The bill for keeping a system past its useful life is paid in the decisions you couldn’t get to—the platform investment you deferred, the migration you pushed to next year, the architectural rework that would have prevented the incident you’re now in.</p><p dir=\"auto\">You are already choosing which systems to keep alive. The question is whether you’re choosing deliberately or inheriting the choice from the team that built them.</p><p dir=\"auto\">So: which one are you afraid to kill?</p>",
            "url": "https://adams.io/blog/zombie-systems-the-hardest-to-kill-still-work",
            "title": "Zombie Systems: The Hardest to Kill Still Work",
            "summary": "Failure isn’t the cost. Existence is.",
            "date_modified": "2026-05-27T00:00:00.000Z",
            "tags": [
                "strategy"
            ]
        },
        {
            "id": "urn:sha256:b3fb40f001107ded6fb3fdb9e0e019705e5daae46518514eb7d9eecbfdcfbba9",
            "content_html": "<p dir=\"auto\">You hired more engineers. The roadmap says you should ship faster. Instead, the pace stalls. CI queues fill up. PR reviews spill into the next day. Senior engineers become bottlenecks on approvals beyond their team. Releases suffer. The natural response is to double down—more discipline, more headcount, more process. Things get worse.</p><p dir=\"auto\">All software engineering becomes an alignment problem at scale. The constraint isn’t coding speed; it’s how many independent decisions teams can make in parallel without blocking each other. That’s a property of release topology, ownership boundaries, and runtime decoupling, not of process. Mobile orgs hit those limits earlier than backend orgs of similar size. Across a portfolio, costs multiply. In the agent era, they multiply faster than human throughput alone would explain. Companies that figured this out—Uber, Spotify, Airbnb, Meta, Grab, Shopify—published most of their answers years ago. The differentiator is sustained investment, not the framework.</p><h2 dir=\"auto\">The Four Things That Make Mobile Different</h2><p dir=\"auto\">Earlier in my career, I led mobile engineering for a multi-tenant SaaS platform shipping 200+ white-labeled apps across eight shared codebases. Hiring helped until it didn’t. The engineers were excellent. Coordination cost was the problem.</p><p dir=\"auto\">Mobile orgs hit the wall earlier than backend orgs of similar size for compounding reasons.</p><ol dir=\"auto\"><li data-preset-tag=\"p\"><p><strong>Atomic distribution.</strong> Backend services deploy at service granularity. A team can own one and ship it on its own schedule. Mobile’s deploy unit is the app, not the team. One published bundle per app ID is shared by every contributing team and passes through a single review queue, release window, and regression surface. Source-level decomposition—modules, separate repos, internal SDKs—doesn’t change the publication unit: it’s still one binary. Coupling within an app must be managed at runtime with feature flags and remote configuration, not eliminated by reorganizing the code.</p></li><li data-preset-tag=\"p\"><p><strong>Build-system friction.</strong> Default-configured Xcode and Gradle do not scale beyond a certain codebase size. Clean builds become costly. CI queues lengthen during business hours. Engineers batch larger changes to justify the wait. Bigger PRs, bigger blast radius: more conflicts, more flake, harder review. Branches live longer and drift further. When inner-loop feedback slows from seconds to minutes, throughput loss compounds.</p></li><li data-preset-tag=\"p\"><p><strong>Testing cost.</strong> Device farms, simulator orchestration, visual regression, and OS version compatibility make mobile testing operationally expensive. Backend bugs roll back in minutes; mobile bugs live in installed apps until a fix passes review and users update. As regression suites grow, orgs relax quality discipline to preserve velocity, but this tradeoff rarely holds: the cost of escape exceeds the cost of the suite.</p></li><li data-preset-tag=\"p\"><p><strong>Shared infrastructure as shared exposure.</strong> Large mobile orgs rely heavily on shared systems: repositories, CI, review queues, release trains, observability. Atomic distribution, slow builds, and a growing test surface would each be a single team’s problem in a decoupled system; sharing turns them into everyone’s problem at once. Runaway test suites, repository-wide refactors, and surges of PRs from AI coding agents don’t stay in their lanes.</p></li></ol><p dir=\"auto\">The scale is well documented. Grab’s 2020 Bazel migration covered 2.5M lines of code per platform, over a thousand Android modules, 700 iOS targets, tens of thousands of unit tests per platform, and hundreds of commits a day. Airbnb reported nearly 1,500 modules in its iOS app. Meta replaced Buck1 with Buck2 and reported builds roughly twice as fast. None of those metrics primarily describes coding throughput. They describe coordination throughput—what lets independent teams stop interfering with each other.</p><h2 dir=\"auto\">Modular Architecture, Past the Folder Names</h2><p dir=\"auto\">The foundational shift is from a single project to a set of modules with explicit boundaries. Whether those modules live in one repo or many is the question most teams jump to first, and it’s the wrong starting point. Repo strategy is usually a proxy for the actual concerns: dependency direction, ownership boundaries, and release independence. A monorepo with weak boundaries is still a monolith. Multiple repos without governance still produce fragmentation. Modules and ownership are upstream of the repo decision; once those are right, the repo question becomes a tooling preference rather than an architectural one.</p><p dir=\"auto\">Most scalable mobile suites converge on a three-tier shape. A thin app shell initializes the runtime, configures dependency injection (DI), and assembles modules. Feature modules own UI, presentation logic, local business logic, tests, and feature-specific navigation contracts. Core platform libraries provide networking, auth, persistence, analytics, design system, logging, feature flags, localization, and accessibility.</p><p dir=\"auto\">Naming the tiers is easy; enforcing the boundaries is where most orgs fail. Features import each other directly. Core libraries develop circular dependencies. API modules become thin wrappers around implementation details. The folders look modular, but coordination costs don’t change.</p><p dir=\"auto\">The discipline that fixes this is the API module pattern: each feature exposes a public interface as a separate, dependency-free module. Other features depend only on the interface; the implementation is private. DI binds implementations at the shell level. The binding mechanism—Dagger, Hilt, or Koin on Android; Needle or Swinject on iOS—matters less than consistency. The result: teams refactor internally without coordinating with the rest of the org.</p><p dir=\"auto\">Module boundaries should also match team ownership. An unowned module is a coordination liability; one shared among three teams is a meeting in disguise. Healthy architectures align the two and rebalance when teams reorganize.</p><p dir=\"auto\">Across a portfolio, the API module pattern extends from features to apps. The same authentication interface might have different implementations for the consumer app, the merchant app, and the white-label deployments—but every app depends on the same contract. The implementation varies; the contract holds. The same discipline extends to the contracts that mobile shares with the backend: generated clients from a single schema source—OpenAPI, GraphQL, Protobuf—let mobile and backend teams evolve in parallel without coordinating on every change. Features within an app, apps within a portfolio, mobile and backend across a shared contract: each layer is the same discipline pushed outward.</p><h2 dir=\"auto\">Platform Teams Are Not Help Desks</h2><p dir=\"auto\">What distinguishes successful enterprise mobile orgs from struggling ones isn’t which build tool they use. It’s the presence of a platform team operating under a platform-as-a-product model.</p><p dir=\"auto\">A mature platform team owns shared infrastructure as a product—versioned APIs, changelogs, migration guidance, documentation, and intake protocols. Its customers are feature teams. Its core deliverable is protected engineering throughput: keeping the inner loop fast, CI trustworthy, the release process boring, the build infrastructure invisible, and the telemetry sharp enough that autonomy doesn’t become guesswork.</p><p dir=\"auto\">The anti-pattern is the platform team as a help desk, absorbing every Slack interruption and applying one-off fixes on request. I’ve seen well-staffed platform teams hollow out within two quarters because Slack became the intake channel. The most interrupted engineers were the strongest, and they were the first to leave. A well-staffed platform team without triage is just a more expensive help desk.</p><p dir=\"auto\">The healthy pattern is intake-driven: a single channel, a rotating triage owner, lane-based classification by impact, observation windows for non-critical issues, and a refusal to support ambiguity—urgency without data is just anxiety moving through Slack.</p><p dir=\"auto\">Across a portfolio, the platform team’s customers are distinct apps with their own roadmaps. Versioning, deprecation, and migration become more consequential as the number of apps in-flight grows. A breaking change to the networking layer isn’t one app’s problem; it’s every app’s.</p><h2 dir=\"auto\">Shift-Left Without the Cost-Cutting</h2><p dir=\"auto\">Quality engineering at scale is one of the most consequential architectural concerns and one of the most under-resourced. The dominant industry conversation has been shift-left: defects are cheaper to catch earlier, so testing should move closer to development. The principle is correct, but the implementation is often flawed. I now treat QE cuts as a leading indicator of stability problems—every planning cycle I’ve seen trade QE headcount for a feature push has been followed by climbing production defect rates later in the year.</p><p dir=\"auto\">Shift-left fails when implemented as a cost-cutting measure to transfer responsibility. Developers test their own work, and defect detection is pushed later rather than earlier. The reason isn’t that developers can’t test but that without QE partnership, aggregate coverage skews toward the happy path: people writing the code naturally test what they built. Negative paths, platform-parity divergences, accessibility regressions, and OS-version edge cases are the seams a dedicated QE function is trained to find. Defects missed at the story level surface in regression; those missed in regression surface in production.</p><p dir=\"auto\">Shift-left works when it redistributes responsibility with investment in capability. Developers own first-pass story-level testing, supported by tooling and an embedded or adjacent QE function focused on exploratory, platform-parity, accessibility, and complex integration testing. Capability shifts left, headcount stays stable, and defect detection moves earlier.</p><p dir=\"auto\">In a portfolio, QE coverage extends across apps. A bug in the shared networking layer that appears only in App B requires QE involvement in App B’s release cycle, even if the change originated in App A. Parity testing becomes two-dimensional: not just iOS vs. Android but App A vs. App B across the shared surface. Without cross-app QE ownership, regressions hide in apps that didn’t cause the change—exactly where nobody looks.</p><h2 dir=\"auto\">Release Trains Need a Captain</h2><p dir=\"auto\">A scalable release train needs a captain. One person owns each release end-to-end: cutting the branch, monitoring stabilization, deciding which late fixes to cherry-pick, signing off on submission, watching rollout telemetry with flag attribution, and posting the post-mortem. Without that role, release decisions fall to whoever is online when the question comes up. Stabilization gets inconsistent, sign-off slips, and the drift compounds release over release.</p><p dir=\"auto\">The mechanics around the captain are well understood. Trunk-based development with feature flags decouples code releases from feature availability. The app ships on a fixed cadence—weekly or biweekly. The release branch cuts at a known point, stabilization runs in parallel with continued trunk development, and the release ships even if individual features aren’t done. Unfinished features stay dark behind flags until ready. What the canon underdescribes is the human ownership a release train requires to stay on time.</p><p dir=\"auto\">Each app has its own release train, and the shared platform layer has its own cadence feeding them all. Named ownership extends to the platform release. Someone owns shared-library cuts separate from any app’s release captain. When no one does, breaking changes land unpredictably, and every app’s release captain spends time tracking platform state rather than stabilizing their own release.</p><h2 dir=\"auto\">The Multi-App Case the Canon Skips</h2><p dir=\"auto\">Most public material on mobile engineering at scale assumes a single-product app—Uber’s, Spotify’s, Airbnb’s. Super-app material covers a different case: one app with many capabilities, like Grab or WeChat. Portfolio orgs fit neither pattern, and the canon skips them.</p><p dir=\"auto\"><strong>Why portfolio orgs don’t publish.</strong> It’s not secrecy or compliance. At banks, retailers, telcos, and white-label platforms, engineering supports the product rather than being the product. External technical evangelism isn’t part of how leadership measures the function, and the recruiting pressure that pushes Meta and Stripe to publish doesn’t apply. The canon reflects where engineering brand is a competitive necessity, not where engineering problems are most common.</p><p dir=\"auto\"><strong>The platform becomes load-bearing.</strong> The platform team’s role shifts from useful to essential. With a single app, a feature team can gradually drift from the platform. With multiple apps on the same platform, drift causes immediate divergence. App A is on the new networking layer; App B remains on the old. The next breaking change requires coordination across organizational boundaries that didn’t exist a quarter earlier. Intake discipline is essential because a shared platform team must ration finite capacity across the apps it serves and govern who can add to the platform’s surface. Without managed intake, the loudest app wins; without governance, the platform fragments from its own success.</p><p dir=\"auto\"><strong>The business-unit-aligned trap.</strong> The failure mode this produces is the business-unit-aligned platform. Each business unit creates its own platform team because the central one is too slow. Each new team reimplements similar infrastructure. The shared layer fragments, and the original throughput problem returns multiplied by the number of units. The healthy pattern is per-app feature teams plus a shared central platform with disciplined intake, named escalation paths, and proven responsiveness rather than a federation of mini-platforms.</p><p dir=\"auto\"><strong>Fork, share, or abstract.</strong> A new decision emerges for any shared capability. Take authentication. There are three options, and the choice is deliberate.</p><ul dir=\"auto\"><li data-preset-tag=\"p\"><p>Fork, and each app gets its own SDK and its own bugs.</p></li><li data-preset-tag=\"p\"><p>Share, and all apps land in the same release window when the auth provider rotates a certificate.</p></li><li data-preset-tag=\"p\"><p>Abstract—a stable contract with implementations per app—and the consumer app uses biometrics, the merchant app uses SSO, and white-label deployments use the partner’s identity provider, all behind a single interface.</p></li></ul><p dir=\"auto\">The third option is the most expensive and most durable; choose it when apps must vary along axes such as branding, regulatory regimes, or regional payment integrations yet share the same core. The abstraction is paid for once. The wrong fork or the wrong share is paid for every quarter.</p><p dir=\"auto\">Across the 200-app program, every shared capability eventually faced this choice. Teams that chose deliberately with named owners and contract versioning survived the next reorganization with their architecture intact. Teams that drifted into a choice, usually by forking under deadline pressure and intending to consolidate later, rarely did.</p><p dir=\"auto\"><strong>Parity across two axes.</strong> It isn’t just about testing. Published material on iOS vs. Android parity is solid. The multi-app extension for shared features, regulatory standards, accessibility, and security posture is mostly an internal knowledge gap. Without a dashboard tracking both axes, drift quietly compounds until a regulatory audit, security review, or customer-impact incident exposes it.</p><h2 dir=\"auto\">Agents Are a Different Class of Contributor</h2><p dir=\"auto\">Agents are not faster humans. They’re a distinct class of contributor, and treating them as humans with a throughput multiplier is a category error. The cost compounds as portfolio size grows.</p><figure><table><tbody><tr><th><p dir=\"auto\"><br></p></th><th><p dir=\"auto\"><strong>Human Engineers</strong></p></th><th><p dir=\"auto\"><strong>Coding Agents</strong></p></th></tr><tr><td><p dir=\"auto\"><strong>Blast radius</strong></p></td><td><p dir=\"auto\">Work within the context of a single app, even on portfolio-wide changes</p></td><td><p dir=\"auto\">Act on the shared surface across every app at once</p></td></tr><tr><td><p dir=\"auto\"><strong>Submission rate</strong></p></td><td><p dir=\"auto\">Pace against visible release windows</p></td><td><p dir=\"auto\">Submit at rates that saturate review economies sized for humans</p></td></tr><tr><td><p dir=\"auto\"><strong>Escalation</strong></p></td><td><p dir=\"auto\">Escalate on social context—release timing, team disruption, deadline pressure</p></td><td><p dir=\"auto\">Escalate on task ambiguity, not on release calendar or organizational state</p></td></tr><tr><td><p dir=\"auto\"><strong>Failure mode</strong></p></td><td><p dir=\"auto\">Process gaps, logic errors</p></td><td><p dir=\"auto\">Convention refactors during stabilization, missed context boundaries</p></td></tr><tr><td><p dir=\"auto\"><strong>Release-calendar awareness</strong></p></td><td><p dir=\"auto\">Inferred from ambient signals—Slack, standups, release announcements</p></td><td><p dir=\"auto\">Absent unless someone wires it in</p></td></tr></tbody></table></figure><p dir=\"auto\">The most common agent failure isn’t a bug—it’s a convention refactor that lands during release stabilization, forcing a regression rerun late in the cycle. The change is technically correct. The agent just has no model of the release calendar.</p><p dir=\"auto\"><strong>Intake under two contributor classes.</strong> The implications for platform teams are concrete. Intake discipline must now govern two classes of contribution with different rate profiles. Review economies sized for human submission rates saturate when agents are added without throttling. The platform team’s job extends to owning agent policy at the portfolio level, including which surface areas agents may touch, which review and test bars apply, what attribution is required, and when activity pauses.</p><p dir=\"auto\"><strong>Agent blackouts and policy.</strong> The release captain role extends in parallel. Pausing agent activity on stabilization branches is no longer a nice-to-have; it’s now a must-have. The captain owns the policy, including when these pauses start and end, which exceptions are allowed, and how late agent-originated cherry-picks are evaluated. Without named ownership of the policy, every release captain reinvents it under pressure, and the inconsistency becomes overhead.</p><p dir=\"auto\"><strong>Attribution and provenance.</strong> When an incident traces back to a change, the org needs to know whether the change came from a human, an agent, or a human-agent collaboration—not for blame, but because the remediation path differs. Agent-driven regressions usually indicate a policy gap or a context boundary the agent missed. Human regressions usually indicate a process gap. Conflating them misdirects the fix.</p><p dir=\"auto\">Agents don’t create new coordination problems—they make the old ones non-optional.</p><h2 dir=\"auto\">What Compounds and What Reverses</h2><p dir=\"auto\">None of these patterns are novel. They’re the same ones Uber, Spotify, Airbnb, Meta, Grab, and Shopify have publicly documented over the past decade, plus the emerging practices for governing AI coding agents in shared infrastructure. Most are bullet points in someone’s engineering blog.</p><p dir=\"auto\">Consistency is the hard part. Each of these investments can be reversed within a single budget cycle. The platform team gets reorganized into feature teams to “increase velocity.” QE gets trimmed to fund a feature push. The release captain role is absorbed into existing management. Each reversal looks defensible on its own. Repeated across a few quarters, they produce the symptoms this piece opened with.</p><p dir=\"auto\">These investments accrue. Over time, they produce orgs that ship continuously, release without drama, and maintain velocity without exhausting the people who deliver it.</p><p dir=\"auto\">The symptoms are familiar: clogged pipelines, breaking changes cascading across unrelated features, cross-domain advisory bottlenecks, developer-as-tester, missing agent blackouts, production issues absorbed ad hoc, parity gaps between iOS and Android or across apps. When they show up at portfolio scale, the response shouldn’t be more sync meetings or stricter PR review. It should be structural: architecture, tooling, team topology, and governance must evolve together.</p><p dir=\"auto\">Pick any one, and the symptoms remain. Together, they’re what makes a mobile portfolio stop being a code problem.</p><p dir=\"auto\">If your org is feeling these symptoms, the most useful question isn’t which pattern to adopt first. It’s which coordination cost you haven’t named—because that’s the one already shaping how your roadmap actually moves.</p>",
            "url": "https://adams.io/blog/when-a-mobile-portfolio-stops-being-a-code-problem",
            "title": "When a Mobile Portfolio Stops Being a Code Problem",
            "summary": "What reads as engineering inefficiency is coordination cost. The bill: revenue, retention, timing.",
            "date_modified": "2026-05-10T00:00:00.000Z",
            "tags": [
                "architecture"
            ]
        },
        {
            "id": "urn:sha256:98b6bf1280c3d3018a9d9defe407ba08cc3a9b5776c51f38e63f702ff065d05c",
            "content_html": "<p dir=\"auto\">I’ve watched a version of this play out more than once. A staff engineer presents a redesign in an architecture review. Twenty minutes in, two reviewers are nodding; three are quiet.</p><p dir=\"auto\">She reads the quiet as agreement. It isn’t.</p><p dir=\"auto\">Two weeks later, the redesign returns with five questions any of those three reviewers would have asked—if they had the prerequisite context.</p><p dir=\"auto\">Three months of design met twenty minutes of audience—and weeks of rework followed.</p><p dir=\"auto\">Nothing about this looks like a communication failure from the inside. That’s the problem.</p><p dir=\"auto\">She can’t see how the questions could arise—the knowledge she now has erases the memory of not having it. The gap has a name, and it gets installed deeper every quarter a leader stays in role.</p><h2 dir=\"auto\">The Tappers and the Listeners</h2><p dir=\"auto\">In 1990, Stanford grad student Elizabeth Newton ran a now-classic study. Subjects tapped the rhythm of well-known songs—“Happy Birthday,” “The Star-Spangled Banner”—on a tabletop while listeners tried to name them. Tappers predicted listeners would guess correctly about half the time. The actual rate was 2.5%.</p><p dir=\"auto\">Tappers heard the melody in their heads; listeners heard knocking. Each tapper was certain the song was obvious—because for them, it was.</p><p dir=\"auto\">This is the <em>curse of knowledge</em>: once you know something, you can’t accurately model what it’s like not to know it. In engineering orgs, it shows up as a leader frustrated that a team “isn’t getting it”—and a team frustrated with a leader who skips the parts that would make the decision make sense.</p><h2 dir=\"auto\">Seniority Installs It</h2><p dir=\"auto\">Every step up the ladder adds context you don’t realize you’ve absorbed. The EM sits in the planning conversation where scope got cut before the team ever saw the original. The director sees the financial constraint that killed the migration. The VP carries the political memory of the org that tried this two years ago and failed.</p><p dir=\"auto\">You communicate from inside that context. You name decisions and skip the journey. The conclusion arrives without the reasoning behind it. When people ask questions you answered months ago, it feels like resistance—not a knowledge gap.</p><p dir=\"auto\">The capability you were promoted for—pattern recognition, compression, judgment—is the same capability that now breaks transmission.</p><h2 dir=\"auto\">How It Leaks Out</h2><p dir=\"auto\">Once you look for it, the fingerprints are everywhere:</p><ul dir=\"auto\"><li data-preset-tag=\"p\"><p><strong>Design docs</strong> that specify the <em>what</em> without the <em>why</em>, because the <em>why</em> feels too obvious to write down.</p></li><li data-preset-tag=\"p\"><p><strong>Roadmap reviews</strong> that describe trade-offs at the resolution they were made, not the resolution the audience needs.</p></li><li data-preset-tag=\"p\"><p><strong>OKRs</strong> written at the altitude leadership negotiated, not the altitude the team has to execute at.</p></li><li data-preset-tag=\"p\"><p><strong>Estimates</strong> sized by the person who already knows the answer, not the person who has to find it.</p></li></ul><p dir=\"auto\">This isn’t individual failure—it’s structural. A transmission problem that worsens with seniority.</p><h2 dir=\"auto\">It Looks Like a People Problem</h2><p dir=\"auto\">From the leader’s seat, the curse looks like a people problem—the team isn’t aligned, they lack urgency, juniors don’t grasp the scope. Each diagnosis triggers a predictable intervention: more all-hands, better OKRs, tighter performance management. The mechanism stays invisible.</p><p dir=\"auto\">Even strong communicators hit this wall. Fluency hides the failure.</p><p dir=\"auto\">The diagnostic question isn’t <em>did I explain it well?</em></p><p dir=\"auto\">It’s <em>what context made this conclusion obvious to me—and have I made that context available—not summarized, available—to the people I expect to act on it?</em></p><h2 dir=\"auto\">What Actually Helps</h2><p dir=\"auto\">The fix isn’t humility theater or “explaining better.” It’s a small set of mechanisms that compensate for a bias you can’t will away.</p><p dir=\"auto\"><strong>Make rejected alternatives part of the design.</strong> A design doc that presents only the chosen approach compounds the curse—reviewers must reverse-engineer your reasoning before they can challenge it. Require an “Alternatives Considered” section with at least two rejected options and why each lost. Not “performance”—the load profile, the benchmark, the threshold. Rejections teach more about the constraint space than the recommendation.</p><p dir=\"auto\"><strong>Treat “operator error” as a finding, not a conclusion.</strong> When an incident review lands on “the on-call missed it,” that’s not a root cause—it’s the curse of knowledge in the postmortem. Investigators now have the full timeline; the operator at 2 a.m. had whatever the dashboard showed and the runbook said. Ask what the system surfaced at the decision point—and what it hid. The answer usually points to a runbook or dashboard written for a simpler system. If you keep landing on operator error, you’re not investigating; you’re confessing.</p><p dir=\"auto\"><strong>Send the doc to someone one level less marinated.</strong> Before a design doc or proposal goes wide, hand it to a senior engineer with enough context to engage—but not enough to fill in your gaps. The questions they ask are the ones everyone else has but won’t raise in the larger room.</p><h2 dir=\"auto\">Design Around the Asymmetry</h2><p dir=\"auto\">Leaders who handle this well don’t pretend they share the team’s context.</p><p dir=\"auto\">You can’t unknow what you know. The curse isn’t a flaw in your communication—it’s a side effect of becoming the person the org needed. The work isn’t to roll back expertise; it’s to stop assuming the meaning is transmitting just because you're using words.</p><p dir=\"auto\">If your team keeps not getting it, the question isn’t whether they’re listening.</p><p dir=\"auto\">It’s whether what reached them was the song—or just the tapping.</p>",
            "url": "https://adams.io/blog/the-curse-of-knowing-what-you-know",
            "title": "The Curse of Knowing What You Know",
            "summary": "The more it makes sense to you, the less it reaches them.",
            "date_modified": "2026-05-01T00:00:00.000Z",
            "tags": [
                "leadership"
            ]
        },
        {
            "id": "urn:sha256:521957853d7dc47c34454e5d8d5b187a2ac157982acae3b02f2f69facdaef71a",
            "content_html": "<p dir=\"auto\">I was in a planning review a few years ago where four of us decided to consolidate two services into one. One of us—a staff engineer who knew both systems from the inside—raised an objection. The rest of us rolled it into a footnote and moved on.</p><p dir=\"auto\">Eight weeks into the work, we unwound it. The objection turned out to be the load-bearing constraint. We had treated it as a detail; he had treated it as a wall.</p><p dir=\"auto\">This isn’t a war story. It’s the ordinary operation of most engineering organizations, including the ones I’ve helped lead. Decisions are made by people who don’t have the information: sometimes because the information holder isn’t in the room, sometimes because they are but their voice doesn’t carry the weight the decision actually requires. The ground truth catches up afterward, usually as the explanation for why the decision didn’t work.</p><h2 dir=\"auto\">The Information-Authority Gap</h2><p dir=\"auto\">Engineering org charts were designed for accountability flow, not information flow. The two don’t match, and as organizations scale, the gap widens.</p><p dir=\"auto\">The pattern is consistent. A senior engineer who’s worked on a service for three years knows things no director can know: the production quirks, the dependencies that aren’t documented, the customer-specific behavior nobody ever wrote down. The director has authority over decisions about that service. Information travels up through standups, Jira roll-ups, weekly summaries, and occasional skip-levels. Each hop is a copy-with-loss step. By the time the decision comes back down, it has the shape of correctness without the substance.</p><p dir=\"auto\">Most engineering leaders feel this gap. Few of them have a vocabulary for it. The default explanation is “we have a culture problem” or “we need to empower the team more.” Both diagnoses are downstream of something more structural.</p><h2 dir=\"auto\">It’s Not an Empowerment Problem</h2><p dir=\"auto\">It’s two structural problems that have to be solved together.</p><p dir=\"auto\"><strong>Decision-rights allocation.</strong> Authority needs to rest with those who have the relevant knowledge.</p><p dir=\"auto\"><strong>Aligned incentives.</strong> Those same people need to be pulling for the broader organization’s outcomes—not just their team’s metrics or their patch of the system.</p><p dir=\"auto\">Solve only one, and you create new failure modes. Pure decentralization without alignment produces local optimization that hurts the whole. The senior engineer spends a sprint on the elegant refactor without seeing the delivery commitment it blocks. The team ships what makes on-call easier instead of what the customer needs. Pure centralization without information produces fast, confident, wrong decisions. That second pattern is where most engineering orgs are actually stuck.</p><p dir=\"auto\">The fix isn’t rhetorical—it’s diagnostic. For each significant decision, ask where the relevant knowledge lives and whether that’s also where the authority sits. When the two map cleanly, decisions get faster and better at the same time. When they don’t, you get the consolidation-room scene above, on repeat.</p><h2 dir=\"auto\">The Visible Signatures</h2><p dir=\"auto\">In engineering orgs, the gap leaves fingerprints:</p><ul dir=\"auto\"><li data-preset-tag=\"p\"><p><strong>Architectural decisions</strong> made by leaders who haven’t read the on-call runbook.</p></li><li data-preset-tag=\"p\"><p><strong>Vendor selections</strong> made by people who haven’t tested the vendor.</p></li><li data-preset-tag=\"p\"><p><strong>Priority calls</strong> made from dashboards instead of from the teams running the systems.</p></li><li data-preset-tag=\"p\"><p><strong>Scoping decisions</strong> made by managers two levels removed from anyone who’d implement them.</p></li><li data-preset-tag=\"p\"><p><strong>Migration plans</strong> approved by reviewers who haven’t touched the legacy system in eighteen months.</p></li></ul><p dir=\"auto\">None of these are pathologies in isolation. The pathology lies in treating them as exceptions instead of patterns. Once you start looking, the pattern is everywhere.</p><h2 dir=\"auto\">What <em>Turn the Ship Around</em> Actually Solved</h2><p dir=\"auto\">David Marquet’s book gets quoted for its slogan: “I intend to…” That’s not the point. The mechanism is.</p><p dir=\"auto\">The protocol does both jobs at once. Authority moves to where the knowledge lives—the officer states what they intend to do, and the captain retains override as a backstop that rarely gets exercised. Accountability stays exactly where it has to be—the officer who proposed the action owns the outcome.</p><p dir=\"auto\">Most leaders cite the spirit and skip the mechanism. The spirit is easier to cite and harder to install. The mechanism is the opposite—a specific protocol you can run in Tuesday morning’s architectural review. Stop accepting requests for permission. Start expecting statements of intent with the relevant context attached. The captain doesn’t ask the bridge officer for sonar contacts; the bridge officer brings them with the proposed action.</p><h2 dir=\"auto\">The Audit</h2><p dir=\"auto\">A diagnostic to take into your next leadership meeting:</p><p dir=\"auto\">Pick three significant decisions your organization made last quarter. For each, ask two questions.</p><ol dir=\"auto\"><li data-preset-tag=\"p\"><p>Where did the relevant knowledge live versus where was the decision actually made? Measure the distance in whatever currency fits—reporting layers, time zones, hops.</p></li><li data-preset-tag=\"p\"><p>Whose outcomes did the decision optimize for—the system’s customer, or the deciding team’s convenience?</p></li></ol><p dir=\"auto\">The first reveals the allocation gap; the second reveals the alignment gap.</p><p dir=\"auto\">If the gap is consistently large, you don’t have an empowerment problem. You have a decision-rights allocation problem. It’s costing you in rework, in rollback cycles, and in the senior engineers who stopped fighting and started updating their LinkedIn profiles.</p><p dir=\"auto\">Closing the gap means moving authority toward the knowledge—and ensuring the people now holding it are pulling for the whole organization, not just their patch. Decision-rights allocation without aligned incentives merely relocates the dysfunction.</p><p dir=\"auto\">Your organization is already allocating decision rights. The question is whether you’ve allocated them deliberately or just inherited the allocation from the org chart.</p><p dir=\"auto\">You don’t have a decision problem. You have a placement problem.</p>",
            "url": "https://adams.io/blog/where-your-decisions-actually-get-made",
            "title": "Where Your Decisions Actually Get Made",
            "summary": "Your org chart shows who’s accountable. It doesn’t show who actually knows.",
            "date_modified": "2026-04-25T00:00:00.000Z",
            "tags": [
                "organization"
            ]
        },
        {
            "id": "urn:sha256:d8c61a4bd6fa0d7d042cb1e6fd798b0ef0f3b59934ccc1476320e5b64824f66f",
            "content_html": "<p dir=\"auto\">I once watched a quarterly roadmap get approved that assumed—without anyone saying it—three critical systems wouldn’t fail during the same two-week window. No one modeled it. No one raised it. It was just the bet the organization was making by not discussing it.</p><p dir=\"auto\">Every engineering organization carries reliability bets. Most haven’t written them down. The roadmap is visible. The risk allocation behind it isn’t.</p><p dir=\"auto\">I’ve worked inside enterprise platforms where that gap led to real costs—not from negligence but because the reliability conversation never reached planning. It appeared only in post-mortems after the incident.</p><h2 dir=\"auto\">The Binary Trap</h2><p dir=\"auto\">Many organizations treat reliability as an all-or-nothing proposition: the system works or it’s down. That creates two pitfalls: over-investing everywhere, draining budgets and slowing delivery, or delaying action until a public incident demands a response.</p><p dir=\"auto\">Southwest Airlines’ December 2022 meltdown was a portfolio allocation failure that manifested in technology. Their crew scheduling system had been flagged internally for years, but leadership kept choosing to invest elsewhere. The bet was implicit. When winter weather hit, the portfolio collapsed, and the DOT fined them while they absorbed nearly $1B in losses.</p><h2 dir=\"auto\">Tiered Allocation</h2><p dir=\"auto\">The question isn’t whether to invest in reliability. It’s whether you’ve made that allocation explicit or let it follow whoever last controlled the budget and whatever broke most recently.</p><p dir=\"auto\">I frame this in three tiers. Not as an SRE framework, but as a planning lens.</p><p dir=\"auto\"><strong>Structural systems</strong> are those where failure isn’t just an incident. It’s existential. Payment processing, authentication, and data integrity qualify here. These receive deep investment: redundancy, automated failover, rigorous testing. You don’t negotiate these down. In regulated environments, compliance typically makes systems structural by default. If data is financial or PII, regulatory overhead supersedes what engineering might otherwise decide.</p><p dir=\"auto\"><strong>Elastic systems</strong> tolerate some degradation when managed well. Search can be briefly stale, a recommendation engine can return defaults, and a notification can be delayed. Invest so these fail gracefully but not so much that they never fail. Design the degradation path rather than hoping one appears.</p><p dir=\"auto\"><strong>Disposable systems</strong> are those where failure is cheap and recovery is fast. Internal tooling, experiments behind feature flags, and batch jobs that can be re-run all fit here. By design, these get minimal reliability investment, which is appropriate if deliberate.</p><p dir=\"auto\">These tiers classify systems, not teams, and they’re dynamic. Major reliability failures often occur when a system quietly shifts tiers. A tool built on a Friday gets adopted by customer support, baked into a daily workflow, and becomes load-bearing without engineering ever knowing. When it breaks, you discover it was structural all along. Maintaining the portfolio means recognizing when a system has drifted before an incident proves it.</p><p dir=\"auto\">Many organizations struggle here: they invest as if everything is structural but treat everything as disposable when budgets shrink. Since the portfolio remains implicit, risk allocation is set by whoever is most afraid—not by whoever has the best information.</p><h2 dir=\"auto\">Operational Readiness</h2><p dir=\"auto\">I learned this lens long before software. In the Army, operational readiness isn’t asking, “Are we ready for anything?” It’s, “What are we prepared for, what aren’t we, and does leadership understand?” Units report status across personnel, equipment, and training. Commanders deploy based on known gaps rather than a single green, yellow, or red. You’d never send a unit downrange without knowing where it stands.</p><p dir=\"auto\">Yet many engineering organizations approve quarterly roadmaps without this conversation about systems—not out of disregard, but because no one built the mechanism.</p><h2 dir=\"auto\">Making It Legible</h2><p dir=\"auto\">The mechanism doesn’t need to be complex. It needs to translate reliability into the language the business already uses.</p><ul dir=\"auto\"><li data-preset-tag=\"p\"><p><strong>Revenue protection</strong>: What do you risk if this system fails during peak? If the payment gateway goes down during holiday traffic, the exposure isn’t theoretical: it’s dollars per minute in lost transactions.</p></li><li data-preset-tag=\"p\"><p><strong>Recovery cost</strong>: What does it take in engineering time, customer goodwill, and contractual standing to recover? A four-hour outage on a structural system can cost a weekend of work, a flood of support tickets, and a difficult conversation with a client whose SLA you just missed.</p></li><li data-preset-tag=\"p\"><p><strong>Opportunity cost</strong>: What won’t get built while keeping legacy systems alive? If a team spends 30% of its capacity nursing a service that should have been replaced two quarters ago, that’s not a maintenance line item. It’s a feature that never shipped.</p></li></ul><p dir=\"auto\">When I frame reliability in these business terms instead of latency or error budgets, the conversation shifts. You’re no longer pitching uptime to people who assume uptime is free. You’re pitching risk-adjusted investment to people who think in portfolios.</p><p dir=\"auto\">The artifact that makes this real is simple: a one-page summary with four columns: system, tier, exposure, and the cost to close the gap. It’s something you bring to a quarterly business review, not a dashboard engineers stare at in isolation. If it doesn't fit on one page, you’re overcomplicating it.</p><h2 dir=\"auto\">The Bet You’re Already Making</h2><p dir=\"auto\">Your organization already has a reliability portfolio. It’s either explicit—documented, defended, reviewed—or implicit, shaped by inertia, recency bias, and whoever made the most noise last quarter.</p><p dir=\"auto\">Making the portfolio explicit won’t prevent all failures. But when something breaks, the organization learns from documented decisions instead of excavating Slack threads to figure out who decided this system didn’t matter.</p>",
            "url": "https://adams.io/blog/reliability-is-a-portfolio-decision",
            "title": "Reliability Is a Portfolio Decision",
            "summary": "You’re already choosing where your systems fail. The question is whether you meant to.",
            "date_modified": "2026-04-13T00:00:00.000Z",
            "tags": [
                "reliability"
            ]
        },
        {
            "id": "urn:sha256:2d64ab720b74ea609d8f04acb1ca714ce104e54f4448d6c2e0edae89ab2abdcb",
            "content_html": "<p dir=\"auto\">It’s Tuesday morning. You’re in a 1:1. One of your direct reports is asking about leading the API redesign next quarter. Is it a good growth opportunity? It’s a reasonable question. You give a reasonable answer.</p><p dir=\"auto\">What they don’t know is that you spent an hour last night on a call with HR, finalizing the conversation that will change everything for them by the end of the week.</p><p dir=\"auto\">Every engineering leader eventually carries this weight: the gap between what you know about someone’s future and what they know about their own. The conversation itself gets all the attention: what to say, how to be direct with empathy. But it’s thirty minutes. The carrying period can last weeks or months of parallel operation. That’s where outcomes are actually shaped. How you manage it decides whether the person lands well, the team stays stable, and the organization stays clean.</p><h2 dir=\"auto\">The Clock You Didn’t Start</h2><p dir=\"auto\">Before the carrying period even starts, most leaders have already lost weeks to a quieter failure mode: hope as strategy. “One more quarter.” “Let’s see if the new support structure helps.” “I don’t want to overreact.” By the time you’ve committed to the decision, you’ve spent more time convincing yourself to act than the process itself requires. The carrying period doesn’t start when the decision is made. It starts when you first suspected and chose not to force clarity.</p><h2 dir=\"auto\">The Bureaucratic Reality</h2><p dir=\"auto\">Let’s be honest about the timeline. Sometimes it’s a PIP that HR and legal require before anything moves forward. Sometimes it’s a reorg that’s been approved but not announced. Sometimes it’s a client-driven rolloff where the decision originates entirely outside your org. The trigger varies. The carrying period doesn’t.</p><p dir=\"auto\">Each version has its own mandated timeline, and those timelines exist for good reasons. The failure mode isn’t the mandated timeline. It’s the discretionary delay you add on top of it. The process is complete. The path is clear. And you push it one more week because the timing feels off, or you want one more data point, or Friday just seems too soon. That padding isn’t diligence. It’s the discomfort of acting disguised as thoroughness.</p><p dir=\"auto\">Every day of discretionary delay is a day they commit to projects, turn down recruiter calls, and make plans based on the absence of bad news.</p><h2 dir=\"auto\">Where the Time Goes</h2><p dir=\"auto\">Once you’re in it, the carrying period has failure modes that show up in your day-to-day behavior. Each one is well-intentioned. Each one makes the outcome worse.</p><p dir=\"auto\"><strong>Unconscious withdrawal.</strong> You pull back without realizing it. Your 1:1s get shorter. You stop engaging with their proposals. You stop asking about career goals because it feels dishonest. They notice the shift before they understand it, and by the time the conversation happens, it lands as confirmation of something they already sensed. That turns a recoverable transition into a betrayal.</p><p dir=\"auto\">This is different from a deliberate strategy. Some leaders intentionally create space for someone to read the room and self-select out. A calculated move with a defensible—if uneven—rationale. But unconscious withdrawal isn’t strategy. It’s leakage. Know which one you’re doing.</p><p dir=\"auto\"><strong>Overcorrection.</strong> Guilt pushes you the other direction. You’re warmer, more available, more encouraging than usual. When the conversation arrives, the prior weeks become evidence against your credibility. They replay every positive signal and wonder what was real. The dissonance is worse than the news.</p><p dir=\"auto\"><strong>Performative normalcy.</strong> It sounds noble. But maintaining your exact normal rhythm is its own kind of dishonesty. If you know someone is rolling off the project in ten days, a deep-dive architectural review on Wednesday isn’t respect. It’s theater. Letting them take ownership of a complex initiative that someone else will inherit next week isn’t kindness. It’s an organizational liability.</p><p dir=\"auto\">Operational pragmatism isn’t the opposite of respect. Shifting someone toward lower-risk work during the transition window protects the team, the organization, and the person from owning a mess they won’t be around to resolve. The line is intent: adjusting scope to reduce blast radius is leadership. Disengaging because you’ve already written them off is negligence.</p><h2 dir=\"auto\">Managing the Window</h2><p dir=\"auto\">A few things keep the carrying period from degrading the outcome.</p><p dir=\"auto\"><strong>Have an operational partner.</strong> Not a therapist—someone who shares the coordination load and keeps your judgment honest. When you tell your HRBP, “I think I’m pulling back in our 1:1s,” and they say, “You canceled the last one with twelve minutes’ notice,” that’s a correction you can’t give yourself. The loneliness of carrying it alone is what drives most of the failure modes above.</p><p dir=\"auto\"><strong>Separate the emotional timeline from the operational one.</strong> You told them “great work on the postmortem” on Thursday morning and meant it. Thursday night, you’re drafting the transition plan and feeling like a fraud. The guilt says slow down. The guilt is wrong. You can feel conflicted and still execute cleanly. Discomfort isn’t a signal that you’re moving too fast. It’s a signal that you take the human cost seriously—one variable in the equation, not the equation itself. The equation is: does the person land well, does the team maintain velocity, and does the organization stay healthy.</p><p dir=\"auto\"><strong>Know when the carrying has to end.</strong> The process is approved. The path is clear. And you push it one more week because they just got back from PTO, or there’s a release on Wednesday. Each reason sounds humane. Each one is another day the person keeps investing in a reality you’ve already decided to change. Once the path is clear, the most respectful thing you can do is move.</p><h2 dir=\"auto\">The Thirty Minutes and the Weeks</h2><p dir=\"auto\">You’ve done this before. You’ll do it again. And the thing that will test you next time, same as last time, isn’t what you say in the room. It’s whether you managed the weeks before it well enough that the person has options, the team has continuity, and the organization doesn’t have a crater where a key person used to be.</p><p dir=\"auto\">That’s the harder part of the job. Not the conversation. The carrying. And by the time you sit down across from someone, the carrying has already decided how it lands.</p>",
            "url": "https://adams.io/blog/the-weight-before-the-words",
            "title": "The Weight Before the Words",
            "summary": "Everyone rehearses the conversation. Nobody warns you about the weeks before it.",
            "date_modified": "2026-04-03T00:00:00.000Z",
            "tags": [
                "leadership"
            ]
        },
        {
            "id": "urn:sha256:4453066bbd4a69c23317f7a44576275888e56ab2a0b82f72c4c5b39c4db977c6",
            "content_html": "<p dir=\"auto\">If your teams spend more time maintaining reports than building product, your reporting system isn’t supporting delivery—it’s competing with it.</p><p dir=\"auto\">Every Monday morning, the release lead pulled data from Jira, reshaped it in a spreadsheet, and built a slide deck—all to describe work that Jira already knew about. One week, the deck didn’t match a Confluence tracker maintained by another team. The conversation didn’t start with “let’s reconcile.” It started with “why isn’t this done?” Two hours later, she’d proven that finished work was finished—time that came straight out of the sprint.</p><p dir=\"auto\">The work was done. The system just couldn’t see it.</p><p dir=\"auto\">If that sounds familiar, it should. Most engineering organizations carry some version of this tax—hours lost every week to duplicating, reformatting, and defending information that already exists in the system of record. It compounds quietly. No single instance looks like a crisis, but across teams and sprints, the drag on delivery is real and measurable.</p><p dir=\"auto\">This isn’t a tooling issue. It’s a data flow problem—the path between the system where work happens and the systems where work gets reported has forked, and nobody maintained the bridge.</p><h2 dir=\"auto\">The Anti-Pattern: A Second System for the Same Work</h2><p dir=\"auto\">Someone needs visibility into delivery, so they build a report. Instead of deriving it from the existing workflow, they create a parallel layer—extra fields, labels, naming conventions, or documents that sit outside the day-to-day work.</p><p dir=\"auto\">Now engineers aren’t just doing the work. They’re maintaining a second system that describes it.</p><p dir=\"auto\">Over time, the reporting layer becomes the thing leadership sees. The workflow becomes something teams use locally. When the two drift—and they always do—the report wins. Work that’s finished shows up as missing. Teams are asked to explain gaps that don’t exist.</p><p dir=\"auto\">Two systems. Two truths. One credibility gap.</p><h2 dir=\"auto\">Why It Happens (And Why It’s Predictable)</h2><p dir=\"auto\">Workflows evolve over years. Fields get added and rarely removed. Reporting needs change faster than the underlying system can adapt. Workflow tools are built for the people doing the work, not for the people consuming reports about it. And the people who need visibility often aren’t the same people who know how to extract it from the system.</p><p dir=\"auto\">So they build something they can control—documents, decks, and dashboards outside the workflow. A Confluence page that mirrors what already lives in Jira. A recurring slide deck that takes hours to manually assemble by reshaping data that Jira already has.</p><p dir=\"auto\">Each workaround makes sense on its own. Together, they create a reporting layer that is manual, delayed, duplicative, and easy to get wrong.</p><p dir=\"auto\">Then the direction of effort flips. Instead of reports reflecting the work, the work starts bending to satisfy the reports. Engineers context-switch to maintain metadata that exists solely for a downstream consumer. The cognitive tax is real. The accuracy drops. The trust erodes.</p><p dir=\"auto\">There’s also a dynamic that rarely gets named: reporting flows upward, but the burden of maintaining it flows downward. The people who produce the reports often have the context to know they’re redundant—but not the leverage to push back. That asymmetry turns a reasonable request for visibility into a standing tax on the people closest to the work. Solving this requires leaders who recognize that the cost of a report includes the engineering time it displaces.</p><h2 dir=\"auto\">The Principle: Reports Follow the Work</h2><p dir=\"auto\">The reporting layer must be a read-only projection of the system where the work actually happens.</p><p dir=\"auto\">If a report depends on data that isn’t in that system, the answer isn’t to create a parallel process. It’s to fix the system so the data exists where the work is already being done.</p><p dir=\"auto\">Everything else is a workaround—and workarounds accumulate.</p><p dir=\"auto\">From that principle, a few practical shifts follow—each one learned the hard way.</p><h3 dir=\"auto\">Treat metadata like a contract, not a suggestion</h3><p dir=\"auto\">If a field is required for visibility, it should be enforced in the workflow itself—not something people are expected to remember. I’ve watched teams maintain an ad hoc labeling convention for months before anyone realized a third of the values were misspelled variants of each other. The dashboard looked authoritative. The data underneath was noise. If it matters enough to report on, it matters enough to validate at the point of entry.</p><h3 dir=\"auto\">Stop copying data into documents</h3><p dir=\"auto\">If a report requires manual transcription, it’s already out of date. Native dashboards and filters within your workflow tool can surface most of what teams manually replicate in static pages—in real time, from a single source. The barrier is usually familiarity, not capability. I’ve replaced multi-hour weekly deck-building rituals with a saved Jira filter and a shared dashboard that took an afternoon to configure. The data was better, and the team got Monday back.</p><h3 dir=\"auto\">Generate recurring reports instead of rebuilding them</h3><p dir=\"auto\">If something gets assembled the same way every cycle, it should be produced directly from the underlying data. If that’s not possible, either the data structure or the report format needs to change—not the overhead of producing it. The question to ask isn’t \"how do we make this report easier to build?\" It’s \"why are we building it at all instead of reading it?\"</p><h3 dir=\"auto\">Make the system queryable by the people who need answers</h3><p dir=\"auto\">Most shadow reporting exists because extracting the right view from the system feels harder than recreating it manually. That’s a solvable problem—but it’s a shared one. Engineering owns making the data clean and structured. Program management and ops own defining what they need to see. A few focused working sessions where both sides sit together and build the views will eliminate more duplication than any process document.</p><h2 dir=\"auto\">Start with the Work</h2><p dir=\"auto\">This isn’t about one function getting it wrong. Everyone involved is trying to create visibility. The question is whether that visibility is generated from the system where work is planned and completed, or maintained alongside it—and whether the people asking for it understand what it costs to produce.</p><p dir=\"auto\">Start with the work. Make it observable. Let everything else follow.</p>",
            "url": "https://adams.io/blog/stop-letting-the-tail-wag-the-dog",
            "title": "Stop Letting the Tail Wag the Dog",
            "summary": "When reports compete with delivery, the system meant to help is quietly doing harm.",
            "date_modified": "2026-03-26T00:00:00.000Z",
            "tags": [
                "architecture"
            ]
        }
    ]
}