About this series
I’ve been working on general purpose robots with Everyday Robots for 8 years, and was the engineering lead of the product/applications group until we were impacted1 by the recent Alphabet layoffs. This series is an attempt to share almost a decade of lessons learned so you can get a head start making robots that live and work among us. Previous posts live here.
What is Tech Debt?
In my experience, there are two kinds of tech debt. The first comes from going fast and building something that cuts corners. This is the kind of tech debt that most people think of. It is a loan against the future to get something done now, and you pay it down (often with interest) when you have more resources. We’ll call this Cut-Corner tech debt, and it gets a bad rap, but it is often a good idea to take on some Cut-Corner tech debt in order to move fast.
“We can’t hardcode the location of every robot charging station into our source code! There should be a cloud! And a UI! And permissions! And ways for the robot to autonomously update them if they detect that they’ve moved!”
“There are only 3 charging stations in the whole world… and I finished hard coding them while you were talking.”
Then there is the other kind of tech debt. This one comes from having a problem, sitting down and imagining a system that could solve that problem, writing down all the other problems the system could be extended to solve, and then building that system. It feels really good because you are building something beautiful and future proof and extensible and scalable and all the things that make engineering satisfying.
“If I modify it just a little bit, I think I can make my type system do SLAM and also function as a working ML accelerator”
But you only have experience with the one problem that you started with, and probably still don’t understand it well.2 And odds are high you will never ever need any of the other features you imagined. You’ll end up with a large, very complicated system that solves a bunch of problems you don’t have, but is hard to change when you learn that you misunderstood the original problem. We’ll call this Grand-Framework tech debt.3
“I finished my Universal Task Execution Framework. Now we can use it!”
“Great! Can you get the robot to pick up one of these two cans?”
“Um, well, no, UTEF assumes a single perception context object. Hmm. Maybe I could create an Orchestration Framework for managing multiple UTEFs? Oh, but the cloud optimization depends on a singleton… Umm, Let me get back to you in a few weeks”
YAGNI vs DRY: Fight!
Engineers overestimate the odds they will need a feature and underestimate the effort to refactor a small system.4 This is why people still have to go around yelling, “You Ain’t Gonna Need It” (YAGNI). Because deep in our hearts we are sure that we will need it, even though we almost never have in the past. Another engineering adage, “Don’t Repeat Yourself” (DRY), can lead you astray, because it seems to say "always make an abstraction".5 Don’t be so afraid to repeat yourself that you build a big system just in case.
In DRY vs YAGNI I lean toward YAGNI every time:
That's because the first flavor of tech debt is so much nicer. It is cheap to acquire (that's the whole point of cutting corners) and cheaper to clean up (because it's small and you don’t hurt anyone's feelings/career if you delete it). Whereas the second flavor of tech debt is very expensive to acquire AND expensive to remove.6
Scar Tissue
As companies grow and mature they develop what I call ‘scar tissue’. This comes from something bad happening (like an outage) followed by a post-mortem (ooh, an intern changed a config) and a policy change (config changes need to be peer reviewed). Each bit of scar tissue is a good idea, but the net effect is a growing continuous drag on getting things done.7 Folks coming from big companies sometimes see the cut-corners in the ancient parts of the code base and know the pain and scarring they caused and think, “We’ve learned better. That cut-corner-code sure caused a bunch of headaches as we scaled. I won’t make that mistake: we’ll start with scalable stuff right off the bat.” I suspect that the old code at successful companies is full of cut-corner tech debt because that is how those companies moved fast enough (and learned fast enough) to become successful. And the startups that built beautiful abstractions from the beginning starved from lack of progress and never got to hire a bunch of engineers to admire their beautiful architecture.
“Three years in, I have invented the perfect programming language to write my startup’s code in.”
“What’s the product?”
“Well, nothing, actually. We ran out of money a year ago.”
TLDR?
Some tech debt comes from doing a sloppy job on execution, some comes from doing a sloppy job on understanding the problem. Great execution on the wrong problem is way worse in the short and long term, which is why doing rough, throw-away work in order to fully understand the problem is worth it even though it feels bad.
What does this have to do with robots?
When people start working on general purpose robots there is a tendency to try to make sure that everything they do is Fully General Purpose And Future Proof. They have grand visions and world changing ambitions. So it seems like they should be building grand software, to match. This pushes folks really hard towards the expensive kind of tech debt. Here’s some sniff tests to keep an eye out for:
Can We Make It More General Purpose?:
You are facing a problem for the first time and you propose a simple solution. A fellow engineer (who’s otherwise a smart, well meaning, lovely human being) asks, “Yeah, but is that general purpose enough?” You are in the danger zone for Grand-Framework tech debt. Stick to your guns and generalize only to problems you have right now.
Cart Before the Horse:
If you are building a generic capability and go hunting for a problem to use it on (a problem you otherwise wouldn’t need to solve in order to build your MVP) because it makes your platform feel more “general purpose” you are in the danger zone for Grand-Framework tech debt.
Good news, I got our customer to agree to spread rubble in parts of their warehouse so we can practice walking on uneven terrain!
Building for Someone Else:
If you are considering building a library or framework to support a workflow or help solve a problem that you’ve never solved yourself you are in the danger zone for Grand-Framework tech debt.8
This is for “Non-Roboticists”:
This is a special case of “ Building for Someone Else” that seems to pop up a lot for folks interested in general purpose robots. The impulse is good: making robots do things is hard and it would be good if it was easier. The problem here is that by framing it “accessible to non-roboticists” you encourage frameworks and libraries that try to hide the complexity of the problem inside of magic. Folks propose APIs like this:
robot.grasp(“apple”)
robot.place_into(“basket”)
Which is great until “grasp” doesn’t work for you or do exactly what you’d want. It’s like all those web-app-frameworks that make for great demos in a hackathon (look you can make twitter in 4 minutes) but no one actually builds real applications out of them. They let you get to 80% in a few minutes but have no path to 100% because you can’t control all the things you need to. Instead, make programming robots productive for folks as smart and capable as you, but who are less patient for stupid bullshit. That’s how to make a beautiful API.
Classic “You Ain’t Gonna Need It”:
If someone says, “We don’t need this now, but I’m sure we’re going to need it eventually and it's going to take a very long time to build so we should start now” you are in the red-hot alarms-blaring danger zone for Grand-Framework tech debt.
How to have the good kind of tech debt
The way out of this, culturally, is to acknowledge that no-one (including you) knows how to build general purpose robots that work. And acknowledge that most things are going to be wrong the first few times. The shortest, best, path to general purpose components is to reject DRY and repeat yourself, implementing single purpose, application specific, feels-unscalable-and-throwaway behaviors.
Then, once you have two or three of those under your belt, look at where you spend a bunch of time-that-doesn’t-feel-productive and pull those out into a library or RPC. Keep things composable, opt-in and a-la-carte so it's easy to stop using the parts you don’t like. My favorite pieces of software are the second or third version of libraries written by people who were scratching their own itch.
My Favorite Acronym: TSTTCPW
Always strive to built The Simplest Thing That Could Possibly Work. Let future requirements be covered by future refactoring. You’ll still end up with tech debt (all projects get tech debt), but you’ll end up with mostly the cheap kind and avoid the expensive kind.
"A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work."
– John Gall
I know I don’t feel like I really understand a problem until I’ve solved it at least once, probably twice.
Not all frameworks are tech-debt and not all over designed, overcomplicated systems are frameworks, but the two do seem to spend a lot of time together.
If you don’t believe me, maybe you’ll believe John Carmack in tweet form.
There are other, healthier, ways to interpret DRY. It’s important to have a single source of truth for certain kinds of information, for example. That’s all gravy. And if you are disciplined you can avoid repeating yourself while also keeping your abstractions small and nimble, but DRY is often used to excuse big clunky abstractions.
I did a survey on tech debt among the software team at Everyday Robots and had folks list sources of tech debt, how much it slowed them down and how much effort it would take to clean up. We found we had slightly more instances of cut-corner tech debt, but were slowed down slightly more by grand-framework tech debt. But the real kicker was the grand-framework tech debt was estimated to take 5x more effort to clean up.
Drag is OK if you are already winning. If you are doing well you should be more conservative about thrashing around. No one doing general purpose robots has the luxury of being conservative because no one has it figured out yet. Scar tissue seems like one of the big mechanisms that makes big companies slow. I was shocked at the slowdown going from a startup (12 employees) to Facebook (4k employees in 2012) and then to Alphabet (50k employees in 2014). I don’t know how to reverse the slowdown from scar tissue. It feels crazy to say, “I know we had a outage, but let's not add any more process to prevent it next time, because it slows us down 1%”. But empirically, those 1% slowdowns add up. It does seem worth it to try resist the accrual of scar tissue as long as you can, as Facebook’s famous “Move Fast and Break Things” tried to do.
If you do have to do this, my advice is to try to solve the problem and write your library at the same time, going back and forth between them. Write the library as you need bits of it, and err on the side of a smaller library and a bigger application. You can always move more of the application into the library later if you need that logic somewhere else.
Something comes through to me as contradictory. In "What is tech debt?" you enumerate type 1: fast, cut corners, and type 2: grand framework. In "Yagni vs. DRY: Fight" you confirm the enumeration and assert type 1 (fast, cut corners) is cheap to remove, doesn't hamper careers, whereas the 2nd flavor is expensive to acquire and remove. But then in note 6 you say your team estimated the cut-corner debt to be 5x more expensive to remove. This seems to contradict your assertion that it's fast and cheap to remove. What am I missing?
Great series - I've read them all & really enjoy them. I feel the pain and have lived it. Thanks!
Great post as usual Benjie!
One thing I've learned to embrace along these lines is to start with spaghetti-code, and organize from there. Starting from the assumption that you won't get it right until you've tested and iterated a bunch of times, a lot of up-front "framework design" is just going to be wasted or worse, getting you over-committed to bad design decisions. The bad thing about spaghetti-code (i.e., more or less one big function that does a bunch of stuff) is mostly that if you come back to it after some time, it becomes impossible to make heads or tails of it. But freshly-written spaghetti-code is really flexible and easy to change (e.g., swapping code blocks around, adding new variables or config values, etc.). So, a process I sometimes recommend is to write the quick-and-dirty spaghetti-code that works, start immediately testing it out and iterating, and as it stabilizes, you organize the code. The abstractions or APIs needed tend to emerge naturally (e.g., what group of variables or configs are often used together, what data is config / state / scratch-space / IO, etc.). One counter point I've heard is "what if you can't test or integrate it right now?", to which I would reply, "if so, then, why are you writing it now?".
On the Grand Framework thing, a common thing to hear people say is: "we might not need this feature right now, but if I don't design the framework with this feature, it will be impossible to add it later". Generally, IMHO, that just points to design issues with the framework, usually a lack of composability or transparency. And as you said, it can often be a DRY thing too. A simple example, say, you need a single-produce-single-consumer (SPSC) thread-safe queue, but you could write it to be more general, and knowing that if you needed an MPMC queue at some point, you'll probably end up having to write a new class (to avoid changing your heavily used SPSC queue) and "repeat-yourself" a lot. It's almost a no-brainer that a very simple SPSC is preferred to a much harder MPMC, and if you design things in a reasonably flexible way, adding or switching to an MPMC later should not be that hard.
I guess another way to put it is that imaginary features should have imaginary implementations, not concrete ones. I think it's reasonable to have TODOs and other comments in the code that discuss pieces of code or API decisions that are as they are in case an imaginary feature is needed, so it remains easy to add or to find where changes are needed. In other words, there is an imaginary implementation for an imaginary feature. What is serious tech debt is when you have concrete implementations for features that aren't used (or barely used), especially when it encumbers a lot of the details of the code. A good framework is flexible, i.e., it allows for many imaginary features, while a grand bloated framework has a bunch of features baked in that are never quite the right mix for any user and with too much overhead (performance, maintenance, cognitive load, etc.).