A commonly held belief is that it’s best to batch work – to handle similar tasks in large, consolidated chunks. The notion makes intuitive sense. It allows you to focus on one activity at a time and avoid so-called switching costs of switching activities. But as with so many other instances of unverified intuition, this particular notion is flat-out wrong. Batching may avoid switching costs, but it greatly protracts flow time, which, in the long run, can end up being far more expensive. Which is why the Toyota Production System introduced the concept of heijunka – “leveling”.
Previously on “Game Planning With Science!”: Part 1 Part 2 Part 3 Part 4 Part 5 Part 6 Part 7 Part 8 Part 9 Part 10 Part 11 Part 12 Part 13
By Reading This Post, You Will Learn:
- Why batching to avoid switching costs is a problematic solution
- The cost of batching
- Why the traditional production->alpha->beta->certification model essentially batches your entire game
- The consequences of this large scale batching
- Why it’s important to run QA along the way and focus on minimizing the cost of QA passes
Understanding The Impact Of Batching On Efficiency
Let’s start with physical products as we work to wrap our heads around this topic. Let’s imagine you have a factory, and you manufacture two items, a Red product and a Blue product, and each item requires to manufacturing activities, Step 1 and Step 2, which can each process on unit at a time. Adjusting each activity from processing Red to processing Blue, or vice-versa, imposes a switching cost in terms of time and money.
In order to minimize your exposure to that switching cost, you batch process the items. You run 100 units of Red through Step 1 and then prep Step 1 for Blue units. You then start to run all 100 Red units through Step 2 to finish them while moving 100 Blue units through Step 1. Next, you prep Step 2 for processing Blue units, and reset Step 1 for Red. And on and on, ad infinitum:
So, everything’s all hunky dory. You’re getting a lot of inventory processed, and you’re avoiding switching costs.
Except there’s one really big problem you’re overlooking. By batching in this way, you are maintaining a perpetual amount of inventory in the system (the purple line):
There are two major problems here.
First, think back to the kanban post. The more stuff in the system, the longer the queues get in front of activities. And the longer the the queues, the longer the actual flow time. This, in turn, pushes your actual flow time further and further away from your theoretical flow time. Which, consequently, means your flow time efficiency goes right down the crapper.
But there’s another reason that’s a little more subtle. And potentially more expensive.
Businesses operations run on something called “working capital” – this is the amount of money sunk into actually running a company from day to day. And the more money tied up in working capital, the less money you have to invest in other opportunities or return to shareholders.
And one of the largest sources of working capital is inventory, both for manufacturing (components for building sell-able inventory) and retail (inventory on the shelves).
Examples Always Help
Here’s a practical example. Let’s say you own a company that manufactures widgets, and that company’s cost of capital is 10%. Through inefficiencies, you have $10MM of excess inventory in the production line. This is inventory you don’t need to be able to maintain your current throughput. It’s just bloat that you could eliminate if you were more efficient. And at that cost of capital and that magnitude of bloat you have a yearly opportunity cost of $1MM per year.
In other words, if you could streamline your operations and free up that $10MM of working capital, you could apply it to other investments and reasonably expect to make $1MM of profit a year. This is why there is an entire field of study called “supply chain management” and why huge corporations spend millions of dollars forecasting demand. They want to have the minimum amount of inventory on hand to reasonably satisfy demand, and not a cent more.
Enter The Toyota Concept Of Heijunka (平準化)
If batching occurs because we want to avoid switching costs, then it follows that, to eliminate the need for batching, we should focus on reducing switch costs. This is what effective operations managers focus on: the switching costs. Your goal should be to have switching costs so low you can cost-effectively have batch sizes of a single unit.
Toyota refers to this notion as heijunka (literally, “leveling”) – putting an emphasis on keeping all work-in-progress inventory at a minimum level by avoiding batches. It doesn’t want to make 100 red Camry’s, then 100 blue Rav4’s, and then 100 black Forerunners. It wants to make 1 red Camry, then 1 blue Rav4, then 1 black Forerunner.
No queues. No batches. Every component piece spends the absolute minimum amount of time on the factory floor. Actual flow time moves towards theoretical.
Resources That Informed And Influenced This Post
If you have ad blockers turned on, you may not see the above links.
The Overlap With Game Development
Great, Justin. Factories. Blah, blah, blah. Your point?
Here’s where this becomes relevant to game dev: we batch like crazy people. But the problem only becomes apparent depending on how you apply the word “done”.
In the game industry, we generally use the term “done” or “complete” with regard to a feature or user story to mean that we’ve coded that feature or user story and merged it into the build. Under that regime, there doesn’t seem to be any obvious batching. We code one feature, submit it, and then code the next. No worries.
However, if we change the definition of “done” from “coded and merged” to “ready to ship” the problem becomes more apparent.
If a feature isn’t done until it’s ready to ship – ie, until it’s been run through the QA ringer – then our typical Production → Alpha → Beta → Certification sequence indicates a massive operational problem. If you are waiting until the end of your production schedule to perform dedicated QA testing and fixing, you don’t have “batches”. You just have a batch. One.
YOUR ENTIRE GAME:
We code a game’s worth of features (the yellow line), and accrue defects (the red line) at some multiple. Since the features have defects (and thus aren’t ready to ship) they reside as work-in-progress inventory. Then we go through the madness of post-production: we struggle to un-fuck our buggy house of cards until finally we throw our hands up and say “SHIP IT!”.
And, from an operations science perspective, that is pure, unadulterated lunacy.
The Consequence Of The Alpha/Beta/Cert Mentality
There are a couple of problems that a late QA cycle creates.
First: your flow time efficiency doesn’t just go down the tubes. You essentially negate it. By leaving QA until the end of the project, your actual flow time is so far removed from your theoretical flow time that your efficiency ratio is effectively zero.
The second critical issue is that leaving QA until the end of the project pays no head to the time value of fixes. Bug fixes are cheapest when they are implemented as soon as possible. While leaving them to linger doesn’t guarantee that each and every defect will increase in scope, you do squander your ability to sort out issues before they fester.
What’s To Be Done?
Move from a production/alpha/beta/cert mentality to a build→fix, build→fix, build→fix cadence. Make QA testing and hardening part of the definition of done, and part of the sequence of feature development. QA testing shouldn’t be the last stages of production, it should be the final steps of development for every feature. Ideally, you want a single QA pass for every feature submission.
Then, focus on making the cost per QA pass (both in terms of money ant time) as low as possible. As in the manufacturing example, you want to eliminate or minimize the switching cost of transitioning a feature from dev to QA.
You Want Us To Slow Down?!!
Any grizzled veterans reading this may balk at the notion of slowing down production to allow for parallel QA testing. Fair enough.
Except, I’m not advocating that you slow down. I’m advocating that you consolidate the work.
Rather than doing 80% of the story (the development) during production and the remaining 20% (the QA) during alpha or beta, consolidate the work. Do 100% effort to get a feature ready to ship in one pass. Defragment your production process the same way you would defragment a PC hard drive.
The Impact Of Heijunka On Muda
Heijunka impacts to types of muda: excess work in progress (we’re pushing features through QA sooner so they are “done-done” fasters) and work queues (we’re eliminating the months long backlog of hardening work that accrues when teams defer QA to the end of production).
Further Reading If You Enjoyed This Post
Where Do We Go From Here?
So, at this point in our journey through lean, we’ve taken the time to carefully spec out feature requests to eliminate the potential for human error (poka-yoke). We’re using kanban pull-based production to minimize flow time. We’ve put the robots to work on our behalf with jidoka-style autonomation. We’re using disciplined QA processes to reduce muda. And we are leveraging the concept of heijunka to avoid batching. We have all the tools in place to run an efficient, lean development cycle.
But this doesn’t mean we won’t have problems. Systems, code, and people will fail. And when that happens, we need to investigate why. More specifically, we need to investigate why five times in a row. So, in the next post, I’ll be covering root-cause analysis, known more colloquially as “the five whys”.
- Batching is a common practice to avoid switching and setup costs
- However, batching also increases flow times and opportunity costs, which is problematic
- The goal of effective operations is, therefor, to minimize those switching costs
- The traditional game development model of Production → Alpha → Beta → Certification model is particularly problematic because it, in essence, batches the entire game in one large QA pass
- The goal should instead be a continuous Build → Fix cadence, with an emphasis on minimizing the cost and time per QA pass