Breaking The Wheel

Story Points for Feature Estimation – Game Planning With Science! Part 5 Featured Image

Story Points for Feature Estimation – Game Planning With Science! Part 5

In Part 4 of “Game Planning With Science!”, I covered the central limit theorem, and how we can use it for forecasting feature development. At the end of the post I acknowledged that it’s no mean feat to track the time per individual feature without some heavy duty project management software and a team that is superlatively disciplined about tracking their time. In Part 5, I’m going to give you my favorite tool for getting around this problem: Story Points.

The article image for “Story Points for Feature Estimation” is from GraphicStock. Used under license.

Previously on “Game Planning With Science!”: Part 1 Part 2 Part 3 Part 4


By Reading This Post, You Will Learn:

  • The distinction between accuracy and precision and why it’s important to balance the two
  • The downsides forecasting in discrete units of time
  • What a story point is and why it’s a superior form of estimation versus time
  • How to estimate a feature using story points

Accuracy vs. Precision

First off, it’s important to acknowledge that there is a difference between accuracy and precision. We tend to conflate those two terms, but they are actually quite distinct. Accuracy is an appraisal of whether an answer or measurement actually encapsulates the truth. Precision, on the other hand, is the magnitude of the answer or measurement’s margin of error.

Imagine a man who weighs exactly 200 pounds, no more, no less.

  • Saying that he weighs between 170 and 190 pounds is neither precise nor accurate
  • Meanwhile, saying that he weighs between 190 and 210 pounds is accurate, but not precise
  • Saying that he weighs 176.3334334433 pounds is precise, but not accurate
  • But saying he weighs 200.0000000 pounds is both accurate and precise

Forecasting is a Balance Between Accuracy and Precision (but Accuracy is More Important)

Now, obviously, we’d all love to have the fourth scenario. But, when it comes to forecasting, there is an inverse relationship between accuracy and precision. Think back to the confidence interval formulas from Part 4. If you want a higher degree of confidence, you need to accept that the interval will be wider (less precise).

The only way to increase accuracy and precision simultaneously is to gather more data or reduce the variability of your process outputs (eg, make the time per feature less variable).

Given that we need to prioritize one or the other, I’ll go with accuracy over precision every time. The terms aren’t mutually exclusive. You need some degree of precision for forecasts to be meaningful. But I’m more concerned with being correct than being specific.

What’s Wrong With Being Specific?

Nothing. But the appropriate level of specificity depends on your needs and application. If you’re running physics experiments in a particle collider, or building a CPU, exacting precision is absolutely critical. But that kind of precision is expensive, time consuming, and leaves no room for error. General anesthesia, for example, requires a high degree of precision. Anesthesiologists ride a thin line between keeping a patient unconscious and killing her. Preparing someone to maintain that narrow margin of error is expensive, both in terms of training and equipment.

But precision is also relative. A carpenter only requires the precision level of measuring tape. Calculating the trajectory that will get the Juno probe to Jupiter – taking gravity, orbits, and time into account – is another matter entirely. You swap those levels of precision, and you have a carpenter who never works (because he’s too slow and WAAAAY too expensive) and North Korea’s missile program.

The goal then is to achieve a level of precision that allows you to make meaningful forecasts, and no more. And the definition of meaningful is also subjective over the course of a project. At the start of game development, estimates in terms of months, quarters, or even years are meaningful. In alpha and beta, on the other hand, a precision level of weeks or days is necessary.

The Danger of Forecasting with Discreet Units of Time

I am completely against estimating the development time of individual features in terms of discrete units (days, hours, etc). Experience has convinced me that it’s a fool’s errand.

Discrete units of time are a precision measurement

They prioritize specificity over accuracy, and don’t inherently account for margins of error. In order to account for that margin or error in your forecasts, you would need to estimates in terms of windows of time. And scheduling like that is nightmarish. I’m getting hives just thinking about it.

They trigger anchoring biases

Having a preconceived time estimate in mind triggers a cognitive bias known as anchoring. Anchoring makes it difficult to adapt to new information, such as when a feature turns out to be more complex then first imagined.

Time-based forecasts are complex

You need to know the average time per feature (again, hard to track). Further, you also need to make an assumption – or mandate – about the number of hours per week each team member will work. Then you need to map the upcoming features against that bucket of man-hours. This is not as simple as dividing the one by the other. If you have two developers, each with 40 available hours, and three features, each estimated at 30 hours, what do you do with that third feature?

They encourage bad management behavior

If you think in terms of discrete units of time, you will schedule in terms of discrete units of time. Say there are five features, each of which you’ve estimated at 1 day of work. You may be enticed to load all five onto one developer to give him his work load for the week. Now you’ve booked him at 100% of his capacity. He has no wiggle room if one feature is larger than expected, or his help is desperately needed by another developer during that week. His only option at that point will be to work late to make up hours or slip his deadlines, all of which increases your development variability.

They are time consuming

Estimating every feature in terms of hours is labor-intensive. In operations science terminology, it’s a non-value-adding activity. Gamers will not perceive increased value in your games because you took the time to estimate in days or hours. It provides no value to them. Some non-value-adding activities (eg, meetings) are necessary in any organization. But you should keep them to a minimum. If a method of estimation is more expensive while providing reduced accuracy then it’s not worth the time.

The time required per feature is subjective by person

A feature that would take a senior engineer or skilled specialist 2 days to code may take a junior coder or generalist 4 or 5. The time required to code something is therefore more subjective than an estimate of size or scope.

It’s difficult to compare the estimate to the actuality

You might have originally estimated that a feature should take 8-hours, but did it? What was the actual development time? How do you account for sleeping, eating, and bathroom breaks? What about tech discussions?

If Not Hours, Then What?: Story Points

My preferred unit for forecasting development is the humble story point. For those of you not familiar, story points are a concept from scrum. If you’re anti-scrum, don’t worry – this isn’t going to be a pitch for the framework. I, a certified scrum master myself, have my own misgivings about scrum.

For one, it’s accretion of mainstream acceptance seems to have gone hand-in-hand with a nasty – possibly terminalº – case of dogma. And I have not the time nor the patience for dogma. Scrum’s habit of renaming things that already existed in order to sell a product (and lots of certification courses) also reeks of the techniques that make people hate marketers.

I say all of that not to piss in my fellow scrum masters’ collective cornflakes, but to make a point. Despite my gripes with scrum, story points are one of the framework’s best, most useful concepts.

What is a story point? An estimate of scope. There are lots of approaches to story points, but the most common, and the best in my opinion, is the Fibonacci Sequence. Features (or user stories, as they’re known in scrum parlance) are given a scope estimate of 1, 2, 3, 5, 8, 13, or 21 story points. Some practitioners go even higher than that, but in my experience, 21 story points is a sufficient upper bound.

And therein lies the key difference between story points and hours or days. Hours and days are estimates of time, while story points are estimates of scope. Hours and days seek to estimate how long something will take (which, as covered above, it subjective by team member). Story points appraise how large, complex, and/or complicated it is.

Why the Fibonacci Sequence? Why not just 1-5 or Small-Medium-Large-XL?

It’s way easier to judge relative size than absolute size. For example it is really hard to estimate the height of a building, but it’s easy to estimate that it’s half the height of the building next-door. In the Fibonacci sequence, each value is the sum of the two proceeding values. A 2-point feature is roughly the scope of two 1-point features put together. A 13-point feature is roughly the size of a 5- and an 8-point feature put together.

Another advantage to the Fibonacci sequence is that the exponential spread between values in the sequence nicely accounts for variance. Across any evaluation of scope, there will naturally be some variance. None of your 1-point feature will be exactly the same scope. None of your 21-pointers will be exactly the same scope. But, there should only be a small variance in the scope of your smallest, 1-point features. On the other hand, we would expect a your biggest, hairiest feature 21-pointers to have significant variance in scope from feature to feature.

But Doesn’t The Exponential Curve of The Fibonacci Sequence Screw Up Any Calculations of Averages or Variance?

Nope. Remember what we learned with the central limit theorem. It doesn’t matter what the probability distribution looks like (in this case exponential). Large collections of averages or sums will still form a normal distribution. All of the formulas from Part 4 still apply!

Turtles All The Way Down: Level Setting

The first question that usually pops into anyone’s head when he or she first hears this scheme is the inherent chicken-and-egg aspect of relative sizing. If you’re estimating everything based on something else, what’s the baseline? How do you establish a 1-point feature? There are a couple of approaches: the proper way, and the pragmatic way.

The Proper Way

The proper way is to go through your list of features, and find the smallest one (or one of the smallest ones). Then establish that as the 1-point baseline, and appraise all the other stories in terms of that 1-pointer. From there, find examples of the various gradations to use for reference.

Then, as you go along, pull up reference examples of each point size. Before doing any future point estimations, it’s a good habit (though not essential) to quickly review those examples to re-calibrate.

The Pragmatic Way

Alternatively, the pragmatic way is simply to set an expectation of what each point gradation means. For example:

  • 1-Point: a day or less of work
  • 2-Points: a couple of days
  • 3-Points: the better part of a week
  • 5-Points: around a week
  • 8-Points: a week to a week and a half
  • 13-Points: two full weeks
  • 21-Points: three to four weeks

This not a great way to think of story points because it gets the team thinking in terms of time again. Any die-hard scrum masters reading this are probably gnashing their teeth right about now. But, if relative estimates are too abstract, the pragmatic method can get you started.

You can always shift to the proper approach once you have some experience and comfort with story points. And in practice, the pragmatic approach doesn’t significantly compromise estimates and forecasts (at least in my anecdotal experience).


Resources That Influenced This Post

If you have ad blockers turned on, you may not see the above links. Breaking The Wheel is part of Amazon’s Affiliate Program. If you purchase products using the above links, the blog will get a portion of the sale (at no cost to you).

Why Are Story Points Better?

Points prioritize accuracy over precision

You are just categorizing features within a finite number of scope categories. Precision isn’t a going concern – there’s no such thing as having precisely 2-points worth of scope. You only need to be accurate: Feature A is roughly the same size as Feature B.

Points can reduce the impact of anchoring

Because you are dealing with broader categories of feature scope, there is less vulnerability to anchoring. You’re not going to start chewing your nails if it takes 16 hours, rather than 8. Some anchoring is unavoidable. If a 2-pointer ends up taking two weeks, you will start chewing your nails and wondering if the developer working on it has been napping in the lounge, even if there is a perfectly legitimate reason that that the scope exploded.

Better scheduling

You don’t schedule teams or people by the amount of hours they have, but by the amount of scope they can typically process.

Points are faster

Instead of hemming and hawing about whether a feature takes 3 days or 4, you just need to assess which gradation of scope it falls in. In my experience streamlining the estimate – and thinking in terms of scope rather than time – facilitates better discussions. Team members focus on the technical requirements and risks of features, rather than negotiating the blocks of time required to execute.

Scope is not subjective by person

A 13-point feature is a 13-point feature, regardless of whose working on in. One team member might take longer than another with that feature, but the scope is universal. Obviously this isn’t fool-proof. Some developers have an amazing ability to expand scope by getting lost in the weeds. But, if your team is well coordinated, if your leads and tech-director types are effectively coaching junior team members, and if the requirements for a given feature are well-defined, you can manage that risk.

It’s easy to compare the estimate to the actuality

It’s easy to say that a feature estimated at 5-points was in actuality more like a 13-point feature. You can focus the discussion on why scope was so badly estimated rather than accounting for the specific development time rather than accounting for time.

Using Story Points For Forecasts

Again, using average development time per feature is a difficult value to calculate because it’s hard to track that data accurately. But in a story points-based system, you don’t care about the average amount of time it takes to complete a feature. You just care about the average amount of scope you can complete per unit of time.

For example, in a hours-based system, you might say that the average feature take 20 hours to complete and your team of six can output 240 man-hours per week. In a story points-based system, you would instead say that your average feature is 8-points of scope and your team can churn through an average of 60 points every two-weeks.

By ignoring the time per feature, and just focusing on scope per unit of time, you greatly simplify your data collection. All you need to know is what features were closed, when they were closed, and what their story point estimates were. This is a snap to track with any decent project management software. Or post-it notes on a board, for that matter.

You can then take that scope burndown rate (velocity in scrum terminology) and extrapolate against your backlog. Take your average scope per feature, multiply that by the number of features in your backlog, and then divide that number by your average velocity. The result is your expected delivery date. From there, you can also use your standard deviation to develop a confidence interval window around that delivery date, as I covered in my post about the central limit theorem. More on how to do this in Part 7!

Next up: Forecasts

In Part 6 of “Game Planning With Science!”, I’ll walk you through a specific method of feature request: user stories. I’ll cover what they are, why they’re useful, and how to estimate their scope. Click here to read on!


Key Takeaways

  • Keep the distinction between accuracy and precision in mind
  • Estimating feature development in terms of time is problematic for multiple reasons
  • Story points are an estimate of scope
  • Story Points are a better form of estimation than and alleviate many of the inherent problems with time-based estimates
  • The typical method of estimating story points is with the Fibonacci Sequence: 1, 2, 3, 5, 8, 13, 21

*This section was inspire by an example from The Signal and the Noise by fivethirtyeight.com founder Nate Silver
º I say “possibly terminal” because scrum will only be the management framework du jour until something better (or at least better-marketed) comes along. If scrum practitioners don’t evolve their art (and iteratively incorporate our ever-evolving understanding of management science), for fear of deviating from the sacred tomes of Sutherland and Schwaber or being called a “scrum-but” heretic, scrum will be brown bread.

Looking for more info about process management? Check out the Management & Operations Resources Page!

Return to the “Game Planning With Science” Table of Contents

Creative Commons License
“Story Points For Feature Estimation – Game Planning With Science! Part 5” by Justin Fischer is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

If You Enjoyed This Post, Please Share It!

2 comments

  1. Pingback: Poka-Yoke:The Fine Art of Mistake Proofing - Game Planning With Science, Part 10!

  2. Pingback: Kanban: The Counter-Intuitive Value Of Pull-Based Production - Game Planning With Science! Part 11

Leave a Reply

%d bloggers like this: