Weighing the Evidence

I’m going to tell you a true story.

Recently, in response to a few observations, I began to make a few changes in my diet and my habits. Perhaps you’ll be impressed.

I cut down radically on my consumption of sugar.
I cut down significantly on carbohydrates. (Very painful; I LOVE rice. I LOVE noodles.)
I started drinking less alcohol. (See above.)
I increased my intake of tea and water.
I’ve been reducing how much I eat during the day; some days I don’t eat at all until dinner. Other days I have breakfast, lunch, and dinner. And a snack.
I reflected on the idea of not eating during the day, thinking about Moslem friends who fast, and about Nassim Taleb’s ideas in Antifragile. I decided that some variation of this kind in a daily regimen is okay; even a good idea.
I started weighing myself regularly.

Impressed yet? Let me give you some data.

When I started, I reckon I was just under 169 lbs. (That’s 76.6 kilograms, for non-Americans and younger Canadians. I still use pounds. I’m old. Plus it’s easier to lose a pound than a kilogram, so I get a milestone-related ego boost more often.)

Actually, that 169 figure is a bit of a guess. When I became curious about my weight, the handiest tool for measuring it was my hotel room’s bathroom scale. I kicked off my shoes, and then weighed myself. 173 lbs., less a correction for my clothes and the weight of all of the crap I habitually carry around in my pockets: Moleskine, iPhone, Swiss Army knife, wallet stuffed with receipts, pocket change (much of it from other countries). Sometimes a paperback.

Eventually I replaced the batteries on our home scale (when did bathroom scales suddenly start needing batteries? Are there electronics in there? Is there software? Has it been tested?—but I digress). The scale implicitly claims a certain level of precision by giving readings to the tenth of a pound. These readings are reliable, I believe; that is, they’re consistent from one measurement to the next. I tested reliability by weighing myself several times over a five-minute period, and the results were consistent to the tenth of a pound. I repeated that test a day or two later. My weight was different, but I observed the same consistency.

I’ve been making the measurement of my actual weight a little more precise by, uh, leaving the clothes out of the measurement. I’ve been losing between one and two pounds a week pretty consistently. A few days ago, I weighed myself, and I got a figure of 159.9 lbs. Under 160! Then I popped up for a day or two. This morning, I weighed myself again. 159.4! Bring on the sugar!

That’s my true story. Now, being a tester, I’ve been musing about aspects of the measurement protocol.

For example, being a bathroom scale, it’s naturally in the bathroom. The number I read from the scale can vary depending on whether I weigh myself Before or After, if you catch my meaning. If I’ve just drunk a half litre of water, that’s a whole pound to add to the variance. I’ve not been weighing myself at consistent times of the day, either. In fact, this afternoon I weighed myself again: 159.0! Aren’t you impressed!

Despite my excitement, it would be kind of bogus for me to claim that I weigh 159.0 lbs, with the “point zero”. I would guess my weight fluctuates by at least a pound through the day. More formally, there’s natural variability in my weight, and to be perfectly honest, I haven’t measured that variability. If I were trying to impress you with my weight-loss achievement, I’d be disposed to report the lowest number on any given day. You’d be justified in being skeptical about my credibility, which would make me obliged to earn it if I care about you. So what could I do to make my report more credible?

I could weigh myself several times per day (say, morning, afternoon, and night) at regular times, average the results, and report the average. If I wanted to be credible, I’d tell you about my procedure. If I wanted to be very credible, I’d tell you about the variances in the readings. If I wanted to be super credible, I’d let you see my raw data, too.
All that would be pretty expensive and disruptive, since I would have to spend few minutes going through a set procedure (no clothes, remember?) at very regular times, every day, whether I was at home or at a business lunch or travelling. Few hotel rooms provide scales, and even if they did, for consistency’s sake, I’d have to bring my own scale with me. Plus I’d have to record and organize and report the data credibly too. So…
Maybe I could weigh myself once a day. To get a credible reading, I’d weigh myself under very similar and very controlled conditions; say, each morning, just before my shower. This would be convenient and efficient, since doffing clothes is part of the shower procedure anyway. (I apologize for my consistent violation of the “no disturbing mental images” rule in this post.) I’d still have to bring my own scale with me on business trips to be sure I’m using consistent instrumentation.
Speaking of instrumentation, it would be a good idea for me to establish the reliability and validity of my scale. I’ve described its reliability above; it produces a consistent reading from one measurement to the next. Is it a valid reading, though? If I desired credibility, I’d calibrate the scale regularly by comparing its readings to a reference scale or reference weight that itself was known to be reliable (consistent between observations) and valid (consistent with some consensus-based agreement on what “a pound” is). If I wanted to be super-credible, I’d report whatever inaccuracy or variability I observed in the reading from my scale, and potential inconsistencies in my reference instruments, hoping that both were within an acceptable range of tolerance. I might also invite other people to scrutinize and critique my procedure.
If I wanted to be ultra-scientific, I’d also have to be prepared to explain my metric—the measurement function by which I hang a number on an observation. and the manner in which I operationalized the metric. The metric here is bound into the bathroom scale: for each unit pound placed on the scale, the figure display should increase by 1.0. We could test that as I did above. Or, more whimsically, if I were to put 159 one-pound weights on one side of Sir Bedevere’s largest scales, and me on the other, the scales would be in perfect balance (“and therefore… A WITCH!”), assuming no problems with the machinery.
If I missed any daily observations, that would be unfortunate and potentially misleading. Owning up to the omission and reporting it would probably preferable to covering it up. Covering up and getting caught would torpedo my credibility.
Based on some early samples, and occasional resampling, I could determine the variability of my own weight. When reporting, I could give a precise figure and along with the natural variation in the measurement: 159.4 lbs, +/- 1.2 lbs.
Unless I’m wasting away, you’d expect to see my weight stabilize after a while. Stabilize, but not freeze. Considering the natural variance in my weight, it would be weird and incredible if I were to report exactly the same weight week after week. In that case, you’d be justified to suspect that something was wrong. It could be a case of quixotic reliability—Kirk and Miller’s term for an observation that is consistent in a trivial and misleading way, as a broken thermometer might yield. Such observations, they say, frequently prove “only that the investigator has managed to observe or elicit ‘party line’ or rehearsed information. Americans, for example, reliably respond to the question ‘How are you?’ with the knee-jerk ‘Fine.” The reliability of this answer does not make it useful data about how Americans are.” Another possibility, of course, is that I’m reporting faked data.
It might be more reasonable to drop the precision while retaining accuracy. “About 160 lbs” is an accurate statement, even if it’s not a precise one. “About 160, give or take a pound or so” is accurate, with a little patina of precision and a reasonable and declared tolerance for imprecision.
Plus, I don’t think anyone else cares about a daily report anyhow. Even I am only really interested in things in the longer term. Having gone this far watching things closely, I can probably relax. One weighing a week, on a reasonably consistent day, first thing in the morning before the shower (I promise; that was the last time I’ll present that image) is probably fine. So I can relax the time and cost of the procedure, too.
I’m looking for progress over time to see the effects of the changes I’m made to my regimen. Saying “I weigh about 160. Six weeks ago, I weighed about 170” adds context to the report. I could provide the raw data:

Plotting the data against time on a chart would illustrate the trend. I could show display the data in a way that showed impressive progress:

But basing the Y-axis at 154.0 (to which Excel defaulted, in this case) wouldn’t be very credible because it exaggerates the significance of the change. To be credible, I’d use a zero base:

Using a zero-based Y-axis on the chart would show the significance of change in a more neutral way.

To support the quantitative data, I might add other observations, too: I’ve run out of holes on my belt and my pants are slipping down. My wife has told me that I look trimmer. Given that, I could add add these observations to the long-term trend in the data, and could cautiously conclude that the regimen overall was having some effect.
All this is fine if I’m trying to find support for the hypothesis that my new regimen is having some effect. It’s not so good for two other things. First, it does not prove that my regimen change is having an effect. Maybe it’s having no effect at all, and I’ve been walking and biking more than before; or maybe I acquired some kind of wasting disease just as I began to cut down on the carbs. Second, it doesn’t identify specific factors that brought about weight loss and rule out other factors. To learn about those and to report on them credibly, I’d have to go back to a more refined approach. I would have to vary aspects of my diet while controlling others and make precise observations of what happened. I’d have to figure out what factors to vary, why they might be important, and what effects they might have. In other words, I’d be developing a hypothesis tied to a model and a body of theory. Then I’d set up experiments, systematically varying the inputs to see their effects, and searching for other factors that might influence the outcomes. I’d have to control for confounding factors outside of my diet. To make the experiment credible, I’d have to show that the numbers were focused on describing results, and not on attaining a goal. That’s the distinction between inquiry metrics and control metrics: an inquiry metric triggers questions; a control metric influences or drives decisions.

When I focus on the number, I set up the possibility of some potentially harmful effects. To make the number look really good on any given day, I might cut my water intake. To make the number look fabulous over a prolonged period (say, as long as I was reporting my weight to you), I could simply starve myself until you stopped paying attention. Then it’d be back to lots of sugar in the coffee, and yes, I will have another beer, thank you.) I know that if I were to start exercising, I’d build up muscle mass, and muscle weighs more than flab. It becomes very tempting to optimize my weight in pounds, not only to impress you, but also to make me feel proud of myself. Worst of all: I might rig the system not consciously, but unconsciously. Controlling the number is reciprocal; the number ends up controlling me.

Having gone through all of this, it might be a good idea to take a step back and line up the accuracy and precision of my measurement scheme with my goal—which I probably should have done in the first place. I don’t really care how much I weigh in pounds; that’s just a number. No one else should care how much I weigh every day. And come to think of it, even if they did care, it’s none of their damn business. The quantitative value of my weight is only a stand-in—a proxy or an indirect measurement—for my real goal. My real goal is to look and feel more sleek and trim. It’s not to weigh a certain number of pounds; it’s to get to a state where my so-called “friends” stop patting my belly and asking me when the baby is due. (You guys know who you are.)

That goal doesn’t warrant a strict scientific approach, a well-defined system of observation, and precise reporting, because it doesn’t matter much except to me. Some data might illustrate or inform the story of my progress, but the evidence that matters is in the mirror; do I look and feel better than before?

In a different context, you may want to persuade people in a professional discipline of some belief of some course of action, while claiming that you’re making solid arguments based on facts. If so, you have to marshal and present your facts in a way that stands up to scrutiny. So, over the next little while, I’ll raise some issues and discuss things that might be important for credible reporting in a professional community.

This blog post was strongly influenced by several sources.

Cem Kaner and Walter P. Bond, “Software Engineering Metrics: What Do They Measure and How Do We Know“. In particular, I used the ten questions on measurement validity from that paper as a checklist for my elaborate and rigourous measurement procedures above. If you’re a tester and you haven’t read the paper, my advice is to read it. If you have read it, read it again.

Shadish, Cook, and Campbell, Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Snappy title, eh? As books go, it’s quite expensive, too. But if you’re going to get serious about looking at measurement validity, it’s a worthwhile investment, extremely interesting and informative.

Jerome Kirk and Mark L. Miller, Reliability and Validity in Qualitative Research. This very slim book raises lots of issues in performing, analyzing, and reporting if your aim is to do credible research. (Ultimately, all research, whether focused on quantitative data or not, serves a qualitative purpose: understanding the nature of things at least a little better.)

Gerald M. (Jerry) Weinberg, Quality Software Management, Vol. 2: First Order Measurement, (also available as two e-books, “How to Observe Software” and “Responding to Significant Software Events”)

Edward Tufte’s Presenting Data and Information (a mind-blowing one-day course) and his books The Visual Display of Quantitative Information; Envisioning Information; Visual Explanations; and Beautiful Evidence.

Prior Art Dept.: As I was writing this post, I dimly recalled Brian Marick posting something on losing weight several years ago. I deliberately did not look at that post until I was finished with this one. From what I can see, that material (http://www.exampler.com/old-blog/2005/04/02/#big-visible-belly) was not related to this. On the other hand, I hope Brian continues to look and feel his best. 🙂

I thank Laurent Bossavit and James Bach for their reviews of earlier drafts of this article.

8 replies to “Weighing the Evidence”

Matt

September 12, 2014 at 3:23 pm

Another factor you might want to take in to account: muscle mass tends to be more dense than fat (by volume), so in theory your pants/belt metric could indicate progress, while your scale might claim otherwise. One way to quantify your progress could be body fat measurements, but I’m not sure what the cost of those measures are.
Matt

September 12, 2014 at 3:24 pm

Never mind, apparently I just missed the part where you covered that.

Michael replies: Explainable; it’s a pretty long post, I’m afraid. Thanks for thinking things through, though.
Kim Engel

September 12, 2014 at 5:27 pm

Excellent post Michael. I hope you didn’t co-incidentally display symptoms of wasting syndrome just as you went on a diet. What are the chances of that…? It’s beyond my current capacity to calculate this 🙂

When it comes to using a mirror as a measurement tool I find it deceptive and inaccurate due to immediacy and subjectivity. Photos allow me to be more objective.

As far as your friends not patting your belly, there are other ways to achieve that goal. E.g. boxing lessons and a speed jab. Which will also help you with the goal of being sleek and trim! With that approach you’d fail to meet an implicit goal, e.g. to not alienate your friends and family in the process. So in the pursuit of one goal, you’ve considered various approaches and chosen one that aligned to your larger goals (ethicsmorals).

Congratulations on your progress!
George Dinwiddie

September 12, 2014 at 10:20 pm

Good post. I find that, if I’m paying attention to what I really desire, precise measurement of the proxies for it don’t really matter. (See http://blog.gdinwiddie.com/2010/04/19/the-importance-of-precise-estimates/ for my take on the situation.)

I am jealous of your reaching 160, though. Congratulations!
Yury Makedonov

September 14, 2014 at 5:10 pm

I know a better way to lose weight. Just book a Caribbean cruise. I lost half a pound on each Caribbean cruise so far.

I have just one data point but assume that my conclusion should be valid for all other people on all their Caribbean cruises. 🙂
Five Blogs – 15 September 2014 | 5blogs

September 14, 2014 at 11:05 pm

[…] Weighing the Evidence Written by: Michael Bolton […]
B Rijsdijk

September 15, 2014 at 11:22 am

The y-axis should be the average weight of someone with the Same sex, hight and age. Or, if you don’t want to Be average, your own personal goal. No one wants to weight 0, not in pounds, not in kilo’s. You also should tot in a line that would indicate a healthy lost of weight, overdoing it causes more harm than it does good.

Michael replies: It might be a good idea to provide some relevant reference information; what constitutes “relevance” is a matter of subjectivity and desire. However, I am typically suspicious of charts that don’t have a zero base because such charts tend to exaggerate the significance of the variance in the data.
Rikard Edgren

September 17, 2014 at 5:25 am

Hi Michael

You seem motivated, so this will work out fine!
To get to “look and feel more sleek and trim” I think you should add exercise as well.
I would recommend walking/running, since you can do it whenever, and even use as a de-focusing technique for tricky testing problems.
Pretty fast walking could be combined with tweeting, but running is, in my experience, more rewarding (the time I spend exercising is given back as more energy the rest of my time.)
Just make sure you run with 180 (exactly;) steps per minute, it will decrease the risk of injurys (less friction.)

Hope to see you soon,
Rikard

8 replies to “Weighing the Evidence”

Leave a Comment Cancel reply