Timing tests, list tests, cover tests, select tests — testing can do wonderful things for your catalog, and carefully thought-out tests should be in every circulation plan you create. However, that “carefully thought-out” part can get embarrassing.
But They’re All Getting the Same Thing
A national business-to-business cataloger wanted to test how mailing one less catalog per year to each buyer segment would impact sales. Executives split a large buyer segment into two unequal parts, mailed the usual number to the larger segment (control), and one less to the smaller (test).
In each flight (single mail drop) where test and control both got the same catalog, they applied the same mail code to the control and test segments. But in flights where the control group got a catalog and the test didn’t, the control received a unique mail code.
Staffers soon realized they couldn’t calculate separate response rates, because in flights where test and control had gotten catalogs with the same mail code, it was impossible to tell which segment had generated the order.
When a test involves just one flight, it’s easy to see that each group should have a different keycode applied to its catalogs, so when orders arrive, you can tell which group generated the order.
But if a test spans several flights, you must apply different keycodes to catalogs going to the test and control segments in each flight — even if in any given flight the test and control segments are being treated exactly the same. If you don’t, at season’s end you won’t be able to tell which group (test or control) generated which orders in those flights where test and control got the same keycode. So you won’t be able to create all-season response rates and sales-per-book numbers for your test and control groups.
In Bad Taste
A national food cataloger wanted to test whether mailing a free product sample to customers would boost sales enough to cover the significant cost of the sample mailing. It divided test and control segments properly, coded them separately, captured key codes at order time, and even hired an outside fulfillment house to execute the test. (Its in-house fulfillment center couldn’t handle stuffing samples into mailers.)
The test spanned several months, and early results heavily favored sampling. But later results showed a puzzling drop-off in response. Finally, someone at headquarters inserted himself as a seed in the next sample mailing, got the sample package by mail, and realized that the fulfillment company wasn’t rotating its sample stock. The samples were stale and tasted terrible.
No matter how simple your test, something can go wrong. The actual source of the problem for the cataloger above was its decision to hire an outside fulfillment house to implement the sample mailing. It had never used an outside fulfillment vendor before, which meant that not only was it running a sample-mailing test, it also was learning how to manage an outside fulfillment house.
In any test, try to limit the changes to the specific thing being tested. By keeping everything else the same, you maximize your chance of getting a clean result.
A cataloger of luxury goods tested a very large number of prospecting lists one year. The following year it based its entire circulation plan on the prior year’s test results by rolling out to lists that worked and dropping lists that didn’t.
There was just one problem: To test the largest number of lists during the test year, the cataloger mailed fairly small quantities to each list. So the number of orders from each list looked like this: 14, 35, 40, 19.
“It won’t work,” I said. “Your results are too small to be statistically significant.”
They disagreed, rolled out as planned ... and got drastically different results from the prior year.
You don’t need a degree in statistics to understand this simple rule: For a test to be valid, you need at least 50 responses (orders) in each cell. Any cell that gets fewer is meaningless.
And you can turn this around. When creating a test, plan to get at least 50 responses per cell. That’s the reason behind the 5,000-name minimum that most list rentals institute. A 1-percent response rate from 5,000 names will yield 50 orders — just enough to be statistically significant. If you’re expecting less than a 1-percent response rate, rent more names to bring your expected order quantity up to a minimum of 50 per cell.
The No-test Test
A national cataloger created a beautifully simple split-test. It divided in half a buyer group; made sure each group was large enough to yield statistically significant results; gave each group a different mail code; tracked the mail codes at order-taking time; and after all sales were in, was rewarded with a very clear winner. The test segment had significantly outperformed the control segment.
So what had it actually tested that produced this clear result? Absolutely nothing — the test and control groups were treated the same at each point in the test. They got identical catalogs at identical times.
I once performed the above test myself in cooperation with one of my catalog clients. We called it the “no-test test,” because we went through all the motions of a test, only we didn’t do a single thing differently between the control and test segments. The results from both segments basically should have been the same — but they weren’t.
What can we learn from this unsettling outcome? It could’ve been a fluke. At the 95-percent confidence level that statisticians talk about, one test in 20 will be a fluke. It also could’ve been caused by a failure of randomness in how the segments were created. For example, the computer operator who split the segment, rather than doing a true nth-select, could instead have just chopped the group in half. If the original list was alphabetized, that would put all the As, Bs and Cs in one group, and all the Xs, Ys and Zs in the other.
However, most often this kind of result is caused by code-tracking problems. In fact, one reason for performing a no-test test is to check on how well your operators are handling code tracking. This result also underscores the importance of the traditional process for testing new lists, namely: Test once. If it works, test bigger. If that works, roll out cautiously.
In other words, any test result, no matter how well the test was implemented, must be confirmed by additional testing before rolling out.
Susan McIntyre is president of McIntyre Direct, a full-service catalog agency and consulting firm based in Portland, OR. She can be reached at (503) 286-1400.