Background

Statistical analysis is the core competency of the business.

I studied psychology – focused on conducting experiments – both as an undergraduate (Haverford) and a PhD candidate (University of Pennsylvania).  After leaving graduate school I continued training in statistical analysis, not just in academic classes (Harvard), but also in business seminars (DMA).

Through literally hundreds of regression analysis / predictive modeling projects, and hundreds of cluster analysis, factor analysis, CHAID, and/or other multivariable analyses, I am very confident about what from the academic tradition is critical (e.g., paying close attention to reliability, especially when combining different techniques like CHAID and regression), and how to incorporate that into a business approach, where spending weeks on an analysis (e.g., checking regression assumptions) just isn’t realistic.

Using an analyst with a business-only statistical background (whether it was learned on-the-job or through seminar training) is a little like trying to design a skyscraper by playing around with AutoCAD.  While I believe that research is critical to good business decisions, one should avoid making an important one based on work from an analyst who doesn’t understand the underpinnings of regression.


My Approach

I believe that the largest portion of the time spent on the typical statistical analysis project should be devoted to data preparation.  That includes ensuring that there aren’t problems with the data – or at least being able to conduct the analysis with an awareness of them – fully understanding the scale and values of variables, and understanding where the data came from and what it means.

Spending too little time on data preparation can lead to disaster.  Failing to use exactly the right analysis probably will still give the analyst a sense of the results.  For example, one of the first business projects I worked on as an employee of the ad agency was a follow-up analysis of all New Jersey residents turning 65 to see what predicted signing up for a Medicare "supplemental" health insurance plan.  My results were so different from the prior analysis, done by a vendor, that I spent a lot of my own time after work trying to reconcile the two sets of findings.

It turned out that there were a number of problems with the original analysis:

  • One income variable (from appended data) was coded into a scale where higher values of the scale corresponded to lower income!  The analyst wasn’t aware of that and treated the scale as though it correlated positively with income.
     
  • Another scaled income variable used both numbers and letters to fit more than just 10 values into one digit (such as 6, 7, 8, 9, 0, A, B, C, D, etc.).  This was back when appended data was often difficult to work with but at least the data warehouses did a good job of keeping data files small.  When that field was imported into SPSS the letter values became missing data (which still happens today with SPSS if you use their automatic import).  So when that variable was used it excluded everyone with high income from the analysis.
     
  • Finally, at this time in New Jersey (and today!) incomes varied widely by county as did cost of living, but county income percentile was not even acquired from the data warehouse, much less tried in the regression.

There were other problems with the data and with the analysis, but because so much variance was accounted for in the model and the recommendations made intuitive sense, the analyst didn’t go back and check his work.

I don’t want to underplay the value of doing excellent analytical work once the data is in good shape and well understood.  Regression and other techniques are best considered one tool in a toolbox that has to be used intelligently (and almost as a partner with the analyst) to uncover relationships in the data.  Even if one knows everything about all statistical procedures and their parameters, data analysis should not be automated; it should be an investigative process where the investigator is continually getting clues and figuring out what to do next.

  • One DMA course I took recommended routinely applying a dozen or so of the same transformations (squared, cubed, inverse, log, etc.) to every regression analysis.  That could save time if one is aware of the effect on reliability and one is willing not to stop at those transformations.  Bit it’s generally a lazy, and sometimes dangerous, approach.
     
  • Later in the same (not so good) course one lecturer told the attendees never to use factor analysis unless the variables were interval or ratio variables (in other words variables having values of more than two numbers), and then after a break the other lecturer showed us an example of how he’d solved a difficult problem by using factor analysis with mostly binary data (only two values)!

It’s true that it’s kind of OK to cautiously use factor analysis with binary variables.  Similarly, you probably gain more than you lose by using linear regression to predict a binary variable if you don’t have access to logistic regression.  But these aren’t great examples to teach!

I’ve found that a good way to test whether an analysts knows an adequate amount about regression is to ask about the following phenomenon:

  • Two correlated variables (although they don’t have anything like a perfect correlation) like education and an achievement test score each, on their own, correlate well with income.  But when they’re used together the achievement test score still has a strong positive partial correlation (or t value if you prefer) but education has a negative partial correlation (t value).  Was this a fluke or a problem with the analysis or is this a valuable clue about the composition of the dataset?

If the analyst can tell you that that’s suppression and a phenomenon that’s a natural outcome sometimes of how regression works, great.  If he or she can clearly explain how and why it works and use examples, you’ve probably got a very good analyst.

Most analytical agencies I’ve worked with are unfamiliar with suppression.  When they get that kind of confusing result they often throw out education, tell the client the data are faulty, or present the results byt gloss over the negative correlation.

Comments are closed.

preload preload preload