Statistics Can Be Tricky
Hi everyone! In this post I’d like to talk about Simpson’s Paradox. This wikipedia article might be helpful if you want to know more about this: http://en.wikipedia.org/wiki/Simpson’s_paradox
What is Simpson’s Paradox? In my paraphrase, that means a paradox when the decision making is reversed if the data is observed more carefully. Consider this real life example which I took from wikipedia about the passage of the Civil Rights Act of 1964 in the United States. Overall, a larger fraction of Republican legislators voted in favor of the Act than Democrats. However, when the congressional delegations from the northern and southern States are considered separately, a larger fraction of Democrats voted in favor of the act in both regions.
|Northern||94% (145/154)||85% (138/162)|
|Southern||7% (7/94)||0% (0/10)|
|Both||61% (152/248)||80% (138/172)|
We discussed this in Statistics 1 class and all professional statisticians know about this, I suppose. What then can we make out of this knowledge? Well, at least we can be more careful if we read statistical reports on news or wherever. Coming back to the earlier example, if I am the journalist and given that data, I have two options to present the news if I want to influence the public in a certain way (biased towards Democrat or Republican). And as far as I know, statisticians and journalists do this thing all the time, i.e. to take the data in a certain way and use it to support certain opinion/thought/say/claim/whatever.
Another common issue regarding statistics which my engineering professor always mentioned in class is about significance. I’m sure we’ve seen articles saying something like, “Chocolate lovers have lower risk of getting heart attack” or “Contrary to popular belief, [a product or anything] is actually [the new claim]”. Often times they would mention that a study have been done in a university, this number of participants have taken part in the study, and the result shows that it is significant for the new claim to be correct. But, sometimes they do not tell you what the significance level is. Normally the significance level is denoted by Greek alphabet alpha. Common values are 1%, 5%, and 10%. Different conclusion can be made when using different significance level, i.e. claim A is significant when using 10% significance level, but not the case when 1% significance level is being used. Again, statistics can be tricky and we should be a little more careful!