A Failed Experiment

23 Sep 2014

Data Analysis IPython Ice Cream Crimes

(Data and computations for this post can be found here.)

Correlating Ice Cream Sales and Violent Crime Rates

In my last post, I said that several textbooks use the correlation between ice cream sales and violent crime rates to show that correlation does not imply causation. However, every textbook and article I found on the subject only stated this correlation and did little to back it up.

When a source did include any evidence to back up the claim of ice cream sales and violent crime rates being positively correlated, the proof was rather weak. The data for ice cream sales usually came from business reports (like this one) that only share their data for a steep price. And most of the crime reports (like this one) used to back up the claim are very location specific (usually within a single city), and the ice cream sales used to compare were not equivalently localized.

Wanting to do a thorough job in writing that post, I felt compelled to show this correlation myself.

Hunting for the Data

The entire process of finding suitable data sets proved to be several degrees of magnitude harder than I expected.

Initially, I tried finding current US ice cream sales data, but the only sources I could find charged a lot of money to get access, and this blog isn’t a huge cash cow that affords me thousands of dollars to buy a single data set. I also tried open sources, but I never found any data that was recent enough or complete enough to be worth using. The only semi-usable ice cream data I found (from Ch 4 of this textbook) was from the 1950s, but hey, beggars can’t be choosers.

So then I went on to try finding crime rates for the same time frame as the ice cream data. However, all the credible sources I could find only had crime statistics for more recent years. For example, the FBI’s crime statistics only go back to 1960. So close!

My initial goal was to find recent and US-centric data. Since loosening the recency restriction prooved fruitless, I had no choice at this point but to relax the US-centric restriction and go global. I had much better luck going this route, and I found two data sets of recent (1995 to 2010) ice cream sales and homicide rates for a number of countries. Both data sets came from the United Nations.

Sources:

Crunching the Data

Excited to be done with the arduous search for data sets that matched on geographic location and time frame, I immediately jumped into prepping the data. I merged the two data sets on Year and Country so that I could easily compare, say, ice cream sales with the homicide rate in Albania in 2009.

There were some missing data, though, so I decided to drop all rows with any null values. That left me with 291 rows, which I felt was enough to show the correlations I was looking for.

So here I was, all of my data gathered and prepared, ready for that moment of glory when I would see a beautifully correlated graph between ice cream sales and violent crime rates, like I had been promised. I graphed all the data on a scatter plot of homicide rates vs ice cream sales, and was dismayed to see the result looking like this.

Correlation between Ice Cream Sales and Homicides

The clear correlation I was expecting is no where to be seen. Using this data, the correlation between ice cream sales and homicides is only 0.135196. This graph is much less a proof of the correlation between ice cream sales and violent crime rates as it is an accurate representation of my tears upon realizing my utter failure.

Correlation between Failed Experiments and My Tears

After a bit of reflection, though, I realized that the data had not failed me, but I had failed the data. In my eagerness to show a correlation that I was sure existed just to prove my point, I overlooked a glaring flaw in my choice of data sets.

Since I used global data, the included countries’ wealths varied tremendously. Consider Canada (ranked 11th by GDP) and Albania (ranked 125th by GDP), and imagine how their respective wealths impact their ice cream culture. The wealthier the country, the greater their access to ice cream. This global trend was much stronger than the correlation between ice cream sales and homicide rates, causing the data to be heavily weighted to the low-homicide-rate end of the graph, thus drowning out any small correlation there may have been between ice cream sales and homicides.

Lessons Learned

Planning ahead can save you a lot of wasted time. If only I had listened to Max Shron’s advice from a previous post of mine, I would have focused more on the planning aspect instead of jumping straight into playing with the data. If I’d given more thought to the global trends of ice cream consumption and wealth being heavily correlated, I could have avoided working with that data the way I did.

Don’t give in to temptation of bending the data to prove your point. After I felt the defeat of not being able to prove my point, I could have continued working to get the data to tell the story I wanted it to tell. I could have removed the rows that didn’t fit into my narrative. But that would be the antithesis of Data Science and Science in general.

Know when to give up. This again goes back to Max Shron’s advice of focusing on the why of your data project. I realized that maybe it wasn’t necessary to show this correlation when I wasn’t planning on using it for anything. I only needed it to justify my use of the phrase “Ice Cream Crimes” in future posts when pointing out other examples of poor data techniques. I decided it wasn’t worth it to go back to the drawing board at this point and to spend even more hours on a fruitless endeavor that I’d probably start resenting soon. After all, ice cream tastes a lot better than sour grapes anyway.

comments powered by Disqus