Data Science and the traditional sciences share a number of characteristics. Both are built upon statistics. Both depend on the reproducibility of results. Both are efforts to shed falsehoods by way of evidence. Data Science depends on other sciences for a lot of data in many cases. And as scientists have more and more data to contend with, they will need to draw from Data Science principles as well.
The two communities have so much to offer each other, so it’s a little surprising when there isn’t as much overlap as one would expect. Some scientists are not interested in statistics, though they often run into walls or analyze their data incorrectly as a result. Some Data Scientists don’t consider the methods by which data was gathered by, and can come to erroneous conclusions if there are biases in an unscientifically gathered data set.
A recent article called Ten Simple Rules for Effective Statistical Practice from Kass, Caffo, Davidian, Meng, Yu, and Reid tries to bridge this gap. It’s geared toward scientific researchers to improve their data analysis. I find it just as useful for data scientists to read as a way to understand how data science methods can apply in a wide range of scientific fields. I highly recommend reading the whole article, but I’ll reproduce the eponymous ten simple rules here:
- Statistical Methods Should Enable Data to Answer Scientific Questions
- Signals Always Come with Noise
- Plan Ahead, Really Ahead
- Worry about Data Quality
- Statistical Analysis Is More Than a Set of Computations
- Keep it Simple
- Provide Assessments of Variability
- Check Your Assumptions
- When Possible, Replicate!
- Make Your Analysis Reproducible