Man is a tool-using animal. Without tools he is nothing, with tools he is all.
The human brain alone isn’t capable of storing and performing operations on the amount of data that data projects entail, so tools are a necessity.
In this blog, at least at the beginning, we won’t need the really heavy-lifting tools that allow humongous data sets. And also, as the blog progresses, new tools will surely be added and removed from the Data Journeyman toolbelt. This post is just about defining a starting place.
Work in data science isn’t always about generating a complete, functioning program like typical programming is. Often what needs to be captured is a series of operations executed on a dataset. IPython Notebook is a great way to document this process. Many future posts will be accompanied by a notebook to show the transformation process on some data related to the post’s topic.
I’m not going to cover the installation process because it’s been covered so well in other places. To get started, check out this guide.
Running IPython Notebook with the command
ipython notebook --pylab=inline will automatically import NumPy and matplotlib for you. The examples in this blog will assume IPython Notebook is run this way.
Here’s an example IPython Notebook that covers some of the tools in this post. Just download the file and run the above command in the folder containing it.
NumPy allows you to perform operations on large, multidimensional arrays. Since most data will fit nicely in this format, having an efficient tool for performing linear algebra and other mathematical functions will be very useful.
See the IPython Notebook for this post for a quick look at the library, and checkout out this fantastic tutorial for a more in-depth look.
matplotlib is a great library for basic plotting and visualization. It comes with several baked-in chart and graph types, including the basics like line graphs, bar graphs, histograms, box plots, and moving on to more advanced types.
It’s very easy to generate visualizations with, but typically the graphs will look rather unpolished. There are many other excellent libraries for data visualization, and many of them make very professional charts, but matplotlib has one of the lowest barriers to entry
There are some examples in the IPython Notebook.
pandas is a library for handling tabular data at a higher level than NumPy. It tries to make the process of accessing and manipulating data, even when there are missing values, as easy as possible.
Its basic structures are Series – ”1D labeled homogeneously-typed array” – and DataFrames – ”General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns”. It has great indexing features and moving window statistics. For more, see the package overview.
Again, check the IPython Notebook for some examples.
In the short amount of time I’ve spent researching data science, I’ve come across so many great resources on the subject. I’m going to put a few here, but this list will be far from comprehensive. Please leave a comment if I’m missing any great references here and I’ll add it to the list.
Tutorials, Learning Materials, and Curricula
- The Open-Source Data Science Masters
- A Practical Intro to Data Science
- Udacity and Coursera have a bunch of Statistics and Data Analysis courses
- PyData Videos on Vimeo
- Terms in data science defined in less than 50 words