Data Tools

31 Aug 2014

Tools and Resources IPython

Tools

Man is a tool-using animal. Without tools he is nothing, with tools he is all.

-Thomas Carlyle

The human brain alone isn’t capable of storing and performing operations on the amount of data that data projects entail, so tools are a necessity.

In this blog, at least at the beginning, we won’t need the really heavy-lifting tools that allow humongous data sets. And also, as the blog progresses, new tools will surely be added and removed from the Data Journeyman toolbelt. This post is just about defining a starting place.

IPython Notebook

Work in data science isn’t always about generating a complete, functioning program like typical programming is. Often what needs to be captured is a series of operations executed on a dataset. IPython Notebook is a great way to document this process. Many future posts will be accompanied by a notebook to show the transformation process on some data related to the post’s topic.

I’m not going to cover the installation process because it’s been covered so well in other places. To get started, check out this guide.

Running IPython Notebook with the command ipython notebook --pylab=inline will automatically import NumPy and matplotlib for you. The examples in this blog will assume IPython Notebook is run this way.

Here’s an example IPython Notebook that covers some of the tools in this post. Just download the file and run the above command in the folder containing it.

NumPy

NumPy allows you to perform operations on large, multidimensional arrays. Since most data will fit nicely in this format, having an efficient tool for performing linear algebra and other mathematical functions will be very useful.

See the IPython Notebook for this post for a quick look at the library, and checkout out this fantastic tutorial for a more in-depth look.

matplotlib

matplotlib is a great library for basic plotting and visualization. It comes with several baked-in chart and graph types, including the basics like line graphs, bar graphs, histograms, box plots, and moving on to more advanced types.

It’s very easy to generate visualizations with, but typically the graphs will look rather unpolished. There are many other excellent libraries for data visualization, and many of them make very professional charts, but matplotlib has one of the lowest barriers to entry

There are some examples in the IPython Notebook.

pandas

pandas is a library for handling tabular data at a higher level than NumPy. It tries to make the process of accessing and manipulating data, even when there are missing values, as easy as possible.

Its basic structures are Series – ”1D labeled homogeneously-typed array” – and DataFrames – ”General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns”. It has great indexing features and moving window statistics. For more, see the package overview.

Again, check the IPython Notebook for some examples.

References

In the short amount of time I’ve spent researching data science, I’ve come across so many great resources on the subject. I’m going to put a few here, but this list will be far from comprehensive. Please leave a comment if I’m missing any great references here and I’ll add it to the list.

Tutorials, Learning Materials, and Curricula

Blogs and Newsletters

Communities

comments powered by Disqus