**0**

# Book Review: Data Analysis with Open Source Tools

Posted by

**WingedPanther73**, 01 September 2011 · 1063 views
One of the questions that gets asked a LOT on this forum is, "How much math do you need to know to be a good programmer?" The general consensus is usually something along the lines of, "Some algebra, but more always helps." As a mathematician, I never really like that answer, because I know there is so much more that it is good for programmers to be aware of.

I picked up Data Analysis: with Open Source Tools about 9 months ago, and have been reading it steadily since then (I keep finding "must read now!" books). This is one of those books that does a great job of illustrating why you should know more than just algebra. It covers the math and programming strategies of how to solve various problems in data analysis based on real data, with packages that were developed for those types of problems.

It starts with basic descriptive statistics, including discussions of histograms, linear regressions, data plotting, modeling, etc. Almost immediately, the author, Philipp K. Janert, deviates from any statistics book you will find, and discusses the problems with all the standard regression tools. Simply stated, they frequently don't work well on the types of data we encounter in the real world, and computers can give us better models with far more ease than anything that was available more than 30 years ago (most of stats was invented more than 100 years ago). He also gives numerous examples of using Python's NumPy and SciPy libraries to facilitate with analyzing data. GSL (a C library) is also discussed, along with R.

That covers the first two sections of the book. In the third section, he discusses data mining. In particular, he goes through the various issues with understanding realistic, multivariate data. This isn't like a stats class, where the problems are straight-forward and the results are immediately satisfying. He instead tackles how to explore data with 17 different parameters (wine characteristics) or or detecting clusters of data.

The fourth section discusses a few issues of how data gets used. Most of this work is done for business, so he discusses how business people think, financial concerns of data, and how predictions are made.

Throughout all of this, Philipp makes numerous book recommendations, lists sources for a variety of interesting data sets, and discusses a LOT of math. Basic statistics, Calculus, advanced statistics, and linear algebra are discussed frequently. The assumption is that you are either familiar with the concepts, or will study them on your own. If you know nothing of the concepts, you will quickly discover there is a lot that you might be asked to do that requires (gasp!) more math than you were lead to believe you needed to know.

Oh, the appendices. There are three of them: programming environments/languanges that are available for data analysis, an overview of calculus, and a discussion of how to work with data in general terms (file formats, SQL, where to get your data).

This is a very good book on a complicated field. It makes a point of stating, repeatedly, that real-world data is never as clean as the data in math books. It also makes a fantastic argument for anyone who wants to get a bachelor's degree in computer science also getting at least a minor in math. Will you do this type of analysis all the time? No. On the other hand, being able to analyze data, and being reasonably competent with programming, got me my first coding gig doing database migrations and updating statistical calculations tools.

I still say: learn all the math you can. You never know when it'll come in handy

I picked up Data Analysis: with Open Source Tools about 9 months ago, and have been reading it steadily since then (I keep finding "must read now!" books). This is one of those books that does a great job of illustrating why you should know more than just algebra. It covers the math and programming strategies of how to solve various problems in data analysis based on real data, with packages that were developed for those types of problems.

It starts with basic descriptive statistics, including discussions of histograms, linear regressions, data plotting, modeling, etc. Almost immediately, the author, Philipp K. Janert, deviates from any statistics book you will find, and discusses the problems with all the standard regression tools. Simply stated, they frequently don't work well on the types of data we encounter in the real world, and computers can give us better models with far more ease than anything that was available more than 30 years ago (most of stats was invented more than 100 years ago). He also gives numerous examples of using Python's NumPy and SciPy libraries to facilitate with analyzing data. GSL (a C library) is also discussed, along with R.

That covers the first two sections of the book. In the third section, he discusses data mining. In particular, he goes through the various issues with understanding realistic, multivariate data. This isn't like a stats class, where the problems are straight-forward and the results are immediately satisfying. He instead tackles how to explore data with 17 different parameters (wine characteristics) or or detecting clusters of data.

The fourth section discusses a few issues of how data gets used. Most of this work is done for business, so he discusses how business people think, financial concerns of data, and how predictions are made.

Throughout all of this, Philipp makes numerous book recommendations, lists sources for a variety of interesting data sets, and discusses a LOT of math. Basic statistics, Calculus, advanced statistics, and linear algebra are discussed frequently. The assumption is that you are either familiar with the concepts, or will study them on your own. If you know nothing of the concepts, you will quickly discover there is a lot that you might be asked to do that requires (gasp!) more math than you were lead to believe you needed to know.

Oh, the appendices. There are three of them: programming environments/languanges that are available for data analysis, an overview of calculus, and a discussion of how to work with data in general terms (file formats, SQL, where to get your data).

This is a very good book on a complicated field. It makes a point of stating, repeatedly, that real-world data is never as clean as the data in math books. It also makes a fantastic argument for anyone who wants to get a bachelor's degree in computer science also getting at least a minor in math. Will you do this type of analysis all the time? No. On the other hand, being able to analyze data, and being reasonably competent with programming, got me my first coding gig doing database migrations and updating statistical calculations tools.

I still say: learn all the math you can. You never know when it'll come in handy

Thx for this review,

I read a lot of your review (I actually bought SQL Antipatterns and Manage It!, very interesting book, I'm half through sql antipatterns)

You seem to have a good taste in book, and I'm always looking for new book, and something it's hard since most book only repeat the same thing.

But all your reviews are scattered against the blog (not sure this is the good expression, once again, I don't normally speak english, but I mean it can be hard to find them)

So I was thinking it would be great if codecall had a page where you could find all review (from you and from other person) separated by subject.

Maybe a separated page or even simply a section on the forum.

Thank you.