Yes – another blog on data science. Crazy, I know. But before you judge me too harshly, let me talk a bit about what data science means to me and what I hope to accomplish with this blog (sorry it is so long). Note: these are my opinions and I make some pretty big generalizations. Also, subject to change š
What is Data Science?
The best definition I have found so far is from Josh Wills:
A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.
He expands more on this definition in a Quora post. Read this post! I completely agree with it, and let me explain why.
Traditional Statistics
I used to work as an analyst at an economic consulting firm. I focused mostly on econometrics, which is basically statistics using economic data, and was privileged to work closely with some amazing statisticians. This experience gave me a good look at how statisticians work. The typical process was using static data – meaning that we received data and that data was not going to change – to makeĀ inferential models. For example, using 10 years of sales data to build a linear regression in order to figure out what variables most affect sales. Our code to build these models didn’t have to worry about changing data, scalability, or being integrated into a product. We would get the code to work for our project, write a report for the client, and never use that code again.
Also, when building these models our process started with understanding which variables should be important for our model. Data mining was not a good word. Data should not tell you how to build your model, domain knowledge should. Once we had our model, we looked at things like p-values, confidence intervals, and heteroskedasticity.
This is how I view life as a statistician. And I believe consulting firms, insurance companies, and health care firms have been doing this type of stuff for a long time.
How is data science different from statistics?
I think all of the items above are important for a data scientist. A data scientist should understand that too much data mining can bias your results and that domain knowledge is important. Among many other things from traditional statistics. Thus, as the quote says, a data scientist is someone who is better at statistics than any software engineer.
But I believe data science, at least at the moment, is more focused on prediction. We use training and test sets to evaluate our models. We throw as many variables as we can at the model and use tools like regularization to help figure out which to use. Data is king. If the data says the butterfly population in Mexico affects gas prices, who are we to argue? We will use models that are very hard to interpret, such as neural networks. These are things that you will not see that often in traditional statistics. But they are powerful tools that have led to some great insights.
But as noted above, a data scientist also needs to know his statistics. Just running a bunch of models on a bunch of data to maximize a test accuracy can get you in trouble if you don’t understand the underlying assumptions of what you are doing. Even Google had some issues with their flu trends. So while data science uses some different methods than statistics, a lot of it is still firmly based on statistics, so a data scientists needs to know his or her stats.
What about this software engineering thing?
In my opinion, this is one of the more exciting aspects of data science. Instead of taking some data, running some models, getting some results, and then making a nice report (more traditional data analytics), a data scientist buildsĀ products.
For example, instead of writing a yearly report on sales projections, a data scientist might build a scalable program that takes in real-time sales data, analyzes it, makes some predictions, and then instantly pushes these predictions to a dashboard of some kind. That is powerful. And that requires good, fundamental software engineering skills. Not just the ability to run a regression in R.
In my opinion, this is what sets a data scientist apart from a statistician. A data scientist not only understand enough statistics to build a good prediction engine (and understand its shortfalls), but he or she can also incorporate a prediction engine into a code base so it can be used more like a piece of software.
So those are my thoughts on data science. Data science is definitely an evolving field and I am excited to see what it grows into. People have been analyzing data for a long time and I have no doubt that will continue to happen. Whether it is called data science or not. I also believe good software engineer skills will continue to be important for data analysts.
So why this blog?
For along time I have wanted to create a webpage to put up some of my personal data science projects. At first I thought I would mostly do tutorial type posts, like how to use PCA. But I have found there are already a lot of great data science tutorials on the web. So instead I will focus on putting up posts that explore data that I find interesting. I hope to show how I work through the entire data science process from defining a question, to getting data, testing models, and even building a data product if applicable. But if I think a good tutorial is missing on the web, I will try and create it. I will also will try and create a list of resources that I have found useful.
This blog will probably be mostly for my benefit. But hopefully some find it interesting and maybe others can critique my methods to help me become better.
If you enjoy a post or would like to me try and create a post on something specific, please let me know! I would love to hear from you!