July 9, 2016

The three most important habits of a data analyst

I've been doing data analysis for almost 15 years -- mostly using Excel and Business Intelligence tools. And from the very first year I believe that accuracy is the biggest challenge for a data analyst. Accuracy is fundamental because if a calculation result is incorrect then everything else that is based on it -- visualizations, judgements and conclusions, become irrelevant and worthless. Even performance is not so important, because sometimes you can solve a performance problem by throwing in more hardware, but that would never fix incorrect calculation logic.

Ensuring accuracy is probably the most important skill a data analyst should master. To me, striving for accuracy is a mental discipline developed as a result of constant self-training, rather than something that can be learned overnight. There are three practical habits to develop this skill:

1) Sanity checks. These are quick litmus tests that allow detecting grave errors on early stages. After you get a calculation result for the first time, ask yourself -- does it make sense? Does the order of magnitude look sane? If it's a share (percentage) of something else -- is it reasonably big/small? If it's a list of items -- is it reasonably long/short? Sounds like a no-brainer but people tend to skip sanity checks frequently.

2) Full assumption testing. In my experience this habit is most overlooked by beginner analysts. Assumptions should not be opinions, they must be verified facts. "We're told that field A has unique keys" -- verify it by trying to find duplicate values in it. "Field B has no nulls" -- again, verify it by counting nulls or check data model constraints (where applicable). "They said that Gender is encoded with M and F" -- verify it by counting distinct values in field Gender. Whatever assumptions are used for filtering, joining or calculation -- absolutely all of them must be tested and confirmed prior to doing anything else. Once you develop this habit you would be surprised how often assumptions turn out to be wrong. A good data analyst can spend a few days to verify assumptions before even starting analyzing data itself. Sometimes assumptions are implicit -- e.g. when we compare two text fields we usually implicitly assume that neither of if has special symbols or trailing spaces. A good data samurai is able to see invisible recognize implicit assumptions and test them explicitly.

3) Double calculation. This habit is sometimes overlooked by even experienced analysts. Probably because it requires sometimes rather tedious effort. This habit is about creating alternative calculations, often created in a different tool -- typically Excel. The point is to test the core logic, therefore such alternative calculation can include only a subset of original data and do not cover minor cases. The results achieved using alternative calculation should be equal to results of the main calculation logic, regardless whether it's done in SQL or some BI/ETL tool.

Let the Accuracy be with you.