Data Science

2 Mutch Geek

Download PDF

[Cross-posted from ICanHazDataScience]

I’ve been descending into a lot of Python hints lately, and not so much on the data analysis side of things.  Whilst I finish writing up an example analysis involving poo (yes, really!), here’s something I wrote a while ago to a young aspiring data scientist…

What it takes to be a data scientist, most of all, is curiosity, persistence, intelligence and the ability to tell a story: that drive to understand what’s wrapped up inside the data, to learn whichever tools it takes to get that understanding, and the skills to explain it to others.

It also helps to have a strong technical background, because we haven’t yet developed the user-friendly tools and training that non-technical data scientists need (we’re working on it!), so some knowledge of Bayesian statistics and things like text processing and – unfortunately – the ability to program in a language like R or Python helps a lot.  The good news is that some excellent training materials are now available through things like python.org and Coursera.com (for the advanced user) and meetups – check meetup.com for ideas (I’m here if you want to look at the groups I belong to).  You can also learn a lot – one heck of a lot – from going to some of the data science hackathons like the ones run by DataKind, helping out with the drudge work like data cleaning for the teams, and listening to the techs talking about what they’re doing (being the person who writes up the team webpage text and presentation is a very good way to learn and get geek gratitude points too).

I suspect you were probably looking for me to tell you a list of skills that a data scientist needs to learn, but it’s not like that: data science is about storytelling from data, and right now your personality counts for much more than the things that you learnt at school (you will still have to learn things, but you need to start from the right place). Oh, and you also need a very low boredom threshold and a lateral-thinking mind – much of this work is about obtaining, cleaning (removing errors and getting into ‘standard’ formats) and wrestling with data in different (and sometimes unusual) ways, and like all new exploration, it can take an awful long time to do with much pain along the way, but with equal feelings of triumph at the end.