Infosec, meet data science

Download PDF

I know you’ve been friends for a while, but I hear you’re starting to get closer, and maybe there are some things you need to know about each other. And since part of my job is using my data skills to help secure information assets, it’s time that I put some thoughts down on paper… er… pixels.

Infosec and data science have a lot in common: they’re both about really really understanding systems, and they’re both about really understanding people and their behaviors, and acting on that information to protect or exploit those systems.  It’s no secret that military infosec and counterint people have been working with machine learning and other AI algorithms for years (I think I have a couple of old papers on that myself), or that data scientists and engineers are including practical security and risk in their data governance measures, but I’m starting to see more profound crossovers between the two.

Take data risk, for instance. I’ve spent the past few years as part of the conversation on the new risks from both doing data science and its components like data visualization (the responsible data forum is a good place to look for this): that there is risk to everyone involved in the data chain, from subject and collectors through to processors and end-product users, and that what we need to secure goes way beyond atomic information like EINs and SSNs, to the products and actions that could be generated by combining data points.  That in itself is going to make infosec harder: there will be incursions (or, if data’s coming out from outside, excursions), and tracing what was collected, when and why is becoming a lot more subtle.  Data scientists are also good at subtle: some of the Bayes-based pattern-of-life tools and time-series anomaly algorithms are well-bounded things of beauty.  But it’s not all about DS; also in those conversations have been infosec people who understand how to model threats and risks, and help secure those data chains from harm (I think I have some old talks on that too somewhere, from back in my crisismapping days).

There are also differences.  As a data scientist, I often have the luxury of time: I can think about a system, find datasets, make and test hypotheses and consider the veracity and inherent risks in what I’m doing over days, weeks or sometimes months.  As someone responding to incursion attempts (and yes, it’s already happening, it’s always already happening), it’s often in the moment or shortly after, and the days, weeks or months are taken in preparation and precautions.  Data scientists often play 3d postal chess; infosec can be more like Union-rules rugby, including the part where you’re all muddy and not totally sure who’s on your side any more.

Which isn’t to say that data science doesn’t get real-time and reactive: we’re often the first people to spot that something’s wrong in big streaming data, and the pattern skills we have can both search for and trace unusual events, but much of our craft to date has been more one-shot and deliberate (“help us understand and optimise this system”). Sometimes we realize a long time later that we were reactive (like realizing recently that mappers have been tracking and rejecting information injection attempts back to at least 2010 – yay for decent verification processes!). But even in real-time we have strengths: a lot of data engineering is about scaling data science processes in both volume and time, and work on finding patterns and reducing reaction times in areas ranging from legal discovery (large-scale text analysis) to manufacturing and infrastructure (e.g. not-easy-to-predict power flows) can also be applied to security.

Both infosec and data scientists have to think dangerously: what’s wrong with this data, these algorithms, this system (how is it biased, what is it missing, how is it wrong); how do I attack this system, how can I game these people; how do I make this bad thing least-worst given the information and resources I have available, and that can get us both into difficult ethical territory.  A combination of modern data science and infosec skills means I could gather data on and profile all the people I work with, and know things like their patterns of life and potential vulnerabilities to e.g. phishing attempts, but the ethics of that is very murky: there’s a very very fine line between protection and being seriously creepy (yep, another thing I work on sometimes).  Equally, knowing those patterns of life could help a lot in spotting non-normal behaviours on the inside of our systems (because infosec has gone way beyond just securing the boundaries now), and some of our data summary and anonymisation techniques could be helpful here too.  Luckily much of what I deal with is less ethically murky: system and data access logs, with known vulnerabilities, known data and motivations, and I work with a most wonderfully evil and detailed security nerd.  But we still have a lot to learn from each other.  Back in the Cold War days (the original Cold War, not the one that seems to be restarting now), every time we designed a system, we also designed countermeasures to it, often drawing on disciplines far outside the original system’s scope.  That seems to be core to the infosec art, and data science would seem to be one of those disciplines that could help.