Bursting the right bubbles

[cross-post from Medium]

First, understand the bubble

It’s hard to argue with people if you don’t know where they’re coming from. One way is to ask: engage with people who are vehemently disagreeing with you, find out more about them as people, about their environments and motives. Which definitely should be done, but it also helps to do some background reading…

The Guardian’s started in on this: a round-up of 5 non-liberal articles every week, complete with backgrounder on each author and why the article is important. It doesn’t hurt that some of these authors are friends of friends and therefore maybe approachable with some questions. It’s also worth checking out things like BlueFeed RedFeed.

I’ve taken some flack lately for trying to understand Trump supporters. I’m slowly coming round to amending that to trying to understand Trump voters — especially the ones who voted with their noses held.

Us vs Them

I still can’t be pulled into “them vs us”. God it would be easier sometimes to build barricades and see ‘them’ as evil people supporting an evil leader, but truth is we’re all human, and unless we get out of this together, it’s going to be way beyond hard to get out at all.

I spent the lead-up to the election in a house in Trump country (neighbours with guns and banners about them having guns and I don’t mean in a jolly countryside lets bag some pheasants kinda way; Trump signs and flags everywhere including my local pub; that damned hat on people I’d spend time at the bar with) and whilst I wouldn’t necessarily ask for empathy (those flags were on some damn nice houses), I would ask for understanding the narratives and doing some deep soul-searching to see if any of them might be true, because that’s where we start the honest conversations about where we are today.

And yes, I’m going to fight by every nonviolent means possible too — I have much less to lose than others around me (older, no children, nobody who depends on me); I’m hoping that if enough of us fight that way, we’ll never have to fight in the alternative.

Could Ring Theory help?

I’ve been thinking a lot today about people, their interactions, margins, fairness and how to be a better ally, friend and compatriot (thank you Barnaby for making me think hard about this). We humans are complex beasts: en masse it’s hard to apply things like Ring Theory (the “comfort in, dump out” theory for comforting individuals and the layers of people around them that we often talked about in the Crisismappers’ Cancer Survival Chat — yes, there was such a thing, yes there were many people you wouldn’t expect in it, no I didn’t — I was there as an admin/ friend…) because everyone has different pain from different things, especially when that pain is being deliberately seeded in many areas at once. I’m not sure what the answer is, but mutual compassion, respect and walking in the other guys’ shoes have definitely got to be in there somewhere.

Bottom line: understand where other people are coming from. See if you can get them to understand where you are coming from too. Am not talking about the trolls (ignore them), but the people who are genuinely trying to argue with you…

Fake News Isn’t About Truth, It’s About Gaming Belief Systems

[cross-post from Medium]

Thinking about #fakenews. Starting with “what is it”.

* We’re not dealing with truth here: we’re dealing with gaming belief systems. That’s what fake news does (well, one of the things; another thing it does is make money from people reading it), and just correcting fake news is aiming at the wrong thing. Because…
* Information leaves traces in our heads, even when we know what’s going on. If I jokingly tell you that I’ve crashed your car, then go ‘ha ha’, you know that I didn’t crash your car, but I’ve left a trace in your head that I’m an unsafe driver. The bigger the surprise of the thing you initially believe, the bigger the trace it leaves (this is why I never make jokes like that).
* That’s important because #fakenews isn’t about the thing that’s being said. It’s about the things that are being implied. Always look for the thing being implied. That’s what you have to counter.
* Some of those things are, e.g. “Liberals are unpatriotic”. “Terrorists are a real and present threat *to you*”. Work out counters for these, and mechanisms for those counters. F’example: wearing US flags at protests and being loudly patriotic whilst standing up for basic rights is a good idea.
* Yes, straighten the record, but you’re not aiming at the person (or site) spouting fake news. What you *are* trying to change is their readers’ belief in whether something is true.
* America is a big country. Not everyone can go and see what’s true or not. Which means they have to trust someone else to go look for them. The Internet is even bigger. Some of the things on it (e.g. beliefs about other people’s beliefs) don’t have physical touchpoints and are impossible to confirm or deny as ‘truth’.
* Which means you’re trying to change the beliefs of large groups of people, who have a whole bunch of trust issues (both overtrust for in-group, and serious distrust of out-group people) and no direct proof.
* You know who else hacks trust and beliefs in large groups? Salesmen and advertisers. Learn from them (oh, and propagandists, but you might want to be careful what you learn there).
* People often hold conflicting beliefs in their heads (unless they’re Aspie: Aspies have a hard time doing this). Niggling doubts are levers, even when people are still being defensive and doubling-down on their stated beliefs. Look for the traces of these.
* But go gentle. Create too much cognitive dissonance, and people will shut down. Learn from the salesmen on this.
* People are more likely to trust people they know. Get to know the people whose beliefs you want to change (even if it means hanging out in conservative chat channels). Also know that your attention is a resource: learn to distinguish between people who are engaged and might listen (hint: they’re often the ones shouting at you), people who won’t, and sock puppets.
* More advertising tricks: look for influencers (not just on Twitter ‘cos it’s easy goddammit; check in the real world too). There’s only one you: use that you wisely.

Some reading:
* A field guide to earthlings (the Aspie reference)
* Social psychology: a very short introduction

How to culture jam a populist

The Internet is made of beliefs

[cross-post from Medium]

“Most people don’t have the time or headspace to handle IW: we’re going to need to tool up. Is not much, but I’m talking next month on belief, and how some of the pre-big-data AI tools and verification methods we used in mapping could be useful in this new (for many) IW world… am hoping it sparks a few people to build stuff.” — me, whilst thoroughly lost somewhere in Harlem.

Dammit. I’ve started talking about belief and information warfare, and my thoughts looked half-baked and now I’m going to have to follow through. I said we’d need to tool up to deal with the non-truths being presented, but that’s only a small part of the thought. So here are some other thoughts.

1) The internet is also made of beliefs. The internet is made of many things: pages and and comment boxes and ports and protocols and tubes (for a given value of ‘tubes’). But it’s also made of belief: it’s a virtual space that’s only tangentially anchored in reality, and to navigate that virtual space, we all build mental models of who is out there, where they’re coming from, who or what to trust, and how to verify that they are who they say they are, and what they’re saying is true (or untrue but entertaining, or fantasy, or… you get the picture).

2) This isn’t new, but it is bigger and faster. The US is a big country; news here has always been either hyperlocal or spread through travelers and media (newspapers, radio, telegrams, messages on ponies). These were made of belief too. Lying isn’t new; double-talk isn’t new; what’s new here is the scale, speed and number of people that it can reach.

3) Don’t let the other guys frame your reality. We’re entering a time where misinformation and double-talk are likely to dominate our feeds, and even people we trust are panic-sharing false information. It’s not enough to pick a media outlet or news site or friend to trust, because they’ve been fooled recently too; we’re going to have to work out together how best to keep a handle on the truth. As a first step, we should separate out our belief in a source from our belief in a piece of information from them, and factor in our knowledge about their potential motivations in that.

4) Verification means going there. For most of us, verification is something we might do up front, but rarely do as a continuing practice. Which, apart from making people easy to phish, also makes us vulnerable to deliberate misinformation. We want to believe stuff? We need to do the leg-work of cross-checking that the source is real (crisismappers had a bunch of techniques for this, including checking how someone’s social media profile had grown, looking at activity patterns), finding alternate sources, getting someone to physically go look at something and send photos (groups like findyr still do this). We want to do this without so much work every time? We need to share that load; help each other out with #icheckedthis tags, pause and think before we hit the “share” button.

5) Actions really do speak louder than words. There will most likely be a blizzard of information heading our way; we will need to learn how to find the things that are important in it. One of the best pieces of information I’ve ever received (originally, it was about men) applies here: “ignore everything they say, and watch everything they do”. Be aware of what people are saying, but also watch their actions. Follow the money, and follow the data; everything leaves a trace somewhere if you know how to look for it (again, something that perhaps is best done as a group).

6) Truth is a fragile concept; aim for strong, well-grounded beliefs instead. Philosophy warning: we will probably never totally know our objective truths. We’re probably not in the matrix, but we humans are all systems whose beliefs in the world are completely shaped by our physical senses, and those senses are imperfect. We’ll rarely have complete information either (e.g. there are always outside influences that we can’t see), so what we really have are very strong to much weaker beliefs. There are some beliefs that we accept as truths (e.g. I have a bruise on my leg because I walked into a table today), but mostly we’re basing what we believe on a combination of evidence and personal viewpoint (e.g. “it’s not okay to let people die because they don’t have healthcare”). Try to make both of those as strong as you can.

I haven’t talked at all about tools yet. That’s for another day. One of the things I’ve been building into my data science practice is the idea of thinking through problems as a human first, before automating them, so perhaps I’ll roll these thoughts around a bit first. I’ve been thinking about things like perception, e.g. a camera’s perception of a car color changes when it moves from daylight to sodium lights, and adaptation (e.g. using other knowledge like position, shape and plates) and actions (clicking the key) and when beliefs do and don’t matter (e.g. they’re usually part of an action cycle, but some action cycles are continuous and adaptive, not one-shot things), how much of data work is based on chasing beliefs and what we can learn from people with different ways of processing information (hello, Aspies!), but human first here.

Infosec, meet data science

I know you’ve been friends for a while, but I hear you’re starting to get closer, and maybe there are some things you need to know about each other. And since part of my job is using my data skills to help secure information assets, it’s time that I put some thoughts down on paper… er… pixels.

Infosec and data science have a lot in common: they’re both about really really understanding systems, and they’re both about really understanding people and their behaviors, and acting on that information to protect or exploit those systems.  It’s no secret that military infosec and counterint people have been working with machine learning and other AI algorithms for years (I think I have a couple of old papers on that myself), or that data scientists and engineers are including practical security and risk in their data governance measures, but I’m starting to see more profound crossovers between the two.

Take data risk, for instance. I’ve spent the past few years as part of the conversation on the new risks from both doing data science and its components like data visualization (the responsible data forum is a good place to look for this): that there is risk to everyone involved in the data chain, from subject and collectors through to processors and end-product users, and that what we need to secure goes way beyond atomic information like EINs and SSNs, to the products and actions that could be generated by combining data points.  That in itself is going to make infosec harder: there will be incursions (or, if data’s coming out from outside, excursions), and tracing what was collected, when and why is becoming a lot more subtle.  Data scientists are also good at subtle: some of the Bayes-based pattern-of-life tools and time-series anomaly algorithms are well-bounded things of beauty.  But it’s not all about DS; also in those conversations have been infosec people who understand how to model threats and risks, and help secure those data chains from harm (I think I have some old talks on that too somewhere, from back in my crisismapping days).

There are also differences.  As a data scientist, I often have the luxury of time: I can think about a system, find datasets, make and test hypotheses and consider the veracity and inherent risks in what I’m doing over days, weeks or sometimes months.  As someone responding to incursion attempts (and yes, it’s already happening, it’s always already happening), it’s often in the moment or shortly after, and the days, weeks or months are taken in preparation and precautions.  Data scientists often play 3d postal chess; infosec can be more like Union-rules rugby, including the part where you’re all muddy and not totally sure who’s on your side any more.

Which isn’t to say that data science doesn’t get real-time and reactive: we’re often the first people to spot that something’s wrong in big streaming data, and the pattern skills we have can both search for and trace unusual events, but much of our craft to date has been more one-shot and deliberate (“help us understand and optimise this system”). Sometimes we realize a long time later that we were reactive (like realizing recently that mappers have been tracking and rejecting information injection attempts back to at least 2010 – yay for decent verification processes!). But even in real-time we have strengths: a lot of data engineering is about scaling data science processes in both volume and time, and work on finding patterns and reducing reaction times in areas ranging from legal discovery (large-scale text analysis) to manufacturing and infrastructure (e.g. not-easy-to-predict power flows) can also be applied to security.

Both infosec and data scientists have to think dangerously: what’s wrong with this data, these algorithms, this system (how is it biased, what is it missing, how is it wrong); how do I attack this system, how can I game these people; how do I make this bad thing least-worst given the information and resources I have available, and that can get us both into difficult ethical territory.  A combination of modern data science and infosec skills means I could gather data on and profile all the people I work with, and know things like their patterns of life and potential vulnerabilities to e.g. phishing attempts, but the ethics of that is very murky: there’s a very very fine line between protection and being seriously creepy (yep, another thing I work on sometimes).  Equally, knowing those patterns of life could help a lot in spotting non-normal behaviours on the inside of our systems (because infosec has gone way beyond just securing the boundaries now), and some of our data summary and anonymisation techniques could be helpful here too.  Luckily much of what I deal with is less ethically murky: system and data access logs, with known vulnerabilities, known data and motivations, and I work with a most wonderfully evil and detailed security nerd.  But we still have a lot to learn from each other.  Back in the Cold War days (the original Cold War, not the one that seems to be restarting now), every time we designed a system, we also designed countermeasures to it, often drawing on disciplines far outside the original system’s scope.  That seems to be core to the infosec art, and data science would seem to be one of those disciplines that could help.

Notes from John Sarapata’s talk on online responses to organised adversaries

John Sarapata (@JohnSarapata) = head of engineering at Jigsaw  (= new name for Google Ideas).  Jigsaw = “the group at Google that tries to help users facing organized violence and oppression”.  A common thread in their work is that they’re dealing with the outputs from organized adversaries, e.g. governments, online mobs, extremist groups like ISIS.
One example project is redirectmethod.org, which looks for people who are searching for extremist connections (e.g. ISIS) and shows them content from a different point of view, e.g. a user searching for travel to Aleppo might be shown realistic video of conditions there. [IMHO this is a useful application of social engineering in a clear-cut situation; threats and responses in other situations may be more subtle than this (e.g. what does ‘realistic’ mean in a political context?).]
The Jigsaw team is looking at threats and counters at 3 levels of the tech stack:
  • device/user: activities are consume and create content; threats include attacks by governments, phishing, surveillance, brigading, intimidation
  • wire: activities are find then transfer; threats include DNS hijacking, TOR bridge probes
  • server: activities are hosting; threats include DDOS
[They also appear to be looking at threats and counters on a meta level (e.g. the social hack above).]
Examples of emergent countermeasures outside the team include people bypassing censorship in Turkey by using Google’s public DNS, and people in China after the 2008 Szechwan earthquake posting images of school collapses and investigating links between these (ultimately leading to finding links between collapses, school contractors using substandard concrete and officials being bribed to ignore this) despite government denial of issues.  These are about both reading and generating content, both of which need to be protected.
There are still unsolved problems, for example communications inside a government firewall.  Firewalls (e.g. China’s Great Firewall) generally have slow external pipes with internal alternatives (e.g. Sino Weibo), so people tend to consume information from inside. Communication of external information inside a firewall isn’t solved yet, e.g mesh networks aren’t great; the use of thumb drives to share information in Cuba was one way around this, but there’s still more to do.  [This comment interested me because that’s exactly the situation we’ve been dealing with in crises over the past few years: using sneakernet/ mopeds,  point-to-point, meshes etc., and there may be things to learn in both directions.]
Example Jigsaw projects and apps include:
  • Unfiltered.news, still in beta: creates a knowledge graph when Google scans news stories (this is language independent). One of the cooler uses of this is being able to find things that are reported on in every country except yours (e.g. Russia, China not showing articles on the Panama Papers).
  • Anti-phishing: team used stuff from Google’s security team for this, e.g. using Password Alert (alerts when user e.g. puts their company password into a non-company site) on Google accounts.
  • Government Attack Warning. Google can see attacks on gmail, google drive etc accounts: when a user logs in, Google displays a message to them about a detected attack, including what they could do.
  • Conversation AI. Internet discussions aren’t always civil, e.g. 20-25 governments including China and Russia have troll armies now, amplified by bots (brigading); conversation AI is machine classification/detection of abuse/harassment in text; the Jigsaw team is working on machine learning approaches together with the youtube comment cleanup team.  The team’s considered the tension that exists between free speech and reducing threats: their response is that detection apps must lay out values, and Jigsaw values include that conversation algorithms are community specific, e.g. each community decides its limits on swearing etc.; a good example of this is Riot Games. [This mirrors a lot of the community-specific work by community of community groups like the Community Leadership Forum].  Three examples of communities using Conversation AI: a Youtube feature that flags potential abuse to channel owners (launching Nov 2016). Wikipedia: flagging personal attacks (e.g. you are full of shit) in talk pages (Wikipedia has a problem with falling numbers of editors, partly because of this). New York Times: scaling existing human moderation of website comments (NYT currently turns off comments on 90% of pages because they don’t have enough human moderators). “NYT has lots of data, good results”.  Team got interesting data on how abuse spreads after releasing a photo of women talking to their team about #gamergate, then watching attackers discuss online (4chan etc) who of those women to attack and how, and the subsequent attacks.
  • Firehook: censorship circumvention. Jigsaw has the Uproxy plugin  for peer-to-peer information sharing across censorship boundaries (article), but needs to do more, eg look at the whole ecosystem.  Most people use proxy servers (e.g. VPNs), but a government could disallow VPNs: we need many different proxies and ways to hide them.  Currently using WebRTC for peer to peer proxies (e.g. Germany to Turkey using e.g NAT hole punching), collateral freedom and domain fronting, e.g. GreatFire routing New York Times articles through Amazon and GitHub.  Domain fronting (David Fyfield article) uses the fact that e.g. CloudFlare hosts many sites: the user connects to an allowed host, https encrypts it, then uses the encrypted header to go to a blocked site on the same host.  There are still counters to this; China first switched off GitHub access (then had to restore it), and used the Great Cannon to counter GreatFire, e.g. every 100th load of Baidu Analytics injects malware into external machines and creates a DDOS botnet. NB the firewall here was on path, not in path: a machine off to one side listens for banned words and breaks connections, but Great Cannon is inside the connection; and with current access across the great firewall, people see different pages based on who’s browsing.
  • DDOS: http://www.digitalattackmap.com/ (with Arbor Networks), shows DDOS in real time. Digital attacks are now mirroring physical ones, and during recent attacks, e.g. Hong Kong’s Umbrella Revolution, Jigsaw protected sites on both sides (using Project Shield, below) because Google thinks some things, like DDOS, are unfair.  Interesting point: trolling as the human equivalent of DDOS [how far can this comparison go in designing potential counters?].
  • Project Shield: reused Google’s PSS (Page Speed Service) to protect news sites, human rights organization etc from DDOS attacks. Sites are on Google cloud: can scale up number of VMs used and nginx allows clever uses with reverse proxies, cookie challenges etc.  Example site: Krebs on Security was being DDOSed (nb a DDOS attack on a site costs about $50 online), moved from host Akamai to Google with Project Shield. Team is tracking Twitter user bragging about this and other attacks (TL;DR: IoT attack, e.g. baby monitors; big botnet, Mirai botnet source code now released, brought down Twitter, Snapchat).  Krebs currently getting about 5 attacks a day, e.g. brute-force, Slowloris, Hulk (bandwidth, syn flood, post flood, cache busting, WordPress pingback etc), and Jigsaw gets the world’s best DDOSers hacking and testing their services.
Audience questions:
  • Qs: protecting democracy in US, e.g. botnets, online harassment etc.? A: don’t serve specific countries but worldwide.
  • Q: google autofill encouraging hatespeech? A: google reflects the world as it is; google search reflects what people do, holds a mirror back up to you. Researching machine learning bias on google suggest results and bias in training data.  Don’t want to censor, but don’t want to propagate bad things.
  • Q: not censor but inform, can you e.g. tell a user “your baby monitor is hacked”? A: privacy issue, eg g connecting kit and emails to ip addresses.
  • Q: people visiting google site to attack… spamming google auto complete, algorithms? A: if google detects people gaming them, will come down hard.
  • Q: standard for “abusive”? how to compare with human?. A: is training data, Wikipedia is saying what’s abusive. Is all people in the end.
  • Q: how deal with e.g. misinformation? A: unsolved problem, politically sensitive, e.g. who gets to decide what’s fake and true? censorship and harassment work will take time.
  • Q: why Twitter not on list of content? A: Twitter might not want this, team is resource constrained, e.g. NYT models are useless on youtube because NYT folks use proper grammar and spelling.
  • Q: how to decide what to intervene in?A:  e.g. Google takes sides against ISIS, who are off the charts on eg genocide.
  • Q: AWS Shield; does Google want to commercialise their stuff? A: no. NB CloudFlare Galileo is also response to google work.
  • Q: biggest emerging tech threat online? Brigading, e.g. groups of people and bots. This breaches physical and online, includes eg physical threats and violence, and is hard to detect and attribute.
Other references:

Why am I writing about belief?

[Cross-post from LinkedIn]

I’ve been meaning to write a set of sessions on computational belief for a while now, based on the work I’ve done over the years on belief, reasoning, artificial intelligence and community beliefs. With all that’s happening in our world now, both online and in the “real world”, I believe that the time has come to do this.

We could start with truth. We often talk about ‘true’ and ‘false’ as though they’re immovable things: that every statement should be able to be assigned one of these values. But it’s a little more complicated than that. What we see as ‘true’ is often the result of a judgement we made, given our perception and experience of the world, that a belief is close enough to certain to be ‘true’.

But what is there are no objective truths? In robotics, we talk about “ground truth” and the “god’s eye view” of the world: the knowledge of the world that our robots (or computer vision or reasoning systems) would have if they had perfect information about the world. We talk about things like the “frame problem”, where a system’s ability to reason and act is limited by the “frame” that it has around the world, and the “naughty baby problem” of outside influences that it has no awareness of and cannot plan for. We accept that a robot’s version of “truth” is limited to what it can perceive. But humans being are also limited by their perceptions of the world, by the amount of information available to them. Without going all “Matrix” on you, is it possible that we too are wrong about our “truths”, and we’re not truly objective in reasoning about them because there is no “God’s Eye View” that we can access?

For now, let’s put aside Godel’s theorem and the ‘undecidable’ sentences like “this sentence is false” that can’t be assigned a true or false value, and think about what happens in a world where we all have only perception and consensus agreements on ‘reality’, and nobody has perfect information. One of the things that happens is that we stop talking about “true” and “false”, and start talking about perception: what we can reasonably believe to be true or false (or undecidable), the uncertainty we have about those beliefs, influence and what it might take in terms of evidence or new information to change them. In psychology, that gets us into the theories of mind and reasoning like cognitive psychology and studies of people like Aspergers individuals who process ‘facts’ differently; in design, into things like social engineering and the theory of change. In maths and AI, that gets us into territories like multi-state logics (which beliefs are possible, necessary etc) and both frequentist (‘what happened”) and Bayesian (“what if”) statistics. We might also shift our focus, and talk not of beliefs, but of what we are trying to achieve with them, getting us into theories of actions, influence and decisions (hello, robotics and operational research). There are many theories of uncertainty, but for now probability theory is dominant, so its good to spend time with and understand how that works under the hood.

Although this may all seem abstract hand-wavey, late-night-discussiony “are we living in the Matrix already”, theories of belief have many practical applications. They’re used heavily in data science, and underpin decisions on things like technology designs (via e.g. AB testing), political information release and propaganda (which are usually not the same thing). We need to talk about that too, because the tools and the terrain available (e.g. the internet) are already more powerful than their current uses. One thing we’re becoming painfully aware of now is how belief functions in groups; the use of influence, multiple separate sources of information and repetition to spread beliefs that compete with each other, and the role of things like desire in those beliefs. We’re also learning that systems we build based on human outputs (e.g. Internet-based AI) have the same biases in belief as the humans; an obvious-but-not-obvious thing that we need to recognise and handle. There are useful theories for this too, ranging from the maths of multiple viewpoints to techniques used to both create and make sense of competing views (ACH, information incest detection, phemes). There are also theories of how humans think in groups, and how they can be persuaded to or do change their minds, both rapidly and slowly over time (e.g. game theory, creativity theory and the study of both human and scientific revolutions).

I’ve spent a lot of my life thinking about and applying the theories above, but I’ve never really got round (apart from the odd note on intelligence, data science or the risks inherent in processing data about people) to writing it all down. The session notes are, I hope, a start on this, and even if nobody else reads them, it’ll be fun to do some targeted thinking around them.

WriteSpeakCode/ PyLadies joint meetup 2015-10-22: Tales of Open Source: rough notes

Pyladies: international mentorship program for female python coders

  • meetup,com, NYC Pyladies
  • Lisa moderating, Panelists: Maia McCormick, Anna Herlihy, Julian Berman, Ben Darnell, David Turner
  • Intros:
    • Maia: worked on Outreachy (formerly OPW) – gives stipends to women and minorities to work on OS code; currently at Spring
    • Anna: works at MongoDb, does a lot of Mongo OS work.
    • Julian: works at Magnetic (ad company); worked on Twisted, started OS project (schema for validating Json projects)
    • Ben: Tornado maintainer, working on OS distributed database on Go.
    • David: ex FSF, OpenPlans, now at Twitter, “making git faster”.
  • Q: how to find OS projects, how to get started?
    • D: started contributing to Xchat… someone said “wish chat had the following feature”… silence… recently, whatever the company is working on. Advice: find the right project, see if they’re interested, then write the feature.
    • B: started on python interpreter, was using game library, needed bindings for library
    • J: looked at OpenHatch OS projects.  Found Twisted – told that if want to get code in there, there’s a review process. Found feature/bug, wrote patch, waited for response – that got him in… vehicle for other people to read and respond to code.
    • A: first OS commit was to Mongodb – interned there after college. Couldn’t get feature to work on her mac, fixed it ’til it ran, then someone asked “are you going to put in a core request”…was first experience of request politics.  Hard to find projects that both need help, and want help. Best to contact first, e.g. “are you interested in a fix for OSX”. Most people’s experience of OS has been rejection or a negatively tinged experience.
    • M: first pull request got landed… top-down approach, “how do I get work experience on a big codebase – obvious answer is OS… applied to outreachy, who have a list of orgs who want donations”… found Gnome music on the list… iTunes for Gnome… looked at list of beginner-friendly bugs, built that (“approx a million years”) on own machine.  Gnome are particularly newbie-friendly.
    • Outreach deadline is Nov 2nd.
  • Q: how do you find a project that wants your contribution? (or tips for what to avoid)
    • D: avoid people who are loudly mean (e.g. Linux kernalists).  Responsiveness beyond everything… e.. friendly community who took a month to fix their instructions… sat on patch for 1-2 months.  Good: active community, can see closed pull requests (but linux/git have mailing lists, but that’s active)
    • B: has a list of newbie-friendly bugs.
    • J: gauge on whether want to use that software or not.
    • A: bugs are best place to start. Filing a bug report tells you a lot about the maintainers, e.g. on it immediately, starting a conversation about it, you can follow the progress of the bug – see the conversations between the contributors, reminds you that there are humans behind it… “any kind of form of life”.
    • M: probably would have started on bpython (shinier ipython), because peer was really excited about it… peer recommendation, people excited about a project = project probably doesn’t suck.
  • Q: suggestions for good places to find lists of welcoming OS projects
    • OpenHatch
    • Hacktoberfest (organised by digital ocean) – everyone submitting 4 projects from the list gets a free t-shirt
    • Look at the projects that OS projects include… those tools are also interesting projects.
    • Go to your bosses and ask if you can release the company software as OS.
  • Q: about your projects, features, bugs – something you’d like to share
    • M: dev environment – hard to build these. Long slog through virtual machine (e.g. fedora 2.1 was still in alpha)… lots of patience, and a new computer. Taking notes – wrote everything down, error messages etc so can do on next install, take to project maintainer as suggestions for things to go into instructions.
    • A: pymongo sometimes gets a bug that spirals out of control, and ends up being a python bug (that’s already been reported)… e.g. multiprocessing bug that took time to figure out. Getting a copy of the project is a big step towards actually contributing.
    • J: like perfect storm types of bugs, e.g. json schema had a bug… likes semantic versioning, maintaining backwards compatibility… a release was broken and put out, got bug report 6 hours after release from people in big orgs (e.g. openstack, mediawiki)… tiny detail – pip environment markers – broke the release; lots of people; doesn’t like fixing bugs until have regression test in place = pressure is on… did in 24 hours…
    • B: asynch in Tornado Async is interator returning awaitable objects, python library asynchio had different interpretation- trying to mix them, got stack overflows endlessly trying to convert objects. Still an open issue- did a workaround, but other code will have similar problems with it.
    • D: rewrote hash table function in git, git merge started crashing… because of the fix… git index is also called cache and staging area, depending of which part of code you’re in… created nightmares on macs… weirdass pointer being pulled out from under code whilst still in use – only happened on a mac on certain large merges… but patch not accepted the first way was written, so rewrote a different way
  • Q: anything your OS projects want help with now?
    • M: has a bug list – look for Gnome Music getting started page. “Gnome Love Bugs” https://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=gnome-music;dist=unstable
    • A: lot of mongodb driver things to work on… any time release, looking for people to test for bugs – finding one = starting a conversation. Drivers have various levels of accessible bugs… mongodb is too big a place to start in.  And Mongodb is hiring.
    • J: Twisted has tedious but beginner-friendly work; J has proof of concept projects that wrote parts he needed (e.g. docker python bindings are literal translations of command line commands –  can jump in and extend out that library), etc. code lives on github https://github.com/Julian… not heavily organised.
    • B: cockroachdb has well-organised bug list. https://github.com/cockroachdb/cockroach – can talk to B about stuff that’s not well-organised.
    • D: git doesn’t have a public bug list, but can look at unit tests and see known failures… need to ask git if they’re things that people care about.  Also e.g. “git rm” removes entire account? (is not filed yet). (all panelists are hiring!)
  • Audience questions:
  • Q: Dropbox might be a good starter project.
  • Q: Setting aside time to work on OS? A: motivated by other people – find someone interested in working on a project. Take advantage of frustration – immediately after frustration, try to work something out.
  • Q: How do you deal with ownership in companies based on OS? Ordinary employee = work for hire. Contract employee = 20-point test, but can override that in the contract. Ownership matters if you want to enforce the license – need copyrights to do this.
  • Q: licenses? Apache vs MIT vs GPL? Prefer for most things copyleft (e.g. GPL), otherwise adoption. More permissive, e.g. Apache, MIT. But use FSF-approved license, e.g. Apache, MIT or GPL.

Looking at data with Python: Matplotlib and Pandas

I like python. R and Excel have their uses too for data analysis, but I just keep coming back to Python.

One of the first things I want to do once I’ve finally wrangled a dataset out of various APIs, websites and pieces of paper, is to have a good look at what’s in it.  Two python libraries are useful here: Pandas and Matplotlib.

  • Pandas is Wes McKinney’s library for R-style dataframe (data in rows and columns) manipulation, summary and analysis.
  • Matplotlib is John D Hunter’s library for Matlab-style plots of data.

Before you start, you’ll need to type “pip install pandas” and “pip install matplotlib” in the terminal window.   It’s also convention to load the libraries into your code with these two lines:

import pandas as pd
import matplotlib.pyplot as plt

Some things in Pandas (like reading in datafiles) are wonderfully easy; others take a little longer to learn. I’ll meander through a few of them here.  I’ll be using Nepal medical shipments data as an example.

Reading in data

Pandas makes this easy. Reading in a CSV file is as simple as:

df = pd.read_csv(csvfilename) # Comma-separated file
df = pd.read_csv(csvfilename, sep='\t') # Tab-separated file

There’s also pd.read_json, pd.read_html, pd.read_sas, pd.read_stata and pd.read_sql_table to read in other data formats.  Be careful with read_html though: it only reads in html tables, and you’ll need lxml, beautifulsoup or suchlike if you want to read tables straight from a webpage.

First-look at the dataframe

I like to know what I’m dealing with before starting analysis.  I usually use Tableau or R for this, but that’s not always possible, and Pandas is a good alternative.

df.columns # List all the column headings
df.head(4) # The first 4 rows of data, same as df.head(n=4)
df[['column1','column2', 'column3']].head(10) # Just some of the columns
df.describe() # Basic statistics for every numerical column

That tells you what your columns are, what your first few rows look like (df.tail(4) will give the last rows) and some basic statistics for numerical columns, but you’re probably more curious than that.


Value_counts will tell you what’s in a single column.  If you want to know what’s in a pair or combination of columns, you’ll need to start using pivot tables or group_by.

You might know pivot tables from Excel.  They’re ways of creating a new datatable whose rows, columns and content are defined by you.  This function, for example, gives you a new table whose rows are column x values, columns are column y values, and contents are the number of rows that contained those combinations of x and y values.

x_by_y = df.pivot_table(index='columnx',columns='columny', values='columnz', aggfunc='count', fill_value=0)

Column z gets involved here because if you don’t nominate a column for the values, Pandas will return an array with the counts for every combination of columns. I’ve included fill_value=0 because I’m counting, and Pandas would otherwise include NaN (not a number) in its counts.

x_by_y is a data frame. You can plot this, for example:

x_by_y.head(10).plot(kind='bar', stacked=True)

You’re now using Matplotlib.  And that was quite a complex plot: a stacked bar chart, with a legend.   Note that .plot creates a plot object: if you want to *see* your plot, you need to type “plt.show()”.  This will put up a plot window and stop your code until you close the window again.

Basic data manipulation

I’ve had a look at the dataset, got some ideas for more things to look at in it, and some of them need calculations. Pandas handles this too.  More soon. Meanwhile, here’s some stuff I did with the Nepal dataset.

pivot1 = df.pivot_table(
 index='Material Hierarchy Family',
 columns='Final Recipient Name',
 values='Dollar Value',
 aggfunc='count').plot(kind='barh', stacked=True)

recipientsize = df.groupby('Final Recipient Name').size()

pivot2 = df.pivot_table(
 columns='Material Hierarchy Family',
 index='Final Recipient Name',
 values='Dollar Value',

pivot2.plot(kind='bar', stacked=True, legend=False)



i’ve been thinking today about the singularity: the point at which machines become smarter than humans, about an internet of things so smart that we don’t know how to manage it with our existing software paradigms.  And I wondered: a good manager will already be managing entities that are much smarter than them (because you don’t want your best thinkers doing the paperwork, management is another discipline/ skill etc etc); is it perhaps time to think about how to use those management skills on clusters of machines?

Notes from meetup: data-driven design 2.0 (Data-driven architecture), 2015-08-24

Meetup: data-driven design 2.0 (Data-driven architecture), 2015-08-24


Melissa Marsh on intros and bios… 

  • “Transforming architectural practice series” = thinking differently about the process of arch: tools, practice, how they run their business (leads to thinking differently about product). 
  • Panelists showing how taken on data-led practice changes how arch does their work… incorporating different methodologies, s/m/l/xl data. 
  • Today = moving from data sources and collection to examples within projects, how to set up projects and client relationships differently.  Came out of feedback from June event.  Continuing looking at future of design relationships. 
  • Panelists: 
    • Jeff Ferzoco (linepointpath), 
    • Zak Kostura (ARUP, hiph performance structures – currently form found roof system for MX city)… thinking about project setup and info sharing and how it’s changing client relationships. 
    • Darrick Borowski – on tools and techniques… data-driven design = ask better questions at the beginning… back and forth with Q&A. 
    • Shawn Rickenbacker – intersection of data and design; urban, architectural, interior design: opportunity to link the scales of design and learn from the scales how to apply design. “love of systems, problem-solving skills”. 
    • (Panelists all teach class: uni?)
  • Later:
    • Oct 12th: coworking and the future of architecture.
    • tues pm: measuring architecture (shift from geometric to other measures, including financial and social)… exhibit and event.  
  • Program: 10-min presentations from each panelist, then Q&A on each topic. 

Jeff Ferzoco (@zingbot): 

  • Practice: info design, mapping and experience design.  Tonight: examples of what he does with data.  Practice is almost all mapping at this point.  Looking at 3 problems and how he solved them. 
  • “hardest part about data is understanding what to do with it when you get it”
  • Old job: 8 years at regional scale, 31 NY counties, NYC not as a city, but as counties and larger areas. e.g. America2050 project, on high-speed rail in US… maps, event, work with stakeholders on what HS rail would look like (america2050.org, linepointpath.com).  Train, traffic, cultural data, produced massive-scale maps. 
  • Since then, looking only at neighbourhood and city scale issues. 
    • NYC released open data. Started working with Geonyc, Betanyc on this… data for NYC policy… 
    • 2013, citibike called clients… opening data, called Sarah Kaufman NYU… wanted to see what people were doing.  Dataset was 4Gb for the year, 1 row per ride (1.5M rides; pickup place, droopy place, times, casual/member, gender, zipcode of member… more data got released later).  Put into OpenRefine, looked at favourite stops… 
    • Next was DOB site… has 4’ profile of all jobs done.  (http://www.nyc.gov/html/dob/html/bis/bis.shtml?).  Jean wanted to know about renovations in the city.  Had database from 2004, lots of human errors.  Pulled into map, can see where all residential renovations have been.  Very messy data.  See sweeten.com http://linepointpath.com/111242/4498744/work/sweeten-renovation-map
    • Current project: finding where all gay nightlife spots have ever been in NYC.  Went through old guidebooks(60s, 70s) about gay life, scraped them… about 7 sources, 2 archives (gay and lesbian centre) – about 1000 gap spots since 1859.
  • Citibike – automatically generated data; DOB = human-generated; gay = historic data.  
  • Need to understand the tools.  
    • Tools better over last 3 years 
    • Favorite tool is CartoDb for mapping. 
    • Take data and put through other tools, like google refine (“sophisticated pivot tables, non-destructive”), google drive.  Tools for  big datasets listed on last slide of his presentation.
    • Jeff posting presentation on the meetup.

Zak Kostura (ARUP)

  • ARUP people – lots of then, different fascinations including data. 
  • Excited can get a lot of data, not know why looking at it, but by exploring it, find opportunities wouldn’t otherwise have known. 
  • Arup were doing taxi commission data on rides: 10-15 years of data collected by the commission.  Arup were handed a hard drive of data… taxi commission went to them with the data.  
  • Zak = structural engineer. “never accept data unless I know what I need it for”. Tonight, talking about 2 projects as examples of this. 
  • Need all structural systems to be sized for any event. 
  • First project was Fulton Centre – the net on the inside, alongside Jamie Carter on the engineering of it. 
    • Soft structure, takes the form that it wants to take, like a hammock.  When forces on the net change (e.g. wind), the shape changes too… in design, needed to understand all the changes possible… e.g. smoke exhausts create wind forces, building moves, heat rises and changes it.  About 1000 panels.  Was 2007… needed algorithms could interrogate on the fly. 
    • Can stop the process of having to idealise.. can look at exact systems, not estimates, can aggregate info and not be afraid of it. 
  • Mexico City airport.  Largest roof in the world… giant X… road around it is 2.9 miles. Only 21 touchdown points, big space-frame system. 
    • Principles simple: forces on it, how much force it takes an element to yield, how much force before it buckles… do this calc for each of 1m+ elements, or approximate it (but approx = inefficiency). 
    • Started data-driven… from a database. Used to use Excel, but not think about using a database… excel not designed for e.g. fast v-lookups. 
    • Could only design because knew exactly what the forces would be on each panel. 
    • <SJT: wasn’t the Guggenheim Bilbao designed like this some time ago?>
    • Processing comes down to a pipe, calculated 3/4m times.  To do this in the timeframe, everyone on the team has to be confident interacting with the data, e.g. using the right SQL commands to do this. Currently working with fire teams, architect etc on a database. Hoping to get to a point where can hand the database over to contractors. 

Darrick Borowski (design director, ARExA)

  • Follow-up to a talk at the AIA.  This time, talking about the tools used to do projects. 
  • “The generative nature of information, or the dirty things people do with data”
  • Tools: grasshoppers, cellular mod, dijkstras, A* etc. 
  •  Last time were talking about material experiments, human behavior as a source of data, extracting learnings from natural systems as a data source, and cultural systems as a source of data (and particularly cities), all in Grasshopper. 
  • Grasshopper is a graphical system editor.  Can model behavior in this.  Looking for the learnings that are available in our world (human, natural, cultural systems). https://en.wikipedia.org/wiki/Grasshopper_3D
  • System: data in, computational model, data out.  Use simulation in all 3 aspects of that.  
    • Data in = pounding pavements, streetcorner counters, material experiments, simulation. 
    • Model: create algorithm to process data and map onto problem trying to solve.  Sometimes data is in the natural system that are extrapolating data from.  Key this is that mapping from what you’re trying to accomplish to the model you’re using. 
    • Data out. Can do many things with this. Data in itself, e.g. 3d model.  Data for decision making, e.g. hand back to designer. Data fed back into another algorithm, adding complexity to that solution. 
  • Project: workspace layout based on connectivity that different layouts enabled. 
    • Tied into: more frequent conversations people have, the more innovation comes out; the closer you are to other people, the more likely you are tho communicate. (Allen Curve).
    • Put in different design layouts, measured distance between, looked for layout with best average distance. 
    • Desks are agents; agents are sent to every other person in the room, by navigating around obstacles in the room… draw a vector and test options for routes… 
    • Different desk layouts, paths travelled… fed into excel, averaged, extracted average distance travelled.
    • Inefficiencies in building that behavior from scratch: phase 2 = pre-developed algorithms using efficiencies they couldn’t see in phase 1.  Eg. use A* etc for routes (like Google Maps does), and looking at how slime mould aggregates (slime mould aggegates). http://christinehastie.com/2015/01/collaboration-can-learn-slime-mould/
  • Project: greenhouse. 
    • Problem distributing structural nodes across a dome. Looked at phyllotaxis (e.g. how sunflowers distribute seed heads – how to pack things into a space)… extrapolated principles of algorithm, added into Grasshopper (who scales out, repeats) – produced a structural network with equally-sized members over a surface.  
    • Examples of data in a natural system serving as a computational model
  • Example: based on caloric intake of average american, land needed to create that, designed city tissues – areas of land to feed people; reaction to fuel crisis, global warming etc. 
    • Showed power of building things up in layers. Complicated algorithm: tempting to pack everything into one Grasshopper algorithm… robotics etc will build one small piece, e.g. walk, and build on that and build on that again to create the systems. 

Shawn Rickenbacker, Urban Data + Design.

  • Use Rhino and Grasshopper quite a bit for complex geometries. 
  • Example: designing a wavy, moving wall.
  • “Somehow ended up as data analysts, working with china partnership in downtown manhattan on how tourism were shaping their environment”.
    • One characteristic: knock-off goods; this black market enormously important part of downtown economy. 
    • Video of group of swallows… look choreographed: interested in why they do this: barometric pressure, predators etc? This is the problem of data: understanding why. 
    • Big data is about… tracking individuals… Borrowed S/M/L/XL from Kolleeny
    • R visualisation… R as a useful tool.
    • Tourism used counters (interns) on streets.  Generated heat maps from the dataset (easy to get from R; had timestamps). R still needs human input to recognize the patterns.  
  • User groups: did some work with Sony. 
    • Client understands the user… each one of the data points is a different individual.  CAn’t understand the flock because you don’t understand the individual.  Buildings are reacting to the environment, e.g. wind blows – spatial element. 
  • Chinatown again:
    • Distributed QR codes across Chinatown.  Borrowed idea from stores showing pictures with QR codes to purchase things instantly on the street. 
    • Client asked “what happens at night?”.  Started projecting QR codes on walls at night… Chinatown has different night and day populations. 
    • Marc Andresson: “software is eating the world”… shakes at the core of arch, which is about physicality.  SW was designed as a tool to solve physical problems. 
    • Got into distributed networks… Xbee+ Arduino to create a wireless mesh network, can add sensors and cameras to these. 
      • Software eating the world: working less with physical architecture, more with non-discrete architecture (e.g. no physical component, like wireless networks). 
      • E.g. QR code as a pass to get into a Nike event.  this is convergence between the digital and physical.  “felt lame using QR codes back in 2012”.  Nike was pop-up event, large size… had huge LED basketball court projected inside… access only gained by using QR codes set up around the event. 
  • How to deal with software eating the world spatially, architecturally. 
    • Video is best way to track the flock of birds.  Programming xbox connect, then manipulating real-time data from it.  Used processing, arduino, R, gestures as data compiled by the camera then processed through p/a/r; creates action = movement of swallows. 
  • Kimono Labs are good place for data
    • https://www.kimonolabs.com/
    • Can go to any website, scrape data and produce an API from it. 
    • <SJT: can also do this with Google Spreadsheet!>
    • e.g. taking data from video – bus going by, using data for next time bus goes by… creating interactive.  
  • M: “If software is eating the world, architects are making that world more nutritious”? “N… maybe get a lot skinnier”. 

Q&A session with panelists:

  • Q: If dealing with large amounts of data, how do you share raw data and communicate results with other team members (consultants etc)?
    • A: Zac: newforma, file transfer protocols; lawyers need legal protocol for sending data, e.g. timestamped, who sent, who received: this is barrier for collaborating over data.  e.g. airport model can’t be done in live collaborative fashion because strips have to be packaged and sent.  Tech: set up database server and use that, but lose ability to see who did what and why?  This hampers ability to share effectively.
    • M: need a next version that takes on responsibility and ownership of data by parties, e.g. all becoming stakeholders collaboratively: share blame, but can do great design.
    • Derek: problem with Grasshopper, are the only people int he team interested in behavior; everyone else wants geometry; tend to bake data into 3d model for rest of team to work with; sometimes geometry, sometimes excel spreadsheet, sometimes evaluation of spreadsheet. All about shared language.
    • Jeff: use SQL to access mass amounts of taxi data; communicating, usually a layer on top, e.g. google chat. BigQuery for large storage; if smaller, then Dropbox (or Box).  
    • Shawn: AutoDesk pilot project called Dreamcapture (=AutoCad Artificial Intelligence). Future of large data affecting design isn’t as large as commerce data. Dreamcatcher premise is collaborative environment, where can see effect of team member changes.  Security: if you have a secure cloud with limited access, get round problem of waiting for save, send, send back etc.
    • M: data volumes; if look at buildings as they’re lived an occupied, will get to consumer-driven sized volumes of data. Important that archs have a thoughtfulness about that data. NB Autodesk is a sponsor of the global self quant annual meeting.