In July I had the amazing opportunity to attend the Digital Methods Summer School at the University of Amsterdam. I’m finally getting around to writing down some of my thoughts, the things I did and the things I learned from the experience. The course was structured so that we worked on two separate projects – one in each of the two weeks.
The fun part was playing around with and learn a bunch of tools like:
- Google Fusion Tables – quick and dirty merging and filtering of spreadsheets (if you don’t need/want SQL in your life, check out Google Fusion tables) Also a cool mapping feature if you have geographic data in your set.
- Gephi– the go-to open source Network Graphing software
- NetVizz for Facebook – developed by the DMI team
- ImagePlot – developed by Lev Manovich’s Software Studies Lab
- Tracker Tracker – developed by the DMI team
Week 1 focused on Secondary Social Media, defined as both the social media platforms which are understudied, and those that did not appear on the leaked NSA slides (GAFA – Google, Amazon, Facebook, Apple). The week kicked off with a series of lectures, from Lev Manovich, and Limor Shifman, as well as an introductory lecture by Richard Rogers on Digital Methods and social media.
Our project in Week 1 was one of the meme-centered projects inspired by Limor Shifman’s lecture. As a meme expert, she discussed the phenomenon of Internet memes, and the ‘hyper-memetic logic’ of contemporary media. Defining an internet meme as “(a) a group of digital items sharing common characteristics of form, content and/or stance; (b) that were created with awareness of each other; and (c) were circulated, imitated and/or transformed via the internet by many users” she asserted that memes are not just/only for humour, and we should also look at the political dimensions of memes. This is where we started with our case study of the (then-recent) phenomenon of #lovewins as a response to the SCOTUS ruling on same sex marriage. The full project report is here.
The rest of the week is a bit of a blur, kind of like a week-long hackathon. We spent a lot of time gathering data, using Gephi to look at network graphs and analysis of hashtag themes, and co-occurrence. I spent more time playing around with R, and ImagePlot to see if we could find any visual patterns.
For me, a takeaway from Week 1 was dealing with the scale of the meme (or viral, or meme-like phenomenon … we had lots of debates about what exactly lovewins is) – in the end it was not possible for us to gather all of the data using the Instagram scraper tool because of the sheer volume and since a week had already passed since the start of the hashtag’s use. This proved an interesting contrast to one of Lev Manovich’s recommendations for studying ‘big cultural data’ – Look at everything at once instead of selectively sampling. What do you do when you can’t capture everything to look at it? And of course, is it ever possible to capture/look at everything? This seems like a core question of archives and appraisal, how you select the set to preserve from some conceptual ‘whole’ (with the classic Jenkinson vs. Schellenberg debates). But I think the problem is somewhat compounded now for archives – is it possible to catch what happens at the beginning of the phenomenon or trend? and what do you do when starting even a week later means that it’s not possible to scrape the data using existing tools and APIs?
Week 2 was a significant shift in direction to look at how we can approach digital media empiricism in a Post-Snowden world. Our group ended up looking at advertising Trackers, digging deeper into the kinds of relationships or ‘ecologies’ of trackers that are not immediately apparent from tools like Ghostery or the DMI-created Tracker Tracker. We started by gathering data from Alexa for the top 100 websites in different categories: adult, education, sports, religion, news, etc. We then used Tracker Tracker to see what kinds of trackers were present on the websites within each category. We did some low-tech human-powered search work to find out the parent-company of a given tracker (and sometimes the parent-company’s parent-company). We also located each of these companies geographically, and used Ghostery’s database of the types of tracking data each tracker collected, providing a bigger picture of who is collecting data, what kind of data is collected, and where these companies are located (which may not indicate where the data is stored, but might indicate the laws or legal frameworks in that jurisdiction).
A selection of the Maps of Tracker count by Category:
I still struggle to talk about our results, but only because this seemed like the preliminary exploratory work of something much bigger to study. The final report got a bit sidetracked too, but team credits are here. We found that there were category-specific trackers e.g. trackers only found on porn sites. We also found some interesting things like ‘orphan’ trackers, abandoned or associated with a now-defunct company, but still present on the web. There were also a bunch of different variables that influenced which trackers were found on a site – we tested a few sites using different durations of time elapsed from the initial loading of a page (generally more trackers were found 30 seconds after page load), and different devices and VPNs to represent different countries’ IPs.
So what does this have to do with preservation and archives? First, I think it’s significant that tracking tools are largely absent from current web archiving projects as we focus more on the ‘content’ of digital objects and not the ways that websites are encountered and experienced. Looking at trackers from an archives perspective brings up further questions:
- Could we consider the data collected by trackers a kind of archive, or record of your web usage, or does this contradict the way that tracking data is sliced and diced and sold to the highest bidder? How can we reconcile this with our understanding of records and government archives when it’s made apparent through the Snowden revelations and the potential access the NSA has to data from companies like Google?
- Trackers might also be seen as key factor in the shift towards personalization and web experiences tailored to your unique profile and browsing history. Can we (should we) curate tracking profiles, with ‘clean’ or ‘dirty’ browsing histories? Would future (or contemporary) researchers benefit from experiencing the web through switching profiles, providing insight into how others experience the world (figuratively walking the web in someone else’s shoes)?
- Finally, we talked a bit about Browser Fingerprinting, the next wave in tracking that doesn’t rely on cookies but instead uses the configuration of settings on your machine (i.e. things that can be queried to load a website like size of browser window or font packages installed) to uniquely identify users across sites. Could those types of tools be appropriated for archives, to provide contextual information about the diversity of web users’ interface experience?
TL;DR Had a great time at the DMI Summer School, just getting around to writing about it now, would go back in a heartbeat! Played around with a bunch of tools, and it provoked tons of questions still floating around in my head on how this relates and integrates with everything we talk about in the iSchool.