Hack IMLS – a CDX case study

I spent the past few days at the Archives Unleashed hackathon/datathon at the Library of Congress. It was an amazing gathering of lovely people, and the results of what the teams could put together in just a few days is really astounding.

With the help of the talented Sawood Alam and Ed Summers, I worked on exploring a web crawl of the US museum websites. We got access to this crawl, courtesy of the Internet Archive, thanks largely to the whims of Jefferson Bailey, who used the list of museums from the IMLS Museum Data Universe File as the seed. I thought this was an interesting approach to use open government data as a seed list, and a clear logic to scoping the collection.

We started by brainstorming a bunch of research questions: could we use this crawl to see any trends how museums are represented on the web? Are there differences in the use of images and texts, and does that reveal anything about the discourse of digital cultural heritage? I was also interested to see what might be missing from these crawls, as many museums are also extending their web presences to different social media platforms – a fact that was very evident in the museums in DC (e.g. from this quick instagram search):

Our best laid brainstorming quickly encountered a problem with scale. We couldn’t work with the WARC file because of its size – at 24TB, it wasn’t possible to get the file, and instead we were limited to using the CDX, a derivative file.

A brief introduction to CDX
The CDX format is generated as the crawler performs the crawl and acts as the index of the different elements / URLs. It doesn’t really cover the information content (such as the text of a webpage, outlinks, etc.) so we were limited in our analysis for the following fields:


One approach we took was to compare the crawl data with the information provided in the original IMLS data file, including data on income and location for each museum. (And since these sets are published for different years, there’s potential for some longitudinal analysis). We could see if there was any correlation between things like # of pages with 404 errors, or size of content on the web, and the museum’s income. Ed wrote up the “Neoliberal Museum Analysis” approach describing the process of creating the Redis database and generating the CSV file (which didn’t finish in time for the end of the hackathon).

Having archival theory readings on my mind, I started thinking about the CDX as a kind of Finding Aid for the WARC file.

The CDX file for a 24TB crawl is still pretty unwieldy, clocking in at ~10GB, or 219,070,202 HTTP transactions. Shifting gears from formulating a research question about museums (the content of the crawl), a second question was about the approach to reading what is in the collection – how do you read a finding aid that’s 200 million lines long?

The obvious answer is you don’t. Or you don’t read all of it. But how can we skim, browse, sample and select?

The approach we took was lightweight, which fit with our time limitations in the hackathon, but this also makes sense if you consider a use case where a researcher wants to quickly see what might be in the web collection – and doesn’t want to install a bunch of software to do it. We focused on the quick-and-dirty working with some simple (and some lesser known) unix shell commands.

###Description of our workflow, or: a CDX recipe book###

#reading the compressed file with zcat
zcat MUSEUM-20160318191052-crawl341.cdx.gz | head 

#sorting by mime type (col 4), for unique values, summed and listed in descending order
zcat MUSEUM-20160318191052-crawl341.cdx.gz | cut -d ' ' -f 4 | sort | uniq -c | sort -nr | less

#sorting by top level domains - select col 1 (massaged url) then select text delim by ,
zcat MUSEUM-20160318191052-crawl341.cdx.gz | cut -d ' ' -f 1  | cut -f 1 -d ',' | sort | uniq -c | sort -nr | less

#sorting for unknown mime types (for text analysis)
awk ‘{if ($1 ~ /unk/) {print $2}}’ mimesurt.txt | sort -u > mimesurt.unk

#splitting words to new line for word count
tr -s '[[:punct:][:space:]]' 'n' < mimesurt.unk | sort | uniq -c | sort -nr > mimesurt.unk.count

#removing numeric values and sorting (for comparison)
awk '{if ($2 !~ /[0-9]/) {print $2}}’ mimesurt.pdf.count | sort > mimesurt.pdf.clean.sorted

#compare two file types - returns 3 columns: unique to first, unique to second, common in both
comm -23 mimesurt.tiff.clean.sorted mimesurt.jpeg.clean.sorted  

#compare two file types - returns unique to first (drop cols 2-3)
comm -23 mimesurt.tiff.clean.sorted mimesurt.jpeg.clean.sorted > comparejpgtif_unique_to_tiff.txt

With this approach we started sifting and sorting through the data, and could start doing some basic text analysis on the text of the URLs. We also discovered a lot of duplication in the CDX, and reduced down to (only!) 138,905,796 unique URLs. With some quick D3 skills, Sawood generated this tree map of all the domains:

Museums Domains

We also tested out some simple text analysis on the URLs by breaking up each into tokens and removing and punctuation and numeric characters. For example, comparing the text of the URLs (top 50 tokens) for different file types:
URL text mimetype

You can start to see some places to dig deeper: domains including .org show up across all the types, but .edu is more prominent in jpegs, pdfs, and (troublingly?) the mimetypes registering as unknown. Tiffs are dominated by a few tokens – does this indicate a large collection involving these file names? Or an artifact of the crawler stuck in a loop?

Aside – an analogy with Scrabble: There are two ways we play Scrabble in my family. The first is the way the game is written, where the winner is determined by the scores written on the tiles, which are weighted based on the statistical occurrence of letters in words in the English language. And of course, you play to a certain strategy of creating words that maximize your score by taking advantage of the multipliers in double/triple letter and double/triple word scores that are also distributed across the board according to some sort of game theory logic (I don’t know the details of this, but I assume that Scrabble would not be as enjoyable a game if it weren’t calculated accordingly). The second way to play Scrabble is to try to make the best word (as judged by the players). The only way the rules of the game can account for ‘best word’ is by length – meaning that if you use all eight letters in your hand in a single word, you get an extra 50 points. But playing with people who appreciate good words (however that is defined), while we are all very very competitive at board games and will want to maximize our points, we also recognize a really elegant word when it comes up, and we might just play it if we can, even if it’s not the most points, because it’s a really good word and those don’t come around that often (in a way that’s playable on the board) in a Scrabble game.

So, if you have followed me thus far, you can maybe see the analogy that I’m going to try to make with Scrabble and with data – there’s a lot of very precise ways to count data letter-by-letter, and even based around statistically robust methods that use a corpus of English language texts. But there’s no accounting for a really good word. And maybe the most interesting parts of the data are the ones that don’t come up every game, that get lost when we are only counting the numbers on the tiles. That is how I characterize the challenge of characterizing web data. Lots of data means lots of noise, and it’s much easier to pick out the top occurrences, which aren’t always the most interesting.

For me, this project evolved from one about analyzing data into something else that got me thinking more about methods of analysis, and it was a really useful exercise as a kind of catalyst to start synthesizing some of the areas that are in my reading lists right now: what archival arrangement and description can mean for web archives; approaches to reading data encoding structures as texts; what systems and infrastructures are needed for digital curation and to support internet research – did I mention we were very lucky to have multiple options for virtual machines and servers for working with the data?

In the Saving the Web symposium on Thursday that followed the hackathon, there was a lot of discussion around data literacy. Literacy can mean many things, but I think it has to start with really reading the data itself in all its messiness, before anything else applying other more distant methods of analysis.

I learned a lot, and learned a lot about what I need to learn next – so I hope this post can serve as both a reminder for me, and maybe something like a how-to guide for others.

tl;dr reading data is hard when there’s lots of it. making sense of it is even harder.