Robots(.txt) in the Archives

Last week I was very lucky to participate in another Archives Unleashed (my third now!) hosted by the British Library as part of Web Archiving Week, preceding the joint RESAW and IIPC Web Archiving conferences.

We’ve posted the code and explanation of our project on github, but I wanted to complement that with more of a narrative. Our group came together around a shared interest in robots.txt files — this exclusion protocol can be used by webmasters to specify how bots, indexers, and crawlers interact with their site, for example listing specific files and folders to be excluded from indexing, or specifying a certain time delay between requests so that the server doesn’t get overloaded. Since the protocol can be overridden and ignored, recent discussions and debates have asked whether or not web archives should adhere to robots.txt. Different web archives currently have different policies around respecting or ignoring these files and we wanted to understand the implications of these kinds of decisions.

Explanation from robotstxt.org

Our approach was to take a collection that had ignored robots.txt and ask: what would not have been captured if robots.txt had been respected? How many, and which particular resources would not have been captured in a crawl?

We drew on the collections from the National Archives’ UK Government Web Archive. Since their legal mandate is to collect all government publications, the web archiving policy is to ignore robots.txt. For this work we focused on a set of WARCs from the 2010 Elections collection.

Our main method was to:

  • Extract all robots.txt from the WARC collection (using warcbase)
  • Apply the robots.txt retroactively to see what would not have been captured, by:
    • parsing the robots.txt exclusion rules,
    • applying the rules to the URIs and links in the WARC collection.
  • Compare the coverage of a collection adhering to vs. ignoring robots.txt

Extracting the Robots.txt

The first step was to pull out all the robots files from the WARCs. We used warcbase, and our first iteration looked like this:

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
import StringUtils._

val links = RecordLoader.loadArchives("/mnt/TNA-Dataset/2010electionsUK/post-election/decc.gov.uk/EA-TNA0510.9294.www.decc.gov.uk-20100512172050-00000.warc.gz",sc)
.keepUrlPatterns(Set("https?://[^/]*/robots.txt".r))
.map(r=>(r.getCrawlDate, r.getUrl))
.map(tabDelimit(_))
.saveAsTextFile("robots1.txt")

The above code returns the content for all warc-records of URLs ending in robots.txt, with each one ideally looking something like this:

HTTP/1.1 200
Server: Microsoft-IIS/5.0
pics-label: (pics-1.1 "http://www.icra.org/ratingsv02.html" l gen true for "http://www.hse.gov.uk" r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true for "http://www.hse.gov.uk" r (n 0 s 0 v 0 l 0))
X-Powered-By: ASP.NET
Date: Wed, 12 May 2010 17:35:11 GMT
Content-Type: text/plain
Accept-Ranges: bytes
Last-Modified: Thu, 13 Aug 2009 07:31:38 GMT
ETag: "f02d4518e81bca1:907"
Content-Length: 248

User-agent: *
Disallow: /search
Allow: /slips/step/index.htm
Disallow: /slips/step
Disallow: /newdesign/index.htm
Disallow: /grip
Disallow: /pubns/priced
Sitemap: http://www.hse.gov.uk/sitemap.txt
Sitemap: http://www.hse.gov.uk/sitemap.xml

However, it also returned some results we didn’t want, like resources that are not text files, including some 404 error pages instead of the robots.txt. A few more iterations got us to this version, which saves all the robots.txt in a single file (separated by “SNIP HERE” lines):

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
import StringUtils._
import org.warcbase.data.WarcRecordUtils

val links = RecordLoader.loadArchives("/mnt/TNA-Dataset/2010electionsUK/post-election/decc.gov.uk/*.warc.gz",sc)
.keepUrlPatterns(Set("https?://[^/]*/robots.txt".r))
.filter(r => (r.getContentString.contains("HTTP/1.1 200") || r.getContentString.contains("HTTP/1.0 200") || r.getContentString.contains("HTTP/2.0 200")))
.keepMimeTypes(Set("text/plain"))
.map(r=>("<------- SNIP HERE -------->", r.getDomain, r.getCrawlDate, RemoveHttpHeader(r.getContentString)))
.saveAsTextFile("/mnt/TNA-Dataset/robots/extracted-robots")

Once we have the robots files, we can do some initial analysis with bash. For example we wrote a bash script to simply determine if a domain has a robots.txt file in the WARC, returning true/false for each domain:

cd /mnt/TNA-Dataset/robots
grep SNIP robots-decc.gov.uk/part-all  | cut -d, -f2 | sort | uniq > robots-decc.gov.uk/domains-with-robots-decc
cat urls-decc.gov.uk/urls-decc | cut -d$'\t' -f1 | sort | uniq > urls-decc.gov.uk/domains-decc
comm --output-delimiter : urls-decc.gov.uk/domains-decc robots-decc.gov.uk/domains-with-robots-decc | sed -e 's/^\([^:].*\)/\1\tfalse/' -e 's/::\(.*\)/\1\ttrue/' > domains-robots-flag-decc

We performed this on the whole 2010 Elections collection and found 904 domains total, and 478 with robots.txt files. This means almost half of the domains don’t have a robots.txt file.

Looking at all the robots files together also allowed us to do a quick analysis like the count of different user-agents targeted:

Count User-agent specified
857 User-agent: *
27 googlebot
11 msnbot
5 baiduspider
4 yahoo
4 ia_archiver

We can also see how many specify a crawl delay, and of what length:

Count Crawl-delay specified (in seconds)
35 Crawl-delay: 10
4 Crawl-delay: 30
4 Crawl-delay: 60
8 Crawl-delay: 120
2 Crawl-delay: 300
7 Crawl-delay: 3600

Ideally for future work we can try to output the robots in a more structured format, which would allow for easier comparison of different variables. For example, if the robots.txt were parsed into more structured fields we could more easily determine trends i.e. are long crawl-delays targeted at specific crawlers, or only found in robots.txt files for certain sites.

Applying Robots.txt Retroactively

The next step involved trying to extract the specific sites or files targeted by robots exclusions, and retroactively apply them to the crawl, and this was really two separate parts. The first is determining the exclusion rules for each domain. Parsing the rules was relatively straightforward, as others have built robots parsers to draw on. For example, we found three options in a quick search:

We ended up testing both the python and nodeJS options, but in future would like to explore the scala option, and if it can be more directly integrated with warcbase.

After determining the robots.txt exclusions, we needed to understand how they would have been applied and in what ways they would limit the crawl. This proved more difficult than initially anticipated, since there were two dimensions to consider: first, if any single resource would not have been captured because its URL was blocked by robots; second, which resources would not have been discovered because they are linked from resources that would have been blocked by robots.

Essentially, the impact of robots.txt exclusion in a crawl is a network effect, so we needed to understand the link structure of the collection to fully understand it.

Graphing the links within a collection

We used warcbase again to extract the links within a collection. The previous examples we found with warcbase have worked with aggregating links by domain, and required some adjustments for a more granular view of links for individual resources. We were lucky to draw on Ian’s warcbase knowledge, and came up with this script with his help:

import org.warcbase.spark.matchbox.{ExtractDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/mnt/TNA-Dataset/2010electionsUK/post-election/decc.gov.uk/*.warc.gz",sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, f._1, f._2))) 
  .saveAsTextFile("/mnt/TNA-Dataset/robots/links-decc.gov.uk")

which returns a TSV file which lists all links as targets and sources, in order to create a directed graph (something like a LGA file).

However, in the end the level of granularity for this link graph was too much for Gephi’s memory limits, so we had to simplify a bit. While not representing the detailed network effect we initially wanted to explore, this graph shows all the domains that have robots.txt (in red) compared to those that don’t (in blue), which begins to give an indication of the extent of the impact.

In the end, while we couldn’t create a detailed graph visualization to overlap and compare the impact of robots.txt exclusions, we were able to apply the full network analysis to a subset of this collection and list the results. We found that only 24 of the captured resources in the sample selection would be affected. Again, full details and code for this final step can be found on the github page.

Conclusions

For this sample collection, the impact ignoring robots.txt was found to be minimal. This was a bit of an anticlimactic conclusion, but also not surprising considering the relatively tight scope for this collection covering government web pages.

We hope that more work can explore other collections with this same method. We’re also looking at incorporating some of the robots.txt parsing with scala into warcbase directly.

Advertisements

Hack IMLS – a CDX case study

I spent the past few days at the Archives Unleashed hackathon/datathon at the Library of Congress. It was an amazing gathering of lovely people, and the results of what the teams could put together in just a few days is really astounding.

With the help of the talented Sawood Alam and Ed Summers, I worked on exploring a web crawl of the US museum websites. We got access to this crawl, courtesy of the Internet Archive, thanks largely to the whims of Jefferson Bailey, who used the list of museums from the IMLS Museum Data Universe File as the seed. I thought this was an interesting approach to use open government data as a seed list, and a clear logic to scoping the collection.

We started by brainstorming a bunch of research questions: could we use this crawl to see any trends how museums are represented on the web? Are there differences in the use of images and texts, and does that reveal anything about the discourse of digital cultural heritage? I was also interested to see what might be missing from these crawls, as many museums are also extending their web presences to different social media platforms – a fact that was very evident in the museums in DC (e.g. from this quick instagram search):

#photography encouraged #renwickgallery #expression #art #mediaset #photochallenge

A post shared by Robin Rutherford (@invisibletiara) on

Our best laid brainstorming quickly encountered a problem with scale. We couldn’t work with the WARC file because of its size – at 24TB, it wasn’t possible to get the file, and instead we were limited to using the CDX, a derivative file.

A brief introduction to CDX
The CDX format is generated as the crawler performs the crawl and acts as the index of the different elements / URLs. It doesn’t really cover the information content (such as the text of a webpage, outlinks, etc.) so we were limited in our analysis for the following fields:

cdx-table

One approach we took was to compare the crawl data with the information provided in the original IMLS data file, including data on income and location for each museum. (And since these sets are published for different years, there’s potential for some longitudinal analysis). We could see if there was any correlation between things like # of pages with 404 errors, or size of content on the web, and the museum’s income. Ed wrote up the “Neoliberal Museum Analysis” approach describing the process of creating the Redis database and generating the CSV file (which didn’t finish in time for the end of the hackathon).

Having archival theory readings on my mind, I started thinking about the CDX as a kind of Finding Aid for the WARC file.

The CDX file for a 24TB crawl is still pretty unwieldy, clocking in at ~10GB, or 219,070,202 HTTP transactions. Shifting gears from formulating a research question about museums (the content of the crawl), a second question was about the approach to reading what is in the collection – how do you read a finding aid that’s 200 million lines long?

The obvious answer is you don’t. Or you don’t read all of it. But how can we skim, browse, sample and select?

The approach we took was lightweight, which fit with our time limitations in the hackathon, but this also makes sense if you consider a use case where a researcher wants to quickly see what might be in the web collection – and doesn’t want to install a bunch of software to do it. We focused on the quick-and-dirty working with some simple (and some lesser known) unix shell commands.

###Description of our workflow, or: a CDX recipe book###

#reading the compressed file with zcat
zcat MUSEUM-20160318191052-crawl341.cdx.gz | head 

#sorting by mime type (col 4), for unique values, summed and listed in descending order
zcat MUSEUM-20160318191052-crawl341.cdx.gz | cut -d ' ' -f 4 | sort | uniq -c | sort -nr | less

#sorting by top level domains - select col 1 (massaged url) then select text delim by ,
zcat MUSEUM-20160318191052-crawl341.cdx.gz | cut -d ' ' -f 1  | cut -f 1 -d ',' | sort | uniq -c | sort -nr | less

#sorting for unknown mime types (for text analysis)
awk ‘{if ($1 ~ /unk/) {print $2}}’ mimesurt.txt | sort -u > mimesurt.unk

#splitting words to new line for word count
tr -s '[[:punct:][:space:]]' 'n' &lt; mimesurt.unk | sort | uniq -c | sort -nr > mimesurt.unk.count

#removing numeric values and sorting (for comparison)
awk '{if ($2 !~ /[0-9]/) {print $2}}’ mimesurt.pdf.count | sort > mimesurt.pdf.clean.sorted

#compare two file types - returns 3 columns: unique to first, unique to second, common in both
comm -23 mimesurt.tiff.clean.sorted mimesurt.jpeg.clean.sorted  

#compare two file types - returns unique to first (drop cols 2-3)
comm -23 mimesurt.tiff.clean.sorted mimesurt.jpeg.clean.sorted > comparejpgtif_unique_to_tiff.txt

With this approach we started sifting and sorting through the data, and could start doing some basic text analysis on the text of the URLs. We also discovered a lot of duplication in the CDX, and reduced down to (only!) 138,905,796 unique URLs. With some quick D3 skills, Sawood generated this tree map of all the domains:

Museums Domains

We also tested out some simple text analysis on the URLs by breaking up each into tokens and removing and punctuation and numeric characters. For example, comparing the text of the URLs (top 50 tokens) for different file types:
URL text mimetype

You can start to see some places to dig deeper: domains including .org show up across all the types, but .edu is more prominent in jpegs, pdfs, and (troublingly?) the mimetypes registering as unknown. Tiffs are dominated by a few tokens – does this indicate a large collection involving these file names? Or an artifact of the crawler stuck in a loop?


Aside – an analogy with Scrabble: There are two ways we play Scrabble in my family. The first is the way the game is written, where the winner is determined by the scores written on the tiles, which are weighted based on the statistical occurrence of letters in words in the English language. And of course, you play to a certain strategy of creating words that maximize your score by taking advantage of the multipliers in double/triple letter and double/triple word scores that are also distributed across the board according to some sort of game theory logic (I don’t know the details of this, but I assume that Scrabble would not be as enjoyable a game if it weren’t calculated accordingly). The second way to play Scrabble is to try to make the best word (as judged by the players). The only way the rules of the game can account for ‘best word’ is by length – meaning that if you use all eight letters in your hand in a single word, you get an extra 50 points. But playing with people who appreciate good words (however that is defined), while we are all very very competitive at board games and will want to maximize our points, we also recognize a really elegant word when it comes up, and we might just play it if we can, even if it’s not the most points, because it’s a really good word and those don’t come around that often (in a way that’s playable on the board) in a Scrabble game.

So, if you have followed me thus far, you can maybe see the analogy that I’m going to try to make with Scrabble and with data – there’s a lot of very precise ways to count data letter-by-letter, and even based around statistically robust methods that use a corpus of English language texts. But there’s no accounting for a really good word. And maybe the most interesting parts of the data are the ones that don’t come up every game, that get lost when we are only counting the numbers on the tiles. That is how I characterize the challenge of characterizing web data. Lots of data means lots of noise, and it’s much easier to pick out the top occurrences, which aren’t always the most interesting.


For me, this project evolved from one about analyzing data into something else that got me thinking more about methods of analysis, and it was a really useful exercise as a kind of catalyst to start synthesizing some of the areas that are in my reading lists right now: what archival arrangement and description can mean for web archives; approaches to reading data encoding structures as texts; what systems and infrastructures are needed for digital curation and to support internet research – did I mention we were very lucky to have multiple options for virtual machines and servers for working with the data?

In the Saving the Web symposium on Thursday that followed the hackathon, there was a lot of discussion around data literacy. Literacy can mean many things, but I think it has to start with really reading the data itself in all its messiness, before anything else applying other more distant methods of analysis.

I learned a lot, and learned a lot about what I need to learn next – so I hope this post can serve as both a reminder for me, and maybe something like a how-to guide for others.

tl;dr reading data is hard when there’s lots of it. making sense of it is even harder.

Encounters in Electronic Textuality Part 2

I wanted to see if I could use Google Translate on my phone to capture OCR text from an image in my book so I could read and capture quotes on mobile, instead of transcribing by keyboard (this is a normal train of thought for the somewhat lazy academic, right?) BUT then this happened (imagine these images in motion, not static) and it’s super trippy to look at the text on a page morphing and jumping around and I don’t know if I can ever trust my phone camera again.

 

Encounters in Electronic Textuality Part 1

There is a beautiful irony in reading McKenzie’s discussion on interpretation and misquotation (and reading it in the context of studying the materiality of data) and getting this error instead of the image comparing the two quotations:

Screen Shot 2016-05-08 at 1.19.40 PM

And for comparison, the alternative (glitchy) rendering on my iPad:

13119099_10101567514750897_6759765516030719593_n

 

Playing my cards right: Reflections on making a thing

Some thoughts on the making of Data Against Humanities.

Seeing the whole set together, I realized the cards are more of a reflection of me than anything else – my humour, what websites or articles I’ve been reading lately, and generally what I think is funny. I knew that going into it, but it’s somehow still surprising. While I’ve been reading and thinking about things like ‘the death of the author’ lately, somehow my ego is still getting in the way when it comes to my own work. My overarching reaction wasn’t about the academic merit of it all, I was mainly interested in whether or not people were using the cards and finding it funny.

Catalogue of common questions or reactions I’ve gotten so far:

  • Where did you get the cards printed? (probably #1 question so far)
  • Can I take a card(s)? (yes)
  • I’m sorry I took a card. I wasn’t sure if I was supposed to. Do you want it back? (no, I was totally fine with you taking it)
  • Can I get a set of cards? (I honestly can’t believe anyone really wants this, but maybe? I would consider doing a new set more tailored to a specific audience)

I feel like this whole thing was a little test or experiment, and I need to do another iteration, figure out the best way to engage the audience (not in a formal presentation), and how to get people more involved in playing and writing their own cards. Maybe make a rule that you have to write a card to take a card? I could also do a whole other study about which cards were more popular, and which cards were left behind at the end (what does it all mean???? probably not much). There are also a bunch of different ways I could frame the exercise that might change the outcome and it would be fun to try out different stylings.

Origin Stories

I have decided these two quotes are going to be my epigraph:

I want to speak about bodies changed into new forms. You, gods, since you are the ones who alter these, and all other things, inspire my attempt, and spin out a continuous thread of words, from the world’s first origins to my own time.

Before there was earth or sea or the sky that covers everything, Nature appeared the same throughout the whole world: what we call chaos: a raw confused mass, nothing but inert matter, badly combined discordant atoms of things, confused in the one place. There was no Titan yet, shining his light on the world, or waxing Phoebe renewing her white horns, or the earth hovering in surrounding air balanced by her own weight, or watery Amphitrite stretching out her arms along the vast shores of the world. Though there was land and sea and air, it was unstable land, unswimmable water, air needing light. Nothing retained its shape, one thing obstructed another, because in the one body, cold fought with heat, moist with dry, soft with hard, and weight with weightless things.
Ovid’s Metamorphoses

 

To translate it into UNIX system administration terms (Randy’s fundamental metaphor for just about everything), the post-modern, politically correct atheists were like people who had suddenly found themselves in charge of a big and unfathomably complex computer system (viz, society) with no documentation or instructions of any kind, and so whose only way to keep the thing running was to invent and enforce certain rules with a kind of neo-Puritanical rigor, because they were at a loss to deal with any deviations from what they saw as the norm. Whereas people who were wired into a church were like UNIX system administrators who, while they might not understand everything, at least had some documentation, some FAQs and How-tos and README files, providing some guidance on what to do when things got out of whack. They were, in other words, capable of displaying adaptability. 
Cryptonomicon, Neal Stephenson

(the latter may be shortened to the much more direct: “Display some fucking adaptability”)

Settlers of Catafghan

My friends who are fans of German strategy board games had a baby. I figured you’re never too young to start learning about brick and wood as key early-game resources, so this happened.

catafghan_01

catafghan_02

catafghan_03

Pattern adapted from: Galler Yarns Interchangeable Hexagons by Marie Segares/Underground Crafter and Super Simple Hexagon by Leanda Xavian (for the desert)

Here are some terrible notes on my mods for the half-hexagon and weird corner pieces for the water edge sections.

IMG_20160312_221943

so many coffee stains
so many coffee stains

IMG_20160312_222019

More Mechanisms: On gates and gates and gates

Chapter 5 on Gibson’s Agrippa quotes a passage by Gary Taylor, describing how his writing is “playing on the word ‘gate’ as both logic gate and Bill Gates,” which reminded me of this article I read a few years ago. “Celan Reads Japanese” is also about gates, and the translation of poetry. Attempting to translate poetry from German to Japanese would seem as doomed a task as we are presented with in digital preservation – aren’t we similarly attempting the impossible by translating works over time, to new contexts, attempting to preserve readings and meanings? But, in this essay, Tawada reveals something about the nature of translation (which I think those of us in digital preservation can learn from):

There were exceptions, though, such as the poems of Paul Celan, which I found utterly fascinating even in Japanese translation.  From time to time it occurred to me to wonder whether his poems might not be lacking in quality since they were translatable.  When I ask about a work’s ‘translatability,’ I don’t mean whether a perfect copy of a poem can exist in a foreign language, but whether its translation can itself be a work of literature.  Besides, it would be insufficient if I were to say that Celan’s poems were translatable.  Rather, I had the feeling that they were peering into Japanese.

I love this sentence (highlighted), it does so much – not simply denying the possibility of a perfect copy, but making that question itself irrelevant, shifting focus from imitation to creation. Tawada insists that we recognize the translation as a new work, and asks that as readers we also demand more from this process of translation.

She goes on to describe the artistry of Celan’s translation, how the ideograms that show up all share the radical for ‘gate.’ I don’t know how one judges the authenticity of translated poetry, but this seems to ring true – or, to adopt more Kirschenbaum-esque language, there is a certain craft evident in this new encoding, that also serves to reveal an underlying formal materiality in the original. This is what I want significant properties to be able to do.

I’ve also been thinking a lot about haiku recently – how the ‘genre’ of haiku has transformed over time (in Japanese) and in contrast how the anglicized notion of haiku is stripped of the ‘semantic’ factors (such as juxtaposition of imagery and focus on nature and seasons) and is reduced to following form (counting syllables). Do we value form over nuance – or does nuance always get lost in translation?

But also, maybe there is something more here… about codes and visualizations, about ideograms and notations (Goodman comes to mind again). And about the ‘information’ of each…

Goodman, N. (1976). Languages of art: an approach to a theory of symbols. Indianapolis: Hackett.

Tawada, Y. (2013, March). Celan Reads Japanese. Retrieved February 16, 2016, from http://www.thewhitereview.org/features/celan-reads-japanese/

Mechanisms + SpecLab

I’m a few chapters into Mechanisms and just finished the Preface of SpecLab. The juxtaposition between these two is funny. Drucker is (or at least in the preface purports to be) all about aesthetics, interpretation, imagination and playfulness. In contrast Kirschenbaum’s focus on forensics feels stodgy, insistent in its grounding with the physical apparatus of the hard drive, the magnetic signals, the ‘device.’ Yet, while I think I don’t share his passion for the physical storage medium, I do appreciate Kirschenbaum’s approach – he refuses to gloss over these details of the underlying components of computing, and uses this material view to counter the (tropes of a) ‘medial ideology’ presented by others. This description of Kirschenbaum’s conceptual/analytical framework is particularly nice:

Forensics itself is a Janus-faced word, by definition both the presentation of scientific evidence and the construction of a rhetorical argument. Mechanisms embraces both of these aspects. While it does not shy away from technical detail (or theoretical nuance), it is first and foremost intended as a practical intervention in the current discourse about digital textuality.( p. 21)

But… when does he get to the acknowledgement that storage and inscription is not made meaningful without processing and interpretation? I hope he does. It also strikes me that the choice of devices and technologies in his examples already seem quite dated (or perhaps historical is a better way to say that).

Kirschenbaum focuses on the act of displacement involved in writing to the hard drive. While he begins to hint at the coming changes with cloud services (though maybe that was not a buzzword yet at the time of writing) I see this is a much more significant act of displacement, shifting much more of the physical inscription away from the user. It might be interesting to do a study of inscription with web platforms and cloud services to document at the really low-level what is recorded on the user’s side vs. the service provider’s side, what’s black-boxed or not, what inscriptions remain locally even when we work in the cloud.

 

Drucker, J. (2009). SpecLab: digital aesthetics and projects in speculative computing. Chicago: University of Chicago Press.

Kirschenbaum, M. G. (2012). Mechanisms: new media and the forensic imagination. Cambridge, Mass.; London: MIT Press.