Robots(.txt) in the Archives

Last week I was very lucky to participate in another Archives Unleashed (my third now!) hosted by the British Library as part of Web Archiving Week, preceding the joint RESAW and IIPC Web Archiving conferences.

We’ve posted the code and explanation of our project on github, but I wanted to complement that with more of a narrative. Our group came together around a shared interest in robots.txt files — this exclusion protocol can be used by webmasters to specify how bots, indexers, and crawlers interact with their site, for example listing specific files and folders to be excluded from indexing, or specifying a certain time delay between requests so that the server doesn’t get overloaded. Since the protocol can be overridden and ignored, recent discussions and debates have asked whether or not web archives should adhere to robots.txt. Different web archives currently have different policies around respecting or ignoring these files and we wanted to understand the implications of these kinds of decisions.

Explanation from robotstxt.org

Our approach was to take a collection that had ignored robots.txt and ask: what would not have been captured if robots.txt had been respected? How many, and which particular resources would not have been captured in a crawl?

We drew on the collections from the National Archives’ UK Government Web Archive. Since their legal mandate is to collect all government publications, the web archiving policy is to ignore robots.txt. For this work we focused on a set of WARCs from the 2010 Elections collection.

Our main method was to:

  • Extract all robots.txt from the WARC collection (using warcbase)
  • Apply the robots.txt retroactively to see what would not have been captured, by:
    • parsing the robots.txt exclusion rules,
    • applying the rules to the URIs and links in the WARC collection.
  • Compare the coverage of a collection adhering to vs. ignoring robots.txt

Extracting the Robots.txt

The first step was to pull out all the robots files from the WARCs. We used warcbase, and our first iteration looked like this:

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
import StringUtils._

val links = RecordLoader.loadArchives("/mnt/TNA-Dataset/2010electionsUK/post-election/decc.gov.uk/EA-TNA0510.9294.www.decc.gov.uk-20100512172050-00000.warc.gz",sc)
.keepUrlPatterns(Set("https?://[^/]*/robots.txt".r))
.map(r=>(r.getCrawlDate, r.getUrl))
.map(tabDelimit(_))
.saveAsTextFile("robots1.txt")

The above code returns the content for all warc-records of URLs ending in robots.txt, with each one ideally looking something like this:

HTTP/1.1 200
Server: Microsoft-IIS/5.0
pics-label: (pics-1.1 "http://www.icra.org/ratingsv02.html" l gen true for "http://www.hse.gov.uk" r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true for "http://www.hse.gov.uk" r (n 0 s 0 v 0 l 0))
X-Powered-By: ASP.NET
Date: Wed, 12 May 2010 17:35:11 GMT
Content-Type: text/plain
Accept-Ranges: bytes
Last-Modified: Thu, 13 Aug 2009 07:31:38 GMT
ETag: "f02d4518e81bca1:907"
Content-Length: 248

User-agent: *
Disallow: /search
Allow: /slips/step/index.htm
Disallow: /slips/step
Disallow: /newdesign/index.htm
Disallow: /grip
Disallow: /pubns/priced
Sitemap: http://www.hse.gov.uk/sitemap.txt
Sitemap: http://www.hse.gov.uk/sitemap.xml

However, it also returned some results we didn’t want, like resources that are not text files, including some 404 error pages instead of the robots.txt. A few more iterations got us to this version, which saves all the robots.txt in a single file (separated by “SNIP HERE” lines):

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
import StringUtils._
import org.warcbase.data.WarcRecordUtils

val links = RecordLoader.loadArchives("/mnt/TNA-Dataset/2010electionsUK/post-election/decc.gov.uk/*.warc.gz",sc)
.keepUrlPatterns(Set("https?://[^/]*/robots.txt".r))
.filter(r => (r.getContentString.contains("HTTP/1.1 200") || r.getContentString.contains("HTTP/1.0 200") || r.getContentString.contains("HTTP/2.0 200")))
.keepMimeTypes(Set("text/plain"))
.map(r=>("<------- SNIP HERE -------->", r.getDomain, r.getCrawlDate, RemoveHttpHeader(r.getContentString)))
.saveAsTextFile("/mnt/TNA-Dataset/robots/extracted-robots")

Once we have the robots files, we can do some initial analysis with bash. For example we wrote a bash script to simply determine if a domain has a robots.txt file in the WARC, returning true/false for each domain:

cd /mnt/TNA-Dataset/robots
grep SNIP robots-decc.gov.uk/part-all  | cut -d, -f2 | sort | uniq > robots-decc.gov.uk/domains-with-robots-decc
cat urls-decc.gov.uk/urls-decc | cut -d$'\t' -f1 | sort | uniq > urls-decc.gov.uk/domains-decc
comm --output-delimiter : urls-decc.gov.uk/domains-decc robots-decc.gov.uk/domains-with-robots-decc | sed -e 's/^\([^:].*\)/\1\tfalse/' -e 's/::\(.*\)/\1\ttrue/' > domains-robots-flag-decc

We performed this on the whole 2010 Elections collection and found 904 domains total, and 478 with robots.txt files. This means almost half of the domains don’t have a robots.txt file.

Looking at all the robots files together also allowed us to do a quick analysis like the count of different user-agents targeted:

Count User-agent specified
857 User-agent: *
27 googlebot
11 msnbot
5 baiduspider
4 yahoo
4 ia_archiver

We can also see how many specify a crawl delay, and of what length:

Count Crawl-delay specified (in seconds)
35 Crawl-delay: 10
4 Crawl-delay: 30
4 Crawl-delay: 60
8 Crawl-delay: 120
2 Crawl-delay: 300
7 Crawl-delay: 3600

Ideally for future work we can try to output the robots in a more structured format, which would allow for easier comparison of different variables. For example, if the robots.txt were parsed into more structured fields we could more easily determine trends i.e. are long crawl-delays targeted at specific crawlers, or only found in robots.txt files for certain sites.

Applying Robots.txt Retroactively

The next step involved trying to extract the specific sites or files targeted by robots exclusions, and retroactively apply them to the crawl, and this was really two separate parts. The first is determining the exclusion rules for each domain. Parsing the rules was relatively straightforward, as others have built robots parsers to draw on. For example, we found three options in a quick search:

We ended up testing both the python and nodeJS options, but in future would like to explore the scala option, and if it can be more directly integrated with warcbase.

After determining the robots.txt exclusions, we needed to understand how they would have been applied and in what ways they would limit the crawl. This proved more difficult than initially anticipated, since there were two dimensions to consider: first, if any single resource would not have been captured because its URL was blocked by robots; second, which resources would not have been discovered because they are linked from resources that would have been blocked by robots.

Essentially, the impact of robots.txt exclusion in a crawl is a network effect, so we needed to understand the link structure of the collection to fully understand it.

Graphing the links within a collection

We used warcbase again to extract the links within a collection. The previous examples we found with warcbase have worked with aggregating links by domain, and required some adjustments for a more granular view of links for individual resources. We were lucky to draw on Ian’s warcbase knowledge, and came up with this script with his help:

import org.warcbase.spark.matchbox.{ExtractDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/mnt/TNA-Dataset/2010electionsUK/post-election/decc.gov.uk/*.warc.gz",sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, f._1, f._2))) 
  .saveAsTextFile("/mnt/TNA-Dataset/robots/links-decc.gov.uk")

which returns a TSV file which lists all links as targets and sources, in order to create a directed graph (something like a LGA file).

However, in the end the level of granularity for this link graph was too much for Gephi’s memory limits, so we had to simplify a bit. While not representing the detailed network effect we initially wanted to explore, this graph shows all the domains that have robots.txt (in red) compared to those that don’t (in blue), which begins to give an indication of the extent of the impact.

In the end, while we couldn’t create a detailed graph visualization to overlap and compare the impact of robots.txt exclusions, we were able to apply the full network analysis to a subset of this collection and list the results. We found that only 24 of the captured resources in the sample selection would be affected. Again, full details and code for this final step can be found on the github page.

Conclusions

For this sample collection, the impact ignoring robots.txt was found to be minimal. This was a bit of an anticlimactic conclusion, but also not surprising considering the relatively tight scope for this collection covering government web pages.

We hope that more work can explore other collections with this same method. We’re also looking at incorporating some of the robots.txt parsing with scala into warcbase directly.

Advertisements