Wget and Web Archiving
Mini intro to practical Wget for archivists
What’s in a URL?
https://example.com/about?key=value#anchor
protocol + domain name (optional port :80) + path + query with parameters + fragment/anchor
A subdomain can be added in front of the main domain name(s).
For example, in lib.uidaho.edu
> lib
is a subdomain of uidaho
, which is a subdomain of the top-level domain edu
.
Dynamic vs Static web
A static website is a collection of HTML, CSS, JS, images, and other files that are delivered exactly as they are on the server to users. A URL in a static site generally represents a request for an HTML document in a specific file location.
Dynamic web uses a server-side scripting language to create pages on the fly when a user makes a request. Thus a URL represents a query, rather than an existing document on the server. Content, templates, and metadata are usually stored in a database. For example, WordPress uses the scripting language PHP and database MySQL. This enables more complex interactivity such as comments, customized views, user management, and a web-based admin interface.
Web crawls harvest the set of pages generated by following the links within a domain. This is a static snap shot at a specific point in time, meaning the dynamic functionality of a website will NOT be captured. Depending on the site design this could lead to loss of information. For example, some information is not fetched until a button click, links are written to the page using JS, images are changed for different browsers, or data is retrieved via a web form (POST request).
In a web archive features such as search bar, streaming media, widget embeds, and complex JS won’t work (or will introduce context anomalies). (example)
Webrecorder is a tool used to capture features that require user interaction, but this requires actually surfing everything you want to harvest. (example use case, Transparent Idaho)
Dynamic requests issues
UIdaho uses the CMS platform sitecore. This is a database driven dynamic site written using ASP.NET (?, check their SiteCore demo site).
Look at https://www.uidaho.edu/, notice that the hyperlinks on page look like https://www.uidaho.edu/academics.aspx. Rather than a static HTML document, the links are a dynamic request to “active server page extended” script, which creates the page https://www.uidaho.edu/academics (which is not academics.html or academics/index.html).
This causes problems for a web archive, since we harvest the resulting static HTML, not the aspx script. The file “www.uidaho.edu/academics.aspx” will not be in the web archive, but the document is captured as “www.uidaho.edu/academics”. (example)
Wget Prep
Wget is a handy Free command line tool to robustly retrieve documents from the web. It is a standard utility on Linux. On Windows, I suggest setting up a Bash terminal with Wget, for example Cygwin as outlined in Using Cygwin (note: I previously suggested Cmder as a handy portable option. However, we have discovered some bugs when creating WARC files with Wget on Cmder.).
See: Intro to the Command Line
Basic Wget
Open a terminal and navigate to a test directory.
Wget commands typically take arguments and a URL.
Arguments are set using a --
flag.
There is a long and short of version of most (e.g. --help
and -h
):
wget --help
To retrieve a single web page or file, just add the URL:
wget https://evanwill.github.io/_drafts/notes/commandline.html
To get a list of files, create a plain text list of urls you want to download, one per line. Use the --input-file=
option to pass that list to wget.
wget --input-file=download-file-list.txt
Adding the --recursive
argument allows Wget to act as a web crawler, following links a page until everything in a domain has been downloaded.
All assets will be downloaded in a directory structure mirroring the site organization.
A crawl can be limited to a specific file type using the --accept
option. For example, download all PDFs:
wget --recursive --accept=pdf http://site-with-pdfs.com/
When using --recursive
, add --no-parent
, --level=NUMBER
, or --domains=LIST
to limit your crawl:
wget -r -np -Apdf http://site-with-pdfs.com/services/workshops/resources/
Correctly scoping your crawl is important, spend time exploring the hierarchy of the site to ensure you will capture what you want, but not download the entire internet…
To get an entire web site, use the --mirror
and --page-requisites
arguments.
This will recursively crawl the domain and collect everything needed to reproduce the site.
Adding --convert-links
will rewrite the internal links to work offline if desired.
It is important to add --wait=SECONDS
and --random-wait
to avoid bothering servers (you are unlikely to overload them, but they are likely to block you).
wget -mpk --wait=5 --random-wait https://example.com
Archival Wget with WARC
wget --help | grep warc
WARC is a web archive format that stores page content, response headers, and metadata for a group of web pages. One WARC can contain all the pages gathered during a web harvest. In addition to HTML documents, it can contain binary content such as images.
Wget can create a WARC for any crawl simply by adding the flag --warc-file="filename"
to the command.
Wget will harvest the site assets as normal, but additionally create a WARC compressed as a gzip file (.gz
).
For larger sites it’s a good idea to add --warc-max-size=1G
to limit the max size of each WARC so they don’t get too big.
If the server refuses to give content to wget’s default user agent (sometimes identified as a robot), you can send a different one, like --user-agent=Mozilla
.
Occasionally it may be necessary to ignore “robots.txt” for archival purposes.
Add --execute robots=off
to the command.
Test example:
wget --mirror --page-requisites --wait=2 --random-wait --no-parent --trust-server-names --warc-file="test-archive" http://www.example.com/path/sometopic/
More complete example:
wget -mpkE --span-hosts --domains=example.com,www.example.com,sub.example.com --warc-file="test-archive" --warc-max-size=1G --warc-cdx --user-agent=Mozilla -e robots=off --wait=2 --random-wait http://www.example.com
More examples in use:
wget -mpkE -np --trust-server-names --warc-max-size=1G --warc-file="test-archive" --warc-cdx --wait=1 --random-wait https://www.example.com/path/news/newsletter
Limit using --include-directories=
(instead of --no-parent
), be sure to include all directories for page requisites:
wget -mpkE --trust-server-names -I /~,/css,/fonts,/Images,/Scripts,/path/news/newsletters --warc-max-size=1G --warc-file="test-archive" --warc-cdx --wait=0.5 https://www.example.com/path/news/newsletters/
Playback
Options:
- Webrecorder Player (actively developed desktop app created by Webrecorder / Rhizome. Currently seems extremely slow and buggy on some computers when using warc not created by Webrecorder.)
- Webarchive Player (not actively developed, but still works. simple desktop app)
- pywb (“Python WayBack” Python package. Lots of features using wb-manager commandline utility. Can serve and manages multiple warc files in a collection.)
Workflow questions
- Policy
- Harvest
- Store
- Catalogue
- Access
Reference:
- ArchiveTeam, “Wget with WARC output”.
- IIPC Awesome Web Archiving.
- IIPC overview.
- Corey Davis, “Archiving the Web: A Case Study from the University of Victoria”, code4lib 26 (2014), http://journal.code4lib.org/articles/10015.
- Adrian Brown, Archiving websites : a practical guide for information management professionals (2006). http://search.lib.uidaho.edu/UID:everything:CP71127753580001451
- Maureen Pennock, “Web-Archiving”, DPC Technology Watch Report 13 (2013), http://www.dpconline.org/docman/technology-watch-reports/865-dpctw13-01-pdf/file
- Jinfang Niu, “An Overview of Web Archiving”, DLib 18, 3/4 (2012), doi:10.1045/march2012-niu1.
- “Archivability” guide, Stanford Libraries
- ArchiveReady (test sites for archivability)
- Oldweb.today (surf web archives in emulated historic web browsers)
Collection Development Policy examples:
- Columbia University Libraries, “Web Resources Collection Program”
- Stanford Libraries, “Collection development”
- MSU Archives, “Web Site Collection Plan” (Web Archives @ MSU)
- Michael Shallcross, “On the Development of the University of Michigan Web Archives: Archival Principles and Strategies” SAA Campus Case Studies 13 (2011) http://files.archivists.org/pubs/CampusCaseStudies/Case13Final.pdf.
Tools:
- Web Curator Tool
- NetarchiveSuite (vagrant, package to manage harvesting developed by The Royal Danish Library)
- WAIL (GUI interface to work with Heritrix and OpenWayback, buggy in my experience, repo)
- WARCreate (Chrome plugin for one off WARC creation)
- Wpull and grab-site (archival focused Wget alternative in development)
Videos:
- IIPC dramatic “Why Archive the Web?”
- LOC “Web Archiving”
- UK WebArchive, “What is a web archive?”