_drafts

Wget and Web Archiving

Mini intro to practical Wget for archivists

What’s in a URL?

https://example.com/about?key=value#anchor

protocol + domain name (optional port :80) + path + query with parameters + fragment/anchor

A subdomain can be added in front of the main domain name(s). For example, in lib.uidaho.edu > lib is a subdomain of uidaho, which is a subdomain of the top-level domain edu.

Dynamic vs Static web

A static website is a collection of HTML, CSS, JS, images, and other files that are delivered exactly as they are on the server to users. A URL in a static site generally represents a request for an HTML document in a specific file location.

Dynamic web uses a server-side scripting language to create pages on the fly when a user makes a request. Thus a URL represents a query, rather than an existing document on the server. Content, templates, and metadata are usually stored in a database. For example, WordPress uses the scripting language PHP and database MySQL. This enables more complex interactivity such as comments, customized views, user management, and a web-based admin interface.

Web crawls harvest the set of pages generated by following the links within a domain. This is a static snap shot at a specific point in time, meaning the dynamic functionality of a website will NOT be captured. Depending on the site design this could lead to loss of information. For example, some information is not fetched until a button click, links are written to the page using JS, images are changed for different browsers, or data is retrieved via a web form (POST request).

In a web archive features such as search bar, streaming media, widget embeds, and complex JS won’t work (or will introduce context anomalies). (example)

Webrecorder is a tool used to capture features that require user interaction, but this requires actually surfing everything you want to harvest. (example use case, Transparent Idaho)

Dynamic requests issues

UIdaho uses the CMS platform sitecore. This is a database driven dynamic site written using ASP.NET (?, check their SiteCore demo site).

Look at https://www.uidaho.edu/, notice that the hyperlinks on page look like https://www.uidaho.edu/academics.aspx. Rather than a static HTML document, the links are a dynamic request to “active server page extended” script, which creates the page https://www.uidaho.edu/academics (which is not academics.html or academics/index.html).

This causes problems for a web archive, since we harvest the resulting static HTML, not the aspx script. The file “www.uidaho.edu/academics.aspx” will not be in the web archive, but the document is captured as “www.uidaho.edu/academics”. (example)

Wget Prep

Wget is a handy Free command line tool to robustly retrieve documents from the web. It is a standard utility on Linux. On Windows, I suggest setting up a Bash terminal with Wget, for example Cygwin as outlined in Using Cygwin (note: I previously suggested Cmder as a handy portable option. However, we have discovered some bugs when creating WARC files with Wget on Cmder.).

See: Intro to the Command Line

Basic Wget

Open a terminal and navigate to a test directory.

Wget commands typically take arguments and a URL. Arguments are set using a -- flag. There is a long and short of version of most (e.g. --help and -h):

wget --help

To retrieve a single web page or file, just add the URL:

wget https://evanwill.github.io/_drafts/notes/commandline.html

To get a list of files, create a plain text list of urls you want to download, one per line. Use the --input-file= option to pass that list to wget.

wget --input-file=download-file-list.txt

Adding the --recursive argument allows Wget to act as a web crawler, following links a page until everything in a domain has been downloaded. All assets will be downloaded in a directory structure mirroring the site organization. A crawl can be limited to a specific file type using the --accept option. For example, download all PDFs:

wget --recursive --accept=pdf http://site-with-pdfs.com/

When using --recursive, add --no-parent, --level=NUMBER, or --domains=LIST to limit your crawl:

wget -r -np -Apdf http://site-with-pdfs.com/services/workshops/resources/

Correctly scoping your crawl is important, spend time exploring the hierarchy of the site to ensure you will capture what you want, but not download the entire internet…

To get an entire web site, use the --mirror and --page-requisites arguments. This will recursively crawl the domain and collect everything needed to reproduce the site. Adding --convert-links will rewrite the internal links to work offline if desired. It is important to add --wait=SECONDS and --random-wait to avoid bothering servers (you are unlikely to overload them, but they are likely to block you).

wget -mpk --wait=5 --random-wait https://example.com

Archival Wget with WARC

wget --help | grep warc

WARC is a web archive format that stores page content, response headers, and metadata for a group of web pages. One WARC can contain all the pages gathered during a web harvest. In addition to HTML documents, it can contain binary content such as images.

Wget can create a WARC for any crawl simply by adding the flag --warc-file="filename" to the command. Wget will harvest the site assets as normal, but additionally create a WARC compressed as a gzip file (.gz). For larger sites it’s a good idea to add --warc-max-size=1G to limit the max size of each WARC so they don’t get too big.

If the server refuses to give content to wget’s default user agent (sometimes identified as a robot), you can send a different one, like --user-agent=Mozilla. Occasionally it may be necessary to ignore “robots.txt” for archival purposes. Add --execute robots=off to the command.

Test example:

wget --mirror --page-requisites --wait=2 --random-wait --no-parent --trust-server-names --warc-file="test-archive" http://www.example.com/path/sometopic/

More complete example:

wget -mpkE --span-hosts --domains=example.com,www.example.com,sub.example.com --warc-file="test-archive" --warc-max-size=1G --warc-cdx --user-agent=Mozilla -e robots=off --wait=2 --random-wait http://www.example.com

More examples in use:

wget -mpkE -np --trust-server-names --warc-max-size=1G --warc-file="test-archive" --warc-cdx --wait=1 --random-wait https://www.example.com/path/news/newsletter

Limit using --include-directories= (instead of --no-parent), be sure to include all directories for page requisites:

wget -mpkE --trust-server-names -I /~,/css,/fonts,/Images,/Scripts,/path/news/newsletters --warc-max-size=1G --warc-file="test-archive" --warc-cdx --wait=0.5 https://www.example.com/path/news/newsletters/

Playback

Options:

  • Webrecorder Player (actively developed desktop app created by Webrecorder / Rhizome. Currently seems extremely slow and buggy on some computers when using warc not created by Webrecorder.)
  • Webarchive Player (not actively developed, but still works. simple desktop app)
  • pywb (“Python WayBack” Python package. Lots of features using wb-manager commandline utility. Can serve and manages multiple warc files in a collection.)

Workflow questions

  • Policy
  • Harvest
  • Store
  • Catalogue
  • Access

Reference:

Collection Development Policy examples:

Tools:

Videos: