Power tool for Messy Data

One of Refine’s strengths is the flexibility to handle all sorts of data from all sorts of sources:

Import formats: CSV, TSV, custom separator, Excel, ODS spreadsheet, XML, JSON, RDF, Google Sheets, MARC, line-based text…

Import sources: local file(s), archive (zip), URL, clipboard, database, or Google Sheets.

Output formats: TSV, CSV, HTML, Excel, ODS spreadsheet, SQL, Wikidata, RDF schema, or custom template…

When creating a new project, Refine imports the data without changing the original source, saving a copy using an optimized format in the “workspace directory” on your computer. Importantly, Refine is not a cloud based service, no data is sent off your computer during this process (see where data is stored and data privacy FAQ).

Once imported into a project, Refine’s interface represents the data in a tabular grid, using this basic terminology:

Refine row, column, cell labelled

Refine is efficient enough to provide comfortable performance up to 100,000’s of rows. For very large data sets, you may want to increase memory allocation.

Messy Data

Inconsistent formats, unnecessary white space, extra characters, typos, incomplete records, etc… Messy data is the bane of analysis!

For example, each column contains exactly the same info:

2015-10-14 $1,000 ID
10/14/2015 1000 I.D.
10/14/15 1,000 US-ID
Oct 14, 2015 1000 dollars idaho
Wed, Oct 14th US$1000 Idaho,
42291 $1k Ihaho

Multi-valued cells limit ability to manipulate, clean, and use the data:

“Using OpenRefine by Ruben Verborgh and Max De Wilde, September 2013”    
“University of Idaho, 875 Perimeter Drive, Moscow, ID, 83844, p. 208-885-6111, info@uidaho.edu”    

Luckily, Refine provides powerful visualizations and tools to discover, isolate, and fix these types of data issues.

Unfortunately, there isn’t a lot of literature out there about standards for wrangling and cleaning data–it has been a behind the scenes art that consumes a lot of hidden labor in a project. It is essential for the process of analysis, as well as for preparing your data for preservation and reuse. Documenting your process and being transparent will help both yourself and others understand your data set–and shed light on the important work that has gone into it.