What is OpenRefine?

openrefine interface

OpenRefine is a free, open source, Java application, that runs offline in a web browser.

The original creator David Huynh said Refine is:

“A power tool for working with messy data”

  • more powerful than a spreadsheet
  • more interactive and visual than scripting
  • more provisional / exploratory / experimental / playful than a database

Tabular Data

Refine can handle all sorts of data from all sorts of sources:

The data is imported without changing the original source–a new copy is saved in an optimized format in the Refine working directory. Once imported, the data is represented as tabular, using this basic terminology:

table parts

Refine is efficient enough to provide comfortable performance up to 100,000’s of rows (although, you may want to increase memory allocated to Java).

Use Cases

Explore - navigate and evaluate quality with visualizations and filters that help dig deeply into the data so you can get to know it better…

Clean - efficiently discover and fix inconsistency with faceting, clustering, cell transforms, GREL expressions…

Transform - easily change formats, subset, or reshape with split/join multi valued cells, split columns, transpose columns/rows…

Extend - enrich data by combining files, merging projects, fetching URLs, reconciliation with online databases…

Automate - record and preserve your processing routine for transparency, then automate reuse by exporting operation history in JSON!

Messy Data?

Inconsistent formats, unnecessary white space, extra characters, typos, etc… Messy data is the bane of analysis! Each column contains exactly the same info:

2015-10-14 $1,000 ID
10/14/2015 1000 I.D.
10/14/15 1,000 US-ID
Oct 14, 2015 1000 dollars idaho
Wed, Oct 14th US$1000 Idaho,
42291 $1k Ihaho

Multi-valued cells limit ability to manipulate, clean, and use the data:

“Using OpenRefine by Ruben Verborgh and Max De Wilde, September 2013”    
“University of Idaho, 875 Perimeter Drive, Moscow, ID, 83844, p. 208-885-6111, info@uidaho.edu”    

Luckily, Refine provides powerful visualizations and tools to discover these types of data issues, then isolate and fix them.