_drafts

Text Analysis Notes

Collection of Text Analysis resources from various workshops and presentations.

Text Analysis?

= a messy term that encompasses many interconnected processes such as text data collection, cleaning, parsing, summary, visualization. Also known as text mining.

Examples: Text Annotation, Natural Language Processing, Sentiment Analysis

Explore a Text

remember: Your Data == Text

Juxta (digital text collation. NINES.)

Voyant Tools (suite of online text visualization tools. Stéfan Sinclair & Geoffrey Rockwell)

Wordseer (suite of tools run on local server)

TokenX (“a text visualization, analysis, and play tool”. Brian Pytlik Zillig, U of Nebraska Libraries)

WordTree (Jason Davies)

NLP

Explore a Corpus

Topic Modeling

Overview Docs (online tool designed for journalists to sort through huge data sets)

Jigsaw (“Visual Analytics for Exploring and Understanding Document Collections”)

Concordancers

Explore a Huge Corpus

Think about Big Data 3 V’s (volume, variety and velocity).

Ngrams

Hathi Trust Research Center Portal (big text data)

Twitter analysis

Explore Text with Programming

Python

  • Distribution: Anaconda (get Python 3, 64-bit)
  • Tool Kit: NLTK (install via conda install nltk or pip install nltk)
  • Learn: NLTK Book (Steven Bird, Ewan Klein, and Edward Loperm, designed to teach text analysis)

R

Bash Shell

Tool Catalogs / Directories

  • DIRT (Digital Research Tools)
  • TAPoR3 (“Discover research tools for studying text”)
  • DH Commons (good place to find example projects)

Caution: Due to the nature of academic funding cycles, there is a lot of dead tools/projects out there, and a lot of tutorials for dead tools. Many of these tools still work, but without active maintenance they may not for long. Due to the technical and statistical nature of these tools, descriptions of what they do and how they work may be Difficult reading… Don’t be intimidated! However, it is good to be aware of the academic literature explaining the tools and algorithms since you may need to cite them to validate the techniques in your own work.

Helpful DH Resources

Overheard: stopwords, “to be or not to be”