Text Analysis Notes
Collection of Text Analysis resources from various workshops and presentations.
= a messy term that encompasses many interconnected processes such as text data collection, cleaning, parsing, summary, visualization. Also known as text mining.
Explore a Text
remember: Your Data == Text
Juxta (digital text collation. NINES.)
- Juxtacommons (online version, great for sharing)
- desktop (legacy desktop version)
- My Digital Aladore example
Voyant Tools (suite of online text visualization tools. Stéfan Sinclair & Geoffrey Rockwell)
Wordseer (suite of tools run on local server)
TokenX (“a text visualization, analysis, and play tool”. Brian Pytlik Zillig, U of Nebraska Libraries)
WordTree (Jason Davies)
- Stanford NLP Group (a library of Java apps, e.g. named entity tagging demo
- Open Calais (API trained on web and newspaper text)
- Watson Natural Language Understanding (API trained on web content)
Explore a Corpus
- Megan R. Brett, “Topic Modeling: A Basic Introduction” (2012)
- MALLET (MAchine Learning for LanguagE Toolkit) (Programming Historian MALLET lesson)
- Topic Modeling Tool (simple visual way to use part of MALLET)
Overview Docs (online tool designed for journalists to sort through huge data sets)
Jigsaw (“Visual Analytics for Exploring and Understanding Document Collections”)
Explore a Huge Corpus
Think about Big Data 3 V’s (volume, variety and velocity).
Hathi Trust Research Center Portal (big text data)
- Healey & Ramaswamy, “Visualizing Twitter Sentiment” (2013)
- Clement Levallois, “Umigon: sentiment analysis on Tweets based on terms lists and heuristics” (2013)
- now everyone is doing it! Twitter interactive
Explore Text with Programming
- Distribution: Anaconda (get Python 3, 64-bit)
- Tool Kit: NLTK (install via
conda install nltkor
pip install nltk)
- Learn: NLTK Book (Steven Bird, Ewan Klein, and Edward Loperm, designed to teach text analysis)
- Distribution: R
- IDE: RStudio
- Learn: Text analysis with R for students of literature, Matthew L Jockers (New York : Springer-Verlag, 2014).
- The command line has lots of great functions for manipulating text files!
- Programming Historian, Introduction to the Bash Commandline
- SWC, Unix Shell
Tool Catalogs / Directories
- DIRT (Digital Research Tools)
- TAPoR3 (“Discover research tools for studying text”)
- DH Commons (good place to find example projects)
Caution: Due to the nature of academic funding cycles, there is a lot of dead tools/projects out there, and a lot of tutorials for dead tools. Many of these tools still work, but without active maintenance they may not for long. Due to the technical and statistical nature of these tools, descriptions of what they do and how they work may be Difficult reading… Don’t be intimidated! However, it is good to be aware of the academic literature explaining the tools and algorithms since you may need to cite them to validate the techniques in your own work.