Text Analysis Notes
Collection of Text Analysis resources from various workshops and presentations.
Text Analysis?
= a messy term that encompasses many interconnected processes such as text data collection, cleaning, parsing, summary, visualization. Also known as text mining.
Examples: Text Annotation, Natural Language Processing, Sentiment Analysis
Explore a Text
remember: Your Data == Text
Juxta (digital text collation. NINES.)
- Juxtacommons (online version, great for sharing)
- desktop (legacy desktop version)
- My Digital Aladore example
Voyant Tools (suite of online text visualization tools. Stéfan Sinclair & Geoffrey Rockwell)
Wordseer (suite of tools run on local server)
TokenX (“a text visualization, analysis, and play tool”. Brian Pytlik Zillig, U of Nebraska Libraries)
WordTree (Jason Davies)
NLP
- Stanford NLP Group (a library of Java apps, e.g. named entity tagging demo
- Open Calais (API trained on web and newspaper text)
- Watson Natural Language Understanding (API trained on web content)
Explore a Corpus
Topic Modeling
- Megan R. Brett, “Topic Modeling: A Basic Introduction” (2012)
- MALLET (MAchine Learning for LanguagE Toolkit) (Programming Historian MALLET lesson)
- Topic Modeling Tool (simple visual way to use part of MALLET)
- example classroom projects: Sherlock Holmes’s London prep and analysis; Posner basic strategies and tmt_get_started.
Overview Docs (online tool designed for journalists to sort through huge data sets)
Jigsaw (“Visual Analytics for Exploring and Understanding Document Collections”)
Concordancers
- AntConc (lots of software and publications from Laurence Anthony)
- CasualConc (R package)
- TextSTAT
Explore a Huge Corpus
Think about Big Data 3 V’s (volume, variety and velocity).
Ngrams
- Bookworm
- Google Books Ngram Viewer (see TED talk)
Hathi Trust Research Center Portal (big text data)
Twitter analysis
- Healey & Ramaswamy, “Visualizing Twitter Sentiment” (2013)
- Clement Levallois, “Umigon: sentiment analysis on Tweets based on terms lists and heuristics” (2013)
- now everyone is doing it! Twitter interactive
Explore Text with Programming
Python
- Distribution: Anaconda (get Python 3, 64-bit)
- Tool Kit: NLTK (install via
conda install nltk
orpip install nltk
) - Learn: NLTK Book (Steven Bird, Ewan Klein, and Edward Loperm, designed to teach text analysis)
R
- Distribution: R
- IDE: RStudio
- Learn: Text analysis with R for students of literature, Matthew L Jockers (New York : Springer-Verlag, 2014).
Bash Shell
- The command line has lots of great functions for manipulating text files!
- Programming Historian, Introduction to the Bash Commandline
- SWC, Unix Shell
Tool Catalogs / Directories
- DIRT (Digital Research Tools)
- TAPoR3 (“Discover research tools for studying text”)
- DH Commons (good place to find example projects)
Caution: Due to the nature of academic funding cycles, there is a lot of dead tools/projects out there, and a lot of tutorials for dead tools. Many of these tools still work, but without active maintenance they may not for long. Due to the technical and statistical nature of these tools, descriptions of what they do and how they work may be Difficult reading… Don’t be intimidated! However, it is good to be aware of the academic literature explaining the tools and algorithms since you may need to cite them to validate the techniques in your own work.
Helpful DH Resources
- Programming Historian
- Stanford Litlab Pamphlets
- Miriam Posner DH 101 and blog.
Overheard: stopwords, “to be or not to be”