Text as Data

2024
Fall
Methods
Text as Data with Alex Siegel
Published

August 26, 2024

Week 1

Click here for syllabus

Why should we take text as data?

  • Availability!

  • Text and language are integral components of political science.

  • This field is HOT right now. Very accessible.

  • Opens up exciting new opportunities for measurement and inference.

  • a lot of text analysis is discovery - this can shape our question moving forward. The process is somewhat inductive.

What can we learn?

  • social media data gives us real-time measures of mass and elite behavior

  • Text of floor speeches, press releases, supreme court briefs, and manifestos offer new measures of ideological scaling.

  • Web-scraping repeatedly can show us how authoritarian regimes censor content and control information.

  • The language in books and magazines over time can show us how social attitudes have evolved.

Week 2 (Labor Day - No Class)

Week 3 - Introduction to Text as Data:

Lecture Notes:

A fundamentally qualitative exercise:

  • substantive knowledge and deep contextual understanding is essential to determine relevant quantities of interest.

  • is the method appropriate?

  • is this thing I measured actually capturing what I want?

  • automated approach allows humans to read, organize, and analyze documents at scale.

Data Generating Process:

  • Agnostic about model selection

  • as long as we can accurately measure what we care about.

  • all text models are wrong!

  • evaluate models based on their ability to perform a useful social scientific task.

Validation:

  • Face validity?

  • Does the text measure correlate with similar measures?

  • How does the measure compare with our bets guess at “ground truth”?

What automated approaches can and can’t do:

  • Algorithms are really good at sorting.

  • good at compressing texts into lower dimensions.

  • With supervised tasks, if a human can’t do it a machine (usually) can’t either.

Big Data Challenges:

  • Just because data is really big, doesn’t mean it’s representative.

  • Just because we have statistical significance but it substantively isn’t meaningful.

Ethics:

  • Just because data is publicly available doesn’t mean you don’t need to protect subjects.

  • Prioritize subjects over reproducability

  • Following platforms’ terms of service is sometimes at odds with open science.

  • Do the best you can but don’t violate ToS if possible.

Theory Build or Theory Testing?

  • Theory building: develop a new argument or concept to explain something that is not understood well

  • theory testing: takes an existing theory and apply it to new cases or-evaluate existing cases etc. This can also result in the modification of an existing theory.

How do I collect data?

  • APIs

  • Webscraping

  • OCR

Grimmer, Justin, Margaret Roberts, and Brandon Stewart. Text as Data, Chapters 1-2.

(Textbook)

Chapter 1:

  • The book uses text as data methods from computer science for social science purposes.

  • Big use of inductive reasoning.

  • Rather than a focus on prediction, text as data researchers are much more interested in how well their models provide insights into concepts of interest, how well measurement tools sort documents according to those rules, and how well the assumptions needed for accurate causal inference or prediction are met.

  • Texts can be organized differently. No one correct organization

    • we can then find the best measurement of that particular organization

      • can be tested with extensive validation.

        • what is the utility to answering the research question becomes of key interest.
  • Authors reject views that there is an underlying structure that statistical models applied to text are recovering. Rather, we view statistical models as useful (and incorrect) summaries of the text documents.

Chapter 2:

  • Typical social science research is deductive

    • Theory > data > test

    • Empirical Implications of Theoretical Models

      • use theory to create formal game model then use the game theory model to extract predictions. Then collect data to test those predictions.
    • strict deductive research can lead to missed opportunities in new questions + strategies.

  • Authors argue there is value in inductive reasoning and using an iterative approach for text analysis.

  • Methods to analyze text as data can contribute to inference at three stages of the process: discovery, measurement, and inference.

Discovery:

  • Discovery - develop a research question

    • the task of deciding what you want to measure from data and goal of inference.

      • need to conceptualize the world.

      • You want to analyze social media, okay, but what aspect?

        • topical content, sentiment, readibility, etc.

          • text analysis helps us discover what part we want to look at through the inductive process.

Figure of deductive vs. inductive.
  • By pointing out new ways out of organizing the document, we can read the text differently and prompt new questions.

Measurement:

  • researchers have to demonstrate that their method of measurement does indeed describe the concept or behavior they want to measure—that is, they have to validate their measures and provide evidence that the described quantities are relevant to the theoretical quantity at hand.

Inference:

  • predictions about events

  • used for causal inference.

Agnostic approach to Text Analysis:

  • No true underlying data-generating process

    • just a bunch of different ones with different insights.

Six Principles of Text Analysis:

  1. Theory and substantive knowledge are essential for research design (Section 2.7.1)

  2. Text analysis does not replace humans—it augments them (Section 2.7.2)

  3. Building, refining, and testing social science theories requires iteration and cumulation (Section 2.7.3)

  4. Text analysis methods distill generalizations from language (Section 2.7.4)

  5. The best method depends on the task (Section 2.7.5)

  6. Validations are essential and depend on the theory and the task (Section 2.7.6)

Henry E. Brady. 2019. “The Challenge of Big Data and Data Science.” Annual Review of Political Science.

Abstract:

Big data and data science are transforming the world in ways that spawn new concerns for social scientists, such as the impacts of the internet on citizens and the media, the repercussions of smart cities, the possibilities of cyber-warfare and cyber-terrorism, the implications of precision medicine, and the consequences of artificial intelligence and automation. Along with these changes in society, powerful new data science methods support research using administrative, internet, textual, and sensor-audio-video data. Burgeoning data and innovative methods facilitate answering previously hard-to-tackle questions about society by offering new ways to form concepts from data, to do descriptive inference, to make causal inferences, and to generate predictions. They also pose challenges as social scientists must grasp the meaning of concepts and predictions generated by convoluted algorithms, weigh the relative value of prediction versus causal inference, and cope with ethical challenges as their methods, such as algorithms for mobilizing voters or determining bail, are adopted by policy makers.

Bumper Sticker:

Big data changed poli sci a lot - and there is more work to be done!

Background:

  • So much information causing overload -> fragmentation of information flows.

    • good material explanation of maybe why people can chose their preferred media source.
  • Big data \(\neq\) better inference.

    • this is true if our data collection process is bad/false/problematic.

Grimmer, Justin and Brandon Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Documents” Political Analysis. 21, 3 267-297.

Abstract:

Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.

Overview:

Process of Text as Data

Background:

  • Just because one text method works well on one data set does not mean it will work will on a different dataset.

  • Classification - organizes texts into a set of categories

    • Dictionary method - use the frequency of key words to determine a document’s class

    • supervised methods: hand coding documents then giving it to the machine to train and classify.

      • easier to validate
    • Fully automated clustering (FAC) - estimate categories then classify those documents into categories.

      • mixed membership models - problem-specific structure to assist in the estimation of categories

      • Computer assisted clustering (CAC) - looks at 1000s of potential clusterings.

    • Unsupervised is not perfect and you need validation.

  • Text/document - unit of anaylsis - tweet, facebook post, article, sentence, paragraph

  • Corpus - population of texts

  • Corpora - collection of corpus.

  • Order of words does not matter. We treat our documents as a bag of words.

  • stemming - reduces ends of words. This reduces dimensionality and keeps everything simpler.

    • family, families, families’, familial

      • all become famili
  • Typically also discard capitalization, punctuation, very common words, and very uncommon words.

  • We can used supervised methods to validate our use of unsupervised methods.

Four Principles of Automated Text Analysis:

  1. All quantitative models of language are wrong - but some are useful

  2. Quantitative methods for text amplify resources and augment humans

  3. There is no globally best method for automated text analysis

  4. Validate, Validate, Validate.

Supervised Method:

  1. Construct a training set

    1. iteratively develop coding scheme

    2. randomly sample to obtain a representative sample of documents

      1. use this to train data set.

      2. n = 500 is ideal

        1. n = 100 is just enough
  2. apply the supervised learning method-learning the relationship between features and categories in the training set, then using this to infer the labels in the test set

  3. validate the model output and classify the remaining documents

Fully Automated Clustering (FAC):

a supervised method we can use to validate unsupervised method

Single Membership Model:

  • Single membership clustering models estimate a clustering: a partition of documents into

    mutually exclusive and exhaustive groups.

    • Each group of documents in a clustering is a cluster, which represents an estimate of a category.

Mixed Membership Model

Computer Assisted Clustering

Michel et al. 2011, “Quantitative analysis of culture using millions of digitized books” Science, 331:6014.

Abstract:

We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

Bumper Sticker:

Text as Data can help us figure out language/culture changes over time.

Research Question:

How has culture changed?

Data/methods:

  • 15 million books used for corpus

  • spans hundreds of years and languages.

  • restricted to 1-gram up to 5-gram - looked at how often a 1-gram or n-gram appeared

    • 1-gram is a sequence of characters uninterrupted by space.
  • Usage frequency is computed by dividing the number of instances of the n-gram in a given year by the total number of words in the corpus in that year.

Findings:

  • More words found than in dictionary.

  • English lexicon has early doubled from 1950 to 2000.

  • grammar has changed

DiMaggio, Paul. “Adapting computational text analysis to social science (and vice versa).” Big Data & Society 2.2 (2015).

Week 4 (Canceled - Alex out sick)

Week 5

Lecture Notes:

  • Type: is a unique sequence of characters that are grouped together in some meaningful way.

  • Token: a particular instance of a type.

    • tokenization is the process of splitting a document into its constituent words

    • tokens are the individual units we split our document into before counting them up.

  • Term: is a type that is part of the system’s ‘dictionary’.

  • To do List:

    • choose unit of analysis & make corpus

    • tokenize

    • reduce complexity

    • filter by frequency.

      • remove overly common and overly rare words.
  • A lemma is the canonical form (such as one might find in a dictionary) of a set of words that are related by inflection.

  • A document is a collection of W features tokens

Grimmer, chp. 3-5.

Chapter 3:

  • We need to decide what information we want and what information we need discarded.

    • figuring out what to discard and how is the hard part.

Principle 1:

  • The usefulness of a corpus depends on the question the researcher wants to answer

    and the population they want to study.

    • figure out the quanitity and population of interest.

      • twitter data is a subset of the population for example.

        • might not be the best way to gauge public opinion.

Principle 2:

  • There is no one right way to represent text for all research questions.

Principle 3:

  • The best assurance that a text representation is working is extensive validation.

    • how do we know that we have selected the “right” representation for their research question?

      • We will know that our representation is working in a measurement model if the

        measures that we have created using that representation align with validated hand coded

        data and facts that we know about the social world.

Chapter 4: Selecting Documents

  • Four types of Bias:

    • Resource Bias

    • Incentive Bias

    • Medium Bias

    • Retrieval Bias

Chapter 5: Bag of Words

  • We now need to represent our words numerically.

  • Bag of Words model:

    • we will represent each document by counting how many time each word appears in it.

    • create a matrix

    • Full Recipe:

      • Choose the Unit of Analysis

      • Tokenize

        • Each individual word in the document is a token and the process of splitting a document into its constituent words is called tokenization.

          • Tokens are the individual units we split our document into before counting them up.
      • Reduce Complexity - helps computationally

        • lowercase

        • remove punctuation

        • Remove stop words

        • Create equivalence classes

        • filter by frequency

      • Create the document feature of analysis.

  • N-grams

    • related to tokenization

      • basically we may have a word like “white house” BUT we pull out white and house.

        • we use n-grams - an ordered set of n words.

          • unigrams- single words

          • bigrams- ordered pairs

          • trigrams- ordered triples.

When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks

Abstract:

Previous research uses negative word counts to measure the tone of a text. We show that word lists developed for other disciplines misclassify common words in financial text. In a large sample of 10-Ks during 1994 to 2008, almost three-fourths of the words identified as negative by the widely used Harvard Dictionary are words typically not considered negative in financial contexts. We develop an alternative negative word list, along with five other word lists, that better reflect tone in financial text. We link the word lists to 10-K filing returns, trading volume, return volatility, fraud, material weakness, and unexpected earnings.

Question:

  • Can dictionaries from social science be used for business text data.

Bumper Sticker:

  • Words mean different things in context.

The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods

Abstract:

We are in the midst of a technological revolution whereby, for the first time, researchers can link daily word use to a broad array of real-world behaviors. This article reviews several computerized text analysis methods and describes how Linguistic Inquiry and Word Count (LIWC) was created and validated. LIWC is a transparent text analysis program that counts words in psychologically meaningful categories. Empirical results using LIWC demonstrate its ability to detect meaning in a wide variety of experimental settings, including to show attentional focus, emotionality, social relationships, thinking styles, and individual differences.

Background:

  • Content Words:

    • generally nouns, regular verbs, and many adjectives and adverbs.

    • “It was a dark and stormy night”

      • content words: “dark”, “stormy”, “night”
    • convey what people are saying

  • Style Words:

    • aka function words.

      • pronouns, prepositions, articles, conjunctions, auxiliary verbs, and a few other esoteric categories.
    • style words: how people are communicating.

Week 6

Dictionary Methods

  • Keyword counting

  • (weighted) average frequency

  • For dictionaries to work well, words must have a particular meaning in a specific context

  • Be very skeptical of out of the box approaches!

    • the financial liabilities reading is a good example of this.
  • Have to do validation.

Validation

  • Develop a confusion matrix

  • minimize false positives and false negatives.

Zipf’s Law

  • In any corpus, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent will word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Wordscores

  • Extracting dimensional information from political texts using computerized content analysis

  • Training a Wordscores model requires reference scores for texts whose positions on well-defined a prior dimensions are “known”. Afterwards, Wordscores estimates the positions for the remaining “unknown” texts.

  • semi-supervised machine learning.

  • kinda similar to DW Nominate score.

  • But the issue is the dimension to which we observe these as

    • DW nominate has always had issues with the Y axis.

      • Wordscore has a similar issue with the labeling.

Week 7 - Similarity and Complex measures

Lecture Notes:

Cosine Similarity:

  • What makes text similar?

  • cosign similarity is how close are documents are when put into a visual space.

    • maximum similarity will occur between a document and itself.

      • 0 is min similarity and 1 is max for similarity
  • Jaccard Similarity

    • calculate jaccard coefficient
  • Distance metrics

    • euclidean distance

      • hypotenuse of Manhattan triangle.
    • Manhattan distance

      • up and over

Weighting and “Goldilocks Words”

  • we want words that are neither too rare nor too frequent

    • weighting can help us with this.

      • when we remove stopwords, we are setting their weight to 0.

      • common approach tfdif

        • term frequency document infrequency
  • TFDIF

    • up-weights rare words within a document

    • down-weights words popular across corpus.

Textual Complexity

  • Linguistic complexity can lead to interesting measures/comparisons of documents

  • Domain-specific measures of complexity.

Week 8 - Supervised Machine Learning

Supervised Machine Learning:

  • Create a mapping between the features within a set of documents and more general categories and concepts that have been defined by the researcher.

  • We learn the mapping function from the data by training an algorithm to predict the category of interest from the text using a random sample of coded documents.

  • we are predicting.

  • This approach is quite general and has the advantage that once the parameters of the model are learned classifying additional documents is essentially costless.

Classification Steps

  • Accurately Label Documents (create a training set & validation set)

  • feature representation

  • choose model & classify documents

  • IF YOUR TRAINING DATA SUCKS, YOUR CLASSIFIERS WILL SUCK!

Human Annotation

  • Stratified random sample of documents to annotate.

  • Create a codebook

    • codebook = instructions

      • how are we defining stuff

        • iterative process
      • objectivity/inter-subjectivity

        • you can get intercode reliability.

Types of supervised models:

Naive Bayes Classifiers:

  • SPAM v. Ham

    • how do we classify data example

    • Bayes rule

      • probability of new email is spam given its text is equal to….see slide.

        • lets look at our priors

          • posterior probability is updated probability after taking new info into account.
      • for multiple words we treat each word as independent.

        • this is obviously not true.

          • but Naive bayes are useful.
      • Main thing: when training data is imbalanced - we run into a lot of issues!

Support Vector Machines:

  • Building a model from a binary labeled training set (yes/no, Rep/Dem) and returns an optimal hyperplane that puts a new examples into one of the categories non-probabilistically.

    • first finds a support vector

Model Performance

  • Is Bayes or SVM better?

  • confusion matrix!

  • LOOK AT BALANCED ACCURACY FOR UNBALANCED TRAINING DATA!

Validation:

  • k-fold Cross-validation.

  • Face validity?

  • Convergent validity

  • hypothesis validity

  • performance may drop.

Week 9 - Unsupervised Machine Learning

  • Algorithms that learn patterns from unlabeled data.

    • there is no labels to start.
  • Helpful for exploratory analysis. What topics are in my data?

  • Researchers assigns labels to clusters or topics through post-estimation interpretation of their elements-this requires a lot of substantive knowledge!

Clustering

  • creating partitions of the data.

  • they input a collection of texts and output a set of K categories and document assignments to each of those K categories.

  • Documents are organized is estimated as part of the process, rather than being assumed.

  • Clustering groups documents such that inside a cluster they are very similar to each other but different from those outside the cluster.

  • Clustering only gets one membership - LDA allows this to vary (multiple memberships).

K-Means Clustering

  • Randomly select K starting centroids

  • each data point is assigned to its nearest centroid

  • Recompute the centroids as the mean of the data points assigned to the respective cluster.

    • repeat until we are optimized.
  • Kinda an old method

  • Possibility the algorithm finds latent differences that weren’t theorized.

  • If we struggle to label a cluster - its probably not a good cluster.

Topic Models:

  • Rather than assign each document to only one cluster, topic models assign each document with proportional membership to all categories.

Latent Dirichlet Allocation (LDA)

  • Bayesian hierarchical model that assumes a particular data generating process for how an author produces a text.

  • Suppose that when writing a text the author draws a mixture of topics: a set of weights that will describe how prevalent the particular topics are.

  • Given the set of weights, the author generates the actual text.

  • This topic-specific distribution is common across the documents and characterizes the rates words appear when discussing a particular topic.

  • It is a LATENT VARIABLE MODEL

    • latent variables models are statistical models where in addition to the observed variables, a set of latent variables also exists which is not observed.
  • LDA is basic topic model

  • Dirichlet is a type of distribution.

  • simplex its a generalization of a triangle.

    • triangle is the simplest shape to make in space.

Week 10 - Word Embeddings

  • We want context now!

  • we need to know whats around words now.

  • Word embeddings enable us to take context into account.

  • transfer learning - using stuff to train something else adjacent.

  • an embedding is a mapping of a discrete-categorical-variable to a vector of continuous numbers

Hidden Layer

  • embeddings basically create like a giant lookup table that gets us from the input to the output.

Week 11

Transformers

  • Leverage embedding and supervised training for more nuanced/better text classification

  • Encoders are layers that process and refine the input text representation

  • Self-attention mechanism: calculates each word’s relevance to others in the sentence

    • allows each word to “focus” on other words in the sentence, adjusting its representation based on context.
  • Masked Language Modeling: Predicts missing words to understand sentence structure

    • “The animal didn’t cross the road because it was tired”

      • BERT can tell animal and it are the same.
  • BERT cannot produce its own text.

Citation

BibTeX citation:
@online{neilon2024,
  author = {Neilon, Stone},
  title = {Text as {Data}},
  date = {2024-08-26},
  url = {https://stoneneilon.github.io/notes/Text_Data/},
  langid = {en}
}
For attribution, please cite this work as:
Neilon, Stone. 2024. “Text as Data.” August 26, 2024. https://stoneneilon.github.io/notes/Text_Data/.