Forschungspraktikum 1+2: Computational Social Science

Session 06: Machine Learning with Text-as-Data

Dr. Christian Czymara

Agenda

Introduction to Sentiment Analysis
Machine Learning Approaches for Sentiment
Last session: How to train a supervised machine learning classifier
This session: Automatically determine categories of texts
Tutorial: Predicting Sentiment in Movie Reviews

Introduction to Text Classification with Machine Learning

Last session, we learned how to
- Choose and evaluate classifiers
- Optimize settings for best performance
In sessions 02 and 03, we learned how to converte text into numerical features (DFM)
Today, we will combine this knowledge

Limits of Dictionary-Based Sentiment Analysis

Sentiment (e.g., “positive” or “negative”) is often context-dependent
- “Increasing” may imply positive growth (e. g., GDP)
- But is negative in, e.g, unemployment rate
Machine learning can account for context and complex relationships

Limits of Machine Learning-Based Sentiment Analysis

Custom annotation are often necessary (resource intense)
Classifiers trained on one domain (e.g., movie reviews) may not generalize well to another (e.g., political texts)
Classifier may generally not perform well, especially if data is messy

Example: Immigration News in Right-Wing Media

Czymara, C. S. (2024). Real-World Developments Predict Immigration News in Right-Wing Media: Evidence from Germany. Mass Communication and Society, 27(1), 50-74.
“I drew a random sample of 600 articles, which were manually coded by a student assistant. In the next step, I trained a Naive Bayes classifier using Benoit et al. (2021) to detect articles about immigration in the overall dataset.”

Example: Using `quanteda` for Text Classification to Predict Trump’s Tweets

Example Data: Trump (and others) on Twitter

Combination of two data files: A sample from Trump Twitter Archive and a sample from the sentiment140, 519 observations each.

table(tweets_combined$is_trump)


not trump     trump 
      519       519

Example Tweets from Trump

head(sample(tweets_combined[tweets_combined$is_trump=="trump", ]$text))

[1] "@JUrciuoli19  I will."                                                                                                                            
[2] "\"\"\"@Jenism101: @realDonaldTrump @97Musick Trump is the only person we believe will actually set things right.\"\"\""                           
[3] "Unsolicited Ballots are uncontrollable, totally open to ELECTION INTERFERENCE by foreign countries, and will lead to massive chaos and confusion!"
[4] "RT @markknoller: Pres signed two broadband bills today:\n-requires Pres to ensure security of 5G and subsequent generation cell phone techno…"    
[5] "\"\"\"@rebelgirl1213: @realDonaldTrump @Carrienguns America cannot survive another Bush! #NoMoreBush\"\"\""                                       
[6] "I feel so badly for Mark Cuban-the Dallas Mavericks were just eliminated from the playoffs and his partners are pissed. Very sad!"

Example Tweets not from Trump

head(sample(tweets_combined[tweets_combined$is_trump=="not trump", ]$text))

[1] "@Kat_La VOMIT! Ugh."                                                                                                                     
[2] "Take comfort progressive Australia. Abbott's Victory is the sickness that will force us to vomit out the poison. http://t.co/A5AIw3DdFO" 
[3] "@mynameiselysee it looked like this. I want to vomit rn http://t.co/c5pD43TMuA"                                                          
[4] "About to vomit"                                                                                                                          
[5] "I'm not saying that if you voted Liberal you're a horrible vomit-flavoured garbage heap of  rancid bigotry but um yeah I am saying that."
[6] "Oh fuck! I think I'm going to vomit.... \nThis is not romance. #TheBachelorAU is getting too tacky. #Sickly #VomitMaterial"

Preprocessing the Data

library(quanteda)

tweets_corpus <- corpus(tolower(tweets_combined$text),
                        docvars = tweets_combined)

toks_tweets <- tokens(tweets_corpus,
                     remove_punct = T,
                     remove_numbers = T,
                     remove_symbols = T,
                     remove_separators = T,
                     include_docvars = T)

# toks_tweets <- tokens_group(toks_tweets, groups = tweets_combined$is_trump)

toks_tweets <- tokens_remove(toks_tweets, stopwords())

dfm_tweets <- dfm(toks_tweets)

The DFM

dfm_tweets

Document-feature matrix of: 1,038 documents, 4,487 features (99.76% sparse) and 2 docvars.
       features
docs    bleh feeling vomit gonna bangerz close can just feel miley
  text1    1       1     1     0       0     0   0    0    0     0
  text2    0       0     1     1       1     1   1    1    1     1
  text3    0       0     1     0       0     0   0    0    1     0
  text4    0       0     1     0       0     0   0    0    0     0
  text5    0       0     1     0       0     0   0    0    0     0
  text6    0       0     1     0       0     0   0    0    1     0
[ reached max_ndoc ... 1,032 more documents, reached max_nfeat ... 4,477 more features ]

Preprocessing: Split

# Split the data (70% training, 30% test)
set.seed(1337)

train_indices <- sample(1:ndoc(dfm_tweets), size = 0.7 * ndoc(dfm_tweets))

Preprocessing: Split

# Training data
dfm_train <- dfm_tweets[train_indices, ]

trump_observed_train <- docvars(dfm_train, "is_trump")

# Test data
dfm_test <- dfm_tweets[-train_indices, ]

trump_observed_test <- docvars(dfm_test, "is_trump")

Tutorial 06: Exercise 1.-2.

Train Support Vector Machine

library(quanteda.textmodels)

# Train Support Vector Machine
trump_svm_model <- textmodel_svm(dfm_train, y = trump_observed_train)

trump_svm_model


Call:
textmodel_svm.dfm(x = dfm_train, y = trump_observed_train)

726 training documents; 3,488 fitted features.
Method: L2-regularized L2-loss support vector classification dual (L2R_L2LOSS_SVC_DUAL)

Predict

# Predict the sentiment for the test set
trump_predictions_svm <- predict(trump_svm_model, newdata = dfm_test)

head(trump_predictions_svm)

    text2     text4     text8     text9    text10    text12 
not trump not trump not trump not trump not trump not trump 
Levels: not trump trump

Evaluation: Confusion Matrix

# Generate a confusion matrix
confusion_matrix_trump <- table(Predicted = trump_predictions_svm, Actual = trump_observed_test)

confusion_matrix_trump

           Actual
Predicted   not trump trump
  not trump       157     0
  trump             2   153

Evaluation: Metrics

# Calculate precision and recall
tp_nb <- confusion_matrix_trump["trump", "trump"]
fp_nb <- confusion_matrix_trump["trump", "not trump"]
fn_nb <- confusion_matrix_trump["not trump", "trump"]

precision_nb <- tp_nb / (tp_nb + fp_nb)
recall_nb <- tp_nb / (tp_nb + fn_nb)
f1_score_nb <- 2 * (precision_nb * recall_nb) / (precision_nb + recall_nb)

precision_nb

[1] 0.9870968

recall_nb

[1] 1

f1_score_nb

[1] 0.9935065

Precision vs. Recall Trade-Off

High Recall: Extensive dictionaries capture more relevant terms
- Drawback: Potentially includes irrelevant terms, reducing precision
High Precision: Focused dictionaries improve accuracy but may miss related terms
- Drawback: Lower recall, potentially overlooking relevant instances
Optimal Balance: Depends on research objectives and need for precision

Tutorial 06: Exercise 3.

Text Classification with `quanteda`

Using `quanteda` with text-as-data

quanteda.textmodels provides several text classification
Includes supervised machine learning, text scaling, and regression
Alternative to the more general tidymodels

Linear Regression

textmodel_lm()
Fits a linear regression model using DFM features as predictors
Predict continuous outcomes (e.g., scores, ratings) from text data
Suitable for regression tasks, interpretable coefficients

Naïve Bayes

textmodel_nb()
Multinomial Naïve Bayes model
Classification of text into categories (e. g., sentiment analysis, spam detection)
Fast and simple model, suitable for high-dimensional sparse data like text

Support Vector Machines

textmodel_svm() (using e1071)
Classification of text into categories (e. g., sentiment analysis, spam detection)
Effective for high-dimensional data, offers kernel-based flexibility

Wordfish

textmodel_wordfish()
Wordfish scaling model for estimating latent traits from textual data
Used to place texts (e.g., political speeches or documents) on a latent scale based on their word frequencies
Unsupervised analysis of, for example, political ideology
See Slapin and Proksch (2008)

Wordfish

Poisson Naïve Bayes mode, assuming the number of times text \(i\) mentions word \(j\) is drawn from Poisson distribution
Dependent variable: one-dimensional latent variable (similarity)
\(\beta\): association of each word to the latent scale
\(\theta\): position of each document on the latent scale, based on its words

Summary

Model	Use Case
`textmodel_lm()`	Predicting continuous outcomes
`textmodel_nb()`	Text classification
`textmodel_svm()`	Complex classification
`textmodel_wordfish()`	Latent scaling (unsupervised)

See quanteda.textmodels for a full overview

Tutorial 06: Exercise 4.

Text Classification with `tidymodels` (under construction)

Using `tidymodels` with text-as-data

The tidymodels workflow we learned last week can also be applied to text-as-data
- Integration into the tidyverse
- Streamlined (e. g., cross-validation)
- Cutting edge methods

Setting up the `tidymodels` workflow

# Convert Document-Feature Matrix to data.frame
trump_dfm_df <- quanteda::convert(dfm_tweets, to = "data.frame")

trump_dfm_df$doc_id <- NULL

# Add is_trump
tidy_tweets <- cbind(trump_dfm_df, is_trump = as.factor(tweets_combined$is_trump))

library(tidymodels)

set.seed(1337)
data_split <- initial_split(tidy_tweets, prop = 0.8, strata = is_trump)

train_data <- training(data_split)  # Training set
test_data <- testing(data_split)    # Test set

Setting up the `tidymodels` workflow

tweets_recipe <- recipe(is_trump ~ ., data = train_data) # Trump category as a function of all terms

# Specify Naive Bayes model
nb_model <- naive_Bayes() %>%
  set_engine("naivebayes") %>%
  set_mode("classification")

Setting up the `tidymodels` workflow

library(naivebayes)
library(discrim)

# Combine recipe and model in a workflow
nb_workflow <- workflow() %>%
  add_recipe(tweets_recipe) %>%
  add_model(nb_model)

# Train Model
set.seed(1337)
nb_fit <- fit(nb_workflow, 
  data = train_data)

# Predict
trump_predictions <- predict(nb_fit, new_data = test_data)

Confusion Matrix

conf_mat(data = bind_cols(test_data, trump_predictions), truth = is_trump, estimate = .pred_class)

           Truth
Prediction  not trump trump
  not trump         0     0
  trump           104   104

Something did not work well here…

Forschungspraktikum 1+2: Computational Social Science

Agenda

Introduction to Text Classification with Machine Learning

Limits of Dictionary-Based Sentiment Analysis

Limits of Machine Learning-Based Sentiment Analysis

Example: Immigration News in Right-Wing Media

Example: Using quanteda for Text Classification to Predict Trump’s Tweets

Example Data: Trump (and others) on Twitter

Example Tweets from Trump

Example Tweets not from Trump

Preprocessing the Data

The DFM

Preprocessing: Split

Preprocessing: Split

Tutorial 06: Exercise 1.-2.

Train Support Vector Machine

Predict

Evaluation: Confusion Matrix

Evaluation: Metrics

Precision vs. Recall Trade-Off

Tutorial 06: Exercise 3.

Text Classification with quanteda

Using quanteda with text-as-data

Linear Regression

Naïve Bayes

Support Vector Machines

Wordfish

Wordfish

Summary

Tutorial 06: Exercise 4.

Text Classification with tidymodels (under construction)

Using tidymodels with text-as-data

Setting up the tidymodels workflow

Setting up the tidymodels workflow

Setting up the tidymodels workflow

Confusion Matrix

Example: Using `quanteda` for Text Classification to Predict Trump’s Tweets

Text Classification with `quanteda`

Using `quanteda` with text-as-data

Text Classification with `tidymodels` (under construction)

Using `tidymodels` with text-as-data

Setting up the `tidymodels` workflow

Setting up the `tidymodels` workflow

Setting up the `tidymodels` workflow