Forschungspraktikum 1+2: Computational Social Science

Session 06: Machine Learning with Text-as-Data

Dr. Christian Czymara

Agenda

  • Introduction to Sentiment Analysis
  • Machine Learning Approaches for Sentiment
  • Last session: How to train a supervised machine learning classifier
  • This session: Automatically determine categories of texts
  • Tutorial: Predicting Sentiment in Movie Reviews

Introduction to Text Classification with Machine Learning

  • Last session, we learned how to
    • Choose and evaluate classifiers
    • Optimize settings for best performance
  • In sessions 02 and 03, we learned how to converte text into numerical features (DFM)
  • Today, we will combine this knowledge

Limits of Dictionary-Based Sentiment Analysis

  • Sentiment (e.g., “positive” or “negative”) is often context-dependent
    • “Increasing” may imply positive growth (e. g., GDP)
    • But is negative in, e.g, unemployment rate
  • Machine learning can account for context and complex relationships

Limits of Machine Learning-Based Sentiment Analysis

  • Custom annotation are often necessary (resource intense)
  • Classifiers trained on one domain (e.g., movie reviews) may not generalize well to another (e.g., political texts)
  • Classifier may generally not perform well, especially if data is messy

Example: Immigration News in Right-Wing Media

Example: Using quanteda for Text Classification to Predict Trump’s Tweets

Example Data: Trump (and others) on Twitter

table(tweets_combined$is_trump)

not trump     trump 
      519       519 

Example Tweets from Trump

head(sample(tweets_combined[tweets_combined$is_trump=="trump", ]$text))
[1] "@JUrciuoli19  I will."                                                                                                                            
[2] "\"\"\"@Jenism101: @realDonaldTrump @97Musick Trump is the only person we believe will actually set things right.\"\"\""                           
[3] "Unsolicited Ballots are uncontrollable, totally open to ELECTION INTERFERENCE by foreign countries, and will lead to massive chaos and confusion!"
[4] "RT @markknoller: Pres signed two broadband bills today:\n-requires Pres to ensure security of 5G and subsequent generation cell phone techno…"    
[5] "\"\"\"@rebelgirl1213: @realDonaldTrump @Carrienguns America cannot survive another Bush! #NoMoreBush\"\"\""                                       
[6] "I feel so badly for Mark Cuban-the Dallas Mavericks were just eliminated from the playoffs and his partners are pissed. Very sad!"                

Example Tweets not from Trump

head(sample(tweets_combined[tweets_combined$is_trump=="not trump", ]$text))
[1] "@Kat_La VOMIT! Ugh."                                                                                                                     
[2] "Take comfort progressive Australia. Abbott's Victory is the sickness that will force us to vomit out the poison. http://t.co/A5AIw3DdFO" 
[3] "@mynameiselysee it looked like this. I want to vomit rn http://t.co/c5pD43TMuA"                                                          
[4] "About to vomit"                                                                                                                          
[5] "I'm not saying that if you voted Liberal you're a horrible vomit-flavoured garbage heap of  rancid bigotry but um yeah I am saying that."
[6] "Oh fuck! I think I'm going to vomit.... \nThis is not romance. #TheBachelorAU is getting too tacky. #Sickly #VomitMaterial"              

Preprocessing the Data

library(quanteda)

tweets_corpus <- corpus(tolower(tweets_combined$text),
                        docvars = tweets_combined)

toks_tweets <- tokens(tweets_corpus,
                     remove_punct = T,
                     remove_numbers = T,
                     remove_symbols = T,
                     remove_separators = T,
                     include_docvars = T)

# toks_tweets <- tokens_group(toks_tweets, groups = tweets_combined$is_trump)

toks_tweets <- tokens_remove(toks_tweets, stopwords())

dfm_tweets <- dfm(toks_tweets)

The DFM

dfm_tweets
Document-feature matrix of: 1,038 documents, 4,487 features (99.76% sparse) and 2 docvars.
       features
docs    bleh feeling vomit gonna bangerz close can just feel miley
  text1    1       1     1     0       0     0   0    0    0     0
  text2    0       0     1     1       1     1   1    1    1     1
  text3    0       0     1     0       0     0   0    0    1     0
  text4    0       0     1     0       0     0   0    0    0     0
  text5    0       0     1     0       0     0   0    0    0     0
  text6    0       0     1     0       0     0   0    0    1     0
[ reached max_ndoc ... 1,032 more documents, reached max_nfeat ... 4,477 more features ]

Preprocessing: Split

# Split the data (70% training, 30% test)
set.seed(1337)

train_indices <- sample(1:ndoc(dfm_tweets), size = 0.7 * ndoc(dfm_tweets))

Preprocessing: Split

# Training data
dfm_train <- dfm_tweets[train_indices, ]

trump_observed_train <- docvars(dfm_train, "is_trump")

# Test data
dfm_test <- dfm_tweets[-train_indices, ]

trump_observed_test <- docvars(dfm_test, "is_trump")

Tutorial 06: Exercise 1.-2.

Train Support Vector Machine

library(quanteda.textmodels)

# Train Support Vector Machine
trump_svm_model <- textmodel_svm(dfm_train, y = trump_observed_train)

trump_svm_model

Call:
textmodel_svm.dfm(x = dfm_train, y = trump_observed_train)

726 training documents; 3,488 fitted features.
Method: L2-regularized L2-loss support vector classification dual (L2R_L2LOSS_SVC_DUAL)

Predict

# Predict the sentiment for the test set
trump_predictions_svm <- predict(trump_svm_model, newdata = dfm_test)

head(trump_predictions_svm)
    text2     text4     text8     text9    text10    text12 
not trump not trump not trump not trump not trump not trump 
Levels: not trump trump

Evaluation: Confusion Matrix

# Generate a confusion matrix
confusion_matrix_trump <- table(Predicted = trump_predictions_svm, Actual = trump_observed_test)

confusion_matrix_trump
           Actual
Predicted   not trump trump
  not trump       157     0
  trump             2   153

Evaluation: Metrics

# Calculate precision and recall
tp_nb <- confusion_matrix_trump["trump", "trump"]
fp_nb <- confusion_matrix_trump["trump", "not trump"]
fn_nb <- confusion_matrix_trump["not trump", "trump"]

precision_nb <- tp_nb / (tp_nb + fp_nb)
recall_nb <- tp_nb / (tp_nb + fn_nb)
f1_score_nb <- 2 * (precision_nb * recall_nb) / (precision_nb + recall_nb)

precision_nb
[1] 0.9870968
recall_nb
[1] 1
f1_score_nb
[1] 0.9935065

Precision vs. Recall Trade-Off

  • High Recall: Extensive dictionaries capture more relevant terms
    • Drawback: Potentially includes irrelevant terms, reducing precision
  • High Precision: Focused dictionaries improve accuracy but may miss related terms
    • Drawback: Lower recall, potentially overlooking relevant instances
  • Optimal Balance: Depends on research objectives and need for precision

Tutorial 06: Exercise 3.

Text Classification with quanteda

Using quanteda with text-as-data

  • quanteda.textmodels provides several text classification
  • Includes supervised machine learning, text scaling, and regression
  • Alternative to the more general tidymodels

Linear Regression

  • textmodel_lm()
  • Fits a linear regression model using DFM features as predictors
  • Predict continuous outcomes (e.g., scores, ratings) from text data
  • Suitable for regression tasks, interpretable coefficients

Naïve Bayes

  • textmodel_nb()
  • Multinomial Naïve Bayes model
  • Classification of text into categories (e. g., sentiment analysis, spam detection)
  • Fast and simple model, suitable for high-dimensional sparse data like text

Support Vector Machines

  • textmodel_svm() (using e1071)
  • Classification of text into categories (e. g., sentiment analysis, spam detection)
  • Effective for high-dimensional data, offers kernel-based flexibility

Wordfish

  • textmodel_wordfish()
  • Wordfish scaling model for estimating latent traits from textual data
  • Used to place texts (e.g., political speeches or documents) on a latent scale based on their word frequencies
  • Unsupervised analysis of, for example, political ideology
  • See Slapin and Proksch (2008)

Wordfish

  • Poisson Naïve Bayes mode, assuming the number of times text \(i\) mentions word \(j\) is drawn from Poisson distribution
  • Dependent variable: one-dimensional latent variable (similarity)
  • \(\beta\): association of each word to the latent scale
  • \(\theta\): position of each document on the latent scale, based on its words

Summary

Model Use Case
textmodel_lm() Predicting continuous outcomes
textmodel_nb() Text classification
textmodel_svm() Complex classification
textmodel_wordfish() Latent scaling (unsupervised)

Tutorial 06: Exercise 4.

Text Classification with tidymodels (under construction)

Using tidymodels with text-as-data

  • The tidymodels workflow we learned last week can also be applied to text-as-data
    • Integration into the tidyverse
    • Streamlined (e. g., cross-validation)
    • Cutting edge methods

Setting up the tidymodels workflow

# Convert Document-Feature Matrix to data.frame
trump_dfm_df <- quanteda::convert(dfm_tweets, to = "data.frame")

trump_dfm_df$doc_id <- NULL

# Add is_trump
tidy_tweets <- cbind(trump_dfm_df, is_trump = as.factor(tweets_combined$is_trump))

library(tidymodels)

set.seed(1337)
data_split <- initial_split(tidy_tweets, prop = 0.8, strata = is_trump)

train_data <- training(data_split)  # Training set
test_data <- testing(data_split)    # Test set

Setting up the tidymodels workflow

tweets_recipe <- recipe(is_trump ~ ., data = train_data) # Trump category as a function of all terms

# Specify Naive Bayes model
nb_model <- naive_Bayes() %>%
  set_engine("naivebayes") %>%
  set_mode("classification")

Setting up the tidymodels workflow

library(naivebayes)
library(discrim)

# Combine recipe and model in a workflow
nb_workflow <- workflow() %>%
  add_recipe(tweets_recipe) %>%
  add_model(nb_model)

# Train Model
set.seed(1337)
nb_fit <- fit(nb_workflow, 
  data = train_data)

# Predict
trump_predictions <- predict(nb_fit, new_data = test_data)

Confusion Matrix

conf_mat(data = bind_cols(test_data, trump_predictions), truth = is_trump, estimate = .pred_class)
           Truth
Prediction  not trump trump
  not trump         0     0
  trump           104   104
  • Something did not work well here…