Forschungspraktikum 1+2: Computational Social Science

Session 05: Machine Learning

Dr. Christian Czymara

Agenda

Introduction to supervised machine learning
Predicting with linear regression
Predicting with machine learning
- Applying the tidymodels framework
- Validation
- Cross-validation
- Choosing a classifier
Tutorial: Predicting internet use in survey data

Supervised vs. Unsupervised Learning

Distinction between supervised and unsupervised machine learning (SML vs UML)
Types of problems suited for each:
- SML: Classification and regression
- UML: Clustering and association (more in the next sessions)

Supervised Machine Learning

Predicting a known variable using labeled data
For example: Predicting gender from Twitter biographies
- Manually label some data points (e.g., identify gender for certain profiles)
- Train model to classify based on labeled examples
- Use model to predict gender for unlabeled profiles

Unsupervised Machine Learning

Analyzing data without labeled examples
Clustering and associations
Topic Modeling: Extraction of “topics” from textual data without prior labels

Goal of Supervised Machine Learning

Objective: Estimate model from data
… and use it to predict outcomes for new cases
Ideal Scenarios:
- Large Labeled Dataset: Numerous examples to classify
- Partial Labeling: A random subset of the dataset is labeled
- These labeled examples help the model learn to classify new, unlabeled data points

Predicting data

`diamonds` data set

We’ll use the diamonds dataset for this example

library(ggplot2)

data(diamonds)

nrow(diamonds)

[1] 53940

head(diamonds)

# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Prediction with Linear Regression

Linear Regression

lin_mod <- lm(formula = "price ~ carat + depth",
              data = diamonds)

summary(lin_mod)


Call:
lm(formula = "price ~ carat + depth", data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-18238.9   -801.6    -19.6    546.3  12683.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4045.333    286.205   14.13   <2e-16 ***
carat       7765.141     14.009  554.28   <2e-16 ***
depth       -102.165      4.635  -22.04   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1542 on 53937 degrees of freedom
Multiple R-squared:  0.8507,    Adjusted R-squared:  0.8507 
F-statistic: 1.536e+05 on 2 and 53937 DF,  p-value: < 2.2e-16

Prediction with Linear Regression

equatiomatic::extract_eq(lin_mod, use_coefs = TRUE)

\[ \operatorname{\widehat{price}} = 4045.33 + 7765.14(\operatorname{carat}) - 102.17(\operatorname{depth}) \]

carat <- c(.5, 3)

depth  <- c(65, 72)

newdata <- data.frame(carat, depth)

predict(lin_mod, newdata)

        1         2 
 1287.158 19984.852

Prediction with Linear Regression

Shift focus from interpretation of coefficients to the prediction of the dependent variable for new, unknown cases
Step 1: Estimate model
Step 2: Apply model to data
Problems:
- Unrealistic predictions
- Model assumptions may be violated
- Often interested in predicting categories (“classification”)

Terminology in Machine Learning and Statistics

Machine Learning Term	Statistical Term
Feature	Predictor/Independent Variable
Label	Outcome/Dependent Variable
Training Set	Sample/Data Used for Estimation
Prediction	Estimate
Accuracy	Proportion Explained/Model Fit

Van Atteveldt, Trilling & Calderon (2022): Chapter 08

Prediction with Machine Learning

Naïve Bayes

Naïve Bayes is a probabilistic classifier for based on Bayes’ theorem

\[P(\text{Class} | \text{Data}) = \frac{P(\text{Data} | \text{Class}) \times P(\text{Class})}{P(\text{Data})}\]

\(P(\text{Class} | \text{Data})\): Probability of the class (e.g., spam or not spam), given the data
\(P(\text{Class})\): Fraction of all cases in this class (e.g., spam)
The formula helps us choose the class with the highest probability given the data

Key Idea: Independence Assumption

Naïve Bayes assumes that each feature is independent of others given the class
This makes calculations simple:

\[P(\text{Data} | \text{Class}) = P(\text{Feature 1} | \text{Class}) \times P(\text{Feature 2} | \text{Class}) \times \dots\]

Applying Supervised Machine Learning using `tidymodels`

Tidymodels

Tidymodels is a collection of packages for machine learning in R, following the tidyverse principles
Consistent syntax and modular design for:
- Data preparation
- Model building
- Evaluation and tuning

# install.packages("tidymodels")

library(tidymodels)

Tidymodels: Seven Steps

Prepare and split data
Define a model and chose a classifier
Create a recipe
Create a workflow
Fit model to training data (train classifier)
Evaluate the model on the test data
Apply the model to the unlabeled data

Example: Predicting Ideal Cut Quality from Diamonds Dataset

Goal: Predict the cut quality of diamonds based on other attributes (carat, color, clarity)
Thus, we will always need a labeled/annotated/coded data set
In this case, we can use a 80% random sample as training data (the data has full information anyway)
We use a binary version of the cut quality: “ideal” vs. “not ideal”

Step 1: Prepare Data

Randomly split the data into training and test set
- Training: For model building
- Test: For model evaluation
- Stratified by the dependent variable

diamonds$ideal <- as.factor(ifelse(diamonds$cut == "Ideal", "ideal", "not ideal"))

levels(diamonds$ideal)

[1] "ideal"     "not ideal"

set.seed(1337) # ensures reproducibility
diamonds_split <- initial_split(diamonds, prop = 0.8, strata = ideal)

diamonds_train <- training(diamonds_split)

diamonds_test <- testing(diamonds_split)

Tutorial 05: Exercises 1.-2.

Step 2: Define a Model

We will use a a Naïve Bayes model for classification

library(naivebayes)
library(discrim)

nb_model <- naive_Bayes() %>%
  set_engine("naivebayes") %>%
  set_mode("classification")

nb_model

Naive Bayes Model Specification (classification)

Computational engine: naivebayes

Step 3: Create a Recipe for Preprocessing

Defines any data preprocessing steps and the model specification

diamonds_recipe <- recipe(ideal ~ carat + color + clarity, data = diamonds_train)

Step 4: Create a Workflow

Combine the model specification and recipe in a workflow

diamonds_workflow <- workflow() %>%
  add_model(nb_model) %>%
  add_recipe(diamonds_recipe)

diamonds_workflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: naive_Bayes()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Naive Bayes Model Specification (classification)

Computational engine: naivebayes

Step 5: Fit the Model

Fit the model to the training data
Put differently, we train the model using the training dataset

diamonds_fit <- fit(diamonds_workflow, data = diamonds_train)

Tutorial 05: Exercise 3.a)-d).

Step 6: Evaluate the Model

How accurate is the classifier?
Testing it on the training data may show high performance, but this will lead to overfitting (model was trained on this exact data)
The goal is to assess if it can accurately predict new data
To avoid overfitting, test the classifier on the unseen test data

Step 6: Evaluate the Model

Predict values in the test data

diamonds_predictions <- predict(diamonds_fit, new_data = diamonds_test)

# add predictions to test data
diamonds_test <- bind_cols(diamonds_test, diamonds_predictions)

diamonds_test[, 11:12]

# A tibble: 10,789 × 2
   ideal     .pred_class
   <fct>     <fct>      
 1 ideal     not ideal  
 2 not ideal not ideal  
 3 not ideal not ideal  
 4 not ideal not ideal  
 5 not ideal not ideal  
 6 not ideal not ideal  
 7 not ideal not ideal  
 8 not ideal not ideal  
 9 not ideal not ideal  
10 ideal     ideal      
# ℹ 10,779 more rows

Step 6: Evaluate the Model

Compare the predicted and the observed values using a confusion matrix

conf_mat(data = diamonds_test, truth = ideal, estimate = .pred_class)

           Truth
Prediction  ideal not ideal
  ideal      2064      1657
  not ideal  2247      4821

True Positives (TP): Correct positive predictions
False Positives (FP): Incorrectly predicted as positive
True Negatives (TN): Correct negative predictions
False Negatives (FN): Incorrectly predicted as negative

Precision

Measures accuracy among predicted positives
Formula: \(\text{Precision} = \frac{TP}{TP + FP}\)
High precision = fewer false positives
In example: \(\text{Precision} = \frac{2064}{2064 + 1657} = \frac{2064}{3721} \approx 0.555\)

Recall

Measures completeness of positives captured
Formula: \(\text{Recall} = \frac{TP}{TP + FN}\)
High recall = fewer false negatives
In example: \(\text{Recall} = \frac{2064}{2064 + 2247} = \frac{2064}{4311} \approx 0.479\)

Precision vs. Recall

Precision & Recall

Source: Wikipedia

Trade-off: Precision vs. Recall

Improving one often reduces the other
For example, “predicting” every case as “positive” would lead to perfect recall (no false negatives)
Example scenarios:
- Medical diagnosis (High recall is crucial)
- Spam detection (High precision is preferred)

F1 Score

Harmonic mean of precision and recall
Formula: \(\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)
Balances precision and recall for overall performance
In example: \(\text{F1 Score} = 2 \times \frac{0.555 \times 0.479}{0.555 + 0.479} \approx 0.514\)

Tutorial 05: Exercise 3.f)-g).

Step 7: Apply to New Data

Now that we (hopefully) know the classifier performs well, we can use it to predict values in the large, unlabeled dataset
Skipped here because there is no larger, unlabelled dataset

new_data_predicted <- predict(diamonds_fit, new_data)

Cross-validation

What is Cross-Validation?

Just using one training dataset may still lead to overfitting (or underfitting) just by chance
Cross-Validation allows to determine how the model will generalize to various independent datasets
Idea: test model performance on multiple “folds” (samples) of the data (split it various times)

Fivefold Cross-Validation

Dataset is split into five equal parts (folds)
Model is trained on four of these parts and tested on the fifth
This process is repeated five times, with each fold used as a test set exactly once

Setting up Fivefold Cross-Validation

set.seed(1337) # ensures reproducibility

cv_folds <- vfold_cv(data = diamonds_train, v = 5)

Fit Model with Cross-Validation

Train the model on the training set, then evaluate on the test set
Average the results of the five test evaluations to estimate overall model performance

my_metrics <- metric_set(yardstick::f_meas, yardstick::precision, yardstick::recall)

diamonds_5cv <- fit_resamples(diamonds_workflow,
                              resamples = cv_folds,
                              metrics = my_metrics)

collect_metrics(diamonds_5cv)

# A tibble: 3 × 6
  .metric   .estimator  mean     n std_err .config             
  <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
1 f_meas    binary     0.506     5 0.00275 Preprocessor1_Model1
2 precision binary     0.549     5 0.00439 Preprocessor1_Model1
3 recall    binary     0.470     5 0.00288 Preprocessor1_Model1

Tutorial 05: Exercise 4.

Choosing A Classifier

Comparing Multiple Classifiers

There are plenty of classifiers available
- Naïve Bayes: Probability of label given feature, independent of all other features
- Logistic Regression: Probability of label given feature, control for other features
- Support Vector Machines: Like logistic regression, but optimize hinge loss function
- Random Forests, K-Nearest Neighbors, …
Often, you simply want to chose the best performing one

Comparing Multiple Classifiers

For example, let us compare naïve bayes and logistic regression
First, set up the engines

log_model <- logistic_reg() %>% 
  set_engine("glm") %>%
  set_mode("classification")

nb_model <- naive_Bayes() %>% 
    set_engine("naivebayes") %>%
  set_mode("classification")

Comparing Multiple Classifiers

Then, create a workflow set

model_workflow_set <-workflow_set(
  preproc = list(diamonds_recipe),
  models = list(log_model, nb_model)
  )

Comparing Multiple Classifiers

Train each model across the 5-fold cross-validation

model_cv_results <- workflow_map(model_workflow_set,
                                 "fit_resamples",
                                 resamples = cv_folds,
                                 metrics = my_metrics,
                                 seed = 1337 ## replicability
                                 )

Comparing Multiple Classifiers

Evaluate each model

autoplot(model_cv_results) + scale_y_continuous(limits = c(.2, .6))

Comparing Multiple Classifiers

Chose the best one (in our example: Naïve Bayes)
Fit it on the full training data
Predict on the test data for further evaluation

best_workflow <- extract_workflow(model_cv_results, id = "recipe_naive_Bayes")

# fit models on full training data
best_fit <- fit(best_workflow, diamonds_train)

# predict on test data
predictions <- predict(best_fit, new_data = diamonds_test)

… and, ultimately, use it to predict on a larger, unlabelled dataset

Tutorial 05: Exercise 5.

Other Classifiers Explained

K-Nearest Neighbors

Classifies a new data point based on its “k” closest points in the training data
Assuming that similar data points tend to belong to the same class

K-Nearest Neighbors

Select the number of neighbors to consider (k)
For a new data point, the model finds the “k” nearest neighbors using a distance measure (e.g., Euclidean distance)
Assigns the new point to the most common class among its “k” neighbors

Decision Trees

Flowchart-like structure where each decision node splits the data based on feature values
Divides the data into smaller groups until each final group (or “leaf”) represents a class or value

Decision Trees

Tree begins with a root node that represents the whole dataset
At each node, the tree selects a feature and a threshold to best split the data into groups (based on measures like Gini impurity or information gain)
The tree continues splitting until it reaches pure or mostly pure nodes (where most or all data points in a node belong to the same class)

Random Forest

Improve on individual decision trees by creating an ensemble of many trees, each built with a random subset of data and features
The forest combines predictions from each tree to make a more reliable decision

Random Forest

Many decision trees, each trained on a different random sample of the data with a random subset of features
Aggregate Results: When making predictions, each tree “votes” on the class. The final result is based on the majority vote or the average of all trees.
By using many trees, the random forest balances out individual tree biases, making the model more accurate and less likely to overfit

Support Vector Machines

Find the “best boundary” (or hyperplane) that separates different classes in a dataset
This boundary maximizes the distance between the closest data points from each class

Support Vector Machines

Focuses on the data points closest to the boundary (support vectors)
Places the hyperplane to maximize the margin between classes, improving its ability to generalize to new data
Once the boundary is set, new points are classified based on which side of the hyperplane they fall on

Forschungspraktikum 1+2: Computational Social Science

Agenda

Supervised vs. Unsupervised Learning

Supervised Machine Learning

Unsupervised Machine Learning

Goal of Supervised Machine Learning

Predicting data

diamonds data set

Prediction with Linear Regression

Linear Regression

Prediction with Linear Regression

Prediction with Linear Regression

Terminology in Machine Learning and Statistics

Prediction with Machine Learning

Naïve Bayes

Key Idea: Independence Assumption

Applying Supervised Machine Learning using tidymodels

Tidymodels

Tidymodels: Seven Steps

Example: Predicting Ideal Cut Quality from Diamonds Dataset

Step 1: Prepare Data

Tutorial 05: Exercises 1.-2.

Step 2: Define a Model

Step 3: Create a Recipe for Preprocessing

Step 4: Create a Workflow

Step 5: Fit the Model

Tutorial 05: Exercise 3.a)-d).

Step 6: Evaluate the Model

Step 6: Evaluate the Model

Step 6: Evaluate the Model

Precision

Recall

Precision vs. Recall

Trade-off: Precision vs. Recall

F1 Score

Tutorial 05: Exercise 3.f)-g).

Step 7: Apply to New Data

Cross-validation

What is Cross-Validation?

Fivefold Cross-Validation

Setting up Fivefold Cross-Validation

Fit Model with Cross-Validation

Tutorial 05: Exercise 4.

Choosing A Classifier

Comparing Multiple Classifiers

Comparing Multiple Classifiers

Comparing Multiple Classifiers

Comparing Multiple Classifiers

Comparing Multiple Classifiers

Comparing Multiple Classifiers

Tutorial 05: Exercise 5.

Other Classifiers Explained

K-Nearest Neighbors

K-Nearest Neighbors

Decision Trees

Decision Trees

Random Forest

Random Forest

Support Vector Machines

Support Vector Machines

`diamonds` data set

Applying Supervised Machine Learning using `tidymodels`