Forschungspraktikum 1+2: Computational Social Science

Session 05: Machine Learning

Dr. Christian Czymara

Agenda

  • Introduction to supervised machine learning
  • Predicting with linear regression
  • Predicting with machine learning
    • Applying the tidymodels framework
    • Validation
    • Cross-validation
    • Choosing a classifier
  • Tutorial: Predicting internet use in survey data

Supervised vs. Unsupervised Learning

  • Distinction between supervised and unsupervised machine learning (SML vs UML)
  • Types of problems suited for each:
    • SML: Classification and regression
    • UML: Clustering and association (more in the next sessions)

Supervised Machine Learning

  • Predicting a known variable using labeled data
  • For example: Predicting gender from Twitter biographies
    • Manually label some data points (e.g., identify gender for certain profiles)
    • Train model to classify based on labeled examples
    • Use model to predict gender for unlabeled profiles

Unsupervised Machine Learning

  • Analyzing data without labeled examples
  • Clustering and associations
  • Topic Modeling: Extraction of “topics” from textual data without prior labels

Goal of Supervised Machine Learning

  • Objective: Estimate model from data
  • … and use it to predict outcomes for new cases
  • Ideal Scenarios:
    • Large Labeled Dataset: Numerous examples to classify
    • Partial Labeling: A random subset of the dataset is labeled
    • These labeled examples help the model learn to classify new, unlabeled data points

Predicting data

diamonds data set

We’ll use the diamonds dataset for this example

library(ggplot2)

data(diamonds)

nrow(diamonds)
[1] 53940
head(diamonds)
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Prediction with Linear Regression

Linear Regression

lin_mod <- lm(formula = "price ~ carat + depth",
              data = diamonds)

summary(lin_mod)

Call:
lm(formula = "price ~ carat + depth", data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-18238.9   -801.6    -19.6    546.3  12683.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4045.333    286.205   14.13   <2e-16 ***
carat       7765.141     14.009  554.28   <2e-16 ***
depth       -102.165      4.635  -22.04   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1542 on 53937 degrees of freedom
Multiple R-squared:  0.8507,    Adjusted R-squared:  0.8507 
F-statistic: 1.536e+05 on 2 and 53937 DF,  p-value: < 2.2e-16

Prediction with Linear Regression

equatiomatic::extract_eq(lin_mod, use_coefs = TRUE)

\[ \operatorname{\widehat{price}} = 4045.33 + 7765.14(\operatorname{carat}) - 102.17(\operatorname{depth}) \]

carat <- c(.5, 3)

depth  <- c(65, 72)
newdata <- data.frame(carat, depth)

predict(lin_mod, newdata)
        1         2 
 1287.158 19984.852 

Prediction with Linear Regression

  • Shift focus from interpretation of coefficients to the prediction of the dependent variable for new, unknown cases
  • Step 1: Estimate model
  • Step 2: Apply model to data
  • Problems:
    • Unrealistic predictions
    • Model assumptions may be violated
    • Often interested in predicting categories (“classification”)

Terminology in Machine Learning and Statistics

Machine Learning Term Statistical Term
Feature Predictor/Independent Variable
Label Outcome/Dependent Variable
Training Set Sample/Data Used for Estimation
Prediction Estimate
Accuracy Proportion Explained/Model Fit

Prediction with Machine Learning

Naïve Bayes

\[P(\text{Class} | \text{Data}) = \frac{P(\text{Data} | \text{Class}) \times P(\text{Class})}{P(\text{Data})}\]

  • \(P(\text{Class} | \text{Data})\): Probability of the class (e.g., spam or not spam), given the data
  • \(P(\text{Class})\): Fraction of all cases in this class (e.g., spam)
  • The formula helps us choose the class with the highest probability given the data

Key Idea: Independence Assumption

  • Naïve Bayes assumes that each feature is independent of others given the class
  • This makes calculations simple:

\[P(\text{Data} | \text{Class}) = P(\text{Feature 1} | \text{Class}) \times P(\text{Feature 2} | \text{Class}) \times \dots\]

Applying Supervised Machine Learning using tidymodels

Tidymodels

  • Tidymodels is a collection of packages for machine learning in R, following the tidyverse principles
  • Consistent syntax and modular design for:
    • Data preparation
    • Model building
    • Evaluation and tuning
# install.packages("tidymodels")

library(tidymodels)

Tidymodels: Seven Steps

  1. Prepare and split data
  2. Define a model and chose a classifier
  3. Create a recipe
  4. Create a workflow
  5. Fit model to training data (train classifier)
  6. Evaluate the model on the test data
  7. Apply the model to the unlabeled data

Example: Predicting Ideal Cut Quality from Diamonds Dataset

  • Goal: Predict the cut quality of diamonds based on other attributes (carat, color, clarity)
  • Thus, we will always need a labeled/annotated/coded data set
  • In this case, we can use a 80% random sample as training data (the data has full information anyway)
  • We use a binary version of the cut quality: “ideal” vs. “not ideal”

Step 1: Prepare Data

  • Randomly split the data into training and test set
    • Training: For model building
    • Test: For model evaluation
    • Stratified by the dependent variable
diamonds$ideal <- as.factor(ifelse(diamonds$cut == "Ideal", "ideal", "not ideal"))

levels(diamonds$ideal)
[1] "ideal"     "not ideal"
set.seed(1337) # ensures reproducibility
diamonds_split <- initial_split(diamonds, prop = 0.8, strata = ideal)

diamonds_train <- training(diamonds_split)

diamonds_test <- testing(diamonds_split)

Tutorial 05: Exercises 1.-2.

Step 2: Define a Model

  • We will use a a Naïve Bayes model for classification
library(naivebayes)
library(discrim)

nb_model <- naive_Bayes() %>%
  set_engine("naivebayes") %>%
  set_mode("classification")

nb_model
Naive Bayes Model Specification (classification)

Computational engine: naivebayes 

Step 3: Create a Recipe for Preprocessing

  • Defines any data preprocessing steps and the model specification
diamonds_recipe <- recipe(ideal ~ carat + color + clarity, data = diamonds_train)

Step 4: Create a Workflow

  • Combine the model specification and recipe in a workflow
diamonds_workflow <- workflow() %>%
  add_model(nb_model) %>%
  add_recipe(diamonds_recipe)

diamonds_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: naive_Bayes()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Naive Bayes Model Specification (classification)

Computational engine: naivebayes 

Step 5: Fit the Model

  • Fit the model to the training data
  • Put differently, we train the model using the training dataset
diamonds_fit <- fit(diamonds_workflow, data = diamonds_train)

Tutorial 05: Exercise 3.a)-d).

Step 6: Evaluate the Model

  • How accurate is the classifier?
  • Testing it on the training data may show high performance, but this will lead to overfitting (model was trained on this exact data)
  • The goal is to assess if it can accurately predict new data
  • To avoid overfitting, test the classifier on the unseen test data

Step 6: Evaluate the Model

  • Predict values in the test data
diamonds_predictions <- predict(diamonds_fit, new_data = diamonds_test)

# add predictions to test data
diamonds_test <- bind_cols(diamonds_test, diamonds_predictions)

diamonds_test[, 11:12]
# A tibble: 10,789 × 2
   ideal     .pred_class
   <fct>     <fct>      
 1 ideal     not ideal  
 2 not ideal not ideal  
 3 not ideal not ideal  
 4 not ideal not ideal  
 5 not ideal not ideal  
 6 not ideal not ideal  
 7 not ideal not ideal  
 8 not ideal not ideal  
 9 not ideal not ideal  
10 ideal     ideal      
# ℹ 10,779 more rows

Step 6: Evaluate the Model

  • Compare the predicted and the observed values using a confusion matrix
conf_mat(data = diamonds_test, truth = ideal, estimate = .pred_class)
           Truth
Prediction  ideal not ideal
  ideal      2064      1657
  not ideal  2247      4821
  • True Positives (TP): Correct positive predictions
  • False Positives (FP): Incorrectly predicted as positive
  • True Negatives (TN): Correct negative predictions
  • False Negatives (FN): Incorrectly predicted as negative

Precision

  • Measures accuracy among predicted positives
  • Formula: \(\text{Precision} = \frac{TP}{TP + FP}\)
  • High precision = fewer false positives
  • In example: \(\text{Precision} = \frac{2064}{2064 + 1657} = \frac{2064}{3721} \approx 0.555\)

Recall

  • Measures completeness of positives captured
  • Formula: \(\text{Recall} = \frac{TP}{TP + FN}\)
  • High recall = fewer false negatives
  • In example: \(\text{Recall} = \frac{2064}{2064 + 2247} = \frac{2064}{4311} \approx 0.479\)

Precision vs. Recall

Precision & Recall

Source: Wikipedia

Trade-off: Precision vs. Recall

  • Improving one often reduces the other
  • For example, “predicting” every case as “positive” would lead to perfect recall (no false negatives)
  • Example scenarios:
    • Medical diagnosis (High recall is crucial)
    • Spam detection (High precision is preferred)

F1 Score

  • Harmonic mean of precision and recall
  • Formula: \(\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)
  • Balances precision and recall for overall performance
  • In example: \(\text{F1 Score} = 2 \times \frac{0.555 \times 0.479}{0.555 + 0.479} \approx 0.514\)

Tutorial 05: Exercise 3.f)-g).

Step 7: Apply to New Data

  • Now that we (hopefully) know the classifier performs well, we can use it to predict values in the large, unlabeled dataset
  • Skipped here because there is no larger, unlabelled dataset
new_data_predicted <- predict(diamonds_fit, new_data)

Cross-validation

What is Cross-Validation?

  • Just using one training dataset may still lead to overfitting (or underfitting) just by chance
  • Cross-Validation allows to determine how the model will generalize to various independent datasets
  • Idea: test model performance on multiple “folds” (samples) of the data (split it various times)

Fivefold Cross-Validation

  • Dataset is split into five equal parts (folds)
  • Model is trained on four of these parts and tested on the fifth
  • This process is repeated five times, with each fold used as a test set exactly once

Setting up Fivefold Cross-Validation

set.seed(1337) # ensures reproducibility

cv_folds <- vfold_cv(data = diamonds_train, v = 5)

Fit Model with Cross-Validation

  • Train the model on the training set, then evaluate on the test set
  • Average the results of the five test evaluations to estimate overall model performance
my_metrics <- metric_set(yardstick::f_meas, yardstick::precision, yardstick::recall)

diamonds_5cv <- fit_resamples(diamonds_workflow,
                              resamples = cv_folds,
                              metrics = my_metrics)

collect_metrics(diamonds_5cv)
# A tibble: 3 × 6
  .metric   .estimator  mean     n std_err .config             
  <chr>     <chr>      <dbl> <int>   <dbl> <chr>               
1 f_meas    binary     0.506     5 0.00275 Preprocessor1_Model1
2 precision binary     0.549     5 0.00439 Preprocessor1_Model1
3 recall    binary     0.470     5 0.00288 Preprocessor1_Model1

Tutorial 05: Exercise 4.

Choosing A Classifier

Comparing Multiple Classifiers

  • There are plenty of classifiers available
    • Naïve Bayes: Probability of label given feature, independent of all other features
    • Logistic Regression: Probability of label given feature, control for other features
    • Support Vector Machines: Like logistic regression, but optimize hinge loss function
    • Random Forests, K-Nearest Neighbors, …
  • Often, you simply want to chose the best performing one

Comparing Multiple Classifiers

  • For example, let us compare naïve bayes and logistic regression
  • First, set up the engines
log_model <- logistic_reg() %>% 
  set_engine("glm") %>%
  set_mode("classification")

nb_model <- naive_Bayes() %>% 
    set_engine("naivebayes") %>%
  set_mode("classification")

Comparing Multiple Classifiers

  • Then, create a workflow set
model_workflow_set <-workflow_set(
  preproc = list(diamonds_recipe),
  models = list(log_model, nb_model)
  )

Comparing Multiple Classifiers

  • Train each model across the 5-fold cross-validation
model_cv_results <- workflow_map(model_workflow_set,
                                 "fit_resamples",
                                 resamples = cv_folds,
                                 metrics = my_metrics,
                                 seed = 1337 ## replicability
                                 )

Comparing Multiple Classifiers

  • Evaluate each model
autoplot(model_cv_results) + scale_y_continuous(limits = c(.2, .6))

Comparing Multiple Classifiers

  • Chose the best one (in our example: Naïve Bayes)
  • Fit it on the full training data
  • Predict on the test data for further evaluation
best_workflow <- extract_workflow(model_cv_results, id = "recipe_naive_Bayes")

# fit models on full training data
best_fit <- fit(best_workflow, diamonds_train)

# predict on test data
predictions <- predict(best_fit, new_data = diamonds_test)
  • … and, ultimately, use it to predict on a larger, unlabelled dataset

Tutorial 05: Exercise 5.

Other Classifiers Explained

K-Nearest Neighbors

  • Classifies a new data point based on its “k” closest points in the training data
  • Assuming that similar data points tend to belong to the same class

K-Nearest Neighbors

  1. Select the number of neighbors to consider (k)
  2. For a new data point, the model finds the “k” nearest neighbors using a distance measure (e.g., Euclidean distance)
  3. Assigns the new point to the most common class among its “k” neighbors

Decision Trees

  • Flowchart-like structure where each decision node splits the data based on feature values
  • Divides the data into smaller groups until each final group (or “leaf”) represents a class or value

Decision Trees

  1. Tree begins with a root node that represents the whole dataset
  2. At each node, the tree selects a feature and a threshold to best split the data into groups (based on measures like Gini impurity or information gain)
  3. The tree continues splitting until it reaches pure or mostly pure nodes (where most or all data points in a node belong to the same class)

Random Forest

  • Improve on individual decision trees by creating an ensemble of many trees, each built with a random subset of data and features
  • The forest combines predictions from each tree to make a more reliable decision

Random Forest

  1. Many decision trees, each trained on a different random sample of the data with a random subset of features
  2. Aggregate Results: When making predictions, each tree “votes” on the class. The final result is based on the majority vote or the average of all trees.
  3. By using many trees, the random forest balances out individual tree biases, making the model more accurate and less likely to overfit

Support Vector Machines

  • Find the “best boundary” (or hyperplane) that separates different classes in a dataset
  • This boundary maximizes the distance between the closest data points from each class

Support Vector Machines

  1. Focuses on the data points closest to the boundary (support vectors)
  2. Places the hyperplane to maximize the margin between classes, improving its ability to generalize to new data
  3. Once the boundary is set, new points are classified based on which side of the hyperplane they fall on