15.1 Overview of Prediction and Classification

When we predict something, we are estimating some unknown using information we do know and trying to do so as accurately and precisely as possible.

  • Often prediction involves classification– predicting a categorical outcome (e.g., prediction of who wins vs. who loses)

Some social science examples of this might include

  • Trying to detect hate speech online
  • Trying to flag “fake news” and other misinformation
  • Trying to forecast the results of an election
  • Trying to classify a large amount of text into subject or topic categories for analysis

15.1.1 How to predict or classify

Goal: Estimate/guess some unknown using information we have – and do so as accurately and precisely as possible.

  1. Choose an approach
    • Using an observed (known) measure as a direct proxy to predict an outcome (e.g., a dictionary)
    • Using one or more observed (known) measures in a (often, regression) model to predict an outcome
    • Using a model to automatically select the measures to use for predicting an outcome
  2. Assess accuracy and precision in-sample and out-of-sample using some sample “training” data where you do know the right answer (“ground truth”).
    • Prediction error: \(Truth - Prediction\)
    • Bias: Average prediction error: \(\text{mean}(Truth - Prediction)\)
      • A prediction is `unbiased’ if the bias is zero (if the prediction is on average true)
    • In regression: R-squared or Root-mean squared error
      • RMSE is like `absolute’ error– the average magnitude of the prediction error
    • For classification: Confusion Matrix
      • A cross-tab of predictions you got correct vs. predictions you got wrong (misclassified)
      • Gives you true positives and true negatives vs. false positives and false negatives
  3. Repeat steps 1 and 2 until you are confident in your method for predicting or classifying.
  4. Apply to completely unknown data.