15.1 Overview of Prediction and Classification

When we predict something, we are estimating some unknown using information we do know and trying to do so as accurately and precisely as possible.

Often prediction involves classification– predicting a categorical outcome (e.g., prediction of who wins vs. who loses)

Some social science examples of this might include

Trying to detect hate speech online
Trying to flag “fake news” and other misinformation
Trying to forecast the results of an election
Trying to classify a large amount of text into subject or topic categories for analysis

Goal: Estimate/guess some unknown using information we have – and do so as accurately and precisely as possible.

Choose an approach
- Using an observed (known) measure as a direct proxy to predict an outcome (e.g., a dictionary)
- Using one or more observed (known) measures in a (often, regression) model to predict an outcome
- Using a model to automatically select the measures to use for predicting an outcome
Assess accuracy and precision in-sample and out-of-sample using some sample “training” data where you do know the right answer (“ground truth”).
- Prediction error: \(Truth - Prediction\)
- Bias: Average prediction error: \(\text{mean}(Truth - Prediction)\)
  - A prediction is `unbiased’ if the bias is zero (if the prediction is on average true)
- In regression: R-squared or Root-mean squared error
  - RMSE is like `absolute’ error– the average magnitude of the prediction error
- For classification: Confusion Matrix
  - A cross-tab of predictions you got correct vs. predictions you got wrong (misclassified)
  - Gives you true positives and true negatives vs. false positives and false negatives
Repeat steps 1 and 2 until you are confident in your method for predicting or classifying.
Apply to completely unknown data.