MLE - Data 3

2024
Methods
Fall
Professor: Andy Philips
Published

August 26, 2024

Syllabus

Topic 1 - Introduction to Probability Models

Required Reading:

King Chapters 1 & 2

Chapter 1:

  • Introduction of the book. Political science methodology is disjointed and not as completely coherent as it should (and needs) to be. King seeks to organize and centralize the political science methodology.

  • Statistic Model: a formal representation of the process by which a social system produces output.

    • Since no interesting social systems generate outcomes deterministically, statistical models are assumed to have both systematic and stochastic components.
  • Inference: the general process by which one uses observed data to learn about the social system and its outputs.

  • Estimation: the specific procedure by which one obtains estimates of features (usually parameters) of the statistical model.

  • The important question for political science research:

    • is the underlying process that gives rise to the observed data

      • What are the characteristics of the social system that produced these data?

      • What changes in known features of the social system might have produced data with different characteristics?

      • What is the specific stochastic process driving one’s results?

    • By posing these questions, statistical modelling will be more theoretically relevant and empirically fruitful.

Chapter 2:

  • Conditional probabilities: describing the uncertainty of an observed or hypothetical event given a set of assumptions about the world.

  • Likelihood: a measure of relative uncertainty

Ward and Ahlquist Chapter 1:

Lecture:

Week 1 Slides

  • Summary: Run models that find parameters that are most likely to have generated the observed data.

  • These models are hard to interpret.

  • Goal: Familiarize you with a variety of MLE models used in the social sciences.

  • Probability has to sum to 1.

    • We want to find the best estimate \(\theta\)
  • Probabilities are:

    • Bounded between 1 and 0.

    • Sum of probabilities equal 1.

    • Trials -> \(\infty\)

    • Mutually exclusive outcomes (Independent).

  • Theta is the only parameter we need to estimate.

  • We are still specifying the distribution of the outcome variable. Is it a poisson, bernoulli, normal, etc?

    • this will help us specify which model to use.
  • L stands for “likelihood function”

  • Our goal is to select some \(\theta\)* -> \(\hat{\theta}\) as to maximize the likelihood of these data being generated. Ways to do this:

    • plug in candidate \(\theta\)* values

    • look at the graph

    • optimize function (solve for \(\theta\)*)

  • No priors! (that would be Bayesian)

    • for our coin flip example, we know .5 is the probability but we only have a set of {H,H,T}

    • Without anymore knowledge, the best estimate of \(\theta\) is 2/3 or .66.

  • We use ML anytime our dependent variable has a distribution that was not generated by a Gaussian (normal) process.

    • see slide 23 for examples.

    • We can estimate all of these using OLS but we may hit a few snags and violation assumptions.

  • Working through an ML problem is as follows:

    • Build a parametric statistical model

    • Define the probability density for \(Y_i\) (uncertainty component)

    • Define the systematic component (\(\theta\))

    • Maximize the likelihood function, given the data.

    • Interpret

  • We will pretty much always use log-likelihood.

    • why?

    • logarithms turn multiplication problems into addition problems.

    • likelihood starts to breakdown around N=1000. Log-likelihood does not. Why?

      • Our optimization function is multiplying probabilities

        • what happens when we multiply a bunch of probabilities?

          • Multiplying thousands of probabilities together is simply not a viable approach without infinite precision.

Example:

# compare what happens when we increase n. 
p <- seq(0,1,by=0.01)
l <- (p^2)*(1-p)^3
plot(l ~ p, type="l")

p[which(l == max(l))]
[1] 0.4
p <- seq(0,1,by=0.01)
l <- (p^40)*(1-p)^20
plot(l ~ p, type="l")

p[which(l == max(l))]
[1] 0.67

Homework 1:

  • “Lab 1.R”

Probability v. Likelihood:

  • Probability = we know which universe we are in, and the probabilities of all events in that one universe add up to 1.

    • area under a fixed distribution

  • Likelihood = we know what we observed, and we consider the probability of what we observed in any possible universe.

Topic 2: Estimation: Looking Under the Hood

  • \((\theta|y_i)=Pr(y|\theta)\)

  • you have to pick which distribution generated y.

    • assume a probability model.
  • Remember:

    • Traditional probability is a measure of absolute uncertainty. It comes from three axioms:

      • See slide 5.
    • However, the likelihood is only a relative measure of uncertainty.

    • Likelihood model is never absolutely true. It is assumed. We always have to assume a probability model.

    • Therefore, we assume that information about \(\theta\) comes from

      • the data

      • assumption about the DGP (assumed probability distribution)

  • Important to assume outcomes are independent.

  • Pick a theta and figure out the probability/distribution of outcomes.

    • higher theta better.
  • Whath happens when we multiply a bunch of probabilities together?

    • they get really small

      • so we use logs.

        • what happens when we take natural logs of probabilities?

          • we get negative numbers - and they will become more negative with more observations.

            • take value closes to zero

              • maximizing a negative number. (making it less negative)

Under the Hood of ML:

  • We are calculating the derivative of the highest point of the joint distribution.

  • Types of optimization methods:

    • Numerical

      • grid search: Give me a bunch of plausible values of theta and evaluate.

        • we will find a global maximum.

        • very slow

        • computationally becomes crazy very quickly.

    • Iterative

      • this is the “default” one - everyone does this.

        • these are optimization methods of “steepest ascent” or “gradient Ascent” since the algorithm ‘crawls’ up the surface of the likelihood.
    • Others…

Measures of Uncertainty:

  • We have discussed how to obtain the MLE, \(\hat{\theta}\). Yet it is an estimate.

  • uncertainty is kinda measured by the curvature.

  • standard errors are derived from the negative of the inverse of the second derivative.

    • standard errors can’t be negative

    • bigger values imply smaller variance.

    • bigger negative = more curvature. see equation 8/9 on slides.

    • we take take the inverse since larger (more negative) values indicate a sharper curvature, which indicates more certainty in our estimate.

  • We use the Hessian for standard errors in MLE.

    • Variance = \(-[\textbf{H}^-1]\)

    • SE: \(\sqrt{-[\textbf{H}^-1]}\)

Properties of ML:

  • Sufficiency: there exists a single \(\theta\)

  • Consistency

    • \(\hat{\theta} ->\theta as \textit{n} -> \infty\)
  • Asymptotic normality

  • Invariance: ML estimate is invariant to functional transformations.

  • Efficiency: MLE has the smallest variance (asymptotically), as given by the Cramer-Rao Lower Bound

Disadvantages of ML:

  • small sample issues. Since ML is asymptotically normal, use Z- rather than t-statistics.

  • We know the VCV is biased in small samples

  • (not a disadvantage) but most MLE models use z rather than t-stat.

  • Have to make distributional assumptions. We must characterize the nature of the statistical experiment.

  • some regularity conditions must be met.

Information Criteria:

  • Provides goodness-of-fit with penalization for model complexity

    • basically R^2
  • Used for feature (i.e., covariate) selection

  • Relative, not absolute.

  • Data-dependent (sample-dependent, just like log likelihood): numerical values of Y must be identical.

  • No hypothesis test

    • no p-value.
  • Akaike Information criterion (AIC)

    • \(AIC=2k-2ln(L)\)

    • Lower AIC is preferred model.

  • Schwartz Bayesian information criterion (SBIC)

    • \(SBIC=ln(n)k-2ln(L)\)

    • Lower SBIC is preferred model

    • stronger penalty for over fitting than AIC. Penalty derived from “prior” information.

  • Restricted Mode: Less parameters

  • unrestricted model: all parameters.

  • likelihood ratio test basically tells you if there is statistically significant difference between two models

    • complex v. simple

      • if the complex model doesn’t do that much better than the simple model, then the ratio can help you decide how to change your model specification (may be overkill).

LaGrange Multiplier Test:

  • Lots of tests for additional “nuisance” parameters

  • Reject H_0: restricted parameter sufficiently improves model fit; should be unrestricted

  • Fail to reject H_0:

Wald Test:

  • Similar to LR tst,

  • only estimate unrestricted model.

  • if MLE and the restriction are quite different, W becomes large.

    • uncertainty of the coefficients matter.
  • reject: parameters sufficiently different from the restriction

    • use the unrestricted model.

The 3 Likelihood Tests:

  • LR-Test requires estimating two models; may be computationally intensive

  • LM -Test requires estimating only a restricted model. Yet finding MLE of constrained model is sometimes difficult. Some LM derivations get around this.

  • Wald requires estimating only an unrestricted model. Can also test non-linear restrictions.

  • All are asymptotically equivalent

  • In small samples, LM is most conservative, then LR, then Wald.

    • small sample - use LM test.
  • When doing these in code just keep track of which is restricted and unrestricted.

Topic 3: Generalized Linear Models

  • GLMs are generalized version of linear regression.

  • basically just a bunch of different link functions

    • we will call link function g().
  • GLMs are linear in parameters.

  • Basic order:

    • figure out what DV is

      • pick a distribution (based on DV)

        • follow the guide basically.
  • Begin by specifying the random (stochastic) component.

  • Normal:

    • mean, media, and mode all occur at \(\mu\)

    • The distribution is symmetric about \(\mu\) (eliminate any random variable that is skewed)

    • Distribution is continuous on the real-number line (eliminates any discrete random variable or bounded random variable)

    • central limit theorem

    • mean and variance are separable

      • normal distribution is the only one that does this.
    • Most distributions are not normal!

  • MLE uses z-statistics because it is asymptotically normal.

  • \(\sigma^2\) is not usually reported. It is an ancillary parameter.

  • do likelihood test

Topic 4:

  • rho is the correlation coefficient of epsilon and its prior value.

  • when you take the lag of a series the first observation goes away.

Limited Dependent Variables

censoring changes shape of the distribution, truncation does not change the distribution for the un-affected range.

Both cause bias (attenuation) and inconsistency. More data does not help us here!

  • Censoring

    • Censoring is a symptom of our measuring

    • is in sample (just a discrete value though)

    • “values in a certain range are all transformed to (or reported as) a single value”

      • income in surveys often censored ($250,000 or more) since there are so few individuals that would comprise these categories

      • ex: if an individual on a survey responds 1, they are a strong democrat, 2,3,4, weak affiliation/independent, 5, strong Republican.

    • This results in lumping/bunching near the censoring point \(\tau\)

    • Estimates are biased (attenuated) since observations farthest from the center of the distribution are restricted to some arbitrary upper (lower limit)

    • Three types:

      • left (lower) censor

        • anyone with incomes below $20k coded as < $20k
      • Right upper censor

        • Anyone with incomes above $250k coded as > $250k
      • Interval censor.

    • We have an observation in the region somewhere…but we don’t know exactly what the true value is”

    • Dealing with Censoring

      • censored data can sometimes be though fo as two distributions, one discrete and one continuous.
  • Truncation

    • truncation effects arise when one attempts to make inferences about a larger population from a sample that is drawn from a distinct sub-population.

    • Theory will tell you where the truncation is. To fix truncation, you have to know you have truncation.

    • Truncation is a symptom of our sampling

    • Observation? what observation?

    • produces bias

      • shrinks estimates towards zero.
    • moves mean away from tau

    • shrinks variance too

    • sample selection is a form of truncation

    • This is in my Y.

    • Examples: data on GDP of countries from the World Bank (excludes those that are too poor to report data from their statistical agency)

      • data on car damage from insurer claims (excludes any damage that was below deductible)
    • Dealing with Truncation:

      • truncated normal distirbutions are not full probability distributions since the area under the curve (the CDF) does nto sum to one

      • thus we cannot form the likelihood function.

  • Sample Selection Bias

    • a type of truncation

    • nonrandom sampling of observations.

    • incidental truncation (truncation caused by some other variable, not y itself)

    • What’s the population we are generalizing to?

  • Selecting on the DV

    • sample selection bias is not the same as selecting on the DV.

    • Sampling on DV means deliberately choosing certain y outcomes.

  • Discrete outcomes

Topic 5: Binary Choice Models

Dichotomous dependent variables

  • We can estimate \(\pi_i\) using OLS.

    • \(\pi_i = X_i\beta\) using OLS

    • benefits:

      • linear interpretation of betas

      • simple…much faster since using OLS

      • Works well if \(X_i\) is also distributed Bernoulli

    • Drawbacks:

      • “impossible” predictions; probabilities exceed 0 and 1.

      • censoring issue.

    • error not normally distributed.

    • error will not have constant variance.

  • LPM is not ideal but its not terrible. You can run it sometimes. What we want is

    • \(\pi_i=g(X_i\beta)\)

    • we can first express \(\pi_i\) as odds

      • approach infinity in the positive direction

      • approach zero

Logit (log-odds)

logit(\(\pi_i\)) < 0

  • creates a sigmoid curve.

    • cumulative density function.
  • near inflection point of zero

  • we are solving for \(\pi_i\)

  • GO THROUGH SLIDE 17!

  • constant shifts the inflection point in logit.

  • big coefficients should have steeper slopes.

    • if X is smaller, we get a more stretched out curve.

      • slide 18.

Probit

  • z-score usage

  • different link function

  • coefficients in probit model show the increase/decrease in the z-score in response to a change in \(x_ik\)

  • probit usually steeper but not always

  • Logit more common in poli sci.

    • prob cause coefficients are bigger
  • Do not compare logit or probit coefficients - they are different.

Latent Variable

  • think of our Xbetas as a unbounded latent variable

  • slide 28.

Measures of Fit: \(R^2\)

  • There are several psuedo R^2 measures used for fit of logit/prboit models.

  • kinda useless. They aren’t exactly R^2.

  • no need to report really.

  • AIC/BIC more important.

Interpretation:

  • Signs matter

  • magnitude less so…

  • Interpreting single \(\beta\) can be done, but be careful about predictions, as log-odds are not a change in Pr(y=1)

  • Example: predict Pr(Farm Laborer)

  • Odds = 1 mean increase in x does not make Pr(y_i =1) more or less likely

  • odds < 1 mean increase in X makes Pr(y_i) less likely

  • cant say probability but how much more likely you are to be a farm laborer.

  • ODD RATIO SHOWN (this is what we should say when we report this).

  • predictions always depend on the value of other covariates. see slide 37.

First Differences (LPM)

  • How much does Y change given a change at X.

    • holding other X’s at some interesting value

First Differences (Logit/Probit)

  • First differences in the logit/probit context do not have all the properties listed previously

  • first-differences for logit are given by: see slide 40.

Read Hanmer and Kalkan

  • critique of expected values/expected change: “we are not aware of any theories that are specifically concerned with 48-year old white women who are independent politically and have an in income of $40k - $45k.

  • Instead of setting X, we keep all X_i’s at their observed value for each observation, and fix our variable of interest to some value, giving us an expected value for each observation.

Stochastic Simulation

  • Growing use of simulation techniques designed to show statistical and substantive significance of the results.

  • Typically used to make predictions of Y.

Example: Flores-Macias and Kreps (2013)

  • When do states adopt war taxes to finance the cost of war?

Citation

BibTeX citation:
@online{neilon2024,
  author = {Neilon, Stone},
  title = {MLE - {Data} 3},
  date = {2024-08-26},
  url = {https://stoneneilon.github.io/notes/Comparative_Behavior/},
  langid = {en}
}
For attribution, please cite this work as:
Neilon, Stone. 2024. “MLE - Data 3.” August 26, 2024. https://stoneneilon.github.io/notes/Comparative_Behavior/.