Fast, approximate, estimation of random coefficient logit models

May 19

This post (my first blog post ever) is about doing demand estimation, and specifically estimating random coefficient logit models, as quickly as possible. Specifically, the model originated by Berry, Levinsohn, and Pakes ("BLP") has been an obsession of the industrial organization community for decades now, and has been used in thousands of papers and hand-coded by hundreds of graduate students. And yet, it was only recently that anyone took it upon themselves to build some open-source software that could estimate the model reliably (I'm referring to PyBLP by Jeff Gortmaker and Chris Conlon, which is amazing and is maybe the only Python package I don't hate to use). What that means is that, until PyBLP, nearly every paper that used some version of this canonical model required writing the code from scratch, which is both error prone and a huge waste of time for a scientific field. It's time consuming to write and test your own code, and it can be even more time consuming to run it over and over trying to get the model to "work".

Even PyBLP goes only so far toward improving the status quo. Whereas I think of PyBLP's focus as doing things "right" (and they do this extremely well), I'd be happy even if we could develop packages or processes that let researchers cut a few small corners if it means that we can get acceptable estimates quicker and more reliably. In my experience, estimating a BLP-style model (even using PyBLP) is a slow process with an annoyingly high failure rate. I've certainly let some BLP code run for hours only to come back to an optimization procedure that is stuck on a corner of constraint space, or to estimates that are nonsensical due to data processing or coding errors. If we could estimate models that are pretty close to BLP without near as much of the lift, the IO field would be far better off in my eyes (though maybe we'd have to brag about something other than the length of time it takes to estimate our models).

This line of thinking has been my side project for years now. I've tried writing (and re-writing many times) alpha packages like NPDemand.jl (a first "real" version is coming soon as part of a paper with Adam Smith) and FRAC.jl, both of which focus on demand estimation techniques that have near-zero failure rates, and which also share an API that abstracts from method-specific details (e.g., "define_problem", "estimate!", and "price_elasticities!"). This post is another step in that direction. I'm going to walk through a method introduced by Fox, Bajari, Ryan, and Kim (2011) which is another linear regression-based method that I hadn't spent much time thinking about recently. Then, I'll discuss how a recent paper by Meeker (2021) opened my eyes to the value of FKRB and one recommendation for how FRAC.jl may play nicely with FKRB to give us a nice balance of speed, informative estimates, and bias. I'll run some brief simulations to see how my idea performs.

Note: My secret agenda in writing this post is to convince some economics PhD students who might come across this to submit some PRs on FRAC.jl and FKRB.jl to save me some work and get us some better statistical programs faster than I will do as a solo side project. Right now I mostly do this coding on planes and I don't take enough flights to make a reasonable amount of progress.

Very quick review of mixed logits

Anyone who is reading now probably knows this, but I'll specify my conceptual model just in case. In a mixed logit model, we're making the assumption that each customer makes a choice among a set of available choices (e.g., choosing among products on a menu/shelf) by maximizing their indirect utility function. That utility function for each customer $i$, product $j$, and market $t$, is usually assumed to be of the form:

\[ u_{ijt} = \theta^k X_{jt} + \xi_{jt} + \epsilon_{ijt} \]

Each component here is important. The first term $\theta^k X_{jt}$ captures the impact of observable product characteristics (including prices) on utility. $\theta^k$ is the vector of preferences for customer type $k$, and $X_{jt}$ is the observed vector/ of product characteristics. The second term $\xi_{jt}$ represents "demand shocks," which are things that the researcher doesn't observe but the customer making a choice does. Our main concern in industrial organization is that $\xi$ plays a big role in explaining market share variation and that it is likely correlated with prices (and sometimes other things), because firms may try to set high prices when they know that customers have other reasons to buy their good that might offset the cost of a price hike.

The final component is $\epsilon_{ijt}$, which is often called an idiosyncratic error and which denotes random, one-off decision shifters. You woke up on the wrong side of the bed and decided to buy an extra large coffee today, or you had a great day at work and decided to go out to eat instead of buying groceries. Note that this is the only term which is indexed by $i$ here. In some cases $\theta$ will also have an $i$ index, but in this case we're going to assume that potentially many customers have preferences that are similar enough to group them all into a single shared $\theta^k$, such that there are only $K$ total distinct types of customers. Still, each customer has good and bad days and thus still receives idiosyncratic shocks.

Fox, Bajari, Ryan, and Kim (2011)

As I said above, the first paper I want to pull from is Fox, Bajari, Ryan, and Kim (2011) (FKRB). Their insight starts from a really simple one. Since a mixed logit assumes that every individual consumer behaves according to a logit demand model with their own individual preferences, that means that we can write market shares in a mixed logit as weighted average of individual shares, weighting by the mass of each type of customer:

\[ s_{jt} = \sum_k w_k s^k_{jt} \]

Now notice that $s^k_{jt}$ is something we could calculate in advance, if only we knew the value of $\theta^k$ and $\xi_{jt}$, following the standard logit formula:

\[ s^k_{jt} = \frac{exp(\theta^k X_{jt} + \xi_{jt})}{1 + \sum_{j'} exp(\theta^k X_{j't} + \xi_{j't})} \]

Now, how can we actually utilize this identity, given the two things we don't know? FKRB cover part of this in depth. If we don't know $\theta^k$ but we do know $\xi_{jt}$ (or are willing to assume it away), what we can do is guess a large number of possible values and see which ones are most correlated with the shares we observe in the actual data. All that requires in practice is estimating a fairly simple regression:

\[ s_{jt} = w_1 s^1_{jt} + w_2 s^2_{jt} + \dots + w_K s^K_{jt} + \varepsilon_{jt} \]

Since, as I mentioned, we can actually calculate each term $s^k_{jt}$ here in advance, this regression is more-or-less standard. You can regress observed market shares on these constructed type-specific market shares, and the estimated coefficients are the weights that tell you the mass of that type in your data. This is also incredibly simple in comparison to the inner workings of a package like PyBLP.

There are definitely some catches and some challenges putting this to use in practice. For example, how do you specify your grid of potential values for $\theta^k$? If you specify a grid with too many points, you may over-fit your regression, but you also need enough points in your grid to make sure that you'll cover the true domain of $\theta^k$. This is a hard balance, but best practice now seems to specify a pretty large grid and then estimate the key regression using an elastic net of some sort, and in my simulations that seems to work fairly well. The current best, which is only slightly harder to implement, is probably to run a constrained elastic net where the constraints are used to enforce that the weights $w_k$ are forced to be nonnegative and to sum to 1.

Meeker (2021)

So, FKRB is a really nice tool, but to get here we had to assume that we know the values of $\xi_{jt}$ (and/or assume that they are all equal to 0). This is a bad assumption in most cases. I've done enough demand estimation to know that the things we care about (prices, product characteristics) explain annoyingly little of the variation in market shares most of the time. As a result, omission of $\xi$ is probably hugely problematic for FKRB. Why exactly? If FKRB is correctly specified, then the residual in that regression is probably best interpreted as measurement error. That is, maybe the market shares we observe in the data are constructed from a sample of individual choices, or through some other approach which is less than perfect but unbiased.

If, on the other hand, $\xi$ is erroneously omitted from our construction of type-specific shares, then we have a complicated case of omitted variable bias. The $\xi$ for each product enters each type-specific share (the constructed regressors) nonlinearly, and because $\xi$ is likely to be correlated with prices in real data, this omission could lead to substantial bias in ways that are hard to characterize (we'll look for this bias in simulation below). So, how do we get around this? Fairly simple: we can try to estimate $\xi$ first, an then add $\xi$ into our calculation of the type-specific shares.

This is what Meeker (2021) did, which I thought was clever and wasn't something I'd considered before (part of the reason I wasn't very enthusiastic about FKRB until recently). His approach proceeds in two steps.

Step 1: Estimate a logit model without random coefficients

\[ log(s_{jt}/s_{0t}) = \bar\theta X_{jt} + \xi_{jt} \]

Step 2: Run FKRB via an elastic net, using the $\hat\xi_{jt}$s estimated from this regression to calculate type-specific market shares

\[ \hat s^k_{jt} = \frac{exp(\theta^k X_{jt} + \hat\xi_{jt})}{1 + \sum_{j'} exp(\theta^k X_{j't} + \hat\xi_{j't})} \]

This is probably at least 90% of the practical difference between the original FKRB approach I outlined above and the ideal, and I like it a lot. By generating an initial reasonable estimate of $\xi$ and including it in the constructed type-specific shares, we mostly get rid of the omitted variable bias.

The only issue that remained in my mind after reading about this approach is that the model we used to estimate $\xi$ conflicts with the model we estimate in FKRB. In the former, we assumed that customers all have the same preferences for observable features. In the latter, we are allowing a lot of heterogeneity. So, I wanted to propose one modification that I think makes this two-step approach slightly better, or at least internally consistent.

FRAC + FKRB

What we need here is a model which is quick and easy enough that we can use it before FKRB, but is general enough to also be consistent with FKRB's underlying assumptions. My proposal is a model called FRAC (Fast, Robust, and Approximately Correct), which is something I have pushed many times now in Twitter and to peers. FRAC was introduced by Salanie and Wolak (though now the acronym doesn't fit their new title), and makes the following observation. We can think of random coefficient logit models as perturbations away from the simpler logit model in which all preferences are identical, where the perturbation increases the level of heterogeneity across customers slightly. With that framing, they take Taylor expansion around the basic logit model and show that mixed logits admit an approximation which is quite similar to the usual logit regression.

For simplicity, let's assume X is only one characteristic (prices $p$), and let $\sigma_\theta$ denote the standard deviation of $\theta$ across customers. Then we can write the following approximation:

\[ log(s_{jt}/s_{0t}) = \bar\theta X_{jt} + \sigma_{\theta}^2 K_{jt} + \xi_{jt} + O(\sigma_\theta^4) \\ K = (p_{jt}/2 - e_{t}) p_{jt} \\ e_{t} = \sum_{j=1}^J s_{jt}p_{jt} \\ \]

This implies that, to estimate a mixed logit (approximately), we can use the same left-hand regressor that we use in plain logit models, and simply modify the right side of the regression by adding the constructed regressor $K_{jt}$, which is a sort of quadratic term in prices. The amazing thing about this approximation is that it doesn't make any distributional assumptions, meaning that we can estimate $\sigma_\theta^2$ as the coefficient on $K_{jt}$ while still leaving the distribution of $\theta$ unspecified. While amazing, I also think of this as a minor weakness of FRAC when implementing, because knowing the mean and variance of $\theta$ doesn't tell me the shape of its distribution, which might be quite important.

So, my proposed approach is simply to modify Step 1 in Meeker's paper to use FRAC to generate initial estimates of $\xi$ which can then be plugged in to FKRB's type-specific market shares (the constructed regressors). My hope is that using FRAC will give us slightly more robust estimates of $\xi$, which will then improve our estimates of the distribution of preferences ($\theta$), because it is explicitly assuming a model which allows for random coefficients and leaves the distribution of preferences unspecified, just like FKRB. Now, there is still some approximation error remaining in the equation above, which is increasing in the magnitude of $\sigma_\theta$, and if we're being formal there might be something fishy about the discreteness of the distribution in FKRB and some assumptions made in FRAC. Still, it feels like this approximation error is probably made smaller by including $K_{jt}$ (and other FRAC adjustments for more complicated models) relative to the logit model used in Meeker (2021).

Simulations

Now, let's see how well this approach works, and whether my suggestion helps at all! To compare the models, I'll simulate some mixed logit data for 400 markets, where products are differentiated by prices (which are chosen endogenously) and an additional exogenous attribute x. In each market, there will be between 2 and 4 products. Simulating this data generating process many times, I'll compare three approaches:

FKRB, ignoring the role of $\xi$ (assuming all are 0, to give a worst-case baseline)
FKRB with Meeker's logit model first stage
FKRB with FRAC as a first stage

I ran this simulation 600 times, calculating the average own-price elasticity for each of the above approaches for each simulation. These averages are shown below, and it looks like there are two main takeaways:

Adding the logit step improves the elasticities a lot relative to not correcting
Using FRAC doesn't add a ton! It does improve the mean estimate, but that comes at the cost of a lot of variance, and more of the latter than I expected. This is consistent with what I said above, that the logit correction does at least 90% of the fix we need.

If you've compared elasticities from models with and without random coefficients before, you have probably seen that it very often doesn't change the estimated elasticities much until you're adding more complex distributions, or heterogeneity is really large relative to the variance of $\xi$. Still, we can see some improvement on the mean here visually, and I figure that the ease of FRAC means it's better to be safe than sorry. It's definitely possible that this will matter for some DGPs. I've implemented the FRAC+FKRB approach in my code here. Hoping to add more over time but any improvements/additions are more than welcome. No tests or anything implemented yet, so very possible that something obvious is broken.

Figure shows the distribution of average own-price elasticities from three different FKRB-based procedures and the truth (blue) in the simulated data. The FRAC (red) adjustment performs best on average, followed closely by the logit (green) adjustment.

Extra: Simple example code to run FRAC+FKRB

Here is some simple code which can implement the proposed approach. First you have to add FKRB from Github (no promises that it works out of the box yet, but I hope it does).

using Pkg
Pkg.add(url = "https://github.com/jamesbrandecon/FKRB.jl", rev = "master")

Then you should be able to run the code below (assuming you've worked in Julia enough to have the packages below or to add them yourself)

using FKRB
using Distributions, Plots
using FixedEffectModels, DataFrames

T = 500; # (number of markets)/2
J1 = 2; # number of products in half of markets
J2 = 4; # number of products in the other half of markets
preference_means = [-1.0 1.0]; # means of preferences
# first element is preference for price, second is x
preference_SDs = [0.3 0.3]; # standard deviations of preferences
sd_xi = 0.3; # standard deviation of demand shocks 

df = FKRB.sim_logit_vary_J(J1, J2, T, 1,
	preference_means, preference_SDs, sd_xi,
	with_market_FEs = true); # adding in market fixed effects just to make the problem more complicated
	
# Drop true demand shifters created during simulation
df = select(df, Not(:xi)); 

# Define a problem (here nonlinear and linear are required to be the same, in order to match FRAC and NPDemand usage)
problem = FKRB.define_problem(data = df,
	linear = ["x", "prices"],
	nonlinear = ["x", "prices"],
	iv = ["demand_instruments0"]);

# For now, you have to estimate FKRB with a pre-specified set of elastic net penalties, but K-fold cross validation can be added
FKRB.estimate!(problem, 
    method = "elasticnet",
    gamma = 1e-5, # L1 penalty
    lambda = 1e-6) # L2 penalty

# Make a simple plot showing the CDF of the two dimensions of preference parameters
cdf_plot = FKRB.plot_cdfs(problem) 

# Calculate price elasticities for all products and markets
FKRB.price_elasticities!(problem) 
# Reshape elasticities into a dataframe
elasticities_df = FKRB.elasticities_df(problem)

James Brand