PD for the Layman

New Call-to-action


This poor, little old lady’s expression is probably an apt reflection of the faces of all of us who have ever been tasked with building a probability of default model for loans. Of course, the spectrum of possible model complexity is quite wide, ranging from simple proportions to sophisticated Markov chains. And, depending on the context, a simple model may be all that is needed. For example, if you’re interested in estimating the proportion of current, active loans that will default in the next year, you might consider the pool of loans that were active 12 months ago, and calculate the proportion of those loans that have since defaulted.


Unfortunately, situations arise which necessitate a greater level of sophistication than this. I’ll try to come up with such a situation off the top of my head. Umm...I don’t know...CECL! The new standard released by FASB in 2016 will require institutions to use expected lifetime credit loss in their loss allowance calculations. For such a small word, “lifetime” causes a lot of grief for all of us. Because of it, we must consider a loan’s probability of defaulting over its remaining life rather than just over the next year. Additional complexity is added when you wish to consider risk characteristics such as credit score, LTV, economic conditions, etc..


There is still a wide scope of PD model types that can accommodate these things, but I am going to discuss, at a high level, the model that VE has chosen to use. It is a type of regression model, so I’ll begin by describing what those are. In general, statistical regression allows you to use a set of factors (independent variables) to predict another variable (dependent variable). For example, you might use characteristics of a mother cow to predict her gestation period (in months). Or attributes of a vehicle to estimate its selling price. In these examples, the dependent variable would be the number of months in the gestation period and the price of the vehicle, respectively. Now, it’s important to note that there are two parts involved in this kind of model (and in most models). The first involves using existing data to train, or fit, the model. That is, using existing data to quantify the relationships between each independent variable to the dependent variable. The second is where the model is actually used, and it involves applying those quantified relationships to a new data point to obtain estimates. For example, once you have fit a vehicle pricing model using existing data, you can use the model to estimate the value of any particular vehicle by “plugging” the vehicle attributes into the model.




regressionLine.jpgFigure 2: A scatterplot relating two variables, with a linear regression model fit to the data


In the case of a PD model, the dependent variable, or variable we want to predict, is the probability of default. This adds an intricacy beyond the cow and car examples, in that the number we are trying to predict must be a proportion, so it must fall between 0 and 1. This requires that we use a special kind of regression model (logistic regression). Also, the characteristics of the loan and economy are dynamic, such that the values of independent variables for a loan may be different next month than what they are now. Furthermore, CECL will actually require that institutions account for anticipated future conditions in their loss calculations. All of this is not to mention the fact that we must incorporate the possibility of prepayment in the model (turning logistic regression into multinomial logistic regression). The sum total of these requirements equals one gnarly regression model that Visible Equity is psyched about. In the spirit of keeping this high-level and digestible, I’ll fight the urge to get too academic, and leave it here for now.

Rachel Messick

Product Manager/Data Scientist at Visible Equity