How to model how long things last when you can't wait around to see them all end — from Sir David Cox's 1972 trick to its deep-learning sequel.
You have a list of subjects and, for each one, a number: how long they lasted. Patients who lived 14 months after diagnosis. Lightbulbs that burned 8,421 hours. Customers who stayed subscribed for 97 days. You'd like to ask: which covariates make this number bigger or smaller? A natural instinct is to reach for ordinary regression. Don't. Survival data has a peculiarity that standard regression cannot handle, and ignoring it will bias every conclusion you draw.
The peculiarity is censoring. Not every subject reaches the event you're measuring before the study ends. Some patients are still alive at the last follow-up. Some bulbs are still burning when you pull the plug on the experiment. Some customers are still subscribed today. For these subjects, you don't know their true lifetime — you only know it's longer than what you observed. Drop them and you bias toward short lifetimes. Treat their observation time as a real event time and you bias in the same direction, harder. Survival analysis is the family of methods designed to use this partial information correctly.
A censored observation is not a missing data point — it's a lower bound on the true lifetime, and it carries real information. The art of survival analysis is squeezing that information out.
Here's what survival data looks like laid out as a timeline. Each horizontal line is one subject. The line starts at their enrollment (time zero) and ends when their event happens — or when the study closes, whichever comes first.
Two observations. First, as you shrink the study duration, more subjects remain censored. Information about their true lifetimes is lost — but the fact that they made it to the study's end is itself valuable. Second, if you simply threw out the censored subjects and ran a regression on the remaining ones, you'd be training on a biased sample: the short-lived. Your predictions would systematically underestimate lifetimes.
Survival analysis speaks in two linked languages. One is the hazard function h(t) — the instantaneous rate of events at time t, conditional on having survived until t. Think: "given you've made it this far, what's your risk of dying in the next instant?" The other is the survival function S(t) — the probability of lasting beyond time t, full stop. They carry the same information and you can convert between them freely:
S(t) = exp( − ∫₀t h(s) ds )
The integrand is called the cumulative hazard. The relationship is mechanical: if you know the instantaneous risk at every time, you know the chance of surviving. If you know the chance of surviving, you can recover the instantaneous risk by differentiating.
To build intuition, here are the two curves for a Weibull distribution — a flexible two-parameter family that covers most practical shapes of aging: infant mortality (hazard decreasing), random failures (hazard constant), and wear-out (hazard increasing).
Fitting a full hazard function is hard. It's a whole curve, and curves have many parameters. Fitting a single-number effect of each covariate is easy — but if you commit to a specific hazard family like the Weibull, you've baked in assumptions about the shape of aging that may be wrong.
In 1972, Sir David Cox proposed a remarkable compromise. Write the hazard for a subject with covariates x as:
h(t | x) = h₀(t) · exp(β·x)
The first factor, h₀(t), is the baseline hazard — the hazard function for a subject whose covariates are all zero. It can be any function. No assumption on its shape. The second factor, exp(β·x), is a constant-in-time multiplier that scales the baseline up or down depending on covariates. This is the proportional hazards structure: two subjects with different covariates have hazard functions that are scaled copies of each other. The ratio of their hazards at any time t is the same constant, forever.
Why is this a miracle? Because you can estimate β without ever estimating h₀(t). The baseline hazard is a nuisance — it cancels out in the mathematical trick we're about to see. You get the covariate effects, crisp and interpretable, without wrestling with what aging looks like.
The name "proportional hazards" says what it does. Below are two groups — a control group (solid green) and a treatment group (red). The treatment has a log-hazard shift β. When you slide β, the treatment group's hazard and survival curves scale relative to the control, but the shape of the baseline hazard is preserved in both. The hazard ratio is HR = exp(β): HR = 2 means the treated group has twice the instantaneous risk at every time; HR = 0.5 means half.
Here is the beautiful move. Forget about absolute event times and think only about orderings. Suppose subject i has their event at time ti. At that exact moment, consider the risk set R(ti) — every subject still alive (not yet eventful, not yet censored). Cox asks: given that exactly one event happened at ti, what's the probability that the subject who had it was i in particular, rather than any of the others in the risk set?
Under proportional hazards, this probability is:
P(i has the event | one event at ti) = exp(β·xi) / Σj ∈ R(ti) exp(β·xj)
Notice what just happened. The baseline hazard h₀(ti) appears in the numerator and denominator identically — it cancels. All that's left is the covariate effect β. Multiply these probabilities across every observed event and you get the partial likelihood:
L(β) = Πi : event exp(β·xi) / Σj ∈ R(ti) exp(β·xj)
Maximize this and you've got β. Censored subjects show up in the risk sets (they contribute as competitors) but never as numerators (they never had the event). No distributional assumptions, no baseline hazard, no fuss. Cox's 1972 paper is three pages of this idea and it changed biostatistics forever.
Look again at the Cox model: h(t | x) = h₀(t) · exp(β·x). The risk is a linear function of the covariates. Doubling a feature doubles the log-hazard. If the true relationship is nonlinear — a U-shape in dose response, a threshold in pressure, an interaction between two sensor signals — plain Cox will fit a straight line through the curve and miss everything interesting.
DeepSurv (Katzman et al., 2018) takes a single, direct step. Replace the linear score β·x with a neural network fθ(x):
h(t | x) = h₀(t) · exp(fθ(x))
Train the network's weights θ by maximizing the same partial likelihood — just with the linear risk score replaced by the network's output. The censoring logic, the baseline-free elegance, the whole partial-likelihood machinery all carry through unchanged. You inherit Cox's philosophy and gain the flexibility of deep learning to model arbitrary, nonlinear, interaction-heavy risk functions.
The trade-off is exactly what you'd expect. You lose coefficient-level interpretability — there's no single β telling you "a one-unit increase in feature-5 multiplies risk by 1.3." You gain the ability to capture complex risk surfaces that linear Cox cannot represent. For problems where the linear assumption is roughly right, classical Cox is often equal or better and far easier to defend. For problems where it clearly isn't, DeepSurv (and its cousins — Cox-Time, DeepHit, survival transformers) is where the field has moved.
To see the difference in action, we generate synthetic survival data where the true log-hazard depends on a single covariate x through a U-shape: risk is high at both extremes and low in the middle (a typical drug dose-response pattern). We then fit two models — a linear Cox and a flexible Cox (polynomial basis, an honest in-browser stand-in for what DeepSurv does with neural networks) — and plot their estimated log-hazard functions against the ground truth.
The concordance index (c-index) in the readout is survival analysis's analogue of AUC: over all valid pairs of subjects where one had the event first, how often does the model correctly give that subject the higher risk score? It ranges from 0.5 (random) to 1.0 (perfect). With a genuinely nonlinear ground truth, linear Cox's c-index hovers near 0.5 regardless of sample size — it has no power to see the U-shape. The flexible model's c-index climbs with sample size as it locks onto the curvature. This is the DeepSurv argument in one number.
The historical home of Cox regression. Estimating the effect of treatments, biomarkers, or demographic factors on survival in oncology, cardiology, and epidemiology. Hazard ratios are the lingua franca of medical literature.
Time-to-failure for bearings, batteries, power electronics, and mechanical components. Cox lets you bring in operating-condition covariates (load, temperature, duty cycle) without assuming a Weibull or log-normal lifetime distribution.
How long until a subscriber cancels, a user goes dormant, a free trial converts (or doesn't). Censoring is natural here: many customers are still active at the moment you query the database, and survival methods handle them correctly.
Time-to-default modeling with covariates on borrower, loan, and macroeconomic features. Cox and its extensions are standard in banking risk models where the event of interest is rare and right-censored.
Predicting next failure for large populations of industrial assets — turbines, pumps, vehicles — where each unit's history is a mix of failures, repairs, and still-running censored observations. Perfect DeepSurv territory: many covariates, likely nonlinear interactions.
For power-electronics components operating under varied duty cycles, DeepSurv can learn interactions between temperature, current stress, and switching patterns that a linear Cox would silently miss.
Time-to-promotion, time-to-attrition, time-to-hire. Wide use in workforce planning; Cox keeps you honest about employees who are still with the company at report time.
Extensions of Cox (Andersen-Gill, frailty models) handle subjects with multiple events over time — warranty claims on the same unit, repeated hospitalizations. The partial-likelihood machinery generalizes cleanly.