Cox Regression & DeepSurv — An Illustrated Primer

Why survival data is different

You have a list of subjects and, for each one, a number: how long they lasted. Patients who lived 14 months after diagnosis. Lightbulbs that burned 8,421 hours. Customers who stayed subscribed for 97 days. You'd like to ask: which covariates make this number bigger or smaller? A natural instinct is to reach for ordinary regression. Don't. Survival data has a peculiarity that standard regression cannot handle, and ignoring it will bias every conclusion you draw.

The peculiarity is censoring. Not every subject reaches the event you're measuring before the study ends. Some patients are still alive at the last follow-up. Some bulbs are still burning when you pull the plug on the experiment. Some customers are still subscribed today. For these subjects, you don't know their true lifetime — you only know it's longer than what you observed. Drop them and you bias toward short lifetimes. Treat their observation time as a real event time and you bias in the same direction, harder. Survival analysis is the family of methods designed to use this partial information correctly.

Key insight

A censored observation is not a missing data point — it's a lower bound on the true lifetime, and it carries real information. The art of survival analysis is squeezing that information out.

Censoring, made visible

Here's what survival data looks like laid out as a timeline. Each horizontal line is one subject. The line starts at their enrollment (time zero) and ends when their event happens — or when the study closes, whichever comes first.

Study duration 10

Base event rate 0.15

Event observed Censored (unknown)

Each bar is a subject, stopping at either an event (solid red dot) or the end of observation (open circle = censored). As you shorten the study duration, more subjects get cut off before their event and censoring grows. Survival methods use all of them — censored included.

Two observations. First, as you shrink the study duration, more subjects remain censored. Information about their true lifetimes is lost — but the fact that they made it to the study's end is itself valuable. Second, if you simply threw out the censored subjects and ran a regression on the remaining ones, you'd be training on a biased sample: the short-lived. Your predictions would systematically underestimate lifetimes.

The two fundamental functions

Survival analysis speaks in two linked languages. One is the hazard function h(t) — the instantaneous rate of events at time t, conditional on having survived until t. Think: "given you've made it this far, what's your risk of dying in the next instant?" The other is the survival function S(t) — the probability of lasting beyond time t, full stop. They carry the same information and you can convert between them freely:

S(t) = exp( − ∫₀^t h(s) ds )

The integrand is called the cumulative hazard. The relationship is mechanical: if you know the instantaneous risk at every time, you know the chance of surviving. If you know the chance of surviving, you can recover the instantaneous risk by differentiating.

Hazard meets survival

To build intuition, here are the two curves for a Weibull distribution — a flexible two-parameter family that covers most practical shapes of aging: infant mortality (hazard decreasing), random failures (hazard constant), and wear-out (hazard increasing).

Hazard h(t)

Survival S(t)

Weibull shape κ 1.5

Scale λ 1.0

Same distribution, two views. When κ < 1, hazard falls over time (infant mortality — things that survive early will likely survive long). When κ = 1, hazard is flat (memoryless). When κ > 1, hazard rises (wear-out). The survival curve translates this into what you observe: the probability any individual is still with you at time t.

Cox's elegant trick

Fitting a full hazard function is hard. It's a whole curve, and curves have many parameters. Fitting a single-number effect of each covariate is easy — but if you commit to a specific hazard family like the Weibull, you've baked in assumptions about the shape of aging that may be wrong.

In 1972, Sir David Cox proposed a remarkable compromise. Write the hazard for a subject with covariates x as:

h(t | x) = h₀(t) · exp(β·x)

The first factor, h₀(t), is the baseline hazard — the hazard function for a subject whose covariates are all zero. It can be any function. No assumption on its shape. The second factor, exp(β·x), is a constant-in-time multiplier that scales the baseline up or down depending on covariates. This is the proportional hazards structure: two subjects with different covariates have hazard functions that are scaled copies of each other. The ratio of their hazards at any time t is the same constant, forever.

Why is this a miracle? Because you can estimate β without ever estimating h₀(t). The baseline hazard is a nuisance — it cancels out in the mathematical trick we're about to see. You get the covariate effects, crisp and interpretable, without wrestling with what aging looks like.

The proportional hazards assumption

The name "proportional hazards" says what it does. Below are two groups — a control group (solid green) and a treatment group (red). The treatment has a log-hazard shift β. When you slide β, the treatment group's hazard and survival curves scale relative to the control, but the shape of the baseline hazard is preserved in both. The hazard ratio is HR = exp(β): HR = 2 means the treated group has twice the instantaneous risk at every time; HR = 0.5 means half.

Hazard h(t | group)

Survival S(t | group)

Log hazard ratio β 0.69

Control Treatment

The central Cox commitment. On linear axes the two hazards look quite different in magnitude; on log axes they're parallel — their vertical distance is a constant equal to β. This is the signature of proportional hazards. When the real-world hazard ratio drifts with time (a common violation), Cox regression gives biased estimates and you need time-varying coefficients or a different model.

The partial likelihood

Here is the beautiful move. Forget about absolute event times and think only about orderings. Suppose subject i has their event at time t_i. At that exact moment, consider the risk set R(t_i) — every subject still alive (not yet eventful, not yet censored). Cox asks: given that exactly one event happened at t_i, what's the probability that the subject who had it was i in particular, rather than any of the others in the risk set?

Under proportional hazards, this probability is:

P(i has the event | one event at t_i) = exp(β·x_i) / Σ_{j ∈ R(t_i)} exp(β·x_j)

Notice what just happened. The baseline hazard h₀(t_i) appears in the numerator and denominator identically — it cancels. All that's left is the covariate effect β. Multiply these probabilities across every observed event and you get the partial likelihood:

L(β) = Π_{i : event} exp(β·x_i) / Σ_{j ∈ R(t_i)} exp(β·x_j)

Maximize this and you've got β. Censored subjects show up in the risk sets (they contribute as competitors) but never as numerators (they never had the event). No distributional assumptions, no baseline hazard, no fuss. Cox's 1972 paper is three pages of this idea and it changed biostatistics forever.

Where Cox ends, DeepSurv begins

Look again at the Cox model: h(t | x) = h₀(t) · exp(β·x). The risk is a linear function of the covariates. Doubling a feature doubles the log-hazard. If the true relationship is nonlinear — a U-shape in dose response, a threshold in pressure, an interaction between two sensor signals — plain Cox will fit a straight line through the curve and miss everything interesting.

DeepSurv (Katzman et al., 2018) takes a single, direct step. Replace the linear score β·x with a neural network f_θ(x):

h(t | x) = h₀(t) · exp(f_θ(x))

Train the network's weights θ by maximizing the same partial likelihood — just with the linear risk score replaced by the network's output. The censoring logic, the baseline-free elegance, the whole partial-likelihood machinery all carry through unchanged. You inherit Cox's philosophy and gain the flexibility of deep learning to model arbitrary, nonlinear, interaction-heavy risk functions.

The trade-off is exactly what you'd expect. You lose coefficient-level interpretability — there's no single β telling you "a one-unit increase in feature-5 multiplies risk by 1.3." You gain the ability to capture complex risk surfaces that linear Cox cannot represent. For problems where the linear assumption is roughly right, classical Cox is often equal or better and far easier to defend. For problems where it clearly isn't, DeepSurv (and its cousins — Cox-Time, DeepHit, survival transformers) is where the field has moved.

Linear vs flexible risk functions

To see the difference in action, we generate synthetic survival data where the true log-hazard depends on a single covariate x through a U-shape: risk is high at both extremes and low in the middle (a typical drug dose-response pattern). We then fit two models — a linear Cox and a flexible Cox (polynomial basis, an honest in-browser stand-in for what DeepSurv does with neural networks) — and plot their estimated log-hazard functions against the ground truth.

Curvature γ 2.5

Sample size n 150

True log-hazard Cox linear Flexible (DeepSurv-style)

Ground truth log-hazard (black) is U-shaped. Linear Cox (red dashed) fits a straight line — the best linear approximation to a curve is a constant, so it flattens to near-zero and sees essentially no effect. Flexible Cox (gold), using polynomial basis functions as a stand-in for a neural network, bends to match the truth. With γ near zero the two models agree; as curvature grows, the linear fit falls farther behind.

The concordance index (c-index) in the readout is survival analysis's analogue of AUC: over all valid pairs of subjects where one had the event first, how often does the model correctly give that subject the higher risk score? It ranges from 0.5 (random) to 1.0 (perfect). With a genuinely nonlinear ground truth, linear Cox's c-index hovers near 0.5 regardless of sample size — it has no power to see the U-shape. The flexible model's c-index climbs with sample size as it locks onto the curvature. This is the DeepSurv argument in one number.

Where these methods earn their keep

Clinical trials & prognosis

The historical home of Cox regression. Estimating the effect of treatments, biomarkers, or demographic factors on survival in oncology, cardiology, and epidemiology. Hazard ratios are the lingua franca of medical literature.

Reliability & RUL estimation

Time-to-failure for bearings, batteries, power electronics, and mechanical components. Cox lets you bring in operating-condition covariates (load, temperature, duty cycle) without assuming a Weibull or log-normal lifetime distribution.

Customer churn & retention

How long until a subscriber cancels, a user goes dormant, a free trial converts (or doesn't). Censoring is natural here: many customers are still active at the moment you query the database, and survival methods handle them correctly.

Credit risk & loan default

Time-to-default modeling with covariates on borrower, loan, and macroeconomic features. Cox and its extensions are standard in banking risk models where the event of interest is rare and right-censored.

Fleet health & maintenance

Predicting next failure for large populations of industrial assets — turbines, pumps, vehicles — where each unit's history is a mix of failures, repairs, and still-running censored observations. Perfect DeepSurv territory: many covariates, likely nonlinear interactions.

Device & component lifecycling

For power-electronics components operating under varied duty cycles, DeepSurv can learn interactions between temperature, current stress, and switching patterns that a linear Cox would silently miss.

HR analytics

Time-to-promotion, time-to-attrition, time-to-hire. Wide use in workforce planning; Cox keeps you honest about employees who are still with the company at report time.

Recurrent events & warranty claims

Extensions of Cox (Andersen-Gill, frailty models) handle subjects with multiple events over time — warranty claims on the same unit, repeated hospitalizations. The partial-likelihood machinery generalizes cleanly.

Practical notes from the trenches

Check the proportional hazards assumption before trusting a Cox fit. Use Schoenfeld residuals (cox.zph() in R, lifelines.check_assumptions in Python). If a covariate's effect drifts with time — a common violation — your reported hazard ratio is an average of a moving target and shouldn't be interpreted as a simple effect size.
Handle ties in event times carefully. When two subjects have the same observed event time, the partial likelihood has three common approximations: Breslow (default in many packages, fastest, biased under heavy ties), Efron (usually the best default), and exact (correct but slow). Prefer Efron unless you have strong reasons not to.
Scale your continuous covariates. Cox is theoretically scale-invariant for final estimates, but gradient-based optimizers converge much faster on standardized inputs, and coefficient magnitudes become directly comparable.
Don't use Cox for absolute prediction without the baseline hazard. The partial likelihood gives you β, not h₀(t). To produce survival probabilities for a new subject, estimate the baseline hazard separately (Breslow estimator is standard) after fitting.
Concordance index is your primary performance metric. Not accuracy, not MSE. The c-index measures how well the model orders subjects by risk — that's what survival models actually do. Aim for >0.65 to be useful, >0.75 for strong, >0.85 for exceptional in most medical domains.
For DeepSurv, start simple. Two or three hidden layers, modest width, batch normalization, dropout around 0.2–0.4. Adam with a small learning rate. Early stopping on validation c-index. Don't begin with a Transformer — you'll tune forever for marginal gains.
Watch out for tiny risk sets at long follow-up. Toward the tail of your observation window, the risk set shrinks and each surviving subject carries enormous weight in the partial likelihood. Consider trimming the last 5–10% of event times or adding a robustness weight.
Beware confusing hazard ratio with risk ratio. HR is a ratio of instantaneous event rates; risk ratio is a ratio of cumulative probabilities over some interval. They only coincide when events are rare and follow-up is short. Report HR with a time horizon if you want the clinical reader to understand it.
Stratify when proportionality fails for a categorical covariate. If a factor (say, hospital) violates PH, you can stratify on it — estimate a separate baseline hazard for each stratum while sharing β across them. It's a clean compromise that preserves interpretation.
DeepSurv shines when n is large and nonlinearity is plausible. With small sample sizes (n < a few hundred events), the linear Cox's simplicity usually wins — fewer parameters, more stable. Don't reach for the neural net until you have enough events to feed it (rough rule: >10 events per hidden-unit's worth of parameters).

References & further reading

Cox, D. R. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society, Series B, 34(2), 187–220. The founding paper. Remarkably readable.
Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., & Kluger, Y. (2018). DeepSurv: Personalized Treatment Recommender System Using a Cox Proportional Hazards Deep Neural Network. BMC Medical Research Methodology, 18(1).
Kvamme, H., Borgan, Ø., & Scheel, I. (2019). Time-to-Event Prediction with Neural Networks and Cox Regression. Journal of Machine Learning Research, 20. Introduces Cox-Time and related extensions.
Lee, C., Zame, W., Yoon, J., & van der Schaar, M. (2018). DeepHit: A Deep Learning Approach to Survival Analysis with Competing Risks. AAAI. Neural alternative that drops the proportional-hazards assumption entirely.
Therneau, T. M., & Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer. The canonical practical reference for applied Cox regression.
Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.). Springer. Comprehensive textbook, good for self-study.
Ishwaran, H., Kogalur, U. B., Blackstone, E. H., & Lauer, M. S. (2008). Random Survival Forests. Annals of Applied Statistics, 2(3). A tree-based non-neural alternative worth knowing about.
Davidson-Pilon, C. lifelines: Survival Analysis in Python. lifelines.readthedocs.io. The reference Python implementation; excellent documentation.
Pölsterl, S. scikit-survival: a Python library for survival analysis. scikit-survival.readthedocs.io. Integrates cleanly with scikit-learn pipelines.
Harrell, F. E., Jr. (2015). Regression Modeling Strategies (2nd ed.). Springer. Chapters 20–21 are the classic practical treatment of Cox modeling with splines and other nonlinear extensions.