Customer Lifetime Value — Predict Who Will Keep Buying (BG/NBD Model)

Not all customers are equal. Some will buy from you for years. Others already placed their last order and you do not know it yet. The BG/NBD model looks at each customer's purchase history — when they bought, how often, and how much — and predicts who will keep buying, who has likely churned, and how much revenue each customer will generate over the next 12 months. Upload a CSV of your order data and get a ranked customer list in under 60 seconds.

What Is the BG/NBD Model?

BG/NBD stands for Beta-Geometric / Negative Binomial Distribution. That sounds intimidating, but the idea is simple: the model predicts when a customer will buy next and whether they are still "alive" — meaning still in their buying cycle — based on their purchase history. It answers two questions that spreadsheets and gut instinct cannot: how likely is this customer to make another purchase, and how many transactions should we expect from them over a given time horizon?

Traditional approaches to customer value use backward-looking averages: take the total revenue from a customer and divide by the time they have been around. The problem is obvious — that treats a customer who bought five times last month identically to one who bought five times over three years and has not returned in six months. The BG/NBD model understands the difference. It weighs recency (when did they last buy?) against frequency (how often do they buy?) and recognizes that a long gap since the last purchase means the customer is more likely to have churned.

The model works in two stages. First, BG/NBD estimates the probability each customer is still active — their P(alive) score — and predicts how many future transactions to expect. Second, a companion model called Gamma-Gamma estimates the average monetary value of each customer's transactions. Multiply expected future transactions by expected average order value, and you get a predicted customer lifetime value (CLV) for every person in your database.

When to Use BG/NBD

The BG/NBD model is built for non-contractual, repeat-purchase businesses — businesses where customers can buy whenever they want and can leave without telling you. That covers most e-commerce, retail, restaurants, marketplaces, and direct-to-consumer brands. If customers come back and buy multiple times over months or years, this model was designed for your data.

E-commerce: An online store exports their Shopify or WooCommerce order history and wants to know which customers are most valuable over the next year. The model ranks every customer by predicted future revenue, separating the loyal repeat buyers from the one-and-done shoppers.

Subscription businesses: A subscription box company wants to identify members who are likely to cancel before they actually do. P(alive) scores flag customers whose purchase frequency has dropped — they are statistically more likely to churn even if their subscription is technically still active.

Marketing budget allocation: A marketing team needs to decide where to spend their retention budget. Instead of blanket discounts to everyone, the CLV model identifies high-value customers worth investing in versus low-value customers where the return on a retention campaign will not justify the cost.

Customer segmentation: Beyond simple RFM buckets, the model produces continuous scores — P(alive), expected transactions, and predicted CLV — that create precise segments. You can find "high-value at risk" customers (high historical spend but declining P(alive)) and target them with a win-back campaign before they are gone.

What Data Do You Need?

You need a CSV of your transaction history with three required columns: a customer identifier (email address, customer ID, or account number), a transaction date, and an order value. Each row should represent one order. If your export has multiple line items per order (common with Shopify exports), the tool handles deduplication, but you will get the cleanest results if each row is one unique order.

Two optional columns improve the analysis: an order ID (helps deduplicate multi-line exports) and an order status column (lets the model filter out cancelled or refunded orders). Neither is required — the model works with just customer ID, date, and value.

For reliable results, aim for at least 50 unique customers with some repeat purchasers and at least 90 days of transaction history. More data is always better — 180 days or more gives the model enough observation window to reliably distinguish active customers from churned ones. Very short windows (under 30 days) make everyone look "alive" because nobody has had time to churn.

How to Read the Report

Overview and Data Pipeline

The report begins with an overview card summarizing your dataset: how many customers, how many transactions, the date range, and key aggregate metrics. Beside it, the preprocessing card shows exactly what happened to your data before analysis — how rows were filtered, how dates were parsed, and whether any deduplication was applied. This matters for reproducibility: you can see every step from raw upload to analysis-ready data.

Expected Transactions Matrix

This heatmap shows expected future transactions plotted by frequency (how many times a customer has bought) and recency (how recently they last bought). The pattern is intuitive: customers in the upper-right corner — high frequency, recent purchase — are expected to buy again many times. Customers in the lower-left — low frequency, long time since last purchase — are expected to buy rarely or never. The gradient makes it immediately obvious where your best customers sit and where the dead zones are.

P(Alive) Probability Matrix

This is arguably the most actionable card in the report. It shows the probability each customer segment is still active, again plotted by frequency and recency. A customer who bought 10 times and last purchased yesterday has a P(alive) near 1.0. A customer who bought twice, both times over a year ago, might have a P(alive) of 0.15. The matrix lets you see, at a glance, which frequency-recency combinations correspond to living customers and which correspond to likely churners. Use this to define your win-back threshold: customers with P(alive) between 0.3 and 0.7 are uncertain — they are the ones most worth a re-engagement email or a discount offer.

CLV Distribution

This chart shows how predicted lifetime values are distributed across your entire customer base. In most businesses, the distribution is heavily right-skewed — a small number of customers have extremely high predicted CLV while the majority cluster at the low end. This is the Pareto principle in action: your top 20% of customers often account for 60-80% of predicted future revenue. The distribution chart quantifies exactly how concentrated your value is and helps you decide how many customers are worth individual attention versus segment-level campaigns.

Top Customers by CLV

A ranked bar chart (or table) of your highest-value customers by predicted CLV. These are the customers the model believes will generate the most revenue over the prediction horizon. Each entry shows the customer identifier, their predicted CLV, their P(alive) score, and their historical purchase stats. This is the list your retention team should be working from — losing any of these customers has an outsized impact on future revenue.

Customer Segment Summary

The segment table groups customers into categories based on their model scores — typically segments like "Champions" (high frequency, high recency, high CLV), "At Risk" (historically valuable but declining P(alive)), "Hibernating" (low recent activity), and "New" (recent first purchase, not enough data to predict confidently). Each segment shows its size, average CLV, average P(alive), and total predicted revenue. This is where strategy meets data: you can design a different marketing action for each segment rather than treating your entire customer base as one homogeneous group.

Model Validation

The validation card compares the model's predictions against a holdout period. The data is split into a calibration period (used to fit the model) and a holdout period (used to test it). The table shows, for each frequency group, how many transactions the model predicted versus how many actually occurred. When predicted and actual numbers track closely, you can trust the model's forward-looking estimates. Large discrepancies indicate the model may not fit your data well — perhaps due to strong seasonality, a very short observation window, or too few repeat purchasers.

Executive Summary (TL;DR)

The executive summary distills the entire analysis into key findings and recommendations. It highlights the most important numbers: total predicted revenue, top customer concentration, churn risk summary, and specific actions to take. This is the card you send to stakeholders who will not read the full report — it answers "so what?" in a few sentences, backed by the statistical evidence from the cards above.

How BG/NBD Differs from Simpler Approaches

Many teams start with RFM segmentation — grouping customers by Recency, Frequency, and Monetary value into buckets like "high/medium/low." RFM is a useful starting point, but it is purely descriptive. It tells you who your best customers were; it does not predict who they will be. Two customers can have identical RFM scores today but very different futures if one is accelerating their purchase frequency while the other is slowing down. BG/NBD captures these dynamics because it models the underlying purchase and churn processes, not just the current snapshot.

Another common approach is the simple LTV formula: average order value multiplied by purchase frequency multiplied by average customer lifetime. This gives a single number for your entire customer base — useful for a pitch deck, but useless for deciding which individual customers to invest in. The BG/NBD model produces a CLV estimate for every single customer, which is what you need for targeting decisions.

Cohort analysis is valuable for understanding retention trends over time — what percentage of January's customers came back in February, March, April. It is excellent for tracking whether your overall retention is improving or declining. But it does not give you per-customer predictions. The BG/NBD model complements cohort analysis: use cohorts to see the big picture, use BG/NBD to act on individual customers.

Common Pitfalls

Duplicate order rows. If your export has one row per line item rather than one row per order, the model will overcount transactions. A single order with three products looks like three separate purchases, inflating frequency and CLV estimates. Use the order ID column mapping to let the tool deduplicate, or pre-process your export to one row per order before uploading.

Too little history. With less than 90 days of data, the model cannot reliably distinguish customers who are still in their purchase cycle from those who have churned. Everyone looks "alive" because nobody has had time to go dormant. The model will still run, but predictions will have wide uncertainty. Six months of data is the sweet spot for most businesses.

Contractual businesses. If your customers have fixed-term contracts or subscriptions with known end dates (annual SaaS licenses, gym memberships), the BG/NBD model's assumption of "silent churn" does not apply. In contractual settings, you know exactly when a customer leaves. Use survival analysis or a churn classification model instead.

Seasonal businesses. The model assumes purchase rates are roughly stationary — it does not account for holiday spikes or seasonal slowdowns. A gift shop with 80% of sales in Q4 will see predictions skew depending on which months the data covers. If your business is highly seasonal, note that predictions based on off-season data will underestimate future value, and vice versa.

The R Code Behind the Analysis

Every report includes the exact R code used to produce the results — reproducible, auditable, and citable. This is not AI-generated code that changes every run. The same data produces the same analysis every time.

The analysis uses the BTYDplus and BTYD packages in R — the standard implementation of the BG/NBD model used in academic research and industry applications. The bgnbd.EstimateParameters() function fits the purchase and churn model, bgnbd.PAlive() computes the probability each customer is still active, and bgnbd.ConditionalExpectedTransactions() predicts future purchase counts. The Gamma-Gamma monetary model uses spend.EstimateParameters() and spend.expected.value() to predict average transaction values. These are the same functions cited in Fader, Hardie, and Lee's original BG/NBD paper and its extensions. No custom implementations, no black boxes. Every step is visible in the code tab of your report.