14 min

What to Think About Before Measuring Landing Page Conversions

analyticskpiab-testingga4conversion

Start with the KGI

A KGI is the business reason your landing page exists, expressed as a number. "Increase monthly subscriptions." "Grow MRR." "Improve paid conversion rate." It doesn't have to be a specific target — a direction is enough.

The most important thing to accept here is that your KGI often can't be fully measured within the LP itself.

In one project, the KGI was "increase monthly paid subscriptions." But subscriptions don't complete on the LP. Users navigate from the LP to an external signup site and subscribe there. That site's data lives in a separate GA4 property managed by a different team.

In other words, you can't measure your own KGI yourself.

At this point, people make one of two mistakes.

The first is ignoring the unmeasurable KGI and building KPIs from only what you can measure. Using "trial signups" or "chat starts" as KPIs looks rational, but if the chat starts 100 times and subscriptions are zero, the business hasn't moved. KPIs disconnected from the KGI move you further in the wrong direction the more you optimize them.

The second is placing the unmeasurable KGI directly as the North Star Metric. In the first version, we set "monthly completed subscriptions" as the NSM. The result: every weekly standup, we'd say "subscription data hasn't arrived from external yet, so NSM progress is unknown." The entire purpose of an NSM — aligning the whole team around a single number — collapses.

The right answer is to set an "interim NSM."

KGI:          Increase monthly subscriptions (ultimate goal)
Final NSM:    Monthly completed subscriptions (external data, once pipeline is ready)
Interim NSM:  Outbound click-through rate (measurable on our site)
Promotion:    When correlation between interim NSM and subscriptions r > 0.7

An interim NSM is a proxy for the KGI that you can measure with your own hands. Once its correlation with the KGI is confirmed, you promote it to the "final NSM."

This isn't a compromise. Meta made MAU their North Star because MAU had a strong correlation with revenue. They didn't chase revenue directly — they chased MAU, trusting the causal link between MAU and revenue. We're simply applying this same structure to a small LP.

Three criteria for choosing an interim NSM. You can measure it yourself. You can explain its causal link to the KGI. Your team's work can move it.

Write Hypotheses First

Once the KGI and interim NSM are decided, the next step is to design the funnel... Wrong. Write hypotheses.

"Why isn't the current LP hitting its target?" "What change would move the numbers?"

Write these before looking at data. If you look at data first, the data will create your hypotheses for you. Humans are far too good at reverse-engineering hypotheses from what they've seen. That's not a hypothesis — it's a post-hoc explanation.

Write as a causal chain.

[Intervention] → [Behavior change] → [KPI change] → [KGI change]

The causal chain for this project looked like this:

Deploy AI chat widget → Users find relevant information more easily → Service page reach rate increases → Outbound click-through to signup site increases → Subscriptions increase

Each arrow in this causal chain is a "hypothesis to validate," and the number at each stage becomes a "KPI." Causal chain first, KPIs second. Never reverse this order. If you start with KPIs, you end up measuring what's easy to measure. What's easy to measure and what moves the business rarely align.

When You Must Not Use Bounce Rate as a KPI

We opened GA4 and looked at the data.

Ad LP bounce rate 64.3%, homepage 76.6%. Looking at these numbers, the first version placed "bounce rate improvement" as the Primary KPI. The improvement margin looked large, measurement was easy, and the story was compelling.

This was the first — and most dangerous — mistake.

In this project, we add an AI chat widget to the B group (test group). A bubble appears, users click, a conversation starts. This entire interaction chain is recorded as GA4 key events.

Here's the trap. In GA4, a "bounce" is a "session that isn't an engaged session," and a session becomes engaged if even one key event fires. The moment you add the widget to the B group, bubble display and click events fire. Even if user behavior doesn't change at all, the measurement mechanism alone "improves" the bounce rate.

The moment you make bounce rate your Primary KPI, you've predetermined the test result. Any widget you put on the B group will "succeed." That's not an A/B test — it's a foregone conclusion.

Over five revisions, bounce rate was demoted from Primary → Secondary → Diagnostic. In the final version, it's used as a halt criterion: "Don't use for judgment, but if B group bounce rate worsens by 5+ points over A group, halt the test immediately."

Bounce rate is a thermometer, not a treatment goal. If it goes above 38°C, you suspect something's wrong, but "get temperature to 36°C" is not the treatment goal.

Working Backwards from the Numbers: Behind a 76.6% Bounce Rate

Homepage bounce rate 76.6% and 3.9 seconds session duration. Looking at these numbers alone, you'd think "terrible page."

But the same site's article pages had bounce rates of 16–24% and session durations of 70–141 seconds. Users who reached articles were reading for 90+ seconds on average. "Content quality" was high. The problem was "content discoverability."

If we'd made bounce rate the KPI, we'd have spent effort on homepage redesign and content additions. But the real bottleneck was "guiding users to good content," and the chat was designed precisely as that guide.

A 76.6% bounce rate is a "symptom," not the "problem." The problem is in navigation design. Using bounce rate as a KPI means targeting the symptom, and the temptation to hack the number instead of solving the root cause becomes irresistible.

Choose Decision Metrics Based on User Intent Signals

After removing bounce rate, what takes its place?

"Chat initiation rate" looks like a candidate, but it can only be measured in the B group. The A group has no chat, so the initiation rate is 0% by definition. Comparing "A group 0% vs B group 15%" merely confirms the tautology that "if you add chat, chat gets used" — zero information for the business.

"Session duration" has the same problem. Chat conversation time gets added, so it increases even if users' information-seeking behavior hasn't changed.

What we ultimately chose was "outbound click-through rate."

The beauty of this metric is that it can be measured identically across A and B groups. The signup site link exists in both groups. And it's causally connected to the KGI (subscription growth). A certain percentage of those who click through will subscribe. Higher click-through rate means more subscriptions.

Keep the decision metric to exactly one. The first version defined 4 Primary KPIs. Having 4 "Primary" metrics defeats the meaning of "Primary." With multiple metrics, you can cherry-pick the favorable one post-test and call it "success." That's statistical malpractice. The final version has exactly one Primary, with the rest separated into Secondary and Diagnostic.

One Event = One Action

The first version measured "bubble click," "iframe focus," and "option selection" all with a single user_engage event.

When defining "chat initiation rate = user_engage / bubble impressions," the numerator includes users who merely touched the bubble and users who engaged in deep conversation. When the metric goes up, you can't distinguish whether "more people are starting chats" or "existing users are clicking more."

The final version split one event into five.

Each event maps to exactly one funnel stage. If 12% click the bubble but only 50% establish a conversation, there's a UI problem after the click. If conversations establish but only 20% interact with options, there's a chat response quality problem. You can see exactly where the bottleneck is.

Define Numerator and Denominator Rigorously

A KPI that just says "conversion rate" is undefined.

Conversion rate = [what] / [divided by what] / [in what unit] / [over what period]

"Event count / session count" and "sessions with 1+ events / session count" are different. The former gives 3/1 = 300% if one session has 3 clicks. The latter gives 1/1 = 100%.

Another issue: if A/B assignment is user-level (cookie) but analysis is session-level, you have a problem. Multiple sessions from the same user are not independent — it's the same person with the same interests. Sample size calculations break down.

In this project, average pageviews per user were 1.13, so the vast majority were single-visit users. The clustering effect was likely small. But "it's small so we'll ignore it" is fundamentally different from "structurally eliminated."

The final version defined: "Primary analysis uses only each user's first eligible session; all sessions serve as supplementary analysis." This aligns the randomization unit with the analysis unit, making sample size calculation assumptions accurate.

Is 100,000 Users "Enough"?

103,817 users over 90 days. Roughly 1,200 new daily visitors. Is that a sufficient base to justify LP conversion optimization?

Alex Schultz's marketing funnel framework includes "Pool Size Analysis." Measure the base at each funnel layer, find the largest drop-off, and invest there. But the most critical judgment is "is the funnel entrance large enough?"

This service's potential market is in the millions to tens of millions. 100,000 is barely a few percent. The awareness problem is overwhelmingly larger. Improving CVR from 2% to 3% adds only a few dozen conversions per month, but 10x the traffic with CVR held constant gives 10x the effect.

We still chose LP optimization. The reasoning was documented explicitly.

LP redesign requires production approval from the site owner and takes time. Chat can be added via GTM with a lighter approval process. Chat works across all pages, giving broader reach than a single LP revision. If successful, it becomes a reusable package for other services.

"Why this initiative" is a question that must be answered before "why this KPI." Without it, you can't respond when someone asks "wouldn't it be better to just increase ad spend?"

Define Halt Criteria Before the Test

The first two versions had no test halt criteria. They only said "p < 0.05 means success."

Criteria defined in the final version:

Bounce rate reappears here. Not as a decision metric, but as an alert for rapid experience degradation. A 5-point deterioration means something is catastrophically broken. Stop before investigating the cause.

Pre-Define Post-Test Decisions

"Roll out to all users if successful" is insufficient. Define four patterns in advance.

Significant + Large effect (absolute diff >= 0.5pt)
→ Roll out. Proceed to next test (scenario optimization)
 
Significant + Small effect (absolute diff < 0.5pt)
→ Improve the initiative and retest
 
No significant difference
→ Shelve the initiative. Pivot to CTA optimization on the LP itself
 
B group worsens
→ Halt immediately. Fundamental UX redesign

Without this, when the click-through rate goes from 2.0% to 2.1% with p=0.048, you'll succumb to the temptation to declare it "significant, therefore successful." The monthly additional click-throughs would be in the single digits, with even fewer additional subscriptions. Likely below the chat widget's monthly cost.

In Alex Schultz's words, if you need a data scientist and a microscope to see the improvement, it isn't moving the business.

Structure Documents by Reader

The most common mistake in KPI documents is writing them in "the order I analyzed things." Open GA4 → pull data → form hypotheses → define KPIs → design events → write test specs. This is the analyst's thought process, not the order readers want.

The first version started with "Purpose and Scope," with the GA4 Property ID and GTM container version number at the top. When the site owner's director received it, the first thing they saw was GTM-XXXXXXX v10.2.0. The document was closed at that point.

A KPI document has exactly two types of readers: decision-makers and implementers. What these two need is fundamentally different.

Decision-makers want to know: What are we doing? Why? How much money if it works? What do I need to do? They won't read GA4 event names or postMessage specifications.

Implementers want to know: What are we measuring? Which events fire on which triggers? What are the numerator and denominator definitions? They don't need to read the hypothesis logic or business impact calculations every time.

First half: Decision-making section
  1. Executive summary (1 page)
  2. Why we're doing this (hypothesis logic)
  3. How we'll judge (test design & criteria)
  4. What we're asking for
 
--- Decision-makers stop reading here ---
 
Second half: Implementation section
  5. KPI definition details
  6. Funnel and targets
  7. Event design
  8. Implementation tasks and priorities

The executive summary is the most important section because decision-makers only read this part. Whether it says "projected impact of X per month" or not determines the entire document's fate.

Across five versions, a business impact estimate was never written. This was the biggest structural flaw.

Speak Business Impact in Currency

No matter how elegant your measurement design, if it doesn't say "this is how much," approval won't come.

Use the interactive calculator below to adjust assumptions and explore impact sensitivity.

GA4 Actuals (confirmed)
Daily new users
~1,200
GA4 90-day avg.
B group (50% split)
~600/day
A/B test design
Assumptions (adjust with sliders)
2.0%
+50%
30%
¥1,000
18 mo
Estimated Impact
Monthly add'l clicks
180
/month
Monthly add'l subs
54
/month
Monthly add'l LTV
97.2
¥10K/mo
B rate = 2.0% × (1 + 50%) = 3.00%
Daily add'l clicks = 600 × 1.00% = 6.0
Monthly add'l clicks = 6.0 × 30 = 180
Monthly add'l subs = 180 × 30% = 54
LTV = ¥1,000 × 18 mo = ¥18,000
Monthly add'l LTV = 54 × ¥18,000 = ¥972,000 (97.2 ¥10K)
Sensitivity Analysis: Monthly Incremental LTV (¥10K)
CVR \ Lift+20%+30%+50%+75%+100%
20%25.239.664.897.2130
30%39.657.697.2146194
40%52.277.4130194259
50%64.897.2162243324
60%77.4117194292389

Row = click→subscription CVR / Col = B group relative lift / Baseline: A group 2.0% / Retention 18 mo

Summary

Generalizing the lessons from five revisions:

1. Start with the KGI. Why does this LP exist? If you can't measure the KGI, set an interim NSM and define promotion criteria.

2. Write hypotheses first. As causal chains. Before looking at data.

3. Don't use mechanically inflated metrics for decisions. Bounce rate is a thermometer, not a treatment goal. Choose the decision metric closest to user intent signals.

4. One event = one action. Mixing multiple actions makes analysis ambiguous.

5. Define numerator and denominator rigorously. Align the randomization unit with the analysis unit.

6. Document why you chose this initiative. "Why this KPI" comes after "why this initiative."

7. Write halt criteria and decision matrix before the test. After the test is too late.

8. Structure documents by reader. Decision-makers in the first half, implementers in the second half. Documents without executive summaries don't get read.

9. Write business impact in currency. Without this, approval won't come.

In this entire sequence, GA4 event design doesn't appear until the second half. It shouldn't come first. Measurement is a means, not an end. The end is moving the business.