VWO Glossary

ROPE

Aastha Trehan — Thu, 29 Aug 2024 06:08:46 +0000

The Region of Practical Equivalence (ROPE)

The Region of Practical Equivalence refers to a statistical concept that defines an area within which any observed differences are practically considered unimportant. Said another way, it’s like a buffer zone around a baseline value inside which changes are considered too small to matter in a real-world context, even when they attain statistical significance.

In the context of experimentation, ROPE helps determine when a difference between the control and the variation is so minor that it can be considered practically equivalent, even if it is statistically significant. This approach optimizes the testing process by quickly implementing effective changes and stopping variations that barely outperform the baseline.

Understanding ROPE with an example

Imagine you own an eCommerce platform and your present conversion rate is 40%. You determine that for all practical purposes, any conversion rate within the range of 38% to 42% is considered equivalent to 40% for your business. (We have explained how to determine this in the following sections.)

This range—from 38% to 42%—is your ROPE.

Here’s how you do it:

Find the ends: State the lower and upper limits within which your ROPE lies. In this case, 38% and 42%.

Calculate the difference: Find the difference between these limits and the baseline. Here it is ±2%.

Normalize the Difference: Divide the difference by the baseline conversion rate (40%)

± 2 / (40%) = ± 5

So, a ±5% range around your baseline is your ROPE.

Significance of ROPE

ROPE helps save visitors from insignificant changes by closing them early. This approach helps save valuable visitors from being exposed to changes that aren’t likely to yield meaningful improvements.

As a tradeoff, you invest slightly more visitors on better variations so that you can deploy them with increased accuracy. Overall, since winning ideas are rarer and most ideas are insignificant, you save visitors significantly on average.

The wider the ROPE region, the more visitors you save. Larger ROPE means more accurate winners (in exchange for extra visitors) and early stopping of variations that do not have potential.

Minimize false positives

Random variations in your data may sometimes look like a trend of important changes. ROPE steps in to protect against these scenarios, making sure that actions will be taken only on meaningful improvements.

Suppose you’re running an ad campaign, and there is a spike in website traffic while the marketing campaign is running. If proper statistical bounds are not established, this spike may be mistaken for a successful campaign when the real reason could be other external factors such as a holiday or another viral post on social media. ROPE differentiates ‘real’ actionable improvement and random fluctuations.

Factors to consider while setting ROPE

Defaults

Reasonable default values to start with for a ROPE might be a conservative value, say, ±1%, especially if you are new to the idea. This will reduce your false positives and give you the benefits of early closing.

Later, as you start using ROPE, you can increase the ROPE value for faster closing of tests. However, doing so means you might miss out on detecting small but potentially valuable improvements within the ROPE region.

Essentially, the trade-off is between more rapid test closures and the risk of overlooking minor improvements.

Business context

Different businesses may have different thresholds for what they consider to be a meaningful change.

For example, some businesses may require very small improvements to be considered meaningful, while others may only consider larger improvements to be significant. Your understanding of what qualifies as a meaningful change in your particular context will determine the appropriate value for your ROPE.

Let’s take an example of an eCommerce retail company:

A company selling low-margin products (like groceries) might consider even a small percentage increase in conversion rate as meaningful. A 1% lift in conversion rate could translate to a significant increase in overall revenue due to high sales volume. The ROPE would be set accordingly, perhaps from -0.5% to +0.5%. Any improvement outside this range would be considered significant. In this case, small changes can be meaningful due to high transaction volume. Hence, ROPE is narrower to detect these smaller, yet significant, improvements.

But if the same store sells luxury goods, it may require a larger percentage increase in average order value to be considered meaningful. Given the higher profit margins, a smaller absolute increase in revenue might still represent a substantial improvement. A luxury goods retailer might require a 5% increase in average order value to be meaningful. Hence, ROPE would be wider, maybe from -2% to +2%, to accommodate the fact that smaller percentage changes have a lesser impact in absolute terms.

Iterative refinement

When conducting continuous testing, you may start with a much wider ROPE and then refine it based on the outcomes of the experiments. This adaptive mechanism will make your ROPE closer to the real business effect.

For example, for unoptimized and new webpages, you can aim for larger ROPE values since they should target bigger uplifts.

Optimized webpages with many visitors can benefit from small improvements as well and hence should keep a smaller ROPE.

High-traffic pages of your website should have lower ROPE values since smaller uplifts can be valuable for them. Low-traffic pages should have higher ROPE values so that early stopping can help save visitors and time.

ROPE in VWO

The good news is that ROPE has been integrated into VWO’s Statistical Engine. ROPE enables quicker decision-making since the stats engine can now recommend disabling a variation when it is unlikely to outperform the baseline. This means you will enjoy all the benefits discussed in this article and can rely on smarter and more accurate results for every test you run. Take a free trial with VWO now.

The post ROPE appeared first on VWO Glossary.

CUPED

Aastha Trehan — Thu, 08 Aug 2024 08:08:44 +0000

Controlled experiment using pre-experiment data (CUPED) is a variance reduction technique used in A/B testing.

Developed by Microsoft’s data science team in the early 2010s, CUPED was created to address the need for more efficient A/B testing on platforms like Bing and Microsoft Office. Since its inception, this technique has become regular within the A/B testing and optimization communities due to its ability to reduce variance.

How does CUPED work?

Let’s try to understand how CUPED works through an example. Suppose you run an online store and want to test a new checkout process. You set up an A/B test where half of your visitors see the new checkout process (Group B) and the other half see the current one (Group A). The goal is to determine if the new checkout process leads to more completed purchases.

Before starting the test, you already have extensive data about your visitors’ behavior. For instance, you know how many purchases each visitor made in the month prior to the test. Here’s where CUPED comes into play. For each visitor in both Group A and Group B, CUPED gathers data on their purchase behavior from the previous month. As the test runs, it counts the purchases each group makes during the test period. However, instead of just comparing the raw numbers, CUPED adjusts these figures based on an increase or decrease in the numbers compared to the last month in the control group and the variation group.

Without CUPED, if Group A (current checkout) averages 10 purchases and Group B (new checkout) averages 12 purchases after the test, you might conclude that the new checkout is slightly better. But with CUPED, you adjust these numbers using the pre-experiment data. Perhaps Group A’s visitors made an average of 4 purchases, and Group B’s visitors made an average of 2 purchases before the test. After adjusting for this pre-experiment data, you might find that Group B’s improvement is even more significant.

Thus, CUPED helps you make your A/B tests more accurate and reliable by factoring in what you already know about your visitors.

Benefits of CUPED

Here are the benefits of using CUPED to make your A/B tests more accurate and reliable:

CUPED leverages pre-experiment data to control for natural variations in your visitors’ behavior. This means that if there’s a genuine difference between your test groups, CUPED makes it easier to spot. For instance, if your new checkout process is indeed better, CUPED will help you see that improvement more clearly.
Reaching statistical significance requires a large number of visitors. However, with CUPED, you can achieve meaningful conclusions with fewer visitors because it reduces the “noise” from natural variations. This makes your tests more efficient and less resource-intensive.

Limitations of using CUPED

While CUPED offers significant benefits, it’s important to understand its limitations. Here are two key points to keep in mind:

CUPED relies on pre-experiment data to reduce variance and improve the accuracy of your test results. This means it can only be used with visitors who have been to your site before. If you have a lot of new visitors, CUPED won’t be effective because there’s no past data to leverage.

It is not effective for binary metrics, like conversion rates, because it relies on continuous data (such as the number of purchases) to adjust for pre-experiment differences. This makes it less suitable for scenarios where you’re measuring simple yes/no outcomes.

Conclusion

In conclusion, CUPED is a powerful technique that leverages pre-experiment data to enhance the accuracy and efficiency of A/B testing. It helps control variance and enables you to draw meaningful conclusions with fewer participants. However, keep in mind that CUPED is only effective with past visitors and not be suitable for binary metrics.

The post CUPED appeared first on VWO Glossary.

Variance

Aastha Trehan — Wed, 03 Jul 2024 08:58:59 +0000

What is variance?

Variance measures the spread of a dataset by quantifying how much a set of values differs from the mean. A higher variance indicates a more spread-out dataset, while a lower variance indicates values are closer to the mean.

Variance helps in understanding the consistency of the dataset and its spread, which helps in making better decisions.

Here are simple images that explain dataset variance:

The left graph shows a high-variance dataset with more erratic values. The right graph illustrates a low-variance data set where the values are more consistent.

How do we calculate variance?

To calculate the variance in a sample, you start by subtracting each value from the mean and then squaring the result. This process is repeated for all values in the data set. Next, you sum all these squared differences. Finally, you divide this sum by the number of values in the data set minus one. The square root of the result is the variance of the sample.

Where:

xi = Each value in the data set

x = Mean of all values in the data set

N = Number of values in the data set

The variance in a population is calculated slightly differently, the denominator changes from N – 1 to N.

Variance can be calculated in software like Google Sheets using various functions. Here’s a quick guide to the different variance functions:

VAR.P: Calculates the variance for an entire population, using only numerical data.
VAR.S: Calculates the variance for a sample, using only numerical data.
VARA: Calculates the variance for a sample, including numerical values, text strings (treated as 0), and logical values (TRUE as 1, FALSE as 0).
VARPA: Calculates the variance for an entire population, including numerical values, text strings (treated as 0), and logical values (TRUE as 1, FALSE as 0).

Use these functions depending on whether you are working with a sample or the entire population and whether your dataset includes mixed data types.

Variance in A/B testing

When conducting A/B testing, we compare the average of a metric (such as spending or conversion) between two distinct groups. We also use the standard error, which indicates how much the average conversion rate might vary if the experiment is repeated multiple times. It is calculated as the square root of the variance, so a higher variance results in a higher standard error. A higher standard error means more uncertainty in our estimate of the true conversion rate.

However, you can reduce the standard error using the following methods.

One of the simplest ways to reduce standard error in A/B testing is by increasing the sample size, as a larger sample size tends to produce a distribution that closely resembles a normal distribution. However, practical constraints often prevent us from increasing the sample size.
Splitting the sample size evenly between the control and variation groups (50-50%) in an A/B test can reduce the impact of variance in the dataset and help achieve statistical significance more quickly. An unequal sample size can increase the chances of variance in the sample size with a lower size.
Normalizing outliers is another effective method to reduce standard error. For instance, when conducting an A/B test on a segment filtered by cost per head, you can improve the accuracy of your results by excluding customer data with exceptionally high or low costs per head.
CUPED, or Controlled-experiment Using Pre-Existing Data, is another technique in A/B testing that uses data from before the experiment to account for natural variations in user behavior. This reduces the standard error in your results. By considering how users behave on your site beforehand (such as their usual spending habits), CUPED helps smooth out natural fluctuations in behavior, making it easier to see the true impact of your new layout.

Conclusion

Variance measures a dataset’s spread and influences statistical analysis’s accuracy. In A/B testing, reducing standard error by increasing sample size, equalizing sample distribution, and normalizing outliers can lead to more reliable and meaningful results. By grasping these concepts and applying appropriate techniques, we can enhance the accuracy and reliability of our data analysis and decision-making processes.

The post Variance appeared first on VWO Glossary.

False Positive Rate

Aastha Trehan — Wed, 27 Mar 2024 09:31:03 +0000

A false positive happens when a test or experiment wrongly shows that a variant is a winner or a loser when actually there is no impact on the target metric. It’s like getting a wrong answer on a test, making you think you’re right when you’re actually wrong. In testing or experiments, false positives can lead to mistaken conclusions and decisions.

Please note: False positives show up as Type-1 errors in A/B testing.

What is a false positive rate?

The false positive rate (FPR) is a critical metric that reveals how frequently a phenomenon is mistakenly identified as statistically significant when it’s not. This measure is vital as it indicates the reliability of a test or outcome. A lower false positive rate signifies higher accuracy and trustworthiness of the test.

Where:

FP represents the number of false positives.
TN represents the number of true negatives or the number of winners received among all the tests that did not have any improvement.

Example of false positive rates

Imagine a newly developed diagnostic test aimed at detecting a rare genetic disorder. To gauge its accuracy, 1000 seemingly healthy individuals from diverse demographics and geographical areas undergo the test. Upon analysis, it’s discovered that out of these 1000 individuals, the test incorrectly identifies 20 as having the genetic disorder. This results in a false positive rate of 2%. Despite being healthy, these individuals are wrongly flagged by the test. Such simulated assessments offer vital insights into the efficacy of medical tests, aiding healthcare professionals in assessing their real-world reliability and effectiveness.

Why is evaluating the false positive rate important?

The accuracy of the statistical model is heavily reliant on the false positive rate, making it imperative to maintain a careful balance.

In medical diagnostics, a high false positive rate can erroneously categorize healthy individuals as having a disease.

Within finance, false positives manifest in fraud detection systems and credit scoring models. Elevated false positive rates can result in legitimate transactions being flagged as fraudulent.

Cybersecurity tools are susceptible to false positives, which can inundate security analysts with alerts, leading to alert fatigue. Excessive false alerts may cause analysts to overlook genuine threats.

False positives within quality control processes may lead to the rejection of acceptable products, escalating manufacturing costs and diminishing efficiency.

The ramifications of false positives vary across these domains, contingent upon the specific context and repercussions of inaccurate outcomes. Broadly, a heightened false positive rate can squander resources, impair efficiency, undermine trust in systems or models, and potentially yield adverse consequences for individuals or organizations.

False positive rate in A/B testing

The false positive rate poses a significant risk in A/B testing scenarios, where businesses compare different website or app versions to determine which performs better. When the false positive rate is high, the A/B test takes longer to conclude and get statistical significance.

To bolster the reliability and effectiveness of A/B testing software while minimizing false positives, it’s prudent to lower the false positive rate threshold. Typically set at 5% in A/B testing, reducing it to 1% can enhance test accuracy and reduce false positives. Platforms like VWO utilize the Probability to Beat the Baseline (PTBB) to control the false positive rate, if the PTBB is 99% then the FPR is 1%.

Conclusion

In conclusion, the false positive rate is a critical metric that impacts various domains, including medical diagnostics, finance, cybersecurity, and quality control processes. High false positive rates can lead to erroneous decisions, squander resources, and undermine trust in systems or models.

Platforms like VWO leverage PTBB to mitigate the threat of false positive rates. If you want to know more about it, grab a 30-day free trial of the VWO platform to explore all its capabilities.

The post False Positive Rate appeared first on VWO Glossary.

Sequential Testing Correction

Aastha Trehan — Thu, 14 Mar 2024 09:34:57 +0000

Sequential testing is a statistical method used to analyze data as it is collected, ensuring decisions are made in a step-by-step way rather than waiting until all data is collected.

This can help to reduce the time and resources needed for experimentation, particularly in situations where the outcome becomes clear before all data is collected.

Let’s say you’re a product manager for an eCommerce website, and you’re planning to roll out a new feature aimed at increasing conversion rates. However, your website has limited traffic, and acquiring additional traffic through advertising campaigns is costly. In such a case, sequential testing would be ideal.

You could implement the new feature and use sequential testing to monitor its performance. If the feature shows significant positive results early on, you can conclude the test sooner and roll out the feature to all visitors, saving time and resources. On the other hand, if the feature doesn’t perform as expected, you can stop the test early, preventing further investment of resources in an ineffective feature.

Sequential testing correction encompasses methods aimed at preventing the issues that arise from sequential testing, such as false conclusions when interpreting interim results. Sometimes, sequential testing may heighten the risk of erroneously concluding a variation to be better when it isn’t (a false positive). Sequential testing correction mitigates this risk by adjusting the threshold of confidence necessary before finalizing significance.

Fixed horizon tests vs Sequential tests

In contrast to sequential tests, fixed horizon tests have both sample sizes and experiment goals predetermined. Conclusions can only be drawn upon completion of the review period. This approach generally provides a higher level of statistical trustworthiness but at the cost of higher traffic being used for each experiment.

Why are Sequential tests more suited to modern A/B testing?

In recent years, sequential tests have become increasingly popular, enabling continuous data collection. Here are some reasons why it is more suited to modern A/B testing:

Efficiency

By implementing sequential testing, organizations can quickly identify potential disadvantageous ideas or content at an early stage of development before they are fully implemented or exposed to a large audience. Organizations can effectively allocate resources and minimize the overall costs associated with implementing such ideas. This helps businesses make informed decisions, such as releasing a big feature before a major event, in fast-paced digital environments.

Flexibility

Modern businesses need experimentation to be visitor efficient so that A/B testing can be done on pages with low traffic as well. With sequential testing, sample sizes are not fixed, offering the option to stop the experiment early if significant results are observed or to continue until reaching a predetermined endpoint, accommodating varying traffic levels and experiment durations.

What are the problems caused by sequential testing?

Despite having benefits, sequential testing may also pose problems for businesses.

It may seem counterintuitive, but whenever statistical results are calculated multiple times, there is a risk of increasing the false positive rate.

This is the main concern with continuously monitoring A/B test statistics. Therefore, several solutions have been proposed to sequentially correct test statistics and reduce the occurrence of false positives in sequential testing

How do you correct errors from sequential testing?

There are a couple of ways of correcting errors in sequential testing. They are as follows:

Bonferroni corrections

False positive rates increase linearly with the number of interim checks you make. The most simple solution is to divide your false positive rate by the number of interim checks you are making.

So, if you need a 5% false positive rate, and you are making 10 interim analyses, set the false positive rate of the test to be 5/10 = 0.5%. This is the Bonferroni correction.

Always valid inference

This method allows for continuous testing during data collection without determining in advance when to stop or how many interim analyses to conduct. This approach offers flexibility, as it doesn’t require prior knowledge of sample size and supports both streaming and batch data processing.

Always Valid Inference isn’t popular because it’s complex to grasp and significantly compromises statistical power. This implies that detecting a winner will take significantly longer when one actually exists.

To simplify the testing process and allow you to focus on running tests and obtaining early results without concern for skewed outcomes, VWO uses a derivative of an approach called Alpha-Spending to correct Sequential Testing by Lan and DeMets.

The alpha-spending approach involves distributing the type I error (alpha) across the duration of a sequential A/B test. With this approach, alpha can be allocated flexibly across the selected peek times, and it is only utilized when peeking occurs. If a peek is skipped, the unused alpha can be retained for future use. Additionally, there is no need to predetermine the number of tests or the timing of their execution during data collection.

By selecting Sequential Testing Correction in the SmartStats Configuration, decision probabilities will be adjusted to minimize errors while monitoring test results during data collection in the new test reports.

If you prioritize obtaining reliable test results and desire greater control over test statistics, consider using VWO, where our testing system is designed to meet your advanced needs.

The post Sequential Testing Correction appeared first on VWO Glossary.

Inverse Metrics

Aastha Trehan — Tue, 06 Feb 2024 11:42:21 +0000

Inverse metrics on a website are deemed more favorable when their values decrease.

For instance, if you notice an increase in the drop-off rate on your website’s cart page using analytics tools, and the heatmap analysis confirms the same, you might run a test to ‘reduce’ the drop-off. Ideally, you want the checkout rate to increase while the drop-off rate decreases.

In this example, the drop-off rate is the inverse metric you aim to decrease. A reduction in the drop-off rate can contribute to an increase in other crucial metrics, indicating that visitors are taking desired actions on your website and leading to an uplift in conversions for your business.

What are some inverse metrics?

Whether you want to improve conversions, introduce a new feature, or investigate navigation bottlenecks on your website, tracking inverse metrics is important to understand where visitors encounter problems and to find ways to reduce their values. Here are some inverse metrics you should watch out for:

Page load time

The page load time is an inverse metric because the lower it is, the better the visitor experience on a website. Consequently, maintaining a low page load time helps control other inverse metrics, such as bounce rates.

Bounce rate

Bounce rate is the percentage of visitors leaving after viewing one page on a website. It is important to maintain a low bounce rate to encourage visitors to explore further and move down the conversion funnel on your website.

Refund rate

Refund rate represents the percentage of customers requesting refunds for products or services. A lower refund rate suggests customer satisfaction, good product quality, and effective marketing, all of which are positive indicators for a business.

Customer support tickets

A decrease in the number of customer support tickets indicates that visitors are experiencing fewer issues or challenges with the products or services offered by a business. This could indicate improved product quality, clearer instructions, intuitive features, or the proactive resolution of common customer pain points.

Form abandonment rate

When visitors abandon web forms midway, it indicates that they found the form-filling process to be a hassle. You can monitor the field-level friction points through form analytics. A clear and intuitive form design encourages visitors to smoothly progress through the required fields.

Cart abandonment rate

A higher cart abandonment rate suggests that visitors are dropping off before completing their purchases, signaling friction in the conversion funnel. Do you want to learn effective methods for minimizing cart abandonment on your website? Download our eBook for valuable frameworks, tips, and real-world examples to guide you through the process.

Cost per acquisition

A lower Cost Per Acquisition (CPA) is desirable because it means a business is acquiring customers at a lower cost, improving profits and returns. Businesses can prioritize high-return channels to acquire new customers, nurture relationships with existing customers, and implement customer retention strategies to bring down CPA.

Businesses successfully reducing inverse metrics

Businesses actively strive to keep inverse metrics under check because a reduction in these values will indicate an improvement in visitor engagement and experience on their websites. Here are some brands that strategized to control inverse metrics and saw improvement in conversion metrics:

ReplaceDirect, a Dutch eCommerce site, revamped the second stage of the checkout process by adding an order overview showing the products, total costs, and delivery date. The layouts of the page and the form were changed for a cleaner look, and unnecessary fields were removed. It decreased the cart abandonment rate by 25% and increased sales by 12%.

MedaliaArt, an online art gallery, conducted a split URL test where they created two new versions of homepages with a holiday sale banner displayed at different locations – one at the top and another on the right. They wanted to track which variation could help reduce the bounce rate on the website. Variation 1, which showed the banner prominently at the top, was a winner, reducing the bounce rate by 21%.

POSist, an online restaurant management platform, wanted to increase the number of sign-ups for a demo of their platform. The team started with homepage improvements to figure out ways to reduce the drop-off on the website. They also reduced the loading time and enhanced the overall performance of their website to ensure faster loading on all devices and platforms. This optimization resulted in a 15.45% increase in visits to the contact page. Moreover, these changes addressed fundamental issues and laid the foundation for a couple of other tests that increased demo requests by 52%.

The lower the values of inverse metrics, the better the visitor experience. If you’re wondering where to start making changes to keep these metrics in check, VWO can help. With VWO, you can derive insights from visitor behavior, identify friction areas, run tests, and implement changes to control inverse metrics.

In fact, VWO recently introduced two powerful metrics – time spent on page and bounce rate. These metrics reveal how visitors behave, enabling increased engagement and better conversions on a website. In experiments where the bounce rate serves as a metric, VWO views lower bounce rate conversions as a sign of improved performance. To explore all the features of VWO, sign up for a free trial.

The post Inverse Metrics appeared first on VWO Glossary.

Guardrail Metric

Aastha Trehan — Mon, 05 Feb 2024 12:49:25 +0000

What is a guardrail metric?

Guardrail metrics are the business metrics that you don’t want to see negatively impacted while conducting experiments like A/B tests. The guardrail metric setting acts as a safety net, ensuring that while you’re focusing on enhancing certain aspects of your business, you’re not inadvertently causing harm to another critical metric essential for overall success.

An organization can establish guardrail metrics common to all teams to prevent negative impacts during experiments. Additionally, different teams can publicly share their key metrics and request to set them as guardrails to avoid causing harm. For instance, the web performance team may share their key metric like website speed threshold, which the marketing team can set as a guardrail metric when conducting an A/B test.

Example of guardrail metric

Let’s imagine a scenario where a SaaS website is conducting an A/B test to improve scroll depth on its landing page. The original design of the landing page is as follows:

The A/B test involved testing a variation with a scroll-down feature for the “know feature” text in the first fold. To safeguard against unintended consequences, a guardrail metric was established to ensure the visibility and effectiveness of the “Book demo” call-to-action (CTA) in the first fold remained prominent and unaffected.

Throughout the test phase, the team meticulously analyzed user engagement metrics, and conversion rates, and gathered feedback. After a few weeks of experimentation, the data revealed a remarkable 20% boost in user scroll depth. Importantly, this increase was achieved without compromising the visibility or effectiveness above a threshold of the critical “Book demo” CTA. The successful outcome showcased a well-balanced approach, achieving increased scroll depth while ensuring there was no negative impact on the guardrail metric.

Types of guardrail metrics

To secure a continuous enhancement of your website or digital touchpoint experience while safeguarding your ROI, it’s crucial to monitor different types of guardrail metrics. Here are the types of guardrail metrics you should keep an eye on:

Financial metrics that have a direct impact on the revenue generated through your digital touchpoint, such as the checkout button click-through rate (CTR).
Metrics that track user experience, including engagement rate, scroll depth, time duration, and CTR, website speed.
The business-specific metric that changes at specific time intervals, for example, a business quarter aim, might be to reduce churn and track metrics that measure engagement from existing customers.

Benefits of using guardrail metrics

Setting a guardrail metric for an experimentation campaign offers key advantages:

a. Risk-averse approach

It maintains a risk-averse approach while enabling improvements, ensuring a balance in performance for your key business objectives.

b. Complex relationship insights

It facilitates the understanding of complex relationships between various parameters that may be overlooked during hypothesis creation.

c. Coordination between teams

An organization can ensure that individual teams working to improve respective key metrics don’t inadvertently harm other team metrics.

d. Ease for future hypotheses

The insights gained from tracking guardrail metrics aid in formulating hypotheses by providing clear guidelines on what to avoid for future hypotheses.

Setting and tracking guardrail metrics with VWO

Creating a guardrail metric with VWO is a straightforward process. Suppose you wish to set a guardrail for the form signup rate on your website. The image below shows the VWO interface with the required metric setup.

Woking with metrics in VWO

Once you have successfully created the metric, applying it to your VWO campaigns is a straightforward process. In any experimentation feature, like VWO Testing, you can access the VWO dashboard where you manage your metrics and goals. Set the primary metric as the one intended for the test and select the secondary metric as the guardrail metric you created.

VWO dashboard

By incorporating a guardrail metric into your VWO campaigns, you ensure a robust monitoring system that allows you to track and safeguard crucial business metrics during experimentation.

If you want to explore the VWO dashboard, discover how to set guardrail metrics, and utilize other experimentation features to enhance your CRO campaigns, we offer a comprehensive 30-day free trial. Give it a try and unlock the potential for optimizing your conversion rates!

Conclusion

In conclusion, guardrail metrics are crucial for businesses looking to conduct experiments and improve their key metrics without causing harm to other critical metrics essential for overall success. By setting and tracking guardrail metrics, organizations can maintain a risk-averse approach, gain insights into complex relationships, and ensure coordination between teams.

The post Guardrail Metric appeared first on VWO Glossary.

Engagement Ratio

Aastha Trehan — Mon, 08 Jan 2024 08:52:01 +0000

Engagement ratios, also known as engagement scores, vary in definition across businesses. For a uniform understanding, we can define it as the active time users spend on a website, encompassing activities like scrolling, clicking, typing, and media playback among others.

Clicking and scrolling, along with other actions, act as indicators of users’ attention and involvement in a product.

For example, a user might click on a product image for a better view, click on a button to purchase, or click a link to navigate to another page. Clicking is associated with exploration, navigation, and interaction with a website.

Users may play a video or audio file by clicking on it. This action is driven by a desire to consume content, such as watching a tutorial, listening to music, or viewing product demonstrations.

Ultimately, a positive user experience is achieved when users face little to no challenges and show sustained attention throughout their journey.

Why is engagement ratio important?

An increasing number of businesses are recognizing the importance of prioritizing engagement ratio on their websites. Let’s explore the top reasons why it is crucial.

Capitalize on the strengths

The engagement score recognizes and leverages existing strengths on a website. By pinpointing where users are most engaged, businesses gain insights into the compelling aspects of their digital experience. This understanding is crucial for strategic optimization that allows businesses to build on their strengths and create a more successful online presence.

Allocate resources smartly

Identifying high-engagement areas helps businesses wisely allocate resources. For example, a high engagement score on a specific product page may encourage you to spend more money on ads or create more content for that product. This targeted approach maximizes the impact of your marketing efforts and ensures resources are used efficiently, contributing to overall success.

Personalize offerings

Past engagement scores offer insights for tailoring experiences for future user actions. Let’s say, if a key page suddenly gets less engagement, it could mean it’s not working well. This signals a need to optimize the page for sustained engagement and conversions. Alternatively, if users engage more with a banner promotion, show them more related content later to push them down the funnel.

Building customer loyalty

High engagement scores not only boost the chances of visitors converting but also create an environment where existing customers are more likely to become loyal advocates for your brand. This positive influence gradually extends to a broader circle, forming a positive ripple effect.

Key tips to improve engagement score on your website

Here are some tips to boost your website’s engagement score and cultivate lasting connections with your users for improved conversions.

Gather user feedback

The first step is to be truly interested in understanding your users’ needs, behavior, and preferences, so you can serve them better. Make the most of on-page surveys, live chats, and exit pop-ups to get their feedback so you can continuously improve your website and align it to their likes and interests.

Craft user-focused content

To truly capture attention and drive engagement, you need to make your content all about the users. Ensure your website focuses more on them, their problems, and solutions rather than highlighting your achievements. Create content that’s easy to consume and inspires audiences to take action.

Humanize messaging

Establishing a unique connection with your target audience by incorporating empathy into your messages is essential. Avoid jargon-heavy content if you want users to derive true value from it. Authenticity resonates with users, amplifying the engagement score on your website.

Personalize experiences

In today’s business world, customer experiences are personalized at every touchpoint, from social media content to email offers and website product recommendations. This personalization approach ensures that your marketing strategies align with users’ interests, securing their increased satisfaction.

A roadmap for better tracking of engagement score

At VWO, we plan to create a dashboard displaying top metrics aligning with the growth of our data infrastructure for improved tracking of engagement scores.

We plan to use a scoring method based on heuristics. The scoring algorithm will consider these factors:

Media Playback (Start-Stop): Indicates a user’s interest in multimedia content, showcasing active involvement. High engagement during media playback shows the effectiveness of visual elements, helping assess the appeal of multimedia content.

Copy: This reflects a user’s interest in textual content, indicating their willingness to consume information.

Mouse Movement: Active mouse movement helps uncover user interaction patterns and areas of interest, signaling improvements for a more user-friendly experience.

Scroll: Users scrolling through content show they’re keen to explore more. It’s a helpful way to see what they find interesting and guides us in arranging content for a better experience.

Right clicks: When users right-click, it shows they’re exploring more options or information through context menus. This helps us understand what users prefer, guiding us to improve the interface for a better experience.

Typing: Typing activities reveal how users engage with input fields or forms. This helps you gauge user engagement with interactive elements and plan for form optimization accordingly.

Taps: Tapping gives insights into how users interact with their touchscreens, revealing room for improvement in mobile interfaces.

Zooming: Zooming is a clear sign that users want a closer look at specific content. This is important because it shows their keen interest in details, providing valuable insights to improve visuals and layouts.

We’ll mark all the important moments of engagement on a timeline and measure the time between them. If it’s less than 5 seconds, we’ll count it as active engagement; otherwise, it’s considered non-engagement.

How can VWO help you improve your engagement score?

To boost your website’s engagement score, you can leverage VWO Insights to analyze visitor behavior on your website. Heatmaps and session recordings help you assess clicks, scrolls, and typing, enabling you to set engagement scores. Moreover, you can harness the power of website surveys and form analytics to gather user feedback and improve form performance respectively.

Further, based on these insights, you can conduct tests using VWO Testing to validate hypotheses and improve user experiences. For instance, if heatmaps reveal low clicks on the primary CTA button of your landing page, you can test to see if introducing changes enhances the engagement score, subsequently improving the conversion rate.

How did increased engagement lead to more sign-ups for Ubisoft?

Ubisoft Entertainment, based in Paris, is a renowned French video game publisher known for hit series like Assassin’s Creed, Far Cry, and Just Dance.

For Ubisoft, the conversion on the Buy Now page was the main performance indicator of user experience. They resorted to A/B testing to improve lead generation for the game, For Honor, on the same page. However, before that, the Ubisoft team leveraged heatmaps, scrollmaps, and on-page surveys to gauge the current level of user engagement on the Buy Now page.

From the observed insights, it was hypothesized that it would be better if the up and down scroll could be reduced and the buying process simplified.

In the revamped test layout, the section for selecting the edition and console, along with the Order Now step, was relocated to the upper part of the left column, accompanied by an edition comparison feature.

This redesign effectively eliminated the need for scrolling and led to an enhanced engagement on the Buy Now page. As a result, the variation was a clear winner with a 12% increase in order sign-ups.

Control

Variation

Are you inspired by this success story? Aim for high engagement scores to ensure optimal conversions on your website. Take a free trial to get started with VWO today.

The post Engagement Ratio appeared first on VWO Glossary.

Simpson’s Paradox

Aastha Trehan — Fri, 05 Jan 2024 07:22:50 +0000

What is Simpson’s Paradox?

Simpson’s Paradox is a statistical phenomenon in which a trend or characteristic observed within individual data groups undergoes a reversal or disappearance when these groups are aggregated.

Let’s understand it through a simple hypothetical example.

In a medical research facility, researchers evaluated the effectiveness of two drugs, labeled Drug A and Drug B, in improving a crucial health indicator. The overall results favored Drug A, indicating its superior performance.

However, when the data was dissected by gender, an interesting nuance emerged. Among men, Drug B surprisingly outperformed Drug A, while a similar trend was observed among women. Despite Drug A’s general superiority, the gender-specific analysis showcased distinct strengths for Drug B in both male and female cohorts.

Simpson’s Paradox in drug evaluation

The issue with Simpson’s Paradox is that it can be difficult for analysts to determine whether they should rely on insights from aggregated data or individual data groups. Simpson’s Paradox isn’t limited to any specific field; instead, it can manifest anywhere.

Why is Simpson’s Paradox important?

a. Highlights the pitfall of drawing misleading conclusions

Simpson’s Paradox highlights the pitfalls of drawing misleading conclusions from data without taking into account the variables involved. This oversight can be particularly consequential and worrisome in fields like medicine and scientific research, where precise data interpretation is crucial.

b. Emphasizes the need to control confounding variables

Confounding variables are factors that, while not the primary focus of a study, can significantly impact how we interpret the relationship between the main variables under investigation. These variables often sneak into the analysis and introduce biases or distortions, making it difficult to attribute any observed effects solely to the studied variables. Simpson’s Paradox highlights the importance of not only identifying these potential confounding variables but also actively taking steps to control for them in subsequent statistical analyses.

c. Showcases the complexity of the data at hand

The Simpson’s Paradox emphasizes the intricacy of interpreting data patterns. It shows that observed trends in subgroups may not hold when the data is combined, and vice versa. This serves as a reminder for analysts and researchers to avoid simplistic generalizations and adopt a more sophisticated and context-aware approach to data analysis.

How do you deal with Simpson’s Paradox?

a. Randomized sampling

In this process, the dataset is randomly divided into equal groups without favoring any specific data variable. The goal is to achieve a balanced distribution of confounding variables in both groups, minimizing the likelihood of their impact and preventing the occurrence of Simpson’s Paradox. Randomized sampling is mostly utilized when there is limited information available regarding confounding variables. It’s important to note, however, that randomized sampling is most effective with large samples, and the risk of uneven distribution of confounding variables increases with smaller sample sizes.

b. Blocking confounding variables

If you’ve pinpointed a confounding variable in a dataset through literature review and past experiment results, you can address the paradox by blocking those variables in the current dataset. For instance, if a previous dataset revealed a paradox related to male and female users, you can block gender as a variable in the current analysis. However, this approach becomes impractical when dealing with numerous confounding variables.

Simpson’s Paradox in A/B testing

The Simpson’s Paradox emerges when there’s inconsistency in traffic allocation during an A/B test. For instance, if you start with a 10-90 traffic split between variation and control on day 1, with 1000 visitors, and then, on day 2, you adjust the traffic to a 50-50 split for the variation and control with 600 visitors, you may encounter the Simpson’s Paradox in the dataset.

Simpson’s Paradox in A/B testing

Across both days, the variation appears to boast a superior conversion rate. However, when you amalgamate the dataset, the control emerges as the winner. This discrepancy in results is a classic manifestation of Simpson’s Paradox, induced by the shift in traffic allocation between days. Such deceptive trends can be perilous, especially for large websites with significant financial stakes, potentially leading to misguided decisions. Hence, it’s always advisable to maintain consistent traffic allocation throughout the ongoing test to sidestep the occurrence of Simpson’s Paradox in the results.

Conclusion

Simpson’s Paradox rears its head in datasets influenced by confounding variables, making it crucial for businesses and analysts to stay vigilant and approach analysis with awareness. Remember, a thorough review of literature, past data analysis, and simulation can be instrumental in mitigating its effects. Being proactive in understanding and addressing potential confounding factors is key to ensuring accurate and reliable data interpretations.

Frequently Asked Questions (FAQs)

What is the primary reason for Simpson’s Paradox?

Simpson’s Paradox occurs when the analysis of data is oversimplified, leading to incorrect conclusions.

What is the solution to Simpson’s Paradox?

The solution to Simpson’s Paradox is identifying and negating the confounding variables.

What is the difference between Simpson’s Paradox and Berkson’s Paradox?

Simpson’s Paradox manifests as a divergence in trends when data groups are amalgamated. In contrast, Berkson’s Paradox stems from selection bias in the sampling process, creating a correlation that may not exist in the broader population. Both paradoxes underscore the importance of careful consideration and nuanced analysis in statistical interpretation.

The post Simpson’s Paradox appeared first on VWO Glossary.

Sample Ratio Mismatch

Aastha Trehan — Thu, 30 Nov 2023 07:11:41 +0000

What is Sample Ratio Mismatch?

Sample Ratio Mismatch (SRM) in the context of an A/B test refers to an imbalance in the distribution of users between the control and variation groups. It happens when the intended randomization fails, leading to unequal sample sizes in a test.

For example, you assign 50% of users to the control group and 50% to the variation group for an A/B test. However due to some issues, the actual distribution results in allocating 45% of users in the control group and 55% in the treatment group. This is a case of SRM, affecting the accuracy and reliability of your test results.

Another scenario is when the configured allocation is, say, 60:40 in the A/B test, but the observed allocation turns out to be 70:30. Any deviation from the planned distribution is considered an SRM issue.

Image source: Doordash

What causes SRM issues?

There may be several reasons why SRM creeps into your A/B test. Let’s look at some of the classic reasons why this happens below:

User behavior

If users delete or block cookies, it can disrupt the tracking and randomization process, leading to a sample ratio mismatch. This is because regular clearing of cookies may lead to the counting of such users as new users leading to their overrepresentation in one group.

Technical bugs

Technical issues can also cause an SRM. Consider a test with JavaScript code that’s making one variation crash. Due to this, some visitors sent to the crashing variant may not be recorded properly, causing SRM.

Geographic or time differences

Geographic or time differences can influence user behaviors, affecting the distribution of users across groups in the A/B test. So, for example, consider an online retail website with a global user base. If your test does not account for time zone differences, it may unwittingly include a significant number of users from a specific region in one group during certain hours. This could result in an SRM in the segment of users coming from that particular location.

Browser or device biases

When specific browsers or devices are overrepresented due to biases in the randomization process, the integrity of the test can be compromised. For example, suppose you run an A/B test on your SaaS website for mobile but its slow loading speed led to a decreased sample allocation to the mobile variation. Without careful randomization, one group ends up with a higher proportion of users due to device or browser issues, skewing the test results.

Dogfooding

Employees, being internal users, are exposed to the latest features or tests by default. As they interact with the product more frequently than external users, their inclusion in the treatment group significantly skews the metrics. This inadvertent inclusion of one’s own company’s employees in a test, also known as dogfooding, can distort test results and lead to an overestimation of the impact of a test.

When is SRM a problem and when is it not?

Put simply, SRM arises when one version of a test receives a significantly different number of visitors than originally expected. A classic A/B test has a 50/50 traffic split between two variations.

But you see that toward the end of the test, the control gets 5,000 visitors, and the variation gets 4982 visitors. Would you call this a massive problem? Not really.

In the final stage of an A/B test, a slight deviation in traffic allocation can happen due to the inherent randonment in allocation. So, if you see, that the majority of traffic is rightly allocated (calculated confidence being 95%-99%), you need not worry about a slight difference in sample ratios.

But SRM becomes a notable issue when the difference in traffic is substantial, such as 5,000 visitors directed to one version and 2100 to the other.

That’s why staying alert, and keeping an eye on visitor count is so important if you want to obtain accurate test results.

Want to watch how you can split traffic for your A/B test on VWO? Here is a video for you:

Traffic splitting on VWO

How to check for SRM?

SRM is similar to a symptom revealing an underlying issue in A/B testing. Similar to a doctor recommending tests for a patient, a chi-square test can be called a diagnostic tool for confirming SRM. A p-value below 0.05 shows that there is SRM in the test. In some cases, the differences in ratios are so pronounced that no mathematical formula is needed to identify the problem.

Image source: Towards Data Science

Where to check for SRM?

Once you’re sure there’s an SRM in your test (which happens in about 6% of A/B tests), you need to know where to find it. Microsoft’s report highlights the stages where SRM can occur:

Experiment Assignment

Issues could occur if users are placed in the wrong groups, the randomization function malfunctions, or user IDs are corrupted.

Experiment Execution

Variations might start at different times, causing discrepancies, or delays in determining which groups are part of the experiment.

Experiment Log Processing

Challenges may arise from automatic bots mistakenly removing real users or delays in log information arrival.

Experiment Analysis

Errors may occur in triggering or starting variations incorrectly.

Experiment Interference

The experiment might face security threats or interference from other ongoing experiments.

What is the role of segment analysis?

Sometimes you can find the SRM hidden in one of your visitor segments in the A/B test. Let’s understand with an example.

Let’s say you’re testing two different discount banners on your grocery website. The link to one variation has been circulated through newsletters, leading to more traffic for that variation and less for the control and the other variation. When you delve into segments, you notice SRM in the user segment from the email source.

You can exclude this segment and proceed with the test results with properly adjusted users. Or if you think the segment is too important to let go of, consider starting the test anew. We advise you to discuss this with your stakeholders before making a decision.

Therefore, segment analysis helps you make important optimization decisions, a task not possible with just a chi-square test. While the chi-square test identifies SRM, it doesn’t really tell you why it happened.

Can SRM affect both Frequentist and Bayesian statistical engines in A/B testing?

Yes. Regardless of the statistical approach used, SRM can jeopardize the authenticity of any A/B tests. Addressing and correcting for SRM is crucial to ensure the reliability of the test results, whether you are using Frequentist or Bayesian statistical engines.

The post Sample Ratio Mismatch appeared first on VWO Glossary.