# Experimental Management

# Required reading before the experiment

# Common Steps to Developing an Experiment

**** : The transition rate of the pop-up window is very low?
Experimental hypothesis : Modifying the text of the pop-up button can improve the transition rate of pop-up.
Experimental strategy : Two waves of users are randomly filtered as control group A and experiment group B. Control group A displays online copy, and experiment group A shows new copies to obtain data on user exposure click behavior.
Experimental Evaluation : Determine whether there is an improvement in the pop-up conversion rate in Experimental Group B, and if the improvement hypothesis is verified.Otherwise, new experimental hypotheses need to be proposed for verification.

# Understand the experimental indicators

Core Indicators: The indicators that the experiment focuses on, and only one is allowed for each experiment.
Monitoring indicators: indicators of experiment observation. Generally, experiments are not allowed to be negative.

# Explanations of Common Experimental Nouns

Business Sensitivity vs. Recommended Sample Size: Business sensitivity is the size of the difference that the experimenter wants to detect. For example, if you want to detect a difference of more than 5 percent, then the business sensitivity is 5 percent. The smaller the business sensitivity set, the larger the sample size is required. Popularly, more samples are needed to detect smaller differences. The current platform's sensitivity is fixed at 10 percent. At the same time, the platform calculates the recommended sample size for each group at this sensitivity. What this suggested sample size means is how many samples are needed to detect a difference of 10 percent. If no significant difference is observed at the end of the experiment, the number of users of the experiment reaches the recommended sample size. If so, we have an 80 percent (statistically effective) probability that the control group is no different or less than 10 percent different from the experimental group. Conversely, we cannot conclude whether there is a difference and may need to extend the duration of the experiment or increase the flow of the experiment.
Confidence intervals [a, b]: The platform currently shows 95% confidence intervals on the results page.What this range means is that if we do this experiment 100 times and get 100 confidence intervals, there are about 95 or so confidence interval that contains the overall true difference. Therefore, when there is a significant increase in the experimental group, the experimenter can obtain a conservative point estimate of the experimental team's increase by the lower boundary of the confidence interval.
- Both a and b are negative, indicating a decrease in the experimental group compared to the control group, e.g. [-10%, -5%] indicating 95% confidence that the effect of the experimental team decreases by at least 5% compared to that of the control group after the strategy is in place.
- Both a and b are positive, indicating that the experimental group improved relative to the control group, e.g. [5%, 10%] indicating that 95% of the confidence is that the effect of the experimental team improved by at least 5% compared to the control team after the strategy was put online.
- A is negative and b is positive (the interval contains 0), indicating that no valid conclusion can be drawn.
Significance level: A probability of taking values 0 to 1. The implication is (take the value of 5% as an example): There is a 5% probability that a significant (group B and A are different) conclusion is wrong. Or to put it this way: the experimental system thinks it works, but there is a 5% probability that it doesn't work after coming online. The first error and the second error have to be weighed. If the purpose is more concerned about making the second error, The easiest way to reduce the error of the second kind is to increase the level of significance (increase the error of the first kind).
Business sensitivity: Business sickness is used only as a calculated sample size. Represents the minimum number of samples needed to test for x% differences, which guides the estimation of how much traffic and how long the experiment will take if you focus on that metric.
Statistical power: the probability that the test is correct when there is a real x% difference (business sensitivity).
- When the p value of the experiment is significant, the greater the power, the more the real total difference is close to or greater than x%;
- When the p value of the experiment is not significant (take statistical power 0.8 as an example), the number of samples should be increased when the power is < 0.8.
- When the p-value of the experiment is not significant (in the case of statistical power 0.8), power > = 0.8, the probability is that the experiment has no effect.
Threshold of winning probability (for T + 1 experiment): When the posterior probability of the mean of the experimental group being greater than the mean of the control group is greater than the threshold, the experiment is significant.
The layer concept: The layer is designed primarily for traffic reuse. Depending on whether the traffic can be reused between experiments, there are two ways to split the experiment platform.
- Streamline 1: The traffic cannot be reused between different experiments. At this point you wanted to test the effect of button font color, and started an experiment that consumed 50% of the traffic. You also wanted to test the effect of the background color of the button, and another experiment used the remaining 50% of the traffic. Because the traffic cannot be reused, there is no idle traffic available for you to do a new experiment at this time.
- Streamline 2: The flow can be reused between different experiments. The platform diversion method can be such that each time a random portion of the traffic from the main disk is selected for experimentation. For example, 50% of the traffic from the market is randomly selected to open a button font color experiment A. Then randomly extract 50% of the traffic from the market, and open a change button backview experiment B. The traffic for A and B overlaps (there is a user hitting both experiments A and B). In this way, theoretically, an infinite number of experiments can be performed.
Both of the above diversion methods have certain problems. The number of experiments in channel one will be limited, the traffic cannot be repeated, and the total will be 100%, and when used, it will be gone. Shunt mode 2 can open unlimited experiments, but the flow is completely multiplexed, and the experiments may interfere with each other, such as the color of the button font in experiment A is white, The background color of the experimental group button in experimental B is white. A user may have hit both experimental group in experimental A and experimental group of experimental B. At this point he sees that the font color and background color of both the button are white. This is a problem.

In order to run as many experiments as possible while reducing the interference between the experiments, the platform introduced the concept of layers. The resulting diversion is as follows:
- Streamline 3: You can create many layers that are stacked one by one. Each layer contains 100% of the disk's traffic. Experiments are created within layers. An experiment can only belong to one layer. Multiple experiments can be created across a layer. Different experimental traffic within the same layer cannot be reused (equivalent to partitioning scheme I, often called traffic interjection), and the distribution of traffic from different layers is reused (equal to partition scheme II, where the different layers are fully randomly reused and do not interact with each other, often termed traffic orthogonal).
Use the analogy of building a high building. The platform creating an experimental layer is equivalent to adding another layer to a building. Determine which experiments a user hits and walk up the first floor, going through all the floors. Go to the first floor (first floor) to determine which experiment you hit in this floor. You can only hit one experiment at most, or you may not hit either. After entering the second floor, whether or not the landing experiment and which experiment on the second floor are not affected by the first floor, nor will the subsequent floors be affected.

With the layer structure, you can create experiments that interfere with each other in the same layer, such as the button font color and background color mentioned above. Put them in the same level, and a user can only hit one of the experiments at most, so that the same color of the font and background does not appear. Instead, put independent experiments into different layers to reuse the flow, so that more experiments can be started at the same time. For example, a sorting policy experiment in the background and a button color experiment in the front end can be put into different layers to reuse traffic.

To summarize further, there are actually two extremes of diversion mode one and diversion mode two. Streamlined mode one means that there is only one layer in the entire system, and streamlined mode two means that no layers exist, and only one experiment is done in each layer. The diversion method three is a compromise between the two. The platform doesn't know in advance whether your experiments are interfering with each other, so that's why layer management is exposed.

# Experiment FAQ

Why are there no data on the indicator results?

The experiment was not released. Once the experiment is created, it needs to be published so that the data can be observed the next day.
There was no data on the day the experiment went live. The indicator data is updated on a day-by-day basis, and the data is not observed until the day after it is online.
The various groups of the experiment had a flow of 0. In this case, no user will hit the experiment.Make sure that the published Weixin Mini Program callswx.getExptInfoSyncinterface.

How long will the experiment run?

The experiment is recommended to run for at least one week. The reasons are as follows:

The experimental time was too short and could easily lead to miscalculation due to novelty. For a new feature, the user may use it out of curiosity, even if the user doesn't like the new feature. In such a scenario, there may be an indicator that goes up for a few days and then goes down. Therefore, the experimental time should be increased to avoid misjudgments due to novelty.
Different times may affect user behavior. For example, on weekdays and weekends, users may behave differently. Therefore, the periodicity of the product should be analysed, and the experimental operating time should include as much as one cycle as possible.

Why does the experiment work, but it doesn't work when the full amount is online?

The first type of error. The test bench is a statistical test of hypothesis. In this approach, the first type of error is unavoidable. The specific implication is that a platform may still reach a conclusion that the two groups are significantly different when there is no difference between them. The probability of this happening is 5 percent, about once every 20 times.
The indicator results show the improvement of the sample, and after coming online, there is a certain probability that the overall improvement will be smaller than the sample improvement. It is recommended to pay attention to the lower boundaries of the confidence interval in the indicator results.

Why can't you check the results of the experiment every day and go online when there's a significant improvement?

This magnifies the first type of error. The correct experimental method is to determine the experimental time before the experiment, and make decisions only based on the conclusions at the end of the experiment. At this point, the experimental platform was able to ensure that the error rate in the first category was controlled at 5.0%. If the experimenter looks at the results every day and stops the experiment significantly and makes a different decision, this magnifies the first type of error. For example, in scenarios where there is no difference between the two groups, the user observes for seven days in a row. Once a significant difference appears and goes live, then do 100 such experiments and more than 30 percent of them will have type 1 error (the accurate type 1 error is: 1 - (1 - 0.05) ^ 7).

# New A / B experiment

Enter the experimental sign. The experimental sign consists mainly of two parts: an experimental scale overview and a list of experiments.

Experiment Size Profile: View the number of users being experimented with and the number of available users for a certain experimental layer to help users determine the available experimental traffic for the current layer.
List of experiments: See how experiments were created and how new experiments can be created.

New Experiments

Basic information
- Statistical cycles
  - 1 day: The conclusions of the experiment can be reviewed in 1 day, allowing for the rapid detection of experiments with large differences.
  - 7 days: 7 days can be reviewed for the conclusions of the experiment. The conclusions are more accurate and all the experiments are valid.
Experimental indicators
- Core indicators
  - The experiment focuses on evaluating the indicators, and the conclusions of the experiment are mainly assessed based on the effects of the core indicators.
  - The system provides Weixin Mini Program key indicators by default, and users can choose them as core indicators according to experimental needs.
  - Users can also customize some of the indicators, as described in detail in Indicator Management - Indicator Creation.
- Monitoring indicators
  - To assist in observing the effects of the experiment, users are advised to select no more than 10 observation indicators at a time. See [Indicator Management - Indicator Creation] for details on indicator creation.
Experimental Flow
- Experimental Levels
  - Select the experimental layer created. The traffic at the different layers is orthogonal, allowing traffic reuse (the same user hits multiple layers at the same time). A single layer with insufficient traffic can be added.
  - Set the ratio of total experimental traffic, and the experimental groups will divide the traffic evenly.
- Experimental population
  - Random population: randomly selected populations in the control and experimental groups.
  - Targeted populations: Define images and randomly select populations.Support devices, Weixin Mini Program new and old users, custom crowd orientation.
- Experimental parameters
  - Experimental parameters are the only identifiers used to distinguish the experiment. Be sure to ensure that the experimental parameters are unique.
  - Experiment parameter is character string, mainly used to control the user's experiment group (according to the different parameter value to determine the user assigned to which experiment group).
  - The experimental parameters are obtained through the basic library interfacewx.getExptInfoSync, and perform different experimental logic, see [Experimental Interface].
- Experimental grouping
  - Set the experimental parameter values so that the students can distinguish between the control group and the different experimental groups
  - Set the "test WeChat number" (submit the experiment is effective, do not need to publish the experiment).
  - Test with "Test WeChat"
    - Testing on for development and experience (recommended)
    - Test on the [developer tool (version 1.0.2106292 above]]](https://developers.weixin.qq.com/miniprogram/dev/devtools/nightly.html) and above the2.17.1]]base libraries if you can't get a recompile run.
Submit an experiment
- You can see the relevant experiments created in the list of experiments.

Release Experiment

Post the experiment to see if the real-time hit user is normal.

A modified experiment

Upon publication of the experiment, the total volume of the experiment can be modified, such as the proportion of total flow, indicators, group of experiments, etc.
The ongoing experiment only supports the modification of the total experimental flow ratio, and the various experimental groups will divide the flow evenly to meet the experimental flow greyness function.

Full-scale experiments

If the experimental effect is significant, the whole amount of an experimental group can be put on the line, without the need for Weixin Mini Program version;In full, we support withdrawal.

# Experimental Interpretation

After creating the experiment, enter the experimental data tab to view the experimental results, the experimental results mainly include:

Indicator Trend Map: A daily trend map that provides offline indicators and a real-time trend map of the number of hit users.
Overall data: Provides a cumulative number of hit users and an analysis of the effects of each indicator in the experimental and control groups.
Core Indicators: Core indicator details, providing the molecule, denominator, and confidence interval for the calculation.
Experimental Conclusions: The experiment runs for more than 7 days (including 7 days), and the system will automatically provide the experimental conclusions to help the user judge the effectiveness of the experiment.
Population analysis: It is possible to limit the population (current system provides population sex, geography and equipment) and further analyze the results of the experiment to explore the effects of different populations.

# Trend chart of indicators

Data cycle
- Past data: Offline indicator data, updated on a daily basis.
- Real-time data: Currently, real-time hits information for hit users is provided temporarily and is updated at a rate of 10 minutes.

# Overall data

Cumulative hit user number: The cumulative hit user number of the groups from the start of the experiment to the current cumulative hits user number, provided the traffic is even.
Core-Indicator: Provides core indicator statistics and the proportion of improvement in the experimental group compared to the control group. A significant positive improvement is marked in green and a significant negative decline is marked in red.
Monitoring - Indicators: Interpret the core indicators.

# Core indicators

Recommended sample size: For detailed explanation, see [Experimental Management - Required Reading Before Experiment - Explanation of Common Experimental Terminology].
Confidence interval: For a more detailed explanation, see [Experiment Management - Required reading before experiment - Interpretation of common experimental terms].

# Experimental conclusions

Experimental conclusions premise
- Run for more than 7 days (inclusive of 7 days).
- Different groups of traffic are homogeneous: Different quality indicates that the traffic is heterogeneous and suggests a new experiment.
Classification of experimental conclusions
- Positive: The core indicators are positive, the observation indicators are not negative.
- Positive observations to be made: The core indicators are positive, the observation indicators are negative.
- Negative: The core indicator is negative.
- No effect: The core indicators are not significant and the number of hit users is greater than equal to the recommended sample size.
- No conclusion can be drawn: the core indicators are not significant and the number of hit users is less than the recommended sample size.