Netflix recommendation system model’s fast online evaluation method -interleaving


Netflix recommendation system model’s fast online evaluation method -interleaving

According to AI Technology Review, the author of this article is Silicon Valley senior engineer Wang Yan. /p/68509372

This is the eighteenth article of “Wang Yan’s Machine Learning Notes”. Today we are paying attention to the evaluation of the model and online testing. Experienced algorithm engineers must be very clear that in the development cycle of a model, it is actually the process of characteristic engineering and model evaluation and launch. Now that the machine learning platform is very mature, the implementation and adjustment of the model structure is just a few lines of code.

Therefore, if the efficiency of model assessment and online AB TEST can be improved, it must be a matter of great efficiency of liberating algorithm engineers.

Today, we will introduce this article to the “Unique Online Evaluation Secret” of the streaming giant NETFLIX -Interleaving.

It is well known that Netflix is ​​a streaming giant in the United States. The widely known reason is not only because of its well -known original dramas, high market value, and in the field of recommended technology, Netflix has also been at the forefront of the industry. Then the important technology of driving Netflix to achieve rapid iteration of the recommendation system is the fast online evaluation method we are going to introduce today -InterLaving.

Netflix recommendation system problem background

Almost all pages of Netflix are recommended algorithms, and each algorithm optimizes different recommendation scenarios. As shown in Figure 1, the “TOP PICKS” on the homepage provides recommendations based on the personalized ranking of the video, and the “Trending Now” contains the recent popular trend. These personalized lines together constitute the personalized homepage of nearly 100 million members of Netflix.

Figure 1: Personalized netflix homepage example. Each line is a recommended category. For a given line, the video sort from left to right is determined by a specific sort algorithm.

For Netflix, which is driven by a strong algorithm, the iterative innovation of algorithms is essential. To maximize the business goals of Netflix through the algorithm (these business indicators include include

The number of users subscribed every month, the total time to watch

Wait), a large number of AB TEST needs to verify whether the new algorithm can effectively improve these key product indicators.

This brings a contradiction, that is,


The contradiction between algorithm engineers’ growing AB TEST demand and online AB test resources

Essence Because the online AB TEST must take up precious online traffic resources, it may also cause damage to the user experience, but online traffic resources are obviously limited, and only a small part can be used for AB TEST; while algorithm research and development of this side The use scenario of algorithm drivers is increasing, and a large number of candidate algorithms need to be performed one by one. The contradictions between the two are bound to intensify. This urgently needs to design a fast online evaluation method.

To this end, Netflix designed a two -stage online test process (see Figure 2).

In the first stage, the test method called Interleaving was used for rapid screening of candidate algorithms, and a small amount of “excellent” Ranking algorithm was selected from a large number of initial ideas.

The second stage is the traditional AB TEST of the reduced algorithm set to measure their long -term impact on user behavior.


Everyone must be familiar with the traditional AB TEST method, so this article focuses on introducing how Netflix is ​​performed online quickly through the interleaving method.


Figure 2: Use Inter leaving for fast online testing. Use light bulbs to represent the candidate algorithm. Among them, the best winning algorithm is represented by red. Internet can quickly reduce the original candidate algorithm, compared to the traditional AB TEST faster determining the optimal algorithm.

The problem of traditional AB TEST

In addition to the problem of efficiency, traditional AB Test also has some statistical significant significant differences. The following is a typical AB TEST question.

Here is a AB TEST to verify whether the user group has a taste tendency to “Coca -Cola” and “Pepsi”. Then according to the traditional practice, we will randomly divide the test crowd into two groups and then perform “blind test”, that is, testing without telling the cola brand. The first group only provides Coca -Cola, the second group only provides Pepsi Cola, and then observe whether people prefer “Coca -Cola” or “Pepsi” according to the Coca -Cola consumption in a certain period of time.

This experiment is indeed effective in the general sense, and many times we do this. But there are some potential problems:

Among the overall test crowds, the consumption habits of cola are definitely different, and there are people who drink a lot of cola every day from almost not drinking cola.

Coca -Cola’s heavy consumption group must only account for a small part of the total test group, but they may account for a large proportion of overall soda consumption.

These two problems are caused, even

The tiny imbalance between the heavy cola consumers between the two groups of AB may also have an impact on the conclusion that the conclusion is not proportional


In the Internet scenario, such problems also exist. For example, in the NetFLIX scenario, the number of active users is a minority, but the viewing time of their contributions accounts for a large proportion. Therefore, the active users in Netflix Ab TEST are divided into group A or they are divided into group B. It has a greater impact on the results, thereby covering the true effect of the model.

So how to solve this problem? One method is not to group the test crowd, but allows all testers to freely choose Pepsi and Coca -Cola (there is still no brand label during the test process, but it can be distinguished by two different colas). At the end of the experiment, count the consumption ratio of everyone Coca -Cola and Pepsi, and then get the overall consumption ratio after the average.

The advantage of this test solution is:

Eliminate the problem of uneven distribution of testers of group AB group;

By giving everyone the same weight, the excessive influence of heavy consumers on the result.

This test idea is used in Netflix’s scene, which is Interleaving.

Netflix’s fast online evaluation method -interleaving

Figure 3 depicts the differences between AB TEST and INTERLEAVING.

In the traditional AB TEST, Netflix will choose two sets of subscribers: one set of recommendations for accepting the Ranking algorithm A, and the other set of recommendation results to accept the Ranking algorithm B.

In the Interleaving test, there are only one set of subscribers, who will receive alternate rankings generated through the ranking of mixed algorithms A and B.


This allows users to see the recommendations of algorithm A and B at the same time at the same time (the user cannot distinguish whether a ITEM is recommended by algorithm A or algorithm B). Then you can measure whether it is algorithm A or algorithm B by calculating indicators such as viewing duration.

Figure 3: In traditional AB TEST, traditional AB TEST and Interleaving are divided into two groups, one is exposed to ranking algorithm A, and the other is exposed to algorithm B. Core assessment indicators such as comparison between the two groups are compared to watch the duration and other core evaluation indicators. Essence On the other hand, Interleaving will expose all test users to the mixed ranking of algorithm A and B, and then compare the ITEM indicators corresponding to the algorithm

Of course, when testing using the interleaving method, the existence of position deviation must be considered to avoid the videos from algorithm A always ranked first. Therefore, algorithm A and algorithm B need to be led by equal probability. This is similar to when playing in the wild stadium, the two captains first decide who choose who chose people first by throwing a coin, and then the process of choosing the players alternately.

Figure 4: Mix videos of two ranking algorithms in the method of “Captain Choose”. Ranking algorithm A and B have produced a list of recommended video respectively. It is determined by randomly throwing a coin. Is it the first video of the Ranking algorithm A or B? Then, choose videos from high to the end from algorithm A and B.

After clearing the Interleaving method, you also need to verify whether this evaluation method can replace the traditional ab test and whether it will draw an error conclusion. Netflix has verified from two aspects, one is the “sensitivity” of interleaving, and the second is the “correctness” of interleaving.

Comparison of Internet sensitivity with traditional AB TEST

This group of experiments in Netflix hopes to verify that the Internet method can verify the advantages and disadvantages of algorithm A and algorithm B. We have repeatedly emphasized the tension of online test resources before, so here naturally hope that Interleaving can use less online resources, and fewer test users can solve the evaluation problem. This is the so -called “sensitivity comparison”.

Figure 5 is the result of the experiment. The horizontal axis is the number of samples participating in the experiment. The vertical axis Netflix does not give a very accurate explanation, but we can understand that it is determined that algorithm A is better than algorithm B. It can be seen that the interleaving method can determine whether algorithm A is better than B, while AB TEST needs 10^5 samples to reduce the error rate below 5%. This means that the use of a set of AB TEST resources, we can do 100 groups of Interleaving experiments. This has undoubtedly greatly strengthened the ability of online testing.

Figure 5: The sensitivity to the Internet and traditional AB TEST indicators. Compared with the most sensitive AB TEST indicator, Interleaving only requires 1/100 subscriber samples to determine which algorithm user prefers

The relevance of the interleaving index and AB TEST indicator

In addition to being able to use small samples to quickly perform algorithm evaluation, whether the judgment results of Interleaving are consistent with AB TEST, it is also the key to testing whether Interleaving can evaluate the first phase of online replacement of AB TEST.

Figure 6 shows the correlation between the experimental indicators and AB TEST indicators in Interleaving. Each data point represents a Ranking algorithm. We found that there is a very strong correlation between Interleaving indicators and AB TEST evaluation indicators, which verifies the algorithms that win in Interleaving experiments.

Figure 6: The relevance of the interleaving index and AB TEST indicator. Each point represents an experimental result of an Ranking algorithm. The Internet indicator has a strong correlation with the ab test index

in conclusion

Through experiments, we already know that Internet is a powerful and fast algorithm verification method, which accelerates the iterative innovation of various Ranking algorithms of Netflix.

But we also have to know that the Internet method also has certain limitations, mainly the following two points:

The framework of engineering implementation is complicated than traditional AB test

Essence Because the logic and business logic of Interleaving experiments are entangled, business logic may be disturbed. And in order to realize the Internet, you need to add a large amount of auxiliary data to the entire data pipeline, which is the difficulty of the project;

After all, the interleaving is just a relative measurement of the user’s preferences for the recommendation of the algorithm.

Essence For example, we want to know how much algorithm A can improve the user’s overall viewing duration, and using Interleaving cannot draw such a conclusion. To this end, Netflix designed the interleaving+ab test two -level experimental structure to improve the framework of the entire online test.

At the end of the article, we will discuss a few issues with you in accordance with the practice. I hope that everyone can share their views and discuss the truth:

What test is the sensitivity test in the text? Is the vertical axis P Value? (You can refer to the original link at the end of the text)

In addition to AB TEST and Internet, what online testing methods do you use in your work?

In my opinion, in addition to the two disadvantages introduced in the last introduce, there are some other potential problems. What do you think are there?

Finally, everyone is welcome to pay attention to my WeChat public account: Wang Zhe’s machine learning notes (Wangzhenotes), track the frontier of machine learning in machine learning such as computing advertisements, recommendation systems.

Students who want to communicate further can also discuss technical issues through the public account with my WeChat.

Note: The article most of the original technical blog based on Netflix ( translation was effectively supplemented.

Product Recommendation: Sexy Girl Realistic Full Body Pussy Love Vagina Sex Toy