In Part 1 of this series on Data Visualization at Netflix, I recapped the answers of four engineers’ experiences working with data at Netflix. The second part of the interviews got a little more technical, so I decided to keep their entire answers to the questions below.
Q: What are the characteristics of the Data you are working with? Dimensionality, missing values, data size, etc.?
Kanishka Bhaduri: The most important thing is that it’s big. Whatever the machine size is, you can always have a data set which doesn’t fit in the machine. So a big challenge is, “How do you sample that data set? How do you build models where, even if it’s 100,000 dimensions, you can still figure out which are the most important features to look at.” The other aspect is to make sense of the data when there are missing values. Decision trees, for example, are great at handling missing values, but if you don’t use a decision tree, but you use logistic regression, then how do you handle those missing values?
Fernando Amat: As you can imagine, there is a lot of data at Netflix, and it is very diverse. Generally, you can think of it along two main axes. The first is the viewpoint of the user. For each user we can have each play that the user has played, at what time, what movie it was, for how long. You can already see that these things are very heterogeneous. And the number of users is very large. I think we’re currently close to 85 million subscribers. The other axis you can look at the data from is the video collection perspective. In this case, you have all of the meta data of the show, when it was produced, the main actors, etc. This is relatively smaller. Our catalog is in the few thousands. You can see that it’s very heterogeneous again. You have these two axes, but between them, there are many ways to slice and dice the data. One final important part of the data is the power-law skewedness of the data. There are a few shows that are watched a lot, that are very popular, and a long tail of shows with fewer views. Similarly, most users watch a couple of hours of shows per week or less, but there are some who watch much more. You have to balance your model so that you don’t fit only the popular users or titles.
Hussein Taghavi: Data is a very general term. We can think of data as raw data that’s logged from a customer, or data that’s ready to go into a machine learning model. Therefore, there is a large process to go from raw data that’s logged into the data ready for training. We use various kinds of data. We have user history data. We have a lot of data about the videos. Generally, in the raw format they are enormous with many types of values. What we have to do is build a pipeline to go from that into preprocessed data, and in that, we handle a lot of different types. We have data that have missing values. We have data with a lot of noise in them. We have to be mindful of that when we select the machine learning model as well as when we’re doing the preprocessing.
Kenny Xie: I mainly work with two kinds of data. The first kind is customer behavior data. For example, how many hours customers stream on Netflix in a certain time window, say one month? How often do customers visit Netflix? And where do customers discover shows? Is it from the home page or from search? The second kind of data I work with is demographic data, such as the sign up country of a customer member, and the device that the customer uses to stream. We have tens of millions of users who stream billions of hours in a quarter. So our data is truly at a massive scale.
What are the biggest challenges that you face in your job?
KB: I think the biggest challenge is to develop the intuition for a new problem or domain. So if tomorrow somebody comes and tells me, “Let’s improve this feature,” first I need to understand, what does it mean? Now, if I understand that part then I have to go and get the data for it. One of the other big challenges is to clean the data. I would say typically in my work, 60% to 70% of the time is just used up in downloading the data, making it in the right format that I want, and then there’s the big question of how do you clean it? How do you impute missing values? How do you make sure that the data is not biased? And those checks and those things. So, putting together the final data set for modeling is the biggest challenge as a data scientist, apart from which model to select.
FA: At my job at Netflix, I face quite a few challenges, but if I had to single out three of them, I would say the first one is definitely good pre-processing of the data. So, I mentioned before we have all these meta data from videos and all this user history, but rarely do we put this kind of raw data into the algorithms. You have to make sure to extract some of the signal, clean the data, so that the input data that goes into your machine learning algorithms is good and has a clear signal. So this kind of pre-processing involves a lot of visualization and intuition and is one of the challenges. The second one will be the correlation between what we call offline metrics and online metrics. So, when you’re prototyping and testing in your own machine or desktop, basically you have historical data. You can calculate all sorts of metrics, like accuracy, precision, recall, etc. But of course, that is not the real goal. The real goal is that online this translates to a better customer experience, and so the right content is recommended to the right people. Basically making sure that the offline metrics correlate to better service, that’s quite challenging. The third one is what we call the pipeline from prototyping to product, to make it as fast as possible. So we talked about using Python for prototypeing and then Java for production. You can imagine that sometimes you have to rewrite most of your code, so you want to minimize the transition or how much you can reuse from one to another. So you can basically try as many ideas as possible and as fast as possible. So this kind of velocity of innovation is one of the challenges that we face here.
HT: We are in a field that is very dynamic. Tools are changing all the time. Algorithms are improving. And it’s basically important to know when to move on to the next thing. There is a trade off between being up to date all the time and moving always to the latest technology or latest algorithms, but then that also comes at the cost of having to move all the time and not being able to have an infrastructure that matures on an existing algorithm. So figuring out the right trade off between the two is, I would say, one of the challenges in our jobs.
KX: The first one is to define the right metrics to measure the performance of a test. The second one is to collect the right data to calculate the metrics. And the final one is to make sensible recommendations based on the metrics’ results.
Q: What type of models do you or your team primarily work with? Linear models like linear regression? Deep Learning?
KB: My team is consisting of at least eight to ten data scientists who have very different backgrounds. Some of them are machine learning experts. Some of them come from the finance domain. Some of them come from the EuroScience Board. Some of them are physicists. We have a whole gamut of people who work on very different kinds of models. Some of the consistent themes that I’ve seen across our team is decision trees, gradient-boosted machines, linear models, of course, logistic regressions, deep learning is used to some extent. Then there’s factorization machines and other kernel methods, and these are very active. I’ve also seen people work with factor analysis and other variants of low dimensional presentations.
FA: In my team, we primarily work with classification methods. So most of the tasks, because we work in recommendations in the home page, are related to trying to predict the probability of a play for a title, which is a binary classification. So any method that you have learned like logistic regression or trees for classification will serve. I think the main distinction is the relation between features and the algorithms. So you can choose a linear method like logistic regression, but then have a lot of features that you can put in the model. Or, you can go the other way around: sets of fewer features that are more complex. Even a lot of times, these features are just output from another model. Then you have nonlinear algorithms like classification trees or things like this.
HT: In the Netflix recommender system, we use a variety of machine learning algorithms. We use a lot of supervised algorithms. Particularly, classification is very, very common. And within the umbrella of classification, also there are a variety of algorithms that we use. For instance, we use logistic regression. We use tree-based algorithms. And we also use neural networks. The difference is that generally, when we want to start a new model, we tend to experiment first with simpler models like logistic regression. And only when we see a clear advantage, we move on to a more complex algorithm. Now, on the unsupervised side, we also do a lot of algorithms for clustering and dimensionality reduction, and that can be used just for understanding the data or it could be used for generating features that go into unsupervised algorithms.
KX: We sometimes use linear regression models or logistic regression models to analyze AB tests. We also use survival analysis to model customer retention.