The Who and The What
Sabastian Thrun and Nitin Sharma of Udacity sit down to talk about some of the future of Data Science (watch the full video). They cover a range of ideas already mentioned in this blog, like the importance data will play in the future of computing, and their expert opinions offer a lot of insight into this field and where it’s heading, not to mention what you can do to keep up. The interview is in part an advertisement for Udacity’s new Nano Degree program, which Thrun and Sharma say is the best way to break into the field.
In the interview, the Udacity duo tackle questions like
- What’s the difference between a Data Scientist, a Data Analyst, and a Data Engineer?
- How do you transition a traditional development team to become data-centric?
- Is the current Data Science surge a trend or is it here to stay?
- Predicts the end of software engineering?
- How is doing Data Analysis in a class setting different from the real world?
Below, I’ve taken quotes from the interview that are espedcially relevant in particular to this blog.
Q: What is Data Science?
Nitin Sharma: To my mind, Data Science is the science of systematically discovering patters in very large data sets, to extract useful knowledge, and to predict something of value, and to do that systematically, rigorously, is what the Data Science discipline is about.
Sabastian Thrun: There has been an explosion of data that’s available. It’s exponentially growing. And as a result, data has become the future of pretty much every business, and it’s the future of Artificial Intelligence. It’s very, very big.
Q: What opportunities are out there? What can people get involved with in Data Science today?
ST: Value is booming in Data Science right now, and there’s tons of open jobs. Data Scientists and Data Analysts are in extremely short demand. (Ex., Netflix, Google.)
NS: And there are lots of other cool projects that are happening in practically every possible industry: health, finance, retail, meteorology and weather predictions, all sorts of areas. IBM has worked on Watson…and now they are using it to do medical diagnostics. In the world of genetics, we are analyzing human genome data to figure out which people are more susceptible to contracting certain kinds of disease. There is hardly any field today that is not touched, that is not being transformed by the availability of data and finding interesting insights from that data.
Q: Are there opportunities in the field for people that want to start their own business?
ST: I can tell you, I mean if you want to start a company in Silicone Valley in the tech field, you very likely will hire a lot of data scientists, and the reason is that no matter what you do, whether it’s oil drilling or in the medical space, where you do personalized medicine, at the end of the day data is going to be your strongest trait. Companies who have really understood this really early on, companies like Google, which started data science at a scale much bigger and grander than anybody else, and as a result produced much better search results which gave them superiority in the business field. Amazon, they do amazing Data Science, not just in their product selection but also the layout of the screen. The entire screen that you see is optimized to make you buy something. I think that that trend from these formerly small companies to big companies is accellerating. So if you start a company today and you don’t do good data work, you likely will fall behind.
Q: I recently heard that the Harvard Business Review called Data Science the sexiest job of the 21st century. Why do you think they’re calling it that?
NS: I think it’s right on because the amount of insight that is hidden in the data is so powerful that it can make the difference in a dramatically successful company and an average company. And the companies that are on top of the data – on what their customers are doing, what they need, what they want, what will they not like – any company that is on top of these trends has a dramatically powerful advantage that will completely overwhelm all of its competitors. It is the equivalent of fighting a war with nuclear weapons versus the other guys fighting with bows and arrows. That’s how powerful Data Science can be.
Q: What do people need to have in their personality and skills to be successful?
NS: One thing is that it’s very important for you to have a solid background in probability and statistics, and have a solid understanding of the data science techniques and algorithms to analyze data, with special emphasis on the strengths and weaknesses of each approach. In what scenario is a decision tree a good way to build a predictive model? When is logistic regression a better algorithm? When is Support Vector Machines the right tool to use? So having a good understanding of the tradeoffs and the strengths and weaknesses is very important.
The second thing I would say is the right skills to process data at a very large scale. Typically we are talking about teabytes or more of data. So you know you have to have a very good understanding of what it takes to build software to handle the data at that scale, to be aware of the tactical considerations, things like outliers or filters or noise in the data, missing data, incorrect values and being able to handle these cases efficiently.
And the third thing I would say is general curiosity and the ability to ask the right questions. Understanding the business context in which these problems appear, figure out what is the right way to solve that problem, and equally importantly, communicate the findings to the business audience so that they can make the best decisions based on that data.
ST: I think that was perfect! The only thing I would add is that there are many different levels of data scientists. There is the PhD level, where people invent new things, new theories. And then there’s more the technician level where people are able to handle these things. I think all of them are important. At the bottom there are many more jobs than at the top. It takes you longer to get to the top. In any level, you always have to understand tools. There are a ton of tools in the industry that people use. If you reinvent everything from scratch, you’re going to spend a lot of time. So we have many tools that you have to understand.
NS: One of the questions that we often get asked is “What is the difference between a Data Scientist, a Data Analyst, and a Data Engineer?” and I think Sebastian just answered that question.
ST: Yeah, the scientist for me is the person who invents the new, crazy stuff, and the Data Analyst is much more at a lower level, but a much broader level with many more jobs, who needs to be able to do meaningful work with data with the existing tools.
Q: For a traditional software development team that wants to evolve to a more data driven product, maybe you could share some advice with them?
NS: So I think what is important is to build the application with scale in mind from the get-go because the data is growing very rapidly and you have to be able to design the algorithms and the applications with scale in mind. So you have to make sure that the application is robust towards potential outliers or missing data or potential corruption in data. So the applications that you develope have to deal with scale, failures, and potential missing values and so on. These applications need to tackle the business problems as opposed to just implementing cool algorithms for the sake of it.
ST: On top of that I would say it’s a cultural issue. In Silicone Valley they’re all very young. If you go to companies that are established and have been around before this data science wave, AKA in 2005, or in the 1900s, there’s usually this theme that there’s a lot of opinions and usually very little data because the tools didn’t exist. So to get a team to be really data-centric, I think you also have to work on management and mid-level management and engineers to really accept data. As you get to the point where data is accepted, a lot of opinions go away and are replaced by data, and I find this extremely refreshing. So when we deal with our content, we don’t argue about opinions, we just look at the data. And that really helps us set the development schedule for our products.
Q: With this prevalence of big data, what are the risks of having 5 different data scientists finding 5 different results from the data? How do you mitigate that?
NS: If the techniques used are rigorous, then probably there will not be insights that are in conflict with each other. So that’s the first part. The second part is that if the 5 data scientists are looking at 5 different aspects of the problem, then it makes the solution all the richer. So there is not necessarily inherently a conflict as long as you follow the right rigorous techniques, and 5 people looking at the problem in different ways might actually come up with more insight that helps you understand the problem and the solutions in a much more interesting way.
ST: In my experience, the more common situation is that I have a data science result, and it completely blows me away. When we built the self driving car, we drove it every day, dozens of them every day to the present day, driving around the Bay Area, and the type of problems that occurred in the data were massively different from anything I could have imagined, and it was only the data that helped us understand what the actual issues were and helped us really shape the development.
Q: Is Data Science a trend? Are you concerned that this big trend will fade out? What do you think the lifespan of this incredible field is?
ST: Well, isn’t everything a trend? I’d say that this is going to be with us for quite a while, in fact. We’ve seen this exponental uptake right now, it’s quite massive, and we see an ability to collect data that’s going up exponentially, and in an ability to process data exponentially, and we see the resulting benefits to go up exponentially to be honest. So I think we’re going to be on it for quite a while.
NS: And I actually think this is just the very beginning, the very first few years of this upward trend that is likely to continue for several decades. The data is just exploding, as we’ve mentioned earlier, and the range of things that we can discover, uncover, from the data, the kinds of insights, are just so powerful that I can’t imagine any business a decade from now operating without a very strong data scientist team. It is simply not possible to compete in the modern world if you’re not on top of the data about your customers, about your business, and how your customers are interacting with the product.
Q: What do you think will be the biggest things that come out of this field specifically? Do you have a sense for where you think we’re going in the next 5 to 10 years?
ST: I mean, obviously, what we’re doing today is we optimize like crazy everything we do. If you have a product on the market, you have to apply data to make it very good. But I think that the next step is actually going from very targeted, structured data questions, with a clear input and a clear output, to very amorphous and broad questions. There is a lot of work on data processing technology that has millions of parametes, like Deep Belief Networks is one that’s going to massively kind of take over the world. There’s going to be all kinds of jobs that will be empowered by an ability to have a level of intelligence that is much stronger than what we have seen before. So right now we can fly planes autonomously. We can drive cars autonomously. We’ve done away with that. The next thing to fall might be, I don’t know, software engineers? A whole bunch of areas, I think, where we can use Data Science to really boost the intelligence of our computers.
NS: And I feel that the next several years are going to lead to much richer and much more diverse data sets. You know: images and events from sensor networks; text; video. The world is just exploding with all sorts of data. That will open up possibilities we can’t even imagine today.
ST: People talk about Artificial Intelligence a lot. Elon Musk just warned against it. That’s data! What is happening in AI right now is all about machine learning. It’s all about data. If we just test 10 years from now the tools, the skills you’re going to pick up doing data will be essential to make machines truly intelligent.
Q: How much of a Data Scientist’s job is figuring out which technologies to use as opposed to analyzing the data?
NS: Most of the emphasis should be on understanding the problem domain and the structure and what techniques would best solve the problem, and not as much on the programming language. The biggest decision that you have to make is what kind of algorithms do you use and what are the features that you need to put into your prediction model.
ST: When you get into your job as a Data Analyst, the number one frustration will be that the tools you’ve learned about aren’t that great for the problems you want to solve. The data might be changing over time or have missing features. So the skill of taking something in the real world that you care about and then taking this toolbox over here, whichever language it is, and making those match is the most important skill. So that’s the number one frustration, to be honest, because in class you find that they match very great because the instructors chose the problems to be compliant, but in the real world when you go in to the project phase, you find it’s really hard sometimes to make this connection. And that’s what you want to learn.