Quite interestingly, big data analysis is not a new concept. Large organizations, government agencies, medical industry and almost all large-scale consumer research companies have been conducting big data analysis for years. So, what is causing people to focus more on big data in the last few years? What are the implications of the big data analytics and how does it impacts our lives? How can you explore big data analytics career path?
Who can give answers to these questions better than big data scientist and author Samuel Berestizhevsky? Samuel and I work together at Rackspace, and presently we are working on a complex paid search statistical analytics projects. I asked him about this interview blog post, and he quickly agreed. I want to thank Samuel for giving my blog readers the opportunity to dive into his 25+ years of big data analytics experience.
Introduction – Samuel Berestizhevsky
Tell me about yourself and a bit about your career as a data scientist.
[SB] Briefly, I am actionable analytics guy. I was trained as a statistician and as a software engineer. At the beginning of my career, I spent several years on statistical research of robust regression. My career in data science started about 27 years ago with development of software for interactive multi-targeting data analysis. This work resulted not only with real working software, but also with ideas for my 2 books published later in 1995 and 1998.
I worked for an academy world for about 10 years in universities and then changed my life by moving to the business world. I pioneered technology and online analytics for monitoring of web user behavior back in 1995. All this was done without any cookies and user privacy violations. At that time, people hardly understood why they need this technology if they have the web log files :). Those who understood benefits of this technology made a lot of money. The name of technology is WatchWise, and my company was acquired in 2002. 99% of WatchWise customers were advertisers who wanted to know how their advertising works from a cost/effective perspective of view. They wanted to know it in the real-time, anytime they wanted or needed it.
I think the following case, may interest you. One of WatchWise customers, a marketer of online casino got more profit from advertising of casino than casino itself. This marketer went to London’s stock exchange in 2004 and became $1B company while their client, the online casino 888.com went to the same stock exchange in London 6 months later and became only $0.5 B dollar company. All this happened because the marketer got answers from WatchWise services to whom, what and where to advertise. It helped them to make a right decision to deploy revenue share pricing model from casino profit made by players they brought in, rather than charge their client for advertising services.
How did you get into data science? What made you think data?
[SB] Frankly, I did not have other choices : My father and my uncle were mathematicians, so it was pre-determined that I will pursue some mathematical career.
How long have you been doing data analysis and what life lessons have you learnt from data?
[SB] I believe I do it all my adult life, starting from student research works, through development of real working analytical software applications, and continue discovering the unknown from the data.
Tell me more about your books.
[SB] I wanted to share my knowledge and experience in development of analytical software applications that produce high quality and reliable statistical results. “Table-driven strategies for rapid application development” was published in 1995. This book describes a technology of so-called “code-free” approach, which enables to create, modify and maintain complex data-driven applications without changing/updating any programming code. All updates can be done by changing of meta-data, that is data that describes application functionality as well as application data. Second book “Object-based statistical analysis”describes technology that helps you to find a fit between assumptions of statistical methods to characteristics of data thus making it possible to create high quality and reliability statistical inference. These 2 books actually provide end-to-end solution on how to create software applications which properties can be changed by data and ensure that results of these analytical applications you can trust.
What made you write two books on data? What lessons do you think you would like to teach in your next book?
[SB] I wanted to share knowledge, experience and help other to develop as fast as possible analytical applications of high quality and reliability of results. Next book I would definitely write about analytical applications for “big data”.
In your book “Programming Techniques for Object-Based Statistical Analysis with SAS” you stress on object based approach for consistent data can you elaborate on that.
[SB] Consistent data is a key for success in development of analytical solutions. The main idea of consistency here is to analyze characteristics of data and ensure this data is analyzed by an appropriate statistical method. Object-oriented approach helps you to preserve association of data characteristics with appropriate from statistical assumption perspective of view statistical methods. In other words: many statisticians say that they develop methods or algorithms, and this is not their problem to interpret results of applying these algorithms to real data. Some of statisticians say, we need to ensure that appropriate method is applied to appropriate to this method data, thus we can ensure quality of results and we can provide straight-forward interpretation. I belong to this small group of statisticians.
I keep hearing the word big data more frequently these days, and I know it’s for real. What do you really think is big data and what can the analytical world do about it? Can you share examples of big data?
[SB] “Big data” is not a new concept or phenomenon. Statistics actually dealt with big data from the day one of its existence. The main idea on how to extend statistical inferences made on a small sample of data to the whole universe (read “big data”) is definitely not a new one. However, data sets that change their characteristics almost in real-time (high velocity of changes), data with muti-dimensional properties that can’t be reduced and the volume of such data we can define as a “big data”. I think it’s important to mention that even if your algorithm doesn’t scale with data it tries to process you deal already with a “big data”. I think that “big data” brings enormous challenges for so-called large-scale statistical inference that is developing area in the last 20 years. This development most likely will end up with completely new theory and practice of large-scale statistical inference. Without a possibility to make statistical inferences about data, there is no need to gather and analyze data. This means, how important is to develop scalable to data statistic algorithms as a foundation for further stat. inference.
Example of “big data”: network traffic, click-stream data, gene expressions, etc .
What in your eyes is not big data, and you would not consider it as big enough to be considered for map reduce or other big data exercises?
[SB] Big data is not necessarily the size of the data set. Data that has fixed size I would not consider as a big data. Map reduce is mainly used to process distributed data of any size, not necessarily big data.
When do you run big data analysis versus traditional excel or access based analysis?
[SB] I would strongly recommend against Excel and Access for any kind of analytical works. If I have really huge data size (but fixed one) I would prefer bootstrap technique (multiple random sampling of data) and use existing stat. software (SAS, STATA, R, Statistica, etc) to perform statistical analysis and will use multiple comparison methods to formulate stat. inferences.
What was your biggest data exercise (volume / complexity)? What challenges did you ran into and how did you overcome those challenges.
[SB] Probably data about online advertising impressions, audiences and their online behavior. This was a development for discovering of new knowledge about audiences and their online behaviors in order to ensure that relevant ad is presented to a relevant audience at the right time. As far as I remember data, size was measured in petabytes and it was heavily distributed data storage similar to Hadoop, called Cosmos (Microsoft in-house development). Challenges? Everything was a challenge. Starting from the problem on how to identify data characteristics, and ending up with changes in audience behavior conditionally on content presented to this audience. Development of new methods and algorithms was required, and it was done simultaneously with development of this project.
What are some of the best practices you would like to give to upcoming big data analysts to make their lives little easier?
[SB] There is no best practice and there is no situation that makes data analyst life a little easier. As I mentioned already, high velocity of changes in data, characteristics don’t allow us to create any best practice. What worked well yesterday not necessarily worked well tomorrow.
What do you think is the next evolution in data intelligence and analysis? i.e real time analysis, predictive models, location-based models,
[SB] Large-scale statistical inference is the current and next evolution in data intelligence and analysis. Without such an inference, the collection, storage and processing of big data may make no real sense and benefits, except of counting and accounting.
Data projects (Design, Segment and Experimentation)
What types of data projects have you been involved it? Top 3 projects you enjoyed most and top 3 you hated most and why?
[SB] I tried not to take on myself any project that I hate from the first sight: I loved projects that delivered enormous benefits regardless of their complexity. It’s hard to say about top 3, so I will make a random selection of 3 out of top 30. They are:
1. Development of algorithms and programs for identifying in the real-time commercial intentions of web users, predict their next step and present relevant to the next step advertising before web users actually perform their next step.
2. Development of analytical solution to compare relevance of search results produced by virtually unlimited numbers of algorithms for virtually unlimited numbers of queries.
3. Development of the demand forecasting system intended to reduce out-of-stock events.
What are the most effective data models when it comes to web traffic analysis and insights? How do you improve ROI for your online adspend using attribution?
[SB] There is no such thing like most effective data models. These models are effective only when applied to data with specific characteristics. Attribution is pretty complex problem, and I will give you some examples. For example: display advertising. To measure effectiveness of display advertising that actually contributes to virtually any other channel, you should first to define audience to which you plan to advertise. If you can make sure that your ad is delivered to the targeted audience, you can measure results of this advertising by examining audience of your new customers whom you acquired through many different online channels. If your display advertising targeted to the specific audience worked you will see these changes in the audience of your new customers. If you don’t see any changes (and you are pretty much sure that ad was delivered to targeted audience), feel free to put your display advertising spend on hold.
What are your top 3 data modeling and statistical tools and why?
[SB] I work with SAS because it can process data based upon the concept of one record per time, meaning SAS can process very large data files. SAS has extensive SAD Macro language that helps you to create automated data and stat. processing. I work with STATA because of its very robust and advanced stat. programs. I work with R because it has a very large library of different stat. methods.
How do you handle missing data and what imputation techniques do you recommend?
[SB] Missing data is a part of data and should be treated as such: Imputation technique depends on underlying nature of missing data. These techniques should be chosen in accordance to the nature of missing data. If you have a trouble to reveal the nature/mechanism of missing values, I would recommend using pretty much generic Hidden-Markov Modeling to impute missing value. This is not a universal method, and its results should be validated using simulations.
What are some of the best practices when designing a data analytics experiment?
[SB] Unfortunately, there is no such thing as the best practice in data experiment. Everything depends on.data, their characteristics, availability, accuracy, acquisition cost, etc.
From your vast experience, can you share examples of good and bad designs of experiments.
[SB] All designs are good. Some of their applications are good, but many are bad. Everything depends, unfortunately, on collaboration between statistician and subject-matter expert. If these people can’t collaborate, nothing will help. Usually, good designs of experiments are done as follows-up experiments, and bad designs try to get everything in one step.
Career advice and mentorship
What career advice would you like to give data scientists wannabes? What do you think makes a good data scientist?
[SB] Put a question mark on everything you learn or do. This is how you will be able to innovate. Ability to innovation will make you a good data scientist.
Do you know a few “rules of thumb” used in statistical or computer science?
[SB] If something looks too easy to you, then you are probably on the wrong path (in statistics) . If something that you do is too complicated (in computer science!) it means you are absolutely on the wrong way.
Do you think data science is an art or a science?
[SB] It’s both like in any other science.
Which data scientists do you admire most?
[SB] I was lucky to meet with several outstanding statisticians and mathematicians. John Tukey and Andrey Kolmogorov probably are the most I admire. BTW, John Tukey was a statistician who invented a term “software”.
What books would you recommend to our listeners following the data science career track?
[SB] Tough question. It depends on area of data science. I would definitely recommend the latest book of Peter Huber “Data Analysis: What Can Be Learned From the Past 50 Years?
Future / Predictions
What are some of the key predictions for the future of data science?
[SB] Development of large-scale statistical inference theory that will eventually replace modern statistics. Development of large-scale automated data analysis will combine machine-learning with a large-scale statistical. inference.
How should the analyst community prepare for the data explosion?
[SB] Actually it’s too late to prepare. It is happening now: I would say the same I said during last 25 years: put a question mark on everything and this is a way to innovate. Innovation will help you to deal with data of any complexity.
Hope you enjoyed this post. Do you have questions for Samuel? Please leave a comment with your question and we will make sure to get a response from Samuel.