In the last few weeks, we have discussed the implications of big data and big data analytics. We also talked about the latest trends in big data and how big data visualization is becoming a form of art. Today, I want to take the theoretical discussions to practice with a recent big data project we successfully completed.
A part of my team at Rackspace is responsible for paid search where we manage thousands of keywords and a pretty substantial budget. Typically, the paid search budget is managed through daily optimization and monitoring of our performance. Some of the routine activities are ad copy optimization, AB/MVT testing, landing page testing, ad group categorization, day/week parting, budget shifting and ad extensions. Our paid search account analysis along with all the variables (tens of thousands of keywords, budget, clicks, impressions, ad score easily becomes a big data analytics challenge.
In addition to our traditional paid search analytics we were interested in identifying statistically significant ROI patterns and new keyword level optimization opportunities. I reached out to our chief data scientist Samuel Berestizhevsky to get his insights. Samuel recommended using ANCOVA/ANOVA analysis for this experiment.
Let’s first understand what does ANCOVA and ANOVA mean. If you go by the traditional technical definitions using Wikipedia –
ANOVA – Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences between group means and their associated procedures (such as “variation” among and between groups)
(Source : http://en.wikipedia.org/wiki/ANOVA)
ANCOVA – Covariance is a measure of how much two variables change together and how strong the relationship is between them.[1] Analysis of covariance (ANCOVA) is a general linear model which blends ANOVA and regression.
(Source : http://en.wikipedia.org/wiki/Analysis_of_covariance)
In simple terms, ANOVA identifies the differences and similarities, and ANCOVA establishes the connections or correlation between the individual data pieces or groups of data set.
As an example, let’s assume you are a supplier of wool. One of your team is responsible for raising and managing several herds of sheep. To increase production, you can increase the serving of vitamins in equal quantity and test over three different types of herds of sheep (fine wool, medium wool and carpet wool).
Image Source : http://goo.gl/xWyaD
You hire a statistician Ben to help solve this challenge using historicals to find the right combination(s). The default state (null hypothesis) in this experiment is there is no correlation between the type of sheep and increase production of wool. Ben goal will be to use statistics and prove the improved results are not by chance.
He records the output from 21 fine wool, 19 medium wool and 22 carpet wool sheep. He wants to know whether the average output from all fine wool sheep is different from the average output of all medium wool sheep or whether both averages differ from average output of carpet wool sheep.
Using single factor (one factor) ANOVA Ben calculates the following :
ANOVA calculator = http://www.vassarstats.net/anova1u.html
The pvalue in these experiments tells us the probability of the differences in the group is happening by chance or not. In this case, a pvalue of 0.081259 is clearly higher than 0.05, which means the difference in the group is by chance only.
Conclusion – Ben would conclude the vitamins supplement has similar effect on all three types of sheep.
This is just an example on a much smaller scale on how ANOVA can be used to verify the data. Now let’s go back to our original exercise of evaluating the performance of a largescale ppc exercise. We have used SAS to perform the complex statistical analysis. The goals for this exercise are as follows –
Goals of the PPC Analysis:
Part I. Identify which, if any, keywords significantly impact Revenue, ROI.
Part II. How does the Revenue / ROI impact changes, if at all, depending on

the amount spent on the keyword searches

the number of paid accounts, impressions & clicks
Part I – Keywords that impact Revenue/ROI
For the first part of the ANOVA analysis, we divided the keywords into several groups.
A. Alpha
B. Beta
C. Theta
D. Gamma
Once these groups were formed we performed the following steps –
1. draw a random sample from largest group based on sample size of smallest group
2. with this subset of data (equal observations per group):
 test normality assumptions, transform if necessary
 test equal variance assumption
 perform oneway ANOVA for each group
 report means and confidence intervals
3. Repeat for a total of X random samples
4. Base conclusions on summarized results of X random sample
Here are some of the observations. (actual results are kept confidential and are replaced by dummy numbers)
What we found is the pvalue for Gamma group is smaller than 0.05 which indicates possible difference in ROI within the Gamma group. We also found the Alpha, Beta and Theta group didn’t have any differences in ROI.
Part II . How does the Revenue / ROI impact changes, if at all, depending on

the amount spent on the keyword searches

the number of paid accounts, impressions & clicks
For the part 2 of the analysis, we will be using either common slop or unequal slope ANCOVA (Analysis of covariance).
For each of the ANCOVA models / analysis, the following procedure was used:
1. Draw a random sample from the largest group level(s) based on the sample size of the
smallest group level.
2. Perform residual and regression diagnostics to verify / check assumptions
3. Determine the most appropriate form of the covariate model for the sample
(either a common slope or unequal slopes model)
4. Repeat 1, 2, and 3 above for 100 samples and report results based on averages of
these 100 samples.
Here is the summary of the observations from the ANCOVA analysis.
Key Takeaways from this exercise
1. Increase the spend on the Gamma group of keywords as it has the highest correlation to ROI.
2. Decrease the spend on the Alpha keywords as it is not driving enough ROI to justify the spend.
3. Generic observation – Pause the keywords containing “xxxxx” because it is being influenced by other campaigns and the ROI attribution does not ties to the performance of the campaign itself.
Hope this detailed overview of the ANOVA/ANCOVA gives you enough ideas to perform this analysis on your data. The possibilities are endless, you can apply the same methodology to SEO or even display media. Next week, we will look at the Part III of this analysis and identify the optimize spending per keywords with a goal to maximize the number of paid customers brought by the keywords.