Thursday, January 9, 2014

Simple Variable Correlation Analysis Against Conversions

In the previous post I showed you a data visualization technique that displays how conversion rates behave against various web site visitor behavior metrics. Using the same data set, I plotted a simple "pairs" function in R, to show the significance of each independent variable to the dependent predictor variables, "ConversionRate" in a static format. This is the graphical output we get from R.

Initially, the graphical output is a lot to take in, but we can see after a quick look that a few predictor variables show a definite relationship individually, between themselves, and Conversion Rate. Before looking ahead, can you find out which ones they are? For those of you who aren't familiar with matrix plots like these take a look at the 2nd square in the first row. The dependent variable, on the y-axis is simply the Conversion Rate. On the x-axis we have visits.

The second square in the first row simply is a visualization of Conversion Rate specifically as a function of Visits. So, in this case we can see a weak relationship between visits, and the variable Conversion Rate. Notice how the inverse of this graph can be found in the 2nd row on the first column. Is this a meaningful square on the matrix? Definitely not. It is output that displays the same two variables, but on opposite axes. Here, we are changing the y-axis, or dependent variable to visits, by conversion rate. For the purposes of this study we are not concerned with how visits change as a result of conversions. We are only concerned with how Conversion Rates change as a result of other independent variables.

Given the fact that a lot of data is compressed into a small space, you wouldn't want to show a client the raw output without marking the most important variables (and explaining the reason why they are the most important variables). After mark-up we have this:

Once the regression line is drawn through the scatter plot, we see that two predictor variables, Average visit duration, and Pages/Visit have reasonably high coefficients of correlation. We will look at them individually.
The average visit duration plot (marked with the yellow regression line) displays a positive relationship to conversion. It is interesting to note that the relationship becomes a little more murky as the time on site gets much higher. The conversion data points match average visit duration much more closely at lower levels of average visit duration. The residuals, or errors, for the second half of the graph show that at higher levels, time on site, are a bit weaker predictors of conversions.

Pages per visit also shows a positive relation to conversions. Once again, the residuals grow, which tells us that though pages per visit is a reasonably good predictor of conversion rate, it's predictive power diminishes as the number of pages per visit grows.

It's important to note that any one of these as a single predictor variable doesn't carry a high enough coefficient of correlation to predict conversion rate by themselves. However, it is worth determining the variables that at least some power to predict conversions. Ultimately, as you may have already assumed, websites that convert well, have the intrinsic ability to rank well.

1 comment:

  1. Hi Prateek,
    Good article! With large amounts of predictor variables, it's hard to figure out which ones are the most important. You're use of r software coupled with key insights makes it clear that "average visit duration" and "pages per visit" are the best explanatory variables to predict conversion rate.
    Afif Khaja