Thursday, January 9, 2014

Simple Variable Correlation Analysis Against Conversions

In the previous post I showed you a data visualization technique that displays how conversion rates behave against various web site visitor behavior metrics. Using the same data set, I plotted a simple "pairs" function in R, to show the significance of each independent variable to the dependent predictor variables, "ConversionRate" in a static format. This is the graphical output we get from R.

Initially, the graphical output is a lot to take in, but we can see after a quick look that a few predictor variables show a definite relationship individually, between themselves, and Conversion Rate. Before looking ahead, can you find out which ones they are? For those of you who aren't familiar with matrix plots like these take a look at the 2nd square in the first row. The dependent variable, on the y-axis is simply the Conversion Rate. On the x-axis we have visits.

The second square in the first row simply is a visualization of Conversion Rate specifically as a function of Visits. So, in this case we can see a weak relationship between visits, and the variable Conversion Rate. Notice how the inverse of this graph can be found in the 2nd row on the first column. Is this a meaningful square on the matrix? Definitely not. It is output that displays the same two variables, but on opposite axes. Here, we are changing the y-axis, or dependent variable to visits, by conversion rate. For the purposes of this study we are not concerned with how visits change as a result of conversions. We are only concerned with how Conversion Rates change as a result of other independent variables.

Given the fact that a lot of data is compressed into a small space, you wouldn't want to show a client the raw output without marking the most important variables (and explaining the reason why they are the most important variables). After mark-up we have this:

Once the regression line is drawn through the scatter plot, we see that two predictor variables, Average visit duration, and Pages/Visit have reasonably high coefficients of correlation. We will look at them individually.
The average visit duration plot (marked with the yellow regression line) displays a positive relationship to conversion. It is interesting to note that the relationship becomes a little more murky as the time on site gets much higher. The conversion data points match average visit duration much more closely at lower levels of average visit duration. The residuals, or errors, for the second half of the graph show that at higher levels, time on site, are a bit weaker predictors of conversions.

Pages per visit also shows a positive relation to conversions. Once again, the residuals grow, which tells us that though pages per visit is a reasonably good predictor of conversion rate, it's predictive power diminishes as the number of pages per visit grows.

It's important to note that any one of these as a single predictor variable doesn't carry a high enough coefficient of correlation to predict conversion rate by themselves. However, it is worth determining the variables that at least some power to predict conversions. Ultimately, as you may have already assumed, websites that convert well, have the intrinsic ability to rank well.

Tuesday, January 7, 2014

Using the Motion Chart for Multivariate Analysis

So, we've all seen them by now: the infamous Motion Charts made famous in a now iconic TED talk, given by Hans Rosling. Rosling used it to show how you can present a great deal of data in a clean and meaningful way: on a two dimensional plane that displays more than just two dimensions.

Collecting meaningful data has always been the most challenging step in generating actionable visualizations. However, digital analytics data has made a wealth of free and pertinent information available. We can use this to see exactly what dimensions affect web site traffic, and formulate insights based on them. Sites that function as E-commerce storefronts are a bit less concerned with general traffic, and are more so concerned with sales. Therefore, it's worth examining the conversion rate a bit more in detail.

Take a look at the recent conversion rates of an actual, but unidentified site against the following metrics: bounce rate, pages viewed per visit, aggregated daily visitors, and average time spent on the site per visitor.

For a more traditional view, click the bubble data point, check the box marked 'trails', and change the x-axis to the 'time' option. Now hit play and adjust the speed. You will see how the conversion (sales) rate changes over time, by the number of overall web site visitors, number of pages viewed per visit, and visitor bounce rate. While it would be good to get an R-squared value of a specific metric to the dependent variable (conversion) this would only give us a numeric, abstract statistical understanding of what factors affect conversion, most profoundly.

However, the best way to use the motion charts is to disregard the time variable as an axis label. Only then can we see how all four dimensions affect conversions simultaneously. The multiple dimensions are represented by the x-axis, the y-axis, the size of the bubble, and the color of the bubble, at any given point. This gives us a much more rich interpretation of the factors that influence conversion.

But what about our clients? After all aren't we reporting for their benefit? Chances are, they don't care what the R-squared value of any one variable is. But a visual representation like the motion chart can go a long way to show just how much of an influencing factor a particular metric is to conversion.

Motion charts can no longer be generated from the GoogleDoc gadget function. This motion chart was generated in R using the Google Chart Tools interface and googleVis package.