Sunday, March 23, 2014

Easily Assign Attributes for Your Metrics in R

One of the most common mistakes I've seen is failure to transform variable attributes before analyzing data in R, or other statistical software packages. It is critical your variables are in the right format (numeric, integer, factor, etc.) or you run the risk of producing relatively meaningless insights. Fortunately, using R, it is remarkably painless to assign your variables to the correct attributes.

If you are using Google Analytics to track your data, you are restricted to exporting your data to csv files. As a result, Excel "guesses" at the what it thinks the variable should be. These properties are then assigned to the individual variables when the data is read in to R.

After exporting the data into a csv file via Google Analytics, I intentionally left the default excel settings to highlight the problems that come up. I named the dataset "x", and read the csv into R. Before analyzing your data set, it is critical to take a moment to explore the individual variables and the structure of the data itself. The 'structure' command comes in very handy to do just that. This is done by typing str(  ).


There are 5 variables: Google organic, Bing organic, visitors who bounced, all visitors, and the date. The default date range pulled 31 days of data (the 31 observations in the example). Next, the variable types are printed. Notice that 'google', 'bing', and 'bounced' are labelled as integers, while 'all' and 'date' are factor levels. As a result, R is reading each row as a separate category. Obviously, this would pose an enormous problem when attempting to calculate basic summary statistics, since you can't calculate categorical variables. R recognizes dates formatted YYYY/MM/DD, but still requires readjustments before recognizing it as a date. The following code assigns the correct attributes to the variables for analysis.

R Code variable transformations





The x$date simply instructs R to look for 'date' in dataset 'x', and reformat it as a date as opposed to a factor. The rest of the attributes are assigned a numeric designation. Though no error messages came up, it is still a good idea to check for accuracy. After a quick structure command on the data set, we can see that the numeric variables are, in fact, numerical and that date is no longer a factor:


This may seem like a mundane step in your analysis, but it is one of the most important things you can do when analyzing any data at all. Note that the numbers, themselves, have not changed, however the variable types have changed. Now we can compute summary statistics, covariance tables, plot histograms and more.

Thursday, January 9, 2014

Simple Variable Correlation Analysis Against Conversions

In the previous post I showed you a data visualization technique that displays how conversion rates behave against various web site visitor behavior metrics. Using the same data set, I plotted a simple "pairs" function in R, to show the significance of each independent variable to the dependent predictor variables, "ConversionRate" in a static format. This is the graphical output we get from R.

Initially, the graphical output is a lot to take in, but we can see after a quick look that a few predictor variables show a definite relationship individually, between themselves, and Conversion Rate. Before looking ahead, can you find out which ones they are? For those of you who aren't familiar with matrix plots like these take a look at the 2nd square in the first row. The dependent variable, on the y-axis is simply the Conversion Rate. On the x-axis we have visits.

The second square in the first row simply is a visualization of Conversion Rate specifically as a function of Visits. So, in this case we can see a weak relationship between visits, and the variable Conversion Rate. Notice how the inverse of this graph can be found in the 2nd row on the first column. Is this a meaningful square on the matrix? Definitely not. It is output that displays the same two variables, but on opposite axes. Here, we are changing the y-axis, or dependent variable to visits, by conversion rate. For the purposes of this study we are not concerned with how visits change as a result of conversions. We are only concerned with how Conversion Rates change as a result of other independent variables.

Given the fact that a lot of data is compressed into a small space, you wouldn't want to show a client the raw output without marking the most important variables (and explaining the reason why they are the most important variables). After mark-up we have this:


Once the regression line is drawn through the scatter plot, we see that two predictor variables, Average visit duration, and Pages/Visit have reasonably high coefficients of correlation. We will look at them individually.
The average visit duration plot (marked with the yellow regression line) displays a positive relationship to conversion. It is interesting to note that the relationship becomes a little more murky as the time on site gets much higher. The conversion data points match average visit duration much more closely at lower levels of average visit duration. The residuals, or errors, for the second half of the graph show that at higher levels, time on site, are a bit weaker predictors of conversions.

Pages per visit also shows a positive relation to conversions. Once again, the residuals grow, which tells us that though pages per visit is a reasonably good predictor of conversion rate, it's predictive power diminishes as the number of pages per visit grows.

It's important to note that any one of these as a single predictor variable doesn't carry a high enough coefficient of correlation to predict conversion rate by themselves. However, it is worth determining the variables that at least some power to predict conversions. Ultimately, as you may have already assumed, websites that convert well, have the intrinsic ability to rank well.

Tuesday, January 7, 2014

Using the Motion Chart for Multivariate Analysis



So, we've all seen them by now: the infamous Motion Charts made famous in a now iconic TED talk, given by Hans Rosling. Rosling used it to show how you can present a great deal of data in a clean and meaningful way: on a two dimensional plane that displays more than just two dimensions.

Collecting meaningful data has always been the most challenging step in generating actionable visualizations. However, digital analytics data has made a wealth of free and pertinent information available. We can use this to see exactly what dimensions affect web site traffic, and formulate insights based on them. Sites that function as E-commerce storefronts are a bit less concerned with general traffic, and are more so concerned with sales. Therefore, it's worth examining the conversion rate a bit more in detail.

Take a look at the recent conversion rates of an actual, but unidentified site against the following metrics: bounce rate, pages viewed per visit, aggregated daily visitors, and average time spent on the site per visitor.

For a more traditional view, click the bubble data point, check the box marked 'trails', and change the x-axis to the 'time' option. Now hit play and adjust the speed. You will see how the conversion (sales) rate changes over time, by the number of overall web site visitors, number of pages viewed per visit, and visitor bounce rate. While it would be good to get an R-squared value of a specific metric to the dependent variable (conversion) this would only give us a numeric, abstract statistical understanding of what factors affect conversion, most profoundly.

However, the best way to use the motion charts is to disregard the time variable as an axis label. Only then can we see how all four dimensions affect conversions simultaneously. The multiple dimensions are represented by the x-axis, the y-axis, the size of the bubble, and the color of the bubble, at any given point. This gives us a much more rich interpretation of the factors that influence conversion.

But what about our clients? After all aren't we reporting for their benefit? Chances are, they don't care what the R-squared value of any one variable is. But a visual representation like the motion chart can go a long way to show just how much of an influencing factor a particular metric is to conversion.

Motion charts can no longer be generated from the GoogleDoc gadget function. This motion chart was generated in R using the Google Chart Tools interface and googleVis package.