Sunday, March 23, 2014

Easily Assign Attributes for Your Metrics in R

One of the most common mistakes I've seen is failure to transform variable attributes before analyzing data in R, or other statistical software packages. It is critical your variables are in the right format (numeric, integer, factor, etc.) or you run the risk of producing relatively meaningless insights. Fortunately, using R, it is remarkably painless to assign your variables to the correct attributes.

If you are using Google Analytics to track your data, you are restricted to exporting your data to csv files. As a result, Excel "guesses" at the what it thinks the variable should be. These properties are then assigned to the individual variables when the data is read in to R.

After exporting the data into a csv file via Google Analytics, I intentionally left the default excel settings to highlight the problems that come up. I named the dataset "x", and read the csv into R. Before analyzing your data set, it is critical to take a moment to explore the individual variables and the structure of the data itself. The 'structure' command comes in very handy to do just that. This is done by typing str(  ).

There are 5 variables: Google organic, Bing organic, visitors who bounced, all visitors, and the date. The default date range pulled 31 days of data (the 31 observations in the example). Next, the variable types are printed. Notice that 'google', 'bing', and 'bounced' are labelled as integers, while 'all' and 'date' are factor levels. As a result, R is reading each row as a separate category. Obviously, this would pose an enormous problem when attempting to calculate basic summary statistics, since you can't calculate categorical variables. R recognizes dates formatted YYYY/MM/DD, but still requires readjustments before recognizing it as a date. The following code assigns the correct attributes to the variables for analysis.

R Code variable transformations

The x$date simply instructs R to look for 'date' in dataset 'x', and reformat it as a date as opposed to a factor. The rest of the attributes are assigned a numeric designation. Though no error messages came up, it is still a good idea to check for accuracy. After a quick structure command on the data set, we can see that the numeric variables are, in fact, numerical and that date is no longer a factor:

This may seem like a mundane step in your analysis, but it is one of the most important things you can do when analyzing any data at all. Note that the numbers, themselves, have not changed, however the variable types have changed. Now we can compute summary statistics, covariance tables, plot histograms and more.