Summarizing data

This post is specific to the use of the R language for data analysis. Often, we need a quick look at the data in terms of summary of key variables. The function str() tells us about the dimensions of the data frame in terms of the number of observations and the number of variables. Further, it tells us the name of individual variables, their type (class) and list of first few observations.

> str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

The summary() function in the base package provides a nice bird's eye view of the data

> summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species 
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50 
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50 
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50 
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                 
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                 
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

The summ() function in the package epicalc provides summary in a slightly different format

> require(epicalc)

> summ(iris)

No. of observations = 150
  Var. name    obs. mean   median  s.d.   min.   max.
1 Sepal.Length 150  5.84   5.8     0.83   4.3    7.9 
2 Sepal.Width  150  3.06   3       0.44   2      4.4 
3 Petal.Length 150  3.76   4.35    1.77   1      6.9 
4 Petal.Width  150  1.2    1.3     0.76   0.1    2.5 
5 Species      150  2      2       0.819  1      3

I was thinking of elegant ways to generate customized summary of the numeric data within a dataframe. Recently, I was browsing through the book ‘Data Manipulation with R’ by Phil Spector. Using the explanations on sapply and apply functions and other examples in the book, the following code can be used to generate customized data summary.

> require(fBasics)    
#loads fBasics package for skewness and kurtosis functions

> summary.fn <- function(x) round( c(obs=sum(!is.na(x)), missing= sum(is.na(x)), median=median(x), mean=mean(x), sd=sd(x), skewness= skewness(x) , kurtosis= kurtosis(x) ), 3)       
#creates a function for customized summary

> dataframe.numeric <- iris[, sapply(iris, class)== 'numeric' ]    
#select columns that are of ‘numeric’ class

> t(apply(dataframe.numeric, 2, summary.fn))    
# generate customized summary 

             obs missing median  mean    sd skewness kurtosis
Sepal.Length 150       0   5.80 5.843 0.828    0.309   -0.606
Sepal.Width  150       0   3.00 3.057 0.436    0.313    0.139
Petal.Length 150       0   4.35 3.758 1.765   -0.269   -1.417
Petal.Width  150       0   1.30 1.199 0.762   -0.101   -1.358

The summary.fn can be defined to include any customized view of the data

Tags: 

Add new comment

(If you're a human, don't change the following field)
Your first name.
(If you're a human, don't change the following field)
Your first name.
(If you're a human, don't change the following field)
Your first name.

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.