# Summarizing data

This post is specific to the use of the R language for data analysis. Often, we need a quick look at the data in terms of summary of key variables. The function *str()* tells us about the dimensions of the data frame in terms of the number of observations and the number of variables. Further, it tells us the name of individual variables, their type (class) and list of first few observations.

> str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

The *summary()* function in the base package provides a nice bird's eye view of the data

> summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

The *summ()* function in the package *epicalc* provides summary in a slightly different format

> require(epicalc) > summ(iris) No. of observations = 150 Var. name obs. mean median s.d. min. max. 1 Sepal.Length 150 5.84 5.8 0.83 4.3 7.9 2 Sepal.Width 150 3.06 3 0.44 2 4.4 3 Petal.Length 150 3.76 4.35 1.77 1 6.9 4 Petal.Width 150 1.2 1.3 0.76 0.1 2.5 5 Species 150 2 2 0.819 1 3

I was thinking of elegant ways to generate customized summary of the numeric data within a dataframe. Recently, I was browsing through the book ‘Data Manipulation with R’ by Phil Spector. Using the explanations on *sapply *and *apply *functions and other examples in the book, the following code can be used to generate customized data summary.

> require(fBasics) #loads fBasics package for skewness and kurtosis functions > summary.fn <- function(x) round( c(obs=sum(!is.na(x)), missing= sum(is.na(x)), median=median(x), mean=mean(x), sd=sd(x), skewness= skewness(x) , kurtosis= kurtosis(x) ), 3) #creates a function for customized summary > dataframe.numeric <- iris[, sapply(iris, class)== 'numeric' ] #select columns that are of ‘numeric’ class > t(apply(dataframe.numeric, 2, summary.fn)) # generate customized summary obs missing median mean sd skewness kurtosis Sepal.Length 150 0 5.80 5.843 0.828 0.309 -0.606 Sepal.Width 150 0 3.00 3.057 0.436 0.313 0.139 Petal.Length 150 0 4.35 3.758 1.765 -0.269 -1.417 Petal.Width 150 0 1.30 1.199 0.762 -0.101 -1.358

The *summary.fn* can be defined to include any customized view of the data

## Add new comment