Application of Object Oriented System of R for generating descriptive statistics group-wise


There are many packages available in R like data.table, tables, psych etc. to provide descriptive statistics like mean, standard deviation etc.  group-wise(factor-wise) for number variables.  In this article, an attempt is made to generate similar type of tabulated results utilizing  the  functions available in the base package and the concepts of object oriented system available with R. The main purpose of this type of exercise is to illustrate the application of object oriented system available in R to generate the output results as per our requirement. For the purpose of illustration, the iris data is considered, which consists of the data of four variables Sepal Length, Sepal Width, Petal Length and Petal Width for three species(factors). The following algorithm and R code illustrate the  calculation of   mean and standard deviations of these four variables for each specie and generate a tabulated results, which is similar to those obtained from the above packages.

Algorithm and Code

1.For generating the mean and standard deviation of any vector, a user defined function meansd() is defined as follows :

meansd<-function(x) {

The above function receives any vector  as input, calculates the mean and standard deviation and returns the results as a list l.

2.Initially, the execution starts by calling a function with just two arguments viz., i).the data frame containing all the variables for which mean and standard deviation are required and ii). a vector containing the factor variable. So a new function basstat() is defined with just two arguments, the first containing the data frame all the variables and the second containing the factor variable.  For the iris data we can call this function  as given below.


3. The basstat() function will split the data iris specie-wise into a list containing three sub data frames, one for each specie.  We will now use lapply function, which in turn calls another function result()  for each of these sub data frames and obtains the aggregated results in a variable “bres”. For the purpose of printing these aggregated results in a neat tabular fashion, we will take the help of object oriented programming concepts of R. For this purpose, we will change the class of bres as “myclass” and return this bres object. The basstat () function code is given below :

basstat<-function(df,f) {

4.The lapply function in step 3 in turn, is calling the function result(), using each sub data frame as input argrument. The result function, contains a sapply() function. This function in turn will call the meansd function, with each of these sub data frames one at a time and receives the mean and standard deviation results for all  the variables in the sub data frames. It will capture them in the object  “tres” and returns these results to the calling function lapply. The code of the result() function is given below :

result<-function(x) {

Through all these function calls, all the results are now available in the object res, which is of class “myclass”. The  results obtained from all these function calls are available specie-wise but not in a neat compact tabular fashion  as shown below.


5.To facilitate the printing in a compact neat tabular fashion, the print function of myclass is defined as follows. This function, in turn cbinds all the results, does the required string manipulations and finally prints the results in a neat tabular fashion. The code of the  print.myclass() function is given below :

print.myclass<-function(x) {
for(i in 1:length(x)) {
cat(” “)
isp<-paste(rep(” “,tsp),collapse=””)
for(i in 1:length(nm)) {
rsp<-paste(rep(” “,esp),collapse=””)

6.These results can be printed by just typing the res object of step 2


The code for using the basstat() function and the results obtained are given below :


Some more Results :

i).Descriptive Statistics of six variables mpg,disp,hp,drat,wt,qsec  for the factor cyl consisting of the levels/groups viz., cylinder 4, 6 and 8 of the dataset mtcars of MASS package



ii).Descriptive Statistics of two variables Prewt and Postwt for the groupsCBT, Cont and FT of the dataset anorexia of MASS package


iii).Desriptive Statistics of three variables Price, and MPG.highway  for the groups Compact, Large, Midsize, Small, Sporty and Van  of the dataset Cars93 of MASS package



The tabulations of the output results obtained from all the above examples are found to be  similar to those  obtained from the data tables and tables packages. We could achieve this by using object oriented concepts of R language. In this exercise, I have   obtained the mean and standard deviations of number of variables group-wise. It is also possible to modify the program to obtain the other statistics like min, max, median, 1st and 3rd quartiles etc. for number of variables group-wise.

Rcmdr Plug-in(s)

These plug-ins enhance statistical graphical user interface by extending new menus to statistical package provided by Rcmdr. While the original GUI was created for a basic statistics calculations, enabling of extensions (or plug-ins) has greatly enhanced the possible use and scope of this software. Installing these plug-ins is quite easy. They can be installed like any other R package. After installing the plugin package, these plug-ins can be activated by simply selecting the menu option Tools – Load Rcmdr plug-in(s) option in Rcmdr. Some useful Rcmdr plug-ins and their usage are provided below :

1.RcmdrPlugin.bca – Business and Customer Analytics
2.RcmdrPlugin.depthTools – A package that implements different statistical tools for the description and analysis of gene expression data based on the concept of data depth
3.RcmdrPlugin.DoE – Design of Experiments
4.RcmdrPlugin.EBM – Evidence Based Medicine plug-in
5.RcmdrPlugin.epack – Plugin for Time Series
6. RcmdrPlugin.EZR : adds a variety of statistical functions, including survival analyses, ROC analyses, metaanalyses, sample size calculation, and so on
7.RcmdrPlugin.FactoMineR : dedicated to multivariate Data Analysis
8.RcmdrPlugin.KMggplor2 – Kaplan-Meier plots and other plots by using the ggplot2 package
9. RcmdrPlugin.NMBU – extends linear models and provides new extended interfaces for PCA,PLS,LDA,QDA, clustering of variables, tests, plots etc.
10. RcmdrPlugin.sampling – provides tools for calculating sample sizes and selecting samples using various sampling designs
11. Rcmdr.survival : survival package, with dialogs for Cox models, parametric survival regression models, estimation of survival curves etc.
12. Rcmdr.temis – provides an integrated solution to perform a series of text mining tasks


Curve Fitting or Polynomial Regression between two variables

When the relation between two variables x and y is not linear and if there exists a curvilinear relationship (which can be observed by means of a scatter plot between x and y), then one can perform curve fitting or polynomial regression between these two variables. To know the details as to how to perform curve fitting or polynomial regression between two variables x and y using r functions lm and poly, and the stopping criteria that can be followed, read the following article at the address given below:

Analysis of tabular data from csv file

Sometimes data may not be available in the csv file in the required format. Consider the following csv file, whose details are as follows – Four different treatments were given to four different groups of patients. Random samples of size 7 were selected from each group and blood levels of Hb percentage levels were measured after one month. The objective is to test whether there are significant differences in the mean values of the Hb percentage levels due to treatments by the application of one-way ANOVA.

The following is the csv file containing the patients treatments data.

Method :

1.Read the csv file and convert it as a matrix “trt”. Next extract the four rows of the matrix and convert them into vectors.
2.Create a list and subsequently using stack() function convert this list into a dataframe “df”.
3. The above stack function creates the dataframe df with two columns ind and values. ind is a categorical variable(factors) and values variable contain the Hb percentage levels. Rename the column ind as “Treatments” and perform one-way ANOVA.

The results are given below :


The complete code is given below :




Aggregating basic statistics group-wise in R

Many times, while doing Statistical analysis, we have to evaluate the descriptive statistics like mean, standard deviation and so on for a number of variables, group-wise. Most of the Statistical packages like SAS, SPSS and so on provide these features. In R, the data.table package is very useful for aggregating these types of results and to tabulate them. It offers fast aggregation of large data , fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). In addition, the tables package and psych package’s describeBy method were also found to be useful for generating this type of results. As an exercise, the iris data is considered, which contains the data of four variables Sepal Length(SL), Sepal Width(SW), Petal Length(PL) and Petal Width(PW) of three species setosa, versicolor and virginica. Three types of results were generated, utilizing all the three packages listed above.

1.Mean and Standard deviation for all the four variables specie-wise using data-table package

2.Mean and Standard deviation for all the four variables specie-wise using tables package

3.Mean and Standard deviation for all the four variables specie-wise using psych package