Commercial Applications using R

Introduction

R language can also be used for commercial applications. In this article, I will describe how R language can be used for a commercial application like Payroll. In this exercise let us assume that in a typical Indian Company, every employee gets two types of allowances called Dearness Allowance(DA) and House Rent Allowance(HRA) besides the basic Salary. Let us assume that the DA and HRA are 22% and 30% of Basic Salary respectively. In addition, the company also deducts IncomeTax, which is 10% of Basic Salary. The Gross Salary is obtained by adding the Basic Salary, DA and HRA. Net Salary is obtained by subtracting the IncomeTax amount from the Gross Salary. It is required to do all the calculations for each employee and generate the Payslips for each employee in a specified format mentioning  all these details.

Sample Data

Some sample data of four employees is shown below :

data

Calculations and Printing of Payslips

The following User Defined Function payroll()  will perform all the calculations and returns the results  as list  of class “payroll”  :

payroll<-function(df) {
lst<-list()
lst$eno<-df$eno
lst$name<-df$name
lst$dept<-df$dept
lst$salary<-df$salary
lst$da<-df$salary*0.22
lst$hra<-df$salary*0.3
lst$gpay<-df$salary+lst$da+lst$hra
lst$itax<-df$salary*0.1
lst$npay<-lst$gpay-lst$itax
class(lst)<-“payroll”
lst
}

In order to print the payslips the print function of payroll is defined as follows  :

print.payroll<-function(x) {
cat(“\n”)
for(i in 1:n) {
cat(“Eno : “,x[[1]][i],”\n”)
cat(“Name : “,x[[2]][i],”\n”)
cat(“Dept : “,x[[3]][i],”\n”)
cat(“Salary : “,x[[4]][i],”\n”)
cat(“DA : “,x[[5]][i],”\n”)
cat(“HRA : “,x[[6]][i],”\n”)
cat(“Gross Pay : “,x[[7]][i],”\n”)
cat(“ITax : “,x[[8]][i],”\n”)
cat(“Net Salary : “,x[[9]][i],”\n\n”)
for(i in 1:40) {
cat(“=”)
}
cat(“\n”)
}
}

The following  R code will  read the data from the excel data file payroll.csv,  do all the calculations and  prints the payslips for all the employees.

df<-read.csv(“g:/RExercises/OOP/payroll.csv”,header=TRUE,sep=”,”,
stringsAsFactors=FALSE)
n<-nrow(df)
res<-payroll(df)
res

Payslips printing

The payslips generated by the above code is given below :

payslip1

 

payslip2

Conclusions

R language program for developing  a commercial application like Payroll is described in this article.

 

 

 

Application of Object Oriented System of R for generating descriptive statistics group-wise

Introduction

There are many packages available in R like data.table, tables, psych etc. to provide descriptive statistics like mean, standard deviation etc.  group-wise(factor-wise) for number variables.  In this article, an attempt is made to generate similar type of tabulated results utilizing  the  functions available in the base package and the concepts of object oriented system available with R. The main purpose of this type of exercise is to illustrate the application of object oriented system available in R to generate the output results as per our requirement. For the purpose of illustration, the iris data is considered, which consists of the data of four variables Sepal Length, Sepal Width, Petal Length and Petal Width for three species(factors). The following algorithm and R code illustrate the  calculation of   mean and standard deviations of these four variables for each specie and generate a tabulated results, which is similar to those obtained from the above packages.

Algorithm and Code

1.For generating the mean and standard deviation of any vector, a user defined function meansd() is defined as follows :

meansd<-function(x) {
l<-list()
l$Mean<-mean(x)
l$SD<-sd(x)
return(l)
}

The above function receives any vector  as input, calculates the mean and standard deviation and returns the results as a list l.

2.Initially, the execution starts by calling a function with just two arguments viz., i).the data frame containing all the variables for which mean and standard deviation are required and ii). a vector containing the factor variable. So a new function basstat() is defined with just two arguments, the first containing the data frame all the variables and the second containing the factor variable.  For the iris data we can call this function  as given below.

res<-basstat(iris[,1:4],iris[,5])

3. The basstat() function will split the data iris specie-wise into a list containing three sub data frames, one for each specie.  We will now use lapply function, which in turn calls another function result()  for each of these sub data frames and obtains the aggregated results in a variable “bres”. For the purpose of printing these aggregated results in a neat tabular fashion, we will take the help of object oriented programming concepts of R. For this purpose, we will change the class of bres as “myclass” and return this bres object. The basstat () function code is given below :

basstat<-function(df,f) {
l<-split(df,f)
res<-lapply(l,result)
class(res)<-“myclass”
return(res)
}

4.The lapply function in step 3 in turn, is calling the function result(), using each sub data frame as input argrument. The result function, contains a sapply() function. This function in turn will call the meansd function, with each of these sub data frames one at a time and receives the mean and standard deviation results for all  the variables in the sub data frames. It will capture them in the object  “tres” and returns these results to the calling function lapply. The code of the result() function is given below :

result<-function(x) {
tres<-sapply(x,meansd)
return(tres)
}

Through all these function calls, all the results are now available in the object res, which is of class “myclass”. The  results obtained from all these function calls are available specie-wise but not in a neat compact tabular fashion  as shown below.

raw-results

5.To facilitate the printing in a compact neat tabular fashion, the print function of myclass is defined as follows. This function, in turn cbinds all the results, does the required string manipulations and finally prints the results in a neat tabular fashion. The code of the  print.myclass() function is given below :

print.myclass<-function(x) {
nm<-names(x)
options(digits=4)
finres<-vector()
for(i in 1:length(x)) {
finres<-cbind(finres,t(x[[i]]))
}
cat(” “)
tsp<-max(nchar(names(x)))
isp<-paste(rep(” “,tsp),collapse=””)
cat(isp)
for(i in 1:length(nm)) {
tt<-nchar(nm[i])
ifelse((tt<12),esp<-(12-tt),esp<-1)
rsp<-paste(rep(” “,esp),collapse=””)
nm[i]<-paste(nm[i],rsp)
cat(nm[i])
}
cat(“\n”)
print(finres)
}

6.These results can be printed by just typing the res object of step 2

>res

The code for using the basstat() function and the results obtained are given below :

irisresults

Some more Results :

i).Descriptive Statistics of six variables mpg,disp,hp,drat,wt,qsec  for the factor cyl consisting of the levels/groups viz., cylinder 4, 6 and 8 of the dataset mtcars of MASS package

res1<-basstat(mtcars[,c(1,3,4,5,6,7)],mtcars[,2])
res1

mtcarsresults

ii).Descriptive Statistics of two variables Prewt and Postwt for the groupsCBT, Cont and FT of the dataset anorexia of MASS package

anorexia-results

iii).Desriptive Statistics of three variables Price, MPG.city and MPG.highway  for the groups Compact, Large, Midsize, Small, Sporty and Van  of the dataset Cars93 of MASS package

cars93-results

Conclusions

The tabulations of the output results obtained from all the above examples are found to be  similar to those  obtained from the data tables and tables packages. We could achieve this by using object oriented concepts of R language. In this exercise, I have   obtained the mean and standard deviations of number of variables group-wise. It is also possible to modify the program to obtain the other statistics like min, max, median, 1st and 3rd quartiles etc. for number of variables group-wise.

Rcmdr Plug-in(s)

These plug-ins enhance statistical graphical user interface by extending new menus to statistical package provided by Rcmdr. While the original GUI was created for a basic statistics calculations, enabling of extensions (or plug-ins) has greatly enhanced the possible use and scope of this software. Installing these plug-ins is quite easy. They can be installed like any other R package. After installing the plugin package, these plug-ins can be activated by simply selecting the menu option Tools – Load Rcmdr plug-in(s) option in Rcmdr. Some useful Rcmdr plug-ins and their usage are provided below :

1.RcmdrPlugin.bca – Business and Customer Analytics
2.RcmdrPlugin.depthTools – A package that implements different statistical tools for the description and analysis of gene expression data based on the concept of data depth
3.RcmdrPlugin.DoE – Design of Experiments
4.RcmdrPlugin.EBM – Evidence Based Medicine plug-in
5.RcmdrPlugin.epack – Plugin for Time Series
6. RcmdrPlugin.EZR : adds a variety of statistical functions, including survival analyses, ROC analyses, metaanalyses, sample size calculation, and so on
7.RcmdrPlugin.FactoMineR : dedicated to multivariate Data Analysis
8.RcmdrPlugin.KMggplor2 – Kaplan-Meier plots and other plots by using the ggplot2 package
9. RcmdrPlugin.NMBU – extends linear models and provides new extended interfaces for PCA,PLS,LDA,QDA, clustering of variables, tests, plots etc.
10. RcmdrPlugin.sampling – provides tools for calculating sample sizes and selecting samples using various sampling designs
11. Rcmdr.survival : survival package, with dialogs for Cox models, parametric survival regression models, estimation of survival curves etc.
12. Rcmdr.temis – provides an integrated solution to perform a series of text mining tasks

 

Curve Fitting or Polynomial Regression between two variables

When the relation between two variables x and y is not linear and if there exists a curvilinear relationship (which can be observed by means of a scatter plot between x and y), then one can perform curve fitting or polynomial regression between these two variables. To know the details as to how to perform curve fitting or polynomial regression between two variables x and y using r functions lm and poly, and the stopping criteria that can be followed, read the following article at the address given below:

http://davetang.org/muse/2013/05/09/on-curve-fitting/

Analysis of tabular data from csv file

Sometimes data may not be available in the csv file in the required format. Consider the following csv file, whose details are as follows – Four different treatments were given to four different groups of patients. Random samples of size 7 were selected from each group and blood levels of Hb percentage levels were measured after one month. The objective is to test whether there are significant differences in the mean values of the Hb percentage levels due to treatments by the application of one-way ANOVA.

The following is the csv file containing the patients treatments data.
treatments-rawdata1

Method :

1.Read the csv file and convert it as a matrix “trt”. Next extract the four rows of the matrix and convert them into vectors.
2.Create a list and subsequently using stack() function convert this list into a dataframe “df”.
3. The above stack function creates the dataframe df with two columns ind and values. ind is a categorical variable(factors) and values variable contain the Hb percentage levels. Rename the column ind as “Treatments” and perform one-way ANOVA.

The results are given below :

anova-results

The complete code is given below :

anovacode

 

 

Aggregating basic statistics group-wise in R

Many times, while doing Statistical analysis, we have to evaluate the descriptive statistics like mean, standard deviation and so on for a number of variables, group-wise. Most of the Statistical packages like SAS, SPSS and so on provide these features. In R, the data.table package is very useful for aggregating these types of results and to tabulate them. It offers fast aggregation of large data , fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). In addition, the tables package and psych package’s describeBy method were also found to be useful for generating this type of results. As an exercise, the iris data is considered, which contains the data of four variables Sepal Length(SL), Sepal Width(SW), Petal Length(PL) and Petal Width(PW) of three species setosa, versicolor and virginica. Three types of results were generated, utilizing all the three packages listed above.

1.Mean and Standard deviation for all the four variables specie-wise using data-table package

data-table-results
2.Mean and Standard deviation for all the four variables specie-wise using tables package

table-output
3.Mean and Standard deviation for all the four variables specie-wise using psych package

psych-output