R - Grouping Data

From Training Material
Revision as of 05:24, 14 February 2017 by Daniel Rodriguez (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Find the average salary in the whole company.

> mean(emp$SAL)
[1] 2073.214

How many employees are in the database?

> length(emp$ID)
[1] 14

Other salary statistics (minimal, maximal, sum, etc.)

> sd(emp$SAL)
[1] 1182.503
> var(emp$SAL)
[1] 1398314
> min(emp$SAL)
[1] 800
> max(emp$SAL)
[1] 5000
> sum(emp$SAL)
[1] 29025
> summary(emp)
      ID          ENAME                  JOB         MGR          HIREDATE               SAL            COMM          DEPTNO     
Min.   :7369   Length:14          ANALYST  :2   Min.   :7566   Min.   :1980-12-17   Min.   : 800   Min.   :   0   Min.   :10.00  
1st Qu.:7588   Class :character   CLERK    :4   1st Qu.:7698   1st Qu.:1981-04-09   1st Qu.:1250   1st Qu.: 225   1st Qu.:20.00  
Median :7785   Mode  :character   MANAGER  :3   Median :7698   Median :1981-09-18   Median :1550   Median : 400   Median :20.00  
Mean   :7727                      PRESIDENT:1   Mean   :7739   Mean   :1981-09-29   Mean   :2073   Mean   : 550   Mean   :22.14  
3rd Qu.:7868                      SALESMAN :4   3rd Qu.:7839   3rd Qu.:1981-12-03   3rd Qu.:2944   3rd Qu.: 725   3rd Qu.:30.00  
Max.   :7934                                    Max.   :7902   Max.   :1983-01-12   Max.   :5000   Max.   :1400   Max.   :30.00  
                                                NA's   :1                                          NA's   :10

The median for department number is pretty much useless. It is better to treat DEPTNO as a factor


> emp$DEPTNO <- as.factor(DEPTNO)
> summary(emp)
      ID          ENAME                  JOB         MGR          HIREDATE               SAL            COMM      DEPTNO
Min.   :7369   Length:14          ANALYST  :2   Min.   :7566   Min.   :1980-12-17   Min.   : 800   Min.   :   0   10:3  
1st Qu.:7588   Class :character   CLERK    :4   1st Qu.:7698   1st Qu.:1981-04-09   1st Qu.:1250   1st Qu.: 225   20:5  
Median :7785   Mode  :character   MANAGER  :3   Median :7698   Median :1981-09-18   Median :1550   Median : 400   30:6  
Mean   :7727                      PRESIDENT:1   Mean   :7739   Mean   :1981-09-29   Mean   :2073   Mean   : 550         
3rd Qu.:7868                      SALESMAN :4   3rd Qu.:7839   3rd Qu.:1981-12-03   3rd Qu.:2944   3rd Qu.: 725         
Max.   :7934                                    Max.   :7902   Max.   :1983-01-12   Max.   :5000   Max.   :1400         
                                                NA's   :1                                          NA's   :10

Grouping

Groups contain all rows which have the same values in a column or columns. The grouping columns are called factors in R.

Find averages in specific jobs

> tapply(SAL,JOB,mean)
 ANALYST     CLERK   MANAGER PRESIDENT  SALESMAN 
3000.000  1037.500  2758.333  5000.000  1400.000

You can group by more than one factor

> tapply(SAL,list(JOB,DEPTNO),mean)
            10   20   30
ANALYST     NA 3000   NA
CLERK     1300  950  950
MANAGER   2450 2975 2850
PRESIDENT 5000   NA   NA
SALESMAN    NA   NA 1400

Example with three factors

> tapply(SAL,list(JOB,DEPTNO,format(emp$HIREDATE,"%Y")),mean)
, , 1980

          10  20 30
ANALYST   NA  NA NA
CLERK     NA 800 NA
MANAGER   NA  NA NA
PRESIDENT NA  NA NA
SALESMAN  NA  NA NA 

, , 1981

            10   20   30
ANALYST     NA 3000   NA
CLERK       NA   NA  950
MANAGER   2450 2975 2850
PRESIDENT 5000   NA   NA
SALESMAN    NA   NA 1400 

, , 1982

            10   20 30
ANALYST     NA 3000 NA
CLERK     1300   NA NA
MANAGER     NA   NA NA
PRESIDENT   NA   NA NA
SALESMAN    NA   NA NA 

, , 1983

          10   20 30
ANALYST   NA   NA NA
CLERK     NA 1100 NA
MANAGER   NA   NA NA
PRESIDENT NA   NA NA
SALESMAN  NA   NA NA

Grouping with data.table package

 install.packages("data.table")
 library("data.table")
 # Convert
 emp.dt = data.table(emp)
 # Group
 emp.dt[,sum(SAL),by=JOB]
           JOB   V1
[1,] PRESIDENT 5000
[2,]   MANAGER 8275
[3,]  SALESMAN 5600
[4,]     CLERK 4150
[5,]   ANALYST 6000

Exercises

23. Exercise

Find the minimal, maximal and average salaries in the whole company.

     Minimal Maximal     Mean
[1,]     800    5000 2073.214

24. Exercise

Find the difference between maximal and minimal salary.

[1] 4200


25. Exercise

Find the average salary for every post.

 ANALYST     CLERK   MANAGER PRESIDENT  SALESMAN 
3000.000  1037.500  2758.333  5000.000  1400.000 


26. Exercise

How many managers work for the company?

MANAGER 
      3 


27. Exercise

Find the average annual salaries in departments.

   10    20    30 
35000 26100 18800 


28. Exercise

Find departments with more than 3 workers.

20 30 
 5  6 

29. *Exercise

Check that every employees id is unique.

30. *Exercise

For each of the managers find a minimum salary of their subordinates.

+------+----------+
| mgr  | min(sal) |
+------+----------+
| NULL | 5000.00  |
| 7566 | 3000.00  |
| 7698 | 950.00   |
| 7782 | 1300.00  |
| 7788 | 1100.00  |
| 7839 | 2450.00  |
| 7902 | 800.00   |
+------+----------+