R - Grouping Data: Difference between revisions
No edit summary  | 
			
(No difference) 
 | 
Latest revision as of 05:24, 14 February 2017
Find the average salary in the whole company.
> mean(emp$SAL) [1] 2073.214
How many employees are in the database?
> length(emp$ID) [1] 14
Other salary statistics (minimal, maximal, sum, etc.)
> sd(emp$SAL) [1] 1182.503 > var(emp$SAL) [1] 1398314 > min(emp$SAL) [1] 800 > max(emp$SAL) [1] 5000 > sum(emp$SAL) [1] 29025
> summary(emp)
      ID          ENAME                  JOB         MGR          HIREDATE               SAL            COMM          DEPTNO     
Min.   :7369   Length:14          ANALYST  :2   Min.   :7566   Min.   :1980-12-17   Min.   : 800   Min.   :   0   Min.   :10.00  
1st Qu.:7588   Class :character   CLERK    :4   1st Qu.:7698   1st Qu.:1981-04-09   1st Qu.:1250   1st Qu.: 225   1st Qu.:20.00  
Median :7785   Mode  :character   MANAGER  :3   Median :7698   Median :1981-09-18   Median :1550   Median : 400   Median :20.00  
Mean   :7727                      PRESIDENT:1   Mean   :7739   Mean   :1981-09-29   Mean   :2073   Mean   : 550   Mean   :22.14  
3rd Qu.:7868                      SALESMAN :4   3rd Qu.:7839   3rd Qu.:1981-12-03   3rd Qu.:2944   3rd Qu.: 725   3rd Qu.:30.00  
Max.   :7934                                    Max.   :7902   Max.   :1983-01-12   Max.   :5000   Max.   :1400   Max.   :30.00  
                                                NA's   :1                                          NA's   :10
The median for department number is pretty much useless. It is better to treat DEPTNO as a factor
> emp$DEPTNO <- as.factor(DEPTNO)
> summary(emp)
      ID          ENAME                  JOB         MGR          HIREDATE               SAL            COMM      DEPTNO
Min.   :7369   Length:14          ANALYST  :2   Min.   :7566   Min.   :1980-12-17   Min.   : 800   Min.   :   0   10:3  
1st Qu.:7588   Class :character   CLERK    :4   1st Qu.:7698   1st Qu.:1981-04-09   1st Qu.:1250   1st Qu.: 225   20:5  
Median :7785   Mode  :character   MANAGER  :3   Median :7698   Median :1981-09-18   Median :1550   Median : 400   30:6  
Mean   :7727                      PRESIDENT:1   Mean   :7739   Mean   :1981-09-29   Mean   :2073   Mean   : 550         
3rd Qu.:7868                      SALESMAN :4   3rd Qu.:7839   3rd Qu.:1981-12-03   3rd Qu.:2944   3rd Qu.: 725         
Max.   :7934                                    Max.   :7902   Max.   :1983-01-12   Max.   :5000   Max.   :1400         
                                                NA's   :1                                          NA's   :10
Grouping
Groups contain all rows which have the same values in a column or columns. The grouping columns are called factors in R.
Find averages in specific jobs
> tapply(SAL,JOB,mean) ANALYST CLERK MANAGER PRESIDENT SALESMAN 3000.000 1037.500 2758.333 5000.000 1400.000
You can group by more than one factor
> tapply(SAL,list(JOB,DEPTNO),mean)
            10   20   30
ANALYST     NA 3000   NA
CLERK     1300  950  950
MANAGER   2450 2975 2850
PRESIDENT 5000   NA   NA
SALESMAN    NA   NA 1400
Example with three factors
> tapply(SAL,list(JOB,DEPTNO,format(emp$HIREDATE,"%Y")),mean)
, , 1980
          10  20 30
ANALYST   NA  NA NA
CLERK     NA 800 NA
MANAGER   NA  NA NA
PRESIDENT NA  NA NA
SALESMAN  NA  NA NA 
, , 1981
            10   20   30
ANALYST     NA 3000   NA
CLERK       NA   NA  950
MANAGER   2450 2975 2850
PRESIDENT 5000   NA   NA
SALESMAN    NA   NA 1400 
, , 1982
            10   20 30
ANALYST     NA 3000 NA
CLERK     1300   NA NA
MANAGER     NA   NA NA
PRESIDENT   NA   NA NA
SALESMAN    NA   NA NA 
, , 1983
          10   20 30
ANALYST   NA   NA NA
CLERK     NA 1100 NA
MANAGER   NA   NA NA
PRESIDENT NA   NA NA
SALESMAN  NA   NA NA
Grouping with data.table package
 install.packages("data.table")
 library("data.table")
 # Convert
 emp.dt = data.table(emp)
 # Group
 emp.dt[,sum(SAL),by=JOB]
JOB V1 [1,] PRESIDENT 5000 [2,] MANAGER 8275 [3,] SALESMAN 5600 [4,] CLERK 4150 [5,] ANALYST 6000
Exercises
23. Exercise
Find the minimal, maximal and average salaries in the whole company.
Minimal Maximal Mean [1,] 800 5000 2073.214
24. Exercise
Find the difference between maximal and minimal salary.
[1] 4200
25.  Exercise
Find the average salary for every post.
ANALYST CLERK MANAGER PRESIDENT SALESMAN 3000.000 1037.500 2758.333 5000.000 1400.000
26.  Exercise
How many managers work for the company?
MANAGER 
      3 
27.  Exercise
Find the average annual salaries in departments.
10 20 30 35000 26100 18800
28.  Exercise
Find departments with more than 3 workers.
20 30 5 6
29. *Exercise
Check that every employees id is unique.
30. *Exercise
For each of the managers find a minimum salary of their subordinates.
+------+----------+ | mgr | min(sal) | +------+----------+ | NULL | 5000.00 | | 7566 | 3000.00 | | 7698 | 950.00 | | 7782 | 1300.00 | | 7788 | 1100.00 | | 7839 | 2450.00 | | 7902 | 800.00 | +------+----------+