R - Grouping Data
Find the average salary in the whole company.
> mean(emp$SAL) [1] 2073.214
How many employees are in the database?
> length(emp$ID) [1] 14
Other salary statistics (minimal, maximal, sum, etc.)
> sd(emp$SAL) [1] 1182.503 > var(emp$SAL) [1] 1398314 > min(emp$SAL) [1] 800 > max(emp$SAL) [1] 5000 > sum(emp$SAL) [1] 29025
> summary(emp)
ID ENAME JOB MGR HIREDATE SAL COMM DEPTNO
Min. :7369 Length:14 ANALYST :2 Min. :7566 Min. :1980-12-17 Min. : 800 Min. : 0 Min. :10.00
1st Qu.:7588 Class :character CLERK :4 1st Qu.:7698 1st Qu.:1981-04-09 1st Qu.:1250 1st Qu.: 225 1st Qu.:20.00
Median :7785 Mode :character MANAGER :3 Median :7698 Median :1981-09-18 Median :1550 Median : 400 Median :20.00
Mean :7727 PRESIDENT:1 Mean :7739 Mean :1981-09-29 Mean :2073 Mean : 550 Mean :22.14
3rd Qu.:7868 SALESMAN :4 3rd Qu.:7839 3rd Qu.:1981-12-03 3rd Qu.:2944 3rd Qu.: 725 3rd Qu.:30.00
Max. :7934 Max. :7902 Max. :1983-01-12 Max. :5000 Max. :1400 Max. :30.00
NA's :1 NA's :10
The median for department number is pretty much useless. It is better to treat DEPTNO as a factor
> emp$DEPTNO <- as.factor(DEPTNO)
> summary(emp)
ID ENAME JOB MGR HIREDATE SAL COMM DEPTNO
Min. :7369 Length:14 ANALYST :2 Min. :7566 Min. :1980-12-17 Min. : 800 Min. : 0 10:3
1st Qu.:7588 Class :character CLERK :4 1st Qu.:7698 1st Qu.:1981-04-09 1st Qu.:1250 1st Qu.: 225 20:5
Median :7785 Mode :character MANAGER :3 Median :7698 Median :1981-09-18 Median :1550 Median : 400 30:6
Mean :7727 PRESIDENT:1 Mean :7739 Mean :1981-09-29 Mean :2073 Mean : 550
3rd Qu.:7868 SALESMAN :4 3rd Qu.:7839 3rd Qu.:1981-12-03 3rd Qu.:2944 3rd Qu.: 725
Max. :7934 Max. :7902 Max. :1983-01-12 Max. :5000 Max. :1400
NA's :1 NA's :10
Grouping
Groups contain all rows which have the same values in a column or columns. The grouping columns are called factors in R.
Find averages in specific jobs
> tapply(SAL,JOB,mean) ANALYST CLERK MANAGER PRESIDENT SALESMAN 3000.000 1037.500 2758.333 5000.000 1400.000
You can group by more than one factor
> tapply(SAL,list(JOB,DEPTNO),mean)
10 20 30
ANALYST NA 3000 NA
CLERK 1300 950 950
MANAGER 2450 2975 2850
PRESIDENT 5000 NA NA
SALESMAN NA NA 1400
Example with three factors
> tapply(SAL,list(JOB,DEPTNO,format(emp$HIREDATE,"%Y")),mean)
, , 1980
10 20 30
ANALYST NA NA NA
CLERK NA 800 NA
MANAGER NA NA NA
PRESIDENT NA NA NA
SALESMAN NA NA NA
, , 1981
10 20 30
ANALYST NA 3000 NA
CLERK NA NA 950
MANAGER 2450 2975 2850
PRESIDENT 5000 NA NA
SALESMAN NA NA 1400
, , 1982
10 20 30
ANALYST NA 3000 NA
CLERK 1300 NA NA
MANAGER NA NA NA
PRESIDENT NA NA NA
SALESMAN NA NA NA
, , 1983
10 20 30
ANALYST NA NA NA
CLERK NA 1100 NA
MANAGER NA NA NA
PRESIDENT NA NA NA
SALESMAN NA NA NA
Grouping with data.table package
install.packages("data.table")
library("data.table")
# Convert
emp.dt = data.table(emp)
# Group
emp.dt[,sum(SAL),by=JOB]
JOB V1 [1,] PRESIDENT 5000 [2,] MANAGER 8275 [3,] SALESMAN 5600 [4,] CLERK 4150 [5,] ANALYST 6000
Exercises
23. Exercise
Find the minimal, maximal and average salaries in the whole company.
Minimal Maximal Mean [1,] 800 5000 2073.214
24. Exercise
Find the difference between maximal and minimal salary.
[1] 4200
25. Exercise
Find the average salary for every post.
ANALYST CLERK MANAGER PRESIDENT SALESMAN 3000.000 1037.500 2758.333 5000.000 1400.000
26. Exercise
How many managers work for the company?
MANAGER
3
27. Exercise
Find the average annual salaries in departments.
10 20 30 35000 26100 18800
28. Exercise
Find departments with more than 3 workers.
20 30 5 6
29. *Exercise
Check that every employees id is unique.
30. *Exercise
For each of the managers find a minimum salary of their subordinates.
+------+----------+ | mgr | min(sal) | +------+----------+ | NULL | 5000.00 | | 7566 | 3000.00 | | 7698 | 950.00 | | 7782 | 1300.00 | | 7788 | 1100.00 | | 7839 | 2450.00 | | 7902 | 800.00 | +------+----------+