R - Grouping Data
Find the average salary in the whole company.
> mean(emp$SAL) [1] 2073.214
How many employees are in the database?
> length(emp$ID) [1] 14
Other salary statistics (minimal, maximal, sum, etc.)
> sd(emp$SAL) [1] 1182.503 > var(emp$SAL) [1] 1398314 > min(emp$SAL) [1] 800 > max(emp$SAL) [1] 5000 > sum(emp$SAL) [1] 29025
> summary(emp) ID ENAME JOB MGR HIREDATE SAL COMM DEPTNO Min. :7369 Length:14 ANALYST :2 Min. :7566 Min. :1980-12-17 Min. : 800 Min. : 0 Min. :10.00 1st Qu.:7588 Class :character CLERK :4 1st Qu.:7698 1st Qu.:1981-04-09 1st Qu.:1250 1st Qu.: 225 1st Qu.:20.00 Median :7785 Mode :character MANAGER :3 Median :7698 Median :1981-09-18 Median :1550 Median : 400 Median :20.00 Mean :7727 PRESIDENT:1 Mean :7739 Mean :1981-09-29 Mean :2073 Mean : 550 Mean :22.14 3rd Qu.:7868 SALESMAN :4 3rd Qu.:7839 3rd Qu.:1981-12-03 3rd Qu.:2944 3rd Qu.: 725 3rd Qu.:30.00 Max. :7934 Max. :7902 Max. :1983-01-12 Max. :5000 Max. :1400 Max. :30.00 NA's :1 NA's :10
The median for department number is pretty much useless. It is better to treat DEPTNO as a factor
> emp$DEPTNO <- as.factor(DEPTNO)
> summary(emp) ID ENAME JOB MGR HIREDATE SAL COMM DEPTNO Min. :7369 Length:14 ANALYST :2 Min. :7566 Min. :1980-12-17 Min. : 800 Min. : 0 10:3 1st Qu.:7588 Class :character CLERK :4 1st Qu.:7698 1st Qu.:1981-04-09 1st Qu.:1250 1st Qu.: 225 20:5 Median :7785 Mode :character MANAGER :3 Median :7698 Median :1981-09-18 Median :1550 Median : 400 30:6 Mean :7727 PRESIDENT:1 Mean :7739 Mean :1981-09-29 Mean :2073 Mean : 550 3rd Qu.:7868 SALESMAN :4 3rd Qu.:7839 3rd Qu.:1981-12-03 3rd Qu.:2944 3rd Qu.: 725 Max. :7934 Max. :7902 Max. :1983-01-12 Max. :5000 Max. :1400 NA's :1 NA's :10
Grouping
Groups contain all rows which have the same values in a column or columns. The grouping columns are called factors in R.
Find averages in specific jobs
> tapply(SAL,JOB,mean) ANALYST CLERK MANAGER PRESIDENT SALESMAN 3000.000 1037.500 2758.333 5000.000 1400.000
You can group by more than one factor
> tapply(SAL,list(JOB,DEPTNO),mean) 10 20 30 ANALYST NA 3000 NA CLERK 1300 950 950 MANAGER 2450 2975 2850 PRESIDENT 5000 NA NA SALESMAN NA NA 1400
Example with three factors
> tapply(SAL,list(JOB,DEPTNO,format(emp$HIREDATE,"%Y")),mean) , , 1980 10 20 30 ANALYST NA NA NA CLERK NA 800 NA MANAGER NA NA NA PRESIDENT NA NA NA SALESMAN NA NA NA , , 1981 10 20 30 ANALYST NA 3000 NA CLERK NA NA 950 MANAGER 2450 2975 2850 PRESIDENT 5000 NA NA SALESMAN NA NA 1400 , , 1982 10 20 30 ANALYST NA 3000 NA CLERK 1300 NA NA MANAGER NA NA NA PRESIDENT NA NA NA SALESMAN NA NA NA , , 1983 10 20 30 ANALYST NA NA NA CLERK NA 1100 NA MANAGER NA NA NA PRESIDENT NA NA NA SALESMAN NA NA NA
Grouping with data.table package
install.packages("data.table")
library("data.table")
# Convert
emp.dt = data.table(emp)
# Group
emp.dt[,sum(SAL),by=JOB]
JOB V1 [1,] PRESIDENT 5000 [2,] MANAGER 8275 [3,] SALESMAN 5600 [4,] CLERK 4150 [5,] ANALYST 6000
Exercises
23. Exercise
Find the minimal, maximal and average salaries in the whole company.
Minimal Maximal Mean [1,] 800 5000 2073.214
24. Exercise
Find the difference between maximal and minimal salary.
[1] 4200
25. Exercise
Find the average salary for every post.
ANALYST CLERK MANAGER PRESIDENT SALESMAN 3000.000 1037.500 2758.333 5000.000 1400.000
26. Exercise
How many managers work for the company?
MANAGER 3
27. Exercise
Find the average annual salaries in departments.
10 20 30 35000 26100 18800
28. Exercise
Find departments with more than 3 workers.
20 30 5 6
29. *Exercise
Check that every employees id is unique.
30. *Exercise
For each of the managers find a minimum salary of their subordinates.
+------+----------+ | mgr | min(sal) | +------+----------+ | NULL | 5000.00 | | 7566 | 3000.00 | | 7698 | 950.00 | | 7782 | 1300.00 | | 7788 | 1100.00 | | 7839 | 2450.00 | | 7902 | 800.00 | +------+----------+