R - Merging Data Frames
We want to select the name of an employee (ENAME column in emp data frame) and the department name (DNAME column in dept data frame) in which the employee works.
First, merge two tables
merge(emp,dept,by="DEPTNO")
DEPTNO ID ENAME JOB MGR HIREDATE SAL COMM DNAME LOC 1 10 7839 BUSH PRESIDENT NA 17/11/1981 5000 NA ACCOUNTING NEW YORK 2 10 7934 ELISON CLERK 7782 23/01/1982 1300 NA ACCOUNTING NEW YORK 3 10 7782 MERKEL MANAGER 7839 09/06/1981 2450 NA ACCOUNTING NEW YORK 4 20 7566 PUTIN MANAGER 7839 02/04/1981 2975 NA RESEARCH LONDON 5 20 7369 THATCHER CLERK 7902 17/12/1980 800 NA RESEARCH LONDON 6 20 7788 CARNEGIE ANALYST 7566 09/12/1982 3000 NA RESEARCH LONDON 7 20 7876 FORD CLERK 7788 12/01/1983 1100 NA RESEARCH LONDON 8 20 7902 TOOSK ANALYST 7566 03/12/1981 3000 NA RESEARCH LONDON 9 30 7900 BUFFETT CLERK 7698 03/12/1981 950 NA SALES PARIS 10 30 7654 CHIRACK SALESMAN 7698 28/09/1981 1250 1400 SALES PARIS 11 30 7698 BLAIR MANAGER 7839 01/05/1981 2850 NA SALES PARIS 12 30 7844 GATES SALESMAN 7698 08/09/1981 1500 0 SALES PARIS 13 30 7521 WALTON SALESMAN 7698 22/02/1981 1250 500 SALES PARIS 14 30 7499 BAROSSO SALESMAN 7698 20/02/1981 1600 300 SALES PARIS
Second, select ENAME and DNAME columns
merge(emp,dept,by="DEPTNO")[,c("ENAME","DNAME")]
ENAME DNAME 1 BUSH ACCOUNTING 2 ELISON ACCOUNTING 3 MERKEL ACCOUNTING 4 PUTIN RESEARCH 5 THATCHER RESEARCH 6 CARNEGIE RESEARCH 7 FORD RESEARCH 8 TOOSK RESEARCH 9 BUFFETT SALES 10 CHIRACK SALES 11 BLAIR SALES 12 GATES SALES 13 WALTON SALES 14 BAROSSO SALES
You can show all rows, even if there are no corresponding rows in another data frame.
merge(emp,dept,by="DEPTNO",all=T)[,c("ENAME","DNAME")]
ENAME DNAME 1 BUSH ACCOUNTING 2 ELISON ACCOUNTING 3 MERKEL ACCOUNTING 4 PUTIN RESEARCH 5 THATCHER RESEARCH 6 CARNEGIE RESEARCH 7 FORD RESEARCH 8 TOOSK RESEARCH 9 BUFFETT SALES 10 CHIRACK SALES 11 BLAIR SALES 12 GATES SALES 13 WALTON SALES 14 BAROSSO SALES 15 <NA> OPERATIONS
Non-equi join
If merge cannot find a common column or if the by parameter is not specified, merge function will return Cartesian Product of two data frames.
This allows us to filter out rows using any condition we want.
If we want to find the name of an employee and their salary grade, we have to get data from two data frames: EMP and SALGRADE.
empsalgrade <- merge(emp,salgrade)
with(empsalgrade,empsalgrade[SAL >= LOSAL & SAL <= HISAL,])
ID ENAME JOB MGR HIREDATE SAL COMM DEPTNO GRADE LOSAL HISAL 8 7900 BUFFETT CLERK 7698 03/12/1981 950 NA 30 1 700 1200 11 7369 THATCHER CLERK 7902 17/12/1980 800 NA 20 1 700 1200 13 7876 FORD CLERK 7788 12/01/1983 1100 NA 20 1 700 1200 19 7654 CHIRACK SALESMAN 7698 28/09/1981 1250 1400 30 2 1201 1400 23 7521 WALTON SALESMAN 7698 22/02/1981 1250 500 30 2 1201 1400 28 7934 ELISON CLERK 7782 23/01/1982 1300 NA 10 2 1201 1400 34 7499 BAROSSO SALESMAN 7698 20/02/1981 1600 300 30 3 1401 2000 35 7844 GATES SALESMAN 7698 08/09/1981 1500 0 30 3 1401 2000 44 7698 BLAIR MANAGER 7839 01/05/1981 2850 NA 30 4 2001 3000 45 7782 MERKEL MANAGER 7839 09/06/1981 2450 NA 10 4 2001 3000 46 7566 PUTIN MANAGER 7839 02/04/1981 2975 NA 20 4 2001 3000 52 7902 TOOSK ANALYST 7566 03/12/1981 3000 NA 20 4 2001 3000 54 7788 CARNEGIE ANALYST 7566 09/12/1982 3000 NA 20 4 2001 3000 57 7839 BUSH PRESIDENT NA 17/11/1981 5000 NA 10 5 3001 9999
Save the joined tables and show only appropriate columns
> empsalgrade <- with(empsalgrade,empsalgrade[SAL > LOSAL & SAL < HISAL,]) > empsalgrade[c("ENAME","SAL","GRADE")] ENAME SAL GRADE 8 BUFFETT 950 1 11 THATCHER 800 1 13 FORD 1100 1 19 CHIRACK 1250 2 23 WALTON 1250 2 28 ELISON 1300 2 34 BAROSSO 1600 3 35 GATES 1500 3 44 BLAIR 2850 4 45 MERKEL 2450 4 46 PUTIN 2975 4 57 BUSH 5000 5
Self-join
We want to know the person's boss name
m1 <- merge(emp,emp,by.x="MGR",by.y="ID")[,c("ENAME.x","ENAME.y")]
library(plyr)
rename(m1,c("ENAME.x" = "Employee","ENAME.y" = "Boss"))
ENAME.x ENAME.y 1 TOOSK PUTIN 2 CARNEGIE PUTIN 3 CHIRACK BLAIR 4 BAROSSO BLAIR 5 GATES BLAIR 6 BUFFETT BLAIR 7 WALTON BLAIR 8 ELISON MERKEL 9 FORD CARNEGIE 10 BLAIR BUSH 11 MERKEL BUSH 12 PUTIN BUSH 13 THATCHER TOOSK
Generating Cartesian Product
If merge criteria are more complex, you can always generate Cartesian product and use filtering criteria to get results you are interested in.
> merge(emp,emp,,all=T,by=NULL) ID.x ENAME.x JOB.x MGR.x HIREDATE.x SAL.x COMM.x DEPTNO.x ID.y ENAME.y JOB.y MGR.y HIREDATE.y SAL.y COMM.y DEPTNO.y 1 7839 BUSH PRESIDENT NA 1981-11-17 5000 NA 10 7839 BUSH PRESIDENT NA 1981-11-17 5000 NA 10 2 7698 BLAIR MANAGER 7839 1981-05-01 2850 NA 30 7839 BUSH PRESIDENT NA 1981-11-17 5000 NA 10 ... 195 7876 FORD CLERK 7788 1983-01-12 1100 NA 20 7934 ELISON CLERK 7782 1982-01-23 1300 NA 10 196 7934 ELISON CLERK 7782 1982-01-23 1300 NA 10 7934 ELISON CLERK 7782 1982-01-23 1300 NA 10
Exercises
31. Exercise
Select the name of the employee and the city (LOC column in DEPT table) in which they work.
ENAME LOC 1 BUSH NEW YORK 2 ELISON NEW YORK 3 MERKEL NEW YORK 4 PUTIN LONDON 5 THATCHER LONDON 6 CARNEGIE LONDON 7 FORD LONDON 8 TOOSK LONDON 9 BUFFETT PARIS 10 CHIRACK PARIS 11 BLAIR PARIS 12 GATES PARIS 13 WALTON PARIS 14 BAROSSO PARIS
32. Exercise
Select the names of the employees, and the name and number of their department.
ENAME DNAME DEPTNO 1 BUSH ACCOUNTING 10 2 ELISON ACCOUNTING 10 3 MERKEL ACCOUNTING 10 4 PUTIN RESEARCH 20 5 THATCHER RESEARCH 20 6 CARNEGIE RESEARCH 20 7 FORD RESEARCH 20 8 TOOSK RESEARCH 20 9 BUFFETT SALES 30 10 CHIRACK SALES 30 11 BLAIR SALES 30 12 GATES SALES 30 13 WALTON SALES 30 14 BAROSSO SALES 30
33. Exercise
Select the names of the employees, their salary and salary grade, but only those whose salary is more than 2000.
ENAME SAL GRADE 44 BLAIR 2850 4 45 MERKEL 2450 4 46 PUTIN 2975 4 52 TOOSK 3000 4 54 CARNEGIE 3000 4 57 BUSH 5000 5
34. Exercise
Select employees working in London.
ENAME LOC 4 PUTIN LONDON 5 THATCHER LONDON 6 CARNEGIE LONDON 7 FORD LONDON 8 TOOSK LONDON
35. Exercise
Select employees and their salary grades except those from London.
ENAME LOC GRADE 1 MERKEL NEW YORK 4 2 BUSH NEW YORK 5 3 ELISON NEW YORK 2 9 BAROSSO PARIS 3 10 WALTON PARIS 2 11 CHIRACK PARIS 2 12 BLAIR PARIS 4 13 GATES PARIS 3 14 BUFFETT PARIS 1
Exercises
36. Exercise
Find departments without employees.
DEPTNO DNAME 15 40 OPERATIONS
37. *Exercise
Select an employee’s name and his/her boss’s name.
Employee Boss 1 TOOSK PUTIN 2 CARNEGIE PUTIN 3 CHIRACK BLAIR 4 BAROSSO BLAIR 5 GATES BLAIR 6 BUFFETT BLAIR 7 WALTON BLAIR 8 ELISON MERKEL 9 FORD CARNEGIE 10 BLAIR BUSH 11 MERKEL BUSH 12 PUTIN BUSH 13 THATCHER TOOSK
38. Exercise
Select an employee’s name, his/her boss’s name and show the employees who have no superior.
> be Employee Boss 1 TOOSK PUTIN 2 CARNEGIE PUTIN 3 CHIRACK BLAIR 4 BAROSSO BLAIR 5 GATES BLAIR 6 BUFFETT BLAIR 7 WALTON BLAIR 8 ELISON MERKEL 9 FORD CARNEGIE 10 BLAIR BUSH 11 MERKEL BUSH 12 PUTIN BUSH 13 THATCHER TOOSK 14 BUSH <NA>
39. Exercise
Select an employee’s name, his/her boss’s name and show all employees even though they have no subordinates.
+----------+----------+ | Employee | Boss | +----------+----------+ | BLAIR | BUSH | | MERKEL | BUSH | | PUTIN | BUSH | | CHIRACK | BLAIR | | BAROSSO | BLAIR | | GATES | BLAIR | | BUFFETT | BLAIR | | WALTON | BLAIR | | ELISON | MERKEL | | TOOSK | PUTIN | | CARNEGIE | PUTIN | | | CHIRACK | | | BAROSSO | | | GATES | | | BUFFETT | | | WALTON | | THATCHER | TOOSK | | | THATCHER | | FORD | CARNEGIE | | | FORD | | | ELISON | +----------+----------+
40. Exercise*
Select an employee’s name, his/her boss’s name and show all employees even though they have no subordinates and also those who have no superior.
Employee Boss 1 <NA> THATCHER 2 <NA> BAROSSO 3 <NA> WALTON 4 TOOSK PUTIN 5 CARNEGIE PUTIN 6 <NA> CHIRACK 7 CHIRACK BLAIR 8 BAROSSO BLAIR 9 GATES BLAIR 10 BUFFETT BLAIR 11 WALTON BLAIR 12 ELISON MERKEL 13 FORD CARNEGIE 14 BLAIR BUSH 15 MERKEL BUSH 16 PUTIN BUSH 17 <NA> GATES 18 <NA> FORD 19 <NA> BUFFETT 20 THATCHER TOOSK 21 <NA> ELISON 22 BUSH <NA>
41. Exercise
Select employees hired earlier than their bosses.
Employee Employee hiredate Boss Boss hiredate 4 BAROSSO 1981-02-20 BLAIR 1981-05-01 7 WALTON 1981-02-22 BLAIR 1981-05-01 10 BLAIR 1981-05-01 BUSH 1981-11-17 11 MERKEL 1981-06-09 BUSH 1981-11-17 12 PUTIN 1981-04-02 BUSH 1981-11-17 13 THATCHER 1980-12-17 TOOSK 1981-12-03
42. Exercise*
Show employee, their boss and boss's boss
Employee Boss BossBoss 1 FORD CARNEGIE PUTIN 2 THATCHER TOOSK PUTIN 3 TOOSK PUTIN BUSH 4 GATES BLAIR BUSH 5 CHIRACK BLAIR BUSH 6 BAROSSO BLAIR BUSH 7 CARNEGIE PUTIN BUSH 8 WALTON BLAIR BUSH 9 ELISON MERKEL BUSH 10 BUFFETT BLAIR BUSH 11 BLAIR BUSH <NA> 12 MERKEL BUSH <NA> 13 PUTIN BUSH <NA> 14 BUSH <NA> <NA>
Set operations
UNION
In order to combine two data frames we can use rbind function
We want to see people earning less than 2000 and more than 3000
> rbind(emp[SAL<2000,],emp[SAL>3000,]) ID ENAME JOB MGR HIREDATE SAL COMM DEPTNO 5 7654 CHIRACK SALESMAN 7698 1981-09-28 1250 1400 30 6 7499 BAROSSO SALESMAN 7698 1981-02-20 1600 300 30 7 7844 GATES SALESMAN 7698 1981-09-08 1500 0 30 8 7900 BUFFETT CLERK 7698 1981-12-03 950 NA 30 9 7521 WALTON SALESMAN 7698 1981-02-22 1250 500 30 11 7369 THATCHER CLERK 7902 1980-12-17 800 NA 20 13 7876 FORD CLERK 7788 1983-01-12 1100 NA 20 14 7934 ELISON CLERK 7782 1982-01-23 1300 NA 10 91 7839 BUSH PRESIDENT NA 1981-11-17 5000 NA 10
If we interested only in names
> union(emp[SAL<2000,"ENAME"],emp[SAL>3000,"ENAME"]) [1] "CHIRACK" "BAROSSO" "GATES" "BUFFETT" "WALTON" "THATCHER" "FORD" "ELISON" "BUSH"
INTERSECT Function
INTERSECT operator (set intersection) allows you to select rows common for two data frames
Find jobs existing in department 30 and 20
> intersect(emp[DEPTNO==20,"JOB"] ,emp[DEPTNO==30,"JOB"]) [1] "MANAGER" "CLERK"
SETDIFF Operator
This operator allows to to show rows present in the first query but not present in the second query.
Find jobs in department 30 which do not exist in department 20
> setdiff(emp[DEPTNO==30,"JOB"] ,emp[DEPTNO==20,"JOB"]) [1] "SALESMAN"
Exercises
43. *Exercise
Find post filled in 1982 and 1983.
[1] "CLERK"
Answer >>
intersect(emp[format(emp$HIREDATE,"%Y") == 1982,"JOB"] ,emp[format(emp$HIREDATE,"%Y") == 1981,"JOB"] )
44. ***Exercise
Find jobs which are in department 10 but not in department 20.
[1] "PRESIDENT"
Answer >>