R - Merging Data Frames

From Training Material
Jump to navigation Jump to search


We want to select the name of an employee (ENAME column in emp data frame) and the department name (DNAME column in dept data frame) in which the employee works.

First, merge two tables

 merge(emp,dept,by="DEPTNO")
   DEPTNO   ID    ENAME       JOB  MGR   HIREDATE  SAL COMM      DNAME      LOC
1      10 7839     BUSH PRESIDENT   NA 17/11/1981 5000   NA ACCOUNTING NEW YORK
2      10 7934   ELISON     CLERK 7782 23/01/1982 1300   NA ACCOUNTING NEW YORK
3      10 7782   MERKEL   MANAGER 7839 09/06/1981 2450   NA ACCOUNTING NEW YORK
4      20 7566    PUTIN   MANAGER 7839 02/04/1981 2975   NA   RESEARCH   LONDON
5      20 7369 THATCHER     CLERK 7902 17/12/1980  800   NA   RESEARCH   LONDON
6      20 7788 CARNEGIE   ANALYST 7566 09/12/1982 3000   NA   RESEARCH   LONDON
7      20 7876     FORD     CLERK 7788 12/01/1983 1100   NA   RESEARCH   LONDON
8      20 7902    TOOSK   ANALYST 7566 03/12/1981 3000   NA   RESEARCH   LONDON
9      30 7900  BUFFETT     CLERK 7698 03/12/1981  950   NA      SALES    PARIS
10     30 7654  CHIRACK  SALESMAN 7698 28/09/1981 1250 1400      SALES    PARIS
11     30 7698    BLAIR   MANAGER 7839 01/05/1981 2850   NA      SALES    PARIS
12     30 7844    GATES  SALESMAN 7698 08/09/1981 1500    0      SALES    PARIS
13     30 7521   WALTON  SALESMAN 7698 22/02/1981 1250  500      SALES    PARIS
14     30 7499  BAROSSO  SALESMAN 7698 20/02/1981 1600  300      SALES    PARIS

Second, select ENAME and DNAME columns

 merge(emp,dept,by="DEPTNO")[,c("ENAME","DNAME")]
      ENAME      DNAME
1      BUSH ACCOUNTING
2    ELISON ACCOUNTING
3    MERKEL ACCOUNTING
4     PUTIN   RESEARCH
5  THATCHER   RESEARCH 
6  CARNEGIE   RESEARCH 
7      FORD   RESEARCH
8     TOOSK   RESEARCH
9   BUFFETT      SALES
10  CHIRACK      SALES
11    BLAIR      SALES
12    GATES      SALES
13   WALTON      SALES
14  BAROSSO      SALES


You can show all rows, even if there are no corresponding rows in another data frame.

 merge(emp,dept,by="DEPTNO",all=T)[,c("ENAME","DNAME")]
     ENAME      DNAME
1      BUSH ACCOUNTING
2    ELISON ACCOUNTING
3    MERKEL ACCOUNTING
4     PUTIN   RESEARCH
5  THATCHER   RESEARCH
6  CARNEGIE   RESEARCH
7      FORD   RESEARCH
8     TOOSK   RESEARCH
9   BUFFETT      SALES
10  CHIRACK      SALES
11    BLAIR      SALES
12    GATES      SALES
13   WALTON      SALES
14  BAROSSO      SALES
15     <NA> OPERATIONS


Non-equi join

If merge cannot find a common column or if the by parameter is not specified, merge function will return Cartesian Product of two data frames.

This allows us to filter out rows using any condition we want.

If we want to find the name of an employee and their salary grade, we have to get data from two data frames: EMP and SALGRADE.

  empsalgrade <- merge(emp,salgrade)
  with(empsalgrade,empsalgrade[SAL >= LOSAL & SAL <= HISAL,])
     ID    ENAME       JOB  MGR   HIREDATE  SAL COMM DEPTNO GRADE LOSAL HISAL
8  7900  BUFFETT     CLERK 7698 03/12/1981  950   NA     30     1   700  1200
11 7369 THATCHER     CLERK 7902 17/12/1980  800   NA     20     1   700  1200
13 7876     FORD     CLERK 7788 12/01/1983 1100   NA     20     1   700  1200
19 7654  CHIRACK  SALESMAN 7698 28/09/1981 1250 1400     30     2  1201  1400
23 7521   WALTON  SALESMAN 7698 22/02/1981 1250  500     30     2  1201  1400
28 7934   ELISON     CLERK 7782 23/01/1982 1300   NA     10     2  1201  1400
34 7499  BAROSSO  SALESMAN 7698 20/02/1981 1600  300     30     3  1401  2000
35 7844    GATES  SALESMAN 7698 08/09/1981 1500    0     30     3  1401  2000
44 7698    BLAIR   MANAGER 7839 01/05/1981 2850   NA     30     4  2001  3000
45 7782   MERKEL   MANAGER 7839 09/06/1981 2450   NA     10     4  2001  3000
46 7566    PUTIN   MANAGER 7839 02/04/1981 2975   NA     20     4  2001  3000
52 7902    TOOSK   ANALYST 7566 03/12/1981 3000   NA     20     4  2001  3000
54 7788 CARNEGIE   ANALYST 7566 09/12/1982 3000   NA     20     4  2001  3000
57 7839     BUSH PRESIDENT   NA 17/11/1981 5000   NA     10     5  3001  9999


Save the joined tables and show only appropriate columns

> empsalgrade <- with(empsalgrade,empsalgrade[SAL > LOSAL & SAL < HISAL,])
> empsalgrade[c("ENAME","SAL","GRADE")]
      ENAME  SAL GRADE
8   BUFFETT  950     1
11 THATCHER  800     1
13     FORD 1100     1
19  CHIRACK 1250     2
23   WALTON 1250     2
28   ELISON 1300     2
34  BAROSSO 1600     3
35    GATES 1500     3
44    BLAIR 2850     4
45   MERKEL 2450     4
46    PUTIN 2975     4
57     BUSH 5000     5

Self-join

We want to know the person's boss name

m1 <- merge(emp,emp,by.x="MGR",by.y="ID")[,c("ENAME.x","ENAME.y")]
library(plyr)
rename(m1,c("ENAME.x" = "Employee","ENAME.y" = "Boss"))
    ENAME.x  ENAME.y
1     TOOSK    PUTIN
2  CARNEGIE    PUTIN
3   CHIRACK    BLAIR
4   BAROSSO    BLAIR
5     GATES    BLAIR
6   BUFFETT    BLAIR
7    WALTON    BLAIR
8    ELISON   MERKEL
9      FORD CARNEGIE
10    BLAIR     BUSH
11   MERKEL     BUSH
12    PUTIN     BUSH
13 THATCHER    TOOSK

Generating Cartesian Product

If merge criteria are more complex, you can always generate Cartesian product and use filtering criteria to get results you are interested in.

> merge(emp,emp,,all=T,by=NULL)
    ID.x  ENAME.x     JOB.x MGR.x HIREDATE.x SAL.x COMM.x DEPTNO.x ID.y  ENAME.y     JOB.y MGR.y HIREDATE.y SAL.y COMM.y DEPTNO.y
1   7839     BUSH PRESIDENT    NA 1981-11-17  5000     NA       10 7839     BUSH PRESIDENT    NA 1981-11-17  5000     NA       10
2   7698    BLAIR   MANAGER  7839 1981-05-01  2850     NA       30 7839     BUSH PRESIDENT    NA 1981-11-17  5000     NA       10
...
195 7876     FORD     CLERK  7788 1983-01-12  1100     NA       20 7934   ELISON     CLERK  7782 1982-01-23  1300     NA       10
196 7934   ELISON     CLERK  7782 1982-01-23  1300     NA       10 7934   ELISON     CLERK  7782 1982-01-23  1300     NA       10

Exercises

31. Exercise

Select the name of the employee and the city (LOC column in DEPT table) in which they work.

       ENAME      LOC
1      BUSH NEW YORK
2    ELISON NEW YORK
3    MERKEL NEW YORK
4     PUTIN   LONDON
5  THATCHER   LONDON
6  CARNEGIE   LONDON
7      FORD   LONDON
8     TOOSK   LONDON
9   BUFFETT    PARIS
10  CHIRACK    PARIS
11    BLAIR    PARIS
12    GATES    PARIS
13   WALTON    PARIS
14  BAROSSO    PARIS


32. Exercise

Select the names of the employees, and the name and number of their department.

      ENAME      DNAME DEPTNO
1      BUSH ACCOUNTING     10
2    ELISON ACCOUNTING     10
3    MERKEL ACCOUNTING     10
4     PUTIN   RESEARCH     20
5  THATCHER   RESEARCH     20
6  CARNEGIE   RESEARCH     20
7      FORD   RESEARCH     20
8     TOOSK   RESEARCH     20
9   BUFFETT      SALES     30
10  CHIRACK      SALES     30
11    BLAIR      SALES     30
12    GATES      SALES     30
13   WALTON      SALES     30
14  BAROSSO      SALES     30


33. Exercise

Select the names of the employees, their salary and salary grade, but only those whose salary is more than 2000.

      ENAME  SAL GRADE
44    BLAIR 2850     4
45   MERKEL 2450     4
46    PUTIN 2975     4
52    TOOSK 3000     4
54 CARNEGIE 3000     4
57     BUSH 5000     5


34. Exercise

Select employees working in London.

     ENAME    LOC
4    PUTIN LONDON
5 THATCHER LONDON
6 CARNEGIE LONDON
7     FORD LONDON
8    TOOSK LONDON


35. Exercise

Select employees and their salary grades except those from London.

     ENAME      LOC GRADE
1   MERKEL NEW YORK     4
2     BUSH NEW YORK     5
3   ELISON NEW YORK     2
9  BAROSSO    PARIS     3
10  WALTON    PARIS     2
11 CHIRACK    PARIS     2
12   BLAIR    PARIS     4
13   GATES    PARIS     3
14 BUFFETT    PARIS     1


Exercises

36. Exercise

Find departments without employees.

   DEPTNO      DNAME
15     40 OPERATIONS


37. *Exercise

Select an employee’s name and his/her boss’s name.

   Employee     Boss
1     TOOSK    PUTIN
2  CARNEGIE    PUTIN
3   CHIRACK    BLAIR
4   BAROSSO    BLAIR
5     GATES    BLAIR
6   BUFFETT    BLAIR
7    WALTON    BLAIR
8    ELISON   MERKEL
9      FORD CARNEGIE
10    BLAIR     BUSH
11   MERKEL     BUSH
12    PUTIN     BUSH
13 THATCHER    TOOSK


38. Exercise

Select an employee’s name, his/her boss’s name and show the employees who have no superior.

> be
   Employee     Boss
1     TOOSK    PUTIN
2  CARNEGIE    PUTIN
3   CHIRACK    BLAIR
4   BAROSSO    BLAIR
5     GATES    BLAIR
6   BUFFETT    BLAIR
7    WALTON    BLAIR
8    ELISON   MERKEL
9      FORD CARNEGIE
10    BLAIR     BUSH
11   MERKEL     BUSH
12    PUTIN     BUSH
13 THATCHER    TOOSK
14     BUSH     <NA>

39. Exercise

Select an employee’s name, his/her boss’s name and show all employees even though they have no subordinates.

+----------+----------+
| Employee | Boss     |
+----------+----------+
| BLAIR    | BUSH     |
| MERKEL   | BUSH     |
| PUTIN    | BUSH     |
| CHIRACK  | BLAIR    |
| BAROSSO  | BLAIR    |
| GATES    | BLAIR    |
| BUFFETT  | BLAIR    |
| WALTON   | BLAIR    |
| ELISON   | MERKEL   |
| TOOSK    | PUTIN    |
| CARNEGIE | PUTIN    |
|          | CHIRACK  |
|          | BAROSSO  |
|          | GATES    |
|          | BUFFETT  |
|          | WALTON   |
| THATCHER | TOOSK    |
|          | THATCHER |
| FORD     | CARNEGIE |
|          | FORD     |
|          | ELISON   |
+----------+----------+

40. Exercise*

Select an employee’s name, his/her boss’s name and show all employees even though they have no subordinates and also those who have no superior.

   Employee     Boss
1      <NA> THATCHER
2      <NA>  BAROSSO
3      <NA>   WALTON
4     TOOSK    PUTIN
5  CARNEGIE    PUTIN
6      <NA>  CHIRACK
7   CHIRACK    BLAIR
8   BAROSSO    BLAIR
9     GATES    BLAIR
10  BUFFETT    BLAIR
11   WALTON    BLAIR
12   ELISON   MERKEL
13     FORD CARNEGIE
14    BLAIR     BUSH
15   MERKEL     BUSH
16    PUTIN     BUSH
17     <NA>    GATES
18     <NA>     FORD
19     <NA>  BUFFETT
20 THATCHER    TOOSK
21     <NA>   ELISON
22     BUSH     <NA>

41. Exercise

Select employees hired earlier than their bosses.

   Employee Employee hiredate  Boss Boss hiredate
4   BAROSSO        1981-02-20 BLAIR    1981-05-01
7    WALTON        1981-02-22 BLAIR    1981-05-01
10    BLAIR        1981-05-01  BUSH    1981-11-17
11   MERKEL        1981-06-09  BUSH    1981-11-17
12    PUTIN        1981-04-02  BUSH    1981-11-17
13 THATCHER        1980-12-17 TOOSK    1981-12-03


42. Exercise*

Show employee, their boss and boss's boss

   Employee     Boss BossBoss
1      FORD CARNEGIE    PUTIN
2  THATCHER    TOOSK    PUTIN
3     TOOSK    PUTIN     BUSH
4     GATES    BLAIR     BUSH
5   CHIRACK    BLAIR     BUSH
6   BAROSSO    BLAIR     BUSH
7  CARNEGIE    PUTIN     BUSH
8    WALTON    BLAIR     BUSH
9    ELISON   MERKEL     BUSH
10  BUFFETT    BLAIR     BUSH
11    BLAIR     BUSH     <NA>
12   MERKEL     BUSH     <NA>
13    PUTIN     BUSH     <NA>
14     BUSH     <NA>     <NA>


Set operations

UNION

In order to combine two data frames we can use rbind function

We want to see people earning less than 2000 and more than 3000

> rbind(emp[SAL<2000,],emp[SAL>3000,])
    ID    ENAME       JOB  MGR   HIREDATE  SAL COMM DEPTNO
5  7654  CHIRACK  SALESMAN 7698 1981-09-28 1250 1400     30
6  7499  BAROSSO  SALESMAN 7698 1981-02-20 1600  300     30
7  7844    GATES  SALESMAN 7698 1981-09-08 1500    0     30
8  7900  BUFFETT     CLERK 7698 1981-12-03  950   NA     30
9  7521   WALTON  SALESMAN 7698 1981-02-22 1250  500     30
11 7369 THATCHER     CLERK 7902 1980-12-17  800   NA     20
13 7876     FORD     CLERK 7788 1983-01-12 1100   NA     20
14 7934   ELISON     CLERK 7782 1982-01-23 1300   NA     10
91 7839     BUSH PRESIDENT   NA 1981-11-17 5000   NA     10

If we interested only in names

> union(emp[SAL<2000,"ENAME"],emp[SAL>3000,"ENAME"])
[1] "CHIRACK"  "BAROSSO"  "GATES"    "BUFFETT"  "WALTON"   "THATCHER" "FORD"     "ELISON"   "BUSH"    


INTERSECT Function

INTERSECT operator (set intersection) allows you to select rows common for two data frames

Find jobs existing in department 30 and 20

> intersect(emp[DEPTNO==20,"JOB"] ,emp[DEPTNO==30,"JOB"])
[1] "MANAGER" "CLERK"  

SETDIFF Operator

This operator allows to to show rows present in the first query but not present in the second query.

Find jobs in department 30 which do not exist in department 20

> setdiff(emp[DEPTNO==30,"JOB"] ,emp[DEPTNO==20,"JOB"])
[1] "SALESMAN"

Exercises

43. *Exercise

Find post filled in 1982 and 1983.

[1] "CLERK"

Answer >>

intersect(emp[format(emp$HIREDATE,"%Y") == 1982,"JOB"] ,emp[format(emp$HIREDATE,"%Y") == 1981,"JOB"] )


44. ***Exercise

Find jobs which are in department 10 but not in department 20.

[1] "PRESIDENT"

Answer >>

> setdiff(emp[DEPTNO==10,"JOB"] ,emp[DEPTNO==20,"JOB"])