Replication in MongoDB

Replication

what will happen when standalone server crashes?
replication is a way of keeping identical server copies
recommended for all production deployments
replication is a set of one primary server and many secondaries
replication in MongoDB is asynchronous
primary server handles client requests
when primary crashes secondaries will elect new primary
primary (master), secondary (slave)

Test Setup

this is the way to start 3 servers on ports 31000, 31001, 31002
- or on ports 20000, 20001, 20002 starting from MongoDB 3.2
databases are stored in default directory (/data/db)

$ mongo --nodb
> replicaSet = new ReplSetTest({"nodes" : 3})
> replicaSet.startSet()
> replicaSet.initiate()

$ mongo --nodb
> conn0 = new Mongo("localhost:31000")
> db0 = conn0.getDB("test")
> db0.isMaster()
> for (i=0; i<100; i++) { db0.repl.insert({"x" : i}) }
> db0.repl.count()
>
> conn1 = new Mongo("localhost:31001")
> db1 = conn1.getDB("test")
> db1.repl.count()
> db1.setSlaveOk()
> db1.repl.count()
>
> db0.adminCommand({"shutdown" : 1})
> db1.isMaster()

Production-Like Setup

this example show how to start replica with 3 members
it is on the same server but easily can be changed to multiple machines
only one server (localhost:27100) may contain data
replica set name can be any utf-8 string (np_rep in this example)
use --replSet np_rep when starting mongod
each member of the replica set must be able to connect with all members

$ mongod --port 27100 --replSet np_rep --dbpath /data/np_rep/rep0 --logpath /data/np_rep/rep0.log --fork --bind_ip 127.0.0.1 --rest --httpinterface --logappend
$ mongod --port 27101 --replSet np_rep --dbpath /data/np_rep/rep1 --logpath /data/np_rep/rep1.log --fork ...
$ mongod --port 27102 --replSet np_rep --dbpath /data/np_rep/rep2 --logpath /data/np_rep/rep2.log --fork ...
$
$ mongod --shutdown --dbpath /data/np_rep/rep0

Mongo Shell is the only way to configure replica set
prepare configuration document (np_rep_config)
- _id key is your replica set name
- members is a list of all servers in replica set
send configuration to server with data
this server will change configuration of other members
servers will elect primary and start handling clients requests
localhost as a server name can be used only for testing

> np_rep_config = { 
...   "_id" : "np_rep",
...   "members" : [
...     {"_id" : 0, "host" : "localhost:27100"},
...     {"_id" : 1, "host" : "localhost:27101"},
...     {"_id" : 2, "host" : "localhost:27102"}
...   ]
... }
>
> rs.initiate(np_rep_config)
> db = (new Mongo("localhost:27102")).getDB("test")

rs Helper

rs is a global variable containing replication helper functions
rs.help() - list of all functions
rs.initiate(np_rep_config) is a wrapper to adminCommand
- db.adminCommand({"replSetInitiate" : np_rep_config})

> rs.help()
> rs.add("localhost:27103")
> rs.remove("localhost:27103")
> rs.config()
> np_rep_config = rs.config()
> edit np_rep_config 
> rs.reconfig(np_rep_config)

All About Majorities

how to design replica set?
majority = more than half of all members in the set
majority is based on configuration not on the current status of replica
- i.e. 5 members replica, 3 members are down
- two remaining members can not reach majority
- if one of two was primary it would step down
- all two members will be secondaries
why two of five can not reach majority?
examples: 3+2 members, 2+2+1 members, only 2 members

How Election Works

election starts when secondary can not reach primary
it will contact all other members and request that it be primary
others do several checks:
- can they reach primary?
- is candidate up to date with replication?
- is there any member with higher priority?
election ends when candidate receives "yes" from a majority
members can send veto (10000 votes)
heartbeat is sent every 2 seconds with 10 seconds time-out to all members
- based on this, the primary knows if it can reach a majority
- when election results in a tie all members will wait 30 seconds

Member Configuration

arbiter
- normal mongod process started with --replSet option and empty data directory
- > rs.addArb("server:port")
- > rs.add({"_id" : 7, "host" : "server:port", "arbiterOnly" : true})
- use at most one arbiter, don't add arbiters "just in case"
- arbiter once added can't be changed to normal mongod, and vice versa
- use normal data member instead of an arbiter whenever possible
priority
- priority range is from 0 to 100 (default 1)
- it means how badly this member wants to become primary
- member with priority 0 can never become primary (passive members)
hidden
- hidden members don't handle client requests
- hidden members are not preferred as replication sources
- useful for backup or less powerful machines
- use hidden : true with priority : 0
- rs.isMaster(), rs.status(), rs.config()
slave delay
- delayed secondary will purposely lag by the specified number of second
- use slaveDelay : seconds with priority : 0
- ... and hidden : true if your clients reads from secondaries
building indexes
- useful for backup servers
- prevents from building any indexes
- use buildIndexes : false with priority : 0
- non-index-building member can't be changed to normal easily

Sync Process

oplog contains all operations that primary performs
capped collection in local database on each member
any member can be used as a source for replication
steps: read data from primary -> apply operation to data -> write to local oplog
re applying the same oplog operation is safe and handled properly (the same result)
oplog is fixed in size not in time
- usually one operation on data results in one operation in oplog
- exception: one operation that affects multiple documents is stored in oplog as many operations on single document
secondary may go stale when: secondary had downtime, has too many writes than it can handle, to busy handling reads

Initial Sync Process

initial checks (choose a source, drop existing databases)
cloning data from source (longest operation)
first oplog application
second oplog application
index building (long operation)
third oplog application
switching to normal syncing and normal operations

restoring from backup is often faster
cloning may ruin source's working set
initial sync may fail if oplog is too short

States of Replica Set Members

Primary
Secondary
Startup (when you start a member for the first time)
Startup2 (initial sync, short state on normal members: starts replication and election process)
Recovering (failure state, member is operating correctly but not available for reads, occurs in many situations)
Arbiter
Down (when member was up and then becomes unreachable)
Unknown (when member has never been able to reach another member)
Removed (after removing from replica set, when added back it will return into "normal" state)
Rollback (when rolling back data)
Fatal (find a reason in log, grep "replSet FATAL", restore from backup or resync)

Rollbacks

what is rollback and when it will be performed
synchronization must be done manually
- collectionName.bson file in a rollback directory in data directory
- mongorestore above file into temporary collection and perform manual merge
rollback will fail if there is more than 300MB of data or about 30 minutes of operations to rollback
how to prevent rollbacks
- do not change number of votes of replica set members
- keep secondaries up to date

Connection to Replica Set and Replication Guarantees

connection strings: mongodb://user_name:password@host1:27017,host2:27018,host3:27019/?replicaSet=replicaName&connectTimeoutMS=10000&authMechanism=SCRAM-SHA-1

> db.runCommand({getLastError: 1, w: "majority"})
> db.testColl.insert({a: 1}, {writeConcern: {w: "majority"}})
> db.testColl.insert({a: 1}, {writeConcern: {w: "majority", wtimeout: 5000}})
> db.testColl.insert({a: 1}, {writeConcern: {w: 3, wtimeout: 5000}})

> config = rs.conf()
> config.settings = {}
> config.settings.getLastErrorDefaults = {w: "majority", wtimeout: 5000}
> rs.reconfig(config)

> config = rs.config()
> config.members[0].tags = {"dc": "PL"}
> config.members[1].tags = {"dc": "UK"}
> config.settings = config.settings || {}
> config.settings.getLastErrorModes = [{"allDataCenters" : {"dc": 2}}]
> rs.reconfig(config)
> db.testColl.insert({a: 1}, {writeConcern: {w: "allDataCenters", wtimeout: 5000}})

Read Preference

read preference describes how MongoDB clients route read operations to the members of a replica set
read preference modes:
- primary - default mode
- primaryPreferred
- secondary
- secondaryPreferred
- nearest
tag sets

Administration of Replica Set

Run as Standalone

many operation can not be done on secondaries
1. stop the secondary
2. run mongod without --replSet, on different port, with the same --dbpath

$ mongod --port 27999 --dbpath /var/lib/mongodb

Large Replica Sets

replica sets are limited to 50 members (12 before version 3.0) and only 7 voting members
this is limited to reduce network traffic generated by heartbeat
use Master-Slave configuration when more than 50 secondaries is needed

> rs.add({"_id" : 51, "host" : "server-8:27017", "votes" : 0})

Forcing Reconfiguration

useful when you majority is lost permanantly
force allows you to send the configuration to secondary (not only to primary)
configuration needs to be prepared correctly
force will change version dramatically

> var config = rs.config()
> edit config
> rs.reconfig(config, {"force" : true})

Changing Member Status Manually

there is no way to force member to become primary
demote primary to a secondary (by default 60 seconds)

> rs.stepDown()
> rs.stepDown(600)

Preventing Elections

in can be also used on demoted primary to unfreeze it

> rs.freeze(600)
> rs.freeze(0)

Maintenance Mode

server will go into recovery state
useful if server is performing long operations or is far behind in replication

> db.adminCommand({"replSetMaintenanceMode" : true})
> db.adminCommand({"replSetMaintenanceMode" : false})

Monitoring Replication

status of the replica set (from the current server perspective)
db.adminCommand("replSetGetStatus") or rs.status()
important fields: self, stateStr, uptime [s], optimeDate (last oplog operation), pingMs, errmsg

> rs.status()

Replication Source

rs.status() can be used to create the replication graph
replication source is determined by ping (smallest)
it can cause chains creation in replication
replication loops: s1 from s2, s2 from s3, s3 from s1

> db.adminCommand({"replSetSyncFrom" : server:27017})
> var config = rs.config()
> config.settings = config.settings || {}
> config.settings.chainingAllowed = false
> rs.reconfig(config)

Resizing the Oplog

long oplog gives you time for maintenance
oplog on primary should be at least few days or even weeks long

> db.printReplicationInfo()
> db.printSlaveReplicationInfo()

procedure to resize the oplog:

demote primary into secondary
shut down and restart as standalone
copy the last insert from the oplog into temporary collection

> use local
> var cursor = db.oplog.rs.find({"op" : "i"})
> var lastInsert = cursor.sort({"$natural" : -1}).limit(1).next()
> db.tmpLastOp.save(lastInsert)
> db.tmpLastOp.findOne()

drop current oplog

> db.oplog.rs.drop()

create new oplog with different size

> db.createCollection("oplog.rs", {"capped" : true, "size" : 10000000})

move last operation from temporary collection into new oplog

> var lastInsert = db.tmpLastOp.findOne()
> db.oplog.rs.insert(lastInsert)
> db.oplog.rs.findOne()

shut down standalone server and restart as replica set member

Other operations

restoring data from delayed secondary
building indexes
replication on cheap machines: priority:0, hidden:true, buildIndexes:false, votes:0
master - slave configuration and mimicking this behaviour when using replica set
calculating lag: local.slaves, local.me, local.slaves.drop()

Replication in MongoDB

Contents

Replication

Test Setup

Production-Like Setup

rs Helper

All About Majorities

How Election Works

Member Configuration

Sync Process

Initial Sync Process

States of Replica Set Members

Rollbacks

Connection to Replica Set and Replication Guarantees

Read Preference

Administration of Replica Set

Run as Standalone

Large Replica Sets

Forcing Reconfiguration

Changing Member Status Manually

Preventing Elections

Maintenance Mode

Monitoring Replication

Replication Source

Resizing the Oplog

Other operations

Navigation menu

Personal tools

Namespaces

Variants

Views

Search

Opportunities

Navigation

Tools