Replication in MongoDB
Jump to navigation
Jump to search
Replication
- what will happen when standalone server crashes?
- replication is a way of keeping identical server copies
- recommended for all production deployments
- replication is a set of one primary server and many secondaries
- replication in MongoDB is asynchronous
- primary server handles client requests
- when primary crashes secondaries will elect new primary
- primary (master), secondary (slave)
Test Setup
- this is the way to start 3 servers on ports 31000, 31001, 31002
- or on ports 20000, 20001, 20002 starting from MongoDB 3.2
- databases are stored in default directory (/data/db)
$ mongo --nodb
> replicaSet = new ReplSetTest({"nodes" : 3})
> replicaSet.startSet()
> replicaSet.initiate()
$ mongo --nodb
> conn0 = new Mongo("localhost:31000")
> db0 = conn0.getDB("test")
> db0.isMaster()
> for (i=0; i<100; i++) { db0.repl.insert({"x" : i}) }
> db0.repl.count()
>
> conn1 = new Mongo("localhost:31001")
> db1 = conn1.getDB("test")
> db1.repl.count()
> db1.setSlaveOk()
> db1.repl.count()
>
> db0.adminCommand({"shutdown" : 1})
> db1.isMaster()
Production-Like Setup
- this example show how to start replica with 3 members
- it is on the same server but easily can be changed to multiple machines
- only one server (localhost:27100) may contain data
- replica set name can be any utf-8 string (np_rep in this example)
- use --replSet np_rep when starting mongod
- each member of the replica set must be able to connect with all members
$ mongod --port 27100 --replSet np_rep --dbpath /data/np_rep/rep0 --logpath /data/np_rep/rep0.log --fork --bind_ip 127.0.0.1 --rest --httpinterface --logappend
$ mongod --port 27101 --replSet np_rep --dbpath /data/np_rep/rep1 --logpath /data/np_rep/rep1.log --fork ...
$ mongod --port 27102 --replSet np_rep --dbpath /data/np_rep/rep2 --logpath /data/np_rep/rep2.log --fork ...
$
$ mongod --shutdown --dbpath /data/np_rep/rep0
- Mongo Shell is the only way to configure replica set
- prepare configuration document (np_rep_config)
- _id key is your replica set name
- members is a list of all servers in replica set
- send configuration to server with data
- this server will change configuration of other members
- servers will elect primary and start handling clients requests
- localhost as a server name can be used only for testing
> np_rep_config = {
... "_id" : "np_rep",
... "members" : [
... {"_id" : 0, "host" : "localhost:27100"},
... {"_id" : 1, "host" : "localhost:27101"},
... {"_id" : 2, "host" : "localhost:27102"}
... ]
... }
>
> rs.initiate(np_rep_config)
> db = (new Mongo("localhost:27102")).getDB("test")
rs Helper
- rs is a global variable containing replication helper functions
- rs.help() - list of all functions
- rs.initiate(np_rep_config) is a wrapper to adminCommand
- db.adminCommand({"replSetInitiate" : np_rep_config})
> rs.help()
> rs.add("localhost:27103")
> rs.remove("localhost:27103")
> rs.config()
> np_rep_config = rs.config()
> edit np_rep_config
> rs.reconfig(np_rep_config)
All About Majorities
- how to design replica set?
- majority = more than half of all members in the set
- majority is based on configuration not on the current status of replica
- i.e. 5 members replica, 3 members are down
- two remaining members can not reach majority
- if one of two was primary it would step down
- all two members will be secondaries
- why two of five can not reach majority?
- examples: 3+2 members, 2+2+1 members, only 2 members
How Election Works
- election starts when secondary can not reach primary
- it will contact all other members and request that it be primary
- others do several checks:
- can they reach primary?
- is candidate up to date with replication?
- is there any member with higher priority?
- election ends when candidate receives "yes" from a majority
- members can send veto (10000 votes)
- heartbeat is sent every 2 seconds with 10 seconds time-out to all members
- based on this, the primary knows if it can reach a majority
- when election results in a tie all members will wait 30 seconds
Member Configuration
- arbiter
- normal mongod process started with --replSet option and empty data directory
- > rs.addArb("server:port")
- > rs.add({"_id" : 7, "host" : "server:port", "arbiterOnly" : true})
- use at most one arbiter, don't add arbiters "just in case"
- arbiter once added can't be changed to normal mongod, and vice versa
- use normal data member instead of an arbiter whenever possible
- priority
- priority range is from 0 to 100 (default 1)
- it means how badly this member wants to become primary
- member with priority 0 can never become primary (passive members)
- hidden
- hidden members don't handle client requests
- hidden members are not preferred as replication sources
- useful for backup or less powerful machines
- use hidden : true with priority : 0
- rs.isMaster(), rs.status(), rs.config()
- slave delay
- delayed secondary will purposely lag by the specified number of second
- use slaveDelay : seconds with priority : 0
- ... and hidden : true if your clients reads from secondaries
- building indexes
- useful for backup servers
- prevents from building any indexes
- use buildIndexes : false with priority : 0
- non-index-building member can't be changed to normal easily
Sync Process
- oplog contains all operations that primary performs
- capped collection in local database on each member
- any member can be used as a source for replication
- steps: read data from primary -> apply operation to data -> write to local oplog
- re applying the same oplog operation is safe and handled properly (the same result)
- oplog is fixed in size not in time
- usually one operation on data results in one operation in oplog
- exception: one operation that affects multiple documents is stored in oplog as many operations on single document
- secondary may go stale when: secondary had downtime, has too many writes than it can handle, to busy handling reads
Initial Sync Process
- initial checks (choose a source, drop existing databases)
- cloning data from source (longest operation)
- first oplog application
- second oplog application
- index building (long operation)
- third oplog application
- switching to normal syncing and normal operations
- restoring from backup is often faster
- cloning may ruin source's working set
- initial sync may fail if oplog is too short
States of Replica Set Members
- Primary
- Secondary
- Startup (when you start a member for the first time)
- Startup2 (initial sync, short state on normal members: starts replication and election process)
- Recovering (failure state, member is operating correctly but not available for reads, occurs in many situations)
- Arbiter
- Down (when member was up and then becomes unreachable)
- Unknown (when member has never been able to reach another member)
- Removed (after removing from replica set, when added back it will return into "normal" state)
- Rollback (when rolling back data)
- Fatal (find a reason in log, grep "replSet FATAL", restore from backup or resync)
Rollbacks
- what is rollback and when it will be performed
- synchronization must be done manually
- collectionName.bson file in a rollback directory in data directory
- mongorestore above file into temporary collection and perform manual merge
- rollback will fail if there is more than 300MB of data or about 30 minutes of operations to rollback
- how to prevent rollbacks
- do not change number of votes of replica set members
- keep secondaries up to date
Connection to Replica Set and Replication Guarantees
- connection strings: mongodb://user_name:password@host1:27017,host2:27018,host3:27019/?replicaSet=replicaName&connectTimeoutMS=10000&authMechanism=SCRAM-SHA-1
> db.runCommand({getLastError: 1, w: "majority"})
> db.testColl.insert({a: 1}, {writeConcern: {w: "majority"}})
> db.testColl.insert({a: 1}, {writeConcern: {w: "majority", wtimeout: 5000}})
> db.testColl.insert({a: 1}, {writeConcern: {w: 3, wtimeout: 5000}})
> config = rs.conf()
> config.settings = {}
> config.settings.getLastErrorDefaults = {w: "majority", wtimeout: 5000}
> rs.reconfig(config)
> config = rs.config()
> config.members[0].tags = {"dc": "PL"}
> config.members[1].tags = {"dc": "UK"}
> config.settings = config.settings || {}
> config.settings.getLastErrorModes = [{"allDataCenters" : {"dc": 2}}]
> rs.reconfig(config)
> db.testColl.insert({a: 1}, {writeConcern: {w: "allDataCenters", wtimeout: 5000}})
Read Preference
- read preference describes how MongoDB clients route read operations to the members of a replica set
- read preference modes:
- primary - default mode
- primaryPreferred
- secondary
- secondaryPreferred
- nearest
- tag sets
Administration of Replica Set
Run as Standalone
- many operation can not be done on secondaries
- stop the secondary
- run mongod without --replSet, on different port, with the same --dbpath
$ mongod --port 27999 --dbpath /var/lib/mongodb
Large Replica Sets
- replica sets are limited to 50 members (12 before version 3.0) and only 7 voting members
- this is limited to reduce network traffic generated by heartbeat
- use Master-Slave configuration when more than 50 secondaries is needed
> rs.add({"_id" : 51, "host" : "server-8:27017", "votes" : 0})
Forcing Reconfiguration
- useful when you majority is lost permanantly
- force allows you to send the configuration to secondary (not only to primary)
- configuration needs to be prepared correctly
- force will change version dramatically
> var config = rs.config()
> edit config
> rs.reconfig(config, {"force" : true})
Changing Member Status Manually
- there is no way to force member to become primary
- demote primary to a secondary (by default 60 seconds)
> rs.stepDown()
> rs.stepDown(600)
Preventing Elections
- in can be also used on demoted primary to unfreeze it
> rs.freeze(600)
> rs.freeze(0)
Maintenance Mode
- server will go into recovery state
- useful if server is performing long operations or is far behind in replication
> db.adminCommand({"replSetMaintenanceMode" : true})
> db.adminCommand({"replSetMaintenanceMode" : false})
Monitoring Replication
- status of the replica set (from the current server perspective)
- db.adminCommand("replSetGetStatus") or rs.status()
- important fields: self, stateStr, uptime [s], optimeDate (last oplog operation), pingMs, errmsg
> rs.status()
Replication Source
- rs.status() can be used to create the replication graph
- replication source is determined by ping (smallest)
- it can cause chains creation in replication
- replication loops: s1 from s2, s2 from s3, s3 from s1
> db.adminCommand({"replSetSyncFrom" : server:27017})
> var config = rs.config()
> config.settings = config.settings || {}
> config.settings.chainingAllowed = false
> rs.reconfig(config)
Resizing the Oplog
- long oplog gives you time for maintenance
- oplog on primary should be at least few days or even weeks long
> db.printReplicationInfo()
> db.printSlaveReplicationInfo()
procedure to resize the oplog:
- demote primary into secondary
- shut down and restart as standalone
- copy the last insert from the oplog into temporary collection
> use local
> var cursor = db.oplog.rs.find({"op" : "i"})
> var lastInsert = cursor.sort({"$natural" : -1}).limit(1).next()
> db.tmpLastOp.save(lastInsert)
> db.tmpLastOp.findOne()
- drop current oplog
> db.oplog.rs.drop()
- create new oplog with different size
> db.createCollection("oplog.rs", {"capped" : true, "size" : 10000000})
- move last operation from temporary collection into new oplog
> var lastInsert = db.tmpLastOp.findOne()
> db.oplog.rs.insert(lastInsert)
> db.oplog.rs.findOne()
- shut down standalone server and restart as replica set member
Other operations
- restoring data from delayed secondary
- building indexes
- replication on cheap machines: priority:0, hidden:true, buildIndexes:false, votes:0
- master - slave configuration and mimicking this behaviour when using replica set
- calculating lag: local.slaves, local.me, local.slaves.drop()