Replication in MongoDB

From Training Material
Jump to: navigation, search


Replication

  • what will happen when standalone server crashes?
  • replication is a way of keeping identical server copies
  • recommended for all production deployments
  • replication is a set of one primary server and many secondaries
  • replication in MongoDB is asynchronous
  • primary server handles client requests
  • when primary crashes secondaries will elect new primary
  • primary (master), secondary (slave)


Test Setup

  • this is the way to start 3 servers on ports 31000, 31001, 31002
    • or on ports 20000, 20001, 20002 starting from MongoDB 3.2
  • databases are stored in default directory (/data/db)


$ mongo --nodb
> replicaSet = new ReplSetTest({"nodes" : 3})
> replicaSet.startSet()
> replicaSet.initiate()


$ mongo --nodb
> conn0 = new Mongo("localhost:31000")
> db0 = conn0.getDB("test")
> db0.isMaster()
> for (i=0; i<100; i++) { db0.repl.insert({"x" : i}) }
> db0.repl.count()
>
> conn1 = new Mongo("localhost:31001")
> db1 = conn1.getDB("test")
> db1.repl.count()
> db1.setSlaveOk()
> db1.repl.count()
>
> db0.adminCommand({"shutdown" : 1})
> db1.isMaster()


Production-Like Setup

  • this example show how to start replica with 3 members
  • it is on the same server but easily can be changed to multiple machines
  • only one server (localhost:27100) may contain data
  • replica set name can be any utf-8 string (np_rep in this example)
  • use --replSet np_rep when starting mongod
  • each member of the replica set must be able to connect with all members


$ mongod --port 27100 --replSet np_rep --dbpath /data/np_rep/rep0 --logpath /data/np_rep/rep0.log --fork --bind_ip 127.0.0.1 --rest --httpinterface --logappend
$ mongod --port 27101 --replSet np_rep --dbpath /data/np_rep/rep1 --logpath /data/np_rep/rep1.log --fork ...
$ mongod --port 27102 --replSet np_rep --dbpath /data/np_rep/rep2 --logpath /data/np_rep/rep2.log --fork ...
$
$ mongod --shutdown --dbpath /data/np_rep/rep0


  • Mongo Shell is the only way to configure replica set
  • prepare configuration document (np_rep_config)
    • _id key is your replica set name
    • members is a list of all servers in replica set
  • send configuration to server with data
  • this server will change configuration of other members
  • servers will elect primary and start handling clients requests
  • localhost as a server name can be used only for testing


> np_rep_config = { 
...   "_id" : "np_rep",
...   "members" : [
...     {"_id" : 0, "host" : "localhost:27100"},
...     {"_id" : 1, "host" : "localhost:27101"},
...     {"_id" : 2, "host" : "localhost:27102"}
...   ]
... }
>
> rs.initiate(np_rep_config)
> db = (new Mongo("localhost:27102")).getDB("test")


rs Helper

  • rs is a global variable containing replication helper functions
  • rs.help() - list of all functions
  • rs.initiate(np_rep_config) is a wrapper to adminCommand
    • db.adminCommand({"replSetInitiate" : np_rep_config})
> rs.help()
> rs.add("localhost:27103")
> rs.remove("localhost:27103")
> rs.config()
> np_rep_config = rs.config()
> edit np_rep_config 
> rs.reconfig(np_rep_config)


All About Majorities

  • how to design replica set?
  • majority = more than half of all members in the set
  • majority is based on configuration not on the current status of replica
    • i.e. 5 members replica, 3 members are down
    • two remaining members can not reach majority
    • if one of two was primary it would step down
    • all two members will be secondaries
  • why two of five can not reach majority?
  • examples: 3+2 members, 2+2+1 members, only 2 members


How Election Works

  • election starts when secondary can not reach primary
  • it will contact all other members and request that it be primary
  • others do several checks:
    • can they reach primary?
    • is candidate up to date with replication?
    • is there any member with higher priority?
  • election ends when candidate receives "yes" from a majority
  • members can send veto (10000 votes)
  • heartbeat is sent every 2 seconds with 10 seconds time-out to all members
    • based on this, the primary knows if it can reach a majority
    • when election results in a tie all members will wait 30 seconds


Member Configuration

  • arbiter
    • normal mongod process started with --replSet option and empty data directory
    • > rs.addArb("server:port")
    • > rs.add({"_id" : 7, "host" : "server:port", "arbiterOnly" : true})
    • use at most one arbiter, don't add arbiters "just in case"
    • arbiter once added can't be changed to normal mongod, and vice versa
    • use normal data member instead of an arbiter whenever possible

  • priority
    • priority range is from 0 to 100 (default 1)
    • it means how badly this member wants to become primary
    • member with priority 0 can never become primary (passive members)

  • hidden
    • hidden members don't handle client requests
    • hidden members are not preferred as replication sources
    • useful for backup or less powerful machines
    • use hidden : true with priority : 0
    • rs.isMaster(), rs.status(), rs.config()

  • slave delay
    • delayed secondary will purposely lag by the specified number of second
    • use slaveDelay : seconds with priority : 0
    • ... and hidden : true if your clients reads from secondaries

  • building indexes
    • useful for backup servers
    • prevents from building any indexes
    • use buildIndexes : false with priority : 0
    • non-index-building member can't be changed to normal easily


Sync Process

  • oplog contains all operations that primary performs
  • capped collection in local database on each member
  • any member can be used as a source for replication
  • steps: read data from primary -> apply operation to data -> write to local oplog
  • re applying the same oplog operation is safe and handled properly (the same result)
  • oplog is fixed in size not in time
    • usually one operation on data results in one operation in oplog
    • exception: one operation that affects multiple documents is stored in oplog as many operations on single document
  • secondary may go stale when: secondary had downtime, has too many writes than it can handle, to busy handling reads


Initial Sync Process

  1. initial checks (choose a source, drop existing databases)
  2. cloning data from source (longest operation)
  3. first oplog application
  4. second oplog application
  5. index building (long operation)
  6. third oplog application
  7. switching to normal syncing and normal operations
  • restoring from backup is often faster
  • cloning may ruin source's working set
  • initial sync may fail if oplog is too short


States of Replica Set Members

  1. Primary
  2. Secondary
  3. Startup (when you start a member for the first time)
  4. Startup2 (initial sync, short state on normal members: starts replication and election process)
  5. Recovering (failure state, member is operating correctly but not available for reads, occurs in many situations)
  6. Arbiter
  7. Down (when member was up and then becomes unreachable)
  8. Unknown (when member has never been able to reach another member)
  9. Removed (after removing from replica set, when added back it will return into "normal" state)
  10. Rollback (when rolling back data)
  11. Fatal (find a reason in log, grep "replSet FATAL", restore from backup or resync)


Rollbacks

  1. what is rollback and when it will be performed
  2. synchronization must be done manually
    • collectionName.bson file in a rollback directory in data directory
    • mongorestore above file into temporary collection and perform manual merge
  3. rollback will fail if there is more than 300MB of data or about 30 minutes of operations to rollback
  4. how to prevent rollbacks
    • do not change number of votes of replica set members
    • keep secondaries up to date


Connection to Replica Set and Replication Guarantees

  • connection strings: mongodb://user_name:password@host1:27017,host2:27018,host3:27019/?replicaSet=replicaName&connectTimeoutMS=10000&authMechanism=SCRAM-SHA-1


> db.runCommand({getLastError: 1, w: "majority"})
> db.testColl.insert({a: 1}, {writeConcern: {w: "majority"}})
> db.testColl.insert({a: 1}, {writeConcern: {w: "majority", wtimeout: 5000}})
> db.testColl.insert({a: 1}, {writeConcern: {w: 3, wtimeout: 5000}})


> config = rs.conf()
> config.settings = {}
> config.settings.getLastErrorDefaults = {w: "majority", wtimeout: 5000}
> rs.reconfig(config)


> config = rs.config()
> config.members[0].tags = {"dc": "PL"}
> config.members[1].tags = {"dc": "UK"}
> config.settings = config.settings || {}
> config.settings.getLastErrorModes = [{"allDataCenters" : {"dc": 2}}]
> rs.reconfig(config)
> db.testColl.insert({a: 1}, {writeConcern: {w: "allDataCenters", wtimeout: 5000}})


Read Preference

  • read preference describes how MongoDB clients route read operations to the members of a replica set
  • read preference modes:
    • primary - default mode
    • primaryPreferred
    • secondary
    • secondaryPreferred
    • nearest
  • tag sets


Administration of Replica Set


Run as Standalone

  • many operation can not be done on secondaries
    1. stop the secondary
    2. run mongod without --replSet, on different port, with the same --dbpath
$ mongod --port 27999 --dbpath /var/lib/mongodb


Large Replica Sets

  • replica sets are limited to 50 members (12 before version 3.0) and only 7 voting members
  • this is limited to reduce network traffic generated by heartbeat
  • use Master-Slave configuration when more than 50 secondaries is needed
> rs.add({"_id" : 51, "host" : "server-8:27017", "votes" : 0})


Forcing Reconfiguration

  • useful when you majority is lost permanantly
  • force allows you to send the configuration to secondary (not only to primary)
  • configuration needs to be prepared correctly
  • force will change version dramatically
> var config = rs.config()
> edit config
> rs.reconfig(config, {"force" : true})


Changing Member Status Manually

  • there is no way to force member to become primary
  • demote primary to a secondary (by default 60 seconds)
> rs.stepDown()
> rs.stepDown(600)


Preventing Elections

  • in can be also used on demoted primary to unfreeze it
> rs.freeze(600)
> rs.freeze(0)


Maintenance Mode

  • server will go into recovery state
  • useful if server is performing long operations or is far behind in replication
> db.adminCommand({"replSetMaintenanceMode" : true})
> db.adminCommand({"replSetMaintenanceMode" : false})


Monitoring Replication

  • status of the replica set (from the current server perspective)
  • db.adminCommand("replSetGetStatus") or rs.status()
  • important fields: self, stateStr, uptime [s], optimeDate (last oplog operation), pingMs, errmsg
> rs.status()


Replication Source

  • rs.status() can be used to create the replication graph
  • replication source is determined by ping (smallest)
  • it can cause chains creation in replication
  • replication loops: s1 from s2, s2 from s3, s3 from s1
> db.adminCommand({"replSetSyncFrom" : server:27017})
> var config = rs.config()
> config.settings = config.settings || {}
> config.settings.chainingAllowed = false
> rs.reconfig(config)


Resizing the Oplog

  • long oplog gives you time for maintenance
  • oplog on primary should be at least few days or even weeks long
> db.printReplicationInfo()
> db.printSlaveReplicationInfo()


procedure to resize the oplog:

  • demote primary into secondary
  • shut down and restart as standalone
  • copy the last insert from the oplog into temporary collection
> use local
> var cursor = db.oplog.rs.find({"op" : "i"})
> var lastInsert = cursor.sort({"$natural" : -1}).limit(1).next()
> db.tmpLastOp.save(lastInsert)
> db.tmpLastOp.findOne()
  • drop current oplog
> db.oplog.rs.drop()
  • create new oplog with different size
> db.createCollection("oplog.rs", {"capped" : true, "size" : 10000000})
  • move last operation from temporary collection into new oplog
> var lastInsert = db.tmpLastOp.findOne()
> db.oplog.rs.insert(lastInsert)
> db.oplog.rs.findOne()
  • shut down standalone server and restart as replica set member


Other operations

  • restoring data from delayed secondary
  • building indexes
  • replication on cheap machines: priority:0, hidden:true, buildIndexes:false, votes:0
  • master - slave configuration and mimicking this behaviour when using replica set
  • calculating lag: local.slaves, local.me, local.slaves.drop()