Operational Guide
Using a process supervisor
Mesos uses a "fail-fast" approach to error handling: if a serious error occurs, Mesos will typically exit rather than trying to continue running in a possibly erroneous state. For example, when Mesos is configured for high availability, the leading master will abort itself when it discovers it has been partitioned away from the Zookeeper quorum. This is a safety precaution to ensure the previous leader doesn't continue communicating in an unsafe state.
To ensure that such failures are handled appropriately, production deployments of Mesos typically use a process supervisor (such as systemd or supervisord) to detect when Mesos processes exit. The supervisor can be configured to restart the failed process automatically and/or to notify the cluster operator to investigate the situation.
Changing the master quorum
The master leverages a Paxos-based replicated log as its storage backend (--registry=replicated_log
is the only storage backend currently supported). Each master participates in the ensemble as a log replica. The --quorum
flag determines a majority of the masters.
The following table shows the tolerance to master failures for each quorum size:
Masters | Quorum Size | Failure Tolerance |
---|---|---|
1 | 1 | 0 |
3 | 2 | 1 |
5 | 3 | 2 |
... | ... | ... |
2N - 1 | N | N - 1 |
It is recommended to run with 3 or 5 masters, when desiring high availability.
NOTE
When configuring the quorum, it is essential to ensure that there are only so many masters running as specified in the table above. If additional masters are running, this violates the quorum and the log may be corrupted! As a result, it is recommended to gate the running of the master process with something that enforces a static whitelist of the master hosts. See MESOS-1546 for adding a safety whitelist within Mesos itself.
For online reconfiguration of the log, see: MESOS-683.
Increasing the quorum size
As the size of a cluster grows, it may be desired to increase the quorum size for additional fault tolerance.
The following steps indicate how to increment the quorum size, using 3 -> 5 masters as an example (quorum size 2 -> 3):
- Initially, 3 masters are running with
--quorum=2
- Restart the original 3 masters with
--quorum=3
- Start 2 additional masters with
--quorum=3
To increase the quorum by N, repeat this process to increment the quorum size N times.
NOTE: Currently, moving out of a single master setup requires wiping the replicated log state and starting fresh. This will wipe all persistent data (e.g., agents, maintenance information, quota information, etc). To move from 1 master to 3 masters:
- Stop the standalone master.
- Remove the replicated log data (
replicated_log
under the--work_dir
). - Start the original master and two new masters with
--quorum=2
Decreasing the quorum size
The following steps indicate how to decrement the quorum size, using 5 -> 3 masters as an example (quorum size 3 -> 2):
- Initially, 5 masters are running with
--quorum=3
- Remove 2 masters from the cluster, ensure they will not be restarted (see NOTE section above). Now 3 masters are running with
--quorum=3
- Restart the 3 masters with
--quorum=2
To decrease the quorum by N, repeat this process to decrement the quorum size N times.
Replacing a master
Please see the NOTE section above. So long as the failed master is guaranteed to not re-join the ensemble, it is safe to start a new master with an empty log and allow it to catch up.
External access for Mesos master
If the default IP (or the command line arg --ip
) is an internal IP, then external entities such as framework schedulers will be unable to reach the master. To address that scenario, an externally accessible IP:port can be setup via the --advertise_ip
and --advertise_port
command line arguments of mesos-master
. If configured, external entities such as framework schedulers interact with the advertise_ip:advertise_port from where the request needs to be proxied to the internal IP:port on which the Mesos master is listening.
HTTP requests to non-leading master
HTTP requests to some master endpoints (e.g., /state, /machine/down) can only be answered by the leading master. Such requests made to a non-leading master will result in either a 307 Temporary Redirect
(with the location of the leading master) or 503 Service Unavailable
(if the master does not know who the current leader is).