Mesos Observability Metrics

This document describes the observability metrics provided by Mesos master and agent nodes. This document also provides some initial guidance on which metrics you should monitor to detect abnormal situations in your cluster.

Overview

Mesos master and agent nodes report a set of statistics and metrics that enable cluster operators to monitor resource usage and detect abnormal situations early. The information reported by Mesos includes details about available resources, used resources, registered frameworks, active agents, and task state. You can use this information to create automated alerts and to plot different metrics over time inside a monitoring dashboard.

Metric information is not persisted to disk at either master or agent nodes, which means that metrics will be reset when masters and agents are restarted. Similarly, if the current leading master fails and a new leading master is elected, metrics at the new master will be reset.

Metric Types

Mesos provides two different kinds of metrics: counters and gauges.

Counters keep track of discrete events and are monotonically increasing. The value of a metric of this type is always a natural number. Examples include the number of failed tasks and the number of agent registrations. For some metrics of this type, the rate of change is often more useful than the value itself.

Gauges represent an instantaneous sample of some magnitude. Examples include the amount of used memory in the cluster and the number of connected agents. For some metrics of this type, it is often useful to determine whether the value is above or below a threshold for a sustained period of time.

The tables in this document indicate the type of each available metric.

Master Nodes

Metrics from each master node are available via the /metrics/snapshot master endpoint. The response is a JSON object that contains metrics names and values as key-value pairs.

Observability metrics

This section lists all available metrics from Mesos master nodes grouped by category.

Resources

The following metrics provide information about the total resources available in the cluster and their current usage. High resource usage for sustained periods of time may indicate that you need to add capacity to your cluster or that a framework is misbehaving.

Metric	Description	Type
`master/cpus_percent`	Percentage of allocated CPUs	Gauge
`master/cpus_used`	Number of allocated CPUs	Gauge
`master/cpus_total`	Number of CPUs	Gauge
`master/cpus_revocable_percent`	Percentage of allocated revocable CPUs	Gauge
`master/cpus_revocable_total`	Number of revocable CPUs	Gauge
`master/cpus_revocable_used`	Number of allocated revocable CPUs	Gauge
`master/disk_percent`	Percentage of allocated disk space	Gauge
`master/disk_used`	Allocated disk space in MB	Gauge
`master/disk_total`	Disk space in MB	Gauge
`master/disk_revocable_percent`	Percentage of allocated revocable disk space	Gauge
`master/disk_revocable_total`	Revocable disk space in MB	Gauge
`master/disk_revocable_used`	Allocated revocable disk space in MB	Gauge
`master/gpus_percent`	Percentage of allocated GPUs	Gauge
`master/gpus_used`	Number of allocated GPUs	Gauge
`master/gpus_total`	Number of GPUs	Gauge
`master/gpus_revocable_percent`	Percentage of allocated revocable GPUs	Gauge
`master/gpus_revocable_total`	Number of revocable GPUs	Gauge
`master/gpus_revocable_used`	Number of allocated revocable GPUs	Gauge
`master/mem_percent`	Percentage of allocated memory	Gauge
`master/mem_used`	Allocated memory in MB	Gauge
`master/mem_total`	Memory in MB	Gauge
`master/mem_revocable_percent`	Percentage of allocated revocable memory	Gauge
`master/mem_revocable_total`	Revocable memory in MB	Gauge
`master/mem_revocable_used`	Allocated revocable memory in MB	Gauge

Master

The following metrics provide information about whether a master is currently elected and how long it has been running. A cluster with no elected master for sustained periods of time indicates a malfunctioning cluster. This points to either leadership election issues (so check the connection to ZooKeeper) or a flapping Master process. A low uptime value indicates that the master has restarted recently.

Metric	Description	Type
`master/elected`	Whether this is the elected master	Gauge
`master/uptime_secs`	Uptime in seconds	Gauge

System

The following metrics provide information about the resources available on this master node and their current usage. High resource usage in a master node for sustained periods of time may degrade the performance of the cluster.

Metric	Description	Type
`system/cpus_total`	Number of CPUs available in this master node	Gauge
`system/load_15min`	Load average for the past 15 minutes	Gauge
`system/load_5min`	Load average for the past 5 minutes	Gauge
`system/load_1min`	Load average for the past minute	Gauge
`system/mem_free_bytes`	Free memory in bytes	Gauge
`system/mem_total_bytes`	Total memory in bytes	Gauge

Agents

The following metrics provide information about agent events, agent counts, and agent states. A low number of active agents may indicate that agents are unhealthy or that they are not able to connect to the elected master.

Metric	Description	Type
`master/slave_registrations`	Number of agents that were able to cleanly re-join the cluster and connect back to the master after the master is disconnected.	Counter
`master/slave_removals`	Number of agent removed for various reasons, including maintenance	Counter
`master/slave_reregistrations`	Number of agent re-registrations	Counter
`master/slave_unreachable_scheduled`	Number of agents which have failed their health check and are scheduled to be marked unreachable. They will not be marked unreachable immediately due to the Agent Removal Rate-Limit, but `master/slave_unreachable_completed` will start increasing as they do get removed.	Counter
`master/slave_unreachable_canceled`	Number of times that an agent was due to be marked unreachable but this transition was cancelled. This happens when the agent removal rate limit is enabled and the agent sends a `PONG` response message to the master before the rate limit allows the agent to be marked unreachable.	Counter
`master/slave_unreachable_completed`	Number of agents that were marked as unreachable because they failed health checks. These are agents which were not heard from despite the agent-removal rate limit, and have been marked as unreachable in the master's agent registry.	Counter
`master/slaves_active`	Number of active agents	Gauge
`master/slaves_connected`	Number of connected agents	Gauge
`master/slaves_disconnected`	Number of disconnected agents	Gauge
`master/slaves_inactive`	Number of inactive agents	Gauge
`master/slaves_unreachable`	Number of unreachable agents. Unreachable agents are periodically garbage collected from the registry, which will cause this value to decrease.	Gauge

Frameworks

The following metrics provide information about the registered frameworks in the cluster. No active or connected frameworks may indicate that a scheduler is not registered or that it is misbehaving.

Metric	Description	Type
`master/frameworks_active`	Number of active frameworks	Gauge
`master/frameworks_connected`	Number of connected frameworks	Gauge
`master/frameworks_disconnected`	Number of disconnected frameworks	Gauge
`master/frameworks_inactive`	Number of inactive frameworks	Gauge
`master/outstanding_offers`	Number of outstanding resource offers	Gauge

The following metrics are added for each framework which registers with the master, in order to provide detailed information about the behavior of the framework. The framework name is percent-encoded before creating these metrics; the actual name can be recovered by percent-decoding.

Metric	Description	Type
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/subscribed`	Whether or not this framework is currently subscribed	Gauge
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/calls`	Total number of calls sent by this framework	Counter
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/calls/<CALL_TYPE>`	Number of each type of call sent by this framework	Counter
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/events`	Total number of events sent to this framework	Counter
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/events/<EVENT_TYPE>`	Number of each type of event sent to this framework	Counter
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/operations`	Total number of offer operations performed by this framework	Counter
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/operations/<OPERATION_TYPE>`	Number of each type of offer operation performed by this framework	Counter
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/tasks/active/<TASK_STATE>`	Number of this framework's tasks currently in each active task state	Gauge
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/tasks/terminal/<TASK_STATE>`	Number of this framework's tasks which have transitioned into each terminal task state	Counter
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/offers/sent`	Number of offers sent to this framework	Counter
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/offers/accepted`	Number of offers accepted by this framework	Counter
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/offers/declined`	Number of offers explicitly declined by this framework	Counter
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/offers/rescinded`	Number of offers sent to this framework which were subsequently rescinded	Counter
`master/frameworks/<ENCODED_FRAMEWORK_NAME>/<FRAMEWORK_ID>/roles/<ROLE_NAME>/suppressed`	For each of the framework's subscribed roles, whether or not offers for that role are currently suppressed	Gauge

Tasks

The following metrics provide information about active and terminated tasks. A high rate of lost tasks may indicate that there is a problem with the cluster. The task states listed here match those of the task state machine.

Metric	Description	Type
`master/tasks_error`	Number of tasks that were invalid	Counter
`master/tasks_failed`	Number of failed tasks	Counter
`master/tasks_finished`	Number of finished tasks	Counter
`master/tasks_killed`	Number of killed tasks	Counter
`master/tasks_killing`	Number of tasks currently being killed	Gauge
`master/tasks_lost`	Number of lost tasks	Counter
`master/tasks_running`	Number of running tasks	Gauge
`master/tasks_staging`	Number of staging tasks	Gauge
`master/tasks_starting`	Number of starting tasks	Gauge
`master/tasks_unreachable`	Number of unreachable tasks	Gauge

Operations

The following metrics provide information about offer operations on the master.

Below, OPERATION_TYPE refers to any one of reserve, unreserve, create, destroy, grow_volume, shrink_volume, create_disk or destroy_disk.

NOTE: The counter for terminal operation states can over-count over time. In particular if an agent contained unacknowledged terminal status updates when it was marked gone or marked unreachable, these operations will be double-counted as both their original state and OPERATION_GONE/OPERATION_UNREACHABLE.

Metric	Description	Type
`master/operations/total`	Total number of operations known to this master	Gauge
`master/operations/<OPERATION_STATE>`	Number of operations in the given non-terminal state (`pending`, `recovering` or `unreachable`)	Gauge
`master/operations/<OPERATION_STATE>`	Number of operations in the given terminal state (`finished`, `error`, `dropped` or `gone_by_operator`)	Counter
`master/operations/<OPERATION_TYPE>/total`	Total number of operations with the given type known to this master	Gauge
`master/operations/<OPERATION_TYPE>/<OPERATION_STATE>`	Number of operations with the given type in the given non-terminal state (`pending`, `recovering` or `unreachable`)	Gauge
`master/operations/<OPERATION_TYPE>/<OPERATION_STATE>`	Number of operations with the given type in the given state (`finished`, `error`, `dropped` or `gone_by_operator`)	Counter

Messages

The following metrics provide information about messages between the master and the agents and between the framework and the executors. A high rate of dropped messages may indicate that there is a problem with the network.

Metric	Description	Type
`master/invalid_executor_to_framework_messages`	Number of invalid executor to framework messages	Counter
`master/invalid_framework_to_executor_messages`	Number of invalid framework to executor messages	Counter
`master/invalid_operation_status_update_acknowledgements`	Number of invalid operation status update acknowledgements	Counter
`master/invalid_status_update_acknowledgements`	Number of invalid status update acknowledgements	Counter
`master/invalid_status_updates`	Number of invalid status updates	Counter
`master/dropped_messages`	Number of dropped messages	Counter
`master/messages_authenticate`	Number of authentication messages	Counter
`master/messages_deactivate_framework`	Number of framework deactivation messages	Counter
`master/messages_decline_offers`	Number of offers declined	Counter
`master/messages_executor_to_framework`	Number of executor to framework messages	Counter
`master/messages_exited_executor`	Number of terminated executor messages	Counter
`master/messages_framework_to_executor`	Number of messages from a framework to an executor	Counter
`master/messages_kill_task`	Number of kill task messages	Counter
`master/messages_launch_tasks`	Number of launch task messages	Counter
`master/messages_operation_status_update_acknowledgement`	Number of operation status update acknowledgement messages	Counter
`master/messages_reconcile_operations`	Number of reconcile operations messages	Counter
`master/messages_reconcile_tasks`	Number of reconcile task messages	Counter
`master/messages_register_framework`	Number of framework registration messages	Counter
`master/messages_register_slave`	Number of agent registration messages	Counter
`master/messages_reregister_framework`	Number of framework re-registration messages	Counter
`master/messages_reregister_slave`	Number of agent re-registration messages	Counter
`master/messages_resource_request`	Number of resource request messages	Counter
`master/messages_revive_offers`	Number of offer revival messages	Counter
`master/messages_status_update`	Number of status update messages	Counter
`master/messages_status_update_acknowledgement`	Number of status update acknowledgement messages	Counter
`master/messages_unregister_framework`	Number of framework unregistration messages	Counter
`master/messages_unregister_slave`	Number of agent unregistration messages	Counter
`master/messages_update_slave`	Number of update agent messages	Counter
`master/recovery_slave_removals`	Number of agents not reregistered during master failover	Counter
`master/slave_removals/reason_registered`	Number of agents removed when new agents registered at the same address	Counter
`master/slave_removals/reason_unhealthy`	Number of agents failed due to failed health checks	Counter
`master/slave_removals/reason_unregistered`	Number of agents unregistered	Counter
`master/valid_framework_to_executor_messages`	Number of valid framework to executor messages	Counter
`master/valid_operation_status_update_acknowledgements`	Number of valid operation status update acknowledgement messages	Counter
`master/valid_status_update_acknowledgements`	Number of valid status update acknowledgement messages	Counter
`master/valid_status_updates`	Number of valid status update messages	Counter
`master/task_lost/source_master/reason_invalid_offers`	Number of tasks lost due to invalid offers	Counter
`master/task_lost/source_master/reason_slave_removed`	Number of tasks lost due to agent removal	Counter
`master/task_lost/source_slave/reason_executor_terminated`	Number of tasks lost due to executor termination	Counter
`master/valid_executor_to_framework_messages`	Number of valid executor to framework messages	Counter

Event queue

The following metrics provide information about different types of events in the event queue.

Metric	Description	Type
`master/event_queue_dispatches`	Number of dispatches in the event queue	Gauge
`master/event_queue_http_requests`	Number of HTTP requests in the event queue	Gauge
`master/event_queue_messages`	Number of messages in the event queue	Gauge
`master/operator_event_stream_subscribers`	Number of subscribers to the operator event stream	Gauge

Registrar

The following metrics provide information about read and write latency to the agent registrar.

Metric	Description	Type
`registrar/state_fetch_ms`	Registry read latency in ms	Gauge
`registrar/state_store_ms`	Registry write latency in ms	Gauge
`registrar/state_store_ms/max`	Maximum registry write latency in ms	Gauge
`registrar/state_store_ms/min`	Minimum registry write latency in ms	Gauge
`registrar/state_store_ms/p50`	Median registry write latency in ms	Gauge
`registrar/state_store_ms/p90`	90th percentile registry write latency in ms	Gauge
`registrar/state_store_ms/p95`	95th percentile registry write latency in ms	Gauge
`registrar/state_store_ms/p99`	99th percentile registry write latency in ms	Gauge
`registrar/state_store_ms/p999`	99.9th percentile registry write latency in ms	Gauge
`registrar/state_store_ms/p9999`	99.99th percentile registry write latency in ms	Gauge

Replicated log

The following metrics provide information about the replicated log underneath the registrar, which is the persistent store for masters.

Metric	Description	Type
`registrar/log/recovered`	Whether the replicated log for the registrar has caught up with the other masters in the cluster. A cluster is operational as long as a quorum of "recovered" masters is available in the cluster.	Gauge
`registrar/log/ensemble_size`	The number of masters in the ensemble (cluster) that the current master communicates with (including itself) to form the replicated log quorum. It's imperative that this number is always less than `--quorum * 2` to prevent split-brain. It's also important that it should be greater than or equal to `--quorum` to maintain availability.	Gauge

Allocator

The following metrics provide information about performance and resource allocations in the allocator.

Metric	Description	Type
`allocator/mesos/allocation_run_ms`	Time spent in allocation algorithm in ms	Gauge
`allocator/mesos/allocation_run_ms/count`	Number of allocation algorithm time measurements in the window	Gauge
`allocator/mesos/allocation_run_ms/max`	Maximum time spent in allocation algorithm in ms	Gauge
`allocator/mesos/allocation_run_ms/min`	Minimum time spent in allocation algorithm in ms	Gauge
`allocator/mesos/allocation_run_ms/p50`	Median time spent in allocation algorithm in ms	Gauge
`allocator/mesos/allocation_run_ms/p90`	90th percentile of time spent in allocation algorithm in ms	Gauge
`allocator/mesos/allocation_run_ms/p95`	95th percentile of time spent in allocation algorithm in ms	Gauge
`allocator/mesos/allocation_run_ms/p99`	99th percentile of time spent in allocation algorithm in ms	Gauge
`allocator/mesos/allocation_run_ms/p999`	99.9th percentile of time spent in allocation algorithm in ms	Gauge
`allocator/mesos/allocation_run_ms/p9999`	99.99th percentile of time spent in allocation algorithm in ms	Gauge
`allocator/mesos/allocation_runs`	Number of times the allocation algorithm has run	Counter
`allocator/mesos/allocation_run_latency_ms`	Allocation batch latency in ms	Gauge
`allocator/mesos/allocation_run_latency_ms/count`	Number of allocation batch latency measurements in the window	Gauge
`allocator/mesos/allocation_run_latency_ms/max`	Maximum allocation batch latency in ms	Gauge
`allocator/mesos/allocation_run_latency_ms/min`	Minimum allocation batch latency in ms	Gauge
`allocator/mesos/allocation_run_latency_ms/p50`	Median allocation batch latency in ms	Gauge
`allocator/mesos/allocation_run_latency_ms/p90`	90th percentile allocation batch latency in ms	Gauge
`allocator/mesos/allocation_run_latency_ms/p95`	95th percentile allocation batch latency in ms	Gauge
`allocator/mesos/allocation_run_latency_ms/p99`	99th percentile allocation batch latency in ms	Gauge
`allocator/mesos/allocation_run_latency_ms/p999`	99.9th percentile allocation batch latency in ms	Gauge
`allocator/mesos/allocation_run_latency_ms/p9999`	99.99th percentile allocation batch latency in ms	Gauge
`allocator/mesos/roles/<role>/shares/dominant`	Dominant resource share for the role, exposed as a percentage (0.0-1.0)	Gauge
`allocator/mesos/event_queue_dispatches`	Number of dispatch events in the event queue	Gauge
`allocator/mesos/offer_filters/roles/<role>/active`	Number of active offer filters for all frameworks within the role	Gauge
`allocator/mesos/quota/roles/<role>/resources/<resource>/offered_or_allocated`	Amount of resources considered offered or allocated towards a role's quota guarantee	Gauge
`allocator/mesos/quota/roles/<role>/resources/<resource>/guarantee`	Amount of resources guaranteed for a role via quota	Gauge
`allocator/mesos/resources/cpus/offered_or_allocated`	Number of CPUs offered or allocated	Gauge
`allocator/mesos/resources/cpus/total`	Number of CPUs	Gauge
`allocator/mesos/resources/disk/offered_or_allocated`	Allocated or offered disk space in MB	Gauge
`allocator/mesos/resources/disk/total`	Total disk space in MB	Gauge
`allocator/mesos/resources/mem/offered_or_allocated`	Allocated or offered memory in MB	Gauge
`allocator/mesos/resources/mem/total`	Total memory in MB	Gauge

Basic Alerts

This section lists some examples of basic alerts that you can use to detect abnormal situations in a cluster.

master/uptime_secs is low

The master has restarted.

master/uptime_secs < 60 for sustained periods of time

The cluster has a flapping master node.

master/tasks_lost is increasing rapidly

Tasks in the cluster are disappearing. Possible causes include hardware failures, bugs in one of the frameworks, or bugs in Mesos.

master/slaves_active is low

Agents are having trouble connecting to the master.

master/cpus_percent > 0.9 for sustained periods of time

Cluster CPU utilization is close to capacity.

master/mem_percent > 0.9 for sustained periods of time

Cluster memory utilization is close to capacity.

master/elected is 0 for sustained periods of time

No master is currently elected.

Agent Nodes

Metrics from each agent node are available via the /metrics/snapshot agent endpoint. The response is a JSON object that contains metrics names and values as key-value pairs.

Observability Metrics

This section lists all available metrics from Mesos agent nodes grouped by category.

Resources

The following metrics provide information about the total resources available in the agent and their current usage.

Metric	Description	Type
`containerizer/fetcher/cache_size_total_bytes`	The configured maximum size of the fetcher cache in bytes. This value is constant for the life of the Mesos agent.	Gauge
`containerizer/fetcher/cache_size_used_bytes`	The current amount of data stored in the fetcher cache in bytes.	Gauge
`gc/path_removals_failed`	Number of times the agent garbage collection process has failed to remove a sandbox path.	Counter
`gc/path_removals_pending`	Number of sandbox paths that are currently pending agent garbage collection.	Gauge
`gc/path_removals_succeeded`	Number of sandbox paths the agent successfully removed.	Counter
`slave/cpus_percent`	Percentage of allocated CPUs	Gauge
`slave/cpus_used`	Number of allocated CPUs	Gauge
`slave/cpus_total`	Number of CPUs	Gauge
`slave/cpus_revocable_percent`	Percentage of allocated revocable CPUs	Gauge
`slave/cpus_revocable_total`	Number of revocable CPUs	Gauge
`slave/cpus_revocable_used`	Number of allocated revocable CPUs	Gauge
`slave/disk_percent`	Percentage of allocated disk space	Gauge
`slave/disk_used`	Allocated disk space in MB	Gauge
`slave/disk_total`	Disk space in MB	Gauge
`slave/gpus_percent`	Percentage of allocated GPUs	Gauge
`slave/gpus_used`	Number of allocated GPUs	Gauge
`slave/gpus_total`	Number of GPUs	Gauge
`slave/gpus_revocable_percent`	Percentage of allocated revocable GPUs	Gauge
`slave/gpus_revocable_total`	Number of revocable GPUs	Gauge
`slave/gpus_revocable_used`	Number of allocated revocable GPUs	Gauge
`slave/mem_percent`	Percentage of allocated memory	Gauge
`slave/disk_revocable_percent`	Percentage of allocated revocable disk space	Gauge
`slave/disk_revocable_total`	Revocable disk space in MB	Gauge
`slave/disk_revocable_used`	Allocated revocable disk space in MB	Gauge
`slave/mem_used`	Allocated memory in MB	Gauge
`slave/mem_total`	Memory in MB	Gauge
`slave/mem_revocable_percent`	Percentage of allocated revocable memory	Gauge
`slave/mem_revocable_total`	Revocable memory in MB	Gauge
`slave/mem_revocable_used`	Allocated revocable memory in MB	Gauge
`volume_gid_manager/volume_gids_total`	Number of gids configured for volume gid manager	Gauge
`volume_gid_manager/volume_gids_free`	Number of free gids available for volume gid manager	Gauge

Agent

The following metrics provide information about whether an agent is currently registered with a master and for how long it has been running.

Metric	Description	Type
`slave/registered`	Whether this agent is registered with a master	Gauge
`slave/uptime_secs`	Uptime in seconds	Gauge

System

The following metrics provide information about the agent system.

Metric	Description	Type
`system/cpus_total`	Number of CPUs available	Gauge
`system/load_15min`	Load average for the past 15 minutes	Gauge
`system/load_5min`	Load average for the past 5 minutes	Gauge
`system/load_1min`	Load average for the past minute	Gauge
`system/mem_free_bytes`	Free memory in bytes	Gauge
`system/mem_total_bytes`	Total memory in bytes	Gauge

Executors

The following metrics provide information about the executor instances running on the agent.

Metric	Description	Type
`containerizer/mesos/container_destroy_errors`	Number of containers destroyed due to launch errors	Counter
`containerizer/fetcher/task_fetches_succeeded`	Total number of times the Mesos fetcher successfully fetched all the URIs for a task.	Counter
`containerizer/fetcher/task_fetches_failed`	Number of times the Mesos fetcher failed to fetch all the URIs for a task.	Counter
`slave/container_launch_errors`	Number of container launch errors	Counter
`slave/executors_preempted`	Number of executors destroyed due to preemption	Counter
`slave/frameworks_active`	Number of active frameworks	Gauge
`slave/executor_directory_max_allowed_age_secs`	Maximum allowed age in seconds to delete executor directory	Gauge
`slave/executors_registering`	Number of executors registering	Gauge
`slave/executors_running`	Number of executors running	Gauge
`slave/executors_terminated`	Number of terminated executors	Counter
`slave/executors_terminating`	Number of terminating executors	Gauge
`slave/recovery_errors`	Number of errors encountered during agent recovery	Gauge
`slave/recovery_time_secs`	Agent recovery time in seconds. This value is only available after agent recovery succeeded and remains constant for the life of the Mesos agent.	Gauge

Tasks

The following metrics provide information about active and terminated tasks.

Metric	Description	Type
`slave/tasks_failed`	Number of failed tasks	Counter
`slave/tasks_finished`	Number of finished tasks	Counter
`slave/tasks_killed`	Number of killed tasks	Counter
`slave/tasks_lost`	Number of lost tasks	Counter
`slave/tasks_running`	Number of running tasks	Gauge
`slave/tasks_staging`	Number of staging tasks	Gauge
`slave/tasks_starting`	Number of starting tasks	Gauge

Messages

The following metrics provide information about messages between the agents and the master it is registered with.

Metric	Description	Type
`slave/invalid_framework_messages`	Number of invalid framework messages	Counter
`slave/invalid_status_updates`	Number of invalid status updates	Counter
`slave/valid_framework_messages`	Number of valid framework messages	Counter
`slave/valid_status_updates`	Number of valid status updates	Counter

Containerizers

The following metrics provide information about both Mesos and Docker containerizers.

Metric	Description	Type
`containerizer/docker/image_pull_ms`	Docker containerizer image pull latency in ms	Gauge
`containerizer/docker/image_pull_ms/count`	Number of Docker containerizer image pulls	Gauge
`containerizer/docker/image_pull_ms/max`	Maximum Docker containerizer image pull latency in ms	Gauge
`containerizer/docker/image_pull_ms/min`	Minimum Docker containerizer image pull latency in ms	Gauge
`containerizer/docker/image_pull_ms/p50`	Median Docker containerizer image pull latency in ms	Gauge
`containerizer/docker/image_pull_ms/p90`	90th percentile Docker containerizer image pull latency in ms	Gauge
`containerizer/docker/image_pull_ms/p95`	95th percentile Docker containerizer image pull latency in ms	Gauge
`containerizer/docker/image_pull_ms/p99`	99th percentile Docker containerizer image pull latency in ms	Gauge
`containerizer/docker/image_pull_ms/p999`	99.9th percentile Docker containerizer image pull latency in ms	Gauge
`containerizer/docker/image_pull_ms/p9999`	99.99th percentile Docker containerizer image pull latency in ms	Gauge
`containerizer/mesos/disk/project_ids_free`	Number of free project IDs available to the XFS Disk isolator	Gauge
`containerizer/mesos/disk/project_ids_total`	Number of project IDs configured for the XFS Disk isolator	Gauge
`containerizer/mesos/provisioner/docker_store/image_pull_ms`	Mesos containerizer docker image pull latency in ms	Gauge
`containerizer/mesos/provisioner/docker_store/image_pull_ms/count`	Number of Mesos containerizer docker image pulls	Gauge
`containerizer/mesos/provisioner/docker_store/image_pull_ms/max`	Maximum Mesos containerizer docker image pull latency in ms	Gauge
`containerizer/mesos/provisioner/docker_store/image_pull_ms/min`	Minimum Mesos containerizer docker image pull latency in ms	Gauge
`containerizer/mesos/provisioner/docker_store/image_pull_ms/p50`	Median Mesos containerizer docker image pull latency in ms	Gauge
`containerizer/mesos/provisioner/docker_store/image_pull_ms/p90`	90th percentile Mesos containerizer docker image pull latency in ms	Gauge
`containerizer/mesos/provisioner/docker_store/image_pull_ms/p95`	95th percentile Mesos containerizer docker image pull latency in ms	Gauge
`containerizer/mesos/provisioner/docker_store/image_pull_ms/p99`	99th percentile Mesos containerizer docker image pull latency in ms	Gauge
`containerizer/mesos/provisioner/docker_store/image_pull_ms/p999`	99.9th percentile Mesos containerizer docker image pull latency in ms	Gauge
`containerizer/mesos/provisioner/docker_store/image_pull_ms/p9999`	99.99th percentile Mesos containerizer docker image pull latency in ms	Gauge

Resource Providers

The following metrics provide information about ongoing and completed operations that apply to resources provided by a resource provider with the given type and name. In the following metrics, the operation placeholder refers to the name of a particular operation type, which is described in the list of supported operation types.

Metric	Description	Type
`resource_providers/<type>.<name>/operations/<operation>/pending`	Number of ongoing operations	Gauge
`resource_providers/<type>.<name>/operations/<operation>/finished`	Number of finished operations	Counter
`resource_providers/<type>.<name>/operations/<operation>/failed`	Number of failed operations	Counter
`resource_providers/<type>.<name>/operations/<operation>/dropped`	Number of dropped operations	Counter

Supported Operation Types

Since the supported operation types may vary among different resource providers, the following is a comprehensive list of operation types and the corresponding resource providers that support them. Note that the name column is for the operation placeholder in the above metrics.

Type	Name	Supported Resource Provider Types
`RESERVE`	`reserve`	All
`UNRESERVE`	`unreserve`	All
`CREATE`	`create`	`org.apache.mesos.rp.local.storage`
`DESTROY`	`destroy`	`org.apache.mesos.rp.local.storage`
`CREATE_DISK`	`create_disk`	`org.apache.mesos.rp.local.storage`
`DESTROY_DISK`	`destroy_disk`	`org.apache.mesos.rp.local.storage`

For example, cluster operators can monitor the number of successful CREATE_VOLUME operations that are applied to the resource provider with type org.apache.mesos.rp.local.storage and name lvm through the resource_providers/org.apache.mesos.rp.local.storage.lvm/operations/create_disk/finished metric.

CSI Plugins

Storage resource providers in Mesos are backed by CSI plugins running in standalone containers. To monitor the health of these CSI plugins for a storage resource provider with type and name, the following metrics provide information about plugin terminations and ongoing and completed CSI calls made to the plugin.

Metric	Description	Type
`resource_providers/<type>.<name>/csi_plugin/container_terminations`	Number of terminated CSI plugin containers	Counter
`resource_providers/<type>.<name>/csi_plugin/rpcs_pending`	Number of ongoing CSI calls	Gauge
`resource_providers/<type>.<name>/csi_plugin/rpcs_finished`	Number of successful CSI calls	Counter
`resource_providers/<type>.<name>/csi_plugin/rpcs_failed`	Number of failed CSI calls	Counter
`resource_providers/<type>.<name>/csi_plugin/rpcs_cancelled`	Number of cancelled CSI calls	Counter

ClusterD - Continued development of Apache Mesos