CouchBase Monitoring Extension

Use Case

Couchbase Server is an open source, distributed (shared-nothing architecture) NoSQL document-oriented database that is optimized for interactive applications. This extension allows the user to connect to a specific cluster host and retrieve metrics about the cluster, all the nodes within the cluster and any buckets associated with the nodes.

Prerequisites

  1. This extension works only with the standalone Java machine agent. The extension requires the machine agent to be up and running.
  2. This extension creates a client to the CouchBase server that needs to be monitored. So the CouchBase server that has to be monitored, should be available for access from the machine that has the extension installed.
  3. The client created through the extension uses various REST endpoints provided by the CouchBase server to retrieve metrics. Please make sure your user account has proper admin role to access all the REST endpoints. "Full", "Cluster" and "Read-only" level roles gives you access to all the REST endpoints.

Installing the extension

  1. Download and unzip the CouchBaseMonitor-version.zip file into <MACHINE_AGENT_HOME>/monitors/ directory.
  2. Configure the extension by referring to the below section.
  3. Restart the machine agent.

Configuring the extension using config.yml

Configure the CouchBase monitoring extension by editing the config.yml file in <MACHINE_AGENT_HOME>/monitors/CouchBaseMonitor/

  1. Configure the "tier" under which the metrics need to be reported. This can be done by changing the value of <TIER NAME OR TIER ID> in metricPrefix: "Server|Component:`<TIER NAME OR TIER ID>|Custom Metrics|CouchBase"

    For example,

    metricPrefix: "Server|Component:Extensions tier|Custom Metrics|CouchBase"

  2. Configure the CouchBase cluster by specifying the name(required), host(required), port(required), queryPort(required) of any node(server) in the CouchBase cluster, username(only if authentication enabled), password(only if authentication enabled), passwordEncrypted(only if password encryption required).

    For example,

    servers:
       - name: "Cluster1"
         host: "localhost"
         port: "8091"
         queryPort: "8093"
         username: "Administrator"
         password: "password1"
         passwordEncrypted: ""
    

  3. Configure the encyptionKey for encryptionPasswords(only if password encryption required). See next section for encrypting password.

    For example,

    #Encryption key for Encrypted password.
    encryptionKey: "axcdde43535hdhdgfiniyy576"
    

  4. Configure the metrics section.

    For configuring the metrics, the following properties can be used:

    PropertyDefault valuePossible valuesDescription
    aliasmetric nameAny stringThe substitute name to be used in the metric browser instead of metric name.
    aggregationType"AVERAGE""AVERAGE", "SUM", "OBSERVATION"Aggregation qualifier
    timeRollUpType"AVERAGE""AVERAGE", "SUM", "CURRENT"Time roll-up qualifier
    clusterRollUpType"INDIVIDUAL""INDIVIDUAL", "COLLECTIVE"Cluster roll-up qualifier
    multiplier1Any numberValue with which the metric needs to be multiplied.
    convertnullAny key value mapSet of key value pairs that indicates the value to which the metrics need to be transformed. eg: UP:0, DOWN:1
    deltafalsetrue, falseIf enabled, gives the delta values of metrics instead of actual values.

    For example,

    - bandwidth_usage:  #Bandwidth used during replication, measured in bytes per second.
        alias: "bandwidthUsed"
        multiplier: 1
        aggregationType: "SUM"
        timeRollUpType: "CURRENT"
        clusterRollUpType: "INDIVIDUAL"
        delta: true
    - status:
        alias: "status"
        convert:
          "healthy" : 1
          "warmup" : 2
    

    **All these metric properties are optional, and the default value shown in the table is applied to the metric(if a property has not been specified) by default.**

    There are six categories of metrics i.e cluster, node , bucket, query, xdcr, index. To disable any of these sections, change the include parameter under the section to "false" as follows:

    index:
       include: "false"
       stats:
           - memorySnapshotInterval:
                 alias: "memorySnapshotInterval"
           - stableSnapshotInterval:
                 alias: "stableSnapshotInterval"
           - maxRollbackPoints:
                 alias: "maxRollbackPoints"
    

Password encryption:

To avoid setting the clear text password in the config.yml, please follow the steps below to encrypt the password and set the encrypted password and the key in the config.yml:

  1. Download the util jar to encrypt the password from here.
  2. Encrypt password from the command line using the following command :
    java -cp "appd-exts-commons-2.0.0.jar" com.appdynamics.extensions.crypto.Encryptor myKey myPassword
    
    where "myKey" is any random key, "myPassword" is the actual password that needs to be encrypted
  3. Add the values for "encryptionKey", "passwordEncrypted" in the config.yml. The value for "encryptionKey" is the value substituted for "myKey" in the above command. The value for "passwordEncrypted" is the result of the above command.

Recommendations

It is recommended that a single CouchBase monitoring extension be used to monitor a single CouchBase cluster.

Metrics

cluster metrics
        ram:
            - total: #Total ram available to cluster (bytes)
                  alias: "total"
            - quotaTotal: #Ram quota total for the cluster (bytes)
                  alias: "quotaTotal"
            - quotaUsed: #Ram quota used by the cluster (bytes)
                  alias: "quotaUsed"
            - used: #Ram used by the cluster (bytes)
                  alias: "used"
            - usedByData: #Ram used by the data in the cluster (bytes)
                  alias: "usedByData"
            - quotaUsedPerNode: #Ram quota used per node in the cluster (bytes)
                  alias: "quotaUsedPerNode"
            - quotaTotalPerNode:  #Ram quota total per node in the cluster (bytes)
                  alias: "quotaTotalPerNode"
        hdd:
            - total: #Total harddrive space available to cluster (bytes)
                  alias: "total"
            - quotaTotal: #Harddrive quota total for the cluster (bytes)
                  alias: "quotaTotal"
            - used: #Harddrive space used by the cluster (bytes)
                  alias: "used"
            - usedByData: #Harddrive use by the data in the cluster(bytes)
                  alias: "usedByData"
            - free: #Free harddrive space in the cluster (bytes)
                  alias: "free"
        counters:
            - rebalance_success:
                  alias: "rebalance_success"
            - rebalance_start:
                  alias: "rebalance_start"
        others:
            - rebalanceStatus: #Rebalancing status of the cluster
                  alias: "rebalanceStatus"
                  convert:
                    "none" : 0
nodes metrics
        systemStats: #
            - cpu_utilization_rate: #The CPU utilization rate (%)
                  alias: "cpu_utilization_rate"
            - swap_total: #Total swap size allocated (bytes)
                  alias: "swap_total"
            - swap_used: #Amount of swap space used(bytes)
                  alias: "swap_used"
            - mem_total: #Total memory available to the node (bytes)
                  alias: "mem_total"
            - mem_free: #Amount of memory free for the node (bytes)
                  alias: "mem_free"
        interestingStats:
            - cmd_get: #Number of get commands
                  alias: "cmd_get"
            - couch_docs_actual_disk_size: #Amount of disk space used by Couch docs(bytes)
                  alias: "couch_docs_actual_disk_size"
            - couch_docs_data_size: #Data size of couch documents associated with a node (bytes)
                  alias: "couch_docs_data_size"
            - couch_spatial_data_size: #Size of object data for Couch spatial views (bytes)
                  alias: "couch_spatial_data_size"
            - couch_spatial_disk_size: #Amount of disk space occupied by Couch spatial views (bytes)
                  alias: "couch_spatial_disk_size"
            - couch_views_actual_disk_size: #Amount of disk space occupied by Couch views (bytes)
                  alias: "couch_views_actual_disk_size"
            - couch_views_data_size: #Size of object data for Couch views (bytes)
                  alias: "couch_views_data_size"
            - curr_items: #Number of current items
                  alias: "curr_items"
            - curr_items_tot: #Total number of items associated with node
                  alias: "curr_items_tot"
            - ep_bg_fetched: #Number of disk fetches performed since server was started
                  alias: "ep_bg_fetched"
            - get_hits: #Number of get hits
                  alias: "get_hits"
            - mem_used: #Memory used by the node (bytes)
                  alias: "mem_used"
            - ops: #Number of operations performed on Couchbase
                  alias: "ops"
            - vb_replica_curr_items: #Number of items/documents that are replicas
                  alias: "vb_replica_curr_items"
        otherStats:
            - clusterMembership: #Current node status with respect to membership in the cluster
                  alias: "clusterMembership"
                  convert:
                    "active" : 1
                    "inactiveAdded" : 2
                    "inactiveFailed" : 3
            - status: #Node status
                  alias: "status"
                  convert:
                    "healthy" : 1
                    "warmup" : 2
buckets metrics
        quota:
            - ram: #Amount of RAM used by the bucket (bytes)
                  alias: "ram"
            - rawRAM: #Amount of raw RAM used by the bucket (bytes)
                  alias: "rawRam"
        basicStats:
            - quotaPercentUsed: #Percentage of RAM used (for active objects) against the configure bucket size(%)
                  alias: "quotaPercentUsed"
            - opsPerSec: #Number of operations per second
                  alias: "opsPerSec"
            - diskFetches: #Number of disk fetches
                  alias: "diskFetches"
            - itemCount: #Number of items associated with the bucket
                  alias: "itemCount"
            - diskUsed: #Amount of disk used (bytes)
                  alias: "diskUsed"
            - dataUsed: #Size of user data within buckets of the specified state that are resident in RAM(%)
                  alias: "dataUsed"
            - memUsed: #Amount of memory used by the bucket (bytes)
                  alias: "memUsed"
        otherStats:
            - couch_total_disk_size:
            - couch_docs_fragmentation:
            - couch_views_fragmentation:
            - hit_ratio:
            - ep_cache_miss_rate: 
            - ep_resident_items_rate:
            - vb_avg_active_queue_age:
            - vb_avg_replica_queue_age:
            - vb_avg_pending_queue_age:
            - vb_avg_total_queue_age:
            - vb_active_resident_items_ratio:
            - vb_replica_resident_items_ratio:
            - vb_pending_resident_items_ratio:
            - avg_disk_update_time: #Average time required to update data on disk
            - avg_disk_commit_time: #Average time required to commit data on disk
            - avg_bg_wait_time: #The average background fetch time in microseconds
            - avg_active_timestamp_drift:
            - avg_replica_timestamp_drift:
            - bg_wait_count:
            - bg_wait_total:
            - bytes_read:
            - bytes_written:
            - cas_badval: #No. of CAS operations per second using an incorrect CAS ID for data that this bucket contains.
            - cas_hits: #No. of CAS operations per second for data that this bucket contains
            - cas_misses: #No. of CAS operations per second for data that this bucket contains
            - cmd_get:
            - cmd_set:
            - curr_connections:
            - curr_items:
            - curr_items_tot:
            - decr_hits: #No of decrement operations per sec for data that this bucket contains
            - decr_misses: #No of decrement operations per sec for data that this bucket does not contain
            - delete_hits: #No of delete operations per sec for data that this bucket contains
            - delete_misses: #No of delete operations per sec for data that this bucket does not contain
            - evictions:
            - get_hits: #No of get operations per sec for data that this bucket contains
            - get_misses: #No of get operations per sec for data that this bucket does not contain
            - incr_hits: #No of increment operations per sec for data that this bucket contains
            - incr_misses: #No of increment operations per sec for data that this bucket does not contain
            - misses:
            - xdc_ops: #No of XDCR related operations per second for this bucket
            - cpu_idle_ms:
            - cpu_local_ms:
            - cpu_utilization_rate:
            - hibernated_requests:
            - hibernated_waked:
            - mem_actual_free:
            - mem_actual_used:
            - mem_free:
            - mem_total:
            - mem_used_sys:
            - rest_requests:
            - swap_total:
            - swap_used:
query metrics
        systemVitals:
            - request.completed.count: #Number of requests completed
                  alias: "request.completed.count"
            - request.active.count: #Number of active requests
                  alias: "request.active.count"
            - request.per.sec.1min: #query throughput 1 minute
                  alias: "request.per.sec.1min"
            - request.per.sec.5min: #query throughput 5 minutes
                  alias: "request.per.sec.5min"
            - request.per.sec.15min: #query throughput 15 minutes
                  alias: "request.per.sec.15min"
            - request_time.mean: #Mean time to complete a request
                  alias: "request_time.mean(ms)"
            - request_time.median: #Median time to complete a request
                  alias: "request_time.median(ms)"
            - request_time.80percentile: #80th percentile query response time
                  alias: "request_time.80percentile(ms)"
            - request_time.95percentile: #95th percentile query response time
                  alias: "request_time.95percentile(ms)"
            - request_time.99percentile: #99th percentile query response time
                  alias: "request_time.99percentile(ms)"
            - request.prepared.percent: #percentage of prepared requests
                  alias: "request.prepared.percent"
xdcr metrics
        stats:
            - bandwidth_usage: #Bandwidth used during replication, measured in bytes per second.
                  alias: "bandwidth_usage"
            - changes_left: #Number of updates still pending replication.
                  alias: "changes_left"
            - data_replicated: #Size of data replicated in bytes.
                 alias: "data_replicated"
            - docs_checked: #Number of documents checked for changes.
                 alias: "docs_checked"
            - docs_failed_cr_source: #Number of documents that have failed conflict resolution on the source cluster and not replicated to target cluster.
                 alias: "docs_failed_cr_source"
            - docs_filtered: #Number of documents that have been filtered out and not replicated to target cluster.
                 alias: "docs_filtered"
            - docs_latency_wt: #Weighted average latency for sending replicated changes to destination cluster.
                 alias: "docs_latency_wt"
            - docs_opt_repd: #Number of docs sent optimistically.
                 alias: "docs_opt_repd"
            - docs_received_from_dcp: #Number of documents received from DCP.
                 alias: "docs_received_from_dcp"
            - docs_rep_queue: #Number of documents in replication queue.
                 alias: "docs_rep_queue"
            - docs_written: #Number of documents written to the destination cluster via xdcr.
                 alias: "docs_written"
            - meta_latency_wt: #Weighted average time for requesting document metadata. xdcr uses this for conflict resolution prior to sending the document into the replication queue.
                 alias: "meta_latency_wt"
            - num_checkpoints: #Number of checkpoints issued in replication queue.
                 alias: "num_checkpoints"
            - num_failedckpts: #Number of checkpoints failed during replication.
                 alias: "num_failedckpts"
            - rate_received_from_dcp: #Number of documents received from DCP per second.
                 alias: "rate_received_from_dcp"
            - rate_replication: #Rate of documents being replicated, measured in documents per second.
                 alias: "rate_replication"
            - size_rep_queue: #Size of replication queue in bytes.
                 alias: "size_rep_queue"
            - time_committing: #Seconds elapsed during replication.
                 alias: "time_committing"
index metrics
        stats:
            - memorySnapshotInterval: #Represents how often the indexer creates an in-memory snapshot for querying
                  alias: "memorySnapshotInterval"
            - stableSnapshotInterval: #Represents how often the indexer creates a persistent snapshot of recovery
                  alias: "stableSnapshotInterval"
            - maxRollbackPoints: #Maximum number of rollback points
                  alias: "maxRollbackPoints"

Workbench

Workbench is a feature by which you can preview the metrics before registering it with the controller. This is useful if you want to fine-tune the configurations. Workbench is embedded into the extension jar. To use the workbench, follow all the steps in installation and configuration.

  1. Start the workbench with the following command if you are in

    java -jar /monitors/CouchBaseMonitor/couchbase-monitoring-extension.jar 
    
    This starts an HTTP server at http://localhost:9090/. This can be accessed from the browser.

  2. If the server is not accessible from outside/browser, you can use the following endpoints to see the list of registered metrics and errors.

    Get the stats
    curl http://localhost:9090/api/stats
    Get the registered metrics
    curl http://localhost:9090/api/metric-paths
    

  3. Once the configuration is complete, you can kill the workbench and start the Machine Agent.

Version

Current Version:

2.0.0

CouchBase Version Tested On:

4.6

Last Update:

11/13/2017

 

2.0.0 - Revamped the extension to support new extensions framework(2.0.0), Added 3 different categories of metrics(query, xdcr and index), Added extra metrics in cluster, node and bucket categories.

Troubleshooting

Please follow the steps specified in the Troubleshooting document to debug problems faced while using the extension.

Contributing

Always feel free to fork and contribute any changes directly via GitHub.

Support

For any questions or feature request, please contact AppDynamics Support.