Manually Upgrading an Active Failover Deployment
This page will guide you through the process of a manual upgrade of an Active Failover setup. The different nodes can be upgraded one at a time without incurring a prolonged downtime of the entire system. The downtimes of the individual nodes should also stay fairly low.
The manual upgrade procedure described in this section can be used to upgrade to a new hotfix version, or to perform an upgrade to a new minor version of ArangoDB. Please refer to the Upgrade Paths section for detailed information.
Preparations
The ArangoDB installation packages (e.g. for Debian or Ubuntu) set up a
convenient standalone instance of arangod
. During installation, this instance’s
database will be upgraded (see --database.auto-upgrade
)
and the service will be (re)started.
You have to make sure that your Active Failover deployment is independent of this standalone instance. Specifically, make sure that the database directory as well as the socket used by the standalone instance provided by the package are separate from the ones in your Active Failover configuration. Also, that you haven’t modified the init script or systemd unit file for the standalone instance in a way that it would start or stop your Active Failover instance instead.
Install the new ArangoDB version binary
The first step is to install the new ArangoDB package.
Note: you do not have to stop the Active Failover (arangod) processes before upgrading it.
For example, if you want to upgrade to 3.7.13
on Debian or Ubuntu, either call
$ apt install arangodb=3.7.13-1
(apt-get
on older versions) if you have added the ArangoDB repository. Or
install a specific package using
$ dpkg -i arangodb3-3.7.13-1_amd64.deb
after you have downloaded the corresponding file from download.arangodb.com.
Stop the Standalone Instance
As the package will automatically start the standalone instance, you might want to stop that instance now, as otherwise it can create some confusion later. As you are starting the Active Failover processes manually you will not need the automatically installed and started standalone instance, and you should hence stop it via:
$ service arangodb3 stop
Also, you might want to remove the standalone instance from the default runlevels to prevent it from starting on the next reboots of your machine. How this is done depends on your distribution and init system. For example, on older Debian and Ubuntu systems using a SystemV-compatible init, you can use:
$ update-rc.d -f arangodb3 remove
Set supervision into maintenance mode
You have two main choices when performing an upgrade of the Active Failover setup:
- Upgrade while incurring a leader-to-follower switch (with reduced downtime)
- An upgrade with no leader-to-follower switch.
Turning the maintenance mode on will enable the latter case. You might have a short downtime during the leader upgrade, but there will be no potential loss of acknowledged operations.
To enable the maintenance mode means to essentially disable the Agency supervision for a limited amount
of time during the upgrade procedure. The following API calls will
activate and deactivate the maintenance mode of the supervision job. You might use curl to send the API calls.
The following examples assume there is an Active Failover node running on localhost
on port 7002.
Activate Maintenance mode
curl -u username:password <single-server>/_admin/cluster/maintenance -XPUT -d'"on"'
For example:
curl -u "root:" http://localhost:7002/_admin/cluster/maintenance -XPUT -d'"on"'
{"error":false,"warning":"Cluster supervision deactivated.
It will be reactivated automatically in 60 minutes unless this call is repeated until then."}
Note: In case the manual upgrade takes longer than 60 minutes, the API call has to be resent.
Deactivate Maintenance mode
The cluster supervision resumes automatically 60 minutes after disabling it. It can be manually reactivated earlier at any point using the following API call:
curl -u username:password <single-server>/_admin/cluster/maintenance -XPUT -d'"off"'
For example:
curl -u "root:" http://localhost:7002/_admin/cluster/maintenance -XPUT -d'"off"'
{"error":false,"warning":"Cluster supervision reactivated."}
Upgrade the Active Failover processes
Now all the Active Failover (Agents, Single-Server) processes (arangod) have to be upgraded on each node.
Note: Please read the section regarding the maintenance mode above.
In order to stop an arangod process we will need to use a command like kill -15
:
kill -15 <pid-of-arangod-process>
The pid associated to your Active Failover setup can be checked using a command like ps:
ps -C arangod -fww
The output of the command above does not only show the process ids of all arangod processes but also the used commands, which is useful for the following restarts of all arangod processes.
The output below is from a test machine where three Agents and two Single-Servers were running locally. In a more production-like scenario, you will find only one instance of each type running per machine:
ps -C arangod -fww
UID PID PPID C STIME TTY TIME CMD
max 29075 8072 0 13:50 pts/2 00:00:42 arangod --server.endpoint tcp://0.0.0.0:5001 --agency.my-address=tcp://127.0.0.1:5001 --server.authentication false --agency.activate true --agency.size 3 --agency.endpoint tcp://127.0.0.1:5001 --agency.supervision true --log.file a1 --javascript.app-path /tmp --database.directory agent1
max 29208 8072 2 13:51 pts/2 00:02:08 arangod --server.endpoint tcp://0.0.0.0:5002 --agency.my-address=tcp://127.0.0.1:5002 --server.authentication false --agency.activate true --agency.size 3 --agency.endpoint tcp://127.0.0.1:5001 --agency.supervision true --log.file a2 --javascript.app-path /tmp --database.directory agent2
max 29329 16224 0 13:51 pts/3 00:00:42 arangod --server.endpoint tcp://0.0.0.0:5003 --agency.my-address=tcp://127.0.0.1:5003 --server.authentication false --agency.activate true --agency.size 3 --agency.endpoint tcp://127.0.0.1:5001 --agency.supervision true --log.file a3 --javascript.app-path /tmp --database.directory agent3
max 29824 16224 1 13:55 pts/3 00:01:53 arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:7001 --cluster.my-address tcp://127.0.0.1:7001 --cluster.my-role SINGLE --cluster.agency-endpoint tcp://127.0.0.1:5001 --cluster.agency-endpoint tcp://127.0.0.1:5002 --cluster.agency-endpoint tcp://127.0.0.1:5003 --log.file c1 --javascript.app-path /tmp --database.directory single1
max 29938 16224 2 13:56 pts/3 00:02:13 arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:7002 --cluster.my-address tcp://127.0.0.1:7002 --cluster.my-role SINGLE --cluster.agency-endpoint tcp://127.0.0.1:5001 --cluster.agency-endpoint tcp://127.0.0.1:5002 --cluster.agency-endpoint tcp://127.0.0.1:5003 --log.file c2 --javascript.app-path /tmp --database.directory single2
Note: The start commands of Agent and Single Server are required for restarting the processes later.
The recommended procedure for upgrading an Active Failover setup is to stop, upgrade and restart the arangod instances one by one on all participating servers, starting first with all Agent instances, and then following with the Active Failover instances themselves. When upgrading the Active Failover instances, the followers should be upgraded first.
To figure out the node containing the followers you can consult the cluster endpoints API:
curl http://<single-server>:7002/_api/cluster/endpoints
This will yield a list of endpoints, the first of which is always the leader node.
Stopping, upgrading and restarting an instance
To stop an instance, the currently running process has to be identified using the ps
command above.
Let’s assume we are about to upgrade an Agent instance, so we have to look in the ps
output for an Agent instance first, and note its process id (pid) and start command.
The process can then be stopped using the following command:
kill -15 <pid-of-agent>
The instance then has to be upgraded using the same command that was used before (in the ps
output),
but with the additional option:
--database.auto-upgrade=true
After the upgrade procedure has finishing successfully, the instance will remain stopped.
So it has to be restarted using the command from the ps
output before
(this time without the --database.auto-upgrade
option).
Once an Agent was upgraded and restarted successfully, repeat the procedure for the other Agent instances in the setup and then repeat the procedure for the Active Failover instances, there starting with the followers.
Final words
The Agency supervision then needs to be reactivated by issuing the following API call to the leader:
curl -u username:password <single-server>/_admin/cluster/maintenance -XPUT -d'"off"'