Conditionally Inserting and Modifying Documents
A common requirement when ingesting data is to ensure that certain documents exist in a collection. Oftentimes when running a command it is unclear whether the target documents are already present in the collection or need to be inserted first.
Unconditional INSERT
operations will not work here, because they may run
into errors if the target documents already exist. This will trigger a
“unique constraint violation” error. Unconditional UPDATE
or REPLACE
operations will also fail, because they require that the target documents are
already present. If this is not the case, the operations would run into
“document not found” errors.
So what needs to be run instead are conditional inserts/updates/replaces, also called upserts or repserts. The behavior of such operations is:
- Check if a document exists, based on some criteria
- If it does not exist, create the document
- If it exists, update or replace it with a new version
ArangoDB provides the following options in AQL to achieve this:
UPSERT
AQL operationINSERT
AQL operation withoverwriteMode
- Insert operation not using AQL, but the Document REST API
These alternatives have different capabilities and performance characteristics.
UPSERT
AQL Operation
Let us start with the UPSERT
AQL operation,
which is very generic and flexible.
The purpose of the UPSERT
AQL operation is to ensure that a specific document
exists after the operation has finished.
UPSERT
will look for a specific document, based on user-configurable
attributes/values, and create the document if it does not yet exist.
If UPSERT
finds such document, it can partially adjust it (UPDATE
) or fully
replace it (REPLACE
).
To recap, the syntaxes of AQL UPSERT
are, depending on whether you want to
update replace a document:
UPSERT <search-expression>
INSERT <insert-expression>
UPDATE <update-expression>
IN <collection> OPTIONS <options>
or
UPSERT <search-expression>
INSERT <insert-expression>
REPLACE <replace-expression>
IN <collection> OPTIONS <options>
The OPTIONS
part is optional.
An example UPSERT
operation looks like this:
UPSERT { page: "index.html" }
INSERT { page: "index.html", status: "inserted" }
UPDATE { status: "updated" }
IN pages
This will look for a document in the pages
collection with the page
attribute having a value of index.html
. If such document cannot be found, the
INSERT
part will be executed, which will create a document with the page
and
status
attributes. If the operation finds an existing document with page
being index.html
, it will execute the UPDATE
part, which will set the
document’s status
attribute to updated
.
Tracking Modification Dates
The UPSERT
AQL operation is sometimes used in combination with
date/time-keeping. For example, the following query keeps track of when a
document was first created, and when it was last updated:
UPSERT { page: "index.html" }
INSERT { page: "index.html", created: DATE_NOW() }
UPDATE { updated: DATE_NOW() }
IN pages
OLD
variable
The UPSERT
AQL operation also provides a pseudo-variable named OLD
to refer
to the existing document and its values in the UPDATE
/REPLACE
part.
Following is an example that increments a counter on a document whenever the
UPSERT
operation is executed:
UPSERT { page: "index.html" }
INSERT { page: "index.html", hits: 1 }
UPDATE { hits: OLD.value + 1 }
IN pages
UPSERT
Caveats
UPSERT
is a very flexible operation, so some things should be kept in mind to
use it effectively and efficiently.
Repeat the Search Attributes
First of all, the INSERT
part of an UPSERT
operation should contain all
attributes that are used in the search expression. Consider the following
counter-example:
UPSERT { page: "index.html" }
INSERT { status: "inserted" } /* page attribute missing here! */
UPDATE { status: "updated" }
IN pages
Forgetting to specify the search attributes in the INSERT
part introduces a
problem: The first time the UPSERT
is executed and does not find a document
with page
being index.html
, it will branch into the INSERT
part as
expected. However, the INSERT
part will create a document with only the
status
attribute set. The page
attribute is missing here, so when the
INSERT
completes, there is still no document with page
being index.html
.
That means whenever this UPSERT
statement executes, it will branch into the
INSERT
part, and the UPDATE
part will never be reached. This is likely
unintentional.
The problem can easily be avoided by adding the search attributes to the
INSERT
part:
UPSERT { page: "index.html" }
INSERT { page: "index.html", status: "inserted" }
UPDATE { status: "updated" }
IN pages
Note that it is not necessary to repeat the search attributes in the UPDATE
part, because UPDATE
is a partial update. It will only set the attributes that
are specified in the UPDATE
part, and leave all other existing attributes
alone. However, it is necessary to repeat the search attributes in the REPLACE
part, because REPLACE
will completely overwrite the existing document with
what is specified in the REPLACE
part.
That means when using the REPLACE
operation, the query should look like:
UPSERT { page: "index.html" }
INSERT { page: "index.html", status: "inserted" }
REPLACE { page: "index.html", status: "updated" }
IN pages
Use Indexes for Search Attributes
A downside of UPSERT
’s flexibility is that it can be used on arbitrary
collection attributes, even if those are not indexed.
When the UPSERT
looks for an existing document, it will use an index if an
index exists, but will also continue if no index exists. In the latter case,
the UPSERT
will execute a full collection scan, which can be expensive for
large collections. So it is advised to create an index on the search
attribute(s) used in an UPSERT
.
UPSERT
is Non-Atomic
The overall UPSERT
operation does not execute atomically for a single document.
It is basically a document lookup followed by either a document insert, update
or replace operation.
That means if multiple UPSERT
operations run concurrently with the same search
values, they may all determine that the target document does not exist - and
then all decide to create such document. That will mean one will end up with
multiple instances of the target document afterwards.
To avoid such concurrency issues, a unique index can be created on the search
attribute(s). Such index will prevent concurrent UPSERT
operations from
creating identical documents. Instead, only one of the concurrent UPSERT
s will
succeed, whereas the others will fail with a “unique constraint violated” error.
In that case the client application can either retry the operation (which then
should go into the UPDATE
/REPLACE
branch), or ignore the error if the goal
was only to ensure the target document exists.
Using a unique index on the search attribute(s) will thus improve lookup performance and avoid duplicates.
Using Shard Key(s) for Lookups
In a cluster setup, the search expression should contain the shard key(s), as this allows the lookup to be sent to a single shard only. This will be more efficient than having to execute the lookup on all the shards of the collection.
Another benefit of using the shard key(s) in the search expression is that unique indexes are only supported if they contain the shard key(s).
INSERT
AQL Operation with overwriteMode
While the UPSERT
AQL operation is very powerful and flexible, it is often not
the ideal choice for high-volume ingestion.
A much more efficient alternative to the UPSERT
AQL operation is the
INSERT
AQL operation with the overwriteMode
attribute set. This operation is not a drop-in replacement for UPSERT
, but
rather a fast alternative in case the document key (_key
attribute) is known
when the operation is executed, and none of the old values need to be referenced.
The general syntax of the INSERT
AQL operation is:
INSERT <insert-expression>
IN <collection> OPTIONS <options>
As we will deal with the overwriteMode
option here, we are focussing on
INSERT
operations with this option set, for example:
INSERT { _key: "index.html", status: "created" }
IN pages OPTIONS { overwriteMode: "ignore" }
Regardless of the selected overwriteMode
, the INSERT
operation will insert
the document if no document exists in the collection with the specified _key
.
In this aspect it behaves as a regular INSERT
operation.
However, if a document with the specified _key
already exists in the
collection, the INSERT
behavior will be as follows, depending on the selected
overwriteMode
:
conflict
(default): if a document with the specified_key
exists, return a “unique constraint violation”ignore
: if a document with the specified_key
exists, do nothing. Especially do not report a “unique constraint violation” error.update
: if a document with the specified_key
exists, (partially) update the document with the attributes specified.replace
: if a document with the specified_key
exists, fully replace the document with the attributes specified.
If no overwriteMode
is specified, the behavior of an INSERT
operation is as
if the overwriteMode
was set to conflict
.
The benefit of using INSERT
with overwriteMode
set to ignore
, update
or
replace
is that the INSERT
operation is going to be very fast, especially in
comparison with the UPSERT
operation. In addition, INSERT
will do a lookup
using the _key
attribute, which is always indexed. So it will always use the
primary index and never do full collection scans. It also does not require
setting up additional indexes, because the primary index is automatically
present for all collections.
There are also a few caveats when working with INSERT
AQL operations:
-
They can only be used when the value of the
_key
attribute is known at the time of insert. That means the client application must be able to provide the document keys in a deterministic way. -
The values that can be used for the
_key
attribute have some character and length restrictions, but alphanumeric keys work well. -
In a cluster setup, the underlying collection must be sharded by
_key
. This is the default shard key, however. -
There is no access to the data of an existing document for arbitrary calculations when going into the
update
orreplace
mode.
Please note that even though the INSERT
AQL operation cannot refer to existing
documents to calculate values for updating/replacing, it can still return the
previous version of the document in case the document is already present.
This can be achieved by appending a RETURN OLD
to the INSERT
operation,
e.g.
INSERT { _key: "index.html", status: "created" }
IN pages OPTIONS { overwriteMode: "replace" }
RETURN OLD
It is also possible to return the new version of the document (the inserted
document if no previous document existed, or the updated/replaced version in
case a document already existed) by using RETURN NEW
:
INSERT { _key: "index.html", status: "created" }
IN pages OPTIONS { overwriteMode: "replace" }
RETURN NEW
Insert Operation not Using AQL
There is the option to execute an insert operation with overwriteMode
outside
of AQL. The POST /_api/document/{collection}
endpoint is a dedicated REST API for insert operations, which can handle one
document, or multiple documents at once.
Conceptually this API behaves like the INSERT
AQL operation, but it can be
called with a batch of documents at once. This is the most efficient solution,
and should be preferred if possible.
Most ArangoDB drivers also provide a means to insert multiple documents at once, which will internally call this same REST API.
The REST API provides the returnOld
and returnNew
options to make it return
the previous versions of documents or the insert/updated/replaced documents, in
the same way as the INSERT
AQL operation can do.
Summary
The UPSERT
AQL operation is the most flexible way to conditionally insert or
update/replace documents in ArangoDB, but it is also the least efficient variant.
The INSERT
AQL operation with the overwriteMode
set will outperform
UPSERT
, but it can only be used for some use cases.
Using the dedicated REST API for document inserts will be even more efficient, and is thus the preferred option for bulk document inserts.