Features and Improvements in ArangoDB 3.10
The following list shows in detail which features have been added or improved in ArangoDB 3.10. ArangoDB 3.10 also contains several bug fixes that are not listed here.
Native ARM Support
ArangoDB is now available for the ARM architecture, in addition to the x86-64 architecture.
It can natively run on Apple silicon (e.g. M1 chips). It was already possible to run 3.8.x and older versions on these systems via Rosetta 2 emulation, but not 3.9.x because of its use of AVX instructions, which Rosetta 2 does not emulate. 3.10.x runs on this hardware again, but now without emulation.
ArangoDB 3.10.x also runs on 64-bit ARM (AArch64, Little Endian) chips under Linux. The minimum requirement is an ARMv8 chip with Neon (SIMD extension).
SmartGraphs (Enterprise Edition)
SmartGraphs and SatelliteGraphs on a single server
It is now possible to test SmartGraphs and SatelliteGraphs on a single server and then to port them to a cluster with multiple servers.
You can create SmartGraphs, Disjoint SmartGraphs, SmartGraphs using
SatelliteCollections, Disjoint SmartGraphs using SatelliteCollections, as well
as SatelliteGraphs in the usual way, using
arangosh
for instance, but on a single server, then dump them, start a cluster
(with multiple servers) and restore the graphs in the cluster. The graphs and
the collections will keep all properties that are kept when the graph is already
created in a cluster.
Also see SmartGraphs and SatelliteGraphs on a Single Server.
This feature is only available in the Enterprise Edition.
EnterpriseGraphs (Enterprise Edition)
The 3.10 version of ArangoDB introduces a specialized version of SmartGraphs, called EnterpriseGraphs.
EnterpriseGraphs come with a concept of automated sharding key selection, meaning that the sharding key is randomly selected while ensuring that all their adjacent edges are co-located on the same servers, whenever possible. This approach provides significant advantages as it minimizes the impact of having suboptimal sharding keys defined when creating the graph.
See the EnterpriseGraphs chapter for more details.
This feature is only available in the Enterprise Edition.
(Disjoint) Hybrid SmartGraphs renaming
(Disjoint) Hybrid SmartGraphs were renamed to (Disjoint) SmartGraphs using SatelliteCollections. The functionality and behavior of both types of graphs stay the same.
Computed Values
This feature lets you define expressions on the collection level that run on inserts, modifications, or both. You can access the data of the current document to compute new values using a subset of AQL.
Possible use cases are to add timestamps of the creation or last modification to every document, to add default attributes, or to automatically process attributes for indexing purposes, like filtering, combining multiple attributes into one, and to convert characters to lowercase.
The following example uses the JavaScript API to create a collection with two computed values, one to add a timestamp of the document creation, and another to maintain an attribute that combines two other attributes:
var coll = db._create("users", {
computedValues: [
{
name: "createdAt",
expression: "RETURN DATE_ISO8601(DATE_NOW())",
overwrite: true,
computeOn: ["insert"]
},
{
name: "fullName",
expression: "RETURN LOWER(CONCAT_SEPARATOR(' ', @doc.firstName, @doc.lastName))",
overwrite: false,
computeOn: ["insert", "update", "replace"], // default
keepNull: false,
failOnWarning: true
}
]
});
var doc = db.users.save({ firstName: "Paula", lastName: "Plant" });
/* Stored document:
{
"createdAt": "2022-08-08T17:14:37.362Z",
"fullName": "paula plant",
"firstName": "Paula",
"lastName": "Plant"
}
*/
See Computed Values for more information and examples.
ArangoSearch
Inverted collection indexes
A new inverted
index type is available that you can define at the collection
level. You can use these indexes similar to arangosearch
Views, to accelerate
queries like value lookups, range queries, accent- and case-insensitive search,
wildcard and fuzzy search, nested search, as well as for
sophisticated full-text search with the ability to search for words, phrases,
and more.
db.imdb_vertices.ensureIndex({
type: "inverted",
name: "inv-idx",
fields: [
"title",
{ name: "description", analyzer: "text_en" }
]
});
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-idx", forceIndexHint: true }
FILTER TOKENS("neo morpheus", "text_en") ALL IN doc.description
RETURN doc.title`);
You need to use an index hint to utilize an inverted index.
Note that the FOR
loop uses a collection, not a View, and documents are
matched using a FILTER
operation. Filter conditions work similar to the
SEARCH
operation but you don’t need to specify Analyzers with the ANALYZER()
function in queries.
Like Views, this type of index is eventually consistent.
See Inverted index for details.
search-alias
Views
A new search-alias
View type was added to let you add one or more
inverted indexes to a View, enabling federated searching over multiple collections,
ranking of search results by relevance, and search highlighting, similar to
arangosearch
Views. This is on top of the filtering capabilities provided by the
inverted indexes, including full-text processing with Analyzers and more.
db._createView("view", "search-alias", {
indexes: [
{ collection: "coll", index: "inv-idx" }
]
});
db._query(`FOR doc IN imdb
SEARCH STARTS_WITH(doc.title, "The Matrix") AND
PHRASE(doc.description, "invasion", 2, "machine")
RETURN doc`);
A key difference between arangosearch
and search-alias
Views is that you don’t
need to specify an Analyzer context with the ANALYZER()
function in queries
because it is inferred from the inverted index definition, which only supports
a single Analyzer per field.
Also see Getting started with ArangoSearch.
Search highlighting (Enterprise Edition)
Views now support search highlighting, letting you retrieve the positions of matches within strings, to highlight what was found in search results.
You need to index attributes with custom Analyzers that have the new offset
feature enabled to use this feature. You can then call the
OFFSET_INFO()
function to
get the start offsets and lengths of the matches (in bytes).
You can use the SUBSTRING_BYTES()
function
together with the VALUE()
function to
extract a match.
db._create("food");
db.food.save({ name: "avocado", description: { en: "The avocado is a medium-sized, evergreen tree, native to the Americas." } });
db.food.save({ name: "carrot", description: { en: "The carrot is a root vegetable, typically orange in color, native to Europe and Southwestern Asia." } });
db.food.save({ name: "chili pepper", description: { en: "Chili peppers are varieties of the berry-fruit of plants from the genus Capsicum, cultivated for their pungency." } });
db.food.save({ name: "tomato", description: { en: "The tomato is the edible berry of the tomato plant." } });
var analyzers = require("@arangodb/analyzers");
var analyzer = analyzers.save("text_en_offset", "text", { locale: "en", stopwords: [] }, ["frequency", "norm", "position", "offset"]);
db._createView("food_view", "arangosearch", { links: { food: { fields: { description: { fields: { en: { analyzers: ["text_en_offset"] } } } } } } });
db._query(`FOR doc IN food_view
SEARCH ANALYZER(
TOKENS("avocado tomato", "text_en_offset") ANY == doc.description.en OR
PHRASE(doc.description.en, "cultivated", 2, "pungency") OR
STARTS_WITH(doc.description.en, "cap")
, "text_en_offset")
FOR offsetInfo IN OFFSET_INFO(doc, ["description.en"])
RETURN {
description: doc.description,
name: offsetInfo.name,
matches: offsetInfo.offsets[* RETURN {
offset: CURRENT,
match: SUBSTRING_BYTES(VALUE(doc, offsetInfo.name), CURRENT[0], CURRENT[1])
}]
}`);
[
{
"description" : { "en" : "The avocado is a medium-sized, evergreen tree, native to the Americas." },
"name" : [ "description", "en" ],
"matches" : [
{ "offset" : [4, 11], "match" : "avocado" }
]
},
{
"description" : { "en" : "Chili peppers are varieties of the berry-fruit of plants from the genus Capsicum, cultivated for their pungency." },
"name" : [ "description", "en" ],
"matches" : [
{ "offset" : [82, 111], "match" : "cultivated for their pungency" },
{ "offset" : [72, 80], "match" : "Capsicum" }
]
},
{
"description" : { "en" : "The tomato is the edible berry of the tomato plant." },
"name" : [ "description", "en" ],
"matches" : [
{ "offset" : [4, 10], "match" : "tomato" },
{ "offset" : [38, 44], "match" : "tomato" }
]
}
]
*/
See Search highlighting with ArangoSearch for details.
Nested search (Enterprise Edition)
Nested search allows you to index arrays of objects in a way that lets you search the sub-objects with all the conditions met by a single sub-object instead of each condition being met by any of the sub-objects.
Consider a document like the following:
{
"dimensions": [
{ "type": "height", "value": 35 },
{ "type": "width", "value": 60 }
]
}
The following search query matches the document because the individual conditions are satisfied by one of the nested objects:
FOR doc IN viewName
SEARCH doc.dimensions.type == "height" AND doc.dimensions.value > 40
RETURN doc
If you only want to find documents where both conditions are true for a single nested object, you could post-filter the search results:
FOR doc IN viewName
SEARCH doc.dimensions.type == "height" AND doc.dimensions.value > 40
FILTER LENGTH(doc.dimensions[* FILTER CURRENT.type == "height" AND CURRENT.value > 40]) > 0
RETURN doc
The new nested search feature allows you to simplify the query and get better performance:
FOR doc IN viewName
SEARCH doc.dimensions[? FILTER CURRENT.type == "height" AND CURRENT.value > 40]
RETURN doc
See Nested search with ArangoSearch using Views and the nested search example using Inverted indexes for details.
This feature is only available in the Enterprise Edition.
Optimization rule arangosearch-constrained-sort
This new optimization rule brings significant performance improvements by
allowing you to perform sorting and limiting inside arangosearch
Views
enumeration node, if using just scoring for a sort operation.
ArangoSearch column cache (Enterprise Edition)
arangosearch
Views support new caching options.
Introduced in: v3.9.5, v3.10.2
-
You can enable the new
cache
option for individual View links or fields to always cache field normalization values in memory. This can improve the performance of scoring and ranking queries. -
You can enable the new
cache
option in the definition of astoredValues
View property to always cache stored values in memory. This can improve the query performance if stored values are involved.
Introduced in: v3.9.6, v3.10.2
-
You can enable the new
primarySortCache
View property to always cache the primary sort columns in memory. This can improve the performance of queries that utilize the primary sort order. -
You can enable the new
primaryKeyCache
View property to always cache the primary key column in memory. This can improve the performance of queries that return many documents.
Inverted indexes also support similar new caching options.
Introduced in: v3.10.2
-
A new
cache
option for inverted indexes as the default or for specificfields
to always cache field normalization values in memory. -
A new
cache
option per object in the definition of thestoredValues
elements to always cache stored values in memory. -
A new
cache
option in theprimarySort
property to always cache the primary sort columns in memory. -
A new
primaryKeyCache
property for inverted indexes to always cache the primary key column in memory.
The cache size can be controlled with the new --arangosearch.columns-cache-limit
startup option and monitored via the new arangodb_search_columns_cache_size
metric.
ArangoSearch caching is only available in the Enterprise Edition.
See Optimizing View and inverted index query performance for examples.
If you use ArangoSearch caching in supported 3.9 versions and upgrade an Active Failover deployment to 3.10, you may need to re-configure the cache-related options and thus recreate inverted indexes and Views. See Known Issues in 3.10.
New startup options
-
With
--arangosearch.skip-recovery
, you can skip data recovery for the specified View links and inverted indexes on startup. Values for this startup option should have the format<collection-name>/<link-id>
,<collection-name>/<index-id>
, or<collection-name>/<index-name>
. On DB-Servers, the<collection-name>
part should contain a shard name. -
With the
--arangosearch.fail-queries-on-out-of-sync
startup option you can let write operations fail ifarangosearch
View links or inverted indexes are not up-to-date with the collection data. The option is set tofalse
by default. Queries on out-of-sync links/indexes are answered normally, but the return data may be incomplete. If set totrue
, any data retrieval queries on out-of-sync links/indexes are going to fail with error “collection/view is out of sync” (error code 1481).
Introduced in: v3.10.4
- The new
--javascript.user-defined-functions
startup option lets you disable user-defined AQL functions so that no user-defined JavaScript code of UDFs runs on the server. Also see Server security options.
ArangoSearch metrics and figures
The Metrics HTTP API has been
extended with metrics for ArangoSearch for monitoring arangosearch
View links
and inverted indexes.
The following metrics have been added in ArangoDB 3.10:
Label | Description |
---|---|
arangodb_search_cleanup_time |
Average time of few last cleanups. |
arangodb_search_commit_time |
Average time of few last commits. |
arangodb_search_consolidation_time |
Average time of few last consolidations. |
arangodb_search_index_size |
Size of the index in bytes for current snapshot. |
arangodb_search_num_docs |
Number of documents for current snapshot. |
arangodb_search_num_failed_cleanups |
Number of failed cleanups. |
arangodb_search_num_failed_commits |
Number of failed commits. |
arangodb_search_num_failed_consolidations |
Number of failed consolidations. |
arangodb_search_num_files |
Number of files for current snapshot. |
arangodb_search_num_live_docs |
Number of live documents for current snapshot. |
arangodb_search_num_out_of_sync_links |
Number of out-of-sync arangosearch links/inverted indexes. |
arangodb_search_num_segments |
Number of segments for current snapshot. |
arangodb_search_columns_cache_size |
Size of all ArangoSearch columns currently loaded into the cache. |
These metrics are exposed by single servers and DB-Servers.
Additionally, the JavaScript and HTTP API for indexes has been extended with
figures for arangosearch
View links and inverted indexes.
In arangosh, you can call db.<collection>.indexes(true, true);
to get at this
information. Also see Listing all indexes of a collection.
The information has the following structure:
{
"figures" : {
"numDocs" : 4, // total number of documents in the index with removals
"numLiveDocs" : 4, // total number of documents in the index without removals
"numSegments" : 1, // total number of index segments
"numFiles" : 8, // total number of index files
"indexSize" : 1358 // size of the index in bytes
}, ...
}
Note that the number of (live) docs may differ from the actual number of documents if the nested search feature is used.
Analyzers
minhash
Analyzer (Enterprise Edition)
This new Analyzer applies another Analyzer, for example, a text
Analyzer to
tokenize text into words, and then computes so called MinHash signatures from
the tokens using a locality-sensitive hash function. The result lets you
approximate the Jaccard similarity of sets.
A common use case is to compare sets with many elements for entity resolution, such as for finding duplicate records, based on how many common elements they have.
You can use the Analyzer with a new inverted index or arangosearch
View to
quickly find candidates for the actual Jaccard similarity comparisons you want
to perform.
This feature is only available in the Enterprise Edition.
See Analyzers for details.
classification
Analyzer (Enterprise Edition)
A new, experimental Analyzer for classifying individual tokens or entire inputs using a supervised fastText word embedding model that you provide.
This feature is only available in the Enterprise Edition.
See Analyzers for details.
nearest_neighbors
Analyzer (Enterprise Edition)
A new, experimental Analyzer for finding similar tokens using a supervised fastText word embedding model that you provide.
This feature is only available in the Enterprise Edition.
See Analyzers for details.
geo_s2
Analyzer (Enterprise Edition)
Introduced in: v3.10.5
This new Analyzer lets you index GeoJSON data with inverted indexes or Views
similar to the existing geojson
Analyzer, but it internally uses a format for
storing the geo-spatial data that is more efficient.
You can choose between different formats to make a tradeoff between the size on disk, the precision, and query performance:
- 8 bytes per coordinate pair using 4-byte integer values, with limited precision.
- 16 bytes per coordinate pair using 8-byte floating-point values, which is still
more compact than the VelocyPack format used by the
geojson
Analyzer - 24 bytes per coordinate pair using the native Google S2 format to reduce the number of computations necessary when you execute geo-spatial queries.
This feature is only available in the Enterprise Edition.
See Analyzers for details.
Web Interface
The 3.10 release of ArangoDB introduces a new Web UI for Views that allows
you to view, configure, or drop arangosearch
Views.
AQL
All Shortest Paths Graph Traversal
In addition to finding any shortest path and enumerating all paths between two
vertices in order of increasing length, you can now use the new
ALL_SHORTEST_PATHS
graph traversal algorithm in AQL to get all paths of
shortest length:
FOR p IN OUTBOUND ALL_SHORTEST_PATHS 'places/Carlisle' TO 'places/London'
GRAPH 'kShortestPathsGraph'
RETURN { places: p.vertices[*].label }
See All Shortest Paths in AQL for details.
Parallelism for Sharded Graphs (Enterprise Edition)
The 3.10 release supports traversal parallelism for Sharded Graphs, which means that traversals with many start vertices can now run in parallel. An almost linear performance improvement has been achieved, so the parallel processing of threads leads to faster results.
This feature supports all types of graphs - General Graphs, SmartGraphs, EnterpriseGraphs (including Disjoint).
Traversals with many start vertices can now run in parallel. A traversal always starts with one single start vertex and then explores the vertex neighborhood. When you want to explore the neighborhoods of multiple vertices, you now have the option to do multiple operations in parallel.
The example below shows how to use parallelism to allow using up to three
threads to search for v/1
, v/2
, and v/3
in parallel and independent of one
another. Note that parallelism increases the memory usage of the query due to
having multiple operations performed simultaneously, instead of one after the
other.
FOR startVertex IN ["v/1", "v/2", "v/3", "v/4"]
FOR v,e,p IN 1..3 OUTBOUND startVertex GRAPH "yourShardedGraph" OPTIONS {parallelism: 3}
[...]
This feature is only available in the Enterprise Edition.
GeoJSON changes
The 3.10 release of ArangoDB conforms to the standards specified in GeoJSON and GeoJSON Mode. This diverges from the previous implementation in two fundamental ways:
-
The syntax of GeoJSON objects is interpreted so that lines on the sphere are geodesics (pieces of great circles). This is in particular true for boundaries of polygons. No special treatment of longitude-latitude-rectangles is done any more.
-
Linear rings in polygons are no longer automatically normalized so that the “smaller” of the two connected components are the interior. This allows specifying polygons that cover more than half of the surface of the Earth and conforms to the GeoJSON standard.
Additionally, the reported issues, which occasionally produced wrong results in geo queries when using geo indexes, have been fixed.
For existing users who do not wish to rebuild their geo indexes and
continue using the previous behavior, the legacyPolygons
index option
has been introduced to guarantee backwards compatibility.
For existing users who wish to take advantage of the new standard behavior, geo indexes need to be dropped and recreated after an upgrade.
See Legacy Polygons for details and for hints about upgrading to version 3.10 or later.
Traversal Projections (Enterprise Edition)
Starting with version 3.10, the AQL optimizer automatically detects which document attributes you access in traversal queries and optimizes the data loading. This optimization is beneficial if you have large document sizes but only access small parts of the documents.
By default, up to 5 attributes are extracted instead of loading the full document.
You can control this number with the maxProjections
option, which is now
supported for graph traversals.
See also how to use maxProjections
with FOR loops.
In the following query, the accessed attributes are the name
attribute of the
neighbor vertices and the vertex
attribute of the edges (via the path variable):
FOR v, e, p IN 1..3 OUTBOUND "persons/alice" GRAPH "knows_graph" OPTIONS { maxProjections: 5 }
RETURN { name: v.name, vertices: p.edges[*].vertex }
The use of projections is indicated in the query explain output:
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 TraversalNode 1 - FOR v /* vertex (projections: `name`) */, p /* paths: edges (projections: `_from`, `_to`, `vertex`) */ IN 1..3 /* min..maxPathDepth */ OUTBOUND 'persons/alice' /* startnode */ GRAPH 'knows_graph'
3 CalculationNode 1 - LET #7 = { "name" : v.`name`, "vertex" : p.`edges`[*].`vertex` } /* simple expression */
4 ReturnNode 1 - RETURN #7
This feature is only available in the Enterprise Edition.
Number of filtered documents in profiling output
The AQL query profiling output now shows the number of filtered inputs for each execution node separately, so that it is more visible how often filter conditions are invoked and how effective they are. Previously the number of filtered inputs was only available as a total value in the profiling output, and it wasn’t clear which execution node caused which amount of filtering.
For example, consider the following query:
FOR doc1 IN collection
FILTER doc1.value1 < 1000 /* uses index */
FILTER doc1.value2 NOT IN [1, 4, 7] /* post filter */
FOR doc2 IN collection
FILTER doc1.value1 == doc2.value2 /* uses index */
FILTER doc2.value2 != 5 /* post filter */
RETURN doc2
The profiling output for this query now shows how often the filters were invoked for the different execution nodes:
Execution plan:
Id NodeType Calls Items Filtered Runtime [s] Comment
1 SingletonNode 1 1 0 0.00008 * ROOT
14 IndexNode 1 700 300 0.00694 - FOR doc1 IN collection /* persistent index scan, projections: `value1`, `value2` */ FILTER (doc1.`value2` not in [ 1, 4, 7 ]) /* early pruning */
13 IndexNode 61 60000 10000 0.11745 - FOR doc2 IN collection /* persistent index scan */ FILTER (doc2.`value2` != 5) /* early pruning */
12 ReturnNode 61 60000 0 0.00212 - RETURN doc2
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
14 idx_1727463382256189440 persistent collection false false 99.99 % [ `value1` ] (doc1.`value1` < 1000)
13 idx_1727463477736374272 persistent collection false false 0.01 % [ `value2` ] (doc1.`value1` == doc2.`value2`)
Query Statistics:
Writes Exec Writes Ign Scan Full Scan Index Cache Hits/Misses Filtered Peak Mem [b] Exec Time [s]
0 0 0 71000 0 / 0 10300 98304 0.13026
Number of cache hits / cache misses in profiling output
When profiling an AQL query via db._profileQuery(...)
command or via the web interface, the
query profile output will now contain the number of index entries read from
in-memory caches (usable for edge and persistent indexes) plus the number of cache misses.
In the following example query, there are in-memory caches present for both indexes used in the query. However, only the innermost index node #13 can use the cache, because the outer FOR loop does not use an equality lookup.
Query String (270 chars, cacheable: false):
FOR doc1 IN collection FILTER doc1.value1 < 1000 FILTER doc1.value2 NOT IN [1, 4, 7]
FOR doc2 IN collection FILTER doc1.value1 == doc2.value2 FILTER doc2.value2 != 5 RETURN doc2
Execution plan:
Id NodeType Calls Items Filtered Runtime [s] Comment
1 SingletonNode 1 1 0 0.00008 * ROOT
14 IndexNode 1 700 300 0.00630 - FOR doc1 IN collection /* persistent index scan, projections: `value1`, `value2` */ FILTER (doc1.`value2` not in [ 1, 4, 7 ]) /* early pruning */
13 IndexNode 61 60000 10000 0.14254 - FOR doc2 IN collection /* persistent index scan */ FILTER (doc2.`value2` != 5) /* early pruning */
12 ReturnNode 61 60000 0 0.00168 - RETURN doc2
Indexes used:
By Name Type Collection Unique Sparse Cache Selectivity Fields Ranges
14 idx_1727463613020504064 persistent collection false false true 99.99 % [ `value1` ] (doc1.`value1` < 1000)
13 idx_1727463601873092608 persistent collection false false true 0.01 % [ `value2` ] (doc1.`value1` == doc2.`value2`)
Query Statistics:
Writes Exec Writes Ign Scan Full Scan Index Cache Hits/Misses Filtered Peak Mem [b] Exec Time [s]
0 0 0 71000 689 / 11 10300 98304 0.15389
Number of cluster requests in profiling output
Introduced in: v3.9.5, v3.10.2
The query profiling output in the web interface and arangosh now shows the
number of HTTP requests for queries that you run against cluster deployments in
the Query Statistics
:
Query String (33 chars, cacheable: false):
FOR doc IN coll
RETURN doc._key
Execution plan:
Id NodeType Site Calls Items Filtered Runtime [s] Comment
1 SingletonNode DBS 3 3 0 0.00024 * ROOT
9 IndexNode DBS 3 0 0 0.00060 - FOR doc IN coll /* primary index scan, index only (projections: `_key`), 3 shard(s) */
3 CalculationNode DBS 3 0 0 0.00025 - LET #1 = doc.`_key` /* attribute expression */ /* collections used: doc : coll */
7 RemoteNode COOR 6 0 0 0.00227 - REMOTE
8 GatherNode COOR 2 0 0 0.00209 - GATHER /* parallel, unsorted */
4 ReturnNode COOR 2 0 0 0.00008 - RETURN #1
Indexes used:
By Name Type Collection Unique Sparse Cache Selectivity Fields Stored values Ranges
9 primary primary coll true false false 100.00 % [ `_key` ] [ ] *
Optimization rules applied:
Id RuleName
1 scatter-in-cluster
2 distribute-filtercalc-to-cluster
3 remove-unnecessary-remote-scatter
4 reduce-extraction-to-projection
5 parallelize-gather
Query Statistics:
Writes Exec Writes Ign Scan Full Scan Index Cache Hits/Misses Filtered Requests Peak Mem [b] Exec Time [s]
0 0 0 0 0 / 0 0 9 32768 0.00564
Index Lookup Optimization
ArangoDB 3.10 features a new optimization for index lookups that can help to speed up index accesses with post-filter conditions. The optimization is triggered if an index is used that does not cover all required attributes for the query, but the index lookup post-filter conditions only access attributes that are part of the index.
For example, if you have a collection with an index on value1
and value2
,
a query like below can only partially utilize the index for the lookup:
FOR doc IN collection
FILTER doc.value1 == @value1 /* uses the index */
FILTER ABS(doc.value2) != @value2 /* does not use the index */
RETURN doc
In this case, previous versions of ArangoDB always fetched the full documents from the storage engine for all index entries that matched the index lookup conditions. 3.10 will now only fetch the full documents from the storage engine for all index entries that matched the index lookup conditions, and that satisfy the index lookup post-filter conditions, too.
If the post-filter conditions filter out a lot of documents, this optimization can significantly speed up queries that produce large result sets from index lookups but filter many of the documents away with post-filter conditions.
See Filter Projections Optimization for details.
Lookahead for Multi-Dimensional Indexes
The multi-dimensional index type zkd
(experimental) now supports an optional
index hint for tweaking performance:
FOR … IN … OPTIONS { lookahead: 32 }
See Lookahead Index Hint.
Question mark operator
The new [? ... ]
array operator is a shorthand for an inline filter with a
surrounding length check:
LET arr = [ 1, 2, 3, 4 ]
RETURN arr[? 2 FILTER CURRENT % 2 == 0] // true
/* Equivalent expression:
RETURN LENGTH(arr[* FILTER CURRENT % 2 == 0]) == 2
*/
The quantifier can be a number, a range, NONE
, ANY
, ALL
, or AT LEAST
.
This operator is used for the new Nested search feature (Enterprise Edition only).
See Array Operators for details.
New AT LEAST
array comparison operator
You can now combine one of the supported comparison operators with the special
AT LEAST (<expression>)
operator to require an arbitrary number of elements
to satisfy the condition to evaluate to true
. You can use a static number or
calculate it dynamically using an expression:
[ 1, 2, 3 ] AT LEAST (2) IN [ 2, 3, 4 ] // true
["foo", "bar"] AT LEAST (1+1) == "foo" // false
See Array Comparison Operators.
New and Changed AQL Functions
AQL functions added to the 3.10 Enterprise Edition:
-
OFFSET_INFO()
: An ArangoSearch function to get the start offsets and lengths of matches for search highlighting. -
MINHASH()
: A new function for locality-sensitive hashing to approximate the Jaccard similarity. -
MINHASH_COUNT()
: A helper function to calculate the number of hashes (MinHash signature size) needed to not exceed the specified error amount. -
MINHASH_ERROR()
: A helper function to calculate the error amount based on the number of hashes (MinHash signature size). -
MINHASH_MATCH()
: A new ArangoSearch function to match documents with an approximate Jaccard similarity of at least the specified threshold that are indexed by a View.
AQL functions added to all editions of 3.10:
-
SUBSTRING_BYTES()
: A function to get a string subset using a start and length in bytes instead of in number of characters. -
VALUE()
: A new document function to dynamically get an attribute value of an object, using an array to specify the path. -
KEEP_RECURSIVE()
: A document function to recursively keep attributes from objects/documents, as a counterpart toUNSET_RECURSIVE()
.
AQL functions changed in 3.10:
-
MERGE_RECURSIVE()
: You can now call the function with a single argument instead of at least two. It also accepts an array of objects now, matching the behavior of theMERGE()
function. -
EXISTS()
: The function supports a new signatureEXISTS(doc.attr, "nested")
to check whether the specified attribute is indexed as nested field by a View or inverted index (introduced in v3.10.1).
Edge cache refilling (experimental)
Introduced in: v3.9.6, v3.10.2
A new feature to automatically refill the in-memory edge cache is available. When edges are added, modified, or removed, these changes are tracked and a background thread tries to update the edge cache accordingly if the feature is enabled, by adding new, updating existing, or deleting and refilling cache entries.
You can enable it for individual INSERT
, UPDATE
, REPLACE
, and REMOVE
operations in AQL queries (using OPTIONS { refillIndexCaches: true }
), for
individual document API requests that insert, update, replace, or remove single
or multiple edge documents (by setting refillIndexCaches=true
as query
parameter), as well as enable it by default using the new
--rocksdb.auto-refill-index-caches-on-modify
startup option.
The new --rocksdb.auto-refill-index-caches-queue-capacity
startup option
restricts how many edge cache entries the background thread can queue at most.
This limits the memory usage for the case of the background thread being slower
than other operations that invalidate edge cache entries.
The background refilling is done on a best-effort basis and not guaranteed to succeed, for example, if there is no memory available for the cache subsystem, or during cache grow/shrink operations. A background thread is used so that foreground write operations are not slowed down by a lot. It may still cause additional I/O activity to look up data from the storage engine to repopulate the cache.
In addition to refilling the edge cache, the cache can also automatically be
seeded on server startup. Use the new --rocksdb.auto-fill-index-caches-on-startup
startup option to enable this feature. It may cause additional CPU and I/O load.
You can limit how many index filling operations can execute concurrently with the
--rocksdb.max-concurrent-index-fill-tasks
option. The lower this number, the
lower the impact of the cache filling, but the longer it takes to complete.
The following metrics are available:
Label | Description |
---|---|
rocksdb_cache_auto_refill_loaded_total |
Total number of queued items for in-memory index caches refilling. |
rocksdb_cache_auto_refill_dropped_total |
Total number of dropped items for in-memory index caches refilling. |
rocksdb_cache_full_index_refills_total |
Total number of in-memory index caches refill operations for entire indexes. |
This feature is experimental.
Also see:
- AQL
INSERT
operation - AQL
UPDATE
operation - AQL
REPLACE
operation - AQL
REMOVE
operation - Document HTTP API
- Edge cache refill options
Extended query explain statistics
Introduced in: v3.10.4
The query explain result now includes the peak memory usage and execution time. This helps finding queries that use a lot of memory or take long to build the execution plan.
The additional statistics are displayed at the end of the output in the
web interface (using the Explain button in the QUERIES section) and in
arangosh (using db._explain()
):
44 rule(s) executed, 1 plan(s) created, peak mem [b]: 32768, exec time [s]: 0.00214
The HTTP API returns the extended statistics in the stats
attribute when you
use the POST /_api/explain
endpoint:
{
...
"stats": {
"rulesExecuted": 44,
"rulesSkipped": 0,
"plansCreated": 1,
"peakMemoryUsage": 32768,
"executionTime": 0.00241307167840004
}
}
Also see:
Indexes
Parallel index creation (Enterprise Edition)
In the Enterprise Edition, non-unique indexes can be created with multiple threads in parallel. The number of parallel index creation threads is currently set to 2, but future versions of ArangoDB may increase this value. Parallel index creation is only triggered for collections with at least 120,000 documents.
Storing additional values in indexes
Persistent indexes now allow you to store additional attributes in the index that can be used to cover more queries without having to look up full documents. They cannot be used for index lookups or for sorting, but for projections only.
You can specify the additional attributes in the new storedValues
option when
creating a new persistent index:
db.<collection>.ensureIndex({
type: "persistent",
fields: ["value1"],
storedValues: ["value2"]
});
See Persistent Indexes.
Enabling caching for index values
Persistent indexes now support in-memory caching of index entries, which can be
used when doing point lookups on the index. You can enable the cache with the
new cacheEnabled
option when creating a persistent index:
db.<collection>.ensureIndex({
type: "persistent",
fields: ["value"],
cacheEnabled: true
});
This can have a great positive effect on index scan performance if the number of scanned index entries is large.
As the cache is hash-based and unsorted, it cannot be used for full or partial range scans, for sorting, or for lookups that do not include all index attributes.
See Persistent Indexes.
Document keys
Some key generators can generate keys in an ascending order, meaning that document
keys with “higher” values also represent newer documents. This is true for the
traditional
, autoincrement
and padded
key generators.
Previously, the generated keys were only guaranteed to be truly ascending in single
server deployments. The reason was that document keys could be generated not only by
the DB-Server, but also by Coordinators (of which there are normally multiple instances).
While each component would still generate an ascending sequence of keys, the overall
sequence (mixing the results from different components) was not guaranteed to be
ascending.
ArangoDB 3.10 changes this behavior so that collections with only a single
shard can provide truly ascending keys. This includes collections in OneShard
databases as well.
Also, autoincrement
key generation is now supported in cluster mode for
single-sharded collections.
Document keys are still not guaranteed to be truly ascending for collections with
more than a single shard.
Read from followers in Clusters (Enterprise Edition)
You can now allow for reads from followers for a number of read-only operations in cluster deployments. In this case, Coordinators are allowed to read not only from leader shards but also from follower shards. This has a positive effect, because the reads can scale out to all DB-Servers that have copies of the data. Therefore, the read throughput is higher.
This feature is only available in the Enterprise Edition.
For more information, see Read from followers.
Improved shard rebalancing
Starting with version 3.10, the shard rebalancing feature introduces an automatic shard rebalancing API.
You can do any of the following by using the API:
- Get an analysis of the current cluster imbalance.
- Compute a plan of move shard operations to rebalance the cluster and thus improve balance.
- Execute the given set of move shard operations.
- Compute a set of move shard operations to improve balance and execute them immediately.
For more information, see the Cluster section of the HTTP API documentation.
Query result spillover to decrease memory usage
Queries can be executed with storing intermediate and final results temporarily on disk to decrease memory usage when a specified threshold is reached, either based on the memory usage (in bytes) or the number of result rows.
This feature is experimental and is turned off by default. It is currently limited to AQL queries that use SORT
operations but without a LIMIT
. The query results are still built up entirely in memory on Coordinators and single servers unless you use streaming queries.
An example of how to configure the spillover feature:
arangod --database.directory "myDir"
--temp.intermediate-results-path "tempDir"
--temp.intermediate-results-encryption
--temp.intermediate-results-encryption-hardware-acceleration
--temp.intermediate-results-spillover-threshold-memory-usage 134217728
--temp.intermediate-results-spillover-threshold-num-rows 50000
You can also set the thresholds per query in the JavaScript and HTTP APIs.
For details, see:
Server options
Responses early during instance startup
The HTTP interface of arangod instances can now optionally be started earlier during the startup process, so that ping probes from monitoring tools can already be responded to when the instance has not fully started.
You can set the new --server.early-connections
startup option to true
to
let the instance respond to the /_api/version
, /_admin/version
, and
/_admin/status
REST APIs early.
See Respond to liveliness probes.
Cache RocksDB index and filter blocks by default
The default value of the --rocksdb.cache-index-and-filter-blocks
startup option was changed
from false
to true
. This makes RocksDB track all loaded index and filter blocks in the
block cache, so they are accounted against the RocksDB’s block cache memory limit.
The default value for the --rocksdb.enforce-block-cache-size-limit
startup option was also
changed from false
to true
to make the RocksDB block cache not temporarily exceed the
configured memory limit.
These default value changes will make RocksDB adhere much better to the configured memory limit
(configurable via --rocksdb.block-cache-size
).
The changes may have a small negative impact on performance because, if the block cache is
not large enough to hold the data plus the index and filter blocks, additional disk I/O may
need to be performed compared to the previous versions.
This is a trade-off between memory usage predictability and performance, and ArangoDB 3.10
will default to more stable and predictable memory usage. If there is still unused RAM
capacity available, it may be sensible to increase the total size of the RocksDB block cache,
by increasing --rocksdb.block-cache-size
. Due to the changed configuration, the block
cache size limit will not be exceeded anymore.
It is possible to opt out of these changes and get back the memory and performance characteristics
of previous versions by setting the --rocksdb.cache-index-and-filter-blocks
and --rocksdb.enforce-block-cache-size-limit
startup options to false
on startup.
RocksDB range delete operations in cluster
The new --rocksdb.use-range-delete-in-wal
startup option controls whether the
collection truncate operation in a cluster can use RangeDelete operations in
RocksDB. Using RangeDeletes is fast and reduces the algorithmic complexity of
the truncate operation to O(1), compared to O(n) when this option is turned off
(with n being the number of documents in the collection/shard).
Previous versions of ArangoDB used RangeDeletes only on a single server, but never in a cluster.
The default value for this startup option is true
, and the option should only
be changed in case of emergency. This option is only honored in the cluster.
Single server and Active Failover deployments use RangeDeletes regardless of the
value of this option.
Note that it is not guaranteed that all truncate operations use a RangeDelete operation. For collections containing a low number of documents, the O(n) truncate method may still be used.
Pregel configuration options
There are now several startup options to configure the parallelism of Pregel jobs:
--pregel.min-parallelism
: minimum parallelism usable in Pregel jobs.--pregel.max-parallelism
: maximum parallelism usable in Pregel jobs.--pregel.parallelism
: default parallelism to use in Pregel jobs.
Administrators can use these options to set concurrency defaults and bounds for Pregel jobs on an instance level.
There are also new startup options to configure the usage of memory-mapped files for Pregel temporary data:
-
--pregel.memory-mapped-files
: to specify whether to use memory-mapped files or RAM for storing temporary Pregel data. -
--pregel.memory-mapped-files-location-type
: to set a location for memory-mapped files written by Pregel. This option is only meaningful, if memory-mapped files are used.
For more information on the new options, please refer to ArangoDB Server Pregel Options.
Query spillover options
The following new options are available to control the Query spillover feature.
--temp.intermediate-results-path
--temp.intermediate-results-encryption
(Enterprise Edition only)--temp.intermediate-results-encryption-hardware-acceleration
(Enterprise Edition only)--temp.intermediate-results-spillover-threshold-memory-usage
--temp.intermediate-results-spillover-threshold-num-rows
AQL query logging
Introduced in: v3.9.5, v3.10.2
There are three new startup options to configure how AQL queries are logged:
--query.log-failed
for logging all failed AQL queries, to be used during development or to catch unexpected failed queries in production (off by default)--query.log-memory-usage-threshold
to define a peak memory threshold from which on a warning is logged for AQL queries that exceed it (default: 4 GB)--query.max-artifact-log-length
for controlling the length of logged query strings and bind parameter values. Both are truncated to 4096 bytes by default.
ArangoSearch column cache limit
Introduced in: v3.9.5, v3.10.2
The new --arangosearch.columns-cache-limit
startup option lets you control how
much memory (in bytes) the ArangoSearch column cache
is allowed to use.
Cluster supervision options
Introduced in: v3.9.6, v3.10.2
The following new options allow you to delay supervision actions for a configurable amount of time. This is desirable in case DB-Servers are restarted or fail and come back quickly because it gives the cluster a chance to get in sync and fully resilient without deploying additional shard replicas and thus without causing any data imbalance:
-
--agency.supervision-delay-add-follower
: The delay in supervision, before an AddFollower job is executed (in seconds). -
--agency.supervision-delay-failed-follower
: The delay in supervision, before a FailedFollower job is executed (in seconds).
Edge cache refill options
Introduced in: v3.9.6, v3.10.2
--rocksdb.auto-refill-index-caches-on-modify
: Whether to automatically (re-)fill in-memory edge cache entries on insert/update/replace operations by default. Default:false
.--rocksdb.auto-refill-index-caches-queue-capacity
: How many changes can be queued at most for automatically refilling the edge cache. Default:131072
.--rocksdb.auto-fill-index-caches-on-startup
: Whether to automatically fill the in-memory edge cache with entries on server startup. Default:false
.--rocksdb.max-concurrent-index-fill-tasks
: The maximum number of index fill tasks that can run concurrently on server startup. Default: the number of cores divided by 8, but at least1
.
Introduced in: v3.9.10, v3.10.5
--rocksdb.auto-refill-index-caches-on-followers
: Control whether automatic refilling of in-memory caches should happen on followers or only leaders. The default value istrue
, i.e. refilling happens on followers, too.
Agency option to control whether a failed leader adds a shard follower
Introduced in: v3.9.7, v3.10.2
A --agency.supervision-failed-leader-adds-follower
startup option has been
added with a default of true
(behavior as before). If you set this option to
false
, a FailedLeader
job does not automatically configure a new shard
follower, thereby preventing unnecessary network traffic, CPU load, and I/O load
for the case that the server comes back quickly. If the server has permanently
failed, an AddFollower
job is created anyway eventually.
RocksDB Bloom filter option
Introduced in: v3.10.3
A new --rocksdb.bloom-filter-bits-per-key
startup option has been added to
configure the number of bits to use per key in a Bloom filter.
The default value is 10
, which is downwards-compatible to the previously
hard-coded value.
Option to disable Foxx
Introduced in: v3.10.5
A --foxx.enable
startup option has been added to let you configure whether
access to user-defined Foxx services is possible for the instance. It defaults
to true
.
If you set the option to false
, access to Foxx services is forbidden and is
responded with an HTTP 403 Forbidden
error. Access to the management APIs for
Foxx services are also disabled as if you set --foxx.api false
manually.
Access to ArangoDB’s built-in web interface, which is also a Foxx service, is
still possible even with the option set to false
.
RocksDB auto-flushing
Introduced in: v3.9.10, v3.10.5
A new feature for automatically flushing RocksDB Write-Ahead Log (WAL) files and in-memory column family data has been added.
An auto-flush occurs if the number of live WAL files exceeds a certain threshold.
This ensures that WAL files are moved to the archive when there are a lot of
live WAL files present, for example, after a restart. In this case, RocksDB does
not count any previously existing WAL files when calculating the size of WAL
files and comparing its max_total_wal_size
. Auto-flushing fixes this problem,
but may prevent WAL files from being moved to the archive quickly.
You can configure the feature via the following new startup options:
--rocksdb.auto-flush-min-live-wal-files
: The minimum number of live WAL files that triggers an auto-flush. Defaults to10
.--rocksdb.auto-flush-check-interval
: The interval (in seconds) in which auto-flushes are executed. Defaults to3600
. Note that an auto-flush is only executed if the number of live WAL files exceeds the configured threshold and the last auto-flush is longer ago than the configured auto-flush check interval. This avoids too frequent auto-flushes.
Miscellaneous changes
Optimizer rules endpoint
Added the GET /_api/query/rules
REST API endpoint that returns the available
optimizer rules for AQL queries.
Additional metrics for caching subsystem
The caching subsystem now provides the following 3 additional metrics:
rocksdb_cache_active_tables
: total number of active hash tables used for caching index values. There should be 1 table per shard per index for which the in-memory cache is enabled. The number also includes temporary tables that are built when migrating existing tables to larger equivalents.rocksdb_cache_unused_memory
: total amount of memory used for inactive hash tables used for caching index values. Some inactive tables can be kept around after use, so they can be recycled quickly. The overall amount of inactive tables is limited, so not much memory will be used here.rocksdb_cache_unused_tables
: total number of inactive hash tables used for caching index values. Some inactive tables are kept around after use, so they can be recycled quickly. The overall amount of inactive tables is limited, so not much memory will be used here.
Replication improvements
For synchronous replication of document operations in the cluster, the follower can now return smaller responses to the leader. This change reduces the network traffic between the leader and its followers, and can lead to slightly faster turnover in replication.
Calculation of file hashes (Enterprise Edition)
The calculation of SHA256 file hashes for the .sst files created by RocksDB and that are required for hot backups has been moved from a separate background thread into the actual RocksDB operations that write out the .sst files.
The SHA256 hashes are now calculated incrementally while .sst files are being
written, so that no post-processing of .sst files is necessary anymore.
The previous background thread named Sha256Thread
, which was responsible for
calculating the SHA256 hashes and sometimes for high CPU utilization after
larger write operations, has now been fully removed.
Traffic accounting metrics
Introduced in: v3.8.9, v3.9.6, v3.10.2
The following metrics for traffic accounting were added:
Label | Description |
---|---|
arangodb_client_user_connection_statistics_bytes_received |
Bytes received for requests, only user traffic. |
arangodb_client_user_connection_statistics_bytes_sent |
Bytes sent for responses, only user traffic. |
arangodb_http1_connections_total |
Total number of HTTP/1.1 connections accepted. |
Configurable CACHE_OBLIVIOUS
option for jemalloc
Introduced in: v3.9.7, v3.10.3
The jemalloc memory allocator supports an option to toggle cache-oblivious large allocation alignment. It is enabled by default up to v3.10.3, but disabled from v3.10.4 onwards. Disabling it helps to save 4096 bytes of memory for every allocation which is at least 16384 bytes large. This is particularly beneficial for the RocksDB buffer cache.
You can now configure the option by setting a CACHE_OBLIVIOUS
environment
variable to the string true
or false
before starting ArangoDB.
See ArangoDB Server environment variables for details.
WAL file tracking metrics
Introduced in: v3.9.10, v3.10.5
The following metrics for write-ahead log (WAL) file tracking have been added:
Label | Description |
---|---|
rocksdb_live_wal_files |
Number of live RocksDB WAL files. |
rocksdb_wal_released_tick_flush |
Lower bound sequence number from which WAL files need to be kept because of external flushing needs. |
rocksdb_wal_released_tick_replication |
Lower bound sequence number from which WAL files need to be kept because of replication. |
arangodb_flush_subscriptions |
Number of currently active flush subscriptions. |
Number of replication clients metric
The following metric for the number of replication clients for a server has been added:
Introduced in: v3.10.5
Label | Description |
---|---|
arangodb_replication_clients |
Number of currently connected/active replication clients. |
Client tools
arangobench
arangobench has a new --create-collection
startup option that can be set to false
to skip setting up a new collection for the to-be-run workload. That way, some
workloads can be run on already existing collections.
arangoexport
arangoexport has a new --custom-query-bindvars
startup option that lets you set
bind variables that you can now use in the --custom-query
option
(renamed from --query
):
arangoexport \
--custom-query 'FOR book IN @@@@collectionName FILTER book.sold > @@sold RETURN book' \
--custom-query-bindvars '{"@@collectionName": "books", "sold": 100}' \
...
Note that you need to escape at signs in command-lines by doubling them (see Environment variables as parameters).
arangoexport now also has a --custom-query-file
startup option that you can
use instead of --custom-query
, to read a query from a file. This allows you to
store complex queries and no escaping is necessary in the file:
// example.aql
FOR book IN @@collectionName
FILTER book.sold > @sold
RETURN book
arangoexport \
--custom-query-file example.aql \
--custom-query-bindvars '{"@@collectionName": "books", "sold": 100}' \
...
arangoimport
arangoimport has a new --overwrite-collection-prefix
option that is useful
when importing edge collections. This option should be used together with
--to-collection-prefix
or --from-collection-prefix
.
If there are vertex collection prefixes in the file you want to import,
you can overwrite them with prefixes specified on the command line. If the option is set
to false
, only _from
and _to
values without a prefix are going to be
prefixed by the entered values.
arangoimport now supports the --remove-attribute
option when using JSON
as input file format.
For more information, refer to the arangoimport Options.
ArangoDB Starter
ArangoDB Starter has a new feature that allows you to configure the binary
by reading from a configuration file. The default configuration file of the Starter
is arangodb-starter.conf
and can be changed using the --configuration
option.
See the Starter configuration file section for more information about the configuration file format, passing through command line options, and examples.
Internal changes
C++20
ArangoDB is now compiled using the -std=c++20
compile option on Linux and MacOS.
A compiler with c++-20 support is thus needed to compile ArangoDB from source.
Upgraded bundled library versions
The bundled version of the RocksDB library has been upgraded from 6.8.0 to 7.2.
The bundled version of the Boost library has been upgraded from 1.71.0 to 1.78.0.
The bundled version of the immer library has been upgraded from 0.6.2 to 0.7.0.
The bundled version of the jemalloc library has been upgraded from 5.2.1-dev to 5.3.0.
The bundled version of the zlib library has been upgraded from 1.2.11 to 1.2.12.
For ArangoDB 3.10, the bundled version of rclone is 1.59.0.