SonarW Extensions

SonarW is compatible with MongoDB for the most part, but it is a Big Data warehouse and not an OLTP database like MongoDB. Therefore, there are numerous extensions added by SonarW.

SonarW offers a number of language extensions that address specific analytic use cases. This section details these language extensions.

The extensions outlined in these sections include extensions made to both the CRUD query language (and specifically the find query) as well as to the aggregation framework. Larger extensions provided by SonarW - such as joins, window functions, views and more are described in separate, stand-alone sections.

String Processing Extensions

Searching for Existing of Multiple Words within a String

The $contains operator allows you to search using a set of values and is often used with a subquery operator. It allows you to search within string fields that have many words and match if one of the words appears within the string. While it can be done with a regular expression, writing the regular expression can be difficult and using $contains is usually easier.

For example, suppose that you have the following data:

> db.sayings.find()
{ "_id" : ObjectId("56329387619cd4f520659320"), "text" : "A barking dog never bites" }
{ "_id" : ObjectId("5632939c619cd4f520659321"), "text" : "raining cats and dogs" }
{ "_id" : ObjectId("563293aa619cd4f520659322"), "text" : "You cannot teach an old dog new tricks" }
{ "_id" : ObjectId("563293c6619cd4f520659323"), "text" : "grin like a Cheshire cat" }
{ "_id" : ObjectId("563293da619cd4f520659324"), "text" : "Don't change horses at midstream." }

then:

> db.sayings.find({$expr: {$contains: ["$text", ["dog", "cat"]]}})
{ "_id" : ObjectId("56329387619cd4f520659320"), "text" : "A barking dog never bites" }
{ "_id" : ObjectId("563293aa619cd4f520659322"), "text" : "You cannot teach an old dog new tricks" }
{ "_id" : ObjectId("563293c6619cd4f520659323"), "text" : "grin like a Cheshire cat" }

Note that the match is always case insensitive, and that while it looks similar to a $in, the match is for any word within the text as opposed to trying to match the text as a whole. It is closest to a $tsearch but with multiple words and can always be used, irrespective of the existence of a text index.

When you have two words it is not a problem to do the same with a $regex but imagine that you have 100 words to search for. The implementation is also tuned for many word matches and faster than using a $regex.

A $contains is most often used in-tandem with a subquery (using the $query within the $expr). The subquery is used to compute a set of values that are then used to match a certain field. For example, assume you have documents that look like:

> db.session.findOne()
{
   "_id" : NumberLong("10000000000"),
   "Access Id" : "32",
   "Database Name" : "DB_NAME_2672",
   "Session Start" : ISODate("2014-01-01T00:00:00Z"),
   "Uid Chain" : null,
   "Session End" : ISODate("1970-01-01T00:00:00Z"),
   "Analyzed Client IP" : "186.211.9.154",
   "Server IP" : "66.91.145.34",
   "Client Host Name" : "CLIENT_HOST_NAME_680",
   "Login Succeeded" : 1,
   "DB User Name" : "DB_USER_906",
   "Source Program" : "SOURCE_PROGRAM_49721",
   "Server Type" : "MONGODB",
   "Service Name" : "SERVICE_NAME_487",
   "OS User" : "OS_USER_49088",
   "Server Host Name" : "SERVER_HOST_NAME_49615",
   "Original Timezone" : "UTC-04:00"
}

where each document has a user name (in the field DB User Name). Also, assume that you have a collection that has groups of users, e.g.:

> db.group_members.findOne({"Group Description": "Admin Users"})
{
   "_id" : ObjectId("563290be16ca4734e6369801"),
   "Group Type" : "USERS",
   "Group Description" : "Admin Users",
   "Group Member" : "ADMIN"
}

If you now want to look for all sessions that belong to any one of the users in the Admin Users group you can use:

> db.session.findOne(
      {"$expr":{
         "$contains":[
            "$DB User Name",
            {"$query":{"$ns":"group_members","$q":{"Group Type":"USERS","Group Description":"Admin Users"},"$p":"Group Member"
   }}]}})
{
   "_id" : NumberLong("30000180089"),
   "Access Id" : "20d159cc238651adf0df7d3d9897f4a04823e26e",
   "Session Start" : ISODate("2015-10-20T12:02:17Z"),
   "Session End" : ISODate("2015-10-20T12:02:17Z"),
   "Database Name" : "ON12SCAL@ON12SCAL",
   "Uid Chain" : "(1,root,init [5])->(4716,root,/usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid)->(19816,root,sshd: root@pts/1)->(19819,root,-bash)->(19850,root,su - oracle11)->(19851,oracle11,-bash)->(19877,oracle11,sqlplus)->(19881,oracle11,oracleon12scal (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))))",
   "Login Succeeded" : 1,
   "DB User Name" : "SYS",
   "OS User" : "ORACLE11",
   "Source Program" : "SQLPLUS",
   "Server IP" : "9.70.147.189",
   "Analyzed Client IP" : "9.70.147.189",
   "Service Name" : "ORACLEON12SCAL",
   "Client Host Name" : "SCALE-DB04",
   "Server Type" : "ORACLE",
   "Server Host Name" : "SCALE-DB04",
   "Original Timezone" : "UTC-04:00"
}

By default space is the delimiter in tokenizing the words in the input string. But you can add additional delimiters. Each delimiter is a single character. To tokenize based on space, semicolon and comma do:

> db.sayings.find({$expr: {$contains: ["$text", ["dog", "cat"], " ;,"]}})

Limits:

Each $contains is limited to searching up to 16K strings and each string is limited to 1K characters.

Searching for Word Tuples in Strings (an OR of AND Containment)

Similar to $contains, $containsTuple searches within strings but can match multiple words using a delimiter. A match occurs only if all delimited words are matched. As an example:

> db.sayings.find()
{ "_id" : ObjectId("56329387619cd4f520659320"), "text" : "A barking dog never bites" }
{ "_id" : ObjectId("5632939c619cd4f520659321"), "text" : "raining cats and dogs" }
{ "_id" : ObjectId("563293aa619cd4f520659322"), "text" : "You cannot teach an old dog new tricks" }
{ "_id" : ObjectId("563293c6619cd4f520659323"), "text" : "grin like a Cheshire cat" }
{ "_id" : ObjectId("563293da619cd4f520659324"), "text" : "Don't change horses at midstream." }

then:

> db.sayings.find({$expr: {$containsTuple: ["$text", "+", ["bites+dog", "grin+cat"]]}})
{ "_id" : ObjectId("56329387619cd4f520659320"), "text" : "A barking dog never bites" }
{ "_id" : ObjectId("563293c6619cd4f520659323"), "text" : "grin like a Cheshire cat" }

Parameters to $containsTuple are passed as an array. The first parameter in the array is the field to search within. The second is a delimiter string. The third is an array of delimited strings used in the search. Note that like $contains, this array can be the result of a computed subquery.

By default space is the delimiter in tokenizing the words in the input string. But you can add additional delimiters. Each delimiter is a single character. To tokenize based on space, semicolon and comma do:

> db.sayings.find({$expr: {$containsTuple: ["$text", "+", ["bites+dog", "grin+cat"], " ;,"]}})

Note that in $containsTuple you have TWO delimiters - the first delimiter controls the tokenization of the tuples and the second one the tokenization of the input strings.

Limits:

Each $containsTuple is limited to searching up to 16K tuples and each tuple definition is limited to 1K characters. Overall there cannot be more than 64K distinct words in all tuples.

Finding unique heads and tails in a string

An operation which will return a unique list of the “heads” or “tails” in a string that is in the head tail construction.

A string in a head tail construction can have multiple “sentences” that are separated by a sentence delimiter. Each sentence will have a “head” and a “tail” which are seperated by the first head/tail delimiter found in the sentence. For example, the string: “TABLE1 SELECT;TABLE2 DROP TABLE” can have a head/tail construction with ‘;’ as a sentence delimiter and ‘ ‘ as a head/tail delimiter The example above has 2 sentences: 1. “TABLE1 SELECT” 2. “TABLE2 DROP TABLE” Every sentence has a head and a tail: 1. head: “TABLE1” tail: “SELECT”, 2. head: “TABLE2” tail: “DROP TABLE”

Syntax:

$getHeadTail:{
    value: < an expression to run operator on >,
    head: < boolean represents weather you want head or tail >,
    sentence_delimiter: < [OPTIONAL] string the represents the sentence delimiter, defaults to ';' >
    head_tail_delimiter: < [OPTIONAL] string the represents the head/tail delimiter, defaults to ' ' >
    case_insensitive: < [OPTIONAL] boolean represents case insensitivity defaults to false (case sensitive) >
}

Example:

>db.a.find()
{ "a": "TABLE1 SELECT;TABLE2 DROP TABLE;table3 select" }

>db.a.aggregate({$project:{_id:0, tails:{$getHeadTail:{value:"$Objects and Verbs", head:false, case_insensitive: true}}}})
{ "tails": [ "SELECT", "DROP TABLE" ] }

>db.a.aggregate({$project:{_id:0, heads:{$getHeadTail:{value:"$Objects and Verbs", head:false, case_insensitive: false}}}})
{ "heads": [ "TABLE1", "TABLE2", "table3" ] }

Example real data:

db.foo.aggregate({$project:{_id:0,n:{$getHeadTail:{value:"$Objects and Verbs",head:true}}}})
{ "n" : [ "COUNT_BIG", "master.dbo.GDMC_DEF_mn0qa_db25mn0fal" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.GDMC_WI8KU2X64T_MSS_mss609tcp" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.GDPR_SAMPLE" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.GDPR_SAMPLE1" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.MSreplication_options" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.purchase_visa" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.scl_tbl0_aao" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.scl_tbl0_acp" ] }

Finding substrings in a repeating expression via regex

Example:

> db.foo.aggregate({$project:{_id:0,n:{$regexSplit:["$Objects and Verbs",/[ ]*([^ ]+)[ ]*([^;]*)[;]*/]}}})
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.GDMC_DEF_mn0qa_db25mn0fal", "SELECT" ] ] }
{ "n" : [ [ "master.dbo.GDMC_WI8KU2X64T_MSS_mss609tcp", "SELECT" ], [ "COUNT_BIG", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.GDPR_SAMPLE", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.GDPR_SAMPLE1", "SELECT" ] ] }
{ "n" : [ [ "master.dbo.MSreplication_options", "SELECT" ], [ "COUNT_BIG", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.purchase_visa", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.scl_tbl0_aao", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.scl_tbl0_acp", "SELECT" ] ] }
{ "n" : [ [ "master.dbo.scl_tbl0_afq", "SELECT" ], [ "COUNT_BIG", "SELECT" ] ] }
{ "n" : [ [ "master.dbo.scl_tbl0_ahy", "SELECT" ], [ "COUNT_BIG", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.scl_tbl0_aic", "SELECT" ] ] }
{ "n" : [ [ "master.dbo.scl_tbl0_ajb", "SELECT" ], [ "COUNT_BIG", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.scl_tbl0_aky", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.scl_tbl0_alp", "SELECT" ] ] }
{ "n" : [ [ "master.dbo.scl_tbl0_alr", "SELECT" ], [ "COUNT_BIG", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.scl_tbl0_anb", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.scl_tbl0_any", "SELECT" ] ] }
{ "n" : [ [ "master.dbo.scl_tbl0_aoe", "SELECT" ], [ "COUNT_BIG", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.scl_tbl0_aup", "SELECT" ] ] }
{ "n" : [ [ "COUNT_BIG", "SELECT" ], [ "master.dbo.scl_tbl0_awa", "SELECT" ] ] }

> db.foo.aggregate({$project:{_id:0,n:{$regexSplit:["$Objects and Verbs",/[ ]*([^ ]+)[ ]*[^;]*[;]*/]}}})
{ "n" : [ "COUNT_BIG", "master.dbo.GDMC_DEF_mn0qa_db25mn0fal" ] }
{ "n" : [ "master.dbo.GDMC_WI8KU2X64T_MSS_mss609tcp", "COUNT_BIG" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.GDPR_SAMPLE" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.GDPR_SAMPLE1" ] }
{ "n" : [ "master.dbo.MSreplication_options", "COUNT_BIG" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.purchase_visa" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.scl_tbl0_aao" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.scl_tbl0_acp" ] }
{ "n" : [ "master.dbo.scl_tbl0_afq", "COUNT_BIG" ] }
{ "n" : [ "master.dbo.scl_tbl0_ahy", "COUNT_BIG" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.scl_tbl0_aic" ] }
{ "n" : [ "master.dbo.scl_tbl0_ajb", "COUNT_BIG" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.scl_tbl0_aky" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.scl_tbl0_alp" ] }
{ "n" : [ "master.dbo.scl_tbl0_alr", "COUNT_BIG" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.scl_tbl0_anb" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.scl_tbl0_any" ] }
{ "n" : [ "master.dbo.scl_tbl0_aoe", "COUNT_BIG" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.scl_tbl0_aup" ] }
{ "n" : [ "COUNT_BIG", "master.dbo.scl_tbl0_awa" ] }

Searching for Full Inclusion of a Set of Words

$allContained returns true if and only if all words in the input string are contained in an array of words. If any of the tokenized words are not in the array of words supplied the operator returns false. Use an optional delimiter string to define how to tokenize the input string. For example, if the collection contains:

> db.sayings.find()
{ "_id" : ObjectId("56329387619cd4f520659320"), "text" : "A barking dog never bites" }
{ "_id" : ObjectId("5632939c619cd4f520659321"), "text" : "raining cats and dogs" }
{ "_id" : ObjectId("563293aa619cd4f520659322"), "text" : "You cannot teach an old dog new tricks" }
{ "_id" : ObjectId("563293c6619cd4f520659323"), "text" : "grin like a Cheshire cat" }
{ "_id" : ObjectId("563293da619cd4f520659324"), "text" : "Don't change horses at midstream." }
{ "_id" : ObjectId("5632939c619cd4f520659325"), "text" : "raining" }
{ "_id" : ObjectId("5632939c619cd4f520659326"), "text" : "raining cats but not dogs" }
{ "_id" : ObjectId("5632939c619cd4f520659327"), "text" : "raining cats but dogs" }

then the following:

> db.sayings.find({$expr: {$allContained: ["$text", ["raining", "cats", "and", "dogs", "but"]]}})

returns:

{ "_id" : ObjectId("5632939c619cd4f520659321"), "text" : "raining cats and dogs" }
{ "_id" : ObjectId("5632939c619cd4f520659325"), "text" : "raining" }
{ "_id" : ObjectId("5632939c619cd4f520659327"), "text" : "raining cats but dogs" }

Searching for Words in Strings (an OR of AND Containment) with Location Specificity

The $headTail operator provides more refined searches that are akin to $contains and $containsTuple, but also adds an element of location to the search. It works on strings that can be broken into “sentences” and allows you to specify the delimiter. For each such substring it further breaks down the sentence into a head and a tail component using an additional delimiter. It then allows you to provide an array of words (that can also be created using a subquery) and either searches for any of these words to appear in the head sections or in the tail sections. It also allows you to specify “all” instead of “any” to match only texts where all sentences have at least one such word. You can also control whether the match of a word in the head or tail is a perfect match or a LIKE match using % to represent a wildcard. Finally, you can match tuples when the array has tuples (with a delimiter) requiring a match on both the head and the tail (similer to an elemMatch).

Synopsis of the command in a $match:

$expr: {
        $headTail: [
                "$<field name to match>",
                [ .. array of strings (words or tuples) .. ],
                "head|tail|both", // both for tuples
                "any|all", // match only if every sentence matches
                "string|regex", // regex only matches a % as a wildcard
                ";", // sentence delimiter
                " ", // delimiter between the head and the tail - first one delimits
                "+" // delimiter of tuples (when using "both")
        ]}

Find by Network

Examples:

> db.coll.find()
{ "_id" : ObjectId("5a2ebb03d4ff6c80e851c465"), "ip" : "192.168.36.28" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c466"), "ip" : "192.168.36.10" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c467"), "ip" : "192.168.36.66" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c468"), "ip" : "192.168.1.130" }

> db.coll2.find()
{ "_id" : ObjectId("5a2ebbd7d4ff6c80e851c469"), "ip" : "fdb7:851c:0e1c:e1a9::abdf:1234" }
{ "_id" : ObjectId("5a2ebbd7d4ff6c80e851c46a"), "ip" : "1234:7712:123::abcd:1234:5678" }

> db.coll.find({"ip":{"$inCIDR":"192.168.36.8/29"}})
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c466"), "ip" : "192.168.36.10" }

> db.coll.find({"ip":{"$inNetwork":"192.168.36.0/255.255.255.0"}})
{ "_id" : ObjectId("5a2ebb03d4ff6c80e851c465"), "ip" : "192.168.36.28" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c466"), "ip" : "192.168.36.10" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c467"), "ip" : "192.168.36.66" }

> db.coll.find({"ip":{"$inNetwork":"192.168.36.0/255.255.255.248"}})

> db.coll.find({"ip":{"$inNetwork":"192.168.36.0/255.255.255.240"}})
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c466"), "ip" : "192.168.36.10" }

> db.coll2.find({"ip":{"$inCIDR":"1234:7712::/32"}})
{ "_id" : ObjectId("5a2ebbd7d4ff6c80e851c46a"), "ip" : "1234:7712:123::abcd:1234:5678" }

Additional Network Projection Expressions

Examples:

> db.coll.find()
{ "_id" : ObjectId("5a2ebb03d4ff6c80e851c465"), "ip" : "192.168.36.28" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c466"), "ip" : "192.168.36.10" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c467"), "ip" : "192.168.36.66" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c468"), "ip" : "192.168.1.130" }

> db.coll2.find()
{ "_id" : ObjectId("5a2ebbd7d4ff6c80e851c469"), "ip" : "fdb7:851c:0e1c:e1a9::abdf:1234" }
{ "_id" : ObjectId("5a2ebbd7d4ff6c80e851c46a"), "ip" : "1234:7712:123::abcd:1234:5678" }

> db.coll.aggregate({$project:{n:{$inCIDR:["$ip","192.168.36.0",24]}}})
{ "_id" : ObjectId("5a2ebb03d4ff6c80e851c465"), "n" : true }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c466"), "n" : true }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c467"), "n" : true }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c468"), "n" : false }

> db.coll.aggregate({$project:{n:{$inNetwork:["$ip","192.168.36.8/255.255.255.248"]}}})
{ "_id" : ObjectId("5a2ebb03d4ff6c80e851c465"), "n" : false }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c466"), "n" : true }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c467"), "n" : false }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c468"), "n" : false }

> db.coll.aggregate({$project:{n:{$inNetwork:["$ip","192.168.36.8","255.255.255.248"]}}})
{ "_id" : ObjectId("5a2ebb03d4ff6c80e851c465"), "n" : false }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c466"), "n" : true }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c467"), "n" : false }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c468"), "n" : false }

> db.coll2.aggregate([{'$project':{'n':{'$inCIDR':["$ip","1234:7712::/32"]}}}])
{ "_id" : ObjectId("5a2ebbd7d4ff6c80e851c469"), "n" : false }
{ "_id" : ObjectId("5a2ebbd7d4ff6c80e851c46a"), "n" : true }

> db.coll.aggregate([{'$project':{'n':{'$netCIDR':["$ip",29]}}}])
{ "_id" : ObjectId("5a2ebb03d4ff6c80e851c465"), "n" : "192.168.36.24" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c466"), "n" : "192.168.36.8" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c467"), "n" : "192.168.36.64" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c468"), "n" : "192.168.1.128" }

> db.coll.aggregate([{'$project':{'n':{'$netMask':["$ip","255.255.255.0"]}}}])
{ "_id" : ObjectId("5a2ebb03d4ff6c80e851c465"), "n" : "192.168.36.0" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c466"), "n" : "192.168.36.0" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c467"), "n" : "192.168.36.0" }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c468"), "n" : "192.168.1.0" }

> db.coll.aggregate({$project:{n:{$bucketIP:["$ip",[["192.168.1.0","192.168.1.255"],["192.168.36.0","192.168.36.32"],["192.168.36.0", "192.168.36.255"]]]}}})
{ "_id" : ObjectId("5a2ebb03d4ff6c80e851c465"), "n" : [ NumberLong(1), NumberLong(2) ] }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c466"), "n" : [ NumberLong(1), NumberLong(2) ] }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c467"), "n" : NumberLong(2) }
{ "_id" : ObjectId("5a2ebb04d4ff6c80e851c468"), "n" : NumberLong(0) }

Cron Date matching and generation

Basic Syntax:

{$cron:{date:<date>, at: <cron string>, fire: <non-zero integer>}}

the ‘date’ field is optional, and if ommitted, defauts to the time at which the query was started.

Examples:

For a cron event scheduled for noon, monday to friday, show the next time it will fire from now.

> db.one.aggregate({$project:{_id:0,n:{$cron:{at:‘0 12 * * Mon-Fri *’,fire:1}}}}) { “n” : ISODate(“2018-03-08T12:00:00Z”) }

For a cron event scheduled every 3 hours, starting at midnight, but only runninng on mondays and fridays, show the 2nd last time this query was started:

> db.one.aggregate({$project:{_id:0,n:{$cron:{at:'0 */3 * * 1,5 *',fire:-2}}}})
{ "n" : ISODate("2018-03-05T18:00:00Z") }

For a cron event scheduled to occur at 3 am on the first monday of every other month starting in January, show the 4th time after a date in the database that the query would execute:

> db.dates.aggregate({$project:{_id:0,n:{$cron:{date:"$date",at:'0 3 1-7 */2 Mon *',fire:4}}}})
{ "n" : ISODate("2018-11-05T03:00:00Z") }

For a cron event scheduled to occur only on Feb 29, and to happen at 6:01 am, show the time after the next time it will happen:

> db.one.aggregate({$project:{_id:0,n:{$cron:{at:'1 6 29 Feb * *',fire:1}}}})
{ "n" : ISODate("2020-02-29T06:01:00Z") }

One can also find matches based on a cron date, eg:

> db.schedules.find({date:{$gte:{$cron:{at:‘1 12 * * Wed *’, fire:-1}}}}) …

Would match any date since 12:01 am last wedneday.

Searching for Inclusion of Sequences of Characters anywhere in Strings

Note: $snippet is only available when SonarW is running on a machines which have an Nvidia GPU.

SonarW has a fast-search facility called snippets that can be used to implement “Google-like search” on certain fields in a document. The search capability does not require matching the entire contents of a field nor a stemmed word nor a prefix. It can be used to find any combination of consecutive characters in strings. For example, you can search for “ear” and get hits from words including earring, search, earl, clear, etc. The search is case-insensitive so that matches do not rely on case. In the current release this is only available for English strings.

Snippet search is always done in-memory for speed. You therefore need to tell the system how much memory to allocate for this facility and which columns need to be loaded for snippet search. To do this add the following to your sonard.conf file:

snippet_columns="retail.products.Product.Issuer,retail.products.Product.Description"
max_snippet_memory=10000M

This means that you are allocating 10GB of memory and will hold the Product.Issuer and Product.Description columns in the snippet memory for searches.

When SonarW starts you will see two relevant messages in the log file:

Starting snippets...
Loading snippets for 2 columns : <retail.products.Product.Issuer> <retail.products.Product.Description>

and:

Snippets memory usage: 9163 Mbytes / 10000  Mbytes  :
     Column retail.products.Product.Description : 259120526 document, 7716 Mbytes in 37638 blocks
     Column retail.products.Product.Issuer : 259120419 document, 1447 Mbytes in 37593 blocks

You can now add a $snippet clause to your find or in an aggregation $match. You can mix $snippet with any other find operator. For a snippet search you specify the field to search on and an array of words to look for. For example:

"Product.Description": {$snippet: {$search: ["joe", "jane"], $limit: 1000, $first: true}}

In this case SonarW will look through the “Product.Description” fields for matches on either joe or jane. Once it finds the first 1000 word matches (note - not document matches - word matches) it will return with the result.

$limit can be any number less than 1 Million (or actually 1024*1024) - any larger number will be transformed to this hard limit.

There are two modes for working with $snippet depending on whether you have $first: true or $first: false. This affects the order of operation - i.e does the GPU do the string matching first and then apply the rest of the conditions or do you first apply the rest of the conditions and only on the selected data apply the GPU search. As an example:

db.products.find(
   {"Product.Description":{
      "$snippet":{"$limit":1000000,"$search":["Drum","Rope"],"$first":first}
   },
   "Product.MarketSectorDescription" : "sports"})

will first find the first 1MM matches on either drum or rope and then apply the other checks (in this case on the Product.MarketSectorDescription).

If on the other hand you specify $first: false then first the system will reduce the collection by all the other conditions and only then search through the word matches. In this case $limit has no effect and everything is searched.

Which option is faster depends on your data and queries. The GPU is ultra-fast using thousands of cores - but if there is a huge number of matches and most of the limiting done by the other conditions then a lot of work is done for nothing. If the other matches drastically reduce the data set it is usually better to perform them first. But if the bulk of the workload is the word matches and matches are infrequent (yet a lot of data needs to be searched) then using first: true is usually preferable.

$geoWithin projection

Use $GeoWithin projection when you have a single ring polygon and a point in the same document and you want to project whether or not the point is inside the polygon.

Example:

db.bar.find()
{ "_id": 1, "point" : [ 1.1, 2.2 ], "poly" : [ [ 1.8, 1.9 ], [ 2.3, 2.1 ], [ 3.3, 3.3 ] ]}
{ "_id": 2, "point" : [ 1.1, 2.2 ], "poly" : [ [ 1.1, 2.2 ], [ 1.8, 1.9 ], [ 2.3, 2.1 ], [ 3.3, 3.3 ] ]}

db.bar.aggregate({
    $project: {
        w: {$geoWithin: ['$point', '$poly']},
    }
})

{ "_id": 1, "w" : false }
{ "_id": 2, "w" : true }

geoNear command and $geoNear aggregation option: geoField

Use geoNear field to specify the field which you want to apply to the geoNear operation.

Issuing a geoNear without specifying the field cannot run on collections with more than one geo index. This extension allows running geoNear if you have more than one geo index on different fields of the same collection.

Example:

db.a.drop()
db.a.createIndex({a:"2dsphere"})
db.a.createIndex({b:"2dsphere"})
db.a.insert({"_id": 0, a:[0,0], b:[1,1]})
db.a.insert({"_id": 1, a:[1,1], b:[0,0]})
db.a.aggregate({$geoNear: {near:[0,0], distanceField: "dis", geoField: "a", spherical: true}})

Results:
{ "_id" : 0, "a" : [ 0, 0 ], "b" : [ 1, 1 ], "dis" : 0 }
{ "_id" : 1, "a" : [ 1, 1 ], "b" : [ 0, 0 ], "dis" : 0.024682056391766436 }

Removing (Trimming) Whitespaces

To trim whitespaces from the left, right or both sides of a string, use $ltrim, $rtrim or $trim respectively.

Syntax: {$trim: <string>}

Examples:

db.foo.find()
{ "_id" : 0, "a" : "  dandelion stack   " }

Query:
db.foo.aggregate({$project: {"hey": {$ltrim: "$a"}}})
Result:
{ "_id" : 0, "hey" : "dandelion stack   " }

Query:
db.foo.aggregate({$project: {"hey": {$rtrim: "$a"}}})
Result:
{ "_id" : 0, "hey" : "  dandelion stack" }

Query:
db.foo.aggregate({$project: {"hey": {$trim: "$a"}}})
Result:
{ "_id" : 0, "hey" : "dandelion stack" }

Replace in String

The $replaceAll projection operator lets you replace characters within strings. Usage is:

$replaceAll:[<input_text>,<string_or_regex_to_search_for>,<replacement_string>]

For example:

db.ex.aggregate({'$project':{'example':{$replaceAll:['Protagonist woke up. Protagonist went to the store. Protagonist went back to bed','Protagonist','Geoff']}}})

{ "example" : "Geoff woke up. Geoff went to the store. Geoff went back to bed", "_id" : ObjectId("5684629fdb82dc919baa1f64") }

db.ex.aggregate({'$project':{'example':{$replaceAll:['<name> woke up. <first_name> went to the store. <name> went back to bed',/<\S*name>/,'Geoff']}}})

{ "example" : "Geoff woke up. Geoff went to the store. Geoff went back to bed", "_id" : ObjectId("5684629fdb82dc919baa1f64") }

Input can include field names. For example, if a document contains:

{ "d": "201512110600", "hostname": "sonarg38", "Name": "EXCEPTION_LOG" }

And you wish to replace all underscores with spaces in the Name field, use:

{$project: {Name: {$replaceAll: ["$Name", "_", " "]}}}

String Length

$length returns the length of a string, e.g.:

> db.session.aggregate({$project: {a: {$length: "$Access Id"}  ,  "Access Id":1 }})
{ "a" : NumberLong(2), "Access Id" : "29", "_id" : NumberLong("10000000001") }
{ "a" : NumberLong(3), "Access Id" : "x14", "_id" : NumberLong("10000000002") }
{ "a" : NumberLong(2), "Access Id" : "x7", "_id" : NumberLong("10000000003") }

Encrypting/Decrypting Strings

Use $aesEncrypt and $aesDecrypt to do AES encryption/decryption of strings. The result of $aesEncrypt is a BinData type and the function accepts a key which must be a string. For example, to encrypt the password field (to a new collection) do:

db.datasources.aggregate({$project: { encrypted_pwd: { $aesEncrypt:['$password','s7p3rs3cr3t']}}},{$out: "pwdok"})

To get the passwords back do:

db.pwdok.aggregate( {$project: { orig_pwd: { $aesDecrypt: ['$encrypted_pwd', 's7p3rs3cr3t']}}})

Only string fields can be encrypted/decrypted and the key must also be a string.

Concatenation of a null value

In SonarW, when $concat gets a null (null value or a missing field), it will insert an empty string “”, and do the rest of the string concatenation for that document.

Example:

> db.test.find()
{ "_id" : 1, "f_name" : "John", "m_name" : "basher", "l_name" : "Doe" }
{ "_id" : 2, "f_name" : "Jane", "l_name" : "Doe" }
{ "_id" : 3, "f_name" : "John", "m_name" : "Junior", "l_name" : "Doe" }

Sonar:
> db.test.aggregate({"$project":{"full name":{"$concat":["$f_name"," ","$m_name"," ","$l_name"]}})
{ "full name" : "John basher Doe", "_id" : 1 }
{ "full name" : "Jane  Doe", "_id" : 2 }
{ "full name" : "John Junior Doe", "_id" : 3 }

Mongo:
> db.test.aggregate({"$project":{"full name":{"$concat":["$f_name"," ","$m_name"," ","$l_name"]}}})
{ "_id" : 1, "full name" : "John basher Doe" }
{ "_id" : 2, "full name" : null }
{ "_id" : 3, "full name" : "John Junior Doe" }

Search for text only within a named field

$tsearch does a text search restricted to a named field. For example, if the collection contains:

> use tsearch
> db.prods.drop()
> db.prods.createIndex({ "Product.Description": "text", "Product.ShortDescription": "text" })
> db.prods.insert(
        [
        { "_id" : 25029537, "Product" : { "Description" : "Drill Brain Passport Coffee racquet Ears Rope Cup School Vacuum Cave Butterfly", "ShortDescription" : "Eraser Worm" } },
        { "_id" : 12852162, "Product" : { "Description" : "Hose Aircraft Banana Cappuccino Box Post-office Chisel Square Coffee-shop Map Light Shower", "ShortDescription" : "Drill Library" } },
        { "_id" : 36714549, "Product" : { "Description" : "Rifle Chess Drum Circus Compact Kitchen Stomach Bank Magnet Restaurant Fire Hat", "ShortDescription" : "Bee Leg Drill" } }
       ]
 )

The full text search using $text:

> db.prods.find({$text: {$search: "Drill"}}, {_id: 1})

returns all three documents:

{ "_id" : 25029537 }
{ "_id" : 12852162 }
{ "_id" : 36714549 }

Using $tsearch to restrict to the Product.Description field:

> db.prods.find({"Product.Description": {$tsearch: "Drill"}}, {_id: 1})

we see only one document returned:

{ "_id" : 25029537 }

Using $tsearch to restrict to the Product.ShortDescription field:

> db.prods.find({"Product.ShortDescription": {$tsearch: "Drill"}}, {_id: 1})

returns the other two documents:

{ "_id" : 12852162 }
{ "_id" : 36714549 }

Regular Expression Extensions

Projections a Regular Expression Match ($regex and $regexp in $project)

Use $regex to project the result of a regular expression match. The syntax allows you to project a boolean on whether a match existed or to extract part of the string using the parentheses syntax.

For example, assume the following data:

> db.instance.aggregate(
   {$match: {"Objects and Verbs": {$exists: 1}}},
   {$project: {"Objects and Verbs":1, _id:0}})
{ "Objects and Verbs" : "db31_ycs GRANT" }
{ "Objects and Verbs" : null }
{ "Objects and Verbs" : "GUARD_TABLE784 CREATE TABLE" }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT" }
{ "Objects and Verbs" : "GUARD_TABLE784 INSERT" }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT" }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT" }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT" }

To project the result of the regex match testing for which field has an INSERT ( a boolean) use:

> db.instance.aggregate(
   {$match: {"Objects and Verbs": {$exists: 1}}},
   {$project: {"Objects and Verbs":1, _id:0, bool: {$regex: ["$Objects and Verbs", /.* INSERT/]}}})
{ "Objects and Verbs" : "db31_ycs GRANT", "bool" : false }
{ "Objects and Verbs" : null, "bool" : false }
{ "Objects and Verbs" : "GUARD_TABLE784 CREATE TABLE", "bool" : false }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT", "bool" : false }
{ "Objects and Verbs" : "GUARD_TABLE784 INSERT", "bool" : true }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT", "bool" : false }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT", "bool" : false }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT", "bool" : false }

To extract the table name (the first string) for each insert use:

> db.instance.aggregate(
      {$match: {"Objects and Verbs": {$exists: 1}}},
      {$project: {"Objects and Verbs":1, _id:0, bool: {$regex: ["$Objects and Verbs", /(.*) INSERT/]}}})
{ "Objects and Verbs" : "db31_ycs GRANT", "bool" : "" }
{ "Objects and Verbs" : null, "bool" : "" }
{ "Objects and Verbs" : "GUARD_TABLE784 CREATE TABLE", "bool" : "" }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT", "bool" : "" }
{ "Objects and Verbs" : "GUARD_TABLE784 INSERT", "bool" : "GUARD_TABLE784" }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT", "bool" : "" }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT", "bool" : "" }
{ "Objects and Verbs" : "GUARD_TABLE784 SELECT", "bool" : "" }

A $regex can accept an array to operate on, making the combination of regex and split very useful. For example, if you have a string field such as:

db.fullsql.insert({"fullsql":"SELECT SALARY FROM EMP ; SELECT ADDRESS FROM DEPT;  INSERT SALARY INTO EMP"})

and you want to project all the statements that work on the EMP table only, you can use a split with a regex such as:

db.fullsql.aggregate([{'$project':{'emp':{'$regex':[{"$split":["$fullsql",";"]},/(.*)EMP/i]},'_id':0}}])

{ "emp" : [ "SELECT SALARY FROM ", "  INSERT SALARY INTO " ] }

Finally, you can have multiple matches in a single $regex using multiple parentheses expressions. For example, to match all words separated by spaces in a string and return an array of the strings do:

db.t1.insert({a: 'no number sry buddy'})

db.t1.aggregate({$project: {words: {$regex: ['$a',  /(.*) (.*) (.*) (.*)/ ]}}})

to get:

 {"words" : [ "no", "number", "sry", "buddy" ] }

You can also operate a $regex on an array. If t1 has:

{"a":['0000123456','no number sry buddy','12344444445'] },
{"a":"1234556"}
{"a":["This is a string","This is also one",'123456789'] }

performing the above aggregation yields:

{ "words" : [ [ "no", "number", "sry", "buddy" ] ] }, // only second array matched
{ "words" : false },   // because there was no match
{ "words" : [ [ "This", "is", "a", "string" ], [ "This", "is", "also", "one" ] ] } // first and second matched

Note that if there is a single match you will get that match and not an array with one element.

Note that you can use the $regex projection primitive in a $match using a $expr. For example, to split a string similar to that used in the $split example and then filter only those that do not start in GUARD_TABLE use:

> db.instance.aggregate(
   {$project: {arr: {$split: ["$Objects and Verbs",";"]}}},
   {$match: {$expr: {$regex: ["$arr", /^(?!GUARD_TABLE)/]}}},
   {$project: {sql: {$arrayElemAt: ["$arr", 0]}}})

{ "sql" : "scl_tbl0_jik CREATE TABLE", "_id" : ObjectId("5668f102b34d03395f0c9ed2") }
{ "sql" : "db32_mzn REVOKE", "_id" : ObjectId("5668f102b34d03395f0c64a6") }
{ "sql" : "db32_uic GRANT", "_id" : ObjectId("5668f102b34d03395f0c6340") }
{ "sql" : "db31_ywr REVOKE", "_id" : ObjectId("5668f102b34d03395f0cb025") }
{ "sql" : "db31_wjj REVOKE", "_id" : ObjectId("5668f102b34d03395f0cad3d") }
{ "sql" : "db32_jor GRANT", "_id" : ObjectId("5668f102b34d03395f0c67c4") }
{ "sql" : "db32_sqk REVOKE", "_id" : ObjectId("5668f102b34d03395f0c6612") }

As another example, assume the following documents:

> db.rt.find()
{ "_id" : ObjectId("56381aa60ad0a0038c437fa0"), "a" : "ze hu 747" }
{ "_id" : ObjectId("56381aa80ad0a0038c437fa1"), "a" : "ze hu 737" }
{ "_id" : ObjectId("56381c360ad0a0038c437fa2"), "a" : "ze hu 707" }
{ "_id" : ObjectId("56381c380ad0a0038c437fa3"), "a" : "ze hu 737" }
{ "_id" : ObjectId("56381c3a0ad0a0038c437fa4"), "a" : "ze hu 777" }
{ "_id" : ObjectId("56381c400ad0a0038c437fa5"), "a" : "ze hu 210-300" }
{ "_id" : ObjectId("56381cf80ad0a0038c437fa6"), "a" : 5434 }

To match the middle numeral between the two 7s (effectively the Boeing aircraft type), project:

> db.rt.aggregate({$project: { t: {$regex: ['$a',"/7(\\d)7/"  ]}   }  })
{ "t" : "4", "_id" : ObjectId("56381aa60ad0a0038c437fa0") }
{ "t" : "3", "_id" : ObjectId("56381aa80ad0a0038c437fa1") }
{ "t" : "0", "_id" : ObjectId("56381c360ad0a0038c437fa2") }
{ "t" : "3", "_id" : ObjectId("56381c380ad0a0038c437fa3") }
{ "t" : "7", "_id" : ObjectId("56381c3a0ad0a0038c437fa4") }
{ "t" : "", "_id" : ObjectId("56381c400ad0a0038c437fa5") }
{ "t" : "", "_id" : ObjectId("56381cf80ad0a0038c437fa6") }

Note that if a match in not made an empty string is projected.

If you exclude the parentheses syntax a true/false indicator is projected:

> db.rt.aggregate({$project: { t: {$regex: ['$a',"/7\\d7/"  ]}   }  })
{ "t" : true, "_id" : ObjectId("56381aa60ad0a0038c437fa0") }
{ "t" : true, "_id" : ObjectId("56381aa80ad0a0038c437fa1") }
{ "t" : true, "_id" : ObjectId("56381c360ad0a0038c437fa2") }
{ "t" : true, "_id" : ObjectId("56381c380ad0a0038c437fa3") }
{ "t" : true, "_id" : ObjectId("56381c3a0ad0a0038c437fa4") }
{ "t" : false, "_id" : ObjectId("56381c400ad0a0038c437fa5") }
{ "t" : false, "_id" : ObjectId("56381cf80ad0a0038c437fa6") }

NOTE: Some drivers (e.g. the Java driver) expects a string after a $regex operator since this is how you express a regex, i.e. where the shell uses /some/, in Java you would use {$regex: “some”}. Therefore, when you use a Java driver you cannot use $regex as shown above since the driver expects a string to appear after the $regex (and the above specifies an array). In this case use $regexp - which is a SonarW synonym for this feature.

NOTE: You can use strings instead of true regular expression types, e.g.:

$project:  { is_it_matching :
{a: $regex: [ $name", [ "Chris", "/Jane/i", "Mark" ]]}  -- match Chris, Mark or caseless Jane

SQL-Like matching:

Use $like for matching similar to using regular expressions but using % instead as any number of wildcards. $like gets the input as strings and always performs case-insensitive matches, e.g.:

> db.bar.aggregate({$match:{$expr:{$like:["$a",["G%T", "%LOR"]]}}})

Regex in Field Names

Another extension that is very useful for wide searches when you don’t know precisely what you are searching for (or, more precisely, where you are searching), is the ability to search in all fields based on a regex.

For example, suppose that you have various fields that are all sorts of names:

> db.people.find()
{ "_id" : ObjectId("5410e508a38e812d3e6f422d"), "first_name" : "John", "last_name" : "Doe", "nickname" : "J", "previous_name" : "Locco" }
{ "_id" : ObjectId("5410e53ea38e812d3e6f422e"), "first_name" : "Jane", "last_name" : "Doe", "maiden_name" : "Lombardy", "nickname" : "Jenna" }
{ "_id" : ObjectId("5410e55ea38e812d3e6f422f"), "first_name" : "Chris", "last_name" : "Lombardy", "nickname" : "Chrissy" }

And you want to search for all people that have the string “Lombardy” in any name - not in a specific name. Then do:

> db.people.find({"/name/": "Lombardy"})
{ "_id" : ObjectId("5410e53ea38e812d3e6f422e"), "first_name" : "Jane", "last_name" : "Doe", "maiden_name" : "Lombardy", "nickname" : "Jenna" }
{ "_id" : ObjectId("5410e55ea38e812d3e6f422f"), "first_name" : "Chris", "last_name" : "Lombardy", "nickname" : "Chrissy" }

Regex with Array

The $regex operator can also work with array inputs whereas the input is reduced to the matching strings.

For example:

> db.rt.insert([{"s":['747','757','737','727','717']}, {"s": '747'}, {"s": '737'}])
> db.test.aggregate([{$project:{'x':{$regex:['$s', /[/2/,/3/]/]}, "_id" : 0}}])
{ "x" : [ "737", "727" ] }
{ "x" : false }
{ "x" : true }

Using % as a Wildcard

It is common to use the percent wildcard character ‘%’ within SQL queries. NoSQL databases use regular expressions instead using the syntax: “.*” to indicate any character, any number of times (i.e. the equivalent meaning).

As a convenience to SQL users, SonarW can treat the % character as this wildcard whenever the M option is set.

Note: This feature must be enabled by setting the following configuration in sonargd.conf:

use_percent_wildcard=true

In this case:

db.vet.drop()
db.vet.insert({ "_id" : 1, "b" : "dog" })
db.vet.insert({ "_id" : 2, "b" : "cat" })
db.vet.insert({ "_id" : 3, "b" : "dig" })

db.vet.aggregate([{'$match': { 'b' : {'$regex': 'd%g', '$options': 'm'} } }])

{ "_id" : 1, "b" : "dog" }
{ "_id" : 3, "b" : "dig" }

Converting strings and arrays to a regular expression

The $toRegex operator can convert a given string, or array of strings, to an equivalent regular expression.

Pattern options can also be provided (using the /pattern/options).

SonarW uses Perl Compatible Regular Expression (PCRE) for the conversion.

For example:

>db.test.insert([
{ "_id" : 1, "username" : "james_mark32", "age" : 32 },
{ "_id" : 2, "username" : "Mark.Holl15", "age" : 25 },
{ "_id" : 3, "username" : "sootline33", "age" : 28 },
{ "_id" : 4, "username" : "markkkz", "age" : 24 }
])
> var r = db.test.aggregate({$project: {x: {$toRegex: '/^mark/i'}, _id: 0}}, {$limit: 1})
> db.test.find( { username: { $regex: r.next().x } } )
{ "_id" : 2, "username" : "Mark.Holl15", "age" : 25 }
{ "_id" : 4, "username" : "markkkz", "age" : 24 }

Statistical Extensions

SonarW supports additional aggregation operators beyond $avg and $sum such as $var, $kElement, $med, $q1, $q3, $linreg, $cor.

Variance

The $var aggregation function calculates variance of group of values. Each value may be any numerical type (int/long/float). The syntax is as follows:

db.sample.insert([
{a: 1},
{a: 20},
{a: 25},
{a: 100}])

db.sample.aggregate({$group: {_id : 0, 'variance': {$var: '$a'}}})
{ "_id" : 0, "variance": 1424.25 }

K-th Element

The $kElement aggregation function reports the nth element of a group of values. Each value may be any supported type. In order for $kelement to be meaningful, the elements must have a defined order. Therefore, $kelement should follow a $sort. The syntax is as follows:

db.sample.insert([
{'a': {x: 5}, 'i': 1},
{'a': {x: 2}, 'i': 1},
{'a': {x: 4}, 'i': 1},
{'a': {x: 300}, 'i': 2},
{'a': {x: 400}, 'i': 2}
])

db.sample.aggregate([{$sort: {a: 1}}, {$group: {_id : '$i', 'kElement': {$kElement: [3,'$a']}}}])
{ "_id" : 1, "kElement" : { "x" : 5 } }
{ "_id" : 2 }

Median

The $med aggregation function calculates the approximate median of a group of values. Each value may be any numerical type (int/long/float). Values are accumulated and reported as float. Approximate medians are much faster to compute than real medians. An example usage is:

db.t.insert([
{'a': 1, 'b': 1},
{'a': 100, 'b': 1},
{'a': 200, 'b': 1},
{'a': 300, 'b': 1},
{'a': 500, 'b': 2},
{'a': 600, 'b': 2},
{'a': 700, 'b': 2}
])

db.t.aggregate({$group: {_id : '$b', 'med': {$med: '$a'}}})
{ "_id" : 1, "med" : 200 }
{ "_id" : 2, "med" : 600 }

First Quartile

The $q1 aggregation function calculates the approximate first quartile of a group of values. Each value may be any numerical type (int/long/float). Values are accumulated and reported as float:

db.t.insert([
{'a': 1, 'b': 1},
{'a': 100, 'b': 1},
{'a': 200, 'b': 1},
{'a': 300, 'b': 1},
{'a': 500, 'b': 2},
{'a': 600, 'b': 2},
{'a': 700, 'b': 2}
])

db.t.aggregate({$group: {_id : '$b', 'q1': {$q1: '$a'}}})
{ "_id" : 1, "q1" : 100 }
{ "_id" : 2, "q1" : 500 }

Third Quartile

The $q3 aggregation function calculates the approximate third quartile of a group of values. Each value may be any numerical type (int/long/float). Values are accumulated and reported as float:

db.t.insert([
{'a': 1, 'b': 1},
{'a': 100, 'b': 1},
{'a': 200, 'b': 1},
{'a': 300, 'b': 1},
{'a': 500, 'b': 2},
{'a': 600, 'b': 2},
{'a': 700, 'b': 2}
])

db.t.aggregate({$group: {_id : '$b', 'q3': {$q3: '$a'}}})
{ "_id" : 1, "q3" : 300 }
{ "_id" : 2, "q3" : 700 }

Linear Regression

The $linreg aggregation function provides the linear regression of two variables. It calculates the line of best fit for the given points by means of least squares and returns the line defined by its slope and intercept.

Example:

db.sample.insert([ {a: 1,b:0},
{a: 20,b:-19},
{a: 25,b:-24},
{a: 100,b:-99}])

db.sample.aggregate({$group:{_id:0,'cor':{'$linreg':['$a','$b']}}})
{ "_id" : 0, "linreg" : { "slope" : -1, "intercept" : 1 } }

Correlation Coefficient

The $cor aggregation function provides the correlation coefficient between two variables. It measures how close the line of best fit approximates the relationship of the two variables with the sign being the sign of the slope of the line of best fit. The strength of relationship can be anywhere between −1 and +1. The stronger the correlation, the closer the correlation coefficient comes to ±1. If the coefficient is a positive number, the variables are directly related. If, on the other hand, the coefficient is a negative number, the variables are inversely related. 0 means no correlation.

Example:

db.sample.insert([ {a: 1,b:0},
{a: 20,b:-19},
{a: 25,b:-24},
{a: 100,b:-99}])

> db.sample.aggregate({$group:{_id:0,'cor':{'$cor':['$a','$b']}}})
{ "_id" : 0, "cor" : -1 }

> db.sample.insert({a:5,b:21.2})

> db.sample.aggregate({$group:{_id:0,'cor':{'$cor':['$a','$b']}}})
{ "_id" : 0, "cor" : -0.9726638336871489 }

Numerical Extensions

Random Numbers

SonarW can project random numbers of various types (e.g. integer or float) and in various ranges. Assuming that foo has two documents then:

> db.foo.aggregate({$project: {r: {$random:1.0}}})
{ "r" : 0.12272948849980655, "_id" : ObjectId("5600a4717eef85863c51d8b3") }
{ "r" : 0.7621166506591738, "_id" : ObjectId("5600a4f37eef85863c51d8b4") }

> db.foo.aggregate({$project: {r: {$random:100.0}}})
{ "r" : 84.17849744497171, "_id" : ObjectId("5600a4717eef85863c51d8b3") }
{ "r" : 43.139700041881554, "_id" : ObjectId("5600a4f37eef85863c51d8b4") }

> db.foo.aggregate({$project: {r: {$random:NumberInt(100)}}})
{ "r" : 54, "_id" : ObjectId("5600a4717eef85863c51d8b3") }
{ "r" : 16, "_id" : ObjectId("5600a4f37eef85863c51d8b4") }

> db.foo.aggregate({$project: {r: {$random:true}}})
{ "r" : true, "_id" : ObjectId("5600a4717eef85863c51d8b3") }
{ "r" : false, "_id" : ObjectId("5600a4f37eef85863c51d8b4") }

When a numeric is used, the random number will be between zero and the specified number.

Round and Truncate

To truncate a floating point number to an integer use the $truncate operator. For example, if:

> db.schedule.aggregate({$limit: 3},{$project: {wp:1, abc: {$now: -800}, def: {$multiply: ["$start_date", 1.5523764]}}})
{ "abc" : ISODate("2015-03-04T15:27:42Z"), "def" : 3118.7241876, "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T15:27:42Z"), "def" : 3120.276564, "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T15:27:42Z"), "def" : 3117.1718112, "wp" : "WP 19231" }

Then:

> db.schedule.aggregate({$limit: 3},{$project: {wp:1, abc: {$now: -800}, def: {$truncate: {$multiply: ["$start_date", 1.5523764]}}}})
{ "abc" : ISODate("2015-03-04T15:28:16Z"), "def" : 3118, "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T15:28:16Z"), "def" : 3120, "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T15:28:16Z"), "def" : 3117, "wp" : "WP 19231" }

Similarily, to round the number to the nearest integer:

> db.schedule.aggregate({$limit: 3},{$project: {wp:1, abc: {$now: -800}, def: {$round: {$multiply: ["$start_date", 1.5523764]}}}})
{ "abc" : ISODate("2015-03-04T15:29:32Z"), "def" : 3119, "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T15:29:32Z"), "def" : 3120, "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T15:29:32Z"), "def" : 3117, "wp" : "WP 19231" }

and to round it to the nearest 2nd digit after the decimal point:

> db.schedule.aggregate({$limit: 3},{$project: {wp:1, abc: {$now: -800}, def: {$round: [{$multiply: ["$start_date", 1.5523764]}, 2]}}})
{ "abc" : ISODate("2015-03-04T15:33:32Z"), "def" : 3118.72, "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T15:33:32Z"), "def" : 3120.28, "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T15:33:32Z"), "def" : 3117.17, "wp" : "WP 19231" }

Date Extensions

Current Date/Time

The $now operator will retrieve the current UTC time from the machine on which SonarW is running and can be used as part of a query. For example:

> db.schedule.aggregate({$limit: 3},{$project: {wp:1, abc: {$now: null}}})
{ "abc" : ISODate("2015-03-04T23:26:19Z"), "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T23:26:19Z"), "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T23:26:19Z"), "wp" : "WP 19231" }

You can also add an offset for application-level setting of a timezone. For example, to get the times in Pacific Standard Time:

> db.schedule.aggregate({$limit: 3},{$project: {wp:1, abc: {$now: -800}}})
{ "abc" : ISODate("2015-03-04T15:26:35Z"), "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T15:26:35Z"), "wp" : "WP 19231" }
{ "abc" : ISODate("2015-03-04T15:26:35Z"), "wp" : "WP 19231" }

Converting a String to an ISODate

SonarW provides a conversion function from a string to an ISODate. The syntax follows that of $dateToString where date: is the string from which to create a date and format: is the format string. For example, if $str is a string which a date of the form 2014-11-03 then the following generates an ISO date from that string:

$stringToDate: {
       format: "%Y-%m-%d",
       date: "$str"
     }

For converting both date and time use the full date-time format. For example when the date string provides the date and time information in ISO format (“2016-05-27T20:41:06.712Z”) then the conversion format will look like:

$stringToDate: {
     format: "%Y-%m-%dT%H:%M:%S.%LZ",
     date: "$str"
  }

Note: You can use any subset of the full format, as long as the string to be parsed conforms to the format provided. For example:

$stringToDate: {
   format: "%Y-%m-%dT%H:%M",
   date: "2014-05-27T20:41"
}

You can also control the timezone in which the UTC date should be shown as using the tz field, for example:

{"$project" :
        {
        "_id" : 0,
        "inPacific" : {
                "$dateToString" : {
                        "date" : ISODate("2017-06-01T16:06:40.023Z"),
                        "format" : "%H:%M:%S",
                        "tz" : "PST"
} } } }

Adding/Subtracting a Month from a Date

Use $addMonth to add (or subtract) a number of months from an ISODate, for example:

{ '$project': { '_id':0, 'date': {'$addMonth':['$mydate', '$offset']}} }

To add or subtract days, weeks, years, hours etc. use standard $add and $subtract.

Computing the Quarter a Date is in

$quarter returns the quarter that an ISODate is in. It is similar to $month, $year etc. - but returns the quarter within the year.

Extracting the Date and Time from an ISODate

$toDate will leave the date component of an ISODate and $toTime will leave the time component of an ISODate (zeroing out the part that is not needed). For example, if test has a document:

{a: ISODate("2015-12-18T19:36:29.369Z")}

then:

> db.test.aggregate({$project:{n:{$toDate:"$a"},_id:0}})
{ "n" : ISODate("2015-12-18T00:00:00Z") }
> db.test.aggregate({$project:{n:{$toTime:"$a"},_id:0}})
{ "n" : ISODate("1970-01-01T19:36:29.369Z") }

The operand to these functions can be an ISODate or a string - but the string must be formated using the ISO date format. Other forms of dates may be parsed from strings using $stringToDate, which follows the same style as $dateToString.

Single Digit Hour and AM/PM Format designators

Use %I for a single digit hour and %P for am/pm designation, for example:

> db.t1.aggregate({$project:{orig: "$Event Date and Time", d:{$stringToDate:{date:"$Event Date and Time",format:"%m/%d/%Y %I:%M:%S %P%/[mM]/"}}}})
{ "orig" : "10/07/2015 11:34:06 PM", "d" : ISODate("2015-10-07T23:34:06Z"), "_id" : ObjectId("57a29a10277fff0b2a4b79e7") }

Average Date Calculation

$avg can be used to calculate the average date (average milliseconds since epoch).

Example:

use test;
db.test.drop();
db.test.insert({"a": ISODate("2017-12-08T23:27:40.552Z"), "b": ISODate("2010-01-01")});
db.test.aggregate([{"$project": {"average_date": {"$avg": ["$a", "$b"]}}}])

Produces:

{ "_id" : ObjectId("5a2b39ba9a242a74e37404eb"), "average_date" : ISODate("2013-12-20T11:43:50.276Z") }

Array and Object Extensions

Arbitrary Expressions in Aggregation Stages

SonarW allows you to write projection expressions directly into group, sort, match and join phases. This often allows eliminating $project phases and thus will improve performance.

For example, in a sort stage you can supply an array of pairs - each pair specifying what to sort by and whether to sort ascending or descending. As an example, to sort ascending by a+b and then descending by rank do:

{$sort:[{$add:["$a","$b"]},1,"$rank",-1]}

Sorting the Contents of an Array and Removing Duplicates

The $sorted projection operator sorts an array, producing an array which is equivalent to the source array but that is ordered. For example:

> db.coll.drop()
> db.coll.insert([ { "_id" : 0, "data" : [ 9, 4 ] },
                { "_id" : 1, "data" : [ 0, 8, 4, 8, 0 ] },
                { "_id" : 2, "data" : [ 6, 3 ] },
                { "_id" : 3, "data" : [ 7, 4, 7, 6, 3, 6 ] },
                { "_id" : 4, "data" : [ 3, 2, 4, 2, 6 ] },
                { "_id" : 5, "data" : [ 3, 8, 8 ] },
                { "_id" : 6, "data" : [ 5, 5, 2, 3, 8, 0, 4, 4 ] },
                { "_id" : 7, "data" : [ 0, 6, 0 ] },
                { "_id" : 8, "data" : [ 8, 9, 2, 8, 3, 6 ] },
                { "_id" : 9, "data" : [ 1, 4, 7, 4, 0 ] } ])

> db.coll.aggregate([{ "$project" : { "_id" : 0, "m" : { "$sorted" : "$data" } } }])
{ "m" : [ 4, 9 ] }
{ "m" : [ 0, 0, 4, 8, 8 ] }
{ "m" : [ 3, 6 ] }
{ "m" : [ 3, 4, 6, 6, 7, 7 ] }
{ "m" : [ 2, 2, 3, 4, 6 ] }
{ "m" : [ 3, 8, 8 ] }
{ "m" : [ 0, 2, 3, 4, 4, 5, 5, 8 ] }
{ "m" : [ 0, 0, 6 ] }
{ "m" : [ 2, 3, 6, 8, 8, 9 ] }
{ "m" : [ 0, 1, 4, 4, 7 ] }

The $unique projection operator sorts an array and removes duplicates, producing an array which is equivalent to the source array but that has no duplicate members and the members are ordered.

Creating a Hash that Identifies the Unique Contents of an Array

$hashOfSet takes an array and returns a NumberLong that is a computed hash of the elements in the array, excluding duplicates, and irrespective of order. For example, you could use $unique with $hashOfSet when you want to associate a unique signature based on a collection of indicators stored in an array:

> db.coll.drop()
> db.coll.insert([
    { "task" : "project", "name" : "Charlie" }, { "task" : "assignment", "name" : "Susan" }, { "task" : "exercise", "name" : "Charlie" },
    { "task" : "assignment", "name" : "Susan" }, { "task" : "exercise", "name" : "Charlie" }, { "task" : "exercise", "name" : "Greg" },
    { "task" : "lecture", "name" : "Sally" }, { "task" : "grading", "name" : "Jim" }, { "task" : "lecture", "name" : "Sally" },
    { "task" : "work-study", "name" : "Joe" }, { "task" : "grading", "name" : "Bill" }, { "task" : "lab", "name" : "Jim" },
    { "task" : "lab", "name" : "Bob" }, { "task" : "lab prep", "name" : "Fred" }, { "task" : "project", "name" : "Jim" },
    { "task" : "assignment", "name" : "Bill" }, { "task" : "assignment", "name" : "Bill" }, { "task" : "exercise", "name" : "Charlie" },
    { "task" : "grading", "name" : "Bill" }, { "task" : "lab", "name" : "Susan" }, { "task" : "lecture", "name" : "Joe" },
    { "task" : "project", "name" : "Charlie" }, { "task" : "lab", "name" : "Susan" }, { "task" : "lecture", "name" : "Jim" },
    { "task" : "lab", "name" : "Alice" }, { "task" : "assignment", "name" : "Fred" }, { "task" : "project", "name" : "Bill" },
    { "task" : "lecture", "name" : "Sally" }, { "task" : "exercise", "name" : "Sally" }, { "task" : "work-study", "name" : "Bill" },
    { "task" : "lab prep", "name" : "Jim" }, { "task" : "grading", "name" : "Bob" }, { "task" : "assignment", "name" : "Jim" },
    { "task" : "lab", "name" : "Jim" }, { "task" : "lab prep", "name" : "Charlie" }, { "task" : "work-study", "name" : "Fred" },
    { "task" : "project", "name" : "Joe" }, { "task" : "grading", "name" : "Joe" }, { "task" : "assignment", "name" : "Alice" },
    { "task" : "grading", "name" : "Bob" }, { "task" : "work-study", "name" : "Joe" }, { "task" : "project", "name" : "Joe" },
    { "task" : "lab", "name" : "Alice" }, { "task" : "exercise", "name" : "Fred" }, { "task" : "work-study", "name" : "Bill" },
    { "task" : "assignment", "name" : "Alice" }, { "task" : "exercise", "name" : "Fred" }, { "task" : "exercise", "name" : "Charlie" },
    { "task" : "project", "name" : "Greg" }, { "task" : "lecture", "name" : "Greg" }
])

> db.coll.aggregate( [
   {  "$group" : {  "_id" : "$name",  "tasks" : {  "$push" : "$task" } } },
   {  "$group" : {  "_id" : {  "$hashOfSet" : "$tasks" },  "tasks" : {  "$first" : {  "$unique" : "$tasks" } },  "count" : {  "$sum" : 1 } } }.
   {  "$project" : {  "_id" : 0,  "*" : 1 } } ])
{ "tasks" : [ "assignment", "grading", "lab", "lab prep", "lecture", "project" ], "count" : 1 }
{ "tasks" : [ "exercise", "lecture", "project" ], "count" : 1 }
{ "tasks" : [ "assignment", "lab" ], "count" : 2 }
{ "tasks" : [ "exercise", "lecture" ], "count" : 1 }
{ "tasks" : [ "assignment", "grading", "project", "work-study" ], "count" : 1 }
{ "tasks" : [ "grading", "lab" ], "count" : 1 }
{ "tasks" : [ "exercise", "lab prep", "project" ], "count" : 1 }
{ "tasks" : [ "assignment", "exercise", "lab prep", "work-study" ], "count" : 1 }
{ "tasks" : [ "grading", "lecture", "project", "work-study" ], "count" : 1 }

Flattening Nested Arrays to a Single Array

Use $flatten to flatten nested arrays into a single array. Arrays of arrays (and more nesting) will be flattened into a single-level array. Non-array elements will remain in the top-level array. For example, if:

> db.t1.find()
{ "_id" : ObjectId("567331feea56094168280431"), "a" : [ [ 1, 2 ], [ 3, 4 ], [ "a", "v" ], [ { "a" : 1, "b" : "aaa" }, 1 ], 2, { "b" : 1 } ] }

Then:

> db.t1.aggregate({$project: {A: {$flatten: "$a"}, _id:0}})
{ "A" : [ 1, 2, 3, 4, "a", "v", { "a" : 1, "b" : "aaa" }, 1, 2, { "b" : 1 } ] }

Flattening Nested Arrays to a Single Set (Array with no duplicates)

Use $flattenToSet to flatten nested arrays into a single array with no duplicates. Arrays of arrays (and more nesting) will be flattened into a single-level array. Non-array elements will remain in the top-level array. For example, if:

> db.t1.find()
{ "_id" : ObjectId("567331feea56094168280431"), "a" : [ [ 1, 2 ], [ 3, 1 ], [[ "a", "a" ]], [ { "a" : 1, "b" : "aaa" }, 1 ], 2, { "b" : 1 } ] }

Then:

> db.t1.aggregate({$project: {A: {$flattenToSet: "$a"}, _id:0}})
{ "A" : [ 1, 2, 3, "a", { "a" : 1, "b" : "aaa" }, { "b" : 1 } ] }

Removing duplicates from an array

Use $removeDuplicates to remove duplicates from a given array.

For example, if:

> db.t1.find()
{ "_id" : ObjectId("567331feea56094168280431"), "a" : [ [ 1, 2 ], 1, 2, 1, [ "a", "a" ], 2, { "b" : 1 } ] }

Then:

> db.t1.aggregate({$project: {A: {$removeDuplicates: "$a"}, _id:0}})
{ "A" : [ [ 1, 2 ], 1, 2, [ "a", "a" ], { "b" : 1 } ] }

> db.t1.aggregate({$project: {A: {$removeDuplicates: { "$arrayElemAt": ["$a", 3]}}}})
{ "A" : [ "a" ] }

Projecting a Subdocument by Name

$elementAt is an extension to $arrayElementAt that allows you to project a subdocument element by name. It is useful when you want to use a variable in a document as a reference into another subdocument - i.e. use a value as a field name. For example, if a document looks like:

> db.foo.find().pretty()
{
   "_id" : ObjectId("5676f34f0b5c4daad5c91950"),
   "a" : {
      "c" : 4,
      "d" : 5
   }
}
{
   "_id" : ObjectId("5676f3830b5c4daad5c91951"),
   "a" : {
      "c" : 4,
      "d" : 5
   },
   "p" : "c"
}
{
   "_id" : ObjectId("5676f3860b5c4daad5c91952"),
   "a" : {
      "c" : 4,
      "d" : 5
   },
   "p" : "d"
}

you can use the value in “p” to lookup the appropriate field in a. For example:

> db.foo.aggregate({$project: { em: {$elementAt:[ "$a","$p" ] }  } })

{ “_id” : ObjectId(“5676f34f0b5c4daad5c91950”), “em” : null } { “_id” : ObjectId(“5676f3830b5c4daad5c91951”), “em” : 4 } { “_id” : ObjectId(“5676f3860b5c4daad5c91952”), “em” : 5 }

The first element of the array in elementAt is the subdocument to look into and the second element of the array is the field from which to take the value used for the field name. The second value in the array must compute to a string.

Applying an Expression to Each Element of an Array to Reduce an Array to a Single Value

$arrayReduce applies an expression to each element of an array, effectively reducing the array to a single value. $arrayReduce has the following arguments:

Required:
  • input: The column to use to retrieve an array
  • as: The new column name for each individual element of the array.
  • into: The name of the column which contains the accumulation of all elements so far.
  • in: The expression used to evaluate each element.
  • init: The expression used to set the initial value of into.
Optional:
  • finish: The expression used to set the final value of the reduce.

Within the body of the in, it is possible to refer to an element of the array, using $$ syntax. It is also possible to refer to the item being accumulated by name, using the $$ syntax. Observe $$item.price and $$s respectively in the example below. Within the body of the init and finally (and in), it is possible to refer to other columns within the original object. Observe $x and $items in the example below. The finish expression can also refer to the accumulated value ($$s).

As an example, we can use arrayReduce to manually calculate the average:

db.sales.drop()
db.sales.insert(
[{
   _id: 0,
   items: [
     { item_id: 43, quantity: 2, price: 10 },
     { item_id: 2, quantity: 1, price: 240 }
   ],
x: 0.0,
},
{
   _id: 1,
   items: [
     { item_id: 23, quantity: 3, price: 110 },
     { item_id: 103, quantity: 4, price: 5 },
     { item_id: 38, quantity: 1, price: 300 }
   ],
   x: 0.0,
},
{
    _id: 2,
    items: [
       { item_id: 4, quantity: 1, price: 23 }
    ],
   x: 0.0,
},
])

db.sales.aggregate([
    {
       $project: {
         myAvg: {
            $arrayReduce: {
               input: "$items",
               as: "item",
               into: "s",
               in: { $sum: [ "$$s", "$$item.price" ] },
               init: { $literal: '$x' },
               finish: { $div: [ "$$s", {$size: '$items'}] },
            }
         }
      }
   }
])

{ "_id" : 0, "myAvg" : 125 }
{ "_id" : 1, "myAvg" : 138.33333333333334 }
{ "_id" : 2, "myAvg" : 23 }

Projecting Documents

$struct produces an object whose fields can be expressions as well as the values of those fields.

The syntax is as follows:

{$struct: [ expr, expr, .... ] }

The first, third, fifth, etc expressions in the list are field names and the 2nd, 4th, 6th, etc expressions are the values. If an odd number of expressions are given, the last one is ignored. For example, given documents that have a letter_grade field and a name field, the following produces documents where the letter_grade value becomes the field name:

> db.grades.aggregate({$project:{_id:0,grades:{$struct:["$letter_grade","$name"]}}}
{ "A": "Joe" }
{ "C": "Darren" }
{ "A": "Elizabeth" }
{ "B": "Fred" }
{ "C": "Ralph" }

regex case matching

$awk:

Given:
> db.queries.find({},{_id:0})
{ "s" : "insert into sales values ('george',55.99)" }
{ "s" : "drop table foobar" }
{ "s" : "CREATE TABLE foobar (id INTEGER,name VARCHAR(255))" }

and:
> db.sqlcases.find({},{_id:0})
{ "case" : /^[ ]*insert[ ]+into[ ]+([^ ]+)[ ]+values[ ]*[(]([^)]*)[)]/i, "expr" : "{ \"op\" : \"insert\", \"table\" : { \"$toUpper\" : \"$$1\" }, \"values\" : { \"$split\" : [ \"$$2\", \",\" ] } }" }
{ "case" : /^[ ]*drop[ ]+table[ ]+([^ ]+)/i, "expr" : "{ \"op\" : \"drop\", \"table\" : { \"$toUpper\" : \"$$1\" } }" }
{ "case" : /^[ ]*create[ ]+table[ ]+([^ (]+)[ ]*[(]((?:[^()]|[(][^)]*[)])*)[)]/i, "expr" : "{ \"op\" : \"create\", \"table\" : { \"$toUpper\" : \"$$1\" }, \"columns\" : { \"$split\" : [ \"$$2\", \",\" ] } }" }

> db.queries.aggregate({$project:{_id:0,n:{$awk:["$s",{$a:[{$project:{'*':1,_id:0}}],$ns:'sqlcases',$p:'*'}]}}})
{ "n" : { "op" : "insert", "table" : "SALES", "values" : [ "'george'", "55.99" ] } }
{ "n" : { "op" : "drop", "table" : "FOOBAR" } }
{ "n" : { "op" : "create", "table" : "FOOBAR", "columns" : [ "id INTEGER", "name VARCHAR(255)" ] } }

or if:
> db.sqlcases.find({},{_id:0})
{ "case" : /^[ ]*insert[ ]+into[ ]+([^ ]+)[ ]+values[ ]*[(]([^)]*)[)]/i, "value" : { "table" : { "$toUpper" : "$$1" }, "values" : { "$split" : [ "$$2", "," ] }, "op" : "insert" }, "priority": 0 }
{ "case" : /^[ ]*drop[ ]+table[ ]+([^ ]+)/i, "value" : { "table" : { "$toUpper" : "$$1" }, "op" : "drop" }, "priority": 1 }
{ "case" : /^[ ]*create[ ]+table[ ]+([^ (]+)[ ]*[(]((?:[^()]|[(][^)]*[)])*)[)]/i, "value" : { "table" : { "$toUpper" : "$$1" }, "columns" : { "$split" : [ "$$2", "," ] }, "op" : "create" }, "priority": 2 }

then:
> db.queries.aggregate({$project:{_id:0,n:{$awk:["$s",{$a:[{$project:{'*':1,_id:0}}],$ns:'sqlcases',$p:'*'}]}}})
{ "n" : { "op" : "insert", "table" : "SALES", "values" : [ "'george'", "55.99" ] } }
{ "n" : { "op" : "drop", "table" : "FOOBAR" } }
{ "n" : { "op" : "create", "table" : "FOOBAR", "columns" : [ "id INTEGER", "name VARCHAR(255)" ] } }

The difference between ‘expr’ and ‘value’ is that ‘expr’ is always a string representation of a document describing the value, while ‘value’ is the actual value to be used for that case.

The priority field is optional, included in the second example above, to explicitly give a notion of order to cases where multiple cases may match the input. In cases where multiple cases match the input, the lowest numberered prority takes precedence.

Network Query and manipulation functions

$isIP:

> db.x.aggregate({$project:{_id:0, ip:1, 'has Ip':{$isIP:"$ip"}}})
{ "has Ip" : false }
{ "ip": "<invalid>", "has Ip" : false }
{ "ip": "192.168.1.66", "has Ip" : true }
{ "ip": "2001:0db8:85a3::8a2e:0370:7334", "has Ip" : true }

> db.x.aggregate({$project:{_id:0, ip:1, 'has Ip':{$isIP:["$ip","ipv4"]}}})
{ "has Ip" : false }
{ "ip": "<invalid>", "has Ip" : false }
{ "ip": "192.168.1.66", "has Ip" : true }
{ "ip": "2001:0db8:85a3::8a2e:0370:7334", "has Ip" : false }

> db.x.aggregate({$project:{_id:0, ip:1, 'has Ip':{$isIP:["$ip","ipv6"]}}})
{ "has Ip" : false }
{ "ip": "<invalid>", "has Ip" : false }
{ "ip": "192.168.1.66", "has Ip" : false }
{ "ip": "2001:0db8:85a3::8a2e:0370:7334", "has Ip" : true }

$isDomain:

> db.x.aggregate({$project:{_id:0,host:1,isdomain:{$isDomain:"$host"}}})
{ "host" : "192.168.1.99", "isdomain" : false }
{ "host" : "foobar.com", "isdomain" : true }
{ "host" : 10000, "isdomain" : false }
{ "host" : null, "isdomain" : false }

$isEmail:

> db.x.aggregate({$project:{_id:0,e:1,result:{$isEmail:"$e"}}})
{ "e" : "Ury", "result" : false }
{ "e" : "Ury@jsonar", "result" : false }
{ "e" : "Ury@jsonar.com", "result" : true }
{ "e" : "Ury@192.168.1.123", "result" : true }

$isURL:

> db.x.aggregate({$project:{_id:0,h:1,result:{$isUrl:"$h"}}})
{ "h" : "www.jsonar.com", "result" : false }
{ "h" : "https://www.jsonar.com", "result" : true }
{ "h" : "https:8008//www.jsonar.com", "result" : true }
{ "h" : "https://www.jsonar.com/about.html", "result" : true }
{ "h" : "ftp://ftp.slackware.com/CONTENTS", "result" : true }

$urlToHost:

> db.x.aggregate({$project:{_id:0,h:1,result:{$urlToHost:"$h"}}})
{ "h" : "https://www.jsonar.com", "result" : "www.jsonar.com" }
{ "h" : "https://www.jsonar.com:8008", "result" : "www.jsonar.com" }
{ "h" : "https://www.jsonar.com/about.html", "result" : "www.jsonar.com" }
{ "h" : "ftp://ftp.slackware.com/CONTENTS", "result" : "ftp.slackware.com" }

$urlToProtocol:

> db.x.aggregate({$project:{_id:0,h:1,result:{$urlToProtocol:"$h"}}})
{ "h" : "https://www.jsonar.com", "result" : "https" }
{ "h" : "https://www.jsonar.com:8008", "result" : "https" }
{ "h" : "https://www.jsonar.com/about.html", "result" : "https" }
{ "h" : "ftp://ftp.slackware.com/CONTENTS", "result" : "ftp" }

$urlToPort:

> db.x.aggregate({$project:{_id:0,h:1,result:{$urlToPort:"$h"}}})
{ "h" : "https://www.jsonar.com", "result" : "443" }
{ "h" : "https://www.jsonar.com:8008", "result" : "8008" }
{ "h" : "https://www.jsonar.com/about.html", "result" : "443" }
{ "h" : "ftp://ftp.slackware.com/CONTENTS", "result" : "21" }

$urlToPath:

> db.x.aggregate({$project:{_id:0,h:1,result:{$urlToPath:"$h"}}})
{ "h" : "https://www.jsonar.com", "result" : "" }
{ "h" : "https://www.jsonar.com:8008", "result" : "" }
{ "h" : "https://www.jsonar.com/about.html", "result" : "/about.html" }
{ "h" : "ftp://ftp.slackware.com/CONTENTS", "result" : "/CONTENTS" }

$domainToTLD:

> db.x.aggregate({$project:{_id:0,d:1,result:{$domainToTLD:"$d"}}})
{ "d" : "www.jsonar.com", "result" : "com" }
{ "d" : "www.foobar.net", "result" : "net" }
{ "d" : "jsonar.com", "result" : "com" }
{ "d" : "www.something.info", "result" : "info" }

$domainRemoveTLD:

> db.x.aggregate({$project:{_id:0,d:1,result:{$domainRemoveTLD:"$d"}}})
{ "d" : "www.jsonar.com", "result" : "www.jsonar" }
{ "d" : "www.foobar.net", "result" : "www.foobar" }
{ "d" : "jsonar.com", "result" : "jsonar" }
{ "d" : "www.something.info", "result" : "www.something" }

$domainRemoveSubdomain:

> db.x.aggregate({$project:{_id:0,d:1,result:{$domainRemoveSubdomain:"$d"}}})
{ "d" : "www.jsonar.com", "result" : "jsonar.com" }
{ "d" : "www.foobar.net", "result" : "foobar.net" }
{ "d" : "jsonar.com", "result" : "jsonar.com" }
{ "d" : "www.something.info", "result" : "something.info" }

Operators that work on Arrays and Non-Arrays

$size (in aggregation, not the query operator) will work also on non-arrays and will behave as though the value is an array with one element being the field’s value. This allows running $size on variables that sometimes hold an array and sometimes a single value as is common in data that comes from XML conversions.

For example:

db.example.drop()
db.example.insert({a: 1 , col: ["red", "blue", "green"]})
db.example.insert({a: 1 , col:  "yellow"})

The result of checking the size of the column col:

db.example.aggregate({"$project": {"size": {"$size": "$col"}}})
{ "_id" : ObjectId("59c2b24665aa78c6cf23ec26"), "size" : 3 }
{ "_id" : ObjectId("59c2b24665aa78c6cf23ec27"), "size" : 1 }

$concat is also extended in SonarW to support arrays. If a $concat field is an array, all elements of the array are concatenated as well.

For example, if:

> db.students.find()
{ "_id" : 0, "first_name" : "Claire", "last_name" : "Jones", "grade" : 10, "classes" : [ "Science", "Math", "French" ] }
{ "_id" : 1, "first_name" : "Paul", "last_name" : "Warner", "grade" : 12, "classes" : [ "Calculus", "Physics", "French", "English" ] }
{ "_id" : 2, "first_name" : "Steve", "last_name" : "Grey", "grade" : 10, "classes" : [ "Science", "English", "PE", "Spanish" ] }
{ "_id" : 3, "first_name" : "Samantha", "last_name" : "Kim", "grade" : 11, "classes" : [ "Math", "Science", "French" ] }

then you can use a $concat even if the field has an array, e.g.:

> db.students.aggregate({'$project':{'_id':0, 'student_info':{'$concat':['$last_name','_','$classes']}}})
{ "student_info" : "Jones_ScienceMathFrench" }
{ "student_info" : "Warner_CalculusPhysicsFrenchEnglish" }
{ "student_info" : "Grey_ScienceEnglishPESpanish" }
{ "student_info" : "Kim_MathScienceFrench" }

This is a generalization of the fact that $concat works on non-string data such as integers, e.g.:

> db.students.aggregate({'$project':{'_id':0, 'student_info':{'$concat':['$last_name','_','$grade']}}})
{ "student_info" : "Jones_10" }
{ "student_info" : "Warner_12" }
{ "student_info" : "Grey_10" }
{ "student_info" : "Kim_11" }

$elemMatch on sonar will also return a non-array value match, e.g.:

data=[ { "_id" : 1, "a" : [{ "b" : 1 }]}, { "_id" : 2, "a" : { "b" : 1 }} ]
db.coll1.drop()
db.coll1.insert(data)
>db.coll1.find( { a:{ $elemMatch:{ b:1 } } } )
{ "_id" : 1, "a" : [ { "b" : 1 } ] }
{ "_id" : 2, "a" : { "b" : 1 } }

Producing Subdocuments in a Group Stage

In a $group, generated field names may include a period in which case a subdocument is created. Note that this is slower compared to generating top-level fields and should not be used for very large collections.

For example, the following is legal:

> db.test.aggregate({"$group":{"_id":"$a", "new_field.name":{$sum:"$b"}}})
   {
    "_id" : 1,
    "new_field" :
     { "name" : 5.84 }
   }

Indirect in Projection and Calculated Field Names

The $indirect operator allows you to create new fields from data (ie values). $indirect gets two values in an array - the first value determines the field from which a value is taken to determine the projected field name and the second value determines the field from which a value is taken to determine the field’s value.

For example, suppose you have a collection that has name-value pairs as documents (perhaps because they were imported this way):

> db.nv.find()
{ "_id" : ObjectId("544815c30da22840ff530562"), "name" : "f1", "value" : "v1" }
{ "_id" : ObjectId("544815c80da22840ff530563"), "name" : "f2", "value" : "v2" }
{ "_id" : ObjectId("544815cc0da22840ff530564"), "name" : "f3", "value" : "v3" }

Use the $indirect operator to project a new field - f1 for the first document, f2 for the second etc.:

> db.nv.aggregate({$project: {$indirect: ['$name', '$value']}})
{ "_id" : ObjectId("544815c30da22840ff530562"), "f1" : "v1" }
{ "_id" : ObjectId("544815c80da22840ff530563"), "f2" : "v2" }
{ "_id" : ObjectId("544815cc0da22840ff530564"), "f3" : "v3" }

Note that you can concatenate a prefix to ensure that the generated field will be placed in a subdocument, for example:

$indirect: [
      {
        $concat: [
          'a.b.c.',
            'header.type'
          ]
        },
        'header.text'
      ],

This will place the generated field inside a: { b: { c: {

Note also that you can create multiple such variables by using an array of arrays. $indirect takes either an array of two values as an argument, or else it takes an array of arrays, where each inner array contains 2 values, allowing you to specify multiple columns with a single $indirect spec. For example, if:

> db.t1.find()
{ "_id" : ObjectId("577e586f4cc802f0c13be7d9"), "dim1" : "ID1", "dim2" : "ID2", "KI1" : "v1", "KI2" : "v2" }

Then:

> db.t1.aggregate({$project: {_id: 0, $indirect: [[{$concat: ["$dim1", "_KI1"]}, "$KI1"],[{$concat: ["$dim2", "_KI2"]}, "$KI2"]]}})
{ "ID1_KI1" : "v1", "ID2_KI2" : "v2"}

This pattern is often used together with the $union pattern to create a single document for all these name-value pairs, e.g.:

> db.nv.aggregate(
   {$project: {$indirect: ['$name', '$value']}},
   {$group: {_id: 'all', attributes: {$union: 1}}}).pretty()
{
   "_id" : "all",
   "attributes" : {
      "f1" : "v1",
      "f2" : "v2",
      "f3" : "v3"
   }
}

You can use any additional operators in the $indirect array - for example, to project to field_named_f1, field_named_f2 etc:

> db.nv.aggregate({$project: {
   $indirect: [
      {$concat: ['field_named_', '$name']},
      '$value']
      }})
{ "_id" : ObjectId("544815c30da22840ff530562"), "field_named_f1" : "v1" }
{ "_id" : ObjectId("544815c80da22840ff530563"), "field_named_f2" : "v2" }
{ "_id" : ObjectId("544815cc0da22840ff530564"), "field_named_f3" : "v3" }

Similarly, a simpler use of the $indirect operator is used in a projection clause to create new names based on some field in the source document. But contrary to a normal projection the field which is used for the projection is determined indirectly from the value of another field.

Take for example a collection that has the different names that people have along with a special field that tells us which of the names the person prefers to be called:

> db.names.find().pretty()
{
   "_id" : ObjectId("54481dab0da22840ff530569"),
   "first" : "John",
   "last" : "Doe",
   "nick" : "Johnny",
   "pref" : "first"
}
{
   "_id" : ObjectId("54481de10da22840ff53056a"),
   "first" : "Remy",
   "last" : "Hadley",
   "nick" : "13",
   "pref" : "nick"
}

If we want to project what each person should be called we can use the $indirect operator to tell SonarW to look at the pref field and use that value as the specifier from which to project:

> db.names.aggregate({$project: {call_me: {$indirect: "$pref"}}})
{ "_id" : ObjectId("54481dab0da22840ff530569"), "call_me" : "John" }
{ "_id" : ObjectId("54481de10da22840ff53056a"), "call_me" : "13" }

The right side of $indirect can be any expression and not just a field name.

Example for $indirect - Converting an array of name-value pairs (a-la ETL)

Assume documents that each have an array of name-value pairs and we want to convert these into real field names and their values. Starting from the array:

> db.nva.find().pretty()
{
   "_id" : ObjectId("544823550da22840ff53056b"),
   "i" : 1,
   "arr" : [
      {
         "n" : "f1",
         "v" : "v1"
      },
      {
         "n" : "f2",
         "v" : "v2"
      },
      {
         "n" : "f3",
         "v" : "v3"
      }
   ]
}
{
   "_id" : ObjectId("5448236d0da22840ff53056c"),
   "i" : 2,
   "arr" : [
      {
         "n" : "f11",
         "v" : "v11"
      },
      {
         "n" : "f22",
         "v" : "v22"
      },
      {
         "n" : "f33",
         "v" : "v33"
      }
   ]
}

We use an aggregation pipeline to transform these into “real” documents with fields. First we use an $unwind stage to get each name value pair separately. Then we use projection with a $indirect and a $concat to get the field names created. Finally we use a group stage with $union to create the following resulting documents:

{
   "_id" : 1,
   "vars" : {
      "i" : 1,
      "field_1" : "v1",
      "field_2" : "v2",
      "field_3" : "v3"
   }
}
{
   "_id" : 2,
   "vars" : {
      "i" : 2,
      "field_11" : "v11",
      "field_22" : "v22",
      "field_33" : "v33"
   }
}

The pipeline is:

> db.nva.aggregate(
   {$unwind: "$arr"},
   {$project: {i:1, $indirect: [{"$concat": [{"$substr": ["$arr.n", 0, 1]}, "ield_", {"$substr": ["$arr.n", 1, -1]}]}, "$arr.v"]}},
   {$group: {_id: "$i", vars: {$union: 1}}}
  )

Example - Using $indirect and $union to aggregate Steam gaming data

Assume we start with Steam gaming data (a popular gaming e-store and online community) recording which games each player plays and how long they play it. Each document takes the form of:

{
   "_id" : ObjectId("53b21af672a53f1ac56db3cc"),
   "game" : "Garry's Mod",
   "game_id" : 4000,
   "owner" : "76561198000664965",
   "playtime_forever" : 27
}

We want to transform this into a set of documents where each document is a representation of all games played by that player. This is a data structure that we can then use to build a recommendation engine. Each document should have fields - one per played game. The field name should be the game name (the value under the game field above) and the value should be the value of playtime_forever above. This is a classic use case for $indirect and for $union:

db.steam.aggregate(
   {$project: {owner: 1, $indirect: ["$game", "$playtime_forever"]}},
   {$group: {_id: "$owner", matrix: {$union: 1}}})

Selecting documents in a range of internal IDs

SonarW maintains an internal ID per document - different and in addition to the _id.

The $did operator expects an array with one or two SonarW IDs as limits, and will select all the documents with IDs in that range. Note that these are SonarW’s internal document IDs.

The following example finds all documents within the employees collection with IDs within the range [0,4]:

> db.employees.find({_id: {$did: [0,4]}})
{ "_id" : 0, "name" : "Bob" }
{ "_id" : 1, "name" : "Jodie" }
{ "_id" : 2, "name" : "Cindy" }
{ "_id" : 3, "name" : "Alex" }

or:

> db.bonds.find({_id: {$did:[4,6]} , 'Product.Country':1} } )
{ "Product" : { "Country" : "BN" }, "_id" : ObjectId("556f8513a77bf01caf43f90e") }
{ "Product" : { "Country" : "CAN" }, "_id" : ObjectId("556f8513a77bf01caf43f90f") }

If only one ID is given, the query will find the document with that ID if it exists:

> db.employees.find({_id: {$did: [1]}})
{ "_id" : 1, "name" : "Jodie" }

Projecting the Internal ID

SonarW maintains an internal ID per document - different and in addition to the _id.

You can project the internal ID using the $did projection operator - for example:

db.customers.find( {age:25} , {$did:1, _id:1} }

You can also use the $did operator as a projection operator:

db.t1.drop()
db.t1.insert([{'_id': 0, 'a': 1},{'_id': 1, 'a':2}])
db.t1.aggregate({$match: {'a':2}},{$project: {'*':1, did:{$did:1}}})
{ "did" : NumberLong(1), "_id" : 1, "a" : 2 }

Projecting a new ObjectId

You can compute and project a new ObjectId that can be used for creating a unique number (regardless of the _id), e.g.:

db.t1.aggregate({$project: {unique: {$objectId: 1}}})

Dereferencing a DBRef

SonarW allows you to de-reference and traverse a DBRef “pointer” in a $project. As an example, if collection messages has a field called account which is a DBRef to the accounts collection, you can project elements from the associated/pointed-to collection using:

> db.messages.aggregate({"$project": {"account": 1, _id: 0, name: {$deref: "$account.$name"}}})
{
   "account" : { "$id" : 1, "$ref" : "account" },
   "name" : "Account 101"
}

Note that if you had additional DBRef values that point to different collections then each would dereference to the right collection. Each such dereferencing operation performs it’s own id-based query and the throughput is no more than a couple of thousands of dereferencing per second. Therefore, consider using dereferencing after various other aggregations and consider to use a $join if you need higher throughput.

Determining if Two Documents are Different

The $hasDiff projection operator:

$hasDiff[<arg1>,<arg2>]

returns a true/false value indicating if there are any differences between the documents passed as arguments, ignoring the _id field. Either of the arguments passed (but not both) may also be an array, in which case its output will be an array of booleans indicating true or false for each document in the array indicating whether or not the corresponding document had a difference to the document given as the other argument. The order of the arguments does not impact the output

The $diff projection operator:

$diff:[<arg1>,<arg2>,optional-flag]

produces the differences between objects in a Unix ‘diff’-like manner showing fields that have been added, removed, or changed between the two documents, using the fieldnames “-” to indicate removal, and “+” to indicate addition. Both “+” and “-” will be present when a field has the same name but a different value. While the order of elements in arrays is considered when comparing things for equality, the order of documents or subdocuments is not. If the optional third flag is specified and is true, then fields that are the same are also output (for context). If either argument is an array, then each of the documents in that array are compared with the object as the other argument, and the output is an array of documents showing the differences between each document in the array and the given document. If any elements in the array are not documents, they are ignored. If one argument is an array, the other argument must be an object/document or else this expression produces an empty result. If neither argument is an object, then this expression also produces an empty result.

You can use self-joins, $$CURRENT and the * projection operator to compare and compute diffs between documents in the same collection (e.g. comparing different items or the same item over time). For example, if the height_progress collection contains documents describing people’s height and age and you want to compare the heights at the various ages between Jack and Dan you could use:

> db.height_progress.aggregate([
      {$match:{name:'Jack'}},
      {$join: {
         $joined: {self:'height_progress'},
         $match : { age: '$self.$age'},
         $having:{'diff.name.+':{$eq:'Dan'}},
         $project: { diff:{$diff:['$$CURRENT','$self.$*',true]}}}}])
   .toArray()
[
 {
     "diff" : {
         "_id" : {
             "-" : ObjectId("573f7f69758587e7b2879246"),
             "+" : ObjectId("573f7f69758587e7b2879245")
         },
         "age" : 5,
         "height" : 34,
         "name" : {
             "-" : "Jack",
             "+" : "Dan"
         }
     },
     "_id" : ObjectId("573f7f69758587e7b2879246")
 },
 {
     "diff" : {
         "_id" : {
             "-" : ObjectId("573f7f6a758587e7b2879248"),
             "+" : ObjectId("573f7f6a758587e7b2879247")
         },
         "age" : 6,
         "height" : {
             "-" : 45,
             "+" : 56
         },
         "name" : {
             "-" : "Jack",
             "+" : "Dan"
         }
     },
     "_id" : ObjectId("573f7f6a758587e7b2879248")
 },
 {
     "diff" : {
         "_id" : {
             "-" : ObjectId("573f7f6a758587e7b287924a"),
             "+" : ObjectId("573f7f6a758587e7b2879249")
         },
         "age" : 7,
         "height" : {
             "-" : 56,
             "+" : 60
         },
         "name" : {
             "-" : "Jack",
             "+" : "Dan"
         }
     },
     "_id" : ObjectId("573f7f6a758587e7b287924a")
 },
 {
     "diff" : {
         "_id" : {
             "-" : ObjectId("573f7f6a758587e7b287924c"),
             "+" : ObjectId("573f7f6a758587e7b287924b")
         },
         "age" : 8,
         "height" : {
             "-" : 90,
             "+" : 80
         },
         "name" : {
             "-" : "Jack",
             "+" : "Dan"
         }
     },
     "_id" : ObjectId("573f7f6a758587e7b287924c")
 }
]

Note: When you have a set of documents in such a group-diff, each diff will be an incremental compared with the previous document. To always compare against the first document use $diffToFirst.

Comparing Documents:

Using a $group and a $diff you can compare any two documents to get the main document and all variations. The $diff is an aggregation operator of $group and compares all documents in the group to the first document of the group. For example, to compare all versions of documents in the datasources collection do:

> db.datasources.aggregate({$group: {_id: "$Datasource Id", diff: {$diff: "$$CURRENT"}}}).pretty()
{
        "_id" : "20017",
        "diff" : [
                {
                        "_id" : ObjectId("575acf93b34d03154b14b561"),
                        "UTC Offset" : -4,
                        "Datasource Id" : "20017",
                        "Datasource Type" : "INFORMIX",
                        "Datasource Name" : "informixTest1",
                        "Host" : "9.70.144.157",
                        "Port" : "1403",
                        "Service Name" : "9.70.144.157",
                        "User Name" : "informix",
                        "Shared" : "true",
                        "Severity" : "MED",
                        "Timestamp" : ISODate("2016-03-10T12:52:54Z"),
                        "last_update_time" : ISODate("2016-07-26T13:00:34Z"),
                        "Password" : "_vy_miqJQytaR2mIW-_HJg",
                        "Name" : "Chris B",
                        "email" : "cbrown@jsonar.com"
                },
                {
                        "_id" : {
                                "-" : ObjectId("575acf93b34d03154b14b561"),
                                "+" : ObjectId("575bef2cb34d0319c42feca1")
                        },
                        "last_update_time" : {
                                "-" : ISODate("2016-07-26T13:00:34Z"),
                                "+" : ISODate("2016-07-26T13:00:35Z")
                        }
                },{
                        "-" : {
                                "Name" : "Chris B",
                                "email" : "cbrown@jsonar.com"
                        },
                        "_id" : {
                                "-" : ObjectId("575acf93b34d03154b14b561"),
                                "+" : ObjectId("57667b9eb34d03999d4742c1")
                        },
                        "last_update_time" : {
                                "-" : ISODate("2016-07-26T13:00:34Z"),
                                "+" : ISODate("2016-07-26T13:00:44Z")
                        }
                }, ...

Casting Operators

Use $toNumber, $toBool, $toInt, $toDateTime, $toLong, and $toString to cast types in aggregation framework stages. For example, if:

> db.schedule.aggregate({$limit: 3}, {$project: {proj_id: 1, now: {$now: null}}})

{ “_id” : ObjectId(“59d3fe7efa1c157cc0cc02d6”), “proj_id” : 51965, “now” : ISODate(“2017-10-03T21:18:38Z”) } { “_id” : ObjectId(“59d3fe7ffa1c157cc0cc02d7”), “proj_id” : 51965, “now” : ISODate(“2017-10-03T21:18:38Z”) } { “_id” : ObjectId(“59d3fe7ffa1c157cc0cc02d8”), “proj_id” : 51965, “now” : ISODate(“2017-10-03T21:18:38Z”) }

Then:

> db.schedule.aggregate({$limit: 3}, {$project: {proj_id: {$toString: "$proj_id"}, now: {$toString: {$now: null}}}})

{ “_id” : ObjectId(“59d3fe7efa1c157cc0cc02d6”), “proj_id” : “51965”, “now” : “2017-10-03T21:18:56” } { “_id” : ObjectId(“59d3fe7ffa1c157cc0cc02d7”), “proj_id” : “51965”, “now” : “2017-10-03T21:18:56” } { “_id” : ObjectId(“59d3fe7ffa1c157cc0cc02d8”), “proj_id” : “51965”, “now” : “2017-10-03T21:18:56” }

> db.schedule.aggregate({$limit: 3}, {$project: {proj_id: {$toBool: “$proj_id”}, now: {$toNumber: {$now: null}}}})

{ “_id” : ObjectId(“59d3fe7efa1c157cc0cc02d6”), “proj_id” : true, “now” : 1507065552000 } { “_id” : ObjectId(“59d3fe7ffa1c157cc0cc02d7”), “proj_id” : true, “now” : 1507065552000 } { “_id” : ObjectId(“59d3fe7ffa1c157cc0cc02d8”), “proj_id” : true, “now” : 1507065552000 }

The $toJson projection operator creates a String containing JSON from a document or a field. For example, to get a string containing the JSON per document project:

{
  "$project": {
    json: {
      $toJson: "$$ROOT"
      }
    }
  }

There are three types of conversions - affecting how typed data gets converted. For example, given the data:

> db.x.find()
{ "a" : 1 }
{ "a" : NumberDecimal("1.2") }
{ "a" : NumberLong(12345678) }
{ "a" : [ "abc", 1, 2, 3 ] }
{ "a" : ObjectId("5956b7d4a4c4ce4d299d84c9") }
{ "a" : { "key" : "value", "array_key" : [ ISODate("2017-06-30T20:43:00.302Z"), 300, null ] } }

The three possible options are mongoexport, shell and flat:

> db.x.aggregate({$project:{json:{$toJson:["$a","mongoexport"]},_id:0}})
{ "json" : "1" }
{ "json" : "{\"$numberDecimal\":\"1.2\"}" }
{ "json" : "{\"$numberLong\":\"12345678\"}" }
{ "json" : "[\"abc\",1,2,3]" }
{ "json" : "{\"$oid\":\"5956b7d4a4c4ce4d299d84c9\"}" }
{ "json" : "{\"key\":\"value\",\"array_key\":[{ \"$date\": { \"$numberLong\": \"1498855380302\" } },300,null]}" }

> db.x.aggregate({$project:{json:{$toJson:["$a","shell"]},_id:0}})
{ "json" : "1" }
{ "json" : "NumberDecimal('1.2')" }
{ "json" : "NumberLong('12345678')" }
{ "json" : "[\"abc\",1,2,3]" }
{ "json" : "ObjectId('5956b7d4a4c4ce4d299d84c9')" }
{ "json" : "{\"key\":\"value\",\"array_key\":[ISODate('2017-06-30T20:43:00.302Z'),300,null]}" }

> db.x.aggregate({$project:{json:{$toJson:["$a","flat"]},_id:0}})
{ "json" : "1" }
{ "json" : "1.2" }
{ "json" : "12345678" }
{ "json" : "[\"abc\",1,2,3]" }
{ "json" : "\"5956b7d4a4c4ce4d299d84c9\"" }
{ "json" : "{\"key\":\"value\",\"array_key\":['2017-06-30T20:43:00.302Z',300,null]}" }

$fromJson does the opposite, translating a string from ordinary json (“flat” mode) into a document or other valid bson type. The json parser used by $fromJson does not have any understanding of specific BSON types and only can identify strings, numbers, booleans, arrays, objects, or null. If the json is not well formatted (ie, non-matching brackets, etc), etc, then $fromJson yields ‘undefined’.

Privileges Extensions

$user

Use the $user projection primitive to project the name of the currently logged-in user. {$user: 1} will project the string of the user that owns this session.

For example:

> db.foo.aggregate({$project: {"_id": 0, "user": {"$user": 1}}})
{ "user" : "admin" }

$hasPrivilege

$hasPrivilege is an operator that you can use in a find or a $project to evaluate if the user you are running as has a certain privilege. You can include this with other conditions to ensure that the operating query behaves differently depending on which user is currently logged into the database (and what privileges that user has). This is useful when building applications and having the application queries used by different users. For example, the following query will check that Login Succeeded is equal to 1 but also that the user has write privileges to the database; a user with only read privileges will see nothing. Therefore, if user qa1 has readWrite on the database they will get:

> db.session.findOne({"Login Succeeded" : 1, $expr: {$hasPrivilege: ["session", "update"]}})
{
   "_id" : NumberLong("10000000000"),
   "Access Id" : "32",
   "Database Name" : "DB_NAME_2672",
   "Session Start" : ISODate("2014-01-01T00:00:00Z"),
   "Uid Chain" : null,
   "Session End" : ISODate("1970-01-01T00:00:00Z"),
   "Analyzed Client IP" : "186.211.9.154",
   "Server IP" : "66.91.145.34",
   "Client Host Name" : "CLIENT_HOST_NAME_680",
   "Login Succeeded" : 1,
   "DB User Name" : "DB_USER_906",
   "Source Program" : "SOURCE_PROGRAM_49721",
   "Server Type" : "MONGODB",
   "Service Name" : "SERVICE_NAME_487",
   "OS User" : "OS_USER_49088",
   "Server Host Name" : "SERVER_HOST_NAME_49615",
   "Original Timezone" : "UTC-04:00"
}

while a user qa2 with only read permission will get:

> db.session.findOne({"Login Succeeded" : 1, $expr: {$hasPrivilege: ["session", "update"]}})

currentOp and killOp

There are number of extensions to the currentOp and killOp commands. In currentOp, SonarW provides additional information specific to queries - e.g. in aggregation pipelines it shows the progress of the pipeline through the data set. In both currentOp and killOp, non-admin users are able to see and kill only their operations - thus allowing administrators to provide these privileges to users more easily.

db.currentOp() - Returns a document describing the current operations on the database belonging to the current user, for example:

db.currentOp()
{
   "inprog" : [
         {
                 "query" : {

                 },
                 "active" : false,
                 "client" : "127.0.0.1:48917",
                 "threadId" : "477749248",
                 "connectionId" : 3,
                 "waitingForLock" : false,
                 "killed" : false,
                 "pipeline_progress" : {
                         "chunk" : NumberLong(1),
                         "chunk_time" : NumberLong(0),
                         "component" : "Aggregation_Sort",
                         "component_no" : NumberLong(1),
                         "collection_docs" : NumberLong(3000000),
                         "current_segment" : NumberLong(1),
                         "segments" : [
                                 "[ Aggregation_Sort ]",
                                 "[ Aggregation_Sort ]"
                         ],
                         "components" : [
                                 {
                                         "segment" : 1,
                                         "component_no" : 1,
                                         "component" : "Aggregation_Sort",
                                         "at_document" : NumberLong(16384)
                                 }
                         ]
                 }
         }
        ]
}

Note: A user with root privileges can monitor the operations belonging to any user and may kill any active operation.

db.killOp(operationId) - Kill the operation defined by operationId. If the current user does not have root privileges, the operation must belong to the current user.

Additional Match Extensions

$expr

A $match stage in an aggregation pipeline or a find can include an $expr expression. A $expr can include any operator sequence as long as it evaluates to true or false. This is a very flexible way to do any type of complex matching. For example, you would use $expr to match only documents where one field is larger than another (as opposed to larger than a literal). The following matches all shapes where the width is larger than the length:

db.shapes.aggregate({$match: {$expr: {$gt: ["$width", "$length"]}}})

When using $expr use an array for the two sides of the expression rather than the <lhs>: { <operator> : <rhs>} syntax normally used. The example above can not even be done using standard syntax since it involves two fields.

For equality use $eq.

Note that $expr is slower than normal operator syntax.

$sonarSyntax

SonarW allows an extended syntax for projection operators computing values in find and in $match. For example, you could do a find with an operator such as:

"full_name" : {  "$concat" : [ "Jane", " ", "Doe" ] }

This is useful especially for code generated by applications or translations. While you can do this with a $expr, SonarW has a mechanism to precompute these values before the query begins for constant values thus avoiding the slow-down inherent in $expr. But because this syntax conflicts with MongoDB syntax, you need to add the $sonarSyntax modifier, for example:

db.queue.find({
   "$sonarSyntax":true,
   "full_name" : {  "$concat" : [ "Jane", " ", "Doe" ] } })

SonarW will rewrite this query to the following before execution:

db.queue.find({ "full_name" :"Jane Doe" } })

Similarly, this:

db.employees.find({$sonarSyntax: true, salary:{ $gt: { $add:[50000,40000] } }})

will be rewritten to:

db.employees.find({$sonarSyntax: true, salary:{ $gt: 90000 }})

If you attempt to use $sonarSyntax, and an extended expression is discovered that is not constant, then an error will be raised.

Updating using $expr

When updating a document using the $set operator, it is possible for the new value to be the result of evaluating an expression.

The expression may refer to any of the columns of the original data. Any of the columns (except for _id) may be set in this way. The $expr may be any valid expression.

The $expr operations may be intermixed with any other $set operation.

Note: When a column is evaluated within an expression, and the value is also being updated, the values seen by the $expr are the values prior to the update.

Example:

db.col.drop()
db.col.insert({_id:1, a: 1, b: 2, c: 3, d: 4})
db.col.insert({_id:2, a: 1, b: 2})
db.col.insert({_id:3, a: 1, b: 20, c: 30, d: 40})
db.col.update({'a':1},{'$set':{'a': 44, 'b': {$expr: {$div: ['$b',2]}},c: {$expr: {$sum: ['$a','$b','$c']}}}},{multi:true})
{ "_id" : 1, "a" : 44, "b" : 1, "c" : 6, "d" : 4 }
{ "_id" : 2, "a" : 44, "b" : 1, "c" : 3 }
{ "_id" : 3, "a" : 44, "b" : 10, "c" : 51, "d" : 40 }

Using Javascript Expressions in Match and Project

SonarW supports writing limited JavaScript expressions using the $js operator. $js accepts a string and parses the Javascript inside that string, converting it into the appropriate projection expression $cond, $ifNull etc. This is far less verbose and much easier to write. You can include a $js in a $match, a $project and even in a $group, for example:

> db.t1.find()
{ "_id" : ObjectId("5654f13fd08195132035b264"), "a" : 1 }
{ "_id" : ObjectId("5654f148d08195132035b265"), "a" : 2, "b" : 1 }
{ "_id" : ObjectId("5654f14bd08195132035b266"), "a" : 2, "b" : 3 }
{ "_id" : ObjectId("5654f151d08195132035b267"), "a" : null, "b" : 3 }

> db.t1.aggregate({$group: {
      _id: 'a',
      c: {$sum:1},
      a: {$sum: {$js: "($a != null) ? 0 : 1"}},
      b: {$sum: {$js: "($b != null) ? 0 : 1"}}}})

{ "_id" : "a", "c" : 4, "a" : 1, "b" : 0 }

Note that a $eq projection operator does not equate a non-existant field with null. Therefore if you want to match a variable v with documents that are null and documents where v does not exist use:

($v==null || $v==undefined)

The following JS operators and functions are supported in $js, and map to the corresponding projection expressions:

  • < $lt
  • > $gt
  • <= $lte
  • >= $gte
  • != $ne
  • ==, === $eq
  • + $add
  • - $subtract
  • * $multiply
  • % $mod
  • / $divide (with some restrictions, see below)
  • ! $not
  • && $and
  • || $or
  • ?: $cond
  • substr $substr
  • concat $concat
  • strcasecmp $strcasecmp
  • toLower $toLower
  • toUpper $toUpper
  • dayOfYear $dayOfYear
  • dayOfMonth $dayOfMonth
  • dayOfWeek $dayOfWeek
  • year $year
  • month $month
  • week $week
  • hour $hour
  • minute $minute
  • second $second
  • dateToString $dateToString

The restrictions on using the / operator in js are that no more than one may appear in the entire $js expression because of ambiguity with regular expressions.

If a variable has a space within it you can refer to it in the $js expression. Note that the variable name is terminated by the first character that is parsed. Therefore, if you have a variable named “a a” and a variable named “b b” you can refer to them and use them as follows:

> db.t1.find()
{ "_id" : ObjectId("56a797b5139f60aa16508e22"), "a a" : 1, "b b" : 2 }

> db.t1.aggregate({$project: {c: {$js: "$a a - $b b"}}})
{ "_id" : ObjectId("56a797b5139f60aa16508e22"), "c" : -1 }

but that this will not work if your field name is called “a a ” or “a a +”; use explicit non-$js expressions in these cases.

Output Extensions

Appending to an Existing Collection Using $out

By defaut a $out stage will drop an existing collection before inserting documents. To keep the existing contents and append the documents to the existing collection use:

{$out: {name: "<collection name>", append: true}}

Instead of the default:

{$out: "<collection name>"}

Appending to an existing collection inserts documents. If you are appending rows which have a matching _id field to a row already in that collection, the new row is added, but that collection is also flagged as accepting duplicate _id’s, and cannot be changed back until the duplicate row(s) are deleted. If you specify “upsert” instead of “append”, however, then whenever there is a matching _id already in the collection, that row is updated instead, e.g.:

db.foo.aggregate([....,{$out: {name: "bar", upsert: true} }])

Specifying both “upsert” and “append” is an error, and is reported as such.

Updating an Existing Collection Using $out

A $out usually inserts new documents into the target collection. You may want to use each document resulting in an aggregation pipeline to update existing collections. In this case use the upsert extension. Each document will cause an upsert statement to be formed:

{$out: {name: "<collection name>", upsert: true}}

The upsert statement will use the _id for the query condition (and thus you must have an _id). If there is no _id in the document then the document is inserted.

Outputing to a Collection in Another Database

Out can also be used to generate the collection in another database (on the same instance) using (append is optional):

{$out: {append: <true/false>, name:<collection-name>, db: <database name>}}

Outputing to a Collection Without an _id, Ingesting and Partitioning

New collections generated using $out can be created with or without unique _id indexing. If not specified the database default will be used as specified in sonard.conf. To specify this explicitly use:

{$out: {name:<collection-name>, allow_duplicate_ids: true}}

New collections generated using $out may use partitioning and be ingested for faster creation. To do so you need to specify the keys used for partitioning in an array (can be empty). For example:

{$out: {name:<collection-name>, allow_duplicate_ids: true, ingest: ['Session Id']}}

If ingest is not specified, normal inserts are used to create the collection. If an empty array is specified, ingestion without partitioning occurs.

Outputing to an Operating System File

By default $out generates a collection within SonarW. To get the output as an operating system file use the fstype: “local” qualifier in a subdocument as:

{ $out:{"fstype": "local", "format": "csv", "filename": "foo" }}
or
{ $out:{"fstype": "local", "format": "json", "filename": "foo" }}
or
{ $out:{"fstype": "local", "format": "bson", "filename": "foo" }}

To append to the existing file system use:

{ $out:{"fstype": "local", "format": "csv", "filename": "foo" , append: true}}

NOTE: Some MongoDB drivers (e.g. the Java driver) require that $out be followed by a string and may throw an exception even though there is no output after a $out. If this happens use $sout instead of $out.

When outputing to a file, $out must be preceded by a $project, for example:

db.attendance.aggregate([{$project:{'*':1}},{$out:{fstype:'local',  "format": "json", "filename": "foo" }}])

Users with “dbAdmin” and “root” access are allowed to out to an Operating System file. Other users should be given “createLocalOutFile” role to be able to perform this action.

When you output data to an external file you have two more option:

  1. zip: true - will generate compressed zip files
  2. partSize: num - will break up the output into multiple files that will be generated incrementally, each one limited by the number of bytes you specify. num can be “5K”, “10M” (5 or 10 being an example) or a number which is the number of bytes. This option is useful when you are generating large files but want to start sending the parts to the consumer as they become ready.

Outputing to Amazon AWS S3

To get the output into an Amazon AWS S3 bucket, use fstype: ‘s3’ as in:

{$out: {
   fstype:'s3',
   key_id: <amazon key id>,
   secret: <secret access key>,
   bucket: <amazon s3 bucket name>,
   name: <name to give to the file>,
   amz_headers:<optional additional request headers>,
   format:<format>}}

The formats supported are CSV and JSON as in local files.

All fields are mandatory except for the ‘amz_headers’ field, and the ‘amz_headers’ field may contain either a single string, for a single additional header, or an array of strings, where each string is formatted like an http request header. This is used to specify any additional information that one might want to provide to s3 for the save request, such as “x-amz-acl:public-read” for example, which specifies that the file is intended to be read by anyone. The bucket specified must already exist and be configured to be writable by the user whose credentials are supplied.

If errors are encountered during the transfer, or there are problems with the credentials, they are reported on the server, but are not currently reported back to the client.

Just as with $out to a file, $out to an s3 file must also be preceded by a $project, for example:

db.attendance.aggregate([{$project:{'*':1}},{$out:{fstype:'s3', bucket:'jsonar.marksbucket', key_id:'EXAMPLE_KEY', secret:'SECRET_KEY', name:'class_attendance.json', amz_headers:'x-amz-acl:public-read', format:'json'}}])

Outputing to Syslog

You can send data directly over syslog and you can format JSON documents as LEEF.

As an example, if a collection has documents of the form:

> db.webservice.findOne({number: 'INC0000002'})
{
        "_id" : ObjectId("59cf96748a7597550015e40f"),
        "number" : "INC0000002",
        "IP Address" : "localhost",
        "SonarG Source" : "SonarDispatcher-Web-Service",
        "__status" : "success",
        "active" : true,
        "activity_due" : "2017-08-10 14:51:11",
        "approval" : "null",
        "assigned_to" : "Howard Johnson",
        "assignment_group" : "Network",
        "caller_id" : "Fred Luddy",
        "category" : "Network",
        "close_code" : "null",
        "cmdb_ci" : "FileServerFloor2",
        "comments" : "2017-08-03 16:13:23 - System Administrator (Additional comments)\nAdded an attachment\n\n",
        "comments_and_work_notes" : "2017-08-03 16:13:23 - System Administrator (Additional comments)\nAdded an attachment\n\n",
        "contact_type" : "null",
        "description" : "User can't get to any of his files on the file server.",
        "escalation" : "Overdue",
        "hold_reason" : "Awaiting Vendor",
        "impact" : "1 - High",
        "incident_state" : "On Hold",
        "knowledge" : false,
        "location" : "1050 Sunnyview Road Northeast, Salem,OR",
        "made_sla" : false,
        "notify" : "Do Not Notify",
        "opened_at" : ISODate("2017-05-04T16:07:12Z"),
        "opened_by" : "Joe Employee",
        "priority" : "1 - Critical",
        "problem_id" : "PRB0000007",
        "reassignment_count" : NumberLong(1),
        "section" : "web_service",
        "severity" : "1 - High",
        "short_description" : "Network file shares access issue",
        "sla_due" : "UNKNOWN"
}

and you want to send the contents over syslog in a LEEF format you can run:

db.webservice.aggregate(
        {$project:{'*':1}},
        {$out:{
                format:'leef',
                vendor:'default_vendor',
                product:'$SonarG Source',
                product_version:'$number',
                ignore_fields:["_id","SonarG Source","number"],
                value_replace:[{from:'\n',to:'\\n'}],
                fstype:'syslog',
                host:'localhost',
                loglevel:'notice',
                facility: 'user',
                protocol:'udp',
                port:10516
        }})

Producing:

Oct 13 13:27:39 localhost sonarw: LEEF:2.0|default_vendor|SonarDispatcher-Web-Service|INC0000002|59cf96748a7597550015e40f|IP Address=localhost#011__status=success#011active=true#011activity_due=2017-08-10 14:51:11#011approval=null#011assigned_to=Howard Johnson#011assignment_group=Network#011caller_id=Fred Luddy#011category=Network#011close_code=null#011cmdb_ci=FileServerFloor2#011comments=2017-08-03 16:13:23 - System Administrator (Additional comments)\nAdded an attachment\n\n#011comments_and_work_notes=2017-08-03 16:13:23 - System Administrator (Additional comments)\nAdded an attachment\n\n#011contact_type=null#011description=User can't get to any of his files on the file server.#011escalation=Overdue#011hold_reason=Awaiting Vendor#011impact=1 - High#011incident_state=On Hold#011knowledge=false#011location=1050 Sunnyview Road Northeast, Salem,OR#011made_sla=false#011notify=Do Not Notify#011opened_at=2017-05-04T16:07:12#011opened_by=Joe Employee#011priority=1 - Critical#011problem_id=PRB0000007#011reassignment_count=1#011section=web_service#011severity=1 - High#011short_description=Network file shares access issue#011sla_due=UNKNOWN

Chunking the output to a file and compressing files

When you use $out to output data to a file (whether local or on s3) you can compress the file and you can chunk the output into part files. This allows downstream applications to start using file parts as they become available without having to wait until the entire output is complete. Synopsis:

{$out:
        {fstype: 'file',
        name: "s3://access_key:secret_key:bucket/name" or the path/filename when in the local file system,
        zip: true|false, // default is false and can be ommitted
        partSize: <size>, // when partSize is specified and there are multiple parts the files will be named name_1, name_2, ..
        format: "csv|json"
}}

partSize can be a number (in bytes) or a string of the form “<num>K|M|G”.

Sort Extensions

Using Interim Sort Results in Applications While the Sort is Still Running

Use the $view sort extensions when you want to see interim sort results while the sort is going on (because large sorts can take a long time). When $view is used as fieldname to sort, then a view is defined for the sort which can contain interim results of the sort while it is operating.

Example:

db.connection_log.aggregate({$sort:{time_ms:-1,$view:{name:"slowest_times",limit:100}}})

creates a view called slowest_times that shows the entries from connection_log with the worst operation times while the sort is occurring.

Use the disposeView admin command to manually remove such a view from the database, e.g:

db.runCommand({disposeView: "slowest_times"})

releases the memory being used by the view called ‘slowest_times’, as long as it had been automatically created using $view in a sort operation. Trying to dispose of any other type of views with this command has no effect.

When the connection is terminated all these transient views are destroyed automatically.

Sort “Precision”

When sorting strings it may be useful to configure the MaxFieldSize Specifier in order to increase the accuracy of the sort. By default, SonarW considers only the first 64 bytes of each string. If the strings differ after this, this will be ignored.

This behavior can be changed using the max_disk_sort_string_len configuration setting, however this limit applies to every sorted field. Since SonarW does not use a schema, it assumes that each and every sorted field has the same maximum length.

Note also that both of these parameters apply only to multi-column disk sorts.

If most sort operations fit within the 64 byte limitation, and only a select few require a larger field, then the MaxFieldSize Specifier should be used, for example:

db.sales.aggregate({$sort: {$maxFieldSize: 1024, 'Account': 1, 'Item': -1}})

This setting will not be used unless there are millions of rows of data (ie. disk sort).

$lag

$lag is a projection extension that allows you to project a value from previous documents processed. While it is a projection extension that can be used anywhere, it is mostly used after a sort where you control the processing order of the documents. For example, if you order the documents by time you can compute the delta of a certain measurement between readings using:

{$sort: {time: 1}},{$project: {delta: {$subtract: ["$measurement", {$lag: ["$measurement", 1]}]}}}

Grouping Extensions

Specifying Filtering Criteria as Part of a Group

If you have a $group stage that will produce a very large number of groups (millions or more) and after the $group you plan to have a $match to drastically cut down on the number of groups then you can use a $having modifier within the $group to lower the cost of materializing the output of the $group phase. Use this when you have a large number of groups during the group stage but will carry forward only a small subset.

If for example:

> db.emp.aggregate({$group: {_id:'$dept' , s: {$sum:'$sal'} } })
{ "_id" : 0, "s" : 450 }
{ "_id" : 1, "s" : 460 }
{ "_id" : 2, "s" : 470 }
{ "_id" : 3, "s" : 480 }
{ "_id" : 4, "s" : 490 }
{ "_id" : 5, "s" : 500 }
{ "_id" : 6, "s" : 510 }
{ "_id" : 7, "s" : 520 }
{ "_id" : 8, "s" : 530 }
{ "_id" : 9, "s" : 540 }

Then:

> db.emp.aggregate({$group: {_id:'$dept' , s: {$sum:'$sal'},
      $having: {$or: [ { $eq:['$s',500]} ,{ $lt: ['$_id',2]}  ]}  } } )
{ "_id" : 0, "s" : 450 }
{ "_id" : 1, "s" : 460 }
{ "_id" : 5, "s" : 500 }

Note that the filter can be on computed aggregate fields or some combination of the id fields and aggregate fields and therefore cannot be pushed as a $match before the $group.

Group by Regex

The $group operator in the aggregation framework has been extended to support a regular expressions. You can group by parts of a string specified using a regular expression. Consider the following sample data:

> db.assets.find()
{ "_id" : ObjectId("5410b747e7294618c513a150"), "text" : "Oracle 11.1" }
{ "_id" : ObjectId("5410b74de7294618c513a151"), "text" : "Oracle 10.2" }
{ "_id" : ObjectId("5410b751e7294618c513a152"), "text" : "Oracle 10.1" }
{ "_id" : ObjectId("5410b75ee7294618c513a153"), "text" : "Sybase 11.0" }
{ "_id" : ObjectId("5410b76ce7294618c513a154"), "text" : "MySQL 5.5" }
{ "_id" : ObjectId("5410b76fe7294618c513a155"), "text" : "MySQL 5.6" }
{ "_id" : ObjectId("5410b774e7294618c513a156"), "text" : "MySQL 4" }

Suppose that each text will include a product name and a version number where the version number must have a major version number and any number of minor version number segments. If you want to count how many assets there are based on the product you can do:

> db.assets.aggregate({$group: {_id: {"text": /([a-zA-Z]+) [0-9]+[0-9\.]*/}, count: {$sum: 1}}})
{ "_id" : { "match_0" : "Oracle" }, "count" : 3 }
{ "_id" : { "match_0" : "Sybase" }, "count" : 1 }
{ "_id" : { "match_0" : "MySQL" }, "count" : 3 }

If you want to count how many assets there are per major number you can do:

> db.assets.aggregate({$group: {_id: {"text": /[a-zA-Z]+ ([0-9]+)[0-9\.]*/}, count: {$sum: 1}}})
{ "_id" : { "match_0" : "11" }, "count" : 2 }
{ "_id" : { "match_0" : "10" }, "count" : 2 }
{ "_id" : { "match_0" : "5" }, "count" : 2 }
{ "_id" : { "match_0" : "4" }, "count" : 1 }

You can have as many match elements in the regex. For example, if you want to group based on both the product and the major version number do:

> db.assets.aggregate({$group: {_id: {"text": /([a-zA-Z]+) ([0-9]+)[0-9\.]*/}, count: {$sum: 1}}})
{ "_id" : { "match_0" : "Oracle", "match_1" : "11" }, "count" : 1 }
{ "_id" : { "match_0" : "Oracle", "match_1" : "10" }, "count" : 2 }
{ "_id" : { "match_0" : "Sybase", "match_1" : "11" }, "count" : 1 }
{ "_id" : { "match_0" : "MySQL", "match_1" : "5" }, "count" : 2 }
{ "_id" : { "match_0" : "MySQL", "match_1" : "4" }, "count" : 1 }

Each () parentheses pair generates another grouping/matching element.

Note that you may not mix elements of a group-by-regex and regular grouping fields in the same $group operator.

Group Pivot Ranges

SonarW can generate pivots based on ranges while grouping data. This is useful when you need indicator variables and rather than computing a large number of such variables you can use a single pivot expression.

Supposed as an example that you have documents expression minimum salaries and location, eg:

db.pivots.insert([
{city: 'LA', age: 31, salary: 50},
{city: 'LA', age: 37, salary: 62},
{city: 'LA', age: 25, salary: 48},
{city: 'NY', age: 23, salary: 62},
{city: 'NY', age: 28, salary: 72}
])

Supposed that you want to compute per city how many people make between 20 and 30, how many between 30 and 40, etc. you would do:

> db.pivots.aggregate({$group: {_id: '$city', sumSalary:{$sum:[1, '$salary', [20,30,40,50,60,70,100]]}}}).pretty()
{
   "_id" : "LA",
   "sumSalary_20_30" : 0,
   "sumSalary_30_40" : 0,
   "sumSalary_40_50" : 1,
   "sumSalary_50_60" : 1,
   "sumSalary_60_70" : 1,
   "sumSalary_70_100" : 0,
   "sumSalary_100" : 0
}
{
   "_id" : "NY",
   "sumSalary_20_30" : 0,
   "sumSalary_30_40" : 0,
   "sumSalary_40_50" : 0,
   "sumSalary_50_60" : 0,
   "sumSalary_60_70" : 1,
   "sumSalary_70_100" : 1,
   "sumSalary_100" : 0
}

The command generates a set of fields all with the sumSalary prefix. While the groups are being made indicator variables are computed based on the ranges in the array parameter. The second argument, $salary, specifies what field to look at to determine which aggregate value should be affected. As another example, if the buckets are to be just 0-50, 50-100 and over 100:

> db.pivots.aggregate({$group: {_id: '$city', sumSalary:{$sum:[1, '$salary', [0,50,100]]}}}).pretty()
{
   "_id" : "LA",
   "sumSalary_0_50" : 1,
   "sumSalary_50_100" : 2,
   "sumSalary_100" : 0
}
{
   "_id" : "NY",
   "sumSalary_0_50" : 0,
   "sumSalary_50_100" : 2,
   "sumSalary_100" : 0
}

Also, in both of these examples the operator was a count - a sum of 1 for each document. You can use any variable and any group aggregation operator. For example, to compute average salaries by age group per city:

> db.pivots.aggregate({$group: {_id: '$city', avgSalary:{$avg:['$salary','$age',[20,30,40,50,60,70100]]}}}).pretty()
{ "_id" : "LA", "avgSalary_20_30" : 48, "avgSalary_30_40" : 56 }
{ "_id" : "NY", "avgSalary_20_30" : 67 }

Lets look at another example, this time from SonarG. One of the collections generated by SonarG’s SAGEs analytic engines is ae_dt which has (among other things) a document per hour and per day for every session combination (user, ip, service, etc.). A typical document looks like:

{
  "_id": {
    "$oid": "570323b88e82c4680000002b"
  },
  "DB_User_Name": "GUARDIUM_QA",
  "Server_IP": "9.70.156.59",
  "Client_IP": "9.70.157.12",
  "Service_Name": "9.70.156.59:5.6.10",
  "Client_Host_Name": null,
  "Day": {
    "$date": "2016-04-04T00:00:00.000Z"
  },
  "Hour": 22,
  "count": 1,
  "Succeeded": 1,
  "Failed": 0,
  "OS_User": null,
  "Type": "session"
}

The following query generated a pivot table per user that has over 1000 sessions overall where the pivot table lists the number of sessions per month that this user does:

db.ae_dt.aggregate(
   {$match: {"Type": "session"}},
   {$group: {
      _id: "$DB_User_Name",
      SessionInMonth: {$sum: ["$count",{$month: "$Day"}, [1,2,3,4,5,6,7,8,9,10,11,12]]},
      Count: {$sum: "$count"},
      $having: {Count: {$gt: 1000}}
   }})

The result pivot table looks like the following figure.

_images/pivot.png

Combining Multiple Documents into One

The union group operator allows you to combine multiple documents that match a grouping criteria and create a single document from it. It is useful when different documents describe the same world entity (perhaps with different attributes/field) and you want to unify them into a single subdocument.

Suppose for example that you have multiple documents describing different attributes of the same person:

> db.people.find()
{ "_id" : ObjectId("5410dfc4a38e812d3e6f4228"), "lang" : "en", "name" : "John" }
{ "_id" : ObjectId("5410dfd2a38e812d3e6f4229"), "lang" : "pt", "name" : "Sergio" }
{ "_id" : ObjectId("5410dfdba38e812d3e6f422a"), "age" : 32, "name" : "John" }
{ "_id" : ObjectId("5410dfeca38e812d3e6f422b"), "age" : 28, "name" : "Sergio" }

You can combine these documents into a single document using a $group with a $union:

> db.people.aggregate({$group: {_id: "$name", attributes: {$union: 1}}})
{ "_id" : "John", "attributes" : { "age" : 32, "lang" : "en", "name" : "John" } }
{ "_id" : "Sergio", "attributes" : { "age" : 28, "lang" : "pt", "name" : "Sergio" } }

Note that if you have multiple values for the same variable one of them will be selected, e.g. if we add:

> db.people.insert({name: "Sergio", age: 29, height: 185})

Then:

> db.people.aggregate({$group: {_id: "$name", attributes: {$union: 1}}})
{ "_id" : "John", "attributes" : { "age" : 32, "lang" : "en", "name" : "John" } }
{ "_id" : "Sergio", "attributes" : { "age" : 28, "height" : 185, "lang" : "pt", "name" : "Sergio" } }

$union is often used in a pattern along with indirect creation of fields (calculated field names in $project).

$countDistinct function in $group

Sonar supports an additional accumulator operator called: $countDistinct. $countDistinct counts how many distinct non-null values of an expression are in a group.

Example:

> db.people.find()
{ "_id" : 1, "name" : "john" }
{ "_id" : 2, "name" : "Sergio" }
{ "_id" : 3, "name" : "John" }
{ "_id" : 4, "name" : "Sergio" }
{ "_id" : 5, "name" : null }

> db.people.aggregate({$group: {_id: null, number_of_names: {$countDistinct: "$name"}}})
{ "_id" : null, "number_of_names" : NumberLong(3) }

> db.people.aggregate({
    "$group":{
        "_id": null,
        "number_of_names": {
            "$countDistinct": {
                "$cond":[
                    {$ne:["$name",  null]},
                    {"$toLower": "$name"},
                    "$name"
                ]
            }
        }
    }
})

{ "_id" : null, "number_of_names" : NumberLong(2) }

The Example shows that in the documents we have 3 names: john, John and Sergio

In the second query we count how many distinct values we have in the following expression:

We convert values in the “name” field to lowercase when it is not null (to not convert nulls to empty strings).

The results was 2: john and sergio

$sort and $limit within $group

For better performance, Sonar supports sorting and limiting the results within the group pipeline stage.

Example:

> db.people.find()
{ "_id" : 1, "name" : "john" }
{ "_id" : 2, "name" : "Sergio" }
{ "_id" : 3, "name" : "John" }
{ "_id" : 4, "name" : "Sergio" }
{ "_id" : 5, "name" : null }

> db.people.aggregate({$group: {_id: "$name", count: {$sum: 1}, $sort: {$keys: {count: -1}, $limit: 1}}})
{ "_id" : "Sergio", "count" : 2 }

> db.people.aggregate({$group: {_id: "$name", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 1})
{ "_id" : "Sergio", "count" : 2 }

The Example shows that doing $sort and $limit within the $group returns the same results as adding them in later stages in the pipeline.

Getting the document responsible for the min or max value of a field

A very common pattern in analytics is to find a value of a document where another field is the min or max in a group. For example, find the number of open tickets in the last day worked by a person. The min and max are done on a date field but the value desired is in another field. While you can do this using window functions (supported in SonarW as of version 1.2), an extension to the $group allows you to perform this very easily. For example, using:

$group: {
   _id: "$user",
   latest: {$maxDoc: "$date"}
}

will project the latest document per person (by the date field).

Multi Grouping

In SonarW we added the ability to group a single document into multiple groups. We have special operators to do this:

  1. $headTailGroup - group by the heads or the tails of a string with a head/tail format.
  2. $iterGroup - group by the elements in an array

Note

Only a single multi group operator is permitted per _id field within the $group syntax

HeadTailGroup Multi Grouping Operator

Syntax:

$headTailGroup: {
    "value": <Expression to run the head tail grouping on>,
    "head": true/false,
    "sentence_delimiter": <optional single char representing the sentence delimiter. defaults to ";">,
    "head_tail_delimiter": <optional single char representing the head_tail delimiter. defaults to " ">,
    "case_insensitive": <optional boolean for case insensitivity. defaults to false (case sensitive) >
}

The sentence_delimiter and head_tail_delimiter parameters are optional and default to: “sentence_delimiter”: “;” and “head_tail_delimiter”: ” “

Example usage:

> db.data.find()
{ "head_tail_field": "TABLE1 SELECT;TABLE2 SELECT", "User": "Ron" }
{ "head_tail_field": "'TABLE 1' SELECT;'TABLE 1' DROP TABLE", "User": "Rachel" }

> db.data.aggregate(
    {"$group": {
        "_id": {
            "$headTailGroup": {
                "value": "$head_tail_field",
                "head": true
            }
        },
        "count": {"$sum": 1}
    }}
)
{ "_id" : "TABLE 1", "count" : 2 }
{ "_id" : "TABLE1", "count" : 1 }
{ "_id" : "TABLE2", "count" : 1 }

> db.data.aggregate(
    {"$group": {
        "_id": {
            "$headTailGroup": {
                "value": "$head_tail_field",
                "head": false
            }
        },
        "count": {"$sum": 1}
    }}
)
{ "_id" : "SELECT", "count" : 3 }
{ "_id" : "DROP TABLE", "count" : 1 }

> db.data.aggregate(
    {"$group": {
        "_id": {
            "user": "$User",
            "verb": {
                "$headTailGroup": {
                    "value": "$head_tail_field",
                    "head": false
                }
            }
        },
        "count": {"$sum": 1}
    }}
)
{ "_id" : { "user" : "Rachel", "verb" : "SELECT" }, "count" : 1 }
{ "_id" : { "user" : "Rachel", "verb" : "DROP TABLE" }, "count" : 1 }
{ "_id" : { "user" : "Ron", "verb" : "SELECT" }, "count" : 2 }

IterGroup Multi Grouping Operator

$iterGroup can group on elements of an array.

Syntax:

$iterGroup: <Expression to run the grouping on - should be an array>

Example usage with array of numbers:

> db.data.find()
{ "array_field": [1, 2, 1], "User": "Ron" }
{ "array_field": [1, 2, 3], "User": "Ron" }
{ "array_field": 1, "User": "Rachel" }

> db.data.aggregate(
    {"$group": {
        "_id": {
            "$iterGroup": "$array_field"
        },
        "count": {"$sum": 1}
    }}
)
{ "_id" : 1, "count" : 4 }
{ "_id" : 2, "count" : 2 }
{ "_id" : 3, "count" : 1 }

> db.data.aggregate(
    {"$group": {
        "_id": {
            "user": "$User",
            "arr_element": {
                "$iterGroup": "$array_field"
            }
        },
        "count": {"$sum": 1}
    }}
)
{ "_id" : { "user" : "Rachel", "arr_element" : 1 }, "count" : 1 }
{ "_id" : { "user" : "Ron", "arr_element" : 1 }, "count" : 3 }
{ "_id" : { "user" : "Ron", "arr_element" : 2 }, "count" : 2 }
{ "_id" : { "user" : "Ron", "arr_element" : 3 }, "count" : 1 }

Example usage with an array of Documents:

> db.data.find()
{ "stocks": [{"stock": "GOOGL", "amount": 5642}, {"stock": "NFLX", "amount": 8645}], "User": "Ron" }
{ "stocks": [{"stock": "GOOGL", "amount": 125}, {"stock": "AAPL", "amount": 789}], "User": "Ron" }
{ "stocks": [{"stock": "NFLX", "amount": 5688}, {"stock": "AAPL", "amount": 1112}], "User": "Rachel" }

db.data.aggregate(
    {"$group": {
        "_id": {
            "user": "$User",
            "stock": {
                "$iterGroup": "$stocks.stock"
            }
        },
        "count": {"$sum": 1}
    }}
)

{ "_id" : { "user" : "Rachel", "stock" : "AAPL" }, "count" : 1 }
{ "_id" : { "user" : "Rachel", "stock" : "NFLX" }, "count" : 1 }
{ "_id" : { "user" : "Ron",    "stock" : "GOOGL" }, "count" : 2 }
{ "_id" : { "user" : "Ron",    "stock" : "NFLX" }, "count" : 1 }
{ "_id" : { "user" : "Ron",    "stock" : "AAPL" }, "count" : 1 }

To use group functions on the elements of the array, use the $iter modifier on the elements you want to inspect. For example:

db.data.aggregate(
    {"$group": {
        "_id": {
            "user": "$User",
            "stock": {
                "$iterGroup": "$stocks.stock"
            }
        },
        "total_amount": {$iter: {"$sum": "$stocks.amount"}}}
    }
)

Nested Grouping

In SonarW we added the ability to do nested grouping when the “inner” group groups on the documents that fall into the “outer” group. This allows to do analytics within groups as well as for the whole collection.

Syntax:

<field_name>: {"$group": <inner group>}

Example:

Sample Data:

{ "first_name" : "Chrysa", "gender" : "Female", "country" : "Canada", "city" : "Montréal-Est" }
{ "first_name" : "Wylma", "gender" : "Female", "country" : "Canada", "city" : "Cochrane" }
{ "first_name" : "Deb", "gender" : "Female", "country" : "Canada", "city" : "Amos" }
{ "first_name" : "Jeanie", "gender" : "Female", "country" : "Canada", "city" : "Camlachie" }
{ "first_name" : "Hamilton", "gender" : "Male", "country" : "Canada", "city" : "Thetford-Mines" }
{ "first_name" : "Cassi", "gender" : "Female", "country" : "Germany", "city" : "Nürnberg" }
{ "first_name" : "Zed", "gender" : "Male", "country" : "Canada", "city" : "Adstock" }
{ "first_name" : "Lebbie", "gender" : "Female", "country" : "Canada", "city" : "Lumsden" }
{ "first_name" : "Urson", "gender" : "Male", "country" : "Canada", "city" : "Roberval" }
{ "first_name" : "Nichol", "gender" : "Female", "country" : "Canada", "city" : "Belfast" }
{ "first_name" : "Alexis", "gender" : "Male", "country" : "Uganda", "city" : "Kayunga" }
{ "first_name" : "Berky", "gender" : "Male", "country" : "Canada", "city" : "Bonavista" }
{ "first_name" : "Levon", "gender" : "Male", "country" : "Uganda", "city" : "Lyantonde" }
{ "first_name" : "Egan", "gender" : "Male", "country" : "Germany", "city" : "Duisburg" }
{ "first_name" : "Dominic", "gender" : "Male", "country" : "Canada", "city" : "Little Current" }
{ "first_name" : "Broderick", "gender" : "Male", "country" : "Germany", "city" : "Bochum" }
{ "first_name" : "Field", "gender" : "Male", "country" : "Germany", "city" : "Osnabrück" }
{ "first_name" : "Bax", "gender" : "Male", "country" : "Canada", "city" : "Lunenburg" }
{ "first_name" : "Winnifred", "gender" : "Female", "country" : "Canada", "city" : "Kimberley" }
{ "first_name" : "Chickie", "gender" : "Female", "country" : "Canada", "city" : "Montréal" }
...

Query:
db.mock.aggregate(
    {"$group":
        "_id": "$country",
        "country_count": {"$sum": 1},
        "top_cities": { "$group":
            "_id": "$city",
            "city_count": {"$sum": 1},
            "$sort": {
                "$keys": {"city_count": -1},
                "$limit": 3
            }
            gender_split: { "$group":
                "_id": "$gender",
                "gender_count": {"$sum": 1}
            }
        }
    }
)

Results:

{ "_id": "Canada", "country_count": 3480, "top_cities": [
    { "_id" : "Windsor", "city_count" : 20, "gender_split": [
        { "_id" : "Female", "gender_count" : 15 },
        { "_id" : "Male", "gender_count" : 5 }
    ]},
    { "_id" : "Chilliwack", "city_count" : 16 "gender_split": [
        { "_id" : "Female", "gender_count" : 8 },
        { "_id" : "Male", "gender_count" : 8 }
    ]},
    { "_id" : "White City", "city_count" : 15 "gender_split": [
        { "_id" : "Female", "gender_count" : 7 },
        { "_id" : "Male", "gender_count" : 8 }
    ]}
]},
{ "_id": "Germany", "country_count": 876, "top_cities": [
    { "_id" : "Berlin", "city_count" : 122, "gender_split": [
        { "_id" : "Female", "gender_count" : 65 },
        { "_id" : "Male", "gender_count" : 57 }
    ]},
    { "_id" : "Hamburg", "city_count" : 61 "gender_split": [
        { "_id" : "Female", "gender_count" : 27 },
        { "_id" : "Male", "gender_count" : 34 }
    ]},
    { "_id" : "München", "city_count" : 58 "gender_split": [
        { "_id" : "Female", "gender_count" : 27 },
        { "_id" : "Male", "gender_count" : 31 }
    ]}
]},
{ "_id": "Uganda", "country_count": 644, "top_cities": [
    { "_id" : "Bweyogerere", "city_count" : 15, "gender_split": [
        { "_id" : "Female", "gender_count" : 8 },
        { "_id" : "Male", "gender_count" : 7 }
    ]},
    { "_id" : "Amuru", "city_count" : 14 "gender_split": [
        { "_id" : "Female", "gender_count" : 7 },
        { "_id" : "Male", "gender_count" : 7 }
    ]},
    { "_id" : "Kabale", "city_count" : 13 "gender_split": [
        { "_id" : "Female", "gender_count" : 8 },
        { "_id" : "Male", "gender_count" : 5 }
    ]}
]}

Unwind Extensions

Delayed Projection After an Unwind

The unwind pipeline stage deconstructs a column which contains an array and builds a full new document for each value in the array.

Given the following sample data:

db.example.drop()
db.example.insert({a: 1 , col: ["red", "blue", "green"]})

The result of unwinding the column col:

db.example.aggregate({$unwind: '$col'})
{ "_id" : ObjectId("5632788e46ce566e057b5451"), "a" : 1, "col" : "red" }
{ "_id" : ObjectId("5632788e46ce566e057b5451"), "a" : 1, "col" : "blue" }
{ "_id" : ObjectId("5632788e46ce566e057b5451"), "a" : 1, "col" : "green" }

Consider the special condition that an unwind is followed by a project or a group. Example:

db.example.drop()
db.example.insert({a: 1 , col: ["red", "blue", "green"], other: "this is a message"})
db.example.aggregate({$unwind: '$col'},{$project: {_id: 0, a: 1, col: 1}})

In this example, we don’t need the columns other than a, col or _id. In some cases there may be many such columns which will not be projected (or grouped) and will not be available to future stages.

In this case, SonarW automatically delays materializing any column until the next project or group, when we can be sure the column is required. This optimization results in a significant speedup.

The same optimization is applied when the next operation is another unwind: materialization is delayed until after the following unwind.

Note: This optimization can only be applied to at most two consecutive unwind operations. If there are 3 or more consecutive unwind operations, only the last two will be optimized.

Note: Two unwind operations separated by a $skip or $limit are still considered consecutive. If any other operation appears between two unwind operations, they are not considered consecutive.

Specifying Match Criteria Within an Unwind

SonarW supports a matching primitive within the unwind itself. This match operates only on the elements being unwound from the array. The benefit of using this combination of operators is speed - it is far faster to do the match as part of the unwind rather than as a separate stage after the unwind. This is because array members that do not match the conditions are never generated. For maximum throughput and speed, avoid any usages of $match after the $unwind. If you want to match on elements outside the unwound array, do the match before doing the $unwind. If you want to match on elements in the array move the conditions into the unwind. In most cases the optimizer will do the first such rewrite for you but will not insert the $elemMatch into the $unwind.

If you want to match on elements in the unwound array, then instead of the unwind:

db.col.aggregate({$unwind: "$arr"})

Use:

db.col.aggregate({$unwind: {field: "$arr", $elemMatch: {.. your match conditions ...}}})

For example, if:

db.elem.drop()

db.elem.insert({ a: [ {b:1, c:2}, {c:2, d:31}, {d:13, e:4} ] })
db.elem.insert({ a: [ {b:2, c:2}, {c:2, d:32}, {d:14, e:4} ] })
db.elem.insert({ a: [ {b:3, c:2}, {c:2, d:33}, {d:15, e:4} ] })
db.elem.insert({ a: [ {b:4, c:2}, {c:2, f:34}, {f:16, e:4} ] })

Then you can unwind a but emit only subdocuments that have a field called d:

db.elem.aggregate({$unwind: {field: "$a", $elemMatch: {d: {$exists: true}}}})

Note that a second unwind cannot follow an unwind with an elemMatch in this version of SonarW. For example, this is supported:

db.elem.drop()
db.elem.insert({ a: [ {c: [{k: 1, d:1}]} ]  })
db.elem.insert({ a: [ {c: [{k: 2, d: 2}, {k: 3, d: 3}]}] })
db.elem.insert({ a: [ {b:1, d: 11}, {c: [{k: 2, d: null}, {k: 5}]}] })

db.elem.aggregate([
{$unwind: '$a'},
{$unwind: {field: '$a.c', $elemMatch: {d: {$gt: 2}}}},
{$project: {_id: 0, a: 1}} ])

{ "a" : { "c" : { "k" : 3, "d" : 3 } } }

But this is not supported:

db.elem.drop()
db.elem.insert({ a: [ {i: 1, c: [{k: 1, d:1}]} ]  })
db.elem.insert({ a: [ {i: 2, c: [{k: 2, d: 2}, {k: 3, d: 3}]}] })
db.elem.insert({ a: [ {i: 3, b:1, d: 11}, {i: 4, c: [{k: 2, d: null}}]}] })

db.elem.aggregate([
{$unwind: {field: '$a', $elemMatch: {i: {$gt: 2}}}},
{$unwind: '$a.c'},
{$project: {_id: 0, a: 1}} ])

Converting an Array into a Document (Name-Value Pairs)

The $unwindByKey operator is used within an aggregation $project and allows you to convert an array to a document with tags in-place. It transforms each document into a single document (as opposed to the classic $unwind operator which generates many documents based on an array). It converts the array into a subdocument where the keys are derived from an element in the array documents.

For example, suppose you have an array with name-value pairs as:

> db.t1.findOne()
{
   "_id" : ObjectId("5544e757e2a9104e7da5e8b5"),
   "nvs" : [
      {
         "name" : "attr1",
         "value" : "val1"
      },
      {
         "name" : "attr2",
         "val" : "val2"
      }
   ]
}

The array is not very convenient for queries since any operation will first require an unwind. Therefore, it may be useful to convert this into a subdocument using the $unwindByKey operator. To use the operator you specify the array field name and what to use as key, for example:

> db.t1.aggregate({$project: {nvs: {$unwindByKey: {field: "$nvs", keys: ["name"]}}}},{$out: "t2"})
> db.t2.findOne()
{
   "_id" : ObjectId("5544e757e2a9104e7da5e8b5"),
   "nvs" : {
      "attr1" : {
         "value" : "val1"
      },
      "attr2" : {
         "val" : "val2"
      }
   }
}

Note that the keys field will accept an array. This allows you to define multiple keys that will be used for creating the subdocument keys so that if the original array uses different values, you can still unwind it. For example, if the original document has mixed names such as:

> db.t1.find().pretty()
{
   "_id" : ObjectId("5544e757e2a9104e7da5e8b5"),
   "nvs" : [
      {
         "name" : "attr1",
         "value" : "val1"
      },
      {
         "name" : "attr2",
         "val" : "val2"
      }
   ]
}
{
   "_id" : ObjectId("5544e8dee2a9104e7da5e8b6"),
   "nvs" : [
      {
         "name" : "attr3",
         "value" : "val3"
      },
      {
         "val" : "val4",
         "description" : "attr4"
      }
   ]
}

You can still convert the array in-place:

> db.t1.find().pretty()
{
   "_id" : ObjectId("5544e757e2a9104e7da5e8b5"),
   "nvs" : [
      {
         "name" : "attr1",
         "value" : "val1"
      },
      {
         "name" : "attr2",
         "val" : "val2"
      }
   ]
}
{
   "_id" : ObjectId("5544e8dee2a9104e7da5e8b6"),
   "nvs" : [
      {
         "name" : "attr3",
         "value" : "val3"
      },
      {
         "val" : "val4",
         "description" : "attr4"
      }
   ]
}
> db.t1.aggregate({$project: {nvs: {$unwindByKey: {field: "$nvs", keys: ["name", "description"]}}}},{$out: "t2"})
> db.t2.find().pretty()
{
   "_id" : ObjectId("5544e757e2a9104e7da5e8b5"),
   "nvs" : {
      "attr1" : {
         "value" : "val1"
      },
      "attr2" : {
         "val" : "val2"
      }
   }
}
{
   "_id" : ObjectId("5544e8dee2a9104e7da5e8b6"),
   "nvs" : {
      "attr3" : {
         "value" : "val3"
      },
      "attr4" : {
         "val" : "val4"
      }
   }
}

The order of the array matters in case an element has a value for both items in the key - for example:

> db.t1.findOne()
{
   "_id" : ObjectId("5544ea3ce2a9104e7da5e8b7"),
   "nvs" : [
      {
         "name" : "attr3",
         "value" : "val3"
      },
      {
         "name" : "attr2",
         "description" : "attr4",
         "val" : "val4"
      }
   ]
}
> db.t1.aggregate({$project: {nvs: {$unwindByKey: {field: "$nvs", keys: ["name", "description"]}}}},{$out: "t2"})
> db.t2.findOne()
{
   "_id" : ObjectId("5544ea3ce2a9104e7da5e8b7"),
   "nvs" : {
      "attr3" : {
         "value" : "val3"
      },
      "attr2" : {
         "description" : "attr4",
         "val" : "val4"
      }
   }
}
> db.t1.aggregate({$project: {nvs: {$unwindByKey: {field: "$nvs", keys: ["description", "name"]}}}},{$out: "t2"})
> db.t2.findOne()
{
   "_id" : ObjectId("5544ea3ce2a9104e7da5e8b7"),
   "nvs" : {
      "unnamed_0" : {
         "value" : "val3"
      },
      "attr3" : {
         "value" : "val3"
      },
      "attr4" : {
         "name" : "attr2",
         "val" : "val4"
      }
   }
}

$unwindByKey can also be used recursively - i.e. multiple embedded arrays are converted. For example:

> db.t1.findOne()
{
   "_id" : ObjectId("5544eb19e2a9104e7da5e8b8"),
   "nvs" : [
      {
         "name" : "attr3",
         "value" : "val3"
      },
      {
         "name" : "attr2",
         "description" : "attr4",
         "val" : "val4"
      },
      {
         "name" : "arr",
         "value" : [
            {
               "d" : "attr7",
               "v" : "val7"
            },
            {
               "d" : "attr8",
               "v" : "val8"
            }
         ]
      }
   ]
}
> db.t1.aggregate({$project: {nvs: {$unwindByKey: {field: "$nvs", keys: ["description", "name"], $unwindByKey: {field: "value", keys: ["d"]}}}}},{$out: "t2"})
> db.t2.findOne()
{
   "_id" : ObjectId("5544eb19e2a9104e7da5e8b8"),
   "nvs" : {
      "unnamed_0" : {
         "value" : "val3"
      },
      "attr3" : {
         "value" : "val3"
      },
      "attr4" : {
         "name" : "attr2",
         "val" : "val4"
      },
      "arr" : {
         "attr7" : {
            "v" : "val7"
         },
         "attr8" : {
            "v" : "val8"
         }
      }
   }
}

Converting an Array to a Document Using Positional Elements

$unwindPositional allows you to unwind an array based on its positions versus tags. It is useful when you have multi-type arrays that have meaning per position and probably should have been structured as a document in the first place, but were not.

For example, if documents contain an array where the first position is the name, the second is an age and the third is a term (which is a subdocument) such as:

> db.t1.findOne()
{
   "_id" : ObjectId("5544eed1e2a9104e7da5e8ba"),
   "desc" : [
      "Joe",
      35,
      {
         "term1" : "val1",
         "term2" : "val2"
      }
   ]
}

Then:

> db.t1.aggregate({$project: {data: {$unwindPositional: "$desc"}}}).pretty()
{
   "_id" : ObjectId("5544eed1e2a9104e7da5e8ba"),
   "data" : {
      "at_0" : "Joe",
      "at_1" : 35,
      "at_2" : {
         "term1" : "val1",
         "term2" : "val2"
      }
   }
}

Aggregation Pipeline Control

Optimizer Flag

SonarW has various optimizers that accelerate your queries. These include the pipeline optimizer that rewrites your aggregation pipeline for efficiency and the cost-based execution optimizer which performs the query in the least-cost order based on cardinality information and various runtime parameters.

By default all optimizers are invoked for every query. You can disable either one of these optimizers or both using the optimizer flag. To disable both add $optimizer: 0. To disable pipeline rewrites but use the cost-based optimizer do $optimizer: 1. To disable the cost-based optimizer but do pipeline optimization use $optimizer: 2. $optimizer:3 uses both optimizers and is the default. In a find, add this as a top-level condition and in aggregation add this as the first step in the pipeline, for example:

db.products.find({$optimizer: 0, name: /wheel/})

db.products.aggregate({$optimizer: 0},{$match: {name: /wheel/}})

Strict Checking of Collection and Field Existance

SonarW normally checks your pipelines and if you reference a field that does not exist it will produce an error. This is useful to avoid mistakes and typos that go without a warning. Sometimes you want to allow this and not produce an error - for example when a collection does not yet have a field but eventually will. Add $strict: false to disable the error message. For example:

> db.t1.find()
{ "_id" : ObjectId("56fbc656b9effdcf1fabf23a"), "a" : 2, "f" : 1 }
{ "_id" : ObjectId("56fbc650b9effdcf1fabf237"), "a" : 1, "b" : 2, "f" : 1 }

> db.t1.aggregate({$project: {a: 1, c:1}})
2016-04-07T20:18:19.687+0000 error: {
     "$err" : "Column 'c' not found in collection t1.",
     "errmsg" : "Column 'c' not found in collection t1.",
     "code" : -15,
     "ok" : 0
} at src/mongo/shell/query.js:131

> db.t1.aggregate({$strict: false},{$project: {a: 1, c:1}})
{ "a" : 2, "_id" : ObjectId("56fbc656b9effdcf1fabf23a") }
{ "a" : 1, "_id" : ObjectId("56fbc650b9effdcf1fabf237") }

You can set the default to not check within the sonard.conf file in which case you can selectively turn on checking using $strict: true.

Ensuring that the Output of an Aggregation Pipeline has Only Distinct Documents

When a {$distinctDocs:true} is specified on an aggregation pipeline it will force every document in the final output collection to be distinct, after excluding the _id field.

Example:

db.courses.aggregate({$match:{$or:[{course_id:"MATH2200"},{course_id:"CPSC2200"}]}},{$project:{prerequisite:1}},{$distinctDocs:true})

This query would show the list of prerequisites for either MATH2200 or CPS2200, excluding any duplicates.

$distinctDocs may also be followed by a number instead of the boolean true, this means that only the first n entries in the output will be distinct, where n is the number following $distinctDocs.

Note that $distinctDocs is not a pipeline stage and may appear anywhere in the pipeline with the same effect. Its effect is as a modifier for the entire pipeline that causes the pipeline’s output to have unique documents (apart from the _id). As another example, given:

{ "_id" : ObjectId("55b00dfba87eba6000000035"), "FlightNum" : 378, "Origin" : "IND", "Dest" : "JAX" }
{ "_id" : ObjectId("55b00dfba87eba6000000038"), "FlightNum" : 1016, "Origin" : "IND", "Dest" : "MDW" }
{ "_id" : ObjectId("55b00dfba87eba6000000042"), "FlightNum" : 1244, "Origin" : "ISP", "Dest" : "BWI" }
{ "_id" : ObjectId("55b00dfba87eba6000000053"), "FlightNum" : 1056, "Origin" : "ISP", "Dest" : "PBI" }
{ "_id" : ObjectId("55b00dfba87eba6000000075"), "FlightNum" : 23, "Origin" : "JAX", "Dest" : "PHL" }
{ "_id" : ObjectId("55b00dfba87eba6000000076"), "FlightNum" : 1232, "Origin" : "JAX", "Dest" : "PHL" }
{ "_id" : ObjectId("55b00dfba87eba600000008a"), "FlightNum" : 2032, "Origin" : "LAS", "Dest" : "B0" }
{ "_id" : ObjectId("55b00dfba87eba60000000d1"), "FlightNum" : 693, "Origin" : "LAS", "Dest" : "MDW" }
{ "_id" : ObjectId("55b00dfba87eba60000000e3"), "FlightNum" : 962, "Origin" : "LAS", "Dest" : "OAK" }
{ "_id" : ObjectId("55b00dfba87eba60000000e8"), "FlightNum" : 3233, "Origin" : "LAS", "Dest" : "OAK" }
{ "_id" : ObjectId("55b00dfba87eba60000000f0"), "FlightNum" : 731, "Origin" : "LAS", "Dest" : "ONT" }
{ "_id" : ObjectId("55b00dfba87eba60000000f1"), "FlightNum" : 872, "Origin" : "LAS", "Dest" : "ONT" }
{ "_id" : ObjectId("55b00dfba87eba60000000f2"), "FlightNum" : 906, "Origin" : "LAS", "Dest" : "ONT" }
{ "_id" : ObjectId("55b00dfba87eba6000000102"), "FlightNum" : 504, "Origin" : "LAS", "Dest" : "PHX" }
{ "_id" : ObjectId("55b00dfba87eba6000000112"), "FlightNum" : 594, "Origin" : "LAS", "Dest" : "PIT" }
{ "_id" : ObjectId("55b00dfba87eba6000000117"), "FlightNum" : 298, "Origin" : "LAS", "Dest" : "RNO" }
{ "_id" : ObjectId("55b00dfba87eba6000000125"), "FlightNum" : 708, "Origin" : "LAS", "Dest" : "SAN" }
{ "_id" : ObjectId("55b00dfba87eba6000000132"), "FlightNum" : 2113, "Origin" : "LAS", "Dest" : "SAT" }
{ "_id" : ObjectId("55b00dfba87eba600000013e"), "FlightNum" : 2886, "Origin" : "LAS", "Dest" : "SFO" }
{ "_id" : ObjectId("55b00dfba87eba6000000141"), "FlightNum" : 12, "Origin" : "LAS", "Dest" : "SJC" }

We could find the unique origin/destination combinations as follows:

> db.flights.aggregate({$distinctDocs:true},{$project:{Origin:1,Dest:1}})
{ "Origin" : "IND", "Dest" : "JAX", "_id" : ObjectId("55b00dfba87eba6000000035") }
{ "Origin" : "IND", "Dest" : "MDW", "_id" : ObjectId("55b00dfba87eba6000000038") }
{ "Origin" : "ISP", "Dest" : "BWI", "_id" : ObjectId("55b00dfba87eba6000000042") }
{ "Origin" : "ISP", "Dest" : "PBI", "_id" : ObjectId("55b00dfba87eba6000000053") }
{ "Origin" : "JAX", "Dest" : "PHL", "_id" : ObjectId("55b00dfba87eba6000000075") }
{ "Origin" : "LAS", "Dest" : "B0", "_id" : ObjectId("55b00dfba87eba600000008a") }
{ "Origin" : "LAS", "Dest" : "MDW", "_id" : ObjectId("55b00dfba87eba60000000d1") }
{ "Origin" : "LAS", "Dest" : "OAK", "_id" : ObjectId("55b00dfba87eba60000000e3") }
{ "Origin" : "LAS", "Dest" : "ONT", "_id" : ObjectId("55b00dfba87eba60000000f0") }
{ "Origin" : "LAS", "Dest" : "PHX", "_id" : ObjectId("55b00dfba87eba6000000102") }
{ "Origin" : "LAS", "Dest" : "PIT", "_id" : ObjectId("55b00dfba87eba6000000112") }
{ "Origin" : "LAS", "Dest" : "RNO", "_id" : ObjectId("55b00dfba87eba6000000117") }
{ "Origin" : "LAS", "Dest" : "SAN", "_id" : ObjectId("55b00dfba87eba6000000125") }
{ "Origin" : "LAS", "Dest" : "SAT", "_id" : ObjectId("55b00dfba87eba6000000132") }
{ "Origin" : "LAS", "Dest" : "SFO", "_id" : ObjectId("55b00dfba87eba600000013e") }
{ "Origin" : "LAS", "Dest" : "SJC", "_id" : ObjectId("55b00dfba87eba6000000141") }

Deferred Errors ($warning)

Since aggregation pipelines usually operate on large data sets, SonarW has the ability to not stop when an error occurs. If for example your collection has 1 Billion documents and you perform the work for 900 Million of them, it would be a shame to stop processing when an error occurs. Some errors do not cause SonarW to stop processing and are treated as warnings. SonarW allows you to specify that you want these errors to be recorded for later review.

Use the $warning aggregation step to denote the collection name in which you want the errors recorded. You can only have one $warnings step in an aggregation pipeline. For example, if you have documents with an “a” and a “d” and wish to divide a by d, but the collection has cases where d may be zero, running the following aggregation step will not stop upon the first division by zero:

> db.f2.aggregate({$warning:'myerrors' } ,{ $project: {v:{$div:['$a','$d']}} })

By default 10 errors are recorded - the first 5 and the last 5, e.g.:

> db.myerrors.find()

{ "warnings" : [
   "Error projecting document 0: Division by zero (-66)",
   "Error projecting document 1: Division by zero (-66)",
   "Error projecting document 2: Division by zero (-66)",
   "Error projecting document 3: Division by zero (-66)",
   "Error projecting document 4: Division by zero (-66)",
   "Error projecting document 10002: Division by zero (-66)",
   "Error projecting document 10003: Division by zero (-66)",
   "Error projecting document 10004: Division by zero (-66)",
   "Error projecting document 10005: Division by zero (-66)",
   "Error projecting document 10006: Division by zero (-66)" ] }

You can control the number of errors reported using max_warning_messages.

The myerrors collection above is a view rather than a real collection and exists for 60 seconds after the completion of the pipeline.

Undef as result in operators

When running an aggregation pipeline, SonarW will return an undef when an operator cannot be performed and will continue performing the pipeline. For example, doing a $add to a type that cannot be added or a $substr to a non-string will return an undef. This behavior is preferred over stopping the pipeline with an error. When running a query over billions of documents it is not desirable to stop the query in the event of an error.

Subqueries

SonarW supports subqueries so that you can use data in different collections to answer your questions. Subqueries are not implemented through joins but rather as multiple queries, but this is all done within the database engine and is therefore much faster than multiple passes done by the client.

Using $in, $nin and $all with Subqueries

Subqueries allow you to use the results of a query within another query when using a $in, $nin or $all operator. You can specify the subquery using the following syntax:

<field name> : {
   <$in or $nin or $all> : {
      $ns: <collection name to run the subquery on>,
      $q: { .. the query to perform for the subquery .. },
      $p: <field name to use for the $in or $nin or $all operator>
      $limit: <limit value>
   }
}

In general, subqueries will usually remove duplicate projected fields from the subquery by default. If you want to keep all of the values in the array, for example to use in a subsequent $project, use the field “$p!” instead of $p force the system to keep any duplicates. In a $join operation, however, duplicates are always retained, “$p!” and “$p” are treated as synonyms in that context. An error is reported whenever one to tries to specify $p and “$p!” in the same subquery.

$ns can be a collection in the same or in another database. $limit is optional, 1 Million is the default. In no case will a subquery return more than 100 Million in the array (although this limit is a parameter in the sonard conf file). If you hit this limit (explicit or implicit) you will receive an error message - e.g. if you do not specify a limit, and therefore use the soft 1 Million limit, and have more than 1 Million documents, you will get an error of the form:

"$err" : "Subquery result list too long: 1048577, reached soft limit 1048576,
hard limit configured to 104857600."

You can then add a $limit attribute to increase up to 100 Million.

For example, suppose you have the following data:

> db.products.find()
{ "_id" : ObjectId("54118b75cfe9f9efd501e2ba"), "product_id" : 1, "product_name" : "soap" }
{ "_id" : ObjectId("54118b75cfe9f9efd501e2bb"), "product_id" : 2, "product_name" : "shampoo" }
{ "_id" : ObjectId("54118b75cfe9f9efd501e2bc"), "product_id" : 3, "product_name" : "conditioner" }
> db.orders.find()
{ "_id" : ObjectId("54118b75cfe9f9efd501e2c0"), "customer_id" : 1, "order_id" : 1, "product_id" : 1, "qty" : 10 }
{ "_id" : ObjectId("54118b75cfe9f9efd501e2c1"), "customer_id" : 2, "order_id" : 2, "product_id" : 2, "qty" : 20 }
{ "_id" : ObjectId("54118b76cfe9f9efd501e2c2"), "customer_id" : 3, "order_id" : 3, "product_id" : 3, "qty" : 30 }
{ "_id" : ObjectId("54118be1cfe9f9efd501e2c3"), "customer_id" : 1, "order_id" : 4, "product_id" : 1, "qty" : 40 }
> db.customers.find()
{ "_id" : ObjectId("54118b75cfe9f9efd501e2bd"), "customer_id" : 1, "cutomer_name" : "Ola" }
{ "_id" : ObjectId("54118b75cfe9f9efd501e2be"), "customer_id" : 2, "cutomer_name" : "Leha" }
{ "_id" : ObjectId("54118b75cfe9f9efd501e2bf"), "customer_id" : 3, "cutomer_name" : "Davidoff" }

To get all orders of soap do:

> db.orders.find({product_id: {$in: {$ns: "products", $q: {product_name: "soap"}, $p: "product_id"}}})
{ "_id" : ObjectId("54118b75cfe9f9efd501e2c0"), "customer_id" : 1, "order_id" : 1, "product_id" : 1, "qty" : 10 }
{ "_id" : ObjectId("54118be1cfe9f9efd501e2c3"), "customer_id" : 1, "order_id" : 4, "product_id" : 1, "qty" : 40 }

To get all customers that have ordered any soap do:

> db.customers.find({customer_id: {$in: {$ns: "orders", $q: {product_id: {$in: {$ns: "products", $q: {product_name: "soap"}, $p: "product_id"}}}, $p: "customer_id"}}})
{ "_id" : ObjectId("54118b75cfe9f9efd501e2bd"), "customer_id" : 1, "cutomer_name" : "Ola" }

In addition, you can use $a instead of $q when you want to run an arbitrary aggregation pipeline within the subquery - for example:

> db.math_score.aggregate({  "$group" : {
   "_id" : "$name",
   "s" : {  "$sum" : "$grade" },
   "$having" : {
      "s" : {  "$in" : {
         "$ns" : "cs_score",
         "$p" : "s",
         "$a" : [ { "$group" : { "_id" : "$name", "s" : { "$sum" : "$grade" } } } ]
   } } } } })

Using the General form of a Subquery

$query is used to make a subquery result that can then be used anywhere in an other aggregation step - including in $match operators such as $contains as well as inside of a projection expression that evaluates to a constant array that can be then used within another query. The result is an array of all of the values in a single column that satisfy the query. It is useful when you want to write a subquery but you want to evaluate it once versus once per document in the iterated collection. Always use $query when the internal subquery does not depend on the document being operated on.

The following query first computes a list of all of the courses that have MATH1401 as a prerequisite and then projects per student, excluding the courses they may have already completed or are in the process of taking (the ‘courses’ field; assume for example students that have MATH 1401 are trying to select which follow-on course they should take):

db.students.aggregate(
   {$project:{
      course_options:{
         $setDifference:[
            {$query:{$ns:"courses",$q:{prerequisite:"MATH1401"},$p:"course_id"}},
            "$courses"]
   }}})

$p field can be used by either specifying the field(s) to be projected from the subquery, or you can specify “*” to get all fields.

Using an Aggregation Pipeline in a Subquery

In all above examples you used $q with an expression that translates to a find command on the $ns collection. You can also use $a along with an array of subdocuments to run a pipeline as your subquery. For example, to add a stage to a pipeline that will leave only documents with the smallest month add:

{$match:
   {$expr: {$in:
      ["$Month",
      {$query: {
         $a: [{$group: {_id: 1, m: {$min: {$month: "$T1"}}}}],
         $p: "m",
         $ns: "grdmrec"}
      }]}}}

Union as an Aggregation Step

SonarW supports a union operator that is similar to UNION in SQL.

Syntax: {$union: {collection-name:[pipeline]}}

$union combines the output of the aggregation steps that preceded it with the output of the aggregation steps supplied as its pipeline. You may use multiple $union steps if you want to combine the results with multiple aggregations. For example:

> db.ticket_sales.aggregate([
   { $group : { _id : "$theatre", sales: { $sum : "$revenue" } } },
   { $union : { concession : [ { $group : { _id : "$cinema", sales: { $sum : "$sales" } } } ] } } ])

This example combines the output of two different $groups on different collections, into a single aggregation, making sure that there are no duplicates in case the original collections both had the same theater (even though it had different names):

> db.ticket_sales.aggregate([
   { $group : { _id : "$theatre", sales : { $sum : "$revenue" } } },
   { $union : { concession : [ { $group : { _id : "$cinema", sales : { "$sum" : "$sales" } } } ] } },
   { $group : { _id : "$_id", sales : { $sum : "$sales" } } } ])

Schema and Metadata

Even though SonarW is a JSON database supporting flexible data, it is also a columnar warehouse and as such includes information regarding a collection’s schema. This information is an implicit or de-facto schema. SonarW does not force you to define a schema or enforce it so it does not interfere with schema-on-read advantages and flexible data, but it is able to report and alert on defined deviations (such as a new field that should not be there).

Use the stats() function on a collection to see what columns exist in documents in a collection. For example, if t3 has only this document:

> db.t3.find()
{ "_id" : ObjectId("5632516b2af8a03a2c661ed3"), "Name" : "John", "Phone" : "604-123-4567", "Age" : 25 }

Then the stats command will show you the columns in this collection:

> db.t3.stats()
{
        "ns" : "test.t3",
        "blocks" : NumberLong(5),
        "count" : NumberLong(1),
        "deleted" : NumberLong(0),
        "size" : 507,
        "avgObjSize" : 507,
        "storageSize" : 507,
        "numExtents" : 1,
        "nindexes" : 1,
        "lastExtentSize" : 507,
        "paddingFactor" : 0,
        "stream" : false,
        "totalIndexSize" : 0,
        "number_of_columns" : NumberLong(5),
        "columns" : [
                "_id",
                "lmrm_bson_",
                "lmrm_n__Age",
                "lmrm_s__Name",
                "lmrm_s__Phone"
        ],
        "view" : false,
        "ok" : 1
}

You can also use the db command syntax, it equivalent to db.<collection-name>.stats():

db.runCommand({collStats:<collection-name>})
Two things stand out:
  • A prefix lmrm__ + n__ or s__ was added to each column name. n stands for number, s stands for string, etc. This is internal information, used by SonarW to identify the data type of the stored data. It is transparent to the user during regular use.
  • There is an added column called bson__, a column used for fast retrieval when the whole document is needed. It is transparent to the user on regular use.

To see just the columns do:

> db.t3.stats()["columns"]
[ "_id", "lmrm_bson_", "lmrm_n__Age", "lmrm_s__Name", "lmrm_s__Phone" ]

Add one more document to the collection to have:

> db.t3.insert({Name:'Jane', Phone:6049999999, new_field:'foo'})
 WriteResult({ "nInserted" : 1 })
 > db.t3.find()
 { "_id" : ObjectId("5632516b2af8a03a2c661ed3"), "Name" : "John", "Phone" : "604-123-4567", "Age" : 25 }
 { "_id" : ObjectId("563256012af8a03a2c661ed4"), "Name" : "Jane", "Phone" : 6049999999, "new_field" : "foo" }

The column list is automatically updated:

> db.t3.stats()["columns"]
[
        "_id",
        "lmrm_bson_",
        "lmrm_n__Age",
        "lmrm_n__Phone",
        "lmrm_s__Name",
        "lmrm_s__Phone",
        "lmrm_s__new_field"
]

Two additional fields now exist: lmrm_s__new_field, holding strings, and lmrm_n__Phone holding a phone number as a number type.

Additionally, SonarW also has a metadata statistics command. This command computes statistics on the occurrence of fields and types in the database allowing you to know when fields are sparse, when types are inconsistent and more. It is also used as the basis for schema analysis and viewing.

To compute statistics run:

> db.runCommand({lmrm__metadata:1})
{ "ok" : 1 }

This command generates and populates a collection named lmrm__metadata2 in the database. Each of the generated documents describes a field/column.

Metadata records will show types, frequency, how many documents exist etc, we can query for a specific collection/field metadata:

> db.lmrm__metadata2.find({name:'t3', "field_name" : "Phone"}).pretty()
{
        "_id" : ObjectId("56325736c5b5480f0000069c"),
        "name" : "t3",
        "field_name" : "Phone",
        "docs_sampled" : NumberLong(2),
        "types" : {
                "<type 'float'>" : 0.5,
                "<type 'unicode'>" : 0.5
        },
        "last_updated" : ISODate("2015-10-29T17:28:22Z")
}

There are 2 docs in the ‘t3’ collection that include the ‘Phone’ field, 50% of the docs hold a numeric value, and 50% hold a string value.

User Defined Functions (UDFs)

User defined functions allow you to use results of Javascript functions in projections and in matches. Any Javascript code can be used - both yours, and existing code in libraries.

UDFs are supported as both projections and for match conditions. UDFs in projections are supported in aggregations and UDFs for match conditions are supported in both aggregations and find.

User defined functions run as Javascript within the V8 engine embedded within SonarW. As such they are far slower than the built-in SonarW operators and should be used judiciously.

UDFs in Projections

For any Javscript function you can add the result of a call to that function as a field projection using the $call operator. For example, to use the built-in Math.random function to create a random field you can do:

> db.steam.aggregate({$project: {owner: 1, rnd: {$call: "Math.random"}}})
{ "_id" : ObjectId("53b21af672a53f1ac56db3cc"), "owner" : "76561198000664965", "rnd" : 0.7133265407755971 }
{ "_id" : ObjectId("53b21af672a53f1ac56db3cd"), "owner" : "76561198000664965", "rnd" : 0.508274263003841 }
{ "_id" : ObjectId("53b21af672a53f1ac56db3ce"), "owner" : "76561198000664965", "rnd" : 0.8214249196462333 }

You can also call a function that takes parameters. Arguments to the parameters can be literals or column names. For example, given a collection of shapes such as:

> db.shape.find()
{ "_id" : ObjectId("544ab13c60aa305d3780516d"), "name" : "shape1", "width" : 10, "height" : 12 }
{ "_id" : ObjectId("544ab14c60aa305d3780516e"), "name" : "shape2", "width" : 10, "height" : 120 }
{ "_id" : ObjectId("544ab15560aa305d3780516f"), "name" : "shape3", "width" : 100, "height" : 12 }

You can get the larger of the dimensions using:

> db.shape.aggregate({$project: {name:1, max_dim: {$call: {$fn: function (x, y) {if (x < y) return y; else return x;} , $params: ["$width", "$height"]}}}})
{ "_id" : ObjectId("544ab13c60aa305d3780516d"), "max_dim" : 12, "name" : "shape1" }
{ "_id" : ObjectId("544ab14c60aa305d3780516e"), "max_dim" : 120, "name" : "shape2" }
{ "_id" : ObjectId("544ab15560aa305d3780516f"), "max_dim" : 100, "name" : "shape3" }

The function being called can be a function already stored in system.js, for example:

> db.system.js.save({_id: "mymax", value: function (x, y) {if (x < y) return y; else return x;}})
> db.system.js.find()
{ "_id" : "mymax", "value" : "function (x, y) {if (x < y) return y; else return x;}" }
> db.shape.aggregate({$project: {name:1, max_dim: {$call: {$fn: "mymax" , $params: ["$width", "$height"]}}}})
{ "_id" : ObjectId("544ab13c60aa305d3780516d"), "max_dim" : 12, "name" : "shape1" }
{ "_id" : ObjectId("544ab14c60aa305d3780516e"), "max_dim" : 120, "name" : "shape2" }
{ "_id" : ObjectId("544ab15560aa305d3780516f"), "max_dim" : 100, "name" : "shape3" }

In all cases, the returned value can be any JSON - including arrays and objects, e.g:

> db.shape.aggregate({$project: {name:1, max_dim: {$call: {$fn: function (x, y) {return {"a": {"b": x, "c":y}}} , $params: ["$width", "$height"]}}}}).pretty()
{
   "_id" : ObjectId("544ab13c60aa305d3780516d"),
   "max_dim" : {
      "a" : {
         "b" : 10,
         "c" : 12
      }
   },
   "name" : "shape1"
}
{
   "_id" : ObjectId("544ab14c60aa305d3780516e"),
   "max_dim" : {
      "a" : {
         "b" : 10,
         "c" : 120
      }
   },
   "name" : "shape2"
}
{
   "_id" : ObjectId("544ab15560aa305d3780516f"),
   "max_dim" : {
      "a" : {
         "b" : 100,
         "c" : 12
      }
   },
   "name" : "shape3"
}

UDFs in Matches

Whereas a UDF that is used in a projection can return any value, a UDF that is used in a match must return a boolean value to determine whether to include the document or not. For example, if we want a random sampling of 10% of the documents from a collection we can use:

> db.carries.find({_id: {$call: {$fn: function(x) { return Math.random() < 0.1;
}}}})
{ "Code" : "0GQ", "Description" : "Inter Island Airways, d/b/a Inter Island Air", "_id" : ObjectId("534208031c479f6149b0ed61") }
{ "Code" : "0Q", "Description" : "Flying Service N.V.", "_id" : ObjectId("534208031c479f6149b0ed68") }
{ "Code" : "13Q", "Description" : "Chartright Air Inc.", "_id" : ObjectId("534208031c479f6149b0ed72") }
{ "Code" : "1MQ", "Description" : "Twin Cities Air Service LLC", "_id" : ObjectId("534208031c479f6149b0ed82") }
{ "Code" : "1TQ", "Description" : "Rhoades Aviation dba Transair", "_id" : ObjectId("534208031c479f6149b0ed88") }
...

or:

> db.carriers.aggregate({$match: {_id: {$call: {$fn: function(x) { return Math.random() < 0.1; }}}}})
{ "Code" : "0CQ", "Description" : "ACM AIR CHARTER GmbH", "_id" : ObjectId("534208031c479f6149b0ed60") }
{ "Code" : "0J", "Description" : "JetClub AG", "_id" : ObjectId("534208031c479f6149b0ed63") }
{ "Code" : "0RQ", "Description" : "TAG Aviation Espana S.L.", "_id" : ObjectId("534208031c479f6149b0ed6a") }
{ "Code" : "0TQ", "Description" : "Corporatejets, XXI", "_id" : ObjectId("534208031c479f6149b0ed6b") }
{ "Code" : "13Q", "Description" : "Chartright Air Inc.", "_id" : ObjectId("534208031c479f6149b0ed72") }
...

This is a trivial example because it does not use the field value. In the more general form the function (whether inline or in system.js) is called and the value of the field is passed into the function as an argument. The function should return a true or a false value which is used to determine if the document should be included or not. For example, assume the following registrations:

> db.registrations.find()
{ "_id" : ObjectId("544ac25360aa305d37805170"), "name" : "Joe", "at" : ISODate("2014-10-24T21:19:15.186Z") }
{ "_id" : ObjectId("544ac25860aa305d37805171"), "name" : "Jane", "at" : ISODate("2014-10-24T21:19:20.507Z") }
{ "_id" : ObjectId("544ac25c60aa305d37805172"), "name" : "Jeff", "at" : ISODate("2014-10-24T21:19:24.683Z") }
{ "_id" : ObjectId("544ac26260aa305d37805173"), "name" : "Jack", "at" : ISODate("2014-10-24T21:19:30.244Z") }
{ "_id" : ObjectId("544ac26660aa305d37805174"), "name" : "Jill", "at" : ISODate("2014-10-24T21:19:34.692Z") }

To find all people that registered less than 15 minutes ago (900 seconds, or 900000 milliseconds ago), based on whenever the call is made use:

> db.registrations.aggregate({$match: {at: {$call: {$fn: function(x) { return ((new Date()).getTime() - x.getTime()) < 900000; }}}}})
{ "_id" : ObjectId("544ac25c60aa305d37805172"), "name" : "Jeff", "at" : ISODate("2014-10-24T21:19:24.683Z") }
{ "_id" : ObjectId("544ac26260aa305d37805173"), "name" : "Jack", "at" : ISODate("2014-10-24T21:19:30.244Z") }
{ "_id" : ObjectId("544ac26660aa305d37805174"), "name" : "Jill", "at" : ISODate("2014-10-24T21:19:34.692Z") }

A UDF is not limited to work on a single field - multiple fields may be passed into a UDF. The first parameter is always the field on which the $match is performed. Any additional arguments are passed to the function using $param. For example, assume a collection of shapes:

> db.shape.find()
{ "_id" : ObjectId("544ab13c60aa305d3780516d"), "height" : 12, "name" : "shape1", "width" : 10 }
{ "_id" : ObjectId("544ab14c60aa305d3780516e"), "height" : 120, "name" : "shape2", "width" : 10 }
{ "_id" : ObjectId("544ab15560aa305d3780516f"), "height" : 12, "name" : "shape3", "width" : 100 }

The following UDF can be used to return only those shapes where the height is larger than the width:

> db.shape.aggregate(
   {$match: {
      width: {$call: {
         $fn: function(x,y) {return x < y; } ,
         $params: ["$height"]}}
    }})
{ "_id" : ObjectId("544ab13c60aa305d3780516d"), "height" : 12, "name" : "shape1", "width" : 10 }
{ "_id" : ObjectId("544ab14c60aa305d3780516e"), "height" : 120, "name" : "shape2", "width" : 10 }

Automatic columns

SonarW provides an option to automatically add fields (columns) to each document. This feature is set on a collection, and works on all the contained documents.

The 2 fields are:

  • Creation-Time
  • Last-Update-Time

These columns will automatically assume the “current time” when the document is created/updated

Creation-Time

The next example shows how to define a creation time column called ‘Creation_Time’ in the collection ‘test’:

> db.runCommand({auto_create_column:{test:'Creation_Time'}})

From the time this command is executed, any new document that is inserted into the collection will have the field ‘Creation-Time’ automatically added, it will contain an ISODate that reflects the time the document was created. If the document already contains a field called ‘Creation_Time’, then the already-provided value will take precedence (i.e. will not get overwritten automatically). Regardless of whether the field was automatically added or not, the field that is set as the auto_create_column is retained whenever any updates to the document are made.

Last-Update-Time

The next example shows how to define a creation time column called Update_Time in the collection test:

> db.runCommand({auto_update_column:{test:'Update_Time'}})

Whenever a document is inserted into the collection for the first time or modified, this field will reflect the time that the document was last updated. Same as with the Creation-Time, if an explicit value is provided to this field then the provided value will take precedence. Note that if the document is later updated, the column will be changed to the current time unless an explicit value is provided for it.

It is, of course, an error to set both of these columns to the same field in a collection, and therefore it is not allowed in SonarW.

It is possible to query the collection to see the names of its update and creation columns, if they exist.

For example to get the names of the columns (in the ‘test’ collection) use:

> db.runCommand({auto_create_column:'test'})
{ "result" : "ct", "ok" : true }

> db.runCommand({auto_update_column:'test'})
{ "result" : "update_time", "ok" : true }

If there were no such column defined for the collection, then the ‘result’ fields for the above queries would be false instead of a string.

For example to turn the feature off (in the ‘test’ collection) use:

> db.runCommand({auto_create_column:{'test':false}})
and
> db.runCommand({auto_update_column:{'test':false}})

Turning the functionality off will not impact documents that have already been inserted/updated. Documents updated after turning the feature off, will not necessarily keep the automatically inserted columns (depending on the kind of update). Note that even when the fields are retained, they will not be updated automatically.