Joins in SonarW

With SonarW you can join between two or more collections. A join always has a primary collection that SonarW will iterate over and joined collections from which it will bring additional data. In the descriptions below we refer to the main collection as the fact collection or fact table and the collections joined to the main collection as dimension (or dim) collections or dim tables.

Joins are supported through a special step in the aggregation framework. You can join any number of collections including multi-step and cascading joins (where there is not necessarily a single join table but rather many 1-N or N-N relationships).

A join in aggregation is a new step taking the form:

$join : {
    $joined: {
        dim1: "D1",
        dim2: "D2",
        dim3: "D3"
    },
    $outer: [
        "dim1",
        "dim3"
    ],
    $match: [
        {
            D1_id: "$dim1.$_id"
        },
        {
            D2_id: "$dim2.$_id"
        },
        {
            D3_id: "$dim3$_id"
        }
    ] ,
    $selector: {
        dim1: {
            <selection criteria that operates on the joined table>
        },
        dim2: {
            ...
        },
        dim3: {
            ...
        },
   },
   $project: {
       <projection elements from either fact collection or dim collection>
   },
   $group: {
       <similar to projections but used to group together multiple joined docs>
   },
   $having: {
       <match operators using projected field names>
   }
}
$joined
specifies the names of the collection to join with. dim1, dim2 and dim3 shown above are the alias names and all references to the dim collections need to use the alias. Above, dim1 is an alias to the D1 collection, dim2 is an alias to D2 and so on. Aliases must be unique throughout the join expression.

You can join with collections in another database using the syntax:

dim3: {
   $db: "db_name",
    $ns: "collection_name"
}

Finally, you can join with a result of a query - whether it is a simple find query or the result of an inner aggregation pipeline (similarly to joining with the results of an inner subquery in SQL). Use $q for a simple query and $a for an aggregation pipeline:

"$joined" : { "dim" : { "$ns" : "names", "$q" : { "age" : { "$lt" : 40 } } } }

    or

"$joined" : { "dim" : { "$ns" : "names", "$a" : [   agg array      ] } }
$match

defines the join conditions and can include multiple conditions.

$match expressions always have the fact collection on the left hand side of the : and a dim collection reference on the right hand side using the alias name.

There are two modes for handling the DIM table. Normally the appropriate fields of the dim table are all loaded to memory so that large joins can run very fast. These dim caches are maintained in memory in an LRU cache for handling repeated queries. If you have a 1-1 mapping and you do not want the entire dim collection be loaded but rather only those selected by the fact, use the $oneToOne modifier. For example, use:

D1_id: {$oneToOne: "$dim1.$_id"}

instead of:

D1_id: "$dim1.$_id"

in the $match

$project

You must specify which fields to include from both the dim collection and the fact collection - i.e. without specifying a projection you will only get an _id. You can include any field from the fact collection using $<field name> or any field from the dim collections using $<alias>.$<field name>.

To project all fields from the fact collection, use *: 1 as usual. To project all fields from the dim collection(s), use $<alias>.*: 1 (for example, $dim1.* : 1 would project all fields from D1 above).

When you project from the fact collection, always use the $project clause. When you project from the dim collections you have two options:

  1. If the relationship is a 1-1 relationship, use the $project clause.

  2. If the relationship is 1-N (i.e. there are many documents in the dim collection that you associate with a single document in the fact collection) then use the $group clause. For example, to create an array that has all the values for field foo of the dim collection bar (alias bar) do:

    $join :  {
        ...
        $group: {
            foo_arr: {
                $push: "$bar.$foo"
            }
        }
    }
    
$outer
The default join is an inner join and the $outer is optionally used to request an outer join. In an inner join, a document will only be created in the result if there was a matching document in the dim collection. If you want one to be created when a match does not exist add the dim collection to the $outer array.
$orphan
You can also use $orphan instead of $outer when what you are looking for are only the documents in the main collection that do not have a match in the joined collection. Rather than using a $outer with a $having clause, use $orphan which is much more efficient.
$group

The $group clause ensures that the output contains only one document per each document in the fact collection. If you choose to leave out the group clause, a 1-N relationship will produce N documents per fact document.

Use the $group clause to define how to aggregate the N records into a single document in the output. $group operators can be one of:

<group-op>: $sum|$avg|$first|$last|$min|$max|$push|$addToSet

Note that you must also include the variable that you want to use in the $group clause inside the $project clause.

When a $group clause is present, all variables from the $project must be aggregated, otherwise the result is undefined - the field might be stripped completely, or just show an single arbitrary value.

$selector

The $selector clause is a subdocument that has the same structure as that of a $match stage or a find operation. It operates on the dim collection before the join is made and is used to speed up joins. The selection criteria eliminates the creation of joined documents in the first place (as opposed to creating them and discarding them later).

You can have different $selector clauses for the different dim collections you are joining but each selector only operates on a single dim collection.

$having

Use the $having clause for additional comparison restrictions if they pertain to more than one source (different dim collections or conditions in the dim collections and the fact collections that are too complex to add to the $match). Note that the $having clause restricts the result coming out of the $join and therefore speeds up the materialization of the results. However, as opposed to $selector it does not speed up the join itself.

The $having clause can contain complex expressions. These expressions can only refer to projection fields and you therefore need to use the projection field names, for example:

$having: {
    $eq: [
        "$projection_field1",
        $add: [
            "$projection_field2",
            "$projection_field3"
        ]
    ]
}

Example

Let’s look at an example from the flights database. In it are two collections - one collection describing flights and one describing airport. Each flight document looks like:

{
  "TAIL_NUM": "N325US",
  "FL_NUM": 1497,
  "ORIGIN_AIRPORT_ID": 14100,
  "ORIGIN_AIRPORT_SEQ_ID": 1410002,
  "ORIGIN_CITY_MARKET_ID": 34100,
  "ORIGIN": "PHL",
  ...
  "DEST": "MSP",
}

and each airport document looks like:

{
  "_id": {
    "$oid": "5383623b7bfb8767e2e9ca1f"
  },
  "iata": "00M",
  "airport": "Thigpen ",
  "city": "Bay Springs",
  "state": "MS",
  "country": "USA",
  "lat": 31.95376472,
  "long": -89.23450472
}

For example, to join the flights document with airport data for both origin and destination and project the tail number from flights and the state for the origin and destination do:

db.flights.aggregate([
    {
        $join: {
            $joined: {
                origin: "airports",
                dest: "airports"
            },
            $match: [
                {
                    ORIGIN: "$origin.$iata"
                },
                {
                    DEST: "$dest.$iata"
                }
            ],
            $project: {
                ORIGIN: "$ORIGIN",
                DEST: "$DEST",
                TAIL_NUM: "$TAIL_NUM",
                origin_state: "$origin.$state",
                dest_state: "$dest.$state"
            }
        }
    }
])

The result documents look like:

{
  "ORIGIN": "FLL",
  "DEST": "LGA",
  "TAIL_NUM": "N325NB",
  "_id": {
    "$oid": "534205f61c479f6149a9270a"
  },
  "origin_state": "FL",
  "dest_state": "NY"
}

To understand how the $group works let’s look at an example and do the reverse join - i.e. per airport let’s join the flights that fly from that airport.

If we want to get all flights that originate from it as an array:

db.airports.aggregate(
   [
     {
    "$join": {
      "$joined": {
        "flights": "flights"
        },
        "$match": [
        {
          "iata": "$flights.$ORIGIN"
          }
        ],
        "$project": {
          FL_NUM: "$flights.$FL_NUM",
          airport: "$airport",
          city: "$city",
          state: "$state"
        },
        "$group": {
        flights: {
          $addToSet: "$FL_NUM"
          }
        }
      }
    }
   ] )

To get documents such as:

{
   "flights": [
     4879,
     4906,
     ...,
     4334,
     4359,
     4365,
     4372
   ],
   "airport": "Lehigh Valley International",
   "city": "Allentown",
   "state": "PA",
   "_id": {
     "$oid": "5383623c7bfb8767e2e9cd16"
   }
 }

Join collections from different Databases

When specifying a joined collection ( the DIM collection), it is possible to join a collection from ANOTHER db.

For example: the alias DIM3 refers to a collection from a different DB.

$joined: { dim1: “D1” , dim2: “D2” , dim3: { $db:”db_name”, $ns: “collection_name” } }

Match Types

There are three ways to match documents from the fact table to documents in the dim table.

Match by Hash

All selected values of the field in the dim are hashed. The fact field values are hashed and matched against them. Example:

$match: [
    {
        $fact-field: "$alias.$dim-field "
    }
]

Match by Expression Hash

Expression values for the selected dim documents/fields are hashed. For all selected fact documents, the expression is evaluated and a hash of the result is matched. Example:

$match: [
    {
        $eq: [
            $add: [ "$D1_id", 4 ] ,
            $length: [ "$dim1.$_id" ]
        ]
    }
]

Note: In the expr-on-fact item, only fact fields can be used. In the expr-on-dim, only dim fields can be used, and only from one dim collection.

Match by Range

In this type of match, each document in the fact is joined to as many documents in the dim for which the fact value or field falls in the range described. Both the value and the range edges can be expressions.

The range is specified using the following syntax:

$match: [
    {
        $value: <fact-expr>,
        $gt[e]: <dim-lower-bound>,
        $lt[e]: <dim-upper-bound>
    },
    ...
]

Three keys are required to specify the range: the $value, a lower bound, using $gt or $gte, and an upper bound using $lt or $lte.

It is possible to specify multiple range objects, in which case the intersection thereof would be considered for the join.

<fact-expr> can only involve fields from the fact, while <dim-lower-bound> and <dim-upper-bound> can only include expressions from the dim.

Match by Geo Within

In this type of match, each document in the fact is joined to as many documents in the dim for which the fact value is a point within the dim polygon value.

The join field in the fact table has to have a “2d” geo index. The fact field that is being joined needs to include geoJson documents that represent polygons or multipolygons.

Both the fact field and the dim field cannot be expressions and need to represent actual fields in the respective collections.

A join including this kind of match cannot be used after collection changing pipeline stages like $group or $project. It can be used after $match or $limit.

The geoWithin match is specified using the following syntax:

$match: [
    {
        "fact-field": {"$geoWithin": "$alias.$dim-field"}
    },
    ...
]

Example

As an example, the following joins with a dept collection using a hash-based join, a salary collection using a rang-based match and a gen collection using an expression hash:

{
   "$join" : {
      "$joined" : {
         "salA" : "salary",
         "deptA" : "dept",
         "genA" : "gen"
      },
      "$match" : [
         {
            "$value" : "$sal",    // sal is in the FACT collection. Join with all docs in salary where low+50 < sal and high >= sal
            "$gt" : {
               "$add" : [
                  "$salA.$low",
                  50
               ]
            },
            "$lte" : "$salA.$high"
         },
         {
            "dept" : "$deptA.$deptid"  // hash-based sort with the dept collection
         },
         {
            "$eq" : [        // join on gen documents where the genderid+1 == gender in FACT
               "$gender",
               {
                  "$add" : [
                     "$genA.$genderid",
                     1
                  ]
               }
            ]
         }
      ],
      "$project" : {
         "min_salary" : "$salA.$low",
         "max_salary" : "$salA.$high",
         "salary" : "$salary",
         "rank" : "$salA.$rank",
         "bld" : "$deptA.$bld",
         "gender" : "$genA.$txt"
      }
   }
}

Cascading Joins

SonarW supports snowflake-type queries using cascading joins. If for example you join DIM1, DIM2 and DIM3 to the FACT collection but then also need to join D1A, D1B and D1C to DIM1 and D2A, D2B and D2C to DIM2 then add the following to the join document:

dim1: {
   $joined:  { dim1a: "D1A" , dim1b: "D1B" , dim1c: "D1C"   } ,
   $match: [ { D1A_id: "$dim1a.$_id"},{D1B_id: "$dim1b.$_id" },{ D1C_id: "$dim1c$_id"   } ] ,
} ,

dim2: {
  $joined:  { dim2a: "D2A" , dim2b: "D2B" , dim2c: "D2C"   } ,
  $outer: ["dim2c"] ,
  $match: [ { D2A_id: "$dim2a.$_id" },{ D2B_id: "$dim2b.$_id" },{ D2C_id: "$dim2c$_id"   } ] ,
} ,

Cascading clause can have a $joined, $outer, $selector and $match. On the other hand, $projection, $group and $having can only be specified in the top level but can refer to any alias including those in the cascading section. There is no limit to the cascading depth.

For example, to join an employees collection with department collection and gender collection and then cascade and join a bld (building) collection to the department table use:

db.emp.aggregate( {
 $join: {
   $joined: { deptA: "dept", genA: "gen" } ,
   $match: [
             {dept:"$deptA.$deptid" },
             {gender:"$genA.$genderid"}
           ],
   deptA : {
        $joined: { bldA: "bld" } ,
        $match: [ {"bld" : "$bldA.$bldid" }  ],
   } ,
   $project: {  "$deptA.$*":1 , "*":1 , "$bldA.*":1 , "$genA.*":1 }
 }
})

To include a selector in the cascading join (one selector on gen and one on bld that is joined on the DIM collection dept:

db.emp.aggregate( {
 $join: {
   $joined: { deptA: "dept", genA: "gen" } ,
   $match: [
             {dept:"$deptA.$deptid" },
             {gender:"$genA.$genderid"}
           ],

   $selector: { genA: { "txt" : "Female" } } ,

   deptA : {
        $joined: { bldA: "bld" } ,
        $match: [ {"bld" : "$bldA.$bldid" }  ],
        $selector: { bldA: { "bldname" : /2/ } } ,
   } ,

   $project: {  "$deptA.$*":1 , "*":1 , "$bldA.*":1 , "$genA.*":1 }
 }
})

Example for $oneToOne

When joining between ean exception collection and it’s session collection, instead of hashing all of the $dim table (session), you can load only those lines from session collection documents that survived the previous $match. You ust ensure that session_id is unique in the session table, and that there are not a lot of survivors from the first $match for thie option to be more efficient than the standard option.

The $oneToOne syntax in the match prt of the $join in this case looke like:

db.exception.aggregate ( [
{
   {"$match":
            {"Exception Timestamp":{"$gte":ISODate('2016-06-29T18:17:38.727Z'),"$lte":ISODate('2016-07-30T00:00:00.000Z')}},
     }
} ,
{  "$join":{
   "$joined":{"session":"session"},
   "$match":[{"Session Id":{"$oneToOne":"$session.$_id"}}],
    "$project":{
            "User_Name":"$User Name",
            "OS_User":"$session.$OS User",
            "Client_Host_Name":"$session.$Client Host Name",
            "Destination_Address":"$Destination Address"
   }
 }
}
])