Standing Query Email Notifications

This research spike concerns generating email notifications for standing queries in user workspaces. A few assumptions were made for the purposes of this spike, namely:

  1. Notifications are generated for workspaces rather than for individual queries;
  2. Users must opt in to receive email notifications for a workspace;
  3. Notifications are generated only for the workspaces that have at least one active (i.e., unexpired) standing query.

Listed here are some of the major concerns to be considered for a solution:

  1. Security
  2. Performance
  3. Support for federation
  4. Reusability

The Proposed Solution

Standing query notification generation should be run once per day at a time specified by the system administrator (this time would likely be at night). This standing query notification service will fetch all the active workspaces and all the active standing queries belonging to each. Then, for each workspace, it will run each standing query. But since this service is interested in the most recent results, each query will be modified to restrict it to results that have been modified in the last 24 hours (i.e., the Metacard.MODIFIED attribute falls within the last 24 hours). To get an accurate count, each query will have to request the total result count.

One problem here is that the queries cannot be run with the workspace owner’s Subject. Even though we know who owns the workspace, we can’t access their Subject since the system is running the queries in a batch at a scheduled interval. The user would likely be logged out at the time the queries are running anyway. Consequentially, the results cannot be filtered appropriately for the user. Thus we must restrict the information contained in the email notifications to a numerical summary of the new query results, e.g. “There are <X> new results available for workspace <Y>. Log in to view them.” If we run the queries using the system’s high watermark, they may return results that the user is not allowed to see. In this case, the user may log in after receiving their notification and find that there are fewer new results than the notification claimed were available.

After running the workspaces’ standing queries, we will have a map of the workspaces to the number of new results for each. This information is potentially useful for applications other than email notification. The standing query notifier service could maintain a reference list of subscriber services and notify each of them after obtaining the new result counts. One such subscriber service would be an email generator that emails the summaries to the users who have opted in to email notifications. Another subscriber service could be a notification generator that creates notifications that will be displayed to the user in the Search UI. With this design, the new workspace search result summary information can be made available to any interested parties.

For each workspace, the email generator service will have to obtain the list of the users subscribed to email notifications. If a user has opted in for email notifications for multiple workspaces, they will probably only want one email containing the new results for all of their workspaces rather than one email for each workspace. Essentially, the email generator service will iterate through the workspaces and build a map of the subscribed users to their workspace summaries. Then it can simply construct the emails and send them.

There are some performance concerns with this solution since the standing query notifier service may need to run a large number of queries. But for this particular feature, performance is not absolutely critical since it will run infrequently and the result does not have to be returned quickly. Nevertheless, there are several simple optimizations we could make.

One such optimization is to group a workspace’s standing queries by source. In other words, a workspace’s standing queries are grouped into those that query source A, those that query source B, etc. Each group’s queries are ORed together, and the resulting query is sent to the catalog framework. This means that for each workspace, each source will be sent at most one query.

Another optimization follows from the fact that this problem is naturally separated into independent parts (workspaces). Since each workspace will have an associated set of queries to run, fetching the workspaces’ new results can easily be done in parallel. In addition, the queries within each workspace can be run in parallel.

There are likely more optimizations that could be made – the two presented above were the easiest to find and appeared to have straightforward implementations. 

Several conditions that would diminish the performance concerns of this implementation are likely to hold in most scenarios. Firstly, any given DDF node will likely have a very small number of users. Furthermore, each user will likely have only a few active workspaces, and each workspace will likely have only a few standing queries. Workspace sharing will also reduce the number of active workspaces. Consequentially, the number of active standing queries in a DDF node will probably be low.

Other Notes Regarding This Design

If workspaces are closed or deleted or if standing queries expire, the standing query notification service will not be affected since it only evaluates active workspaces and standing queries.

Link to Prototype