Date: Fri, 29 Mar 2024 13:46:33 +0000 (UTC) Message-ID: <617863973.3.1711719993775@dd82865975c4> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_2_1749848720.1711719993774" ------=_Part_2_1749848720.1711719993774 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
This research spike concerns generating email notifications for = standing queries in user workspaces. A few assumptions were made for the pu= rposes of this spike, namely:
Listed here are some of the major concerns to be considered for a soluti= on:
Standing query notification generation should be run once per day at a t=
ime specified by the system administrator (this time would likely be at nig=
ht). This standing query notification service will fetch all the active wor=
kspaces and all the active standing queries belonging to each. Then, for ea=
ch workspace, it will run each standing query. But since this service is in=
terested in the most recent results, each query will be modified to restric=
t it to results that have been modified in the last 24 hours (i.e., the&nbs=
p;Metacard.MODIFIED
attribute falls within the last 24 ho=
urs). To get an accurate count, each query will have to request the total r=
esult count.
One problem here is that the queries cannot be run with the workspace ow=
ner=E2=80=99s Subject. Even though we know who owns the workspace, we can=
=E2=80=99t access their Subject since the system is running the queries in =
a batch at a scheduled interval. The user would likely be logged out at the=
time the queries are running anyway. Consequentially, the results cannot b=
e filtered appropriately for the user. Thus we must restrict the informatio=
n contained in the email notifications to a numerical summary of the new qu=
ery results, e.g. =E2=80=9CThere are <X> new results avail=
able for workspace <Y>. Log in to view them.=E2=80=9D
If =
we run the queries using the system=E2=80=99s high watermark, they may retu=
rn results that the user is not allowed to see. In this case, the user may =
log in after receiving their notification and find that there are fewer new=
results than the notification claimed were available.
After running the workspaces=E2=80=99 standing queries, we will have a m= ap of the workspaces to the number of new results for each. This informatio= n is potentially useful for applications other than email notification. The= standing query notifier service could maintain a reference list of subscri= ber services and notify each of them after obtaining the new result counts.= One such subscriber service would be an email generator that emails the su= mmaries to the users who have opted in to email notifications. Another subs= criber service could be a notification generator that creates notifications= that will be displayed to the user in the Search UI. With this design, the= new workspace search result summary information can be made available to a= ny interested parties.
For each workspace, the email generator service will have to obtain the = list of the users subscribed to email notifications. If a user has opted in= for email notifications for multiple workspaces, they will probably only w= ant one email containing the new results for all of their workspaces rather= than one email for each workspace. Essentially, the email generator servic= e will iterate through the workspaces and build a map of the subscribed use= rs to their workspace summaries. Then it can simply construct the emails an= d send them.
There are some performance concerns with this solution since the standin= g query notifier service may need to run a large number of queries. But for= this particular feature, performance is not absolutely critical since it w= ill run infrequently and the result does not have to be returned quickly. N= evertheless, there are several simple optimizations we could make.
One such optimization is to group a workspace=E2=80=99s standing queries= by source. In other words, a workspace=E2=80=99s standing queries are grou= ped into those that query source A, those that query source B, etc. Each gr= oup=E2=80=99s queries are ORed together, and the resulting= query is sent to the catalog framework. This means that for each workspace= , each source will be sent at most one query.
Another optimization follows from the fact that this problem is naturall= y separated into independent parts (workspaces). Since each workspace will = have an associated set of queries to run, fetching the workspaces=E2=80=99 = new results can easily be done in parallel. In addition, the queries within= each workspace can be run in parallel.
There are likely more optimizations that could be made =E2=80=93 the two= presented above were the easiest to find and appeared to have straightforw= ard implementations.
Several conditions that would diminish the performance concerns of this = implementation are likely to hold in most scenarios. Firstly, any given DDF= node will likely have a very small number of users. Furthermore, each user= will likely have only a few active workspaces, and each workspace will lik= ely have only a few standing queries. Workspace sharing will also reduce th= e number of active workspaces. Consequentially, the number of active standi= ng queries in a DDF node will probably be low.
If workspaces are closed or deleted or if standing queries expire, the s= tanding query notification service will not be affected since it only evalu= ates active workspaces and standing queries.