Research ways to display clusters of features displayed on the map

Introduction

The purpose of this spike is to determine the feasibility and mechanism to display clusters of features on the map, and to “drill” down into those clusters by clicking on them.  When working with large data sets, it is not ideal to naively plot every point on a map.  This can overwhelm the user and can lead to performance degradation.  One solution to this problem is to collect nearby map points into clusters based on different levels of zoom.

 Research was done in the following areas:

  • Determine the general mechanism behind map clustering
  • Producing a reasonable algorithm to cluster map points
  • Examine the pros and cons of a server vs client-side approach
  • Develop prototypes as time permits.
  • Determine the implementation differences between different map types.
  • Determine which implementation type is most appropriate for DDF purposes

General Mechanism

Map clustering would ideally be a feature that can be turned on in the Standard Search UI, and would take effect after the Search UI obtains the results from a query.  When the map finishes moving, a clustering algorithm would be called that gathers nearby points into groups, adds cluster points, and hides the original non-clustered points.  When the user hovers over the cluster, a polygon can be drawn based on the points that exist within the cluster using a convex hull algorithm (discussed later).   When the user clicks on a cluster, the map will zoom in to a level that encapsulates this polygonal region.

Algorithm

A very straightforward approach to the map clustering problem is to consider the pixel distances between two points on a map.  As the user zooms the map in or out, this pixel distance will change.  If a set of points are within a certain pixel distance of each other, they will all fall under one cluster.  This approach is ideal because it is not implementation dependent, and the APIs for Cesium and OpenLayers both support these calculations.  With some preproccesing (getting pixel locations) the algorithm can be used by both implementations, with the only differences in code being the rendering process.

clusteringAlgorithm:

         clusterList := a new list

         mapPoints := a list of points with pixel locations

         for each feature f in mapPoints:

                  for each cluster c in clusterList:

                           if f fits in c:

                                    add f to c

                           else:

                                    create a new cluster and add to clusterList

                           hide f

         end

         for each cluster c in clusterList:

                  determine position of the center of the cluster and store in c

         end

         return clusterList to be rendered

This proposed algorithm runs in O(n * c), where n is the number of points that exist on the map and c is the number of clusters that are created during the clustering process.  The default number of search results to display in DDF is 250, so the performance of the algorithm should not be terribly concerning.  During the research spike, a live demo (link below) of an algorithm using pixel distances in Cesium was found.  It is worth noting that even with 1000 points, the performance degradation of the clustering is negligible.  

http://biksapps.com/racheli/cesium_cluster/dist/#/ 

When hovering over clusters in this live demo, a polygon is drawn around the encompassing region, this is done by using a convex-hull algorithm.

Convex-Hull

http://roadtolarissa.com/convex-hulls/

Server-Side Implementation

There has been some debate over having a server-side vs client-side clustering algorithm in DDF.  Considering that the algorithm uses pixel distances to cluster the points together, a skeleton set of point data can be passed to the server for processing whenever the user moves the map.  To limit the amount of information being passed between the client and server, this skeleton set of point data could be cached on the server.  For a point, the skeleton information required by the clustering algorithm would be only the pixel location, and some unique identifier for the front end (used to associate the clustering information back to the view).

Pros

  • The clustering algorithm is not tied to a particular front end implementation. (Code once, use in any browser with any toolkit.)
    • However, rendering is already done differently for different toolkits (Cesium vs Openlayers).

Cons

  • Client-Server communication would have to occur every time a user moves the map. 
    • This means there will be multiple queries for multiple users over a short period of time.
    • While a skeleton set of data can be passed, this is still not scalable if a user wants to cluster 1000 points.
    • Cesium Tile refresh is done by client-server communication when changing the zoom level, and is a good example of performance issues that can arise from this approach.
    • If caching, there currently does not exist a way to cache a data set and distinguish the data per user.
      • Caching information per user could cause data/memory performance problems as the amount of active users goes up.
      • If querying the c is done against a slow federated source, the clustering algorithm performance will suffer tremendously.

Client-Side Implementation

With a client-side implementation of the clustering algorithm, data will not be passed around, and the processing can be done immediately after a camera change. 

Pros

  • The client will already have the resulting set from a query, and need not communicate with the server to run the algorithm after a search is completed.
    • The clustered data can be rapidly rendered on the browser as soon as the algorithm finishes.
    • The proposed algorithm uses the pixel distance to cluster points, which is readily available on the client.

 

Cons

  • The clustering algorithm may have to be written more than once for different front-end implementations.  (Cesium, OpenLayers).
    • The rendering is already done differently between different maps.

Prototype

During the research spike, a prototype of clustering on the Cesium map was created in DDF as a proof of concept and feasibility.  In the interest of time, this prototype did not separate the clustering process from the Cesium rendering process. This should be done in the real world implementation since Cesium and OpenLayers have different scripts to render the results from a query.

GitHub code:

https://github.com/bdeining/ddf/tree/DDF-Clustering-Prototype

OpenLayers Rendering

The rendering process for Cesium is demonstrated in the protype, but it is necessary to consider the differences between rendering the information in Cesium vs. Openlayers.  In general, the process is not terribly different.  The following code snippet highlights some of the differences that were found during the research spike:

Listen to the moveend trigger when the map has stopped moving:

map.events.register('moveend', map, clusteringAlgorithm);

To Get the Pixel Coordinates:

var pixel = map.getPixelFromCoordinate(coordinate);

To set the visibility of points:

var features = layer.features;
for( var i = 0; i < features.length; i++ ) {
  features[i].style = { visibility: 'hidden' };
}
layer.redraw();

Considerations

  • In Openlayers, zooming in and out using the mouse wheel is extremely sensitive and may need to be tweaked to facilitate smooth map clustering
  • In Cesium, zooming out seemingly has no upper altitude limit, and the map can be zoomed out until the Earth is no longer in view due to its scale.

Conclusion

Map Clustering makes more sense to implement on the front-end.  This determination is based on the number of cons to a server-side approach, as well as the readily available data that the client-side will have.  Since the clustering algorithm will be run any time the camera moves, the amount of communication required (and the amount of information that is passed) would be less than ideal.  As noted earlier, the performance degradation on the live demo with 1000 points was negligible, but this may be noticeable if the information was communicated to a server for processing.  Since the information already exists on the front-end, it can be easily manipulated and rendered on a per user basis.