Thursday, October 27, 2011

EMF Compare scalability

During the sponsored work on EMF Compare, we have taken some time to measure the performance of EMF Compare and think about possible improvement axis. We meant to profile both the time needed to compare two models and the overall memory footprint of a comparison, this has been achieved through the use of the Yourkit Java profiler.

The first measures we took highlighted such a huge kludge that we took the time to better the comparison algorithms before taking any further action. Through the use of google's Guava and some rethinking, we improved the comparison time on big model comparison a great deal, dividing the total comparison time in half!

Here is a sample of the time it takes now to compare your models with the latest builds of EMF Compare :

Structure of the sample models used in these tests ("fragments" are the number of fragmented files, the rest are UML model elements contained by the samples) :

Small Nominal Large

Fragments 99 399 947

Packages 97 389 880

Classes 140 578 2169

Primitive Types 581 5370 17152

Data Types 599 5781 18637

State Machines 55 209 1311

States 202 765 10156

Dependencies 235 2522 8681

Transitions 798 3106 49805

Operations 1183 5903 46029

Time and memory used required to compare each of the model sizes (the model is copied, randomly modified, then the copy is compared with its original) :

Small Nominal Large

Time (seconds) 6 22 125

maximum Heap (Mo) 555 1019 2100

We initially thought that the time was mostly spent matching the elements together. It turns out that the real bottleneck of a comparison is the differencing phase (we know the two elements "match", now we need to know if, and how they differ). Just goes to show you that a profiler is mandatory in order to really know why your product might be slow ;).

Future development

These profiling sessions made clear the problems of EMF Compare on large models. We can properly handle "medium" models : EMF Compare is fast and its GUI can react at reasonable speed. However we cannot scale this behavior to large models : memory management is somewhat inefficient (more than 2Go of heap space to compare two models that both "weigh" 50Mo on disk...), the GUI is sluggish... We did isolate a number of potential improvements though. Stay tuned!

Note : the report is online and can be retrieved from here for those interested.