Subtle bug in Java standard library

How to removeAll performance from your HashSet

I stumbled over an interesting Tweet from @relizarov stating

OMG! screamDon’t use Set.removeAll(list) on JVM. You can get O(N^2). I have no words. Only emotions.

As I cound not believe that in the first place, I digged a little deeper and learned about a bug in the Java HashSet implementation with a very long history and finally a fix in sight. It provides a great insight into the complexity of simple things.

Reproducing the problem

First of all lets have a look what the problem is all about and reproduce it. I created a very small starter project you can clone from Codeberg and run it with a simple mvn test. The code is quite simple as you will find as follows. It creates a Set and a List of Integers and fills them with a range from 0 to a given size. After that, we measure the milliseconds needed to remove all list items from the set.

long getExecutionTimeMillisForSizes(int setSize, int listSize){
    var set = IntStream.range(0, setSize).boxed().collect(toSet());
    var listToBeRemoved = IntStream.range(0, listSize).boxed().collect(toList());

    var from = Instant.now();
    // remove the list from the set. This is the interesting line.
    set.removeAll(listToBeRemoved);
    var to = Instant.now();
    return Duration.between(from, to).toMillis();
}

Not much to go wrong here, right? Now let’s run this with an increasing number of Set and List sizes. We create a Set of 1.000 elements and remove a List of 999 elements using set.removeAll(listToBeRemoved). Then we go on with a Set of 10.000 elements and remove a List of 9.999 elements, and so on.

The results are not surprising. The execution time slowly increases with the number of items getting larger, but overall, it’s still pretty fast for large Sets and Lists.

[INFO] Running de.martinspielmann.badexample.bugs.hashset.removeall.HashsetRemoveAllUnequalSizesTest
Millis for 1000-999: 2
Millis for 10000-9999: 3
Millis for 100000-99999: 34
Millis for 1000000-999999: 58
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.765 s - in de.martinspielmann.badexample.bugs.hashset.removeall.HashsetRemoveAllUnequalSizesTest

Now what happens if one more item should be removed from the Set. Let’s create a Set of 1.000 elements and remove a List of 1.000 elements. Then we go on with a Set of 10.000 elements and remove a List of 10.000 elements, and so on.

This time, the results are more insteresting. Still starting quickly for small datasets, the execution time explodes to over 22 minutes for a million elements. What the hell is happening here?!

[INFO] Running de.martinspielmann.badexample.bugs.hashset.removeall.HashsetRemoveAllEqualSizesTest
Millis for 1000-1000: 28
Millis for 10000-10000: 150
Millis for 100000-100000: 11897
Millis for 1000000-1000000: 1359740
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1,372.159 s - in de.martinspielmann.badexample.bugs.hashset.removeall.HashsetRemoveAllEqualSizesTest

After reproducing the problem, it is now time to find the cause.

Finding the root cause

The first hint can actually be found the the JavaDoc of AbstractSet.removeAll() which is also used by HashSet:

This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator’s remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set’s remove method.

So as the documentation tells us, there is a small optimization in the code that that leads to a different behavior based on the sizes of the Set and the Collection to be removed. The code looks like this:

public boolean removeAll(Collection<?> c) {
    Objects.requireNonNull(c);
    boolean modified = false;

    if (size() > c.size()) {
        for (Object e : c)
            modified |= remove(e);
    } else {
        for (Iterator<?> i = iterator(); i.hasNext(); ) {
            if (c.contains(i.next())) {
                i.remove();
                modified = true;
            }
        }
    }
    return modified;
}

If the size of the Set is bigger than the given Collection, HashSet.remove() is called on the HashSet which is pretty fast, O(1) to be precise.

The problems start as soon as the Set is no longer bigger than the given Collection. Then the second branch of the if-else statement is executed and Collection.contains() is called. The performance of this methods is unfortunately highly depending on the specific implementation. For example, looking at HashSet it’s again O(1) and will not cause any issue, whereas ArrayList.contains() will run in O(N) and here we are. That’s everything it needs to make the execution time explode. ArrayList.contains() will be called in a loop for every element in the set. For a set of size M, this results in O(M*N). In our example, as both the Set and the List are of equal size, the complexity is in fact worst case with O(N^2). So what has been planned as an optimization actually destroys predictable program behavior and may kill application performance under certain circumstances.

Fixing the issue

There are multiple workarounds that were initially posted in this blog post which also described the problem already 10 years ago! First option is to simply force the Collection to be removed to be a HashSet. There is overhead in creating the new HashSet, but it is almost negligible.

set.removeAll(new HashSet<>(listToBeRemoved));

Another workaround is to not use removeAll, but manually iterate over the collection to be removed and call HashSet.remove() on the source set. This is basically reproducing the first if-else branch of the original removeAll method which also avoids the overhead of creating the new HashSet.

listToBeRemoved.stream().forEach(set::remove)

The cleanest way to fix this issue is to upgrade to Java 15. Yes, there is life in JDK-6394757 which is open since 2006. Actually, the problem has been introduced in Java 1.3 which has been released in the year 2000.

Finally, after 20 years, this issue will be fixed in Java 15 which will be released on 2020/09/15.

Update from September 2020

Java 15 is now available. Unfortunately, running the test again had the same bad results as before. Bug has been moved to JDK 16 now.

References

Leave a Reply

Your email address will not be published. Required fields are marked *