C# Geek

LINQ: Ensure that all items have the same parent key

Why C# Geek post series?

I love programming and C#. I have been working with it for the last 15 years. While my roles and responsibilities have changed over this time, my passion for C# stayed at the same level, thus I decided to write a set of posts on my personal observations, experiences, and learnings.

What to expect from this series?
Well, I will try to cover various aspects of working with C# and .NET, where I will try to avoid opinionated posts but rather focus on sharing findings and providing comparisons.

LINQ: Ensure that all items have the same parent key

Recently I have encountered an interesting problem. There is a method processing in bulk the collection of items, however, it is only able to process them if all items belong to the same parent object. In order to ensure it is the case, the method in question validates the input collection using the following code:

if (items.GroupBy(i => i.ParentKey).Count() > 1)
    throw new InvalidOperationException( /* ... */);

The presented code is perfectly fine and does the job. I know however that LINQ offers various extension methods allowing to achieve the same outcome, thus I got curious to explore them and find out which one would be best suitable for this particular scenario.

Setting up BenchmarkDotNet

In order to compare the differences, I created a new project following guidelines from BenchmarkDotNet overview page with Program class:

public class Program
{
    static void Main(string[] args) => BenchmarkRunner.Run(typeof(Program).Assembly);
}

I created a class Item with ParentKey property:

public class Item
{
    public string ParentKey { get; init; }
}

Then, I created a base benchmark class that will contain all benchmark methods:

[MemoryDiagnoser]
public abstract class BenchmarkBase
{
    private readonly IEnumerable<Item> _data;

    protected BenchmarkBase()
    {
        _data = GetData().ToArray();
    }

    protected abstract IEnumerable<Item> GetData();

    /* Benchmark methods... */
}

… followed by a few child classes that run these benchmarks against collections of 1000 items, where:

  • all items have the same key
  • first item has different key
  • last item has different key
  • middle item has different key
public class AllTheSame : BenchmarkBase
{
    protected override IEnumerable<Item> GetData() => Enumerable.Range(0, 1000)
        .Select(_ => new Item { ParentKey = "key1" });
}

public class FirstDifferent : BenchmarkBase
{
    protected override IEnumerable<Item> GetData() => Enumerable.Range(0, 999)
        .Select(_ => new Item { ParentKey = "key1" })
        .Prepend(new Item { ParentKey = "key2" });
}

public class LastDifferent : BenchmarkBase
{
    protected override IEnumerable<Item> GetData() => Enumerable.Range(0, 999)
        .Select(_ => new Item { ParentKey = "key1" })
        .Append(new Item { ParentKey = "key2" });
}

public class MiddleDifferent : BenchmarkBase
{
    protected override IEnumerable<Item> GetData() => Enumerable.Range(0, 500)
        .Select(_ => new Item { ParentKey = "key1" })
        .Append(new Item { ParentKey = "key2" })
        .Concat(Enumerable.Range(0, 499).Select(_ => new Item { ParentKey = "key1" }));
}

Validation methods

Having the benchmark skeleton in place, I created the MultipleParentKeyDetector class and implemented a few alternatives of methods detecting the presence of items with different ParentKey.

The GroupBy_Count() is the method used in the original code:

public bool GroupBy_Count(IEnumerable<Item> data) 
    => data.GroupBy(x => x.ParentKey).Count() > 1;

I followed it by an equivalent that uses Distinct() method:

public bool Select_Distinct_Count(IEnumerable<Item> data) 
    => data.Select(x => x.ParentKey).Distinct().Count() > 1;

Having it implemented, I thought that we do not really care about the exact number of different keys. The next method I wrote checks if there are any other parent keys than the first one:

public bool Select_Distinct_Skip_Any(IEnumerable<Item> data) 
    => data.Select(x => x.ParentKey).Distinct().Skip(1).Any();

Finally, I implemented a reference method using foreach and that returns immediately after spotting the first difference:

public bool Foreach(IEnumerable<Item> data)
{
    var first = true;
    string firstKey = null;
    foreach (var item in data)
    {
        if (firstKey != item.ParentKey && !first)
            return true;
        if (first)
        {
            firstKey = item.ParentKey;
            first = false;
        }
    }
    return false;
}

Running benchmarks

Having validation methods implemented, I came back to BenchmarkBase and added the benchmark methods:

[Benchmark(Baseline = true)]
public bool Foreach() => _detector.Foreach(_data);

[Benchmark]
public bool GroupBy_Count() => _detector.GroupBy_Count(_data);

[Benchmark]
public bool Select_Distinct_Count() => _detector.Select_Distinct_Count(_data);

[Benchmark]
public bool Select_Distinct_Skip_Any() => _detector.Select_Distinct_Skip_Any(_data);

Finally, it was possible to run them with dotnet run -c Release command.

The outcome

The benchmark was run on .NET 5 SDK and runtime on Windows 10:

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19042.1083 (20H2/October2020Update)
Intel Core i9-9900KS CPU 4.00GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.302
  [Host]     : .NET 5.0.8 (5.0.821.31504), X64 RyuJIT
  DefaultJob : .NET 5.0.8 (5.0.821.31504), X64 RyuJIT

AllTheSame

|                   Method |      Mean |     Error |    StdDev | Ratio | RatioSD |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------- |----------:|----------:|----------:|------:|--------:|-------:|-------:|------:|----------:|
|                  Foreach |  4.658 μs | 0.0204 μs | 0.0181 μs |  1.00 |    0.00 |      - |      - |     - |      32 B |
|            GroupBy_Count | 24.903 μs | 0.1545 μs | 0.1445 μs |  5.35 |    0.04 | 2.0142 | 0.0610 |     - |  16,896 B |
|    Select_Distinct_Count | 22.873 μs | 0.0362 μs | 0.0283 μs |  4.91 |    0.02 | 0.0305 |      - |     - |     352 B |
| Select_Distinct_Skip_Any | 23.640 μs | 0.0558 μs | 0.0495 μs |  5.08 |    0.02 | 0.0305 |      - |     - |     408 B |

FirstDifferent

|                   Method |         Mean |     Error |    StdDev |    Ratio | RatioSD |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------- |-------------:|----------:|----------:|---------:|--------:|-------:|-------:|------:|----------:|
|                  Foreach |     20.65 ns |  0.114 ns |  0.101 ns |     1.00 |    0.00 | 0.0038 |      - |     - |      32 B |
|            GroupBy_Count | 23,784.68 ns | 60.390 ns | 56.489 ns | 1,151.68 |    6.82 | 2.0142 | 0.0610 |     - |  16,984 B |
|    Select_Distinct_Count | 22,816.42 ns |  3.934 ns |  3.285 ns | 1,104.56 |    5.44 | 0.0305 |      - |     - |     352 B |
| Select_Distinct_Skip_Any |    183.27 ns |  2.127 ns |  1.990 ns |     8.87 |    0.12 | 0.0486 |      - |     - |     408 B |

MiddleDifferent

|                   Method |      Mean |     Error |    StdDev | Ratio | RatioSD |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------- |----------:|----------:|----------:|------:|--------:|-------:|-------:|------:|----------:|
|                  Foreach |  2.141 μs | 0.0098 μs | 0.0091 μs |  1.00 |    0.00 | 0.0038 |      - |     - |      32 B |
|            GroupBy_Count | 24.941 μs | 0.1059 μs | 0.0991 μs | 11.65 |    0.05 | 2.0142 | 0.0610 |     - |  16,984 B |
|    Select_Distinct_Count | 22.839 μs | 0.0146 μs | 0.0114 μs | 10.66 |    0.05 | 0.0305 |      - |     - |     352 B |
| Select_Distinct_Skip_Any | 11.906 μs | 0.0330 μs | 0.0308 μs |  5.56 |    0.03 | 0.0458 |      - |     - |     408 B |

LastDifferent

|                   Method |      Mean |     Error |    StdDev | Ratio | RatioSD |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------------------------- |----------:|----------:|----------:|------:|--------:|-------:|-------:|------:|----------:|
|                  Foreach |  4.081 μs | 0.0087 μs | 0.0077 μs |  1.00 |    0.00 |      - |      - |     - |      32 B |
|            GroupBy_Count | 24.983 μs | 0.0767 μs | 0.0680 μs |  6.12 |    0.02 | 2.0142 | 0.0610 |     - |  16,984 B |
|    Select_Distinct_Count | 23.225 μs | 0.0366 μs | 0.0324 μs |  5.69 |    0.01 | 0.0305 |      - |     - |     352 B |
| Select_Distinct_Skip_Any | 22.885 μs | 0.0429 μs | 0.0335 μs |  5.61 |    0.01 | 0.0305 |      - |     - |     408 B |

The Foreach approach is 5 times better than any other alternative that bases on GroupBy() or Distinct().

For scenarios where all items have the same parent key or “problematic” item is at the end of the collection, the performance of all LINQ solutions is similar.

However, the Select_Distinct_Skip_Any() outperforms the other methods if the “problematic” item is detected sooner. It is because this method stops enumerating as soon as it spots the difference, while Count() based solutions always enumerate the entire collection.

Allocation wise, the GroupBy_Count() is most expensive. It is due to the fact that GroupBy() has to preserve all items and group them together, while the Distinct() can discard the duplicates.

The Summary

It looks like the Foreach() approach is most performant and takes the least memory to execute. In scenarios where high performance is important, I would consider writing specialized methods like it.

However, in the less critical scenarios, I would stick to options offered by LINQ as it offers more concise, easier to read, and maintain code.

For this particular scenario I would choose the:

  • Select(x => x.ParentKey).Distinct() over GroupBy(x => x.ParentKey), because the former approach is much more lighweight,
  • Skip(1).Any() over Count() > 1, because the former approach won’t enumerate the entire collection if not required.

The Source-code

The source code for this exercise can be found on my github project: https://github.com/Suremaker/cs-geek/tree/main/AllItemsWithSameKey