LINQ, I love you, but I'm going back to my roots

Posted by Matthew Watkins on December 19, 2017

So, I faced an interesting situation today. I was getting a collection of objects from a database using a stored proc, but I only needed some of the objects that were returned. And I had to return the objects as a List.

Sounds like a good job for a LINQ where clause, right?

public List<object> DoWork() => return GetObjects().Where(ShouldKeep).ToList();
private bool ShouldKeep(object o)
{
// Do a bit of work here...
}

Ah, nice. Short. Readable. Beautiful. And unfortunately, not a good idea in this case.

When LINQ is not appropriate

Listen, I love LINQ. When Java devs ask me what makes me love C# so much, LINQ is at the very top of my list. That said, I think that one of the biggest sins we .NET developers commit is the overuse of LINQ. My use case today gave me several reasons not to use my favorite feature. Anyone of them individually would probably be enough to disqualify LINQ as a valid tool for this problem. Here are a few:

Two heads are better than one. But two collections aren’t.

Because of the way our data repository is implemented, that GetObjects() method doesn’t return an IEnumerable<object>. It doesn’t yield return the elements one at a time. It uses no delayed execution. It returns a list. A list that is taking up memory on my heap. A list that can get pretty big.

And unfortunately, I need to return a list, too. Just a (potentially) smaller one. And here’s the rub: That .Where(ShouldKeep).ToList() creates a second list that is anywhere up to the same size as the potentially large list I fetched from my database call. You will rightly point out that the first list, in this case, becomes just an intermediate list. It will fall out of scope and be marked for garbage collection the moment the method returns the final results list. And that’s true. But while that final results list is being built by my beloved LINQ, you can’t escape the fact that will have two potentially large lists putting memory pressure on your process.

LINQ is best for atomic operations

OK, I lied. My ShouldKeep() method wouldn’t just take the element as its parameter. The decision of whether any given element should be returned in the final list is a lot more complex and nuanced (and very stateful). It depends not only on the nature of the element itself, but also on features of other elements in the list, elements of other lists, and even (gulp) potentially another database call.

private bool ShouldKeep(List<object> list, object o, int index, DataProvider database, CustomConfiguration cfg, ...)
{
// Do a LOT of work here...
}

All of these factors bring in their own weight of timers and logging and business logic. I will grant that some developer out there can probably break it into a bunch of separate predicates and use some lesser-known LINQ overloads that return the index with the element and perform all sorts of wonderful voodoo, but that’s not a game I want to play right now. It makes the code harder to read and even harder to set breakpoints in. LINQ excels at handling atomic operations for its predicates, but if you try to get too fancy, you’re probably going to end up sacrificing something big.

IEnumerable is awesome. But only once.

This next point isn’t strictly related to my particular use case since I already mentioned that the database method I’m calling returns a list, but since I’m already on my soapbox, I might as well. Don’t treat IEnumerable<T> as an actual collection. Here’s an example:

public static void Main()
{
Console.WriteLine("Starting");
var myNumbers = GetNumbersFromDatabase();
if (myNumbers.Any())
{
Console.WriteLine("Got back " + myNumbers.Count() + " results. Here they are:");
foreach (var number in myNumbers)
{
Console.WriteLine(number);
}
}
else
{
Console.WriteLine("Didn't get any results");
}
Console.WriteLine("Done");
}

At first glance during an over-the-shoulder code review, this may look fine. Maybe you even know that GetNumbersFromDatabase() hits 3 different tables so it’s kind of expensive, but that’s OK, you’re only hitting it once, so it’s fine, right?

Sure it’s all fine. Fine until it hits production and your performance slows to a crawl. And your users start complaining that they are seeing, “Got back 0 results. Here they are:” instead of “Didn’t get any results.” So, after hours of looking for the bug, you finally add logging to the GetNumbersFromDatabase() method:

private static IEnumerable<int> GetNumbersFromDatabase()
{
  Console.WriteLine("In GetNumbersFromDatabase. Let's hit the database!");
  for (var i = 1; i <= 3; i++)
  {
    Console.WriteLine("Querying table " + i);
    yield return i;
  }
}

And you see the following in the console:

Starting
In GetNumbersFromDatabase. Let's hit the database!
Querying table 1
In GetNumbersFromDatabase. Let's hit the database!
Querying table 1
Querying table 2
Querying table 3
Got back 3 results. Here they are:
In GetNumbersFromDatabase. Let's hit the database!
Querying table 1
1
Querying table 2
2
Querying table 3
3
Done

Whenever you call a LINQ extension method on anything that implements IEnumerable<T>, it will re-evaluate the provider of the collection. If the underlying object is an array or a list, that’s usually fine since it will just work on the object that already exists in memory. But if the implementation is doing something fancy like yield-returning its elements one at a time, you end up doing whatever work that method does each and every time.

The fix? If you need to know anything about (or do anything to) a bunch of objects other than iterate through it exactly once, put the elements in a collection. Add .ToArray() to the end of the call to GetNumbersFromDatabase() fixes it right away.

Another tip: I like the var keyword in general. But in this case, the var hid the fact that the returned object from the database call was an iterator instead of a collection. So maybe rethink your use of var in that case.

Back to the use case

So what did do I do for my GetObjects() case? Well ultimately I wanted to do something like this:

List<object> listFromDB = GetObjects();
foreach (var element in listFromDB)
{
  if (!ShouldKeep(element, ...))
  {
    listFromDB.Remove(element);
  }
}
return listFromDB;

But alas, if I try to run that I will get the good old “Collection was modified; enumeration operation may not execute” exception when I run it. It’s almost like the CLR is begging you to duplicate the collection (which was reason #1 for not using LINQ in the first place).

Ditching LINQ and foreach

So how can we get around this? Well, we already ditched LINQ. Let’s ditch our other C# favorite, foreach. If we access the elements by index, the responsibility of keeping track of iteration state falls on us instead of the runtime, so .NET won’t complain and yell at us when we try to change something:

List<object> listFromDB = GetObjects();
for (var i = 0; i < listFromDB.Count; i++)
{
  if (!ShouldKeep(element, ...))
  {
    listFromDB.RemoveAt(i);
  }
}
return listFromDB;

Let’s try that with a toy example where we filter out numbers between 3 and 7 inclusive:

public static void Main()
{
  var list = Enumerable.Range(1, 10).ToList();
  for (var i = 0; i < list.Count; i++)
  {
    if (list[i] >= 3 && list[i] <= 7)
    {
      list.RemoveAt(i);
    }
  }
  Console.WriteLine(string.Join(",", list));
}

A collection removal gotcha

Let’s look at the output:

1,2,4,6,8,9,10

Whoah, what’s with that 4,6,8 nonsense? Well, like I said, when we access the elements by index, the responsibility of keeping track of iteration state falls on us instead of the runtime. And we’re doing it wrong. When we remove an element at an index, the indices of all elements past this one shift down by one. So we remove the element at i, then the for loop takes us on to element i + 1. But what used to be i + 1 has dropped down to become i, so we’re now evaluating what used to be i + 2, skipping i + 1 entirely. Whenever you remove at the current index, make sure to decrement the iterating index so you don’t skip anything:

public static void Main()
{
  var list = Enumerable.Range(1, 10).ToList();
  for (var i = 0; i < list.Count; i++)
  {
    if (list[i] >= 3 && list[i] <= 7)
    {
      list.RemoveAt(i);
      i--;
    }
  }
  Console.WriteLine(string.Join(",", list));
}

Tha’s better:

1,2,8,9,10

Now we have a method that doesn’t create an additional collection, and in fact, will only shrink the existing collection.

So next time you’re faced with filtering a large data set in a non-trivial way, take a hard look at your LINQ and foreach and see if you can’t easily release some memory pressure on your application by getting back to your roots.

This post first appeared on Another Dev Blog