LINQ: Let me re-iterate
In his post Generalized IEnumerable to chunked IEnumerable Steve Gilham shows how an IEnumerable can be chopped into chunks and processed as part of a LINQ query. His example is great because it provides an extension method to chop any IEnumerable
Steve’s post shows how a LINQ implementation detail causes a problem when chunking – and how to address it. So his example is useful in exposing some LINQ implementation details, knowledge of which can be used to explain and solve other LINQ problems. In my case I wanted to skip a few bytes before chunking. But prepending .Skip(n) to the call to Window() in the line above caused a problem with a ‘SkipIterator’. So this post publishes an update to Steve’s code to address the problem skipping (or taking or or filtering or selecting) and tries to explain why the problem occurs.
Problems chunking
You’d think chunking will be easy. For example it’s not obvious there’s a problem with this implementation:
1 2 3 4 5 6 7 8 9 10 | public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> source, int chunk) { while (true) { var result = source.Take(chunk); if (!result.Any()) yield break; yield return result; } } |
public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> source, int chunk) { while (true) { var result = source.Take(chunk); if (!result.Any()) yield break; yield return result; } }
But there is. It turns out that every time the ‘source’ parameter is asked for it’s iterator (in this case by the use of the Take() extension method) the iterator is reset so the code runs indefinitately. In fact it’s worse. The iterator is created anew each time and then disposed.
Steve Gilham shows how to address this problem by wrapping the source in a class and accessing the IEnumerator interface through the wrapper rather than getting it directly from the source. By wrapping the source and grabbing the source’s enumerator the enumerator is not released so cannot be be disposed of by the garbage collector. As a result, subsequent calls to retrieve the enumerator return the same one which means successive calls return successive chunks as was originally expected. The intuitive (but incorrect) example above can be changed to fix the problem:
1 2 3 4 5 6 7 8 9 10 11 12 | public static IEnumerable<T[]> Chunk<T>(this IEnumerable<T> source, int chunk) { var readonce = new EnumerableWrapper<T>(source); while (true) { var result = readonce.Take(chunk).ToArray(); if (!result.Any()) break; yield return result; } } |
public static IEnumerable<T[]> Chunk<T>(this IEnumerable<T> source, int chunk) { var readonce = new EnumerableWrapper<T>(source); while (true) { var result = readonce.Take(chunk).ToArray(); if (!result.Any()) break; yield return result; } }
This change is almost trivial though it relies on a helper class to implement the wrapper. The wrapper class implements the IEnumerable interface so it can provide the source enumerator but prevent it being disposed of by the garbage collector as soon as it goes out of scope. It achieves this by hanging on to a reference to the source enumerator.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | private class EnumerableWrapper<T> : IEnumerable<T>, IDisposable { IEnumerator<T> ienumerator = null; public EnumerableWrapper(IEnumerable<T> source) { this.ienumerator = source.GetEnumerator(); } ~EnumerableWrapper() { Dispose(); } #region IEnumerable public IEnumerator<T> GetEnumerator() { return ienumerator; } System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { return ienumerator; } #endregion public void Dispose() { GC.SuppressFinalize(this); if (ienumerator == null) return; ienumerator.Dispose(); } } |
private class EnumerableWrapper<T> : IEnumerable<T>, IDisposable { IEnumerator<T> ienumerator = null; public EnumerableWrapper(IEnumerable<T> source) { this.ienumerator = source.GetEnumerator(); } ~EnumerableWrapper() { Dispose(); } #region IEnumerable public IEnumerator<T> GetEnumerator() { return ienumerator; } System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { return ienumerator; } #endregion public void Dispose() { GC.SuppressFinalize(this); if (ienumerator == null) return; ienumerator.Dispose(); } }
Composing chunks
The purpose of using the Chunk extension method is to make chunking composable. For example, in my case I want to skip past a few bytes at the beginning of the file:
Again, you’d think this will work. But what happens is that only the first chunk is returned. So what’s going on? The answer provides some insight as to what’s going on under the LINQ covers.
In the original example, the ‘source’ parameter of the Chunk() extension method is a byte array cast to an enumerable which acts as an iterator source. In this example the ‘source’ parameter is not a byte array but a different type of iterator called a SkipIterator. If the example had used Take() instead of Skip() it would have been a TakeIterator. If a Select() or a Where() method had been used it would be a WhereSelectIterator. The behavior of these classes causes the result to change: that only one chunk is returned. Back to square one.
So why does the behavior change? It seems that the SkipIterator, TakeIterator and WhereSelectIterator are disposed immediately after being used only once and recreated if not already initialized. The problem with being re-created is that there’s no content so the result is always an empty array. In short: you can’t re-iterate this way. Instead, the wrapper has to be extended to implement IEnumerator<T> and prevent the LINQ infrastructure disposing of iterator instances too soon.
Here’s a re-implementation of the EnumerableWrapper class above which resolves the problem. The obvious change is the implementation of IEnumerator
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | private class EnumerableWrapper<T> : IEnumerable<T>, IEnumerator<T> { IEnumerator<T> ienumerator = null; public EnumerableWrapper(IEnumerable<T> source) { this.ienumerator = source.GetEnumerator(); } #region IEnumerable public IEnumerator<T> GetEnumerator() { // The iterator must be this instance so disposing can be controlled return this; } System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { // The iterator must be this instance so disposing can be controlled return this; } #endregion #region IEnumerator<T> public T Current { get { if (this.ienumerator == null) new NotImplementedException(); return this.ienumerator.Current; } } /// <summary> /// This is a custom dispose method used only by the Chunk() method /// </summary> internal void DisposeEnumerator() { GC.SuppressFinalize(this); if (ienumerator == null) return; this.ienumerator.Dispose(); } object System.Collections.IEnumerator.Current { get { if (this.ienumerator == null) new NotImplementedException(); return this.ienumerator.Current; } } public bool MoveNext() { if (this.ienumerator == null) new NotImplementedException(); return this.ienumerator.MoveNext(); } public void Reset() { if (this.ienumerator == null) new NotImplementedException(); this.ienumerator.Reset(); } #endregion public void Dispose() { // Do nothing - that's the purpose of this implementation } } |
private class EnumerableWrapper<T> : IEnumerable<T>, IEnumerator<T> { IEnumerator<T> ienumerator = null; public EnumerableWrapper(IEnumerable<T> source) { this.ienumerator = source.GetEnumerator(); } #region IEnumerable public IEnumerator<T> GetEnumerator() { // The iterator must be this instance so disposing can be controlled return this; } System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { // The iterator must be this instance so disposing can be controlled return this; } #endregion #region IEnumerator<T> public T Current { get { if (this.ienumerator == null) new NotImplementedException(); return this.ienumerator.Current; } } /// <summary> /// This is a custom dispose method used only by the Chunk() method /// </summary> internal void DisposeEnumerator() { GC.SuppressFinalize(this); if (ienumerator == null) return; this.ienumerator.Dispose(); } object System.Collections.IEnumerator.Current { get { if (this.ienumerator == null) new NotImplementedException(); return this.ienumerator.Current; } } public bool MoveNext() { if (this.ienumerator == null) new NotImplementedException(); return this.ienumerator.MoveNext(); } public void Reset() { if (this.ienumerator == null) new NotImplementedException(); this.ienumerator.Reset(); } #endregion public void Dispose() { // Do nothing - that's the purpose of this implementation } }
As a result of these changes the Chunk() method is changed though only slightly to call the DisposeEnumerator() method:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | public static IEnumerable<T[]> Chunk<T>(this IEnumerable<T> source, int chunk) { var readonce = new EnumerableWrapper<T>(source); while (true) { var result = readonce.Take(chunk).ToArray(); if (!result.Any()) { readonce.DisposeEnumerator(); yield break; } yield return result; } } |
public static IEnumerable<T[]> Chunk<T>(this IEnumerable<T> source, int chunk) { var readonce = new EnumerableWrapper<T>(source); while (true) { var result = readonce.Take(chunk).ToArray(); if (!result.Any()) { readonce.DisposeEnumerator(); yield break; } yield return result; } }