LINQ: Let me re-iterate


In his post Generalized IEnumerable to chunked IEnumerable Steve Gilham shows how an IEnumerable can be chopped into chunks and processed as part of a LINQ query. His example is great because it provides an extension method to chop any IEnumerable (for example a file stream) into chunks which are to be processed independently. In my case each 4K block of a file had to be hashed separately but in another case it might be handy to chop a file being uploaded into 64K chunks.

1
var chunks = myByteArray.Chunk(4096);
var chunks = myByteArray.Chunk(4096);

Steve’s post shows how a LINQ implementation detail causes a problem when chunking – and how to address it. So his example is useful in exposing some LINQ implementation details, knowledge of which can be used to explain and solve other LINQ problems. In my case I wanted to skip a few bytes before chunking. But prepending .Skip(n) to the call to Window() in the line above caused a problem with a ‘SkipIterator’. So this post publishes an update to Steve’s code to address the problem skipping (or taking or or filtering or selecting) and tries to explain why the problem occurs.

Problems chunking

You’d think chunking will be easy. For example it’s not obvious there’s a problem with this implementation:

1
2
3
4
5
6
7
8
9
10
public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> source, int chunk)
{
    while (true)
    {
        var result = source.Take(chunk);
 
        if (!result.Any()) yield break;
        yield return result;
    }
}
public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> source, int chunk)
{
	while (true)
	{
		var result = source.Take(chunk);

		if (!result.Any()) yield break;
		yield return result;
	}
}

But there is. It turns out that every time the ‘source’ parameter is asked for it’s iterator (in this case by the use of the Take() extension method) the iterator is reset so the code runs indefinitately. In fact it’s worse. The iterator is created anew each time and then disposed.

Steve Gilham shows how to address this problem by wrapping the source in a class and accessing the IEnumerator interface through the wrapper rather than getting it directly from the source. By wrapping the source and grabbing the source’s enumerator the enumerator is not released so cannot be be disposed of by the garbage collector. As a result, subsequent calls to retrieve the enumerator return the same one which means successive calls return successive chunks as was originally expected. The intuitive (but incorrect) example above can be changed to fix the problem:

1
2
3
4
5
6
7
8
9
10
11
12
public static IEnumerable<T[]> Chunk<T>(this IEnumerable<T> source, int chunk)
{
    var readonce = new EnumerableWrapper<T>(source);
 
    while (true)
    {
        var result = readonce.Take(chunk).ToArray();
 
        if (!result.Any()) break;
        yield return result;
    }
}
public static IEnumerable<T[]> Chunk<T>(this IEnumerable<T> source, int chunk)
{
	var readonce = new EnumerableWrapper<T>(source);

	while (true)
	{
		var result = readonce.Take(chunk).ToArray();

		if (!result.Any()) break;
		yield return result;
	}
}

This change is almost trivial though it relies on a helper class to implement the wrapper. The wrapper class implements the IEnumerable interface so it can provide the source enumerator but prevent it being disposed of by the garbage collector as soon as it goes out of scope. It achieves this by hanging on to a reference to the source enumerator.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
private class EnumerableWrapper<T> : IEnumerable<T>, IDisposable
{
    IEnumerator<T> ienumerator = null;
 
    public EnumerableWrapper(IEnumerable<T> source)
    {
        this.ienumerator = source.GetEnumerator();
    }
 
    ~EnumerableWrapper()
    {
        Dispose();
    }
 
    #region IEnumerable
 
    public IEnumerator<T> GetEnumerator()
    {
        return ienumerator;
    }
 
    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        return ienumerator;
    }
 
    #endregion
 
    public void Dispose()
    {
        GC.SuppressFinalize(this);
        if (ienumerator == null) return;
        ienumerator.Dispose();
    }
}
private class EnumerableWrapper<T> : IEnumerable<T>, IDisposable
{
	IEnumerator<T> ienumerator = null;

	public EnumerableWrapper(IEnumerable<T> source)
	{
		this.ienumerator = source.GetEnumerator();
	}

	~EnumerableWrapper()
	{
		Dispose();
	}

	#region IEnumerable

	public IEnumerator<T> GetEnumerator()
	{
		return ienumerator;
	}

	System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
	{
		return ienumerator;
	}

	#endregion

	public void Dispose()
	{
		GC.SuppressFinalize(this);
		if (ienumerator == null) return;
		ienumerator.Dispose();
	}
}

Composing chunks

The purpose of using the Chunk extension method is to make chunking composable. For example, in my case I want to skip past a few bytes at the beginning of the file:

1
var chunks = myByteArray.Skip(8).Chunk(4096);
var chunks = myByteArray.Skip(8).Chunk(4096);

Again, you’d think this will work. But what happens is that only the first chunk is returned. So what’s going on? The answer provides some insight as to what’s going on under the LINQ covers.

In the original example, the ‘source’ parameter of the Chunk() extension method is a byte array cast to an enumerable which acts as an iterator source. In this example the ‘source’ parameter is not a byte array but a different type of iterator called a SkipIterator. If the example had used Take() instead of Skip() it would have been a TakeIterator. If a Select() or a Where() method had been used it would be a WhereSelectIterator. The behavior of these classes causes the result to change: that only one chunk is returned. Back to square one.

So why does the behavior change? It seems that the SkipIterator, TakeIterator and WhereSelectIterator are disposed immediately after being used only once and recreated if not already initialized. The problem with being re-created is that there’s no content so the result is always an empty array. In short: you can’t re-iterate this way. Instead, the wrapper has to be extended to implement IEnumerator<T> and prevent the LINQ infrastructure disposing of iterator instances too soon.

Here’s a re-implementation of the EnumerableWrapper class above which resolves the problem. The obvious change is the implementation of IEnumerator which returns the EnumerableWrapper instance as the enumerator not the underlying Skip/Take/Select/Where iterator’s enumerator. This means the caller cannot dispose of the underlying iterator. However the EnumerableWrapper’s IEnumerator implementation delegates to the underlying iterator. The more subtle change is that the Dispose() method does nothing and that the Chunk() method is responsible for calling the new custom dispose method DisposeEnumerator().

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
private class EnumerableWrapper<T> : IEnumerable<T>, IEnumerator<T>
{
    IEnumerator<T> ienumerator = null;
 
    public EnumerableWrapper(IEnumerable<T> source)
    {
        this.ienumerator = source.GetEnumerator();
    }
 
    #region IEnumerable
 
    public IEnumerator<T> GetEnumerator()
    {
        // The iterator must be this instance so disposing can be controlled
        return this;
    }
 
    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        // The iterator must be this instance so disposing can be controlled
        return this;
    }
 
    #endregion
 
    #region IEnumerator<T>
 
    public T Current
    {
        get
        {
            if (this.ienumerator == null) new NotImplementedException();
            return this.ienumerator.Current;
        }
    }
 
    /// <summary>
    /// This is a custom dispose method used only by the Chunk() method
    /// </summary>
    internal void DisposeEnumerator()
    {
        GC.SuppressFinalize(this);
        if (ienumerator == null) return;
        this.ienumerator.Dispose();
    }
 
    object System.Collections.IEnumerator.Current
    {
        get
        {
            if (this.ienumerator == null) new NotImplementedException();
            return this.ienumerator.Current;
        }
    }
 
    public bool MoveNext()
    {
        if (this.ienumerator == null) new NotImplementedException();
        return this.ienumerator.MoveNext();
    }
 
    public void Reset()
    {
        if (this.ienumerator == null) new NotImplementedException();
        this.ienumerator.Reset();
    }
 
    #endregion
 
    public void Dispose()
    {
        // Do nothing - that's the purpose of this implementation
    }
}
private class EnumerableWrapper<T> : IEnumerable<T>, IEnumerator<T>
{
	IEnumerator<T> ienumerator = null;

	public EnumerableWrapper(IEnumerable<T> source)
	{
		this.ienumerator = source.GetEnumerator();
	}

	#region IEnumerable

	public IEnumerator<T> GetEnumerator()
	{
		// The iterator must be this instance so disposing can be controlled
		return this;
	}

	System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
	{
		// The iterator must be this instance so disposing can be controlled
		return this;
	}

	#endregion

	#region IEnumerator<T>

	public T Current
	{
		get
		{
			if (this.ienumerator == null) new NotImplementedException();
			return this.ienumerator.Current;
		}
	}

	/// <summary>
	/// This is a custom dispose method used only by the Chunk() method
	/// </summary>
	internal void DisposeEnumerator()
	{
		GC.SuppressFinalize(this);
		if (ienumerator == null) return;
		this.ienumerator.Dispose();
	}

	object System.Collections.IEnumerator.Current
	{
		get
		{
			if (this.ienumerator == null) new NotImplementedException();
			return this.ienumerator.Current;
		}
	}

	public bool MoveNext()
	{
		if (this.ienumerator == null) new NotImplementedException();
		return this.ienumerator.MoveNext();
	}

	public void Reset()
	{
		if (this.ienumerator == null) new NotImplementedException();
		this.ienumerator.Reset();
	}

	#endregion

	public void Dispose()
	{
		// Do nothing - that's the purpose of this implementation
	}
}

As a result of these changes the Chunk() method is changed though only slightly to call the DisposeEnumerator() method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public static IEnumerable<T[]> Chunk<T>(this IEnumerable<T> source, int chunk)
{
    var readonce = new EnumerableWrapper<T>(source);
 
    while (true)
    {
        var result = readonce.Take(chunk).ToArray();
 
        if (!result.Any())
        {
            readonce.DisposeEnumerator();
            yield break;
        }
        yield return result;
    }
}
public static IEnumerable<T[]> Chunk<T>(this IEnumerable<T> source, int chunk)
{
	var readonce = new EnumerableWrapper<T>(source);

	while (true)
	{
		var result = readonce.Take(chunk).ToArray();

		if (!result.Any())
		{
			readonce.DisposeEnumerator();
			yield break;
		}
		yield return result;
	}
}

Information and Links

Join the fray by commenting, tracking what others have to say, or linking to it from your blog.


Other Posts

Write a Comment

Take a moment to comment and tell us what you think. Some basic HTML is allowed for formatting.

Reader Comments

Be the first to leave a comment!