Batch Processing with Directory.EnumerateFiles

Bohdan Stupak

5.00/5 (10 votes)

Mar 28, 2021

CPOL

3 min read

13216

Batching is a nice technique that allows you to handle big amounts of data gracefully. Directory.EnumerateFiles is the API that allows you to organize batch processing for the directory with a large number of files.

Introduction

In case one wants to retrieve files from catalog, Directory.GetFiles is a simple answer sufficient for most scenarios. However, when you deal with a large amount of data, you might need more advanced techniques.

Example

Let’s assume you have a big data solution and you need to process a directory that contains 200000 files. For each file, you extract some basic information.

public record FileProcessingDto
{
    public string FullPath { get; set; }
    public long Size { get; set; }
    public string FileNameWithoutExtension { get; set; }
    public string Hash { get; internal set; }
}

Note how we conveniently use novel C# 9 record types for our DTO here.

After that, we send extracted info for further processing. Let’s emulate it with the following snippet.

public class FileProcessingService
{
    public Task Process(IReadOnlyCollection<FileProcessingDto> files, 
                        CancellationToken cancellationToken = default)
    {
        files.Select(p =>
        {
            Console.WriteLine($"Processing {p.FileNameWithoutExtension} 
                              located at {p.FullPath} of size {p.Size} bytes");
            return p;
        });

        return Task.Delay(TimeSpan.FromMilliseconds(20), cancellationToken);
    }
}

Now the final piece is extracting info and calling the service.

public class Worker
{
    public const string Path = @"path to 200k files";
    private readonly FileProcessingService _processingService;

    public Worker()
    {
        _processingService = new FileProcessingService();
    }

    private string CalculateHash(string file)
    {
        using (var md5Instance = MD5.Create())
        {
            using (var stream = File.OpenRead(file))
            {
                var hashResult = md5Instance.ComputeHash(stream);
                return BitConverter.ToString(hashResult)
                    .Replace("-", "", StringComparison.OrdinalIgnoreCase)
                    .ToLowerInvariant();
            }
        }
    }

    private FileProcessingDto MapToDto(string file)
    {
        var fileInfo = new FileInfo(file);
        return new FileProcessingDto()
        {
            FullPath = file,
            Size = fileInfo.Length,
            FileNameWithoutExtension = fileInfo.Name,
            Hash = CalculateHash(file)
        };
    }

    public Task DoWork()
    {
        var files = Directory.GetFiles(Path)
            .Select(p => MapToDto(p))
            .ToList();

        return _processingService.Process(files);
    }
}

Note that here, we act in a naive fashion and extract all files via Directory.GetFiles(Path) in one take.

However, once you run this code via:

await new Worker().DoWork()

you’ll notice that results are far from satisfying and the application is consuming memory extensively.

Directory.EnumerateFiles to the Rescue

The thing with Directory.EnumerateFiles is that it returns IEnumerable<string> thus allowing us to fetch collection items one by one. This, in turn, prevents us from excessive use of memory while loading huge amounts of data at once.

Still, as you may have noticed, FileProcessingService.Process has delay coded in it (sort of I/O operation we emulate with simple delay). In a real-world scenario, this might be a call to an external HTTP-endpoint or work with the storage. This brings us to the conclusion that calling FileProcessingService.Process 200 000 times might be inefficient. That’s why we’re going to load reasonable batches of data into memory at once.

The reworked code looks as follows:

public class WorkerImproved
{
    //omitted for brevity

    public async Task DoWork()
    {
        const int batchSize = 10000;
        var files = Directory.EnumerateFiles(Path);
        var chunks = files.Chunk(batchSize);
        foreach (var chunk in chunks)
        {
            var filesToProcess = chunk.Select(file => MapToDto(file)).ToList();
            await _processingService.Process(filesToProcess);
        }
    }
}

Here we use LINQ Chunk method that appeared in .NET 7 to do batching for us. It returns IEnumerable<string> so instead of processing files at once, we work with a reasonable amount at a given point in time.

Evaluation

Results produced by Benchmark.NET are pretty convincing:

Few Words on Batch Processing

In this article, we took a glance at the common pattern in software engineering. Batches of reasonable amount help us to beat both I/O penalty of working in an item-by-item fashion and excessive memory consumption of loading all items in memory at once.

As a rule, you should strive for using batch APIs when doing I/O operations for multiple items. And once the number of items becomes high, you should think about splitting these items into batches.

Few Words on Return Types

Quite often when dealing with codebases, I see code similar to the following:

public IEnumerable<int> Numbers => new List<int> { 1, 2, 3 };

I would argue that this code violates Postel’s principle and the thing that follows from it is that as a consumer of a property I have can’t figure out whether I can enumerate items one by one or if they are just loaded at once in memory.

This is a reason I suggest being more specific about return type, i.e.:

public IList<int> Numbers => new List<int> { 1, 2, 3 };

Conclusion

Batching is a nice technique that allows you to handle big amounts of data gracefully. Directory.EnumerateFiles is the API that allows you to organize batch processing for the directory with a large number of files.

History

28^th March, 2021: Initial version
12^th August, 2021: Replaced batching code with novel Chunk method