Click here to Skip to main content
15,031,495 members
Articles / Programming Languages / C#
Tip/Trick
Posted 28 Mar 2021

Stats

3K views
9 bookmarked

Batch Processing with Directory.EnumerateFiles

Rate me:
Please Sign up or sign in to vote.
5.00/5 (5 votes)
28 Mar 2021CPOL3 min read
Batching is a nice technique that allows you to handle big amounts of data gracefully. Directory.EnumerateFiles is the API that allows you to organize batch processing for the directory with a large number of files.
In case one wants to retrieve files from catalog, `Directory.GetFiles` is a simple answer sufficient for most scenarios. However, when you deal with a large amount of data, you might need more advanced techniques. Directory.EnumerateFiles is the API that allows you to organize batch processing for the directory with a large number of files.

Introduction

In case one wants to retrieve files from catalog, Directory.GetFiles is a simple answer sufficient for most scenarios. However, when you deal with a large amount of data, you might need more advanced techniques.

Example

Let’s assume you have a big data solution and you need to process a directory that contains 200000 files. For each file, you extract some basic information.

C#
public record FileProcessingDto
{
    public string FullPath { get; set; }
    public long Size { get; set; }
    public string FileNameWithoutExtension { get; set; }
    public string Hash { get; internal set; }
}

Note how we conveniently use novel C# 9 record types for our DTO here.

After that, we send extracted info for further processing. Let’s emulate it with the following snippet.

C#
public class FileProcessingService
{
    public Task Process(IReadOnlyCollection<FileProcessingDto> files, 
                        CancellationToken cancellationToken = default)
    {
        files.Select(p =>
        {
            Console.WriteLine($"Processing {p.FileNameWithoutExtension} 
                              located at {p.FullPath} of size {p.Size} bytes");
            return p;
        });

        return Task.Delay(TimeSpan.FromMilliseconds(20), cancellationToken);
    }
}

Now the final piece is extracting info and calling the service.

C#
public class Worker
{
    public const string Path = @"path to 200k files";
    private readonly FileProcessingService _processingService;

    public Worker()
    {
        _processingService = new FileProcessingService();
    }

    private string CalculateHash(string file)
    {
        using (var md5Instance = MD5.Create())
        {
            using (var stream = File.OpenRead(file))
            {
                var hashResult = md5Instance.ComputeHash(stream);
                return BitConverter.ToString(hashResult)
                    .Replace("-", "", StringComparison.OrdinalIgnoreCase)
                    .ToLowerInvariant();
            }
        }
    }

    private FileProcessingDto MapToDto(string file)
    {
        var fileInfo = new FileInfo(file);
        return new FileProcessingDto()
        {
            FullPath = file,
            Size = fileInfo.Length,
            FileNameWithoutExtension = fileInfo.Name,
            Hash = CalculateHash(file)
        };
    }

    public Task DoWork()
    {
        var files = Directory.GetFiles(Path)
            .Select(p => MapToDto(p))
            .ToList();

        return _processingService.Process(files);
    }
}

Note that here, we act in a naive fashion and extract all files via Directory.GetFiles(Path) in one take.

However, once you run this code via:

C#
await new Worker().DoWork()

you’ll notice that results are far from satisfying and the application is consuming memory extensively.

Directory.EnumerateFiles to the Rescue

The thing with Directory.EnumerateFiles is that it returns IEnumerable<string> thus allowing us to fetch collection items one by one. This, in turn, prevents us from excessive use of memory while loading huge amounts of data at once.

Still, as you may have noticed, FileProcessingService.Process has delay coded in it (sort of I/O operation we emulate with simple delay). In a real-world scenario, this might be a call to an external HTTP-endpoint or work with the storage. This brings us to the conclusion that calling FileProcessingService.Process 200 000 times might be inefficient. That’s why we’re going to load reasonable batches of data into memory at once.

The reworked code looks as follows:

C#
public class WorkerImproved
{
    //omitted for brevity

    public async Task DoWork()
    {
        const int batchSize = 10000;
        var files = Directory.EnumerateFiles(Path);
        var count = 0;
        var filesToProcess = new List<FileProcessingDto>(batchSize);

        foreach (var file in files)
        {
            count++;
            filesToProcess.Add(MapToDto(file));
            if (count == batchSize)
            {
                await _processingService.Process(filesToProcess);
                count = 0;
                filesToProcess.Clear();
            }

        }
        if (filesToProcess.Any())
        {
            await _processingService.Process(filesToProcess);
        }
    }
}

Here, we enumerate collection with foreach and once we reach the size of the batch, we process it and flush the collection. The only interesting moment here is to call service one last time after we exit the loop in order to flush remaining items.

Evaluation

Results produced by Benchmark.NET are pretty convincing:

Few Words on Batch Processing

In this article, we took a glance at the common pattern in software engineering. Batches of reasonable amount help us to beat both I/O penalty of working in an item-by-item fashion and excessive memory consumption of loading all items in memory at once.

As a rule, you should strive for using batch APIs when doing I/O operations for multiple items. And once the number of items becomes high, you should think about splitting these items into batches.

Few Words on Return Types

Quite often when dealing with codebases, I see code similar to the following:

C#
public IEnumerable<int> Numbers => new List<int> { 1, 2, 3 };

I would argue that this code violates Postel’s principle and the thing that follows from it is that as a consumer of a property I have can’t figure out whether I can enumerate items one by one or if they are just loaded at once in memory.

This is a reason I suggest being more specific about return type, i.e.:

C#
public IList<int> Numbers => new List<int> { 1, 2, 3 };

Conclusion

Batching is a nice technique that allows you to handle big amounts of data gracefully. Directory.EnumerateFiles is the API that allows you to organize batch processing for the directory with a large number of files.

History

  • 28th March, 2021: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Bohdan Stupak
Software Developer
Ukraine Ukraine
https://twitter.com/BohdanStupak1

Comments and Discussions

 
GeneralThoughts Pin
PIEBALDconsult28-Mar-21 12:54
professionalPIEBALDconsult28-Mar-21 12:54 
GeneralRe: Thoughts Pin
Bohdan Stupak28-Mar-21 22:15
professionalBohdan Stupak28-Mar-21 22:15 
QuestionWhat are you measuring? Pin
qdtgema28-Mar-21 12:47
Memberqdtgema28-Mar-21 12:47 
AnswerRe: What are you measuring? Pin
Bohdan Stupak28-Mar-21 22:08
professionalBohdan Stupak28-Mar-21 22:08 
GeneralMy vote of 5 Pin
LightTempler28-Mar-21 8:23
MemberLightTempler28-Mar-21 8:23 
GeneralRe: My vote of 5 Pin
Bohdan Stupak28-Mar-21 21:59
professionalBohdan Stupak28-Mar-21 21:59 
thank you for your warm response! I find myself surprised how much attention is devoted to advanced GC trickery topics while simple techniques like this are something you do first in real life to help your application work faster. I'm glad you've appreciated my article and planning to cover more simple tricks as well.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.