Click here to Skip to main content
15,881,803 members
Articles / Programming Languages / C#

A small Content Detection Library

Rate me:
Please Sign up or sign in to vote.
4.64/5 (12 votes)
1 May 2007CPOL3 min read 51.7K   507   41   12
Introducing a library to detect content based on file content (and extension)

Screenshot - ContentDetectorLib_01.png

Introduction

In my recently published article about the Zeta Uploader application (in short, a website to upload files and send e-mail messages with links to the uploaded files), the discussion came up (Thanks to Phil.Benson) about the need to administer the uploaded files in order to avoid copyright infringements.

This article introduces a library that I have written last evening and this morning (so it is "really" fresh) to take a first step in the right direction.

What the library does

Since I wanted to avoid (at least for now) forcing users of the Zeta Uploader to register and login in order to use the service, I decided to try a different approach:

After a file is uploaded, it is checked for whether it is considered "prohibited" in terms of that it cannot be uploaded with Zeta Uploader. Currently I've included files like movies (AVI, MOV, etc.) and music (MP3, WAV, etc.) as being prohibited.

How the library works

The detection algorithm uses the following mechanisms to test a file for being prohibited or allowed:

  • File extension

    Look at the file extension. If it matches a given extension on the prohibited list, the file is considered "prohibited".

  • File content

    Look inside the first few bytes of the file for known binary pattern ("magic bytes") to match a list of prohibited patterns.

  • Archive extraction

    The file is detected to be an archive file, the file is being temporarily extracted and the extracted files are scanned, too (recursively, if they contain archives, too).

The next section briefly discusses these different mechanisms.

File extension checking

This goes straight to the extension of the file name. Since this is rather easy to cheat, the file extension checking is done as a first quick check only. If it matches, the whole detection is done for a given file.

If not, a content analysis is done, as described next.

Content analysis

The main work of the library is to apply simple "pattern matching" to the content of a file. Through an extensible ISignatureChecker interface, more complex tests can be added later. I've included a simple check for MP3s that does a little bit more than just pattern matching (class Mp3SignatureChecker).

The ISignatureChecker interface is defined as follows:

C#
/// <summary>
/// Interface to implement when checking a buffer
/// for a certain signature.
/// </summary>
internal interface ISignatureChecker
{
    /// <summary>
    /// Check whether a given buffer matches the signature.
    ///
    /// <param name="buffer">The buffer.</param>
    /// <returns></returns>
    bool MatchesSignature(
        byte[] buffer );

    /// <summary>
    /// Gets the first number of bytes to read.
    /// </summary>
    /// <value>The first number of bytes to read.</value>
    int FirstNumberOfBytesToRead
    {
        get;
    }

    /// <summary>
    /// Gets the minimum length of the required buffer.
    /// </summary>
    /// <value>The minimum length of the required buffer.</value>
    int MinimumRequiredBufferLength
    {
        get;
    }
}

Through this interface, the check engine communicates with the discrete interfaces. See the source files for details and examples.

Archive extraction

Since most files are compressed archives, it is important to extract these too.

Again, I've built an extensible mini-framework based on the IArchiveExtractor interface to allow for adding more archive extractors in the future.

The interface is defined as follows:

C#
/// <summary>
/// Interface for archive extractors.
///
internal interface IArchiveExtractor
{
    /// <summary>
    /// Extracts the specified file path.
    /// </summary>
    /// <param name="filePath">The file path.</param>
    /// <param name="folderPathToExtractInto">The folder path
    /// to extract into.</param>
    void Extract(
        FileInfo filePath,
        DirectoryInfo folderPathToExtractInto );
}

Currently I am using the SharpZipLib to provide extractors for ZIP, gzip and bzip2.

Test application

There is no test application in the download. Instead the following code snippet is the complete Main function of my own testing console application.

C#
/// <summary>
/// The main function.
/// </summary>
private static void Main()
{
    // Instantiate the engine.
    ContentDetectorEngine engine = new ContentDetectorEngine();

    // --
    // Testing discrete files.

    // Collect some files to test.
    FileInfo[] filePaths = new FileInfo[]
    {
        new FileInfo( @"c:\AnotherFolder\112431940.mp3" ),
        new FileInfo( @"c:\AnotherFolder\247293565.txt" ),
        new FileInfo( @"c:\AnotherFolder\008284502.zip" ),
        new FileInfo( @"c:\AnotherFolder\190243241.mdb" ),
        new FileInfo( @"c:\AnotherFolder\182944456.zip" ),
    };

    // Iterate over the files.
    foreach ( FileInfo filePath in filePaths )
    {
        bool contains =
            engine.ContainsFileProhibitedContent( filePath );
        Console.WriteLine(
            @"Contains '{0}': {1}.",
            filePath.Name,
            contains );
    }

    // --
    // Testing a complete folder.

    // Find all files in the given folder.
    FileInfo[] prohibitedPaths =
        engine.ContainsFolderProhibitedContent(
        new DirectoryInfo(
        @"C:\SomeFolder" ) );

    Console.WriteLine( @"Folder contains {0} prohibited files.",
        prohibitedPaths.Length );

    foreach ( FileInfo prohibitedPath in prohibitedPaths )
    {
        Console.WriteLine(
            @"\tProhibited file: '{0}'.", prohibitedPath );
    }
}

Simply copy it into your own console application and you are done.

Conclusion

In this article I've shown you a library to detect file types based on their content. Although this is only a first version of the library and probably some approaches are somewhat naive, I'm sure the code is useful and can be extended in the future to be even more usable.

If you have feedback, questions or comments, simply post them in the comments section below. I'm looking forward to your messages!

References

  1. HeaderSig.txt - Several signatures for file types
  2. Magic number (programming) - Wikipedia article

History

  • 2007-05-01: Initial release of the library

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Chief Technology Officer Zeta Software GmbH
Germany Germany
Uwe does programming since 1989 with experiences in Assembler, C++, MFC and lots of web- and database stuff and now uses ASP.NET and C# extensively, too. He has also teached programming to students at the local university.

➡️ Give me a tip 🙂

In his free time, he does climbing, running and mountain biking. In 2012 he became a father of a cute boy and in 2014 of an awesome girl.

Some cool, free software from us:

Windows 10 Ereignisanzeige  
German Developer Community  
Free Test Management Software - Intuitive, competitive, Test Plans.  
Homepage erstellen - Intuitive, very easy to use.  
Offline-Homepage-Baukasten

Comments and Discussions

 
Generalvoicexml audio file upload in iis srever folder with out upload control using c# aspx . Pin
bruze7-Feb-08 0:55
bruze7-Feb-08 0:55 
GeneralTag der Arbeit(er)... Pin
Phil.Benson1-May-07 20:28
professionalPhil.Benson1-May-07 20:28 
Again, you have got my five.Big Grin | :-D
You have had a very busy "Tag der Arbeit(er)" Poke tongue | ;-P
I will have to install my VS2005 to have a play (if you do not mind)

mfg
Phil

Who the f*** is General Failure, and why is he reading my harddisk?

GeneralRe: Tag der Arbeit(er)... Pin
Uwe Keim1-May-07 21:52
sitebuilderUwe Keim1-May-07 21:52 
GeneralRe: Tag der Arbeit(er)... Pin
Phil.Benson2-May-07 1:02
professionalPhil.Benson2-May-07 1:02 
GeneralRe: Tag der Arbeit(er)... Pin
Uwe Keim2-May-07 1:10
sitebuilderUwe Keim2-May-07 1:10 
GeneralRe: Tag der Arbeit(er)... Pin
Phil.Benson2-May-07 2:16
professionalPhil.Benson2-May-07 2:16 
GeneralRe: Tag der Arbeit(er)... Pin
Uwe Keim2-May-07 2:41
sitebuilderUwe Keim2-May-07 2:41 
GeneralRe: Tag der Arbeit(er)... Pin
Phil.Benson6-May-07 21:44
professionalPhil.Benson6-May-07 21:44 
GeneralRe: Tag der Arbeit(er)... Pin
Uwe Keim7-May-07 5:00
sitebuilderUwe Keim7-May-07 5:00 
Generaltry to use TrID Pin
PeaceTiger1-May-07 0:54
PeaceTiger1-May-07 0:54 
GeneralRe: try to use TrID Pin
Uwe Keim1-May-07 2:41
sitebuilderUwe Keim1-May-07 2:41 
GeneralRe: try to use TrID Pin
PeaceTiger1-May-07 3:20
PeaceTiger1-May-07 3:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.