DirectShow Filters Development Part 3: Transform Filters

Roman Ginzburg

Rate me:

4.83/5 (20 votes)

15 Mar 2011CPOL6 min read

128.4K

4.6K

A text overlay filter and a JPEG/JPEG2000 encoder using transform filters.

Download source code - 820 KB

Introduction

Transform filters are probably the most interesting pieces of the DirectShow puzzle. They encapsulate complex image and video processing algorithms. From a filter development point of view, they are not harder to implement than others; however, they do require some additional coding and method overrides. As with rendering and source filters, transform filters also have base classes from which you should inherit when implementing your custom work.

Transform filters have at least two pins, one input pin and one output pin. Transform filters are divided into two categories - copy-transform filters and in-place transform filters. As their name implies, a copy-transform filter takes the data from the input pin, transforms it, and writes the outcome to the output pin, whereas an in-place filter performs its work on the input sample and passes it on to the output filter.

DirectShow provides three base classes for writing transform filters:

CTransformFilter - base class for copy-transform filters
CTransInPlaceFilter - base class for in-place transforms
CVideoTransfromFilter - designed for video decoding and has built-in quality control management for dropping frames in case of flooding

I will cover the first two classes in this article: the CTransInPlace descendent will be used for a text overlay filter, and CTransformFilter will be used for a JPEG/JPEG2000 encoder.

Before we continue, you should take a look at part 1 of this series as the filter development prerequisites, filter registration, and filter debugging are all the same.

Text Overlay Filter

A text overlay filter adds some user defined text to each and every frame that goes through the filter. It can be used for displaying subtitles or a logo. Adding text to the video frame does not change its media subtype or format, therefore an in-place transform suits perfectly. I will be using GDI+ for overlays, as it provides a convenient API for creating in-place bitmaps and drawing characters on a bitmap.

C++

using namespace Gdiplus;
using namespace std;

class CTextOverlay : public CTransInPlaceFilter, public ITextAdditor
{
public:
        DECLARE_IUNKNOWN;

        CTextOverlay(LPUNKNOWN pUnk, HRESULT *phr);
        virtual ~CTextOverlay(void);

        virtual HRESULT CheckInputType(const CMediaType* mtIn);
        virtual HRESULT SetMediaType(PIN_DIRECTION direction, const CMediaType *pmt);
        virtual HRESULT Transform(IMediaSample *pSample);

        static CUnknown *WINAPI CreateInstance(LPUNKNOWN pUnk, HRESULT *phr); 
        STDMETHODIMP NonDelegatingQueryInterface(REFIID riid, void ** ppv);

        STDMETHODIMP AddTextOverlay(WCHAR* text, DWORD id, RECT position, 
                     COLORREF color = RGB(255, 255, 255), float fontSize = 20);
        STDMETHODIMP Clear(void);
        STDMETHODIMP Remove(DWORD id);

private:
        ULONG_PTR m_gdiplusToken;
        VIDEOINFOHEADER m_videoInfo;
        PixelFormat m_pixFmt;
        int m_stride;
        map<DWORD, Overlay*> m_overlays;
};

The only pure virtual method is the Transform method and it must be implemented in your class. In addition, I have also overridden the CheckInputType called for each media type during the pin connection negotiation. Since a transform filter has two pins at least, SetMediaType has the direction argument which indicates whether the connection is performed on the input or the output pin. You may want to save both the input and output video headers. In this case, I only need the input video header since it is exactly the same as the output:

C++

HRESULT CTextOverlay::SetMediaType(PIN_DIRECTION direction, const CMediaType *pmt)
{
   if(direction == PINDIR_INPUT)
   {
      VIDEOINFOHEADER* pvih = (VIDEOINFOHEADER*)pmt->pbFormat;
      m_videoInfo = *pvih;
      HRESULT hr = GetPixleFormat(m_videoInfo.bmiHeader.biBitCount, &m_pixFmt);
      if(FAILED(hr))
      {
             return hr;
      }

      BITMAPINFOHEADER bih = m_videoInfo.bmiHeader;
      m_stride = bih.biBitCount / 8 * bih.biWidth;
   }

   return S_OK;
}

The filter accepts RGB only formats with 15, 16, 24, and 32 bits per pixel, and using the GDI+ Bitmap class, it is possible to create in-place bitmap objects without any buffer copy. After that, I create a graphics object from that bitmap and call the Graphics::DrawString method to draw the user defined text on the bitmap:

C++

HRESULT CTextOverlay::Transform(IMediaSample *pSample)
{
       CAutoLock lock(m_pLock);

       BYTE* pBuffer = NULL;
       Status s = Ok;
       map<DWORD, Overlay*>::iterator it;

       HRESULT hr = pSample->GetPointer(&pBuffer);
       if(FAILED(hr))
       {
           return hr;
       }

       BITMAPINFOHEADER bih = m_videoInfo.bmiHeader;
       Bitmap bmp(bih.biWidth, bih.biHeight, m_stride, m_pixFmt, pBuffer);
       Graphics g(&bmp);
       
       for ( it = m_overlays.begin() ; it != m_overlays.end(); it++ )
       {
          Overlay* over = (*it).second;
          SolidBrush brush(over->color);
          Font font(FontFamily::GenericSerif(), over->fontSize);
          s = g.DrawString(over->text, -1, &font, over->pos, 
                           StringFormat::GenericDefault(), &brush);
          if(s != Ok)
          {
             TCHAR msg[100];
             wsprintf(L"Failed to draw text : %s", over->text);
             ::OutputDebugString(msg);
          }
       }

    return S_OK;
}

Using the ITextAditor interface, you can add a text overlay with ID, remove them by ID, or remove all. Each overlay contains the text, the bounding rectangle, color, and font size:

C++

DECLARE_INTERFACE_(ITextAdditor, IUnknown)
{
   STDMETHOD(AddTextOverlay)(WCHAR* text, DWORD id, RECT position, 
             COLORREF color, float fontSize) PURE;
   STDMETHOD(Clear)(void) PURE;
   STDMETHOD(Remove)(DWORD id) PURE;
};

Overlay objects are stored in a map in a thread safe manner so you can freely add and remove overlays during playback. Thread-safety in the DirectShow framework is achieved using Critical Sections and the CAutoLock class which is usually declared in the beginning of the method, and when going out of scope at the end of the method - the Critical Section is released.

JPEG / JPEG2000 Encoder

It took me a while to decide what type of video encoding to implement, and eventually I decided to make a simple intra-frame encoder - each video frame is encoded with no reference to the previous or next frame. This type of encoding is easier to implement than inter-frame encoding standards like MPEG4 or H264, but suffers from larger stream throughput since there is much redundant pixel information between neighbor frames. I also created a base class for other intra-frame encoder types, and you can easily swap the implementation by inheriting from CBaseCompressor and updating the Factory method which creates the concrete implementations:

C++

struct CBaseCompressor
{
     virtual HRESULT Init(BITMAPINFOHEADER* pBih) PURE;
 
     virtual HRESULT Compress(BYTE* pInput, DWORD inputSize, BYTE* pOutput, 
     DWORD* outputSize) PURE;
 
     virtual HRESULT SetQuality(BYTE quality) PURE;
 
     virtual HRESULT GetMediaSubTypeAndCompression(GUID* mediaSubType,
     DWORD* compression) PURE;
};

By default, the encoding standard is JPEG, and it is based on a code I found here on CodeProject. Using the IJ2KEncoder::SetEncoderType method, you can change the implementation to the JPEG2000 encoding standard which is based on the OpenJpeg library. Please note that if one of the filter's pins is connected, you cannot change the encoder implementation, so it is best to set the desired encoding algorithm right after filter creation.

JPEG2000 and Media Sub Types

When using a JPEG compressor, DirectShow provides a built-in media sub type called MEDIASUBTYPE_MJPG, and it is declared in the uuids.h file. Regarding JPEG2000, I could not find any appropriate GUID, so I created one using the following macro definition:

C++

DEFINE_GUID( MEDIASUBTYPE_MJ2C, MAKEFOURCC('M', 'J', '2', 'C'),
             0x0000, 0x0010, 0x80, 0x00, 0x00, 0xaa, 
             0x00, 0x38, 0x9b, 0x71);

When using the BITMAPINFOHEADER structure for compressed images, you have to set the biCompression field to MAKEFOURCC('M', 'J', '2', 'C'). This way, the filter can connect to JPEG2000 decoders, like this one.

MJ2C means a JPEG2000 code stream, and it is actually a motion JPEG2000 definition where each frame consists of compressed image data. Another standard is J2K, and it is usually used for still image encoding and also contains headers.

Although JPEG2000 provides better compression ratios and better image quality, especially at lower bit rates, it is more CPU intensive than JPEG and hence less suitable for large resolution videos. During a research I made on JPEG2000 implementations, I found a project called CUJ2K - a JPEG2000 implementation based on CUDA - a GPU based API developed by NVIDIA. Since this library was designed for still images located on the hard drive, it uses the command line to pass the source and destination paths for the images. To make use of it for in-memory buffers, it required some additional work, so I decided to go with OpenJpeg; however, it is worth looking at if you need better performance.

Filter Implementation

To implement a transform filter, you have to implement six methods:

Transform - receives input and output media samples.
CheckInputType - checks whether the input pin can connect to an upstream filter.
CheckTransform - checks whether a transformation is possible between input and output media types.
DecideBufferSixe - sets the memory buffer size for the output media samples.
GetMediaType - returns the media type used to connect the output pin with the downstream filter.
SetMediaType - called when the input and output pins are successfully connected.

C++

class CJ2kCompressor : public CTransformFilter, public IJ2KEncoder
{
public:
       DECLARE_IUNKNOWN;

       CJ2kCompressor(LPUNKNOWN pUnk, HRESULT *phr);
       virtual ~CJ2kCompressor(void);

       // CTransfromFilter overrides
       virtual HRESULT Transform(IMediaSample * pIn, IMediaSample *pOut);
       virtual HRESULT CheckInputType(const CMediaType* mtIn);
       virtual HRESULT CheckTransform(const CMediaType* mtIn, 
                       const CMediaType* mtOut);
       virtual HRESULT DecideBufferSize(IMemAllocator * pAlloc, 
                       ALLOCATOR_PROPERTIES *pProp);
       virtual HRESULT GetMediaType(int iPosition, CMediaType *pMediaType);
       virtual HRESULT SetMediaType(PIN_DIRECTION direction, const CMediaType *pmt);

       static CUnknown * WINAPI CreateInstance(LPUNKNOWN pUnk, HRESULT *pHr);
       STDMETHODIMP NonDelegatingQueryInterface(REFIID riid, void ** ppv);

       // IJ2KEncoder
       STDMETHODIMP SetQuality(BYTE quality);
       STDMETHODIMP SetEncoderType(EncoderType encoderType);

private:
       VIDEOINFOHEADER m_VihIn;   
       VIDEOINFOHEADER m_VihOut; 
       CBaseCompressor* m_encoder;
};

The Transform method implementation is pretty straightforward: I get the buffer pointers from the input and output media samples and then pass them to the CBaseCompressor implementation. After that, I set the actual output media sample size and set the sync point to true since every frame is a reference frame:

C++

HRESULT CJ2kCompressor::Transform(IMediaSample* pIn, IMediaSample* pOut)
{
     HRESULT hr = S_OK;

     BYTE *pBufIn, *pBufOut;
     long sizeIn;
     DWORD sizeOut;

     hr = pIn->GetPointer(&pBufIn);
     if(FAILED(hr))
     {
        return hr;
     }

     sizeIn = pIn->GetActualDataLength();

     hr = pOut->GetPointer(&pBufOut);
     if(FAILED(hr))
     {
        return hr;
     }
 
     hr = m_encoder->Compress(pBufIn, sizeIn, pBufOut, &sizeOut);

     if(FAILED(hr))
     {
        return hr;
     }

     hr = pOut->SetActualDataLength(sizeOut);

     if(FAILED(hr))
     {
        return hr;
     }

     hr = pOut->SetSyncPoint(TRUE);

     return hr;
}

Filter Registration

Since this filter is a video encoder, it should be registered in the video compressor filters category, and this is done using the IFilterMapper object:

C++

STDAPI RegisterFilters( BOOL bRegister )
{
   HRESULT hr = NOERROR;
   WCHAR achFileName[MAX_PATH];
   char achTemp[MAX_PATH];
   ASSERT(g_hInst != 0);

   if( 0 == GetModuleFileNameA(g_hInst, achTemp, sizeof(achTemp))) 
   {
          return AmHresultFromWin32(GetLastError());
   }

   MultiByteToWideChar(CP_ACP, 0L, achTemp, lstrlenA(achTemp) + 1, 
                       achFileName, NUMELMS(achFileName));

   hr = CoInitialize(0);
   if(bRegister)
   {
          hr = AMovieSetupRegisterServer(CLSID_Jpeg2000Encoder, 
                  J2K_FILTER_NAME, achFileName, L"Both", L"InprocServer32");
   }

   if( SUCCEEDED(hr) )
   {
      IFilterMapper2 *fm = 0;
      hr = CoCreateInstance( CLSID_FilterMapper2, NULL, 
              CLSCTX_INPROC_SERVER, IID_IFilterMapper2, (void **)&fm);

      if( SUCCEEDED(hr) )
      {
         if(bRegister)
         {
           IMoniker *pMoniker = 0;
           REGFILTER2 rf2;
           rf2.dwVersion = 1;
           rf2.dwMerit = MERIT_DO_NOT_USE;
           rf2.cPins = 2;
            rf2.rgPins = psudPins;
           hr = fm->RegisterFilter(CLSID_Jpeg2000Encoder, J2K_FILTER_NAME, 
                &pMoniker, &CLSID_VideoCompressorCategory, NULL, &rf2);
         }
         else
         {
            hr = fm->UnregisterFilter(&CLSID_VideoCompressorCategory, 0, 
                                      CLSID_Jpeg2000Encoder);
         }
      }

      if(fm)
         fm->Release();
   }

   if( SUCCEEDED(hr) && !bRegister )
   {
          hr = AMovieSetupUnregisterServer( CLSID_Jpeg2000Encoder );
   }

   CoFreeUnusedLibraries();
   CoUninitialize();
   return hr;
}

STDAPI DllRegisterServer() 
{
       return RegisterFilters(TRUE);
}

STDAPI DllUnregisterServer()
{
       return RegisterFilters(FALSE);
}

References

History

23.2.2011

Initial version

13.3.2011

Changed source to use smart pointers
Fixed SetQuality implementation

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Roman Ginzburg

Software Developer (Senior)

Israel

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Hi.

I'm a filmmaker, who has been developing a digital cinema camera.
I have been looking for a suitable codec, but am yet to find one.
The codec requirements for a RAW Bayer sensor are somewhat different from usual video.

I have made some notes below.

Any thoughts on how this might be realized would be gratefully received.

COMPRESSED RAW BAYER CODEC
BACKGROUND AND REQUIREMENTS

ABOUT BAYER SIGNALS

Taken from wikipedia

A Bayer filter mosaic is a color filter array (CFA) for arranging RGB color filters on a square grid of photosensors. Its particular arrangement of color filters is used in most single-chip digital image sensors used in digital cameras, camcorders, and scanners to create a color image. The filter pattern is 50% green, 25% red and 25% blue, hence is also called RGBG[1][2], GRGB[3], or RGGB.[4]

Important things to note:
1.There are twice as many green filtered pixel sensors as red or blue. This is always the case.
2.The sensors that are beneath the filters have no way of measuring color. They purely measure luminance.
3.Each pixel sensor produces a single integer (usually 12-bit), referring to the intensity of light received by that pixel.
4.The image information that comes from a sensor contains only luminance information. There is no saturation or chrominance information.

Thus if we consider a 12-bit RGBG sensor, a single 2x2 block will contain the following information
R – luminance of the red-filtered pixel – value of 0–4095
G – luminance of the green-filtered pixel – value of 0-4095
B – luminance of the blue-filtered pixel – value of 0-4095
G – luminance of the second green-filtered pixel – value of 0-4095

One might say that the resulting signal structurally looks identical to a 12-bit grayscale image.

DEBAYERING
Debayering is the process of taking this grayscale image, and by using knowledge of which pixels were captured by using which color filter, extrapolating a conventional color picture.
Debayering is quite complicated, and there are many different methods, from simple nearest neighbor algorithms to more complex algorithms.

STORING THE IMAGE
The major question when storing an image produced by a Bayer sensor is whether to debayer before storing.
Any professional stills photographer will know that there a lot of good reasons to store the undebayered image – know as RAW.
This stores the information as received by the sensor, allowing a photographer to use whatever debayering algorithm is best for that particular image later on. It also gives him other options whilst debayering, including color correction, detail enhancement etc.
If one debayers before storage, picture information may be lost in the process which can never be recovered.
However, RAW files are usually much bigger in size, as they usually are uncompressed.
Thus this requires more storage.
This is the reason why compact cameras sometimes do not offer RAW storage, as the average person probably doesn't care too much about whether pictures have been debayered or not, but would rather be able to store a lot of pictures.

UNCOMPRESSED vs. RAW
A lot of people think that these two phrases mean the same thing.
They do not.
RAW means that the sensor information has not been debayered.
'Uncompressed' means that the sensor information has not been compressed.
The reason that people think that they mean the same thing is because usually RAW images are uncompressed.
However RAW images can be compressed, and still be RAW images.

IS COMPRESSION A DIRTY WORD?

Compression takes data and makes it smaller.
We do not worry at all about compressing text documents as a ZIP. We would always expect that once unzipped, the document file would be just the same as before it was compressed. If it were to change a few words here and there, it would be useless.
This is a good example of lossless compression.
However, a lot of filmmakers have come to hate video compression. They associate it with blockiness, detail loss, strange effects on faces, etc.
This is because they have witnessed the effects of extreme video compression.
For example, MPEG-4 compression is often compresses a video to 1/60th of the original size. It's pretty obvious that if you are using compression to that degree, there has to be a loss of quality.
But, video compression does not have to produce horrible effects.
What if the compression method used was totally lossless?
This means that it would work just like ZIP. Whatever is compressed will decompress exactly as it was before.
There seems to me to be little argument that if we can achieve lossless video compression, it is a pretty good way to save storage space without any loss of quality.

LOSSY COMPRESSION
One must be careful when encountering the phrase 'visually lossless'. This means that the compression method is lossy – some information has been lost or altered by the compression process, however it is not noticeable to the human eye.
This brings up an enormous question as to how to define 'visually lossless'. One must be careful that this is not just a marketing way of saying 'yes, we chuck out some of your information, but don't worry, you'll never notice'.
Of course lossy compression has its place. Clever lossy compression is how we can get realtime streaming in YouTube, and is used in all domestic satellite TV systems.
But if we are storing masters straight from the camera, we should be very careful about using a lossy system.

COMPRESSED RAW BAYER VIDEO
When storing video from a Bayer sensor, there are two things we need.
1.To store the video images 'undebayered' ie RAW
2.To lose little or no information in the process of storing.
3.To be able to store as much information on the disk, card, SSD, or whatever storage media we are using. This reduces the amount of times we have to swap cards; reduces costs, because we don't have to buy so many storage devices; reduces the amount of things we have to carry around, etc.

So, it seems to me that what is needed is a method of compressing RAW video that is lossless.

IT ALREADY EXISTS

The RED cameras use REDRaw which is a form of compressed RAW.
Cineform does the same thing.

SO WHY CREATE A NEW CODEC?
Amazingly, there is no open codec for camera developers, digital recorder developers, to use that stores RAW video.
RED keeps their compression algorithms a total secret, and you cannot use them on any camera or device other than the RED camera.
Cineform is a great system, but their revenue stream is derived from licensing the codec, and selling decompression packages. If they gave it away, they wouldn't make any money.

There is a need for a new codec – one that can be used by developers, that is free to use, and that can be used either directly, or as an interchange storage format.

WHY NOT USE A STANDARD CODEC BASED ON H.264 OR OTHER MODERN CODECS?
1. Most codecs expect to receive image information as RGB or YUV.
This stores the information for each pixel in three separate components.
For example, YUV contains the luminance in the Y channel, and the color information encoded into the U and V channels.
However, as shown above, the RAW sensor information does not contain any color information. There is no U and V – only Y.
So this is not a perfect method for storing the information.

2.Most codecs use intraframe compression methods which look for similarities or patterns with nearby pixels.
This doesn't work so well with Bayer images. For example, if we take a picture of a smooth red wall, an RGB compression system would find this very easy, because all adjacent pixels would have very similar values.
However, the RAW Bayer image of the same red wall would contain dramatic differences between adjacent pixels. The pixel that is red-filtered will contain a very high value, whilst the green-filtered pixel will contain a very low value. There will be few similarities for the codec to find.

HOW WOULD A NEW CODEC AVOID THESE TWO ISSUES?

Firstly, we need to find a compression method that works efficiently with 12-bit grayscale images. As we discovered above, in terms of structure, that is what a RAW image effectively is.

Secondly, we need to make it easier for the compression method to work and find patterns or similarities, so that it is not confused by the mosaic-like pattern of the RAW sensor.

PROPOSED SOLUTION
COMPRESSION METHOD
It turns out that there is a compression method that seems ideally suited to this: JPEG-LS
It accepts up to 16-bit grayscale images.
It is lossless or lossy, depending on user preferences.
It is low-complexity, meaning that it could be performed in realtime.
It is freely available for implementation.

There is much information on the internet about JPEG-LS, and a number of people have considered its use for video.

SPATIAL TRANSFORM

However, we still have the problem of the mosaic, and the difficulties of compressing a mosaic image with intraframe compression.

The solution for this is very simple.
We can move the pixels about before compression.
As long as we move them back again after decompression, we can ensure that nothing changes.
For example:
If these are our pixels on the sensor:

GBGBGBGB
RGRGRGRG
GBGBGBGB
RGRGRGRG
GBGBGBGB
RGRGRGRG
GBGBGBGB
RGRGRGRG

We can move them around and group them as follows:

GGGGBBBB
GGGGBBBB
GGGGBBBB
GGGGBBBB
GGGGRRRR
GGGGRRRR
GGGGRRRR
GGGGRRRR

The Blue-filtered pixels are now all in a single block, the Red-filtered pixels are in a single block, and the Green filtered pixels are in two blocks.
This will enable the compression method to work with much greater efficiency.

VARIABLES
The codec will need to be able to deal with different inputs, with the following variables:
1.The size of the image will vary. Standard HD video is 1920x1080. But the codec must support other pixel dimensions.
2.Sensors vary in the order of the pixels. The codec must deal with RGBG, GBGR or RGGB. This does not need to be automatic, but can require user setting.
3.There should be a setting for lossless or the level of acceptable error for lossy.

FUTURE ENHANCEMENT

ENHANCED SPATIAL TRANSFORM
There may be ways of altering the spatial transform to gain greater compression efficiency. For example, rather than separating the image into blocks of G1 G2 R B , it may be that it will be more efficient to feed difference values into the compression algorithm, e.g. G1 G1-G2 G1-R G1-B.

This will require experimentation to find the best balance of efficiency with speed. It is vital that the codec can achieve real-time performance.

METADATA
Useful information could be included in the file header to include metadata, holding information about the image and its creation.

TECHNICAL REQUIREMENTS
Firstly, the final codec should be a FourCC codec which can be selectable as a standard windows codec, to enable easy integration with software.
Secondly, it must be coded efficiently, so that on a standard i5 system, real-time compression can be achieved.
Thirdly, it must work totally losslessly, such that the information stored can be decompressed such that it is identical to the information taken from the sensor.
Fourthly, it should be freely available.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.