Download code from GitHub

Introduction

In the past years, many of us, sound engineers, tried to create and improve speech recognition algorithms. Lots of training, neural networks, cepstrum, fourier, wavelets, that sort of life-consuming research. Windows Speech API would try to implement such algorithms with minor success.

Now that the internet has grown so much in capacity and speed that it can hold and compare zillions of information, all those algorithms suddenly faded out in favor of network based voice recognition. Instead of local analysis, your voice is transmitted to a server which contains many, many samples and it is able to deduct, with great accuracy, your wording. Google is using that in Android already.

In Windows, we have the new SpeechRecognizer UWP API which, with a bit of code, can be used in plain Win32 applications. Here is a one-function library that will handle the details for you. I have been using this in my big audio and video sequencer, Turbo Play.

Using the Library

The code exports just two functions:

HRESULT __stdcall SpeechX3(const wchar_t* t, std::vector<uint8_t>* tx, bool XML);
HRESULT __stdcall SpeechX1(void* ptr, SpeechX2 x2, 
                           const wchar_t* langx = L"en-us", int Mode = 0);

For text to speech, use SpeechX3 with the text, the vector to write out a WAVE file data. You may use XML markup to configure the synthesis.

For speech to text, use SpeechX1.

With Mode = 2, pass a std::vector<std::tuplestd::wstring,std::wstring>> as a ptr to get all languages supported.

std::vector<std::tuple<std::wstring, std::wstring>> sx; 
SpeechX1((void*)&sx, 0, 0, 2); 
for (auto& e : sx) 
{ 
  std::wcout << std::get<0>(e) << L" - " << std::get<1>(e) << std::endl; 
}

The first tuple item is the display name of the language, the second is the code that you would pass again to the function later to initiate the speech recognition.

Once you have picked the language to use, call SpeechX1 again with mode = 0, ptr = a custom pointer to be passed to your callback. The third parameter is the picked language code and you will pass a callback:

HRESULT __stdcall MyCallback(void* ptr, const wchar_t* reco, int conf);

which is called on three occasions:

periodically to confirm the status with reco = nullptr.
with conf == -1 the recognition is pending hypothesis. Reco is the partial text recognized.
with conf >= 0, the recognition is competed. Reco is the final text and the confidence parameter is from 0 to 3 (the lower, the better) to indicate the accuracy of the recognition.

Return S_OK to continue. If you return an error, SpeechX1 returns and the speech recognition session is ended.

With mode==1, the library tests the specific voice recognition engine without returning results (instead, you will hear from your speakers a playback of the recognized voice).

The library is provided as both DLL and static and a command line testing tool is included.
Have fun with it!

History

27^th May, 2020: First release