Voice recognition requires some kind of transform either Fourier or Wavelet to come to a short numerical representation of the speakers voice. You would want to compare that to a stored command template not the actual WAV input.
Check the following article :
Speech Recognition[
^]