Click here to Skip to main content
15,886,110 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I am working on a Speech to Text project in python using Vosk API. I am trying to get the timestamps of certain phrases present in the audio for some data analysis. I need some algorithm or some approach to how I can do the same without using Google Cloud Speech API/IBM Watson Speech API. Any sort of help is welcome.

What I have tried:

I have tried using SimpleAudioIndexer which uses pocketsphinx and Watson Cloud API but its accuracy is not up to the mark.
Posted
Updated 17-Jan-22 16:40pm
Comments
[no name] 2-Oct-20 11:50am    
If you play back the text (as text to speech), you should be able to do timings because the speech engine should have events for: word (breaks), sentences, etc. that you can hook onto.

1 solution

You can get the vosk output to include times (at the granularity of a match sequence) by setting SetWords(True).
e.g.
rec = KaldiRecognizer(model, sample_rate)
rec.SetWords(True)
-----
This causes the output to include the result details -
e.g.
{
"partial" : "zero one eight zero"
}
{
"result" : [{
"conf" : 1.000000,
"end" : 6.690000,
"start" : 6.240000,
"word" : "zero"
}, {
"conf" : 1.000000,
"end" : 6.900000,
"start" : 6.690000,
"word" : "one"
}, {
"conf" : 1.000000,
"end" : 7.140000,
"start" : 6.900000,
"word" : "eight"
}, {
"conf" : 1.000000,
"end" : 7.500000,
"start" : 7.140000,
"word" : "zero"
}, {
"conf" : 1.000000,
"end" : 8.010000,
"start" : 7.500000,
"word" : "three"
}],
"text" : "zero one eight zero three"
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900