I am using a Kenlm 4gram language model (binary) with Deepspeech2 that works quite well, but I constantly get warnings, that seem unnecessary:
"WARNING:pyctcdecode.decoder:Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced.
WARNING:pyctcdecode.language_model:No known unigrams provided, decoding results might be a lot worse."
When I provide a list of unigrams like this, the warnings are gone, the performance seems to be the same, but the computation time is significantly higher:
unigrams_file = "./kenlm-model/vocab-500000.txt"
with open(unigrams_file) as f:
list_of_unigrams = [line.rstrip() for line in f]
def ctc_decoding_lm(logits, model_path=LM_MODEL_PATH, unigrams=list_of_unigrams):
decoder = pyctcdecode.build_ctcdecoder(
labels=char_to_int.get_vocabulary(),
kenlm_model_path = model_path,
unigrams=unigrams,
alpha=0.9,
beta=1.2,
)
logits = np.squeeze(logits)
text = decoder.decode(logits)
return text
In which case is providing extra unigrams relevant?
I am using a Kenlm 4gram language model (binary) with Deepspeech2 that works quite well, but I constantly get warnings, that seem unnecessary:
"WARNING:pyctcdecode.decoder:Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced.
WARNING:pyctcdecode.language_model:No known unigrams provided, decoding results might be a lot worse."
When I provide a list of unigrams like this, the warnings are gone, the performance seems to be the same, but the computation time is significantly higher:
In which case is providing extra unigrams relevant?