Unigrams not provided warning

I am using a Kenlm 4gram language model (binary) with Deepspeech2 that works quite well, but I constantly get warnings, that seem unnecessary:

"WARNING:pyctcdecode.decoder:Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced.
WARNING:pyctcdecode.language_model:No known unigrams provided, decoding results might be a lot worse."

When I provide a list of unigrams like this, the warnings are gone, the performance seems to be the same, but the computation time is significantly higher:

```
unigrams_file = "./kenlm-model/vocab-500000.txt"
with open(unigrams_file) as f:
    list_of_unigrams = [line.rstrip() for line in f]

def ctc_decoding_lm(logits, model_path=LM_MODEL_PATH, unigrams=list_of_unigrams):

    decoder = pyctcdecode.build_ctcdecoder(
            labels=char_to_int.get_vocabulary(),
            kenlm_model_path = model_path,
            unigrams=unigrams, 
            alpha=0.9, 
            beta=1.2,
        )

    logits = np.squeeze(logits)
    text = decoder.decode(logits)
    return text
```

In which case is providing extra unigrams relevant?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unigrams not provided warning #85

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unigrams not provided warning #85

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions