Skip to content

Unigrams not provided warning #85

@to-schi

Description

@to-schi

I am using a Kenlm 4gram language model (binary) with Deepspeech2 that works quite well, but I constantly get warnings, that seem unnecessary:

"WARNING:pyctcdecode.decoder:Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced.
WARNING:pyctcdecode.language_model:No known unigrams provided, decoding results might be a lot worse."

When I provide a list of unigrams like this, the warnings are gone, the performance seems to be the same, but the computation time is significantly higher:

unigrams_file = "./kenlm-model/vocab-500000.txt"
with open(unigrams_file) as f:
    list_of_unigrams = [line.rstrip() for line in f]

def ctc_decoding_lm(logits, model_path=LM_MODEL_PATH, unigrams=list_of_unigrams):

    decoder = pyctcdecode.build_ctcdecoder(
            labels=char_to_int.get_vocabulary(),
            kenlm_model_path = model_path,
            unigrams=unigrams, 
            alpha=0.9, 
            beta=1.2,
        )

    logits = np.squeeze(logits)
    text = decoder.decode(logits)
    return text

In which case is providing extra unigrams relevant?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions