mms

all lexicon.txt files are symlinks

#1
by soupdtag - opened

I cloned mms-cclms on my machine, and I noticed that all lexicon.txt files (in each of the mms-1b-all/[language]/ directories) are symbolic links.

Originally I thought this was an issue with git / git lfs on my end, so I viewed and downloaded a couple lexicon.txt files from huggingface online - but each of the ones I viewed still appear to be symbolic links that point to locations I don't recognize (presumably, they are pointing to file locations on the machine that this repo was originally uploaded from).

Anybody else having this issue?

from diff looks like the files have been moved to /lms

deleted

Were you able to work with it. I am trying to use kenlm package to the lm .bin files but I am getting format load exception.

deleted

@vineelpratap what would be the right way to use these lms? I have been trying to load it using kenlm but it does not recognize the format.

I just got this issue myself, not sure how to deal with it yet.

EDIT: The only solution I’ve come up with so far is to make my own lexicon file for the language I’m interested in (Amharic). I used text data from data.statmt.org/cc-100, the Amharic file is relatively small so it was easy to load locally and gather the top 250k words from the file, that might be harder to do for more popular languages that have larger file sizes.

AI at Meta org

I have updated the lexicon files. Please try again.

Can confirm the update works. Thanks for uploading the lexicons!

vineelpratap changed discussion status to closed

Sign up or log in to comment