-
-
Notifications
You must be signed in to change notification settings - Fork 7
Dictionaries
UralicNLP makes it possible to obtain the lexicographic information from the Giella dictionaries. The information can contain data such as translations, example sentences, semantic tags, morphological information and so on. You have to define the language code of the dictionary.
For example, "sms" selects the Skolt Sami dictionary. The word used to query, however, can appear in any language. If the word is a lemma in Skolt Sami, the result will appear in "exact_match", if it's a word form for a Skolt Sami word, the results will appear in "lemmatized", and if it's a word in some other language, the results will appear in "other_languages", i.e if you search for cat in the Skolt Sami dictionary, you will get a result of a form {"other_languages": [Skolt Sami lexical items that translate to cat]}
An example of querying the Skolt Sami dictionary with car.
from uralicNLP import uralicApi
uralicApi.dictionary_search("car", "sms")
>>{'lemmatized': [], 'exact_match': [], 'other_languages': [{'lemma': 'autt', ...}, ...]
It is possible to list all lemmas in the dictionary:
from uralicNLP import uralicApi
uralicApi.dictionary_lemmas("sms")
>> ['autt', 'sokk' ...]
You can also group the lemmas by part-of-speech
from uralicNLP import uralicApi
uralicApi.dictionary_lemmas("sms",group_by_pos=True)
>> {"N": ['autt', 'sokk' ...], "V":[...]}
To find translations in an endangered language dictionary to a certain language, you can run this script
from uralicNLP import uralicApi
uralicApi.get_translation("piânnai", "sms", "fin")
>> ['koira']
Notice that it's optional to give target language
from uralicNLP import uralicApi
uralicApi.get_translation("piânnai", "sms")
>> {'sme': ['beana'], 'sjd': ['пе̄ннэ'], 'sju': ['biegˈŋja', 'dä̀rra'], 'rus': ['собака'], 'nob': ['hund'], 'eng': ['dog'], 'deu': ['Hund'], 'fin': ['koira']}
The example above searches for the word piânnai in Skolt Sami dictionary and returns the translations in Finnish.
If get_translation method is slow, all you need to do is to pip install hfst
By default, UralicNLP uses a TinyDB backend. This is easy as it does not require an external database server, but it can be extremely slow. For this reason, UralicNLP provides a MongoDB backend.
Make sure you have both MongoDB and pymongo installed.
First, you will need to download the dictionary and import it to MongoDB. The following example shows how to do it for Komi-Zyrian.
from uralicNLP import uralicApi
uralicApi.download("kpv") #Download the latest dictionary data
uralicApi.import_dictionary_to_db("kpv") #Update the MongoDB with the new data
After the initial setup, you can use the dictionary queries, but you will need to specify the backend.
from uralicNLP import uralicApi
from uralicNLP.dictionary_backends import MongoDictionary
uralicApi.dictionary_lemmas("sms",backend=MongoDictionary)
uralicApi.dictionary_search("car", "sms",backend=MongoDictionary)
Now you can query the dictionaries fast.
Is it time to get an airline status? Fly for Points!
UralicNLP is an open-source Python library by Mika Hämäläinen