-
Notifications
You must be signed in to change notification settings - Fork 4.8k
The Neapolitan model is unusable #1356
Description
Hello,
The model for Neapolitan (nap) is unusable because of its poor quality:
>>> ft.get_nearest_neighbors("cuccuziello")
[(0.9683643579483032, 'Masiello'), (0.9683618545532227, 'soldatiello'), (0.9682843685150146, 'Mezzaniello'), (0.9651128053665161, 'perettiello'), (0.963299572467804, 'maretiello'), (0.9630503058433533, 'nnammoratiello'), (0.9629217386245728, 'Fermariello'), (0.9614925384521484, 'poveriello'), (0.9613924622535706, 'Manniello'), (0.9589092135429382, 'ciancianiello')]The above code shows the nearest neighbors of the word for "zucchini": gets "Masiello" (a family name), "soldatiello" (diminutive of "soldat"), "Mezzaniello" (type of pasta), "perettiello" (type of container for the wine), "maretiello" (diminutive of "husband"), etc.
Let’s try with mare (sea):
>>> ft.get_nearest_neighbors("mare")
[(0.6819297671318054, 'maree'), (0.6802213788032532, 'sommare'), (0.67812180519104, 'Altomare'), (0.6762729287147522, 'mmare'), (0.6754312515258789, 'sciummare'), (0.6556524038314819, 'Oltremare'), (0.6542813181877136, 'amare'), (0.6521005630493164, 'Croismare'), (0.6465907692909241, 'lungomare'), (0.6444516181945801, 'Zimmare')]Here it’s marginally better: 40% of the words are related to the sea, probably because "mare" is the same in Italian and all those words come from Italian.
Let’s try with a famous word, guaglione (young man, adolescent):
>>> ft.get_nearest_neighbors("guaglione")
[(0.9444118738174438, 'gguaglione'), (0.9239395260810852, 'uaglione'), (0.922201931476593, 'Quaglione'), (0.9067193269729614, 'Guaglione'), (0.8721657991409302, 'Scaglione'), (0.8564983010292053, 'Baglione'), (0.8542811870574951, 'Faraglione'), (0.8541175127029419, 'muraglione'), (0.8494646549224854, 'Zampaglione'), (0.8474137783050537, 'Maglione')]"gguaglione" (feminine plural), "uaglione" (variant) and "Guaglione" (with a capital letter) are various versions of "guaglione", but the other words have nothing to do with it.
Is there anything one can do to improve the accuracy of the model, or is it inherent to the small size of the corpus?