Skip to content
This repository was archived by the owner on Mar 19, 2024. It is now read-only.
This repository was archived by the owner on Mar 19, 2024. It is now read-only.

The Neapolitan model is unusable #1356

@bfontaine

Description

@bfontaine

Hello,
The model for Neapolitan (nap) is unusable because of its poor quality:

>>> ft.get_nearest_neighbors("cuccuziello")
[(0.9683643579483032, 'Masiello'), (0.9683618545532227, 'soldatiello'), (0.9682843685150146, 'Mezzaniello'), (0.9651128053665161, 'perettiello'), (0.963299572467804, 'maretiello'), (0.9630503058433533, 'nnammoratiello'), (0.9629217386245728, 'Fermariello'), (0.9614925384521484, 'poveriello'), (0.9613924622535706, 'Manniello'), (0.9589092135429382, 'ciancianiello')]

The above code shows the nearest neighbors of the word for "zucchini": gets "Masiello" (a family name), "soldatiello" (diminutive of "soldat"), "Mezzaniello" (type of pasta), "perettiello" (type of container for the wine), "maretiello" (diminutive of "husband"), etc.

Let’s try with mare (sea):

>>> ft.get_nearest_neighbors("mare")
[(0.6819297671318054, 'maree'), (0.6802213788032532, 'sommare'), (0.67812180519104, 'Altomare'), (0.6762729287147522, 'mmare'), (0.6754312515258789, 'sciummare'), (0.6556524038314819, 'Oltremare'), (0.6542813181877136, 'amare'), (0.6521005630493164, 'Croismare'), (0.6465907692909241, 'lungomare'), (0.6444516181945801, 'Zimmare')]

Here it’s marginally better: 40% of the words are related to the sea, probably because "mare" is the same in Italian and all those words come from Italian.

Let’s try with a famous word, guaglione (young man, adolescent):

>>> ft.get_nearest_neighbors("guaglione")
[(0.9444118738174438, 'gguaglione'), (0.9239395260810852, 'uaglione'), (0.922201931476593, 'Quaglione'), (0.9067193269729614, 'Guaglione'), (0.8721657991409302, 'Scaglione'), (0.8564983010292053, 'Baglione'), (0.8542811870574951, 'Faraglione'), (0.8541175127029419, 'muraglione'), (0.8494646549224854, 'Zampaglione'), (0.8474137783050537, 'Maglione')]

"gguaglione" (feminine plural), "uaglione" (variant) and "Guaglione" (with a capital letter) are various versions of "guaglione", but the other words have nothing to do with it.

Is there anything one can do to improve the accuracy of the model, or is it inherent to the small size of the corpus?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions