The Neapolitan model is unusable

Hello,
The model for Neapolitan (`nap`) is unusable because of its poor quality:

```pycon
>>> ft.get_nearest_neighbors("cuccuziello")
[(0.9683643579483032, 'Masiello'), (0.9683618545532227, 'soldatiello'), (0.9682843685150146, 'Mezzaniello'), (0.9651128053665161, 'perettiello'), (0.963299572467804, 'maretiello'), (0.9630503058433533, 'nnammoratiello'), (0.9629217386245728, 'Fermariello'), (0.9614925384521484, 'poveriello'), (0.9613924622535706, 'Manniello'), (0.9589092135429382, 'ciancianiello')]
```
The above code shows the nearest neighbors of the word for "zucchini": gets "Masiello" (a family name), "soldatiello" (diminutive of "soldat"), "Mezzaniello" (type of pasta), "perettiello" (type of container for the wine), "maretiello" (diminutive of "husband"), etc.

Let’s try with `mare` (sea):
```pycon
>>> ft.get_nearest_neighbors("mare")
[(0.6819297671318054, 'maree'), (0.6802213788032532, 'sommare'), (0.67812180519104, 'Altomare'), (0.6762729287147522, 'mmare'), (0.6754312515258789, 'sciummare'), (0.6556524038314819, 'Oltremare'), (0.6542813181877136, 'amare'), (0.6521005630493164, 'Croismare'), (0.6465907692909241, 'lungomare'), (0.6444516181945801, 'Zimmare')]
```
Here it’s marginally better: 40% of the words are related to the sea, probably because "mare" is the same in Italian and all those words come from Italian.

Let’s try with a famous word, `guaglione` (young man, adolescent):
```pycon
>>> ft.get_nearest_neighbors("guaglione")
[(0.9444118738174438, 'gguaglione'), (0.9239395260810852, 'uaglione'), (0.922201931476593, 'Quaglione'), (0.9067193269729614, 'Guaglione'), (0.8721657991409302, 'Scaglione'), (0.8564983010292053, 'Baglione'), (0.8542811870574951, 'Faraglione'), (0.8541175127029419, 'muraglione'), (0.8494646549224854, 'Zampaglione'), (0.8474137783050537, 'Maglione')]
```
"gguaglione" (feminine plural), "uaglione" (variant) and "Guaglione" (with a capital letter) are various versions of "guaglione", but the other words have nothing to do with it.

Is there anything one can do to improve the accuracy of the model, or is it inherent to the small size of the corpus?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The Neapolitan model is unusable #1356

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The Neapolitan model is unusable #1356

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions