Skip to content

Conversation

@mauryaland
Copy link
Contributor

Related to #3142. This solution, by taking the difference between the first token start index and the sentence start index, keep the initial whitespace for some special cases while not printing a lot of useless whitespaces because we store the index of the token from a bigger document.

For example for the following text which is split in two Sentence objects in my application:

text = "Amaury et Valentin mangent au 77 boulevard  Perreire.\n\n\n\n Paul  a modifié le tokenizer."

It gives before the fix:

[Sentence[9]: "Amaury et Valentin mangent au 77 boulevard  Perreire.",
 Sentence[6]: "                                                          Paul  a modifié le tokenizer."]

And after the fix:

[Sentence[9]: "Amaury et Valentin mangent au 77 boulevard  Perreire.",
 Sentence[6]: "Paul  a modifié le tokenizer."]

And a sentence that begins with a whitespace

sentence = Sentence(" ... and then?")

print(sentence)

still got it in the printout:

Sentence[4]: " ... and then?"

@mauryaland mauryaland changed the title take into account Sentence.start_position to calculate whitespace Modify Sentence.to_original_text() to take into account Sentence.start_position for whitespace calculation Mar 15, 2023
@alanakbik
Copy link
Collaborator

@mauryaland thanks for fixing this!

@alanakbik alanakbik merged commit 1807b5d into flairNLP:master Mar 22, 2023
@mauryaland mauryaland deleted the patch-2 branch March 22, 2023 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants