-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
GH-2015: flert #2032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
GH-2015: flert #2032
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
GH-2015: refactor FLERT
stefan-it
reviewed
Dec 16, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds FLERT as presented in our recent paper (closes #2015).
A number of changes are made:
Sentenceobjects now havenext_sentence()andprevious_sentence()methods that are set automatically if loaded throughColumnCorpus. This is a pointer system to navigate through sentences in a corpus:This allows dynamic computation of contexts in the embedding classes.
Sentenceobjects now have theis_document_boundaryfield which is set through theColumnCorpus. In some datasets, there are sentences like "-DOCSTART-" that just indicate document boundaries. This is now recorded as a boolean in the object.TransformerWordEmbeddingsrefactored for dynamic context, robustness to long sentences and readability. The names of some constructor arguments have changed for clarity:pooling_operationis nowsubtoken_pooling(to make clear that we pool subtokens),use_scalar_meanis nowlayer_mean(we only do a simple layer mean) anduse_contextcan now optionally take an integer to indicate the length of the context. Default arguments are also changed.For instance, to create embeddings with a document-level context of 64 subtokens, init like this:
From my testing, it also seems that the new implementation is a bit faster.