GH-2015: flert #2032

alanakbik · 2020-12-16T21:47:35Z

This PR adds FLERT as presented in our recent paper (closes #2015).

A number of changes are made:

Sentence objects now have next_sentence() and previous_sentence() methods that are set automatically if loaded through ColumnCorpus. This is a pointer system to navigate through sentences in a corpus:

# load corpus
corpus = MIT_MOVIE_NER_SIMPLE(in_memory=False)

# get a sentence
sentence = corpus.test[123]
print(sentence)
# get the previous sentence
print(sentence.previous_sentence())
# get the sentence after that
print(sentence.next_sentence())
# get the sentence after the next sentence
print(sentence.next_sentence().next_sentence())

This allows dynamic computation of contexts in the embedding classes.

Sentence objects now have the is_document_boundary field which is set through the ColumnCorpus. In some datasets, there are sentences like "-DOCSTART-" that just indicate document boundaries. This is now recorded as a boolean in the object.
TransformerWordEmbeddings refactored for dynamic context, robustness to long sentences and readability. The names of some constructor arguments have changed for clarity: pooling_operation is now subtoken_pooling (to make clear that we pool subtokens), use_scalar_mean is now layer_mean (we only do a simple layer mean) and use_context can now optionally take an integer to indicate the length of the context. Default arguments are also changed.

For instance, to create embeddings with a document-level context of 64 subtokens, init like this:

embeddings = TransformerWordEmbeddings(
    model='bert-base-uncased',
    layers="-1",
    subtoken_pooling="first",
    fine_tune=True,
    use_context=64,
)

From my testing, it also seems that the new implementation is a bit faster.

You can train FLERT like this:

import torch

from flair.data import Sentence
from flair.datasets import CONLL_03, WNUT_17
from flair.embeddings import TransformerWordEmbeddings, DocumentPoolEmbeddings, WordEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer


corpus = CONLL_03()

use_context = 64
hf_model = 'xlm-roberta-large'

embeddings = TransformerWordEmbeddings(
    model=hf_model,
    layers="-1",
    subtoken_pooling="first",
    fine_tune=True,
    use_context=use_context,
)

tag_dictionary = corpus.make_tag_dictionary('ner')

# init bare-bones tagger (no reprojection, LSTM or CRF)
tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type='ner',
    use_crf=False,
    use_rnn=False,
    reproject_embeddings=False,
)

# train with XLM parameters (AdamW, 20 epochs, small LR)
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
from torch.optim.lr_scheduler import OneCycleLR

context_string = '+context' if use_context else ''

trainer.train(f"resources/flert",
              learning_rate=5.0e-6,
              mini_batch_size=4,
              mini_batch_chunk_size=1,
              max_epochs=20,
              scheduler=OneCycleLR,
              embeddings_storage_mode='none',
              weight_decay=0.,
              )

…tences

GH-2015: refactor FLERT

…GH-2015-flert

flair/embeddings/token.py

alanakbik added 26 commits December 7, 2020 23:22

GH-2015: extend TransformerWordEmbeddings for document context

e9a56be

GH-2015: extend TransformerWordEmbeddings for document context

63bd782

GH-2015: add contextualization method to Corpus

af8f35d

GH-2015: support one-cycle learner

2938609

GH-2015: simplify ColumnDataset

08fdfd7

GH-2015: simplify ColumnDataset

f94af99

GH-2015: context even when in_memory=False

9f62d30

GH-2015: add context methods to Sentence

2ab1fd0

GH-2015: another attempt at contextualization for in_memory=False

559cd84

GH-2015: left contexts may have different lengths

c833a65

GH-2015: left contexts may have different lengths

737babc

GH-2015: left contexts may have different lengths

37287f3

GH-2015: left contexts may have different lengths

9348cfa

GH-2015: refactor TransformerWordEmbeddings for robustnes to long sen…

cec64cc

…tences

GH-2015: change default parameters in TransformerWordEmbeddings

2881c11

GH-2015: change default parameters in TransformerWordEmbeddings

20f5672

GH-2015: sentences that function as document boundaries marked up

b0b9d87

GH-2015: context cannot pass over document boundaries

c297306

GH-2015: add context logic to predict() method

f94c83a

GH-2015: remove unused stuff

3a298a5

Merge pull request #2031 from flairNLP/GH-2015-flert-refactor

f9a2555

GH-2015: refactor FLERT

GH-2015: also write out dev.tsv when evaluating

f252290

Merge branch 'GH-2015-flert' of https://github.com/flairNLP/flair into …

e619c27

…GH-2015-flert

Merge branch 'master' into GH-2015-flert

6c713d9

GH-2015: remove unused

d37f964

GH-2015: remove unused

e29227a

stefan-it reviewed Dec 16, 2020

View reviewed changes

flair/embeddings/token.py Outdated Show resolved Hide resolved

alanakbik added 3 commits December 16, 2020 23:02

GH-2015: remove unused

5c6f5a6

GH-2015: fix unit test

c6d40dc

GH-2015: fix unit test

4d965e6

alanakbik merged commit e66baf4 into master Dec 16, 2020

alanakbik deleted the GH-2015-flert branch December 16, 2020 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GH-2015: flert #2032

GH-2015: flert #2032

Uh oh!

alanakbik commented Dec 16, 2020 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

GH-2015: flert #2032

GH-2015: flert #2032

Uh oh!

Conversation

alanakbik commented Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alanakbik commented Dec 16, 2020 •

edited

Loading