standardizing, streamlining, and snuggling up to spaCy
New and Changed:
-
Removed
textacy.Doc, and split its functionality into two parts- New: Added
textacy.make_spacy_doc()as a convenient and flexible entry point
for making spaCyDocs from text or (text, metadata) pairs, with optional
spaCy language pipeline specification. It's similar totextacy.Doc.__init__,
with the exception that text and metadata are passed in together as a 2-tuple. - New: Added a variety of custom doc property and method extensions to
the globalspacy.tokens.Docclass, accessible via itsDoc._"underscore"
property. These are similar to the properties/methods ontextacy.Doc,
they just require an interstitial underscore. For example,
textacy.Doc.to_bag_of_words()=>spacy.tokens.Doc._.to_bag_of_words(). - New: Added functions for setting, getting, and removing these extensions.
Note that they are set automatically when textacy is imported.
- New: Added
-
Simplified and improved performance of
textacy.Corpus- Documents are now added through a simpler API, either in
Corpus.__init__
orCorpus.add(); they may be one or a stream of texts, (text, metadata)
pairs, or existing spaCyDocs. When adding many documents, the spaCy
language processing pipeline is used in a faster and more efficient way. - Saving / loading corpus data to disk is now more efficient and robust.
- Note:
Corpusis now a collection of spaCyDocs rather thantextacy.Docs.
- Documents are now added through a simpler API, either in
-
Simplified, standardized, and added
Datasetfunctionality- New: Added an
IMDBdataset, built on the classic 2011 dataset
commonly used to train sentiment analysis models. - New: Added a base
Wikimediadataset, from which a reworked
Wikipediadataset and a separateWikinewsdataset inherit.
The underlying data source has changed, from XML db dumps of raw wiki markup
to JSON db dumps of (relatively) clean text and metadata; now, the code is
simpler, faster, and totally language-agnostic. Dataset.records()now streams (text, metadata) pairs rather than a dict
containing both text and metadata, so users don't need to know field names
and split them into separate streams before creatingDocorCorpus
objects from the data.- Filtering and limiting the number of texts/records produced is now clearer
and more consistent between.texts()and.records()methods on
a givenDataset--- and more performant! - Downloading datasets now always shows progress bars and saves to the same
file names. When appropriate, downloaded archive files' contents are
automatically extracted for easy inspection. - Common functionality (such as validating filter values) is now standardized
and consolidated in thedatasets.utilsmodule.
- New: Added an
-
Quality of life improvements
-
Reduced load time for
import textacyfrom ~2-3 seconds to ~1 second,
by lazy-loading expensive variables, deferring a couple heavy imports, and
dropping a couple dependencies. Specifically:ftfywas dropped, and aNotImplementedErroris now raised
in textacy's wrapper function,textacy.preprocess.fix_bad_unicode().
Users with bad unicode should now directly callftfy.fix_text().ijsonwas dropped, and the behavior oftextacy.read_json()
is now simpler and consistent with other functions for line-delimited data.mwparserfromhellwas dropped, since the reworkedWikipediadataset
no longer requires complicated and slow parsing of wiki markup.
-
Renamed certain functions and variables for clarity, and for consistency with
existing conventions:textacy.load_spacy()=>textacy.load_spacy_lang()textacy.extract.named_entities()=>textacy.extract.entities()textacy.data_dir=>textacy.DEFAULT_DATA_DIRfilename=>filepathanddirname=>dirpathwhen specifying
full paths to files/dirs on disk, andtextacy.io.utils.get_filenames()
=>textacy.io.utils.get_filepaths()accordinglySpacyDoc=>Doc,SpacySpan=>Span,SpacyToken=>Token,
SpacyLang=>Languageas variables and in docs- compiled regular expressions now consistently start with
RE_
-
Removed deprecated functionality
- top-level
spacy_utils.pyandspacy_pipelines.pyare gone;
use equivalent functionality in thespaciersubpackage instead math_utils.pyis gone; it was long neglected, and never actually used
- top-level
-
Replaced
textacy.compat.bytes_to_unicode()andtextacy.compat.unicode_to_bytes()
withtextacy.compat.to_unicode()andtextacy.compat.to_bytes(), which
are safer and accept either binary or text strings as input. -
Moved and renamed language detection functionality,
textacy.text_utils.detect_language()=>textacy.lang_utils.detect_lang().
The idea is to add more/better lang-related functionality here in the future. -
Updated and cleaned up documentation throughout the code base.
-
Added and refactored many tests, for both new and old functionality,
significantly increasing test coverage while significantly reducing run-time.
Also, added a proper coverage report to CI builds. This should help prevent
future errors and inspire better test-writing. -
Bumped the minimum required spaCy version:
v2.0.0=>v2.0.12,
for access to their full set of custom extension functionality.
-
Fixed:
- The progress bar during an HTTP download now always closes, preventing weird
nesting issues if another bar is subsequently displayed. - Filtering datasets by multiple values performed either a logical AND or OR
over the values, which was confusing; now, a logical OR is always performed. - The existence of files/directories on disk is now checked properly via
os.path.isfile()oros.path.isdir(), rather thanos.path.exists(). - Fixed a variety of formatting errors raised by sphinx when generating HTML docs.