Releases: webis-de/small-text
v2.0.0.dev3
This intermediate release serves as a preliminary version of the upcoming v2.0.0. Consider it an alpha release, where interface changes are still possible.
Due to overlap with the previouses v2.0.0.dev* relases, no changes will be shown here, but instead we refer to the CHANGELOG file.
v2.0.0.dev2
This intermediate release serves as a preliminary version of the upcoming v2.0.0. Consider it an alpha release, where interface changes are still possible.
Due to overlap with v2.0.0.dev1, no changes will be shown here, but instead we refer to the CHANGELOG file.
v2.0.0.dev1
This intermediate release serves as a preliminary version of the upcoming v2.0.0. Consider it an alpha release, where interface changes are still possible.
Added
- General
- Python requirements raised to Python 3.8 since Python 3.7 has reached end of life on 2023-06-27.
- Dropped torchtext as an integration dependency. For individual use cases it can of course still be used.
- Added environment variables
SMALL_TEXT_PROGRESS_BARSandSMALL_TEXT_OFFLINEto control the default behavior for progress bars and model downloading.
- PoolBasedActiveLearner:
initialize_data()has been replaced byinitialize()which can now also be used to provide an initial model in cold start scenarios. (#10)
- Classification:
- All PyTorch-classifiers (KimCNN, TransformerBasedClassification, SetFitClassification) now support
torch.compile()which can be enabled on demand. (Requires PyTorch >= 2.0.0). - All PyTorch-classifiers (KimCNN, TransformerBasedClassification, SetFitClassification) now support Automatic Mixed Precision.
SetFitClassification.__init__()now has a verbosity parameter (similar toTransformerBasedClassification) through which you can control the progress bar output ofSetFitClassification.fit().- TransformerBasedClassification:
- Removed unnecessary
token_type_idskeyword argument in model call. - Additional keyword args for config, tokenizer, and model can now be configured.
- Removed unnecessary
- All PyTorch-classifiers (KimCNN, TransformerBasedClassification, SetFitClassification) now support
- Embeddings:
- Prevented unnecessary gradient computations for some embedding types and unified code structure.
- Pytorch:
- Added an
inference_mode()context manager that appliestorch.inference_modeortorch.no_gradfor older Pytorch versions.
- Added an
- Query Strategies:
- New strategies: DiscriminativeRepresentationLearning, LabelCardinalityInconsistency, ClassBalancer, and ProbCover.
- Query strategies now have a tie-breaking mechanism to randomly permutate when there is a tie in scores.
- Added
ScoringMixinto enable a reusable scoring mechanism for query strategies. - LightweightCoreset can now process input in batches. (#23)
- Vector Index Functionality:
- A new vector index API provides implementations over a unified interface to use different implementations for k-nearest neighbor search.
- Existing strategies that used a hard-coded vector search ([ContrastiveActiveLearning][contrastive_active_learning], [SEALS][seals], [AnchorSubsampling][anchor_subsampling]) have been adapted and can now be used with different vector index implementations.
Fixed
- Fixed a bug where the
clone()operation wrapped the labels, which then raised an error. This affected the single-label scenario for PytorchTextClassificationDataset and TransformersDataset. (#35) - Fixed a bug where the batching in
greedy_coreset()andlightweight_coreset()resulted in incorrect batch sizes. (#50) - Fixed a bug where
lightweight_coreset()failed when computing the norm of the elementwise mean vector.
Changed
- General
- Moved
split_data()method fromsmall_text.data.datasetstosmall_text.data.splits.
- Moved
- Dependencies
- Raised setfit version to 1.1.0.
- Classification:
- The
initialize()methods of all PyTorch-classifiers (KimCNN, TransformerBasedClassification, SetFitClassification) are now more unified. (#57) - KimCNNClassifier / TransformerBasedClassification: model selection is now disabled by default. Also, it no longer saves models when disabled, thereby greatly reducing the runtime.
- The
- Utils
init_kmeans_plusplus_safe()now supports weighted kmeans++ initialization forscikit-learn>=1.3.0.
Removed
- Deprecated functionality
- Removed
default_tensor_type()method. - Removed
small_text.utils.labels.get_flattened_unique_labels(). - Removed
small_text.integrations.pytorch.utils.labels.get_flattened_unique_labels(). - Classification
- Removed early stopping legacy arguments in
__init__()for KimCNN and TransformerBasedClassification. (Usefit()keyword arguments instead.) - Removed model selection legacy argument in
TransformerBasedClassification.__init__().
- Removed early stopping legacy arguments in
- Removed
- The explicit installation instruction for conda was removed, but the small-text conda-forge package will remain.
v1.4.1
Bugfix release.
Fixed
- Fixed an out of bounds error that occurred when
DiscriminativeActiveLearningqueries all remaining unlabeled data. - Fixed typos/wording in PoolBasedActiveLearner docstrings.
- Pinned SetFit version in notebook example. (#64)
- Fixed an out of bounds error that could occur in
SetFitClassificationfor both 32bit systems and Windows. (#66) - Fixed errors in notebook examples that occurred with more recent seaborn / matplotlib versions.
Changed
- Documentation: added links to bibliography. (#65)
v1.4.0
Fixes SetFit seed control and adds the AnchorSubsampling query strategy.
Added
- New query strategy: AnchorSubsampling.
Fixed
- Changed the way how the seed is controlled in
SetFitClassificationsince the seed was fixed unless explicitly set via the respective trainer keyword argument.
Changed
- Documentation: Added a section where compatible transformer models are listed.
- Documentation: Updated showcase section.
v1.3.3
v1.3.2
v1.3.1
v1.3.0
SetFitClassification now also supports dropout sampling (like KimCNNClassifier and TransformerBasedClassification).
Added
- Added dropout sampling to SetFitClassification.
Fixed
- Fixed broken link in README.md.
- Fixed typo in README.md. (#26)
Changed
Stopping Criteria
- The ClassificationChange stopping criterion now supports multi-label classification.
Documentation
- Updated the active learning setup figure.
- The documentation of integrations has been reorganized.
Contributors
v1.2.0
This release adds a SetFit classifier, the BALD query strategy, and two new example notebooks.
Added
Active Learning
- PoolBasedActiveLearner now handles keyword arguments passed to the classifier's
fit()during theupdate()step. - New strategy: BALD.
- SubsamplingQueryStrategy now uses the remaining unlabeled pool when more samples are requested than are available.
Classification
- Added new classifier: SetFitClassification which wraps huggingface/setfit.
Examples
- Revised both existing notebook examples.
- Added a notebook example for active learning with SetFit classifiers.
- Added a notebook example for cold start initialization with SetFit classifiers.
Documentation
- A showcase section has been added to the documentation.
Fixed
- Distances in lightweight_coreset were not correctly projected onto the [0, 1] interval (but ranking was unaffected).
Changed
- Coreset implementations now use the distance-based (as opposed to the similarity-based) formulation.