Skip to content

call to preprocess.preprocess_text returns TypeError (0.7.0) #243

@gryBox

Description

@gryBox

steps to reproduce

some_string = "'A chemical combination brought about by the action of light, as in the formation of carbohydrates in living plants from the carbon di-oxid and water of the air under the influence of sunlight."

Scenario 1

import textacy
textacy.preprocess.preprocess_text(some_string , 
                                       fix_unicode=True, 
                                       lowercase=False, 
                                       no_urls=False, 
                                       no_emails=False, 
                                       no_phone_numbers=False, no_numbers=False, 
                                       no_currency_symbols=False, no_punct=False, 
                                       no_contractions=False, 
                                       no_accents=False)

Result:

~/anaconda3/envs/py36-ml/lib/python3.6/site-packages/textacy/preprocess.py in preprocess_text(text, fix_unicode, lowercase, no_urls, no_emails, no_phone_numbers, no_numbers, no_currency_symbols, no_punct, no_contractions, no_accents)
    246         text = text.lower()
    247     # always normalize whitespace; treat linebreaks separately from spacing
--> 248     text = normalize_whitespace(text)
    249 
    250     return text


~/anaconda3/envs/py36-ml/lib/python3.6/site-packages/textacy/preprocess.py in normalize_whitespace(text)
     39     """
     40     return constants.RE_NONBREAKING_SPACE.sub(
---> 41         " ", constants.RE_LINEBREAK.sub(r"\n", text)
     42     ).strip()
     43 

TypeError: expected string or bytes-like object

Should fix_unicode be removed since it is no longer supported by textacy directly?

Scenario 2 (all false)

import textacy
textacy.preprocess.preprocess_text(some_string , 
                                       fix_unicode=False, 
                                       lowercase=False, 
                                       no_urls=False, 
                                       no_emails=False, 
                                       no_phone_numbers=False, no_numbers=False, 
                                       no_currency_symbols=False, no_punct=False, 
                                       no_contractions=False, 
                                       no_accents=False)

Result:
'A chemical combination brought about by the action of light, as in the formation of carbohydrates in living plants from the carbon di-oxid and water of the air under the influence of sunlight.'

expected vs. actual behavior

"'a chemical combination brought about by the action of light as in the formation of carbohydrates in living plants from the carbon di oxid and water of the air under the influence of sunlight"

I know preprocess worked in 0.6.x

environment

  • operating system: aws linux
  • python version: 3.7
  • spacy version: 2.1.4
  • installed spacy models: en_core_web_sm
  • textacy version: 0.7.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions