Skip to content

Error training for instances with only numbers. #46

@romualdoalan

Description

@romualdoalan

I found an error in the code that is related to an output length issue in the get_example_output function in the postprocessing.py file. The specific error is an AssertionError that occurs when the code tries to verify whether the length of the output (complete_output) matches the length of the document tokens for an example in which I only have numbers.

Just with instances like that the assertion failed.

{
    "doc_id": "TEST-205",
    "doc_text": "3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960",
    "entities": [
      {
        "entity_id": 0,
        "text": "3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960",
        "label": "NUMEROS_OUTROS",
        "start_offset": 0,
        "end_offset": 54
      }
    ]
  }

Maybe you can give me some insight. Thank you.

The error:

File "D:\Anonimização\NER\postprocessing.py", line 157, in get_example_output
    assert len(complete_output) == len(self.examples[example_ix].doc_tokens), \
AssertionError: Length mismatch for example 169: [ 0  0  0  3  4  4  4  4  9 10 10 10 10 10  9 10 10 10 10 10] !=
             11 in example 169:

doc_id: TEST-205
orig_text:3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960
doc_tokens: [Token(text='3123', offset=0, index=0, tail=' ', tag=None), Token(text='0346', offset=5, index=1, tail=' ', tag=None), Token(text='2154', offset=10, index=2, tail=' ', tag=None), Token(text='8600', offset=15, index=3, tail=' ', tag=None), Token(text='0186', offset=20, index=4, tail=' ', tag=None), Token(text='5500', offset=25, index=5, tail=' ', tag=None), Token(text='1000', offset=30, index=6, tail=' ', tag=None), Token(text='0001', offset=35, index=7, tail=' ', tag=None), Token(text='6015', offset=40, index=8, tail=' ', tag=None), Token(text='3585', offset=45, index=9, tail=' ', tag=None), Token(text='0960', offset=50, index=10, tail='', tag=None)]

labels: ['B-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS', 'I-NUMEROS_OUTROS']

tags: [NETag(doc_id='HAREM-205', entity_id=0, text='3123 0346 2154 8600 0186 5500 1000 0001 6015 3585 0960', type='NUMEROS_OUTROS', start_position=0, end_position=10)]

[array([ 0,  0,  0,  3,  4,  4,  4,  4,  9, 10, 10, 10, 10, 10,  9, 10, 10, 10, 10, 10])]

@fabiocapsouza @rodrigonogueira4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions