Problem in tokens

#5
by maithakh - opened

Hello,
This is the code i executed. I am wondering why the words are broken into non-English tokens ?

text = "Visible porosity appears moderate, characterized by minor to common mouldic, intraparticle and vuggy macropores, along with rare to minor grain and matrix-hosted microporosity. The measured value is agreeable with the observed volume. However, common cementation has degraded connectivity and common cracks and fractures may have affected the measured permeability value. Therefore, a poor to possibly moderate reservoir quality is inferred. Open cracks and fractures are interpreted to be artefacts from sample preparation."

model_name = "vblagoje/bert-english-uncased-finetuned-pos"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

pipeline = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
outputs = pipeline(text)

for output in outputs:
if output['entity'] in ['NOUN','ADJ']:
#print(output)
print(output['word'] , "", output['entity'])

This is the output I am getting:

visible ADJ
por NOUN
##osity NOUN
moderate ADJ
minor ADJ
common ADJ
mo NOUN
##uld NOUN
##ic ADJ
intra ADJ
##par NOUN
##tic NOUN
##le NOUN
vu ADJ
##ggy ADJ
macro NOUN
##pore NOUN
##s NOUN
##rar NOUN
##e ADJ
minor ADJ
grain NOUN
matrix NOUN
micro NOUN
##por NOUN
##osity NOUN
value NOUN
agree ADJ
##able ADJ
volume NOUN
common ADJ
cement NOUN
##ation NOUN
connectivity NOUN
common ADJ
cracks NOUN
fractures NOUN
per NOUN
##me NOUN
##ability NOUN
value NOUN
poor ADJ
moderate ADJ
reservoir NOUN
quality NOUN
open ADJ
cracks NOUN
fractures NOUN
artefacts NOUN
sample NOUN
preparation NOUN
issue in tokenization.png

Any idea what is the reason and how to solve this issue ?

Sign up or log in to comment