pyarrow pandas numpy arxiv sentence_transformers regex sklearn