beautifulsoup4 bs4 docx2txt newspaper3k PyPDF2 regex requests requests-file requests-oauthlib torch transformers validators nltk==3.7 sentence-transformers rank-bm25 spacy_streamlit altair<5