Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

DarkBERT is available for access upon request. Users may submit their request using the form below, which includes the name of the user, the user’s institution, the user’s email address that matches the institution (we especially emphasize this part; any non-academic addresses such as gmail, tutanota, protonmail, etc. are automatically rejected as it makes it difficult for us to verify your affiliation to the institution), and the purpose of usage (in as much detail as possible). By requesting and downloading DarkBERT, the user agrees to the following: the user acknowledges that the use of this model is restricted to research and/or academic purposes only. Access to the model will be granted after the request is manually reviewed. A request may be declined if it does not sufficiently describe research purposes that follow the ACM Code of Ethics (https://www.acm.org/code-of-ethics). The information provided by the requesting user will not be used in any way except for sending the dataset to the user and keeping track of request history for DarkBERT. By requesting for the model, the user agrees to our collection of the provided information. This model shall only be used for non-profit research purposes and in a manner consistent with fair practice. Do not redistribute this dataset to others. The user should indicate the source of this model (found at the bottom of the page) when using or citing the model in their research or article.

Log in or Sign Up to review the conditions and access this model content.

DarkBERT

A BERT-like model pretrained with a Dark Web corpus as described in "DarkBERT: A Language Model for the Dark Side of the Internet (ACL 2023)"

Conditions

DarkBERT is available for access upon request. Users may submit their request using the form below, which includes the name of the user, the user’s institution, the user’s email address that matches the institution (we especially emphasize this part; any non-academic addresses such as gmail, tutanota, protonmail, etc. are automatically rejected as it makes it difficult for us to verify your affiliation to the institution) and the purpose of usage. By requesting and downloading DarkBERT, the user agrees to the following: the user acknowledges that the use of this model is restricted to research and/or academic purposes only. Access to the model will be granted after the request is manually reviewed. A request may be declined if it does not sufficiently describe research purposes that follow the ACM Code of Ethics (https://www.acm.org/code-of-ethics). The information provided by the requesting user will not be used in any way except for sending the dataset to the user and keeping track of request history for DarkBERT. By requesting for the model, the user agrees to our collection of the provided information. This model shall only be used for non-profit research purposes and in a manner consistent with fair practice. Do not redistribute this dataset to others. The user should indicate the source of this model (found at the bottom of the page) when using or citing the model in their research or article.

What is included?

The preprocessed version of DarkBERT.

Benchmark datasets in the benchmark-dataset directory.

Sample Usage

>>> from transformers import pipeline
>>> folder_dir = "DarkBERT"
>>> unmasker = pipeline('fill-mask', model=folder_dir)
>>> unmasker("RagnarLocker, LockBit, and REvil are types of <mask>.")

[{'score': 0.4952353239059448, 'token': 25346, 'token_str': ' ransomware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of ransomware.'},
{'score': 0.04661545157432556, 'token': 16886, 'token_str': ' malware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of malware.'},
{'score': 0.04217657446861267, 'token': 28811, 'token_str': ' wallets', 'sequence': 'RagnarLocker, LockBit, and REvil are types of wallets.'},
{'score': 0.028982503339648247, 'token': 2196, 'token_str': ' drugs', 'sequence': 'RagnarLocker, LockBit, and REvil are types of drugs.'},
{'score': 0.020001502707600594, 'token': 11344, 'token_str': ' hackers', 'sequence': 'RagnarLocker, LockBit, and REvil are types of hackers.'}]

>>> from transformers import AutoModel, AutoTokenizer
>>> model = AutoModel.from_pretrained(folder_dir)
>>> tokenizer = AutoTokenizer.from_pretrained(folder_dir)
>>> text = "Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web."
>>> encoded = tokenizer(text, return_tensors="pt")
>>> output = model(**encoded)
>>> output[0].shape

torch.Size([1, 27, 768])

Citation

If you are using the DarkBERT model, please cite the following paper accordingly:

Youngjin Jin, Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee, and Seungwon Shin. 2023. DarkBERT: A Language Model for the Dark Side of the Internet. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7515–7533, Toronto, Canada. Association for Computational Linguistics.
Downloads last month
107
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using s2w-ai/DarkBERT 23