NlpHUST
/

vibert4news-base-cased

Inference Endpoints

Model card Files Files and versions Community

vibert4news-base-cased / README.md

julien-c's picture

julien-c HF staff

Migrate model card from transformers-repo

27e7bc5 over 3 years ago

|

raw history blame

No virus

1.96 kB

	---
	language: vn
	---
	# BERT for Vietnamese is trained on more 20 GB news dataset

	Apply for task sentiment analysis on using [AIViVN's comments dataset](https://www.aivivn.com/contests/6)

	The model achieved 0.90268 on the public leaderboard, (winner's score is 0.90087)
	Bert4news is used for a toolkit Vietnames(segmentation and Named Entity Recognition) at ViNLPtoolkit(https://github.com/bino282/ViNLP)

	*************New Mar 11 , 2020 *************

	[BERT](https://github.com/google-research/bert) (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).

	We use word sentencepiece, use basic bert tokenization and same config with bert base with lowercase = False.

	You can download trained model:
	- [tensorflow](https://drive.google.com/file/d/1X-sRDYf7moS_h61J3L79NkMVGHP-P-k5/view?usp=sharing).
	- [pytorch](https://drive.google.com/file/d/11aFSTpYIurn-oI2XpAmcCTccB_AonMOu/view?usp=sharing).

	Use with huggingface/transformers
	``` bash
	import torch
	from transformers import AutoTokenizer,AutoModel
	tokenizer= AutoTokenizer.from_pretrained("NlpHUST/vibert4news-base-cased")
	bert_model = AutoModel.from_pretrained("NlpHUST/vibert4news-base-cased")

	line = "Tôi là sinh viên trường Bách Khoa Hà Nội ."
	input_id = tokenizer.encode(line,add_special_tokens = True)
	att_mask = [int(token_id > 0) for token_id in input_id]
	input_ids = torch.tensor([input_id])
	att_masks = torch.tensor([att_mask])
	with torch.no_grad():
	features = bert_model(input_ids,att_masks)

	print(features)

	```

	Run training with base config

	``` bash

	python train_pytorch.py \
	--model_path=bert4news.pytorch \
	--max_len=200 \
	--batch_size=16 \
	--epochs=6 \
	--lr=2e-5

	```

	### Contact information
	For personal communication related to this project, please contact Nha Nguyen Van (nha282@gmail.com).