--- license: other language: it tags: - banks - taxonomy datasets: - ABILab widget: - text: "Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto è una istanza positiva per il termine definizione budget aziendale" example_title: "Definition Recognition 1" - text: "Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto è una istanza positiva per il termine lavorazione assegni tratti" example_title: "Definition Recognition 2" --- # Overview **ABILaBERT** was created for the purpose of classifying a text to one or more concepts of a *Taxonomy* describing the banking domain. The taxonomy can be bank specific, or general to the domain knowledge: it will be modeled for the text classifier through a pre-training process acting over the Taxonomy itself. In this work, we will consider the [ABILab Process Tree Taxonomyas](https://www.abilab.it/tassonomia-processi-bancari) a general, i.e., bank-independent, formalization of the processes currently active in the *Italian* bank eco-system. The objective of this taxonomy is to achieve a complete and shared mapping of the bank’s processes, covering all areas of activity at a level of detail that can be considered common across different banks and financial organizations, without explicit reference to existing organizational structures, products offered or delivery channels. In order to remedy the complete absence of training data, we use the **Zero-Shot Learning** approach by relying on a semantic model, so we exploit the taxonomy itself as a source of information, in particular by making explicit all relationships between concepts. In the suggested augmentation, we map individual taxonomic information (e.g. relations) into short texts, aiming to declare the corresponding semntic evidence (e.g. the correctness of a definition for the concept name, or the declaration of the underlying hierarchical relation). We will call the processes of recognizing these information, that is accepting the texts as true, as **Sub-Tasks**. As a result, a dataset of more than 1 million examples was obtained with which we trained the initial [gilberto-uncased-from-camembert ](https://huggingface.co/idb-ita/gilberto-uncased-from-camembert). For more information on dataset creation, training and classification of a text into one or more concepts of the taxonomy, please refer to the *paper (Margiotta et al., 2021)*, titled **"Knowledge-based neural pre-training for Intelligent Document Management"**, available at: [link](#). Here we will refer only to the use of the model for solving Sub-Tasks used in training. # Sub-Tasks for Domain Specific Pre-training The Sub-Tasks aim at acquiring domain knowledge implicitly from definitions and from relational texts, i.e., statements about direct subsumption relationships between concepts of different levels in the taxonomy. In particular, the model was trained in providing predictions regarding the following tasks: * **Definition Recognition**: a description and the term of a concept in the taxonomy are related and the model is thus expected to recognize if that association is true or false. * **Subsumption Recognition**: in this sub-task, hierarchical relations are mapped into composite sentences declaring the property over two concepts. # Examples of Sub-Tasks * An instance for the **Definition Recognition** sub-task is as follows: - **IT**:*"Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto definisce il termine definizione budget aziendale."* - **EN**:*"Process of managing the budget plan by identifying the rules of preparation, actual preparation and monitoring of its compliance defines the term corporate budget definition."* * An instance for the **Subsumption Recognition** sub-task is as follows: - **IT**:*"Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto è un concetto più specifico del concetto denominato allocazione risorse e definizione del budget."* - **EN**:*"Process of managing the budget plan by identifying rules of preparation, actual preparation, and monitoring compliance is a concept more specific than the one named resource allocation and budget setting."* # Code Example The following is a brief snippet of Python code describing a correct use of the model for a prediction in the **Definition Recognition** task. ## Define a list of input sentence We create the examples composing the 2 parts that we will call "**banking_text**", which is a candidate definition of a concept, and "**concept**" that will contain the name of the concept of which we want to establish the truth. * In this case the canididate definition "**banking_text**" will be: "Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto" * The "**concept**", for the first association will be: "definizione budget aziendale" (TRUE) * The "**concept**", for the second association will be: "lavorazione assegni tratti" (FALSE) ```python inputs = [] banking_text = "Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto" concepts = ["definizione budget aziendale", "lavorazione assegni tratti", ...] for concept in concepts: inputs.append(banking_text + " definisce il termine " + concept) ``` ## Download the model ```python from transformers import CamembertTokenizer, CamembertForSequenceClassification tokenizer = CamembertTokenizer.from_pretrained("Abilab-Uniroma2/ABILaBERT") model = CamembertForSequenceClassification.from_pretrained("Abilab-Uniroma2/ABILaBERT") ``` ## Set the functions for predictions output ```python def generateDataLoader(sentences): encoded_data_classifications = tokenizer.batch_encode_plus( sentences, add_special_tokens=True, return_attention_mask=True, pad_to_max_length=True, max_length=256, return_tensors='pt', truncation=True, padding=True ) input_ids_classifications = encoded_data_classifications['input_ids'] attention_masks_classifications = encoded_data_classifications['attention_mask'] labels_classifications = torch.tensor([0]*len(sentences)) dataset_classifications = TensorDataset(input_ids_classifications, attention_masks_classifications, labels_classifications) return DataLoader(dataset_classifications, sampler=SequentialSampler(dataset_classifications), batch_size=16) def prediction(dataloader_val): model.eval() predictions = [] for batch in dataloader_val: batch = tuple(b.to(device) for b in batch) inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[2], } with torch.no_grad(): outputs = model(**inputs) logits = outputs[1] logits = logits.detach().cpu().numpy() label_ids = inputs['labels'].cpu().numpy() predictions.append(logits) return predictions def showPrediction(inputs, outputs): for i,x in enumerate(inputs): if outputs[0][i][0] > outputs[0][i][1]: print(f'INPUT:\t{x}\nOUTPUT:\t\x1b[31mFALSE\x1b[0m') else: print(f'INPUT:\t{x}\nOUTPUT:\t\033[92mTRUE\x1b[0m') ``` ## Get the output ```python def modelPredictions(inputs): dataLoader = generateDataLoader(inputs) outputs = prediction(dataLoader) showPrediction(inputs, outputs) modelPredictions(inputs) ``` * **INPUT**: Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto definisce il termine definizione budget aziendale * **OUTPUT**: TRUE * **INPUT**: Processo di gestione del piano di budget attraverso l'individuazione delle regole di predisposizione, la predisposizione effettiva e il controllo del suo rispetto definisce il termine lavorazione assegni tratti * **OUTPUT**: FALSE * ...