BART

BART (Bidirectional and Auto-Regressive Transformers) is a transformer-based model architecture introduced by Facebook AI. It combines elements of both bidirectional (like BERT) and autoregressive (like GPT) models into a single architecture.

Bidirectional Meaning

The term "bidirectional" refers to the model's ability to process input sequences in both directions simultaneously during encoding.

Specifically, in a bidirectional model:

Forward Encoding: The input sequence is processed from left to right (forward direction), with each token attending to all tokens that precede it in the sequence. This allows the model to capture contextual information from preceding tokens when encoding each token in the sequence.

Let's explain forward encoding in the context of a language model using a simplified example.

Imagine you have a language model tasked with predicting the next word in a sentence based on the words that came before it. Let's consider a simple example sentence:

"The cat sat on the ____."

In forward encoding:

The language model processes the sentence from left to right, one word at a time.
It starts by predicting the next word based on the words that have already been seen in the sentence.
For example, given the words "The cat sat on the", the model predicts the next word ("table") based on the context provided by the preceding words.
After predicting "table", it moves forward to predict the next word in the sequence, and so on, until the end of the sentence is reached.

In essence, forward encoding in a language model involves generating predictions for each subsequent word in a sentence based on the context provided by the words that precede it, moving forward through the sequence of words.

Backward Encoding: Similarly, the input sequence is also processed from right to left (backward direction), with each token attending to all tokens that follow it in the sequence. This enables the model to capture contextual information from subsequent tokens when encoding each token in the sequence.

Let's consider a different example to illustrate backward encoding in the context of a language model:

Imagine you have a language model trained to generate text based on a given prompt. Let's say the prompt is:

"Once upon a time, there was a ____."

In backward encoding:

The language model processes the prompt from right to left, one word at a time.
It starts by generating the last word of the sentence based on the context provided by the remaining words in the prompt.
For example, given the context "there was a", the model generates the previous word ("dog") based on the context provided by the following words.
The model processes the tokens from right to left, starting with "a", "was", "there", etc. It uses the context provided by these tokens on the right side to predict the missing word.
After generating "dog", it moves backward to generate the previous word in the sequence, and so on, until the beginning of the prompt is reached.

In essence, backward encoding in a language model involves generating text by considering the context provided by the words that come after each word in the sequence, moving backward through the sequence of words.

Auto-regressive Meaning

An autoregressive decoder, like the one used in models such as GPT (Generative Pre-trained Transformer), generates output sequentially, predicting one token at a time based on the previously generated tokens. This means that the model generates the output sequence in an autoregressive manner, where each token is generated conditionally on the tokens that have been generated before it.

Similarly, BART (Bidirectional and Auto-Regressive Transformers) also employs an autoregressive decoder for tasks like text generation and summarization. In the autoregressive decoding process:

The model predicts the next token in the sequence based on the tokens it has already generated.
It generates tokens one-by-one, iterating through the sequence until it reaches the desired length or generates an end-of-sequence token.

In summary, when we say "autoregressive decoder like GPT for BART," we mean that BART employs a decoder component similar to GPT's, which generates output sequentially based on previously generated tokens. This decoder plays a crucial role in BART's ability to generate coherent and contextually relevant text for tasks like summarization.

BART Components

Encoder-Decoder Architecture: BART follows the encoder-decoder architecture commonly used in sequence-to-sequence (seq2seq) models. The encoder processes the input sequence (text) bidirectionally, capturing contextual information from both directions. This bidirectional encoding helps BART understand the input text more comprehensively. The decoder then generates the output sequence autoregressively, one token at a time, based on the encoded input and previous tokens generated.
Bidirectional Encoder: BART's encoder is similar to the encoder used in BERT (Bidirectional Encoder Representations from Transformers). It processes the input text bidirectionally, allowing it to capture contextual information from both preceding and succeeding tokens in the input sequence. This bidirectional encoding helps BART understand the relationships between different parts of the input text.
Autoregressive Decoder: BART's decoder is similar to the decoder used in autoregressive models like GPT (Generative Pre-trained Transformer). It generates the output sequence autoregressively, predicting the next token in the sequence based on the previously generated tokens and the encoded input. This autoregressive decoding allows BART to generate coherent and contextually relevant output sequences.
Pre-training with Noising Function: BART is pre-trained using a denoising autoencoding objective. During pre-training, input text is corrupted with an arbitrary noising function, such as masking, shuffling, or dropping tokens. The model is then trained to reconstruct the original text from the corrupted input. This pre-training strategy encourages the model to learn robust representations of the input text and improves its ability to handle noisy or imperfect input during fine-tuning and inference.
Fine-tuning for Text Generation and Comprehension: BART is particularly effective when fine-tuned for text generation tasks such as summarization and translation. Its bidirectional encoder and autoregressive decoder make it well-suited for capturing contextual information and generating coherent and contextually relevant output sequences. Additionally, BART also performs well on comprehension tasks such as text classification and question answering, demonstrating its versatility and effectiveness across a range of natural language processing tasks.

In summary, BART is a transformer-based model architecture that combines bidirectional encoding with autoregressive decoding. It is pre-trained using a denoising autoencoding objective and is effective for both text generation and comprehension tasks.

Noising Functions in BART

Noising functions are used during the pre-training phase of models like BART to introduce noise or alterations to the input text, which helps the model learn to handle various types of noise and improve its robustness. Here are some common types of noising functions used in pre-training:

Masking: In masking, random tokens in the input text are replaced with a special "mask" token. The model is then trained to predict the original tokens that were masked out. This helps the model learn to fill in missing or masked tokens, which can be useful for tasks like text generation and completion.
Shuffling: Shuffling involves randomly reordering the tokens in the input text. The model is then trained to reconstruct the original order of the tokens. This helps the model learn the underlying structure and dependencies between tokens in the text, which can improve its ability to understand and generate coherent sequences.
Token Dropout: Token dropout involves randomly removing tokens from the input text. The model is then trained to reconstruct the original text, even in the presence of missing tokens. This encourages the model to learn more robust representations of the text and improves its ability to handle missing or incomplete input.
Text Infilling: In text infilling, segments of the input text are replaced with special "mask" tokens, similar to masking. However, instead of predicting the original tokens directly, the model is trained to generate plausible replacements for the masked segments. This helps the model learn to generate fluent and coherent text, even when parts of the input are missing or incomplete.

These are just a few examples of the types of noising functions used during pre-training. The goal of using these functions is to expose the model to a diverse range of noisy input conditions, which helps it learn more robust and generalizable representations of the text. By pre-training the model with these variations of input, it becomes better equipped to handle noisy or imperfect input during fine-tuning and inference stages.

BART - Base Model

When referring to the "BART base" model, it typically refers to the pre-trained BART model before any fine-tuning on downstream tasks has been applied.

The "base" variant of the BART model usually denotes a mid-sized architecture with a moderate number of parameters. This base model is pre-trained on large text corpora using denoising autoencoding objectives but has not been fine-tuned for specific tasks such as text summarization, translation, or question answering.

After pre-training, the BART base model can be further fine-tuned on downstream tasks by continuing training on task-specific datasets. Fine-tuning allows the model to adapt its pre-learned representations to the specific characteristics of the target task, resulting in improved performance on that task.

Experiment with BART Base Model for Text Summarization

# Import necessary libraries
from transformers import BartTokenizer, BartForConditionalGeneration

# Load pre-trained BART tokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")

# Load pre-trained BART model for conditional generation
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

# Input text for summarization
input_text = """
Queenie: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
Rebecca: I found it would be a good idea to get a check-up.
Queenie: Yes, well, you haven't had one for 5 years. You should have one every year.
Rebecca: I know. I figure as long as there is nothing wrong, why go see the doctor?
Queenie: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
Rebecca: Ok.
Queenie: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
Rebecca: Yes.
Queenie: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
Rebecca: I've tried hundreds of times, but I just can't seem to kick the habit.
Queenie: Well, we have classes and some medications that might help. I'll give you more information before you leave.
Rebecca: Ok, thanks doctor.
"""

# Tokenize input text
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=1024, truncation=True)

# Generate summary
summary_ids = model.generate(input_ids, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

# Decode and print the generated summary
summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Generated Summary:", summary_text)

Explanation:

Import Libraries:
- We import the necessary libraries from the transformers package, including BartTokenizer and BartForConditionalGeneration, which are required for tokenization and model loading.
Load Pre-trained BART Model:
- We load the pre-trained BART tokenizer (BartTokenizer.from_pretrained) and BART model for conditional generation (BartForConditionalGeneration.from_pretrained) from the "facebook/bart-base" checkpoint.
Input Text for Summarization:
- We define a multi-line string (input_text) containing a conversation between Queenie, Rebecca, and Doctor Hawkins. This serves as input text for the summarization task.
Tokenize Input Text:
- We tokenize the input text using the BART tokenizer (tokenizer.encode). The return_tensors="pt" parameter specifies that the tokenized inputs should be returned as PyTorch tensors.
Generate Summary:
- We use the pre-trained BART model to generate a summary (model.generate) based on the tokenized input text (input_ids). Parameters such as max_length, min_length, length_penalty, num_beams, and early_stopping are provided to control the generation process.
Decode and Print Generated Summary:
- We decode the generated summary (tokenizer.decode) to convert the token IDs back into human-readable text and print the result.

Generated Summary:

Queenie: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?Rebecca: I found it would be a good idea to get a check-up. Have you had a heart attack in the past 5 years?Queenie (in a low voice): Yes, well, you haven't had one for 5 years. You should have one every year. Are you sure you don't have a heart disease, Mr?ReRebecca? I know. I figure as long as there is nothing wrong, why go see the doctor? Are you serious about your heart disease?Q: Yes, I am serious about my heart disease.Q: What can I do to help you?QQueenie

Issue Description:

The generated summary closely resembles the input text and fails to provide a concise and informative summary of the conversation. Instead of condensing the information and highlighting the main points, the generated summary essentially replicates the input text without adding significant value or insight. As a result, it does not fulfill the purpose of summarization, which is to distill the key ideas and convey them concisely.

Fine Tuning with Dialog Dataset

Dialog Dataset Used:

Dialog ~ Summary : A personally curated dataset, designed to include a variety of chat styles and topics, ensuring robustness and versatility in the model's performance.
DialogSum Dataset : A specialized dataset for dialogue summarization, offering diverse conversational examples.
Samsum dataset for chat summarization : This dataset comprises scripted chat conversations with associated human-written summaries, providing a rich ground for training and validating summarization models.

Training Dataset

DialogSum Dataset:
- Description: The DialogSum dataset consists of three subsets: train, val, and test. Each subset contains dialogs paired with one summary in JSONL format.
- Conversion Process:
  - The train and val subsets were parsed and converted into a pandas DataFrame, with two columns: 'dialog' and 'summary'.
  - Since the dialogs contain placeholders like #person1 and #person2, a mapping from placeholder names to actual person names was created using a predefined name list.
  - The placeholders were replaced with actual person names in the 'dialog' column.
- Outcome: Three DataFrames were created, each containing dialogs paired with their corresponding summaries, suitable for model training.
Samsum Dataset:
- Description: The Samsum dataset includes train, val, and test sets in JSON format, each containing dialogs and their summaries.
- Conversion Process:
  - The train, val, and test sets were parsed and converted into pandas DataFrames, each with two columns: 'dialog' and 'summary'.
- Outcome: Three DataFrames were created, each containing dialogs paired with their corresponding summaries, suitable for model training.
Dialog ~ Summary Dataset:
- Description: The Dialog ~ Summary dataset consists of a text dataset where each line contains a dialog followed by a summary separated by a tilde (~).
- Conversion Process:
  - The text dataset was parsed, and each line was split at the tilde (~) to separate the dialog and summary components.
  - The dialog and summary pairs were then stored in a pandas DataFrame with two columns: 'dialog' and 'summary'.
- Outcome: A DataFrame was created containing dialogs paired with their corresponding summaries, facilitating further analysis and model training.

After processing and consolidating multiple datasets, a training dataset was created by concatenating the following datasets:

DialogSum Dataset (Train and Validation):
- Number of Dialogs: 12,460 (Train) + 500 (Validation)
- Description: The DialogSum dataset contains dialogs paired with one summary per dialog in JSONL format.
- Total Dialogs: 12,960
Samsum Dataset (Train and Validation):
- Number of Dialogs: 14,732 (Train) + 818 (Validation)
- Description: The Samsum dataset includes dialogs and summaries in JSON format.
- Total Dialogs: 15,550
Dialog ~ Summary Dataset (Train):
- Number of Dialogs: 909
- Description: The Dialog ~ Summary dataset consists of dialogs followed by summaries separated by a tilde (~).
- Total Dialogs: 909

Total Training Dataset:

Total number of dialogs after concatenation: 12,960 (DialogSum) + 15,550 (Samsum) + 909 (Dialog ~ Summary) = 29,419 dialogs.

Index	Dialogue	Summary
0	Miles: Hi, Mr. Smith. I'm Doctor Hawkins. Why ...	Mr. Smith's getting a check-up, and Doctor Hawkins is his doctor.
1	Alice: Hello Mrs. Parker, how have you been?\n...	Mrs. Parker takes Ricky for his vaccines, and Dr. Parker administers the vaccines.
2	Amelia: Excuse me, did you see a set of keys?...	Amelia is looking for a set of keys and asks for help from nearby individuals.
3	Samuel: Why didn't you tell me you had a girlf...	Samuel's angry because Luna didn't tell Samuel she had a boyfriend.
4	Quinn: Watsup, ladies! Y'll looking'fine tonig...	Malik invites Nikki to dance, and Nikki agrees if her friends join too.
...	...	...
29414	Ravi: I've been experimenting with cooking dif...	Ravi tells Mei about his culinary experiments and offers to share recipes.
29415	Sophie: I'm working on a project to clean up o...	Sophie discusses her project to clean up the local park and seeks volunteers.
29416	Neil: I've been exploring historical novels re...	Neil shares his interest in historical novels and recommends some titles to Mary.
29417	Grace: I started a blog about sustainable livi...	Grace mentions her new blog on sustainable living and invites feedback from friends.
29418	Lena: I've been learning sign language. It's a...	Lena talks about learning sign language, and Mr. Brown expresses appreciation for her efforts.

The combined training dataset contains a total of 29,419 dialogs, with each dialog paired with its corresponding summary. This dataset is now ready for further preprocessing and model training to develop natural language processing models, such as text summarization models.

Making Training Dataset Ready for Finetuning

class SummaryDataset(Dataset):
    # Initialize the dataset with a tokenizer, data, and maximum token length
    def __init__(self, tokenizer, data, max_length=512):
        self.tokenizer = tokenizer  # Tokenizer for encoding text
        self.data = data            # Data containing dialogues and summaries
        self.max_length = max_length # Maximum length of tokens

    # Return the number of items in the dataset
    def __len__(self):
        return len(self.data)

    # Retrieve an item from the dataset by index
    def __getitem__(self, idx):
        item = self.data.iloc[idx]  # Get the row at the specified index
        dialogue = item['dialogue'] # Extract dialogue from the row
        summary = item['summary']   # Extract summary from the row

        # Encode the dialogue as input data for the model
        source = self.tokenizer.encode_plus(
            dialogue, 
            max_length=self.max_length, 
            padding='max_length', 
            return_tensors='pt', 
            truncation=True
        )

        # Encode the summary as target data for the model
        target = self.tokenizer.encode_plus(
            summary, 
            max_length=self.max_length, 
            padding='max_length', 
            return_tensors='pt', 
            truncation=True
        )

        # Return a dictionary containing input_ids, attention_mask, labels, and the original summary text
        return {
            'input_ids': source['input_ids'].flatten(),
            'attention_mask': source['attention_mask'].flatten(),
            'labels': target['input_ids'].flatten(),
            'summary': summary 
        }

Tokenization Process Explanation:

Initialization: The SummaryDataset class is initialized with a tokenizer, data (containing dialogues and summaries), and a maximum token length.
Retrieve Data: When an item is requested from the dataset (__getitem__ method), the dialogue and summary from the dataset are retrieved based on the index.
Tokenization: The dialogue and summary are tokenized separately using the tokenizer's encode_plus method.
- The encode_plus method tokenizes the text, adds special tokens (such as [CLS] and [SEP]), and returns a dictionary containing the tokenized input_ids and attention_mask tensors.
Padding and Truncation: The tokenized sequences are padded to the maximum token length and truncated if they exceed it. This ensures that all sequences have the same length.
Return: The tokenized dialogue, attention_mask, and tokenized summary are returned as a dictionary along with the original summary text.

Code Snippet and Documentation:

# Encode the dialogue as input data for the model
source = self.tokenizer.encode_plus(
    dialogue, 
    max_length=self.max_length, 
    padding='max_length', 
    return_tensors='pt', 
    truncation=True
)

# Encode the summary as target data for the model
target = self.tokenizer.encode_plus(
    summary, 
    max_length=self.max_length, 
    padding='max_length', 
    return_tensors='pt', 
    truncation=True
)

Explanation:

encode_plus method is called on the tokenizer to tokenize the dialogue and summary separately.
dialogue and summary are passed as input to tokenize.
max_length specifies the maximum length of the tokenized sequences.
padding='max_length' pads sequences to the maximum length specified.
return_tensors='pt' returns PyTorch tensors.
truncation=True truncates sequences that exceed the maximum length.
The resulting tokenized sequences are stored in source and target dictionaries.

This code snippet tokenizes the dialogue and summary texts using the tokenizer, ensuring that they are formatted appropriately for input to the model during training. The tokenized sequences are then padded and truncated as necessary before being returned as tensors.

NOTE:

The encode and encode_plus methods are both provided by the Hugging Face tokenizers library, used for encoding text inputs into numerical representations suitable for input to transformer-based models like BART.

Here's the difference between the two methods:

encode Method:
- The encode method is used to encode a single text input.
- It takes the input text and converts it into a sequence of token IDs.
- The returned output is a list of token IDs representing the input text.
- Additional parameters like max_length and truncation can be specified to control the maximum length of the encoded sequence and whether or not truncation should be applied if the input text exceeds this length.
encode_plus Method:
- The encode_plus method is used to encode multiple text inputs or to include additional information such as attention masks.
- In addition to encoding the input text, it also performs other tasks such as padding, truncation, and generating attention masks.
- It returns a dictionary containing the encoded inputs along with attention masks, token type IDs (for models that use segment embeddings), and other optional parameters.
- This method provides more flexibility and control compared to encode, as it allows for the inclusion of additional information and customization of the encoding process.

In summary, while both methods are used for encoding text inputs, encode_plus offers more functionality and control by providing additional features such as padding, truncation, and attention masks, making it suitable for more complex encoding tasks and scenarios involving multiple text inputs.

ATTENTION MASK IN encode_plus

The attention mask is a binary tensor used in transformer-based models like BART to indicate which tokens should be attended to and which ones should be ignored during the self-attention mechanism.

In the encode_plus method of the Hugging Face tokenizers library, the attention mask is automatically generated based on the encoded input sequence. It has the same length as the input sequence and consists of 1s and 0s. Here's what the attention mask indicates:

1: Tokens that should be attended to by the model.
0: Tokens that should be ignored (masked) by the model.

The attention mask helps the model focus on the relevant tokens in the input sequence while ignoring the padding tokens. This is particularly important when dealing with input sequences of varying lengths, as it ensures that the model doesn't pay attention to padded tokens, which don't contain meaningful information.

For example, consider an input sequence [CLS] Hello world [PAD] [PAD]. The attention mask for this sequence would be [1, 1, 1, 0, 0], indicating that the model should attend to the first three tokens ([CLS], Hello, world) and ignore the padded tokens ([PAD]).

In summary, the attention mask helps improve the efficiency and effectiveness of transformer-based models by guiding their attention mechanism to focus on the relevant parts of the input sequence while disregarding padding tokens.

Fine-tune BART base model

Training Arguments

from transformers import TrainingArguments

# Define training arguments for the model
training_args = TrainingArguments(
    output_dir='./results',          # Directory to save model output and checkpoints
    num_train_epochs=2,              # Number of epochs to train the model
    per_device_train_batch_size=8,   # Batch size per device during training
    per_device_eval_batch_size=8,    # Batch size for evaluation
    warmup_steps=500,                # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Weight decay for regularization
    logging_dir='./logs',            # Directory to save logs
    logging_steps=10,                # Log metrics every specified number of steps
    evaluation_strategy="epoch",     # Evaluation is done at the end of each epoch
    report_to='none'                 # Disables reporting to any online services (e.g., TensorBoard, WandB)
)

output_dir='./results': Specifies the directory where model checkpoints and output files will be saved during training.
num_train_epochs=2: Defines the number of epochs for training the model, indicating how many times the entire training dataset will be passed through the model.
per_device_train_batch_size=8: Sets the batch size per device (e.g., GPU) during training, controlling the number of training samples processed simultaneously on each device.
per_device_eval_batch_size=8: Defines the batch size per device for evaluation, indicating the number of samples evaluated simultaneously on each device during model evaluation.
warmup_steps=500: Specifies the number of warmup steps for the learning rate scheduler, determining the initial optimization steps during which the learning rate increases gradually.
weight_decay=0.01: Sets the weight decay coefficient for regularization, controlling the amount of regularization applied to the model's weights during optimization.
logging_dir='./logs': Defines the directory where logs, including training metrics and evaluation results, will be saved during training.
logging_steps=10: Specifies the frequency at which training metrics will be logged, indicating how often (in number of steps) metrics will be recorded during training.
evaluation_strategy="epoch": Determines the evaluation strategy, specifying whether evaluation will be performed at the end of each epoch or at specified intervals of steps.
report_to='none': Specifies where training progress and results will be reported, with "none" indicating that reporting to any online services (e.g., TensorBoard, Weights & Biases) is disabled.

Issue and Solution

➜ Issue:

ImportError Traceback (most recent call last)
in <cell line: 1>()
----> 1 training_args = TrainingArguments(output_dir=“test-trainer”)

4 frames
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py in _setup_devices(self)
1670 if not is_sagemaker_mp_enabled():
1671 if not is_accelerate_available(min_version=“0.20.1”):
→ 1672 raise ImportError(
1673 “Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U”
1674 )

ImportError: Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
“Open Examples” button below.

➜ Solution: TrainingArgument does not work on colab

Training Process

The code initializes a Trainer object with the specified model, training arguments, and datasets. It then starts the training process by calling the train() method on the Trainer object.

# Initializing the Trainer object
trainer = Trainer(
    model=model,                             # The model to be trained
    args=training_args,                      # Training arguments
    train_dataset=train_dataset,             # Training dataset
    eval_dataset=eval_dataset                # Evaluation dataset
)

# Starting the training process
trainer.train()

Training Result

Epoch	Training Loss	Validation Loss
1	0.095800	0.084215
2	0.075400	0.081112

This table provides a summary of the training and validation losses for each epoch during the training process.

Evaluation with ROGUE Score

Metric	Threshold	Precision	Recall	F-Measure
rouge1	low	0.5203	0.4547	0.4632
	mid	0.5354	0.4689	0.4753
	high	0.5502	0.4824	0.4874
rouge2	low	0.2507	0.2160	0.2205
	mid	0.2656	0.2292	0.2331
	high	0.2808	0.2428	0.2459
rougeL	low	0.4318	0.3784	0.3843
	mid	0.4465	0.3907	0.3964
	high	0.4613	0.4039	0.4090
rougeLsum	low	0.4324	0.3770	0.3830
	mid	0.4463	0.3903	0.3960
	high	0.4616	0.4031	0.4075

Now, let's interpret the scores:

Precision: It measures the proportion of generated summaries that are relevant. A higher precision indicates that the generated summaries are more relevant to the reference summaries.
Recall: It measures the proportion of relevant information in the reference summaries that are correctly captured by the generated summaries. A higher recall indicates that more relevant information is captured.
F-Measure: It is the harmonic mean of precision and recall. It provides a balance between precision and recall. A higher F-measure indicates a better balance between precision and recall.

Interpreting the scores:

Higher scores are generally considered better for all metrics.
The scores are divided into three categories: low, mid, and high. The scores improve from low to high categories, indicating better performance.
A score closer to 1 is desired for all metrics, indicating better performance.

Overall, based on these scores, the performance of the BART base model for summarization can be considered relatively good, especially for the high category. However, there is still room for improvement, particularly in terms of capturing more relevant information (recall) while maintaining precision.

NOTE - ROGUE Score:

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of summaries by comparing them to reference or gold-standard summaries. ROUGE scores measure the overlap of n-grams (contiguous sequences of n words) between the generated summary and the reference summary.

There are several variants of ROUGE metrics, including ROUGE-N, ROUGE-L, and ROUGE-W. Here's a brief explanation of each:

ROUGE-N (ROUGE-Ngram): This metric computes the overlap of n-grams between the generated summary and the reference summary. ROUGE-N scores are calculated for different values of n (e.g., unigrams, bigrams, trigrams) to capture different levels of phrase overlap. For example, if you calculate ROUGE-2 (bigrams), it counts how many pairs of adjacent words appear in both your summary and your friend's. If you both mention the same key phrases or use similar wording, you'll get a higher ROUGE-N score.
- Example: Let's say your summary includes the phrase "action-packed adventure," and your friend's summary mentions "exciting action scenes." Both summaries have the bigram "action scenes," so they'll get a high ROUGE-2 score.
ROUGE-L (ROUGE-Longest Common Subsequence): This metric measures the longest common subsequence (LCS) between the generated summary and the reference summary. It considers the longest sequence of words that appears in both the generated and reference summaries.
- Example: If your summary says "The movie features amazing special effects," and your friend's summary says "Special effects in the movie are stunning," the longest common sequence is "special effects," so you'll get a high ROUGE-L score.
ROUGE-W (ROUGE-Word): This metric is similar to ROUGE-N but considers the weighted overlap of unigrams, with weights assigned based on the distance between matching words.
- Example: If your summary says "The movie was great, with fantastic visuals," and your friend's summary says "Visuals in the movie were fantastic," ROUGE-W would give weight to the words "fantastic" and "visuals" appearing together in both summaries, even though the order is slightly different.

ROUGE scores are typically reported as F1 scores, which are harmonic means of precision and recall. A higher ROUGE score indicates better agreement between the generated summary and the reference summary, with perfect agreement resulting in a score of 1.0.

ROUGE scores are widely used in the evaluation of text summarization systems and other natural language processing tasks where automatic evaluation of generated text is required. They provide objective measures of summary quality that can be used to compare different summarization models or tuning parameters.

Inference

The test conversation:

 Web Developer (You): Hey, I just launched a new website with some exciting features. Would you like to check it out? Machine Learning Enthusiast: That sounds interesting! I'd love to see how you've integrated machine learning into it. Computer Science Student: Speaking of machine learning, have you heard about the latest breakthroughs in natural language processing? Science Enthusiast: Yes, I've been following those developments closely. It's amazing how AI is transforming language understanding. Mathematics Enthusiast: Absolutely! The mathematical foundations of deep learning play a crucial role in these advancements. News Enthusiast: By the way, did you catch the latest headlines? There's a lot happening in the world right now. Web Developer (You): I did! In fact, my website can recommend personalized news articles based on user preferences. Clinical Medical Assistant: That's impressive! Speaking of recommendations, have you worked on any projects related to healthcare? Machine Learning Enthusiast: Yes, I did a project on hybrid acoustic and facial emotion recognition, which could have applications in mental health. Computer Science Student: That's fascinating! It's incredible how our interests and expertise intersect across various fields of study and technology.

Generated Sumamry:

Web Developer (You) introduces the latest breakthroughs in natural language processing to Computer Science Student. Science Enthusiast thinks AI is transforming language understanding and the mathematical foundations of deep learning play a crucial role in these advancements. Web Developer's website can recommend personalized news articles based on user preferences.

Overall, while the summary provides a condensed version of the conversation, it could be improved to include more comprehensive insights and coherence.