data_dynamos4 / CORR_TEXT_ZOOM_NLP_TEXT_CLASSIFICATION.txt
domro11's picture
All lecture files
a2594ba
Hi there! Good afternoon. How are you doing? Okay. So, as you saw, let's give the rest of the time to join. Can you see my slides? Alright, let's kick off. Today, we are going to talk a little bit about text classification, which is going to be the first of the many natural language processing applications that we are going to dive into. As a reminder from the previous session, we talked about language modeling. Language modeling is important nowadays because most of the representations used for processing are based on some type of language model. Language modeling is a simplistic problem in which, given a sentence, you try to predict which is going to be the next word of the sentence, or what some of the words in the sentence that you have removed are. This may not seem very interesting, but it is indeed. The idea is that if a language model is able to do this, it is because it has a deep understanding of the language. Otherwise, it would not be able to properly predict it, because it needs to be able to follow the grammatical rules and understand the semantics of the sentence. The first attempts at language modeling were based on simplistic methods based on n-grams, as we covered in the other way. The idea is to try to predict the probability of the next word given the two previous words. You basically count the number of times that the given sequence happened compared to the number of times that they part in the corpus of the sequence they appear in the dataset. As you have seen in the Forum session, it works a little bit, but when you try to learn from really long sequences, it fails. This is the main limitation that we are now able to address with different deep learning architectures. Thanks to these architectures, we are able to properly model long sequences of information. The first model is the sequential information that you have in your data. In this case, the text is a sequence of characters. The most well-known model in this regard is probably Recurrent Neural Networks, and in particular, Long Short-Term Memory. With this, we are able to start creating representations that, when applied to other scenarios, are good enough. Even better, there is a new architecture from 2017 called Transformer. The idea is that, instead of trying to model the permutation of the text to extract content, it tries to do that by means of the idea of self-attention, trying to direct the model through the self-potential by monitoring the relationship between the different words in a sentence. With this, we will have a great representation of our data. This is the idea, for example, behind BERT, or also behind the GPTs, GPT-1, GPT-2, and GPT-3. With this Transformer architecture, we are able to properly do language modeling, which is helpful to generate new information.Text classification is a typical scenario. For example, if you have a couple of documents and you ask the interface to classify them into different categories, it will be more or less able to do that. This is a new avenue that we will explore next year to see if we don't need to do fine tuning. But in any case, today in text classification, we will start from the basics and then move on to this new transformer-based architecture. We will also try to solve the problem via traditional machine learning. So that's the long story. Before moving on to text classification, do you have any doubts from the previous session or from the theoretical system?Well, two questions and answers that might be of interest to you. So the idea is that if I have a profile based on the questions and document topics that you have reviewed, I can use this to create a text, specify here, and the output of the text will be relevant to you or not. Right. So this is a common scenario in content-based recommendation. If I know what the text content that you have reviewed is, I can train a classifier in order to understand if a new piece of content is going to be relevant to you or not. So for personalization recommendations, at least, you can apply that specification in order to find out or see if or to find a reason right. So you have a piece of information, and you would like to understand if this piece of information belongs to a particular author, or if even this is a new piece of information, or something that you have copied from somewhere else. Text classification is the answer here, because you have a set of, for example, a set of books written by a given author. So a new piece of information is from this author or not. This is particularly relevant nowadays, because one of the main problems with the GPT-2 and GPT-3 on these natural language interfaces is that you can easily generate a piece of information. There have been some efforts in order to create some type of classifiers in order to detect if a given piece of information has been generated or not by, in fact, open AI tried to do that some years ago with GPT-2, the previous version. And now there are some other interfaces trying to do the same. This is a technical one, right? The output is is this text have been written by in order to the table that and the actual right. So you don't you don't try to take at the and to make it to make it pass as if you say something that you have to create. Set them into. Now, this is, of course, the classification. If you want to classify something to positive and negative into the different different levels into the polarity, positive or negative. What this is a text classifier, of course. Oh, well, there are some other examples. I just have compiled some of them. But again, anytime that you have text or information that you would like to categorize on how this text or information you could create a text classifier in order to well decide if this given piece of information belongs to this process, or whatever it is that plus the class could be sarcastic or not, it could be fake news. What? Okay? So think about, you know, whatever categorization problem you have in natural language processing. If it can be solved, then it's because if it can be so via the specification. We could rely on all the advanced methods that we are going to explain today, and you will see later on next week on the problem session that from the practical point of view, they are very, very simple to implement. Okay, they are not not not much more difficult than to do traditional machine learning with the library such as Scikit-Learn. Okay. Nice. So let's start from the most basic way of doing text classification, which is always in natural language processing, and you can try to do that by hand. Right. For example, in mind that you would like to take this pump, so perhaps you can detect that that some home based on keywords that they do have, they are in their emails, so such as if these these are that keyword appears in the email it, because this document and it does become to this part of the spam class right? You will be surprised because they are widely used in the industry even nowadays, with the event of the learning and transformers, but and so on. Still, many, many, many systems out there they do and rely on rules manual rules crafted by some experts in the domain, applying simple, simple linguistics or appliance, simple or express right.The idea is that you do have a very extensive domain knowledge, and the problem that you are trying to solve is not so difficult. You could think about that, for example, if you try to detect fake news, but it's a little bit more difficult, right? It's not clear how you can create a rule that takes fake news or text into account. Well, perhaps it's very simple. Okay. And in fact, you will see that there are many systems out there working with those. My only point of concern with the systems is that if you are thinking of implementing one of those, be aware that they are very, very expensive to maintain, right? Because, as you can imagine, unless the domain that you want to talk to is very, very narrow, any time that you want to take into account a new example or a new piece of information for how you need to update the rule or redefine the rules, you will end up with thousands of rules, right? And you will not be able to maintain the system or scale the system. But in any case, they are systems that are out there, working under this assumption. So, don't discard these from the very beginning. And perhaps this is something that you can use as a baseline for your system, right? That is, you don't want to implement this, but okay, we create a couple of rules. I will see what the performance of these rules is, and then, whatever I'm doing on top of that, it should be able to improve these results, right? It should be able to improve them. Well, here we are doing this, master, right? Because we want to actually leverage data. So I guess that the next step is what supervised machine learning? This is a traditional scenario that you already know from machine learning. So you have a new document to classify a set of classes or training data, where you have a hopefully large set of documents already labeled with the class they do belong to. And you would like to learn a classifier to classify them. This is the field. And this is the thing that you have already done in machine learning. In this case with text. Okay, we do have some useful algorithms to apply for this classification. These are things that you most of you already know, right? Just to review all of them. So there is the Naive Bayes, which is the very most simplistic approach. Right? Then there is the probabilistic approach, which is basically trying to balance the new evidence that you get and to understand if the document belongs to a class or not, right? This is called the likelihood. What is the evidence in this case, what is the information that you have on the documents, the words that you open the documents with, right? So if you have the words "ball" or "soccer" or "basketball", the probability of this document belonging to the class of sports is high, because these words commonly appear in the class and not in other classes. You also have the prior probability, which is what, how likely is the class to be in my dataset? It's not the same if you are working at a sports newspaper or a financial time, right? The prior probability of a given document belonging to the class of sports is going to be very different. You will be able to compute the posterior probability, which is basically how likely is this document to belong to the class I am setting up? So this is the base rule, and then the Naive Bayes algorithm applies this rule with the assumption that all the words, all the evidence that you are observing, all the words in your document are independent of each other. I hope that at this point in the process, that one was clear.You understand that this is completely wrong with regards to the words, "you know," and "document." They are not dependent on each other. In fact, the whole point of not making this assumption is that you are going to be able to predict the next word given the word that you have before, because there is a strong relationship between them. Right? This works reasonably well for a text classification, because this idea of balancing the evidence that you are observing will give you a good understanding of the class, especially if the classification problem that you are addressing is not very complex. It's rather common to use this as a baseline that you implement in many scenarios in order to see if your deep learning approaches are going to be able to do better. It's not really likely that this is the thing that we will implement. But just in case, this is something that you can. This is a quick overview of these ideas. But at least, if at any moment you need me to delve a little bit more into any of them, just let me know. Okay, these are not the most up-to-date things that we have in our natural language processing toolkit. But it's important that you understand these basics. Good. I guess that this is going to be something, some new concept for you, which is the concept of the maximum entropy classifier. They were a thing back in the '90s. We used to do this for the NLP. And then this. This is a very simple methodology that works rather well. We are now in the era of deep learning. So they are not something that we go for anymore. But it's important to understand these previous methods and the basic idea of the maximum entropy classifier. These two. Well, to not make this simplistic assumption that we don't reflect that well, a better model will better reflect the actual scenario of text classification. But the words in your documents are not going to be independent. They have to be related to each other. So you don't want to make this assumption. The maximum entropy classifier proposed a mathematical framework, a mathematical probabilistic framework, that allows you to do that in a very simplistic way, under the concept of entropy. But let me explain this to you with an example. Okay. So you are working at a newspaper and you would like to classify news reports into four different classes: economics, politics, art, and so on. So as this model is very simple, you don't make any kind of assumption on the data at the very beginning. You assign a uniform probability to each one of them, so 25%. Okay. So now, the only thing that you need to do is to start with the documents that you have. Of course, we will need to have a training set and an annotated dataset with documents belonging to these classes. Right? So you can start counting, which is the probability of a given class given the words that you observe in the document. So, for example, imagine that the word "ball" appears in the text, and you know that this word, from your training set, it does belong to the class "sports" or to the other classes. Right? So you basically count the number of times that the word "ball" appears in the class "sports" and you divide by the total number of times that the word appears in your data and the rest of the classes. You are able to estimate this prior, right? What is the probability that when I am seeing the class "sports" I do see the word "ball" because the document belongs to the class? Sorry, the word "ball" belongs to the class. But I do the same for the rest of the words that you do have in your vocabulary, right? And so, as you see, the more documents you have, the more words that happen to be in the document, the more accurate your model is going to be.This methodologies is used to convert to the actual probability distribution in order to properly model. From a mathematical point of view, if you have an infinite number of documents, it will convert to the actual probability that is generating the data. This is the idea behind the maximum entropy classifier. Each one of these probabilities is like a rule or limitation. Imagine that this is the space of all the possible solutions for your problem; it is typically very large and you have potentially an infinite number of configurations for your model. The probability of the classes given the word "ball" is a given amount. There will be some set of classifiers that do not provide this rule, so the solution is not there. What you will do is restrain or constrain your hypothesis space. As you observe in your dataset, there are going to be some places in which the solution is not there. To pick this model, you apply the concept of entropy and select the model which assigns the most uniform probability to the distribution of the rest of the words that you are not observing. This way, you are only making decisions based on what you see in your data and not assuming anything else.These models are not perfect, but they have worked rather well in the past. I have included some links in case you want to delve into those. However, you will need to dive into some complex mathematical ideas that may be out of the scope of your interest in the application of text classification from a language processing perspective, which is very interesting. Sorry if you try to find the maximum entropy, classify your implementation within it, and learn to search for that on Google. So there you'll see that right. Do you see logistic or why logistic regression? No, you think regression, aka, I log it, aka maximum. What does logistic regression have to do with maximum entropy classification? Well, this is a much more complicated mathematical derivation that you can read on this paper. But the basic idea is that when you apply a maximum entropy classifier, both algorithms, although they are different, will reach the same solution. This was a large revelation at the time, because maximum entropy classifiers are sometimes difficult to train, but logistic regression and optimization of the optimization problem for this is rather fast. That means that you can apply logistic regression for text classification and you will obtain a reasonable solution because you are solving this maximum entropy classification problem and you are not making any assumptions on the data. There is no need to dive into this derivation. I need to get back into this explanation. The only thing that you need to know is that applying logistic regression for text classification makes sense and it will give you a baseline that is really hard to beat. So this is something that I do recommend you to try and apply logistic regression for your text classification programs and see what. You will see that even if later on you try deep learning, you will not get a several hundred or 92% increase in performance. Of course, deep learning will improve the classification problem, especially if it is a difficult one, but the used logistic regression will make a very good initial solution and a very good baseline, something that I do recommend you to always test on your models. The last model that we talked about before we dive into the learning is support vector machines in machine learning. In the past, it was the state of the art for text classification because it is well suited for high dimensional data, which is actually what we have in text classification. If you remember this document matrix that we talked about, most of the documents and most of the words do not appear in all the documents. So the data is highly dimensional, of course, because basically the dimensionality of the data set is the size of the vocabulary. But remember this document matrix in which we have the documents on the terms, and you have the weight of each one of the words in the document. So this is the matrix that I have been put into the support vector machine classifier. So highly dimensional. I didn't mention our support vector machines. So support vector machines do work pretty well. Here in this table, you have a comparison of a support vector machine classifier compared to different other machine learning and deep learning methods, and you see that they perform as well, not the best performance for any of the data sets, but of course it's more or less on the same level.Well, I told you that logistic regression makes sense even nowadays to give it a try because it's a very simple model and it will give you a good baseline. I don't see much of a reason, from a practical point of view, to dive into supervised machine learning for classification anymore, I guess. Because they are not so simple, you will need to train them with a kernel in order to be able to model nonlinearities in your data. And I don't know to what extent you have experimented before with supervised machines. What if the dataset is kind of large? It will take some time, of course, not as much as a deep learning method, I guess. But if you want to go for complexity in order to hopefully get the best possible solution, I guess it's a better idea to apply some of the deep learning methods that we are going to see now. Not so much for production, because it's not a good compromise between complexity and accuracy, but in contrast. No, it's not the optimal accuracy, but it's a very simple model to try and test. I don't know in any case, this was actually the state of the art before the boom of deep learning. But I still see some systems based on logistic regression. So I just wanted to include this into this. And of course, the evolution of deep learning. But before diving into those, are there any questions so far? Anything clear? Okay. So I have very quickly reviewed these methods because I'm not introducing anything I didn't already know. And I just wanted to put them into the table just for you to have the entire context. But I guess what makes more sense nowadays is to really dive into deep learning. Okay. But in any case, to have them over there, because if not, the specifications class will not be completely out. Okay. So the other day we talked about recurrence, and you told me that you were more or less in the middle of reviewing it. And then, during your deep learning sessions, did you have the chance to do that? Okay. Okay. So again, as I told you the other day, we don't need to have a deep understanding of those. Once you do an implementation, you will see how the different pieces stick together, how they are trying, and so on. But as I said the other day, just to reiterate, they are great for modeling sequences and text is a sequence of words. So this is something that we already covered the other day. But the whole point of recurrence is that you have the right input. You input these different words in this embedded layer. This we've talked about this embedded layer this week in the forum session, because this is something that has been mentioned on several tapes. So this embedded layer could be something that is already pre-trained, based on word2vec, for example, or something completely new and you know, initialized randomly. And of course, this is an initial representation of your data. You have some hidden states, and these hidden states could be as complex as you want. You can have as many layers as you want. And you have an output. So the whole point of this neural net is that you can train this neural net to predict the next word. So basically, when you input the output that you get from this first input is basically the second input to the second step, right? And basically what you want to do is to be able to predict the next word, very similar to language modeling. Again, here in text classification, we don't care so much about this. We don't care so much about this. What is important is to classify something into positive and negative. So the whole point of this is that we are going to remove this final layer and to put on top of that some kind of classifier. Why? Because what is actually of interest to me is this hidden state, right? So this is something that we already covered the other day. But the whole point of recurrence is that you have the right input. You input these different words in this embedded layer. This we've talked about this embedded layer this week in the forum session, because this is something that has been mentioned on several tapes. So this embedded layer could be something that is already pre-trained, based on word2vec, for example, or something completely new and you know, initialized randomly. And of course, this is an initial representation of your data. You have some hidden states, and these hidden states could be as complex as you want. You can have as many layers as you want. And you have an output. So the whole point of this neural net is that you can train this neural net to predict the next word. So basically, when you input the output that you get from this first input is basically the second input to the second step, right? And basically what you want to do is to be able to predict the next word, very similar to language modeling. Again, here in text classification, we don't care so much about this. What is important is to classify something into positive and negative. So the whole point of this is that we are going to remove this final layer and to put on top of that some kind of classifier. Why? Because what is actually of interest to me is this hidden state, right? So this is something that we already covered the other day. But before diving into those, if there are any questions so far, is anything clear? Okay. So I have very quickly reviewed these methods because I'm not introducing anything I didn't already know. And I just wanted to put them into the table just for you to have the entire context. But I guess what makes more sense nowadays is to really dive into deep learning. Okay. But in any case, to have them over there, because if not, the specifications class will not be completely out. Okay. So the other day we talked about recurrence, and you told me that you were more or less in the middle of reviewing it. And then, during your deep learning sessions, did you have the chance to do that? Okay. Okay. So again, as I told you the other day, we don't need to have a deep understanding of those. Once you do an implementation, you will see how the different pieces stick together, how they are trying, and so on. But as I said the other day, just to reiterate, they are great for modeling sequences and text is a sequence of words. So this is something that we already covered the other day. That's why you need to transfer the learning from one task to the other.So, what's the point of transfer learning? Why are we doing that? That is basically talking about using models that are based on natural language processing. If you have this model and you train it to learn the basics of the language, then you can use it for different problems. With this unique representation, it could be used for different models. That's one reason, but I prefer to just create them all for classification, because that's what I want to do. I don't want to do anything else. So why would you still need this kind of language modeling to be trained on language more than to do text classification? I guess that in the literature, you have heard that it is difficult and you need big data. But if you only have 1,000 examples, you are not going to be able to do that. You need millions of examples. Is this a supervised problem or an unsupervised problem? In supervised learning, you have a training dataset with your features. Who is giving you this dataset? How do you get these labels? If you want to train a neural net on text classification, you need millions, if not billions of examples. You cannot sit down and annotate 3-4 million documents. That's why you first try the neural net on a different, much cheaper problem, where you can easily access tons of data. Then, even if you only have a handful of examples, you don't care, because what you want to do is just fine-tune this learning for your dataset. That's the whole idea of transfer learning. It's like if you are trying to explain mathematics to a baby. You cannot explain mathematics to a baby, because the baby does not understand you. So first, you need to teach the baby how to speak your language, and then, when the baby can say something to you, you can try to explain the basics of mathematics.The way in which language modeling is much cheaper than the tech specification problem is why. Let's talk a little bit about that. So, language modeling. Remember, I'm giving you a sentence privately. Is the next word supervised or unsupervised? Unsupervised? Why? Okay. So, your idea is basically now, your dataset is sorry, now your dataset. Is that the case? Yes, just the features. You are not providing any labels for you. Right? You just download the data from the web from the Internet and you have it over there. I don't need to take the data. Okay, that's right. Any other opinion? Okay. So, actually what you are telling me is that right? You don't have a label, but you have a label if I'm giving you sentences. What I can do is remove the last word from the sentence. No, this is your life right? So, you both are right. In fact, this we have been calling these methods self-supervised methods. Why is that? Because it's true from your point of view, it's some unsupervised problem. You just feed the data with no label whatsoever. Nice. So, for you it's very simple, but it's to get data right. But you have the performance of a supervised model, because, in fact, language modeling is a supervised problem. You need a label, but the model is able to create the model itself. Basically, you get the data, you remove something at random and you have it. Okay. So, yeah, this self-supervised problem is rather cheap. That's why we like language modeling so much because it's a self-supervised problem that is so simple to get data from, basically for free, that you can train very huge neural networks. And then you learn the basics with that and you then apply that to whatever you want to. In this case, text specific. If tomorrow you want to do machine translation, you take the same model, the same representation of data to machine translation. If tomorrow you want to do question answering, you do the same. The only thing that you need to do is that after you remove something, you remove this task, this layer you put on top of that and small dataset only animal dataset. Because the hard problem of okay, this concept of self-supervision is very important and goes beyond. You're not really which process it. In fact, given the huge success of self-supervision and text classification, they are now trying to achieve the same in different domains. For example, if you have an image you can randomly remove pixels from the image and you can ask them to predict these pixels, so they will be able to learn the semantics of the image and then you can apply this model for doing whatever you want to do. You can take a video and randomly remove frames from the video. If I'm always able to replicate these frames they will be able to understand the video. You can take some music and remove seconds from the music and ask them to replicate them. So you will be able to learn that. Okay, something that we are trying to do, but we have not yet been very successful, and it will revolutionize machine learning for commercial practice is to do the same for tabular data. So imagine that you take the Titanic dataset, you randomly remove lines, you have them randomly removed by you from a given passenger for a given feature that you ask them to. Hey, recreate that. That would be great. The main problem that we have now with this transfer and learning problem is very weak, because whatever you learn from the Titanic dataset is going to be a for it to be a so to speak for, which one have a completely different set of features. So, so far we have not been successful. But you know, in any scenario that you can apply this, it's great because you can basically have infinite data for free. You train up with an understanding.What are the relationships of this data? Once you know the relationship with the semantics, you can then apply that to whatever we have already introduced. This idea that we are talking about with language modeling is very important and not to be overlooked. Nowadays, you take a deep learning model, train it for language modeling, and then you can use the model for many specific tasks. We have this foundation model that we are talking about. Is this clear? This idea? The more we talk about it, the better. We had a small discussion about this at the end of the previous session. It has been proven, especially since GPT-3, that these models are not only useful for having a better representation, but if you reach the level of GPT-3, you don't even need to retrain the model for anything. The astonishing performance you see is not because they have right on assignment and right up whatever the tech specification. No, they have only trained on language and done nothing else. We are in the middle of a race to understand how big the models we can create can be. We are talking about GPT-4 and there is a lot of information out there. Google has something called Palms which is three times larger than GPT-3. We are also scaling up not only the size of the model, but the amount of data that they are using. Instead of creating a model as large as GPT-3, they are creating a smaller model but training it with more data. This is also the basis of Lambda, the conversational AI pigeon that was presented a week ago. There is no limit. The more data you have, the better. These foundation models or pre-trained models can be based on several deep learning architectures. The most well-known is BERT. We will play around with BERT, but it just be worth noting that BERT is not the only training model. I recommend you to look into the repository of natural language processing called InferFace. Over there you have a lot of models based for basically anything you want to do. So it's not just BERT that we have trained along with models.Basically, they have taken a language model and trained the model to do many different things. For example, if you go to text classification, you will see many more models. Most of them are based on Burt and Roberta. So, for example, you have a model called MobileBERT which has been trained for finance and legal language. In order to be able to use this model, you have all the information about it in the Transformers library. If you want to use it, you can use the Python library. Thanks to the Transformers library, you have access to a repository of models ready for several tasks. You can train the model once and then reuse it across many tasks, companies, and domains. The only thing you need to do is fine-tune it. You can take the model as it is and then fine-tune it for the specific problem you have. To do this, you need to remove the last layer of the model, which is about language modeling, and put a classifier on top of that. We will talk about how to train the classification on top of that and also how to fine-tune it.Translate that to transfer that to your new something that you can do is to find you on a little bit more than what do I mean by fine-tuning? If you review the bars architecture, basically a set of layers right in the forum system, I think that I have said this paper called, and this idea of our toler, which is analyzed, what avert model is doing that could apply for a better term for any other of this big, large okay, because they are always the same. So you have this very big stack of layers. What we have when we do when we're doing this model, and what we have seen is that over there they are learning just the basics of the one they are learning how to. They are learning how to do this, and they are learning how to relate what you know works to each other as you move up in this architecture. They are learning, but this is the same idea that we have seen so many times with convolution and neural net supplied for it. Right? First on the inner layers they learn how to detect the board of so they mean then how to the texting saves, how to combine these saves into objects, and so on. Same here in that it's quite the basic, the inner layers, the basic building blocks of language that then you combine. Okay? Well, perhaps you don't want to betray this layer, which is understanding the basics of the language right, because the basis of the language are the same for the different tasks. But as you move up on the architecture, the layers on top somehow they are reflecting the fact that you are trying to do language modeling that you will try to prevent the next the next problem, right? But how do you do? You would like to retrain them in order to understand that now you are doing classification. So if you see the work pills, you know that in the context of a it has a strong meaning, right. This is strongly related to some kind of pills, and this is strongly related to a spam or cast prices. You know that you are not going to win a cas price, and they are going to tell you via email, of course. So this is strongly correlated to a. For all this is probably related to a span. You know we learn a little bit the meaning of this works of the relationships on this works right. So how do you do that? You all over to okay this layer on top. I will allow you to change the weight of this layer right? Retrain it's it's called fine-tuning, or retrain the layer. What do I mean by retraining in this layer? You will have a set of words so you can change this weight. Make your phone. You cannot do it. You can change them very much, because otherwise you will destroy the do you? All the familiar over there? Well, I would like to retrain this layer, but only a little bit. and this one over there. What? The house I want to retrain this layer, but only a small amount. So we are going to send the practices on how to do that. The 3 is that you reduce the learning rate as you move down in the architecture right? So the top layer in the architecture, you return that with a small learning rate. the second with a smaller learning rate even smaller. The idea is that if you reduce the learning rate, you are reducing the space, the speed sorry, and the base of which you allow, and the ways to change. So a smaller learning rate basically won't allow the weight to change too much something that you want. You want them to change, but only a little bit. We are going to play around with this idea. We are going to see how this is, how this is working, and with Bert and other deep learning models. Right? So I hope you are with me in in this idea that makes sense to change a little bit the representation there by birth in order to pull up this representation to the new scenario that I'm trying to solve. Okay. and again put that into this context, so transfer learning, having a pre-trained model trained on a huge data set of new language model that I would like to refine in order to solve this particular problem. Don't worry.If that seems too overwhelming, we will practice. That is, we will write some lines of code, which is rather simple. It is very similar to what you can do with cycling. Do you have any doubts so far? Okay. So another one that I would like to be related to transfer learning is that here we have a two-step approach. First, we pre-train a model on a huge database, and then we retrain this model on the task that I want to solve in this case. But as you see here, there is a third step that you can include. I'm taking this picture from the fast.ai and from the III course, and we are past that. The fast.ai is a library, and I do recommend you to give it a try. We are going to use this library in the practice session. But something that they did, and it worked pretty well, is that okay, we have this huge model trained on in this case, Wikipedia. Right. A language model. This is basically. This is so. You train a huge model on Wikipedia. In this case, the model is a recurrent neural network. But whatever model you can. So this model knows how to predict the next word in Wikipedia. Now, we want to do text classification in this case, the problem that they are trying to solve is to classify the reviews in the Internet Movie Database among the reviews into positive and negative. Okay, which is a classification problem. But before we do the classification problem, we try this very big model. We train on language modeling, but using the data set that you want to use, using the data set that you are going to use in your final task. Think about that. The real model was trained on Wikipedia, the language on Wikipedia and the language on the Internet Movie Database. They are completely different. So and you would like the model to learn a little bit the language relationships based on the internal movie data. This is the same as the model that I showed you in the hugging face interface by this member. It's a better model. We train for finance and news. Why? Because the vocabulary in finance and news is not the same as the vocabulary in general domain English, right? So you may want to. Now, the model is able to create the next word in the movie database. Now you are. There is the classification problem, which is basically this is right. So basically what they did was to include a step in the middle of language modeling, but on the data set that you would like to use. Okay, so we will. We will apply that. This is something that we will use also for the practices. Oh, well, here I'm. Just including some slides on how we have been able to improve by a lot this state of the art, thanks to that. Okay. So here we are compiling some ideas. But before we dive into those, I would like to understand if there are any issues or problems. We will review them on the phone. We will apply these ideas in practice. But before moving on, I would like to understand that everything is okay. It's nice up. Yeah. I don't think that they are very, very, very difficult. I. They all of them. They kind of make sense, right? And when you see them right now, they seem kind of all right. And the main thing is that nowadays, we are able to create these huge models because nowadays we are able to get access to huge amounts of data, and we have large computers that are able to process them. To train this models on that, we have this transformer architecture, which is great for it. So again, nothing mind-blowing over there, just more data, more computational power, and a unique transformer architecture. Actually, these transformers, the transformer architecture has been a huge advance on the problem. When I was doing my PhD, depending on the task that you wanted to solve, you need to be able to use different models.Basically, for any of the different tasks from the different problems, you would like to have a specific model that, on a specific natural language processing pipeline, we are going to see these, for example, in your question answering session, how question answering has been solved previously. To do this, you need to create many different pieces that need to fit together. So, if you want it to do question answering, you need to be able to create this complex system. If you want it to do machine translation, something completely different is needed. If you want it to do text summarization, something completely different is needed nowadays. All the natural language pipelines look very similar. You have a pre-trained model, you fine-tune this model to your problem, and you have an annotated dataset, and that's it. All of them are based on transformers. This is some kind of unifying architecture. Now, what I presented in the program was processing, and it seems that this is extending beyond that problem to computer vision, most of which is based on transformers. It was based on convolutional neural networks, but not anymore. Even for music, video, and so on, transformers are making a large impact. Okay, so here is a very quick summary: if you want to solve a classification problem, what do you do at the very beginning? If you don't have data, you can try to create some rules. Remember that creating these rules is not straightforward. If you have a handful of data points, or a small annotated dataset, you can try, and maybe you can try to label some data, or you can try some kind of semi-supervised annotation, which is basically training a machine learning classifier with the output of this machine learning classifier, and so on. If you have a reasonable amount of data, which is the kind of data we have seen in machine learning generally, you can use deep learning models. Now, thanks to the idea of transfer learning, even if you don't have a very large dataset, you can try to apply these deep learning ideas. I also wanted to include some other ideas to enhance text classification methods. But, as you have seen, all of these ideas have a little bit of data, and now you can apply a deep learning method. Yeah, I just wanted to include them and use posterity and use dependency parsing and use some flavor of this part one. The basic idea is that nowadays, if you are facing a performance problem, it doesn't need to be solved in this way, especially because if you are moving to deep learning, it makes more sense to find a model that is better suited and better fine-tuned to your problem. You need to collect all of your dataset for training and, of course, here you have some of these resources. From a practical point of view, LTK is something I don't recommend you to use in practice, as it is very outdated. So, we have a plethora of classifiers that will work rather well, and if you want to move to deep learning, these are the libraries I recommend you to check, mainly Hugging Face. Yeah, I can say today, I can.Faces like Hugging Face are popular for natural language processing in the latter half of the mobile space, as they have a great library for learning in general, and they have something for tech classification and natural language processing in general. However, I do not recommend Hugging Face. We are going to use First.ai and C Frame for the practice session. Rain is a rapid on top of F in phase, just to make the usage of Hugging Face a little bit simpler. But you can use Hugging Face by default, because its interface is not so difficult. If you want to dive into the practicalities of macro-processing, I do recommend you to start with First.ai, and even for the assignment, you can try using deep learning. For the session, we will start by solving a classification problem with traditional machine learning, then apply our report on a neural net by making use of the Fast AI library, and then use Hugging Face to try to solve the text classification problem. Hopefully, you will see the different practicalities for each one of them and what you can do with them for a better understanding of natural language processing and classification, in case you later want to make use of this method. If you have any doubts or questions, let me know on the phone session or drop me an email.