FoodDesert commited on
Commit
0e986bc
1 Parent(s): 1e4bd6c

Upload app.py

Browse files
Files changed (1) hide show
  1. app.py +7 -0
app.py CHANGED
@@ -78,6 +78,13 @@ We then randomly replace about 10% of the tags in each document with a randomly
78
  We then train a FastText (https://fasttext.cc/) model on the documents. The result of this training is a function that maps arbitrary words to vectors such that
79
  the vector for a tag and the vectors for its aliases are all close together (because the model has seen them in similar contexts).
80
  Since the lists of aliases contain misspellings and rephrasings of tags, the model should be robust to these kinds of problems as long as they are not too dissimilar from the alias lists.
 
 
 
 
 
 
 
81
  """
82
 
83
 
 
78
  We then train a FastText (https://fasttext.cc/) model on the documents. The result of this training is a function that maps arbitrary words to vectors such that
79
  the vector for a tag and the vectors for its aliases are all close together (because the model has seen them in similar contexts).
80
  Since the lists of aliases contain misspellings and rephrasings of tags, the model should be robust to these kinds of problems as long as they are not too dissimilar from the alias lists.
81
+
82
+ To enhance the tag corrector further, we leverage conditional probabilities to refine our predictions.
83
+ Using the same 4 million post dataset, we calculate the conditional probability of each tag given the context of other tags appearing within the same document.
84
+ This is done by creating a co-occurrence matrix from our dataset, which records how frequently each pair of tags appears together across all documents.
85
+ By considering the context in which tags are used, we can now not only correct misspellings and rephrasings but also make more contextually relevant suggestions.
86
+ The "similarity weight" slider controls how much weight these conditional probabilities are given vs how much weight the FastText similarity model is given when suggesting replacements for invalid tags.
87
+ A similarity weight slider value of 0 means that only the FastText model's predictions will be used to calculate similarity scores, and a value of 1 means only the conditioanl probabilities are used (although the FastText model is still used to trim the list of candidates).
88
  """
89
 
90