Update README.md
Browse files
README.md
CHANGED
@@ -49,7 +49,7 @@ A mix of the following data: Wikipedia, Books, Twitter comments, Pikabu, Proza.r
|
|
49 |
1. Calculate shingles with size of 5
|
50 |
2. Calculate MinHash with 100 seeds β for every sample (text) have a hash of size 100
|
51 |
3. Split every hash into 10 buckets β every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash β we have 10 hashes for every sample
|
52 |
-
4. For each bucket find duplicates: find samples which have the same hash β calculate pair-wise jaccard
|
53 |
5. Gather duplicates from all the buckets and filter
|
54 |
|
55 |
### Training Hyperparameters
|
|
|
49 |
1. Calculate shingles with size of 5
|
50 |
2. Calculate MinHash with 100 seeds β for every sample (text) have a hash of size 100
|
51 |
3. Split every hash into 10 buckets β every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash β we have 10 hashes for every sample
|
52 |
+
4. For each bucket find duplicates: find samples which have the same hash β calculate pair-wise jaccard similarity β if the similarity is >0.7 than it's a duplicate
|
53 |
5. Gather duplicates from all the buckets and filter
|
54 |
|
55 |
### Training Hyperparameters
|