Goran Glavaš commited on
Commit
41fc5cc
1 Parent(s): 80dbedc

Deleted embeddings.txt due to repo size limitation. Updated the README file.

Browse files
Files changed (2) hide show
  1. README.txt +5 -5
  2. source/res/embeddings.txt +0 -3
README.txt CHANGED
@@ -31,13 +31,13 @@ Example command:
31
 
32
  java -jar graphseg.jar /home/seg-input /home/seg-output 0.25 3
33
 
34
- The tool's correct execution depends on the resources in the /source/res directory. These three files are as follows:
35
 
36
- (1) embeddings.txt -- the word embeddings used for measuring semantic similarity between sentences. The default file used are 200-dimensional GloVe embeddings obtained on Wikipedia 2014 + Giga 5 corpus (http://nlp.stanford.edu/data/glove.6B.zip).
37
- (2) stopwords.txt -- the list of English stopwords (excluded from sentences when measuring semantic similarity)
38
- (3) freqs.txt -- frequencies of English words on a large corpus, needed for the IC-weighting of word contribution
39
 
40
- You may choose to replace these default files (e.g., by using different embeddings or different stopword list), but make sure you name the new files exactly the same (i.e., embeddings.txt, stopwords.txt, and freqs.txt, respectively).
41
 
42
  Credit
43
  ========
 
31
 
32
  java -jar graphseg.jar /home/seg-input /home/seg-output 0.25 3
33
 
34
+ The tool's correct execution depends on the resources in the /source/res directory. There are three files that need to be there:
35
 
36
+ (1) embeddings.txt -- the word embeddings used for measuring semantic similarity between sentences. The default file used are 200-dimensional GloVe embeddings obtained on Wikipedia 2014 + Giga 5 corpus (http://nlp.stanford.edu/data/glove.6B.zip). This file is bundled into the standalone binary file graphseg.jar, but is omitted from the source/res folder due to space constraints of the repository;
37
+ (2) stopwords.txt -- the list of English stopwords (excluded from sentences when measuring semantic similarity);
38
+ (3) freqs.txt -- frequencies of English words on a large corpus, needed for the IC-weighting of word contribution.
39
 
40
+ The last two files (stopwords.txt and freqs.txt) are provided in the res folder, whereas the embeddings.txt are bundled into the binary (/binary/graphseg.jar) but omitted from the /source/res folder due to repository size constraints. You may choose to replace these default files (e.g., by using different embeddings or different stopword list), but make sure you name the new files exactly the same (i.e., embeddings.txt, stopwords.txt, and freqs.txt, respectively).
41
 
42
  Credit
43
  ========
source/res/embeddings.txt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:18870b0a7516e4a72b44d3c226c242d2d846008967d8ce40b94c723a94d1a32b
3
- size 693432828