onerayhan commited on
Commit
3a256e3
·
1 Parent(s): 76f431e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +164 -0
README.md CHANGED
@@ -1,3 +1,167 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - tr
5
+ library_name: bertopic
6
+ tags:
7
+ - finance
8
+ metrics:
9
+ - accuracy 0.9
10
  ---
11
+ # Financial Sentiment Analysis with BERT for Borsa Istanbul (BIST100)
12
+
13
+ - Hello, This is the repository for our model for cs210 in which we've trained a Financial Sentiment Analysis model using Bert for Borsa Istanbul. Knowing live alternatives cost more than 750 TL for a month this project will be further upgraded as a live open-source alternative to its rivals. Please Star the project for further upgrades :))
14
+
15
+ - Here you can find the code of the data gathering, parsing, model training and visualizations in different folders.
16
+ All the documentation is public and open-source.
17
+ This sheet will be updated, but for the moment
18
+ let's get a good grade :)
19
+
20
+ -----------------------------------------------
21
+
22
+ # Links to resources
23
+ - Here data and trained model's source is shared through google drive since they are significantly big in size or quantity
24
+ ------------------------------------------------------------
25
+ ## Link to the Repository
26
+ - Repository inclued every file that we've implemented during the process
27
+ - Has an Accuracy rate > 0.90 for the test data
28
+ - [Repository Link](https://github.com/onerayhan/cs210_trend_followers)
29
+ ------------------------------------------------------------
30
+ ## Link to the Prelabeled Data
31
+ - Data is prelabeled with Daily Return Value of the Bist-100 in order to get a first insight
32
+ - They can be found as classified neg-pos subfolders
33
+ - [Pre-Labeled Data Link](https://drive.google.com/drive/folders/1NYB9wBx8yt31drdczAB_ll5s31I1dcN4?usp=sharing)
34
+ -------------------------------------------------------------
35
+ ## Link to the True Labeled Data
36
+ - Data then labeled another time with keyword searching and more than 200 files have changed directory from neg to pos or vice-versa:
37
+ - [Labeled Data Link](https://drive.google.com/drive/folders/1sn4JtCZ44wH2FO60Opm3FKXQwYLMtwGY?usp=sharing)
38
+
39
+ # Additional Notes
40
+
41
+ ## Data Gathering
42
+ - We've gathered Daily Brokerage Reviews, Daily News and Tweets for the training but haven't used the tweets data in the training part of the model because of its limitations and high spam percentage
43
+ - We've set the starting period as 01.01.21 and ending time as 25.05.23 except for tweets which we couldn't access earlier than 1 month
44
+ - We've gathered the data through various libraries such as Selenium, Requests, BeautifulSoup and SnScrape
45
+ ## Data Preprocessing
46
+ - We've used built-in python libraries, pypdfium, Pandas, Numpy, Transformers and BertTokenizer for preprocessing
47
+ ## Model Training
48
+ - We've used a Turkish Cased Bert for training the data with Transformers
49
+ - [Link to Untrained Model](https://huggingface.co/dbmdz/bert-base-turkish-cased)
50
+ ## Visualization
51
+ - We've Used Matplotlib and Seaborn for visualizations
52
+
53
+ -----------------------------------------------
54
+
55
+ # Quick File Explanations in The repository
56
+ Below are quick explanation about what every code does,
57
+ the workings of the python code could be understood more by looking at the comments in each code.
58
+ This explanations can also be found in our repository
59
+
60
+ ## Downloading Links
61
+
62
+ - akbank_link_download.py
63
+
64
+ Downloads links to PDFs in the specified url until 04.01.2021 using selenium to traverse interactive page in akbank website
65
+
66
+ - gedik_link_download.py
67
+
68
+ Downloads links to PDFs in the specified url using requests and BeautifulSoup to sequentially take links from gedik website
69
+
70
+ - download_links_yk_garan
71
+ Downloads links to PDFs in the specified url using requests, BeautifulSoup and Selenium to sequentially take links from Garanti and YapıKredi website
72
+
73
+ ## Downloading PDFs
74
+
75
+ - akbank_PDF_download.py
76
+
77
+ Gets .txt file of links and downloads PDFs from it and saves them to /data/akbank_PDF, folders need to be created beforehand
78
+
79
+ - garanti_PDF_download.py
80
+
81
+ Gets .txt file of links and downloads PDFs from it and saves them to /data/garanti_PDF, folders need to be created beforehand
82
+
83
+
84
+ - gedik_PDF_download.py
85
+
86
+ Gets .txt file of links and downloads PDFs from it and saves them to /data/gedik_PDF, folders need to be created beforehand
87
+
88
+
89
+ - yapikredi_PDF_download.py
90
+
91
+ Gets .txt file of links and downloads PDFs from it and saves them to /data/yapikredi_PDF, folders need to be created beforehand
92
+
93
+ ## Extracting Text
94
+
95
+ - pypdfium2_akbank.py
96
+
97
+ Using pypdfium2 to get necessary text from gedik pdfs located in data/yapikredi_PDF
98
+ put all extracted text into a list of dictionaries where date, count, paragraph are keys
99
+ put the combined dictionaries into .json file
100
+
101
+
102
+ - pypdfium2_garanti.py
103
+
104
+ Using pypdfium2 to get necessary text from garanti pdfs located in data/garanti_PDF
105
+ put all extracted text into a list of dictionaries where date, count, paragraph are keys
106
+ put the combined dictionaries into .json file
107
+
108
+
109
+ - pypdfium2_gedik.py
110
+
111
+ Using pypdfium2 to get necessary text from gedik pdfs located in data/gedik_PDF
112
+ put all extracted text into a list of dictionaries where date, monthAgo, count, paragraph are keys
113
+ put the combined dictionaries into .json file
114
+
115
+
116
+ - pypdfium2_yapikredi.py
117
+
118
+ Using pypdfium2 to get necessary text from gedik pdfs located in data/yapikredi_PDF
119
+ put all extracted text into a list of dictionaries where date, count, paragraph are keys
120
+ put the combined dictionaries into .json file
121
+
122
+ ## .Json Labeling
123
+
124
+ After text extraction the output .json files were processed by dividing them by BIST-100 values such that,
125
+ if a text was published while BIST-100 had a negative change the processed text was put into the negative folder
126
+ else it was put into the positive folder, these folders would serve as the labeled data for our machine learning model
127
+
128
+ - json_sorter.py
129
+
130
+ In this program received data is a json file containing list of dictionary with keys date, count, and paragraph.
131
+ The date of each element will be compared with XU100 excel sheet where changes in BIST-100 value are located.
132
+ The dates of dictionaries will be found in XU100 and will be sorted into negative folder if value is negative
133
+ or into positive folder if value is positive.
134
+
135
+
136
+ - json_sort_haberler.py
137
+
138
+ In this program received data is a json file containing list of dictionary with keys date, count, and paragraph.
139
+ The date of each element will be compared with XU100 excel sheet where changes in BIST-100 value are located.
140
+ The dates of dictionaries will be found in XU100 and will be sorted into negative folder if value is negative
141
+ or into positive folder if value is positive.
142
+
143
+
144
+ - json_sort_tweet.py
145
+
146
+ In this program received data is a json file containing dictionary of dictionary with id as keys and
147
+ date, tweet, views as values. The date of each element will be compared with XU100 excel sheet where
148
+ changes in BIST-100 value are located. The dates of dictionaries will be found in XU100 and will be
149
+ sorted into negative folder if value is negative or into positive folder if value is positive.
150
+ ## True Labeling
151
+
152
+ - parse_keywords.py
153
+ checks each neg or pos assigned files keywords and move them to other folder if falsely labeled
154
+
155
+ ## Model Training
156
+
157
+ - bert_train
158
+ + Trains the data with Bert Model and checks the results. Bert Tokenizer is also used to further preprocess the data.
159
+ + To see the results and scores of the model please check this file.
160
+
161
+ ## Visualizations
162
+
163
+ - Visualizations.ipynb
164
+ To show the performance on of the model on whole data and to visualize the sentiments made from brokerages or news this file is implemented.
165
+ - CS210Visualization.pptx
166
+ - sentiment_of_broker_sites.ipynb
167
+ Plots the sentiment of 4 different broker sites as percentage comparisons with positive and negative sentiments as categories