Harika22 commited on
Commit
058b5ee
Β·
verified Β·
1 Parent(s): aea589c

Update pages/6_Feature_Engineering.py

Browse files
Files changed (1) hide show
  1. pages/6_Feature_Engineering.py +31 -29
pages/6_Feature_Engineering.py CHANGED
@@ -197,42 +197,44 @@ if file_type == "One-Hot Vectorization":
197
  st.subheader(":red[Disadvantages]")
198
 
199
  st.subheader(":blue[Different Document Length]")
200
- st.markdown(
201
- "<p class='content'>Every document contains a different number of words. Here, we are not converting the entire document into a vector, but rather each word separately. "
202
- "This makes it difficult to structure the data into a tabular format. Converting entire documents into vectors, which is addressed by Bag of Words (BOW), solves this issue.</p>",
203
- unsafe_allow_html=True,
204
- )
205
 
206
  st.subheader(":blue[Sparsity]")
207
- st.markdown(
208
- "<p class='content'>The vectors created using one-hot encoding tend to be sparse. When data is given to any algorithm, the model may become biased towards zero values, "
209
- "leading to an issue in machine learning known as overfitting. This problem is primarily addressed in deep learning.</p>",
210
- unsafe_allow_html=True,
211
- )
 
212
 
213
  st.subheader(":blue[Curse of Dimensionality]")
214
- st.markdown(
215
- "<p class='content'>As the number of documents increases, the vocabulary size grows, leading to an increase in dimensionality. This negatively impacts machine learning performance "
216
- "because the dimensionality of vectors is directly dependent on vocabulary size, which grows as more documents are introduced.</p>",
217
- unsafe_allow_html=True,
218
- )
219
 
220
  st.subheader(":blue[Out of Vocabulary Issue]")
221
- st.markdown(
222
- "<p class='content'>One-hot encoding only converts words that were present in the dataset at the time of training. If a new word appears during inference and was not included in the "
223
- "training dataset, it cannot be converted into a vector, causing a key error. This issue is effectively solved by FastText.</p>",
224
- unsafe_allow_html=True,
225
- )
226
 
227
  st.subheader(":blue[Inability to Preserve Semantic Meaning]")
228
- st.markdown(
229
- "<p class='content'>When converting text into vector format, the relationships between words should be preserved. Ideally, similar words should be represented by similar vectors, "
230
- "meaning the distance between their vectors should be small. If this is achieved, the vectorization method successfully preserves semantic meaning.</p>",
231
- unsafe_allow_html=True,
232
- )
 
 
 
233
 
234
  st.subheader(":blue[Lack of Sequential Information]")
235
- st.markdown(
236
- "<p class='content'>One-hot encoding does not preserve sequential information in text. The order of words, which is crucial in natural language, is completely lost in this encoding method.</p>",
237
- unsafe_allow_html=True,
238
- )
 
197
  st.subheader(":red[Disadvantages]")
198
 
199
  st.subheader(":blue[Different Document Length]")
200
+ st.markdown('''
201
+ - 1.Every document have different no.of words (here we're not converting document to vector , we're converting word to vector)
202
+ - We can't convert into tabular data
203
+ - It would be possible to convert into tabular data when we're converting document into vector(this is solved by Bag of Words(BOW))
204
+ ''')
205
 
206
  st.subheader(":blue[Sparsity]")
207
+ st.markdown('''
208
+ - The vector which is created using one-hhot vectorization gives sparse vector
209
+ - Entire data is given to any alogorithm and machine is going to learn fom data and algorithm it is biasd towards zero values as the data is sparse data
210
+ - This issue in ML is known as overfitting
211
+ - It is solved in Deep learning
212
+ ''')
213
 
214
  st.subheader(":blue[Curse of Dimensionality]")
215
+ st.markdown('''
216
+ - Document increases ↑ Vocabulary ↑ and vector increases ↑ dimensionality also increases ↑
217
+ - Ml performance decreases ↓ - as the dimensionality totally depends on vocabulary and it shootup as the document increases and different
218
+ ''')
 
219
 
220
  st.subheader(":blue[Out of Vocabulary Issue]")
221
+ st.markdown('''
222
+ - Document only converted during training time and we're giving our own dataset
223
+ - If the word is not present in our dataset while training it can't convert into vector format results in key error
224
+ - This is solved by Fasttext
225
+ ''')
226
 
227
  st.subheader(":blue[Inability to Preserve Semantic Meaning]")
228
+ st.markdown('''
229
+ - While converting text β†’ vector format (same relationship should be preserved)
230
+ - We need to convert document into vector in such a way that semantic relationship should be preserved
231
+ - Similarity ⬆️ and Distance ⬇️
232
+ - Similarity ∝ 1 / Distance
233
+ - Distance between vectors should be very small
234
+ - If this is satisfied then the technique has good semantic meaning
235
+ ''')
236
 
237
  st.subheader(":blue[Lack of Sequential Information]")
238
+ st.markdown('''
239
+ - Sequential information is not preserved
240
+ ''')