Update pages/6_Feature_Engineering.py
Browse files- pages/6_Feature_Engineering.py +31 -29
pages/6_Feature_Engineering.py
CHANGED
|
@@ -197,42 +197,44 @@ if file_type == "One-Hot Vectorization":
|
|
| 197 |
st.subheader(":red[Disadvantages]")
|
| 198 |
|
| 199 |
st.subheader(":blue[Different Document Length]")
|
| 200 |
-
st.markdown(
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
)
|
| 205 |
|
| 206 |
st.subheader(":blue[Sparsity]")
|
| 207 |
-
st.markdown(
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
|
|
|
| 212 |
|
| 213 |
st.subheader(":blue[Curse of Dimensionality]")
|
| 214 |
-
st.markdown(
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
)
|
| 219 |
|
| 220 |
st.subheader(":blue[Out of Vocabulary Issue]")
|
| 221 |
-
st.markdown(
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
)
|
| 226 |
|
| 227 |
st.subheader(":blue[Inability to Preserve Semantic Meaning]")
|
| 228 |
-
st.markdown(
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
|
|
|
|
|
|
|
|
|
| 233 |
|
| 234 |
st.subheader(":blue[Lack of Sequential Information]")
|
| 235 |
-
st.markdown(
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
)
|
|
|
|
| 197 |
st.subheader(":red[Disadvantages]")
|
| 198 |
|
| 199 |
st.subheader(":blue[Different Document Length]")
|
| 200 |
+
st.markdown('''
|
| 201 |
+
- 1.Every document have different no.of words (here we're not converting document to vector , we're converting word to vector)
|
| 202 |
+
- We can't convert into tabular data
|
| 203 |
+
- It would be possible to convert into tabular data when we're converting document into vector(this is solved by Bag of Words(BOW))
|
| 204 |
+
''')
|
| 205 |
|
| 206 |
st.subheader(":blue[Sparsity]")
|
| 207 |
+
st.markdown('''
|
| 208 |
+
- The vector which is created using one-hhot vectorization gives sparse vector
|
| 209 |
+
- Entire data is given to any alogorithm and machine is going to learn fom data and algorithm it is biasd towards zero values as the data is sparse data
|
| 210 |
+
- This issue in ML is known as overfitting
|
| 211 |
+
- It is solved in Deep learning
|
| 212 |
+
''')
|
| 213 |
|
| 214 |
st.subheader(":blue[Curse of Dimensionality]")
|
| 215 |
+
st.markdown('''
|
| 216 |
+
- Document increases β Vocabulary β and vector increases β dimensionality also increases β
|
| 217 |
+
- Ml performance decreases β - as the dimensionality totally depends on vocabulary and it shootup as the document increases and different
|
| 218 |
+
''')
|
|
|
|
| 219 |
|
| 220 |
st.subheader(":blue[Out of Vocabulary Issue]")
|
| 221 |
+
st.markdown('''
|
| 222 |
+
- Document only converted during training time and we're giving our own dataset
|
| 223 |
+
- If the word is not present in our dataset while training it can't convert into vector format results in key error
|
| 224 |
+
- This is solved by Fasttext
|
| 225 |
+
''')
|
| 226 |
|
| 227 |
st.subheader(":blue[Inability to Preserve Semantic Meaning]")
|
| 228 |
+
st.markdown('''
|
| 229 |
+
- While converting text β vector format (same relationship should be preserved)
|
| 230 |
+
- We need to convert document into vector in such a way that semantic relationship should be preserved
|
| 231 |
+
- Similarity β¬οΈ and Distance β¬οΈ
|
| 232 |
+
- Similarity β 1 / Distance
|
| 233 |
+
- Distance between vectors should be very small
|
| 234 |
+
- If this is satisfied then the technique has good semantic meaning
|
| 235 |
+
''')
|
| 236 |
|
| 237 |
st.subheader(":blue[Lack of Sequential Information]")
|
| 238 |
+
st.markdown('''
|
| 239 |
+
- Sequential information is not preserved
|
| 240 |
+
''')
|
|
|