MilesCranmer commited on
Commit
d94ce53
1 Parent(s): 9060684

Many more examples in docs

Browse files
Files changed (2) hide show
  1. docs/_sidebar.md +2 -2
  2. docs/examples.md +86 -1
docs/_sidebar.md CHANGED
@@ -1,9 +1,9 @@
1
  - Using PySR
2
 
3
  - [Getting Started](/)
4
- - [Options](options.md)
5
- - [Operators](operators.md)
6
  - [Examples](examples.md)
 
 
7
 
8
  - API Reference
9
 
 
1
  - Using PySR
2
 
3
  - [Getting Started](/)
 
 
4
  - [Examples](examples.md)
5
+ - [More Options](options.md)
6
+ - [Operators](operators.md)
7
 
8
  - API Reference
9
 
docs/examples.md CHANGED
@@ -96,7 +96,92 @@ Which gives us:
96
 
97
  ![](https://github.com/MilesCranmer/PySR/raw/master/docs/images/example_plot.png)
98
 
99
- ## 5. Additional features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
  For the many other features available in PySR, please
102
  read the [Options section](options.md).
 
96
 
97
  ![](https://github.com/MilesCranmer/PySR/raw/master/docs/images/example_plot.png)
98
 
99
+ ## 5. Feature selection
100
+
101
+ PySR and evolution-based symbolic regression in general performs
102
+ very poorly when the number of features is large.
103
+ Even, say, 10 features might be too much for a typical equation search.
104
+
105
+ If you are dealing with high-dimensional data with a particular type of structure,
106
+ you might consider using deep learning to break the problem into
107
+ smaller "chunks" which can then be solved by PySR, as explained in the paper
108
+ [2006.11287](https://arxiv.org/abs/2006.11287).
109
+
110
+ For tabular datasets, this is a bit trickier. Luckily, PySR has a built-in feature
111
+ selection mechanism. Simply declare the parameter `select_k_features=5`, for selecting
112
+ the most important 5 features.
113
+
114
+ Here is an example. Let's say we have 30 input features and 300 data points, but only 2
115
+ of those features are actually used:
116
+ ```python
117
+ X = np.random.randn(300, 30)
118
+ y = X[:, 3]**2 - X[:, 19]**2 + 1.5
119
+ ```
120
+
121
+ Let's create a model with the feature selection argument set up:
122
+ ```python
123
+ model = PySRRegressor(
124
+ binary_operators=["+", "-", "*", "/"],
125
+ unary_operators=["exp"],
126
+ select_k_features=5,
127
+ **kwargs
128
+ )
129
+ ```
130
+ Now let's fit this:
131
+ ```python
132
+ model.fit(X, y)
133
+ ```
134
+
135
+ Before the Julia backend is launched, you can see the string:
136
+ ```
137
+ Using features ['x3', 'x5', 'x7', 'x19', 'x21']
138
+ ```
139
+ which indicates that the feature selection (powered by a gradient-boosting tree)
140
+ has successfully selected the relevant two features.
141
+
142
+ This fit should find the solution quickly, whereas with the huge number of features,
143
+ it would have struggled.
144
+
145
+ This simple preprocessing step is enough to simplify our tabular dataset,
146
+ but again, for more structured datasets, you should try the deep learning
147
+ approach mentioned above.
148
+
149
+ ## 5. Denoising
150
+
151
+ Many datasets, especially in the observational sciences,
152
+ contain intrinsic noise. PySR is noise robust itself, as it is simply optimizing a loss function,
153
+ but there are still some additional steps you can take to reduce the effect of noise.
154
+
155
+ One thing you could do, which we won't detail here, is to create a custom log-likelihood
156
+ given some assumed noise model. By passing weights to the fit function, and
157
+ defining a custom loss function such as `loss="myloss(x, y, w) = w * (x - y)^2"`,
158
+ you can define any sort of log-likelihood you wish. (However, note that it must be bounded at zero)
159
+
160
+ However, the simplest thing to do is preprocessing, just like for feature selection. To do this,
161
+ set the parameter `denoise=True`. This will fit a Gaussian process (containing a white noise kernel)
162
+ to the input dataset, and predict new targets (which are assumed to be denoised) from that Gaussian process.
163
+
164
+ For example:
165
+ ```python
166
+ X = np.random.randn(100, 5)
167
+ noise = np.random.randn(100) * 0.1
168
+ y = np.exp(X[:, 0]) + X[:, 1] + X[:, 2] + noise
169
+ ```
170
+
171
+ Let's create and fit a model with the denoising argument set up:
172
+ ```python
173
+ model = PySRRegressor(
174
+ binary_operators=["+", "-", "*", "/"],
175
+ unary_operators=["exp"],
176
+ denoise=True,
177
+ **kwargs
178
+ )
179
+ model.fit(X, y)
180
+ print(model)
181
+ ```
182
+ If all goes well, you should find that it predicts the correct input equation, without the noise term!
183
+
184
+ ## 6. Additional features
185
 
186
  For the many other features available in PySR, please
187
  read the [Options section](options.md).