Spaces:

MilesCranmer
/

PySR

Running

File size: 6,910 Bytes

cfca8a4
841d7fc
3f4ce91
cfca8a4
 
dc9d777
 
 
 
 
 
cfca8a4
 
ecc127c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
012bfcc
ecc127c
012bfcc
ecc127c
 
012bfcc
841d7fc
012bfcc
 
 
 
 
841d7fc
9d4c050
 
d8f5888
 
 
841d7fc
012bfcc
2925a1d
 
 
 
 
 
 
 
 
 
 
 
 
012bfcc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382662a
 
 
3e14433
16c9195
37f71ff
f570496
d3ad40f
f570496
 
 
 
3e14433
 
a86f107
 
9d4c050
cedbbde
 
 
 
1f1f9b0
 
 
 
 
 
9d4c050

# Eureqa.jl

Symbolic regression built on Julia, and interfaced by Python.
Uses regularized evolution and simulated annealing.

## Installation

Install [Julia](https://julialang.org/downloads/). Then, at the command line,
install the `Optim` package via: `julia -e 'import Pkg; Pkg.add("Optim")'`.
For python, you need to have Python 3, numpy, and pandas installed.

## Running:

### Quickstart

```python
import numpy as np
from eureqa import eureqa

# Dataset
X = 2*np.random.randn(100, 5)
y = 2*np.cos(X[:, 3]) + X[:, 0]**2 - 2

# Learn equations
equations = eureqa(X, y, niterations=5)

...

print(equations)
```

which gives:

```
   Complexity       MSE                                                Equation
0           5  1.947431                          plus(-1.7420927, mult(x0, x0))
1           8  0.486858           plus(-1.8710494, plus(cos(x3), mult(x0, x0)))
2          11  0.000000  plus(plus(mult(x0, x0), cos(x3)), plus(-2.0, cos(x3)))
```

### API

What follows is the API reference for running the numpy interface.
Note that most parameters here
have been tuned with ~1000 trials over several example
equations, so you don't need to tune them yourself.
However, you should adjust `threads`, `niterations`,
`binary_operators`, `unary_operators` to your requirements.

The program will output a pandas DataFrame containing the equations,
mean square error, and complexity. It will also dump to a csv
at the end of every iteration,
which is `hall_of_fame.csv` by default. It also prints the
equations to stdout.

You can add more operators in `operators.jl`, or use default
Julia ones. Make sure all operators are defined for scalar `Float32`.
Then just specify the operator names in your call, as above.
You can also change the dataset learned on by passing in `X` and `y` as
numpy arrays to `eureqa(...)`.

```python
eureqa(X=None, y=None, threads=4, niterations=20,
   ncyclesperiteration=int(default_ncyclesperiteration),
   binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"],
   alpha=default_alpha, annealing=True, fractionReplaced=default_fractionReplaced,
   fractionReplacedHof=default_fractionReplacedHof, npop=int(default_npop),
   parsimony=default_parsimony, migration=True, hofMigration=True
   shouldOptimizeConstants=True, topn=int(default_topn),
   weightAddNode=default_weightAddNode, weightDeleteNode=default_weightDeleteNode,
   weightDoNothing=default_weightDoNothing,
   weightMutateConstant=default_weightMutateConstant,
   weightMutateOperator=default_weightMutateOperator,
   weightRandomize=default_weightRandomize, weightSimplify=default_weightSimplify,
   timeout=None, equation_file='hall_of_fame.csv', test='simple1', maxsize=20)
```

Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.

**Arguments**:

- `X`: np.ndarray, 2D array. Rows are examples, columns are features.
- `y`: np.ndarray, 1D array. Rows are examples.
- `threads`: int, Number of threads (=number of populations running).
You can have more threads than cores - it actually makes it more
efficient.
- `niterations`: int, Number of iterations of the algorithm to run. The best
equations are printed, and migrate between populations, at the
end of each.
- `ncyclesperiteration`: int, Number of total mutations to run, per 10
samples of the population, per iteration.
- `binary_operators`: list, List of strings giving the binary operators
in Julia's Base, or in `operator.jl`.
- `unary_operators`: list, Same but for operators taking a single `Float32`.
- `alpha`: float, Initial temperature.
- `annealing`: bool, Whether to use annealing. You should (and it is default).
- `fractionReplaced`: float, How much of population to replace with migrating
equations from other populations.
- `fractionReplacedHof`: float, How much of population to replace with migrating
equations from hall of fame.
- `npop`: int, Number of individuals in each population
- `parsimony`: float, Multiplicative factor for how much to punish complexity.
- `migration`: bool, Whether to migrate.
- `hofMigration`: bool, Whether to have the hall of fame migrate.
- `shouldOptimizeConstants`: bool, Whether to numerically optimize
constants (Nelder-Mead/Newton) at the end of each iteration.
- `topn`: int, How many top individuals migrate from each population.
- `weightAddNode`: float, Relative likelihood for mutation to add a node
- `weightDeleteNode`: float, Relative likelihood for mutation to delete a node
- `weightDoNothing`: float, Relative likelihood for mutation to leave the individual
- `weightMutateConstant`: float, Relative likelihood for mutation to change
the constant slightly in a random direction.
- `weightMutateOperator`: float, Relative likelihood for mutation to swap
an operator.
- `weightRandomize`: float, Relative likelihood for mutation to completely
delete and then randomly generate the equation
- `weightSimplify`: float, Relative likelihood for mutation to simplify
constant parts by evaluation
- `timeout`: float, Time in seconds to timeout search
- `equation_file`: str, Where to save the files (.csv separated by |)
- `test`: str, What test to run, if X,y not passed.
- `maxsize`: int, Max size of an equation.

**Returns**:

pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
(as strings).


# TODO

- [ ] Consider adding mutation for constant<->variable
- [ ] Use NN to generate weights over all probability distribution conditional on error and existing equation, and train on some randomly-generated equations
- [ ] Performance:
    - [ ] Use an enum for functions instead of storing them?
    - Current most expensive operations:
        - [x] deepcopy() before the mutate, to see whether to accept or not.
            - Seems like its necessary right now. But still by far the slowest option.
        - [ ] Calculating the loss function - there is duplicate calculations happening.
        - [ ] Declaration of the weights array every iteration
- [x] Hyperparameter tune
- [x] Create a benchmark for accuracy
- [x] Add interface for either defining an operation to learn, or loading in arbitrary dataset.
    - Could just write out the dataset in julia, or load it.
- [x] Create a Python interface
- [x] Explicit constant optimization on hall-of-fame
    - Create method to find and return all constants, from left to right
    - Create method to find and set all constants, in same order
    - Pull up some optimization algorithm and add it. Keep the package small!
- [x] Create a benchmark for speed
- [x] Simplify subtrees with only constants beneath them. Or should I? Maybe randomly simplify sometimes?
- [x] Record hall of fame
- [x] Optionally (with hyperparameter) migrate the hall of fame, rather than current bests
- [x] Test performance of reduced precision integers
    - No effect
- [x] Create struct to pass through all hyperparameters, instead of treating as constants
    - Make sure doesn't affect performance