PySR.jl

Symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution and simulated annealing.

Backstory: we used the original eureqa in our paper to convert a graph neural network into an analytic equation describing dark matter overdensity. However, eureqa is GUI-only, doesn't allow for user-defined operators, has no distributed capabilities, and has become proprietary. Thus, the goal of this package is to have an open-source symbolic regression tool as efficient as eureqa, while also exposing a configurable python interface.

The algorithms here implement regularized evolution, as in AutoML-Zero, but with additional algorithmic changes such as simulated annealing, and classical optimization of constants.

Installation

Install Julia - see downloads, and then instructions for mac and linux. Then, at the command line, install the Optim and SpecialFunctions packages via: julia -e 'import Pkg; Pkg.add("Optim"); Pkg.add("SpecialFunctions")'.

For python, you need to have Python 3, numpy, and pandas installed.

You can install this package from PyPI with:

pip install pysr

Quickstart

import numpy as np
from pysr import pysr

# Dataset
X = 2*np.random.randn(100, 5)
y = 2*np.cos(X[:, 3]) + X[:, 0]**2 - 2

# Learn equations
equations = pysr(X, y, niterations=5,
            binary_operators=["plus", "mult"],
            unary_operators=["cos", "exp", "sin"])

...

print(equations)

which gives:

   Complexity       MSE                                                Equation
0           5  1.947431                          plus(-1.7420927, mult(x0, x0))
1           8  0.486858           plus(-1.8710494, plus(cos(x3), mult(x0, x0)))
2          11  0.000000  plus(plus(mult(x0, x0), cos(x3)), plus(-2.0, cos(x3)))

Custom operators

One can define custom operators in Julia by passing a string:

equations = pysr.pysr(X, y, niterations=100,
    binary_operators=["mult", "plus", "special(x, y) = x^2 + y"],
    unary_operators=["cos"])

Now, the symbolic regression code can search using this special function that squares its left argument and adds it to its right. Make sure all passed functions are valid Julia code, and take one (unary) or two (binary) float32 scalars as input, and output a float32. Operators are automatically vectorized.

One can also edit operators.jl. See below for more options.

Weighted data

Here, we assign weights to each row of data using inverse uncertainty squared. We also use 10 threads instead of the usual 4, which creates more populations (one population per thread).

sigma = ...
weights = 1/sigma**2

equations = pysr.pysr(X, y, weights=weights, threads=10)

Operators

All Base julia operators that take 1 or 2 float32 as input, and output a float32 as output, are available. A selection of these and other valid operators are stated below. You can also define your own in operators.jl, and pass the function name as a string.

Binary

plus, mult, pow, div, greater, mod, beta, logical_or, logical_and

Unary

neg, exp, abs, logm (=log(abs(x) + 1e-8)), logm10 (=log10(abs(x) + 1e-8)), logm2 (=log2(abs(x) + 1e-8)), log1p, sin, cos, tan, sinh, cosh, tanh, asin, acos, atan, asinh, acosh, atanh, erf, erfc, gamma, relu, round, floor, ceil, round, sign.

Full API

What follows is the API reference for running the numpy interface. You likely don't need to tune the hyperparameters yourself, but if you would like, you can use hyperopt.py as an example. However, you should adjust threads, niterations, binary_operators, unary_operators, and maxsize to your requirements.

The program will output a pandas DataFrame containing the equations, mean square error, and complexity. It will also dump to a csv at the end of every iteration, which is hall_of_fame.csv by default. It also prints the equations to stdout.

pysr(X=None, y=None, weights=None, threads=4, niterations=100, ncyclesperiteration=300, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=0.1, annealing=True, fractionReplaced=0.10, fractionReplacedHof=0.10, npop=1000, parsimony=1e-4, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=10, weightAddNode=1, weightInsertNode=3, weightDeleteNode=3, weightDoNothing=1, weightMutateConstant=10, weightMutateOperator=1, weightRandomize=1, weightSimplify=0.01, perturbationFactor=1.0, nrestarts=3, timeout=None, equation_file='hall_of_fame.csv', test='simple1', verbosity=1e9, maxsize=20)

Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i. Note: most default parameters have been tuned over several example equations, but you should adjust threads, niterations, binary_operators, unary_operators to your requirements.

Arguments:

X: np.ndarray, 2D array. Rows are examples, columns are features.
y: np.ndarray, 1D array. Rows are examples.
weights: np.ndarray, 1D array. Same shape as y. Optional weighted sum (e.g., 1/error^2).
threads: int, Number of threads (=number of populations running). You can have more threads than cores - it actually makes it more efficient.
niterations: int, Number of iterations of the algorithm to run. The best equations are printed, and migrate between populations, at the end of each.
ncyclesperiteration: int, Number of total mutations to run, per 10 samples of the population, per iteration.
binary_operators: list, List of strings giving the binary operators in Julia's Base, or in operator.jl.
unary_operators: list, Same but for operators taking a single Float32.
alpha: float, Initial temperature.
annealing: bool, Whether to use annealing. You should (and it is default).
fractionReplaced: float, How much of population to replace with migrating equations from other populations.
fractionReplacedHof: float, How much of population to replace with migrating equations from hall of fame.
npop: int, Number of individuals in each population
parsimony: float, Multiplicative factor for how much to punish complexity.
migration: bool, Whether to migrate.
hofMigration: bool, Whether to have the hall of fame migrate.
shouldOptimizeConstants: bool, Whether to numerically optimize constants (Nelder-Mead/Newton) at the end of each iteration.
topn: int, How many top individuals migrate from each population.
nrestarts: int, Number of times to restart the constant optimizer
perturbationFactor: float, Constants are perturbed by a max factor of (perturbationFactor*T + 1). Either multiplied by this or divided by this.
weightAddNode: float, Relative likelihood for mutation to add a node
weightInsertNode: float, Relative likelihood for mutation to insert a node
weightDeleteNode: float, Relative likelihood for mutation to delete a node
weightDoNothing: float, Relative likelihood for mutation to leave the individual
weightMutateConstant: float, Relative likelihood for mutation to change the constant slightly in a random direction.
weightMutateOperator: float, Relative likelihood for mutation to swap an operator.
weightRandomize: float, Relative likelihood for mutation to completely delete and then randomly generate the equation
weightSimplify: float, Relative likelihood for mutation to simplify constant parts by evaluation
timeout: float, Time in seconds to timeout search
equation_file: str, Where to save the files (.csv separated by |)
test: str, What test to run, if X,y not passed.
maxsize: int, Max size of an equation.

Returns:

pd.DataFrame, Results dataframe, giving complexity, MSE, and equations (as strings).

TODO

Refresh screen rather than dumping to stdout?
Test suite
Add ability to save state from python
Add true multi-node processing, with MPI, or just file sharing. Multiple populations per core.
Calculate feature importances based on features we've already seen, then weight those features up in all random generations.
Calculate feature importances of future mutations, by looking at correlation between residual of model, and the features.
- Store feature importances of future, and periodically update it.
Implement more parts of the original Eureqa algorithms: https://www.creativemachineslab.com/eureqa.html
Experiment with freezing parts of model; then we only append/delete at end of tree.
Sympy printing
Sympy evaluation
Consider adding mutation for constant<->variable
Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
Use NN to generate weights over all probability distribution conditional on error and existing equation, and train on some randomly-generated equations
Add GPU capability?
- Not sure if possible, as binary trees are the real bottleneck.
Performance:
- Use an enum for functions instead of storing them?
- Current most expensive operations:
  - Calculating the loss function - there is duplicate calculations happening.
  - Declaration of the weights array every iteration
Print out speed of equation evaluation over time. Measure time it takes per cycle
Add ability to pass an operator as an anonymous function string. E.g., binary_operators=["g(x, y) = x+y"].
Add error bar capability (thanks Johannes Buchner for suggestion)
Why don't the constants continually change? It should optimize them every time the equation appears.
- Restart the optimizer to help with this.
Add several common unary and binary operators; list these.
Try other initial conditions for optimizer
Make scaling of changes to constant a hyperparameter
Make deletion op join deleted subtree to parent
Update hall of fame every iteration?
- Seems to overfit early if we do this.
Consider adding mutation to pass an operator in through a new binary operator (e.g., exp(x3)->plus(exp(x3), ...))
- (Added full insertion operator
Add a node at the top of a tree
Insert a node at the top of a subtree
Record very best individual in each population, and return at end.
Write our own tree copy operation; deepcopy() is the slowest operation by far.
Hyperparameter tune
Create a benchmark for accuracy
Add interface for either defining an operation to learn, or loading in arbitrary dataset.
- Could just write out the dataset in julia, or load it.
Create a Python interface
Explicit constant optimization on hall-of-fame
- Create method to find and return all constants, from left to right
- Create method to find and set all constants, in same order
- Pull up some optimization algorithm and add it. Keep the package small!
Create a benchmark for speed
Simplify subtrees with only constants beneath them. Or should I? Maybe randomly simplify sometimes?
Record hall of fame
Optionally (with hyperparameter) migrate the hall of fame, rather than current bests
Test performance of reduced precision integers
- No effect
Create struct to pass through all hyperparameters, instead of treating as constants
- Make sure doesn't affect performance
Rename package to avoid trademark issues
- PySR?
Put on PyPI