# PySR.jl [![DOI](https://zenodo.org/badge/295391759.svg)](https://zenodo.org/badge/latestdoi/295391759) [![PyPI version](https://badge.fury.io/py/pysr.svg)](https://badge.fury.io/py/pysr) [![Build Status](https://travis-ci.com/MilesCranmer/PySR.svg?branch=master)](https://travis-ci.com/MilesCranmer/PySR) **Symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.** Symbolic regression is a very interpretable machine learning algorithm for low-dimensional problems: these tools search equation space to find algebraic relations that approximate a dataset. One can also extend these approaches to higher-dimensional spaces by using a neural network as proxy, as explained in https://arxiv.org/abs/2006.11287, where we apply it to N-body problems. Here, one essentially uses symbolic regression to convert a neural net to an analytic equation. Thus, these tools simultaneously present an explicit and powerful way to interpret deep models. *Backstory:* Previously, we have used [eureqa](https://www.creativemachineslab.com/eureqa.html), which is a very efficient and user-friendly tool. However, eureqa is GUI-only, doesn't allow for user-defined operators, has no distributed capabilities, and has become proprietary (and recently been merged into an online service). Thus, the goal of this package is to have an open-source symbolic regression tool as efficient as eureqa, while also exposing a configurable python interface. # Installation Install Julia - see [downloads](https://julialang.org/downloads/), and then instructions for [mac](https://julialang.org/downloads/platform/#macos) and [linux](https://julialang.org/downloads/platform/#linux_and_freebsd). Then, at the command line, install the `Optim` and `SpecialFunctions` packages via: ```bash julia -e 'import Pkg; Pkg.add("Optim"); Pkg.add("SpecialFunctions")' ``` For python, you need to have Python 3, numpy, sympy, and pandas installed. You can install this package from PyPI with: ```bash pip install pysr ``` # Quickstart ```python import numpy as np from pysr import pysr # Dataset X = 2*np.random.randn(100, 5) y = 2*np.cos(X[:, 3]) + X[:, 0]**2 - 2 # Learn equations equations = pysr(X, y, niterations=5, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"]) ... print(equations) ``` which gives: ``` Complexity MSE Equation 0 5 1.947431 plus(-1.7420927, mult(x0, x0)) 1 8 0.486858 plus(-1.8710494, plus(cos(x3), mult(x0, x0))) 2 11 0.000000 plus(plus(mult(x0, x0), cos(x3)), plus(-2.0, cos(x3))) ``` The newest version of PySR also returns three additional columns: - `score` - a metric akin to Occam's razor; you should use this to help select the "true" equation. - `sympy_format` - sympy equation. - `lambda_format` - a lambda function for that equation, that you can pass values through. ### Custom operators One can define custom operators in Julia by passing a string: ```python equations = pysr.pysr(X, y, niterations=100, binary_operators=["mult", "plus", "special(x, y) = x^2 + y"], extra_sympy_mappings={'special': lambda x, y: x**2 + y}, unary_operators=["cos"]) ``` Now, the symbolic regression code can search using this `special` function that squares its left argument and adds it to its right. Make sure all passed functions are valid Julia code, and take one (unary) or two (binary) float32 scalars as input, and output a float32. Operators are automatically vectorized. We also define `extra_sympy_mappings`, so that the SymPy code can understand the output equation from Julia, when constructing a useable function. This step is optional, but is necessary for the `lambda_format` to work. One can also edit `operators.jl`. See below for more options. ### Weighted data Here, we assign weights to each row of data using inverse uncertainty squared. We also use 10 processes instead of the usual 4, which creates more populations (one population per thread). ```python sigma = ... weights = 1/sigma**2 equations = pysr.pysr(X, y, weights=weights, procs=10) ``` # Operators All Base julia operators that take 1 or 2 float32 as input, and output a float32 as output, are available. A selection of these and other valid operators are stated below. You can also define your own in `operators.jl`, and pass the function name as a string. Your operator should work with the entire real line (you can use abs(x) - see `logm`); otherwise the search code will be slowed down with domain errors. **Binary** `plus`, `mult`, `pow`, `div`, `greater`, `mod`, `beta`, `logical_or`, `logical_and` **Unary** `neg`, `exp`, `abs`, `logm` (=log(abs(x) + 1e-8)), `logm10` (=log10(abs(x) + 1e-8)), `logm2` (=log2(abs(x) + 1e-8)), `sqrtm` (=sqrt(abs(x))) `log1p`, `sin`, `cos`, `tan`, `sinh`, `cosh`, `tanh`, `asin`, `acos`, `atan`, `asinh`, `acosh`, `atanh`, `erf`, `erfc`, `gamma`, `relu`, `round`, `floor`, `ceil`, `round`, `sign`. # Full API What follows is the API reference for running the numpy interface. You likely don't need to tune the hyperparameters yourself, but if you would like, you can use `hyperparamopt.py` as an example. However, you should adjust `procs`, `niterations`, `binary_operators`, `unary_operators`, and `maxsize` to your requirements. The program will output a pandas DataFrame containing the equations, mean square error, and complexity. It will also dump to a csv at the end of every iteration, which is `hall_of_fame.csv` by default. It also prints the equations to stdout. ```python pysr(X=None, y=None, weights=None, procs=4, niterations=100, ncyclesperiteration=300, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=0.1, annealing=True, fractionReplaced=0.10, fractionReplacedHof=0.10, npop=1000, parsimony=1e-4, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=10, weightAddNode=1, weightInsertNode=3, weightDeleteNode=3, weightDoNothing=1, weightMutateConstant=10, weightMutateOperator=1, weightRandomize=1, weightSimplify=0.01, perturbationFactor=1.0, nrestarts=3, timeout=None, equation_file='hall_of_fame.csv', test='simple1', verbosity=1e9, maxsize=20) ``` Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i. **Arguments**: - `X`: np.ndarray, 2D array. Rows are examples, columns are features. - `y`: np.ndarray, 1D array. Rows are examples. - `weights`: np.ndarray, 1D array. Same shape as `y`. Optional weighted sum (e.g., 1/error^2). - `procs`: int, Number of processes running (=number of populations running). - `niterations`: int, Number of iterations of the algorithm to run. The best equations are printed, and migrate between populations, at the end of each. - `ncyclesperiteration`: int, Number of total mutations to run, per 10 samples of the population, per iteration. - `binary_operators`: list, List of strings giving the binary operators in Julia's Base, or in `operator.jl`. - `unary_operators`: list, Same but for operators taking a single `Float32`. - `alpha`: float, Initial temperature. - `annealing`: bool, Whether to use annealing. You should (and it is default). - `fractionReplaced`: float, How much of population to replace with migrating equations from other populations. - `fractionReplacedHof`: float, How much of population to replace with migrating equations from hall of fame. - `npop`: int, Number of individuals in each population - `parsimony`: float, Multiplicative factor for how much to punish complexity. - `migration`: bool, Whether to migrate. - `hofMigration`: bool, Whether to have the hall of fame migrate. - `shouldOptimizeConstants`: bool, Whether to numerically optimize constants (Nelder-Mead/Newton) at the end of each iteration. - `topn`: int, How many top individuals migrate from each population. - `nrestarts`: int, Number of times to restart the constant optimizer - `perturbationFactor`: float, Constants are perturbed by a max factor of (perturbationFactor\*T + 1). Either multiplied by this or divided by this. - `weightAddNode`: float, Relative likelihood for mutation to add a node - `weightInsertNode`: float, Relative likelihood for mutation to insert a node - `weightDeleteNode`: float, Relative likelihood for mutation to delete a node - `weightDoNothing`: float, Relative likelihood for mutation to leave the individual - `weightMutateConstant`: float, Relative likelihood for mutation to change the constant slightly in a random direction. - `weightMutateOperator`: float, Relative likelihood for mutation to swap an operator. - `weightRandomize`: float, Relative likelihood for mutation to completely delete and then randomly generate the equation - `weightSimplify`: float, Relative likelihood for mutation to simplify constant parts by evaluation - `timeout`: float, Time in seconds to timeout search - `equation_file`: str, Where to save the files (.csv separated by |) - `test`: str, What test to run, if X,y not passed. - `maxsize`: int, Max size of an equation. **Returns**: pd.DataFrame, Results dataframe, giving complexity, MSE, and equations (as strings). # TODO - [x] Async threading, and have a server of equations. So that threads aren't waiting for others to finish. - [x] Print out speed of equation evaluation over time. Measure time it takes per cycle - [x] Add ability to pass an operator as an anonymous function string. E.g., `binary_operators=["g(x, y) = x+y"]`. - [x] Add error bar capability (thanks Johannes Buchner for suggestion) - [x] Why don't the constants continually change? It should optimize them every time the equation appears. - Restart the optimizer to help with this. - [x] Add several common unary and binary operators; list these. - [x] Try other initial conditions for optimizer - [x] Make scaling of changes to constant a hyperparameter - [x] Make deletion op join deleted subtree to parent - [x] Update hall of fame every iteration? - Seems to overfit early if we do this. - [x] Consider adding mutation to pass an operator in through a new binary operator (e.g., exp(x3)->plus(exp(x3), ...)) - (Added full insertion operator - [x] Add a node at the top of a tree - [x] Insert a node at the top of a subtree - [x] Record very best individual in each population, and return at end. - [x] Write our own tree copy operation; deepcopy() is the slowest operation by far. - [x] Hyperparameter tune - [x] Create a benchmark for accuracy - [x] Add interface for either defining an operation to learn, or loading in arbitrary dataset. - Could just write out the dataset in julia, or load it. - [x] Create a Python interface - [x] Explicit constant optimization on hall-of-fame - Create method to find and return all constants, from left to right - Create method to find and set all constants, in same order - Pull up some optimization algorithm and add it. Keep the package small! - [x] Create a benchmark for speed - [x] Simplify subtrees with only constants beneath them. Or should I? Maybe randomly simplify sometimes? - [x] Record hall of fame - [x] Optionally (with hyperparameter) migrate the hall of fame, rather than current bests - [x] Test performance of reduced precision integers - No effect - [x] Create struct to pass through all hyperparameters, instead of treating as constants - Make sure doesn't affect performance - [x] Rename package to avoid trademark issues - PySR? - [x] Put on PyPI - [x] Treat baseline as a solution. - [x] Print score alongside MSE: \delta \log(MSE)/\delta \log(complexity) - [x] Calculating the loss function - there is duplicate calculations happening. - [x] Declaration of the weights array every iteration - [x] Sympy evaluation - [x] Threaded recursion - [x] Test suite - [x] Performance: - Use an enum for functions instead of storing them? - Gets ~40% speedup on small test. - [ ] Add true multi-node processing, with MPI, or just file sharing. Multiple populations per core. - Ongoing in cluster branch - [ ] Consider allowing multi-threading turned off, for faster testing (cache issue on travis). Or could simply fix the caching issue there. - [ ] Dump scores alongside MSE to .csv (and return with Pandas). - [ ] Consider returning only the equation of interest; rather than all equations. - [ ] Use @fastmath - [ ] Refresh screen rather than dumping to stdout? - [ ] Add ability to save state from python - [ ] Calculate feature importances based on features we've already seen, then weight those features up in all random generations. - [ ] Calculate feature importances of future mutations, by looking at correlation between residual of model, and the features. - Store feature importances of future, and periodically update it. - [ ] Implement more parts of the original Eureqa algorithms: https://www.creativemachineslab.com/eureqa.html - [ ] Experiment with freezing parts of model; then we only append/delete at end of tree. - [ ] Sympy printing - [ ] Consider adding mutation for constant<->variable - [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation? - [ ] Use NN to generate weights over all probability distribution conditional on error and existing equation, and train on some randomly-generated equations - [ ] Add GPU capability? - Not sure if possible, as binary trees are the real bottleneck. - Could generate on CPU, evaluate score on GPU? - [ ] Idea: use gradient of equation with respect to each operator (perhaps simply add to each operator) to tell which part is the most "sensitive" to changes. Then, perhaps insert/delete/mutate on that part of the tree? - [ ] For hierarchical idea: after running some number of iterations, do a search for "most common pattern". Then, turn that subtree into its own operator. - [ ] Additional degree operators? - [ ] Tree crossover? I.e., can take as input a part of the same equation, so long as it is the same level or below? - [ ] Create flexible way of providing "simplification recipes." I.e., plus(plus(T, C), C) => plus(T, +(C, C)). The user could pass these. - [ ] Can we cache calculations, or does the compiler do that? E.g., I should only have to run exp(x0) once; after that it should be read from memory. - Maybe I could store the result of calculations in a tree (or an index to a massive array that does this). And only when something in the subtree updates, does the rest of the tree update! - [ ] Try Memoize.jl instead of manually caching. - [ ] Try threading over population. Do random sort, compute mutation for each, then replace 10% oldest. - [ ] Call function to read from csv after running - [ ] Add function to plot equations - [ ] Sort this todo list by priority - [ ] Consider printing output sorted by score, not by complexity. - [ ] Performance: try inling things? - [ ] Multi targets (vector ops)