# Eureqa.jl Symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution and simulated annealing. ## Installation Install [Julia](https://julialang.org/downloads/). Then, at the command line, install the `Optim` package via: `julia -e 'import Pkg; Pkg.add("Optim")'`. For python, you need to have Python 3, numpy, and pandas installed. ## Running: What follows is the API reference for running the numpy interface. Note that nearly all parameters here have been tuned with ~1000 trials over several example equations. However, you should adjust `threads`, `niterations`, `binary_operators`, `unary_operators` to your requirements. The program will output a pandas DataFrame containing the equations, mean square error, and complexity. It will also dump to a csv at the end of every iteration, which is `hall_of_fame.csv` by default. It also prints the equations to stdout. You can add more operators in `operators.jl`, or use default Julia ones. Make sure all operators are defined for scalar `Float32`. Then just specify the operator names in your call, as above. You can also change the dataset learned on by passing in `X` and `y` as numpy arrays to `eureqa(...)`. ```python eureqa(X=None, y=None, threads=4, niterations=20, ncyclesperiteration=int(default_ncyclesperiteration), binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=default_alpha, annealing=True, fractionReplaced=default_fractionReplaced, fractionReplacedHof=default_fractionReplacedHof, npop=int(default_npop), parsimony=default_parsimony, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=int(default_topn), weightAddNode=default_weightAddNode, weightDeleteNode=default_weightDeleteNode, weightDoNothing=default_weightDoNothing, weightMutateConstant=default_weightMutateConstant, weightMutateOperator=default_weightMutateOperator, weightRandomize=default_weightRandomize, weightSimplify=default_weightSimplify, timeout=None, equation_file='hall_of_fame.csv', test='simple1', maxsize=20) ``` Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i. **Arguments**: - `X`: np.ndarray, 2D array. Rows are examples, columns are features. - `y`: np.ndarray, 1D array. Rows are examples. - `threads`: int, Number of threads (=number of populations running). You can have more threads than cores - it actually makes it more efficient. - `niterations`: int, Number of iterations of the algorithm to run. The best equations are printed, and migrate between populations, at the end of each. - `ncyclesperiteration`: int, Number of total mutations to run, per 10 samples of the population, per iteration. - `binary_operators`: list, List of strings giving the binary operators in Julia's Base, or in `operator.jl`. - `unary_operators`: list, Same but for operators taking a single `Float32`. - `alpha`: float, Initial temperature. - `annealing`: bool, Whether to use annealing. You should (and it is default). - `fractionReplaced`: float, How much of population to replace with migrating equations from other populations. - `fractionReplacedHof`: float, How much of population to replace with migrating equations from hall of fame. - `npop`: int, Number of individuals in each population - `parsimony`: float, Multiplicative factor for how much to punish complexity. - `migration`: bool, Whether to migrate. - `hofMigration`: bool, Whether to have the hall of fame migrate. - `shouldOptimizeConstants`: bool, Whether to numerically optimize constants (Nelder-Mead/Newton) at the end of each iteration. - `topn`: int, How many top individuals migrate from each population. - `weightAddNode`: float, Relative likelihood for mutation to add a node - `weightDeleteNode`: float, Relative likelihood for mutation to delete a node - `weightDoNothing`: float, Relative likelihood for mutation to leave the individual - `weightMutateConstant`: float, Relative likelihood for mutation to change the constant slightly in a random direction. - `weightMutateOperator`: float, Relative likelihood for mutation to swap an operator. - `weightRandomize`: float, Relative likelihood for mutation to completely delete and then randomly generate the equation - `weightSimplify`: float, Relative likelihood for mutation to simplify constant parts by evaluation - `timeout`: float, Time in seconds to timeout search - `equation_file`: str, Where to save the files (.csv separated by |) - `test`: str, What test to run, if X,y not passed. - `maxsize`: int, Max size of an equation. **Returns**: pd.DataFrame, Results dataframe, giving complexity, MSE, and equations (as strings). # TODO - [ ] Hyperparameter tune - [ ] Add mutation for constant<->variable - [ ] Create a benchmark for accuracy - [ ] Use NN to generate weights over all probability distribution conditional on error and existing equation, and train on some randomly-generated equations - [ ] Performance: - [ ] Use an enum for functions instead of storing them? - Current most expensive operations: - [x] deepcopy() before the mutate, to see whether to accept or not. - Seems like its necessary right now. But still by far the slowest option. - [ ] Calculating the loss function - there is duplicate calculations happening. - [ ] Declaration of the weights array every iteration - [x] Add interface for either defining an operation to learn, or loading in arbitrary dataset. - Could just write out the dataset in julia, or load it. - [x] Create a Python interface - [x] Explicit constant optimization on hall-of-fame - Create method to find and return all constants, from left to right - Create method to find and set all constants, in same order - Pull up some optimization algorithm and add it. Keep the package small! - [x] Create a benchmark for speed - [x] Simplify subtrees with only constants beneath them. Or should I? Maybe randomly simplify sometimes? - [x] Record hall of fame - [x] Optionally (with hyperparameter) migrate the hall of fame, rather than current bests - [x] Test performance of reduced precision integers - No effect - [x] Create struct to pass through all hyperparameters, instead of treating as constants - Make sure doesn't affect performance