-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
169 lines (127 loc) · 4.12 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
<!-- TOC start (generated with https://github.com/derlin/bitdowntoc) -->
# Table of Contents
- [SymbolicR](#symbolicr)
* [Installation](#installation)
* [Genetic search](#genetic-search)
* [Random search](#random-search)
* [Combinatorial Search](#combinatorial-search)
<!-- TOC end -->
# SymbolicR
[![DOI](https://zenodo.org/badge/830556463.svg)](https://zenodo.org/doi/10.5281/zenodo.12904321)
<!-- badges: start -->
<!-- badges: end -->
Find non-linear formulas that fits your input data.
You can systematically explore and memoize the possible formulas and its cross-validation performance, in an incremental fashon.
Three main interoperable search functions are available:
- `random.search` performs a random exploration,
- `genetic.search` employs a genetic optimization algorithm
- `comb.search` allows the user to provide a data.frame of formulas to be tested
After installation, see tutorials with
```r
browseVignettes('symbolicr')
```
## Installation
You can install the development version of symbolicr like so:
``` r
devtools::install_github('cosbi-research/symbolicr', build_vignettes = TRUE)
```
## Genetic search
This is a minimum viable example on how to use genetic search to find the non-linear relationship:
```{r example}
library(symbolicr)
set.seed(1)
x1<-runif(100, min=2, max=67)
x2<-runif(100, min=0.01, max=0.1)
y <- log10(x1^2*x2) + rnorm(100, 0, 0.001)
X <- data.frame(x1=x1, x2=x2)
results <- genetic.search(
X, y,
n.squares=2,
max.formula.len = 1,
N=2,
K=10,
best.vars.l = list(
c('log.x1')
),
transformations = list(
"log"=function(rdf, x, stats){ log(x) },
"inv"=function(rdf, x, stats){ 1/x }
),
keepBest=T,
cv.norm=F
)
```
We found the correct non-linear formula starting from an initial guess!
We can now get the best formula
```{r best}
results$best
```
And all the formula the genetic algorithm found to be best at each one of the 100 evolution iterations
(last 5 shown for brevity)
```{r evolution}
results$best.iter[seq(length(results$best.iter)-5,length(results$best.iter))]
```
Note that `cv.norm=FALSE` means data is used as-is.
Before running this example we checked for the non-negativeness of `x1^2*x2`.
If you would like to normalize data to avoid scaling issues just use `cv.norm=TRUE`
but in this case, to avoid computing the log of a negative value, we use this updated transformation function
```{r log-std}
# NB: function applied to standardized X values!
# they can be negative
log.std <- function(x, stats){ log(0.1 + abs(stats$min) + x) }
```
The `stats` object contains
- min: the minimum of the column values
- absmin: the minimum of the absolute values of the columns
- absmax: the maximum of the absolute values of the columns
- projzero: -mean/sd of the columns, that is the position of the zero in the original, non-normalized space.
Type `?dataset.min.maxs` in your R console for further informations.
## Random search
This is a minimum viable example on how to use random search to test multiple non-linear relationships, and get a summary of the performances
in a data.frame:
```{r random.example}
random.results <- random.search(
X, y,
n.squares=2,
formula.len = 1,
N=2,
K=10,
transformations = list(
"log"=function(rdf, x, stats){ log(x) },
"inv"=function(rdf, x, stats){ 1/x }
),
cv.norm=F
)
```
You can then inspect results in the resulting data.frame:
```{r random.inspect}
random.results[order(random.results$base.r.squared, decreasing = T), ][seq(5), ]
```
## Combinatorial Search
Search over all non-duplicated combinations of terms taken from an user-supplied list of formulas.
```{r combination.example}
# test the three formula x1, x2 and log(x2)
comb.search(
X, y,
combinations=data.frame(t1=c('x1','x2','log.x2')),
N=2,
K=10,
transformations = list(
"log"=function(rdf, x, stats){ log(x) },
"exp"=function(rdf, x, stats){ exp(x) }
),
cv.norm=F
)
```