--- title: "Parameter selection with CroptimizR" output: rmarkdown::html_vignette author: - name: "Samuel Buis" affiliation: "INRAE - EMMAH" date: "`r Sys.Date()`" vignette: > %\VignetteIndexEntry{Parameter selection with CroptimizR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} params: eval_auto_vignette: TRUE # automatic execution of the vignette eval_manual_vignette: FALSE # manual execution of the vignette (e.g. to create results graphs included in it) eval_auto_test: FALSE # automatic test execution of the vignette, SET IT TO TRUE IF manual_vignette IS SET TO TRUE path_to_JavaStics: "D:/Home/sbuis/Documents/OUTILS-INFORMATIQUE/STICS/JavaSTICS-1.41-stics-9.0" result_path: "D:/Home/sbuis/Documents/GitHub/CroptimizR/vignettes" --- ```{r setup, eval=TRUE, include=FALSE} # Global options path_to_JavaStics <- params$path_to_JavaStics ``` ```{r setup_install_per_package, eval=TRUE, include=FALSE} # Install and load the needed libraries # This one is adapted for manual or test cases (one can first install the # version of the libraries we want to test) library(SticsOnR) library(SticsRFiles) library(CroptimizR) library(CroPlotR) library(ggplot2) library(gridExtra) library(dplyr) library(tidyr) ``` ## Introduction This vignette illustrates a **general workflow for parameter estimation and selection** in crop models using CroptimizR and CroPlotR. The workflow presented here implements the parameter selection algorithm described in [Wallach et al. (2023)](https://link.springer.com/article/10.1007/s13593-023-00900-0), which provides a detailed methodology for stepwise parameter estimation. This parameter estimation and selection algorithm proceeds through a sequence of successive steps. Its core idea is to distinguish between major parameters, which are systematically estimated in order to reduce bias, and candidate parameters, whose estimation is optional. Major parameters are chosen for their strong and consistent influence on model outputs across all environments. A good candidate for this category typically affects simulated values in all environments. Candidate parameters are those expected to explain part of the remaining variability between environments, but for which it is not always clear whether estimating them will improve model performance. In the algorithm, major parameters are estimated first using an Ordinary Least Square (OLS) criterion. Candidate parameters are then added one by one to the set of estimated parameters and retained only if they lead to an improvement in an information criterion value (e.g., Akaike or Bayesian Information Criterion). Candidate parameters that are not selected are kept fixed at their default values. Candidate parameters should be ordered, as far as possible, according to decreasing expected importance. While there is no strict limit on the number of candidate parameters, increasing this number directly increases the computational cost of the procedure. Using an information criterion helps the algorithm limit overfitting, because it is calculated as the sum of a term measuring the fit of simulated values to observations and a penalty that increases with the number of estimated parameters. ## Study Case The crop model input data used in this example comes from a maize crop experiment (see description in [Wallach et al., 2011](https://www.sciencedirect.com/science/article/pii/S1364815210002574)). To illustrate the calibration procedure, julian days of two phenological stages were simulated from this dataset on 8 different environments (called situations in CroptimizR language) and then used as observations, after adding some random errors. The STICS crop model is used in this example. Initialization steps required by the use of this model (definition of the `model_options` argument of the `estim_param` function) are hidden in this example for sake of brevity. They are detailed in the vignette [Parameter estimation with CroptimizR: a simple case using the STICS crop Model](https://SticsRPacks.github.io/CroptimizR/articles/Parameter_estimation_simple_case.html). Note however that CroptimizR can be applied to any crop model, provided that a suitable wrapper is available. Guidelines and examples for implementing wrappers are provided in the [Designing a model wrapper vignette](https://sticsrpacks.github.io/CroptimizR/articles/Designing_a_model_wrapper.html). Wrappers already exist for several models. For more details, see the [Get started page](https://sticsrpacks.github.io/CroptimizR/articles/CroptimizR.html) or contact the authors. ## Plotting the observations ```{r setup_initializations, eval=params$eval_auto_test, message=FALSE, results=FALSE, warning=FALSE, include=FALSE} # DEFINE THE PATH TO YOUR LOCALLY INSTALLED VERSION OF JAVASTICS javastics_path <- path_to_JavaStics # Download the example USMs and define the path to the JavaStics workspace # (JavaStics XML input files): data_dir <- file.path(SticsRFiles::download_data( example_dirs = "study_case_1", stics_version = "V9.0" )) javastics_workspace_path <- file.path(data_dir, "XmlFiles") model_wrapper <- stics_wrapper ``` ```{r gen_dirs, eval=params$eval_auto_test, results='hide', message=FALSE, warning=FALSE, include=FALSE} stics_inputs_path <- file.path(data_dir, "TxtFiles") dir.create(stics_inputs_path, showWarnings = FALSE) gen_usms_xml2txt( javastics = javastics_path, workspace = javastics_workspace_path, out_dir = stics_inputs_path, verbose = TRUE ) ``` ```{r eval=params$eval_auto_test, results='hide', message=FALSE, warning=FALSE, include=FALSE} # Set the model options (see '? stics_wrapper_options' for details) model_options <- stics_wrapper_options( javastics = javastics_path, workspace = stics_inputs_path, parallel = TRUE, cores = 4 ) # Run the model on all situations found in stics_inputs_path sim_res <- stics_wrapper( model_options = model_options, param_values = c( stlevamf = 350, stamflax = 650, tdmin = 7, tdmax = 26 ), var = c("iamfs", "ilaxs") ) set.seed(1234) obs_list <- lapply(sim_res$sim_list, function(x) { tmp <- x[nrow(x), c("Date", "iamfs", "ilaxs")] row.names(tmp) <- NULL tmp[, c("iamfs", "ilaxs")] <- tmp[, c("iamfs", "ilaxs")] + round(rnorm(2, mean = 0, sd = 3)) return(tmp) }) ``` ```{r eval=params$eval_manual_vignette, results='hide', message=FALSE, warning=FALSE, include=FALSE} save(obs_list, file = file.path(params$result_path, "obs_list.Rdata")) ``` ```{r eval=params$eval_auto_vignette, results='hide', message=FALSE, warning=FALSE, include=FALSE} load(file.path("ResultsParameterSelection", "obs_list.Rdata")) ``` The observations are provided here in the `obs_list` object, in the format required by CroptimizR, i.e. as a named list of data frames (or tibble), each element being named after the corresponding situation (see the [Get started](https://sticsrpacks.github.io/CroptimizR/articles/CroptimizR.html) vignette of CroptimizR for details). In this example, the variables corresponding to the two observed stages considered are called `iamfs` and `ilaxs`. They correspond to the julian days (from the beginning of the sowing year) of "end juvenile" and "maximum LAI" stages as defined in the STICS crop model. ```{r eval=params$eval_auto_vignette, message=FALSE, warning=FALSE} # Show the observations for the first two situations obs_list[1:2] ``` The 8 situations are called here `bo96iN+`, `bou00t1`, `bou00t3`, `bou99t1`, `bou99t3`, `lu96iN+`, `lu96iN6` and `lu97iN+`. Let's plot the values of these observations: ```{r eval=params$eval_auto_vignette, message=FALSE, warning=FALSE, fig.width=4, fig.height=3} obs_df <- dplyr::bind_rows(obs_list) %>% tidyr::pivot_longer( cols = c("iamfs", "ilaxs"), names_to = "stage" ) ggplot(obs_df, aes(x = stage, y = value)) + geom_boxplot() ``` The "end juvenile" stage occurred around day 180 after sowing, while the "maximum LAI" stage occurred around day 230 after sowing. ## Setting information on the parameters to estimate The whole list of parameters to estimate and their lower and upper bounds (noted resp. `lb` and `ub`) must be provided in the `param_info` argument of the `estim_param` function. In this example, we will consider 5 parameters of the STICS crop model: `stlevamf`, `stamflax`, `tdmin`, `stressdev` and `tdmax`. ```{r eval=params$eval_auto_test, message=FALSE, warning=FALSE} param_info <- list( lb = c(stlevamf = 100, stamflax = 300, tdmin = 4, stressdev = 0, tdmax = 25), ub = c(stlevamf = 500, stamflax = 800, tdmin = 8, stressdev = 1, tdmax = 32) ) ``` Lower or upper bounds can be set to `-Inf` or `Inf`, respectively. In such cases, initial values must be provided for the corresponding parameters (see `?estim_param` for further details). Candidate parameters must be specified through the `candidate_param` argument of the `estim_param` function. This vector must be ordered from those thought to be the most influential on the considered variable, to the least influential. This ordering is important, as it directly determines the sequence in which candidate parameters are considered and tested during the stepwise parameter selection procedure. In this example, 3 parameters were considered as candidates. ```{r eval=params$eval_auto_test, message=FALSE, warning=FALSE} candidate_param <- c("tdmin", "stressdev", "tdmax") ``` Parameters that are included in `param_info` but not listed in `candidate_param` are treated as major parameters and are therefore systematically estimated at all steps of the procedure. In this example, `stlevamf` and `stamflax` belong to this category and are estimated regardless of the selected candidate parameters. ## Choosing the default parameters values Default values for the non-estimated parameters and for the candidate parameters can be defined in the `forced_param_values` argument of the `estim_param` function. ```{r eval=params$eval_auto_test, message=FALSE, warning=FALSE} forced_param_values <- c(tdmin = 5.0, stressdev = 0.2, tdmax = 30.0) ``` These values will be used at each step for which these parameters are not estimated. ## Setting optimization options During the first step of the algorithm, all major parameters are estimated simultaneously, starting from randomly selected initial values. For subsequent steps, the initial values of parameters that have already been estimated are fixed at their previously estimated values, and only the initial value of the newly added candidate parameter is randomly drawn. As a consequence, Wallach et al. (2023) recommend performing a larger number of optimization repetitions for the first step (typically 10), and fewer repetitions for the subsequent steps (typically 5). This strategy is implemented as follows: ```{r eval=params$eval_auto_test, message=FALSE, warning=FALSE} optim_options <- list(nb_rep = c(10, 5)) ``` ```{r eval=params$eval_auto_test, message=FALSE, warning=FALSE, include=FALSE} optim_options$xtol_rel <- 1e-2 optim_options$ftol_rel <- 1e-3 optim_options$ranseed <- 1234 ``` ## Running the optimization A single call to the `estim_param` function is sufficient to run the complete parameter estimation and selection procedure. Compared with the simple example presented in https://SticsRPacks.github.io/CroptimizR/articles/Parameter_estimation_simple_case.html , it is important here to explicitly define the `crit_function` argument. The stepwise parameter selection procedure illustrated in this vignette is currently implemented for an Ordinary Least Squares objective function, provided by `crit_ols`, which is not the default criterion used by `estim_param.` Note also the use of the `forced_param_values` and `candidate_param` arguments, which respectively control the values of non-estimated parameters and the sequential inclusion of candidate parameters during the selection process. By default, `estim_param` uses the corrected Akaike Information Criterion (AICc) for parameter selection. In this example, the Bayesian Information Criterion (BIC) is used instead, as specified through the `info_crit_func` argument. ```{r eval=params$eval_auto_test, message=FALSE, warning=FALSE} res <- estim_param( obs_list = obs_list, model_function = model_wrapper, model_options = model_options, crit_function = crit_ols, optim_options = optim_options, param_info = param_info, forced_param_values = forced_param_values, candidate_param = candidate_param, info_crit_func = list(CroptimizR::BIC), out_dir = data_dir ) ``` At the end of the execution, some information such as the selected step number, the list of selected parameters, their estimated values and some stats about the execution time are printed in the R console: ```{r eval=FALSE, echo=TRUE} ... # -- Summary of the Parameter Selection procedure results # # Selected step: 2 # Selected parameters: stlevamf,stamflax,tdmin # Estimated value for stlevamf : 321.18 # Estimated value for stamflax : 589.73 # Estimated value for tdmin : 7.96 # # A table summarizing the results obtained at the different steps is stored in C:/Users/sbuis/AppData/Local/Temp/RtmpwRzZkP/data-master/study_case_1/V9.0/param_selection_steps.csv # Graphs and detailed results obtained for the different steps can be found in C:/Users/sbuis/AppData/Local/Temp/RtmpwRzZkP/data-master/study_case_1/V9.0/param_select_step# folders. # # Average time for the model to simulate all required situations: 4.1 sec elapsed # Total number of criterion evaluation: 806 # Total time of model simulations: 3293 sec elapsed # Total time of parameter estimation process: 3314 sec elapsed # ---------------------- ``` In this case, the object returned by the `estim_param` function contains the results corresponding to the selected step, as well as a data frame named `param_selection_steps` that provides detailed information for all steps of the parameter selection procedure. For each minimization step, `param_selection_steps` reports the set of estimated parameters, their initial and final values, and the associated values of the sum of squares and information criteria. The selected step is identified in the last column of this data frame. For convenience, the same information is also written to a CSV file (`param_selection_steps.csv`), saved either in the current working directory or in the directory specified by the `out_dir` argument of `estim_param.` ```{r eval=params$eval_auto_vignette, message=FALSE, warning=FALSE} df <- read.csv("ResultsParameterSelection/param_selection_steps.csv", sep = ";") knitr::kable(df) ``` In this example, the selected parameter set consists of `stlevamf`, `stamflax`, and `tdmin`, corresponding to the lowest value of the BIC among all tested steps. The complete object returned by the estim_param function is also saved to disk in the file `optim_results.Rdata`. Standard result files and diagnostic plots produced by `estim_param` for each step of the procedure are stored in the `param_select_step#` folders. ## Generating diagnostics using CroPlotR To generate diagnostic plots, the model wrapper is first run using the estimated values of the selected parameters, while all remaining parameters are set to their default values. These values are passed to the model through the `param_values` argument. ```{r eval=params$eval_auto_vignette, results='hide', message=FALSE, warning=FALSE, include=FALSE} load(file.path("ResultsParameterSelection", "optim_results.Rdata")) ``` ```{r eval=params$eval_manual_vignette, message=FALSE, warning=FALSE} param_values <- c(res$final_values, res$forced_param_values) sim_after_optim <- model_wrapper( param_values = param_values, model_options = model_options, var = c("iamfs", "ilaxs") ) ``` Once the simulations have been performed, simulated and observed values can be compared graphically using the plotting functions provided by CroPlotR. In particular, scatter plots of simulated versus observed values can be produced as follows: ```{r eval=FALSE, warning=FALSE, message=FALSE} plot(sim_after_optim$sim_list, obs = obs_list, type = "scatter") ``` ```{r eval=params$eval_manual_vignette, warning=FALSE, message=FALSE, echo=FALSE} p <- plot(sim_after_optim$sim_list, obs = obs_list, type = "scatter") save_plot_png(p, data_dir) ``` ```{r eval=params$eval_auto_vignette, echo=FALSE, message=FALSE, warning=FALSE, out.width = '60%'} knitr::include_graphics("ResultsParameterSelection/all_situations.png") ``` In addition to graphical diagnostics, quantitative performance metrics can be computed for each variable. In this example, the Mean Squared Error (MSE) and its decomposition are calculated using the `summary` function from CroPlotR. ```{r eval=FALSE, warning=FALSE, message=FALSE} summary(sim_after_optim$sim_list, obs = obs_list, stats = c("MSE", "Bias2", "SDSD", "LCS") ) ``` ```{r eval=params$eval_manual_vignette, warning=FALSE, message=FALSE, echo=FALSE} stats <- summary(sim_after_optim$sim_list, obs = obs_list, stats = c("MSE", "Bias2", "SDSD", "LCS") ) save(stats, file = file.path(params$result_path, "stats.Rdata")) ``` ```{r eval=params$eval_auto_vignette, echo=FALSE, message=FALSE, warning=FALSE} load("ResultsParameterSelection/stats.Rdata") print(stats) ``` ```{r move_results, eval=params$eval_manual_vignette, include=FALSE} # Move the files produced since the temp. folder is removed after Rmd execution file.copy(file.path(data_dir, "param_selection_steps.csv"), params$result_path, overwrite = TRUE ) file.copy(file.path(data_dir, "param_selection_steps.Rdata"), params$result_path, overwrite = TRUE ) file.copy(file.path(data_dir, "optim_results.Rdata"), params$result_path, overwrite = TRUE ) file.copy(file.path(data_dir, "all_situations.png"), params$result_path, overwrite = TRUE ) dir.create(file.path(params$result_path, "results_all_steps")) file.copy( from = file.path(data_dir, "results_all_steps"), to = file.path(params$result_path, "results_all_steps"), recursive = TRUE, overwrite = TRUE ) ```