HyperPipe

The adaptive HyperPipe code, written by Richard O’Shaughnessy and Atul Kedia, aims to conduct parameter estimation adaptively on any observable or simulated data. This code generalizes RIFT for applications other than graviatational wave data analysis.

HyperPipe adaptively explores regions of parameter space for fully generic, simulation-based inference. The code requires an executable that applies the appropriate physics and calculates the (marginalized) likelihood for a given problem. Given an initial grid and the executable, the code will explore different regions in the parameter space in order to obtain a posterior not skewed by the initial grid guess. The code is highly parallelized for fast parameter estimation. Find a description below of the inputs, outputs, and structure of this pipeline.

Basics

The pipeline performs the inference by iteratively improving upon the posterior distribution. The proceedure is as follows:

  1. Evaluate the (marginalized) likelihood on a user-provided grid of parameters.

  2. Compute posterior via Monte Carlo integration and generate two new grids for exploration. One grid is purely based on the posterior generated by the MC integral, and the other “puffs” the first by a user-specified amount.

  3. Repeat above two steps for the number of iterations desired.

The user must provide the following:

  1. An executable that can calculate the likelihood for a set of parameters,

  2. An initial parameter grid (in the standard RIFT structure, (in the standard RIFT structure, i.e. # lnL sigma_lnL parameter_names …, and second row onwards should have the 0, 0, parameters),

  3. Exploration ranges for each parameter (min, max of each parameter)

There is a demo that shows example executables and a Makefile which creates a sample run directory. Note that the user is currently expected (though not required) to use this Makefile as a template for running their executable, parameters, etc.

Create a Run Directory

HyperPipe uses some codes found in the standard RIFT repository. The run directory is constructed by create_eos_posterior_pipeline.py. This script prepares the SUB files containing the appropriate script references and arguments, as well as constructs the DAG which is submitted to condor to run. To see the help for this tool, run

$ create_eos_posterior_pipeline --help

or look at the possible options in the source. Some options have default values, but the user should inspect the options to ensure the appropriate choices are made. In the example Makefile, this is done automatically. An example of this command is:

$ create_eos_posterior_pipeline --marg-event-exe-list-file `pwd`/args_marg_eos_exe.txt --marg-event-args-list-file  `pwd`/args_marg_eos.txt --eos-post-args `pwd`/args_eos_post.txt --eos-post-exe `which util_ConstructEOSPosterior.py`  --puff-exe `which util_HyperparameterPuffball.py` --puff-args `pwd`/args_puff.txt --input-grid ${SAMPLE_INPUT_GRID_FOR_GAUSSIAN} --n-samples-per-job 1000 --use-full-submit-paths  --working-dir `pwd` --event-file `pwd`/my_event_A.txt --n-iterations 5 --eos-post-explode-jobs 5

Note there are a few important files that should be specified. You will have to provide an executable does the required tasks of evaluating likelihoods.

Input Files

  1. Initial grid - a grid specifying parameters of interest over a desired range. The first grid should set the log-likelihood and error to zero, which will be updated in successive iterations. As the code progresses, the output uses a standard format. See the input/output format here:

# lnL sigma_lnL <param1> <param2> <param3> ...

You may have to include some formatting conversion if your executable requires an input format that differs from HyperPipe outputs. You may need to write a wrapper script that inputs RIFT format parameters and translates them to match the input format required by your code to avoid writing a new executable from scratch.

The pipeline will help the user construct an inital grid using the required format. The script that does this is util_HyperparameterGrid.py. The user must specify each parameter and its [min,max] range, as well as the total desired number of points and optionally an output file name in your Makefile. For example:

$ util_HyperparameterGrid.py --random-parameter x --random-parameter-range [-5,-2] --random-parameter y --random-parameter-range [2,5] --random-parameter z --random-parameter-range [2,5] --npts 1000 --fname-out 'gaussian.dat'

This creates a uniform grid of values needed to begin the search for highest likelihood parameter values.

  1. Executable - this script must take in the text file containing a list of points in parameter space, as well as the log-likelihood and its error at that point. The executable then calculates and updates the likelihood at all points on the grid. The output must be provided in the standard format for posterior sampling to work.

One could also provide two or more executables if there are more constraints to be applied for each observation. Example executables are provided in the demo folder for a gaussian.

IMPORTANT NOTE: The following arguments are all necessary for running this pipeline adaptively. In other words, make your executable take the same argument names: –using-eos, –outdir, –fname-output-integral, and so on, by looking at the example executable example_gaussian.py.

--fname : Dummy argument required by API
--using-eos : eos file with [lnL, sigma_lnL, lambda1, lambda2, lambda3, lambda4, ...] as the params
--outdir : a different output directory name can be chosen as desired.
--fname-output-integral : name of output file made by this executable
--eos_start_index : the starting point from which to evaluate likelihood
--eos_end_index : determines the endpoint of likelihood evaluations

Directory Structure

When create_eos_posterior_pipeline runs, the pipeline will create a directory structure as follows:

long_directory_name_here/
   -> grid-0.dat
   -> local.cache
   -> iteration_0_marg/
   -> iteration_0_post/
   -> iteration_0_con/
   -> iteration_1_marg/
   -> ...
   -> MARG_0.sub
   -> CON.sub
   -> CON_PROD.sub
   -> UNIFY.sub
   -> EOS_POST.sub
   -> JOIN_POST.sub
   -> PUFF.sub
   -> marginalize_hyperparameters.dag

Inside each iteration file is a logs subdirectory. The various iterations will be initially be empty, except for the logfile locations. The top level directory contains several *.sub submission scripts, along with the top-level dag submission script.

Understanding the Stages

MARG.sub

The MARG.sub file contains the call to and arguments for the user provided executable. This step invokes the executable that performs physics calculations for each parameter, and calculates a corresponding marginal likelihood against some data (whether that is observational or some fiducial data is a physics question to be answered by the user). The log-likelihood errors from each parameter combination are then stored in a file.

MARG_PUFF.sub

This does the same function as MARG.sub, except it does so for parameters in the puff grid generated from the second iteration onwards.

CON.sub and CON_PROD.sub

CON.sub runs first. It CONsolidates single events via con_marg.sh and joins together all data files from a single run. This step matches duplicates, averaging likelihoods.

CON_PROD.sub runs second. It join multiple events together into a single overall results file via util_HyperCombine.py, merging MARG entries for the same physical system. By default, it combines samples by averaging, but the option in con_prod.sh may be changed to sum or product.

UNIFY.sub

This step unifies results from all previous iterations to currently available ones that are stored in *.net_marg files. This step grows the grid and keeps track of likelihood results cumulatively over many successive iterations. This is the one file where all the likelihood calculations end up being stored.

EOS_POST.sub

The file called EOS_POST.sub contains the call to and arguments for util_ConstructEOSPosterior.py. During this step, the generic-format log-likelihood data is loaded in and MC integration for the likelihood * prior (i.e. the posterior) is performed to produce a set of weighted samples. This is then passed as an input to the Monte Carlo sampler where samples are fairly drawn from the posterior distribution to generate a new grid to input for likelihood calculations in the succeeding iteration.

PUFF.sub

This contains the call to and arguments for util_HyperparameterPuffball.py. This dithering step ensures the parameter space is well covered by assessing parameter covariance and generating random numbers based on that covariance to add to parameter points. The grid constructed by this “puff” step is added on to the posterior grid.

This code pushes to search for new parameter, and is not “fair drawn”. This creates a wider search in parameters. The previous step with util_ConstructEOSPosterior.py directly searches for posterior maximas, but :code:` ‘util_HyperparameterPuffball.py` works to complement that such that we don’t lose edge details and any other maximas not immediately near the posterior maxima. This produces a new grid_puff and it is used alongside the grid generated by ‘util_ConstructEOSPosterior.py’ for the next iteration’s marginalization.

Submitting Workflow

The standard RIFT pipeline only works within an HTCondor scheduling environment. To submit the workflow, use

$ condor_submit_dag marginalize_hyperparameters.dag

The workflow loosely consists of two parts: worker MARG jobs, which evaluate the marginalized likelihood; and fitting/posterior jobs, which fit the marginalized likelihood and estimate the posterior distribution. Other nodes help group the output of individual jobs and iterations together.

As your run proceeds, files will begin to appear in your directory. A description of some of the files is as follows:

  • grid-0.dat: The initial grid used in the iterative analysis. You’re free to use any grid you want (e.g., the output of some previous analysis). and the workflow can also do the initial grid creation.

  • grid-*.dat: These files are inferred posterior distributions from that iteration. This data format is compatible with postprocessing tools. The final output samples are used to create diagnostic plots.

  • iteration_*: Directories holding the output of each iteration, including log files.

As the workflow progresses, you’ll see the following additional files

  • consolidated_*.net_marg: These files (particularly those ending in .composite) are the output of each iteration’s MARG jobs. Each file is a list of parameters and the value of the marginalized likelihood at those parameters.

Monitor your workflow using

$ watch condor_q

Multiple Constraints

You can run this pipeline for multiple observable constraints. For instance, you may have some observational constraints on your model parameters but there might be terrestrial experiments that put constraints on some parameters (nuclear physics experiments, for instance). In this case you may have more than one executable that you’d like to provide for parameter inference, and this HyperPipe code can utilize both for running.

For example, there is a demo for recovery of a bimodal gaussian distribution. The submit script is very similar to the unimodal case, but now the user must specify the second executable (and its arguments and parameters). The arguments for each executable may be different.

The current version of the pipeline multiplies the likelihoods from various events, but for inference with multiple constraints, you may want to instead add the likelihoods. If the user wishes to change this, they should first do con_prod.sh and change ‘–combination product’ to ‘–combination sum’.

For now, this is useful for the multiple Gaussians example because the code actually returns get peaks as opposed to one peak in-between the two Gaussians. This will be fixed in the future. If the likelihoods are to be multiplied (which is the correct way to evaluate Likelihoods) submit the job without any edits.

For RIFT Users

This parameter estimation workflow is a fully generalized version of the RIFT pipeline that performs simulation-based inference. This means it is no longer specified to perform only gravitaional-wave inference, but can be used coordinate-free for a wide variety of applications. This tool is also powerful because it can perform multiple analyses simultaneously, considering multiple sources of information at once. Despite the changes that generalize the RIFT pipeline, the basic structure of the workflow remains intact.

  • ILE.sub becomes MARG.sub. You may also have multiple instances of MARG if you wish to consider multiple sources of information.

  • PUFF.sub becomes MARG_PUFF.sub

  • CIP.sub becomes EOS_POST.sub