Pyroed is a framework for model-based optimization of sequences of discrete choices with constraints among choices. Pyroed aims to address the regime where there is very little data (100-10000 observations), small batch size (say 10-100), short sequences (length 2-100) of heterogeneous choice sets, and possibly with constraints among choices at different positions in the sequence.
Under the hood, Pyroed performs Thompson sampling against a hierarchical Bayesian linear regression model that is automatically generated from a Pyroed problem specification, deferring to Pyro for Bayesian inference (either variational or MCMC) and to annealed Gibbs sampling for discrete optimization. All numerics is performed by PyTorch.
You can install directly from github via
pip install https://github.com/pyro-ppl/pyroed/archive/main.zipFor developing Pyroed you can install from source
git clone git@github.com:pyro-ppl/pyroed cd pyroed pip install -e .First specify your sequence space by declaring a SCHEMA, CONSTRAINTS, FEATURE_BLOCKS, and GIBBS_BLOCKS. These are all simple Python data structures. For example to optimize a nucleotide sequence of length 6:
# Declare the set of choices and the values each choice can take. SCHEMA = OrderedDict() SCHEMA["nuc0"] = ["A", "C", "G", "T"] # these are the same, but SCHEMA["nuc1"] = ["A", "C", "G", "T"] # you can make each list different SCHEMA["nuc2"] = ["A", "C", "G", "T"] SCHEMA["nuc3"] = ["A", "C", "G", "T"] SCHEMA["nuc4"] = ["A", "C", "G", "T"] SCHEMA["nuc5"] = ["A", "C", "G", "T"] # Declare some constraints. See pyroed.constraints for options. CONSTRAINTS = [] CONSTRAINTS.append(AllDifferent("nuc0", "nuc1", "nuc2")) CONSTRAINTS.append(Iff(TakesValue("nuc4", "T"), TakesValue("nuc5", "T"))) # Specify groups of cross features for the Bayesian linear regression model. FEATURE_BLOCKS = [] FEATURE_BLOCKS.append(["nuc0"]) # single features FEATURE_BLOCKS.append(["nuc1"]) FEATURE_BLOCKS.append(["nuc2"]) FEATURE_BLOCKS.append(["nuc3"]) FEATURE_BLOCKS.append(["nuc4"]) FEATURE_BLOCKS.append(["nuc5"]) FEATURE_BLOCKS.append(["nuc0", "nuc1"]) # consecutive pairs FEATURE_BLOCKS.append(["nuc1", "nuc2"]) FEATURE_BLOCKS.append(["nuc2", "nuc3"]) FEATURE_BLOCKS.append(["nuc3", "nuc4"]) FEATURE_BLOCKS.append(["nuc4", "nuc5"]) # Finally define Gibbs sampling blocks for the discrete optimization. GIBBS_BLOCKS = [] GIBBS_BLOCKS.append(["nuc0", "nuc1"]) # consecutive pairs GIBBS_BLOCKS.append(["nuc1", "nuc2"]) GIBBS_BLOCKS.append(["nuc2", "nuc3"]) GIBBS_BLOCKS.append(["nuc3", "nuc4"]) GIBBS_BLOCKS.append(["nuc4", "nuc5"])An experiment consists of a set of sequences and the experimentally measured responses of those sequences.
# Enter your existing data. sequences = ["ACGAAA", "ACGATT", "AGTTTT"] responses = torch.tensor([0.1, 0.2, 0.6]) # Collect these into a dictionary that we'll maintain throughout our workflow. design = pyroed.encode_design(SCHEMA, sequences) experiment = pyroed.start_experiment(SCHEMA, design, responses)At each step of our optimization loop, we'll query Pyroed for a new design. Pyroed choose the design to balance exploitation (finding sequences with high response) and exploration.
design = pyroed.get_next_design( SCHEMA, CONSTRAINTS, FEATURE_BLOCKS, GIBBS_BLOCKS, experiment, design_size=3 ) new_seqences = ["".join(s) for s in pyroed.decode_design(SCHEMA, design)] print(new_sequences) # ["CAGTGC", "GCAGTT", "TAGGTT"]Then we'll go to the lab, measure the responses of these new sequences, and append the new results to our experiment:
new_responses = torch.tensor([0.04, 0.3, 0.25]) experiment = pyroed.update_experiment(SCHEMA, experiment, design, new_responses)We repeat step 3 as long as we like.
For a more in-depth demonstration of Pyroed usage in practice on some transcription factor data see rollout_tf8.py and tf8_demo.ipynb.
