RL Algorithms¶
Common¶
- class rl4co.models.rl.common.base.RL4COLitModule(env, policy, batch_size=512, val_batch_size=None, test_batch_size=None, train_data_size=100000, val_data_size=10000, test_data_size=10000, optimizer='Adam', optimizer_kwargs={'lr': 0.0001}, lr_scheduler=None, lr_scheduler_kwargs={'gamma': 0.1, 'milestones': [80, 95]}, lr_scheduler_interval='epoch', lr_scheduler_monitor='val/reward', generate_default_data=False, shuffle_train_dataloader=False, dataloader_num_workers=0, data_dir='data/', log_on_step=True, metrics={}, **litmodule_kwargs)[source]¶
Bases:
LightningModuleBase class for Lightning modules for RL4CO. This defines the general training loop in terms of RL algorithms. Subclasses should implement mainly the shared_step to define the specific loss functions and optimization routines.
- Parameters:
env¶ (
RL4COEnvBase) – RL4CO environmentpolicy¶ (
Module) – policy network (actor)batch_size¶ (
int) – batch size (general one, default used for training)val_batch_size¶ (
int) – specific batch size for validationtest_batch_size¶ (
int) – specific batch size for testingtrain_data_size¶ (
int) – size of training dataset for one epochval_data_size¶ (
int) – size of validation dataset for one epochtest_data_size¶ (
int) – size of testing dataset for one epochoptimizer¶ (
Union[str,Optimizer,partial]) – optimizer or optimizer nameoptimizer_kwargs¶ (
dict) – optimizer kwargslr_scheduler¶ (
Union[str,LRScheduler,partial]) – learning rate scheduler or learning rate scheduler namelr_scheduler_kwargs¶ (
dict) – learning rate scheduler kwargslr_scheduler_interval¶ (
str) – learning rate scheduler intervallr_scheduler_monitor¶ (
str) – learning rate scheduler monitorgenerate_default_data¶ (
bool) – whether to generate default datasets, filling up the data directoryshuffle_train_dataloader¶ (
bool) – whether to shuffle training dataloader. Default is False since we recreate dataset every epochdataloader_num_workers¶ (
int) – number of workers for dataloaderdata_dir¶ (
str) – data directorymetrics¶ (
dict) – metricslitmodule_kwargs¶ – kwargs for LightningModule
- forward(td, **kwargs)[source]¶
Forward pass for the model. Simple wrapper around policy. Uses env from the module if not provided.
- log_metrics(metric_dict, phase, dataloader_idx=None)[source]¶
Log metrics to logger and progress bar
- on_train_epoch_end()[source]¶
Called at the end of the training epoch. This can be used for instance to update the train dataset with new data (which is the case in RL).
- post_setup_hook()[source]¶
Hook to be called after setup. Can be used to set up subclasses without overriding setup
- setup(stage='fit')[source]¶
Base LightningModule setup method. This will setup the datasets and dataloaders
Note
We also send to the loggers all hyperparams that are not nn.Module (i.e. the policy). Apparently PyTorch Lightning does not do this by default.
Shared step between train/val/test. To be implemented in subclass
- test_dataloader()[source]¶
An iterable or collection of iterables specifying test samples.
For more information about multiple dataloaders, see this section.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()prepare_data()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Note
If you don’t need a test dataset and a
test_step(), you don’t need to implement this method.
- test_step(batch, batch_idx, dataloader_idx=None)[source]¶
Operates on a single batch of data from the test set. In this step you’d normally generate examples or calculate anything of interest such as accuracy.
- Parameters:
- Returns:
Tensor- The loss tensordict- A dictionary. Can include any keys, but must include the key'loss'.None- Skip to the next batch.
# if you have one test dataloader: def test_step(self, batch, batch_idx): ... # if you have multiple test dataloaders: def test_step(self, batch, batch_idx, dataloader_idx=0): ...
Examples:
# CASE 1: A single test dataset def test_step(self, batch, batch_idx): x, y = batch # implement your own out = self(x) loss = self.loss(out, y) # log 6 example images # or generated text... or whatever sample_imgs = x[:6] grid = torchvision.utils.make_grid(sample_imgs) self.logger.experiment.add_image('example_images', grid, 0) # calculate acc labels_hat = torch.argmax(out, dim=1) test_acc = torch.sum(y == labels_hat).item() / (len(y) * 1.0) # log the outputs! self.log_dict({'test_loss': loss, 'test_acc': test_acc})
If you pass in multiple test dataloaders,
test_step()will have an additional argument. We recommend setting the default value of 0 so that you can quickly switch between single and multiple dataloaders.# CASE 2: multiple test dataloaders def test_step(self, batch, batch_idx, dataloader_idx=0): # dataloader_idx tells you which dataset this is. ...
Note
If you don’t need to test you don’t need to implement this method.
Note
When the
test_step()is called, the model has been put in eval mode and PyTorch gradients have been disabled. At the end of the test epoch, the model goes back to training mode and gradients are enabled.
- train_dataloader()[source]¶
An iterable or collection of iterables specifying training samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set
reload_dataloaders_every_n_epochsto a positive integer.For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()prepare_data()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- training_step(batch, batch_idx)[source]¶
Here you compute and return the training loss and some additional metrics for e.g. the progress bar or logger.
- Parameters:
- Returns:
Tensor- The loss tensordict- A dictionary. Can include any keys, but must include the key'loss'.None- Skip to the next batch. This is only supported for automatic optimization.This is not supported for multi-GPU, TPU, IPU, or DeepSpeed.
In this step you’d normally do the forward pass and calculate the loss for a batch. You can also do fancier things like multiple forward passes or something model specific.
Example:
def training_step(self, batch, batch_idx): x, y, z = batch out = self.encoder(x) loss = self.loss(out, x) return loss
To use multiple optimizers, you can switch to ‘manual optimization’ and control their stepping:
def __init__(self): super().__init__() self.automatic_optimization = False # Multiple optimizers (e.g.: GANs) def training_step(self, batch, batch_idx): opt1, opt2 = self.optimizers() # do training_step with encoder ... opt1.step() # do training_step with decoder ... opt2.step()
Note
When
accumulate_grad_batches> 1, the loss returned here will be automatically normalized byaccumulate_grad_batchesinternally.
- val_dataloader()[source]¶
An iterable or collection of iterables specifying validation samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set
reload_dataloaders_every_n_epochsto a positive integer.It’s recommended that all data downloads and preparation happen in
prepare_data().fit()validate()prepare_data()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
Note
If you don’t need a validation dataset and a
validation_step(), you don’t need to implement this method.
- validation_step(batch, batch_idx, dataloader_idx=None)[source]¶
Operates on a single batch of data from the validation set. In this step you’d might generate examples or calculate anything of interest like accuracy.
- Parameters:
- Returns:
Tensor- The loss tensordict- A dictionary. Can include any keys, but must include the key'loss'.None- Skip to the next batch.
# if you have one val dataloader: def validation_step(self, batch, batch_idx): ... # if you have multiple val dataloaders: def validation_step(self, batch, batch_idx, dataloader_idx=0): ...
Examples:
# CASE 1: A single validation dataset def validation_step(self, batch, batch_idx): x, y = batch # implement your own out = self(x) loss = self.loss(out, y) # log 6 example images # or generated text... or whatever sample_imgs = x[:6] grid = torchvision.utils.make_grid(sample_imgs) self.logger.experiment.add_image('example_images', grid, 0) # calculate acc labels_hat = torch.argmax(out, dim=1) val_acc = torch.sum(y == labels_hat).item() / (len(y) * 1.0) # log the outputs! self.log_dict({'val_loss': loss, 'val_acc': val_acc})
If you pass in multiple val dataloaders,
validation_step()will have an additional argument. We recommend setting the default value of 0 so that you can quickly switch between single and multiple dataloaders.# CASE 2: multiple validation dataloaders def validation_step(self, batch, batch_idx, dataloader_idx=0): # dataloader_idx tells you which dataset this is. ...
Note
If you don’t need to validate you don’t need to implement this method.
Note
When the
validation_step()is called, the model has been put in eval mode and PyTorch gradients have been disabled. At the end of validation, the model goes back to training mode and gradients are enabled.
- class rl4co.models.rl.common.critic.CriticNetwork(env_name=None, encoder=None, init_embedding=None, embedding_dim=128, hidden_dim=512, num_layers=3, num_heads=8, normalization='batch', sdpa_fn=None, **unused_kwargs)[source]¶
Bases:
ModuleWe make the critic network compatible with any problem by using encoder for any environment Refactored from Kool et al. (2019) which only worked for TSP. In our case, we make it compatible with any problem by using the environment init embedding. Note that if no environment name and no init embedding are provided, the critic network does not transform the input (i.e. it should be a tensor of shape (batch_size, embedding_dim)).
- Parameters:
env_name¶ (
str) – environment name to solveencoder¶ (
Module) – Encoder to use for the criticinit_embedding¶ (
Module) – Initial embedding to use for the criticembedding_dim¶ (
int) – Dimension of the embeddingshidden_dim¶ (
int) – Hidden dimension for the feed-forward networknum_layers¶ (
int) – Number of layers for the encodernum_heads¶ (
int) – Number of heads for the attentionnormalization¶ (
str) – Normalization to use for the attentionsdpa_fn¶ (
Optional[Callable]) – Scaled dot product function to use for the attention
Initializes internal Module state, shared by both nn.Module and ScriptModule.
PPO¶
- class rl4co.models.rl.ppo.ppo.PPO(env, policy, critic, clip_range=0.2, ppo_epochs=2, mini_batch_size=0.25, vf_lambda=0.5, entropy_lambda=0.0, normalize_adv=False, max_grad_norm=0.5, metrics={'train': ['loss', 'surrogate_loss', 'value_loss', 'entropy']}, **kwargs)[source]¶
Bases:
RL4COLitModuleAn implementation of the Proximal Policy Optimization (PPO) algorithm (https://arxiv.org/abs/1707.06347) is presented with modifications for autoregressive decoding schemes.
In contrast to the original PPO algorithm, this implementation does not consider autoregressive decoding steps as part of the MDP transition. While many Neural Combinatorial Optimization (NCO) studies model decoding steps as transitions in a solution-construction MDP, we treat autoregressive solution construction as an algorithmic choice for tractable CO solution generation. This choice aligns with the Attention Model (AM) (https://openreview.net/forum?id=ByxBFsRqYm), which treats decoding steps as a single-step MDP in Equation 9.
Modeling autoregressive decoding steps as a single-step MDP introduces significant changes to the PPO implementation, including: - Generalized Advantage Estimation (GAE) (https://arxiv.org/abs/1506.02438) is not applicable since we are dealing with a single-step MDP. - The definition of policy entropy can differ from the commonly implemented manner.
The commonly implemented definition of policy entropy is the entropy of the policy distribution, given by:
\[H(\pi(x_t)) = - \sum_{a_t \in A_t} \pi(a_t|x_t) \log \pi(a_t|x_t) \]where \(x_t\) represents the given state at step \(t\), \(A_t\) is the set of all (admisible) actions at step \(t\), and \(a_t\) is the action taken at step \(t\).
If we interpret autoregressive decoding steps as transition steps of an MDP, the entropy for the entire decoding process can be defined as the sum of entropies for each decoding step:
\[H(\pi) = \sum_t H(\pi(x_t)) \]However, if we consider autoregressive decoding steps as an algorithmic choice, the entropy for the entire decoding process is defined as:
\[H(\pi) = - \sum_{a \in A} \pi(a|x) \log \pi(a|x) \]where \(x\) represents the given CO problem instance, and \(A\) is the set of all feasible solutions.
Due to the intractability of computing the entropy of the policy distribution over all feasible solutions, we approximate it by computing the entropy over solutions generated by the policy itself. This approximation serves as a proxy for the second definition of entropy, utilizing Monte Carlo sampling.
It is worth noting that our modeling of decoding steps and the implementation of the PPO algorithm align with recent work in the Natural Language Processing (NLP) community, specifically RL with Human Feedback (RLHF) (e.g., https://github.com/lucidrains/PaLM-rlhf-pytorch).
Shared step between train/val/test. To be implemented in subclass
REINFORCE¶
- class rl4co.models.rl.reinforce.baselines.CriticBaseline(critic=None, **unused_kw)[source]¶
Bases:
REINFORCEBaselineCritic baseline: use critic network as baseline
- Parameters:
critic¶ (
Module) – Critic network to use as baseline. If None, create a new critic network based on the environment
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- class rl4co.models.rl.reinforce.baselines.ExponentialBaseline(beta=0.8, **kw)[source]¶
Bases:
REINFORCEBaselineExponential baseline: return exponential moving average of reward as baseline
- Parameters:
beta¶ – Beta value for the exponential moving average
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- class rl4co.models.rl.reinforce.baselines.MeanBaseline(**kw)[source]¶
Bases:
REINFORCEBaselineMean baseline: return mean of reward as baseline
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- class rl4co.models.rl.reinforce.baselines.NoBaseline(*args, **kw)[source]¶
Bases:
REINFORCEBaselineNo baseline: return 0 for baseline and neg_los
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- class rl4co.models.rl.reinforce.baselines.REINFORCEBaseline(*args, **kw)[source]¶
Bases:
ModuleBase class for REINFORCE baselines
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- epoch_callback(*args, **kw)[source]¶
Callback at the end of each epoch For example, update baseline parameters and obtain baseline values
- class rl4co.models.rl.reinforce.baselines.RolloutBaseline(bl_alpha=0.05, **kw)[source]¶
Bases:
REINFORCEBaselineRollout baseline: use greedy rollout as baseline
- Parameters:
bl_alpha¶ – Alpha value for the baseline T-test
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- epoch_callback(model, env, batch_size=64, device='cpu', epoch=None, dataset_size=None)[source]¶
Challenges the current baseline with the model and replaces the baseline model if it is improved
- eval(td, reward, env)[source]¶
Evaluate rollout baseline
Warning
This is not differentiable and should only be used for evaluation. Also, it is recommended to use the rollout method directly instead of this method.
- rollout(model, env, batch_size=64, device='cpu', dataset=None)[source]¶
Rollout the model on the given dataset
- setup(*args, **kw)[source]¶
To be called before training during setup phase This follow PyTorch Lightning’s setup() convention
- wrap_dataset(dataset, env, batch_size=64, device='cpu', **kw)[source]¶
Wrap the dataset in a baseline dataset
Note
This is an alternative to eval that does not require the model to be passed at every call but just once. Values are added to the dataset. This also allows for larger batch sizes since we evauate the model without gradients.
Bases:
REINFORCEBaselineShared baseline: return mean of reward as baseline
Initializes internal Module state, shared by both nn.Module and ScriptModule.
Evaluate baseline
- class rl4co.models.rl.reinforce.baselines.WarmupBaseline(baseline, n_epochs=1, warmup_exp_beta=0.8, **kw)[source]¶
Bases:
REINFORCEBaselineWarmup baseline: return convex combination of baseline and exponential baseline
- Parameters:
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- epoch_callback(*args, **kw)[source]¶
Callback at the end of each epoch For example, update baseline parameters and obtain baseline values
- rl4co.models.rl.reinforce.baselines.get_reinforce_baseline(name, **kw)[source]¶
Get a REINFORCE baseline by name The rollout baseline default to warmup baseline with one epoch of exponential baseline and the greedy rollout
- class rl4co.models.rl.reinforce.reinforce.REINFORCE(env, policy, baseline='rollout', baseline_kwargs={}, **kwargs)[source]¶
Bases:
RL4COLitModuleREINFORCE algorithm, also known as policy gradients. See superclass RL4COLitModule for more details.
- Parameters:
env¶ (
RL4COEnvBase) – Environment to use for the algorithmpolicy¶ (
Module) – Policy to use for the algorithmbaseline¶ (
Union[REINFORCEBaseline,str]) – REINFORCE baselinebaseline_kwargs¶ (
dict) – Keyword arguments for baseline. Ignored if baseline is not a string**kwargs¶ – Keyword arguments passed to the superclass
- calculate_loss(td, batch, policy_out, reward=None, log_likelihood=None)[source]¶
Calculate loss for REINFORCE algorithm.
- Parameters:
td¶ (
TensorDict) – TensorDict containing the current state of the environmentbatch¶ (
TensorDict) – Batch of data. This is used to get the extra loss terms, e.g., REINFORCE baselinepolicy_out¶ (
dict) – Output of the policy networkreward¶ (
Optional[Tensor]) – Reward tensor. If None, it is taken from policy_outlog_likelihood¶ (
Optional[Tensor]) – Log-likelihood tensor. If None, it is taken from policy_out
- classmethod load_from_checkpoint(checkpoint_path, map_location=None, hparams_file=None, strict=False, load_baseline=True, **kwargs)[source]¶
Load model from checkpoint/ :rtype:
SelfNote
This is a modified version of load_from_checkpoint from pytorch_lightning.core.saving. It deals with matching keys for the baseline by first running setup
- post_setup_hook(stage='fit')[source]¶
Hook to be called after setup. Can be used to set up subclasses without overriding setup
- set_decode_type_multistart(phase)[source]¶
Set decode type to multistart for train, val and test in policy. For example, if the decode type is greedy, it will be set to greedy_multistart.
- Parameters:
phase¶ (
str) – Phase to set decode type for. Must be one of train, val or test.
Shared step between train/val/test. To be implemented in subclass