NOTE [ Autograd View Variables ]
Many operations return Variable that shares storage with an input Variable.
The returned Variable is called a **view** Variable on the input **base**
Variable.
In PyTorch, we have two types of views: differentiable views, and
non-differentiable views. In either type, to support proper version
checking, the base and view Variables must always share the same
version_counter.
Differentiable Views
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This class allows to track both forward and backward AD differentiable
views. These views can have different base as non-differentiable view for
forward and backward mode AD are not the same.
Most function are either both forward and backward differentiable views (for
example: view, select, narrow, transpose, etc) or both not forward and not
backward differentiable views (for example: indices, values, eq, lt, etc).
But there are also functions that are forward but not backward
differentiable views (only detach for now) or functions that are backward
but not forward differentiable view (only make_dual and unpack dual for
now).
A concrete example of two views with different bases is as follow:
# Have:
# dual is a dual Tensor that is neither a forward or backward view
detached_dual = dual.detach()
view = detached_dual.view_as(dual)
# The forward base of view is dual
# The backward base of view is detached_dual
- Backward Mode View
Differentiable views are the view variables where you want gradients to flow
back to the base variables. Out-of-place operations on views are quite
straightforward, but in-place ones are very tricky. Even if the base
variable may not require grad when we create the view, we still need to
track the view relation because future in-place ops may require back-proping
through it. For example, we need to support
(1) in-place operation on view, e.g.,
# Have:
# base.requires_grad = False
# var.requires_grad = True
base[1] = var # i.e., base[1].copy_(var)
torch.autograd.grad(base.sum(), var) <- should return an all ones
tensor
(2) in-place operation on base after view is created, e.g.,
# Have:
# base.requires_grad = False
# var.requires_grad = True
view = base[1]
base.copy_(var)
torch.autograd.grad(view.sum(), var) <- should return a tensor with
var[1] filled with all ones and
zeros everywhere else
- Forward Mode View
Forward differentiable views follow the same semantic as backward ones but
show up differently as they are computed along with the forward evaluation.
The hard examples above are thus very similar
(1) in-place operation on view, e.g.,
# Have:
# base is a regular Tensor
# var is a dual Tensor whose tangent is all ones
base[1] = var # i.e., base[1].copy_(var)
# Now, base is a dual Tensor
_, fw_grad = fwAD.unpack_dual(base) <- fw_grad should be a tensor with
fw_grad[1] filled with all ones
and zeros everywhere else
(2) in-place operation on base after view is created, e.g.,
# Have:
# base is a regular Tensor
# var is a dual Tensor whose tangent is all ones
view = base[1]
base.copy_(var)
_, fw_grad = fwAD.unpack_dual(view) <- fw_grad should be an all ones
tensor
See Note [Forward Grad View/inplace] for more details on how we handle these
hard cases.
DifferentiableViewMeta is created to support gradient tracking of
such **in-place** operations. In particular,
+ if an in-place op is done on base, the grad_fn field of the view may
become stale. So accesses should always go through grad_fn(), which
reconstructs an updated grad_fn if the version_counter has incremented.
All other fields are always valid.
+ if an in-place op is done on view, in rebase_history() of view, which is
called after every in-place op in VariableType.cpp, the grad_fn of base
is updated.
+ if a single autograd Node returns multiple differentiable views, if any
output is modified by an inplace operation, the autograd engine will
make an equivalent graph (corresponding to the view operations) without
using equivalent graph, where each output is treated as if it were
produced by a distinct view operation. This discards the original (e.g.,
user provided) grad_fn. If the provided grad_fn does more than the
backward of the view, then the DifferentiableViewMeta must be created
with creation_meta= CreationMeta::MULTI_OUTPUT_NODE to prevent the
engine from ignoring the provided grad_fn.
Interaction with GradMode:
The particular case that we consider here is:
# Have:
# base.requires_grad = True or False
with torch.no_grad():
view = base[1]
base.requires_grad_()
view.copy_(var)
torch.autograd.grad(base.sum(), var) <- what should it return?
Given that this particular code example is ambiguous and can easily be
replace by either moving both inside the no_grad block or both outside, we
explicitly forbid it. For now, it is deprecated by a warning. This is
achieved by setting creation_meta=CreationMeta::NO_GRAD_MODE for all
differentiable views created in no_grad mode.
See Note [View + Inplace update for base tensor]
and Note [View + Inplace update for view tensor] for the details how
autograd handles inplace update with view ops.
Non-Differentiable Views
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In certain cases, although function outputs share storage with inputs, they
will **never** require gradient history tracking. Instead of registering the
view relation via DifferentiableViewMeta in autograd, the views will be
using usual AutogradMeta and just share the version counters with the base
Variables.
Such views include:
1. Views created from .detach()
2. Views that are non-differentiable by its nature.
E.g.,
sparse_tensor.indices()
is a integral view on a (possibly)
floating point tensor.
See top of
derivatives.yaml
on how to specify that outputs of a
function are non-differentiable.
These are called non-differentiable views as the gradients do not flow
through the view relation.
Relevant logic for both differentiable and non-differentiable views is
implemented in make_variable_(non_)differentiable_view below, and
wrap_output of gen_variable_type.py.
NOTE [ View + Inplace detection ]
We want to detect views followed by inplace as they are often forbidden to
ensure correctness of the computed gradients. But since we want to only
notify the user when both happen, we tag the DifferentiableViewMeta when the
view is created via the make_variable_*_view()
functions. This tag is then
checked by the check_inplace()
function from VariableTypeUtils.h
that
should be called before every inplace operation and to detect cases where
other views are modified and this one is rebased by side effect, we also
check in the VariableHooks::grad_fn()
.
Flag that gives more information about when this view was created:
- IN_CUSTOM_FUNCTION should be set when the view is created inside a custom
autograd Function is returned.
- NO_GRAD_MODE should be set when a view in created when GradMode is
disabled
- MULTI_OUTPUT_NODE should be set when a Node created by codegen code
returns
multiple differentiable views
- Inference_MODE should be set when a view of normal tensor is created in
InferenceMode.
- DEFAULT is for all other cases