@Operator(group="nn") public final class FixedUnigramCandidateSampler extends PrimitiveOp
A unigram sampler could use a fixed unigram distribution read from a file or passed in as an in-memory array instead of building up the distribution from data on the fly. There is also an option to skew the distribution by applying a distortion power to the weights.
The vocabulary file should be in CSV-like format, with the last field being the weight associated with the word.
For each batch, this op picks a single set of sampled candidate labels.
The advantages of sampling candidates per-batch are simplicity and the possibility of efficient dense matrix multiplication. The disadvantage is that the sampled candidates must be chosen independently of the context and of the true labels.
Modifier and Type | Class and Description |
---|---|
static class |
FixedUnigramCandidateSampler.Options
Optional attributes for
FixedUnigramCandidateSampler |
operation
Modifier and Type | Method and Description |
---|---|
static FixedUnigramCandidateSampler |
create(Scope scope,
Operand<Long> trueClasses,
Long numTrue,
Long numSampled,
Boolean unique,
Long rangeMax,
FixedUnigramCandidateSampler.Options... options)
Factory method to create a class wrapping a new FixedUnigramCandidateSampler operation.
|
static FixedUnigramCandidateSampler.Options |
distortion(Float distortion) |
static FixedUnigramCandidateSampler.Options |
numReservedIds(Long numReservedIds) |
static FixedUnigramCandidateSampler.Options |
numShards(Long numShards) |
Output<Long> |
sampledCandidates()
A vector of length num_sampled, in which each element is
the ID of a sampled candidate.
|
Output<Float> |
sampledExpectedCount()
A vector of length num_sampled, for each sampled
candidate representing the number of times the candidate is expected
to occur in a batch of sampled candidates.
|
static FixedUnigramCandidateSampler.Options |
seed(Long seed) |
static FixedUnigramCandidateSampler.Options |
seed2(Long seed2) |
static FixedUnigramCandidateSampler.Options |
shard(Long shard) |
Output<Float> |
trueExpectedCount()
A batch_size * num_true matrix, representing
the number of times each candidate is expected to occur in a batch
of sampled candidates.
|
static FixedUnigramCandidateSampler.Options |
unigrams(List<Float> unigrams) |
static FixedUnigramCandidateSampler.Options |
vocabFile(String vocabFile) |
equals, hashCode, op, toString
public static FixedUnigramCandidateSampler create(Scope scope, Operand<Long> trueClasses, Long numTrue, Long numSampled, Boolean unique, Long rangeMax, FixedUnigramCandidateSampler.Options... options)
scope
- current scopetrueClasses
- A batch_size * num_true matrix, in which each row contains the
IDs of the num_true target_classes in the corresponding original label.numTrue
- Number of true labels per context.numSampled
- Number of candidates to randomly sample.unique
- If unique is true, we sample with rejection, so that all sampled
candidates in a batch are unique. This requires some approximation to
estimate the post-rejection sampling probabilities.rangeMax
- The sampler will sample integers from the interval [0, range_max).options
- carries optional attributes valuespublic static FixedUnigramCandidateSampler.Options vocabFile(String vocabFile)
vocabFile
- Each valid line in this file (which should have a CSV-like format)
corresponds to a valid word ID. IDs are in sequential order, starting from
num_reserved_ids. The last entry in each line is expected to be a value
corresponding to the count or relative probability. Exactly one of vocab_file
and unigrams needs to be passed to this op.public static FixedUnigramCandidateSampler.Options distortion(Float distortion)
distortion
- The distortion is used to skew the unigram probability distribution.
Each weight is first raised to the distortion's power before adding to the
internal unigram distribution. As a result, distortion = 1.0 gives regular
unigram sampling (as defined by the vocab file), and distortion = 0.0 gives
a uniform distribution.public static FixedUnigramCandidateSampler.Options numReservedIds(Long numReservedIds)
numReservedIds
- Optionally some reserved IDs can be added in the range [0,
..., num_reserved_ids) by the users. One use case is that a special unknown
word token is used as ID 0. These IDs will have a sampling probability of 0.public static FixedUnigramCandidateSampler.Options numShards(Long numShards)
numShards
- A sampler can be used to sample from a subset of the original range
in order to speed up the whole computation through parallelism. This parameter
(together with 'shard') indicates the number of partitions that are being
used in the overall computation.public static FixedUnigramCandidateSampler.Options shard(Long shard)
shard
- A sampler can be used to sample from a subset of the original range
in order to speed up the whole computation through parallelism. This parameter
(together with 'num_shards') indicates the particular partition number of a
sampler op, when partitioning is being used.public static FixedUnigramCandidateSampler.Options unigrams(List<Float> unigrams)
unigrams
- A list of unigram counts or probabilities, one per ID in sequential
order. Exactly one of vocab_file and unigrams should be passed to this op.public static FixedUnigramCandidateSampler.Options seed(Long seed)
seed
- If either seed or seed2 are set to be non-zero, the random number
generator is seeded by the given seed. Otherwise, it is seeded by a
random seed.public static FixedUnigramCandidateSampler.Options seed2(Long seed2)
seed2
- An second seed to avoid seed collision.public Output<Long> sampledCandidates()
public Output<Float> trueExpectedCount()
Copyright © 2022. All rights reserved.