Sequence Distribution

The sequence distribution is used to model independent and identitcally distributed (iid) sequences of observations or varying lengths. We can also model the distribution for the lengths of the sequences.

Assume \(x_i = (x_{i1}, ..., x_{i n_i})\) is a sequence of length \(n_i\) having data type T. The sequence distribution models each \(x_{i, j}\) with a distribution compatible with type T data, \(g(x_i \vert \theta)\). The lengths of the sequences n_i are modeled with a distribution on the integers \(h(n_i \vert \phi)\). The likelhood for a set of observed sequences \(X=([x_{1,1}, \dots, x_{1, n_1}], \dots, [x_{N, 1}, \dots, x_{N, n_N}])\) is

\[f(X) = \prod_{i=1}^{N} g(x_{i, 1}, \dots, x_{i, n_i} \vert \theta) h(n_i \vert \phi).\]

SequenceDistribution

class dmx.stats.sequence.SequenceDistribution(dist, len_dist=NullDistribution(name=None), len_normalized=False, name=None, keys=None)

SequenceDistribution object for sequence of iid observations from distribution a of data type T.

dist

Base distribution of sequence (compatible with T).

Type:

SequenceEncodableProbabilityDistribution

len_dist

Length distribution for modeling lengths of sequences of observations (compatible with type int). Set to NullDistribution if None is passed.

Type:

Optional[SequenceEncodableProbabilityDistribution]

len_normalized

If True, take geometric mean density for any density evaluation.

Type:

Optional[bool]

name

Name to instance of SequenceDistribution.

Type:

Optional[str]

null_len_dist

True if ‘len_dist’ is set to instance of NullDistribution.

Type:

bool

keys

Key for parameters of sequence distribution.

Type:

Optional[str]

__init__(dist, len_dist=NullDistribution(name=None), len_normalized=False, name=None, keys=None)

SequenceDistribution object.

Parameters:
  • dist (SequenceEncodableProbabilityDistribution) – Set base distribution of sequence (compatible with T).

  • len_dist (Optional[SequenceEncodableProbabilityDistribution]) – Length distribution for modeling lengths of sequences of observations (compatible with type int).

  • len_normalized (Optional[bool]) – If True, take geometric mean density for any density evaluation.

  • name (Optional[str]) – Set name to instance of SequenceDistribution.

  • keys (Optional[str]) – Key for parameters of sequence distribution.

density(x)

Evaluate the density of SequenceDistribution at observed sequence x.

Parameters:

x (Sequence[T]) – Sequence of iid observations from base distribution of SequenceDistribution.

Returns:

Density evaluated at observation x.

Return type:

float

dist_to_encoder()

Create DataSequenceEncoder object for SequenceEncodableProbabilityDistribution instance.

Return type:

SequenceDataEncoder

Returns:

DataSequenceEncoder

estimator(pseudo_count=None)

Create a ParameterEstimator for corresponding SequenceEncodableProbabilityDistribution.

Parameters:

pseudo_count (Optional[float]) – Regularize sufficient statistics in estimation step.

Return type:

SequenceEstimator

Returns:

ParameterEstimator

log_density(x)

Evaluate the log-density of SequenceDistribution at observed sequence x.

Parameters:

x (Sequence[T]) – Sequence of iid observations from base distribution of SequenceDistribution.

Returns:

Log-density evaluated at observation x.

Return type:

float

sampler(seed=None)

Create a DistributionSampler object for a given ProbabilityDistribution.

Parameters:

seed (Optional[int]) – Set seed for drawing samples from distribution.

Return type:

SequenceSampler

seq_log_density(x)

Vectorized evaluation of the log density.

Parameters:

x (EncodedDataSequence) – EncodedDataSequence for corresponding SequenceEncodedProbabilityDistribution.

Return type:

ndarray

Returns:

np.ndarray

SequenceEstimator

class dmx.stats.sequence.SequenceEstimator(estimator, len_estimator=<dmx.stats.null_dist.NullEstimator object>, len_dist=None, len_normalized=False, name=None, keys=None)

SequenceEstimator object for estimating SequenceDistribution from aggregated sufficient statistics.

Notes

Requires arg ‘estimator’ to be ParameterEstimator of data type T, compatible with the observed entry values of SequenceDistribution.

If arg ‘len_estimator’ is passed, it must be a ParameterEstimator object compatible with non-negative integers.

If len_estimator is NullEstimator() or None, len_dist is used as length distribution in estimation.

estimator

ParameterEstimator for base distribution.

Type:

ParameterEstimator

len_estimator

ParameterEstimator for length distribution. If None, set to NullEstimator.

Type:

Optional[ParameterEstimator]

len_dist

Set a fixed length distribution.

Type:

Optional[SequenceEncodableProbabilityDistribution]

len_normalized

Take geometric mean of density if True.

Type:

Optional[bool]

name

Name of SequenceEstimator instance.

Type:

Optional[str]

keys

Key for SequenceEstimator instance used in aggregating sufficient statistics.

Type:

Optional[str]

__init__(estimator, len_estimator=<dmx.stats.null_dist.NullEstimator object>, len_dist=None, len_normalized=False, name=None, keys=None)

SequenceEstimator object.

Parameters:
  • estimator (ParameterEstimator) – Set ParameterEstimator for base distribution.

  • len_estimator (Optional[ParameterEstimator]) – Set ParameterEstimator for length distribution.

  • len_dist (Optional[SequenceEncodableProbabilityDistribution]) – Set a fixed length distribution.

  • len_normalized (Optional[bool]) – Take geometric mean of density if True.

  • name (Optional[str]) – Set name to SequenceEstimator instance.

  • keys (Optional[str]) – Set key to SequenceEstimator instance for merging sufficient statistics.

accumulator_factory()

Create SequenceEncodableStatisticAccumulator object.

Return type:

SequenceAccumulatorFactory

estimate(nobs, suff_stat)

Estimate SequenceEncodableProbabilityDistribution for sufficient statistics.

Parameters:
  • nobs (Optional[float]) – Weighted number of observations.

  • suff_stat (Tuple[int, np.ndarray, np.ndarray, np.ndarray]) – Sufficient statistics for dirichlet distribution.

Return type:

SequenceDistribution

Returns:

SequenceEncodableProbabilityDistribution

SequenceSampler

class dmx.stats.sequence.SequenceSampler(dist, len_dist, seed=None)

SequenceSampler object for sampling from an SequenceDistribution instance.

dist

The Base distribution for the sequences (data type T).

Type:

SequenceEncodableProbabilityDistribution

len_dist

Length distribution for the length of the sequences (support on positive integers).

Type:

SequenceEncodableProbabilityDistribution

rng

RandomState object for random sampling.

Type:

RandomState

dist_sampler

DistributionSampler instance from base distribution.

Type:

DistributionSampler

len_sampler

DistributionSampler instance from length distribution.

Type:

DistributionSampler

sample(size=None)

Generate iid samples from SequenceSampler object.

If size is None, the length ‘n’ of the iid sequence is sampled from len_sampler. Then ‘n’ iid samples are drawn from the base dist sampled ‘dist_sampler’.

If size > 0, above is repeated size times and a List of size List[T] is retured.

Parameters:

size (Optional[int])

Return type:

List[Any]

Returns:

List[T] or List[List[T]] with length(size).