Reference

Reference

Contents

Algorithms

DPMM.fitFunction.
fit(X::AbstractMatrix; algorithm=DEFAULT_ALGO, ncpu=1, T=3000, benchmark=false, scene=nothing, o...)

fit is the main function of DPMM.jl which clusters given data matrix where columns are data points.

The output is the labels for each data point.

Default clustering algorithm is SplitMergeAlgorithm

Keywords:

  • ncpu=1 : the number of parallel workers.

  • T=3000 : iteration count

  • benchmarks=false : if true returns elapsed time

  • scene=nothing: plot scene for visualization. see setup_scene

  • o... : other keyword argument specific to algorithm

source
DPMMAlgorithm{P}

Abstract base class for algorithms

P stands for parallel.

Each subtype should provide the following methods:

  • AlgoType(X::AbstractMatrix; o...)` : constructor
  • random_labels(X::AbstractMatrix,algo::AlgoType{P}) where P : random label generator
  • create_clusters(X::AbstractMatrix,algo::AlgoType{P},labels) where P : initial clusters
  • empty_cluster(algo::AlgoType) where P : an empty cluster (may be nothing)
  • run!(algo::AlgoType{P}, X, labels, clusters, emptycluster;o...) where P : run! modifies labels

Other generic functions is implemented on top of these core functions.

source
CollapsedAlgorithm{P,Q} <: DPMMAlgorithm{P}

Run it by:

labels = fit(X; algorithm = CollapsedAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)

P stands for parallel, Q stands for quasi. Quasi algorithm updates the clusters only in the end of each iteration. Parallel algorithm is valid for quasi-collapsed algorithm only. The number of workers can passed by ncpu keyword argument to fit or run! functions

Provides following methods:

  • CollapsedAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
  • random_labels(X::AbstractMatrix, algo::CollapsedAlgorithm) where P
  • create_clusters(X::AbstractMatrix, algo::CollapsedAlgorithm,labels) where P
  • empty_cluster(algo::CollapsedAlgorithm) where P : an empty cluster
  • run!(algo::CollapsedAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}

Other generic functions are implemented on top of these core functions.

source
DirectAlgorithm{P,Q} <: DPMMAlgorithm{P}

Run it by:

labels = fit(X; algorithm = DirectAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)

P stands for parallel, Q stands for quasi. Quasi algorithm uses cluster population proportions as cluster weights. So, it doesn't sample mixture weights from Dirichlet distribution. In large N, this is very similar to non-quasi sampler. The number of workers can passed by ncpu keyword argument to fit or run! functions

Provides following methods:

  • DirectAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
  • random_labels(X::AbstractMatrix, algo::DirectAlgorithm) where P
  • create_clusters(X::AbstractMatrix, algo::DirectAlgorithm,labels) where P
  • empty_cluster(algo::DirectAlgorithm) where P : an empty cluster
  • run!(algo::DirectAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}

Other generic functions are implemented on top of these core functions.

source
SplitMergeAlgorithm{P,Q} <: DPMMAlgorithm{P}

Run it by:

labels = fit(X; algorithm = SplitMergeAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)

P stands for parallel, Q stands for quasi. M=false algorithm doesn't do merge moves at all, so it is not exact However, emprical results shows that merge moves very less likely. The number of workers can passed by ncpu keyword argument to fit or run! functions

Provides following methods:

  • SplitMergeAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
  • random_labels(X::AbstractMatrix, algo::SplitMergeAlgorithm) where P
  • create_clusters(X::AbstractMatrix, algo::SplitMergeAlgorithm,labels) where P
  • empty_cluster(algo::SplitMergeAlgorithm) where P : an empty cluster
  • run!(algo::SplitMergeAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}

Other generic functions are implemented on top of these core functions.

source
DPMM.run!Function.
run!(algo::DPMMAlgorithm, X, labels, clusters, emptycluster;o...)

Runs the specified Gibbs algorithm. Availables algorithms are:

  • Collapsed Algorithms
  • DirectAlgorithm
  • SplitMergeAlgorithm
source
DPMM.setup_workersFunction.
setup_workers(ncpu::Integer)

Setup parallel process, initialize required modules

source
initialize_clusters(X::AbstractMatrix, algo::DPMMAlgorithm{P}

Initialize clusters and labels, sends related data to workers if the algorithm is parallel

source

Algorithms (Internal)

DPMM.random_labelsFunction.
random_labels(X::AbstractMatrix, algo::DPMMAlgorithm)

random label generator for the data. algo.ninit specifies number of clusters

source
DPMM.create_clustersFunction.
create_clusters(X,algo::CollapsedAlgorithm,labels)

generate clusters from labels generator for the data. algo.ninit specifies number of clusters

source
DPMM.empty_clusterFunction.
empty_cluster(X,algo::CollapsedAlgorithm,labels)

generates an empty (0 data points) cluster

source
RestrictedClusterProbs(πs::AbstractVector{V}, clusters::Dict,  x::AbstractVector) where V<:Real

Returns normalized probability vector for a data point being any cluster

source
DPMM.CRPprobsFunction.
CRPprobs(clusters::Dict, cluster0::AbstractCluster, x::AbstractVector) where V<:Real

Returns Chineese Restraunt Probabilities for a data point being any cluster + a new cluster

source
DPMM.SampleSubClusterFunction.
 SampleSubCluster(πs::Vector{V}, cluster::SplitMergeCluster, x::AbstractVector) where V<:Real

Returns normalized probability vector for a data point being right or left subcluster

source
DPMM.ClusterProbsFunction.
ClusterProbs(πs::AbstractVector{V}, clusters::Dict, cluster0::AbstractCluster, x::AbstractVector) where V<:Real

Returns normalized probability vector for a data point being any cluster + a new cluster

source
DPMM.place_x!Function.
place_x!(model::AbstractDPModel,clusters::Dict,knew::Int,xi::AbstractVector)

Place a data point to its new cluster. This modifies `clusters`
source
DPMM.label_xFunction.
label_x(clusters::Dict,knew::Int)

Return new cluster number for a data point
source
DPMM.logmixture_πsFunction.
logmixture_πs(α::V, clusters::Dict{<:Integer, <:AbstractCluster}) where V<:Real

Sample log mixture weights from Dirichlet Distribution.

source

Distributions

NormalInverseWishart{T<:Real,S<:AbstractPDMat} <: ContinuousUnivariateDistribution

Normal Inverse Wishart distribution is prior for MvNormalFast distribution.

see MvNormalFast

source
MvNormalFast{T<:Real,Prec<:AbstractPDMat,Mean<:AbstractVector} <: AbstractMvNormal

Normal distribution is redifined for the purpose of fast likelihood calculations.

It uses μ(mean), J (precision) parametrization.

source
DirichletFast{T<:Real} <:  ContinuousMultivariateDistribution

Dirichlet distribution as a prior to multinomial parameters.

The difference between DirichletFast and Dirichlet is that randn returns MultinomialFast distribution in DirichletFast.

It also does not calculate normalization constant at any time, so it has faster constructor than Dirichlet.

see MultinomialFast

source
MultionmialFast{T<:Real} <:  ContinuousMultivariateDistribution

Multinomial distribution is redifined for the purpose of fast likelihood calculations on DPSparseVector.

The other difference between MultinomialFast and Multionomial is that The n: trial numbers is not set. It is calculated by the input vector in the pdf function. So, it can produce pdf for any discrete x vector.

source

Clusters

AbstractCluster

Abstract base class for clusters

Each subtype should provide the following methods:

  • population(c): population of the cluster
  • isempty(m::AbstractCluster): checks whether the cluster is empty?
  • logαpdf(c,x) : log(∝likelihood) of a data point
  • lognαpdf(c,x): log(population) + logαpdf(c,x) for a data point (used in CRP calculations)
  • ClusterType(m::AbstractDPModel,X::AbstractArray) : constructor (X is the data as columns)
  • ClusterType(m::AbstractDPModel,s::SufficientStats): constructor

Other generic functions are implemented on top of these core functions.

source
DPMM.lognαpdfFunction.
lognαpdf(m::AbstractCluster,x::AbstractArray)

log(population) + log(∝likelihood) of a data point given by a cluster.

source
DPMM.logαpdfFunction.
logαpdf(m::AbstractCluster,x::AbstractArray)

log(∝likelihood) of a data point given by a cluster.

source
DPMM.populationFunction.
population(m::AbstractCluster)

Number of data points in a cluster

source
CollapsedCluster{Pred<:Distribution, Prior<:Distribution} <: AbstractCluster

The CollapsedCluster is designed for Collapsed Gibbs algorithms.

CollapsedCluster has below fields: - n : population - predictive : predictive distribution - prior : prior distribution

A CollapsedCluster are constructed via SufficientStats or data points:

CollapsedCluster(m::AbstractDPModel, X::AbstractArray) # X is the data as columns
CollapsedCluster(m::AbstractDPModel, s::SufficientStats)

There is also generic(not specific to CollapsedCluster) SuffStats method for getting suffstats for whole data as a dictionary:

SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractArray{Int})

There are also specific methods defined for creating clusters for whole data as a dictionary:

CollapsedClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractArray{Int})
CollapsedClusters(model::AbstractDPModel, stats::Dict{Int,<:SufficientStats})

- and + operations are defined for data addition and data removal from the cluster:

-(c::CollapsedCluster, x::AbstractVector)
+(c::CollapsedCluster, x::AbstractVector)

see AbstractCluster for generic functions for all Cluster types.

source
DirectCluster{Pred<:Distribution, Prior<:Distribution} <: AbstractCluster

The DirectCluster is designed for Direct Gibbs algorithms.

DirectCluster has below fields: n : population sampled : sampled parameter distribution 'prior' : prior distribution

A DirectCluster are constructed via SufficientStats or data points:

DirectCluster(m::AbstractDPModel,X::AbstractArray) # X is the data as columns
DirectCluster(m::AbstractDPModel,s::SufficientStats)

There is also generic(not specific to DirectCluster) SuffStats method for getting sufficient stats for whole data data as a dictionary

SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractArray{Int})

There are also specific methods defined for creating clusters for whole data as a dictionary:

DirectClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractArray{Int})
DirectClusters(model::AbstractDPModel, stats::Dict{Int,<:SufficientStats})

see AbstractCluster for generic functions for all Cluster types.

source
SplitMergeCluster{Pred<:Distribution, Post<:Distribution, Prior<:Distribution} <: AbstractCluster

The SplitMergeCluster is designed for Split-Merge Gibbs algorithm.

SplitMergeCluster has below fields: - n : population - nr: right subcluster population - nl: left subcluster population - sampled : sampled parameter distribution - right : right subcluster sampled parameter distribution - left: left subcluster sampled parameter - post : posterior distributions - rightpost : right subcluster posterior distributions - leftpost : left subcluster posterior distributions - 'prior' : prior distribution - llhs : log marginal likelihoods assigned by cluster, right subcluster, leftsubcluster - llh_hist : right + left log marginal likelihood history over 4 iteration - 'prior' : prior distribution

A SplitMergeCluster are constructed via SufficientStats or data points:

SplitMergeCluster(m::AbstractDPModel,X::AbstractArray) # X is the data as columns
SplitMergeCluster(m::AbstractDPModel,s::SufficientStats)

There is also generic SuffStats method for getting sufficient stats for whole data:

SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractVector{Tuple{Int,Bool}})

There are also specific methods defined for creating clusters for whole data:

SplitMergeClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractVector{Tuple{Int,Bool}})

see AbstractCluster for generic functions for all Cluster types.

The logαpdf and lognαpdf generic functions are extended for subcluster likelihoods.

logαpdf(m::SplitMergeCluster,x,::Val{false}) # right subcluster likelihood
logαpdf(m::SplitMergeCluster,x,::Val{true})  # left subcluster likelihood
lognαpdf(m::SplitMergeCluster, x, ::Val{false})  = log(population(m,Val(false))) + logαpdf(m, x, Val(false))
lognαpdf(m::SplitMergeCluster, x, ::Val{true})   = log(population(m,Val(true))) + logαpdf(m, x, Val(true))
source

Models

AbstractDPModel{T,D}

Abstract base class for DPMMs

T stands for element type, D is for dimensionality of the data

source
DPMM.DPGMMType.

DPGMM{T<:Real,D} <: AbstractDPModel{T,D}

Class for DP Gaussian Mixture Models

source
DPMM.DPMNMMType.

DPMNMM{T<:Real,D} <: AbstractDPModel{T,D} Class for DP Multinomial Mixture Models

source
DPMM.DPGMMStatsType.
DPGMMStats{T<:Real} <: SufficientStats

Sufficient statistics for Gaussian Models
source

DPMNMMStats{T<:Real} <: SufficientStats Sufficient statistics for Multinomial Models

source

Data

DPMM.setup_sceneFunction.
setup_scene(X)

Initialize plots for visualizing 2D data
source
DPMM.readNYTimesFunction.
readNYTimes(file::AbstractString)

Read NYTimes dataset from given data file. It returns DPSparseMatrix
source
DPMM.GridMixtureFunction.
GridMixture(L::Integer; πs::Vector{T}=ones(L*L)/(L*L)) where T<:Real

Generates LxL grid Gaussians

source
DPMM.RandMixtureFunction.
RandMixture(K::Integer;D::Int=2,πs::Vector{T}=ones(K)/K) where T<:Real

Randomly generates K Gaussian

source
DPSparseMatrix(X::SparseMatrixCSC{Tv,Ti}) where {Tv,Ti}

DPSparseMatrix has fast getindex methods for column indexing (i.e X[:,i]) It also doesn't copy column and return DPSparseVector for a column indexing.

see DPSparseVector

source
DPSparseVector{Tv,Ti<:Integer} <: AbstractSparseVector{Tv,Ti}

DPSparseVector is almost same with SparseArrays.SparseVector

The only difference is summation between DPSparseVectors results with a Vector.

source

Function Index