Reference

Contents

Reference

Algorithms
Algorithms (Internal)
Distributions
Clusters
Models
Data
Function Index

Algorithms

DPMM.fit — Function.

fit(X::AbstractMatrix; algorithm=DEFAULT_ALGO, ncpu=1, T=3000, benchmark=false, scene=nothing, o...)

fit is the main function of DPMM.jl which clusters given data matrix where columns are data points.

The output is the labels for each data point.

Default clustering algorithm is SplitMergeAlgorithm

Keywords:

ncpu=1 : the number of parallel workers.
T=3000 : iteration count
benchmarks=false : if true returns elapsed time
scene=nothing: plot scene for visualization. see setup_scene
o... : other keyword argument specific to algorithm

source

DPMM.DPMMAlgorithm — Type.

DPMMAlgorithm{P}

Abstract base class for algorithms

P stands for parallel.

Each subtype should provide the following methods:

AlgoType(X::AbstractMatrix; o...)` : constructor
random_labels(X::AbstractMatrix,algo::AlgoType{P}) where P : random label generator
create_clusters(X::AbstractMatrix,algo::AlgoType{P},labels) where P : initial clusters
empty_cluster(algo::AlgoType) where P : an empty cluster (may be nothing)
run!(algo::AlgoType{P}, X, labels, clusters, emptycluster;o...) where P : run! modifies labels

Other generic functions is implemented on top of these core functions.

source

DPMM.CollapsedAlgorithm — Type.

CollapsedAlgorithm{P,Q} <: DPMMAlgorithm{P}

Run it by:

labels = fit(X; algorithm = CollapsedAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)

P stands for parallel, Q stands for quasi. Quasi algorithm updates the clusters only in the end of each iteration. Parallel algorithm is valid for quasi-collapsed algorithm only. The number of workers can passed by ncpu keyword argument to fit or run! functions

Provides following methods:

CollapsedAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
random_labels(X::AbstractMatrix, algo::CollapsedAlgorithm) where P
create_clusters(X::AbstractMatrix, algo::CollapsedAlgorithm,labels) where P
empty_cluster(algo::CollapsedAlgorithm) where P : an empty cluster
run!(algo::CollapsedAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}

Other generic functions are implemented on top of these core functions.

source

DPMM.DirectAlgorithm — Type.

DirectAlgorithm{P,Q} <: DPMMAlgorithm{P}

Run it by:

labels = fit(X; algorithm = DirectAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)

P stands for parallel, Q stands for quasi. Quasi algorithm uses cluster population proportions as cluster weights. So, it doesn't sample mixture weights from Dirichlet distribution. In large N, this is very similar to non-quasi sampler. The number of workers can passed by ncpu keyword argument to fit or run! functions

Provides following methods:

DirectAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
random_labels(X::AbstractMatrix, algo::DirectAlgorithm) where P
create_clusters(X::AbstractMatrix, algo::DirectAlgorithm,labels) where P
empty_cluster(algo::DirectAlgorithm) where P : an empty cluster
run!(algo::DirectAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}

Other generic functions are implemented on top of these core functions.

source

DPMM.SplitMergeAlgorithm — Type.

SplitMergeAlgorithm{P,Q} <: DPMMAlgorithm{P}

Run it by:

labels = fit(X; algorithm = SplitMergeAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)

P stands for parallel, Q stands for quasi. M=false algorithm doesn't do merge moves at all, so it is not exact However, emprical results shows that merge moves very less likely. The number of workers can passed by ncpu keyword argument to fit or run! functions

Provides following methods:

SplitMergeAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
random_labels(X::AbstractMatrix, algo::SplitMergeAlgorithm) where P
create_clusters(X::AbstractMatrix, algo::SplitMergeAlgorithm,labels) where P
empty_cluster(algo::SplitMergeAlgorithm) where P : an empty cluster
run!(algo::SplitMergeAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}

Other generic functions are implemented on top of these core functions.

source

DPMM.run! — Function.

run!(algo::DPMMAlgorithm, X, labels, clusters, emptycluster;o...)

Runs the specified Gibbs algorithm. Availables algorithms are:

Collapsed Algorithms
DirectAlgorithm
SplitMergeAlgorithm

source

DPMM.setup_workers — Function.

setup_workers(ncpu::Integer)

Setup parallel process, initialize required modules

source

DPMM.initialize_clusters — Function.

initialize_clusters(X::AbstractMatrix, algo::DPMMAlgorithm{P}

Initialize clusters and labels, sends related data to workers if the algorithm is parallel

source

Algorithms (Internal)

DPMM.random_labels — Function.

random_labels(X::AbstractMatrix, algo::DPMMAlgorithm)

random label generator for the data. algo.ninit specifies number of clusters

source

DPMM.create_clusters — Function.

create_clusters(X,algo::CollapsedAlgorithm,labels)

generate clusters from labels generator for the data. algo.ninit specifies number of clusters

source

DPMM.empty_cluster — Function.

empty_cluster(X,algo::CollapsedAlgorithm,labels)

generates an empty (0 data points) cluster

source

DPMM.RestrictedClusterProbs — Function.

RestrictedClusterProbs(πs::AbstractVector{V}, clusters::Dict,  x::AbstractVector) where V<:Real

Returns normalized probability vector for a data point being any cluster

source

DPMM.CRPprobs — Function.

CRPprobs(clusters::Dict, cluster0::AbstractCluster, x::AbstractVector) where V<:Real

Returns Chineese Restraunt Probabilities for a data point being any cluster + a new cluster

source

DPMM.SampleSubCluster — Function.

 SampleSubCluster(πs::Vector{V}, cluster::SplitMergeCluster, x::AbstractVector) where V<:Real

Returns normalized probability vector for a data point being right or left subcluster

source

DPMM.ClusterProbs — Function.

ClusterProbs(πs::AbstractVector{V}, clusters::Dict, cluster0::AbstractCluster, x::AbstractVector) where V<:Real

Returns normalized probability vector for a data point being any cluster + a new cluster

source

DPMM.place_x! — Function.

place_x!(model::AbstractDPModel,clusters::Dict,knew::Int,xi::AbstractVector)

Place a data point to its new cluster. This modifies `clusters`

source

DPMM.label_x — Function.

label_x(clusters::Dict,knew::Int)

Return new cluster number for a data point

source

DPMM.logmixture_πs — Function.

logmixture_πs(α::V, clusters::Dict{<:Integer, <:AbstractCluster}) where V<:Real

Sample log mixture weights from Dirichlet Distribution.

source

Distributions

DPMM.NormalInverseWishart — Type.

NormalInverseWishart{T<:Real,S<:AbstractPDMat} <: ContinuousUnivariateDistribution

Normal Inverse Wishart distribution is prior for MvNormalFast distribution.

see MvNormalFast

source

DPMM.MvNormalFast — Type.

MvNormalFast{T<:Real,Prec<:AbstractPDMat,Mean<:AbstractVector} <: AbstractMvNormal

Normal distribution is redifined for the purpose of fast likelihood calculations.

It uses μ(mean), J (precision) parametrization.

source

DPMM.DirichletFast — Type.

DirichletFast{T<:Real} <:  ContinuousMultivariateDistribution

Dirichlet distribution as a prior to multinomial parameters.

The difference between DirichletFast and Dirichlet is that randn returns MultinomialFast distribution in DirichletFast.

It also does not calculate normalization constant at any time, so it has faster constructor than Dirichlet.

see MultinomialFast

source

DPMM.MultinomialFast — Type.

MultionmialFast{T<:Real} <:  ContinuousMultivariateDistribution

Multinomial distribution is redifined for the purpose of fast likelihood calculations on DPSparseVector.

The other difference between MultinomialFast and Multionomial is that The n: trial numbers is not set. It is calculated by the input vector in the pdf function. So, it can produce pdf for any discrete x vector.

source

Clusters

DPMM.AbstractCluster — Type.

AbstractCluster

Abstract base class for clusters

Each subtype should provide the following methods:

population(c): population of the cluster
isempty(m::AbstractCluster): checks whether the cluster is empty?
logαpdf(c,x) : log(∝likelihood) of a data point
lognαpdf(c,x): log(population) + logαpdf(c,x) for a data point (used in CRP calculations)
ClusterType(m::AbstractDPModel,X::AbstractArray) : constructor (X is the data as columns)
ClusterType(m::AbstractDPModel,s::SufficientStats): constructor

Other generic functions are implemented on top of these core functions.

source

DPMM.lognαpdf — Function.

lognαpdf(m::AbstractCluster,x::AbstractArray)

log(population) + log(∝likelihood) of a data point given by a cluster.

source

DPMM.logαpdf — Function.

logαpdf(m::AbstractCluster,x::AbstractArray)

log(∝likelihood) of a data point given by a cluster.

source

DPMM.population — Function.

population(m::AbstractCluster)

Number of data points in a cluster

source

DPMM.CollapsedCluster — Type.

CollapsedCluster{Pred<:Distribution, Prior<:Distribution} <: AbstractCluster

The CollapsedCluster is designed for Collapsed Gibbs algorithms.

CollapsedCluster has below fields: - n : population - predictive : predictive distribution - prior : prior distribution

A CollapsedCluster are constructed via SufficientStats or data points:

CollapsedCluster(m::AbstractDPModel, X::AbstractArray) # X is the data as columns
CollapsedCluster(m::AbstractDPModel, s::SufficientStats)

There is also generic(not specific to CollapsedCluster) SuffStats method for getting suffstats for whole data as a dictionary:

SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractArray{Int})

There are also specific methods defined for creating clusters for whole data as a dictionary:

CollapsedClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractArray{Int})
CollapsedClusters(model::AbstractDPModel, stats::Dict{Int,<:SufficientStats})

- and + operations are defined for data addition and data removal from the cluster:

-(c::CollapsedCluster, x::AbstractVector)
+(c::CollapsedCluster, x::AbstractVector)

see AbstractCluster for generic functions for all Cluster types.

source

DPMM.DirectCluster — Type.

DirectCluster{Pred<:Distribution, Prior<:Distribution} <: AbstractCluster

The DirectCluster is designed for Direct Gibbs algorithms.

DirectCluster has below fields: n : population sampled : sampled parameter distribution 'prior' : prior distribution

A DirectCluster are constructed via SufficientStats or data points:

DirectCluster(m::AbstractDPModel,X::AbstractArray) # X is the data as columns
DirectCluster(m::AbstractDPModel,s::SufficientStats)

There is also generic(not specific to DirectCluster) SuffStats method for getting sufficient stats for whole data data as a dictionary

SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractArray{Int})

There are also specific methods defined for creating clusters for whole data as a dictionary:

DirectClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractArray{Int})
DirectClusters(model::AbstractDPModel, stats::Dict{Int,<:SufficientStats})

see AbstractCluster for generic functions for all Cluster types.

source

DPMM.SplitMergeCluster — Type.

SplitMergeCluster{Pred<:Distribution, Post<:Distribution, Prior<:Distribution} <: AbstractCluster

The SplitMergeCluster is designed for Split-Merge Gibbs algorithm.

SplitMergeCluster has below fields: - n : population - nr: right subcluster population - nl: left subcluster population - sampled : sampled parameter distribution - right : right subcluster sampled parameter distribution - left: left subcluster sampled parameter - post : posterior distributions - rightpost : right subcluster posterior distributions - leftpost : left subcluster posterior distributions - 'prior' : prior distribution - llhs : log marginal likelihoods assigned by cluster, right subcluster, leftsubcluster - llh_hist : right + left log marginal likelihood history over 4 iteration - 'prior' : prior distribution

A SplitMergeCluster are constructed via SufficientStats or data points:

SplitMergeCluster(m::AbstractDPModel,X::AbstractArray) # X is the data as columns
SplitMergeCluster(m::AbstractDPModel,s::SufficientStats)

There is also generic SuffStats method for getting sufficient stats for whole data:

SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractVector{Tuple{Int,Bool}})

There are also specific methods defined for creating clusters for whole data:

SplitMergeClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractVector{Tuple{Int,Bool}})

see AbstractCluster for generic functions for all Cluster types.

The logαpdf and lognαpdf generic functions are extended for subcluster likelihoods.

logαpdf(m::SplitMergeCluster,x,::Val{false}) # right subcluster likelihood
logαpdf(m::SplitMergeCluster,x,::Val{true})  # left subcluster likelihood
lognαpdf(m::SplitMergeCluster, x, ::Val{false})  = log(population(m,Val(false))) + logαpdf(m, x, Val(false))
lognαpdf(m::SplitMergeCluster, x, ::Val{true})   = log(population(m,Val(true))) + logαpdf(m, x, Val(true))

source

Models

DPMM.AbstractDPModel — Type.

AbstractDPModel{T,D}

Abstract base class for DPMMs

T stands for element type, D is for dimensionality of the data

source

DPMM.DPGMM — Type.

DPGMM{T<:Real,D} <: AbstractDPModel{T,D}

Class for DP Gaussian Mixture Models

source

DPMM.DPMNMM — Type.

DPMNMM{T<:Real,D} <: AbstractDPModel{T,D} Class for DP Multinomial Mixture Models

source

DPMM.DPGMMStats — Type.

DPGMMStats{T<:Real} <: SufficientStats

Sufficient statistics for Gaussian Models

source

DPMM.DPMNMMStats — Type.

DPMNMMStats{T<:Real} <: SufficientStats Sufficient statistics for Multinomial Models

source

Data

DPMM.setup_scene — Function.

setup_scene(X)

Initialize plots for visualizing 2D data

source

DPMM.readNYTimes — Function.

readNYTimes(file::AbstractString)

Read NYTimes dataset from given data file. It returns DPSparseMatrix

source

DPMM.GridMixture — Function.

GridMixture(L::Integer; πs::Vector{T}=ones(L*L)/(L*L)) where T<:Real

Generates LxL grid Gaussians

source

DPMM.RandMixture — Function.

RandMixture(K::Integer;D::Int=2,πs::Vector{T}=ones(K)/K) where T<:Real

Randomly generates K Gaussian

source

DPMM.DPSparseMatrix — Type.

DPSparseMatrix(X::SparseMatrixCSC{Tv,Ti}) where {Tv,Ti}

DPSparseMatrix has fast getindex methods for column indexing (i.e X[:,i]) It also doesn't copy column and return DPSparseVector for a column indexing.

see DPSparseVector

source

DPMM.DPSparseVector — Type.

DPSparseVector{Tv,Ti<:Integer} <: AbstractSparseVector{Tv,Ti}

DPSparseVector is almost same with SparseArrays.SparseVector

The only difference is summation between DPSparseVectors results with a Vector.

source

Function Index

DPMM.AbstractCluster
DPMM.AbstractDPModel
DPMM.CollapsedAlgorithm
DPMM.CollapsedCluster
DPMM.DPGMM
DPMM.DPGMMStats
DPMM.DPMMAlgorithm
DPMM.DPMNMM
DPMM.DPMNMMStats
DPMM.DPSparseMatrix
DPMM.DPSparseVector
DPMM.DirectAlgorithm
DPMM.DirectCluster
DPMM.DirichletFast
DPMM.MultinomialFast
DPMM.MvNormalFast
DPMM.NormalInverseWishart
DPMM.SplitMergeAlgorithm
DPMM.SplitMergeCluster
DPMM.CRPprobs
DPMM.ClusterProbs
DPMM.GridMixture
DPMM.RandMixture
DPMM.RestrictedClusterProbs
DPMM.SampleSubCluster
DPMM.create_clusters
DPMM.empty_cluster
DPMM.fit
DPMM.initialize_clusters
DPMM.label_x
DPMM.logmixture_πs
DPMM.lognαpdf
DPMM.logαpdf
DPMM.place_x!
DPMM.population
DPMM.random_labels
DPMM.readNYTimes
DPMM.run!
DPMM.setup_scene
DPMM.setup_workers