Reference
Contents
Algorithms
DPMM.fit — Function.fit(X::AbstractMatrix; algorithm=DEFAULT_ALGO, ncpu=1, T=3000, benchmark=false, scene=nothing, o...)fit is the main function of DPMM.jl which clusters given data matrix where columns are data points.
The output is the labels for each data point.
Default clustering algorithm is SplitMergeAlgorithm
Keywords:
- ncpu=1: the number of parallel workers.
- T=3000: iteration count
- benchmarks=false: if true returns elapsed time
- scene=nothing: plot scene for visualization. see- setup_scene
- o... : other keyword argument specific to - algorithm
DPMM.DPMMAlgorithm — Type.DPMMAlgorithm{P}Abstract base class for algorithms
P stands for parallel.
Each subtype should provide the following methods:
- AlgoType(X::AbstractMatrix; o...)` : constructor
- random_labels(X::AbstractMatrix,algo::AlgoType{P}) where P: random label generator
- create_clusters(X::AbstractMatrix,algo::AlgoType{P},labels) where P: initial clusters
- empty_cluster(algo::AlgoType) where P: an empty cluster (may be nothing)
- run!(algo::AlgoType{P}, X, labels, clusters, emptycluster;o...) where P: run! modifies labels
Other generic functions is implemented on top of these core functions.
DPMM.CollapsedAlgorithm — Type.CollapsedAlgorithm{P,Q} <: DPMMAlgorithm{P}Run it by:
labels = fit(X; algorithm = CollapsedAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)P stands for parallel, Q stands for quasi. Quasi algorithm updates the clusters only in the end of each iteration. Parallel algorithm is valid for quasi-collapsed algorithm only. The number of workers can passed by ncpu keyword argument to fit or run! functions
Provides following methods:
- CollapsedAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
- random_labels(X::AbstractMatrix, algo::CollapsedAlgorithm) where P
- create_clusters(X::AbstractMatrix, algo::CollapsedAlgorithm,labels) where P
- empty_cluster(algo::CollapsedAlgorithm) where P : an empty cluster
- run!(algo::CollapsedAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}
Other generic functions are implemented on top of these core functions.
DPMM.DirectAlgorithm — Type.DirectAlgorithm{P,Q} <: DPMMAlgorithm{P}Run it by:
labels = fit(X; algorithm = DirectAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)P stands for parallel, Q stands for quasi. Quasi algorithm uses cluster population proportions as cluster weights. So, it doesn't sample mixture weights from Dirichlet distribution. In large N, this is very similar to non-quasi sampler. The number of workers can passed by ncpu keyword argument to fit or run! functions
Provides following methods:
- DirectAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
- random_labels(X::AbstractMatrix, algo::DirectAlgorithm) where P
- create_clusters(X::AbstractMatrix, algo::DirectAlgorithm,labels) where P
- empty_cluster(algo::DirectAlgorithm) where P : an empty cluster
- run!(algo::DirectAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}
Other generic functions are implemented on top of these core functions.
DPMM.SplitMergeAlgorithm — Type.SplitMergeAlgorithm{P,Q} <: DPMMAlgorithm{P}Run it by:
labels = fit(X; algorithm = SplitMergeAlgorithm, quasi=false, ncpu=1, T=1000, keywords...)P stands for parallel, Q stands for quasi. M=false algorithm doesn't do merge moves at all, so it is not exact However, emprical results shows that merge moves very less likely. The number of workers can passed by ncpu keyword argument to fit or run! functions
Provides following methods:
- SplitMergeAlgorithm(X::AbstractMatrix{T}; modelType=_default_model(T), α=1, ninit=1, parallel=false, quasi=false, o...)
- random_labels(X::AbstractMatrix, algo::SplitMergeAlgorithm) where P
- create_clusters(X::AbstractMatrix, algo::SplitMergeAlgorithm,labels) where P
- empty_cluster(algo::SplitMergeAlgorithm) where P : an empty cluster
- run!(algo::SplitMergeAlgorithm{P,Q}, X, labels, clusters, cluster0; o...) where {P,Q}
Other generic functions are implemented on top of these core functions.
DPMM.run! — Function.run!(algo::DPMMAlgorithm, X, labels, clusters, emptycluster;o...)Runs the specified Gibbs algorithm. Availables algorithms are:
- Collapsed Algorithms
- DirectAlgorithm
- SplitMergeAlgorithm
DPMM.setup_workers — Function.setup_workers(ncpu::Integer)Setup parallel process, initialize required modules
DPMM.initialize_clusters — Function.initialize_clusters(X::AbstractMatrix, algo::DPMMAlgorithm{P}Initialize clusters and labels, sends related data to workers if the algorithm is parallel
Algorithms (Internal)
DPMM.random_labels — Function.random_labels(X::AbstractMatrix, algo::DPMMAlgorithm)random label generator for the data. algo.ninit specifies number of clusters
DPMM.create_clusters — Function.create_clusters(X,algo::CollapsedAlgorithm,labels)generate clusters from labels generator for the data. algo.ninit specifies number of clusters
DPMM.empty_cluster — Function.empty_cluster(X,algo::CollapsedAlgorithm,labels)generates an empty (0 data points) cluster
DPMM.RestrictedClusterProbs — Function.RestrictedClusterProbs(πs::AbstractVector{V}, clusters::Dict,  x::AbstractVector) where V<:RealReturns normalized probability vector for a data point being any cluster
DPMM.CRPprobs — Function.CRPprobs(clusters::Dict, cluster0::AbstractCluster, x::AbstractVector) where V<:RealReturns Chineese Restraunt Probabilities for a data point being any cluster + a new cluster
DPMM.SampleSubCluster — Function. SampleSubCluster(πs::Vector{V}, cluster::SplitMergeCluster, x::AbstractVector) where V<:RealReturns normalized probability vector for a data point being right or left subcluster
DPMM.ClusterProbs — Function.ClusterProbs(πs::AbstractVector{V}, clusters::Dict, cluster0::AbstractCluster, x::AbstractVector) where V<:RealReturns normalized probability vector for a data point being any cluster + a new cluster
DPMM.place_x! — Function.place_x!(model::AbstractDPModel,clusters::Dict,knew::Int,xi::AbstractVector)
Place a data point to its new cluster. This modifies `clusters`DPMM.label_x — Function.label_x(clusters::Dict,knew::Int)
Return new cluster number for a data pointDPMM.logmixture_πs — Function.logmixture_πs(α::V, clusters::Dict{<:Integer, <:AbstractCluster}) where V<:RealSample log mixture weights from Dirichlet Distribution.
Distributions
DPMM.NormalInverseWishart — Type.NormalInverseWishart{T<:Real,S<:AbstractPDMat} <: ContinuousUnivariateDistributionNormal Inverse Wishart distribution is prior for MvNormalFast distribution.
see MvNormalFast
DPMM.MvNormalFast — Type.MvNormalFast{T<:Real,Prec<:AbstractPDMat,Mean<:AbstractVector} <: AbstractMvNormalNormal distribution is redifined for the purpose of fast likelihood calculations.
It uses μ(mean), J (precision) parametrization.
DPMM.DirichletFast — Type.DirichletFast{T<:Real} <:  ContinuousMultivariateDistributionDirichlet distribution as a prior to multinomial parameters.
The difference between DirichletFast and Dirichlet is that randn returns MultinomialFast distribution in DirichletFast.
It also does not calculate normalization constant at any time, so it has faster constructor than Dirichlet.
see MultinomialFast
DPMM.MultinomialFast — Type.MultionmialFast{T<:Real} <:  ContinuousMultivariateDistributionMultinomial distribution is redifined for the purpose of fast likelihood calculations on DPSparseVector.
The other difference between MultinomialFast and Multionomial is that The n: trial numbers is not set. It is calculated by the input vector in the pdf function. So, it can produce pdf for any discrete x vector.
Clusters
DPMM.AbstractCluster — Type.AbstractClusterAbstract base class for clusters
Each subtype should provide the following methods:
- population(c): population of the cluster
- isempty(m::AbstractCluster): checks whether the cluster is empty?
- logαpdf(c,x): log(∝likelihood) of a data point
- lognαpdf(c,x): log(population) + logαpdf(c,x) for a data point (used in CRP calculations)
- ClusterType(m::AbstractDPModel,X::AbstractArray): constructor (X is the data as columns)
- ClusterType(m::AbstractDPModel,s::SufficientStats): constructor
Other generic functions are implemented on top of these core functions.
DPMM.lognαpdf — Function.lognαpdf(m::AbstractCluster,x::AbstractArray)log(population) + log(∝likelihood) of a data point given by a cluster.
DPMM.logαpdf — Function.logαpdf(m::AbstractCluster,x::AbstractArray)log(∝likelihood) of a data point given by a cluster.
DPMM.population — Function.population(m::AbstractCluster)Number of data points in a cluster
DPMM.CollapsedCluster — Type.CollapsedCluster{Pred<:Distribution, Prior<:Distribution} <: AbstractClusterThe CollapsedCluster is designed for Collapsed Gibbs algorithms.
CollapsedCluster has below fields:     - n : population     - predictive : predictive distribution     - prior : prior distribution
A CollapsedCluster are constructed via SufficientStats or data points:
CollapsedCluster(m::AbstractDPModel, X::AbstractArray) # X is the data as columns
CollapsedCluster(m::AbstractDPModel, s::SufficientStats)There is also generic(not specific to CollapsedCluster) SuffStats method for getting suffstats for whole data as a dictionary:
SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractArray{Int})There are also specific methods defined for creating clusters for whole data as a dictionary:
CollapsedClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractArray{Int})
CollapsedClusters(model::AbstractDPModel, stats::Dict{Int,<:SufficientStats})- and + operations are defined for data addition and data removal from the cluster:
-(c::CollapsedCluster, x::AbstractVector)
+(c::CollapsedCluster, x::AbstractVector)see AbstractCluster for generic functions for all Cluster types.
DPMM.DirectCluster — Type.DirectCluster{Pred<:Distribution, Prior<:Distribution} <: AbstractClusterThe DirectCluster is designed for Direct Gibbs algorithms.
DirectCluster has below fields:     n : population     sampled : sampled parameter distribution     'prior' : prior distribution
A DirectCluster are constructed via SufficientStats or data points:
DirectCluster(m::AbstractDPModel,X::AbstractArray) # X is the data as columns
DirectCluster(m::AbstractDPModel,s::SufficientStats)There is also generic(not specific to DirectCluster) SuffStats method for getting sufficient stats for whole data data as a dictionary
SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractArray{Int})There are also specific methods defined for creating clusters for whole data as a dictionary:
DirectClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractArray{Int})
DirectClusters(model::AbstractDPModel, stats::Dict{Int,<:SufficientStats})see AbstractCluster for generic functions for all Cluster types.
DPMM.SplitMergeCluster — Type.SplitMergeCluster{Pred<:Distribution, Post<:Distribution, Prior<:Distribution} <: AbstractClusterThe SplitMergeCluster is designed for Split-Merge Gibbs algorithm.
SplitMergeCluster has below fields:     - n : population     - nr: right subcluster population     - nl: left subcluster population     - sampled : sampled parameter distribution     - right : right subcluster sampled parameter distribution     - left: left subcluster sampled parameter     - post : posterior distributions     - rightpost : right subcluster posterior distributions     - leftpost : left subcluster posterior distributions     - 'prior' : prior distribution     - llhs : log marginal likelihoods assigned by cluster, right subcluster, leftsubcluster     - llh_hist : right + left log marginal likelihood history over 4 iteration     - 'prior' : prior distribution
A SplitMergeCluster are constructed via SufficientStats or data points:
SplitMergeCluster(m::AbstractDPModel,X::AbstractArray) # X is the data as columns
SplitMergeCluster(m::AbstractDPModel,s::SufficientStats)There is also generic SuffStats method for getting sufficient stats for whole data:
SuffStats(model::AbstractDPModel, X::AbstractMatrix, z::AbstractVector{Tuple{Int,Bool}})There are also specific methods defined for creating clusters for whole data:
SplitMergeClusters(model::AbstractDPModel, X::AbstractMatrix, labels::AbstractVector{Tuple{Int,Bool}})see AbstractCluster for generic functions for all Cluster types.
The logαpdf and lognαpdf generic functions are extended for subcluster likelihoods.
logαpdf(m::SplitMergeCluster,x,::Val{false}) # right subcluster likelihood
logαpdf(m::SplitMergeCluster,x,::Val{true})  # left subcluster likelihood
lognαpdf(m::SplitMergeCluster, x, ::Val{false})  = log(population(m,Val(false))) + logαpdf(m, x, Val(false))
lognαpdf(m::SplitMergeCluster, x, ::Val{true})   = log(population(m,Val(true))) + logαpdf(m, x, Val(true))Models
DPMM.AbstractDPModel — Type.AbstractDPModel{T,D}
Abstract base class for DPMMs
T stands for element type, D is for dimensionality of the data
DPMM.DPGMM — Type.DPGMM{T<:Real,D} <: AbstractDPModel{T,D}
Class for DP Gaussian Mixture Models
DPMM.DPMNMM — Type.DPMNMM{T<:Real,D} <: AbstractDPModel{T,D}     Class for DP Multinomial Mixture Models
DPMM.DPGMMStats — Type.DPGMMStats{T<:Real} <: SufficientStats
Sufficient statistics for Gaussian ModelsDPMM.DPMNMMStats — Type.DPMNMMStats{T<:Real} <: SufficientStats     Sufficient statistics for Multinomial Models
Data
DPMM.setup_scene — Function.setup_scene(X)
Initialize plots for visualizing 2D dataDPMM.readNYTimes — Function.readNYTimes(file::AbstractString)
Read NYTimes dataset from given data file. It returns DPSparseMatrixDPMM.GridMixture — Function.GridMixture(L::Integer; πs::Vector{T}=ones(L*L)/(L*L)) where T<:RealGenerates LxL grid Gaussians
DPMM.RandMixture — Function.RandMixture(K::Integer;D::Int=2,πs::Vector{T}=ones(K)/K) where T<:RealRandomly generates K Gaussian
DPMM.DPSparseMatrix — Type.DPSparseMatrix(X::SparseMatrixCSC{Tv,Ti}) where {Tv,Ti}DPSparseMatrix has fast getindex methods for column indexing (i.e X[:,i]) It also doesn't copy column and return DPSparseVector for a column indexing.
see DPSparseVector
DPMM.DPSparseVector — Type.DPSparseVector{Tv,Ti<:Integer} <: AbstractSparseVector{Tv,Ti}DPSparseVector is almost same with SparseArrays.SparseVector
The only difference is summation between DPSparseVectors results with a Vector.
Function Index
- DPMM.AbstractCluster
- DPMM.AbstractDPModel
- DPMM.CollapsedAlgorithm
- DPMM.CollapsedCluster
- DPMM.DPGMM
- DPMM.DPGMMStats
- DPMM.DPMMAlgorithm
- DPMM.DPMNMM
- DPMM.DPMNMMStats
- DPMM.DPSparseMatrix
- DPMM.DPSparseVector
- DPMM.DirectAlgorithm
- DPMM.DirectCluster
- DPMM.DirichletFast
- DPMM.MultinomialFast
- DPMM.MvNormalFast
- DPMM.NormalInverseWishart
- DPMM.SplitMergeAlgorithm
- DPMM.SplitMergeCluster
- DPMM.CRPprobs
- DPMM.ClusterProbs
- DPMM.GridMixture
- DPMM.RandMixture
- DPMM.RestrictedClusterProbs
- DPMM.SampleSubCluster
- DPMM.create_clusters
- DPMM.empty_cluster
- DPMM.fit
- DPMM.initialize_clusters
- DPMM.label_x
- DPMM.logmixture_πs
- DPMM.lognαpdf
- DPMM.logαpdf
- DPMM.place_x!
- DPMM.population
- DPMM.random_labels
- DPMM.readNYTimes
- DPMM.run!
- DPMM.setup_scene
- DPMM.setup_workers