A Discrete Bayesian Graph #210

New issue

Open

A Discrete Bayesian Graph#210

Assignees

erlebach

I am implementing a discrete Bayesian network using a few Nonlinear functions and cannot get the messagePassingAlgorithm() method to run without error. The resulting graph has all its edges terminated either by Clamp or via a placeholder. Here is the code, which is rather simple:

using ForneyLab
using Plots

gA = 3
gB = 5
gC = 8

# Three deterministic functions
function grade(i, d)
    if i == 0 && d == 0
        g1,g2,g3 = [0.3, 0.4, 0.3]
    elseif i == 0 && d == 1
        g1,g2,g3 = [0.05, 0.25, 0.7]
    elseif i == 1 && d == 0
        g1,g2,g3 = [0.9, 0.08, 0.02]
    else
        g1,g2,g3 = [0.5, 0.3, 0.2]
    end
    
    return [g1, g2, g3]
end

function letter(g)
    if g == gA
        l = .9
    elseif g == gB
        l = .6
    else
        l = 0.01
    end
    return l
end

function score(i)
    if i == 0
        s = 0.05
    else
        s = 0.8
    end
    return s
end

# The model
gr = FactorGraph()

# @RV b1 ~ Clamp(0.4)
# @RV b2 ~ Clamp(0.3)
@RV d ~ Bernoulli(0.4)
@RV i ~ Bernoulli(0.3)
@RV grade_ ~ Nonlinear{Sampling}(i, d, g=grade, n_samples=50)
@RV g ~ Categorical(grade_)
@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50)
@RV l ~ Bernoulli(letter_)
@RV score_ ~ Nonlinear{Sampling}(i, g=score, n_samples=50)
@RV s ~ Bernoulli(score_)

# placeholder(d, :d)
placeholder(i, :i)
placeholder(s, :s)
placeholder(l, :l)
placeholder(g, :g)
#  The line the produces the error: 
algo = messagePassingAlgorithm([d])    # <<<<< ERROR
source_code = algorithmSourceCode(algo)
eval(Meta.parse(source_code));

The error produced is:

No applicable SumProductRule{Categorical} update for Categorical node with inbound types: Message{SampleList}, Nothing

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] inferUpdateRule!(entry::ScheduleEntry, rule_type::Type{SumProductRule{Categorical}}, inferred_outbound_types::Dict{Interface, Type})
   @ ForneyLab ~/src/2022/ForneyLab.jl/src/algorithms/sum_product.jl:75
 [3] inferUpdateRules!(schedule::Vector{ScheduleEntry}; inferred_outbound_types::Dict{Interface, Type})
   @ ForneyLab ~/src/2022/ForneyLab.jl/src/message_passing.jl:203
 [4] messagePassingSchedule(pf::PosteriorFactor)
   @ ForneyLab ~/src/2022/ForneyLab.jl/src/algorithms/posterior_factor.jl:75
 [5] messagePassingAlgorithm(target_variables::Vector{Variable}, pfz::PosteriorFactorization; ep_sites::Vector{Tuple}, id::Symbol, free_energy::Bool)
   @ ForneyLab ~/src/2022/ForneyLab.jl/src/algorithms/inference_algorithm.jl:82
 [6] messagePassingAlgorithm (repeats 2 times)
   @ ~/src/2022/ForneyLab.jl/src/algorithms/inference_algorithm.jl:58 [inlined]
 [7] top-level scope
   @ In[31]:1
 [8] eval
   @ ./boot.jl:373 [inlined]
 [9] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1196

Clearly, this means that there is no SumProduct rule to handle categorical variables with a SampleMessage Input. So my question is: how is this case treated? Would a straightforward variation technique be appropriate? Thanks.

assigned

and

Collaborator

Hi @erlebach!

I am not sure I understand your model. Can you provide the equations for your generative model? Or you can try to explain what are you trying to model with this graph? It seems strange to me that you have a shared placeholder for the output of Bernoulli distribution and the input of a Nonlinear node (e.g. placeholder(i, :i))

First of all, indeed there is no close form solution (aka sp message) in ForneyLab.jl with these incoming messages, i.e. Message{SampleList}, Nothing.

To circumvent the problem you need to introduce a recognition distribution ("variation technique") around the Categorical node. In the code below I've added one more line before algorithm construction:

# The model
gr = FactorGraph()

# @RV b1 ~ Clamp(0.4)
# @RV b2 ~ Clamp(0.3)
@RV d ~ Bernoulli(0.4)
@RV i ~ Bernoulli(0.3)
@RV grade_ ~ Nonlinear{Sampling}(i, d, g=grade, n_samples=50)
@RV g ~ Categorical(grade_)
@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50)
@RV l ~ Bernoulli(letter_)
@RV score_ ~ Nonlinear{Sampling}(i, g=score, n_samples=50)
@RV s ~ Bernoulli(score_)

# placeholder(d, :d)
placeholder(i, :i)
placeholder(s, :s)
placeholder(l, :l)
placeholder(g, :g)
#  The line the produces the error: 

q = PosteriorFactorization(g, grade_, ids=[:g, :grade_]) 
algo = messagePassingAlgorithm([d])    # <<<<< ERROR

However this code won't work either, because one of your recognition factors g is clamped (connected to the placeholder(g, :g))

What you can do, is to remove placeholder(g, :g) to build the algorithm.

# The model
gr = FactorGraph()

# @RV b1 ~ Clamp(0.4)
# @RV b2 ~ Clamp(0.3)
@RV d ~ Bernoulli(0.4)
@RV i ~ Bernoulli(0.3)
@RV grade_ ~ Nonlinear{Sampling}(i, d, g=grade, n_samples=50)
@RV g ~ Categorical(grade_)
@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50)
@RV l ~ Bernoulli(letter_)
@RV score_ ~ Nonlinear{Sampling}(i, g=score, n_samples=50)
@RV s ~ Bernoulli(score_)

# placeholder(d, :d)
placeholder(i, :i)
placeholder(s, :s)
placeholder(l, :l)
# placeholder(g, :g)
#  The line the produces the error: 

q = PosteriorFactorization(g, grade_, ids=[:g, :grade_]) 
algo = messagePassingAlgorithm([d])    # <<<<< ERROR
source_code = algorithmSourceCode(algo)
eval(Meta.parse(source_code));

THIS WORKS!
However you might encounter other issues later. Hence, I would first suggest we discuss your model and then think of the inference.

erlebach

Author

Hi @albertpod:
Thank you for your investigation and suggestion, which I will try out. Let me provide you with some context. This problem was proposed as a class demonstration of Probability Graphical Models. The objective is to understand the relationships between intelligence (I), SAT scores (S), problem difficulty (D), test grade (G), and whether the student receives a letter of recommendation (L). Here is the proposed model:

P(I,S,D,G,L) = P(I) P(D) P(G|D,I) P(L|G) P(S|I)

P(I) and P(D) are modeled as Bernoulli distributions. P(G|D,I) is modeled as a categorical distribution (we assume three different grades). The categorical probabilities depend on the inputs (D,I), so there are four different categorical triplets (whose elements sum to 1). P(L|G) and P(S|I) are Bernoulli distributions whose probability parameter depends on their condition variables. So each has two probabilities. These last three probabilities are captured through the functions letter, score, and grade. The former two can be interpreted as mixtures of Bernoulli, and grade is a mixture of categorical distributions (I believe that the TransitionMixture factor could be of help but I have not looked into it.

I was hoping initially to be able to compute marginals on any variable not observed without appealing to variational inference. I wished to experiment with choosing different subsets of observed variables and examine their effect. Furthermore, I want to be able to specify which variables are observed via a dictionary, passed to some function, and have everything the appropriate inference executed. It should be noted that I am only providing a single observation at this stage, and the various distributions of the nodes are specified, so an analytical solution is possible. By analytical, I mean the following. The marginal of I is a Bernoulli with parameter p=0.3. If I specify that s=1 for example, the I variable becomes a Bernoulli with a different parameter, which is increased. This parameter can be computed exactly. I can also perform a Monte-Carlo simulation to compute it. This is what I did in PyMC. I thought that forney lab would be able to apply expectation propagation to do this, since it can be done manually (with pencil and paper). I do realize that if I place a prior on the probability of intelligence, then its posterior can be evaluated via variational inference, and clarity ensues. But that was not my initial intent.

Note that I might want to observe g. What does one do in that case? In reality, we can consider the current model to have a plate with N measurements, and I am considering the case of N=1, i.e., the simplest case. That should be possible, no?

I have implemented the model successfully in pymc, if that is of that is of interest to you. These implementations serve as exercises to develop expertise in a variety of frameworks in Julia and Python.

For reference, I have never heard the term recognition distribution before.

erlebach

Author

How about we move this to the discussions?

albertpod

Collaborator

Sorry, I noticed the new topic on Discussion earlier than I read your answer @erlebach.

As for recognition distribution q. This term is also referred to as instrumental distribution. This term arises when we approximate our posterior distribution p with some reference probability distribution q (recognition distribution). In this case, we try to minimize KL divergence between p and q as log evidence can't be computed.

We had a short discussion with @ThijsvdLaar about your issue. It seems like you can be just fine with TransitionMixture for this problem. @ThijsvdLaar will get back to you regarding this.

I understand your wish to observe g, but then (to my understanding) your graph will change as well, i.e., you won't need P(G|D, I).

In the meantime, I will try to come up with the inference solution for your model using a Nonlinear node.

erlebach

Author

Thank you! Quick question on something I never understood. If a variable is observed, its probability distribution is quite irrelevant, is that not true? However, when estimating a likelihood, its probability distribution still impacts the posterior through the remaining non-observed variables. If true, P(G|D, I) is still needed. In any case, ideally, I would like, if possible, the flexibility of observing or not observing any subset of random variables without having to write different methods for each combination of observables. Could't an observable be implemented in a general manner through multiplication by a Delta function? This is possible with MCMC, but perhaps not with message passing.

albertpod

Collaborator

I am not sure I've understood your question. What do you mean by irrelevant distribution? (let's continue this part at the Discussion; it is always helpful to draw a graph and inspect how the messages flow depending on your graph structure)

As for the solution with a Nonlinear node. It seems like we have found a bug that doesn't allow us to proceed with the mix of Bernoulli and Categorical distributions. However, Bernoulli can always be converted to a Categorical distribution, e.g. Bernoulli(0.3) <-> Categorical([0.3, 0.7]) . Given that I rewrote your model in the following way:

using ForneyLab
using ProgressMeter

gA = 3
gB = 5
gC = 8

# Three deterministic functions
function grade(i, d)
    if i == 0 && d == 0
        g1,g2,g3 = [0.3, 0.4, 0.3]
    elseif i == 0 && d == 1
        g1,g2,g3 = [0.05, 0.25, 0.7]
    elseif i == 1 && d == 0
        g1,g2,g3 = [0.9, 0.08, 0.02]
    else
        g1,g2,g3 = [0.5, 0.3, 0.2]
    end
    
    return [g1, g2, g3]
end

function letter(g)
    if g == gA
        l[1] = .9
    elseif g == gB
        l = .6
    else
        l = 0.01
    end
    return [l, 1-l] # changed to comply with Catergorical distribution
end

function score(i)
    if i[1] == 0
        s = 0.05
    else
        s = 0.8
    end
    return [s, 1-s] # changed to comply with Catergorical distribution
end

# The model
gr = FactorGraph()

@RV d ~ Categorical([0.4, 0.6]) # equivalent to Bernoulli(0.4)
@RV i ~ Categorical([0.3, 0.7]) # equivalent to Bernoulli(0.3)
@RV grade_ ~ Nonlinear{Sampling}(i, d, g=grade, n_samples=50, dims=[(3,),(2,),(2,)]) # we need to specify the dimensionality for each interface, i.e. grade_, i, d
@RV g ~ Categorical(grade_)
@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50, dims=[(2,),(2,),(2,)])
@RV l ~ Categorical(letter_)
@RV score_ ~ Nonlinear{Sampling}(i, g=score, n_samples=50, dims=[(2,),(2,),(2,)])
@RV s ~ Categorical(score_)

placeholder(s, :s, dims=(2,))
placeholder(l, :l, dims=(2,))


q = PosteriorFactorization(g, grade_, ids=[:G, :Grade]) 
algo = messagePassingAlgorithm([d])
source_code = algorithmSourceCode(algo)
eval(Meta.parse(source_code))

# Prior statistics

vmp_its = 100
data = Dict(:s => [1.0, 0.0], :l=>[0.0, 1.0])


marginals = Dict{Any, Any}(:g => ProbabilityDistribution(Univariate, Categorical, p=one(3)/3))

messages = initGrade()
@showprogress for i in 1:vmp_its
    stepGrade!(data, marginals, messages)
    stepG!(data, marginals, messages)
end

println(ForneyLab.unsafeMean(marginals[:i]))
println(ForneyLab.unsafeMean(marginals[:d]))
println(ForneyLab.unsafeMean(marginals[:g]))

As a side note, at the moment, our lab focuses on ReactiveMP.jl. Only three people can tackle ForneyLab issues (2 of those write theses, and one will be unavailable for a few days). As a consequence, our replies will be slow, but they will come sooner or later :) Thanks for trying FL in this setting; it helps to improve the framework.

assigned

Author

Thanks! One question: you wrote:

@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50, dims=[(2,),(2,),(2,)])

Shouldn't this be written as

@RV letter_ ~ Nonlinear{Sampling}(g, g=letter, n_samples=50, dims=[(2,),(2,)])

I am surprised this would have worked for you.

albertpod

Collaborator

@erlebach ah, right, I made a typo. Your variant is correct.

FL just ignored the last dimensionality specification. We should throw an error or warning when a user does it.

erlebach

Author

Great. I ran the code, and it does not work. But I have a question. You wrote:

data = Dict(:s =>[1.0, 0.0], :l => [0.0, 1.0])

Why is that? When sampled, a Categorical returns a single number (0,1), or (1,2)? Why are you specifying a vector of length 2? Or are you specifying two observed values? I just want to make sure since in my original problem, I only assigned one observed value.

When executing stepGRADE, I get the error:

TypeError: in keyword argument m, expected Number, got a value of type Vector{Float64}.

Looking more closely, the error is in ProgressMeter. I removed Progressmeter, and still get the same error.
More specifically:

TypeError: in keyword argument m, expected Number, got a value of type Vector{Float64}

Stacktrace:
 [1] stepGRADE!(data::Dict{Symbol, Vector{Float64}}, marginals::Dict{Any, Any}, messages::Vector{Message})
   @ Main ./none:17
 [2] top-level scope
   @ ./In[69]:3
 [3] eval
   @ ./boot.jl:373 [inlined]
 [4] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1196

albertpod

Collaborator

I have a feeling that you haven't specified the dimensionality of your placeholder as I, i.e. placeholder(s, :s, dims=(2,)).
Are you sure you are using exactly the same code?

As for data, the Categorical node in FL defines a one-dimensional probability distribution over the normal basis vectors of dimension d. The output is one-hot encoded, meaning that for 1 we use (1,0), and for 2 - (0,1), etc.

erlebach

Author

You are correct, @albertpod ; I had not specified the dimensionality of the placeholders. I did not copy/paste your code, but instead modified mine. That is how I learn faster.

I did not realize that a Categorical was one-hot encoded. Makes sense (different than PyMC).

The code now runs. Whether it runs correctly is another matter. I am surprised that a 100 iterations of a variational method runs slower than PyMC running 5,000 iterations of a Monte-Carlo sampler. Of course, the variational approach provides a parametrized distribution. And I have not been careful to make precise measurements. I guess the answer is because I am using the nonlinear sampler three times in my Model. :-)

Why do you use unsafeMean rather than mean()? I cannot find any documentation on unsafeMean. What does it do?

albertpod

Collaborator

I understand, no worries.

Well, actually you also use "sampling". ForneyLab knows how to combine different message passing techniques. Here you have used hybrid message passing, i.e. SP, VMP, and Sampling-based (around a nonlinear node). It's hard to compare the iterations from one package to another, but I am also surprised by your observation.

I would be also suspicious about the correctness of the result. Unfortunately, for this model, we cannot inspect the free energy (our minimization target). So we don't know when the algorithm converges: within 1, 2,3, or 100 iterations. Maybe you can check how your marginals change over iterations.

We still need to check a few things on our side. Perhaps, we'll come up with a better solution.

As for unsafeMean, that's just a habit (old school FL), you can safely use mean now.

4 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Discrete Bayesian Graph #210

4 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A Discrete Bayesian Graph #210

Description

Activity

albertpod commented on May 29, 2022

erlebach commented on May 29, 2022

erlebach commented on May 29, 2022

albertpod commented on May 30, 2022

erlebach commented on May 30, 2022

albertpod commented on May 30, 2022

erlebach commented on May 30, 2022

albertpod commented on May 30, 2022

erlebach commented on May 30, 2022

albertpod commented on May 30, 2022

erlebach commented on May 30, 2022

albertpod commented on May 30, 2022

4 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions