Skip to content

token.vector is calculated as all-0s creating problems with token.similarity()  #1092

Closed
@mraduldubey

Description

Well, the code below explains that why the similarity is calculated incorrectly. It is due to incorrect vector calculations:

doc = nlp(u'apples is apple. orange is not. oranges is nothing')
def dot_prd(a, b):
    ans = 0
    sa, sb = 0, 0
    for i in range(len(a)):
        ans += a[i]*b[i]
        sa += a[i]*a[i]
        sb += b[i]*b[i]
    sa = sa**0.5
    sb = sb**0.5
    return ans/(sa*sb)

print doc[0], doc[2], doc[4], doc[8]

print dot_prd(doc[0].vector, doc[2].vector), dot_prd(doc[0].vector, doc[4].vector), dot_prd(doc[0].vector,doc[8].vector), dot_prd(doc[4].vector, doc[8].vector)

print doc[0].similarity(doc[2]), doc[0].similarity(doc[4]), doc[0].similarity(doc[8]), doc[4].similarity(doc[8])

Output:

apples apple orange oranges
0.750411317806 0.51238496547 nan nan   #Resuults of cosine-simlarity
0.750411349583 0.512384940626 0.0 0.0  #token.simlarity()

The doc[8].vector is all zeroes. So, why is the vector for 'oranges' token calculated as all-0s?
The vector for 'orange' & 'apple' is calculated correctly. More importantly, the vector for 'apples' is also calculated correctly. So, why is 'oranges' a problem? Is this a problem with the model ?

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation and website

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions