token.vector is calculated as all-0s creating problems with token.similarity() #1092
Closed
Description
Well, the code below explains that why the similarity is calculated incorrectly. It is due to incorrect vector calculations:
doc = nlp(u'apples is apple. orange is not. oranges is nothing')
def dot_prd(a, b):
ans = 0
sa, sb = 0, 0
for i in range(len(a)):
ans += a[i]*b[i]
sa += a[i]*a[i]
sb += b[i]*b[i]
sa = sa**0.5
sb = sb**0.5
return ans/(sa*sb)
print doc[0], doc[2], doc[4], doc[8]
print dot_prd(doc[0].vector, doc[2].vector), dot_prd(doc[0].vector, doc[4].vector), dot_prd(doc[0].vector,doc[8].vector), dot_prd(doc[4].vector, doc[8].vector)
print doc[0].similarity(doc[2]), doc[0].similarity(doc[4]), doc[0].similarity(doc[8]), doc[4].similarity(doc[8])
Output:
apples apple orange oranges
0.750411317806 0.51238496547 nan nan #Resuults of cosine-simlarity
0.750411349583 0.512384940626 0.0 0.0 #token.simlarity()
The doc[8].vector
is all zeroes. So, why is the vector for 'oranges' token calculated as all-0s?
The vector for 'orange' & 'apple' is calculated correctly. More importantly, the vector for 'apples' is also calculated correctly. So, why is 'oranges' a problem? Is this a problem with the model ?