token.vector is calculated as all-0s creating problems with token.similarity() 

Well, the code below explains that why the similarity is calculated incorrectly. **It is due to incorrect vector calculations:**

```
doc = nlp(u'apples is apple. orange is not. oranges is nothing')
def dot_prd(a, b):
    ans = 0
    sa, sb = 0, 0
    for i in range(len(a)):
        ans += a[i]*b[i]
        sa += a[i]*a[i]
        sb += b[i]*b[i]
    sa = sa**0.5
    sb = sb**0.5
    return ans/(sa*sb)

print doc[0], doc[2], doc[4], doc[8]

print dot_prd(doc[0].vector, doc[2].vector), dot_prd(doc[0].vector, doc[4].vector), dot_prd(doc[0].vector,doc[8].vector), dot_prd(doc[4].vector, doc[8].vector)

print doc[0].similarity(doc[2]), doc[0].similarity(doc[4]), doc[0].similarity(doc[8]), doc[4].similarity(doc[8])
```

Output:
```
apples apple orange oranges
0.750411317806 0.51238496547 nan nan   #Resuults of cosine-simlarity
0.750411349583 0.512384940626 0.0 0.0  #token.simlarity()
```
The ```doc[8].vector``` is all zeroes. So, why is the vector for 'oranges' token calculated as all-0s? 
The vector for 'orange' & 'apple' is calculated correctly. More importantly, the vector for 'apples' is also calculated correctly.  So, why is 'oranges' a problem? Is this a problem with the model ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token.vector is calculated as all-0s creating problems with token.similarity() #1092

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development