Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization with exception patterns #700

Merged
merged 53 commits into from
Jan 2, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
5b00039
First steps towards the Hungarian tokenizer code.
oroszgy Dec 7, 2016
90d22db
Added Hungarian resource files.
oroszgy Dec 8, 2016
0289b8c
Additional abbreviation tests.
oroszgy Dec 8, 2016
2051726
Passing Hungatian abbrev tests.
oroszgy Dec 10, 2016
0cf2144
Adding partial hyphen and quote handling support.
oroszgy Dec 10, 2016
c035928
Partial Hungarian number tokenization is added.
oroszgy Dec 20, 2016
366b3f8
Merge branch 'master' into hu_tokenizer
oroszgy Dec 20, 2016
6add156
Refactored language data structure
oroszgy Dec 20, 2016
23956e7
Improved partial support for tokenzing Hungarian numbers
oroszgy Dec 20, 2016
3d5306a
Added further testcases.
oroszgy Dec 20, 2016
ab2f6ea
Removed data files from tests..
oroszgy Dec 21, 2016
35aa547
Hungarian module is exposed in spacy.
oroszgy Dec 21, 2016
1748549
Added exception pattern mechanism to the tokenizer.
oroszgy Dec 21, 2016
d9c59c4
Maintaining backward compatibility.
oroszgy Dec 21, 2016
c5c0ed9
fixed minor typo
fnorf Dec 22, 2016
642803d
Merge pull request #702 from fnorf/patch-1
ines Dec 22, 2016
fdf4776
Added Swedish abbreviations
Dec 22, 2016
7f411fd
Remove exceptions containing whitespace / no special chars
ines Dec 23, 2016
11ec02d
Separate inline icon and help cursor classes
ines Dec 23, 2016
cc051dd
Add resources page to usage docs
ines Dec 23, 2016
48b03b4
Fix formatting and wording
ines Dec 23, 2016
12bb0aa
Fix license formatting for GitHub's parser
ines Dec 23, 2016
1d64527
Update Spanish tokenizer
ines Dec 23, 2016
1436b9f
Fix formatting and consistency
ines Dec 23, 2016
207555f
Fix spelling
ines Dec 23, 2016
3a9be4d
Updated token exception handling mechanism to allow the usage of arbi…
oroszgy Dec 23, 2016
72b61b6
Typo fix.
oroszgy Dec 23, 2016
45e045a
Unicode/UTF8 compatibility for Python2
oroszgy Dec 23, 2016
8785706
Reformat stop words for better readability
ines Dec 23, 2016
b893126
Use link mixin instead of plain link markup
ines Dec 24, 2016
f6f6e02
Make links detect target automatically and replace false with null fo…
ines Dec 24, 2016
6dd8ae1
Update README.md
ines Dec 25, 2016
b7becae
Fix typo
ines Dec 25, 2016
ade7487
Accepted contributor agreement.
oroszgy Dec 26, 2016
ef8f310
Merge branch 'hu_tokenizer' of github.com:oroszgy/spaCy into hu_token…
oroszgy Dec 26, 2016
78f754d
Merge pull request #705 from oroszgy/hu_tokenizer
ines Dec 26, 2016
223142d
Update CONTRIBUTORS.md
ines Dec 26, 2016
ad3669c
Merge pull request #703 from magnusburton/master
ines Dec 27, 2016
ce4539d
Allow the vocabulary to grow to 10,000, to prevent cold-start problem.
honnibal Dec 27, 2016
cade536
Merge branch 'master' of ssh://github.com/explosion/spaCy
honnibal Dec 27, 2016
f62db78
Increment version
honnibal Dec 27, 2016
e80dad8
Update version
ines Dec 27, 2016
decb743
Update README.rst
ines Dec 27, 2016
d158595
Add Hungarian to alpha support overview
ines Dec 27, 2016
9f24eb3
Update CONTRIBUTORS.md
ines Dec 27, 2016
14295f9
Update README.rst
ines Dec 27, 2016
f112e77
Add PART to tag map
petterhh Dec 28, 2016
9d39e78
Merge pull request #713 from petterhh/patch-1
ines Dec 28, 2016
623d94e
Whitespace
honnibal Dec 30, 2016
3e8d9c7
Test interaction of token_match and punctuation
honnibal Dec 30, 2016
9936a1b
Merge branch 'tokenization_w_exception_patterns' of https://github.co…
syllog1sm Dec 30, 2016
3ba7c16
Fix URL tests
syllog1sm Dec 30, 2016
fde53be
Move whole token mach inside _split_affixes.
syllog1sm Dec 30, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
First steps towards the Hungarian tokenizer code.
  • Loading branch information
oroszgy committed Dec 7, 2016
commit 5b00039955a5dc259ce9e63cfe8bebc588f17585
2 changes: 2 additions & 0 deletions spacy/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import pathlib

from spacy import hu
from .util import set_lang_class, get_lang_class
from .about import __version__

Expand All @@ -24,6 +25,7 @@
set_lang_class(pt.Portuguese.lang, pt.Portuguese)
set_lang_class(fr.French.lang, fr.French)
set_lang_class(it.Italian.lang, it.Italian)
set_lang_class(hu.Hungarian.lang, hu.Hungarian)
set_lang_class(zh.Chinese.lang, zh.Chinese)


Expand Down
24 changes: 24 additions & 0 deletions spacy/hu/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from __future__ import unicode_literals, print_function

from . import language_data
from ..attrs import LANG
from ..language import Language


class Hungarian(Language):
lang = 'hu'

class Defaults(Language.Defaults):
tokenizer_exceptions = dict(language_data.TOKENIZER_EXCEPTIONS)
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'hu'

prefixes = tuple(language_data.TOKENIZER_PREFIXES)

suffixes = tuple(language_data.TOKENIZER_SUFFIXES)

infixes = tuple(language_data.TOKENIZER_INFIXES)

tag_map = dict(language_data.TAG_MAP)

stop_words = set(language_data.STOP_WORDS)
219 changes: 219 additions & 0 deletions spacy/hu/data/stopwords.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
a
abban
ahhoz
ahogy
ahol
aki
akik
akkor
akár
alatt
amely
amelyek
amelyekben
amelyeket
amelyet
amelynek
ami
amikor
amit
amolyan
amíg
annak
arra
arról
az
azok
azon
azonban
azt
aztán
azután
azzal
azért
be
belül
benne
bár
cikk
cikkek
cikkeket
csak
de
e
ebben
eddig
egy
egyes
egyetlen
egyik
egyre
egyéb
egész
ehhez
ekkor
el
ellen
elo
eloször
elott
elso
elég
előtt
emilyen
ennek
erre
ez
ezek
ezen
ezt
ezzel
ezért
fel
felé
ha
hanem
hiszen
hogy
hogyan
hát
ide
igen
ill
ill.
illetve
ilyen
ilyenkor
inkább
is
ismét
ison
itt
jobban
jól
kell
kellett
keressünk
keresztül
ki
kívül
között
közül
le
legalább
legyen
lehet
lehetett
lenne
lenni
lesz
lett
ma
maga
magát
majd
meg
mellett
mely
melyek
mert
mi
miatt
mikor
milyen
minden
mindenki
mindent
mindig
mint
mintha
mit
mivel
miért
mondta
most
már
más
másik
még
míg
nagy
nagyobb
nagyon
ne
nekem
neki
nem
nincs
néha
néhány
nélkül
o
oda
ok
oket
olyan
ott
pedig
persze
például
s
saját
sem
semmi
sok
sokat
sokkal
stb.
szemben
szerint
szinte
számára
szét
talán
te
tehát
teljes
ti
tovább
továbbá
több
túl
ugyanis
utolsó
után
utána
vagy
vagyis
vagyok
valaki
valami
valamint
való
van
vannak
vele
vissza
viszont
volna
volt
voltak
voltam
voltunk
által
általában
át
én
éppen
és
így
ön
össze
úgy
új
újabb
újra
ő
őket
Loading