-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTODO
68 lines (52 loc) · 1.29 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#TODO DO SPECS BEFORE EACH IMPL!!
# INDEXER:
..........
THINK ON NEW DESIGN...
2. Test on trec file (edit it)
2. Memory with writing to disk...
3. Compression?
4. LATER: improve to tokenize in each iteration of file
5. Dynamic indexing..
# TOKENIZATION
..........
add positions to edge/nGramz
0. Define again behaviours , improve it
0. fix standard tokenization .2 and a- etc cases
1. impl nGRAM tok
# SEARCH / SCORING
........
#TODO improve boolean search and boolean modeling.. lousy!
TF_IDF indexing!
2. Run queries and check results
4. Query processor + optimization
# SPELLING CORRECTION
...........
1.
# CONCURRENCY :
-------------
1. impl mapReduce for indexing
# QUERY PROCESSOR:
.......
1. Impl from some library
# MICROSERVICES
.......
1. Find tutorial on this
2. Start...
# THEORY:
.......
1. Elasticsearch theory behind relevance scoring
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
# WRITE ON:
........
1. Inverted index
2. Retrieval models
- applying boolean search (via stanford book)
Documents data >> modified tokens >> indexer builds the inverted index
3. BooleanSearch
4. Phrase search
#TODO LAST:
- CI/CD internally for running tests/build after each commit
- Benchmark NLTK vs myIMPL
- Do some LAws of Text coding
#TESTS
- add tests documentation